transformSchema method policy for "duplicated" column names

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

transformSchema method policy for "duplicated" column names

Alessandro Solimando
Hello everyone,
after one month without any reply on stackoverflow (https://stackoverflow.com/questions/47789265/inconsistency-in-handling-duplicate-names-in-dataframe-schema) I try to pose the question here.

Context: I am refactoring some code of mine, transforming scala methods with a signature of the form methodname(df: DataFrame, **params**): DataFrame into a subclass of org.apache.spark.ml.Transformer, following this document: custom transformer doc (my dev is targeting Spark 2.2.0, if relevant).

I have been trying to clear my mind on how the transformSchema method should be used, but I have realized that there is general agreement on how to use them (SPARK-14760). I have decided to invoke such method from my transform methods when there are chances to catch type errors in advance.

The transformSchema method of one of these classes reads as follows:

override def transformSchema(schema: StructType): StructType = {
  val newFields = schema.fields :+ new StructField($(outputCol),
                                                   DataTypes.StringType, false)
  DataTypes.createStructType(newFields)
}

I have written some unit tests to test my custom transformers:

test("Use of ambiguous column names") {
  val df1 = sqlContext.createDataFrame(
    Seq(
      ("a", "1a", 2),
      ("b", "2a", 4),
      ("c", "3a", 5)
    )
  ).toDF("c1", "c2", "c3")

  val df2 = sqlContext.createDataFrame(
    Seq(
      ("a", "1b", 1),
      ("b", "2b", 3),
      ("c", "3b", 6)
    )
  ).toDF("c1", "c2", "c3")

  val df_join = df1.join(df2, "c1")

  val columnsToConcatenate = Array(df1("c2"), df2("c2"))
  [...transformer instantiation and invocation...]
}

To my surprise the unit test above failed by raising an exception:

java.lang.IllegalArgumentException: fields should have distinct names.
at org.apache.spark.sql.types.DataTypes.createStructType(DataTypes.java:219)

Analysis:

By inspecting the source code of createStructType method, I have realized that duplicate names are not allowed.

Of course I can bypass the factory method and generate my StructType by directly using its constructor, but I fail to see the point of this "check" that fails with legal dataframes, and I generally tend to use factories over constructors, if available.

There is no explanation for not supporting duplicate names in the method documentation, and its semantics is, to my opinion, unexpected.

Question: Is this an inconsistency or there is a reason for the semantics adopted by createStructType (that is, no dups)?

Best regards,
Alessandro