Spark SQL Dataframe resulting from an except( ) is unusable

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Spark SQL Dataframe resulting from an except( ) is unusable

vijoshi
With Spark 2.x, I construct a Dataframe from a sample libsvm file:

scala> val higgsDF = spark.read.format("libsvm").load("higgs.libsvm")
higgsDF: org.apache.spark.sql.DataFrame = [label: double, features: vector]


Then, build a new dataframe that involves an except( )

scala> val train_df = higgsDF.sample(false, 0.7, 42)
train_df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [label: double, features: vector]

scala> val test_df = input_df.except(train_df)
test_df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [label: double, features: vector]

Now, most operations on the test_df fail with this exception:

scala> test_df.show()
java.lang.RuntimeException: no default for type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7
  at org.apache.spark.sql.catalyst.expressions.Literal$.default(literals.scala:179)
  at org.apache.spark.sql.catalyst.planning.ExtractEquiJoinKeys$$anonfun$4.apply(patterns.scala:117)
  at org.apache.spark.sql.catalyst.planning.ExtractEquiJoinKeys$$anonfun$4.apply(patterns.scala:110)
   .
   .

Debugging this, I see that this is the schema of this dataframe:

scala> test_df.schema
res4: org.apache.spark.sql.types.StructType = StructType(StructField(label,DoubleType,true), StructField(features,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))

Looking a little deeper, the error occurs because the QueryPlanner ends up inside

  object ExtractEquiJoinKeys (/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala)

where it processes a LeftAnti Join. Then there is an attempt to generate a default Literal value for the org.apache.spark.ml.linalg.VectorUDT DataType which fails with the above exception. This is because there is no match for the VectorUDT in

def default(dataType: DataType): Literal = {..}    (/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/literals.scala)


Any processing on this dataframe that causes Spark to build a query plan (i.e. almost all productive uses of this dataframe) fails due to this exception.

Is it a miss in the Literal implementation that it does not handle UserDefinedTypes or is it left out intentionally? Is there a way to get around this problem? This problem seems to be present in all 2.x version.

Regards,
Vinayak Joshi

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Spark SQL Dataframe resulting from an except( ) is unusable

Liang-Chi Hsieh

Hi Vinayak,

Thanks for reporting this.

I don't think it is left out intentionally for UserDefinedType. If you already know how the UDT is represented in internal format, you can explicitly convert the UDT column to other SQL types, then you may get around this problem. It is a bit hacky, anyway.

I submitted a PR to fix this, but not sure if it will get in the master soon.

vijoshi wrote
With Spark 2.x, I construct a Dataframe from a sample libsvm file:

scala> val higgsDF = spark.read.format("libsvm").load("higgs.libsvm")
higgsDF: org.apache.spark.sql.DataFrame = [label: double, features:
vector]


Then, build a new dataframe that involves an except( )

scala> val train_df = higgsDF.sample(false, 0.7, 42)
train_df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [label:
double, features: vector]

scala> val test_df = input_df.except(train_df)
test_df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [label:
double, features: vector]

Now, most operations on the test_df fail with this exception:

scala> test_df.show()
java.lang.RuntimeException: no default for type
org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7
  at
org.apache.spark.sql.catalyst.expressions.Literal$.default(literals.scala:179)
  at
org.apache.spark.sql.catalyst.planning.ExtractEquiJoinKeys$$anonfun$4.apply(patterns.scala:117)
  at
org.apache.spark.sql.catalyst.planning.ExtractEquiJoinKeys$$anonfun$4.apply(patterns.scala:110)
   .
   .

Debugging this, I see that this is the schema of this dataframe:

scala> test_df.schema
res4: org.apache.spark.sql.types.StructType =
StructType(StructField(label,DoubleType,true),
StructField(features,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))

Looking a little deeper, the error occurs because the QueryPlanner ends up
inside

  object ExtractEquiJoinKeys
(/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala)

where it processes a LeftAnti Join. Then there is an attempt to generate a
default Literal value for the org.apache.spark.ml.linalg.VectorUDT
DataType which fails with the above exception. This is because there is no
match for the VectorUDT in

def default(dataType: DataType): Literal = {..}
(/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/literals.scala)


Any processing on this dataframe that causes Spark to build a query plan
(i.e. almost all productive uses of this dataframe) fails due to this
exception.

Is it a miss in the Literal implementation that it does not handle
UserDefinedTypes or is it left out intentionally? Is there a way to get
around this problem? This problem seems to be present in all 2.x version.

Regards,
Vinayak Joshi
Liang-Chi Hsieh | @viirya
Spark Technology Center
http://www.spark.tc/
Loading...