SparkSQL can not extract values from UDT (like VectorUDT)

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

SparkSQL can not extract values from UDT (like VectorUDT)


Consider the following code using to get the probability column on a data set:
 Note that "probability" is `vector` type which is a UDT with the following implementation.

class VectorUDT extends UserDefinedType[Vector] {

override def sqlType: StructType = {
// type: 0 = sparse, 1 = dense
// We only use "values" for dense vectors, and "size", "indices", and "values" for sparse
// vectors. The "values" field is nullable because we might want to add binary vectors later,
// which uses "size" and "indices", but not "values".
StructField("type", ByteType, nullable = false),
StructField("size", IntegerType, nullable = true),
StructField("indices", ArrayType(IntegerType, containsNull = false), nullable = true),
StructField("values", ArrayType(DoubleType, containsNull = false), nullable = true)))

`values` is one of its attribute. However, it can not be extracted.

The first code snippet results in an exception of  complexTypeExtractors: 

org.apache.spark.sql.AnalysisException: Can't extract value from probability#743;
      at ...
      at ...
      at ...

Here is the code:

It seems that the pattern matching does not take UDT into consideration.

Is this an intended feature? If not, I would like to create a PR to fix it.

Hao Ren

Data Engineer @ leboncoin

Paris, France