SparkSQL can not extract values from UDT (like VectorUDT)

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

SparkSQL can not extract values from UDT (like VectorUDT)

invkrh
Hi,

Consider the following code using spark.ml to get the probability column on a data set:
model.transform(dataSet)
.selectExpr("probability.values")
.printSchema()
 Note that "probability" is `vector` type which is a UDT with the following implementation.

class VectorUDT extends UserDefinedType[Vector] {

override def sqlType: StructType = {
// type: 0 = sparse, 1 = dense
// We only use "values" for dense vectors, and "size", "indices", and "values" for sparse
// vectors. The "values" field is nullable because we might want to add binary vectors later,
// which uses "size" and "indices", but not "values".
StructType(Seq(
StructField("type", ByteType, nullable = false),
StructField("size", IntegerType, nullable = true),
StructField("indices", ArrayType(IntegerType, containsNull = false), nullable = true),
StructField("values", ArrayType(DoubleType, containsNull = false), nullable = true)))
}
  //...
}

`values` is one of its attribute. However, it can not be extracted.

The first code snippet results in an exception of  complexTypeExtractors: 

org.apache.spark.sql.AnalysisException: Can't extract value from probability#743;
      at ...
      at ...
      at ...
...

Here is the code:

It seems that the pattern matching does not take UDT into consideration.

Is this an intended feature? If not, I would like to create a PR to fix it.

--
Hao Ren

Data Engineer @ leboncoin

Paris, France