[DISCUSS] Null-handling of primitive-type of untyped Scala UDF in Scala 2.12

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

[DISCUSS] Null-handling of primitive-type of untyped Scala UDF in Scala 2.12

wuyi
This post was updated on .
Hi all, I'd like to raise a discussion here about null-handling of
primitive-type of untyped Scala UDF [ udf(f: AnyRef, dataType: DataType) ].

After we switch to Scala 2.12 in 3.0, the untyped Scala UDF is broken
because now we can't use reflection to get the parameter types of the Scala
lambda.

This leads to silent result changing, for example, with UDF defined as
`val f = udf((x: Int) => x, IntegerType)`, the query `select f($"x")` has
different behavior between 2.4 and 3.0 when the input value of column x is null.

Spark 2.4:  null
Spark 3.0:  0

Because of it, we deprecate the untyped Scala UDF in 3.0 and recommend users
to use the typed ones. However, recently I identified several valid use
cases, e.g., `val f = (r: Row) => Row(r.getAs[Int](0) * 2)`, where the schema
cannot be detected in typed Scala UDF [ udf[RT: TypeTag, A1: TypeTag](f: Function1[A1, RT]) ].

There are 3 solutions:
1. find a way to get Scala lambda parameter types by reflection (I tried it
very hard but has no luck. The Java SAM type is so dynamic)

2. support case class as the input of typed Scala UDF, so at least people
can still deal with struct type input column with UDF

3. add a new variant of untyped Scala UDF which users can specify input types

I'd like to see more feedbacks or ideas about how to move forward.

Thanks,
Yi Wu



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Null-handling of primitive-type of untyped Scala UDF in Scala 2.12

Takeshi Yamamuro
hi, Yi,

Probably, I miss something though, we cannot just wrap the udf with
`if (isnull(x)) null else udf(knownnotnull(x))`?

On Fri, Mar 13, 2020 at 6:22 PM wuyi <[hidden email]> wrote:
Hi all, I'd like to raise a discussion here about null-handling of
primitive-type of untyped Scala UDF [ udf(f: AnyRef, dataType: DataType) ].

After we switch to Scala 2.12 in 3.0, the untyped Scala UDF is broken
because now we can't use reflection to get the parameter types of the Scala
lambda.
This leads to silent result changing, for example, with UDF defined as `val
f = udf((x: Int) => x, IntegerType)`, the query `select f($"x")` has
different
behavior between 2.4 and 3.0 when the input value of column x is null.

Spark 2.4:  null
Spark 3.0:  0

Because of it, we deprecate the untyped Scala UDF in 3.0 and recommend users
to use the typed ones. However, recently I identified several valid use
cases,
e.g., `val f = (r: Row) => Row(r.getAs[Int](0) * 2)`, where the schema
cannot be detected in typed Scala UDF [ udf[RT: TypeTag, A1: TypeTag](f:
Function1[A1, RT]) ].

There are 3 solutions:
1. find a way to get Scala lambda parameter types by reflection (I tried it
very hard but has no luck. The Java SAM type is so dynamic)
2. support case class as the input of typed Scala UDF, so at least people
can still deal with struct type input column with UDF
3. add a new variant of untyped Scala UDF which users can specify input
types

I'd like to see more feedbacks or ideas about how to move forward.

Thanks,
Yi Wu



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



--
---
Takeshi Yamamuro
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Null-handling of primitive-type of untyped Scala UDF in Scala 2.12

wuyi
Hi Takeshi, thanks for your reply.

Before the broken, we only do the null check for primitive types and leave
null value of non-primitive type to UDF itself in case it will be handled
specifically, e.g., a UDF may return something else for null String.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Null-handling of primitive-type of untyped Scala UDF in Scala 2.12

Takeshi Yamamuro
Ah, I see now what the "broken' means. Thanks, Yi.
I personally think the option 1 is the best for existing Spark users to support the usecase you suggested above.
So, I think this decision depends on how difficult it is to implement "get Scala lambda parameter types by reflection"
and the complexity of it's implementation.
(I'm not familiar with the 2.12 implementation, so I'm not really sure how difficult it is)

If we cannot choose the option 1, I like the option 2 better than
adding a new API for the usecase (the option 3).

Bests,
Takeshi

On Sat, Mar 14, 2020 at 6:24 PM wuyi <[hidden email]> wrote:
Hi Takeshi, thanks for your reply.

Before the broken, we only do the null check for primitive types and leave
null value of non-primitive type to UDF itself in case it will be handled
specifically, e.g., a UDF may return something else for null String.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



--
---
Takeshi Yamamuro
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Null-handling of primitive-type of untyped Scala UDF in Scala 2.12

Sean Owen-2
I don't think it's possible to get the parameters by reflection
anymore -- they are lambdas now in the JVM. At least, indeed, I recall
a few people couldn't find a solution back when we added 2.12 support.
This isn't 'new' in that it has always been the case for Scala 2.12.
If there is a better idea, sure.

On Sat, Mar 14, 2020 at 5:50 AM Takeshi Yamamuro <[hidden email]> wrote:

>
> Ah, I see now what the "broken' means. Thanks, Yi.
> I personally think the option 1 is the best for existing Spark users to support the usecase you suggested above.
> So, I think this decision depends on how difficult it is to implement "get Scala lambda parameter types by reflection"
> and the complexity of it's implementation.
> (I'm not familiar with the 2.12 implementation, so I'm not really sure how difficult it is)
>
> If we cannot choose the option 1, I like the option 2 better than
> adding a new API for the usecase (the option 3).
>
> Bests,
> Takeshi
>
> On Sat, Mar 14, 2020 at 6:24 PM wuyi <[hidden email]> wrote:
>>
>> Hi Takeshi, thanks for your reply.
>>
>> Before the broken, we only do the null check for primitive types and leave
>> null value of non-primitive type to UDF itself in case it will be handled
>> specifically, e.g., a UDF may return something else for null String.
>>
>>
>>
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: [hidden email]
>>
>
>
> --
> ---
> Takeshi Yamamuro

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Null-handling of primitive-type of untyped Scala UDF in Scala 2.12

wuyi
Thanks Sean and Takeshi.

Option 1 seems really impossible. And I'm going to take Option 2 as an
alternative choice.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Null-handling of primitive-type of untyped Scala UDF in Scala 2.12

cloud0fan
I don't think option 1 is possible.

For option 2: I think we need to do it anyway. It's kind of a bug that the typed Scala UDF doesn't support case class that thus can't support struct-type input columns.

For option 3: It's a bit risky to add a new API but seems like we have a good reason. The untyped Scala UDF supports Row as input/output, which is a valid use case to me. It requires a "returnType" parameter, but not input types. This brings 2 problems: 1) if the UDF parameter is primitive-type but the actual value is null, the result will be wrong. 2) Analyzer can't do type check for the UDF.

Maybe we can add a new method def udf(f: AnyRef, inputTypes: Seq[(DataType, Boolean)], returnType: DataType), to allow users to specify the expected input data types and nullablilities.

On Tue, Mar 17, 2020 at 1:58 PM wuyi <[hidden email]> wrote:
Thanks Sean and Takeshi.

Option 1 seems really impossible. And I'm going to take Option 2 as an
alternative choice.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Null-handling of primitive-type of untyped Scala UDF in Scala 2.12

Hyukjin Kwon
Option 2 seems fine to me.

2020년 3월 17일 (화) 오후 3:41, Wenchen Fan <[hidden email]>님이 작성:
I don't think option 1 is possible.

For option 2: I think we need to do it anyway. It's kind of a bug that the typed Scala UDF doesn't support case class that thus can't support struct-type input columns.

For option 3: It's a bit risky to add a new API but seems like we have a good reason. The untyped Scala UDF supports Row as input/output, which is a valid use case to me. It requires a "returnType" parameter, but not input types. This brings 2 problems: 1) if the UDF parameter is primitive-type but the actual value is null, the result will be wrong. 2) Analyzer can't do type check for the UDF.

Maybe we can add a new method def udf(f: AnyRef, inputTypes: Seq[(DataType, Boolean)], returnType: DataType), to allow users to specify the expected input data types and nullablilities.

On Tue, Mar 17, 2020 at 1:58 PM wuyi <[hidden email]> wrote:
Thanks Sean and Takeshi.

Option 1 seems really impossible. And I'm going to take Option 2 as an
alternative choice.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]