Using UDFs in Java without registration

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Using UDFs in Java without registration

Justin Uang
I would like to define a UDF in Java via a closure and then use it without registration. In Scala, I believe there are two ways to do this:

    myUdf = functions.udf({ _ + 5})
    myDf.select(myUdf(myDf("age")))

or

    myDf.select(functions.callUDF({_ + 5}, DataTypes.IntegerType, myDf("age")))

However, both of these don't work for Java UDF. The first one requires TypeTags. For the second one, I was able to hack it by creating a scala AbstractFunction1 and using callUDF, which requires declaring the catalyst DataType instead of using TypeTags. However, it was still nasty because I had to return a scala map instead of a java map.

Is there first class support for creating a org.apache.spark.sql.UserDefinedFunction that works with the org.apache.spark.sql.api.java.UDF1<T1, R>? I'm fine with having to declare the catalyst type when creating it.

If it doesn't exist, I would be happy to work on it =)

Justin
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Using UDFs in Java without registration

rxin
I think you are right that there is no way to call Java UDF without registration right now. Adding another 20 methods to functions would be scary. Maybe the best way is to have a companion object for UserDefinedFunction, and define UDF there?

e.g.

object UserDefinedFunction {

  def define(f: org.apache.spark.api.java.function.Function0, returnType: Class[_]): UserDefinedFunction

  // ... define a few more - maybe up to 5 arguments?
}
 
Ideally, we should ask for both argument class and return class, so we can do the proper type conversion (e.g. if the UDF expects a string, but the input expression is an int, Catalyst can automatically add a cast). However, we haven't implemented those in UserDefinedFunction yet.




On Fri, May 29, 2015 at 12:54 PM, Justin Uang <[hidden email]> wrote:
I would like to define a UDF in Java via a closure and then use it without registration. In Scala, I believe there are two ways to do this:

    myUdf = functions.udf({ _ + 5})
    myDf.select(myUdf(myDf("age")))

or

    myDf.select(functions.callUDF({_ + 5}, DataTypes.IntegerType, myDf("age")))

However, both of these don't work for Java UDF. The first one requires TypeTags. For the second one, I was able to hack it by creating a scala AbstractFunction1 and using callUDF, which requires declaring the catalyst DataType instead of using TypeTags. However, it was still nasty because I had to return a scala map instead of a java map.

Is there first class support for creating a org.apache.spark.sql.UserDefinedFunction that works with the org.apache.spark.sql.api.java.UDF1<T1, R>? I'm fine with having to declare the catalyst type when creating it.

If it doesn't exist, I would be happy to work on it =)

Justin

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Using UDFs in Java without registration

Justin Uang
The idea of asking for both the argument and return class is interesting. I don't think we do that for the scala APIs currently, right? In functions.scala, we only use the TypeTag for RT.

  def udf[RT: TypeTag, A1: TypeTag](f: Function1[A1, RT]): UserDefinedFunction = {
    UserDefinedFunction(f, ScalaReflection.schemaFor(typeTag[RT]).dataType)
  }

There would only be a small subset of conversions that would make sense implicitly (e.g. int to double, the typical conversions in programming languages), but something like (double => int) might be dangerous and (timestamp => double) wouldn't really make sense. Perhaps it's better to be explicit about casts?

If we don't care about declaring the types of the arguments, perhaps we can have all of the java UDF interfaces (UDF1, UDF2, etc) extend a generic interface called UDF, then have

    def define(f: UDF, returnType: Class[_])

to simplify the APIs.


On Sat, May 30, 2015 at 3:43 AM Reynold Xin <[hidden email]> wrote:
I think you are right that there is no way to call Java UDF without registration right now. Adding another 20 methods to functions would be scary. Maybe the best way is to have a companion object for UserDefinedFunction, and define UDF there?

e.g.

object UserDefinedFunction {

  def define(f: org.apache.spark.api.java.function.Function0, returnType: Class[_]): UserDefinedFunction

  // ... define a few more - maybe up to 5 arguments?
}
 
Ideally, we should ask for both argument class and return class, so we can do the proper type conversion (e.g. if the UDF expects a string, but the input expression is an int, Catalyst can automatically add a cast). However, we haven't implemented those in UserDefinedFunction yet.




On Fri, May 29, 2015 at 12:54 PM, Justin Uang <[hidden email]> wrote:
I would like to define a UDF in Java via a closure and then use it without registration. In Scala, I believe there are two ways to do this:

    myUdf = functions.udf({ _ + 5})
    myDf.select(myUdf(myDf("age")))

or

    myDf.select(functions.callUDF({_ + 5}, DataTypes.IntegerType, myDf("age")))

However, both of these don't work for Java UDF. The first one requires TypeTags. For the second one, I was able to hack it by creating a scala AbstractFunction1 and using callUDF, which requires declaring the catalyst DataType instead of using TypeTags. However, it was still nasty because I had to return a scala map instead of a java map.

Is there first class support for creating a org.apache.spark.sql.UserDefinedFunction that works with the org.apache.spark.sql.api.java.UDF1<T1, R>? I'm fine with having to declare the catalyst type when creating it.

If it doesn't exist, I would be happy to work on it =)

Justin

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Using UDFs in Java without registration

rxin
We added all the typetags for arguments but haven't got around to use them yet. I think it'd make sense to have them and do the auto cast, but we can have rules in analysis to forbid certain casts (e.g. don't auto cast double to int).


On Sat, May 30, 2015 at 7:12 AM, Justin Uang <[hidden email]> wrote:
The idea of asking for both the argument and return class is interesting. I don't think we do that for the scala APIs currently, right? In functions.scala, we only use the TypeTag for RT.

  def udf[RT: TypeTag, A1: TypeTag](f: Function1[A1, RT]): UserDefinedFunction = {
    UserDefinedFunction(f, ScalaReflection.schemaFor(typeTag[RT]).dataType)
  }

There would only be a small subset of conversions that would make sense implicitly (e.g. int to double, the typical conversions in programming languages), but something like (double => int) might be dangerous and (timestamp => double) wouldn't really make sense. Perhaps it's better to be explicit about casts?

If we don't care about declaring the types of the arguments, perhaps we can have all of the java UDF interfaces (UDF1, UDF2, etc) extend a generic interface called UDF, then have

    def define(f: UDF, returnType: Class[_])

to simplify the APIs.


On Sat, May 30, 2015 at 3:43 AM Reynold Xin <[hidden email]> wrote:
I think you are right that there is no way to call Java UDF without registration right now. Adding another 20 methods to functions would be scary. Maybe the best way is to have a companion object for UserDefinedFunction, and define UDF there?

e.g.

object UserDefinedFunction {

  def define(f: org.apache.spark.api.java.function.Function0, returnType: Class[_]): UserDefinedFunction

  // ... define a few more - maybe up to 5 arguments?
}
 
Ideally, we should ask for both argument class and return class, so we can do the proper type conversion (e.g. if the UDF expects a string, but the input expression is an int, Catalyst can automatically add a cast). However, we haven't implemented those in UserDefinedFunction yet.




On Fri, May 29, 2015 at 12:54 PM, Justin Uang <[hidden email]> wrote:
I would like to define a UDF in Java via a closure and then use it without registration. In Scala, I believe there are two ways to do this:

    myUdf = functions.udf({ _ + 5})
    myDf.select(myUdf(myDf("age")))

or

    myDf.select(functions.callUDF({_ + 5}, DataTypes.IntegerType, myDf("age")))

However, both of these don't work for Java UDF. The first one requires TypeTags. For the second one, I was able to hack it by creating a scala AbstractFunction1 and using callUDF, which requires declaring the catalyst DataType instead of using TypeTags. However, it was still nasty because I had to return a scala map instead of a java map.

Is there first class support for creating a org.apache.spark.sql.UserDefinedFunction that works with the org.apache.spark.sql.api.java.UDF1<T1, R>? I'm fine with having to declare the catalyst type when creating it.

If it doesn't exist, I would be happy to work on it =)

Justin


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Using UDFs in Java without registration

Justin Uang
Would like to bring this back for consideration again. I'm open to adding types for all the parameters, but it does seem onerous, and in the case of Python, we don't do that. Do you feel strongly about adding them?

On Sat, May 30, 2015 at 8:04 PM Reynold Xin <[hidden email]> wrote:
We added all the typetags for arguments but haven't got around to use them yet. I think it'd make sense to have them and do the auto cast, but we can have rules in analysis to forbid certain casts (e.g. don't auto cast double to int).


On Sat, May 30, 2015 at 7:12 AM, Justin Uang <[hidden email]> wrote:
The idea of asking for both the argument and return class is interesting. I don't think we do that for the scala APIs currently, right? In functions.scala, we only use the TypeTag for RT.

  def udf[RT: TypeTag, A1: TypeTag](f: Function1[A1, RT]): UserDefinedFunction = {
    UserDefinedFunction(f, ScalaReflection.schemaFor(typeTag[RT]).dataType)
  }

There would only be a small subset of conversions that would make sense implicitly (e.g. int to double, the typical conversions in programming languages), but something like (double => int) might be dangerous and (timestamp => double) wouldn't really make sense. Perhaps it's better to be explicit about casts?

If we don't care about declaring the types of the arguments, perhaps we can have all of the java UDF interfaces (UDF1, UDF2, etc) extend a generic interface called UDF, then have

    def define(f: UDF, returnType: Class[_])

to simplify the APIs.


On Sat, May 30, 2015 at 3:43 AM Reynold Xin <[hidden email]> wrote:
I think you are right that there is no way to call Java UDF without registration right now. Adding another 20 methods to functions would be scary. Maybe the best way is to have a companion object for UserDefinedFunction, and define UDF there?

e.g.

object UserDefinedFunction {

  def define(f: org.apache.spark.api.java.function.Function0, returnType: Class[_]): UserDefinedFunction

  // ... define a few more - maybe up to 5 arguments?
}
 
Ideally, we should ask for both argument class and return class, so we can do the proper type conversion (e.g. if the UDF expects a string, but the input expression is an int, Catalyst can automatically add a cast). However, we haven't implemented those in UserDefinedFunction yet.




On Fri, May 29, 2015 at 12:54 PM, Justin Uang <[hidden email]> wrote:
I would like to define a UDF in Java via a closure and then use it without registration. In Scala, I believe there are two ways to do this:

    myUdf = functions.udf({ _ + 5})
    myDf.select(myUdf(myDf("age")))

or

    myDf.select(functions.callUDF({_ + 5}, DataTypes.IntegerType, myDf("age")))

However, both of these don't work for Java UDF. The first one requires TypeTags. For the second one, I was able to hack it by creating a scala AbstractFunction1 and using callUDF, which requires declaring the catalyst DataType instead of using TypeTags. However, it was still nasty because I had to return a scala map instead of a java map.

Is there first class support for creating a org.apache.spark.sql.UserDefinedFunction that works with the org.apache.spark.sql.api.java.UDF1<T1, R>? I'm fine with having to declare the catalyst type when creating it.

If it doesn't exist, I would be happy to work on it =)

Justin


Loading...