[DISCUSS] Function plugins

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

[DISCUSS] Function plugins

Ryan Blue

Hi everyone,
I’ve been looking into improving how users of our Spark platform register and use UDFs and I’d like to discuss a few ideas for making this easier.

The motivation for this is the use case of defining a UDF from SparkSQL or PySpark. We want to make it easy to write JVM UDFs and use them from both SQL and Python. Python UDFs work great in most cases, but we occasionally don’t want to pay the cost of shipping data to python and processing it there so we want to make it easy to register UDFs that will run in the JVM.

There is already syntax to create a function from a JVM class in SQL that would work, but this option requires using the Hive UDF API instead of Spark’s simpler Scala API. It also requires argument translation and doesn’t support codegen. Beyond the problem of the API and performance, it is annoying to require registering every function individually with a CREATE FUNCTION statement.

The alternative that I’d like to propose is to add a way to register a named group of functions using the proposed catalog plugin API.

For anyone unfamiliar with the proposed catalog plugins, the basic idea is to load and configure plugins using a simple property-based scheme. Those plugins expose functionality through mix-in interfaces, like TableCatalog to create/drop/load/alter tables. Another interface could be UDFCatalog that can load UDFs.

interface UDFCatalog extends CatalogPlugin {
  UserDefinedFunction loadUDF(String name)
}

To use this, I would create a UDFCatalog class that returns my Scala functions as UDFs. To look up functions, we would use both the catalog name and the function name.

This would allow my users to write Scala UDF instances, package them using a UDFCatalog class (provided by me), and easily use them in Spark with a few configuration options to add the catalog in their environment.

This would also allow me to expose UDF libraries easily in my configuration, like brickhouse, without users needing to ensure the Jar is loaded and register individual functions.

Any thoughts on this high-level approach? I know that this ignores things like creating and storing functions in a FunctionCatalog, and we’d have to solve challenges with function naming (whether there is a db component). Right now I’d like to think through the overall idea and not get too focused on those details.

Thanks,

rb

--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Function plugins

rxin
Having a way to register UDFs that are not using Hive APIs would be great!



On Fri, Dec 14, 2018 at 1:30 PM, Ryan Blue <[hidden email]> wrote:

Hi everyone,
I’ve been looking into improving how users of our Spark platform register and use UDFs and I’d like to discuss a few ideas for making this easier.

The motivation for this is the use case of defining a UDF from SparkSQL or PySpark. We want to make it easy to write JVM UDFs and use them from both SQL and Python. Python UDFs work great in most cases, but we occasionally don’t want to pay the cost of shipping data to python and processing it there so we want to make it easy to register UDFs that will run in the JVM.

There is already syntax to create a function from a JVM class in SQL that would work, but this option requires using the Hive UDF API instead of Spark’s simpler Scala API. It also requires argument translation and doesn’t support codegen. Beyond the problem of the API and performance, it is annoying to require registering every function individually with a CREATE FUNCTION statement.

The alternative that I’d like to propose is to add a way to register a named group of functions using the proposed catalog plugin API.

For anyone unfamiliar with the proposed catalog plugins, the basic idea is to load and configure plugins using a simple property-based scheme. Those plugins expose functionality through mix-in interfaces, like TableCatalog to create/drop/load/alter tables. Another interface could be UDFCatalog that can load UDFs.

interface UDFCatalog extends CatalogPlugin {
  UserDefinedFunction loadUDF(String name)
}

To use this, I would create a UDFCatalog class that returns my Scala functions as UDFs. To look up functions, we would use both the catalog name and the function name.

This would allow my users to write Scala UDF instances, package them using a UDFCatalog class (provided by me), and easily use them in Spark with a few configuration options to add the catalog in their environment.

This would also allow me to expose UDF libraries easily in my configuration, like brickhouse, without users needing to ensure the Jar is loaded and register individual functions.

Any thoughts on this high-level approach? I know that this ignores things like creating and storing functions in a FunctionCatalog, and we’d have to solve challenges with function naming (whether there is a db component). Right now I’d like to think through the overall idea and not get too focused on those details.

Thanks,

rb

--
Ryan Blue
Software Engineer
Netflix

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Function plugins

Matt Cheah

How would this work with:

  1. Codegen – how does one generate code given a user’s UDF? Would the user be able to specify the code that is generated that represents their function? In practice that’s pretty hard to get right.
  2. Row serialization and representation – Will the UDF receive catalyst rows with optimized internal representations, or will Spark have to convert to something more easily consumed by a UDF?

 

Otherwise +1 for trying to get this to work without Hive. I think even having something without codegen and optimized row formats is worthwhile if only because it’s easier to use than Hive UDFs.

 

-Matt Cheah

 

From: Reynold Xin <[hidden email]>
Date: Friday, December 14, 2018 at 1:49 PM
To: "[hidden email]" <[hidden email]>
Cc: Spark Dev List <[hidden email]>
Subject: Re: [DISCUSS] Function plugins

 

Having a way to register UDFs that are not using Hive APIs would be great!

 

 

 

On Fri, Dec 14, 2018 at 1:30 PM, Ryan Blue <[hidden email]> wrote:

Hi everyone,
I’ve been looking into improving how users of our Spark platform register and use UDFs and I’d like to discuss a few ideas for making this easier.

The motivation for this is the use case of defining a UDF from SparkSQL or PySpark. We want to make it easy to write JVM UDFs and use them from both SQL and Python. Python UDFs work great in most cases, but we occasionally don’t want to pay the cost of shipping data to python and processing it there so we want to make it easy to register UDFs that will run in the JVM.

There is already syntax to create a function from a JVM class [docs.databricks.com] in SQL that would work, but this option requires using the Hive UDF API instead of Spark’s simpler Scala API. It also requires argument translation and doesn’t support codegen. Beyond the problem of the API and performance, it is annoying to require registering every function individually with a CREATE FUNCTION statement.

The alternative that I’d like to propose is to add a way to register a named group of functions using the proposed catalog plugin API.

For anyone unfamiliar with the proposed catalog plugins, the basic idea is to load and configure plugins using a simple property-based scheme. Those plugins expose functionality through mix-in interfaces, like TableCatalog to create/drop/load/alter tables. Another interface could be UDFCatalog that can load UDFs.

interface UDFCatalog extends CatalogPlugin {
  UserDefinedFunction loadUDF(String name)
}

To use this, I would create a UDFCatalog class that returns my Scala functions as UDFs. To look up functions, we would use both the catalog name and the function name.

This would allow my users to write Scala UDF instances, package them using a UDFCatalog class (provided by me), and easily use them in Spark with a few configuration options to add the catalog in their environment.

This would also allow me to expose UDF libraries easily in my configuration, like brickhouse [community.cloudera.com], without users needing to ensure the Jar is loaded and register individual functions.

Any thoughts on this high-level approach? I know that this ignores things like creating and storing functions in a FunctionCatalog, and we’d have to solve challenges with function naming (whether there is a db component). Right now I’d like to think through the overall idea and not get too focused on those details.

Thanks,

rb

--

Ryan Blue

Software Engineer

Netflix

 


smime.p7s (6K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Function plugins

rxin
I don’t think it is realistic to support codegen for UDFs. It’s hooked deep into intervals. 

On Fri, Dec 14, 2018 at 6:52 PM Matt Cheah <[hidden email]> wrote:

How would this work with:

  1. Codegen – how does one generate code given a user’s UDF? Would the user be able to specify the code that is generated that represents their function? In practice that’s pretty hard to get right.
  2. Row serialization and representation – Will the UDF receive catalyst rows with optimized internal representations, or will Spark have to convert to something more easily consumed by a UDF?

 

Otherwise +1 for trying to get this to work without Hive. I think even having something without codegen and optimized row formats is worthwhile if only because it’s easier to use than Hive UDFs.

 

-Matt Cheah

 

From: Reynold Xin <[hidden email]>
Date: Friday, December 14, 2018 at 1:49 PM
To: "[hidden email]" <[hidden email]>
Cc: Spark Dev List <[hidden email]>
Subject: Re: [DISCUSS] Function plugins

 

Image removed by sender.

Having a way to register UDFs that are not using Hive APIs would be great!

 

 

 

On Fri, Dec 14, 2018 at 1:30 PM, Ryan Blue <[hidden email]> wrote:

Hi everyone,
I’ve been looking into improving how users of our Spark platform register and use UDFs and I’d like to discuss a few ideas for making this easier.

The motivation for this is the use case of defining a UDF from SparkSQL or PySpark. We want to make it easy to write JVM UDFs and use them from both SQL and Python. Python UDFs work great in most cases, but we occasionally don’t want to pay the cost of shipping data to python and processing it there so we want to make it easy to register UDFs that will run in the JVM.

There is already syntax to create a function from a JVM class [docs.databricks.com] in SQL that would work, but this option requires using the Hive UDF API instead of Spark’s simpler Scala API. It also requires argument translation and doesn’t support codegen. Beyond the problem of the API and performance, it is annoying to require registering every function individually with a CREATE FUNCTION statement.

The alternative that I’d like to propose is to add a way to register a named group of functions using the proposed catalog plugin API.

For anyone unfamiliar with the proposed catalog plugins, the basic idea is to load and configure plugins using a simple property-based scheme. Those plugins expose functionality through mix-in interfaces, like TableCatalog to create/drop/load/alter tables. Another interface could be UDFCatalog that can load UDFs.

interface UDFCatalog extends CatalogPlugin {
  UserDefinedFunction loadUDF(String name)
}

To use this, I would create a UDFCatalog class that returns my Scala functions as UDFs. To look up functions, we would use both the catalog name and the function name.

This would allow my users to write Scala UDF instances, package them using a UDFCatalog class (provided by me), and easily use them in Spark with a few configuration options to add the catalog in their environment.

This would also allow me to expose UDF libraries easily in my configuration, like brickhouse [community.cloudera.com], without users needing to ensure the Jar is loaded and register individual functions.

Any thoughts on this high-level approach? I know that this ignores things like creating and storing functions in a FunctionCatalog, and we’d have to solve challenges with function naming (whether there is a db component). Right now I’d like to think through the overall idea and not get too focused on those details.

Thanks,

rb

--

Ryan Blue

Software Engineer

Netflix

 

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Function plugins

Ryan Blue

I agree that it probably isn’t feasible to support codegen.

My goal is to be able to have users code like they can in Scala, but change registration so that they don’t need a SparkSession. This is easy with a SparkSession:

In [2]: def plus(a: Int, b: Int): Int = a + b                                                                                                                                                                          
plus: (a: Int, b: Int)Int  

In [3]: spark.udf.register("plus", plus _)                                                                                                                                                                             
Out[3]: UserDefinedFunction(<function2>,IntegerType,Some(List(IntegerType, IntegerType))) 

In [4]: %%sql  
      : select plus(3,4)  

Out[4]:  
+----------------+ 
| UDF:plus(3, 4) | 
+----------------+ 
| 7              | 
+----------------+ 
  available as df0

I want to build a UDFCatalog that can handle indirect registration: a user registers plus with some class that I control, and that class uses the UDFCatalog interface to pass those UDFs to Spark. It would also handle the translation to Spark’s UserDefinedFunction, just like when you use spark.udf.register.


On Fri, Dec 14, 2018 at 7:02 PM Reynold Xin <[hidden email]> wrote:
I don’t think it is realistic to support codegen for UDFs. It’s hooked deep into intervals. 

On Fri, Dec 14, 2018 at 6:52 PM Matt Cheah <[hidden email]> wrote:

How would this work with:

  1. Codegen – how does one generate code given a user’s UDF? Would the user be able to specify the code that is generated that represents their function? In practice that’s pretty hard to get right.
  2. Row serialization and representation – Will the UDF receive catalyst rows with optimized internal representations, or will Spark have to convert to something more easily consumed by a UDF?

 

Otherwise +1 for trying to get this to work without Hive. I think even having something without codegen and optimized row formats is worthwhile if only because it’s easier to use than Hive UDFs.

 

-Matt Cheah

 

From: Reynold Xin <[hidden email]>
Date: Friday, December 14, 2018 at 1:49 PM
To: "[hidden email]" <[hidden email]>
Cc: Spark Dev List <[hidden email]>
Subject: Re: [DISCUSS] Function plugins

 

Image removed by sender.

Having a way to register UDFs that are not using Hive APIs would be great!

 

 

 

On Fri, Dec 14, 2018 at 1:30 PM, Ryan Blue <[hidden email]> wrote:

Hi everyone,
I’ve been looking into improving how users of our Spark platform register and use UDFs and I’d like to discuss a few ideas for making this easier.

The motivation for this is the use case of defining a UDF from SparkSQL or PySpark. We want to make it easy to write JVM UDFs and use them from both SQL and Python. Python UDFs work great in most cases, but we occasionally don’t want to pay the cost of shipping data to python and processing it there so we want to make it easy to register UDFs that will run in the JVM.

There is already syntax to create a function from a JVM class [docs.databricks.com] in SQL that would work, but this option requires using the Hive UDF API instead of Spark’s simpler Scala API. It also requires argument translation and doesn’t support codegen. Beyond the problem of the API and performance, it is annoying to require registering every function individually with a CREATE FUNCTION statement.

The alternative that I’d like to propose is to add a way to register a named group of functions using the proposed catalog plugin API.

For anyone unfamiliar with the proposed catalog plugins, the basic idea is to load and configure plugins using a simple property-based scheme. Those plugins expose functionality through mix-in interfaces, like TableCatalog to create/drop/load/alter tables. Another interface could be UDFCatalog that can load UDFs.

interface UDFCatalog extends CatalogPlugin {
  UserDefinedFunction loadUDF(String name)
}

To use this, I would create a UDFCatalog class that returns my Scala functions as UDFs. To look up functions, we would use both the catalog name and the function name.

This would allow my users to write Scala UDF instances, package them using a UDFCatalog class (provided by me), and easily use them in Spark with a few configuration options to add the catalog in their environment.

This would also allow me to expose UDF libraries easily in my configuration, like brickhouse [community.cloudera.com], without users needing to ensure the Jar is loaded and register individual functions.

Any thoughts on this high-level approach? I know that this ignores things like creating and storing functions in a FunctionCatalog, and we’d have to solve challenges with function naming (whether there is a db component). Right now I’d like to think through the overall idea and not get too focused on those details.

Thanks,

rb

--

Ryan Blue

Software Engineer

Netflix

 



--
Ryan Blue
Software Engineer
Netflix