[SQL] Is it worth it (and advisable) to implement native UDFs?

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

[SQL] Is it worth it (and advisable) to implement native UDFs?

email

Hi,

 

I read online[1] that for a best UDF performance it is possible to implement them using internal Spark expressions, and I also saw a couple of pull requests such as [2] and [3] where this was put to practice (not sure if for that reason or just to extend the API).

 

We have an algorithm that computes a score similar to what the Levenshtein distance does and it takes about 30%-40% of the overall time of our job. We are looking for ways to improve it without adding more resources.

 

I was wondering if it would be advisable to implement it extending BinaryExpression like[1] and if it would result in any performance gains.

 

Thanks for your help!

 

[1] https://hackernoon.com/apache-spark-tips-and-tricks-for-better-performance-cf2397cac11

[2] https://github.com/apache/spark/pull/7214

[3] https://github.com/apache/spark/pull/7236

 

Reply | Threaded
Open this post in threaded view
|

Re: [SQL] Is it worth it (and advisable) to implement native UDFs?

Walaa Eldin Moustafa
Hi,

At LinkedIn, we have some benchmarks that show that UDFs in the
Expression API are more performant than Hive Generic UDFs (I am not
sure which APIs you used to implement your baseline, but I expect
Scala UDFs or Hive Generic UDFs). In fact, we have built a full
fledged UDF API (scalar for now) on top of Spark expressions/internal
rows. You may take a look at it [1]. The same API is reusable for some
other engines/data formats.

[1] https://github.com/linkedin/transport

Thanks,
Walaa.




On Mon, Jan 20, 2020 at 6:34 PM <[hidden email]> wrote:

>
> Hi,
>
>
>
> I read online[1] that for a best UDF performance it is possible to implement them using internal Spark expressions, and I also saw a couple of pull requests such as [2] and [3] where this was put to practice (not sure if for that reason or just to extend the API).
>
>
>
> We have an algorithm that computes a score similar to what the Levenshtein distance does and it takes about 30%-40% of the overall time of our job. We are looking for ways to improve it without adding more resources.
>
>
>
> I was wondering if it would be advisable to implement it extending BinaryExpression like[1] and if it would result in any performance gains.
>
>
>
> Thanks for your help!
>
>
>
> [1] https://hackernoon.com/apache-spark-tips-and-tricks-for-better-performance-cf2397cac11
>
> [2] https://github.com/apache/spark/pull/7214
>
> [3] https://github.com/apache/spark/pull/7236
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [SQL] Is it worth it (and advisable) to implement native UDFs?

rxin
In reply to this post by email
If your UDF itself is very CPU intensive, it probably won't make that much of difference, because the UDF itself will dwarf the serialization/deserialization overhead.

If your UDF is cheap, it will help tremendously.


On Mon, Jan 20, 2020 at 6:33 PM, <[hidden email]> wrote:

Hi,

 

I read online[1] that for a best UDF performance it is possible to implement them using internal Spark expressions, and I also saw a couple of pull requests such as [2] and [3] where this was put to practice (not sure if for that reason or just to extend the API).

 

We have an algorithm that computes a score similar to what the Levenshtein distance does and it takes about 30%-40% of the overall time of our job. We are looking for ways to improve it without adding more resources.

 

I was wondering if it would be advisable to implement it extending BinaryExpression like[1] and if it would result in any performance gains.

 

Thanks for your help!

 

[1] https://hackernoon.com/apache-spark-tips-and-tricks-for-better-performance-cf2397cac11

[2] https://github.com/apache/spark/pull/7214

[3] https://github.com/apache/spark/pull/7236


Reply | Threaded
Open this post in threaded view
|

RE: [SQL] Is it worth it (and advisable) to implement native UDFs?

email

Is there any documentation/ sample about this besides the pull requests merged to spark core?

 

It seems that I need to create my custom functions under the package org.apache.spark.sql.* in order to be able to access some of the internal classes I saw in[1] such as Column[2]

 

Could you please confirm if that’s how it should be?

 

Thanks!

 

[1] https://github.com/apache/spark/pull/7214

[2] https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L37

 

From: Reynold Xin <[hidden email]>
Sent: Wednesday, January 22, 2020 2:22 AM
To: [hidden email]
Cc: [hidden email]
Subject: Re: [SQL] Is it worth it (and advisable) to implement native UDFs?

 

If your UDF itself is very CPU intensive, it probably won't make that much of difference, because the UDF itself will dwarf the serialization/deserialization overhead.

 

If your UDF is cheap, it will help tremendously.

 

 

On Mon, Jan 20, 2020 at 6:33 PM, <[hidden email]> wrote:

Hi,

 

I read online[1] that for a best UDF performance it is possible to implement them using internal Spark expressions, and I also saw a couple of pull requests such as [2] and [3] where this was put to practice (not sure if for that reason or just to extend the API).

 

We have an algorithm that computes a score similar to what the Levenshtein distance does and it takes about 30%-40% of the overall time of our job. We are looking for ways to improve it without adding more resources.

 

I was wondering if it would be advisable to implement it extending BinaryExpression like[1] and if it would result in any performance gains.

 

Thanks for your help!

 

[1] https://hackernoon.com/apache-spark-tips-and-tricks-for-better-performance-cf2397cac11

[2] https://github.com/apache/spark/pull/7214

[3] https://github.com/apache/spark/pull/7236

 

Reply | Threaded
Open this post in threaded view
|

Re: [SQL] Is it worth it (and advisable) to implement native UDFs?

cloud0fan
This is a hack really and we don't recommend users to access internal classes directly. That's why there is no public document.

If you really need to do it and are aware of the risks, you can read the source code. All expressions (or the so-called "native UDF") extend the base class `Expression`. You can read the code comments and look at some implementations.

On Wed, Feb 5, 2020 at 11:11 AM <[hidden email]> wrote:

Is there any documentation/ sample about this besides the pull requests merged to spark core?

 

It seems that I need to create my custom functions under the package org.apache.spark.sql.* in order to be able to access some of the internal classes I saw in[1] such as Column[2]

 

Could you please confirm if that’s how it should be?

 

Thanks!

 

[1] https://github.com/apache/spark/pull/7214

[2] https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L37

 

From: Reynold Xin <[hidden email]>
Sent: Wednesday, January 22, 2020 2:22 AM
To: [hidden email]
Cc: [hidden email]
Subject: Re: [SQL] Is it worth it (and advisable) to implement native UDFs?

 

If your UDF itself is very CPU intensive, it probably won't make that much of difference, because the UDF itself will dwarf the serialization/deserialization overhead.

 

If your UDF is cheap, it will help tremendously.

 

 

On Mon, Jan 20, 2020 at 6:33 PM, <[hidden email]> wrote:

Hi,

 

I read online[1] that for a best UDF performance it is possible to implement them using internal Spark expressions, and I also saw a couple of pull requests such as [2] and [3] where this was put to practice (not sure if for that reason or just to extend the API).

 

We have an algorithm that computes a score similar to what the Levenshtein distance does and it takes about 30%-40% of the overall time of our job. We are looking for ways to improve it without adding more resources.

 

I was wondering if it would be advisable to implement it extending BinaryExpression like[1] and if it would result in any performance gains.

 

Thanks for your help!

 

[1] https://hackernoon.com/apache-spark-tips-and-tricks-for-better-performance-cf2397cac11

[2] https://github.com/apache/spark/pull/7214

[3] https://github.com/apache/spark/pull/7236

 

Reply | Threaded
Open this post in threaded view
|

Re: [SQL] Is it worth it (and advisable) to implement native UDFs?

Walaa Eldin Moustafa
For a general-purpose code example, you may take a look at the class we defined in Transport UDFs to express all Expression UDFs [1]. This is an internal class though and not a user-facing API. User-facing UDF example is in [2]. It leverages [1] behind the scenes.


Thanks,
Walaa.

On Wed, Feb 5, 2020 at 12:06 AM Wenchen Fan <[hidden email]> wrote:
This is a hack really and we don't recommend users to access internal classes directly. That's why there is no public document.

If you really need to do it and are aware of the risks, you can read the source code. All expressions (or the so-called "native UDF") extend the base class `Expression`. You can read the code comments and look at some implementations.

On Wed, Feb 5, 2020 at 11:11 AM <[hidden email]> wrote:

Is there any documentation/ sample about this besides the pull requests merged to spark core?

 

It seems that I need to create my custom functions under the package org.apache.spark.sql.* in order to be able to access some of the internal classes I saw in[1] such as Column[2]

 

Could you please confirm if that’s how it should be?

 

Thanks!

 

[1] https://github.com/apache/spark/pull/7214

[2] https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L37

 

From: Reynold Xin <[hidden email]>
Sent: Wednesday, January 22, 2020 2:22 AM
To: [hidden email]
Cc: [hidden email]
Subject: Re: [SQL] Is it worth it (and advisable) to implement native UDFs?

 

If your UDF itself is very CPU intensive, it probably won't make that much of difference, because the UDF itself will dwarf the serialization/deserialization overhead.

 

If your UDF is cheap, it will help tremendously.

 

 

On Mon, Jan 20, 2020 at 6:33 PM, <[hidden email]> wrote:

Hi,

 

I read online[1] that for a best UDF performance it is possible to implement them using internal Spark expressions, and I also saw a couple of pull requests such as [2] and [3] where this was put to practice (not sure if for that reason or just to extend the API).

 

We have an algorithm that computes a score similar to what the Levenshtein distance does and it takes about 30%-40% of the overall time of our job. We are looking for ways to improve it without adding more resources.

 

I was wondering if it would be advisable to implement it extending BinaryExpression like[1] and if it would result in any performance gains.

 

Thanks for your help!

 

[1] https://hackernoon.com/apache-spark-tips-and-tricks-for-better-performance-cf2397cac11

[2] https://github.com/apache/spark/pull/7214

[3] https://github.com/apache/spark/pull/7236