[DISCUSS] PySpark Window UDF

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[DISCUSS] PySpark Window UDF

Li Jin
Hi All,

I have been looking into leverage the Arrow and Pandas UDF work we have done so far for Window UDF in PySpark. I have done some investigation and believe there is a way to do PySpark window UDF efficiently. 

The basic idea is instead of passing each window to Python separately, we can pass a "batch of windows" as an Arrow Batch of rows + begin/end indices for each window (indices are computed on the Java side), and then rolling over the begin/end indices in Python and applies the UDF.

I have written my investigation in more details here:

I think this is a pretty promising and hope to get some feedback from the community about this approach. Let's discuss! :)

Li