[SPARK][SQL] Distributed createDataframe from many pandas DFs using Arrow

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

[SPARK][SQL] Distributed createDataframe from many pandas DFs using Arrow

Linar Savion
We've created a snippet that creates a Spark DF from a RDD of many pandas DFs in a distributed manner that does not require the driver to collect the entire dataset.

Early tests show a performance improvement of x6-x10 over using pandasDF->Rows>sparkDF.

I've seen that there are some open pull requests that change the way arrow serialization work, Should I open a pull request to add this functionality to SparkSession? (`createFromPandasDataframesRDD`)


Thanks,
Linar
Reply | Threaded
Open this post in threaded view
|

Re: [SPARK][SQL] Distributed createDataframe from many pandas DFs using Arrow

Li Jin
Hi Linar,

This seems useful. But perhaps reusing the same function name is better?


Currently createDataFrame takes an RDD of any kind of SQL data representation(e.g. row, tuple, int, boolean, etc.), or list, or pandas.DataFrame.

Perhaps we can support taking an RDD of pandas.DataFrame as the "data" args too?

What do other people think.

Li

On Sun, Jul 8, 2018 at 1:13 PM, Linar Savion <[hidden email]> wrote:
We've created a snippet that creates a Spark DF from a RDD of many pandas DFs in a distributed manner that does not require the driver to collect the entire dataset.

Early tests show a performance improvement of x6-x10 over using pandasDF->Rows>sparkDF.

I've seen that there are some open pull requests that change the way arrow serialization work, Should I open a pull request to add this functionality to SparkSession? (`createFromPandasDataframesRDD`)


Thanks,
Linar

Reply | Threaded
Open this post in threaded view
|

Re: [SPARK][SQL] Distributed createDataframe from many pandas DFs using Arrow

rxin
Yes I would just reuse the same function. 

On Sun, Jul 8, 2018 at 5:01 AM Li Jin <[hidden email]> wrote:
Hi Linar,

This seems useful. But perhaps reusing the same function name is better?


Currently createDataFrame takes an RDD of any kind of SQL data representation(e.g. row, tuple, int, boolean, etc.), or list, or pandas.DataFrame.

Perhaps we can support taking an RDD of pandas.DataFrame as the "data" args too?

What do other people think.

Li

On Sun, Jul 8, 2018 at 1:13 PM, Linar Savion <[hidden email]> wrote:
We've created a snippet that creates a Spark DF from a RDD of many pandas DFs in a distributed manner that does not require the driver to collect the entire dataset.

Early tests show a performance improvement of x6-x10 over using pandasDF->Rows>sparkDF.

I've seen that there are some open pull requests that change the way arrow serialization work, Should I open a pull request to add this functionality to SparkSession? (`createFromPandasDataframesRDD`)


Thanks,
Linar