dataframe mappartitions problem

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

dataframe mappartitions problem

sunerhan1992@sina.com
hello,
        I'm using dataframe.mappartitions(r=>myfunc(r)) and it works so slow if if i do nothing in my func but return an empty iterator
Following are two dags about not using and using mappartitions and it,and using mappartitions costs more time even if myfunc do nothing.
Is there a problem about scheduling stages when mappartitions involved?

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: dataframe mappartitions problem

cloud0fan
`Dataset.mapPartitions` takes `func: Iterator[T] => Iterator[U]`, which means, spark need to deserialize the internal binary format to type `T`, and this deserialization is costly.

If you do need to do some hack, you can use the internal API: `Dataset.queryExecution.toRdd.mapPartitions`, which has no compatibility guarantees, and you need to deal with `InternalRow` directly.

On 20 Jun 2017, at 8:10 PM, [hidden email] wrote:

hello,
        I'm using dataframe.mappartitions(r=>myfunc(r)) and it works so slow if if i do nothing in my func but return an empty iterator
Following are two dags about not using and using mappartitions and it,and using mappartitions costs more time even if myfunc do nothing.
Is there a problem about scheduling stages when mappartitions involved?
<Catch.jpg>
<Catch696F.jpg>


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Re: dataframe mappartitions problem

sunerhan1992@sina.com
hello,
        I repost dag picture in appendix,please check.
        BTW,i'm using spark1.6.3 and try dataframe generated from hivecontext

 
Date: 2017-06-20 20:42
Subject: Re: dataframe mappartitions problem
`Dataset.mapPartitions` takes `func: Iterator[T] => Iterator[U]`, which means, spark need to deserialize the internal binary format to type `T`, and this deserialization is costly.

If you do need to do some hack, you can use the internal API: `Dataset.queryExecution.toRdd.mapPartitions`, which has no compatibility guarantees, and you need to deal with `InternalRow` directly.

On 20 Jun 2017, at 8:10 PM, [hidden email] wrote:

hello,
        I'm using dataframe.mappartitions(r=>myfunc(r)) and it works so slow if if i do nothing in my func but return an empty iterator
Following are two dags about not using and using mappartitions and it,and using mappartitions costs more time even if myfunc do nothing.
Is there a problem about scheduling stages when mappartitions involved?
<Catch.jpg>
<Catch696F.jpg>




---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

appendix.zip (396K) Download Attachment
Loading...