Re: Spark join over sorted columns of dataset.

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Re: Spark join over sorted columns of dataset.

Koert Kuipers
For RDD the shuffle is already skipped but the sort is not. In spark-sorted we track partitioning and sorting within partitions for key-value RDDs and can avoid the sort. See:

For Dataset/DataFrame such optimizations are done automatically, however it's currently not always working for Dataset, see:

On Mar 3, 2017 11:06 AM, "Rohit Verma" <[hidden email]> wrote:
Sending it to dev’s.
Can you please help me providing some ideas for below.

> On Feb 23, 2017, at 3:47 PM, Rohit Verma <[hidden email]> wrote:
> Hi
> While joining two columns of different dataset, how to optimize join if both the columns are pre sorted within the dataset.
> So that when spark do sort merge join the sorting phase can skipped.
> Regards
> Rohit