Hive Hash in Spark

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Hive Hash in Spark

tcondie

Hi,

 

I noticed the existence of a Hive Hash partitioning implementation in Spark, but also noticed that it’s not being used, and that the Spark hash partitioning function is presently hardcoded to Murmur3. My question is whether Hive Hash is dead code or are their future plans to support reading and understanding data the has been partitioned using Hive Hash? By understanding, I mean that I’m able to avoid a full shuffle join on Table A (partitioned by Hive Hash) when joining with a Table B that I can shuffle via Hive Hash to Table A.

 

Thank you,

Tyson

Reply | Threaded
Open this post in threaded view
|

Re: Hive Hash in Spark

rxin
I think they might be used in bucketing? Not 100% sure.


On Wed, Mar 06, 2019 at 1:40 PM, <[hidden email]> wrote:

Hi,

 

I noticed the existence of a Hive Hash partitioning implementation in Spark, but also noticed that it’s not being used, and that the Spark hash partitioning function is presently hardcoded to Murmur3. My question is whether Hive Hash is dead code or are their future plans to support reading and understanding data the has been partitioned using Hive Hash? By understanding, I mean that I’m able to avoid a full shuffle join on Table A (partitioned by Hive Hash) when joining with a Table B that I can shuffle via Hive Hash to Table A.

 

Thank you,

Tyson


Reply | Threaded
Open this post in threaded view
|

Re: Hive Hash in Spark

Ryan Blue
I think this was needed to add support for bucketed Hive tables. Like Tyson noted, if the other side of a join can be bucketed the same way, then Spark can use a bucketed join. I have long-term plans to support this in the DataSourceV2 API, but I don't think we are very close to implementing it yet.

rb

On Wed, Mar 6, 2019 at 1:57 PM Reynold Xin <[hidden email]> wrote:
I think they might be used in bucketing? Not 100% sure.


On Wed, Mar 06, 2019 at 1:40 PM, <[hidden email]> wrote:

Hi,

 

I noticed the existence of a Hive Hash partitioning implementation in Spark, but also noticed that it’s not being used, and that the Spark hash partitioning function is presently hardcoded to Murmur3. My question is whether Hive Hash is dead code or are their future plans to support reading and understanding data the has been partitioned using Hive Hash? By understanding, I mean that I’m able to avoid a full shuffle join on Table A (partitioned by Hive Hash) when joining with a Table B that I can shuffle via Hive Hash to Table A.

 

Thank you,

Tyson




--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

RE: Hive Hash in Spark

tcondie

Thanks Ryan and Reynold for the information!

 

Cheers,

Tyson

 

From: Ryan Blue <[hidden email]>
Sent: Wednesday, March 6, 2019 3:47 PM
To: Reynold Xin <[hidden email]>
Cc: [hidden email]; Spark Dev List <[hidden email]>
Subject: Re: Hive Hash in Spark

 

I think this was needed to add support for bucketed Hive tables. Like Tyson noted, if the other side of a join can be bucketed the same way, then Spark can use a bucketed join. I have long-term plans to support this in the DataSourceV2 API, but I don't think we are very close to implementing it yet.

 

rb

 

On Wed, Mar 6, 2019 at 1:57 PM Reynold Xin <[hidden email]> wrote:

I think they might be used in bucketing? Not 100% sure.

 

 

On Wed, Mar 06, 2019 at 1:40 PM, <[hidden email]> wrote:

Hi,

 

I noticed the existence of a Hive Hash partitioning implementation in Spark, but also noticed that it’s not being used, and that the Spark hash partitioning function is presently hardcoded to Murmur3. My question is whether Hive Hash is dead code or are their future plans to support reading and understanding data the has been partitioned using Hive Hash? By understanding, I mean that I’m able to avoid a full shuffle join on Table A (partitioned by Hive Hash) when joining with a Table B that I can shuffle via Hive Hash to Table A.

 

Thank you,

Tyson

 


 

--

Ryan Blue

Software Engineer

Netflix