[SQL] hash: 64-bits and seeding

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

[SQL] hash: 64-bits and seeding

Huon.Wilson
Hi,

I’m working on something that requires deterministic randomness, i.e. a row gets the same “random” value no matter the order of the DataFrame. A seeded hash seems to be the perfect way to do this, but the existing hashes have various limitations:

- hash: 32-bit output (only 4 billion possibilities will result in a lot of collisions for many tables: the birthday paradox implies  >50% chance of at least one for tables larger than 77000 rows, and likely ~1.6 billion collisions in a table of size 4 billion)
- sha1/sha2/md5: single binary column input, string output

It seems there’s already support for a 64-bit hash function that can work with an arbitrary number of arbitrary-typed columns (XxHash64), and exposing this for DataFrames seems like it’s essentially one line in sql/functions.scala to match `hash` (plus docs, tests, function registry etc.):

    def hash64(cols: Column*): Column = withExpr { new XxHash64(cols.map(_.expr)) }

For my use case, this can then be used to get a 64-bit “random” column like

    val seed = rng.nextLong()
    hash64(lit(seed), col1, col2)

I’ve created a (hopefully) complete patch by mimicking ‘hash’ at https://github.com/apache/spark/compare/master...huonw:hash64; should I open a JIRA and submit it as a pull request?

Additionally, both hash and the new hash64 already have support for being seeded, but this isn’t exposed directly and instead requires something like the `lit` above. Would it make sense to add overloads like the following?

    def hash(seed: Int, cols: Columns*) = …
    def hash64(seed: Long, cols: Columns*) = …

Though, it does seem a bit unfortunate to be forced to pass the seed first.

(I sent this email to [hidden email] a few days ago, but didn't get any discussion about the Spark aspects of this, so I'm resending it here; I apologise in advance if I'm breaking protocol!)

- Huon Wilson


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [SQL] hash: 64-bits and seeding

rxin
Rather than calling it hash64, it'd be better to just call it xxhash64. The reason being ten years from now, we probably would look back and laugh at a specific hash implementation. It'd be better to just name the expression what it is.


On Wed, Mar 06, 2019 at 7:59 PM, <[hidden email]> wrote:

Hi,

I’m working on something that requires deterministic randomness, i.e. a row gets the same “random” value no matter the order of the DataFrame. A seeded hash seems to be the perfect way to do this, but the existing hashes have various limitations:

- hash: 32-bit output (only 4 billion possibilities will result in a lot of collisions for many tables: the birthday paradox implies >50% chance of at least one for tables larger than 77000 rows, and likely ~1.6 billion collisions in a table of size 4 billion)
- sha1/sha2/md5: single binary column input, string output

It seems there’s already support for a 64-bit hash function that can work with an arbitrary number of arbitrary-typed columns (XxHash64), and exposing this for DataFrames seems like it’s essentially one line in sql/functions.scala to match `hash` (plus docs, tests, function registry etc.):

def hash64(cols: Column*): Column = withExpr { new XxHash64(cols.map(_.expr)) }

For my use case, this can then be used to get a 64-bit “random” column like

val seed = rng.nextLong()
hash64(lit(seed), col1, col2)

I’ve created a (hopefully) complete patch by mimicking ‘hash’ at https://github.com/apache/spark/compare/master...huonw:hash64; should I open a JIRA and submit it as a pull request?

Additionally, both hash and the new hash64 already have support for being seeded, but this isn’t exposed directly and instead requires something like the `lit` above. Would it make sense to add overloads like the following?

def hash(seed: Int, cols: Columns*) = …
def hash64(seed: Long, cols: Columns*) = …

Though, it does seem a bit unfortunate to be forced to pass the seed first.

(I sent this email to [hidden email] a few days ago, but didn't get any discussion about the Spark aspects of this, so I'm resending it here; I apologise in advance if I'm breaking protocol!)

- Huon Wilson

--------------------------------------------------------------------- To unsubscribe e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: [SQL] hash: 64-bits and seeding

Huon.Wilson

Thanks for the guidance. That was my initial inclination, but I decided that consistency with the existing ‘hash’ was better. However, like you, I also prefer the specific form.

 

I’ve opened https://issues.apache.org/jira/browse/SPARK-27099 and submitted the patch (using ‘xxhash64’) at https://github.com/apache/spark/pull/24019.

 

- Huon

 

From: Reynold Xin <[hidden email]>
Date: Thursday, 7 March 2019 at 6:33 pm
To: "Wilson, Huon (Data61, Eveleigh ATP)" <[hidden email]>
Cc: "[hidden email]" <[hidden email]>
Subject: Re: [SQL] hash: 64-bits and seeding

 

Rather than calling it hash64, it'd be better to just call it xxhash64. The reason being ten years from now, we probably would look back and laugh at a specific hash implementation. It'd be better to just name the expression what it is.

 

 

On Wed, Mar 06, 2019 at 7:59 PM, <[hidden email]> wrote:

Hi,

I’m working on something that requires deterministic randomness, i.e. a row gets the same “random” value no matter the order of the DataFrame. A seeded hash seems to be the perfect way to do this, but the existing hashes have various limitations:

- hash: 32-bit output (only 4 billion possibilities will result in a lot of collisions for many tables: the birthday paradox implies >50% chance of at least one for tables larger than 77000 rows, and likely ~1.6 billion collisions in a table of size 4 billion)
- sha1/sha2/md5: single binary column input, string output

It seems there’s already support for a 64-bit hash function that can work with an arbitrary number of arbitrary-typed columns (XxHash64), and exposing this for DataFrames seems like it’s essentially one line in sql/functions.scala to match `hash` (plus docs, tests, function registry etc.):

def hash64(cols: Column*): Column = withExpr { new XxHash64(cols.map(_.expr)) }

For my use case, this can then be used to get a 64-bit “random” column like

val seed = rng.nextLong()
hash64(lit(seed), col1, col2)

I’ve created a (hopefully) complete patch by mimicking ‘hash’ at https://github.com/apache/spark/compare/master...huonw:hash64; should I open a JIRA and submit it as a pull request?

Additionally, both hash and the new hash64 already have support for being seeded, but this isn’t exposed directly and instead requires something like the `lit` above. Would it make sense to add overloads like the following?

def hash(seed: Int, cols: Columns*) = …
def hash64(seed: Long, cols: Columns*) = …

Though, it does seem a bit unfortunate to be forced to pass the seed first.

(I sent this email to [hidden email] a few days ago, but didn't get any discussion about the Spark aspects of this, so I'm resending it here; I apologise in advance if I'm breaking protocol!)

- Huon Wilson

--------------------------------------------------------------------- To unsubscribe e-mail: [hidden email]