Hi everyone,

I was playing around with LSH/Minhash module from spark ml module. I noticed

that hash computation is done with Int (see

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala#L69).

Since "a" and "b" are from a uniform distribution of [1,

MinHashLSH.HASH_PRIME] and MinHashLSH.HASH_PRIME is close to Int.MaxValue,

it's likely for the multiplication to cause Int overflow with a large sparse

input vector.

I wonder if this is a bug or intended. If it's a bug, one way to fix it is

to compute hashes with Long and insert a couple of mod

MinHashLSH.HASH_PRIME. Because MinHashLSH.HASH_PRIME is chosen to be smaller

than sqrt(2^63 - 1), this won't overflow 64-bit integer. Another option is

to use BigInteger.

Let me know what you think.

Thanks,

Jiayuan

--

Sent from:

http://apache-spark-developers-list.1001551.n3.nabble.com/---------------------------------------------------------------------

To unsubscribe e-mail:

[hidden email]