https://spark-project.atlassian.net/browse/SPARK-1153

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

https://spark-project.atlassian.net/browse/SPARK-1153

kant kodali
Hi All,

Any chance of fixing this one ? https://spark-project.atlassian.net/browse/SPARK-1153 or offer some work around may be?

Currently, I got bunch of events streaming into kafka across various topics and they are stamped with an UUIDv1 for each event. so it is easy to construct edges using UUID. I am not quite sure how to generate a long based unique id without synchronization in a distributed setting. I had read this SO post which shows there are two ways one may be able to achieve this

1.  UUID.randomUUID().getMostSignificantBits() & Long.MAX_VALUE

2.  (System.currentTimeMillis() << 20) | (System.nanoTime() & ~9223372036854251520L)

However I am concerned about collisions and looking for the probability of collisions for the above two approaches. any suggestions?

I ran the Connected Components algorithms using graphframes it runs well when long based id's are used but with string the performance drops significantly as pointed out in the ticket. I understand that algorithm depends on hashing integers heavily but I wonder why not fixed length byte[] ? that way we can convert any datatype to sequence of bytes.

Thanks!
Reply | Threaded
Open this post in threaded view
|

Re: https://spark-project.atlassian.net/browse/SPARK-1153

kant kodali
Sorry please ignore this. I accidentally ran it with GraphX instead of Graphframes. 

Which indeed generates its own id! that's great!

Thanks

On Sun, Feb 23, 2020 at 3:53 PM kant kodali <[hidden email]> wrote:
Hi All,

Any chance of fixing this one ? https://spark-project.atlassian.net/browse/SPARK-1153 or offer some work around may be?

Currently, I got bunch of events streaming into kafka across various topics and they are stamped with an UUIDv1 for each event. so it is easy to construct edges using UUID. I am not quite sure how to generate a long based unique id without synchronization in a distributed setting. I had read this SO post which shows there are two ways one may be able to achieve this

1.  UUID.randomUUID().getMostSignificantBits() & Long.MAX_VALUE

2.  (System.currentTimeMillis() << 20) | (System.nanoTime() & ~9223372036854251520L)

However I am concerned about collisions and looking for the probability of collisions for the above two approaches. any suggestions?

I ran the Connected Components algorithms using graphframes it runs well when long based id's are used but with string the performance drops significantly as pointed out in the ticket. I understand that algorithm depends on hashing integers heavily but I wonder why not fixed length byte[] ? that way we can convert any datatype to sequence of bytes.

Thanks!