Unique Partition Id per partition

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Unique Partition Id per partition

Chawla,Sumit
Hi All

I have a rdd, which i partition based on some key, and then can sc.runJob for each partition. 
 Inside this function, i assign each partition a unique key using following:

"%s_%s" % (id(part), int(round(time.time()))
This is to make sure that, each partition produces separate bookeeping stuff, 
which can be aggregated by external system. However, I sometimes i notice multiple 
partition results pointing to same partition_id. Is this some issue due to the 
way above code is serialized by Pyspark. What's the best way to define a unique id 
for each partition. I undestand that its same executor getting multiple partitions to process,
but i would expect the above code to produce a unique id for each partition.


Regards
Sumit Chawla

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Unique Partition Id per partition

Michael Allman-2
Hi Sumit,


Michael

On Jan 31, 2017, at 9:08 AM, Chawla,Sumit <[hidden email]> wrote:

Hi All

I have a rdd, which i partition based on some key, and then can sc.runJob for each partition. 
 Inside this function, i assign each partition a unique key using following:

"%s_%s" % (id(part), int(round(time.time()))
This is to make sure that, each partition produces separate bookeeping stuff, 
which can be aggregated by external system. However, I sometimes i notice multiple 
partition results pointing to same partition_id. Is this some issue due to the 
way above code is serialized by Pyspark. What's the best way to define a unique id 
for each partition. I undestand that its same executor getting multiple partitions to process,
but i would expect the above code to produce a unique id for each partition.


Regards
Sumit Chawla


Loading...