get method guid prefix for file parts for write

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

get method guid prefix for file parts for write

gpongracz
I lack the vocabulary for this question so please bear with my description of
the problem...

I am searching for a way to get the guid prefix value to be used to write
the parts of a file.

eg:

part-00000-b5265e7b-b974-4083-a66e-e7698258ca50-c000.csv

I would like to get the prefix "00000-b5265e7b-b974-4083-a66e-e7698258ca50"

Is there a way that I might be able to access such value programatically?

Any assistance is appreciated.

George Pongracz




--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: get method guid prefix for file parts for write

EveLiao
If I understand your problem correctly, the prefix you provided is actually
"0000-" + UUID. You can get it by uuid generator like
https://docs.python.org/3/library/uuid.html#uuid.uuid4.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: get method guid prefix for file parts for write

Nicholas Chammas
I think what George is looking for is a way to determine ahead of time the partition IDs that Spark will use when writing output.

George,


Specifically, the part that says "TaskContext.get.partitionId()".

I don't know how much of that is part of Spark's public API, but there it is.

It would be useful if Spark offered a way to get a manifest of output files for any given write operation, similar to Redshift's MANIFEST option. This would help when, for example, you need to pass a list of files output by Spark to some other system (like Redshift) and don't want to have to worry about the consistency guarantees of your object store's list operations.

Nick

On Fri, Sep 25, 2020 at 2:00 PM EveLiao <[hidden email]> wrote:
If I understand your problem correctly, the prefix you provided is actually
"0000-" + UUID. You can get it by uuid generator like
https://docs.python.org/3/library/uuid.html#uuid.uuid4.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: get method guid prefix for file parts for write

gpongracz
What Nick said was correct.

What I should also state is that I am using python spark variant in this
case not the scala.

I am looking to use the guid prefix of part-0 to prevent a race condition by
using a s3 waiter for the part to appear, but to achieve this, I need to
know the guid value in advance.

Thank you all again for your help.

Regards,

George



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: get method guid prefix for file parts for write

gpongracz
In reply to this post by Nicholas Chammas
I should add that I tried using a waiter on the _SUCCESS file but it did not
prove successful as due to its small size compared to the part-0 file it
seems to be appearing before the part-0 file in s3, even though it was
written afterwards.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]