Async RDD saves

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Async RDD saves

Antonin Delpeuch
Hi all,

Following my request on the user mailing list [1], there does not seem
to be any simple way to save RDDs to the file system in an asynchronous
way. I am looking into implementing this, so I am first checking whether
there is consensus around the idea.

The goal would be to add methods such as `saveAsTextFileAsync` and
`saveAsObjectFileAsync` to the RDD API.

I am thinking about doing this by:

- refactoring SparkHadoopWriter to allow for submitting jobs
asynchronously (with `submitJob` rather than `runJob`)

- add a `saveAsHadoopFileAsync` method in `PairRDDFunctions`,
counterpart to the existing `saveAsHadoopFile`

- add a `saveAsTextFileAsync` (and other formats) in `AsyncRDDActions`.

Because SparkHadoopWriter is private, it is complicated to reimplement
this functionality outside of Spark as a user, so I think this would be
an API worth offering. It should be possible to implement this without
too much code duplication hopefully.

Cheers,

Antonin

[1]:
http://apache-spark-user-list.1001560.n3.nabble.com/Async-API-to-save-RDDs-td38320.html



---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Async RDD saves

Sean Owen-2
Why do you need to do it, and can you just use a future in your driver code?

On Fri, Aug 7, 2020 at 9:01 AM Antonin Delpeuch (lists)
<[hidden email]> wrote:

>
> Hi all,
>
> Following my request on the user mailing list [1], there does not seem
> to be any simple way to save RDDs to the file system in an asynchronous
> way. I am looking into implementing this, so I am first checking whether
> there is consensus around the idea.
>
> The goal would be to add methods such as `saveAsTextFileAsync` and
> `saveAsObjectFileAsync` to the RDD API.
>
> I am thinking about doing this by:
>
> - refactoring SparkHadoopWriter to allow for submitting jobs
> asynchronously (with `submitJob` rather than `runJob`)
>
> - add a `saveAsHadoopFileAsync` method in `PairRDDFunctions`,
> counterpart to the existing `saveAsHadoopFile`
>
> - add a `saveAsTextFileAsync` (and other formats) in `AsyncRDDActions`.
>
> Because SparkHadoopWriter is private, it is complicated to reimplement
> this functionality outside of Spark as a user, so I think this would be
> an API worth offering. It should be possible to implement this without
> too much code duplication hopefully.
>
> Cheers,
>
> Antonin
>
> [1]:
> http://apache-spark-user-list.1001560.n3.nabble.com/Async-API-to-save-RDDs-td38320.html
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Async RDD saves

edeesis
I will agree that the side effects of using Futures in driver code tend to be tricky to track down.

If you forget to clear the job description and job group information, when the LocalProperties on the SparkContext remain intact - SparkContext#submitJob makes sure to pass down the localProperties.

This has led to us doing this hack:

image.png

This can also cause problems with Spark Streaming where the Streaming UI can get messed up from the various streaming related properties set getting cleared or re-used.

On Fri, Aug 7, 2020 at 10:38 AM Sean Owen <[hidden email]> wrote:
Why do you need to do it, and can you just use a future in your driver code?

On Fri, Aug 7, 2020 at 9:01 AM Antonin Delpeuch (lists)
<[hidden email]> wrote:
>
> Hi all,
>
> Following my request on the user mailing list [1], there does not seem
> to be any simple way to save RDDs to the file system in an asynchronous
> way. I am looking into implementing this, so I am first checking whether
> there is consensus around the idea.
>
> The goal would be to add methods such as `saveAsTextFileAsync` and
> `saveAsObjectFileAsync` to the RDD API.
>
> I am thinking about doing this by:
>
> - refactoring SparkHadoopWriter to allow for submitting jobs
> asynchronously (with `submitJob` rather than `runJob`)
>
> - add a `saveAsHadoopFileAsync` method in `PairRDDFunctions`,
> counterpart to the existing `saveAsHadoopFile`
>
> - add a `saveAsTextFileAsync` (and other formats) in `AsyncRDDActions`.
>
> Because SparkHadoopWriter is private, it is complicated to reimplement
> this functionality outside of Spark as a user, so I think this would be
> an API worth offering. It should be possible to implement this without
> too much code duplication hopefully.
>
> Cheers,
>
> Antonin
>
> [1]:
> http://apache-spark-user-list.1001560.n3.nabble.com/Async-API-to-save-RDDs-td38320.html
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Async RDD saves

kalyan
This looks interesting.. anyways, it will be good if you can elaborate more on the expectations and the various other ways you had tried before deciding to do it this way... 

Regards,
Kalyan.

On Fri, Aug 7, 2020, 11:24 PM Edward Mitchell <[hidden email]> wrote:
I will agree that the side effects of using Futures in driver code tend to be tricky to track down.

If you forget to clear the job description and job group information, when the LocalProperties on the SparkContext remain intact - SparkContext#submitJob makes sure to pass down the localProperties.

This has led to us doing this hack:

image.png

This can also cause problems with Spark Streaming where the Streaming UI can get messed up from the various streaming related properties set getting cleared or re-used.

On Fri, Aug 7, 2020 at 10:38 AM Sean Owen <[hidden email]> wrote:
Why do you need to do it, and can you just use a future in your driver code?

On Fri, Aug 7, 2020 at 9:01 AM Antonin Delpeuch (lists)
<[hidden email]> wrote:
>
> Hi all,
>
> Following my request on the user mailing list [1], there does not seem
> to be any simple way to save RDDs to the file system in an asynchronous
> way. I am looking into implementing this, so I am first checking whether
> there is consensus around the idea.
>
> The goal would be to add methods such as `saveAsTextFileAsync` and
> `saveAsObjectFileAsync` to the RDD API.
>
> I am thinking about doing this by:
>
> - refactoring SparkHadoopWriter to allow for submitting jobs
> asynchronously (with `submitJob` rather than `runJob`)
>
> - add a `saveAsHadoopFileAsync` method in `PairRDDFunctions`,
> counterpart to the existing `saveAsHadoopFile`
>
> - add a `saveAsTextFileAsync` (and other formats) in `AsyncRDDActions`.
>
> Because SparkHadoopWriter is private, it is complicated to reimplement
> this functionality outside of Spark as a user, so I think this would be
> an API worth offering. It should be possible to implement this without
> too much code duplication hopefully.
>
> Cheers,
>
> Antonin
>
> [1]:
> http://apache-spark-user-list.1001560.n3.nabble.com/Async-API-to-save-RDDs-td38320.html
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Async RDD saves

Antonin Delpeuch
In reply to this post by edeesis
Hi both,

Thanks for your replies!

Sean, your proposal to use a driver-side future wrapping the blocking
call sounds a lot easier indeed.

But I want to ensure that canceling the future in the driver code kills
the corresponding tasks on all executors. If I wrap the driver-side call
in a standard Scala or Java future it will not be cancelable, will it? I
think I would need to interrupt the thread that executes the future somehow.

As you can see I am far from an expert on this topic, sorry if I
misunderstood your proposal.

Cheers,
Antonin


On 07/08/2020 19:53, Edward Mitchell wrote:

> I will agree that the side effects of using Futures in driver code tend
> to be tricky to track down.
>
> If you forget to clear the job description and job group information,
> when the LocalProperties on the SparkContext remain intact -
> SparkContext#submitJob makes sure to pass down the localProperties.
>
> This has led to us doing this hack:
>
> image.png
>
> This can also cause problems with Spark Streaming where the Streaming UI
> can get messed up from the various streaming related properties set
> getting cleared or re-used.
>
> On Fri, Aug 7, 2020 at 10:38 AM Sean Owen <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     Why do you need to do it, and can you just use a future in your
>     driver code?
>
>     On Fri, Aug 7, 2020 at 9:01 AM Antonin Delpeuch (lists)
>     <[hidden email] <mailto:[hidden email]>> wrote:
>     >
>     > Hi all,
>     >
>     > Following my request on the user mailing list [1], there does not seem
>     > to be any simple way to save RDDs to the file system in an
>     asynchronous
>     > way. I am looking into implementing this, so I am first checking
>     whether
>     > there is consensus around the idea.
>     >
>     > The goal would be to add methods such as `saveAsTextFileAsync` and
>     > `saveAsObjectFileAsync` to the RDD API.
>     >
>     > I am thinking about doing this by:
>     >
>     > - refactoring SparkHadoopWriter to allow for submitting jobs
>     > asynchronously (with `submitJob` rather than `runJob`)
>     >
>     > - add a `saveAsHadoopFileAsync` method in `PairRDDFunctions`,
>     > counterpart to the existing `saveAsHadoopFile`
>     >
>     > - add a `saveAsTextFileAsync` (and other formats) in
>     `AsyncRDDActions`.
>     >
>     > Because SparkHadoopWriter is private, it is complicated to reimplement
>     > this functionality outside of Spark as a user, so I think this
>     would be
>     > an API worth offering. It should be possible to implement this without
>     > too much code duplication hopefully.
>     >
>     > Cheers,
>     >
>     > Antonin
>     >
>     > [1]:
>     >
>     http://apache-spark-user-list.1001560.n3.nabble.com/Async-API-to-save-RDDs-td38320.html
>     >
>     >
>     >
>     > ---------------------------------------------------------------------
>     > To unsubscribe e-mail: [hidden email]
>     <mailto:[hidden email]>
>     >
>
>     ---------------------------------------------------------------------
>     To unsubscribe e-mail: [hidden email]
>     <mailto:[hidden email]>
>


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]