[VOTE][SPARK-25299] SPIP: Shuffle Storage API

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

[VOTE][SPARK-25299] SPIP: Shuffle Storage API

Matt Cheah

Hi everyone,

 

I would like to call a vote for the SPIP for SPARK-25299, which proposes to introduce a pluggable storage API for temporary shuffle data.

 

You may find the SPIP document here.

 

The discussion thread for the SPIP was conducted here.

 

Please vote on whether or not this proposal is agreeable to you.

 

Thanks!

 

-Matt Cheah


smime.p7s (6K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE][SPARK-25299] SPIP: Shuffle Storage API

bo yang
+1 This is great work, allowing plugin of different sort shuffle write/read implementation! Also great to see it retain the current Spark configuration (spark.shuffle.manager=org.apache.spark.shuffle.YourShuffleManagerImpl).


On Thu, Jun 13, 2019 at 2:58 PM Matt Cheah <[hidden email]> wrote:

Hi everyone,

 

I would like to call a vote for the SPIP for SPARK-25299, which proposes to introduce a pluggable storage API for temporary shuffle data.

 

You may find the SPIP document here.

 

The discussion thread for the SPIP was conducted here.

 

Please vote on whether or not this proposal is agreeable to you.

 

Thanks!

 

-Matt Cheah

Reply | Threaded
Open this post in threaded view
|

Re: [VOTE][SPARK-25299] SPIP: Shuffle Storage API

ifilonenko
+1 (non-binding). This API is versatile and flexible enough to handle Bloomberg's internal use-cases. The ability for us to vary implementation strategies is quite appealing. It is also worth to note the minimal changes to Spark core in order to make it work. This is a very much needed addition within the Spark shuffle story. 

On Fri, Jun 14, 2019 at 9:59 AM bo yang <[hidden email]> wrote:
+1 This is great work, allowing plugin of different sort shuffle write/read implementation! Also great to see it retain the current Spark configuration (spark.shuffle.manager=org.apache.spark.shuffle.YourShuffleManagerImpl).


On Thu, Jun 13, 2019 at 2:58 PM Matt Cheah <[hidden email]> wrote:

Hi everyone,

 

I would like to call a vote for the SPIP for SPARK-25299, which proposes to introduce a pluggable storage API for temporary shuffle data.

 

You may find the SPIP document here.

 

The discussion thread for the SPIP was conducted here.

 

Please vote on whether or not this proposal is agreeable to you.

 

Thanks!

 

-Matt Cheah

Reply | Threaded
Open this post in threaded view
|

Re: [VOTE][SPARK-25299] SPIP: Shuffle Storage API

Imran Rashid-2
+1 (binding)

I think this is a really important feature for spark.

First, there is already a lot of interest in alternative shuffle storage in the community.  There is already a lot of interest in alternative shuffle storage, from dynamic allocation in kubernetes, to even just improving stability in standard on-premise use of Spark.  However, they're often stuck doing this in forks of Spark, and in ways that are not maintainable (because they copy-paste many spark internals) or are incorrect (for not correctly handling speculative execution & stage retries).

Second, I think the specific proposal is good for finding the right balance between flexibility and too much complexity, to allow incremental improvements.  A lot of work has been put into this already to try to figure out which pieces are essential to make alternative shuffle storage implementations feasible.

Of course, that means it doesn't include everything imaginable; some things still aren't supported, and some will still choose to use the older ShuffleManager api to give total control over all of shuffle.  But we know there are a reasonable set of things which can be implemented behind the api as the first step, and it can continue to evolve.

On Fri, Jun 14, 2019 at 12:13 PM Ilan Filonenko <[hidden email]> wrote:
+1 (non-binding). This API is versatile and flexible enough to handle Bloomberg's internal use-cases. The ability for us to vary implementation strategies is quite appealing. It is also worth to note the minimal changes to Spark core in order to make it work. This is a very much needed addition within the Spark shuffle story. 

On Fri, Jun 14, 2019 at 9:59 AM bo yang <[hidden email]> wrote:
+1 This is great work, allowing plugin of different sort shuffle write/read implementation! Also great to see it retain the current Spark configuration (spark.shuffle.manager=org.apache.spark.shuffle.YourShuffleManagerImpl).


On Thu, Jun 13, 2019 at 2:58 PM Matt Cheah <[hidden email]> wrote:

Hi everyone,

 

I would like to call a vote for the SPIP for SPARK-25299, which proposes to introduce a pluggable storage API for temporary shuffle data.

 

You may find the SPIP document here.

 

The discussion thread for the SPIP was conducted here.

 

Please vote on whether or not this proposal is agreeable to you.

 

Thanks!

 

-Matt Cheah

Reply | Threaded
Open this post in threaded view
|

Re: [VOTE][SPARK-25299] SPIP: Shuffle Storage API

Saisai Shao
+1 (binding)

Thanks
Saisai

Imran Rashid <[hidden email]> 于2019年6月15日周六 上午3:46写道:
+1 (binding)

I think this is a really important feature for spark.

First, there is already a lot of interest in alternative shuffle storage in the community.  There is already a lot of interest in alternative shuffle storage, from dynamic allocation in kubernetes, to even just improving stability in standard on-premise use of Spark.  However, they're often stuck doing this in forks of Spark, and in ways that are not maintainable (because they copy-paste many spark internals) or are incorrect (for not correctly handling speculative execution & stage retries).

Second, I think the specific proposal is good for finding the right balance between flexibility and too much complexity, to allow incremental improvements.  A lot of work has been put into this already to try to figure out which pieces are essential to make alternative shuffle storage implementations feasible.

Of course, that means it doesn't include everything imaginable; some things still aren't supported, and some will still choose to use the older ShuffleManager api to give total control over all of shuffle.  But we know there are a reasonable set of things which can be implemented behind the api as the first step, and it can continue to evolve.

On Fri, Jun 14, 2019 at 12:13 PM Ilan Filonenko <[hidden email]> wrote:
+1 (non-binding). This API is versatile and flexible enough to handle Bloomberg's internal use-cases. The ability for us to vary implementation strategies is quite appealing. It is also worth to note the minimal changes to Spark core in order to make it work. This is a very much needed addition within the Spark shuffle story. 

On Fri, Jun 14, 2019 at 9:59 AM bo yang <[hidden email]> wrote:
+1 This is great work, allowing plugin of different sort shuffle write/read implementation! Also great to see it retain the current Spark configuration (spark.shuffle.manager=org.apache.spark.shuffle.YourShuffleManagerImpl).


On Thu, Jun 13, 2019 at 2:58 PM Matt Cheah <[hidden email]> wrote:

Hi everyone,

 

I would like to call a vote for the SPIP for SPARK-25299, which proposes to introduce a pluggable storage API for temporary shuffle data.

 

You may find the SPIP document here.

 

The discussion thread for the SPIP was conducted here.

 

Please vote on whether or not this proposal is agreeable to you.

 

Thanks!

 

-Matt Cheah

Reply | Threaded
Open this post in threaded view
|

Re: [VOTE][SPARK-25299] SPIP: Shuffle Storage API

Dongjoon Hyun-2
+1

Bests,
Dongjoon.


On Sun, Jun 16, 2019 at 9:41 PM Saisai Shao <[hidden email]> wrote:
+1 (binding)

Thanks
Saisai

Imran Rashid <[hidden email]> 于2019年6月15日周六 上午3:46写道:
+1 (binding)

I think this is a really important feature for spark.

First, there is already a lot of interest in alternative shuffle storage in the community.  There is already a lot of interest in alternative shuffle storage, from dynamic allocation in kubernetes, to even just improving stability in standard on-premise use of Spark.  However, they're often stuck doing this in forks of Spark, and in ways that are not maintainable (because they copy-paste many spark internals) or are incorrect (for not correctly handling speculative execution & stage retries).

Second, I think the specific proposal is good for finding the right balance between flexibility and too much complexity, to allow incremental improvements.  A lot of work has been put into this already to try to figure out which pieces are essential to make alternative shuffle storage implementations feasible.

Of course, that means it doesn't include everything imaginable; some things still aren't supported, and some will still choose to use the older ShuffleManager api to give total control over all of shuffle.  But we know there are a reasonable set of things which can be implemented behind the api as the first step, and it can continue to evolve.

On Fri, Jun 14, 2019 at 12:13 PM Ilan Filonenko <[hidden email]> wrote:
+1 (non-binding). This API is versatile and flexible enough to handle Bloomberg's internal use-cases. The ability for us to vary implementation strategies is quite appealing. It is also worth to note the minimal changes to Spark core in order to make it work. This is a very much needed addition within the Spark shuffle story. 

On Fri, Jun 14, 2019 at 9:59 AM bo yang <[hidden email]> wrote:
+1 This is great work, allowing plugin of different sort shuffle write/read implementation! Also great to see it retain the current Spark configuration (spark.shuffle.manager=org.apache.spark.shuffle.YourShuffleManagerImpl).


On Thu, Jun 13, 2019 at 2:58 PM Matt Cheah <[hidden email]> wrote:

Hi everyone,

 

I would like to call a vote for the SPIP for SPARK-25299, which proposes to introduce a pluggable storage API for temporary shuffle data.

 

You may find the SPIP document here.

 

The discussion thread for the SPIP was conducted here.

 

Please vote on whether or not this proposal is agreeable to you.

 

Thanks!

 

-Matt Cheah

Reply | Threaded
Open this post in threaded view
|

Re: [VOTE][SPARK-25299] SPIP: Shuffle Storage API

Ryan Blue
+1 (non-binding)

On Sun, Jun 16, 2019 at 11:11 PM Dongjoon Hyun <[hidden email]> wrote:
+1

Bests,
Dongjoon.


On Sun, Jun 16, 2019 at 9:41 PM Saisai Shao <[hidden email]> wrote:
+1 (binding)

Thanks
Saisai

Imran Rashid <[hidden email]> 于2019年6月15日周六 上午3:46写道:
+1 (binding)

I think this is a really important feature for spark.

First, there is already a lot of interest in alternative shuffle storage in the community.  There is already a lot of interest in alternative shuffle storage, from dynamic allocation in kubernetes, to even just improving stability in standard on-premise use of Spark.  However, they're often stuck doing this in forks of Spark, and in ways that are not maintainable (because they copy-paste many spark internals) or are incorrect (for not correctly handling speculative execution & stage retries).

Second, I think the specific proposal is good for finding the right balance between flexibility and too much complexity, to allow incremental improvements.  A lot of work has been put into this already to try to figure out which pieces are essential to make alternative shuffle storage implementations feasible.

Of course, that means it doesn't include everything imaginable; some things still aren't supported, and some will still choose to use the older ShuffleManager api to give total control over all of shuffle.  But we know there are a reasonable set of things which can be implemented behind the api as the first step, and it can continue to evolve.

On Fri, Jun 14, 2019 at 12:13 PM Ilan Filonenko <[hidden email]> wrote:
+1 (non-binding). This API is versatile and flexible enough to handle Bloomberg's internal use-cases. The ability for us to vary implementation strategies is quite appealing. It is also worth to note the minimal changes to Spark core in order to make it work. This is a very much needed addition within the Spark shuffle story. 

On Fri, Jun 14, 2019 at 9:59 AM bo yang <[hidden email]> wrote:
+1 This is great work, allowing plugin of different sort shuffle write/read implementation! Also great to see it retain the current Spark configuration (spark.shuffle.manager=org.apache.spark.shuffle.YourShuffleManagerImpl).


On Thu, Jun 13, 2019 at 2:58 PM Matt Cheah <[hidden email]> wrote:

Hi everyone,

 

I would like to call a vote for the SPIP for SPARK-25299, which proposes to introduce a pluggable storage API for temporary shuffle data.

 

You may find the SPIP document here.

 

The discussion thread for the SPIP was conducted here.

 

Please vote on whether or not this proposal is agreeable to you.

 

Thanks!

 

-Matt Cheah



--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE][SPARK-25299] SPIP: Shuffle Storage API

liyinan926
+1 (non-binding) 

On Mon, Jun 17, 2019 at 1:58 PM Ryan Blue <[hidden email]> wrote:
+1 (non-binding)

On Sun, Jun 16, 2019 at 11:11 PM Dongjoon Hyun <[hidden email]> wrote:
+1

Bests,
Dongjoon.


On Sun, Jun 16, 2019 at 9:41 PM Saisai Shao <[hidden email]> wrote:
+1 (binding)

Thanks
Saisai

Imran Rashid <[hidden email]> 于2019年6月15日周六 上午3:46写道:
+1 (binding)

I think this is a really important feature for spark.

First, there is already a lot of interest in alternative shuffle storage in the community.  There is already a lot of interest in alternative shuffle storage, from dynamic allocation in kubernetes, to even just improving stability in standard on-premise use of Spark.  However, they're often stuck doing this in forks of Spark, and in ways that are not maintainable (because they copy-paste many spark internals) or are incorrect (for not correctly handling speculative execution & stage retries).

Second, I think the specific proposal is good for finding the right balance between flexibility and too much complexity, to allow incremental improvements.  A lot of work has been put into this already to try to figure out which pieces are essential to make alternative shuffle storage implementations feasible.

Of course, that means it doesn't include everything imaginable; some things still aren't supported, and some will still choose to use the older ShuffleManager api to give total control over all of shuffle.  But we know there are a reasonable set of things which can be implemented behind the api as the first step, and it can continue to evolve.

On Fri, Jun 14, 2019 at 12:13 PM Ilan Filonenko <[hidden email]> wrote:
+1 (non-binding). This API is versatile and flexible enough to handle Bloomberg's internal use-cases. The ability for us to vary implementation strategies is quite appealing. It is also worth to note the minimal changes to Spark core in order to make it work. This is a very much needed addition within the Spark shuffle story. 

On Fri, Jun 14, 2019 at 9:59 AM bo yang <[hidden email]> wrote:
+1 This is great work, allowing plugin of different sort shuffle write/read implementation! Also great to see it retain the current Spark configuration (spark.shuffle.manager=org.apache.spark.shuffle.YourShuffleManagerImpl).


On Thu, Jun 13, 2019 at 2:58 PM Matt Cheah <[hidden email]> wrote:

Hi everyone,

 

I would like to call a vote for the SPIP for SPARK-25299, which proposes to introduce a pluggable storage API for temporary shuffle data.

 

You may find the SPIP document here.

 

The discussion thread for the SPIP was conducted here.

 

Please vote on whether or not this proposal is agreeable to you.

 

Thanks!

 

-Matt Cheah



--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE][SPARK-25299] SPIP: Shuffle Storage API

Felix Cheung
+1

Glad to see the progress in this space - it’s been more than a year since the original discussion and effort started.


From: Yinan Li <[hidden email]>
Sent: Monday, June 17, 2019 7:14:42 PM
To: [hidden email]
Cc: Dongjoon Hyun; Saisai Shao; Imran Rashid; Ilan Filonenko; bo yang; Matt Cheah; Spark Dev List; Yifei Huang (PD); Vinoo Ganesh; Imran Rashid
Subject: Re: [VOTE][SPARK-25299] SPIP: Shuffle Storage API
 
+1 (non-binding) 

On Mon, Jun 17, 2019 at 1:58 PM Ryan Blue <[hidden email]> wrote:
+1 (non-binding)

On Sun, Jun 16, 2019 at 11:11 PM Dongjoon Hyun <[hidden email]> wrote:
+1

Bests,
Dongjoon.


On Sun, Jun 16, 2019 at 9:41 PM Saisai Shao <[hidden email]> wrote:
+1 (binding)

Thanks
Saisai

Imran Rashid <[hidden email]> 于2019年6月15日周六 上午3:46写道:
+1 (binding)

I think this is a really important feature for spark.

First, there is already a lot of interest in alternative shuffle storage in the community.  There is already a lot of interest in alternative shuffle storage, from dynamic allocation in kubernetes, to even just improving stability in standard on-premise use of Spark.  However, they're often stuck doing this in forks of Spark, and in ways that are not maintainable (because they copy-paste many spark internals) or are incorrect (for not correctly handling speculative execution & stage retries).

Second, I think the specific proposal is good for finding the right balance between flexibility and too much complexity, to allow incremental improvements.  A lot of work has been put into this already to try to figure out which pieces are essential to make alternative shuffle storage implementations feasible.

Of course, that means it doesn't include everything imaginable; some things still aren't supported, and some will still choose to use the older ShuffleManager api to give total control over all of shuffle.  But we know there are a reasonable set of things which can be implemented behind the api as the first step, and it can continue to evolve.

On Fri, Jun 14, 2019 at 12:13 PM Ilan Filonenko <[hidden email]> wrote:
+1 (non-binding). This API is versatile and flexible enough to handle Bloomberg's internal use-cases. The ability for us to vary implementation strategies is quite appealing. It is also worth to note the minimal changes to Spark core in order to make it work. This is a very much needed addition within the Spark shuffle story. 

On Fri, Jun 14, 2019 at 9:59 AM bo yang <[hidden email]> wrote:
+1 This is great work, allowing plugin of different sort shuffle write/read implementation! Also great to see it retain the current Spark configuration (spark.shuffle.manager=org.apache.spark.shuffle.YourShuffleManagerImpl).


On Thu, Jun 13, 2019 at 2:58 PM Matt Cheah <[hidden email]> wrote:

Hi everyone,

 

I would like to call a vote for the SPIP for SPARK-25299, which proposes to introduce a pluggable storage API for temporary shuffle data.

 

You may find the SPIP document here.

 

The discussion thread for the SPIP was conducted here.

 

Please vote on whether or not this proposal is agreeable to you.

 

Thanks!

 

-Matt Cheah



--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE][SPARK-25299] SPIP: Shuffle Storage API

Vinoo Ganesh

+1 (non-binding).

 

Thanks for pushing this forward, Matt and Yifei.

 

From: Felix Cheung <[hidden email]>
Date: Tuesday, June 18, 2019 at 00:01
To: Yinan Li <[hidden email]>, "[hidden email]" <[hidden email]>
Cc: Dongjoon Hyun <[hidden email]>, Saisai Shao <[hidden email]>, Imran Rashid <[hidden email]>, Ilan Filonenko <[hidden email]>, bo yang <[hidden email]>, Matt Cheah <[hidden email]>, Spark Dev List <[hidden email]>, "Yifei Huang (PD)" <[hidden email]>, Vinoo Ganesh <[hidden email]>, Imran Rashid <[hidden email]>
Subject: Re: [VOTE][SPARK-25299] SPIP: Shuffle Storage API

 

+1

 

Glad to see the progress in this space - it’s been more than a year since the original discussion and effort started.

 


From: Yinan Li <[hidden email]>
Sent: Monday, June 17, 2019 7:14:42 PM
To: [hidden email]
Cc: Dongjoon Hyun; Saisai Shao; Imran Rashid; Ilan Filonenko; bo yang; Matt Cheah; Spark Dev List; Yifei Huang (PD); Vinoo Ganesh; Imran Rashid
Subject: Re: [VOTE][SPARK-25299] SPIP: Shuffle Storage API

 

+1 (non-binding) 

 

On Mon, Jun 17, 2019 at 1:58 PM Ryan Blue <[hidden email]> wrote:

+1 (non-binding)

 

On Sun, Jun 16, 2019 at 11:11 PM Dongjoon Hyun <[hidden email]> wrote:

+1

 

Bests,

Dongjoon.

 

 

On Sun, Jun 16, 2019 at 9:41 PM Saisai Shao <[hidden email]> wrote:

+1 (binding)

 

Thanks

Saisai

 

Imran Rashid <[hidden email]> 2019615日周六 上午3:46写道:

+1 (binding)

I think this is a really important feature for spark.

First, there is already a lot of interest in alternative shuffle storage in the community.  There is already a lot of interest in alternative shuffle storage, from dynamic allocation in kubernetes, to even just improving stability in standard on-premise use of Spark.  However, they're often stuck doing this in forks of Spark, and in ways that are not maintainable (because they copy-paste many spark internals) or are incorrect (for not correctly handling speculative execution & stage retries).

Second, I think the specific proposal is good for finding the right balance between flexibility and too much complexity, to allow incremental improvements.  A lot of work has been put into this already to try to figure out which pieces are essential to make alternative shuffle storage implementations feasible.

Of course, that means it doesn't include everything imaginable; some things still aren't supported, and some will still choose to use the older ShuffleManager api to give total control over all of shuffle.  But we know there are a reasonable set of things which can be implemented behind the api as the first step, and it can continue to evolve.

 

On Fri, Jun 14, 2019 at 12:13 PM Ilan Filonenko <[hidden email]> wrote:

+1 (non-binding). This API is versatile and flexible enough to handle Bloomberg's internal use-cases. The ability for us to vary implementation strategies is quite appealing. It is also worth to note the minimal changes to Spark core in order to make it work. This is a very much needed addition within the Spark shuffle story. 

 

On Fri, Jun 14, 2019 at 9:59 AM bo yang <[hidden email]> wrote:

+1 This is great work, allowing plugin of different sort shuffle write/read implementation! Also great to see it retain the current Spark configuration (spark.shuffle.manager=org.apache.spark.shuffle.YourShuffleManagerImpl).

 

 

On Thu, Jun 13, 2019 at 2:58 PM Matt Cheah <[hidden email]> wrote:

Hi everyone,

 

I would like to call a vote for the SPIP for SPARK-25299 [issues.apache.org], which proposes to introduce a pluggable storage API for temporary shuffle data.

 

You may find the SPIP document here [docs.google.com].

 

The discussion thread for the SPIP was conducted here [lists.apache.org].

 

Please vote on whether or not this proposal is agreeable to you.

 

Thanks!

 

-Matt Cheah


 

--

Ryan Blue

Software Engineer

Netflix

Reply | Threaded
Open this post in threaded view
|

Re: [VOTE][SPARK-25299] SPIP: Shuffle Storage API

John Zhuge-2
+1 (non-binding)  Great work!

On Tue, Jun 18, 2019 at 6:22 AM Vinoo Ganesh <[hidden email]> wrote:

+1 (non-binding).

 

Thanks for pushing this forward, Matt and Yifei.

 

From: Felix Cheung <[hidden email]>
Date: Tuesday, June 18, 2019 at 00:01
To: Yinan Li <[hidden email]>, "[hidden email]" <[hidden email]>
Cc: Dongjoon Hyun <[hidden email]>, Saisai Shao <[hidden email]>, Imran Rashid <[hidden email]>, Ilan Filonenko <[hidden email]>, bo yang <[hidden email]>, Matt Cheah <[hidden email]>, Spark Dev List <[hidden email]>, "Yifei Huang (PD)" <[hidden email]>, Vinoo Ganesh <[hidden email]>, Imran Rashid <[hidden email]>
Subject: Re: [VOTE][SPARK-25299] SPIP: Shuffle Storage API

 

+1

 

Glad to see the progress in this space - it’s been more than a year since the original discussion and effort started.

 


From: Yinan Li <[hidden email]>
Sent: Monday, June 17, 2019 7:14:42 PM
To: [hidden email]
Cc: Dongjoon Hyun; Saisai Shao; Imran Rashid; Ilan Filonenko; bo yang; Matt Cheah; Spark Dev List; Yifei Huang (PD); Vinoo Ganesh; Imran Rashid
Subject: Re: [VOTE][SPARK-25299] SPIP: Shuffle Storage API

 

+1 (non-binding) 

 

On Mon, Jun 17, 2019 at 1:58 PM Ryan Blue <[hidden email]> wrote:

+1 (non-binding)

 

On Sun, Jun 16, 2019 at 11:11 PM Dongjoon Hyun <[hidden email]> wrote:

+1

 

Bests,

Dongjoon.

 

 

On Sun, Jun 16, 2019 at 9:41 PM Saisai Shao <[hidden email]> wrote:

+1 (binding)

 

Thanks

Saisai

 

Imran Rashid <[hidden email]> 2019615日周六 上午3:46写道:

+1 (binding)

I think this is a really important feature for spark.

First, there is already a lot of interest in alternative shuffle storage in the community.  There is already a lot of interest in alternative shuffle storage, from dynamic allocation in kubernetes, to even just improving stability in standard on-premise use of Spark.  However, they're often stuck doing this in forks of Spark, and in ways that are not maintainable (because they copy-paste many spark internals) or are incorrect (for not correctly handling speculative execution & stage retries).

Second, I think the specific proposal is good for finding the right balance between flexibility and too much complexity, to allow incremental improvements.  A lot of work has been put into this already to try to figure out which pieces are essential to make alternative shuffle storage implementations feasible.

Of course, that means it doesn't include everything imaginable; some things still aren't supported, and some will still choose to use the older ShuffleManager api to give total control over all of shuffle.  But we know there are a reasonable set of things which can be implemented behind the api as the first step, and it can continue to evolve.

 

On Fri, Jun 14, 2019 at 12:13 PM Ilan Filonenko <[hidden email]> wrote:

+1 (non-binding). This API is versatile and flexible enough to handle Bloomberg's internal use-cases. The ability for us to vary implementation strategies is quite appealing. It is also worth to note the minimal changes to Spark core in order to make it work. This is a very much needed addition within the Spark shuffle story. 

 

On Fri, Jun 14, 2019 at 9:59 AM bo yang <[hidden email]> wrote:

+1 This is great work, allowing plugin of different sort shuffle write/read implementation! Also great to see it retain the current Spark configuration (spark.shuffle.manager=org.apache.spark.shuffle.YourShuffleManagerImpl).

 

 

On Thu, Jun 13, 2019 at 2:58 PM Matt Cheah <[hidden email]> wrote:

Hi everyone,

 

I would like to call a vote for the SPIP for SPARK-25299 [issues.apache.org], which proposes to introduce a pluggable storage API for temporary shuffle data.

 

You may find the SPIP document here [docs.google.com].

 

The discussion thread for the SPIP was conducted here [lists.apache.org].

 

Please vote on whether or not this proposal is agreeable to you.

 

Thanks!

 

-Matt Cheah


 

--

Ryan Blue

Software Engineer

Netflix



--
John
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE][SPARK-25299] SPIP: Shuffle Storage API

dhruve ashar
+1 (non-binding)

On Tue, Jun 18, 2019 at 12:12 PM John Zhuge <[hidden email]> wrote:
+1 (non-binding)  Great work!

On Tue, Jun 18, 2019 at 6:22 AM Vinoo Ganesh <[hidden email]> wrote:

+1 (non-binding).

 

Thanks for pushing this forward, Matt and Yifei.

 

From: Felix Cheung <[hidden email]>
Date: Tuesday, June 18, 2019 at 00:01
To: Yinan Li <[hidden email]>, "[hidden email]" <[hidden email]>
Cc: Dongjoon Hyun <[hidden email]>, Saisai Shao <[hidden email]>, Imran Rashid <[hidden email]>, Ilan Filonenko <[hidden email]>, bo yang <[hidden email]>, Matt Cheah <[hidden email]>, Spark Dev List <[hidden email]>, "Yifei Huang (PD)" <[hidden email]>, Vinoo Ganesh <[hidden email]>, Imran Rashid <[hidden email]>
Subject: Re: [VOTE][SPARK-25299] SPIP: Shuffle Storage API

 

+1

 

Glad to see the progress in this space - it’s been more than a year since the original discussion and effort started.

 


From: Yinan Li <[hidden email]>
Sent: Monday, June 17, 2019 7:14:42 PM
To: [hidden email]
Cc: Dongjoon Hyun; Saisai Shao; Imran Rashid; Ilan Filonenko; bo yang; Matt Cheah; Spark Dev List; Yifei Huang (PD); Vinoo Ganesh; Imran Rashid
Subject: Re: [VOTE][SPARK-25299] SPIP: Shuffle Storage API

 

+1 (non-binding) 

 

On Mon, Jun 17, 2019 at 1:58 PM Ryan Blue <[hidden email]> wrote:

+1 (non-binding)

 

On Sun, Jun 16, 2019 at 11:11 PM Dongjoon Hyun <[hidden email]> wrote:

+1

 

Bests,

Dongjoon.

 

 

On Sun, Jun 16, 2019 at 9:41 PM Saisai Shao <[hidden email]> wrote:

+1 (binding)

 

Thanks

Saisai

 

Imran Rashid <[hidden email]> 2019615日周六 上午3:46写道:

+1 (binding)

I think this is a really important feature for spark.

First, there is already a lot of interest in alternative shuffle storage in the community.  There is already a lot of interest in alternative shuffle storage, from dynamic allocation in kubernetes, to even just improving stability in standard on-premise use of Spark.  However, they're often stuck doing this in forks of Spark, and in ways that are not maintainable (because they copy-paste many spark internals) or are incorrect (for not correctly handling speculative execution & stage retries).

Second, I think the specific proposal is good for finding the right balance between flexibility and too much complexity, to allow incremental improvements.  A lot of work has been put into this already to try to figure out which pieces are essential to make alternative shuffle storage implementations feasible.

Of course, that means it doesn't include everything imaginable; some things still aren't supported, and some will still choose to use the older ShuffleManager api to give total control over all of shuffle.  But we know there are a reasonable set of things which can be implemented behind the api as the first step, and it can continue to evolve.

 

On Fri, Jun 14, 2019 at 12:13 PM Ilan Filonenko <[hidden email]> wrote:

+1 (non-binding). This API is versatile and flexible enough to handle Bloomberg's internal use-cases. The ability for us to vary implementation strategies is quite appealing. It is also worth to note the minimal changes to Spark core in order to make it work. This is a very much needed addition within the Spark shuffle story. 

 

On Fri, Jun 14, 2019 at 9:59 AM bo yang <[hidden email]> wrote:

+1 This is great work, allowing plugin of different sort shuffle write/read implementation! Also great to see it retain the current Spark configuration (spark.shuffle.manager=org.apache.spark.shuffle.YourShuffleManagerImpl).

 

 

On Thu, Jun 13, 2019 at 2:58 PM Matt Cheah <[hidden email]> wrote:

Hi everyone,

 

I would like to call a vote for the SPIP for SPARK-25299 [issues.apache.org], which proposes to introduce a pluggable storage API for temporary shuffle data.

 

You may find the SPIP document here [docs.google.com].

 

The discussion thread for the SPIP was conducted here [lists.apache.org].

 

Please vote on whether or not this proposal is agreeable to you.

 

Thanks!

 

-Matt Cheah


 

--

Ryan Blue

Software Engineer

Netflix



--
John


--
-Dhruve Ashar

Reply | Threaded
Open this post in threaded view
|

RE: [VOTE][SPARK-25299] SPIP: Shuffle Storage API

Guo, Chenzhao

Cool : )

 

+1 (non-binding)

 

Chenzhao

 

From: dhruve ashar [mailto:[hidden email]]
Sent: Wednesday, June 19, 2019 2:58 AM
To: John Zhuge <[hidden email]>
Cc: Vinoo Ganesh <[hidden email]>; Felix Cheung <[hidden email]>; Yinan Li <[hidden email]>; [hidden email]; Dongjoon Hyun <[hidden email]>; Saisai Shao <[hidden email]>; Imran Rashid <[hidden email]>; Ilan Filonenko <[hidden email]>; bo yang <[hidden email]>; Matt Cheah <[hidden email]>; Spark Dev List <[hidden email]>; Yifei Huang (PD) <[hidden email]>; Imran Rashid <[hidden email]>
Subject: Re: [VOTE][SPARK-25299] SPIP: Shuffle Storage API

 

+1 (non-binding)

 

On Tue, Jun 18, 2019 at 12:12 PM John Zhuge <[hidden email]> wrote:

+1 (non-binding)  Great work!

 

On Tue, Jun 18, 2019 at 6:22 AM Vinoo Ganesh <[hidden email]> wrote:

+1 (non-binding).

 

Thanks for pushing this forward, Matt and Yifei.

 

From: Felix Cheung <[hidden email]>
Date: Tuesday, June 18, 2019 at 00:01
To: Yinan Li <[hidden email]>, "[hidden email]" <[hidden email]>
Cc: Dongjoon Hyun <[hidden email]>, Saisai Shao <[hidden email]>, Imran Rashid <[hidden email]>, Ilan Filonenko <[hidden email]>, bo yang <[hidden email]>, Matt Cheah <[hidden email]>, Spark Dev List <[hidden email]>, "Yifei Huang (PD)" <[hidden email]>, Vinoo Ganesh <[hidden email]>, Imran Rashid <[hidden email]>
Subject: Re: [VOTE][SPARK-25299] SPIP: Shuffle Storage API

 

+1

 

Glad to see the progress in this space - it’s been more than a year since the original discussion and effort started.

 


From: Yinan Li <[hidden email]>
Sent: Monday, June 17, 2019 7:14:42 PM
To: [hidden email]
Cc: Dongjoon Hyun; Saisai Shao; Imran Rashid; Ilan Filonenko; bo yang; Matt Cheah; Spark Dev List; Yifei Huang (PD); Vinoo Ganesh; Imran Rashid
Subject: Re: [VOTE][SPARK-25299] SPIP: Shuffle Storage API

 

+1 (non-binding) 

 

On Mon, Jun 17, 2019 at 1:58 PM Ryan Blue <[hidden email]> wrote:

+1 (non-binding)

 

On Sun, Jun 16, 2019 at 11:11 PM Dongjoon Hyun <[hidden email]> wrote:

+1

 

Bests,

Dongjoon.

 

 

On Sun, Jun 16, 2019 at 9:41 PM Saisai Shao <[hidden email]> wrote:

+1 (binding)

 

Thanks

Saisai

 

Imran Rashid <[hidden email]> 2019615日周六 上午3:46写道:

+1 (binding)

I think this is a really important feature for spark.

First, there is already a lot of interest in alternative shuffle storage in the community.  There is already a lot of interest in alternative shuffle storage, from dynamic allocation in kubernetes, to even just improving stability in standard on-premise use of Spark.  However, they're often stuck doing this in forks of Spark, and in ways that are not maintainable (because they copy-paste many spark internals) or are incorrect (for not correctly handling speculative execution & stage retries).

Second, I think the specific proposal is good for finding the right balance between flexibility and too much complexity, to allow incremental improvements.  A lot of work has been put into this already to try to figure out which pieces are essential to make alternative shuffle storage implementations feasible.

Of course, that means it doesn't include everything imaginable; some things still aren't supported, and some will still choose to use the older ShuffleManager api to give total control over all of shuffle.  But we know there are a reasonable set of things which can be implemented behind the api as the first step, and it can continue to evolve.

 

On Fri, Jun 14, 2019 at 12:13 PM Ilan Filonenko <[hidden email]> wrote:

+1 (non-binding). This API is versatile and flexible enough to handle Bloomberg's internal use-cases. The ability for us to vary implementation strategies is quite appealing. It is also worth to note the minimal changes to Spark core in order to make it work. This is a very much needed addition within the Spark shuffle story. 

 

On Fri, Jun 14, 2019 at 9:59 AM bo yang <[hidden email]> wrote:

+1 This is great work, allowing plugin of different sort shuffle write/read implementation! Also great to see it retain the current Spark configuration (spark.shuffle.manager=org.apache.spark.shuffle.YourShuffleManagerImpl).

 

 

On Thu, Jun 13, 2019 at 2:58 PM Matt Cheah <[hidden email]> wrote:

Hi everyone,

 

I would like to call a vote for the SPIP for SPARK-25299 [issues.apache.org], which proposes to introduce a pluggable storage API for temporary shuffle data.

 

You may find the SPIP document here [docs.google.com].

 

The discussion thread for the SPIP was conducted here [lists.apache.org].

 

Please vote on whether or not this proposal is agreeable to you.

 

Thanks!

 

-Matt Cheah


 

--

Ryan Blue

Software Engineer

Netflix


 

--

John



--

-Dhruve Ashar

 

Reply | Threaded
Open this post in threaded view
|

Re: [VOTE][SPARK-25299] SPIP: Shuffle Storage API

Tom Graves-2
+1 (binding)

I haven't looked at the low level api, but like the idea and approach to get it started.

Tom

On Tuesday, June 18, 2019, 10:40:34 PM CDT, Guo, Chenzhao <[hidden email]> wrote:


Cool : )

 

+1 (non-binding)

 

Chenzhao

 

From: dhruve ashar [mailto:[hidden email]]
Sent: Wednesday, June 19, 2019 2:58 AM
To: John Zhuge <[hidden email]>
Cc: Vinoo Ganesh <[hidden email]>; Felix Cheung <[hidden email]>; Yinan Li <[hidden email]>; [hidden email]; Dongjoon Hyun <[hidden email]>; Saisai Shao <[hidden email]>; Imran Rashid <[hidden email]>; Ilan Filonenko <[hidden email]>; bo yang <[hidden email]>; Matt Cheah <[hidden email]>; Spark Dev List <[hidden email]>; Yifei Huang (PD) <[hidden email]>; Imran Rashid <[hidden email]>
Subject: Re: [VOTE][SPARK-25299] SPIP: Shuffle Storage API

 

+1 (non-binding)

 

On Tue, Jun 18, 2019 at 12:12 PM John Zhuge <[hidden email]> wrote:

+1 (non-binding)  Great work!

 

On Tue, Jun 18, 2019 at 6:22 AM Vinoo Ganesh <[hidden email]> wrote:

+1 (non-binding).

 

Thanks for pushing this forward, Matt and Yifei.

 

From: Felix Cheung <[hidden email]>
Date: Tuesday, June 18, 2019 at 00:01
To: Yinan Li <[hidden email]>, "[hidden email]" <[hidden email]>
Cc: Dongjoon Hyun <[hidden email]>, Saisai Shao <[hidden email]>, Imran Rashid <[hidden email]>, Ilan Filonenko <[hidden email]>, bo yang <[hidden email]>, Matt Cheah <[hidden email]>, Spark Dev List <[hidden email]>, "Yifei Huang (PD)" <[hidden email]>, Vinoo Ganesh <[hidden email]>, Imran Rashid <[hidden email]>
Subject: Re: [VOTE][SPARK-25299] SPIP: Shuffle Storage API

 

+1

 

Glad to see the progress in this space - it’s been more than a year since the original discussion and effort started.

 


From: Yinan Li <[hidden email]>
Sent: Monday, June 17, 2019 7:14:42 PM
To: [hidden email]
Cc: Dongjoon Hyun; Saisai Shao; Imran Rashid; Ilan Filonenko; bo yang; Matt Cheah; Spark Dev List; Yifei Huang (PD); Vinoo Ganesh; Imran Rashid
Subject: Re: [VOTE][SPARK-25299] SPIP: Shuffle Storage API

 

+1 (non-binding) 

 

On Mon, Jun 17, 2019 at 1:58 PM Ryan Blue <[hidden email]> wrote:

+1 (non-binding)

 

On Sun, Jun 16, 2019 at 11:11 PM Dongjoon Hyun <[hidden email]> wrote:

+1

 

Bests,

Dongjoon.

 

 

On Sun, Jun 16, 2019 at 9:41 PM Saisai Shao <[hidden email]> wrote:

+1 (binding)

 

Thanks

Saisai

 

Imran Rashid <[hidden email]> 2019615日周六 上午3:46写道:

+1 (binding)

I think this is a really important feature for spark.

First, there is already a lot of interest in alternative shuffle storage in the community.  There is already a lot of interest in alternative shuffle storage, from dynamic allocation in kubernetes, to even just improving stability in standard on-premise use of Spark.  However, they're often stuck doing this in forks of Spark, and in ways that are not maintainable (because they copy-paste many spark internals) or are incorrect (for not correctly handling speculative execution & stage retries).

Second, I think the specific proposal is good for finding the right balance between flexibility and too much complexity, to allow incremental improvements.  A lot of work has been put into this already to try to figure out which pieces are essential to make alternative shuffle storage implementations feasible.

Of course, that means it doesn't include everything imaginable; some things still aren't supported, and some will still choose to use the older ShuffleManager api to give total control over all of shuffle.  But we know there are a reasonable set of things which can be implemented behind the api as the first step, and it can continue to evolve.

 

On Fri, Jun 14, 2019 at 12:13 PM Ilan Filonenko <[hidden email]> wrote:

+1 (non-binding). This API is versatile and flexible enough to handle Bloomberg's internal use-cases. The ability for us to vary implementation strategies is quite appealing. It is also worth to note the minimal changes to Spark core in order to make it work. This is a very much needed addition within the Spark shuffle story. 

 

On Fri, Jun 14, 2019 at 9:59 AM bo yang <[hidden email]> wrote:

+1 This is great work, allowing plugin of different sort shuffle write/read implementation! Also great to see it retain the current Spark configuration (spark.shuffle.manager=org.apache.spark.shuffle.YourShuffleManagerImpl).

 

 

On Thu, Jun 13, 2019 at 2:58 PM Matt Cheah <[hidden email]> wrote:

Hi everyone,

 

I would like to call a vote for the SPIP for SPARK-25299 [issues.apache.org], which proposes to introduce a pluggable storage API for temporary shuffle data.

 

You may find the SPIP document here [docs.google.com].

 

The discussion thread for the SPIP was conducted here [lists.apache.org].

 

Please vote on whether or not this proposal is agreeable to you.

 

Thanks!

 

-Matt Cheah


 

--

Ryan Blue

Software Engineer

Netflix


 

--

John



--

-Dhruve Ashar

 

Reply | Threaded
Open this post in threaded view
|

Re: [VOTE][SPARK-25299] SPIP: Shuffle Storage API

Vinoo Ganesh

Results of the voting:

Binding +1s: 5 (Tom Graves,  Dongjoon Hyun, Felix Cheung, Saisai Shao, Imran Rashid)

Non-Binding +1s: 8

-1 from PMC members: 0

 

Per PMC / SPIP Voting Rules (https://spark.apache.org/improvement-proposals.html), given that the vote has been open for >72 hours and 3 +1 binding votes have been received, this SPIP passes.

 

Thanks everyone.

 

From: Tom Graves <[hidden email]>
Date: Friday, June 21, 2019 at 13:02
To: dhruve ashar <[hidden email]>, John Zhuge <[hidden email]>, "Guo, Chenzhao" <[hidden email]>
Cc: Vinoo Ganesh <[hidden email]>, Felix Cheung <[hidden email]>, Yinan Li <[hidden email]>, "[hidden email]" <[hidden email]>, Dongjoon Hyun <[hidden email]>, Saisai Shao <[hidden email]>, Imran Rashid <[hidden email]>, Ilan Filonenko <[hidden email]>, bo yang <[hidden email]>, Matt Cheah <[hidden email]>, Spark Dev List <[hidden email]>, "Yifei Huang (PD)" <[hidden email]>, Imran Rashid <[hidden email]>
Subject: Re: [VOTE][SPARK-25299] SPIP: Shuffle Storage API

 

+1 (binding)

 

I haven't looked at the low level api, but like the idea and approach to get it started.

 

Tom

 

On Tuesday, June 18, 2019, 10:40:34 PM CDT, Guo, Chenzhao <[hidden email]> wrote:

 

 

Cool : )

 

+1 (non-binding)

 

Chenzhao

 

From: dhruve ashar [mailto:[hidden email]]
Sent: Wednesday, June 19, 2019 2:58 AM
To: John Zhuge <[hidden email]>
Cc: Vinoo Ganesh <[hidden email]>; Felix Cheung <[hidden email]>; Yinan Li <[hidden email]>; [hidden email]; Dongjoon Hyun <[hidden email]>; Saisai Shao <[hidden email]>; Imran Rashid <[hidden email]>; Ilan Filonenko <[hidden email]>; bo yang <[hidden email]>; Matt Cheah <[hidden email]>; Spark Dev List <[hidden email]>; Yifei Huang (PD) <[hidden email]>; Imran Rashid <[hidden email]>
Subject: Re: [VOTE][SPARK-25299] SPIP: Shuffle Storage API

 

+1 (non-binding)

 

On Tue, Jun 18, 2019 at 12:12 PM John Zhuge <[hidden email]> wrote:

+1 (non-binding)  Great work!

 

On Tue, Jun 18, 2019 at 6:22 AM Vinoo Ganesh <[hidden email]> wrote:

+1 (non-binding).

 

Thanks for pushing this forward, Matt and Yifei.

 

From: Felix Cheung <[hidden email]>
Date: Tuesday, June 18, 2019 at 00:01
To: Yinan Li <[hidden email]>, "[hidden email]" <[hidden email]>
Cc: Dongjoon Hyun <[hidden email]>, Saisai Shao <[hidden email]>, Imran Rashid <[hidden email]>, Ilan Filonenko <[hidden email]>, bo yang <[hidden email]>, Matt Cheah <[hidden email]>, Spark Dev List <[hidden email]>, "Yifei Huang (PD)" <[hidden email]>, Vinoo Ganesh <[hidden email]>, Imran Rashid <[hidden email]>
Subject: Re: [VOTE][SPARK-25299] SPIP: Shuffle Storage API

 

+1

 

Glad to see the progress in this space - it’s been more than a year since the original discussion and effort started.

 


From: Yinan Li <[hidden email]>
Sent: Monday, June 17, 2019 7:14:42 PM
To: [hidden email]
Cc: Dongjoon Hyun; Saisai Shao; Imran Rashid; Ilan Filonenko; bo yang; Matt Cheah; Spark Dev List; Yifei Huang (PD); Vinoo Ganesh; Imran Rashid
Subject: Re: [VOTE][SPARK-25299] SPIP: Shuffle Storage API

 

+1 (non-binding) 

 

On Mon, Jun 17, 2019 at 1:58 PM Ryan Blue <[hidden email]> wrote:

+1 (non-binding)

 

On Sun, Jun 16, 2019 at 11:11 PM Dongjoon Hyun <[hidden email]> wrote:

+1

 

Bests,

Dongjoon.

 

 

On Sun, Jun 16, 2019 at 9:41 PM Saisai Shao <[hidden email]> wrote:

+1 (binding)

 

Thanks

Saisai

 

Imran Rashid <[hidden email]> 2019615日周六 上午3:46写道:

+1 (binding)

I think this is a really important feature for spark.

First, there is already a lot of interest in alternative shuffle storage in the community.  There is already a lot of interest in alternative shuffle storage, from dynamic allocation in kubernetes, to even just improving stability in standard on-premise use of Spark.  However, they're often stuck doing this in forks of Spark, and in ways that are not maintainable (because they copy-paste many spark internals) or are incorrect (for not correctly handling speculative execution & stage retries).

Second, I think the specific proposal is good for finding the right balance between flexibility and too much complexity, to allow incremental improvements.  A lot of work has been put into this already to try to figure out which pieces are essential to make alternative shuffle storage implementations feasible.

Of course, that means it doesn't include everything imaginable; some things still aren't supported, and some will still choose to use the older ShuffleManager api to give total control over all of shuffle.  But we know there are a reasonable set of things which can be implemented behind the api as the first step, and it can continue to evolve.

 

On Fri, Jun 14, 2019 at 12:13 PM Ilan Filonenko <[hidden email]> wrote:

+1 (non-binding). This API is versatile and flexible enough to handle Bloomberg's internal use-cases. The ability for us to vary implementation strategies is quite appealing. It is also worth to note the minimal changes to Spark core in order to make it work. This is a very much needed addition within the Spark shuffle story. 

 

On Fri, Jun 14, 2019 at 9:59 AM bo yang <[hidden email]> wrote:

+1 This is great work, allowing plugin of different sort shuffle write/read implementation! Also great to see it retain the current Spark configuration (spark.shuffle.manager=org.apache.spark.shuffle.YourShuffleManagerImpl).

 

 

On Thu, Jun 13, 2019 at 2:58 PM Matt Cheah <[hidden email]> wrote:

Hi everyone,

 

I would like to call a vote for the SPIP for SPARK-25299 [issues.apache.org], which proposes to introduce a pluggable storage API for temporary shuffle data.

 

You may find the SPIP document here [docs.google.com].

 

The discussion thread for the SPIP was conducted here [lists.apache.org].

 

Please vote on whether or not this proposal is agreeable to you.

 

Thanks!

 

-Matt Cheah


 

--

Ryan Blue

Software Engineer

Netflix


 

--

John



--

-Dhruve Ashar