SPARk-25299: Updates As Of December 19, 2018

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

SPARk-25299: Updates As Of December 19, 2018

Matt Cheah

Hi everyone,

 

Earlier this year, we proposed SPARK-25299, proposing the idea of using other storage systems for persisting shuffle files. Since that time, we have been continuing to work on prototypes for this project. In the interest of increasing transparency into our work, we have created a progress report document where you may find a summary of the work we have been doing, as well as links to our prototypes on Github. We would ask that anyone who is very familiar with the inner workings of Spark’s shuffle could provide feedback and comments on our work thus far. We welcome any further discussion in this space. You may comment in this e-mail thread or by commenting on the progress report document.

 

Looking forward to hearing from you. Thanks,

 

-Matt Cheah


smime.p7s (6K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: SPARk-25299: Updates As Of December 19, 2018

John Zhuge-2
Matt, appreciate the update!

On Wed, Dec 19, 2018 at 10:51 AM Matt Cheah <[hidden email]> wrote:

Hi everyone,

 

Earlier this year, we proposed SPARK-25299, proposing the idea of using other storage systems for persisting shuffle files. Since that time, we have been continuing to work on prototypes for this project. In the interest of increasing transparency into our work, we have created a progress report document where you may find a summary of the work we have been doing, as well as links to our prototypes on Github. We would ask that anyone who is very familiar with the inner workings of Spark’s shuffle could provide feedback and comments on our work thus far. We welcome any further discussion in this space. You may comment in this e-mail thread or by commenting on the progress report document.

 

Looking forward to hearing from you. Thanks,

 

-Matt Cheah



--
John
Reply | Threaded
Open this post in threaded view
|

Re: SPARk-25299: Updates As Of December 19, 2018

prudenko
Hi Matt, i'm a developer of SparkRDMA shuffle manager: https://github.com/Mellanox/SparkRDMA
Thanks for your effort on improving Spark Shuffle API. We are very interested in participating in this. Have for now several comments:
1. Went through these 4 documents:
As i understood there's 2 discussions: improving shuffle manager API itself (Splash manager) and improving external shuffle service
2. We may consider to revisiting SPIP: RDMA Accelerated Shuffle Engine whether to support RDMA in the main codebase or at least as a first-class shuffle plugin (there are not much other open source shuffle plugins exists). We actively develop it, adding new features. RDMA is now available on Azure (https://azure.microsoft.com/en-us/blog/introducing-the-new-hb-and-hc-azure-vm-sizes-for-hpc/), Alibaba  and other cloud providers. For now we support only memory <-> memory transfer, but rdma is extensible to NVM and GPU data transfer.
3. We have users that are interested in having this feature (https://issues.apache.org/jira/browse/SPARK-12196) - we can consider adding it to this new API.

Let me know if you need help in review / testing / benchmark.
I'll look more on documents and PR,

Thanks,
Peter Rudenko
Software engineer at Mellanox Technologies.


ср, 19 груд. 2018 о 20:54 John Zhuge <[hidden email]> пише:
Matt, appreciate the update!

On Wed, Dec 19, 2018 at 10:51 AM Matt Cheah <[hidden email]> wrote:

Hi everyone,

 

Earlier this year, we proposed SPARK-25299, proposing the idea of using other storage systems for persisting shuffle files. Since that time, we have been continuing to work on prototypes for this project. In the interest of increasing transparency into our work, we have created a progress report document where you may find a summary of the work we have been doing, as well as links to our prototypes on Github. We would ask that anyone who is very familiar with the inner workings of Spark’s shuffle could provide feedback and comments on our work thus far. We welcome any further discussion in this space. You may comment in this e-mail thread or by commenting on the progress report document.

 

Looking forward to hearing from you. Thanks,

 

-Matt Cheah



--
John
Reply | Threaded
Open this post in threaded view
|

Re: SPARk-25299: Updates As Of December 19, 2018

Erik Erlandson-2

Curious how SPARK-25299 (where file tracking is pushed to spark drivers, at least in option-5) interacts with Splash. The shuffle data location in SPARK-25299 would now have additional "fallback" logic for recovering from executor loss.

On Thu, Jan 3, 2019 at 6:24 AM Peter Rudenko <[hidden email]> wrote:
Hi Matt, i'm a developer of SparkRDMA shuffle manager: https://github.com/Mellanox/SparkRDMA
Thanks for your effort on improving Spark Shuffle API. We are very interested in participating in this. Have for now several comments:
1. Went through these 4 documents:
As i understood there's 2 discussions: improving shuffle manager API itself (Splash manager) and improving external shuffle service
2. We may consider to revisiting SPIP: RDMA Accelerated Shuffle Engine whether to support RDMA in the main codebase or at least as a first-class shuffle plugin (there are not much other open source shuffle plugins exists). We actively develop it, adding new features. RDMA is now available on Azure (https://azure.microsoft.com/en-us/blog/introducing-the-new-hb-and-hc-azure-vm-sizes-for-hpc/), Alibaba  and other cloud providers. For now we support only memory <-> memory transfer, but rdma is extensible to NVM and GPU data transfer.
3. We have users that are interested in having this feature (https://issues.apache.org/jira/browse/SPARK-12196) - we can consider adding it to this new API.

Let me know if you need help in review / testing / benchmark.
I'll look more on documents and PR,

Thanks,
Peter Rudenko
Software engineer at Mellanox Technologies.


ср, 19 груд. 2018 о 20:54 John Zhuge <[hidden email]> пише:
Matt, appreciate the update!

On Wed, Dec 19, 2018 at 10:51 AM Matt Cheah <[hidden email]> wrote:

Hi everyone,

 

Earlier this year, we proposed SPARK-25299, proposing the idea of using other storage systems for persisting shuffle files. Since that time, we have been continuing to work on prototypes for this project. In the interest of increasing transparency into our work, we have created a progress report document where you may find a summary of the work we have been doing, as well as links to our prototypes on Github. We would ask that anyone who is very familiar with the inner workings of Spark’s shuffle could provide feedback and comments on our work thus far. We welcome any further discussion in this space. You may comment in this e-mail thread or by commenting on the progress report document.

 

Looking forward to hearing from you. Thanks,

 

-Matt Cheah



--
John