[VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

classic Classic list List threaded Threaded
54 messages Options
123
Reply | Threaded
Open this post in threaded view
|

[VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

Jiang Xingbo
Hi all,

I want to call for a vote of SPARK-24615. It improves Spark by making it aware of GPUs exposed by cluster managers, and hence Spark can match GPU resources with user task requests properly. The proposal and production doc was made available on dev@ to collect input. Your can also find a design sketch at SPARK-27005.

The vote will be up for the next 72 hours. Please reply with your vote:

+1: Yeah, let's go forward and implement the SPIP.
+0: Don't really care.
-1: I don't think this is a good idea because of the following technical reasons.

Thank you!

Xingbo
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

Jiang Xingbo
Start with +1 from myself.

Xingbo Jiang <[hidden email]> 于2019年3月1日周五 下午10:14写道:
Hi all,

I want to call for a vote of SPARK-24615. It improves Spark by making it aware of GPUs exposed by cluster managers, and hence Spark can match GPU resources with user task requests properly. The proposal and production doc was made available on dev@ to collect input. Your can also find a design sketch at SPARK-27005.

The vote will be up for the next 72 hours. Please reply with your vote:

+1: Yeah, let's go forward and implement the SPIP.
+0: Don't really care.
-1: I don't think this is a good idea because of the following technical reasons.

Thank you!

Xingbo
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

Mingjie Tang
+1 

mingjie

On Mar 1, 2019, at 10:18 PM, Xingbo Jiang <[hidden email]> wrote:

Start with +1 from myself.

Xingbo Jiang <[hidden email]> 于2019年3月1日周五 下午10:14写道:
Hi all,

I want to call for a vote of SPARK-24615. It improves Spark by making it aware of GPUs exposed by cluster managers, and hence Spark can match GPU resources with user task requests properly. The proposal and production doc was made available on dev@ to collect input. Your can also find a design sketch at SPARK-27005.

The vote will be up for the next 72 hours. Please reply with your vote:

+1: Yeah, let's go forward and implement the SPIP.
+0: Don't really care.
-1: I don't think this is a good idea because of the following technical reasons.

Thank you!

Xingbo
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

Xiangrui Meng
+1

Btw, as Ryan pointed out las time, +0 doesn't mean "Don't really care." Official definitions here:


  • +0: 'I don't feel strongly about it, but I'm okay with this.'

  • -0: 'I won't get in the way, but I'd rather we didn't do this.'


On Fri, Mar 1, 2019 at 6:27 AM Mingjie <[hidden email]> wrote:
+1 

mingjie

On Mar 1, 2019, at 10:18 PM, Xingbo Jiang <[hidden email]> wrote:

Start with +1 from myself.

Xingbo Jiang <[hidden email]> 于2019年3月1日周五 下午10:14写道:
Hi all,

I want to call for a vote of SPARK-24615. It improves Spark by making it aware of GPUs exposed by cluster managers, and hence Spark can match GPU resources with user task requests properly. The proposal and production doc was made available on dev@ to collect input. Your can also find a design sketch at SPARK-27005.

The vote will be up for the next 72 hours. Please reply with your vote:

+1: Yeah, let's go forward and implement the SPIP.
+0: Don't really care.
-1: I don't think this is a good idea because of the following technical reasons.

Thank you!

Xingbo
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

Tom Graves-2
In reply to this post by Jiang Xingbo
+1 for the SPIP.

Tom

On Friday, March 1, 2019, 8:14:43 AM CST, Xingbo Jiang <[hidden email]> wrote:


Hi all,

I want to call for a vote of SPARK-24615. It improves Spark by making it aware of GPUs exposed by cluster managers, and hence Spark can match GPU resources with user task requests properly. The proposal and production doc was made available on dev@ to collect input. Your can also find a design sketch at SPARK-27005.

The vote will be up for the next 72 hours. Please reply with your vote:

+1: Yeah, let's go forward and implement the SPIP.
+0: Don't really care.
-1: I don't think this is a good idea because of the following technical reasons.

Thank you!

Xingbo
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

liyinan926
+1

On Fri, Mar 1, 2019 at 12:37 PM Tom Graves <[hidden email]> wrote:
+1 for the SPIP.

Tom

On Friday, March 1, 2019, 8:14:43 AM CST, Xingbo Jiang <[hidden email]> wrote:


Hi all,

I want to call for a vote of SPARK-24615. It improves Spark by making it aware of GPUs exposed by cluster managers, and hence Spark can match GPU resources with user task requests properly. The proposal and production doc was made available on dev@ to collect input. Your can also find a design sketch at SPARK-27005.

The vote will be up for the next 72 hours. Please reply with your vote:

+1: Yeah, let's go forward and implement the SPIP.
+0: Don't really care.
-1: I don't think this is a good idea because of the following technical reasons.

Thank you!

Xingbo
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

cloud0fan
+1

On Sat, Mar 2, 2019 at 6:11 AM Yinan Li <[hidden email]> wrote:
+1

On Fri, Mar 1, 2019 at 12:37 PM Tom Graves <[hidden email]> wrote:
+1 for the SPIP.

Tom

On Friday, March 1, 2019, 8:14:43 AM CST, Xingbo Jiang <[hidden email]> wrote:


Hi all,

I want to call for a vote of SPARK-24615. It improves Spark by making it aware of GPUs exposed by cluster managers, and hence Spark can match GPU resources with user task requests properly. The proposal and production doc was made available on dev@ to collect input. Your can also find a design sketch at SPARK-27005.

The vote will be up for the next 72 hours. Please reply with your vote:

+1: Yeah, let's go forward and implement the SPIP.
+0: Don't really care.
-1: I don't think this is a good idea because of the following technical reasons.

Thank you!

Xingbo
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

Weichen Xu
In reply to this post by liyinan926
+1, nice feature!

On Sat, Mar 2, 2019 at 6:11 AM Yinan Li <[hidden email]> wrote:
+1

On Fri, Mar 1, 2019 at 12:37 PM Tom Graves <[hidden email]> wrote:
+1 for the SPIP.

Tom

On Friday, March 1, 2019, 8:14:43 AM CST, Xingbo Jiang <[hidden email]> wrote:


Hi all,

I want to call for a vote of SPARK-24615. It improves Spark by making it aware of GPUs exposed by cluster managers, and hence Spark can match GPU resources with user task requests properly. The proposal and production doc was made available on dev@ to collect input. Your can also find a design sketch at SPARK-27005.

The vote will be up for the next 72 hours. Please reply with your vote:

+1: Yeah, let's go forward and implement the SPIP.
+0: Don't really care.
-1: I don't think this is a good idea because of the following technical reasons.

Thank you!

Xingbo
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

Marco Gaido
+1, a critical feature for AI/DL!

Il giorno sab 2 mar 2019 alle ore 05:14 Weichen Xu <[hidden email]> ha scritto:
+1, nice feature!

On Sat, Mar 2, 2019 at 6:11 AM Yinan Li <[hidden email]> wrote:
+1

On Fri, Mar 1, 2019 at 12:37 PM Tom Graves <[hidden email]> wrote:
+1 for the SPIP.

Tom

On Friday, March 1, 2019, 8:14:43 AM CST, Xingbo Jiang <[hidden email]> wrote:


Hi all,

I want to call for a vote of SPARK-24615. It improves Spark by making it aware of GPUs exposed by cluster managers, and hence Spark can match GPU resources with user task requests properly. The proposal and production doc was made available on dev@ to collect input. Your can also find a design sketch at SPARK-27005.

The vote will be up for the next 72 hours. Please reply with your vote:

+1: Yeah, let's go forward and implement the SPIP.
+0: Don't really care.
-1: I don't think this is a good idea because of the following technical reasons.

Thank you!

Xingbo
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

Felix Cheung
I’m very hesitant with this.

I don’t want to vote -1, because I personally think it’s important to do, but I’d like to see more discussion points addressed and not voting completely on the spirit of it.

First, SPIP doesn’t match the format of SPIP proposed and agreed on. (Maybe this is a minor point and perhaps we should also vote to update the SPIP format)

Second, there are multiple pdf/google doc and JIRA. And I think for example the design sketch is not covering the same points as the updated SPIP doc? It would help to make them align before moving forward.

Third, the proposal touches on some fairly core and sensitive components, like the scheduler, and I think more discussions are necessary. We have a few comments there and in the JIRA.




From: Marco Gaido <[hidden email]>
Sent: Saturday, March 2, 2019 4:18 AM
To: Weichen Xu
Cc: Yinan Li; Tom Graves; dev; Xingbo Jiang
Subject: Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling
 
+1, a critical feature for AI/DL!

Il giorno sab 2 mar 2019 alle ore 05:14 Weichen Xu <[hidden email]> ha scritto:
+1, nice feature!

On Sat, Mar 2, 2019 at 6:11 AM Yinan Li <[hidden email]> wrote:
+1

On Fri, Mar 1, 2019 at 12:37 PM Tom Graves <[hidden email]> wrote:
+1 for the SPIP.

Tom

On Friday, March 1, 2019, 8:14:43 AM CST, Xingbo Jiang <[hidden email]> wrote:


Hi all,

I want to call for a vote of SPARK-24615. It improves Spark by making it aware of GPUs exposed by cluster managers, and hence Spark can match GPU resources with user task requests properly. The proposal and production doc was made available on dev@ to collect input. Your can also find a design sketch at SPARK-27005.

The vote will be up for the next 72 hours. Please reply with your vote:

+1: Yeah, let's go forward and implement the SPIP.
+0: Don't really care.
-1: I don't think this is a good idea because of the following technical reasons.

Thank you!

Xingbo
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

Sean Owen-2
I'm for this in general, at least a +0. I do think this has to have a
story for what to do with the existing Mesos GPU support, which sounds
entirely like the spark.task.gpus config here. Maybe it's just a
synonym? that kind of thing.

Requesting different types of GPUs might be a bridge too far, but,
that's a P2 detail that can be hashed out later. (For example, if a
v100 is available and k80 was requested, do you use it or fail? is the
right level of resource control GPU RAM and cores?)

The per-stage resource requirements sounds like the biggest change;
you can even change CPU cores requested per pandas UDF? and what about
memory then? We'll see how that shakes out. That's the only thing I'm
kind of unsure about in this proposal.

On Sat, Mar 2, 2019 at 9:35 PM Felix Cheung <[hidden email]> wrote:

>
> I’m very hesitant with this.
>
> I don’t want to vote -1, because I personally think it’s important to do, but I’d like to see more discussion points addressed and not voting completely on the spirit of it.
>
> First, SPIP doesn’t match the format of SPIP proposed and agreed on. (Maybe this is a minor point and perhaps we should also vote to update the SPIP format)
>
> Second, there are multiple pdf/google doc and JIRA. And I think for example the design sketch is not covering the same points as the updated SPIP doc? It would help to make them align before moving forward.
>
> Third, the proposal touches on some fairly core and sensitive components, like the scheduler, and I think more discussions are necessary. We have a few comments there and in the JIRA.
>
>
>
> ________________________________
> From: Marco Gaido <[hidden email]>
> Sent: Saturday, March 2, 2019 4:18 AM
> To: Weichen Xu
> Cc: Yinan Li; Tom Graves; dev; Xingbo Jiang
> Subject: Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling
>
> +1, a critical feature for AI/DL!
>
> Il giorno sab 2 mar 2019 alle ore 05:14 Weichen Xu <[hidden email]> ha scritto:
>>
>> +1, nice feature!
>>
>> On Sat, Mar 2, 2019 at 6:11 AM Yinan Li <[hidden email]> wrote:
>>>
>>> +1
>>>
>>> On Fri, Mar 1, 2019 at 12:37 PM Tom Graves <[hidden email]> wrote:
>>>>
>>>> +1 for the SPIP.
>>>>
>>>> Tom
>>>>
>>>> On Friday, March 1, 2019, 8:14:43 AM CST, Xingbo Jiang <[hidden email]> wrote:
>>>>
>>>>
>>>> Hi all,
>>>>
>>>> I want to call for a vote of SPARK-24615. It improves Spark by making it aware of GPUs exposed by cluster managers, and hence Spark can match GPU resources with user task requests properly. The proposal and production doc was made available on dev@ to collect input. Your can also find a design sketch at SPARK-27005.
>>>>
>>>> The vote will be up for the next 72 hours. Please reply with your vote:
>>>>
>>>> +1: Yeah, let's go forward and implement the SPIP.
>>>> +0: Don't really care.
>>>> -1: I don't think this is a good idea because of the following technical reasons.
>>>>
>>>> Thank you!
>>>>
>>>> Xingbo

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

Felix Cheung
Great points Sean.

Here’s what I’d like to suggest to move forward.
Split the SPIP.

If we want to propose upfront homogeneous allocation (aka spark.task.gpus), this should be one on its own and for instance, I really agree with Sean (like I did in the discuss thread) that we can’t simply non-goal Mesos. We have enough maintenance issue as it is. And IIRC there was a PR proposed for K8S that I’d like to see bring that discussion here as well.

IMO upfront allocation is less useful. Specifically too expensive for large jobs.

If we want per-stage resource request, this should a full SPIP with a lot more details to be hashed out. Our work with Horovod brings a few specific and critical requirements on how this should work with distributed DL and I would like to see those addressed.

In any case I’d like to see more consensus before moving forward, until then I’m going to -1 this.


 

From: Sean Owen <[hidden email]>
Sent: Sunday, March 3, 2019 8:15 AM
To: Felix Cheung
Cc: Xingbo Jiang; Yinan Li; dev; Weichen Xu; Marco Gaido
Subject: Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling
 
I'm for this in general, at least a +0. I do think this has to have a
story for what to do with the existing Mesos GPU support, which sounds
entirely like the spark.task.gpus config here. Maybe it's just a
synonym? that kind of thing.

Requesting different types of GPUs might be a bridge too far, but,
that's a P2 detail that can be hashed out later. (For example, if a
v100 is available and k80 was requested, do you use it or fail? is the
right level of resource control GPU RAM and cores?)

The per-stage resource requirements sounds like the biggest change;
you can even change CPU cores requested per pandas UDF? and what about
memory then? We'll see how that shakes out. That's the only thing I'm
kind of unsure about in this proposal.

On Sat, Mar 2, 2019 at 9:35 PM Felix Cheung <[hidden email]> wrote:
>
> I’m very hesitant with this.
>
> I don’t want to vote -1, because I personally think it’s important to do, but I’d like to see more discussion points addressed and not voting completely on the spirit of it.
>
> First, SPIP doesn’t match the format of SPIP proposed and agreed on. (Maybe this is a minor point and perhaps we should also vote to update the SPIP format)
>
> Second, there are multiple pdf/google doc and JIRA. And I think for example the design sketch is not covering the same points as the updated SPIP doc? It would help to make them align before moving forward.
>
> Third, the proposal touches on some fairly core and sensitive components, like the scheduler, and I think more discussions are necessary. We have a few comments there and in the JIRA.
>
>
>
> ________________________________
> From: Marco Gaido <[hidden email]>
> Sent: Saturday, March 2, 2019 4:18 AM
> To: Weichen Xu
> Cc: Yinan Li; Tom Graves; dev; Xingbo Jiang
> Subject: Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling
>
> +1, a critical feature for AI/DL!
>
> Il giorno sab 2 mar 2019 alle ore 05:14 Weichen Xu <[hidden email]> ha scritto:
>>
>> +1, nice feature!
>>
>> On Sat, Mar 2, 2019 at 6:11 AM Yinan Li <[hidden email]> wrote:
>>>
>>> +1
>>>
>>> On Fri, Mar 1, 2019 at 12:37 PM Tom Graves <[hidden email]> wrote:
>>>>
>>>> +1 for the SPIP.
>>>>
>>>> Tom
>>>>
>>>> On Friday, March 1, 2019, 8:14:43 AM CST, Xingbo Jiang <[hidden email]> wrote:
>>>>
>>>>
>>>> Hi all,
>>>>
>>>> I want to call for a vote of SPARK-24615. It improves Spark by making it aware of GPUs exposed by cluster managers, and hence Spark can match GPU resources with user task requests properly. The proposal and production doc was made available on dev@ to collect input. Your can also find a design sketch at SPARK-27005.
>>>>
>>>> The vote will be up for the next 72 hours. Please reply with your vote:
>>>>
>>>> +1: Yeah, let's go forward and implement the SPIP.
>>>> +0: Don't really care.
>>>> -1: I don't think this is a good idea because of the following technical reasons.
>>>>
>>>> Thank you!
>>>>
>>>> Xingbo
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

Xiangrui Meng
Hi Felix,

Just to clarify, we are voting on the SPIP, not the companion scoping doc. What is proposed and what we are voting on is to make Spark accelerator-aware. The companion scoping doc and the design sketch are to help demonstrate that what features could be implemented based on the use cases and dev resources the co-authors are aware of. The exact scoping and design would require more community involvement, by no means we are finalizing it in this vote thread.

I think copying the goals and non-goals from the companion scoping doc to the SPIP caused the confusion. As mentioned in the SPIP, we proposed to make two major changes at high level:
  • At cluster manager level, we update or upgrade cluster managers to include GPU support. Then we expose user interfaces for Spark to request GPUs from them.
  • Within Spark, we update its scheduler to understand available GPUs allocated to executors, user task requests, and assign GPUs to tasks properly.
We should keep our vote discussion at this level. It doesn't exclude Mesos/Windows/TPU/FPGA, nor it commits to support YARN/K8s. Through the initial scoping work, we found that we certainly need domain experts to discuss the support of each cluster manager and each accelerator type. But adding more details on Mesos or FPGA doesn't change the SPIP at high level. So we concluded the initial scoping, shared the docs, and started this vote.

I suggest updating the goals and non-goals in the SPIP so we don't turn the vote into discussing a specific cluster manager support or non-support. After we reach a high-level agreement, the work can be fairly distributed. If there are both strong demand and dev resources from the community for a specific cluster manager or an accelerator type, I don't see why we should block the work. If the work requires more discussion, we can start a new SPIP thread.

Also see my inline comments below.

On Sun, Mar 3, 2019 at 10:20 AM Felix Cheung <[hidden email]> wrote:
Great points Sean.

Here’s what I’d like to suggest to move forward.
Split the SPIP.

If we want to propose upfront homogeneous allocation (aka spark.task.gpus), this should be one on its own and for instance,

This is more like an API/design discussion, which can be done after the vote. I don't think the feature alone needs a separate SPIP thread. On the high level, spark users should be able to request and use GPUs properly. How to implement is pending the design.
 
I really agree with Sean (like I did in the discuss thread) that we can’t simply non-goal Mesos. We have enough maintenance issue as it is. And IIRC there was a PR proposed for K8S that I’d like to see bring that discussion here as well.

+1. As I mentioned above, discussing each cluster manager support requires domain experts. The goals and non-goals in the SPIP caused this confusion. I suggest updating the goals and non-goals and then having separate discussion for each that doesn't block the main SPIP vote. It would be great if you or Sean can lead the discussion on Mesos support.
 

IMO upfront allocation is less useful. Specifically too expensive for large jobs.

This is also an API/design discussion.
 

If we want per-stage resource request, this should a full SPIP with a lot more details to be hashed out. Our work with Horovod brings a few specific and critical requirements on how this should work with distributed DL and I would like to see those addressed.

SPIP is designed to not have a lot details. I agree with what Reynold said on the Table Metadata thread: 

"""
In general it'd be better to have the SPIPs be higher level, and put the detailed APIs in a separate doc. Alternatively, put them in the SPIP but explicitly vote on the high level stuff and not the detailed APIs. 
"""

Could you create a JIRA and document the list of requirements from Horovod use cases?
 

In any case I’d like to see more consensus before moving forward, until then I’m going to -1 this.


 

From: Sean Owen <[hidden email]>
Sent: Sunday, March 3, 2019 8:15 AM
To: Felix Cheung
Cc: Xingbo Jiang; Yinan Li; dev; Weichen Xu; Marco Gaido
Subject: Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling
 
I'm for this in general, at least a +0. I do think this has to have a
story for what to do with the existing Mesos GPU support, which sounds
entirely like the spark.task.gpus config here. Maybe it's just a
synonym? that kind of thing.

Requesting different types of GPUs might be a bridge too far, but,
that's a P2 detail that can be hashed out later. (For example, if a
v100 is available and k80 was requested, do you use it or fail? is the
right level of resource control GPU RAM and cores?)

The per-stage resource requirements sounds like the biggest change;
you can even change CPU cores requested per pandas UDF? and what about
memory then? We'll see how that shakes out. That's the only thing I'm
kind of unsure about in this proposal.

On Sat, Mar 2, 2019 at 9:35 PM Felix Cheung <[hidden email]> wrote:
>
> I’m very hesitant with this.
>
> I don’t want to vote -1, because I personally think it’s important to do, but I’d like to see more discussion points addressed and not voting completely on the spirit of it.
>
> First, SPIP doesn’t match the format of SPIP proposed and agreed on. (Maybe this is a minor point and perhaps we should also vote to update the SPIP format)
>
> Second, there are multiple pdf/google doc and JIRA. And I think for example the design sketch is not covering the same points as the updated SPIP doc? It would help to make them align before moving forward.
>
> Third, the proposal touches on some fairly core and sensitive components, like the scheduler, and I think more discussions are necessary. We have a few comments there and in the JIRA.
>
>
>
> ________________________________
> From: Marco Gaido <[hidden email]>
> Sent: Saturday, March 2, 2019 4:18 AM
> To: Weichen Xu
> Cc: Yinan Li; Tom Graves; dev; Xingbo Jiang
> Subject: Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling
>
> +1, a critical feature for AI/DL!
>
> Il giorno sab 2 mar 2019 alle ore 05:14 Weichen Xu <[hidden email]> ha scritto:
>>
>> +1, nice feature!
>>
>> On Sat, Mar 2, 2019 at 6:11 AM Yinan Li <[hidden email]> wrote:
>>>
>>> +1
>>>
>>> On Fri, Mar 1, 2019 at 12:37 PM Tom Graves <[hidden email]> wrote:
>>>>
>>>> +1 for the SPIP.
>>>>
>>>> Tom
>>>>
>>>> On Friday, March 1, 2019, 8:14:43 AM CST, Xingbo Jiang <[hidden email]> wrote:
>>>>
>>>>
>>>> Hi all,
>>>>
>>>> I want to call for a vote of SPARK-24615. It improves Spark by making it aware of GPUs exposed by cluster managers, and hence Spark can match GPU resources with user task requests properly. The proposal and production doc was made available on dev@ to collect input. Your can also find a design sketch at SPARK-27005.
>>>>
>>>> The vote will be up for the next 72 hours. Please reply with your vote:
>>>>
>>>> +1: Yeah, let's go forward and implement the SPIP.
>>>> +0: Don't really care.
>>>> -1: I don't think this is a good idea because of the following technical reasons.
>>>>
>>>> Thank you!
>>>>
>>>> Xingbo
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

Sean Owen-2
I think treating SPIPs as this high-level takes away much of the point
of VOTEing on them. I'm not sure that's even what Reynold is
suggesting elsewhere; we're nowhere near discussing APIs here, just
what 'accelerator aware' even generally means. If the scope isn't
specified, what are we trying to bind with a formal VOTE? The worst I
can say is that this doesn't mean much, so the outcome of the vote
doesn't matter. The general ideas seems fine to me and I support
_something_ like this.

I think the subtext concern is that SPIPs become a way to request
cover to make a bunch of decisions separately, later. This is, to some
extent, how it has to work. A small number of interested parties need
to decide the details coherently, not design the whole thing by
committee, with occasional check-ins for feedback. There's a balance
between that, and using the SPIP as a license to go finish a design
and proclaim it later. That's not anyone's bad-faith intention, just
the risk of deferring so much.

Mesos support is not a big deal by itself but a fine illustration of
the point. That seems like a fine question of scope now, even if the
'how' or some of the 'what' can be decided later. I raised an eyebrow
here at the reply that this was already judged out-of-scope: how much
are we on the same page about this being a point to consider feedback?

If one wants to VOTE on more details, then this vote just doesn't
matter much. Is a future step to VOTE on some more detailed design
doc? Then that's what I call a "SPIP" and it's practically just
semantics.


On Sun, Mar 3, 2019 at 6:51 PM Xiangrui Meng <[hidden email]> wrote:

>
> Hi Felix,
>
> Just to clarify, we are voting on the SPIP, not the companion scoping doc. What is proposed and what we are voting on is to make Spark accelerator-aware. The companion scoping doc and the design sketch are to help demonstrate that what features could be implemented based on the use cases and dev resources the co-authors are aware of. The exact scoping and design would require more community involvement, by no means we are finalizing it in this vote thread.
>
> I think copying the goals and non-goals from the companion scoping doc to the SPIP caused the confusion. As mentioned in the SPIP, we proposed to make two major changes at high level:
>
> At cluster manager level, we update or upgrade cluster managers to include GPU support. Then we expose user interfaces for Spark to request GPUs from them.
> Within Spark, we update its scheduler to understand available GPUs allocated to executors, user task requests, and assign GPUs to tasks properly.
>
> We should keep our vote discussion at this level. It doesn't exclude Mesos/Windows/TPU/FPGA, nor it commits to support YARN/K8s. Through the initial scoping work, we found that we certainly need domain experts to discuss the support of each cluster manager and each accelerator type. But adding more details on Mesos or FPGA doesn't change the SPIP at high level. So we concluded the initial scoping, shared the docs, and started this vote.

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

Felix Cheung
Once again, I’d have to agree with Sean.

Let’s table the meaning of SPIP for another time, say. I think a few of us are trying to understand what does “accelerator resource aware” mean. As far as I know, no one is discussing API here. But on google doc, JIRA and on email and off list, I have seen questions, questions that are greatly concerning, like “oh scheduler is allocating GPU, but how does it affect memory” and many more, and so I think finer “high level” goals should be defined.



 

From: Sean Owen <[hidden email]>
Sent: Sunday, March 3, 2019 5:24 PM
To: Xiangrui Meng
Cc: Felix Cheung; Xingbo Jiang; Yinan Li; dev; Weichen Xu; Marco Gaido
Subject: Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling
 
I think treating SPIPs as this high-level takes away much of the point
of VOTEing on them. I'm not sure that's even what Reynold is
suggesting elsewhere; we're nowhere near discussing APIs here, just
what 'accelerator aware' even generally means. If the scope isn't
specified, what are we trying to bind with a formal VOTE? The worst I
can say is that this doesn't mean much, so the outcome of the vote
doesn't matter. The general ideas seems fine to me and I support
_something_ like this.

I think the subtext concern is that SPIPs become a way to request
cover to make a bunch of decisions separately, later. This is, to some
extent, how it has to work. A small number of interested parties need
to decide the details coherently, not design the whole thing by
committee, with occasional check-ins for feedback. There's a balance
between that, and using the SPIP as a license to go finish a design
and proclaim it later. That's not anyone's bad-faith intention, just
the risk of deferring so much.

Mesos support is not a big deal by itself but a fine illustration of
the point. That seems like a fine question of scope now, even if the
'how' or some of the 'what' can be decided later. I raised an eyebrow
here at the reply that this was already judged out-of-scope: how much
are we on the same page about this being a point to consider feedback?

If one wants to VOTE on more details, then this vote just doesn't
matter much. Is a future step to VOTE on some more detailed design
doc? Then that's what I call a "SPIP" and it's practically just
semantics.


On Sun, Mar 3, 2019 at 6:51 PM Xiangrui Meng <[hidden email]> wrote:
>
> Hi Felix,
>
> Just to clarify, we are voting on the SPIP, not the companion scoping doc. What is proposed and what we are voting on is to make Spark accelerator-aware. The companion scoping doc and the design sketch are to help demonstrate that what features could be implemented based on the use cases and dev resources the co-authors are aware of. The exact scoping and design would require more community involvement, by no means we are finalizing it in this vote thread.
>
> I think copying the goals and non-goals from the companion scoping doc to the SPIP caused the confusion. As mentioned in the SPIP, we proposed to make two major changes at high level:
>
> At cluster manager level, we update or upgrade cluster managers to include GPU support. Then we expose user interfaces for Spark to request GPUs from them.
> Within Spark, we update its scheduler to understand available GPUs allocated to executors, user task requests, and assign GPUs to tasks properly.
>
> We should keep our vote discussion at this level. It doesn't exclude Mesos/Windows/TPU/FPGA, nor it commits to support YARN/K8s. Through the initial scoping work, we found that we certainly need domain experts to discuss the support of each cluster manager and each accelerator type. But adding more details on Mesos or FPGA doesn't change the SPIP at high level. So we concluded the initial scoping, shared the docs, and started this vote.
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

Xiangrui Meng
What finer "high level" goals do you recommend? To make progress on the vote, it would be great if you can articulate more. Current SPIP proposes two high-level changes to make Spark accelerator-aware:
  • At cluster manager level, we update or upgrade cluster managers to include GPU support. Then we expose user interfaces for Spark to request GPUs from them.
  • Within Spark, we update its scheduler to understand available GPUs allocated to executors, user task requests, and assign GPUs to tasks properly.
How do you want to change or refine them? I saw you raised questions around Horovod requirements and GPU/memory allocation. But there are tens of questions at the same or even higher level. E.g., in preparing the companion scoping doc we saw the following questions:

* How to test GPU support on Jenkins?
* Does the solution proposed also work for FPGA? What are the diffs?
* How to make standalone workers auto-discover GPU resources?
* Do we want to allow users to request GPU resources in Pandas UDF?
* How does user pass the GPU requests to K8s, spark-submit command-line or pod template?
* Do we create a separate queue for GPU task scheduling so it doesn't cause regression on normal jobs?
* How to monitor the utilization of GPU? At what levels?
* Do we want to support GPU-backed physical operators?
* Do we allow users to request both non-default number of CPUs and GPUs?
* ...

IMHO, we cannot nor we should answer questions at this level in this vote. The vote is majorly on whether we should make Spark accelerator-aware to help unify big data and AI solutions, specifically whether Spark should provide proper support to deep learning model training and inference where accelerators are essential. My +1 vote is based on the following logic:

* It is important for Spark to become the de facto solution in connecting big data and AI.
* The work is doable given the design sketch and the early investigation/scoping.

To me, "-1" means either it is not important for Spark to support such use cases or we certainly cannot afford to implement such support. This is my understanding of the SPIP and the vote. It would be great if you can elaborate what changes you want to make or what answers you want to see.

On Sun, Mar 3, 2019 at 11:13 PM Felix Cheung <[hidden email]> wrote:
Once again, I’d have to agree with Sean.

Let’s table the meaning of SPIP for another time, say. I think a few of us are trying to understand what does “accelerator resource aware” mean.
As far as I know, no one is discussing API here. But on google doc, JIRA and on email and off list, I have seen questions, questions that are greatly concerning, like “oh scheduler is allocating GPU, but how does it affect memory” and many more, and so I think finer “high level” goals should be defined.



 

From: Sean Owen <[hidden email]>
Sent: Sunday, March 3, 2019 5:24 PM
To: Xiangrui Meng
Cc: Felix Cheung; Xingbo Jiang; Yinan Li; dev; Weichen Xu; Marco Gaido
Subject: Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling
 
I think treating SPIPs as this high-level takes away much of the point
of VOTEing on them. I'm not sure that's even what Reynold is
suggesting elsewhere; we're nowhere near discussing APIs here, just
what 'accelerator aware' even generally means. If the scope isn't
specified, what are we trying to bind with a formal VOTE? The worst I
can say is that this doesn't mean much, so the outcome of the vote
doesn't matter. The general ideas seems fine to me and I support
_something_ like this.

I think the subtext concern is that SPIPs become a way to request
cover to make a bunch of decisions separately, later. This is, to some
extent, how it has to work. A small number of interested parties need
to decide the details coherently, not design the whole thing by
committee, with occasional check-ins for feedback. There's a balance
between that, and using the SPIP as a license to go finish a design
and proclaim it later. That's not anyone's bad-faith intention, just
the risk of deferring so much.

Mesos support is not a big deal by itself but a fine illustration of
the point. That seems like a fine question of scope now, even if the
'how' or some of the 'what' can be decided later. I raised an eyebrow
here at the reply that this was already judged out-of-scope: how much
are we on the same page about this being a point to consider feedback?

If one wants to VOTE on more details, then this vote just doesn't
matter much. Is a future step to VOTE on some more detailed design
doc? Then that's what I call a "SPIP" and it's practically just
semantics.


On Sun, Mar 3, 2019 at 6:51 PM Xiangrui Meng <[hidden email]> wrote:
>
> Hi Felix,
>
> Just to clarify, we are voting on the SPIP, not the companion scoping doc. What is proposed and what we are voting on is to make Spark accelerator-aware. The companion scoping doc and the design sketch are to help demonstrate that what features could be implemented based on the use cases and dev resources the co-authors are aware of. The exact scoping and design would require more community involvement, by no means we are finalizing it in this vote thread.
>
> I think copying the goals and non-goals from the companion scoping doc to the SPIP caused the confusion. As mentioned in the SPIP, we proposed to make two major changes at high level:
>
> At cluster manager level, we update or upgrade cluster managers to include GPU support. Then we expose user interfaces for Spark to request GPUs from them.
> Within Spark, we update its scheduler to understand available GPUs allocated to executors, user task requests, and assign GPUs to tasks properly.
>
> We should keep our vote discussion at this level. It doesn't exclude Mesos/Windows/TPU/FPGA, nor it commits to support YARN/K8s. Through the initial scoping work, we found that we certainly need domain experts to discuss the support of each cluster manager and each accelerator type. But adding more details on Mesos or FPGA doesn't change the SPIP at high level. So we concluded the initial scoping, shared the docs, and started this vote.
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

Sean Owen-2
To be clear, those goals sound fine to me. I don't think voting on
those two broad points is meaningful, but, does no harm per se. If you
mean this is just a check to see if people believe this is broadly
worthwhile, then +1 from me. Yes it is.

That means we'd want to review something more detailed later, whether
it's a a) design doc we vote on or b) a series of pull requests. Given
the number of questions this leaves open, a) sounds better and I think
what you're suggesting. I'd call that the SPIP, but, so what, it's
just a name. The thing is, a) seems already mostly done, in the second
document that was attached. I'm hesitating because i'm not sure why
it's important to not discuss that level of detail here, as it's
already available. Just too much noise? but voting for this seems like
endorsing those decisions, as I can only assume the proposer is going
to continue the design with those decisions in mind.

What's the next step in your view, after this, and before it's
implemented? as long as there is one, sure, let's punt. Seems like we
could begin that conversation nowish.

Many of those questions you list are _fine_ for a SPIP, in my opinion.
(Of course, I'd add what cluster managers are in/out of scope.)


On Mon, Mar 4, 2019 at 9:07 AM Xiangrui Meng <[hidden email]> wrote:

>
> What finer "high level" goals do you recommend? To make progress on the vote, it would be great if you can articulate more. Current SPIP proposes two high-level changes to make Spark accelerator-aware:
>
> At cluster manager level, we update or upgrade cluster managers to include GPU support. Then we expose user interfaces for Spark to request GPUs from them.
> Within Spark, we update its scheduler to understand available GPUs allocated to executors, user task requests, and assign GPUs to tasks properly.
>
> How do you want to change or refine them? I saw you raised questions around Horovod requirements and GPU/memory allocation. But there are tens of questions at the same or even higher level. E.g., in preparing the companion scoping doc we saw the following questions:
>
> * How to test GPU support on Jenkins?
> * Does the solution proposed also work for FPGA? What are the diffs?
> * How to make standalone workers auto-discover GPU resources?
> * Do we want to allow users to request GPU resources in Pandas UDF?
> * How does user pass the GPU requests to K8s, spark-submit command-line or pod template?
> * Do we create a separate queue for GPU task scheduling so it doesn't cause regression on normal jobs?
> * How to monitor the utilization of GPU? At what levels?
> * Do we want to support GPU-backed physical operators?
> * Do we allow users to request both non-default number of CPUs and GPUs?
> * ...
>
> IMHO, we cannot nor we should answer questions at this level in this vote. The vote is majorly on whether we should make Spark accelerator-aware to help unify big data and AI solutions, specifically whether Spark should provide proper support to deep learning model training and inference where accelerators are essential. My +1 vote is based on the following logic:
>
> * It is important for Spark to become the de facto solution in connecting big data and AI.
> * The work is doable given the design sketch and the early investigation/scoping.
>
> To me, "-1" means either it is not important for Spark to support such use cases or we certainly cannot afford to implement such support. This is my understanding of the SPIP and the vote. It would be great if you can elaborate what changes you want to make or what answers you want to see.
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

Xiangrui Meng


On Mon, Mar 4, 2019 at 7:24 AM Sean Owen <[hidden email]> wrote:
To be clear, those goals sound fine to me. I don't think voting on
those two broad points is meaningful, but, does no harm per se. If you
mean this is just a check to see if people believe this is broadly
worthwhile, then +1 from me. Yes it is.

That means we'd want to review something more detailed later, whether
it's a a) design doc we vote on or b) a series of pull requests. Given
the number of questions this leaves open, a) sounds better and I think
what you're suggesting. I'd call that the SPIP, but, so what, it's
just a name. The thing is, a) seems already mostly done, in the second
document that was attached.

It is far from done. We still need to review the APIs and the design for each major component:

* Internal changes to Spark job scheduler.
* Interfaces exposed to users.
* Interfaces exposed to cluster managers.
* Standalone / auto-discovery.
* YARN
* K8s
* Mesos
* Jenkins

I try to avoid discussing each of them in this thread because they require different domain experts. After we have a high-level agreement on adding accelerator support to Spark. We can kick off the work in parallel. If any committer thinks a follow-up work still needs an SPIP, we just follow the SPIP process to resolve it.
 
I'm hesitating because i'm not sure why
it's important to not discuss that level of detail here, as it's
already available. Just too much noise?

Yes. If we go down one or two levels, we might have to pull in different domain experts for different questions.
 
but voting for this seems like
endorsing those decisions, as I can only assume the proposer is going
to continue the design with those decisions in mind.

That is certainly not the purpose, which was why there were two docs, not just one SPIP. The purpose of the companion doc is just to give some concrete stories and estimate what could be done in Spark 3.0. Maybe we should update the SPIP doc and make it clear that certain features are pending follow-up discussions.
 

What's the next step in your view, after this, and before it's
implemented? as long as there is one, sure, let's punt. Seems like we
could begin that conversation nowish.

We should assign each major component an "owner" who can lead the follow-up work, e.g.,

* Internal changes to Spark scheduler
* Interfaces to cluster managers and users
* Standalone support
* YARN support
* K8s support
* Mesos support
* Test infrastructure
* FPGA

Again, for each component the question we should answer first is "Is it important?" and then "How to implement it?". Community members who are interested in each discussion should subscribe to the corresponding JIRA. If some committer think we need a follow-up SPIP, either to make more members aware of the changes or to reach agreement, feel free to call it out.
 

Many of those questions you list are _fine_ for a SPIP, in my opinion.
(Of course, I'd add what cluster managers are in/out of scope.)

I think the two requires more discussion are Mesos and K8s. Let me follow what I suggested above and try to answer two questions for each:

Mesos:
* Is it important? There are certainly Spark/Mesos users but the overall usage is going downhill. See the attached Google Trend snapshot.
* How to implement it? I believe it is doable, similarly to other cluster managers. However, we need to find someone from our community to do the work. If we cannot find such a person, it would indicate that the feature is not that important.

K8s:
* Is it important? K8s is the fastest growing manager. But the current Spark support is experimental. Building features on top would add additional cost if we want to make changes.
* How to implement it? There is a sketch in the companion doc. Yinan mentioned three options to expose the inferences to users. We need to finalize the design and discuss which option is the best to go.

You see that such discussions can be done in parallel. It is not efficient if we block the work on K8s because we cannot decide whether we should support Mesos.
 


On Mon, Mar 4, 2019 at 9:07 AM Xiangrui Meng <[hidden email]> wrote:
>
> What finer "high level" goals do you recommend? To make progress on the vote, it would be great if you can articulate more. Current SPIP proposes two high-level changes to make Spark accelerator-aware:
>
> At cluster manager level, we update or upgrade cluster managers to include GPU support. Then we expose user interfaces for Spark to request GPUs from them.
> Within Spark, we update its scheduler to understand available GPUs allocated to executors, user task requests, and assign GPUs to tasks properly.
>
> How do you want to change or refine them? I saw you raised questions around Horovod requirements and GPU/memory allocation. But there are tens of questions at the same or even higher level. E.g., in preparing the companion scoping doc we saw the following questions:
>
> * How to test GPU support on Jenkins?
> * Does the solution proposed also work for FPGA? What are the diffs?
> * How to make standalone workers auto-discover GPU resources?
> * Do we want to allow users to request GPU resources in Pandas UDF?
> * How does user pass the GPU requests to K8s, spark-submit command-line or pod template?
> * Do we create a separate queue for GPU task scheduling so it doesn't cause regression on normal jobs?
> * How to monitor the utilization of GPU? At what levels?
> * Do we want to support GPU-backed physical operators?
> * Do we allow users to request both non-default number of CPUs and GPUs?
> * ...
>
> IMHO, we cannot nor we should answer questions at this level in this vote. The vote is majorly on whether we should make Spark accelerator-aware to help unify big data and AI solutions, specifically whether Spark should provide proper support to deep learning model training and inference where accelerators are essential. My +1 vote is based on the following logic:
>
> * It is important for Spark to become the de facto solution in connecting big data and AI.
> * The work is doable given the design sketch and the early investigation/scoping.
>
> To me, "-1" means either it is not important for Spark to support such use cases or we certainly cannot afford to implement such support. This is my understanding of the SPIP and the vote. It would be great if you can elaborate what changes you want to make or what answers you want to see.
>
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

Xiangrui Meng


On Mon, Mar 4, 2019 at 8:23 AM Xiangrui Meng <[hidden email]> wrote:


On Mon, Mar 4, 2019 at 7:24 AM Sean Owen <[hidden email]> wrote:
To be clear, those goals sound fine to me. I don't think voting on
those two broad points is meaningful, but, does no harm per se. If you
mean this is just a check to see if people believe this is broadly
worthwhile, then +1 from me. Yes it is.

That means we'd want to review something more detailed later, whether
it's a a) design doc we vote on or b) a series of pull requests. Given
the number of questions this leaves open, a) sounds better and I think
what you're suggesting. I'd call that the SPIP, but, so what, it's
just a name. The thing is, a) seems already mostly done, in the second
document that was attached.

It is far from done. We still need to review the APIs and the design for each major component:

* Internal changes to Spark job scheduler.
* Interfaces exposed to users.
* Interfaces exposed to cluster managers.
* Standalone / auto-discovery.
* YARN
* K8s
* Mesos
* Jenkins

I try to avoid discussing each of them in this thread because they require different domain experts. After we have a high-level agreement on adding accelerator support to Spark. We can kick off the work in parallel. If any committer thinks a follow-up work still needs an SPIP, we just follow the SPIP process to resolve it.
 
I'm hesitating because i'm not sure why
it's important to not discuss that level of detail here, as it's
already available. Just too much noise?

Yes. If we go down one or two levels, we might have to pull in different domain experts for different questions.
 
but voting for this seems like
endorsing those decisions, as I can only assume the proposer is going
to continue the design with those decisions in mind.

That is certainly not the purpose, which was why there were two docs, not just one SPIP. The purpose of the companion doc is just to give some concrete stories and estimate what could be done in Spark 3.0. Maybe we should update the SPIP doc and make it clear that certain features are pending follow-up discussions.
 

What's the next step in your view, after this, and before it's
implemented? as long as there is one, sure, let's punt. Seems like we
could begin that conversation nowish.

We should assign each major component an "owner" who can lead the follow-up work, e.g.,

* Internal changes to Spark scheduler
* Interfaces to cluster managers and users
* Standalone support
* YARN support
* K8s support
* Mesos support
* Test infrastructure
* FPGA

Again, for each component the question we should answer first is "Is it important?" and then "How to implement it?". Community members who are interested in each discussion should subscribe to the corresponding JIRA. If some committer think we need a follow-up SPIP, either to make more members aware of the changes or to reach agreement, feel free to call it out.
 

Many of those questions you list are _fine_ for a SPIP, in my opinion.
(Of course, I'd add what cluster managers are in/out of scope.)

I think the two requires more discussion are Mesos and K8s. Let me follow what I suggested above and try to answer two questions for each:

Mesos:
* Is it important? There are certainly Spark/Mesos users but the overall usage is going downhill. See the attached Google Trend snapshot.

Screen Shot 2019-03-04 at 8.10.50 AM.png
 
* How to implement it? I believe it is doable, similarly to other cluster managers. However, we need to find someone from our community to do the work. If we cannot find such a person, it would indicate that the feature is not that important.

K8s:
* Is it important? K8s is the fastest growing manager. But the current Spark support is experimental. Building features on top would add additional cost if we want to make changes.
* How to implement it? There is a sketch in the companion doc. Yinan mentioned three options to expose the inferences to users. We need to finalize the design and discuss which option is the best to go.

You see that such discussions can be done in parallel. It is not efficient if we block the work on K8s because we cannot decide whether we should support Mesos.
 


On Mon, Mar 4, 2019 at 9:07 AM Xiangrui Meng <[hidden email]> wrote:
>
> What finer "high level" goals do you recommend? To make progress on the vote, it would be great if you can articulate more. Current SPIP proposes two high-level changes to make Spark accelerator-aware:
>
> At cluster manager level, we update or upgrade cluster managers to include GPU support. Then we expose user interfaces for Spark to request GPUs from them.
> Within Spark, we update its scheduler to understand available GPUs allocated to executors, user task requests, and assign GPUs to tasks properly.
>
> How do you want to change or refine them? I saw you raised questions around Horovod requirements and GPU/memory allocation. But there are tens of questions at the same or even higher level. E.g., in preparing the companion scoping doc we saw the following questions:
>
> * How to test GPU support on Jenkins?
> * Does the solution proposed also work for FPGA? What are the diffs?
> * How to make standalone workers auto-discover GPU resources?
> * Do we want to allow users to request GPU resources in Pandas UDF?
> * How does user pass the GPU requests to K8s, spark-submit command-line or pod template?
> * Do we create a separate queue for GPU task scheduling so it doesn't cause regression on normal jobs?
> * How to monitor the utilization of GPU? At what levels?
> * Do we want to support GPU-backed physical operators?
> * Do we allow users to request both non-default number of CPUs and GPUs?
> * ...
>
> IMHO, we cannot nor we should answer questions at this level in this vote. The vote is majorly on whether we should make Spark accelerator-aware to help unify big data and AI solutions, specifically whether Spark should provide proper support to deep learning model training and inference where accelerators are essential. My +1 vote is based on the following logic:
>
> * It is important for Spark to become the de facto solution in connecting big data and AI.
> * The work is doable given the design sketch and the early investigation/scoping.
>
> To me, "-1" means either it is not important for Spark to support such use cases or we certainly cannot afford to implement such support. This is my understanding of the SPIP and the vote. It would be great if you can elaborate what changes you want to make or what answers you want to see.
>
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

Sean Owen-2
In reply to this post by Xiangrui Meng
It sounds like there's a discussion about the details coming, which is
fine and good. That should maybe also have a VOTE. The debate here is
then merely about what and when to call things a SPIP, but that's not
important.

On Mon, Mar 4, 2019 at 10:23 AM Xiangrui Meng <[hidden email]> wrote:
> I think the two requires more discussion are Mesos and K8s. Let me follow what I suggested above and try to answer two questions for each:
>
> Mesos:
> * Is it important? There are certainly Spark/Mesos users but the overall usage is going downhill. See the attached Google Trend snapshot.
> * How to implement it? I believe it is doable, similarly to other cluster managers. However, we need to find someone from our community to do the work. If we cannot find such a person, it would indicate that the feature is not that important.

I don't think that was the issue that was raised; I don't advocate for
investing more in supporting this cluster manager, myself.
The issue was that we _already_ have support for allocating GPUs in
Mesos. Whatever limited support is there, presumably, doesn't get
removed. It merely needs to be attached to whatever new mechanisms are
implemented. I only pushed back on the idea that it should be ignored
and (presumably) left as a separate unrelated implementation.

> You see that such discussions can be done in parallel. It is not efficient if we block the work on K8s because we cannot decide whether we should support Mesos.

Is the question blocking anything? An answer is: let's say we just
make whatever support in Mesos exists still works coherently with the
new mechanism, whatever those details may be. Is there any
disagreement on that out there? I agree with you in that I think it
shouldn't have been ruled out at this stage, per earlier comments.
This doesn't seem hard to answer as a question of scope even now.

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

123