Kubernetes: why use init containers?

classic Classic list List threaded Threaded
34 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Kubernetes: why use init containers?

Marcelo Vanzin
Hello,

Me again. I was playing some more with the kubernetes backend and the
whole init container thing seemed unnecessary to me.

Currently it's used to download remote jars and files, mount the
volume into the driver / executor, and place those jars in the
classpath / move the files to the working directory. This is all stuff
that spark-submit already does without needing extra help.

So I spent some time hacking stuff and removing the init container
code, and launching the driver inside kubernetes using spark-submit
(similar to how standalone and mesos cluster mode works):

https://github.com/vanzin/spark/commit/k8s-no-init

I'd like to point out the output of "git show --stat" for that diff:
 29 files changed, 130 insertions(+), 1560 deletions(-)

You get massive code reuse by simply using spark-submit. The remote
dependencies are downloaded in the driver, and the driver does the job
of service them to executors.

So I guess my question is: is there any advantage in using an init container?

The current init container code can download stuff in parallel, but
that's an easy improvement to make in spark-submit and that would
benefit everybody. You can argue that executors downloading from
external servers would be faster than downloading from the driver, but
I'm not sure I'd agree - it can go both ways.

Also the same idea could probably be applied to starting executors;
Mesos starts executors using "spark-class" already, so doing that
would both improve code sharing and potentially simplify some code in
the k8s backend.

--
Marcelo

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Kubernetes: why use init containers?

Anirudh Ramanathan-3
We were running a change in our fork which was similar to this at one point early on. My biggest concerns off the top of my head with this change would be localization performance with large numbers of executors, and what we lose in terms of separation of concerns. Init containers are a standard construct in k8s for resource localization. Also how this approach affects the HDFS work would be interesting.  

+matt +kimoon
Still thinking about the potential trade offs here. Adding Matt and Kimoon who would remember more about our reasoning at the time. 


On Jan 9, 2018 5:22 PM, "Marcelo Vanzin" <[hidden email]> wrote:
Hello,

Me again. I was playing some more with the kubernetes backend and the
whole init container thing seemed unnecessary to me.

Currently it's used to download remote jars and files, mount the
volume into the driver / executor, and place those jars in the
classpath / move the files to the working directory. This is all stuff
that spark-submit already does without needing extra help.

So I spent some time hacking stuff and removing the init container
code, and launching the driver inside kubernetes using spark-submit
(similar to how standalone and mesos cluster mode works):

https://github.com/vanzin/spark/commit/k8s-no-init

I'd like to point out the output of "git show --stat" for that diff:
 29 files changed, 130 insertions(+), 1560 deletions(-)

You get massive code reuse by simply using spark-submit. The remote
dependencies are downloaded in the driver, and the driver does the job
of service them to executors.

So I guess my question is: is there any advantage in using an init container?

The current init container code can download stuff in parallel, but
that's an easy improvement to make in spark-submit and that would
benefit everybody. You can argue that executors downloading from
external servers would be faster than downloading from the driver, but
I'm not sure I'd agree - it can go both ways.

Also the same idea could probably be applied to starting executors;
Mesos starts executors using "spark-class" already, so doing that
would both improve code sharing and potentially simplify some code in
the k8s backend.

--
Marcelo

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Kubernetes: why use init containers?

Nicholas Chammas

I’d like to point out the output of “git show —stat” for that diff:
29 files changed, 130 insertions(+), 1560 deletions(-)

+1 for that and generally for the idea of leveraging spark-submit.

You can argue that executors downloading from
external servers would be faster than downloading from the driver, but
I’m not sure I’d agree - it can go both ways.

On a tangentially related note, one of the main reasons spark-ec2 is so slow to launch clusters is that it distributes files like the Spark binaries to all the workers via the master. Because of that, the launch time scaled with the number of workers requested.

When I wrote Flintrock, I got a large improvement in launch time over spark-ec2 simply by having all the workers download the installation files in parallel from an external host (typically S3 or an Apache mirror). And launch time became largely independent of the cluster size.

That may or may not say anything about the driver distributing application files vs. having init containers do it in parallel, but I’d be curious to hear more.

Nick


On Tue, Jan 9, 2018 at 9:08 PM Anirudh Ramanathan <[hidden email]> wrote:
We were running a change in our fork which was similar to this at one point early on. My biggest concerns off the top of my head with this change would be localization performance with large numbers of executors, and what we lose in terms of separation of concerns. Init containers are a standard construct in k8s for resource localization. Also how this approach affects the HDFS work would be interesting.  

+matt +kimoon
Still thinking about the potential trade offs here. Adding Matt and Kimoon who would remember more about our reasoning at the time. 


On Jan 9, 2018 5:22 PM, "Marcelo Vanzin" <[hidden email]> wrote:
Hello,

Me again. I was playing some more with the kubernetes backend and the
whole init container thing seemed unnecessary to me.

Currently it's used to download remote jars and files, mount the
volume into the driver / executor, and place those jars in the
classpath / move the files to the working directory. This is all stuff
that spark-submit already does without needing extra help.

So I spent some time hacking stuff and removing the init container
code, and launching the driver inside kubernetes using spark-submit
(similar to how standalone and mesos cluster mode works):

https://github.com/vanzin/spark/commit/k8s-no-init

I'd like to point out the output of "git show --stat" for that diff:
 29 files changed, 130 insertions(+), 1560 deletions(-)

You get massive code reuse by simply using spark-submit. The remote
dependencies are downloaded in the driver, and the driver does the job
of service them to executors.

So I guess my question is: is there any advantage in using an init container?

The current init container code can download stuff in parallel, but
that's an easy improvement to make in spark-submit and that would
benefit everybody. You can argue that executors downloading from
external servers would be faster than downloading from the driver, but
I'm not sure I'd agree - it can go both ways.

Also the same idea could probably be applied to starting executors;
Mesos starts executors using "spark-class" already, so doing that
would both improve code sharing and potentially simplify some code in
the k8s backend.

--
Marcelo

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Kubernetes: why use init containers?

Marcelo Vanzin
On Tue, Jan 9, 2018 at 6:25 PM, Nicholas Chammas
<[hidden email]> wrote:
> You can argue that executors downloading from
> external servers would be faster than downloading from the driver, but
> I’m not sure I’d agree - it can go both ways.
>
> On a tangentially related note, one of the main reasons spark-ec2 is so slow
> to launch clusters is that it distributes files like the Spark binaries to
> all the workers via the master. Because of that, the launch time scaled with
> the number of workers requested.

It's true that there are side effects. But there are two things that
can be used to mitigate this:

- k8s uses docker images. Users can create docker images with all the
dependencies their app needs, and submit the app using that image.
Spark doesn't have yet documentation on how to create these customized
images, but I'd rather invest time on that instead of supporting this
init container approach.

- The original spark-on-k8s spec mentioned a "dependency server"
approach which sounded like a more generic version of the YARN
distributed cache, which I hope can be a different way of mitigating
that issue. With that work, we could build this functionality into
spark-submit itself and have other backends also benefit.

In general, forcing the download of dependencies on every invocation
of an app should be avoided.


Anirudh:
> what we lose in terms of separation of concerns

1500 less lines of code lower my level of concern a lot more.

--
Marcelo

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Kubernetes: why use init containers?

liyinan926
In reply to this post by Nicholas Chammas
The init-container is required for use with the resource staging server (https://github.com/apache-spark-on-k8s/userdocs/blob/master/src/jekyll/running-on-kubernetes.md#resource-staging-server). The resource staging server (RSS) is a spark-on-k8s component running in a Kubernetes cluster for staging submission client local dependencies to Spark pods. The init-container is responsible for downloading the dependencies from the RSS. We haven't upstream the RSS code yet, but this is a value add component for Spark on K8s as a way for users to use submission local dependencies without resorting to other mechanisms that are not immediately available on most Kubernetes clusters, e.g., HDFS. We do plan to upstream it in the 2.4 timeframe. Additionally, the init-container is a Kubernetes native way of making sure that the dependencies are localized before the main driver/executor containers are started. IMO, this guarantee is positive to have and it helps achieve separation of concerns. So IMO, I think the init-container is a valuable component and should be kept.

On Tue, Jan 9, 2018 at 6:25 PM, Nicholas Chammas <[hidden email]> wrote:

I’d like to point out the output of “git show —stat” for that diff:
29 files changed, 130 insertions(+), 1560 deletions(-)

+1 for that and generally for the idea of leveraging spark-submit.

You can argue that executors downloading from
external servers would be faster than downloading from the driver, but
I’m not sure I’d agree - it can go both ways.

On a tangentially related note, one of the main reasons spark-ec2 is so slow to launch clusters is that it distributes files like the Spark binaries to all the workers via the master. Because of that, the launch time scaled with the number of workers requested.

When I wrote Flintrock, I got a large improvement in launch time over spark-ec2 simply by having all the workers download the installation files in parallel from an external host (typically S3 or an Apache mirror). And launch time became largely independent of the cluster size.

That may or may not say anything about the driver distributing application files vs. having init containers do it in parallel, but I’d be curious to hear more.

Nick


On Tue, Jan 9, 2018 at 9:08 PM Anirudh Ramanathan <[hidden email].invalid> wrote:
We were running a change in our fork which was similar to this at one point early on. My biggest concerns off the top of my head with this change would be localization performance with large numbers of executors, and what we lose in terms of separation of concerns. Init containers are a standard construct in k8s for resource localization. Also how this approach affects the HDFS work would be interesting.  

+matt +kimoon
Still thinking about the potential trade offs here. Adding Matt and Kimoon who would remember more about our reasoning at the time. 


On Jan 9, 2018 5:22 PM, "Marcelo Vanzin" <[hidden email]> wrote:
Hello,

Me again. I was playing some more with the kubernetes backend and the
whole init container thing seemed unnecessary to me.

Currently it's used to download remote jars and files, mount the
volume into the driver / executor, and place those jars in the
classpath / move the files to the working directory. This is all stuff
that spark-submit already does without needing extra help.

So I spent some time hacking stuff and removing the init container
code, and launching the driver inside kubernetes using spark-submit
(similar to how standalone and mesos cluster mode works):

https://github.com/vanzin/spark/commit/k8s-no-init

I'd like to point out the output of "git show --stat" for that diff:
 29 files changed, 130 insertions(+), 1560 deletions(-)

You get massive code reuse by simply using spark-submit. The remote
dependencies are downloaded in the driver, and the driver does the job
of service them to executors.

So I guess my question is: is there any advantage in using an init container?

The current init container code can download stuff in parallel, but
that's an easy improvement to make in spark-submit and that would
benefit everybody. You can argue that executors downloading from
external servers would be faster than downloading from the driver, but
I'm not sure I'd agree - it can go both ways.

Also the same idea could probably be applied to starting executors;
Mesos starts executors using "spark-class" already, so doing that
would both improve code sharing and potentially simplify some code in
the k8s backend.

--
Marcelo

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Kubernetes: why use init containers?

Matt Cheah

A few reasons to prefer init-containers come to mind:

 

Firstly, if we used spark-submit from within the driver container, the executors wouldn’t receive the jars on their class loader until after the executor starts because the executor has to launch first before localizing resources. It is certainly possible to make the class loader work with the user’s jars here, as is the case with all the client mode implementations, but, it seems cleaner to have the classpath include the user’s jars at executor launch time instead of needing to reason about the classloading order.

 

We can also consider the idiomatic approach from the perspective of Kubernetes. Yinan touched on this already, but init-containers are traditionally meant to prepare the environment for the application that is to be run, which is exactly what we do here. This also makes it such that the localization process can be completely decoupled from the execution of the application itself. We can then for example detect the errors that happen on the resource localization layer, say when an HDFS cluster is down, before the application itself launches. The failure at the init-container stage is explicitly noted via the Kubernetes pod status API.

 

Finally, running spark-submit from the container would make the SparkSubmit code inadvertently allow running client mode Kubernetes applications as well. We’re not quite ready to support that. Even if we were, it’s not entirely intuitive for the cluster mode code path to depend on the client mode code path. This isn’t entirely without precedent though, as Mesos has a similar dependency.

 

Essentially the semantics seem neater and the contract is very explicit when using an init-container, even though the code does end up being more complex.

 

From: Yinan Li <[hidden email]>
Date: Tuesday, January 9, 2018 at 7:16 PM
To: Nicholas Chammas <[hidden email]>
Cc: Anirudh Ramanathan <[hidden email]>, Marcelo Vanzin <[hidden email]>, Matt Cheah <[hidden email]>, Kimoon Kim <[hidden email]>, dev <[hidden email]>
Subject: Re: Kubernetes: why use init containers?

 

The init-container is required for use with the resource staging server (https://github.com/apache-spark-on-k8s/userdocs/blob/master/src/jekyll/running-on-kubernetes.md#resource-staging-server[github.com]). The resource staging server (RSS) is a spark-on-k8s component running in a Kubernetes cluster for staging submission client local dependencies to Spark pods. The init-container is responsible for downloading the dependencies from the RSS. We haven't upstream the RSS code yet, but this is a value add component for Spark on K8s as a way for users to use submission local dependencies without resorting to other mechanisms that are not immediately available on most Kubernetes clusters, e.g., HDFS. We do plan to upstream it in the 2.4 timeframe. Additionally, the init-container is a Kubernetes native way of making sure that the dependencies are localized before the main driver/executor containers are started. IMO, this guarantee is positive to have and it helps achieve separation of concerns. So IMO, I think the init-container is a valuable component and should be kept.

 

On Tue, Jan 9, 2018 at 6:25 PM, Nicholas Chammas <[hidden email]> wrote:

I’d like to point out the output of “git show —stat” for that diff:
29 files changed, 130 insertions(+), 1560 deletions(-)

+1 for that and generally for the idea of leveraging spark-submit.

You can argue that executors downloading from
external servers would be faster than downloading from the driver, but
I’m not sure I’d agree - it can go both ways.

On a tangentially related note, one of the main reasons spark-ec2[github.com] is so slow to launch clusters is that it distributes files like the Spark binaries to all the workers via the master. Because of that, the launch time scaled with the number of workers requested[issues.apache.org].

When I wrote Flintrock[github.com], I got a large improvement in launch time over spark-ec2 simply by having all the workers download the installation files in parallel from an external host (typically S3 or an Apache mirror). And launch time became largely independent of the cluster size.

That may or may not say anything about the driver distributing application files vs. having init containers do it in parallel, but I’d be curious to hear more.

Nick

 

On Tue, Jan 9, 2018 at 9:08 PM Anirudh Ramanathan <[hidden email].invalid> wrote:

We were running a change in our fork which was similar to this at one point early on. My biggest concerns off the top of my head with this change would be localization performance with large numbers of executors, and what we lose in terms of separation of concerns. Init containers are a standard construct in k8s for resource localization. Also how this approach affects the HDFS work would be interesting.  

 

+matt +kimoon

Still thinking about the potential trade offs here. Adding Matt and Kimoon who would remember more about our reasoning at the time. 

 

 

On Jan 9, 2018 5:22 PM, "Marcelo Vanzin" <[hidden email]> wrote:

Hello,

Me again. I was playing some more with the kubernetes backend and the
whole init container thing seemed unnecessary to me.

Currently it's used to download remote jars and files, mount the
volume into the driver / executor, and place those jars in the
classpath / move the files to the working directory. This is all stuff
that spark-submit already does without needing extra help.

So I spent some time hacking stuff and removing the init container
code, and launching the driver inside kubernetes using spark-submit
(similar to how standalone and mesos cluster mode works):

https://github.com/vanzin/spark/commit/k8s-no-init[github.com]

I'd like to point out the output of "git show --stat" for that diff:
 29 files changed, 130 insertions(+), 1560 deletions(-)

You get massive code reuse by simply using spark-submit. The remote
dependencies are downloaded in the driver, and the driver does the job
of service them to executors.

So I guess my question is: is there any advantage in using an init container?

The current init container code can download stuff in parallel, but
that's an easy improvement to make in spark-submit and that would
benefit everybody. You can argue that executors downloading from
external servers would be faster than downloading from the driver, but
I'm not sure I'd agree - it can go both ways.

Also the same idea could probably be applied to starting executors;
Mesos starts executors using "spark-class" already, so doing that
would both improve code sharing and potentially simplify some code in
the k8s backend.

--
Marcelo

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

 


smime.p7s (6K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Kubernetes: why use init containers?

Anirudh Ramanathan-3
Marcelo, to address the points you raised:

> k8s uses docker images. Users can create docker images with all the
dependencies their app needs, and submit the app using that image.

The entire reason why we support additional methods of localizing dependencies than baking everything into docker images is that
it's not a very good workflow fit for all use-cases. There are definitely some users that will do that (and I've spoken to some), 
and they build a versioned image in their registry every time they change their code with a CD pipeline, 
but a lot of people are looking for something lighter - and versioning application code, not entire images. 
Telling users that they must rebuild images and pay the cost of localizing new images from the docker registry 
(which is also not very well understood/measured in terms of performance) every time seems less than convincing to me.

- The original spark-on-k8s spec mentioned a "dependency server"
approach which sounded like a more generic version of the YARN
distributed cache, which I hope can be a different way of mitigating
that issue. With that work, we could build this functionality into
spark-submit itself and have other backends also benefit.

The resource staging server as was written was a non-HA fileserver for staging dependencies within the cluster.
It's not distributed, and has no notion of locality, etc. I don't think we had plans (yet) to invest in to make it more 
like the distributed cache you mentioned, at least not until we heard 
back from the community - so, that's unplanned work at this point. It's also hard to imagine how we could
extend that to go beyond just K8s tbh. We should definitely have a JIRA tracking this, if that's a 
direction we want to explore in the future.

I understand the change you're proposing would simplify the code but a decision here seems hard to make
until we get some real benchmarks/measurements, or user feedback.

On Tue, Jan 9, 2018 at 7:24 PM, Matt Cheah <[hidden email]> wrote:

A few reasons to prefer init-containers come to mind:

 

Firstly, if we used spark-submit from within the driver container, the executors wouldn’t receive the jars on their class loader until after the executor starts because the executor has to launch first before localizing resources. It is certainly possible to make the class loader work with the user’s jars here, as is the case with all the client mode implementations, but, it seems cleaner to have the classpath include the user’s jars at executor launch time instead of needing to reason about the classloading order.

 

We can also consider the idiomatic approach from the perspective of Kubernetes. Yinan touched on this already, but init-containers are traditionally meant to prepare the environment for the application that is to be run, which is exactly what we do here. This also makes it such that the localization process can be completely decoupled from the execution of the application itself. We can then for example detect the errors that happen on the resource localization layer, say when an HDFS cluster is down, before the application itself launches. The failure at the init-container stage is explicitly noted via the Kubernetes pod status API.

 

Finally, running spark-submit from the container would make the SparkSubmit code inadvertently allow running client mode Kubernetes applications as well. We’re not quite ready to support that. Even if we were, it’s not entirely intuitive for the cluster mode code path to depend on the client mode code path. This isn’t entirely without precedent though, as Mesos has a similar dependency.

 

Essentially the semantics seem neater and the contract is very explicit when using an init-container, even though the code does end up being more complex.

 

From: Yinan Li <[hidden email]>
Date: Tuesday, January 9, 2018 at 7:16 PM
To: Nicholas Chammas <[hidden email]>
Cc: Anirudh Ramanathan <[hidden email].invalid>, Marcelo Vanzin <[hidden email]>, Matt Cheah <[hidden email]>, Kimoon Kim <[hidden email]>, dev <[hidden email]>
Subject: Re: Kubernetes: why use init containers?

 

The init-container is required for use with the resource staging server (https://github.com/apache-spark-on-k8s/userdocs/blob/master/src/jekyll/running-on-kubernetes.md#resource-staging-server[github.com]). The resource staging server (RSS) is a spark-on-k8s component running in a Kubernetes cluster for staging submission client local dependencies to Spark pods. The init-container is responsible for downloading the dependencies from the RSS. We haven't upstream the RSS code yet, but this is a value add component for Spark on K8s as a way for users to use submission local dependencies without resorting to other mechanisms that are not immediately available on most Kubernetes clusters, e.g., HDFS. We do plan to upstream it in the 2.4 timeframe. Additionally, the init-container is a Kubernetes native way of making sure that the dependencies are localized before the main driver/executor containers are started. IMO, this guarantee is positive to have and it helps achieve separation of concerns. So IMO, I think the init-container is a valuable component and should be kept.

 

On Tue, Jan 9, 2018 at 6:25 PM, Nicholas Chammas <[hidden email]> wrote:

I’d like to point out the output of “git show —stat” for that diff:
29 files changed, 130 insertions(+), 1560 deletions(-)

+1 for that and generally for the idea of leveraging spark-submit.

You can argue that executors downloading from
external servers would be faster than downloading from the driver, but
I’m not sure I’d agree - it can go both ways.

On a tangentially related note, one of the main reasons spark-ec2[github.com] is so slow to launch clusters is that it distributes files like the Spark binaries to all the workers via the master. Because of that, the launch time scaled with the number of workers requested[issues.apache.org].

When I wrote Flintrock[github.com], I got a large improvement in launch time over spark-ec2 simply by having all the workers download the installation files in parallel from an external host (typically S3 or an Apache mirror). And launch time became largely independent of the cluster size.

That may or may not say anything about the driver distributing application files vs. having init containers do it in parallel, but I’d be curious to hear more.

Nick

 

On Tue, Jan 9, 2018 at 9:08 PM Anirudh Ramanathan <[hidden email].invalid> wrote:

We were running a change in our fork which was similar to this at one point early on. My biggest concerns off the top of my head with this change would be localization performance with large numbers of executors, and what we lose in terms of separation of concerns. Init containers are a standard construct in k8s for resource localization. Also how this approach affects the HDFS work would be interesting.  

 

+matt +kimoon

Still thinking about the potential trade offs here. Adding Matt and Kimoon who would remember more about our reasoning at the time. 

 

 

On Jan 9, 2018 5:22 PM, "Marcelo Vanzin" <[hidden email]> wrote:

Hello,

Me again. I was playing some more with the kubernetes backend and the
whole init container thing seemed unnecessary to me.

Currently it's used to download remote jars and files, mount the
volume into the driver / executor, and place those jars in the
classpath / move the files to the working directory. This is all stuff
that spark-submit already does without needing extra help.

So I spent some time hacking stuff and removing the init container
code, and launching the driver inside kubernetes using spark-submit
(similar to how standalone and mesos cluster mode works):

https://github.com/vanzin/spark/commit/k8s-no-init[github.com]

I'd like to point out the output of "git show --stat" for that diff:
 29 files changed, 130 insertions(+), 1560 deletions(-)

You get massive code reuse by simply using spark-submit. The remote
dependencies are downloaded in the driver, and the driver does the job
of service them to executors.

So I guess my question is: is there any advantage in using an init container?

The current init container code can download stuff in parallel, but
that's an easy improvement to make in spark-submit and that would
benefit everybody. You can argue that executors downloading from
external servers would be faster than downloading from the driver, but
I'm not sure I'd agree - it can go both ways.

Also the same idea could probably be applied to starting executors;
Mesos starts executors using "spark-class" already, so doing that
would both improve code sharing and potentially simplify some code in
the k8s backend.

--
Marcelo

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

 




--
Anirudh Ramanathan
Reply | Threaded
Open this post in threaded view
|

Re: Kubernetes: why use init containers?

Marcelo Vanzin
In reply to this post by liyinan926
One thing I forgot in my previous e-mail is that if a resource is
remote I'm pretty sure (but haven't double checked the code) that
executors will download it directly from the remote server, and not
from the driver. So there, distributed download without an init
container.

On Tue, Jan 9, 2018 at 7:15 PM, Yinan Li <[hidden email]> wrote:
> The init-container is required for use with the resource staging server
> (https://github.com/apache-spark-on-k8s/userdocs/blob/master/src/jekyll/running-on-kubernetes.md#resource-staging-server).

If the staging server *requires* an init container you have already a
design problem right there.

> Additionally, the init-container is a Kubernetes
> native way of making sure that the dependencies are localized

Sorry, but the init container does not do anything by itself. You had
to add a whole bunch of code to execute the existing Spark code in an
init container, when not doing it would have achieved the exact same
goal much more easily, in a way that is consistent with how Spark
already does things.

Matt:
> the executors wouldn’t receive the jars on their class loader until after the executor starts

I actually consider that a benefit. It means spark-on-k8s application
will behave more like all the other backends, where that is true also
(application jars live in a separate class loader).

> traditionally meant to prepare the environment for the application that is to be run

You guys are forcing this argument when it all depends on where you
draw the line. Spark can be launched without downloading any of those
dependencies, because Spark will download them for you. Forcing the
"kubernetes way" just means you're writing a lot more code, and
breaking the Spark app initialization into multiple container
invocations, to achieve the same thing.

> would make the SparkSubmit code inadvertently allow running client mode Kubernetes applications as well

Not necessarily. I have that in my patch; it doesn't allow client mode
unless a property that only the cluster mode submission code sets is
present. If some user wants to hack their way around that, more power
to them; users can also compile their own Spark without the checks if
they want to try out client mode in some way.

Anirudh:
> Telling users that they must rebuild images  ... every time seems less than convincing to me.

Sure, I'm not proposing people use the docker image approach all the
time. It would be a hassle while developing an app, as it is kind of a
hassle today where the code doesn't upload local files to the k8s
cluster.

But it's perfectly reasonable for people to optimize a production app
by bundling the app into a pre-built docker image to avoid
re-downloading resources every time. Like they'd probably place the
jar + dependencies on HDFS today with YARN, to get the benefits of the
YARN cache.

--
Marcelo

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Kubernetes: why use init containers?

Anirudh Ramanathan-3
Marcelo, I can see that we might be misunderstanding what this change implies for performance and some of the deeper implementation details here.
We have a community meeting tomorrow (at 10am PT), and we'll be sure to explore this idea in detail, and understand the implications and then get back to you.

Thanks for the detailed responses here, and for spending time with the idea.
(Also, you're more than welcome to attend the meeting - there's a link here if you're around.)

Cheers, 
Anirudh


On Jan 9, 2018 8:05 PM, "Marcelo Vanzin" <[hidden email]> wrote:
One thing I forgot in my previous e-mail is that if a resource is
remote I'm pretty sure (but haven't double checked the code) that
executors will download it directly from the remote server, and not
from the driver. So there, distributed download without an init
container.

On Tue, Jan 9, 2018 at 7:15 PM, Yinan Li <[hidden email]> wrote:
> The init-container is required for use with the resource staging server
> (https://github.com/apache-spark-on-k8s/userdocs/blob/master/src/jekyll/running-on-kubernetes.md#resource-staging-server).

If the staging server *requires* an init container you have already a
design problem right there.

> Additionally, the init-container is a Kubernetes
> native way of making sure that the dependencies are localized

Sorry, but the init container does not do anything by itself. You had
to add a whole bunch of code to execute the existing Spark code in an
init container, when not doing it would have achieved the exact same
goal much more easily, in a way that is consistent with how Spark
already does things.

Matt:
> the executors wouldn’t receive the jars on their class loader until after the executor starts

I actually consider that a benefit. It means spark-on-k8s application
will behave more like all the other backends, where that is true also
(application jars live in a separate class loader).

> traditionally meant to prepare the environment for the application that is to be run

You guys are forcing this argument when it all depends on where you
draw the line. Spark can be launched without downloading any of those
dependencies, because Spark will download them for you. Forcing the
"kubernetes way" just means you're writing a lot more code, and
breaking the Spark app initialization into multiple container
invocations, to achieve the same thing.

> would make the SparkSubmit code inadvertently allow running client mode Kubernetes applications as well

Not necessarily. I have that in my patch; it doesn't allow client mode
unless a property that only the cluster mode submission code sets is
present. If some user wants to hack their way around that, more power
to them; users can also compile their own Spark without the checks if
they want to try out client mode in some way.

Anirudh:
> Telling users that they must rebuild images  ... every time seems less than convincing to me.

Sure, I'm not proposing people use the docker image approach all the
time. It would be a hassle while developing an app, as it is kind of a
hassle today where the code doesn't upload local files to the k8s
cluster.

But it's perfectly reasonable for people to optimize a production app
by bundling the app into a pre-built docker image to avoid
re-downloading resources every time. Like they'd probably place the
jar + dependencies on HDFS today with YARN, to get the benefits of the
YARN cache.

--
Marcelo

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Kubernetes: why use init containers?

Marcelo Vanzin
On a side note, while it's great that you guys have meetings to
discuss things related to the project, it's general Apache practice to
discuss these things in the mailing list - or at the very list send
detailed info about what discussed in these meetings to the mailing
list. Not everybody can attend these meetings, and I'm not just
talking about people being busy, but there are people who live in
different time zones.

Now that this code is moving into Spark I'd recommend getting people
more involved with the Spark project to move things forward.

On Tue, Jan 9, 2018 at 8:23 PM, Anirudh Ramanathan
<[hidden email]> wrote:

> Marcelo, I can see that we might be misunderstanding what this change
> implies for performance and some of the deeper implementation details here.
> We have a community meeting tomorrow (at 10am PT), and we'll be sure to
> explore this idea in detail, and understand the implications and then get
> back to you.
>
> Thanks for the detailed responses here, and for spending time with the idea.
> (Also, you're more than welcome to attend the meeting - there's a link here
> if you're around.)
>
> Cheers,
> Anirudh
>
>
> On Jan 9, 2018 8:05 PM, "Marcelo Vanzin" <[hidden email]> wrote:
>
> One thing I forgot in my previous e-mail is that if a resource is
> remote I'm pretty sure (but haven't double checked the code) that
> executors will download it directly from the remote server, and not
> from the driver. So there, distributed download without an init
> container.
>
> On Tue, Jan 9, 2018 at 7:15 PM, Yinan Li <[hidden email]> wrote:
>> The init-container is required for use with the resource staging server
>>
>> (https://github.com/apache-spark-on-k8s/userdocs/blob/master/src/jekyll/running-on-kubernetes.md#resource-staging-server).
>
> If the staging server *requires* an init container you have already a
> design problem right there.
>
>> Additionally, the init-container is a Kubernetes
>> native way of making sure that the dependencies are localized
>
> Sorry, but the init container does not do anything by itself. You had
> to add a whole bunch of code to execute the existing Spark code in an
> init container, when not doing it would have achieved the exact same
> goal much more easily, in a way that is consistent with how Spark
> already does things.
>
> Matt:
>> the executors wouldn’t receive the jars on their class loader until after
>> the executor starts
>
> I actually consider that a benefit. It means spark-on-k8s application
> will behave more like all the other backends, where that is true also
> (application jars live in a separate class loader).
>
>> traditionally meant to prepare the environment for the application that is
>> to be run
>
> You guys are forcing this argument when it all depends on where you
> draw the line. Spark can be launched without downloading any of those
> dependencies, because Spark will download them for you. Forcing the
> "kubernetes way" just means you're writing a lot more code, and
> breaking the Spark app initialization into multiple container
> invocations, to achieve the same thing.
>
>> would make the SparkSubmit code inadvertently allow running client mode
>> Kubernetes applications as well
>
> Not necessarily. I have that in my patch; it doesn't allow client mode
> unless a property that only the cluster mode submission code sets is
> present. If some user wants to hack their way around that, more power
> to them; users can also compile their own Spark without the checks if
> they want to try out client mode in some way.
>
> Anirudh:
>> Telling users that they must rebuild images  ... every time seems less
>> than convincing to me.
>
> Sure, I'm not proposing people use the docker image approach all the
> time. It would be a hassle while developing an app, as it is kind of a
> hassle today where the code doesn't upload local files to the k8s
> cluster.
>
> But it's perfectly reasonable for people to optimize a production app
> by bundling the app into a pre-built docker image to avoid
> re-downloading resources every time. Like they'd probably place the
> jar + dependencies on HDFS today with YARN, to get the benefits of the
> YARN cache.
>
> --
> Marcelo
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [hidden email]
>
>



--
Marcelo

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Kubernetes: why use init containers?

Matt Cheah
A crucial point here is considering whether we want to have a separate scheduler backend code path for client mode versus cluster mode. If we need such a separation in the code paths, it would be difficult to make it possible to run spark-submit in client mode from the driver container.

We discussed this already when we started to think about client mode. See https://github.com/apache-spark-on-k8s/spark/pull/456. In our initial designs for a client mode, we considered that there are some concepts that would only apply to cluster mode and not to client mode – see https://github.com/apache-spark-on-k8s/spark/pull/456#issuecomment-343007093. But we haven’t worked out all of the details yet. The situation may work out such that client mode is similar enough to cluster mode that we can consider the cluster mode as being a spark-submit in client mode from a container.

I’d imagine this is a reason why YARN hasn’t went with using spark-submit from the application master, because there are separate code paths for a YarnClientSchedulerBackend versus a YarnClusterSchedulerBackend, and the deploy mode serves as the switch between the two implementations. Though I am curious as to why Spark standalone isn’t using spark-submit – the DriverWrapper is manually fetching the user’s jars and putting them on a classloader before invoking the user’s main class with that classloader. But there’s only one scheduler backend for both client and cluster mode for standalone’s case.

The main idea here is that we need to understand if we need different code paths for a client mode scheduler backend versus a cluster mode scheduler backend, before we can know if we can use spark-submit in client mode from the driver container. But using init-containers makes it such that we don’t need to use spark-submit at all, meaning that the differences can more or less be ignored at least in this particular context.

-Matt Cheah

On 1/10/18, 8:40 AM, "Marcelo Vanzin" <[hidden email]> wrote:

    On a side note, while it's great that you guys have meetings to
    discuss things related to the project, it's general Apache practice to
    discuss these things in the mailing list - or at the very list send
    detailed info about what discussed in these meetings to the mailing
    list. Not everybody can attend these meetings, and I'm not just
    talking about people being busy, but there are people who live in
    different time zones.
   
    Now that this code is moving into Spark I'd recommend getting people
    more involved with the Spark project to move things forward.
   
    On Tue, Jan 9, 2018 at 8:23 PM, Anirudh Ramanathan
    <[hidden email]> wrote:
    > Marcelo, I can see that we might be misunderstanding what this change
    > implies for performance and some of the deeper implementation details here.
    > We have a community meeting tomorrow (at 10am PT), and we'll be sure to
    > explore this idea in detail, and understand the implications and then get
    > back to you.
    >
    > Thanks for the detailed responses here, and for spending time with the idea.
    > (Also, you're more than welcome to attend the meeting - there's a link here
    > if you're around.)
    >
    > Cheers,
    > Anirudh
    >
    >
    > On Jan 9, 2018 8:05 PM, "Marcelo Vanzin" <[hidden email]> wrote:
    >
    > One thing I forgot in my previous e-mail is that if a resource is
    > remote I'm pretty sure (but haven't double checked the code) that
    > executors will download it directly from the remote server, and not
    > from the driver. So there, distributed download without an init
    > container.
    >
    > On Tue, Jan 9, 2018 at 7:15 PM, Yinan Li <[hidden email]> wrote:
    >> The init-container is required for use with the resource staging server
    >>
    >> (https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache-2Dspark-2Don-2Dk8s_userdocs_blob_master_src_jekyll_running-2Don-2Dkubernetes.md-23resource-2Dstaging-2Dserver&d=DwIFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=rQzoyVLMucfZdPLZAwNlE-PZ90ViBJzTQ49K1dzjr3c&s=HcCtT_KLkPi_05ojHei1nbXUpwoJomou8bitD-WkYmI&e=).
    >
    > If the staging server *requires* an init container you have already a
    > design problem right there.
    >
    >> Additionally, the init-container is a Kubernetes
    >> native way of making sure that the dependencies are localized
    >
    > Sorry, but the init container does not do anything by itself. You had
    > to add a whole bunch of code to execute the existing Spark code in an
    > init container, when not doing it would have achieved the exact same
    > goal much more easily, in a way that is consistent with how Spark
    > already does things.
    >
    > Matt:
    >> the executors wouldn’t receive the jars on their class loader until after
    >> the executor starts
    >
    > I actually consider that a benefit. It means spark-on-k8s application
    > will behave more like all the other backends, where that is true also
    > (application jars live in a separate class loader).
    >
    >> traditionally meant to prepare the environment for the application that is
    >> to be run
    >
    > You guys are forcing this argument when it all depends on where you
    > draw the line. Spark can be launched without downloading any of those
    > dependencies, because Spark will download them for you. Forcing the
    > "kubernetes way" just means you're writing a lot more code, and
    > breaking the Spark app initialization into multiple container
    > invocations, to achieve the same thing.
    >
    >> would make the SparkSubmit code inadvertently allow running client mode
    >> Kubernetes applications as well
    >
    > Not necessarily. I have that in my patch; it doesn't allow client mode
    > unless a property that only the cluster mode submission code sets is
    > present. If some user wants to hack their way around that, more power
    > to them; users can also compile their own Spark without the checks if
    > they want to try out client mode in some way.
    >
    > Anirudh:
    >> Telling users that they must rebuild images  ... every time seems less
    >> than convincing to me.
    >
    > Sure, I'm not proposing people use the docker image approach all the
    > time. It would be a hassle while developing an app, as it is kind of a
    > hassle today where the code doesn't upload local files to the k8s
    > cluster.
    >
    > But it's perfectly reasonable for people to optimize a production app
    > by bundling the app into a pre-built docker image to avoid
    > re-downloading resources every time. Like they'd probably place the
    > jar + dependencies on HDFS today with YARN, to get the benefits of the
    > YARN cache.
    >
    > --
    > Marcelo
    >
    > ---------------------------------------------------------------------
    > To unsubscribe e-mail: [hidden email]
    >
    >
   
   
   
    --
    Marcelo
   

smime.p7s (6K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Kubernetes: why use init containers?

Marcelo Vanzin
On Wed, Jan 10, 2018 at 1:10 PM, Matt Cheah <[hidden email]> wrote:
> I’d imagine this is a reason why YARN hasn’t went with using spark-submit from the application master...

I wouldn't use YARN as a template to follow when writing a new
backend. A lot of the reason why the YARN backend works the way it
does is because of backwards compatibility. IMO it would be much
better to change the YARN backend to use spark-submit, because it
would immensely simplify the code there. It was a nightmare to get
YARN to reach feature parity with other backends because it has to
pretty much reimplement everything.

But doing that would break pretty much every Spark-on-YARN deployment,
so it's not something we can do right now.

For the other backends the situation is sort of similar; it probably
wouldn't be hard to change standalone's DriverWrapper to also use
spark-submit. But that brings potential side effects for existing
users that don't exist with spark-on-k8s, because spark-on-k8s is new
(the current fork aside).

>  But using init-containers makes it such that we don’t need to use spark-submit at all

Those are actually separate concerns. There are a whole bunch of
things that spark-submit provides you that you'd have to replicate in
the k8s backend if not using it. Thinks like properly handling special
characters in arguments, native library paths, "userClassPathFirst",
etc. You get them almost for free with spark-submit, and using an init
container does not solve any of those for you.

I'd say that using spark-submit is really not up for discussion here;
it saves you from re-implementing a whole bunch of code that you
shouldn't even be trying to re-implement.

Separately, if there is a legitimate need for an init container, then
it can be added. But I don't see that legitimate need right now, so I
don't see what it's bringing other than complexity.

(And no, "the k8s documentation mentions that init containers are
sometimes used to download dependencies" is not a legitimate need.)

--
Marcelo

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Kubernetes: why use init containers?

Matt Cheah
If we use spark-submit in client mode from the driver container, how do we handle needing to switch between a cluster-mode scheduler backend and a client-mode scheduler backend in the future?

Something else re: client mode accessibility – if we make client mode accessible to users even if it’s behind a flag, that’s a very different contract from needing to recompile spark-submit to support client mode. The amount of effort required from the user to get to client mode is very different between the two cases, and the contract is much clearer when client mode is forbidden in all circumstances, versus client mode being allowed with a specific flag. If we’re saying that we don’t support client mode, we should bias towards making client mode as difficult as possible to access, i.e. impossible with a standard Spark distribution.

-Matt Cheah

On 1/10/18, 1:24 PM, "Marcelo Vanzin" <[hidden email]> wrote:

    On Wed, Jan 10, 2018 at 1:10 PM, Matt Cheah <[hidden email]> wrote:
    > I’d imagine this is a reason why YARN hasn’t went with using spark-submit from the application master...
   
    I wouldn't use YARN as a template to follow when writing a new
    backend. A lot of the reason why the YARN backend works the way it
    does is because of backwards compatibility. IMO it would be much
    better to change the YARN backend to use spark-submit, because it
    would immensely simplify the code there. It was a nightmare to get
    YARN to reach feature parity with other backends because it has to
    pretty much reimplement everything.
   
    But doing that would break pretty much every Spark-on-YARN deployment,
    so it's not something we can do right now.
   
    For the other backends the situation is sort of similar; it probably
    wouldn't be hard to change standalone's DriverWrapper to also use
    spark-submit. But that brings potential side effects for existing
    users that don't exist with spark-on-k8s, because spark-on-k8s is new
    (the current fork aside).
   
    >  But using init-containers makes it such that we don’t need to use spark-submit at all
   
    Those are actually separate concerns. There are a whole bunch of
    things that spark-submit provides you that you'd have to replicate in
    the k8s backend if not using it. Thinks like properly handling special
    characters in arguments, native library paths, "userClassPathFirst",
    etc. You get them almost for free with spark-submit, and using an init
    container does not solve any of those for you.
   
    I'd say that using spark-submit is really not up for discussion here;
    it saves you from re-implementing a whole bunch of code that you
    shouldn't even be trying to re-implement.
   
    Separately, if there is a legitimate need for an init container, then
    it can be added. But I don't see that legitimate need right now, so I
    don't see what it's bringing other than complexity.
   
    (And no, "the k8s documentation mentions that init containers are
    sometimes used to download dependencies" is not a legitimate need.)
   
    --
    Marcelo
   
    ---------------------------------------------------------------------
    To unsubscribe e-mail: [hidden email]
   
   

smime.p7s (6K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Kubernetes: why use init containers?

Marcelo Vanzin
On Wed, Jan 10, 2018 at 1:33 PM, Matt Cheah <[hidden email]> wrote:
> If we use spark-submit in client mode from the driver container, how do we handle needing to switch between a cluster-mode scheduler backend and a client-mode scheduler backend in the future?

With a config value set by the submission code, like what I'm doing to
prevent client mode submission in my p.o.c.?

There are plenty of solutions to that problem if that's what's worrying you.

> Something else re: client mode accessibility – if we make client mode accessible to users even if it’s behind a flag, that’s a very different contract from needing to recompile spark-submit to support client mode. The amount of effort required from the user to get to client mode is very different between the two cases

Yes. But if we say we don't support client mode, we don't support
client mode regardless of how easy it is for the user to fool Spark
into trying to run in that mode.

--
Marcelo

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Kubernetes: why use init containers?

Matt Cheah
> With a config value set by the submission code, like what I'm doing to prevent client mode submission in my p.o.c.?

The contract for what determines the appropriate scheduler backend to instantiate is then going to be different in Kubernetes versus the other cluster managers. The cluster manager typically only picks the scheduler backend implementation based on the master URL format plus the deploy mode. Perhaps this is an acceptable tradeoff for being able to leverage spark-submit in the cluster mode deployed driver container. Again though, any flag we expose in spark-submit is a user-facing option that can be set erroneously, which is a practice we shouldn’t be encouraging.

Taking a step back though, I think we want to use spark-submit’s internals without using spark-submit itself. Any flags we add to spark-submit are user-facing. We ideally would be able to extract the dependency download + run user main class subroutines from spark-submit, and invoke that in all of the cluster managers. Perhaps this calls for a refactor in spark-submit itself to make some parts reusable in other contexts. Just an idea.

On 1/10/18, 1:38 PM, "Marcelo Vanzin" <[hidden email]> wrote:

    On Wed, Jan 10, 2018 at 1:33 PM, Matt Cheah <[hidden email]> wrote:
    > If we use spark-submit in client mode from the driver container, how do we handle needing to switch between a cluster-mode scheduler backend and a client-mode scheduler backend in the future?
   
    With a config value set by the submission code, like what I'm doing to
    prevent client mode submission in my p.o.c.?
   
    There are plenty of solutions to that problem if that's what's worrying you.
   
    > Something else re: client mode accessibility – if we make client mode accessible to users even if it’s behind a flag, that’s a very different contract from needing to recompile spark-submit to support client mode. The amount of effort required from the user to get to client mode is very different between the two cases
   
    Yes. But if we say we don't support client mode, we don't support
    client mode regardless of how easy it is for the user to fool Spark
    into trying to run in that mode.
   
    --
    Marcelo
   

smime.p7s (6K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Kubernetes: why use init containers?

Marcelo Vanzin
On Wed, Jan 10, 2018 at 1:47 PM, Matt Cheah <[hidden email]> wrote:
>> With a config value set by the submission code, like what I'm doing to prevent client mode submission in my p.o.c.?
>
> The contract for what determines the appropriate scheduler backend to instantiate is then going to be different in Kubernetes versus the other cluster managers.

There is no contract for how to pick the appropriate scheduler. That's
a decision that is completely internal to the cluster manager code

--
Marcelo

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Kubernetes: why use init containers?

liyinan926
I want to re-iterate on one point, that the init-container achieves a clear separation between preparing an application and actually running the application. It's a guarantee provided by the K8s admission control and scheduling components that if the init-container fails, the main container won't be run. I think this is definitely positive to have. In the case of a Spark application, the application code and driver/executor code won't even be run if the init-container fails to localize any of the dependencies. The result is that it's much easier for users to figure out what's wrong if their applications fail to run: they can tell if the pods are initialized or not and if not, simply check the status/logs of the init-container.  Another argument I want to make is we can easily make the init-container to be able to exclusively use certain credentials for downloading dependencies that are not appropriate to be visible in the main containers and therefore should not be shared. This is not achievable using the Spark canonical way. K8s has built-in support for dynamically injecting containers into pods through the admission control process. One use case would be for cluster operators to inject an init-container (e.g., through a admission webhook) for downloading certain dependencies that require certain access-restrictive credentials. 

Note that we are not blindly opposing getting rid of the init-container, it's just that there's still valid reasons to keep it for now, particularly given that we don't have a solid around client mode yet. Also given that we have been using it in our fork for over a year, we are definitely more confident on the current way of handling remote dependencies as it's been tested more thoroughly. Since getting rid of the init-container is such a significant change, I would suggest that we defer making a decision on if we should get rid of it to 2.4 so we have a more thorough understanding of the pros and cons.      

On Wed, Jan 10, 2018 at 1:48 PM, Marcelo Vanzin <[hidden email]> wrote:
On Wed, Jan 10, 2018 at 1:47 PM, Matt Cheah <[hidden email]> wrote:
>> With a config value set by the submission code, like what I'm doing to prevent client mode submission in my p.o.c.?
>
> The contract for what determines the appropriate scheduler backend to instantiate is then going to be different in Kubernetes versus the other cluster managers.

There is no contract for how to pick the appropriate scheduler. That's
a decision that is completely internal to the cluster manager code

--
Marcelo

Reply | Threaded
Open this post in threaded view
|

Re: Kubernetes: why use init containers?

Marcelo Vanzin
On Wed, Jan 10, 2018 at 2:00 PM, Yinan Li <[hidden email]> wrote:
> I want to re-iterate on one point, that the init-container achieves a clear
> separation between preparing an application and actually running the
> application. It's a guarantee provided by the K8s admission control and
> scheduling components that if the init-container fails, the main container
> won't be run. I think this is definitely positive to have. In the case of a
> Spark application, the application code and driver/executor code won't even
> be run if the init-container fails to localize any of the dependencies

That is also the case with spark-submit... (can't download
dependencies -> spark-submit fails before running user code).

> Note that we are not blindly opposing getting rid of the init-container,
> it's just that there's still valid reasons to keep it for now

I'll flip that around: I'm not against having an init container if
it's serving a needed purpose, it's just that nobody is able to tell
me what that needed purpose is.

1500 less lines of code trump all of the arguments given so far for
what the init container might be a good idea.

--
Marcelo

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Kubernetes: why use init containers?

liyinan926
> 1500 less lines of code trump all of the arguments given so far for
> what the init container might be a good idea.

We can also reduce the #lines of code by simply refactoring the code in such as way that a lot of code can be shared between configuration of the main container and that of the ini-container. Actually we have been discussing this as one of the things to do right after the 2.3 release and we do have a Jira ticket to track it. It's probably true that none of the arguments we made are convincing enough, but we can not rule out the benefits init-containers bring either.

Again, I would suggest we look at this more thoroughly post 2.3.

On Wed, Jan 10, 2018 at 2:06 PM, Marcelo Vanzin <[hidden email]> wrote:
On Wed, Jan 10, 2018 at 2:00 PM, Yinan Li <[hidden email]> wrote:
> I want to re-iterate on one point, that the init-container achieves a clear
> separation between preparing an application and actually running the
> application. It's a guarantee provided by the K8s admission control and
> scheduling components that if the init-container fails, the main container
> won't be run. I think this is definitely positive to have. In the case of a
> Spark application, the application code and driver/executor code won't even
> be run if the init-container fails to localize any of the dependencies

That is also the case with spark-submit... (can't download
dependencies -> spark-submit fails before running user code).

> Note that we are not blindly opposing getting rid of the init-container,
> it's just that there's still valid reasons to keep it for now

I'll flip that around: I'm not against having an init container if
it's serving a needed purpose, it's just that nobody is able to tell
me what that needed purpose is.

1500 less lines of code trump all of the arguments given so far for
what the init container might be a good idea.

--
Marcelo

Reply | Threaded
Open this post in threaded view
|

Re: Kubernetes: why use init containers?

Marcelo Vanzin
On Wed, Jan 10, 2018 at 2:16 PM, Yinan Li <[hidden email]> wrote:
> but we can not rule out the benefits init-containers bring either.

Sorry, but what are those again? So far all the benefits are already
provided by spark-submit...

> Again, I would suggest we look at this more thoroughly post 2.3.

Actually, one of the reasons why I brought this up is that we should
remove init containers from 2.3 unless they're really required for
something.

Simplifying the code is not the only issue. The init container support
introduces a whole lot of user-visible behavior - like config options
and the execution of a completely separate container that the user can
customize. If removed later, that could be considered a breaking
change.

So if we ship 2.3 without init containers and add them later if
needed, it's a much better world than flipping that around.

--
Marcelo

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

12