Apache Spark Docker image repository

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Apache Spark Docker image repository

Dongjoon Hyun-2
Hi, All.

From 2020, shall we have an official Docker image repository as an additional distribution channel?

I'm considering the following images.

    - Public binary release (no snapshot image)
    - Public non-Spark base image (OS + R + Python)
      (This can be used in GitHub Action Jobs and Jenkins K8s Integration Tests to speed up jobs and to have more stabler environments)

Bests,
Dongjoon.
Reply | Threaded
Open this post in threaded view
|

Re: Apache Spark Docker image repository

Sean Owen-2
What would the images have - just the image for a worker?
We wouldn't want to publish N permutations of Python, R, OS, Java, etc.
But if we don't then we make one or a few choices of that combo, and
then I wonder how many people find the image useful.
If the goal is just to support Spark testing, that seems fine and
tractable, but does it need to be 'public' as in advertised as a
convenience binary? vs just some image that's hosted somewhere for the
benefit of project infra.

On Wed, Feb 5, 2020 at 12:16 PM Dongjoon Hyun <[hidden email]> wrote:

>
> Hi, All.
>
> From 2020, shall we have an official Docker image repository as an additional distribution channel?
>
> I'm considering the following images.
>
>     - Public binary release (no snapshot image)
>     - Public non-Spark base image (OS + R + Python)
>       (This can be used in GitHub Action Jobs and Jenkins K8s Integration Tests to speed up jobs and to have more stabler environments)
>
> Bests,
> Dongjoon.

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Apache Spark Docker image repository

Jiaxin Shan
I will vote for this. It's pretty helpful to have managed Spark images. Currently, user have to download Spark binaries and build their own. 
With this supported, user journey will be simplified and we only need to build an application image on top of base image provided by community. 

Do we have different OS or architecture support? If not, there will be Java, R, Python total 3 container images for every release.


On Wed, Feb 5, 2020 at 2:56 PM Sean Owen <[hidden email]> wrote:
What would the images have - just the image for a worker?
We wouldn't want to publish N permutations of Python, R, OS, Java, etc.
But if we don't then we make one or a few choices of that combo, and
then I wonder how many people find the image useful.
If the goal is just to support Spark testing, that seems fine and
tractable, but does it need to be 'public' as in advertised as a
convenience binary? vs just some image that's hosted somewhere for the
benefit of project infra.

On Wed, Feb 5, 2020 at 12:16 PM Dongjoon Hyun <[hidden email]> wrote:
>
> Hi, All.
>
> From 2020, shall we have an official Docker image repository as an additional distribution channel?
>
> I'm considering the following images.
>
>     - Public binary release (no snapshot image)
>     - Public non-Spark base image (OS + R + Python)
>       (This can be used in GitHub Action Jobs and Jenkins K8s Integration Tests to speed up jobs and to have more stabler environments)
>
> Bests,
> Dongjoon.

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



--
Best Regards!
Jiaxin Shan
Tel:  412-230-7670
Address: 470 2nd Ave S, Kirkland, WA

Reply | Threaded
Open this post in threaded view
|

Re: Apache Spark Docker image repository

shane knapp ☠
In reply to this post by Dongjoon Hyun-2
        (This can be used in GitHub Action Jobs and Jenkins K8s Integration Tests to speed up jobs and to have more stabler environments)

yep!

not only that, if we ever get around (hopefully this year) to containerizing (the majority) the master and branch builds, i think it'd be nice to have those available as there as well.

ah, an atomic build environment...  one can dream.  :)

shane
--
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu
Reply | Threaded
Open this post in threaded view
|

Re: Apache Spark Docker image repository

zero323
In reply to this post by Jiaxin Shan


On 2/6/20 2:53 AM, Jiaxin Shan wrote:
I will vote for this. It's pretty helpful to have managed Spark images. Currently, user have to download Spark binaries and build their own. 
With this supported, user journey will be simplified and we only need to build an application image on top of base image provided by community. 

Do we have different OS or architecture support? If not, there will be Java, R, Python total 3 container images for every release.

Well, technically speaking there are 3 non-deprecated Python versions (4 if you count PyPy), 3 non-deprecated R versions, luckily only one non-deprecated Scala version and possible variations of JDK. Latest and greatest are not necessarily the most popular and useful.

That's on top of native dependencies like BLAS (possibly in different flavors and accounting for netlib-java break in development), libparquet and libarrow.

Not all of these must be generated, but complexity grows pretty fast, especially when native dependencies are involved. It gets worse if you actually want to support Spark builds and tests ‒ for example to build and fully test SparkR builds you need half of the universe including some awkward LaTex style patches and such (https://github.com/zero323/sparkr-build-sandbox).

End even without that images tend to grow pretty large.

Few years back me and Elias experimented with the idea of generating different sets of Dockerfiles ‒ https://github.com/spark-in-a-box/spark-in-a-box ‒ intended use cases where rather different (mostly quick setup of testbeds) though. The project has been inactive for a while, with some private patches to fit this or that use case.


On Wed, Feb 5, 2020 at 2:56 PM Sean Owen <[hidden email]> wrote:
What would the images have - just the image for a worker?
We wouldn't want to publish N permutations of Python, R, OS, Java, etc.
But if we don't then we make one or a few choices of that combo, and
then I wonder how many people find the image useful.
If the goal is just to support Spark testing, that seems fine and
tractable, but does it need to be 'public' as in advertised as a
convenience binary? vs just some image that's hosted somewhere for the
benefit of project infra.

On Wed, Feb 5, 2020 at 12:16 PM Dongjoon Hyun <[hidden email]> wrote:
>
> Hi, All.
>
> From 2020, shall we have an official Docker image repository as an additional distribution channel?
>
> I'm considering the following images.
>
>     - Public binary release (no snapshot image)
>     - Public non-Spark base image (OS + R + Python)
>       (This can be used in GitHub Action Jobs and Jenkins K8s Integration Tests to speed up jobs and to have more stabler environments)
>
> Bests,
> Dongjoon.

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



--
Best Regards!
Jiaxin Shan
Tel:  412-230-7670
Address: 470 2nd Ave S, Kirkland, WA

-- 
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
Keybase: https://keybase.io/zero323
Gigs: https://www.codementor.io/@zero323
PGP: C095AA7F33E6123A

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Apache Spark Docker image repository

Tom Graves-2
In reply to this post by Dongjoon Hyun-2
When discussions of docker have occurred in the past - mostly related to k8s - there is a lot of discussion about what is the right image to publish, as well as making sure Apache is ok with it. Apache official release is the source code so we may need to make sure to have disclaimer and we need to make sure it doesn't contain anything licensed that it shouldn't.  What happens when one of the docker images we publish has security update. We would need to make sure all the legal bases are covered first.  

Then the discussion comes into what is in the docker images and how useful it is. People run different os's, different python versions, etc. And like Sean mentioned how useful really is it other then a few examples.  Some discussions on https://issues.apache.org/jira/browse/SPARK-24655

Tom



On Wednesday, February 5, 2020, 02:16:37 PM CST, Dongjoon Hyun <[hidden email]> wrote:


Hi, All.

From 2020, shall we have an official Docker image repository as an additional distribution channel?

I'm considering the following images.

    - Public binary release (no snapshot image)
    - Public non-Spark base image (OS + R + Python)
      (This can be used in GitHub Action Jobs and Jenkins K8s Integration Tests to speed up jobs and to have more stabler environments)

Bests,
Dongjoon.
Reply | Threaded
Open this post in threaded view
|

Re: Apache Spark Docker image repository

Dongjoon Hyun-2
Thank you, Sean, Jiaxin, Shane, and Tom, for feedbacks.

1. For legal questions, please see the following three Apache-approved approaches. We can follow one of them.

       1. https://hub.docker.com/u/apache (93 repositories, Airflow/NiFi/Beam/Druid/Zeppelin/Hadoop/...)
       2. https://hub.docker.com/_/solr (This is also official. There are more instances like this.)
       3. https://hub.docker.com/u/apachestreampipes (Some projects tries this form.)

2. For non-Spark dev-environment images, definitely it will help both our Jenkins and GitHub Action jobs. Apache Infra team also supports GitHub Action secret like the following.

       https://issues.apache.org/jira/browse/INFRA-19565 Create a Docker Hub secret for Github Actions

3. For Spark image content questions, we should not do the following. It's because not only for legal issues, but also we cannot contain or maintain all popular libraries like Nvidia library/TensorFlow in our image.

       https://issues.apache.org/jira/browse/SPARK-26398 Support building GPU docker images

4. The way I see this is a minimal legal image containing only our artifacts from the followings. We can check the other Apache repos's best practice.

       https://www.apache.org/dist/spark/
 
5. For OS/Java/Python/R runtimes and libraries, those (except OS) can be overlayed as an additional layers by the users in general. I don't think we need to provide every combination (Debian/Ubuntu/CentOS/Alpine) x (JDK/JRE) x (Python2/Python3/PyPy) x (R 3.6/3.6) x (many libraries). Specifically, I don't think we need to install all libraries like `arrow`.

6. For the target users, this is a general docker image. We don't need to assume that this is for K8s-only environment. This can be used in any Docker environment.

7. For the number of images, as suggested in this thread, we may want to follow our existing K8s integration test suite way by splitting PySpark and R images from Java. But, I don't have any requirement for this.

What I want to propose in this thread is that we can start with a minimal viable product and evolve them (if needed) as an open source community.

Bests,
Dongjoon.

PS. BTW, Apache Spark 2.4.5 artifacts are published into our doc website, our distribution repo, Maven Central, PyPi, CRAN, Homebrew.
       I'm preparing website news and download page update.


On Thu, Feb 6, 2020 at 11:19 AM Tom Graves <[hidden email]> wrote:
When discussions of docker have occurred in the past - mostly related to k8s - there is a lot of discussion about what is the right image to publish, as well as making sure Apache is ok with it. Apache official release is the source code so we may need to make sure to have disclaimer and we need to make sure it doesn't contain anything licensed that it shouldn't.  What happens when one of the docker images we publish has security update. We would need to make sure all the legal bases are covered first.  

Then the discussion comes into what is in the docker images and how useful it is. People run different os's, different python versions, etc. And like Sean mentioned how useful really is it other then a few examples.  Some discussions on https://issues.apache.org/jira/browse/SPARK-24655

Tom



On Wednesday, February 5, 2020, 02:16:37 PM CST, Dongjoon Hyun <[hidden email]> wrote:


Hi, All.

From 2020, shall we have an official Docker image repository as an additional distribution channel?

I'm considering the following images.

    - Public binary release (no snapshot image)
    - Public non-Spark base image (OS + R + Python)
      (This can be used in GitHub Action Jobs and Jenkins K8s Integration Tests to speed up jobs and to have more stabler environments)

Bests,
Dongjoon.
Reply | Threaded
Open this post in threaded view
|

Re: Apache Spark Docker image repository

Hyukjin Kwon
Quick question. Roughly how much overhead is it required to maintain minimal version?
If that looks not too much, I think it's fine to give a shot.


2020년 2월 8일 (토) 오전 6:51, Dongjoon Hyun <[hidden email]>님이 작성:
Thank you, Sean, Jiaxin, Shane, and Tom, for feedbacks.

1. For legal questions, please see the following three Apache-approved approaches. We can follow one of them.

       1. https://hub.docker.com/u/apache (93 repositories, Airflow/NiFi/Beam/Druid/Zeppelin/Hadoop/...)
       2. https://hub.docker.com/_/solr (This is also official. There are more instances like this.)
       3. https://hub.docker.com/u/apachestreampipes (Some projects tries this form.)

2. For non-Spark dev-environment images, definitely it will help both our Jenkins and GitHub Action jobs. Apache Infra team also supports GitHub Action secret like the following.

       https://issues.apache.org/jira/browse/INFRA-19565 Create a Docker Hub secret for Github Actions

3. For Spark image content questions, we should not do the following. It's because not only for legal issues, but also we cannot contain or maintain all popular libraries like Nvidia library/TensorFlow in our image.

       https://issues.apache.org/jira/browse/SPARK-26398 Support building GPU docker images

4. The way I see this is a minimal legal image containing only our artifacts from the followings. We can check the other Apache repos's best practice.

       https://www.apache.org/dist/spark/
 
5. For OS/Java/Python/R runtimes and libraries, those (except OS) can be overlayed as an additional layers by the users in general. I don't think we need to provide every combination (Debian/Ubuntu/CentOS/Alpine) x (JDK/JRE) x (Python2/Python3/PyPy) x (R 3.6/3.6) x (many libraries). Specifically, I don't think we need to install all libraries like `arrow`.

6. For the target users, this is a general docker image. We don't need to assume that this is for K8s-only environment. This can be used in any Docker environment.

7. For the number of images, as suggested in this thread, we may want to follow our existing K8s integration test suite way by splitting PySpark and R images from Java. But, I don't have any requirement for this.

What I want to propose in this thread is that we can start with a minimal viable product and evolve them (if needed) as an open source community.

Bests,
Dongjoon.

PS. BTW, Apache Spark 2.4.5 artifacts are published into our doc website, our distribution repo, Maven Central, PyPi, CRAN, Homebrew.
       I'm preparing website news and download page update.


On Thu, Feb 6, 2020 at 11:19 AM Tom Graves <[hidden email]> wrote:
When discussions of docker have occurred in the past - mostly related to k8s - there is a lot of discussion about what is the right image to publish, as well as making sure Apache is ok with it. Apache official release is the source code so we may need to make sure to have disclaimer and we need to make sure it doesn't contain anything licensed that it shouldn't.  What happens when one of the docker images we publish has security update. We would need to make sure all the legal bases are covered first.  

Then the discussion comes into what is in the docker images and how useful it is. People run different os's, different python versions, etc. And like Sean mentioned how useful really is it other then a few examples.  Some discussions on https://issues.apache.org/jira/browse/SPARK-24655

Tom



On Wednesday, February 5, 2020, 02:16:37 PM CST, Dongjoon Hyun <[hidden email]> wrote:


Hi, All.

From 2020, shall we have an official Docker image repository as an additional distribution channel?

I'm considering the following images.

    - Public binary release (no snapshot image)
    - Public non-Spark base image (OS + R + Python)
      (This can be used in GitHub Action Jobs and Jenkins K8s Integration Tests to speed up jobs and to have more stabler environments)

Bests,
Dongjoon.
Reply | Threaded
Open this post in threaded view
|

Re: Apache Spark Docker image repository

Dongjoon Hyun-2
Thank you, Hyukjin.

The maintenance overhead only occurs when we add a new release.

And, we can prevent accidental upstream changes by avoiding 'latest' tags.

The overhead will be much smaller than our exisitng Dockerfile maintenance (e.g. 'spark-rm')

Also, if we have a docker repository, we can publish 'spark-rm' image together as a tool. This will save the time and efforts of release managers a lot.

Bests,
Dongjoon

On Mon, Feb 10, 2020 at 00:25 Hyukjin Kwon <[hidden email]> wrote:
Quick question. Roughly how much overhead is it required to maintain minimal version?
If that looks not too much, I think it's fine to give a shot.


2020년 2월 8일 (토) 오전 6:51, Dongjoon Hyun <[hidden email]>님이 작성:
Thank you, Sean, Jiaxin, Shane, and Tom, for feedbacks.

1. For legal questions, please see the following three Apache-approved approaches. We can follow one of them.

       1. https://hub.docker.com/u/apache (93 repositories, Airflow/NiFi/Beam/Druid/Zeppelin/Hadoop/...)
       2. https://hub.docker.com/_/solr (This is also official. There are more instances like this.)
       3. https://hub.docker.com/u/apachestreampipes (Some projects tries this form.)

2. For non-Spark dev-environment images, definitely it will help both our Jenkins and GitHub Action jobs. Apache Infra team also supports GitHub Action secret like the following.

       https://issues.apache.org/jira/browse/INFRA-19565 Create a Docker Hub secret for Github Actions

3. For Spark image content questions, we should not do the following. It's because not only for legal issues, but also we cannot contain or maintain all popular libraries like Nvidia library/TensorFlow in our image.

       https://issues.apache.org/jira/browse/SPARK-26398 Support building GPU docker images

4. The way I see this is a minimal legal image containing only our artifacts from the followings. We can check the other Apache repos's best practice.

       https://www.apache.org/dist/spark/
 
5. For OS/Java/Python/R runtimes and libraries, those (except OS) can be overlayed as an additional layers by the users in general. I don't think we need to provide every combination (Debian/Ubuntu/CentOS/Alpine) x (JDK/JRE) x (Python2/Python3/PyPy) x (R 3.6/3.6) x (many libraries). Specifically, I don't think we need to install all libraries like `arrow`.

6. For the target users, this is a general docker image. We don't need to assume that this is for K8s-only environment. This can be used in any Docker environment.

7. For the number of images, as suggested in this thread, we may want to follow our existing K8s integration test suite way by splitting PySpark and R images from Java. But, I don't have any requirement for this.

What I want to propose in this thread is that we can start with a minimal viable product and evolve them (if needed) as an open source community.

Bests,
Dongjoon.

PS. BTW, Apache Spark 2.4.5 artifacts are published into our doc website, our distribution repo, Maven Central, PyPi, CRAN, Homebrew.
       I'm preparing website news and download page update.


On Thu, Feb 6, 2020 at 11:19 AM Tom Graves <[hidden email]> wrote:
When discussions of docker have occurred in the past - mostly related to k8s - there is a lot of discussion about what is the right image to publish, as well as making sure Apache is ok with it. Apache official release is the source code so we may need to make sure to have disclaimer and we need to make sure it doesn't contain anything licensed that it shouldn't.  What happens when one of the docker images we publish has security update. We would need to make sure all the legal bases are covered first.  

Then the discussion comes into what is in the docker images and how useful it is. People run different os's, different python versions, etc. And like Sean mentioned how useful really is it other then a few examples.  Some discussions on https://issues.apache.org/jira/browse/SPARK-24655

Tom



On Wednesday, February 5, 2020, 02:16:37 PM CST, Dongjoon Hyun <[hidden email]> wrote:


Hi, All.

From 2020, shall we have an official Docker image repository as an additional distribution channel?

I'm considering the following images.

    - Public binary release (no snapshot image)
    - Public non-Spark base image (OS + R + Python)
      (This can be used in GitHub Action Jobs and Jenkins K8s Integration Tests to speed up jobs and to have more stabler environments)

Bests,
Dongjoon.
Reply | Threaded
Open this post in threaded view
|

Re: Apache Spark Docker image repository

Erik Erlandson-2
In reply to this post by Dongjoon Hyun-2
My takeaway from the last time we discussed this was:
1) To be ASF compliant, we needed to only publish images at official releases
2) There was some ambiguity about whether or not a container image that included GPL'ed packages (spark images do) might trip over the GPL "viral propagation" due to integrating ASL and GPL in a "binary release".  The "air gap" GPL provision may apply - the GPL software interacts only at command-line boundaries.

On Wed, Feb 5, 2020 at 1:23 PM Dongjoon Hyun <[hidden email]> wrote:
Hi, All.

From 2020, shall we have an official Docker image repository as an additional distribution channel?

I'm considering the following images.

    - Public binary release (no snapshot image)
    - Public non-Spark base image (OS + R + Python)
      (This can be used in GitHub Action Jobs and Jenkins K8s Integration Tests to speed up jobs and to have more stabler environments)

Bests,
Dongjoon.
Reply | Threaded
Open this post in threaded view
|

Re: Apache Spark Docker image repository

Sean Owen-2
To be clear this is a convenience 'binary' for end users, not just an
internal packaging to aid the testing framework?

There's nothing wrong with providing an additional official packaging
if we vote on it and it follows all the rules. There is an open
question about how much value it adds vs that maintenance. I see we do
already have some Dockerfiles, sure. Is it possible to reuse or
repurpose these so that we don't have more to maintain? or: what is
different from the existing Dockerfiles here? (dumb question, never
paid much attention to them)

We definitely can't release GPL bits or anything, yes. Just releasing
a Dockerfile referring to GPL bits is a gray area - no bits are being
redistributed, but, does it constitute a derived work where the GPL
stuff is a non-optional dependency? Would any publishing of these
images cause us to put a copy of third party GPL code anywhere?

At the least, we should keep this minimal. One image if possible, that
you overlay on top of your preferred OS/Java/Python image. But how
much value does that add? I have no info either way that people want
or don't need such a thing.

On Tue, Feb 11, 2020 at 10:13 AM Erik Erlandson <[hidden email]> wrote:

>
> My takeaway from the last time we discussed this was:
> 1) To be ASF compliant, we needed to only publish images at official releases
> 2) There was some ambiguity about whether or not a container image that included GPL'ed packages (spark images do) might trip over the GPL "viral propagation" due to integrating ASL and GPL in a "binary release".  The "air gap" GPL provision may apply - the GPL software interacts only at command-line boundaries.
>
> On Wed, Feb 5, 2020 at 1:23 PM Dongjoon Hyun <[hidden email]> wrote:
>>
>> Hi, All.
>>
>> From 2020, shall we have an official Docker image repository as an additional distribution channel?
>>
>> I'm considering the following images.
>>
>>     - Public binary release (no snapshot image)
>>     - Public non-Spark base image (OS + R + Python)
>>       (This can be used in GitHub Action Jobs and Jenkins K8s Integration Tests to speed up jobs and to have more stabler environments)
>>
>> Bests,
>> Dongjoon.

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Apache Spark Docker image repository

Dongjoon Hyun-2
Hi, Sean.

Yes. We should keep this minimal.

BTW, for the following questions,

    > But how much value does that add?

How much value do you think we have at our binary distribution in the following link?


Docker image can have a similar value with the above for the users who are using Dockerized environment.

If you are assuming the users who build from the source code or lives on vendor distributions, both the above existing binary distribution link and Docker image have no value.

Bests,
Dongjoon.


On Tue, Feb 11, 2020 at 8:51 AM Sean Owen <[hidden email]> wrote:
To be clear this is a convenience 'binary' for end users, not just an
internal packaging to aid the testing framework?

There's nothing wrong with providing an additional official packaging
if we vote on it and it follows all the rules. There is an open
question about how much value it adds vs that maintenance. I see we do
already have some Dockerfiles, sure. Is it possible to reuse or
repurpose these so that we don't have more to maintain? or: what is
different from the existing Dockerfiles here? (dumb question, never
paid much attention to them)

We definitely can't release GPL bits or anything, yes. Just releasing
a Dockerfile referring to GPL bits is a gray area - no bits are being
redistributed, but, does it constitute a derived work where the GPL
stuff is a non-optional dependency? Would any publishing of these
images cause us to put a copy of third party GPL code anywhere?

At the least, we should keep this minimal. One image if possible, that
you overlay on top of your preferred OS/Java/Python image. But how
much value does that add? I have no info either way that people want
or don't need such a thing.

On Tue, Feb 11, 2020 at 10:13 AM Erik Erlandson <[hidden email]> wrote:
>
> My takeaway from the last time we discussed this was:
> 1) To be ASF compliant, we needed to only publish images at official releases
> 2) There was some ambiguity about whether or not a container image that included GPL'ed packages (spark images do) might trip over the GPL "viral propagation" due to integrating ASL and GPL in a "binary release".  The "air gap" GPL provision may apply - the GPL software interacts only at command-line boundaries.
>
> On Wed, Feb 5, 2020 at 1:23 PM Dongjoon Hyun <[hidden email]> wrote:
>>
>> Hi, All.
>>
>> From 2020, shall we have an official Docker image repository as an additional distribution channel?
>>
>> I'm considering the following images.
>>
>>     - Public binary release (no snapshot image)
>>     - Public non-Spark base image (OS + R + Python)
>>       (This can be used in GitHub Action Jobs and Jenkins K8s Integration Tests to speed up jobs and to have more stabler environments)
>>
>> Bests,
>> Dongjoon.
Reply | Threaded
Open this post in threaded view
|

Re: Apache Spark Docker image repository

Ismaël Mejía
+1 to have Spark docker images for Dongjoon's arguments, having a container
based distribution is definitely something in the benefit of users and the
project too. Having this in the Apache Spark repo matters because of multiple
eyes to fix/ímprove the images for the benefit of everyone.

What still needs to be tested is the best distribution approach. I have been
involved in both Flink and Beam's docker images processes (and passed the whole
'docker official image' validation and some of the learnt lessons is that the
less you put in an image the best it is for everyone. So I wonder if the whole
include everything in the world (Python, R, etc) would scale or if those should
be overlays on top of a more core minimal image,  but well those are details to
fix once consensus on this is agreed.

On the Apache INFRA side there is some stuff to deal with at the beginning, but
things become smoother once they are in place.  In any case fantastic idea and
if I can help around I would be glad to.

Regards,
Ismaël

On Tue, Feb 11, 2020 at 10:56 PM Dongjoon Hyun <[hidden email]> wrote:
Hi, Sean.

Yes. We should keep this minimal.

BTW, for the following questions,

    > But how much value does that add?

How much value do you think we have at our binary distribution in the following link?


Docker image can have a similar value with the above for the users who are using Dockerized environment.

If you are assuming the users who build from the source code or lives on vendor distributions, both the above existing binary distribution link and Docker image have no value.

Bests,
Dongjoon.


On Tue, Feb 11, 2020 at 8:51 AM Sean Owen <[hidden email]> wrote:
To be clear this is a convenience 'binary' for end users, not just an
internal packaging to aid the testing framework?

There's nothing wrong with providing an additional official packaging
if we vote on it and it follows all the rules. There is an open
question about how much value it adds vs that maintenance. I see we do
already have some Dockerfiles, sure. Is it possible to reuse or
repurpose these so that we don't have more to maintain? or: what is
different from the existing Dockerfiles here? (dumb question, never
paid much attention to them)

We definitely can't release GPL bits or anything, yes. Just releasing
a Dockerfile referring to GPL bits is a gray area - no bits are being
redistributed, but, does it constitute a derived work where the GPL
stuff is a non-optional dependency? Would any publishing of these
images cause us to put a copy of third party GPL code anywhere?

At the least, we should keep this minimal. One image if possible, that
you overlay on top of your preferred OS/Java/Python image. But how
much value does that add? I have no info either way that people want
or don't need such a thing.

On Tue, Feb 11, 2020 at 10:13 AM Erik Erlandson <[hidden email]> wrote:
>
> My takeaway from the last time we discussed this was:
> 1) To be ASF compliant, we needed to only publish images at official releases
> 2) There was some ambiguity about whether or not a container image that included GPL'ed packages (spark images do) might trip over the GPL "viral propagation" due to integrating ASL and GPL in a "binary release".  The "air gap" GPL provision may apply - the GPL software interacts only at command-line boundaries.
>
> On Wed, Feb 5, 2020 at 1:23 PM Dongjoon Hyun <[hidden email]> wrote:
>>
>> Hi, All.
>>
>> From 2020, shall we have an official Docker image repository as an additional distribution channel?
>>
>> I'm considering the following images.
>>
>>     - Public binary release (no snapshot image)
>>     - Public non-Spark base image (OS + R + Python)
>>       (This can be used in GitHub Action Jobs and Jenkins K8s Integration Tests to speed up jobs and to have more stabler environments)
>>
>> Bests,
>> Dongjoon.