No LICENSE file in spark custom build distribution

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

No LICENSE file in spark custom build distribution

Xiangyu Li
Hello,

I downloaded spark-2.4.5 source from https://mirrors.ocf.berkeley.edu/apache/spark/spark-2.4.5/spark-2.4.5.tgz
After extracting it and running:

./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn -Pkubernetes

It creates a Spark binary distribution named:
spark-2.4.5-bin-custom-spark.tgz

So this file is supposedly a ready-to-distribute Spark binary file like the one you can download from http://mirror.metrocast.net/apache/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz

However, one big difference between this custom build and the official build is that you do not have a LICENSE file in the custom build. I don't know much about Apache license, but I would suppose a custom build distribution should have one.

The reason we are missing the file is caused by the following code in make-distribution.sh:
image.png

There is no LICENSE-binary file in the official spark-2.4.5.tgz file, therefore there will be no LICENSE file in your custom build. 

I am aware of two pull requests related to this:

started to use LICENSE-binary instead of just the LICENSE.

And
To avoid failure when there is no LICENSE-binary in spark-2.4.5 source directory.

I think we need to change make-distribution.sh to make sure that the LICENSE file is copied over to its corresponding custom build distribution. However, I am not ready to do a pull request, so hopefully we can discuss it here first.
--
Sincerely
Xiangyu Li

[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: No LICENSE file in spark custom build distribution

Sean Owen-2
The source distribution has the source LICENSE file. The binary distribution has the LICENSE-binary license file. The source release isn't supposed to have LICENSE-binary as it would not be accurate for that release; LICENSE is. If you're redistributing a build, you'll have your own process for modifying and building it, including modifying the LICENSE file as appropriate; these LICENSE files represent what the project delivers to you rather than what you deliver to others. You could get the LICENSE-binary file from the right hash commit from git, if desired, as part of your build. 

On Fri, May 1, 2020 at 10:19 AM Xiangyu Li <[hidden email]> wrote:
Hello,

I downloaded spark-2.4.5 source from https://mirrors.ocf.berkeley.edu/apache/spark/spark-2.4.5/spark-2.4.5.tgz
After extracting it and running:

./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn -Pkubernetes

It creates a Spark binary distribution named:
spark-2.4.5-bin-custom-spark.tgz

So this file is supposedly a ready-to-distribute Spark binary file like the one you can download from http://mirror.metrocast.net/apache/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz

However, one big difference between this custom build and the official build is that you do not have a LICENSE file in the custom build. I don't know much about Apache license, but I would suppose a custom build distribution should have one.

The reason we are missing the file is caused by the following code in make-distribution.sh:
image.png

There is no LICENSE-binary file in the official spark-2.4.5.tgz file, therefore there will be no LICENSE file in your custom build. 

I am aware of two pull requests related to this:

started to use LICENSE-binary instead of just the LICENSE.

And
To avoid failure when there is no LICENSE-binary in spark-2.4.5 source directory.

I think we need to change make-distribution.sh to make sure that the LICENSE file is copied over to its corresponding custom build distribution. However, I am not ready to do a pull request, so hopefully we can discuss it here first.
--
Sincerely
Xiangyu Li

[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: No LICENSE file in spark custom build distribution

Xiangyu Li
Hi Sean,

Thanks for the quick response! Yes, what you described about how LICENSE file should be distributed makes sense. 

The reason I learned about this is that I was trying to build spark-2.4.5-bin-custom.tgz, then distributes this build to multiple machines, so that:

1. These machines can run spark with the built.
2. On each machine, I can install pyspark by running `python setup.py install` inside the python directory.

Step 2 would fail because of missing the licenses directory. 

Building pyspark out of a binary distribution is a bit unconventional, but I did this after failing to do what the official doc recommended (https://spark.apache.org/docs/latest/building-spark.html#pyspark-pip-installable), so taking a step back to describe what I did originally:

In the spark-2.4.5 src directory, I just did a simple: 

`./build/mvn -DskipTests clean package`


And then went to the python directory and did:


`python setup.py sdist` followed by `pip install dist/pyspark-2.4.5.tar.gz` (as mentioned in the make-distribution.sh.)


This ran into "error: package directory `deps/jars` does not exist".


However, directly running 


`sudo python setup.py install`


worked. 



On Fri, May 1, 2020 at 11:30 AM Sean Owen <[hidden email]> wrote:
The source distribution has the source LICENSE file. The binary distribution has the LICENSE-binary license file. The source release isn't supposed to have LICENSE-binary as it would not be accurate for that release; LICENSE is. If you're redistributing a build, you'll have your own process for modifying and building it, including modifying the LICENSE file as appropriate; these LICENSE files represent what the project delivers to you rather than what you deliver to others. You could get the LICENSE-binary file from the right hash commit from git, if desired, as part of your build. 

On Fri, May 1, 2020 at 10:19 AM Xiangyu Li <[hidden email]> wrote:
Hello,

I downloaded spark-2.4.5 source from https://mirrors.ocf.berkeley.edu/apache/spark/spark-2.4.5/spark-2.4.5.tgz
After extracting it and running:

./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn -Pkubernetes

It creates a Spark binary distribution named:
spark-2.4.5-bin-custom-spark.tgz

So this file is supposedly a ready-to-distribute Spark binary file like the one you can download from http://mirror.metrocast.net/apache/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz

However, one big difference between this custom build and the official build is that you do not have a LICENSE file in the custom build. I don't know much about Apache license, but I would suppose a custom build distribution should have one.

The reason we are missing the file is caused by the following code in make-distribution.sh:
image.png

There is no LICENSE-binary file in the official spark-2.4.5.tgz file, therefore there will be no LICENSE file in your custom build. 

I am aware of two pull requests related to this:

started to use LICENSE-binary instead of just the LICENSE.

And
To avoid failure when there is no LICENSE-binary in spark-2.4.5 source directory.

I think we need to change make-distribution.sh to make sure that the LICENSE file is copied over to its corresponding custom build distribution. However, I am not ready to do a pull request, so hopefully we can discuss it here first.
--
Sincerely
Xiangyu Li

[hidden email]


--
Sincerely
Xiangyu Li

[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: No LICENSE file in spark custom build distribution

Sean Owen-2
Hm, the build fails? you can see this is just skipped if not present, for this reason.
I'm not clear why you need the file for its own sake, for your own internal modification that you don't redistribute.



On Fri, May 1, 2020 at 11:43 AM Xiangyu Li <[hidden email]> wrote:
Hi Sean,

Thanks for the quick response! Yes, what you described about how LICENSE file should be distributed makes sense. 

The reason I learned about this is that I was trying to build spark-2.4.5-bin-custom.tgz, then distributes this build to multiple machines, so that:

1. These machines can run spark with the built.
2. On each machine, I can install pyspark by running `python setup.py install` inside the python directory.

Step 2 would fail because of missing the licenses directory. 

Building pyspark out of a binary distribution is a bit unconventional, but I did this after failing to do what the official doc recommended (https://spark.apache.org/docs/latest/building-spark.html#pyspark-pip-installable), so taking a step back to describe what I did originally:

In the spark-2.4.5 src directory, I just did a simple: 

`./build/mvn -DskipTests clean package`


And then went to the python directory and did:


`python setup.py sdist` followed by `pip install dist/pyspark-2.4.5.tar.gz` (as mentioned in the make-distribution.sh.)


This ran into "error: package directory `deps/jars` does not exist".


However, directly running 


`sudo python setup.py install`


worked. 



On Fri, May 1, 2020 at 11:30 AM Sean Owen <[hidden email]> wrote:
The source distribution has the source LICENSE file. The binary distribution has the LICENSE-binary license file. The source release isn't supposed to have LICENSE-binary as it would not be accurate for that release; LICENSE is. If you're redistributing a build, you'll have your own process for modifying and building it, including modifying the LICENSE file as appropriate; these LICENSE files represent what the project delivers to you rather than what you deliver to others. You could get the LICENSE-binary file from the right hash commit from git, if desired, as part of your build. 

On Fri, May 1, 2020 at 10:19 AM Xiangyu Li <[hidden email]> wrote:
Hello,

I downloaded spark-2.4.5 source from https://mirrors.ocf.berkeley.edu/apache/spark/spark-2.4.5/spark-2.4.5.tgz
After extracting it and running:

./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn -Pkubernetes

It creates a Spark binary distribution named:
spark-2.4.5-bin-custom-spark.tgz

So this file is supposedly a ready-to-distribute Spark binary file like the one you can download from http://mirror.metrocast.net/apache/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz

However, one big difference between this custom build and the official build is that you do not have a LICENSE file in the custom build. I don't know much about Apache license, but I would suppose a custom build distribution should have one.

The reason we are missing the file is caused by the following code in make-distribution.sh:
image.png

There is no LICENSE-binary file in the official spark-2.4.5.tgz file, therefore there will be no LICENSE file in your custom build. 

I am aware of two pull requests related to this:

started to use LICENSE-binary instead of just the LICENSE.

And
To avoid failure when there is no LICENSE-binary in spark-2.4.5 source directory.

I think we need to change make-distribution.sh to make sure that the LICENSE file is copied over to its corresponding custom build distribution. However, I am not ready to do a pull request, so hopefully we can discuss it here first.
--
Sincerely
Xiangyu Li

[hidden email]


--
Sincerely
Xiangyu Li

[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: No LICENSE file in spark custom build distribution

Xiangyu Li
Hmm, sorry I don't get what part of my email were you referring to when you said "the build fails?". 

So I am trying to build a custom spark binary distribution with, say, different Hadoop versions and R support.

Then I stored this custom build on S3, so as I am building more machines I can just directly download this custom build from S3. But besides spark-submit and what not, I also wanted to install the pyspark python package to the machine I am building. 

The lack of the LICENSE file in the custom build would prevent pyspark from being successfully built.

Hopefully this answers your question.

The second part of my last email was about building pyspark inside spark source directory, I will raise an issue on Jira for that, as it is more of a clean cut problem with the documentation on the website and the comments in make-distribution.sh. 



On Fri, May 1, 2020 at 1:31 PM Sean Owen <[hidden email]> wrote:
Hm, the build fails? you can see this is just skipped if not present, for this reason.
I'm not clear why you need the file for its own sake, for your own internal modification that you don't redistribute.



On Fri, May 1, 2020 at 11:43 AM Xiangyu Li <[hidden email]> wrote:
Hi Sean,

Thanks for the quick response! Yes, what you described about how LICENSE file should be distributed makes sense. 

The reason I learned about this is that I was trying to build spark-2.4.5-bin-custom.tgz, then distributes this build to multiple machines, so that:

1. These machines can run spark with the built.
2. On each machine, I can install pyspark by running `python setup.py install` inside the python directory.

Step 2 would fail because of missing the licenses directory. 

Building pyspark out of a binary distribution is a bit unconventional, but I did this after failing to do what the official doc recommended (https://spark.apache.org/docs/latest/building-spark.html#pyspark-pip-installable), so taking a step back to describe what I did originally:

In the spark-2.4.5 src directory, I just did a simple: 

`./build/mvn -DskipTests clean package`


And then went to the python directory and did:


`python setup.py sdist` followed by `pip install dist/pyspark-2.4.5.tar.gz` (as mentioned in the make-distribution.sh.)


This ran into "error: package directory `deps/jars` does not exist".


However, directly running 


`sudo python setup.py install`


worked. 



On Fri, May 1, 2020 at 11:30 AM Sean Owen <[hidden email]> wrote:
The source distribution has the source LICENSE file. The binary distribution has the LICENSE-binary license file. The source release isn't supposed to have LICENSE-binary as it would not be accurate for that release; LICENSE is. If you're redistributing a build, you'll have your own process for modifying and building it, including modifying the LICENSE file as appropriate; these LICENSE files represent what the project delivers to you rather than what you deliver to others. You could get the LICENSE-binary file from the right hash commit from git, if desired, as part of your build. 

On Fri, May 1, 2020 at 10:19 AM Xiangyu Li <[hidden email]> wrote:
Hello,

I downloaded spark-2.4.5 source from https://mirrors.ocf.berkeley.edu/apache/spark/spark-2.4.5/spark-2.4.5.tgz
After extracting it and running:

./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn -Pkubernetes

It creates a Spark binary distribution named:
spark-2.4.5-bin-custom-spark.tgz

So this file is supposedly a ready-to-distribute Spark binary file like the one you can download from http://mirror.metrocast.net/apache/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz

However, one big difference between this custom build and the official build is that you do not have a LICENSE file in the custom build. I don't know much about Apache license, but I would suppose a custom build distribution should have one.

The reason we are missing the file is caused by the following code in make-distribution.sh:
image.png

There is no LICENSE-binary file in the official spark-2.4.5.tgz file, therefore there will be no LICENSE file in your custom build. 

I am aware of two pull requests related to this:

started to use LICENSE-binary instead of just the LICENSE.

And
To avoid failure when there is no LICENSE-binary in spark-2.4.5 source directory.

I think we need to change make-distribution.sh to make sure that the LICENSE file is copied over to its corresponding custom build distribution. However, I am not ready to do a pull request, so hopefully we can discuss it here first.
--
Sincerely
Xiangyu Li

[hidden email]


--
Sincerely
Xiangyu Li

[hidden email]


--
Sincerely
Xiangyu Li

[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: No LICENSE file in spark custom build distribution

Sean Owen-2
You wrote:

"
2. On each machine, I can install pyspark by running `python setup.py install` inside the python directory.

Step 2 would fail because of missing the licenses directory. 
"

That shouldn't depend on the license file, and the script you showed does not fail when not present, so I am wondering what this means.
I'm not sure there's a JIRA here yet.

On Fri, May 1, 2020 at 1:46 PM Xiangyu Li <[hidden email]> wrote:
Hmm, sorry I don't get what part of my email were you referring to when you said "the build fails?". 

So I am trying to build a custom spark binary distribution with, say, different Hadoop versions and R support.

Then I stored this custom build on S3, so as I am building more machines I can just directly download this custom build from S3. But besides spark-submit and what not, I also wanted to install the pyspark python package to the machine I am building. 

The lack of the LICENSE file in the custom build would prevent pyspark from being successfully built.

Hopefully this answers your question.

The second part of my last email was about building pyspark inside spark source directory, I will raise an issue on Jira for that, as it is more of a clean cut problem with the documentation on the website and the comments in make-distribution.sh. 



On Fri, May 1, 2020 at 1:31 PM Sean Owen <[hidden email]> wrote:
Hm, the build fails? you can see this is just skipped if not present, for this reason.
I'm not clear why you need the file for its own sake, for your own internal modification that you don't redistribute.



On Fri, May 1, 2020 at 11:43 AM Xiangyu Li <[hidden email]> wrote:
Hi Sean,

Thanks for the quick response! Yes, what you described about how LICENSE file should be distributed makes sense. 

The reason I learned about this is that I was trying to build spark-2.4.5-bin-custom.tgz, then distributes this build to multiple machines, so that:

1. These machines can run spark with the built.
2. On each machine, I can install pyspark by running `python setup.py install` inside the python directory.

Step 2 would fail because of missing the licenses directory. 

Building pyspark out of a binary distribution is a bit unconventional, but I did this after failing to do what the official doc recommended (https://spark.apache.org/docs/latest/building-spark.html#pyspark-pip-installable), so taking a step back to describe what I did originally:

In the spark-2.4.5 src directory, I just did a simple: 

`./build/mvn -DskipTests clean package`


And then went to the python directory and did:


`python setup.py sdist` followed by `pip install dist/pyspark-2.4.5.tar.gz` (as mentioned in the make-distribution.sh.)


This ran into "error: package directory `deps/jars` does not exist".


However, directly running 


`sudo python setup.py install`


worked. 



On Fri, May 1, 2020 at 11:30 AM Sean Owen <[hidden email]> wrote:
The source distribution has the source LICENSE file. The binary distribution has the LICENSE-binary license file. The source release isn't supposed to have LICENSE-binary as it would not be accurate for that release; LICENSE is. If you're redistributing a build, you'll have your own process for modifying and building it, including modifying the LICENSE file as appropriate; these LICENSE files represent what the project delivers to you rather than what you deliver to others. You could get the LICENSE-binary file from the right hash commit from git, if desired, as part of your build. 

On Fri, May 1, 2020 at 10:19 AM Xiangyu Li <[hidden email]> wrote:
Hello,

I downloaded spark-2.4.5 source from https://mirrors.ocf.berkeley.edu/apache/spark/spark-2.4.5/spark-2.4.5.tgz
After extracting it and running:

./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn -Pkubernetes

It creates a Spark binary distribution named:
spark-2.4.5-bin-custom-spark.tgz

So this file is supposedly a ready-to-distribute Spark binary file like the one you can download from http://mirror.metrocast.net/apache/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz

However, one big difference between this custom build and the official build is that you do not have a LICENSE file in the custom build. I don't know much about Apache license, but I would suppose a custom build distribution should have one.

The reason we are missing the file is caused by the following code in make-distribution.sh:
image.png

There is no LICENSE-binary file in the official spark-2.4.5.tgz file, therefore there will be no LICENSE file in your custom build. 

I am aware of two pull requests related to this:

started to use LICENSE-binary instead of just the LICENSE.

And
To avoid failure when there is no LICENSE-binary in spark-2.4.5 source directory.

I think we need to change make-distribution.sh to make sure that the LICENSE file is copied over to its corresponding custom build distribution. However, I am not ready to do a pull request, so hopefully we can discuss it here first.
--
Sincerely
Xiangyu Li

[hidden email]


--
Sincerely
Xiangyu Li

[hidden email]


--
Sincerely
Xiangyu Li

[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: No LICENSE file in spark custom build distribution

Xiangyu Li
To reproduce this, I just did

tar xzf spark-2.4.5.tgz
cd spark-2.4.5
./dev/make-distribution.sh --name custom-spark --pip --tgz -Phadoop-2.7
mv spark-2.4.5-bin-custom-spark.tgz ../
cd ..
tar xzf spark-2.4.5-bin-custom-spark.tgz
cd spark-2.4.5-bin-custom-spark/python/
sudo python setup.py install

And here is the output:
image.png


On Fri, May 1, 2020 at 2:48 PM Sean Owen <[hidden email]> wrote:
You wrote:

"
2. On each machine, I can install pyspark by running `python setup.py install` inside the python directory.

Step 2 would fail because of missing the licenses directory. 
"

That shouldn't depend on the license file, and the script you showed does not fail when not present, so I am wondering what this means.
I'm not sure there's a JIRA here yet.

On Fri, May 1, 2020 at 1:46 PM Xiangyu Li <[hidden email]> wrote:
Hmm, sorry I don't get what part of my email were you referring to when you said "the build fails?". 

So I am trying to build a custom spark binary distribution with, say, different Hadoop versions and R support.

Then I stored this custom build on S3, so as I am building more machines I can just directly download this custom build from S3. But besides spark-submit and what not, I also wanted to install the pyspark python package to the machine I am building. 

The lack of the LICENSE file in the custom build would prevent pyspark from being successfully built.

Hopefully this answers your question.

The second part of my last email was about building pyspark inside spark source directory, I will raise an issue on Jira for that, as it is more of a clean cut problem with the documentation on the website and the comments in make-distribution.sh. 



On Fri, May 1, 2020 at 1:31 PM Sean Owen <[hidden email]> wrote:
Hm, the build fails? you can see this is just skipped if not present, for this reason.
I'm not clear why you need the file for its own sake, for your own internal modification that you don't redistribute.



On Fri, May 1, 2020 at 11:43 AM Xiangyu Li <[hidden email]> wrote:
Hi Sean,

Thanks for the quick response! Yes, what you described about how LICENSE file should be distributed makes sense. 

The reason I learned about this is that I was trying to build spark-2.4.5-bin-custom.tgz, then distributes this build to multiple machines, so that:

1. These machines can run spark with the built.
2. On each machine, I can install pyspark by running `python setup.py install` inside the python directory.

Step 2 would fail because of missing the licenses directory. 

Building pyspark out of a binary distribution is a bit unconventional, but I did this after failing to do what the official doc recommended (https://spark.apache.org/docs/latest/building-spark.html#pyspark-pip-installable), so taking a step back to describe what I did originally:

In the spark-2.4.5 src directory, I just did a simple: 

`./build/mvn -DskipTests clean package`


And then went to the python directory and did:


`python setup.py sdist` followed by `pip install dist/pyspark-2.4.5.tar.gz` (as mentioned in the make-distribution.sh.)


This ran into "error: package directory `deps/jars` does not exist".


However, directly running 


`sudo python setup.py install`


worked. 



On Fri, May 1, 2020 at 11:30 AM Sean Owen <[hidden email]> wrote:
The source distribution has the source LICENSE file. The binary distribution has the LICENSE-binary license file. The source release isn't supposed to have LICENSE-binary as it would not be accurate for that release; LICENSE is. If you're redistributing a build, you'll have your own process for modifying and building it, including modifying the LICENSE file as appropriate; these LICENSE files represent what the project delivers to you rather than what you deliver to others. You could get the LICENSE-binary file from the right hash commit from git, if desired, as part of your build. 

On Fri, May 1, 2020 at 10:19 AM Xiangyu Li <[hidden email]> wrote:
Hello,

I downloaded spark-2.4.5 source from https://mirrors.ocf.berkeley.edu/apache/spark/spark-2.4.5/spark-2.4.5.tgz
After extracting it and running:

./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn -Pkubernetes

It creates a Spark binary distribution named:
spark-2.4.5-bin-custom-spark.tgz

So this file is supposedly a ready-to-distribute Spark binary file like the one you can download from http://mirror.metrocast.net/apache/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz

However, one big difference between this custom build and the official build is that you do not have a LICENSE file in the custom build. I don't know much about Apache license, but I would suppose a custom build distribution should have one.

The reason we are missing the file is caused by the following code in make-distribution.sh:
image.png

There is no LICENSE-binary file in the official spark-2.4.5.tgz file, therefore there will be no LICENSE file in your custom build. 

I am aware of two pull requests related to this:

started to use LICENSE-binary instead of just the LICENSE.

And
To avoid failure when there is no LICENSE-binary in spark-2.4.5 source directory.

I think we need to change make-distribution.sh to make sure that the LICENSE file is copied over to its corresponding custom build distribution. However, I am not ready to do a pull request, so hopefully we can discuss it here first.
--
Sincerely
Xiangyu Li

[hidden email]


--
Sincerely
Xiangyu Li

[hidden email]


--
Sincerely
Xiangyu Li

[hidden email]


--
Sincerely
Xiangyu Li

[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: No LICENSE file in spark custom build distribution

Sean Owen-2
Hm, others may have to chime in here. Either that's not how you create the pyspark binary from the source release (make-distribution.sh doesn't do that?) or there is a small but important issue here, that the source release doesn't contain one thing that the binary release script expects, which is LICENSE-binary et al. If it's the latter, we could move around the LICENSE bits in the source tree so that both are "source" files included in the source release, so you can make the binary release with it, but, I'd probably say it's easier/better to simply skip adding the license in this path (if it's supposed to work this way at all) as the use case, a custom derived work, doesn't need the *ASF's* license statement.


On Fri, May 1, 2020 at 3:13 PM Xiangyu Li <[hidden email]> wrote:
To reproduce this, I just did

tar xzf spark-2.4.5.tgz
cd spark-2.4.5
./dev/make-distribution.sh --name custom-spark --pip --tgz -Phadoop-2.7
mv spark-2.4.5-bin-custom-spark.tgz ../
cd ..
tar xzf spark-2.4.5-bin-custom-spark.tgz
cd spark-2.4.5-bin-custom-spark/python/
sudo python setup.py install

And here is the output:
image.png


On Fri, May 1, 2020 at 2:48 PM Sean Owen <[hidden email]> wrote:
You wrote:

"
2. On each machine, I can install pyspark by running `python setup.py install` inside the python directory.

Step 2 would fail because of missing the licenses directory. 
"

That shouldn't depend on the license file, and the script you showed does not fail when not present, so I am wondering what this means.
I'm not sure there's a JIRA here yet.

On Fri, May 1, 2020 at 1:46 PM Xiangyu Li <[hidden email]> wrote:
Hmm, sorry I don't get what part of my email were you referring to when you said "the build fails?". 

So I am trying to build a custom spark binary distribution with, say, different Hadoop versions and R support.

Then I stored this custom build on S3, so as I am building more machines I can just directly download this custom build from S3. But besides spark-submit and what not, I also wanted to install the pyspark python package to the machine I am building. 

The lack of the LICENSE file in the custom build would prevent pyspark from being successfully built.

Hopefully this answers your question.

The second part of my last email was about building pyspark inside spark source directory, I will raise an issue on Jira for that, as it is more of a clean cut problem with the documentation on the website and the comments in make-distribution.sh. 



On Fri, May 1, 2020 at 1:31 PM Sean Owen <[hidden email]> wrote:
Hm, the build fails? you can see this is just skipped if not present, for this reason.
I'm not clear why you need the file for its own sake, for your own internal modification that you don't redistribute.



On Fri, May 1, 2020 at 11:43 AM Xiangyu Li <[hidden email]> wrote:
Hi Sean,

Thanks for the quick response! Yes, what you described about how LICENSE file should be distributed makes sense. 

The reason I learned about this is that I was trying to build spark-2.4.5-bin-custom.tgz, then distributes this build to multiple machines, so that:

1. These machines can run spark with the built.
2. On each machine, I can install pyspark by running `python setup.py install` inside the python directory.

Step 2 would fail because of missing the licenses directory. 

Building pyspark out of a binary distribution is a bit unconventional, but I did this after failing to do what the official doc recommended (https://spark.apache.org/docs/latest/building-spark.html#pyspark-pip-installable), so taking a step back to describe what I did originally:

In the spark-2.4.5 src directory, I just did a simple: 

`./build/mvn -DskipTests clean package`


And then went to the python directory and did:


`python setup.py sdist` followed by `pip install dist/pyspark-2.4.5.tar.gz` (as mentioned in the make-distribution.sh.)


This ran into "error: package directory `deps/jars` does not exist".


However, directly running 


`sudo python setup.py install`


worked. 



On Fri, May 1, 2020 at 11:30 AM Sean Owen <[hidden email]> wrote:
The source distribution has the source LICENSE file. The binary distribution has the LICENSE-binary license file. The source release isn't supposed to have LICENSE-binary as it would not be accurate for that release; LICENSE is. If you're redistributing a build, you'll have your own process for modifying and building it, including modifying the LICENSE file as appropriate; these LICENSE files represent what the project delivers to you rather than what you deliver to others. You could get the LICENSE-binary file from the right hash commit from git, if desired, as part of your build. 

On Fri, May 1, 2020 at 10:19 AM Xiangyu Li <[hidden email]> wrote:
Hello,

I downloaded spark-2.4.5 source from https://mirrors.ocf.berkeley.edu/apache/spark/spark-2.4.5/spark-2.4.5.tgz
After extracting it and running:

./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn -Pkubernetes

It creates a Spark binary distribution named:
spark-2.4.5-bin-custom-spark.tgz

So this file is supposedly a ready-to-distribute Spark binary file like the one you can download from http://mirror.metrocast.net/apache/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz

However, one big difference between this custom build and the official build is that you do not have a LICENSE file in the custom build. I don't know much about Apache license, but I would suppose a custom build distribution should have one.

The reason we are missing the file is caused by the following code in make-distribution.sh:
image.png

There is no LICENSE-binary file in the official spark-2.4.5.tgz file, therefore there will be no LICENSE file in your custom build. 

I am aware of two pull requests related to this:

started to use LICENSE-binary instead of just the LICENSE.

And
To avoid failure when there is no LICENSE-binary in spark-2.4.5 source directory.

I think we need to change make-distribution.sh to make sure that the LICENSE file is copied over to its corresponding custom build distribution. However, I am not ready to do a pull request, so hopefully we can discuss it here first.
--
Sincerely
Xiangyu Li

[hidden email]


--
Sincerely
Xiangyu Li

[hidden email]


--
Sincerely
Xiangyu Li

[hidden email]


--
Sincerely
Xiangyu Li

[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: No LICENSE file in spark custom build distribution

Xiangyu Li
make-distribution.sh with --pip would run a `python setup.py sdist` within that make-distribution.sh script.
I also tested `make-distribution.sh` without --pip, and the same error happens.

Correct me if I'm wrong, but pyspark binary has always been successfully built, it is the pyspark pip package that is failing.

On Fri, May 1, 2020 at 4:23 PM Sean Owen <[hidden email]> wrote:
Hm, others may have to chime in here. Either that's not how you create the pyspark binary from the source release (make-distribution.sh doesn't do that?) or there is a small but important issue here, that the source release doesn't contain one thing that the binary release script expects, which is LICENSE-binary et al. If it's the latter, we could move around the LICENSE bits in the source tree so that both are "source" files included in the source release, so you can make the binary release with it, but, I'd probably say it's easier/better to simply skip adding the license in this path (if it's supposed to work this way at all) as the use case, a custom derived work, doesn't need the *ASF's* license statement.


On Fri, May 1, 2020 at 3:13 PM Xiangyu Li <[hidden email]> wrote:
To reproduce this, I just did

tar xzf spark-2.4.5.tgz
cd spark-2.4.5
./dev/make-distribution.sh --name custom-spark --pip --tgz -Phadoop-2.7
mv spark-2.4.5-bin-custom-spark.tgz ../
cd ..
tar xzf spark-2.4.5-bin-custom-spark.tgz
cd spark-2.4.5-bin-custom-spark/python/
sudo python setup.py install

And here is the output:
image.png


On Fri, May 1, 2020 at 2:48 PM Sean Owen <[hidden email]> wrote:
You wrote:

"
2. On each machine, I can install pyspark by running `python setup.py install` inside the python directory.

Step 2 would fail because of missing the licenses directory. 
"

That shouldn't depend on the license file, and the script you showed does not fail when not present, so I am wondering what this means.
I'm not sure there's a JIRA here yet.

On Fri, May 1, 2020 at 1:46 PM Xiangyu Li <[hidden email]> wrote:
Hmm, sorry I don't get what part of my email were you referring to when you said "the build fails?". 

So I am trying to build a custom spark binary distribution with, say, different Hadoop versions and R support.

Then I stored this custom build on S3, so as I am building more machines I can just directly download this custom build from S3. But besides spark-submit and what not, I also wanted to install the pyspark python package to the machine I am building. 

The lack of the LICENSE file in the custom build would prevent pyspark from being successfully built.

Hopefully this answers your question.

The second part of my last email was about building pyspark inside spark source directory, I will raise an issue on Jira for that, as it is more of a clean cut problem with the documentation on the website and the comments in make-distribution.sh. 



On Fri, May 1, 2020 at 1:31 PM Sean Owen <[hidden email]> wrote:
Hm, the build fails? you can see this is just skipped if not present, for this reason.
I'm not clear why you need the file for its own sake, for your own internal modification that you don't redistribute.



On Fri, May 1, 2020 at 11:43 AM Xiangyu Li <[hidden email]> wrote:
Hi Sean,

Thanks for the quick response! Yes, what you described about how LICENSE file should be distributed makes sense. 

The reason I learned about this is that I was trying to build spark-2.4.5-bin-custom.tgz, then distributes this build to multiple machines, so that:

1. These machines can run spark with the built.
2. On each machine, I can install pyspark by running `python setup.py install` inside the python directory.

Step 2 would fail because of missing the licenses directory. 

Building pyspark out of a binary distribution is a bit unconventional, but I did this after failing to do what the official doc recommended (https://spark.apache.org/docs/latest/building-spark.html#pyspark-pip-installable), so taking a step back to describe what I did originally:

In the spark-2.4.5 src directory, I just did a simple: 

`./build/mvn -DskipTests clean package`


And then went to the python directory and did:


`python setup.py sdist` followed by `pip install dist/pyspark-2.4.5.tar.gz` (as mentioned in the make-distribution.sh.)


This ran into "error: package directory `deps/jars` does not exist".


However, directly running 


`sudo python setup.py install`


worked. 



On Fri, May 1, 2020 at 11:30 AM Sean Owen <[hidden email]> wrote:
The source distribution has the source LICENSE file. The binary distribution has the LICENSE-binary license file. The source release isn't supposed to have LICENSE-binary as it would not be accurate for that release; LICENSE is. If you're redistributing a build, you'll have your own process for modifying and building it, including modifying the LICENSE file as appropriate; these LICENSE files represent what the project delivers to you rather than what you deliver to others. You could get the LICENSE-binary file from the right hash commit from git, if desired, as part of your build. 

On Fri, May 1, 2020 at 10:19 AM Xiangyu Li <[hidden email]> wrote:
Hello,

I downloaded spark-2.4.5 source from https://mirrors.ocf.berkeley.edu/apache/spark/spark-2.4.5/spark-2.4.5.tgz
After extracting it and running:

./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn -Pkubernetes

It creates a Spark binary distribution named:
spark-2.4.5-bin-custom-spark.tgz

So this file is supposedly a ready-to-distribute Spark binary file like the one you can download from http://mirror.metrocast.net/apache/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz

However, one big difference between this custom build and the official build is that you do not have a LICENSE file in the custom build. I don't know much about Apache license, but I would suppose a custom build distribution should have one.

The reason we are missing the file is caused by the following code in make-distribution.sh:
image.png

There is no LICENSE-binary file in the official spark-2.4.5.tgz file, therefore there will be no LICENSE file in your custom build. 

I am aware of two pull requests related to this:

started to use LICENSE-binary instead of just the LICENSE.

And
To avoid failure when there is no LICENSE-binary in spark-2.4.5 source directory.

I think we need to change make-distribution.sh to make sure that the LICENSE file is copied over to its corresponding custom build distribution. However, I am not ready to do a pull request, so hopefully we can discuss it here first.
--
Sincerely
Xiangyu Li

[hidden email]


--
Sincerely
Xiangyu Li

[hidden email]


--
Sincerely
Xiangyu Li

[hidden email]


--
Sincerely
Xiangyu Li

[hidden email]


--
Sincerely
Xiangyu Li

[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: No LICENSE file in spark custom build distribution

Sean Owen-2
I see, that makes more sense, though I have limited knowledge of how the pip packaging works. You don't need pip packaging, do you? just pyspark itself right. Omit --pip? 

On Fri, May 1, 2020 at 3:32 PM Xiangyu Li <[hidden email]> wrote:
make-distribution.sh with --pip would run a `python setup.py sdist` within that make-distribution.sh script.
I also tested `make-distribution.sh` without --pip, and the same error happens.

Correct me if I'm wrong, but pyspark binary has always been successfully built, it is the pyspark pip package that is failing.

On Fri, May 1, 2020 at 4:23 PM Sean Owen <[hidden email]> wrote:
Hm, others may have to chime in here. Either that's not how you create the pyspark binary from the source release (make-distribution.sh doesn't do that?) or there is a small but important issue here, that the source release doesn't contain one thing that the binary release script expects, which is LICENSE-binary et al. If it's the latter, we could move around the LICENSE bits in the source tree so that both are "source" files included in the source release, so you can make the binary release with it, but, I'd probably say it's easier/better to simply skip adding the license in this path (if it's supposed to work this way at all) as the use case, a custom derived work, doesn't need the *ASF's* license statement.


On Fri, May 1, 2020 at 3:13 PM Xiangyu Li <[hidden email]> wrote:
To reproduce this, I just did

tar xzf spark-2.4.5.tgz
cd spark-2.4.5
./dev/make-distribution.sh --name custom-spark --pip --tgz -Phadoop-2.7
mv spark-2.4.5-bin-custom-spark.tgz ../
cd ..
tar xzf spark-2.4.5-bin-custom-spark.tgz
cd spark-2.4.5-bin-custom-spark/python/
sudo python setup.py install

And here is the output:
image.png


On Fri, May 1, 2020 at 2:48 PM Sean Owen <[hidden email]> wrote:
You wrote:

"
2. On each machine, I can install pyspark by running `python setup.py install` inside the python directory.

Step 2 would fail because of missing the licenses directory. 
"

That shouldn't depend on the license file, and the script you showed does not fail when not present, so I am wondering what this means.
I'm not sure there's a JIRA here yet.

On Fri, May 1, 2020 at 1:46 PM Xiangyu Li <[hidden email]> wrote:
Hmm, sorry I don't get what part of my email were you referring to when you said "the build fails?". 

So I am trying to build a custom spark binary distribution with, say, different Hadoop versions and R support.

Then I stored this custom build on S3, so as I am building more machines I can just directly download this custom build from S3. But besides spark-submit and what not, I also wanted to install the pyspark python package to the machine I am building. 

The lack of the LICENSE file in the custom build would prevent pyspark from being successfully built.

Hopefully this answers your question.

The second part of my last email was about building pyspark inside spark source directory, I will raise an issue on Jira for that, as it is more of a clean cut problem with the documentation on the website and the comments in make-distribution.sh. 



On Fri, May 1, 2020 at 1:31 PM Sean Owen <[hidden email]> wrote:
Hm, the build fails? you can see this is just skipped if not present, for this reason.
I'm not clear why you need the file for its own sake, for your own internal modification that you don't redistribute.



On Fri, May 1, 2020 at 11:43 AM Xiangyu Li <[hidden email]> wrote:
Hi Sean,

Thanks for the quick response! Yes, what you described about how LICENSE file should be distributed makes sense. 

The reason I learned about this is that I was trying to build spark-2.4.5-bin-custom.tgz, then distributes this build to multiple machines, so that:

1. These machines can run spark with the built.
2. On each machine, I can install pyspark by running `python setup.py install` inside the python directory.

Step 2 would fail because of missing the licenses directory. 

Building pyspark out of a binary distribution is a bit unconventional, but I did this after failing to do what the official doc recommended (https://spark.apache.org/docs/latest/building-spark.html#pyspark-pip-installable), so taking a step back to describe what I did originally:

In the spark-2.4.5 src directory, I just did a simple: 

`./build/mvn -DskipTests clean package`


And then went to the python directory and did:


`python setup.py sdist` followed by `pip install dist/pyspark-2.4.5.tar.gz` (as mentioned in the make-distribution.sh.)


This ran into "error: package directory `deps/jars` does not exist".


However, directly running 


`sudo python setup.py install`


worked. 



On Fri, May 1, 2020 at 11:30 AM Sean Owen <[hidden email]> wrote:
The source distribution has the source LICENSE file. The binary distribution has the LICENSE-binary license file. The source release isn't supposed to have LICENSE-binary as it would not be accurate for that release; LICENSE is. If you're redistributing a build, you'll have your own process for modifying and building it, including modifying the LICENSE file as appropriate; these LICENSE files represent what the project delivers to you rather than what you deliver to others. You could get the LICENSE-binary file from the right hash commit from git, if desired, as part of your build. 

On Fri, May 1, 2020 at 10:19 AM Xiangyu Li <[hidden email]> wrote:
Hello,

I downloaded spark-2.4.5 source from https://mirrors.ocf.berkeley.edu/apache/spark/spark-2.4.5/spark-2.4.5.tgz
After extracting it and running:

./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn -Pkubernetes

It creates a Spark binary distribution named:
spark-2.4.5-bin-custom-spark.tgz

So this file is supposedly a ready-to-distribute Spark binary file like the one you can download from http://mirror.metrocast.net/apache/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz

However, one big difference between this custom build and the official build is that you do not have a LICENSE file in the custom build. I don't know much about Apache license, but I would suppose a custom build distribution should have one.

The reason we are missing the file is caused by the following code in make-distribution.sh:
image.png

There is no LICENSE-binary file in the official spark-2.4.5.tgz file, therefore there will be no LICENSE file in your custom build. 

I am aware of two pull requests related to this:

started to use LICENSE-binary instead of just the LICENSE.

And
To avoid failure when there is no LICENSE-binary in spark-2.4.5 source directory.

I think we need to change make-distribution.sh to make sure that the LICENSE file is copied over to its corresponding custom build distribution. However, I am not ready to do a pull request, so hopefully we can discuss it here first.
--
Sincerely
Xiangyu Li

[hidden email]


--
Sincerely
Xiangyu Li

[hidden email]


--
Sincerely
Xiangyu Li

[hidden email]


--
Sincerely
Xiangyu Li

[hidden email]


--
Sincerely
Xiangyu Li

[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: No LICENSE file in spark custom build distribution

Holden Karau
In reply to this post by Xiangyu Li
Your problem isn't the missing license per-se (that just happens to be the first error).

I don't believe that is the way we expect users to pip install the Python library. pip will only install directories/targets underneath the directory where setup.py, hence the deps directory which is constructed by setup.py with a bunch of symlinks. It assumes that you are either building Spark from source in which case you should follow it's instructions:

    To build Spark with maven you can run:
      ./build/mvn -DskipTests clean package
    Building the source dist is done in the Python directory:
      cd python
      python setup.py sdist
      pip install dist/*.tar.gz


On Fri, May 1, 2020 at 1:32 PM Xiangyu Li <[hidden email]> wrote:
make-distribution.sh with --pip would run a `python setup.py sdist` within that make-distribution.sh script.
I also tested `make-distribution.sh` without --pip, and the same error happens.

Correct me if I'm wrong, but pyspark binary has always been successfully built, it is the pyspark pip package that is failing.

On Fri, May 1, 2020 at 4:23 PM Sean Owen <[hidden email]> wrote:
Hm, others may have to chime in here. Either that's not how you create the pyspark binary from the source release (make-distribution.sh doesn't do that?) or there is a small but important issue here, that the source release doesn't contain one thing that the binary release script expects, which is LICENSE-binary et al. If it's the latter, we could move around the LICENSE bits in the source tree so that both are "source" files included in the source release, so you can make the binary release with it, but, I'd probably say it's easier/better to simply skip adding the license in this path (if it's supposed to work this way at all) as the use case, a custom derived work, doesn't need the *ASF's* license statement.


On Fri, May 1, 2020 at 3:13 PM Xiangyu Li <[hidden email]> wrote:
To reproduce this, I just did

tar xzf spark-2.4.5.tgz
cd spark-2.4.5
./dev/make-distribution.sh --name custom-spark --pip --tgz -Phadoop-2.7
mv spark-2.4.5-bin-custom-spark.tgz ../
cd ..
tar xzf spark-2.4.5-bin-custom-spark.tgz
cd spark-2.4.5-bin-custom-spark/python/
sudo python setup.py install

And here is the output:
image.png


On Fri, May 1, 2020 at 2:48 PM Sean Owen <[hidden email]> wrote:
You wrote:

"
2. On each machine, I can install pyspark by running `python setup.py install` inside the python directory.

Step 2 would fail because of missing the licenses directory. 
"

That shouldn't depend on the license file, and the script you showed does not fail when not present, so I am wondering what this means.
I'm not sure there's a JIRA here yet.

On Fri, May 1, 2020 at 1:46 PM Xiangyu Li <[hidden email]> wrote:
Hmm, sorry I don't get what part of my email were you referring to when you said "the build fails?". 

So I am trying to build a custom spark binary distribution with, say, different Hadoop versions and R support.

Then I stored this custom build on S3, so as I am building more machines I can just directly download this custom build from S3. But besides spark-submit and what not, I also wanted to install the pyspark python package to the machine I am building. 

The lack of the LICENSE file in the custom build would prevent pyspark from being successfully built.

Hopefully this answers your question.

The second part of my last email was about building pyspark inside spark source directory, I will raise an issue on Jira for that, as it is more of a clean cut problem with the documentation on the website and the comments in make-distribution.sh. 



On Fri, May 1, 2020 at 1:31 PM Sean Owen <[hidden email]> wrote:
Hm, the build fails? you can see this is just skipped if not present, for this reason.
I'm not clear why you need the file for its own sake, for your own internal modification that you don't redistribute.



On Fri, May 1, 2020 at 11:43 AM Xiangyu Li <[hidden email]> wrote:
Hi Sean,

Thanks for the quick response! Yes, what you described about how LICENSE file should be distributed makes sense. 

The reason I learned about this is that I was trying to build spark-2.4.5-bin-custom.tgz, then distributes this build to multiple machines, so that:

1. These machines can run spark with the built.
2. On each machine, I can install pyspark by running `python setup.py install` inside the python directory.

Step 2 would fail because of missing the licenses directory. 

Building pyspark out of a binary distribution is a bit unconventional, but I did this after failing to do what the official doc recommended (https://spark.apache.org/docs/latest/building-spark.html#pyspark-pip-installable), so taking a step back to describe what I did originally:

In the spark-2.4.5 src directory, I just did a simple: 

`./build/mvn -DskipTests clean package`


And then went to the python directory and did:


`python setup.py sdist` followed by `pip install dist/pyspark-2.4.5.tar.gz` (as mentioned in the make-distribution.sh.)


This ran into "error: package directory `deps/jars` does not exist".


However, directly running 


`sudo python setup.py install`


worked. 



On Fri, May 1, 2020 at 11:30 AM Sean Owen <[hidden email]> wrote:
The source distribution has the source LICENSE file. The binary distribution has the LICENSE-binary license file. The source release isn't supposed to have LICENSE-binary as it would not be accurate for that release; LICENSE is. If you're redistributing a build, you'll have your own process for modifying and building it, including modifying the LICENSE file as appropriate; these LICENSE files represent what the project delivers to you rather than what you deliver to others. You could get the LICENSE-binary file from the right hash commit from git, if desired, as part of your build. 

On Fri, May 1, 2020 at 10:19 AM Xiangyu Li <[hidden email]> wrote:
Hello,

I downloaded spark-2.4.5 source from https://mirrors.ocf.berkeley.edu/apache/spark/spark-2.4.5/spark-2.4.5.tgz
After extracting it and running:

./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn -Pkubernetes

It creates a Spark binary distribution named:
spark-2.4.5-bin-custom-spark.tgz

So this file is supposedly a ready-to-distribute Spark binary file like the one you can download from http://mirror.metrocast.net/apache/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz

However, one big difference between this custom build and the official build is that you do not have a LICENSE file in the custom build. I don't know much about Apache license, but I would suppose a custom build distribution should have one.

The reason we are missing the file is caused by the following code in make-distribution.sh:
image.png

There is no LICENSE-binary file in the official spark-2.4.5.tgz file, therefore there will be no LICENSE file in your custom build. 

I am aware of two pull requests related to this:

started to use LICENSE-binary instead of just the LICENSE.

And
To avoid failure when there is no LICENSE-binary in spark-2.4.5 source directory.

I think we need to change make-distribution.sh to make sure that the LICENSE file is copied over to its corresponding custom build distribution. However, I am not ready to do a pull request, so hopefully we can discuss it here first.
--
Sincerely
Xiangyu Li

[hidden email]


--
Sincerely
Xiangyu Li

[hidden email]


--
Sincerely
Xiangyu Li

[hidden email]


--
Sincerely
Xiangyu Li

[hidden email]


--
Sincerely
Xiangyu Li

[hidden email]


--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
Reply | Threaded
Open this post in threaded view
|

Re: No LICENSE file in spark custom build distribution

Xiangyu Li
In reply to this post by Sean Owen-2
I need the pip packaging, all these efforts are to get a pyspark pip package actually.

On Fri, May 1, 2020 at 4:38 PM Sean Owen <[hidden email]> wrote:
I see, that makes more sense, though I have limited knowledge of how the pip packaging works. You don't need pip packaging, do you? just pyspark itself right. Omit --pip? 

On Fri, May 1, 2020 at 3:32 PM Xiangyu Li <[hidden email]> wrote:
make-distribution.sh with --pip would run a `python setup.py sdist` within that make-distribution.sh script.
I also tested `make-distribution.sh` without --pip, and the same error happens.

Correct me if I'm wrong, but pyspark binary has always been successfully built, it is the pyspark pip package that is failing.

On Fri, May 1, 2020 at 4:23 PM Sean Owen <[hidden email]> wrote:
Hm, others may have to chime in here. Either that's not how you create the pyspark binary from the source release (make-distribution.sh doesn't do that?) or there is a small but important issue here, that the source release doesn't contain one thing that the binary release script expects, which is LICENSE-binary et al. If it's the latter, we could move around the LICENSE bits in the source tree so that both are "source" files included in the source release, so you can make the binary release with it, but, I'd probably say it's easier/better to simply skip adding the license in this path (if it's supposed to work this way at all) as the use case, a custom derived work, doesn't need the *ASF's* license statement.


On Fri, May 1, 2020 at 3:13 PM Xiangyu Li <[hidden email]> wrote:
To reproduce this, I just did

tar xzf spark-2.4.5.tgz
cd spark-2.4.5
./dev/make-distribution.sh --name custom-spark --pip --tgz -Phadoop-2.7
mv spark-2.4.5-bin-custom-spark.tgz ../
cd ..
tar xzf spark-2.4.5-bin-custom-spark.tgz
cd spark-2.4.5-bin-custom-spark/python/
sudo python setup.py install

And here is the output:
image.png


On Fri, May 1, 2020 at 2:48 PM Sean Owen <[hidden email]> wrote:
You wrote:

"
2. On each machine, I can install pyspark by running `python setup.py install` inside the python directory.

Step 2 would fail because of missing the licenses directory. 
"

That shouldn't depend on the license file, and the script you showed does not fail when not present, so I am wondering what this means.
I'm not sure there's a JIRA here yet.

On Fri, May 1, 2020 at 1:46 PM Xiangyu Li <[hidden email]> wrote:
Hmm, sorry I don't get what part of my email were you referring to when you said "the build fails?". 

So I am trying to build a custom spark binary distribution with, say, different Hadoop versions and R support.

Then I stored this custom build on S3, so as I am building more machines I can just directly download this custom build from S3. But besides spark-submit and what not, I also wanted to install the pyspark python package to the machine I am building. 

The lack of the LICENSE file in the custom build would prevent pyspark from being successfully built.

Hopefully this answers your question.

The second part of my last email was about building pyspark inside spark source directory, I will raise an issue on Jira for that, as it is more of a clean cut problem with the documentation on the website and the comments in make-distribution.sh. 



On Fri, May 1, 2020 at 1:31 PM Sean Owen <[hidden email]> wrote:
Hm, the build fails? you can see this is just skipped if not present, for this reason.
I'm not clear why you need the file for its own sake, for your own internal modification that you don't redistribute.



On Fri, May 1, 2020 at 11:43 AM Xiangyu Li <[hidden email]> wrote:
Hi Sean,

Thanks for the quick response! Yes, what you described about how LICENSE file should be distributed makes sense. 

The reason I learned about this is that I was trying to build spark-2.4.5-bin-custom.tgz, then distributes this build to multiple machines, so that:

1. These machines can run spark with the built.
2. On each machine, I can install pyspark by running `python setup.py install` inside the python directory.

Step 2 would fail because of missing the licenses directory. 

Building pyspark out of a binary distribution is a bit unconventional, but I did this after failing to do what the official doc recommended (https://spark.apache.org/docs/latest/building-spark.html#pyspark-pip-installable), so taking a step back to describe what I did originally:

In the spark-2.4.5 src directory, I just did a simple: 

`./build/mvn -DskipTests clean package`


And then went to the python directory and did:


`python setup.py sdist` followed by `pip install dist/pyspark-2.4.5.tar.gz` (as mentioned in the make-distribution.sh.)


This ran into "error: package directory `deps/jars` does not exist".


However, directly running 


`sudo python setup.py install`


worked. 



On Fri, May 1, 2020 at 11:30 AM Sean Owen <[hidden email]> wrote:
The source distribution has the source LICENSE file. The binary distribution has the LICENSE-binary license file. The source release isn't supposed to have LICENSE-binary as it would not be accurate for that release; LICENSE is. If you're redistributing a build, you'll have your own process for modifying and building it, including modifying the LICENSE file as appropriate; these LICENSE files represent what the project delivers to you rather than what you deliver to others. You could get the LICENSE-binary file from the right hash commit from git, if desired, as part of your build. 

On Fri, May 1, 2020 at 10:19 AM Xiangyu Li <[hidden email]> wrote:
Hello,

I downloaded spark-2.4.5 source from https://mirrors.ocf.berkeley.edu/apache/spark/spark-2.4.5/spark-2.4.5.tgz
After extracting it and running:

./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn -Pkubernetes

It creates a Spark binary distribution named:
spark-2.4.5-bin-custom-spark.tgz

So this file is supposedly a ready-to-distribute Spark binary file like the one you can download from http://mirror.metrocast.net/apache/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz

However, one big difference between this custom build and the official build is that you do not have a LICENSE file in the custom build. I don't know much about Apache license, but I would suppose a custom build distribution should have one.

The reason we are missing the file is caused by the following code in make-distribution.sh:
image.png

There is no LICENSE-binary file in the official spark-2.4.5.tgz file, therefore there will be no LICENSE file in your custom build. 

I am aware of two pull requests related to this:

started to use LICENSE-binary instead of just the LICENSE.

And
To avoid failure when there is no LICENSE-binary in spark-2.4.5 source directory.

I think we need to change make-distribution.sh to make sure that the LICENSE file is copied over to its corresponding custom build distribution. However, I am not ready to do a pull request, so hopefully we can discuss it here first.
--
Sincerely
Xiangyu Li

[hidden email]


--
Sincerely
Xiangyu Li

[hidden email]


--
Sincerely
Xiangyu Li

[hidden email]


--
Sincerely
Xiangyu Li

[hidden email]


--
Sincerely
Xiangyu Li

[hidden email]


--
Sincerely
Xiangyu Li

[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: No LICENSE file in spark custom build distribution

Xiangyu Li
In reply to this post by Holden Karau
Hi Holden,

Please check the second email of mine in this email chain. I did that originally and to quote my email:

===========================================================================================
In the spark-2.4.5 src directory, I just did a simple: 

`./build/mvn -DskipTests clean package`


And then went to the python directory and did:


`python setup.py sdist` followed by `pip install dist/pyspark-2.4.5.tar.gz` (as mentioned in the make-distribution.sh.)


This ran into "error: package directory `deps/jars` does not exist".

===============================================================================================


So exactly as what you said, which is also one of the printout message in the make-distribution.sh script. 


On Fri, May 1, 2020 at 4:39 PM Holden Karau <[hidden email]> wrote:
Your problem isn't the missing license per-se (that just happens to be the first error).

I don't believe that is the way we expect users to pip install the Python library. pip will only install directories/targets underneath the directory where setup.py, hence the deps directory which is constructed by setup.py with a bunch of symlinks. It assumes that you are either building Spark from source in which case you should follow it's instructions:

    To build Spark with maven you can run:
      ./build/mvn -DskipTests clean package
    Building the source dist is done in the Python directory:
      cd python
      python setup.py sdist
      pip install dist/*.tar.gz


On Fri, May 1, 2020 at 1:32 PM Xiangyu Li <[hidden email]> wrote:
make-distribution.sh with --pip would run a `python setup.py sdist` within that make-distribution.sh script.
I also tested `make-distribution.sh` without --pip, and the same error happens.

Correct me if I'm wrong, but pyspark binary has always been successfully built, it is the pyspark pip package that is failing.

On Fri, May 1, 2020 at 4:23 PM Sean Owen <[hidden email]> wrote:
Hm, others may have to chime in here. Either that's not how you create the pyspark binary from the source release (make-distribution.sh doesn't do that?) or there is a small but important issue here, that the source release doesn't contain one thing that the binary release script expects, which is LICENSE-binary et al. If it's the latter, we could move around the LICENSE bits in the source tree so that both are "source" files included in the source release, so you can make the binary release with it, but, I'd probably say it's easier/better to simply skip adding the license in this path (if it's supposed to work this way at all) as the use case, a custom derived work, doesn't need the *ASF's* license statement.


On Fri, May 1, 2020 at 3:13 PM Xiangyu Li <[hidden email]> wrote:
To reproduce this, I just did

tar xzf spark-2.4.5.tgz
cd spark-2.4.5
./dev/make-distribution.sh --name custom-spark --pip --tgz -Phadoop-2.7
mv spark-2.4.5-bin-custom-spark.tgz ../
cd ..
tar xzf spark-2.4.5-bin-custom-spark.tgz
cd spark-2.4.5-bin-custom-spark/python/
sudo python setup.py install

And here is the output:
image.png


On Fri, May 1, 2020 at 2:48 PM Sean Owen <[hidden email]> wrote:
You wrote:

"
2. On each machine, I can install pyspark by running `python setup.py install` inside the python directory.

Step 2 would fail because of missing the licenses directory. 
"

That shouldn't depend on the license file, and the script you showed does not fail when not present, so I am wondering what this means.
I'm not sure there's a JIRA here yet.

On Fri, May 1, 2020 at 1:46 PM Xiangyu Li <[hidden email]> wrote:
Hmm, sorry I don't get what part of my email were you referring to when you said "the build fails?". 

So I am trying to build a custom spark binary distribution with, say, different Hadoop versions and R support.

Then I stored this custom build on S3, so as I am building more machines I can just directly download this custom build from S3. But besides spark-submit and what not, I also wanted to install the pyspark python package to the machine I am building. 

The lack of the LICENSE file in the custom build would prevent pyspark from being successfully built.

Hopefully this answers your question.

The second part of my last email was about building pyspark inside spark source directory, I will raise an issue on Jira for that, as it is more of a clean cut problem with the documentation on the website and the comments in make-distribution.sh. 



On Fri, May 1, 2020 at 1:31 PM Sean Owen <[hidden email]> wrote:
Hm, the build fails? you can see this is just skipped if not present, for this reason.
I'm not clear why you need the file for its own sake, for your own internal modification that you don't redistribute.



On Fri, May 1, 2020 at 11:43 AM Xiangyu Li <[hidden email]> wrote:
Hi Sean,

Thanks for the quick response! Yes, what you described about how LICENSE file should be distributed makes sense. 

The reason I learned about this is that I was trying to build spark-2.4.5-bin-custom.tgz, then distributes this build to multiple machines, so that:

1. These machines can run spark with the built.
2. On each machine, I can install pyspark by running `python setup.py install` inside the python directory.

Step 2 would fail because of missing the licenses directory. 

Building pyspark out of a binary distribution is a bit unconventional, but I did this after failing to do what the official doc recommended (https://spark.apache.org/docs/latest/building-spark.html#pyspark-pip-installable), so taking a step back to describe what I did originally:

In the spark-2.4.5 src directory, I just did a simple: 

`./build/mvn -DskipTests clean package`


And then went to the python directory and did:


`python setup.py sdist` followed by `pip install dist/pyspark-2.4.5.tar.gz` (as mentioned in the make-distribution.sh.)


This ran into "error: package directory `deps/jars` does not exist".


However, directly running 


`sudo python setup.py install`


worked. 



On Fri, May 1, 2020 at 11:30 AM Sean Owen <[hidden email]> wrote:
The source distribution has the source LICENSE file. The binary distribution has the LICENSE-binary license file. The source release isn't supposed to have LICENSE-binary as it would not be accurate for that release; LICENSE is. If you're redistributing a build, you'll have your own process for modifying and building it, including modifying the LICENSE file as appropriate; these LICENSE files represent what the project delivers to you rather than what you deliver to others. You could get the LICENSE-binary file from the right hash commit from git, if desired, as part of your build. 

On Fri, May 1, 2020 at 10:19 AM Xiangyu Li <[hidden email]> wrote:
Hello,

I downloaded spark-2.4.5 source from https://mirrors.ocf.berkeley.edu/apache/spark/spark-2.4.5/spark-2.4.5.tgz
After extracting it and running:

./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn -Pkubernetes

It creates a Spark binary distribution named:
spark-2.4.5-bin-custom-spark.tgz

So this file is supposedly a ready-to-distribute Spark binary file like the one you can download from http://mirror.metrocast.net/apache/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz

However, one big difference between this custom build and the official build is that you do not have a LICENSE file in the custom build. I don't know much about Apache license, but I would suppose a custom build distribution should have one.

The reason we are missing the file is caused by the following code in make-distribution.sh:
image.png

There is no LICENSE-binary file in the official spark-2.4.5.tgz file, therefore there will be no LICENSE file in your custom build. 

I am aware of two pull requests related to this:

started to use LICENSE-binary instead of just the LICENSE.

And
To avoid failure when there is no LICENSE-binary in spark-2.4.5 source directory.

I think we need to change make-distribution.sh to make sure that the LICENSE file is copied over to its corresponding custom build distribution. However, I am not ready to do a pull request, so hopefully we can discuss it here first.
--
Sincerely
Xiangyu Li

[hidden email]


--
Sincerely
Xiangyu Li

[hidden email]


--
Sincerely
Xiangyu Li

[hidden email]


--
Sincerely
Xiangyu Li

[hidden email]


--
Sincerely
Xiangyu Li

[hidden email]


--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 


--
Sincerely
Xiangyu Li

[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: No LICENSE file in spark custom build distribution

Holden Karau
Can you send me the output of those two commands

On Fri, May 1, 2020 at 1:46 PM Xiangyu Li <[hidden email]> wrote:
Hi Holden,

Please check the second email of mine in this email chain. I did that originally and to quote my email:

===========================================================================================
In the spark-2.4.5 src directory, I just did a simple: 

`./build/mvn -DskipTests clean package`


And then went to the python directory and did:


`python setup.py sdist` followed by `pip install dist/pyspark-2.4.5.tar.gz` (as mentioned in the make-distribution.sh.)


This ran into "error: package directory `deps/jars` does not exist".

===============================================================================================


So exactly as what you said, which is also one of the printout message in the make-distribution.sh script. 


On Fri, May 1, 2020 at 4:39 PM Holden Karau <[hidden email]> wrote:
Your problem isn't the missing license per-se (that just happens to be the first error).

I don't believe that is the way we expect users to pip install the Python library. pip will only install directories/targets underneath the directory where setup.py, hence the deps directory which is constructed by setup.py with a bunch of symlinks. It assumes that you are either building Spark from source in which case you should follow it's instructions:

    To build Spark with maven you can run:
      ./build/mvn -DskipTests clean package
    Building the source dist is done in the Python directory:
      cd python
      python setup.py sdist
      pip install dist/*.tar.gz


On Fri, May 1, 2020 at 1:32 PM Xiangyu Li <[hidden email]> wrote:
make-distribution.sh with --pip would run a `python setup.py sdist` within that make-distribution.sh script.
I also tested `make-distribution.sh` without --pip, and the same error happens.

Correct me if I'm wrong, but pyspark binary has always been successfully built, it is the pyspark pip package that is failing.

On Fri, May 1, 2020 at 4:23 PM Sean Owen <[hidden email]> wrote:
Hm, others may have to chime in here. Either that's not how you create the pyspark binary from the source release (make-distribution.sh doesn't do that?) or there is a small but important issue here, that the source release doesn't contain one thing that the binary release script expects, which is LICENSE-binary et al. If it's the latter, we could move around the LICENSE bits in the source tree so that both are "source" files included in the source release, so you can make the binary release with it, but, I'd probably say it's easier/better to simply skip adding the license in this path (if it's supposed to work this way at all) as the use case, a custom derived work, doesn't need the *ASF's* license statement.


On Fri, May 1, 2020 at 3:13 PM Xiangyu Li <[hidden email]> wrote:
To reproduce this, I just did

tar xzf spark-2.4.5.tgz
cd spark-2.4.5
./dev/make-distribution.sh --name custom-spark --pip --tgz -Phadoop-2.7
mv spark-2.4.5-bin-custom-spark.tgz ../
cd ..
tar xzf spark-2.4.5-bin-custom-spark.tgz
cd spark-2.4.5-bin-custom-spark/python/
sudo python setup.py install

And here is the output:
image.png


On Fri, May 1, 2020 at 2:48 PM Sean Owen <[hidden email]> wrote:
You wrote:

"
2. On each machine, I can install pyspark by running `python setup.py install` inside the python directory.

Step 2 would fail because of missing the licenses directory. 
"

That shouldn't depend on the license file, and the script you showed does not fail when not present, so I am wondering what this means.
I'm not sure there's a JIRA here yet.

On Fri, May 1, 2020 at 1:46 PM Xiangyu Li <[hidden email]> wrote:
Hmm, sorry I don't get what part of my email were you referring to when you said "the build fails?". 

So I am trying to build a custom spark binary distribution with, say, different Hadoop versions and R support.

Then I stored this custom build on S3, so as I am building more machines I can just directly download this custom build from S3. But besides spark-submit and what not, I also wanted to install the pyspark python package to the machine I am building. 

The lack of the LICENSE file in the custom build would prevent pyspark from being successfully built.

Hopefully this answers your question.

The second part of my last email was about building pyspark inside spark source directory, I will raise an issue on Jira for that, as it is more of a clean cut problem with the documentation on the website and the comments in make-distribution.sh. 



On Fri, May 1, 2020 at 1:31 PM Sean Owen <[hidden email]> wrote:
Hm, the build fails? you can see this is just skipped if not present, for this reason.
I'm not clear why you need the file for its own sake, for your own internal modification that you don't redistribute.



On Fri, May 1, 2020 at 11:43 AM Xiangyu Li <[hidden email]> wrote:
Hi Sean,

Thanks for the quick response! Yes, what you described about how LICENSE file should be distributed makes sense. 

The reason I learned about this is that I was trying to build spark-2.4.5-bin-custom.tgz, then distributes this build to multiple machines, so that:

1. These machines can run spark with the built.
2. On each machine, I can install pyspark by running `python setup.py install` inside the python directory.

Step 2 would fail because of missing the licenses directory. 

Building pyspark out of a binary distribution is a bit unconventional, but I did this after failing to do what the official doc recommended (https://spark.apache.org/docs/latest/building-spark.html#pyspark-pip-installable), so taking a step back to describe what I did originally:

In the spark-2.4.5 src directory, I just did a simple: 

`./build/mvn -DskipTests clean package`


And then went to the python directory and did:


`python setup.py sdist` followed by `pip install dist/pyspark-2.4.5.tar.gz` (as mentioned in the make-distribution.sh.)


This ran into "error: package directory `deps/jars` does not exist".


However, directly running 


`sudo python setup.py install`


worked. 



On Fri, May 1, 2020 at 11:30 AM Sean Owen <[hidden email]> wrote:
The source distribution has the source LICENSE file. The binary distribution has the LICENSE-binary license file. The source release isn't supposed to have LICENSE-binary as it would not be accurate for that release; LICENSE is. If you're redistributing a build, you'll have your own process for modifying and building it, including modifying the LICENSE file as appropriate; these LICENSE files represent what the project delivers to you rather than what you deliver to others. You could get the LICENSE-binary file from the right hash commit from git, if desired, as part of your build. 

On Fri, May 1, 2020 at 10:19 AM Xiangyu Li <[hidden email]> wrote:
Hello,

I downloaded spark-2.4.5 source from https://mirrors.ocf.berkeley.edu/apache/spark/spark-2.4.5/spark-2.4.5.tgz
After extracting it and running:

./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn -Pkubernetes

It creates a Spark binary distribution named:
spark-2.4.5-bin-custom-spark.tgz

So this file is supposedly a ready-to-distribute Spark binary file like the one you can download from http://mirror.metrocast.net/apache/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz

However, one big difference between this custom build and the official build is that you do not have a LICENSE file in the custom build. I don't know much about Apache license, but I would suppose a custom build distribution should have one.

The reason we are missing the file is caused by the following code in make-distribution.sh:
image.png

There is no LICENSE-binary file in the official spark-2.4.5.tgz file, therefore there will be no LICENSE file in your custom build. 

I am aware of two pull requests related to this:

started to use LICENSE-binary instead of just the LICENSE.

And
To avoid failure when there is no LICENSE-binary in spark-2.4.5 source directory.

I think we need to change make-distribution.sh to make sure that the LICENSE file is copied over to its corresponding custom build distribution. However, I am not ready to do a pull request, so hopefully we can discuss it here first.
--
Sincerely
Xiangyu Li

[hidden email]


--
Sincerely
Xiangyu Li

[hidden email]


--
Sincerely
Xiangyu Li

[hidden email]


--
Sincerely
Xiangyu Li

[hidden email]


--
Sincerely
Xiangyu Li

[hidden email]


--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 


--
Sincerely
Xiangyu Li

[hidden email]


--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
Reply | Threaded
Open this post in threaded view
|

Re: No LICENSE file in spark custom build distribution

Xiangyu Li
`python setup.py sdist` generates rather long input, there were several warnings like:
image.png
Then at the end it looks successful:
image.png


`pip install dist/pyspark-2.4.5.tar.gz` shows this:
image.png

On Fri, May 1, 2020 at 4:51 PM Holden Karau <[hidden email]> wrote:
Can you send me the output of those two commands

On Fri, May 1, 2020 at 1:46 PM Xiangyu Li <[hidden email]> wrote:
Hi Holden,

Please check the second email of mine in this email chain. I did that originally and to quote my email:

===========================================================================================
In the spark-2.4.5 src directory, I just did a simple: 

`./build/mvn -DskipTests clean package`


And then went to the python directory and did:


`python setup.py sdist` followed by `pip install dist/pyspark-2.4.5.tar.gz` (as mentioned in the make-distribution.sh.)


This ran into "error: package directory `deps/jars` does not exist".

===============================================================================================


So exactly as what you said, which is also one of the printout message in the make-distribution.sh script. 


On Fri, May 1, 2020 at 4:39 PM Holden Karau <[hidden email]> wrote:
Your problem isn't the missing license per-se (that just happens to be the first error).

I don't believe that is the way we expect users to pip install the Python library. pip will only install directories/targets underneath the directory where setup.py, hence the deps directory which is constructed by setup.py with a bunch of symlinks. It assumes that you are either building Spark from source in which case you should follow it's instructions:

    To build Spark with maven you can run:
      ./build/mvn -DskipTests clean package
    Building the source dist is done in the Python directory:
      cd python
      python setup.py sdist
      pip install dist/*.tar.gz


On Fri, May 1, 2020 at 1:32 PM Xiangyu Li <[hidden email]> wrote:
make-distribution.sh with --pip would run a `python setup.py sdist` within that make-distribution.sh script.
I also tested `make-distribution.sh` without --pip, and the same error happens.

Correct me if I'm wrong, but pyspark binary has always been successfully built, it is the pyspark pip package that is failing.

On Fri, May 1, 2020 at 4:23 PM Sean Owen <[hidden email]> wrote:
Hm, others may have to chime in here. Either that's not how you create the pyspark binary from the source release (make-distribution.sh doesn't do that?) or there is a small but important issue here, that the source release doesn't contain one thing that the binary release script expects, which is LICENSE-binary et al. If it's the latter, we could move around the LICENSE bits in the source tree so that both are "source" files included in the source release, so you can make the binary release with it, but, I'd probably say it's easier/better to simply skip adding the license in this path (if it's supposed to work this way at all) as the use case, a custom derived work, doesn't need the *ASF's* license statement.


On Fri, May 1, 2020 at 3:13 PM Xiangyu Li <[hidden email]> wrote:
To reproduce this, I just did

tar xzf spark-2.4.5.tgz
cd spark-2.4.5
./dev/make-distribution.sh --name custom-spark --pip --tgz -Phadoop-2.7
mv spark-2.4.5-bin-custom-spark.tgz ../
cd ..
tar xzf spark-2.4.5-bin-custom-spark.tgz
cd spark-2.4.5-bin-custom-spark/python/
sudo python setup.py install

And here is the output:
image.png


On Fri, May 1, 2020 at 2:48 PM Sean Owen <[hidden email]> wrote:
You wrote:

"
2. On each machine, I can install pyspark by running `python setup.py install` inside the python directory.

Step 2 would fail because of missing the licenses directory. 
"

That shouldn't depend on the license file, and the script you showed does not fail when not present, so I am wondering what this means.
I'm not sure there's a JIRA here yet.

On Fri, May 1, 2020 at 1:46 PM Xiangyu Li <[hidden email]> wrote:
Hmm, sorry I don't get what part of my email were you referring to when you said "the build fails?". 

So I am trying to build a custom spark binary distribution with, say, different Hadoop versions and R support.

Then I stored this custom build on S3, so as I am building more machines I can just directly download this custom build from S3. But besides spark-submit and what not, I also wanted to install the pyspark python package to the machine I am building. 

The lack of the LICENSE file in the custom build would prevent pyspark from being successfully built.

Hopefully this answers your question.

The second part of my last email was about building pyspark inside spark source directory, I will raise an issue on Jira for that, as it is more of a clean cut problem with the documentation on the website and the comments in make-distribution.sh. 



On Fri, May 1, 2020 at 1:31 PM Sean Owen <[hidden email]> wrote:
Hm, the build fails? you can see this is just skipped if not present, for this reason.
I'm not clear why you need the file for its own sake, for your own internal modification that you don't redistribute.



On Fri, May 1, 2020 at 11:43 AM Xiangyu Li <[hidden email]> wrote:
Hi Sean,

Thanks for the quick response! Yes, what you described about how LICENSE file should be distributed makes sense. 

The reason I learned about this is that I was trying to build spark-2.4.5-bin-custom.tgz, then distributes this build to multiple machines, so that:

1. These machines can run spark with the built.
2. On each machine, I can install pyspark by running `python setup.py install` inside the python directory.

Step 2 would fail because of missing the licenses directory. 

Building pyspark out of a binary distribution is a bit unconventional, but I did this after failing to do what the official doc recommended (https://spark.apache.org/docs/latest/building-spark.html#pyspark-pip-installable), so taking a step back to describe what I did originally:

In the spark-2.4.5 src directory, I just did a simple: 

`./build/mvn -DskipTests clean package`


And then went to the python directory and did:


`python setup.py sdist` followed by `pip install dist/pyspark-2.4.5.tar.gz` (as mentioned in the make-distribution.sh.)


This ran into "error: package directory `deps/jars` does not exist".


However, directly running 


`sudo python setup.py install`


worked. 



On Fri, May 1, 2020 at 11:30 AM Sean Owen <[hidden email]> wrote:
The source distribution has the source LICENSE file. The binary distribution has the LICENSE-binary license file. The source release isn't supposed to have LICENSE-binary as it would not be accurate for that release; LICENSE is. If you're redistributing a build, you'll have your own process for modifying and building it, including modifying the LICENSE file as appropriate; these LICENSE files represent what the project delivers to you rather than what you deliver to others. You could get the LICENSE-binary file from the right hash commit from git, if desired, as part of your build. 

On Fri, May 1, 2020 at 10:19 AM Xiangyu Li <[hidden email]> wrote:
Hello,

I downloaded spark-2.4.5 source from https://mirrors.ocf.berkeley.edu/apache/spark/spark-2.4.5/spark-2.4.5.tgz
After extracting it and running:

./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn -Pkubernetes

It creates a Spark binary distribution named:
spark-2.4.5-bin-custom-spark.tgz

So this file is supposedly a ready-to-distribute Spark binary file like the one you can download from http://mirror.metrocast.net/apache/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz

However, one big difference between this custom build and the official build is that you do not have a LICENSE file in the custom build. I don't know much about Apache license, but I would suppose a custom build distribution should have one.

The reason we are missing the file is caused by the following code in make-distribution.sh:
image.png

There is no LICENSE-binary file in the official spark-2.4.5.tgz file, therefore there will be no LICENSE file in your custom build. 

I am aware of two pull requests related to this:

started to use LICENSE-binary instead of just the LICENSE.

And
To avoid failure when there is no LICENSE-binary in spark-2.4.5 source directory.

I think we need to change make-distribution.sh to make sure that the LICENSE file is copied over to its corresponding custom build distribution. However, I am not ready to do a pull request, so hopefully we can discuss it here first.
--
Sincerely
Xiangyu Li

[hidden email]


--
Sincerely
Xiangyu Li

[hidden email]


--
Sincerely
Xiangyu Li

[hidden email]


--
Sincerely
Xiangyu Li

[hidden email]


--
Sincerely
Xiangyu Li

[hidden email]


--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 


--
Sincerely
Xiangyu Li

[hidden email]


--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 


--
Sincerely
Xiangyu Li

[hidden email]