[VOTE] SPARK 2.4.0 (RC4)

classic Classic list List threaded Threaded
49 messages Options
123
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] SPARK 2.4.0 (RC4)

ifilonenko
+1 (non-binding) in reference to all k8s tests for 2.11 (including SparkR Tests with R version being 3.4.1)

[INFO] --- scalatest-maven-plugin:1.0:test (integration-test) @ spark-kubernetes-integration-tests_2.11 ---
Discovery starting.
Discovery completed in 202 milliseconds.
Run starting. Expected test count is: 15
KubernetesSuite:
- Run SparkPi with no resources
- Run SparkPi with a very long application name.
- Use SparkLauncher.NO_RESOURCE
- Run SparkPi with a master URL without a scheme.
- Run SparkPi with an argument.
- Run SparkPi with custom labels, annotations, and environment variables.
- Run extraJVMOptions check on driver
- Run SparkRemoteFileTest using a remote data file
- Run SparkPi with env and mount secrets.
- Run PySpark on simple pi.py example
- Run PySpark with Python2 to test a pyfiles example
- Run PySpark with Python3 to test a pyfiles example
- Run PySpark with memory customization
- Run SparkR on simple dataframe.R example
- Run in client mode.
Run completed in 6 minutes, 47 seconds.
Total number of tests run: 15
Suites: completed 2, aborted 0
Tests: succeeded 15, failed 0, canceled 0, ignored 0, pending 0
All tests passed.

Sean, in reference to your issues, the comment you linked is correct in that you would need to build a Kubernetes distribution: 
i.e. dev/make-distribution.sh --pip --r --tgz -Psparkr -Phadoop-2.7 -Pkubernetes
setup minikube 
i.e. minikube start --insecure-registry=localhost:5000 --cpus 6 --memory 6000
and then run appropriate tests: 
i.e. dev/dev-run-integration-tests.sh --spark-tgz .../spark-2.4.0-bin-2.7.3.tgz

The newest PR that you linked allows us to point to the local Kubernetes cluster deployed via docker-for-mac as opposed to minikube which gives us another way to test, but does not change the workflow of testing AFAICT. 

On Tue, Oct 23, 2018 at 9:14 AM Sean Owen <[hidden email]> wrote:
(I should add, I only observed this with the Scala 2.12 build. It all
seemed to work with 2.11. Therefore I'm not too worried about it. I
don't think it's a Scala version issue, but perhaps something looking
for a spark 2.11 tarball and not finding it. See
https://github.com/apache/spark/pull/22805#issuecomment-432304622 for
a change that might address this kind of thing.)

On Tue, Oct 23, 2018 at 11:05 AM Sean Owen <[hidden email]> wrote:
>
> Yeah, that's maybe the issue here. This is a source release, not a git checkout, and it still needs to work in this context.
>
> I just added -Pkubernetes to my build and didn't do anything else. I think the ideal is that a "mvn -P... -P... install" to work from a source release; that's a good expectation and consistent with docs.
>
> Maybe these tests simply don't need to run with the normal suite of tests, and can be considered tests run manually by developers running these scripts? Basically, KubernetesSuite shouldn't run in a normal mvn install?
>
> I don't think this has to block the release even if so, just trying to get to the bottom of it.

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] SPARK 2.4.0 (RC4)

Dongjoon Hyun-2
In reply to this post by Stavros Kontopoulos-3
BTW, for that integration suite, I saw the related artifacts in the RC4 staging directory.

Does Spark 2.4.0 need to start to release these `spark-kubernetes-integration-tests` artifacts?

On Tue, Oct 23, 2018 at 11:43 AM Stavros Kontopoulos <[hidden email]> wrote:
Sean,

Ok makes sense, im using a cloned repo. I built with Scala 2.12 profile using the related tag v2.4.0-rc4:

./dev/change-scala-version.sh 2.12
./dev/make-distribution.sh  --name test --r --tgz -Pscala-2.12 -Psparkr -Phadoop-2.7 -Pkubernetes -Phive
Pushed images to dockerhub (previous email) since I didnt use the minikube daemon (default behavior).

Then run tests successfully against minikube:

TGZ_PATH=$(pwd)/spark-2.4.0-bin-test.gz
cd resource-managers/kubernetes/integration-tests

./dev/dev-run-integration-tests.sh --spark-tgz $TGZ_PATH --service-account default --namespace default --image-tag k8s-scala-12 --image-repo skonto


[INFO] 
[INFO] --- scalatest-maven-plugin:1.0:test (integration-test) @ spark-kubernetes-integration-tests_2.12 ---
Discovery starting.
Discovery completed in 229 milliseconds.
Run starting. Expected test count is: 14
KubernetesSuite:
- Run SparkPi with no resources
- Run SparkPi with a very long application name.
- Use SparkLauncher.NO_RESOURCE
- Run SparkPi with a master URL without a scheme.
- Run SparkPi with an argument.
- Run SparkPi with custom labels, annotations, and environment variables.
- Run extraJVMOptions check on driver
- Run SparkRemoteFileTest using a remote data file
- Run SparkPi with env and mount secrets.
- Run PySpark on simple pi.py example
- Run PySpark with Python2 to test a pyfiles example
- Run PySpark with Python3 to test a pyfiles example
- Run PySpark with memory customization
- Run in client mode.
Run completed in 5 minutes, 24 seconds.
Total number of tests run: 14
Suites: completed 2, aborted 0
Tests: succeeded 14, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO] 
[INFO] Spark Project Parent POM 2.4.0 ..................... SUCCESS [  4.491 s]
[INFO] Spark Project Tags ................................. SUCCESS [  3.833 s]
[INFO] Spark Project Local DB ............................. SUCCESS [  2.680 s]
[INFO] Spark Project Networking ........................... SUCCESS [  4.817 s]
[INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [  2.541 s]
[INFO] Spark Project Unsafe ............................... SUCCESS [  2.795 s]
[INFO] Spark Project Launcher ............................. SUCCESS [  5.593 s]
[INFO] Spark Project Core ................................. SUCCESS [ 25.160 s]
[INFO] Spark Project Kubernetes Integration Tests 2.4.0 ... SUCCESS [05:30 min]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 06:23 min
[INFO] Finished at: 2018-10-23T18:39:11Z
[INFO] ------------------------------------------------------------------------


but had to modify this line and added -Pscala-2.12 , otherwise it fails (these tests inherit from the parent pom but the profile is not propagated to the mvn command that launches the tests, I can create a PR to fix that).


On Tue, Oct 23, 2018 at 7:44 PM, Hyukjin Kwon <[hidden email]> wrote:
https://github.com/apache/spark/pull/22514 sounds like a regression that affects Hive CTAS in write path (by not replacing them into Spark internal datasources; therefore performance regression). 
but yea I suspect if we should block the release by this.

https://github.com/apache/spark/pull/22144 is just being discussed if I am not mistaken.

Thanks.

2018년 10월 24일 (수) 오전 12:27, Xiao Li <[hidden email]>님이 작성:
https://github.com/apache/spark/pull/22144 is also not a blocker of Spark 2.4 release, as discussed in the PR. 

Thanks,

Xiao

Xiao Li <[hidden email]> 于2018年10月23日周二 上午9:20写道:
Thanks for reporting this. https://github.com/apache/spark/pull/22514 is not a blocker. We can fix it in the next minor release, if we are unable to make it in this release. 

Thanks, 

Xiao

Sean Owen <[hidden email]> 于2018年10月23日周二 上午9:14写道:
(I should add, I only observed this with the Scala 2.12 build. It all
seemed to work with 2.11. Therefore I'm not too worried about it. I
don't think it's a Scala version issue, but perhaps something looking
for a spark 2.11 tarball and not finding it. See
https://github.com/apache/spark/pull/22805#issuecomment-432304622 for
a change that might address this kind of thing.)

On Tue, Oct 23, 2018 at 11:05 AM Sean Owen <[hidden email]> wrote:
>
> Yeah, that's maybe the issue here. This is a source release, not a git checkout, and it still needs to work in this context.
>
> I just added -Pkubernetes to my build and didn't do anything else. I think the ideal is that a "mvn -P... -P... install" to work from a source release; that's a good expectation and consistent with docs.
>
> Maybe these tests simply don't need to run with the normal suite of tests, and can be considered tests run manually by developers running these scripts? Basically, KubernetesSuite shouldn't run in a normal mvn install?
>
> I don't think this has to block the release even if so, just trying to get to the bottom of it.

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]




--
Stavros Kontopoulos
Senior Software Engineer
Lightbend, Inc.
<a href="tel:%2B1%20650%20678%200020" value="+16506780020" target="_blank">p:  +30 6977967274
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] SPARK 2.4.0 (RC4)

Stavros Kontopoulos-3
+1 (non-binding). Run k8s tests with Scala 2.12. Also included the RTestsSuite (mentioned by Ilan) although not part of the 2.4 rc tag:

[INFO] --- scalatest-maven-plugin:1.0:test (integration-test) @ spark-kubernetes-integration-tests_2.12 ---
Discovery starting.
Discovery completed in 239 milliseconds.
Run starting. Expected test count is: 15
KubernetesSuite:
- Run SparkPi with no resources
- Run SparkPi with a very long application name.
- Use SparkLauncher.NO_RESOURCE
- Run SparkPi with a master URL without a scheme.
- Run SparkPi with an argument.
- Run SparkPi with custom labels, annotations, and environment variables.
- Run extraJVMOptions check on driver
- Run SparkRemoteFileTest using a remote data file
- Run SparkPi with env and mount secrets.
- Run PySpark on simple pi.py example
- Run PySpark with Python2 to test a pyfiles example
- Run PySpark with Python3 to test a pyfiles example
- Run PySpark with memory customization
- Run in client mode.
- Run SparkR on simple dataframe.R example
Run completed in 6 minutes, 32 seconds.
Total number of tests run: 15
Suites: completed 2, aborted 0
Tests: succeeded 15, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO] 
[INFO] Spark Project Parent POM 2.4.0 ..................... SUCCESS [  4.480 s]
[INFO] Spark Project Tags ................................. SUCCESS [  3.898 s]
[INFO] Spark Project Local DB ............................. SUCCESS [  2.773 s]
[INFO] Spark Project Networking ........................... SUCCESS [  5.063 s]
[INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [  2.651 s]
[INFO] Spark Project Unsafe ............................... SUCCESS [  2.662 s]
[INFO] Spark Project Launcher ............................. SUCCESS [  5.103 s]
[INFO] Spark Project Core ................................. SUCCESS [ 25.703 s]
[INFO] Spark Project Kubernetes Integration Tests 2.4.0 ... SUCCESS [06:51 min]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 07:44 min
[INFO] Finished at: 2018-10-23T19:09:41Z
[INFO] ------------------------------------------------------------------------

Stavros

On Tue, Oct 23, 2018 at 9:46 PM, Dongjoon Hyun <[hidden email]> wrote:
BTW, for that integration suite, I saw the related artifacts in the RC4 staging directory.

Does Spark 2.4.0 need to start to release these `spark-kubernetes-integration-tests` artifacts?

On Tue, Oct 23, 2018 at 11:43 AM Stavros Kontopoulos <[hidden email]> wrote:
Sean,

Ok makes sense, im using a cloned repo. I built with Scala 2.12 profile using the related tag v2.4.0-rc4:

./dev/change-scala-version.sh 2.12
./dev/make-distribution.sh  --name test --r --tgz -Pscala-2.12 -Psparkr -Phadoop-2.7 -Pkubernetes -Phive
Pushed images to dockerhub (previous email) since I didnt use the minikube daemon (default behavior).

Then run tests successfully against minikube:

TGZ_PATH=$(pwd)/spark-2.4.0-bin-test.gz
cd resource-managers/kubernetes/integration-tests

./dev/dev-run-integration-tests.sh --spark-tgz $TGZ_PATH --service-account default --namespace default --image-tag k8s-scala-12 --image-repo skonto


[INFO] 
[INFO] --- scalatest-maven-plugin:1.0:test (integration-test) @ spark-kubernetes-integration-tests_2.12 ---
Discovery starting.
Discovery completed in 229 milliseconds.
Run starting. Expected test count is: 14
KubernetesSuite:
- Run SparkPi with no resources
- Run SparkPi with a very long application name.
- Use SparkLauncher.NO_RESOURCE
- Run SparkPi with a master URL without a scheme.
- Run SparkPi with an argument.
- Run SparkPi with custom labels, annotations, and environment variables.
- Run extraJVMOptions check on driver
- Run SparkRemoteFileTest using a remote data file
- Run SparkPi with env and mount secrets.
- Run PySpark on simple pi.py example
- Run PySpark with Python2 to test a pyfiles example
- Run PySpark with Python3 to test a pyfiles example
- Run PySpark with memory customization
- Run in client mode.
Run completed in 5 minutes, 24 seconds.
Total number of tests run: 14
Suites: completed 2, aborted 0
Tests: succeeded 14, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO] 
[INFO] Spark Project Parent POM 2.4.0 ..................... SUCCESS [  4.491 s]
[INFO] Spark Project Tags ................................. SUCCESS [  3.833 s]
[INFO] Spark Project Local DB ............................. SUCCESS [  2.680 s]
[INFO] Spark Project Networking ........................... SUCCESS [  4.817 s]
[INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [  2.541 s]
[INFO] Spark Project Unsafe ............................... SUCCESS [  2.795 s]
[INFO] Spark Project Launcher ............................. SUCCESS [  5.593 s]
[INFO] Spark Project Core ................................. SUCCESS [ 25.160 s]
[INFO] Spark Project Kubernetes Integration Tests 2.4.0 ... SUCCESS [05:30 min]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 06:23 min
[INFO] Finished at: 2018-10-23T18:39:11Z
[INFO] ------------------------------------------------------------------------


but had to modify this line and added -Pscala-2.12 , otherwise it fails (these tests inherit from the parent pom but the profile is not propagated to the mvn command that launches the tests, I can create a PR to fix that).


On Tue, Oct 23, 2018 at 7:44 PM, Hyukjin Kwon <[hidden email]> wrote:
https://github.com/apache/spark/pull/22514 sounds like a regression that affects Hive CTAS in write path (by not replacing them into Spark internal datasources; therefore performance regression). 
but yea I suspect if we should block the release by this.

https://github.com/apache/spark/pull/22144 is just being discussed if I am not mistaken.

Thanks.

2018년 10월 24일 (수) 오전 12:27, Xiao Li <[hidden email]>님이 작성:
https://github.com/apache/spark/pull/22144 is also not a blocker of Spark 2.4 release, as discussed in the PR. 

Thanks,

Xiao

Xiao Li <[hidden email]> 于2018年10月23日周二 上午9:20写道:
Thanks for reporting this. https://github.com/apache/spark/pull/22514 is not a blocker. We can fix it in the next minor release, if we are unable to make it in this release. 

Thanks, 

Xiao

Sean Owen <[hidden email]> 于2018年10月23日周二 上午9:14写道:
(I should add, I only observed this with the Scala 2.12 build. It all
seemed to work with 2.11. Therefore I'm not too worried about it. I
don't think it's a Scala version issue, but perhaps something looking
for a spark 2.11 tarball and not finding it. See
https://github.com/apache/spark/pull/22805#issuecomment-432304622 for
a change that might address this kind of thing.)

On Tue, Oct 23, 2018 at 11:05 AM Sean Owen <[hidden email]> wrote:
>
> Yeah, that's maybe the issue here. This is a source release, not a git checkout, and it still needs to work in this context.
>
> I just added -Pkubernetes to my build and didn't do anything else. I think the ideal is that a "mvn -P... -P... install" to work from a source release; that's a good expectation and consistent with docs.
>
> Maybe these tests simply don't need to run with the normal suite of tests, and can be considered tests run manually by developers running these scripts? Basically, KubernetesSuite shouldn't run in a normal mvn install?
>
> I don't think this has to block the release even if so, just trying to get to the bottom of it.

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]





Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] SPARK 2.4.0 (RC4)

Sean Owen-2
In reply to this post by Dongjoon Hyun-2
To be clear I'm currently +1 on this release, with much commentary.

OK, the explanation for kubernetes tests makes sense. Yes I think we need to propagate the scala-2.12 build profile to make it work. Go for it, if you have a lead on what the change is.
This doesn't block the release as it's an issue for tests, and only affects 2.12. However if we had a clean fix for this and there were another RC, I'd include it.

Dongjoon has a good point about the spark-kubernetes-integration-tests artifact. That doesn't sound like it should be published in this way, though, of course, we publish the test artifacts from every module already. This is only a bit odd in being a non-test artifact meant for testing. But it's special testing! So I also don't think that needs to block a release.

This happens because the integration tests module is enabled with the 'kubernetes' profile too, and also this output is copied into the release tarball at kubernetes/integration-tests/tests. Do we need that in a binary release?

If these integration tests are meant to be run ad hoc, manually, not part of a normal test cycle, then I think we can just not enable it with -Pkubernetes. If it is meant to run every time, then it sounds like we need a little extra work shown in recent PRs to make that easier, but then, this test code should just be the 'test' artifact parts of the kubernetes module, no?


On Tue, Oct 23, 2018 at 1:46 PM Dongjoon Hyun <[hidden email]> wrote:
BTW, for that integration suite, I saw the related artifacts in the RC4 staging directory.

Does Spark 2.4.0 need to start to release these `spark-kubernetes-integration-tests` artifacts?

On Tue, Oct 23, 2018 at 11:43 AM Stavros Kontopoulos <[hidden email]> wrote:
Sean,

Ok makes sense, im using a cloned repo. I built with Scala 2.12 profile using the related tag v2.4.0-rc4:

./dev/change-scala-version.sh 2.12
./dev/make-distribution.sh  --name test --r --tgz -Pscala-2.12 -Psparkr -Phadoop-2.7 -Pkubernetes -Phive
Pushed images to dockerhub (previous email) since I didnt use the minikube daemon (default behavior).

Then run tests successfully against minikube:

TGZ_PATH=$(pwd)/spark-2.4.0-bin-test.gz
cd resource-managers/kubernetes/integration-tests

./dev/dev-run-integration-tests.sh --spark-tgz $TGZ_PATH --service-account default --namespace default --image-tag k8s-scala-12 --image-repo skonto
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] SPARK 2.4.0 (RC4)

Dongjoon Hyun-2
In reply to this post by cloud0fan
Ur, Wenchen.

Source distribution seems to fail by default.


$ dev/make-distribution.sh -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver
...
+ cp /spark-2.4.0/LICENSE-binary /spark-2.4.0/dist/LICENSE
cp: /spark-2.4.0/LICENSE-binary: No such file or directory

The root cause seems to be the following fix.


Although Apache Spark provides the binary distributions, it would be great if this succeeds out of the box.

Bests,
Dongjoon.


On Mon, Oct 22, 2018 at 10:42 AM Wenchen Fan <[hidden email]> wrote:
Please vote on releasing the following candidate as Apache Spark version 2.4.0.

The vote is open until October 26 PST and passes if a majority +1 PMC votes are cast, with
a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 2.4.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v2.4.0-rc4 (commit e69e2bfa486d8d3b9d203b96ca9c0f37c2b6cabe):

The release files, including signatures, digests, etc. can be found at:

Signatures used for Spark RCs can be found in this file:

The staging repository for this release can be found at:

The documentation corresponding to this release can be found at:

The list of bug fixes going into 2.4.0 can be found at the following URL:

FAQ

=========================
How can I help test this release?
=========================

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

===========================================
What should happen to JIRA tickets still targeting 2.4.0?
===========================================

The current list of open tickets targeted at 2.4.0 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target Version/s" = 2.4.0

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==================
But my bug isn't fixed?
==================

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] SPARK 2.4.0 (RC4)

Sean Owen-2
Hm, so you're trying to build a source release from a binary release?
I don't think that needs to work nor do I expect it to for reasons
like this. They just have fairly different things.

On Tue, Oct 23, 2018 at 7:04 PM Dongjoon Hyun <[hidden email]> wrote:

>
> Ur, Wenchen.
>
> Source distribution seems to fail by default.
>
> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-bin/spark-2.4.0.tgz
>
> $ dev/make-distribution.sh -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver
> ...
> + cp /spark-2.4.0/LICENSE-binary /spark-2.4.0/dist/LICENSE
> cp: /spark-2.4.0/LICENSE-binary: No such file or directory
>
>
> The root cause seems to be the following fix.
>
> https://github.com/apache/spark/pull/22436/files#diff-01ca42240614718522afde4d4885b40dR175
>
> Although Apache Spark provides the binary distributions, it would be great if this succeeds out of the box.
>
> Bests,
> Dongjoon.
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] SPARK 2.4.0 (RC4)

Ryan Blue
+1 (non-binding)

The Iceberg implementation of DataSourceV2 is passing all tests after updating to the 2.4 API, although I've had to disable ORC support because BufferHolder is no longer public.

One oddity is that the DSv2 API for batch sources now includes an epoch ID, which I think will be removed in the refactor before 2.5 or 3.0 and wasn't part of the 2.3 release. That's strange, but it's minor.

rb

On Tue, Oct 23, 2018 at 5:10 PM Sean Owen <[hidden email]> wrote:
Hm, so you're trying to build a source release from a binary release?
I don't think that needs to work nor do I expect it to for reasons
like this. They just have fairly different things.

On Tue, Oct 23, 2018 at 7:04 PM Dongjoon Hyun <[hidden email]> wrote:
>
> Ur, Wenchen.
>
> Source distribution seems to fail by default.
>
> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-bin/spark-2.4.0.tgz
>
> $ dev/make-distribution.sh -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver
> ...
> + cp /spark-2.4.0/LICENSE-binary /spark-2.4.0/dist/LICENSE
> cp: /spark-2.4.0/LICENSE-binary: No such file or directory
>
>
> The root cause seems to be the following fix.
>
> https://github.com/apache/spark/pull/22436/files#diff-01ca42240614718522afde4d4885b40dR175
>
> Although Apache Spark provides the binary distributions, it would be great if this succeeds out of the box.
>
> Bests,
> Dongjoon.
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] SPARK 2.4.0 (RC4)

Dongjoon Hyun-2
In reply to this post by cloud0fan
Hi, All.

-0 due to the following issue. From Spark 2.4.0, users may get an incorrect result when they use new `map_fitler` with `map_concat` functions.

https://issues.apache.org/jira/browse/SPARK-25823

SPARK-25823 is only aiming to fix the data correctness issue from `map_filter`.

PMC members are able to lower the priority. Always, I respect PMC's decision.

I'm sending this email to draw more attention to this bug and to give some warning on the new feature's limitation to the community.

Bests,
Dongjoon.


On Mon, Oct 22, 2018 at 10:42 AM Wenchen Fan <[hidden email]> wrote:
Please vote on releasing the following candidate as Apache Spark version 2.4.0.

The vote is open until October 26 PST and passes if a majority +1 PMC votes are cast, with
a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 2.4.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v2.4.0-rc4 (commit e69e2bfa486d8d3b9d203b96ca9c0f37c2b6cabe):

The release files, including signatures, digests, etc. can be found at:

Signatures used for Spark RCs can be found in this file:

The staging repository for this release can be found at:

The documentation corresponding to this release can be found at:

The list of bug fixes going into 2.4.0 can be found at the following URL:

FAQ

=========================
How can I help test this release?
=========================

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

===========================================
What should happen to JIRA tickets still targeting 2.4.0?
===========================================

The current list of open tickets targeted at 2.4.0 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target Version/s" = 2.4.0

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==================
But my bug isn't fixed?
==================

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] SPARK 2.4.0 (RC4)

cloud0fan
Hi Dongjoon,

Thanks for reporting it! This is indeed a bug that needs to be fixed.

The problem is not about the function `map_filter`, but about how the map type values are created in Spark, when there are duplicated keys.

In programming languages like Java/Scala, when creating map, the later entry wins. e.g. in scala
scala> Map(1 -> 2, 1 -> 3)
res0: scala.collection.immutable.Map[Int,Int] = Map(1 -> 3)

scala> Map(1 -> 2, 1 -> 3).get(1)
res1: Option[Int] = Some(3)

However, in Spark, the earlier entry wins
scala> sql("SELECT map(1,2,1,3)[1]").show
+------------------+
|map(1, 2, 1, 3)[1]|
+------------------+
|                 2|
+------------------+

So for Spark users, Map(1 -> 2, 1 -> 3) should be equal to Map(1 -> 2).

But there are several bugs in Spark

scala> sql("SELECT map(1,2,1,3)").show
+----------------+
| map(1, 2, 1, 3)|
+----------------+
|[1 -> 2, 1 -> 3]|
+----------------+
The displayed string of map values has a bug and we should deduplicate the entries, This is tracked by SPARK-25824.


scala> sql("CREATE TABLE t AS SELECT map(1,2,1,3) as map")
res11: org.apache.spark.sql.DataFrame = []

scala> sql("select * from t").show
+--------+
|     map|
+--------+
|[1 -> 3]|
+--------+
The Hive map value convert has a bug, we should respect the "earlier entry wins" semantic. No ticket yet.


scala> sql("select map(1,2,1,3)").collect
res14: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)])
Same bug happens at `collect`. No ticket yet.

I'll create tickets and list all of them as known issues in 2.4.0.

It's arguable if the "earlier entry wins" semantic is reasonable. Fixing it is a behavior change and we can only apply it to master branch.

Going back to https://issues.apache.org/jira/browse/SPARK-25823, it's just a symptom of the hive map value converter bug. I think it's a non-blocker.

Thanks,
Wenchen

On Thu, Oct 25, 2018 at 5:31 AM Dongjoon Hyun <[hidden email]> wrote:
Hi, All.

-0 due to the following issue. From Spark 2.4.0, users may get an incorrect result when they use new `map_fitler` with `map_concat` functions.

https://issues.apache.org/jira/browse/SPARK-25823

SPARK-25823 is only aiming to fix the data correctness issue from `map_filter`.

PMC members are able to lower the priority. Always, I respect PMC's decision.

I'm sending this email to draw more attention to this bug and to give some warning on the new feature's limitation to the community.

Bests,
Dongjoon.


On Mon, Oct 22, 2018 at 10:42 AM Wenchen Fan <[hidden email]> wrote:
Please vote on releasing the following candidate as Apache Spark version 2.4.0.

The vote is open until October 26 PST and passes if a majority +1 PMC votes are cast, with
a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 2.4.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v2.4.0-rc4 (commit e69e2bfa486d8d3b9d203b96ca9c0f37c2b6cabe):

The release files, including signatures, digests, etc. can be found at:

Signatures used for Spark RCs can be found in this file:

The staging repository for this release can be found at:

The documentation corresponding to this release can be found at:

The list of bug fixes going into 2.4.0 can be found at the following URL:

FAQ

=========================
How can I help test this release?
=========================

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

===========================================
What should happen to JIRA tickets still targeting 2.4.0?
===========================================

The current list of open tickets targeted at 2.4.0 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target Version/s" = 2.4.0

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==================
But my bug isn't fixed?
==================

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] SPARK 2.4.0 (RC4)

Dongjoon Hyun-2
Thank you for the follow-ups.

Then, Spark 2.4.1 will return `{1:2}` differently from the followings (including Spark/Scala) in the end?

I hoped to fix the `map_filter`, but now Spark looks inconsistent in many ways.

scala> sql("select map(1,2,1,3)").show // Spark 2.2.2
+---------------+
|map(1, 2, 1, 3)|
+---------------+
|    Map(1 -> 3)|
+---------------+

spark-sql> select map(1,2,1,3); // Spark 2.4.0 RC4
{1:3}

hive> select map(1,2,1,3);  // Hive 1.2.2
OK
{1:3}

presto> SELECT map_concat(map(array[1],array[2]), map(array[1],array[3])); // Presto 0.212
 _col0
-------
 {1=3}

Bests,
Dongjoon.


On Wed, Oct 24, 2018 at 5:17 PM Wenchen Fan <[hidden email]> wrote:
Hi Dongjoon,

Thanks for reporting it! This is indeed a bug that needs to be fixed.

The problem is not about the function `map_filter`, but about how the map type values are created in Spark, when there are duplicated keys.

In programming languages like Java/Scala, when creating map, the later entry wins. e.g. in scala
scala> Map(1 -> 2, 1 -> 3)
res0: scala.collection.immutable.Map[Int,Int] = Map(1 -> 3)

scala> Map(1 -> 2, 1 -> 3).get(1)
res1: Option[Int] = Some(3)

However, in Spark, the earlier entry wins
scala> sql("SELECT map(1,2,1,3)[1]").show
+------------------+
|map(1, 2, 1, 3)[1]|
+------------------+
|                 2|
+------------------+

So for Spark users, Map(1 -> 2, 1 -> 3) should be equal to Map(1 -> 2).

But there are several bugs in Spark

scala> sql("SELECT map(1,2,1,3)").show
+----------------+
| map(1, 2, 1, 3)|
+----------------+
|[1 -> 2, 1 -> 3]|
+----------------+
The displayed string of map values has a bug and we should deduplicate the entries, This is tracked by SPARK-25824.


scala> sql("CREATE TABLE t AS SELECT map(1,2,1,3) as map")
res11: org.apache.spark.sql.DataFrame = []

scala> sql("select * from t").show
+--------+
|     map|
+--------+
|[1 -> 3]|
+--------+
The Hive map value convert has a bug, we should respect the "earlier entry wins" semantic. No ticket yet.


scala> sql("select map(1,2,1,3)").collect
res14: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)])
Same bug happens at `collect`. No ticket yet.

I'll create tickets and list all of them as known issues in 2.4.0.

It's arguable if the "earlier entry wins" semantic is reasonable. Fixing it is a behavior change and we can only apply it to master branch.

Going back to https://issues.apache.org/jira/browse/SPARK-25823, it's just a symptom of the hive map value converter bug. I think it's a non-blocker.

Thanks,
Wenchen

On Thu, Oct 25, 2018 at 5:31 AM Dongjoon Hyun <[hidden email]> wrote:
Hi, All.

-0 due to the following issue. From Spark 2.4.0, users may get an incorrect result when they use new `map_fitler` with `map_concat` functions.

https://issues.apache.org/jira/browse/SPARK-25823

SPARK-25823 is only aiming to fix the data correctness issue from `map_filter`.

PMC members are able to lower the priority. Always, I respect PMC's decision.

I'm sending this email to draw more attention to this bug and to give some warning on the new feature's limitation to the community.

Bests,
Dongjoon.


On Mon, Oct 22, 2018 at 10:42 AM Wenchen Fan <[hidden email]> wrote:
Please vote on releasing the following candidate as Apache Spark version 2.4.0.

The vote is open until October 26 PST and passes if a majority +1 PMC votes are cast, with
a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 2.4.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v2.4.0-rc4 (commit e69e2bfa486d8d3b9d203b96ca9c0f37c2b6cabe):

The release files, including signatures, digests, etc. can be found at:

Signatures used for Spark RCs can be found in this file:

The staging repository for this release can be found at:

The documentation corresponding to this release can be found at:

The list of bug fixes going into 2.4.0 can be found at the following URL:

FAQ

=========================
How can I help test this release?
=========================

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

===========================================
What should happen to JIRA tickets still targeting 2.4.0?
===========================================

The current list of open tickets targeted at 2.4.0 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target Version/s" = 2.4.0

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==================
But my bug isn't fixed?
==================

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] SPARK 2.4.0 (RC4)

cloud0fan
> spark-sql> select map(1,2,1,3); // Spark 2.4.0 RC4
> {1:3}

Are you running in the thrift-server? Then maybe this is caused by the bug in `Dateset.collect` as I mentioned above.

I think map_filter is implemented correctly. map(1,2,1,3) is actually map(1,2) according to the "earlier entry wins" semantic. I don't think this will change in 2.4.1.

On Thu, Oct 25, 2018 at 8:56 AM Dongjoon Hyun <[hidden email]> wrote:
Thank you for the follow-ups.

Then, Spark 2.4.1 will return `{1:2}` differently from the followings (including Spark/Scala) in the end?

I hoped to fix the `map_filter`, but now Spark looks inconsistent in many ways.

scala> sql("select map(1,2,1,3)").show // Spark 2.2.2
+---------------+
|map(1, 2, 1, 3)|
+---------------+
|    Map(1 -> 3)|
+---------------+

spark-sql> select map(1,2,1,3); // Spark 2.4.0 RC4
{1:3}

hive> select map(1,2,1,3);  // Hive 1.2.2
OK
{1:3}

presto> SELECT map_concat(map(array[1],array[2]), map(array[1],array[3])); // Presto 0.212
 _col0
-------
 {1=3}

Bests,
Dongjoon.


On Wed, Oct 24, 2018 at 5:17 PM Wenchen Fan <[hidden email]> wrote:
Hi Dongjoon,

Thanks for reporting it! This is indeed a bug that needs to be fixed.

The problem is not about the function `map_filter`, but about how the map type values are created in Spark, when there are duplicated keys.

In programming languages like Java/Scala, when creating map, the later entry wins. e.g. in scala
scala> Map(1 -> 2, 1 -> 3)
res0: scala.collection.immutable.Map[Int,Int] = Map(1 -> 3)

scala> Map(1 -> 2, 1 -> 3).get(1)
res1: Option[Int] = Some(3)

However, in Spark, the earlier entry wins
scala> sql("SELECT map(1,2,1,3)[1]").show
+------------------+
|map(1, 2, 1, 3)[1]|
+------------------+
|                 2|
+------------------+

So for Spark users, Map(1 -> 2, 1 -> 3) should be equal to Map(1 -> 2).

But there are several bugs in Spark

scala> sql("SELECT map(1,2,1,3)").show
+----------------+
| map(1, 2, 1, 3)|
+----------------+
|[1 -> 2, 1 -> 3]|
+----------------+
The displayed string of map values has a bug and we should deduplicate the entries, This is tracked by SPARK-25824.


scala> sql("CREATE TABLE t AS SELECT map(1,2,1,3) as map")
res11: org.apache.spark.sql.DataFrame = []

scala> sql("select * from t").show
+--------+
|     map|
+--------+
|[1 -> 3]|
+--------+
The Hive map value convert has a bug, we should respect the "earlier entry wins" semantic. No ticket yet.


scala> sql("select map(1,2,1,3)").collect
res14: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)])
Same bug happens at `collect`. No ticket yet.

I'll create tickets and list all of them as known issues in 2.4.0.

It's arguable if the "earlier entry wins" semantic is reasonable. Fixing it is a behavior change and we can only apply it to master branch.

Going back to https://issues.apache.org/jira/browse/SPARK-25823, it's just a symptom of the hive map value converter bug. I think it's a non-blocker.

Thanks,
Wenchen

On Thu, Oct 25, 2018 at 5:31 AM Dongjoon Hyun <[hidden email]> wrote:
Hi, All.

-0 due to the following issue. From Spark 2.4.0, users may get an incorrect result when they use new `map_fitler` with `map_concat` functions.

https://issues.apache.org/jira/browse/SPARK-25823

SPARK-25823 is only aiming to fix the data correctness issue from `map_filter`.

PMC members are able to lower the priority. Always, I respect PMC's decision.

I'm sending this email to draw more attention to this bug and to give some warning on the new feature's limitation to the community.

Bests,
Dongjoon.


On Mon, Oct 22, 2018 at 10:42 AM Wenchen Fan <[hidden email]> wrote:
Please vote on releasing the following candidate as Apache Spark version 2.4.0.

The vote is open until October 26 PST and passes if a majority +1 PMC votes are cast, with
a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 2.4.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v2.4.0-rc4 (commit e69e2bfa486d8d3b9d203b96ca9c0f37c2b6cabe):

The release files, including signatures, digests, etc. can be found at:

Signatures used for Spark RCs can be found in this file:

The staging repository for this release can be found at:

The documentation corresponding to this release can be found at:

The list of bug fixes going into 2.4.0 can be found at the following URL:

FAQ

=========================
How can I help test this release?
=========================

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

===========================================
What should happen to JIRA tickets still targeting 2.4.0?
===========================================

The current list of open tickets targeted at 2.4.0 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target Version/s" = 2.4.0

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==================
But my bug isn't fixed?
==================

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] SPARK 2.4.0 (RC4)

Dongjoon Hyun-2
For the first question, it's `bin/spark-sql` result. I didn't check STS, but it will return the same with `bin/spark-sql`.

> I think map_filter is implemented correctly. map(1,2,1,3) is actually map(1,2) according to the "earlier entry wins" semantic. I don't think this will change in 2.4.1.

For the second one, `map_filter` issue is not about `earlier entry wins` stuff. Please see the following example.

spark-sql> SELECT m, map_filter(m, (k,v) -> v=2) c FROM (SELECT map_concat(map(1,2), map(1,3)) m);
{1:3} {1:2}

spark-sql> SELECT m, map_filter(m, (k,v) -> v=3) c FROM (SELECT map_concat(map(1,2), map(1,3)) m);
{1:3} {1:3}

spark-sql> SELECT m, map_filter(m, (k,v) -> v=4) c FROM (SELECT map_concat(map(1,2), map(1,3)) m);
{1:3} {}

In other words, `map_filter` works like `push-downed filter` to the map in terms of the output result
while users assumed that `map_filter` works on top of the result of `m`. 

This is a function semantic issue.


On Wed, Oct 24, 2018 at 6:06 PM Wenchen Fan <[hidden email]> wrote:
> spark-sql> select map(1,2,1,3); // Spark 2.4.0 RC4
> {1:3}

Are you running in the thrift-server? Then maybe this is caused by the bug in `Dateset.collect` as I mentioned above.

I think map_filter is implemented correctly. map(1,2,1,3) is actually map(1,2) according to the "earlier entry wins" semantic. I don't think this will change in 2.4.1.

On Thu, Oct 25, 2018 at 8:56 AM Dongjoon Hyun <[hidden email]> wrote:
Thank you for the follow-ups.

Then, Spark 2.4.1 will return `{1:2}` differently from the followings (including Spark/Scala) in the end?

I hoped to fix the `map_filter`, but now Spark looks inconsistent in many ways.

scala> sql("select map(1,2,1,3)").show // Spark 2.2.2
+---------------+
|map(1, 2, 1, 3)|
+---------------+
|    Map(1 -> 3)|
+---------------+

spark-sql> select map(1,2,1,3); // Spark 2.4.0 RC4
{1:3}

hive> select map(1,2,1,3);  // Hive 1.2.2
OK
{1:3}

presto> SELECT map_concat(map(array[1],array[2]), map(array[1],array[3])); // Presto 0.212
 _col0
-------
 {1=3}

Bests,
Dongjoon.


On Wed, Oct 24, 2018 at 5:17 PM Wenchen Fan <[hidden email]> wrote:
Hi Dongjoon,

Thanks for reporting it! This is indeed a bug that needs to be fixed.

The problem is not about the function `map_filter`, but about how the map type values are created in Spark, when there are duplicated keys.

In programming languages like Java/Scala, when creating map, the later entry wins. e.g. in scala
scala> Map(1 -> 2, 1 -> 3)
res0: scala.collection.immutable.Map[Int,Int] = Map(1 -> 3)

scala> Map(1 -> 2, 1 -> 3).get(1)
res1: Option[Int] = Some(3)

However, in Spark, the earlier entry wins
scala> sql("SELECT map(1,2,1,3)[1]").show
+------------------+
|map(1, 2, 1, 3)[1]|
+------------------+
|                 2|
+------------------+

So for Spark users, Map(1 -> 2, 1 -> 3) should be equal to Map(1 -> 2).

But there are several bugs in Spark

scala> sql("SELECT map(1,2,1,3)").show
+----------------+
| map(1, 2, 1, 3)|
+----------------+
|[1 -> 2, 1 -> 3]|
+----------------+
The displayed string of map values has a bug and we should deduplicate the entries, This is tracked by SPARK-25824.


scala> sql("CREATE TABLE t AS SELECT map(1,2,1,3) as map")
res11: org.apache.spark.sql.DataFrame = []

scala> sql("select * from t").show
+--------+
|     map|
+--------+
|[1 -> 3]|
+--------+
The Hive map value convert has a bug, we should respect the "earlier entry wins" semantic. No ticket yet.


scala> sql("select map(1,2,1,3)").collect
res14: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)])
Same bug happens at `collect`. No ticket yet.

I'll create tickets and list all of them as known issues in 2.4.0.

It's arguable if the "earlier entry wins" semantic is reasonable. Fixing it is a behavior change and we can only apply it to master branch.

Going back to https://issues.apache.org/jira/browse/SPARK-25823, it's just a symptom of the hive map value converter bug. I think it's a non-blocker.

Thanks,
Wenchen

On Thu, Oct 25, 2018 at 5:31 AM Dongjoon Hyun <[hidden email]> wrote:
Hi, All.

-0 due to the following issue. From Spark 2.4.0, users may get an incorrect result when they use new `map_fitler` with `map_concat` functions.

https://issues.apache.org/jira/browse/SPARK-25823

SPARK-25823 is only aiming to fix the data correctness issue from `map_filter`.

PMC members are able to lower the priority. Always, I respect PMC's decision.

I'm sending this email to draw more attention to this bug and to give some warning on the new feature's limitation to the community.

Bests,
Dongjoon.


On Mon, Oct 22, 2018 at 10:42 AM Wenchen Fan <[hidden email]> wrote:
Please vote on releasing the following candidate as Apache Spark version 2.4.0.

The vote is open until October 26 PST and passes if a majority +1 PMC votes are cast, with
a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 2.4.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v2.4.0-rc4 (commit e69e2bfa486d8d3b9d203b96ca9c0f37c2b6cabe):

The release files, including signatures, digests, etc. can be found at:

Signatures used for Spark RCs can be found in this file:

The staging repository for this release can be found at:

The documentation corresponding to this release can be found at:

The list of bug fixes going into 2.4.0 can be found at the following URL:

FAQ

=========================
How can I help test this release?
=========================

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

===========================================
What should happen to JIRA tickets still targeting 2.4.0?
===========================================

The current list of open tickets targeted at 2.4.0 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target Version/s" = 2.4.0

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==================
But my bug isn't fixed?
==================

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] SPARK 2.4.0 (RC4)

cloud0fan
Ah now I see the problem. `map_filter` has a very weird semantic that is neither "earlier entry wins" or "latter entry wins".

I've opened https://github.com/apache/spark/pull/22821 , to remove these newly added map-related functions from FunctionRegistry(for 2.4.0), so that they are invisible to end-users, and the weird behavior of Spark map type with duplicated keys are not escalated. We should fix it ASAP in the master branch.

If others are OK with it, I'll start a new RC after that PR is merged.

Thanks,
Wenchen

On Thu, Oct 25, 2018 at 10:32 AM Dongjoon Hyun <[hidden email]> wrote:
For the first question, it's `bin/spark-sql` result. I didn't check STS, but it will return the same with `bin/spark-sql`.

> I think map_filter is implemented correctly. map(1,2,1,3) is actually map(1,2) according to the "earlier entry wins" semantic. I don't think this will change in 2.4.1.

For the second one, `map_filter` issue is not about `earlier entry wins` stuff. Please see the following example.

spark-sql> SELECT m, map_filter(m, (k,v) -> v=2) c FROM (SELECT map_concat(map(1,2), map(1,3)) m);
{1:3} {1:2}

spark-sql> SELECT m, map_filter(m, (k,v) -> v=3) c FROM (SELECT map_concat(map(1,2), map(1,3)) m);
{1:3} {1:3}

spark-sql> SELECT m, map_filter(m, (k,v) -> v=4) c FROM (SELECT map_concat(map(1,2), map(1,3)) m);
{1:3} {}

In other words, `map_filter` works like `push-downed filter` to the map in terms of the output result
while users assumed that `map_filter` works on top of the result of `m`. 

This is a function semantic issue.


On Wed, Oct 24, 2018 at 6:06 PM Wenchen Fan <[hidden email]> wrote:
> spark-sql> select map(1,2,1,3); // Spark 2.4.0 RC4
> {1:3}

Are you running in the thrift-server? Then maybe this is caused by the bug in `Dateset.collect` as I mentioned above.

I think map_filter is implemented correctly. map(1,2,1,3) is actually map(1,2) according to the "earlier entry wins" semantic. I don't think this will change in 2.4.1.

On Thu, Oct 25, 2018 at 8:56 AM Dongjoon Hyun <[hidden email]> wrote:
Thank you for the follow-ups.

Then, Spark 2.4.1 will return `{1:2}` differently from the followings (including Spark/Scala) in the end?

I hoped to fix the `map_filter`, but now Spark looks inconsistent in many ways.

scala> sql("select map(1,2,1,3)").show // Spark 2.2.2
+---------------+
|map(1, 2, 1, 3)|
+---------------+
|    Map(1 -> 3)|
+---------------+

spark-sql> select map(1,2,1,3); // Spark 2.4.0 RC4
{1:3}

hive> select map(1,2,1,3);  // Hive 1.2.2
OK
{1:3}

presto> SELECT map_concat(map(array[1],array[2]), map(array[1],array[3])); // Presto 0.212
 _col0
-------
 {1=3}

Bests,
Dongjoon.


On Wed, Oct 24, 2018 at 5:17 PM Wenchen Fan <[hidden email]> wrote:
Hi Dongjoon,

Thanks for reporting it! This is indeed a bug that needs to be fixed.

The problem is not about the function `map_filter`, but about how the map type values are created in Spark, when there are duplicated keys.

In programming languages like Java/Scala, when creating map, the later entry wins. e.g. in scala
scala> Map(1 -> 2, 1 -> 3)
res0: scala.collection.immutable.Map[Int,Int] = Map(1 -> 3)

scala> Map(1 -> 2, 1 -> 3).get(1)
res1: Option[Int] = Some(3)

However, in Spark, the earlier entry wins
scala> sql("SELECT map(1,2,1,3)[1]").show
+------------------+
|map(1, 2, 1, 3)[1]|
+------------------+
|                 2|
+------------------+

So for Spark users, Map(1 -> 2, 1 -> 3) should be equal to Map(1 -> 2).

But there are several bugs in Spark

scala> sql("SELECT map(1,2,1,3)").show
+----------------+
| map(1, 2, 1, 3)|
+----------------+
|[1 -> 2, 1 -> 3]|
+----------------+
The displayed string of map values has a bug and we should deduplicate the entries, This is tracked by SPARK-25824.


scala> sql("CREATE TABLE t AS SELECT map(1,2,1,3) as map")
res11: org.apache.spark.sql.DataFrame = []

scala> sql("select * from t").show
+--------+
|     map|
+--------+
|[1 -> 3]|
+--------+
The Hive map value convert has a bug, we should respect the "earlier entry wins" semantic. No ticket yet.


scala> sql("select map(1,2,1,3)").collect
res14: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)])
Same bug happens at `collect`. No ticket yet.

I'll create tickets and list all of them as known issues in 2.4.0.

It's arguable if the "earlier entry wins" semantic is reasonable. Fixing it is a behavior change and we can only apply it to master branch.

Going back to https://issues.apache.org/jira/browse/SPARK-25823, it's just a symptom of the hive map value converter bug. I think it's a non-blocker.

Thanks,
Wenchen

On Thu, Oct 25, 2018 at 5:31 AM Dongjoon Hyun <[hidden email]> wrote:
Hi, All.

-0 due to the following issue. From Spark 2.4.0, users may get an incorrect result when they use new `map_fitler` with `map_concat` functions.

https://issues.apache.org/jira/browse/SPARK-25823

SPARK-25823 is only aiming to fix the data correctness issue from `map_filter`.

PMC members are able to lower the priority. Always, I respect PMC's decision.

I'm sending this email to draw more attention to this bug and to give some warning on the new feature's limitation to the community.

Bests,
Dongjoon.


On Mon, Oct 22, 2018 at 10:42 AM Wenchen Fan <[hidden email]> wrote:
Please vote on releasing the following candidate as Apache Spark version 2.4.0.

The vote is open until October 26 PST and passes if a majority +1 PMC votes are cast, with
a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 2.4.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v2.4.0-rc4 (commit e69e2bfa486d8d3b9d203b96ca9c0f37c2b6cabe):

The release files, including signatures, digests, etc. can be found at:

Signatures used for Spark RCs can be found in this file:

The staging repository for this release can be found at:

The documentation corresponding to this release can be found at:

The list of bug fixes going into 2.4.0 can be found at the following URL:

FAQ

=========================
How can I help test this release?
=========================

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

===========================================
What should happen to JIRA tickets still targeting 2.4.0?
===========================================

The current list of open tickets targeted at 2.4.0 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target Version/s" = 2.4.0

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==================
But my bug isn't fixed?
==================

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] SPARK 2.4.0 (RC4)

Xiao Li-2
[hidden email]  Thanks! This is a blocking ticket. It returns a wrong result due to our undefined behavior. I agree we should revert the newly added map-oriented functions. In 3.0 release, we need to define the behavior of duplicate keys in the data type MAP and fix all the related issues that are confusing to our end users.

Thanks,

Xiao   

On Wed, Oct 24, 2018 at 9:54 PM Wenchen Fan <[hidden email]> wrote:
Ah now I see the problem. `map_filter` has a very weird semantic that is neither "earlier entry wins" or "latter entry wins".

I've opened https://github.com/apache/spark/pull/22821 , to remove these newly added map-related functions from FunctionRegistry(for 2.4.0), so that they are invisible to end-users, and the weird behavior of Spark map type with duplicated keys are not escalated. We should fix it ASAP in the master branch.

If others are OK with it, I'll start a new RC after that PR is merged.

Thanks,
Wenchen

On Thu, Oct 25, 2018 at 10:32 AM Dongjoon Hyun <[hidden email]> wrote:
For the first question, it's `bin/spark-sql` result. I didn't check STS, but it will return the same with `bin/spark-sql`.

> I think map_filter is implemented correctly. map(1,2,1,3) is actually map(1,2) according to the "earlier entry wins" semantic. I don't think this will change in 2.4.1.

For the second one, `map_filter` issue is not about `earlier entry wins` stuff. Please see the following example.

spark-sql> SELECT m, map_filter(m, (k,v) -> v=2) c FROM (SELECT map_concat(map(1,2), map(1,3)) m);
{1:3} {1:2}

spark-sql> SELECT m, map_filter(m, (k,v) -> v=3) c FROM (SELECT map_concat(map(1,2), map(1,3)) m);
{1:3} {1:3}

spark-sql> SELECT m, map_filter(m, (k,v) -> v=4) c FROM (SELECT map_concat(map(1,2), map(1,3)) m);
{1:3} {}

In other words, `map_filter` works like `push-downed filter` to the map in terms of the output result
while users assumed that `map_filter` works on top of the result of `m`. 

This is a function semantic issue.


On Wed, Oct 24, 2018 at 6:06 PM Wenchen Fan <[hidden email]> wrote:
> spark-sql> select map(1,2,1,3); // Spark 2.4.0 RC4
> {1:3}

Are you running in the thrift-server? Then maybe this is caused by the bug in `Dateset.collect` as I mentioned above.

I think map_filter is implemented correctly. map(1,2,1,3) is actually map(1,2) according to the "earlier entry wins" semantic. I don't think this will change in 2.4.1.

On Thu, Oct 25, 2018 at 8:56 AM Dongjoon Hyun <[hidden email]> wrote:
Thank you for the follow-ups.

Then, Spark 2.4.1 will return `{1:2}` differently from the followings (including Spark/Scala) in the end?

I hoped to fix the `map_filter`, but now Spark looks inconsistent in many ways.

scala> sql("select map(1,2,1,3)").show // Spark 2.2.2
+---------------+
|map(1, 2, 1, 3)|
+---------------+
|    Map(1 -> 3)|
+---------------+

spark-sql> select map(1,2,1,3); // Spark 2.4.0 RC4
{1:3}

hive> select map(1,2,1,3);  // Hive 1.2.2
OK
{1:3}

presto> SELECT map_concat(map(array[1],array[2]), map(array[1],array[3])); // Presto 0.212
 _col0
-------
 {1=3}

Bests,
Dongjoon.


On Wed, Oct 24, 2018 at 5:17 PM Wenchen Fan <[hidden email]> wrote:
Hi Dongjoon,

Thanks for reporting it! This is indeed a bug that needs to be fixed.

The problem is not about the function `map_filter`, but about how the map type values are created in Spark, when there are duplicated keys.

In programming languages like Java/Scala, when creating map, the later entry wins. e.g. in scala
scala> Map(1 -> 2, 1 -> 3)
res0: scala.collection.immutable.Map[Int,Int] = Map(1 -> 3)

scala> Map(1 -> 2, 1 -> 3).get(1)
res1: Option[Int] = Some(3)

However, in Spark, the earlier entry wins
scala> sql("SELECT map(1,2,1,3)[1]").show
+------------------+
|map(1, 2, 1, 3)[1]|
+------------------+
|                 2|
+------------------+

So for Spark users, Map(1 -> 2, 1 -> 3) should be equal to Map(1 -> 2).

But there are several bugs in Spark

scala> sql("SELECT map(1,2,1,3)").show
+----------------+
| map(1, 2, 1, 3)|
+----------------+
|[1 -> 2, 1 -> 3]|
+----------------+
The displayed string of map values has a bug and we should deduplicate the entries, This is tracked by SPARK-25824.


scala> sql("CREATE TABLE t AS SELECT map(1,2,1,3) as map")
res11: org.apache.spark.sql.DataFrame = []

scala> sql("select * from t").show
+--------+
|     map|
+--------+
|[1 -> 3]|
+--------+
The Hive map value convert has a bug, we should respect the "earlier entry wins" semantic. No ticket yet.


scala> sql("select map(1,2,1,3)").collect
res14: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)])
Same bug happens at `collect`. No ticket yet.

I'll create tickets and list all of them as known issues in 2.4.0.

It's arguable if the "earlier entry wins" semantic is reasonable. Fixing it is a behavior change and we can only apply it to master branch.

Going back to https://issues.apache.org/jira/browse/SPARK-25823, it's just a symptom of the hive map value converter bug. I think it's a non-blocker.

Thanks,
Wenchen

On Thu, Oct 25, 2018 at 5:31 AM Dongjoon Hyun <[hidden email]> wrote:
Hi, All.

-0 due to the following issue. From Spark 2.4.0, users may get an incorrect result when they use new `map_fitler` with `map_concat` functions.

https://issues.apache.org/jira/browse/SPARK-25823

SPARK-25823 is only aiming to fix the data correctness issue from `map_filter`.

PMC members are able to lower the priority. Always, I respect PMC's decision.

I'm sending this email to draw more attention to this bug and to give some warning on the new feature's limitation to the community.

Bests,
Dongjoon.


On Mon, Oct 22, 2018 at 10:42 AM Wenchen Fan <[hidden email]> wrote:
Please vote on releasing the following candidate as Apache Spark version 2.4.0.

The vote is open until October 26 PST and passes if a majority +1 PMC votes are cast, with
a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 2.4.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v2.4.0-rc4 (commit e69e2bfa486d8d3b9d203b96ca9c0f37c2b6cabe):

The release files, including signatures, digests, etc. can be found at:

Signatures used for Spark RCs can be found in this file:

The staging repository for this release can be found at:

The documentation corresponding to this release can be found at:

The list of bug fixes going into 2.4.0 can be found at the following URL:

FAQ

=========================
How can I help test this release?
=========================

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

===========================================
What should happen to JIRA tickets still targeting 2.4.0?
===========================================

The current list of open tickets targeted at 2.4.0 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target Version/s" = 2.4.0

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==================
But my bug isn't fixed?
==================

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.


--
Spark+AI Summit North America 2019
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] SPARK 2.4.0 (RC4)

Dongjoon Hyun-2
Thank you for the decision, All.

As of now, to unblock this, it seems that we are trying to remove them from the function registry.


One problem here is that users can recover those functions like this simply.

scala> spark.sessionState.functionRegistry.createOrReplaceTempFunction("map_filter", x => org.apache.spark.sql.catalyst.expressions.MapFilter(x(0),x(1)))

Technically, the PR looks like a compromised way to unblock the release and to allow some users that feature completely.

At first glance, I thought this is a workaround to ignore the discussion context. But, that sounds like one of the practical ways for Apache Spark.
(We had Spark 2.0 Tech. Preview before.)

I want to finalize the decision on `map_filter` (and related three functions) issue. Are we good to go with https://github.com/apache/spark/pull/22821?

Bests,
Dongjoon.

PS. Also, there is a PR to completely remove them, too. 


On Wed, Oct 24, 2018 at 10:14 PM Xiao Li <[hidden email]> wrote:
[hidden email]  Thanks! This is a blocking ticket. It returns a wrong result due to our undefined behavior. I agree we should revert the newly added map-oriented functions. In 3.0 release, we need to define the behavior of duplicate keys in the data type MAP and fix all the related issues that are confusing to our end users.

Thanks,

Xiao   

On Wed, Oct 24, 2018 at 9:54 PM Wenchen Fan <[hidden email]> wrote:
Ah now I see the problem. `map_filter` has a very weird semantic that is neither "earlier entry wins" or "latter entry wins".

I've opened https://github.com/apache/spark/pull/22821 , to remove these newly added map-related functions from FunctionRegistry(for 2.4.0), so that they are invisible to end-users, and the weird behavior of Spark map type with duplicated keys are not escalated. We should fix it ASAP in the master branch.

If others are OK with it, I'll start a new RC after that PR is merged.

Thanks,
Wenchen

On Thu, Oct 25, 2018 at 10:32 AM Dongjoon Hyun <[hidden email]> wrote:
For the first question, it's `bin/spark-sql` result. I didn't check STS, but it will return the same with `bin/spark-sql`.

> I think map_filter is implemented correctly. map(1,2,1,3) is actually map(1,2) according to the "earlier entry wins" semantic. I don't think this will change in 2.4.1.

For the second one, `map_filter` issue is not about `earlier entry wins` stuff. Please see the following example.

spark-sql> SELECT m, map_filter(m, (k,v) -> v=2) c FROM (SELECT map_concat(map(1,2), map(1,3)) m);
{1:3} {1:2}

spark-sql> SELECT m, map_filter(m, (k,v) -> v=3) c FROM (SELECT map_concat(map(1,2), map(1,3)) m);
{1:3} {1:3}

spark-sql> SELECT m, map_filter(m, (k,v) -> v=4) c FROM (SELECT map_concat(map(1,2), map(1,3)) m);
{1:3} {}

In other words, `map_filter` works like `push-downed filter` to the map in terms of the output result
while users assumed that `map_filter` works on top of the result of `m`. 

This is a function semantic issue.


On Wed, Oct 24, 2018 at 6:06 PM Wenchen Fan <[hidden email]> wrote:
> spark-sql> select map(1,2,1,3); // Spark 2.4.0 RC4
> {1:3}

Are you running in the thrift-server? Then maybe this is caused by the bug in `Dateset.collect` as I mentioned above.

I think map_filter is implemented correctly. map(1,2,1,3) is actually map(1,2) according to the "earlier entry wins" semantic. I don't think this will change in 2.4.1.

On Thu, Oct 25, 2018 at 8:56 AM Dongjoon Hyun <[hidden email]> wrote:
Thank you for the follow-ups.

Then, Spark 2.4.1 will return `{1:2}` differently from the followings (including Spark/Scala) in the end?

I hoped to fix the `map_filter`, but now Spark looks inconsistent in many ways.

scala> sql("select map(1,2,1,3)").show // Spark 2.2.2
+---------------+
|map(1, 2, 1, 3)|
+---------------+
|    Map(1 -> 3)|
+---------------+

spark-sql> select map(1,2,1,3); // Spark 2.4.0 RC4
{1:3}

hive> select map(1,2,1,3);  // Hive 1.2.2
OK
{1:3}

presto> SELECT map_concat(map(array[1],array[2]), map(array[1],array[3])); // Presto 0.212
 _col0
-------
 {1=3}

Bests,
Dongjoon.


On Wed, Oct 24, 2018 at 5:17 PM Wenchen Fan <[hidden email]> wrote:
Hi Dongjoon,

Thanks for reporting it! This is indeed a bug that needs to be fixed.

The problem is not about the function `map_filter`, but about how the map type values are created in Spark, when there are duplicated keys.

In programming languages like Java/Scala, when creating map, the later entry wins. e.g. in scala
scala> Map(1 -> 2, 1 -> 3)
res0: scala.collection.immutable.Map[Int,Int] = Map(1 -> 3)

scala> Map(1 -> 2, 1 -> 3).get(1)
res1: Option[Int] = Some(3)

However, in Spark, the earlier entry wins
scala> sql("SELECT map(1,2,1,3)[1]").show
+------------------+
|map(1, 2, 1, 3)[1]|
+------------------+
|                 2|
+------------------+

So for Spark users, Map(1 -> 2, 1 -> 3) should be equal to Map(1 -> 2).

But there are several bugs in Spark

scala> sql("SELECT map(1,2,1,3)").show
+----------------+
| map(1, 2, 1, 3)|
+----------------+
|[1 -> 2, 1 -> 3]|
+----------------+
The displayed string of map values has a bug and we should deduplicate the entries, This is tracked by SPARK-25824.


scala> sql("CREATE TABLE t AS SELECT map(1,2,1,3) as map")
res11: org.apache.spark.sql.DataFrame = []

scala> sql("select * from t").show
+--------+
|     map|
+--------+
|[1 -> 3]|
+--------+
The Hive map value convert has a bug, we should respect the "earlier entry wins" semantic. No ticket yet.


scala> sql("select map(1,2,1,3)").collect
res14: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)])
Same bug happens at `collect`. No ticket yet.

I'll create tickets and list all of them as known issues in 2.4.0.

It's arguable if the "earlier entry wins" semantic is reasonable. Fixing it is a behavior change and we can only apply it to master branch.

Going back to https://issues.apache.org/jira/browse/SPARK-25823, it's just a symptom of the hive map value converter bug. I think it's a non-blocker.

Thanks,
Wenchen

On Thu, Oct 25, 2018 at 5:31 AM Dongjoon Hyun <[hidden email]> wrote:
Hi, All.

-0 due to the following issue. From Spark 2.4.0, users may get an incorrect result when they use new `map_fitler` with `map_concat` functions.

https://issues.apache.org/jira/browse/SPARK-25823

SPARK-25823 is only aiming to fix the data correctness issue from `map_filter`.

PMC members are able to lower the priority. Always, I respect PMC's decision.

I'm sending this email to draw more attention to this bug and to give some warning on the new feature's limitation to the community.

Bests,
Dongjoon.


On Mon, Oct 22, 2018 at 10:42 AM Wenchen Fan <[hidden email]> wrote:
Please vote on releasing the following candidate as Apache Spark version 2.4.0.

The vote is open until October 26 PST and passes if a majority +1 PMC votes are cast, with
a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 2.4.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v2.4.0-rc4 (commit e69e2bfa486d8d3b9d203b96ca9c0f37c2b6cabe):

The release files, including signatures, digests, etc. can be found at:

Signatures used for Spark RCs can be found in this file:

The staging repository for this release can be found at:

The documentation corresponding to this release can be found at:

The list of bug fixes going into 2.4.0 can be found at the following URL:

FAQ

=========================
How can I help test this release?
=========================

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

===========================================
What should happen to JIRA tickets still targeting 2.4.0?
===========================================

The current list of open tickets targeted at 2.4.0 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target Version/s" = 2.4.0

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==================
But my bug isn't fixed?
==================

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.


--
Spark+AI Summit North America 2019
Reply | Threaded
Open this post in threaded view
|

What if anything to fix about k8s for the 2.4.0 RC5?

Sean Owen-2
In reply to this post by Sean Owen-2
Forking this thread.

Because we'll have another RC, we could possibly address these two
issues. Only if we have a reliable change of course.

Is it easy enough to propagate the -Pscala-2.12 profile? can't hurt.

And is it reasonable to essentially 'disable'
kubernetes/integration-tests by removing it from the kubernetes
profile? it doesn't mean it goes away, just means it's run manually,
not automatically. Is that actually how it's meant to be used anyway?
in the short term? given the discussion around its requirements and
minikube and all that?

(Actually, this would also 'solve' the Scala 2.12 build problem too)

On Tue, Oct 23, 2018 at 2:45 PM Sean Owen <[hidden email]> wrote:

>
> To be clear I'm currently +1 on this release, with much commentary.
>
> OK, the explanation for kubernetes tests makes sense. Yes I think we need to propagate the scala-2.12 build profile to make it work. Go for it, if you have a lead on what the change is.
> This doesn't block the release as it's an issue for tests, and only affects 2.12. However if we had a clean fix for this and there were another RC, I'd include it.
>
> Dongjoon has a good point about the spark-kubernetes-integration-tests artifact. That doesn't sound like it should be published in this way, though, of course, we publish the test artifacts from every module already. This is only a bit odd in being a non-test artifact meant for testing. But it's special testing! So I also don't think that needs to block a release.
>
> This happens because the integration tests module is enabled with the 'kubernetes' profile too, and also this output is copied into the release tarball at kubernetes/integration-tests/tests. Do we need that in a binary release?
>
> If these integration tests are meant to be run ad hoc, manually, not part of a normal test cycle, then I think we can just not enable it with -Pkubernetes. If it is meant to run every time, then it sounds like we need a little extra work shown in recent PRs to make that easier, but then, this test code should just be the 'test' artifact parts of the kubernetes module, no?

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: What if anything to fix about k8s for the 2.4.0 RC5?

Erik Erlandson-2

I would be comfortable making the integration testing manual for now.  A JIRA for ironing out how to make it reliable for automatic as a goal for 3.0 seems like a good idea.

On Thu, Oct 25, 2018 at 8:11 AM Sean Owen <[hidden email]> wrote:
Forking this thread.

Because we'll have another RC, we could possibly address these two
issues. Only if we have a reliable change of course.

Is it easy enough to propagate the -Pscala-2.12 profile? can't hurt.

And is it reasonable to essentially 'disable'
kubernetes/integration-tests by removing it from the kubernetes
profile? it doesn't mean it goes away, just means it's run manually,
not automatically. Is that actually how it's meant to be used anyway?
in the short term? given the discussion around its requirements and
minikube and all that?

(Actually, this would also 'solve' the Scala 2.12 build problem too)

On Tue, Oct 23, 2018 at 2:45 PM Sean Owen <[hidden email]> wrote:
>
> To be clear I'm currently +1 on this release, with much commentary.
>
> OK, the explanation for kubernetes tests makes sense. Yes I think we need to propagate the scala-2.12 build profile to make it work. Go for it, if you have a lead on what the change is.
> This doesn't block the release as it's an issue for tests, and only affects 2.12. However if we had a clean fix for this and there were another RC, I'd include it.
>
> Dongjoon has a good point about the spark-kubernetes-integration-tests artifact. That doesn't sound like it should be published in this way, though, of course, we publish the test artifacts from every module already. This is only a bit odd in being a non-test artifact meant for testing. But it's special testing! So I also don't think that needs to block a release.
>
> This happens because the integration tests module is enabled with the 'kubernetes' profile too, and also this output is copied into the release tarball at kubernetes/integration-tests/tests. Do we need that in a binary release?
>
> If these integration tests are meant to be run ad hoc, manually, not part of a normal test cycle, then I think we can just not enable it with -Pkubernetes. If it is meant to run every time, then it sounds like we need a little extra work shown in recent PRs to make that easier, but then, this test code should just be the 'test' artifact parts of the kubernetes module, no?

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] SPARK 2.4.0 (RC4)

cloud0fan
In reply to this post by Dongjoon Hyun-2
Personally I don't think it matters. Users can build arbitrary expressions/plans themselves with internal API, and we never guarantee the result.

Removing these functions from the function registry is a small patch and easy to review, and to me it's better than a 1000+ LOC patch that removes the whole thing.

Again I don't have a strong opinion here. I'm OK to remove the entire thing if a PR is ready and well reviewed.

On Thu, Oct 25, 2018 at 11:00 PM Dongjoon Hyun <[hidden email]> wrote:
Thank you for the decision, All.

As of now, to unblock this, it seems that we are trying to remove them from the function registry.


One problem here is that users can recover those functions like this simply.

scala> spark.sessionState.functionRegistry.createOrReplaceTempFunction("map_filter", x => org.apache.spark.sql.catalyst.expressions.MapFilter(x(0),x(1)))

Technically, the PR looks like a compromised way to unblock the release and to allow some users that feature completely.

At first glance, I thought this is a workaround to ignore the discussion context. But, that sounds like one of the practical ways for Apache Spark.
(We had Spark 2.0 Tech. Preview before.)

I want to finalize the decision on `map_filter` (and related three functions) issue. Are we good to go with https://github.com/apache/spark/pull/22821?

Bests,
Dongjoon.

PS. Also, there is a PR to completely remove them, too. 


On Wed, Oct 24, 2018 at 10:14 PM Xiao Li <[hidden email]> wrote:
[hidden email]  Thanks! This is a blocking ticket. It returns a wrong result due to our undefined behavior. I agree we should revert the newly added map-oriented functions. In 3.0 release, we need to define the behavior of duplicate keys in the data type MAP and fix all the related issues that are confusing to our end users.

Thanks,

Xiao   

On Wed, Oct 24, 2018 at 9:54 PM Wenchen Fan <[hidden email]> wrote:
Ah now I see the problem. `map_filter` has a very weird semantic that is neither "earlier entry wins" or "latter entry wins".

I've opened https://github.com/apache/spark/pull/22821 , to remove these newly added map-related functions from FunctionRegistry(for 2.4.0), so that they are invisible to end-users, and the weird behavior of Spark map type with duplicated keys are not escalated. We should fix it ASAP in the master branch.

If others are OK with it, I'll start a new RC after that PR is merged.

Thanks,
Wenchen

On Thu, Oct 25, 2018 at 10:32 AM Dongjoon Hyun <[hidden email]> wrote:
For the first question, it's `bin/spark-sql` result. I didn't check STS, but it will return the same with `bin/spark-sql`.

> I think map_filter is implemented correctly. map(1,2,1,3) is actually map(1,2) according to the "earlier entry wins" semantic. I don't think this will change in 2.4.1.

For the second one, `map_filter` issue is not about `earlier entry wins` stuff. Please see the following example.

spark-sql> SELECT m, map_filter(m, (k,v) -> v=2) c FROM (SELECT map_concat(map(1,2), map(1,3)) m);
{1:3} {1:2}

spark-sql> SELECT m, map_filter(m, (k,v) -> v=3) c FROM (SELECT map_concat(map(1,2), map(1,3)) m);
{1:3} {1:3}

spark-sql> SELECT m, map_filter(m, (k,v) -> v=4) c FROM (SELECT map_concat(map(1,2), map(1,3)) m);
{1:3} {}

In other words, `map_filter` works like `push-downed filter` to the map in terms of the output result
while users assumed that `map_filter` works on top of the result of `m`. 

This is a function semantic issue.


On Wed, Oct 24, 2018 at 6:06 PM Wenchen Fan <[hidden email]> wrote:
> spark-sql> select map(1,2,1,3); // Spark 2.4.0 RC4
> {1:3}

Are you running in the thrift-server? Then maybe this is caused by the bug in `Dateset.collect` as I mentioned above.

I think map_filter is implemented correctly. map(1,2,1,3) is actually map(1,2) according to the "earlier entry wins" semantic. I don't think this will change in 2.4.1.

On Thu, Oct 25, 2018 at 8:56 AM Dongjoon Hyun <[hidden email]> wrote:
Thank you for the follow-ups.

Then, Spark 2.4.1 will return `{1:2}` differently from the followings (including Spark/Scala) in the end?

I hoped to fix the `map_filter`, but now Spark looks inconsistent in many ways.

scala> sql("select map(1,2,1,3)").show // Spark 2.2.2
+---------------+
|map(1, 2, 1, 3)|
+---------------+
|    Map(1 -> 3)|
+---------------+

spark-sql> select map(1,2,1,3); // Spark 2.4.0 RC4
{1:3}

hive> select map(1,2,1,3);  // Hive 1.2.2
OK
{1:3}

presto> SELECT map_concat(map(array[1],array[2]), map(array[1],array[3])); // Presto 0.212
 _col0
-------
 {1=3}

Bests,
Dongjoon.


On Wed, Oct 24, 2018 at 5:17 PM Wenchen Fan <[hidden email]> wrote:
Hi Dongjoon,

Thanks for reporting it! This is indeed a bug that needs to be fixed.

The problem is not about the function `map_filter`, but about how the map type values are created in Spark, when there are duplicated keys.

In programming languages like Java/Scala, when creating map, the later entry wins. e.g. in scala
scala> Map(1 -> 2, 1 -> 3)
res0: scala.collection.immutable.Map[Int,Int] = Map(1 -> 3)

scala> Map(1 -> 2, 1 -> 3).get(1)
res1: Option[Int] = Some(3)

However, in Spark, the earlier entry wins
scala> sql("SELECT map(1,2,1,3)[1]").show
+------------------+
|map(1, 2, 1, 3)[1]|
+------------------+
|                 2|
+------------------+

So for Spark users, Map(1 -> 2, 1 -> 3) should be equal to Map(1 -> 2).

But there are several bugs in Spark

scala> sql("SELECT map(1,2,1,3)").show
+----------------+
| map(1, 2, 1, 3)|
+----------------+
|[1 -> 2, 1 -> 3]|
+----------------+
The displayed string of map values has a bug and we should deduplicate the entries, This is tracked by SPARK-25824.


scala> sql("CREATE TABLE t AS SELECT map(1,2,1,3) as map")
res11: org.apache.spark.sql.DataFrame = []

scala> sql("select * from t").show
+--------+
|     map|
+--------+
|[1 -> 3]|
+--------+
The Hive map value convert has a bug, we should respect the "earlier entry wins" semantic. No ticket yet.


scala> sql("select map(1,2,1,3)").collect
res14: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)])
Same bug happens at `collect`. No ticket yet.

I'll create tickets and list all of them as known issues in 2.4.0.

It's arguable if the "earlier entry wins" semantic is reasonable. Fixing it is a behavior change and we can only apply it to master branch.

Going back to https://issues.apache.org/jira/browse/SPARK-25823, it's just a symptom of the hive map value converter bug. I think it's a non-blocker.

Thanks,
Wenchen

On Thu, Oct 25, 2018 at 5:31 AM Dongjoon Hyun <[hidden email]> wrote:
Hi, All.

-0 due to the following issue. From Spark 2.4.0, users may get an incorrect result when they use new `map_fitler` with `map_concat` functions.

https://issues.apache.org/jira/browse/SPARK-25823

SPARK-25823 is only aiming to fix the data correctness issue from `map_filter`.

PMC members are able to lower the priority. Always, I respect PMC's decision.

I'm sending this email to draw more attention to this bug and to give some warning on the new feature's limitation to the community.

Bests,
Dongjoon.


On Mon, Oct 22, 2018 at 10:42 AM Wenchen Fan <[hidden email]> wrote:
Please vote on releasing the following candidate as Apache Spark version 2.4.0.

The vote is open until October 26 PST and passes if a majority +1 PMC votes are cast, with
a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 2.4.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v2.4.0-rc4 (commit e69e2bfa486d8d3b9d203b96ca9c0f37c2b6cabe):

The release files, including signatures, digests, etc. can be found at:

Signatures used for Spark RCs can be found in this file:

The staging repository for this release can be found at:

The documentation corresponding to this release can be found at:

The list of bug fixes going into 2.4.0 can be found at the following URL:

FAQ

=========================
How can I help test this release?
=========================

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

===========================================
What should happen to JIRA tickets still targeting 2.4.0?
===========================================

The current list of open tickets targeted at 2.4.0 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target Version/s" = 2.4.0

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==================
But my bug isn't fixed?
==================

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.


--
Spark+AI Summit North America 2019
Reply | Threaded
Open this post in threaded view
|

Re: What if anything to fix about k8s for the 2.4.0 RC5?

Stavros Kontopoulos-3
In reply to this post by Erik Erlandson-2
I will open a jira for the profile propagation issue and have a look to fix it.

Stavros

On Thu, Oct 25, 2018 at 6:16 PM, Erik Erlandson <[hidden email]> wrote:

I would be comfortable making the integration testing manual for now.  A JIRA for ironing out how to make it reliable for automatic as a goal for 3.0 seems like a good idea.

On Thu, Oct 25, 2018 at 8:11 AM Sean Owen <[hidden email]> wrote:
Forking this thread.

Because we'll have another RC, we could possibly address these two
issues. Only if we have a reliable change of course.

Is it easy enough to propagate the -Pscala-2.12 profile? can't hurt.

And is it reasonable to essentially 'disable'
kubernetes/integration-tests by removing it from the kubernetes
profile? it doesn't mean it goes away, just means it's run manually,
not automatically. Is that actually how it's meant to be used anyway?
in the short term? given the discussion around its requirements and
minikube and all that?

(Actually, this would also 'solve' the Scala 2.12 build problem too)

On Tue, Oct 23, 2018 at 2:45 PM Sean Owen <[hidden email]> wrote:
>
> To be clear I'm currently +1 on this release, with much commentary.
>
> OK, the explanation for kubernetes tests makes sense. Yes I think we need to propagate the scala-2.12 build profile to make it work. Go for it, if you have a lead on what the change is.
> This doesn't block the release as it's an issue for tests, and only affects 2.12. However if we had a clean fix for this and there were another RC, I'd include it.
>
> Dongjoon has a good point about the spark-kubernetes-integration-tests artifact. That doesn't sound like it should be published in this way, though, of course, we publish the test artifacts from every module already. This is only a bit odd in being a non-test artifact meant for testing. But it's special testing! So I also don't think that needs to block a release.
>
> This happens because the integration tests module is enabled with the 'kubernetes' profile too, and also this output is copied into the release tarball at kubernetes/integration-tests/tests. Do we need that in a binary release?
>
> If these integration tests are meant to be run ad hoc, manually, not part of a normal test cycle, then I think we can just not enable it with -Pkubernetes. If it is meant to run every time, then it sounds like we need a little extra work shown in recent PRs to make that easier, but then, this test code should just be the 'test' artifact parts of the kubernetes module, no?

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



Reply | Threaded
Open this post in threaded view
|

Re: What if anything to fix about k8s for the 2.4.0 RC5?

Stavros Kontopoulos-3
I agree these tests should be manual for now but should be run somehow before a release to make sure things are working right?



On Thu, Oct 25, 2018 at 6:29 PM, Stavros Kontopoulos <[hidden email]> wrote:
I will open a jira for the profile propagation issue and have a look to fix it.

Stavros

On Thu, Oct 25, 2018 at 6:16 PM, Erik Erlandson <[hidden email]> wrote:

I would be comfortable making the integration testing manual for now.  A JIRA for ironing out how to make it reliable for automatic as a goal for 3.0 seems like a good idea.

On Thu, Oct 25, 2018 at 8:11 AM Sean Owen <[hidden email]> wrote:
Forking this thread.

Because we'll have another RC, we could possibly address these two
issues. Only if we have a reliable change of course.

Is it easy enough to propagate the -Pscala-2.12 profile? can't hurt.

And is it reasonable to essentially 'disable'
kubernetes/integration-tests by removing it from the kubernetes
profile? it doesn't mean it goes away, just means it's run manually,
not automatically. Is that actually how it's meant to be used anyway?
in the short term? given the discussion around its requirements and
minikube and all that?

(Actually, this would also 'solve' the Scala 2.12 build problem too)

On Tue, Oct 23, 2018 at 2:45 PM Sean Owen <[hidden email]> wrote:
>
> To be clear I'm currently +1 on this release, with much commentary.
>
> OK, the explanation for kubernetes tests makes sense. Yes I think we need to propagate the scala-2.12 build profile to make it work. Go for it, if you have a lead on what the change is.
> This doesn't block the release as it's an issue for tests, and only affects 2.12. However if we had a clean fix for this and there were another RC, I'd include it.
>
> Dongjoon has a good point about the spark-kubernetes-integration-tests artifact. That doesn't sound like it should be published in this way, though, of course, we publish the test artifacts from every module already. This is only a bit odd in being a non-test artifact meant for testing. But it's special testing! So I also don't think that needs to block a release.
>
> This happens because the integration tests module is enabled with the 'kubernetes' profile too, and also this output is copied into the release tarball at kubernetes/integration-tests/tests. Do we need that in a binary release?
>
> If these integration tests are meant to be run ad hoc, manually, not part of a normal test cycle, then I think we can just not enable it with -Pkubernetes. If it is meant to run every time, then it sounds like we need a little extra work shown in recent PRs to make that easier, but then, this test code should just be the 'test' artifact parts of the kubernetes module, no?

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]




123