[VOTE] Release Apache Spark 2.0.0 (RC5)

classic Classic list List threaded Threaded
33 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[VOTE] Release Apache Spark 2.0.0 (RC5)

rxin
Please vote on releasing the following candidate as Apache Spark version 2.0.0. The vote is open until Friday, July 22, 2016 at 20:00 PDT and passes if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.0.0
[ ] -1 Do not release this package because ...


The tag to be voted on is v2.0.0-rc5 (13650fc58e1fcf2cf2a26ba11c819185ae1acc1f).

This release candidate resolves ~2500 issues: https://s.apache.org/spark-2.0.0-jira

The release files, including signatures, digests, etc. can be found at:

Release artifacts are signed with the following key:

The staging repository for this release can be found at:

The documentation corresponding to this release can be found at:


=================================
How can I help test this release?
=================================
If you are a Spark user, you can help us test this release by taking an existing Spark workload and running on this release candidate, then reporting any regressions from 1.x.

==========================================
What justifies a -1 vote for this release?
==========================================
Critical bugs impacting major functionalities.

Bugs already present in 1.x, missing features, or bugs related to new features will not necessarily block this release. Note that historically Spark documentation has been published on the website separately from the main release so we do not need to block the release due to documentation errors either.

Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

Sean Owen
+1 at last. Sigs and hashes check out, and compiles and passes tests
with "-Pyarn -Phadoop-2.7 -Phive" on Ubuntu 16 + Java 8.


There are actually only 2 issues still targeted for 2.0.0, which is great:
SPARK-16633 lag/lead does not return the default value when the offset
row does not exist
SPARK-16648 LAST_VALUE(FALSE) OVER () throws IndexOutOfBoundsException

These are not marked blocker, though one is critical. I will assume
these don't block.


The only other JIRA that seems to be "for 2.0" and not resolved is...
https://issues.apache.org/jira/browse/SPARK-16486
... which I suspect is actually just something to be renamed and pushed out.


I did encounter two test failures that weren't reproducible, just FYI:

ExecutorAllocationManagerSuite:
- basic functionality *** FAILED ***
  The code passed to eventually never returned normally. Attempted 613
times over 10.015362111999998 seconds. Last failure message:
  Wanted but not invoked:
  executorAllocationClient.killExecutor("2");
  -> at org.apache.spark.streaming.scheduler.ExecutorAllocationManagerSuite$$anonfun$2$$anonfun$7.org$apache$spark$streaming$scheduler$ExecutorAllocationManagerSuite$$anonfun$$anonfun$$verifyKilledExec$1(ExecutorAllocationManagerSuite.scala:80)
  Actually, there were zero interactions with this mock.
  . (ExecutorAllocationManagerSuite.scala:61)

StateStoreSuite:
- maintenance *** FAILED ***
  The code passed to eventually never returned normally. Attempted 611
times over 10.007739936 seconds. Last failure message:
StateStoreSuite.this.fileExists(provider, 1L, false) was true earliest
file not deleted. (StateStoreSuite.scala:395)

On Wed, Jul 20, 2016 at 3:35 AM, Reynold Xin <[hidden email]> wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 2.0.0. The vote is open until Friday, July 22, 2016 at 20:00 PDT and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.0.0
> [ ] -1 Do not release this package because ...
>
>
> The tag to be voted on is v2.0.0-rc5
> (13650fc58e1fcf2cf2a26ba11c819185ae1acc1f).
>
> This release candidate resolves ~2500 issues:
> https://s.apache.org/spark-2.0.0-jira
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc5-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1195/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc5-docs/
>
>
> =================================
> How can I help test this release?
> =================================
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions from 1.x.
>
> ==========================================
> What justifies a -1 vote for this release?
> ==========================================
> Critical bugs impacting major functionalities.
>
> Bugs already present in 1.x, missing features, or bugs related to new
> features will not necessarily block this release. Note that historically
> Spark documentation has been published on the website separately from the
> main release so we do not need to block the release due to documentation
> errors either.
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

Shivaram Venkataraman
In reply to this post by rxin
+1

SHA and MD5 sums match for all binaries. Docs look fine this time
around. Built and ran `dev/run-tests` with Java 7 on a linux machine.

No blocker bugs on JIRA and the only critical bug with target as 2.0.0
is SPARK-16633, which doesn't look like a release blocker. I also
checked issues which are marked as Critical affecting version 2.0.0
and the only other ones that seem applicable are SPARK-15703 and
SPARK-16334. Both of them don't look like blockers to me.

Thanks
Shivaram


On Tue, Jul 19, 2016 at 7:35 PM, Reynold Xin <[hidden email]> wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 2.0.0. The vote is open until Friday, July 22, 2016 at 20:00 PDT and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.0.0
> [ ] -1 Do not release this package because ...
>
>
> The tag to be voted on is v2.0.0-rc5
> (13650fc58e1fcf2cf2a26ba11c819185ae1acc1f).
>
> This release candidate resolves ~2500 issues:
> https://s.apache.org/spark-2.0.0-jira
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc5-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1195/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc5-docs/
>
>
> =================================
> How can I help test this release?
> =================================
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions from 1.x.
>
> ==========================================
> What justifies a -1 vote for this release?
> ==========================================
> Critical bugs impacting major functionalities.
>
> Bugs already present in 1.x, missing features, or bugs related to new
> features will not necessarily block this release. Note that historically
> Spark documentation has been published on the website separately from the
> main release so we do not need to block the release due to documentation
> errors either.
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

Marcin Tustin
In reply to this post by rxin
Whatever happened with the query regarding benchmarks? Is that resolved?

On Tue, Jul 19, 2016 at 10:35 PM, Reynold Xin <[hidden email]> wrote:
Please vote on releasing the following candidate as Apache Spark version 2.0.0. The vote is open until Friday, July 22, 2016 at 20:00 PDT and passes if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.0.0
[ ] -1 Do not release this package because ...


The tag to be voted on is v2.0.0-rc5 (13650fc58e1fcf2cf2a26ba11c819185ae1acc1f).

This release candidate resolves ~2500 issues: https://s.apache.org/spark-2.0.0-jira

The release files, including signatures, digests, etc. can be found at:

Release artifacts are signed with the following key:

The staging repository for this release can be found at:

The documentation corresponding to this release can be found at:


=================================
How can I help test this release?
=================================
If you are a Spark user, you can help us test this release by taking an existing Spark workload and running on this release candidate, then reporting any regressions from 1.x.

==========================================
What justifies a -1 vote for this release?
==========================================
Critical bugs impacting major functionalities.

Bugs already present in 1.x, missing features, or bugs related to new features will not necessarily block this release. Note that historically Spark documentation has been published on the website separately from the main release so we do not need to block the release due to documentation errors either.



Want to work at Handy? Check out our culture deck and open roles
Latest news at Handy
Handy just raised $50m led by Fidelity

Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

Michael Allman-2
Marcin,

I'm not sure what you're referring to. Can you be more specific?

Cheers,

Michael

On Jul 20, 2016, at 9:10 AM, Marcin Tustin <[hidden email]> wrote:

Whatever happened with the query regarding benchmarks? Is that resolved?

On Tue, Jul 19, 2016 at 10:35 PM, Reynold Xin <[hidden email]> wrote:
Please vote on releasing the following candidate as Apache Spark version 2.0.0. The vote is open until Friday, July 22, 2016 at 20:00 PDT and passes if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.0.0
[ ] -1 Do not release this package because ...


The tag to be voted on is v2.0.0-rc5 (13650fc58e1fcf2cf2a26ba11c819185ae1acc1f).

This release candidate resolves ~2500 issues: https://s.apache.org/spark-2.0.0-jira

The release files, including signatures, digests, etc. can be found at:

Release artifacts are signed with the following key:

The staging repository for this release can be found at:

The documentation corresponding to this release can be found at:


=================================
How can I help test this release?
=================================
If you are a Spark user, you can help us test this release by taking an existing Spark workload and running on this release candidate, then reporting any regressions from 1.x.

==========================================
What justifies a -1 vote for this release?
==========================================
Critical bugs impacting major functionalities.

Bugs already present in 1.x, missing features, or bugs related to new features will not necessarily block this release. Note that historically Spark documentation has been published on the website separately from the main release so we do not need to block the release due to documentation errors either.



Want to work at Handy? Check out our culture deck and open roles
Latest news at Handy
Handy just raised $50m led by Fidelity


Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

Marcin Tustin
I refer to Maciej Bryński's ([hidden email]) emails of 29 and 30 June 2016 to this list. He said that his benchmarking suggested that Spark 2.0 was slower than 1.6.

I'm wondering if that was ever investigated, and if so if the speed is back up, or not.

On Wed, Jul 20, 2016 at 12:18 PM, Michael Allman <[hidden email]> wrote:
Marcin,

I'm not sure what you're referring to. Can you be more specific?

Cheers,

Michael

On Jul 20, 2016, at 9:10 AM, Marcin Tustin <[hidden email]> wrote:

Whatever happened with the query regarding benchmarks? Is that resolved?

On Tue, Jul 19, 2016 at 10:35 PM, Reynold Xin <[hidden email]> wrote:
Please vote on releasing the following candidate as Apache Spark version 2.0.0. The vote is open until Friday, July 22, 2016 at 20:00 PDT and passes if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.0.0
[ ] -1 Do not release this package because ...


The tag to be voted on is v2.0.0-rc5 (13650fc58e1fcf2cf2a26ba11c819185ae1acc1f).

This release candidate resolves ~2500 issues: https://s.apache.org/spark-2.0.0-jira

The release files, including signatures, digests, etc. can be found at:

Release artifacts are signed with the following key:

The staging repository for this release can be found at:

The documentation corresponding to this release can be found at:


=================================
How can I help test this release?
=================================
If you are a Spark user, you can help us test this release by taking an existing Spark workload and running on this release candidate, then reporting any regressions from 1.x.

==========================================
What justifies a -1 vote for this release?
==========================================
Critical bugs impacting major functionalities.

Bugs already present in 1.x, missing features, or bugs related to new features will not necessarily block this release. Note that historically Spark documentation has been published on the website separately from the main release so we do not need to block the release due to documentation errors either.



Want to work at Handy? Check out our culture deck and open roles
Latest news at Handy
Handy just raised $50m led by Fidelity




Want to work at Handy? Check out our culture deck and open roles
Latest news at Handy
Handy just raised $50m led by Fidelity

Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

Michael Allman-2
In reference to https://issues.apache.org/jira/browse/SPARK-16320, the code path for reading data from parquet files has been refactored extensively. The fact that Maciej was testing performance on a table with 400 partitions makes me wonder if my PR for https://issues.apache.org/jira/browse/SPARK-15968 will make a difference for repeated queries on partitioned tables. That PR was merged into master and backported to 2.0. The commit short hash is d5d2457.

Maciej, can you rerun your test on your original dataset with a version of Spark 2.0 with that commit in it? And run it more than once? And ensure that when you compare your query performance for the first query, ensure that you're starting with a fresh spark-shell or spark-sql for each so caching is not a factor.

As for the issue with initial query performance on a partitioned table or query performance on an unpartitioned table being inferior, I can do a quick test to see if I can reproduce that issue on our end. Assuming there is a perf regression, I may be able to spend some time debugging today. I've spent a substantial amount of time debugging and optimizing parquet table query perf in Spark, and we've been using 2.0 for at least a month now. Not sure if I'll have time to dig that deep, though.

Michael


On Jul 20, 2016, at 9:23 AM, Marcin Tustin <[hidden email]> wrote:

I refer to Maciej Bryński's ([hidden email]) emails of 29 and 30 June 2016 to this list. He said that his benchmarking suggested that Spark 2.0 was slower than 1.6.

I'm wondering if that was ever investigated, and if so if the speed is back up, or not.

On Wed, Jul 20, 2016 at 12:18 PM, Michael Allman <[hidden email]> wrote:
Marcin,

I'm not sure what you're referring to. Can you be more specific?

Cheers,

Michael

On Jul 20, 2016, at 9:10 AM, Marcin Tustin <[hidden email]> wrote:

Whatever happened with the query regarding benchmarks? Is that resolved?

On Tue, Jul 19, 2016 at 10:35 PM, Reynold Xin <[hidden email]> wrote:
Please vote on releasing the following candidate as Apache Spark version 2.0.0. The vote is open until Friday, July 22, 2016 at 20:00 PDT and passes if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.0.0
[ ] -1 Do not release this package because ...


The tag to be voted on is v2.0.0-rc5 (13650fc58e1fcf2cf2a26ba11c819185ae1acc1f).

This release candidate resolves ~2500 issues: https://s.apache.org/spark-2.0.0-jira

The release files, including signatures, digests, etc. can be found at:

Release artifacts are signed with the following key:

The staging repository for this release can be found at:

The documentation corresponding to this release can be found at:


=================================
How can I help test this release?
=================================
If you are a Spark user, you can help us test this release by taking an existing Spark workload and running on this release candidate, then reporting any regressions from 1.x.

==========================================
What justifies a -1 vote for this release?
==========================================
Critical bugs impacting major functionalities.

Bugs already present in 1.x, missing features, or bugs related to new features will not necessarily block this release. Note that historically Spark documentation has been published on the website separately from the main release so we do not need to block the release due to documentation errors either.



Want to work at Handy? Check out our culture deck and open roles
Latest news at Handy
Handy just raised $50m led by Fidelity




Want to work at Handy? Check out our culture deck and open roles
Latest news at Handy
Handy just raised $50m led by Fidelity


Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

Maciej Bryński
@Michael,
I answered in Jira and could repeat here.
I think that my problem is unrelated to Hive, because I'm using read.parquet method.
I also attached some VisualVM snapshots to SPARK-16321 (I think I should merge both issues)
And code profiling suggest bottleneck when reading parquet file.

I wonder if there are any other benchmarks related to parquet performance.

Regards,
--
Maciek Bryński
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

Michael Allman-2
I've run some tests with some real and some synthetic parquet data with nested columns with and without the hive metastore on our Spark 1.5, 1.6 and 2.0 versions. I haven't seen any unexpected performance surprises, except that Spark 2.0 now does schema inference across all files in a partitioned parquet metastore table. Granted, you aren't using a metastore table, but maybe Spark does that for partitioned non-metastore tables as well.

Michael

> On Jul 20, 2016, at 2:16 PM, Maciej Bryński <[hidden email]> wrote:
>
> @Michael,
> I answered in Jira and could repeat here.
> I think that my problem is unrelated to Hive, because I'm using read.parquet method.
> I also attached some VisualVM snapshots to SPARK-16321 (I think I should merge both issues)
> And code profiling suggest bottleneck when reading parquet file.
>
> I wonder if there are any other benchmarks related to parquet performance.
>
> Regards,
> --
> Maciek Bryński


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

Jonathan Kelly
+1 (non-binding)

On Wed, Jul 20, 2016 at 2:48 PM Michael Allman <[hidden email]> wrote:
I've run some tests with some real and some synthetic parquet data with nested columns with and without the hive metastore on our Spark 1.5, 1.6 and 2.0 versions. I haven't seen any unexpected performance surprises, except that Spark 2.0 now does schema inference across all files in a partitioned parquet metastore table. Granted, you aren't using a metastore table, but maybe Spark does that for partitioned non-metastore tables as well.

Michael

> On Jul 20, 2016, at 2:16 PM, Maciej Bryński <[hidden email]> wrote:
>
> @Michael,
> I answered in Jira and could repeat here.
> I think that my problem is unrelated to Hive, because I'm using read.parquet method.
> I also attached some VisualVM snapshots to SPARK-16321 (I think I should merge both issues)
> And code profiling suggest bottleneck when reading parquet file.
>
> I wonder if there are any other benchmarks related to parquet performance.
>
> Regards,
> --
> Maciek Bryński


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

Joseph E. Gonzalez
In reply to this post by rxin
+1

Sent from my iPad

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

Krishna Sankar
In reply to this post by rxin
+1 (non-binding, of course)

1. Compiled OS X 10.11.5 (El Capitan) OK Total time: 24:07 min
     mvn clean package -Pyarn -Phadoop-2.7 -DskipTests
2. Tested pyspark, mllib (iPython 4.0)
2.0 Spark version is 2.0.0 
2.1. statistics (min,max,mean,Pearson,Spearman) OK
2.2. Linear/Ridge/Lasso Regression OK 
2.3. Classification : Decision Tree, Naive Bayes OK
2.4. Clustering : KMeans OK
       Center And Scale OK
2.5. RDD operations OK
      State of the Union Texts - MapReduce, Filter,sortByKey (word count)
2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
       Model evaluation/optimization (rank, numIter, lambda) with itertools OK
3. Scala - MLlib
3.1. statistics (min,max,mean,Pearson,Spearman) OK
3.2. LinearRegressionWithSGD OK
3.3. Decision Tree OK
3.4. KMeans OK
3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
3.6. saveAsParquetFile OK
3.7. Read and verify the 3.6 save(above) - sqlContext.parquetFile, registerTempTable, sql OK
3.8. result = sqlContext.sql("SELECT OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
4.0. Spark SQL from Python OK
4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'") OK
5.0. Packages
5.1. com.databricks.spark.csv - read/write OK (--packages com.databricks:spark-csv_2.10:1.4.0)
6.0. DataFrames 
6.1. cast,dtypes OK
6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
6.3. All joins,sql,set operations,udf OK
[Dataframe Operations very fast from 11 secs to 3 secs, to 1.8 secs, to 1.5 secs! Good work !!!]
7.0. GraphX/Scala
7.1. Create Graph (small and bigger dataset) OK
7.2. Structure APIs - OK
7.3. Social Network/Community APIs - OK
7.4. Algorithms : PageRank of 2 datasets, aggregateMessages() - OK

Cheers
<k/>

On Tue, Jul 19, 2016 at 7:35 PM, Reynold Xin <[hidden email]> wrote:
Please vote on releasing the following candidate as Apache Spark version 2.0.0. The vote is open until Friday, July 22, 2016 at 20:00 PDT and passes if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.0.0
[ ] -1 Do not release this package because ...


The tag to be voted on is v2.0.0-rc5 (13650fc58e1fcf2cf2a26ba11c819185ae1acc1f).

This release candidate resolves ~2500 issues: https://s.apache.org/spark-2.0.0-jira

The release files, including signatures, digests, etc. can be found at:

Release artifacts are signed with the following key:

The staging repository for this release can be found at:

The documentation corresponding to this release can be found at:


=================================
How can I help test this release?
=================================
If you are a Spark user, you can help us test this release by taking an existing Spark workload and running on this release candidate, then reporting any regressions from 1.x.

==========================================
What justifies a -1 vote for this release?
==========================================
Critical bugs impacting major functionalities.

Bugs already present in 1.x, missing features, or bugs related to new features will not necessarily block this release. Note that historically Spark documentation has been published on the website separately from the main release so we do not need to block the release due to documentation errors either.


Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

Dongjoon Hyun
In reply to this post by rxin
+1 (non-binding)

- MD5/SHA/GPG matched.
- Test passed on Ubuntu (16.04) +  Oracle JDK (1.7.0_80) + R(3.2.3)
  * build/mvn -Phive -Phadoop-2.7 -Pyarn clean package
  * python python/run-tests.py
  * R/install-dev.sh & R/run-tests.sh

Cheers!

Dongjoon.


On Tue, Jul 19, 2016 at 7:35 PM, Reynold Xin <[hidden email]> wrote:
Please vote on releasing the following candidate as Apache Spark version 2.0.0. The vote is open until Friday, July 22, 2016 at 20:00 PDT and passes if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.0.0
[ ] -1 Do not release this package because ...


The tag to be voted on is v2.0.0-rc5 (13650fc58e1fcf2cf2a26ba11c819185ae1acc1f).

This release candidate resolves ~2500 issues: https://s.apache.org/spark-2.0.0-jira

The release files, including signatures, digests, etc. can be found at:

Release artifacts are signed with the following key:

The staging repository for this release can be found at:

The documentation corresponding to this release can be found at:


=================================
How can I help test this release?
=================================
If you are a Spark user, you can help us test this release by taking an existing Spark workload and running on this release candidate, then reporting any regressions from 1.x.

==========================================
What justifies a -1 vote for this release?
==========================================
Critical bugs impacting major functionalities.

Bugs already present in 1.x, missing features, or bugs related to new features will not necessarily block this release. Note that historically Spark documentation has been published on the website separately from the main release so we do not need to block the release due to documentation errors either.


Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

rxin
In reply to this post by Krishna Sankar
+1

On Wednesday, July 20, 2016, Krishna Sankar <[hidden email]> wrote:
+1 (non-binding, of course)

1. Compiled OS X 10.11.5 (El Capitan) OK Total time: 24:07 min
     mvn clean package -Pyarn -Phadoop-2.7 -DskipTests
2. Tested pyspark, mllib (iPython 4.0)
2.0 Spark version is 2.0.0 
2.1. statistics (min,max,mean,Pearson,Spearman) OK
2.2. Linear/Ridge/Lasso Regression OK 
2.3. Classification : Decision Tree, Naive Bayes OK
2.4. Clustering : KMeans OK
       Center And Scale OK
2.5. RDD operations OK
      State of the Union Texts - MapReduce, Filter,sortByKey (word count)
2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
       Model evaluation/optimization (rank, numIter, lambda) with itertools OK
3. Scala - MLlib
3.1. statistics (min,max,mean,Pearson,Spearman) OK
3.2. LinearRegressionWithSGD OK
3.3. Decision Tree OK
3.4. KMeans OK
3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
3.6. saveAsParquetFile OK
3.7. Read and verify the 3.6 save(above) - sqlContext.parquetFile, registerTempTable, sql OK
3.8. result = sqlContext.sql("SELECT OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
4.0. Spark SQL from Python OK
4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'") OK
5.0. Packages
5.1. com.databricks.spark.csv - read/write OK (--packages com.databricks:spark-csv_2.10:1.4.0)
6.0. DataFrames 
6.1. cast,dtypes OK
6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
6.3. All joins,sql,set operations,udf OK
[Dataframe Operations very fast from 11 secs to 3 secs, to 1.8 secs, to 1.5 secs! Good work !!!]
7.0. GraphX/Scala
7.1. Create Graph (small and bigger dataset) OK
7.2. Structure APIs - OK
7.3. Social Network/Community APIs - OK
7.4. Algorithms : PageRank of 2 datasets, aggregateMessages() - OK

Cheers
<k/>

On Tue, Jul 19, 2016 at 7:35 PM, Reynold Xin <<a href="javascript:_e(%7B%7D,&#39;cvml&#39;,&#39;rxin@databricks.com&#39;);" target="_blank">rxin@...> wrote:
Please vote on releasing the following candidate as Apache Spark version 2.0.0. The vote is open until Friday, July 22, 2016 at 20:00 PDT and passes if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.0.0
[ ] -1 Do not release this package because ...


The tag to be voted on is v2.0.0-rc5 (13650fc58e1fcf2cf2a26ba11c819185ae1acc1f).

This release candidate resolves ~2500 issues: https://s.apache.org/spark-2.0.0-jira

The release files, including signatures, digests, etc. can be found at:

Release artifacts are signed with the following key:

The staging repository for this release can be found at:

The documentation corresponding to this release can be found at:


=================================
How can I help test this release?
=================================
If you are a Spark user, you can help us test this release by taking an existing Spark workload and running on this release candidate, then reporting any regressions from 1.x.

==========================================
What justifies a -1 vote for this release?
==========================================
Critical bugs impacting major functionalities.

Bugs already present in 1.x, missing features, or bugs related to new features will not necessarily block this release. Note that historically Spark documentation has been published on the website separately from the main release so we do not need to block the release due to documentation errors either.


Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

Ricardo Almeida
+1 (non binding)

Tested PySpark Core, DataFrame/SQL, MLlib and Streaming on a standalone cluster

On 21 July 2016 at 05:24, Reynold Xin <[hidden email]> wrote:
+1


On Wednesday, July 20, 2016, Krishna Sankar <[hidden email]> wrote:
+1 (non-binding, of course)

1. Compiled OS X 10.11.5 (El Capitan) OK Total time: 24:07 min
     mvn clean package -Pyarn -Phadoop-2.7 -DskipTests
2. Tested pyspark, mllib (iPython 4.0)
2.0 Spark version is 2.0.0 
2.1. statistics (min,max,mean,Pearson,Spearman) OK
2.2. Linear/Ridge/Lasso Regression OK 
2.3. Classification : Decision Tree, Naive Bayes OK
2.4. Clustering : KMeans OK
       Center And Scale OK
2.5. RDD operations OK
      State of the Union Texts - MapReduce, Filter,sortByKey (word count)
2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
       Model evaluation/optimization (rank, numIter, lambda) with itertools OK
3. Scala - MLlib
3.1. statistics (min,max,mean,Pearson,Spearman) OK
3.2. LinearRegressionWithSGD OK
3.3. Decision Tree OK
3.4. KMeans OK
3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
3.6. saveAsParquetFile OK
3.7. Read and verify the 3.6 save(above) - sqlContext.parquetFile, registerTempTable, sql OK
3.8. result = sqlContext.sql("SELECT OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
4.0. Spark SQL from Python OK
4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'") OK
5.0. Packages
5.1. com.databricks.spark.csv - read/write OK (--packages com.databricks:spark-csv_2.10:1.4.0)
6.0. DataFrames 
6.1. cast,dtypes OK
6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
6.3. All joins,sql,set operations,udf OK
[Dataframe Operations very fast from 11 secs to 3 secs, to 1.8 secs, to 1.5 secs! Good work !!!]
7.0. GraphX/Scala
7.1. Create Graph (small and bigger dataset) OK
7.2. Structure APIs - OK
7.3. Social Network/Community APIs - OK
7.4. Algorithms : PageRank of 2 datasets, aggregateMessages() - OK

Cheers
<k/>

On Tue, Jul 19, 2016 at 7:35 PM, Reynold Xin <[hidden email]> wrote:
Please vote on releasing the following candidate as Apache Spark version 2.0.0. The vote is open until Friday, July 22, 2016 at 20:00 PDT and passes if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.0.0
[ ] -1 Do not release this package because ...


The tag to be voted on is v2.0.0-rc5 (13650fc58e1fcf2cf2a26ba11c819185ae1acc1f).

This release candidate resolves ~2500 issues: https://s.apache.org/spark-2.0.0-jira

The release files, including signatures, digests, etc. can be found at:

Release artifacts are signed with the following key:

The staging repository for this release can be found at:

The documentation corresponding to this release can be found at:


=================================
How can I help test this release?
=================================
If you are a Spark user, you can help us test this release by taking an existing Spark workload and running on this release candidate, then reporting any regressions from 1.x.

==========================================
What justifies a -1 vote for this release?
==========================================
Critical bugs impacting major functionalities.

Bugs already present in 1.x, missing features, or bugs related to new features will not necessarily block this release. Note that historically Spark documentation has been published on the website separately from the main release so we do not need to block the release due to documentation errors either.



Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

Joseph Bradley
+1

Mainly tested ML/Graph/R.  Perf tests from Tim Hunter showed minor speedups from 1.6 for common ML algorithms.

On Thu, Jul 21, 2016 at 9:41 AM, Ricardo Almeida <[hidden email]> wrote:
+1 (non binding)

Tested PySpark Core, DataFrame/SQL, MLlib and Streaming on a standalone cluster

On 21 July 2016 at 05:24, Reynold Xin <[hidden email]> wrote:
+1


On Wednesday, July 20, 2016, Krishna Sankar <[hidden email]> wrote:
+1 (non-binding, of course)

1. Compiled OS X 10.11.5 (El Capitan) OK Total time: 24:07 min
     mvn clean package -Pyarn -Phadoop-2.7 -DskipTests
2. Tested pyspark, mllib (iPython 4.0)
2.0 Spark version is 2.0.0 
2.1. statistics (min,max,mean,Pearson,Spearman) OK
2.2. Linear/Ridge/Lasso Regression OK 
2.3. Classification : Decision Tree, Naive Bayes OK
2.4. Clustering : KMeans OK
       Center And Scale OK
2.5. RDD operations OK
      State of the Union Texts - MapReduce, Filter,sortByKey (word count)
2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
       Model evaluation/optimization (rank, numIter, lambda) with itertools OK
3. Scala - MLlib
3.1. statistics (min,max,mean,Pearson,Spearman) OK
3.2. LinearRegressionWithSGD OK
3.3. Decision Tree OK
3.4. KMeans OK
3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
3.6. saveAsParquetFile OK
3.7. Read and verify the 3.6 save(above) - sqlContext.parquetFile, registerTempTable, sql OK
3.8. result = sqlContext.sql("SELECT OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
4.0. Spark SQL from Python OK
4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'") OK
5.0. Packages
5.1. com.databricks.spark.csv - read/write OK (--packages com.databricks:spark-csv_2.10:1.4.0)
6.0. DataFrames 
6.1. cast,dtypes OK
6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
6.3. All joins,sql,set operations,udf OK
[Dataframe Operations very fast from 11 secs to 3 secs, to 1.8 secs, to 1.5 secs! Good work !!!]
7.0. GraphX/Scala
7.1. Create Graph (small and bigger dataset) OK
7.2. Structure APIs - OK
7.3. Social Network/Community APIs - OK
7.4. Algorithms : PageRank of 2 datasets, aggregateMessages() - OK

Cheers
<k/>

On Tue, Jul 19, 2016 at 7:35 PM, Reynold Xin <[hidden email]> wrote:
Please vote on releasing the following candidate as Apache Spark version 2.0.0. The vote is open until Friday, July 22, 2016 at 20:00 PDT and passes if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.0.0
[ ] -1 Do not release this package because ...


The tag to be voted on is v2.0.0-rc5 (13650fc58e1fcf2cf2a26ba11c819185ae1acc1f).

This release candidate resolves ~2500 issues: https://s.apache.org/spark-2.0.0-jira

The release files, including signatures, digests, etc. can be found at:

Release artifacts are signed with the following key:

The staging repository for this release can be found at:

The documentation corresponding to this release can be found at:


=================================
How can I help test this release?
=================================
If you are a Spark user, you can help us test this release by taking an existing Spark workload and running on this release candidate, then reporting any regressions from 1.x.

==========================================
What justifies a -1 vote for this release?
==========================================
Critical bugs impacting major functionalities.

Bugs already present in 1.x, missing features, or bugs related to new features will not necessarily block this release. Note that historically Spark documentation has been published on the website separately from the main release so we do not need to block the release due to documentation errors either.




Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

Matei Zaharia
Administrator
+1

Tested on Mac.

Matei

On Jul 22, 2016, at 11:18 AM, Joseph Bradley <[hidden email]> wrote:

+1

Mainly tested ML/Graph/R.  Perf tests from Tim Hunter showed minor speedups from 1.6 for common ML algorithms.

On Thu, Jul 21, 2016 at 9:41 AM, Ricardo Almeida <[hidden email]> wrote:
+1 (non binding)

Tested PySpark Core, DataFrame/SQL, MLlib and Streaming on a standalone cluster

On 21 July 2016 at 05:24, Reynold Xin <[hidden email]> wrote:
+1


On Wednesday, July 20, 2016, Krishna Sankar <[hidden email]> wrote:
+1 (non-binding, of course)

1. Compiled OS X 10.11.5 (El Capitan) OK Total time: 24:07 min
     mvn clean package -Pyarn -Phadoop-2.7 -DskipTests
2. Tested pyspark, mllib (iPython 4.0)
2.0 Spark version is 2.0.0 
2.1. statistics (min,max,mean,Pearson,Spearman) OK
2.2. Linear/Ridge/Lasso Regression OK 
2.3. Classification : Decision Tree, Naive Bayes OK
2.4. Clustering : KMeans OK
       Center And Scale OK
2.5. RDD operations OK
      State of the Union Texts - MapReduce, Filter,sortByKey (word count)
2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
       Model evaluation/optimization (rank, numIter, lambda) with itertools OK
3. Scala - MLlib
3.1. statistics (min,max,mean,Pearson,Spearman) OK
3.2. LinearRegressionWithSGD OK
3.3. Decision Tree OK
3.4. KMeans OK
3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
3.6. saveAsParquetFile OK
3.7. Read and verify the 3.6 save(above) - sqlContext.parquetFile, registerTempTable, sql OK
3.8. result = sqlContext.sql("SELECT OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
4.0. Spark SQL from Python OK
4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'") OK
5.0. Packages
5.1. com.databricks.spark.csv - read/write OK (--packages com.databricks:spark-csv_2.10:1.4.0)
6.0. DataFrames 
6.1. cast,dtypes OK
6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
6.3. All joins,sql,set operations,udf OK
[Dataframe Operations very fast from 11 secs to 3 secs, to 1.8 secs, to 1.5 secs! Good work !!!]
7.0. GraphX/Scala
7.1. Create Graph (small and bigger dataset) OK
7.2. Structure APIs - OK
7.3. Social Network/Community APIs - OK
7.4. Algorithms : PageRank of 2 datasets, aggregateMessages() - OK

Cheers
<k/>

On Tue, Jul 19, 2016 at 7:35 PM, Reynold Xin <[hidden email]> wrote:
Please vote on releasing the following candidate as Apache Spark version 2.0.0. The vote is open until Friday, July 22, 2016 at 20:00 PDT and passes if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.0.0
[ ] -1 Do not release this package because ...


The tag to be voted on is v2.0.0-rc5 (13650fc58e1fcf2cf2a26ba11c819185ae1acc1f).

This release candidate resolves ~2500 issues: https://s.apache.org/spark-2.0.0-jira

The release files, including signatures, digests, etc. can be found at:

Release artifacts are signed with the following key:

The staging repository for this release can be found at:

The documentation corresponding to this release can be found at:


=================================
How can I help test this release?
=================================
If you are a Spark user, you can help us test this release by taking an existing Spark workload and running on this release candidate, then reporting any regressions from 1.x.

==========================================
What justifies a -1 vote for this release?
==========================================
Critical bugs impacting major functionalities.

Bugs already present in 1.x, missing features, or bugs related to new features will not necessarily block this release. Note that historically Spark documentation has been published on the website separately from the main release so we do not need to block the release due to documentation errors either.





Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

Luciano Resende
In reply to this post by rxin
+ 1 (non-binding)

Found a minor issue when trying to run some of the docker tests, but nothing blocking the release. Will create a JIRA for that.

On Tue, Jul 19, 2016 at 7:35 PM, Reynold Xin <[hidden email]> wrote:
Please vote on releasing the following candidate as Apache Spark version 2.0.0. The vote is open until Friday, July 22, 2016 at 20:00 PDT and passes if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.0.0
[ ] -1 Do not release this package because ...


The tag to be voted on is v2.0.0-rc5 (13650fc58e1fcf2cf2a26ba11c819185ae1acc1f).

This release candidate resolves ~2500 issues: https://s.apache.org/spark-2.0.0-jira

The release files, including signatures, digests, etc. can be found at:

Release artifacts are signed with the following key:

The staging repository for this release can be found at:

The documentation corresponding to this release can be found at:


=================================
How can I help test this release?
=================================
If you are a Spark user, you can help us test this release by taking an existing Spark workload and running on this release candidate, then reporting any regressions from 1.x.

==========================================
What justifies a -1 vote for this release?
==========================================
Critical bugs impacting major functionalities.

Bugs already present in 1.x, missing features, or bugs related to new features will not necessarily block this release. Note that historically Spark documentation has been published on the website separately from the main release so we do not need to block the release due to documentation errors either.




--
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

Holden Karau
+1 (non-binding)

Built locally on Ubuntu 14.04, basic pyspark sanity checking & tested with a simple structured streaming project (spark-structured-streaming-ml) & spark-testing-base & high-performance-spark-examples (minor changes required from preview version but seem intentional & jetty conflicts with out of date testing library - but not a Spark problem).

On Fri, Jul 22, 2016 at 12:45 PM, Luciano Resende <[hidden email]> wrote:
+ 1 (non-binding)

Found a minor issue when trying to run some of the docker tests, but nothing blocking the release. Will create a JIRA for that.

On Tue, Jul 19, 2016 at 7:35 PM, Reynold Xin <[hidden email]> wrote:
Please vote on releasing the following candidate as Apache Spark version 2.0.0. The vote is open until Friday, July 22, 2016 at 20:00 PDT and passes if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.0.0
[ ] -1 Do not release this package because ...


The tag to be voted on is v2.0.0-rc5 (13650fc58e1fcf2cf2a26ba11c819185ae1acc1f).

This release candidate resolves ~2500 issues: https://s.apache.org/spark-2.0.0-jira

The release files, including signatures, digests, etc. can be found at:

Release artifacts are signed with the following key:

The staging repository for this release can be found at:

The documentation corresponding to this release can be found at:


=================================
How can I help test this release?
=================================
If you are a Spark user, you can help us test this release by taking an existing Spark workload and running on this release candidate, then reporting any regressions from 1.x.

==========================================
What justifies a -1 vote for this release?
==========================================
Critical bugs impacting major functionalities.

Bugs already present in 1.x, missing features, or bugs related to new features will not necessarily block this release. Note that historically Spark documentation has been published on the website separately from the main release so we do not need to block the release due to documentation errors either.




--



--
Cell : 425-233-8271
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

Michael Armbrust
+1

On Fri, Jul 22, 2016 at 2:42 PM, Holden Karau <[hidden email]> wrote:
+1 (non-binding)

Built locally on Ubuntu 14.04, basic pyspark sanity checking & tested with a simple structured streaming project (spark-structured-streaming-ml) & spark-testing-base & high-performance-spark-examples (minor changes required from preview version but seem intentional & jetty conflicts with out of date testing library - but not a Spark problem).

On Fri, Jul 22, 2016 at 12:45 PM, Luciano Resende <[hidden email]> wrote:
+ 1 (non-binding)

Found a minor issue when trying to run some of the docker tests, but nothing blocking the release. Will create a JIRA for that.

On Tue, Jul 19, 2016 at 7:35 PM, Reynold Xin <[hidden email]> wrote:
Please vote on releasing the following candidate as Apache Spark version 2.0.0. The vote is open until Friday, July 22, 2016 at 20:00 PDT and passes if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.0.0
[ ] -1 Do not release this package because ...


The tag to be voted on is v2.0.0-rc5 (13650fc58e1fcf2cf2a26ba11c819185ae1acc1f).

This release candidate resolves ~2500 issues: https://s.apache.org/spark-2.0.0-jira

The release files, including signatures, digests, etc. can be found at:

Release artifacts are signed with the following key:

The staging repository for this release can be found at:

The documentation corresponding to this release can be found at:


=================================
How can I help test this release?
=================================
If you are a Spark user, you can help us test this release by taking an existing Spark workload and running on this release candidate, then reporting any regressions from 1.x.

==========================================
What justifies a -1 vote for this release?
==========================================
Critical bugs impacting major functionalities.

Bugs already present in 1.x, missing features, or bugs related to new features will not necessarily block this release. Note that historically Spark documentation has been published on the website separately from the main release so we do not need to block the release due to documentation errors either.




--



--
Cell : <a href="tel:425-233-8271" value="+14252338271" target="_blank">425-233-8271

12