code freeze and branch cut for Apache Spark 2.4

classic Classic list List threaded Threaded
57 messages Options
123
Reply | Threaded
Open this post in threaded view
|

code freeze and branch cut for Apache Spark 2.4

rxin
FYI 6 mo is coming up soon since the last release. We will cut the branch and code freeze on Aug 1st in order to get 2.4 out on time.

Reply | Threaded
Open this post in threaded view
|

Re: code freeze and branch cut for Apache Spark 2.4

Jiang Xingbo
Xiangrui and I are leading an effort to implement a highly desirable feature, Barrier Execution Mode. https://issues.apache.org/jira/browse/SPARK-24374. This introduces a new scheduling model to Apache Spark so users can properly embed distributed DL training as a Spark stage to simplify the distributed training workflow. The prototype has been demoed in the Spark Summit keynote. This new feature got a very positive feedback from the whole community. The design doc and pull requests got more comments than we initially anticipated. We want to finish this feature in the upcoming release, Spark 2.4. Would it be possible to have an extension of code freeze for a week?

Thanks,

Xingbo

2018-07-07 0:47 GMT+08:00 Reynold Xin <[hidden email]>:
FYI 6 mo is coming up soon since the last release. We will cut the branch and code freeze on Aug 1st in order to get 2.4 out on time.


Reply | Threaded
Open this post in threaded view
|

Re: code freeze and branch cut for Apache Spark 2.4

cloud0fan
This seems fine to me.

BTW Ryan Blue and I are working on some data source v2 stuff and hopefully we can get more things done with one more week.

Thanks,
Wenchen

On Thu, Jul 26, 2018 at 1:14 PM Xingbo Jiang <[hidden email]> wrote:
Xiangrui and I are leading an effort to implement a highly desirable feature, Barrier Execution Mode. https://issues.apache.org/jira/browse/SPARK-24374. This introduces a new scheduling model to Apache Spark so users can properly embed distributed DL training as a Spark stage to simplify the distributed training workflow. The prototype has been demoed in the Spark Summit keynote. This new feature got a very positive feedback from the whole community. The design doc and pull requests got more comments than we initially anticipated. We want to finish this feature in the upcoming release, Spark 2.4. Would it be possible to have an extension of code freeze for a week?

Thanks,

Xingbo

2018-07-07 0:47 GMT+08:00 Reynold Xin <[hidden email]>:
FYI 6 mo is coming up soon since the last release. We will cut the branch and code freeze on Aug 1st in order to get 2.4 out on time.


Reply | Threaded
Open this post in threaded view
|

Re: code freeze and branch cut for Apache Spark 2.4

Stavros Kontopoulos-3
Extending code freeze date would be great for me too, I am working on a PR for supporting scala 2.12, I am close but need some more time.
We could get it into 2.4.

Stavros

On Fri, Jul 27, 2018 at 9:27 AM, Wenchen Fan <[hidden email]> wrote:
This seems fine to me.

BTW Ryan Blue and I are working on some data source v2 stuff and hopefully we can get more things done with one more week.

Thanks,
Wenchen

On Thu, Jul 26, 2018 at 1:14 PM Xingbo Jiang <[hidden email]> wrote:
Xiangrui and I are leading an effort to implement a highly desirable feature, Barrier Execution Mode. https://issues.apache.org/jira/browse/SPARK-24374. This introduces a new scheduling model to Apache Spark so users can properly embed distributed DL training as a Spark stage to simplify the distributed training workflow. The prototype has been demoed in the Spark Summit keynote. This new feature got a very positive feedback from the whole community. The design doc and pull requests got more comments than we initially anticipated. We want to finish this feature in the upcoming release, Spark 2.4. Would it be possible to have an extension of code freeze for a week?

Thanks,

Xingbo

2018-07-07 0:47 GMT+08:00 Reynold Xin <[hidden email]>:
FYI 6 mo is coming up soon since the last release. We will cut the branch and code freeze on Aug 1st in order to get 2.4 out on time.





--
Stavros Kontopoulos
Senior Software Engineer
Lightbend, Inc.
<a href="tel:%2B1%20650%20678%200020" value="+16506780020" target="_blank">p:  +30 6977967274
Reply | Threaded
Open this post in threaded view
|

Re: code freeze and branch cut for Apache Spark 2.4

cloud0fan
If no one objects, how about we make the code freeze one week later(Aug 8th)?

BTW I'd like to volunteer to serve as the release manager for Spark 2.4. I'm familiar with most of the major features targeted for the 2.4 release. I also have a lot of free time during this release timeframe and should be able to figure out problems that may appear during the release.

Thanks,
Wenchen

On Fri, Jul 27, 2018 at 11:27 PM Stavros Kontopoulos <[hidden email]> wrote:
Extending code freeze date would be great for me too, I am working on a PR for supporting scala 2.12, I am close but need some more time.
We could get it into 2.4.

Stavros

On Fri, Jul 27, 2018 at 9:27 AM, Wenchen Fan <[hidden email]> wrote:
This seems fine to me.

BTW Ryan Blue and I are working on some data source v2 stuff and hopefully we can get more things done with one more week.

Thanks,
Wenchen

On Thu, Jul 26, 2018 at 1:14 PM Xingbo Jiang <[hidden email]> wrote:
Xiangrui and I are leading an effort to implement a highly desirable feature, Barrier Execution Mode. https://issues.apache.org/jira/browse/SPARK-24374. This introduces a new scheduling model to Apache Spark so users can properly embed distributed DL training as a Spark stage to simplify the distributed training workflow. The prototype has been demoed in the Spark Summit keynote. This new feature got a very positive feedback from the whole community. The design doc and pull requests got more comments than we initially anticipated. We want to finish this feature in the upcoming release, Spark 2.4. Would it be possible to have an extension of code freeze for a week?

Thanks,

Xingbo

2018-07-07 0:47 GMT+08:00 Reynold Xin <[hidden email]>:
FYI 6 mo is coming up soon since the last release. We will cut the branch and code freeze on Aug 1st in order to get 2.4 out on time.





--
Stavros Kontopoulos
Senior Software Engineer
Lightbend, Inc.
<a href="tel:%2B1%20650%20678%200020" value="+16506780020" target="_blank">p:  +30 6977967274
Reply | Threaded
Open this post in threaded view
|

Re: code freeze and branch cut for Apache Spark 2.4

Stavros Kontopoulos-3
+1. That would great!

Thanks,
Stavros

On Sun, Jul 29, 2018 at 5:05 PM, Wenchen Fan <[hidden email]> wrote:
If no one objects, how about we make the code freeze one week later(Aug 8th)?

BTW I'd like to volunteer to serve as the release manager for Spark 2.4. I'm familiar with most of the major features targeted for the 2.4 release. I also have a lot of free time during this release timeframe and should be able to figure out problems that may appear during the release.

Thanks,
Wenchen

On Fri, Jul 27, 2018 at 11:27 PM Stavros Kontopoulos <[hidden email]> wrote:
Extending code freeze date would be great for me too, I am working on a PR for supporting scala 2.12, I am close but need some more time.
We could get it into 2.4.

Stavros

On Fri, Jul 27, 2018 at 9:27 AM, Wenchen Fan <[hidden email]> wrote:
This seems fine to me.

BTW Ryan Blue and I are working on some data source v2 stuff and hopefully we can get more things done with one more week.

Thanks,
Wenchen

On Thu, Jul 26, 2018 at 1:14 PM Xingbo Jiang <[hidden email]> wrote:
Xiangrui and I are leading an effort to implement a highly desirable feature, Barrier Execution Mode. https://issues.apache.org/jira/browse/SPARK-24374. This introduces a new scheduling model to Apache Spark so users can properly embed distributed DL training as a Spark stage to simplify the distributed training workflow. The prototype has been demoed in the Spark Summit keynote. This new feature got a very positive feedback from the whole community. The design doc and pull requests got more comments than we initially anticipated. We want to finish this feature in the upcoming release, Spark 2.4. Would it be possible to have an extension of code freeze for a week?

Thanks,

Xingbo

2018-07-07 0:47 GMT+08:00 Reynold Xin <[hidden email]>:
FYI 6 mo is coming up soon since the last release. We will cut the branch and code freeze on Aug 1st in order to get 2.4 out on time.





--
Stavros Kontopoulos
Senior Software Engineer
Lightbend, Inc.
<a href="tel:%2B1%20650%20678%200020" value="+16506780020" target="_blank">p:  +30 6977967274



--
Stavros Kontopoulos
Senior Software Engineer
Lightbend, Inc.
<a href="tel:%2B1%20650%20678%200020" value="+16506780020" target="_blank">p:  +30 6977967274
Reply | Threaded
Open this post in threaded view
|

Re: code freeze and branch cut for Apache Spark 2.4

Holden Karau
I’m excited to have more folks rotate through release manager :)

On Sun, Jul 29, 2018 at 3:57 PM Stavros Kontopoulos <[hidden email]> wrote:
+1. That would great!

Thanks,
Stavros

On Sun, Jul 29, 2018 at 5:05 PM, Wenchen Fan <[hidden email]> wrote:
If no one objects, how about we make the code freeze one week later(Aug 8th)?

BTW I'd like to volunteer to serve as the release manager for Spark 2.4. I'm familiar with most of the major features targeted for the 2.4 release. I also have a lot of free time during this release timeframe and should be able to figure out problems that may appear during the release.

Thanks,
Wenchen

On Fri, Jul 27, 2018 at 11:27 PM Stavros Kontopoulos <[hidden email]> wrote:
Extending code freeze date would be great for me too, I am working on a PR for supporting scala 2.12, I am close but need some more time.
We could get it into 2.4.

Stavros

On Fri, Jul 27, 2018 at 9:27 AM, Wenchen Fan <[hidden email]> wrote:
This seems fine to me.

BTW Ryan Blue and I are working on some data source v2 stuff and hopefully we can get more things done with one more week.

Thanks,
Wenchen

On Thu, Jul 26, 2018 at 1:14 PM Xingbo Jiang <[hidden email]> wrote:
Xiangrui and I are leading an effort to implement a highly desirable feature, Barrier Execution Mode. https://issues.apache.org/jira/browse/SPARK-24374. This introduces a new scheduling model to Apache Spark so users can properly embed distributed DL training as a Spark stage to simplify the distributed training workflow. The prototype has been demoed in the Spark Summit keynote. This new feature got a very positive feedback from the whole community. The design doc and pull requests got more comments than we initially anticipated. We want to finish this feature in the upcoming release, Spark 2.4. Would it be possible to have an extension of code freeze for a week?

Thanks,

Xingbo

2018-07-07 0:47 GMT+08:00 Reynold Xin <[hidden email]>:
FYI 6 mo is coming up soon since the last release. We will cut the branch and code freeze on Aug 1st in order to get 2.4 out on time.





--
Stavros Kontopoulos
Senior Software Engineer
Lightbend, Inc.
<a href="tel:%2B1%20650%20678%200020" value="+16506780020" target="_blank">p:  +30 6977967274



--
Stavros Kontopoulos
Senior Software Engineer
Lightbend, Inc.
<a href="tel:%2B1%20650%20678%200020" value="+16506780020" target="_blank">p:  +30 6977967274
--
Reply | Threaded
Open this post in threaded view
|

Re: code freeze and branch cut for Apache Spark 2.4

Tom Graves-2
In reply to this post by rxin
Shouldn't this be a discuss thread?  

I'm also happy to see more release managers and agree the time is getting close, but we should see what features are in progress and see how close things are and propose a date based on that.  Cutting a branch to soon just creates more work for committers to push to more branches. 

 http://spark.apache.org/versioning-policy.html mentioned the code freeze and release branch cut mid-august.


Tom

On Friday, July 6, 2018, 11:47:35 AM CDT, Reynold Xin <[hidden email]> wrote:


FYI 6 mo is coming up soon since the last release. We will cut the branch and code freeze on Aug 1st in order to get 2.4 out on time.

Reply | Threaded
Open this post in threaded view
|

Re: code freeze and branch cut for Apache Spark 2.4

Sean Owen-3
In theory releases happen on a time-based cadence, so it's pretty much wrap up what's ready by the code freeze and ship it. In practice, the cadence slips frequently, and it's very much a negotiation about what features should push the code freeze out a few weeks every time. So, kind of a hybrid approach here that works OK. 

Certainly speak up if you think there's something that really needs to get into 2.4. This is that discuss thread.

(BTW I updated the page you mention just yesterday, to reflect the plan suggested in this thread.)

On Mon, Jul 30, 2018 at 9:51 AM Tom Graves <[hidden email]> wrote:
Shouldn't this be a discuss thread?  

I'm also happy to see more release managers and agree the time is getting close, but we should see what features are in progress and see how close things are and propose a date based on that.  Cutting a branch to soon just creates more work for committers to push to more branches. 

 http://spark.apache.org/versioning-policy.html mentioned the code freeze and release branch cut mid-august.


Tom

Reply | Threaded
Open this post in threaded view
|

Re: code freeze and branch cut for Apache Spark 2.4

cloud0fan
I went through the open JIRA tickets and here is a list that we should consider for Spark 2.4:

High Priority:
SPARK-24374: Support Barrier Execution Mode in Apache Spark
This one is critical to the Spark ecosystem for deep learning. It only has a few remaining works and I think we should have it in Spark 2.4.

Middle Priority:
SPARK-23899Built-in SQL Function Improvement
We've already added a lot of built-in functions in this release, but there are a few useful higher-order functions in progress, like `array_except`, `transform`, etc. It would be great if we can get them in Spark 2.4.

SPARK-14220: Build and test Spark against Scala 2.12
Very close to finishing, great to have it in Spark 2.4.

SPARK-4502: Spark SQL reads unnecessary nested fields from Parquet
This one is there for years (thanks for your patience Michael!), and is also close to finishing. Great to have it in 2.4.

SPARK-24882: data source v2 API improvement
This is to improve the data source v2 API based on what we learned during this release. From the migration of existing sources and design of new features, we found some problems in the API and want to address them. I believe this should be the last significant API change to data source v2, so great to have in Spark 2.4. I'll send a discuss email about it later.

SPARK-24252: Add catalog support in Data Source V2
This is a very important feature for data source v2, and is currently being discussed in the dev list.

SPARK-24768: Have a built-in AVRO data source implementation
Most of it is done, but date/timestamp support is still missing. Great to have in 2.4.

SPARK-23243: Shuffle+Repartition on an RDD could lead to incorrect answers
This is a long-standing correctness bug, great to have in 2.4.

There are some other important features like the adaptive execution, streaming SQL, etc., not in the list, since I think we are not able to finish them before 2.4.

Feel free to add more things if you think they are important to Spark 2.4 by replying to this email.

Thanks,
Wenchen

On Mon, Jul 30, 2018 at 11:00 PM Sean Owen <[hidden email]> wrote:
In theory releases happen on a time-based cadence, so it's pretty much wrap up what's ready by the code freeze and ship it. In practice, the cadence slips frequently, and it's very much a negotiation about what features should push the code freeze out a few weeks every time. So, kind of a hybrid approach here that works OK. 

Certainly speak up if you think there's something that really needs to get into 2.4. This is that discuss thread.

(BTW I updated the page you mention just yesterday, to reflect the plan suggested in this thread.)

On Mon, Jul 30, 2018 at 9:51 AM Tom Graves <[hidden email]> wrote:
Shouldn't this be a discuss thread?  

I'm also happy to see more release managers and agree the time is getting close, but we should see what features are in progress and see how close things are and propose a date based on that.  Cutting a branch to soon just creates more work for committers to push to more branches. 

 http://spark.apache.org/versioning-policy.html mentioned the code freeze and release branch cut mid-august.


Tom

Reply | Threaded
Open this post in threaded view
|

Re: code freeze and branch cut for Apache Spark 2.4

Marco Gaido
Hi Wenchen,

I think it would be great to consider also
 - SPARK-24598: Datatype overflow conditions gives incorrect result

As it is a correctness bug. What do you think?

Thanks,
Marco

2018-07-31 4:01 GMT+02:00 Wenchen Fan <[hidden email]>:
I went through the open JIRA tickets and here is a list that we should consider for Spark 2.4:

High Priority:
SPARK-24374: Support Barrier Execution Mode in Apache Spark
This one is critical to the Spark ecosystem for deep learning. It only has a few remaining works and I think we should have it in Spark 2.4.

Middle Priority:
SPARK-23899Built-in SQL Function Improvement
We've already added a lot of built-in functions in this release, but there are a few useful higher-order functions in progress, like `array_except`, `transform`, etc. It would be great if we can get them in Spark 2.4.

SPARK-14220: Build and test Spark against Scala 2.12
Very close to finishing, great to have it in Spark 2.4.

SPARK-4502: Spark SQL reads unnecessary nested fields from Parquet
This one is there for years (thanks for your patience Michael!), and is also close to finishing. Great to have it in 2.4.

SPARK-24882: data source v2 API improvement
This is to improve the data source v2 API based on what we learned during this release. From the migration of existing sources and design of new features, we found some problems in the API and want to address them. I believe this should be the last significant API change to data source v2, so great to have in Spark 2.4. I'll send a discuss email about it later.

SPARK-24252: Add catalog support in Data Source V2
This is a very important feature for data source v2, and is currently being discussed in the dev list.

SPARK-24768: Have a built-in AVRO data source implementation
Most of it is done, but date/timestamp support is still missing. Great to have in 2.4.

SPARK-23243: Shuffle+Repartition on an RDD could lead to incorrect answers
This is a long-standing correctness bug, great to have in 2.4.

There are some other important features like the adaptive execution, streaming SQL, etc., not in the list, since I think we are not able to finish them before 2.4.

Feel free to add more things if you think they are important to Spark 2.4 by replying to this email.

Thanks,
Wenchen

On Mon, Jul 30, 2018 at 11:00 PM Sean Owen <[hidden email]> wrote:
In theory releases happen on a time-based cadence, so it's pretty much wrap up what's ready by the code freeze and ship it. In practice, the cadence slips frequently, and it's very much a negotiation about what features should push the code freeze out a few weeks every time. So, kind of a hybrid approach here that works OK. 

Certainly speak up if you think there's something that really needs to get into 2.4. This is that discuss thread.

(BTW I updated the page you mention just yesterday, to reflect the plan suggested in this thread.)

On Mon, Jul 30, 2018 at 9:51 AM Tom Graves <[hidden email]> wrote:
Shouldn't this be a discuss thread?  

I'm also happy to see more release managers and agree the time is getting close, but we should see what features are in progress and see how close things are and propose a date based on that.  Cutting a branch to soon just creates more work for committers to push to more branches. 

 http://spark.apache.org/versioning-policy.html mentioned the code freeze and release branch cut mid-august.


Tom


Reply | Threaded
Open this post in threaded view
|

Re: code freeze and branch cut for Apache Spark 2.4

pzecevic
In reply to this post by cloud0fan

This one is important to us: https://issues.apache.org/jira/browse/SPARK-24020 (Sort-merge join inner range optimization) but I think it could be useful to others too.

It is finished and is ready to be merged (was ready a month ago at least).

Do you think you could consider including it in 2.4?

Petar


Wenchen Fan @ 1970-01-01 01:00 CET:

> I went through the open JIRA tickets and here is a list that we should consider for Spark 2.4:
>
> High Priority:
> SPARK-24374: Support Barrier Execution Mode in Apache Spark
> This one is critical to the Spark ecosystem for deep learning. It only has a few remaining works and I think we should have it in Spark 2.4.
>
> Middle Priority:
> SPARK-23899: Built-in SQL Function Improvement
> We've already added a lot of built-in functions in this release, but there are a few useful higher-order functions in progress, like `array_except`, `transform`, etc. It would be great if we can get them in Spark 2.4.
>
> SPARK-14220: Build and test Spark against Scala 2.12
> Very close to finishing, great to have it in Spark 2.4.
>
> SPARK-4502: Spark SQL reads unnecessary nested fields from Parquet
> This one is there for years (thanks for your patience Michael!), and is also close to finishing. Great to have it in 2.4.
>
> SPARK-24882: data source v2 API improvement
> This is to improve the data source v2 API based on what we learned during this release. From the migration of existing sources and design of new features, we found some problems in the API and want to address them. I believe this should be
> the last significant API change to data source v2, so great to have in Spark 2.4. I'll send a discuss email about it later.
>
> SPARK-24252: Add catalog support in Data Source V2
> This is a very important feature for data source v2, and is currently being discussed in the dev list.
>
> SPARK-24768: Have a built-in AVRO data source implementation
> Most of it is done, but date/timestamp support is still missing. Great to have in 2.4.
>
> SPARK-23243: Shuffle+Repartition on an RDD could lead to incorrect answers
> This is a long-standing correctness bug, great to have in 2.4.
>
> There are some other important features like the adaptive execution, streaming SQL, etc., not in the list, since I think we are not able to finish them before 2.4.
>
> Feel free to add more things if you think they are important to Spark 2.4 by replying to this email.
>
> Thanks,
> Wenchen
>
> On Mon, Jul 30, 2018 at 11:00 PM Sean Owen <[hidden email]> wrote:
>
>  In theory releases happen on a time-based cadence, so it's pretty much wrap up what's ready by the code freeze and ship it. In practice, the cadence slips frequently, and it's very much a negotiation about what features should push the
>  code freeze out a few weeks every time. So, kind of a hybrid approach here that works OK.
>
>  Certainly speak up if you think there's something that really needs to get into 2.4. This is that discuss thread.
>
>  (BTW I updated the page you mention just yesterday, to reflect the plan suggested in this thread.)
>
>  On Mon, Jul 30, 2018 at 9:51 AM Tom Graves <[hidden email]> wrote:
>
>  Shouldn't this be a discuss thread?  
>
>  I'm also happy to see more release managers and agree the time is getting close, but we should see what features are in progress and see how close things are and propose a date based on that.  Cutting a branch to soon just creates
>  more work for committers to push to more branches.
>
>   http://spark.apache.org/versioning-policy.html mentioned the code freeze and release branch cut mid-august.
>
>  Tom


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: code freeze and branch cut for Apache Spark 2.4

Tomasz Gawęda
Hi,

what is the status of Continuous Processing + Aggregations? As far as I
remember, Jose Torres said it should  be easy to perform aggregations if
coalesce(1) work. IIRC it's already merged to master.

Is this work in progress? If yes, it would be great to have full
aggregation/join support in Spark 2.4 in CP.

Pozdrawiam / Best regards,

Tomek


On 2018-07-31 10:43, Petar Zečević wrote:

> This one is important to us: https://issues.apache.org/jira/browse/SPARK-24020 (Sort-merge join inner range optimization) but I think it could be useful to others too.
>
> It is finished and is ready to be merged (was ready a month ago at least).
>
> Do you think you could consider including it in 2.4?
>
> Petar
>
>
> Wenchen Fan @ 1970-01-01 01:00 CET:
>
>> I went through the open JIRA tickets and here is a list that we should consider for Spark 2.4:
>>
>> High Priority:
>> SPARK-24374: Support Barrier Execution Mode in Apache Spark
>> This one is critical to the Spark ecosystem for deep learning. It only has a few remaining works and I think we should have it in Spark 2.4.
>>
>> Middle Priority:
>> SPARK-23899: Built-in SQL Function Improvement
>> We've already added a lot of built-in functions in this release, but there are a few useful higher-order functions in progress, like `array_except`, `transform`, etc. It would be great if we can get them in Spark 2.4.
>>
>> SPARK-14220: Build and test Spark against Scala 2.12
>> Very close to finishing, great to have it in Spark 2.4.
>>
>> SPARK-4502: Spark SQL reads unnecessary nested fields from Parquet
>> This one is there for years (thanks for your patience Michael!), and is also close to finishing. Great to have it in 2.4.
>>
>> SPARK-24882: data source v2 API improvement
>> This is to improve the data source v2 API based on what we learned during this release. From the migration of existing sources and design of new features, we found some problems in the API and want to address them. I believe this should be
>> the last significant API change to data source v2, so great to have in Spark 2.4. I'll send a discuss email about it later.
>>
>> SPARK-24252: Add catalog support in Data Source V2
>> This is a very important feature for data source v2, and is currently being discussed in the dev list.
>>
>> SPARK-24768: Have a built-in AVRO data source implementation
>> Most of it is done, but date/timestamp support is still missing. Great to have in 2.4.
>>
>> SPARK-23243: Shuffle+Repartition on an RDD could lead to incorrect answers
>> This is a long-standing correctness bug, great to have in 2.4.
>>
>> There are some other important features like the adaptive execution, streaming SQL, etc., not in the list, since I think we are not able to finish them before 2.4.
>>
>> Feel free to add more things if you think they are important to Spark 2.4 by replying to this email.
>>
>> Thanks,
>> Wenchen
>>
>> On Mon, Jul 30, 2018 at 11:00 PM Sean Owen <[hidden email]> wrote:
>>
>>   In theory releases happen on a time-based cadence, so it's pretty much wrap up what's ready by the code freeze and ship it. In practice, the cadence slips frequently, and it's very much a negotiation about what features should push the
>>   code freeze out a few weeks every time. So, kind of a hybrid approach here that works OK.
>>
>>   Certainly speak up if you think there's something that really needs to get into 2.4. This is that discuss thread.
>>
>>   (BTW I updated the page you mention just yesterday, to reflect the plan suggested in this thread.)
>>
>>   On Mon, Jul 30, 2018 at 9:51 AM Tom Graves <[hidden email]> wrote:
>>
>>   Shouldn't this be a discuss thread?
>>
>>   I'm also happy to see more release managers and agree the time is getting close, but we should see what features are in progress and see how close things are and propose a date based on that.  Cutting a branch to soon just creates
>>   more work for committers to push to more branches.
>>
>>    http://spark.apache.org/versioning-policy.html mentioned the code freeze and release branch cut mid-august.
>>
>>   Tom
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: code freeze and branch cut for Apache Spark 2.4

Stavros Kontopoulos-3
I have a PR out for SPARK-14540 (Support Scala 2.12 closures and Java 8 lambdas in ClosureCleaner). 
This should allows us to add support for Scala 2.12, I think we can resolve this long standing issue with 2.4. 

Best,
Stavros

On Tue, Jul 31, 2018 at 4:07 PM, Tomasz Gawęda <[hidden email]> wrote:
Hi,

what is the status of Continuous Processing + Aggregations? As far as I
remember, Jose Torres said it should  be easy to perform aggregations if
coalesce(1) work. IIRC it's already merged to master.

Is this work in progress? If yes, it would be great to have full
aggregation/join support in Spark 2.4 in CP.

Pozdrawiam / Best regards,

Tomek


On 2018-07-31 10:43, Petar Zečević wrote:
> This one is important to us: https://issues.apache.org/jira/browse/SPARK-24020 (Sort-merge join inner range optimization) but I think it could be useful to others too.
>
> It is finished and is ready to be merged (was ready a month ago at least).
>
> Do you think you could consider including it in 2.4?
>
> Petar
>
>
> Wenchen Fan @ 1970-01-01 01:00 CET:
>
>> I went through the open JIRA tickets and here is a list that we should consider for Spark 2.4:
>>
>> High Priority:
>> SPARK-24374: Support Barrier Execution Mode in Apache Spark
>> This one is critical to the Spark ecosystem for deep learning. It only has a few remaining works and I think we should have it in Spark 2.4.
>>
>> Middle Priority:
>> SPARK-23899: Built-in SQL Function Improvement
>> We've already added a lot of built-in functions in this release, but there are a few useful higher-order functions in progress, like `array_except`, `transform`, etc. It would be great if we can get them in Spark 2.4.
>>
>> SPARK-14220: Build and test Spark against Scala 2.12
>> Very close to finishing, great to have it in Spark 2.4.
>>
>> SPARK-4502: Spark SQL reads unnecessary nested fields from Parquet
>> This one is there for years (thanks for your patience Michael!), and is also close to finishing. Great to have it in 2.4.
>>
>> SPARK-24882: data source v2 API improvement
>> This is to improve the data source v2 API based on what we learned during this release. From the migration of existing sources and design of new features, we found some problems in the API and want to address them. I believe this should be
>> the last significant API change to data source v2, so great to have in Spark 2.4. I'll send a discuss email about it later.
>>
>> SPARK-24252: Add catalog support in Data Source V2
>> This is a very important feature for data source v2, and is currently being discussed in the dev list.
>>
>> SPARK-24768: Have a built-in AVRO data source implementation
>> Most of it is done, but date/timestamp support is still missing. Great to have in 2.4.
>>
>> SPARK-23243: Shuffle+Repartition on an RDD could lead to incorrect answers
>> This is a long-standing correctness bug, great to have in 2.4.
>>
>> There are some other important features like the adaptive execution, streaming SQL, etc., not in the list, since I think we are not able to finish them before 2.4.
>>
>> Feel free to add more things if you think they are important to Spark 2.4 by replying to this email.
>>
>> Thanks,
>> Wenchen
>>
>> On Mon, Jul 30, 2018 at 11:00 PM Sean Owen <[hidden email]> wrote:
>>
>>   In theory releases happen on a time-based cadence, so it's pretty much wrap up what's ready by the code freeze and ship it. In practice, the cadence slips frequently, and it's very much a negotiation about what features should push the
>>   code freeze out a few weeks every time. So, kind of a hybrid approach here that works OK.
>>
>>   Certainly speak up if you think there's something that really needs to get into 2.4. This is that discuss thread.
>>
>>   (BTW I updated the page you mention just yesterday, to reflect the plan suggested in this thread.)
>>
>>   On Mon, Jul 30, 2018 at 9:51 AM Tom Graves <[hidden email]> wrote:
>>
>>   Shouldn't this be a discuss thread?
>>
>>   I'm also happy to see more release managers and agree the time is getting close, but we should see what features are in progress and see how close things are and propose a date based on that.  Cutting a branch to soon just creates
>>   more work for committers to push to more branches.
>>
>>    http://spark.apache.org/versioning-policy.html mentioned the code freeze and release branch cut mid-august.
>>
>>   Tom
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [hidden email]
>




Reply | Threaded
Open this post in threaded view
|

Re: code freeze and branch cut for Apache Spark 2.4

Joseph Torres
In reply to this post by Tomasz Gawęda
Full continuous processing aggregation support ran into unanticipated scalability and scheduling problems. We’re planning to overcome those by using some of the barrier execution machinery, but since barrier execution itself is still in progress the full support isn’t going to make it into 2.4.

Jose

On Tue, Jul 31, 2018 at 6:07 AM Tomasz Gawęda <[hidden email]> wrote:
Hi,

what is the status of Continuous Processing + Aggregations? As far as I
remember, Jose Torres said it should  be easy to perform aggregations if
coalesce(1) work. IIRC it's already merged to master.

Is this work in progress? If yes, it would be great to have full
aggregation/join support in Spark 2.4 in CP.

Pozdrawiam / Best regards,

Tomek


On 2018-07-31 10:43, Petar Zečević wrote:
> This one is important to us: https://issues.apache.org/jira/browse/SPARK-24020 (Sort-merge join inner range optimization) but I think it could be useful to others too.
>
> It is finished and is ready to be merged (was ready a month ago at least).
>
> Do you think you could consider including it in 2.4?
>
> Petar
>
>
> Wenchen Fan @ 1970-01-01 01:00 CET:
>
>> I went through the open JIRA tickets and here is a list that we should consider for Spark 2.4:
>>
>> High Priority:
>> SPARK-24374: Support Barrier Execution Mode in Apache Spark
>> This one is critical to the Spark ecosystem for deep learning. It only has a few remaining works and I think we should have it in Spark 2.4.
>>
>> Middle Priority:
>> SPARK-23899: Built-in SQL Function Improvement
>> We've already added a lot of built-in functions in this release, but there are a few useful higher-order functions in progress, like `array_except`, `transform`, etc. It would be great if we can get them in Spark 2.4.
>>
>> SPARK-14220: Build and test Spark against Scala 2.12
>> Very close to finishing, great to have it in Spark 2.4.
>>
>> SPARK-4502: Spark SQL reads unnecessary nested fields from Parquet
>> This one is there for years (thanks for your patience Michael!), and is also close to finishing. Great to have it in 2.4.
>>
>> SPARK-24882: data source v2 API improvement
>> This is to improve the data source v2 API based on what we learned during this release. From the migration of existing sources and design of new features, we found some problems in the API and want to address them. I believe this should be
>> the last significant API change to data source v2, so great to have in Spark 2.4. I'll send a discuss email about it later.
>>
>> SPARK-24252: Add catalog support in Data Source V2
>> This is a very important feature for data source v2, and is currently being discussed in the dev list.
>>
>> SPARK-24768: Have a built-in AVRO data source implementation
>> Most of it is done, but date/timestamp support is still missing. Great to have in 2.4.
>>
>> SPARK-23243: Shuffle+Repartition on an RDD could lead to incorrect answers
>> This is a long-standing correctness bug, great to have in 2.4.
>>
>> There are some other important features like the adaptive execution, streaming SQL, etc., not in the list, since I think we are not able to finish them before 2.4.
>>
>> Feel free to add more things if you think they are important to Spark 2.4 by replying to this email.
>>
>> Thanks,
>> Wenchen
>>
>> On Mon, Jul 30, 2018 at 11:00 PM Sean Owen <[hidden email]> wrote:
>>
>>   In theory releases happen on a time-based cadence, so it's pretty much wrap up what's ready by the code freeze and ship it. In practice, the cadence slips frequently, and it's very much a negotiation about what features should push the
>>   code freeze out a few weeks every time. So, kind of a hybrid approach here that works OK.
>>
>>   Certainly speak up if you think there's something that really needs to get into 2.4. This is that discuss thread.
>>
>>   (BTW I updated the page you mention just yesterday, to reflect the plan suggested in this thread.)
>>
>>   On Mon, Jul 30, 2018 at 9:51 AM Tom Graves <[hidden email]> wrote:
>>
>>   Shouldn't this be a discuss thread?
>>
>>   I'm also happy to see more release managers and agree the time is getting close, but we should see what features are in progress and see how close things are and propose a date based on that.  Cutting a branch to soon just creates
>>   more work for committers to push to more branches.
>>
>>    http://spark.apache.org/versioning-policy.html mentioned the code freeze and release branch cut mid-august.
>>
>>   Tom
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [hidden email]
>

Reply | Threaded
Open this post in threaded view
|

Re: code freeze and branch cut for Apache Spark 2.4

Erik Erlandson-2

Barrier mode seems like a high impact feature on Spark's core code: is one additional week enough time to properly vet this feature?

On Tue, Jul 31, 2018 at 7:10 AM, Joseph Torres <[hidden email]> wrote:
Full continuous processing aggregation support ran into unanticipated scalability and scheduling problems. We’re planning to overcome those by using some of the barrier execution machinery, but since barrier execution itself is still in progress the full support isn’t going to make it into 2.4.

Jose

On Tue, Jul 31, 2018 at 6:07 AM Tomasz Gawęda <[hidden email]> wrote:
Hi,

what is the status of Continuous Processing + Aggregations? As far as I
remember, Jose Torres said it should  be easy to perform aggregations if
coalesce(1) work. IIRC it's already merged to master.

Is this work in progress? If yes, it would be great to have full
aggregation/join support in Spark 2.4 in CP.

Pozdrawiam / Best regards,

Tomek


On 2018-07-31 10:43, Petar Zečević wrote:
> This one is important to us: https://issues.apache.org/jira/browse/SPARK-24020 (Sort-merge join inner range optimization) but I think it could be useful to others too.
>
> It is finished and is ready to be merged (was ready a month ago at least).
>
> Do you think you could consider including it in 2.4?
>
> Petar
>
>
> Wenchen Fan @ 1970-01-01 01:00 CET:
>
>> I went through the open JIRA tickets and here is a list that we should consider for Spark 2.4:
>>
>> High Priority:
>> SPARK-24374: Support Barrier Execution Mode in Apache Spark
>> This one is critical to the Spark ecosystem for deep learning. It only has a few remaining works and I think we should have it in Spark 2.4.
>>
>> Middle Priority:
>> SPARK-23899: Built-in SQL Function Improvement
>> We've already added a lot of built-in functions in this release, but there are a few useful higher-order functions in progress, like `array_except`, `transform`, etc. It would be great if we can get them in Spark 2.4.
>>
>> SPARK-14220: Build and test Spark against Scala 2.12
>> Very close to finishing, great to have it in Spark 2.4.
>>
>> SPARK-4502: Spark SQL reads unnecessary nested fields from Parquet
>> This one is there for years (thanks for your patience Michael!), and is also close to finishing. Great to have it in 2.4.
>>
>> SPARK-24882: data source v2 API improvement
>> This is to improve the data source v2 API based on what we learned during this release. From the migration of existing sources and design of new features, we found some problems in the API and want to address them. I believe this should be
>> the last significant API change to data source v2, so great to have in Spark 2.4. I'll send a discuss email about it later.
>>
>> SPARK-24252: Add catalog support in Data Source V2
>> This is a very important feature for data source v2, and is currently being discussed in the dev list.
>>
>> SPARK-24768: Have a built-in AVRO data source implementation
>> Most of it is done, but date/timestamp support is still missing. Great to have in 2.4.
>>
>> SPARK-23243: Shuffle+Repartition on an RDD could lead to incorrect answers
>> This is a long-standing correctness bug, great to have in 2.4.
>>
>> There are some other important features like the adaptive execution, streaming SQL, etc., not in the list, since I think we are not able to finish them before 2.4.
>>
>> Feel free to add more things if you think they are important to Spark 2.4 by replying to this email.
>>
>> Thanks,
>> Wenchen
>>
>> On Mon, Jul 30, 2018 at 11:00 PM Sean Owen <[hidden email]> wrote:
>>
>>   In theory releases happen on a time-based cadence, so it's pretty much wrap up what's ready by the code freeze and ship it. In practice, the cadence slips frequently, and it's very much a negotiation about what features should push the
>>   code freeze out a few weeks every time. So, kind of a hybrid approach here that works OK.
>>
>>   Certainly speak up if you think there's something that really needs to get into 2.4. This is that discuss thread.
>>
>>   (BTW I updated the page you mention just yesterday, to reflect the plan suggested in this thread.)
>>
>>   On Mon, Jul 30, 2018 at 9:51 AM Tom Graves <[hidden email]> wrote:
>>
>>   Shouldn't this be a discuss thread?
>>
>>   I'm also happy to see more release managers and agree the time is getting close, but we should see what features are in progress and see how close things are and propose a date based on that.  Cutting a branch to soon just creates
>>   more work for committers to push to more branches.
>>
>>    http://spark.apache.org/versioning-policy.html mentioned the code freeze and release branch cut mid-august.
>>
>>   Tom
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [hidden email]
>


Reply | Threaded
Open this post in threaded view
|

Re: code freeze and branch cut for Apache Spark 2.4

Mark Hamstra
No reasonable amount of time is likely going to be sufficient to fully vet the code as a PR. I'm not entirely happy with the design and code as they currently are (and I'm still trying to find the time to more publicly express my thoughts and concerns), but I'm fine with them going into 2.4 much as they are as long as they go in with proper stability annotations and are understood not to be cast-in-stone final implementations, but rather as a way to get people using them and generating the feedback that is necessary to get us to something more like a final design and implementation.

On Tue, Jul 31, 2018 at 11:54 AM Erik Erlandson <[hidden email]> wrote:

Barrier mode seems like a high impact feature on Spark's core code: is one additional week enough time to properly vet this feature?

On Tue, Jul 31, 2018 at 7:10 AM, Joseph Torres <[hidden email]> wrote:
Full continuous processing aggregation support ran into unanticipated scalability and scheduling problems. We’re planning to overcome those by using some of the barrier execution machinery, but since barrier execution itself is still in progress the full support isn’t going to make it into 2.4.

Jose

On Tue, Jul 31, 2018 at 6:07 AM Tomasz Gawęda <[hidden email]> wrote:
Hi,

what is the status of Continuous Processing + Aggregations? As far as I
remember, Jose Torres said it should  be easy to perform aggregations if
coalesce(1) work. IIRC it's already merged to master.

Is this work in progress? If yes, it would be great to have full
aggregation/join support in Spark 2.4 in CP.

Pozdrawiam / Best regards,

Tomek


On 2018-07-31 10:43, Petar Zečević wrote:
> This one is important to us: https://issues.apache.org/jira/browse/SPARK-24020 (Sort-merge join inner range optimization) but I think it could be useful to others too.
>
> It is finished and is ready to be merged (was ready a month ago at least).
>
> Do you think you could consider including it in 2.4?
>
> Petar
>
>
> Wenchen Fan @ 1970-01-01 01:00 CET:
>
>> I went through the open JIRA tickets and here is a list that we should consider for Spark 2.4:
>>
>> High Priority:
>> SPARK-24374: Support Barrier Execution Mode in Apache Spark
>> This one is critical to the Spark ecosystem for deep learning. It only has a few remaining works and I think we should have it in Spark 2.4.
>>
>> Middle Priority:
>> SPARK-23899: Built-in SQL Function Improvement
>> We've already added a lot of built-in functions in this release, but there are a few useful higher-order functions in progress, like `array_except`, `transform`, etc. It would be great if we can get them in Spark 2.4.
>>
>> SPARK-14220: Build and test Spark against Scala 2.12
>> Very close to finishing, great to have it in Spark 2.4.
>>
>> SPARK-4502: Spark SQL reads unnecessary nested fields from Parquet
>> This one is there for years (thanks for your patience Michael!), and is also close to finishing. Great to have it in 2.4.
>>
>> SPARK-24882: data source v2 API improvement
>> This is to improve the data source v2 API based on what we learned during this release. From the migration of existing sources and design of new features, we found some problems in the API and want to address them. I believe this should be
>> the last significant API change to data source v2, so great to have in Spark 2.4. I'll send a discuss email about it later.
>>
>> SPARK-24252: Add catalog support in Data Source V2
>> This is a very important feature for data source v2, and is currently being discussed in the dev list.
>>
>> SPARK-24768: Have a built-in AVRO data source implementation
>> Most of it is done, but date/timestamp support is still missing. Great to have in 2.4.
>>
>> SPARK-23243: Shuffle+Repartition on an RDD could lead to incorrect answers
>> This is a long-standing correctness bug, great to have in 2.4.
>>
>> There are some other important features like the adaptive execution, streaming SQL, etc., not in the list, since I think we are not able to finish them before 2.4.
>>
>> Feel free to add more things if you think they are important to Spark 2.4 by replying to this email.
>>
>> Thanks,
>> Wenchen
>>
>> On Mon, Jul 30, 2018 at 11:00 PM Sean Owen <[hidden email]> wrote:
>>
>>   In theory releases happen on a time-based cadence, so it's pretty much wrap up what's ready by the code freeze and ship it. In practice, the cadence slips frequently, and it's very much a negotiation about what features should push the
>>   code freeze out a few weeks every time. So, kind of a hybrid approach here that works OK.
>>
>>   Certainly speak up if you think there's something that really needs to get into 2.4. This is that discuss thread.
>>
>>   (BTW I updated the page you mention just yesterday, to reflect the plan suggested in this thread.)
>>
>>   On Mon, Jul 30, 2018 at 9:51 AM Tom Graves <[hidden email]> wrote:
>>
>>   Shouldn't this be a discuss thread?
>>
>>   I'm also happy to see more release managers and agree the time is getting close, but we should see what features are in progress and see how close things are and propose a date based on that.  Cutting a branch to soon just creates
>>   more work for committers to push to more branches.
>>
>>    http://spark.apache.org/versioning-policy.html mentioned the code freeze and release branch cut mid-august.
>>
>>   Tom
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [hidden email]
>


Reply | Threaded
Open this post in threaded view
|

Re: code freeze and branch cut for Apache Spark 2.4

Erik Erlandson-2
I don't have a comprehensive knowledge of the project hydrogen PRs, however I've perused them, and they make substantial modifications to Spark's core DAG scheduler code.

What I'm wondering is: how high is the confidence level that the "traditional" code paths are still stable. Put another way, is it even possible to "turn off" or "opt out" of this experimental feature? This analogy isn't perfect, but for example the k8s back-end is a major body of code, but it has a very small impact on any *core* code paths, and so if you opt out of it, it is well understood that you aren't running any experimental code.

Looking at the project hydrogen code, I'm less sure the same is true. However, maybe there is a clear way to show how it is true.


On Tue, Jul 31, 2018 at 12:03 PM, Mark Hamstra <[hidden email]> wrote:
No reasonable amount of time is likely going to be sufficient to fully vet the code as a PR. I'm not entirely happy with the design and code as they currently are (and I'm still trying to find the time to more publicly express my thoughts and concerns), but I'm fine with them going into 2.4 much as they are as long as they go in with proper stability annotations and are understood not to be cast-in-stone final implementations, but rather as a way to get people using them and generating the feedback that is necessary to get us to something more like a final design and implementation.

On Tue, Jul 31, 2018 at 11:54 AM Erik Erlandson <[hidden email]> wrote:

Barrier mode seems like a high impact feature on Spark's core code: is one additional week enough time to properly vet this feature?

On Tue, Jul 31, 2018 at 7:10 AM, Joseph Torres <[hidden email]> wrote:
Full continuous processing aggregation support ran into unanticipated scalability and scheduling problems. We’re planning to overcome those by using some of the barrier execution machinery, but since barrier execution itself is still in progress the full support isn’t going to make it into 2.4.

Jose

On Tue, Jul 31, 2018 at 6:07 AM Tomasz Gawęda <[hidden email]> wrote:
Hi,

what is the status of Continuous Processing + Aggregations? As far as I
remember, Jose Torres said it should  be easy to perform aggregations if
coalesce(1) work. IIRC it's already merged to master.

Is this work in progress? If yes, it would be great to have full
aggregation/join support in Spark 2.4 in CP.

Pozdrawiam / Best regards,

Tomek


On 2018-07-31 10:43, Petar Zečević wrote:
> This one is important to us: https://issues.apache.org/jira/browse/SPARK-24020 (Sort-merge join inner range optimization) but I think it could be useful to others too.
>
> It is finished and is ready to be merged (was ready a month ago at least).
>
> Do you think you could consider including it in 2.4?
>
> Petar
>
>
> Wenchen Fan @ 1970-01-01 01:00 CET:
>
>> I went through the open JIRA tickets and here is a list that we should consider for Spark 2.4:
>>
>> High Priority:
>> SPARK-24374: Support Barrier Execution Mode in Apache Spark
>> This one is critical to the Spark ecosystem for deep learning. It only has a few remaining works and I think we should have it in Spark 2.4.
>>
>> Middle Priority:
>> SPARK-23899: Built-in SQL Function Improvement
>> We've already added a lot of built-in functions in this release, but there are a few useful higher-order functions in progress, like `array_except`, `transform`, etc. It would be great if we can get them in Spark 2.4.
>>
>> SPARK-14220: Build and test Spark against Scala 2.12
>> Very close to finishing, great to have it in Spark 2.4.
>>
>> SPARK-4502: Spark SQL reads unnecessary nested fields from Parquet
>> This one is there for years (thanks for your patience Michael!), and is also close to finishing. Great to have it in 2.4.
>>
>> SPARK-24882: data source v2 API improvement
>> This is to improve the data source v2 API based on what we learned during this release. From the migration of existing sources and design of new features, we found some problems in the API and want to address them. I believe this should be
>> the last significant API change to data source v2, so great to have in Spark 2.4. I'll send a discuss email about it later.
>>
>> SPARK-24252: Add catalog support in Data Source V2
>> This is a very important feature for data source v2, and is currently being discussed in the dev list.
>>
>> SPARK-24768: Have a built-in AVRO data source implementation
>> Most of it is done, but date/timestamp support is still missing. Great to have in 2.4.
>>
>> SPARK-23243: Shuffle+Repartition on an RDD could lead to incorrect answers
>> This is a long-standing correctness bug, great to have in 2.4.
>>
>> There are some other important features like the adaptive execution, streaming SQL, etc., not in the list, since I think we are not able to finish them before 2.4.
>>
>> Feel free to add more things if you think they are important to Spark 2.4 by replying to this email.
>>
>> Thanks,
>> Wenchen
>>
>> On Mon, Jul 30, 2018 at 11:00 PM Sean Owen <[hidden email]> wrote:
>>
>>   In theory releases happen on a time-based cadence, so it's pretty much wrap up what's ready by the code freeze and ship it. In practice, the cadence slips frequently, and it's very much a negotiation about what features should push the
>>   code freeze out a few weeks every time. So, kind of a hybrid approach here that works OK.
>>
>>   Certainly speak up if you think there's something that really needs to get into 2.4. This is that discuss thread.
>>
>>   (BTW I updated the page you mention just yesterday, to reflect the plan suggested in this thread.)
>>
>>   On Mon, Jul 30, 2018 at 9:51 AM Tom Graves <[hidden email]> wrote:
>>
>>   Shouldn't this be a discuss thread?
>>
>>   I'm also happy to see more release managers and agree the time is getting close, but we should see what features are in progress and see how close things are and propose a date based on that.  Cutting a branch to soon just creates
>>   more work for committers to push to more branches.
>>
>>    http://spark.apache.org/versioning-policy.html mentioned the code freeze and release branch cut mid-august.
>>
>>   Tom
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [hidden email]
>



Reply | Threaded
Open this post in threaded view
|

Re: code freeze and branch cut for Apache Spark 2.4

rxin
I actually totally agree that we should make sure it should have no impact on existing code if the feature is not used.


On Tue, Jul 31, 2018 at 1:18 PM Erik Erlandson <[hidden email]> wrote:
I don't have a comprehensive knowledge of the project hydrogen PRs, however I've perused them, and they make substantial modifications to Spark's core DAG scheduler code.

What I'm wondering is: how high is the confidence level that the "traditional" code paths are still stable. Put another way, is it even possible to "turn off" or "opt out" of this experimental feature? This analogy isn't perfect, but for example the k8s back-end is a major body of code, but it has a very small impact on any *core* code paths, and so if you opt out of it, it is well understood that you aren't running any experimental code.

Looking at the project hydrogen code, I'm less sure the same is true. However, maybe there is a clear way to show how it is true.


On Tue, Jul 31, 2018 at 12:03 PM, Mark Hamstra <[hidden email]> wrote:
No reasonable amount of time is likely going to be sufficient to fully vet the code as a PR. I'm not entirely happy with the design and code as they currently are (and I'm still trying to find the time to more publicly express my thoughts and concerns), but I'm fine with them going into 2.4 much as they are as long as they go in with proper stability annotations and are understood not to be cast-in-stone final implementations, but rather as a way to get people using them and generating the feedback that is necessary to get us to something more like a final design and implementation.

On Tue, Jul 31, 2018 at 11:54 AM Erik Erlandson <[hidden email]> wrote:

Barrier mode seems like a high impact feature on Spark's core code: is one additional week enough time to properly vet this feature?

On Tue, Jul 31, 2018 at 7:10 AM, Joseph Torres <[hidden email]> wrote:
Full continuous processing aggregation support ran into unanticipated scalability and scheduling problems. We’re planning to overcome those by using some of the barrier execution machinery, but since barrier execution itself is still in progress the full support isn’t going to make it into 2.4.

Jose

On Tue, Jul 31, 2018 at 6:07 AM Tomasz Gawęda <[hidden email]> wrote:
Hi,

what is the status of Continuous Processing + Aggregations? As far as I
remember, Jose Torres said it should  be easy to perform aggregations if
coalesce(1) work. IIRC it's already merged to master.

Is this work in progress? If yes, it would be great to have full
aggregation/join support in Spark 2.4 in CP.

Pozdrawiam / Best regards,

Tomek


On 2018-07-31 10:43, Petar Zečević wrote:
> This one is important to us: https://issues.apache.org/jira/browse/SPARK-24020 (Sort-merge join inner range optimization) but I think it could be useful to others too.
>
> It is finished and is ready to be merged (was ready a month ago at least).
>
> Do you think you could consider including it in 2.4?
>
> Petar
>
>
> Wenchen Fan @ 1970-01-01 01:00 CET:
>
>> I went through the open JIRA tickets and here is a list that we should consider for Spark 2.4:
>>
>> High Priority:
>> SPARK-24374: Support Barrier Execution Mode in Apache Spark
>> This one is critical to the Spark ecosystem for deep learning. It only has a few remaining works and I think we should have it in Spark 2.4.
>>
>> Middle Priority:
>> SPARK-23899: Built-in SQL Function Improvement
>> We've already added a lot of built-in functions in this release, but there are a few useful higher-order functions in progress, like `array_except`, `transform`, etc. It would be great if we can get them in Spark 2.4.
>>
>> SPARK-14220: Build and test Spark against Scala 2.12
>> Very close to finishing, great to have it in Spark 2.4.
>>
>> SPARK-4502: Spark SQL reads unnecessary nested fields from Parquet
>> This one is there for years (thanks for your patience Michael!), and is also close to finishing. Great to have it in 2.4.
>>
>> SPARK-24882: data source v2 API improvement
>> This is to improve the data source v2 API based on what we learned during this release. From the migration of existing sources and design of new features, we found some problems in the API and want to address them. I believe this should be
>> the last significant API change to data source v2, so great to have in Spark 2.4. I'll send a discuss email about it later.
>>
>> SPARK-24252: Add catalog support in Data Source V2
>> This is a very important feature for data source v2, and is currently being discussed in the dev list.
>>
>> SPARK-24768: Have a built-in AVRO data source implementation
>> Most of it is done, but date/timestamp support is still missing. Great to have in 2.4.
>>
>> SPARK-23243: Shuffle+Repartition on an RDD could lead to incorrect answers
>> This is a long-standing correctness bug, great to have in 2.4.
>>
>> There are some other important features like the adaptive execution, streaming SQL, etc., not in the list, since I think we are not able to finish them before 2.4.
>>
>> Feel free to add more things if you think they are important to Spark 2.4 by replying to this email.
>>
>> Thanks,
>> Wenchen
>>
>> On Mon, Jul 30, 2018 at 11:00 PM Sean Owen <[hidden email]> wrote:
>>
>>   In theory releases happen on a time-based cadence, so it's pretty much wrap up what's ready by the code freeze and ship it. In practice, the cadence slips frequently, and it's very much a negotiation about what features should push the
>>   code freeze out a few weeks every time. So, kind of a hybrid approach here that works OK.
>>
>>   Certainly speak up if you think there's something that really needs to get into 2.4. This is that discuss thread.
>>
>>   (BTW I updated the page you mention just yesterday, to reflect the plan suggested in this thread.)
>>
>>   On Mon, Jul 30, 2018 at 9:51 AM Tom Graves <[hidden email]> wrote:
>>
>>   Shouldn't this be a discuss thread?
>>
>>   I'm also happy to see more release managers and agree the time is getting close, but we should see what features are in progress and see how close things are and propose a date based on that.  Cutting a branch to soon just creates
>>   more work for committers to push to more branches.
>>
>>    http://spark.apache.org/versioning-policy.html mentioned the code freeze and release branch cut mid-august.
>>
>>   Tom
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [hidden email]
>



Reply | Threaded
Open this post in threaded view
|

Re: code freeze and branch cut for Apache Spark 2.4

Imran Rashid-2
In reply to this post by cloud0fan
I'd like to add SPARK-24296, replicating large blocks over 2GB.  Its been up for review for a while, and would end the 2GB block limit (well ... subject to a couple of caveats on SPARK-6235).

On Mon, Jul 30, 2018 at 9:01 PM, Wenchen Fan <[hidden email]> wrote:
I went through the open JIRA tickets and here is a list that we should consider for Spark 2.4:

High Priority:
SPARK-24374: Support Barrier Execution Mode in Apache Spark
This one is critical to the Spark ecosystem for deep learning. It only has a few remaining works and I think we should have it in Spark 2.4.

Middle Priority:
SPARK-23899Built-in SQL Function Improvement
We've already added a lot of built-in functions in this release, but there are a few useful higher-order functions in progress, like `array_except`, `transform`, etc. It would be great if we can get them in Spark 2.4.

SPARK-14220: Build and test Spark against Scala 2.12
Very close to finishing, great to have it in Spark 2.4.

SPARK-4502: Spark SQL reads unnecessary nested fields from Parquet
This one is there for years (thanks for your patience Michael!), and is also close to finishing. Great to have it in 2.4.

SPARK-24882: data source v2 API improvement
This is to improve the data source v2 API based on what we learned during this release. From the migration of existing sources and design of new features, we found some problems in the API and want to address them. I believe this should be the last significant API change to data source v2, so great to have in Spark 2.4. I'll send a discuss email about it later.

SPARK-24252: Add catalog support in Data Source V2
This is a very important feature for data source v2, and is currently being discussed in the dev list.

SPARK-24768: Have a built-in AVRO data source implementation
Most of it is done, but date/timestamp support is still missing. Great to have in 2.4.

SPARK-23243: Shuffle+Repartition on an RDD could lead to incorrect answers
This is a long-standing correctness bug, great to have in 2.4.

There are some other important features like the adaptive execution, streaming SQL, etc., not in the list, since I think we are not able to finish them before 2.4.

Feel free to add more things if you think they are important to Spark 2.4 by replying to this email.

Thanks,
Wenchen

On Mon, Jul 30, 2018 at 11:00 PM Sean Owen <[hidden email]> wrote:
In theory releases happen on a time-based cadence, so it's pretty much wrap up what's ready by the code freeze and ship it. In practice, the cadence slips frequently, and it's very much a negotiation about what features should push the code freeze out a few weeks every time. So, kind of a hybrid approach here that works OK. 

Certainly speak up if you think there's something that really needs to get into 2.4. This is that discuss thread.

(BTW I updated the page you mention just yesterday, to reflect the plan suggested in this thread.)

On Mon, Jul 30, 2018 at 9:51 AM Tom Graves <[hidden email]> wrote:
Shouldn't this be a discuss thread?  

I'm also happy to see more release managers and agree the time is getting close, but we should see what features are in progress and see how close things are and propose a date based on that.  Cutting a branch to soon just creates more work for committers to push to more branches. 

 http://spark.apache.org/versioning-policy.html mentioned the code freeze and release branch cut mid-august.


Tom


123