Apache Spark 3.1 Feature Expectation (Dec. 2020)

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Apache Spark 3.1 Feature Expectation (Dec. 2020)

Dongjoon Hyun-2
Hi, All.

After a short celebration of Apache Spark 3.0, I'd like to ask you the community opinion on Apache Spark 3.1 feature expectations.

First of all, Apache Spark 3.1 is scheduled for December 2020.

I'm expecting the following items:

1. Support Scala 2.13
2. Use Apache Hadoop 3.2 by default for better cloud support
3. Declaring Kubernetes Scheduler GA
    In my perspective, the last main missing piece was Dynamic allocation and
    - Dynamic allocation with shuffle tracking is already shipped at 3.0.
    - Dynamic allocation with worker decommission/data migration is targeting 3.1. (Thanks, Holden)
4. DSv2 Stabilization

I'm aware of some more features which are on the way currently, but I love to hear the opinions from the main developers and more over the main users who need those features.

Thank you in advance. Welcome for any comments.

Bests,
Dongjoon.
Reply | Threaded
Open this post in threaded view
|

Re: Apache Spark 3.1 Feature Expectation (Dec. 2020)

JackyLee
Thank you for putting forward this.
Can we put the support of view and partition catalog in version 3.1?
AFAIT, these are great features in DSv2 and Catalog. With these, we can work
well with warehouse, such as delta or hive.

https://github.com/apache/spark/pull/28147
https://github.com/apache/spark/pull/28617

Thanks.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Apache Spark 3.1 Feature Expectation (Dec. 2020)

Maxim Gekk
In reply to this post by Dongjoon Hyun-2
Hi Dongjoon,

I would add:
- Filters pushdown to JSON (https://github.com/apache/spark/pull/27366)
- Filters pushdown to other datasources like Avro
- Support nested attributes of filters pushed down to JSON

Maxim Gekk

Software Engineer

Databricks, Inc.



On Mon, Jun 29, 2020 at 7:07 PM Dongjoon Hyun <[hidden email]> wrote:
Hi, All.

After a short celebration of Apache Spark 3.0, I'd like to ask you the community opinion on Apache Spark 3.1 feature expectations.

First of all, Apache Spark 3.1 is scheduled for December 2020.

I'm expecting the following items:

1. Support Scala 2.13
2. Use Apache Hadoop 3.2 by default for better cloud support
3. Declaring Kubernetes Scheduler GA
    In my perspective, the last main missing piece was Dynamic allocation and
    - Dynamic allocation with shuffle tracking is already shipped at 3.0.
    - Dynamic allocation with worker decommission/data migration is targeting 3.1. (Thanks, Holden)
4. DSv2 Stabilization

I'm aware of some more features which are on the way currently, but I love to hear the opinions from the main developers and more over the main users who need those features.

Thank you in advance. Welcome for any comments.

Bests,
Dongjoon.
Reply | Threaded
Open this post in threaded view
|

Re: Apache Spark 3.1 Feature Expectation (Dec. 2020)

Holden Karau
Should we also consider the shuffle service refactoring to support pluggable storage engines as targeting the 3.1 release?

On Mon, Jun 29, 2020 at 9:31 AM Maxim Gekk <[hidden email]> wrote:
Hi Dongjoon,

I would add:
- Filters pushdown to JSON (https://github.com/apache/spark/pull/27366)
- Filters pushdown to other datasources like Avro
- Support nested attributes of filters pushed down to JSON

Maxim Gekk

Software Engineer

Databricks, Inc.



On Mon, Jun 29, 2020 at 7:07 PM Dongjoon Hyun <[hidden email]> wrote:
Hi, All.

After a short celebration of Apache Spark 3.0, I'd like to ask you the community opinion on Apache Spark 3.1 feature expectations.

First of all, Apache Spark 3.1 is scheduled for December 2020.

I'm expecting the following items:

1. Support Scala 2.13
2. Use Apache Hadoop 3.2 by default for better cloud support
3. Declaring Kubernetes Scheduler GA
    In my perspective, the last main missing piece was Dynamic allocation and
    - Dynamic allocation with shuffle tracking is already shipped at 3.0.
    - Dynamic allocation with worker decommission/data migration is targeting 3.1. (Thanks, Holden)
4. DSv2 Stabilization

I'm aware of some more features which are on the way currently, but I love to hear the opinions from the main developers and more over the main users who need those features.

Thank you in advance. Welcome for any comments.

Bests,
Dongjoon.


--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
Reply | Threaded
Open this post in threaded view
|

Re: Apache Spark 3.1 Feature Expectation (Dec. 2020)

Jungtaek Lim-2
Does this count only "new features" (probably major), or also count "improvements"? I'm aware of a couple of improvements which should be ideally included in the next release, but if this counts only major new features then don't feel they should be listed.

On Tue, Jun 30, 2020 at 1:32 AM Holden Karau <[hidden email]> wrote:
Should we also consider the shuffle service refactoring to support pluggable storage engines as targeting the 3.1 release?

On Mon, Jun 29, 2020 at 9:31 AM Maxim Gekk <[hidden email]> wrote:
Hi Dongjoon,

I would add:
- Filters pushdown to JSON (https://github.com/apache/spark/pull/27366)
- Filters pushdown to other datasources like Avro
- Support nested attributes of filters pushed down to JSON

Maxim Gekk

Software Engineer

Databricks, Inc.



On Mon, Jun 29, 2020 at 7:07 PM Dongjoon Hyun <[hidden email]> wrote:
Hi, All.

After a short celebration of Apache Spark 3.0, I'd like to ask you the community opinion on Apache Spark 3.1 feature expectations.

First of all, Apache Spark 3.1 is scheduled for December 2020.

I'm expecting the following items:

1. Support Scala 2.13
2. Use Apache Hadoop 3.2 by default for better cloud support
3. Declaring Kubernetes Scheduler GA
    In my perspective, the last main missing piece was Dynamic allocation and
    - Dynamic allocation with shuffle tracking is already shipped at 3.0.
    - Dynamic allocation with worker decommission/data migration is targeting 3.1. (Thanks, Holden)
4. DSv2 Stabilization

I'm aware of some more features which are on the way currently, but I love to hear the opinions from the main developers and more over the main users who need those features.

Thank you in advance. Welcome for any comments.

Bests,
Dongjoon.


--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
Reply | Threaded
Open this post in threaded view
|

Re: Apache Spark 3.1 Feature Expectation (Dec. 2020)

wuyi
In reply to this post by Holden Karau
This could be a sub-task of
https://issues.apache.org/jira/browse/SPARK-25299
<https://issues.apache.org/jira/browse/SPARK-25299>  (Use remote storage for
persisting shuffle data)?

It's good if we could put the whole SPARK-25299 in Spark 3.1.



Holden Karau wrote
> Should we also consider the shuffle service refactoring to support
> pluggable storage engines as targeting the 3.1 release?
>
> On Mon, Jun 29, 2020 at 9:31 AM Maxim Gekk &lt;

> maxim.gekk@

> &gt;
> wrote:
>
>> Hi Dongjoon,
>>
>> I would add:
>> - Filters pushdown to JSON (https://github.com/apache/spark/pull/27366)
>> - Filters pushdown to other datasources like Avro
>> - Support nested attributes of filters pushed down to JSON
>>
>> Maxim Gekk
>>
>> Software Engineer
>>
>> Databricks, Inc.
>>
>>
>> On Mon, Jun 29, 2020 at 7:07 PM Dongjoon Hyun &lt;

> dongjoon.hyun@

> &gt;
>> wrote:
>>
>>> Hi, All.
>>>
>>> After a short celebration of Apache Spark 3.0, I'd like to ask you the
>>> community opinion on Apache Spark 3.1 feature expectations.
>>>
>>> First of all, Apache Spark 3.1 is scheduled for December 2020.
>>> - https://spark.apache.org/versioning-policy.html
>>>
>>> I'm expecting the following items:
>>>
>>> 1. Support Scala 2.13
>>> 2. Use Apache Hadoop 3.2 by default for better cloud support
>>> 3. Declaring Kubernetes Scheduler GA
>>>     In my perspective, the last main missing piece was Dynamic
>>> allocation
>>> and
>>>     - Dynamic allocation with shuffle tracking is already shipped at
>>> 3.0.
>>>     - Dynamic allocation with worker decommission/data migration is
>>> targeting 3.1. (Thanks, Holden)
>>> 4. DSv2 Stabilization
>>>
>>> I'm aware of some more features which are on the way currently, but I
>>> love to hear the opinions from the main developers and more over the
>>> main
>>> users who need those features.
>>>
>>> Thank you in advance. Welcome for any comments.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  &lt;https://amzn.to/2MaRAG9&gt;
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Apache Spark 3.1 Feature Expectation (Dec. 2020)

Tom Graves-2
In reply to this post by Dongjoon Hyun-2
On Monday, June 29, 2020, 11:07:18 AM CDT, Dongjoon Hyun <[hidden email]> wrote:


Hi, All.

After a short celebration of Apache Spark 3.0, I'd like to ask you the community opinion on Apache Spark 3.1 feature expectations.

First of all, Apache Spark 3.1 is scheduled for December 2020.

I'm expecting the following items:

1. Support Scala 2.13
2. Use Apache Hadoop 3.2 by default for better cloud support
3. Declaring Kubernetes Scheduler GA
    In my perspective, the last main missing piece was Dynamic allocation and
    - Dynamic allocation with shuffle tracking is already shipped at 3.0.
    - Dynamic allocation with worker decommission/data migration is targeting 3.1. (Thanks, Holden)
4. DSv2 Stabilization

I'm aware of some more features which are on the way currently, but I love to hear the opinions from the main developers and more over the main users who need those features.

Thank you in advance. Welcome for any comments.

Bests,
Dongjoon.
Reply | Threaded
Open this post in threaded view
|

Re: Apache Spark 3.1 Feature Expectation (Dec. 2020)

Gabor Somogyi
In reply to this post by Dongjoon Hyun-2
Hi Dongjoon,

I would add JDBC Kerberos support w/ keytab: https://issues.apache.org/jira/browse/SPARK-12312

BR,
G


On Mon, Jun 29, 2020 at 6:07 PM Dongjoon Hyun <[hidden email]> wrote:
Hi, All.

After a short celebration of Apache Spark 3.0, I'd like to ask you the community opinion on Apache Spark 3.1 feature expectations.

First of all, Apache Spark 3.1 is scheduled for December 2020.

I'm expecting the following items:

1. Support Scala 2.13
2. Use Apache Hadoop 3.2 by default for better cloud support
3. Declaring Kubernetes Scheduler GA
    In my perspective, the last main missing piece was Dynamic allocation and
    - Dynamic allocation with shuffle tracking is already shipped at 3.0.
    - Dynamic allocation with worker decommission/data migration is targeting 3.1. (Thanks, Holden)
4. DSv2 Stabilization

I'm aware of some more features which are on the way currently, but I love to hear the opinions from the main developers and more over the main users who need those features.

Thank you in advance. Welcome for any comments.

Bests,
Dongjoon.
Reply | Threaded
Open this post in threaded view
|

Re: Apache Spark 3.1 Feature Expectation (Dec. 2020)

Felix Cheung
In reply to this post by Holden Karau
I think pluggable storage in shuffle is essential for k8s GA


From: Holden Karau <[hidden email]>
Sent: Monday, June 29, 2020 9:33 AM
To: Maxim Gekk
Cc: Dongjoon Hyun; dev
Subject: Re: Apache Spark 3.1 Feature Expectation (Dec. 2020)
 
Should we also consider the shuffle service refactoring to support pluggable storage engines as targeting the 3.1 release?

On Mon, Jun 29, 2020 at 9:31 AM Maxim Gekk <[hidden email]> wrote:
Hi Dongjoon,

I would add:
- Filters pushdown to JSON (https://github.com/apache/spark/pull/27366)
- Filters pushdown to other datasources like Avro
- Support nested attributes of filters pushed down to JSON

Maxim Gekk

Software Engineer

Databricks, Inc.



On Mon, Jun 29, 2020 at 7:07 PM Dongjoon Hyun <[hidden email]> wrote:
Hi, All.

After a short celebration of Apache Spark 3.0, I'd like to ask you the community opinion on Apache Spark 3.1 feature expectations.

First of all, Apache Spark 3.1 is scheduled for December 2020.

I'm expecting the following items:

1. Support Scala 2.13
2. Use Apache Hadoop 3.2 by default for better cloud support
3. Declaring Kubernetes Scheduler GA
    In my perspective, the last main missing piece was Dynamic allocation and
    - Dynamic allocation with shuffle tracking is already shipped at 3.0.
    - Dynamic allocation with worker decommission/data migration is targeting 3.1. (Thanks, Holden)
4. DSv2 Stabilization

I'm aware of some more features which are on the way currently, but I love to hear the opinions from the main developers and more over the main users who need those features.

Thank you in advance. Welcome for any comments.

Bests,
Dongjoon.


--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
Reply | Threaded
Open this post in threaded view
|

Re: Apache Spark 3.1 Feature Expectation (Dec. 2020)

Dongjoon Hyun-2
Thank you for sharing your opinions, Jacky, Maxim, Holden, Jungtaek, Yi, Tom, Gabor, Felix.

I also want to include both `New Features` and `Improvements` together according to the above discussion.

When I checked the item status as of today, it looked like the following. In short, I removed K8s GA and DSv2 Stabilization explicitly from ON-TRACK list according to the given concerns. For those items, we can try to build a consensus for Apache Spark 3.2 (June 2021) or later.

ON-TRACK
1. Support Scala 2.13 (SPARK-25075)
2. Use Apache Hadoop 3.2 by default for better cloud support (SPARK-32058)
3. Stage Level Scheduling (SPARK-27495)
4. Support filter pushdown more (CSV is already shipped by SPARK-30323 in 3.0)
    - Support filters pushdown to JSON (SPARK-30648 in 3.1)
    - Support filters pushdown to Avro (SPARK-XXX in 3.1)
    - Support nested attributes of filters pushed down to JSON
5. Support JDBC Kerberos w/ keytab (SPARK-12312)

NICE TO HAVE OR DEFERRED TO APACHE SPARK 3.2
1. Declaring Kubernetes Scheduler GA
    - Should we also consider the shuffle service refactoring to support pluggable storage engines as targeting the 3.1 release? (Holden)
    - I think pluggable storage in shuffle is essential for k8s GA (Felix)
    - Use remote storage for persisting shuffle data (SPARK-25299)
2. DSv2 Stabilization? (The followings and more)
    - SPARK-31357 Catalog API for view metadata
    - SPARK-31694 Add SupportsPartitions Catalog APIs on DataSourceV2

As we know, we work willingly and voluntarily. If something lands on the `master` branch before the feature freeze (November), it will be a part of Apache Spark 3.1, of course.

Thanks,
Dongjoon.

On Sun, Jul 5, 2020 at 12:21 PM Felix Cheung <[hidden email]> wrote:
I think pluggable storage in shuffle is essential for k8s GA


From: Holden Karau <[hidden email]>
Sent: Monday, June 29, 2020 9:33 AM
To: Maxim Gekk
Cc: Dongjoon Hyun; dev
Subject: Re: Apache Spark 3.1 Feature Expectation (Dec. 2020)
 
Should we also consider the shuffle service refactoring to support pluggable storage engines as targeting the 3.1 release?

On Mon, Jun 29, 2020 at 9:31 AM Maxim Gekk <[hidden email]> wrote:
Hi Dongjoon,

I would add:
- Filters pushdown to JSON (https://github.com/apache/spark/pull/27366)
- Filters pushdown to other datasources like Avro
- Support nested attributes of filters pushed down to JSON

Maxim Gekk

Software Engineer

Databricks, Inc.



On Mon, Jun 29, 2020 at 7:07 PM Dongjoon Hyun <[hidden email]> wrote:
Hi, All.

After a short celebration of Apache Spark 3.0, I'd like to ask you the community opinion on Apache Spark 3.1 feature expectations.

First of all, Apache Spark 3.1 is scheduled for December 2020.

I'm expecting the following items:

1. Support Scala 2.13
2. Use Apache Hadoop 3.2 by default for better cloud support
3. Declaring Kubernetes Scheduler GA
    In my perspective, the last main missing piece was Dynamic allocation and
    - Dynamic allocation with shuffle tracking is already shipped at 3.0.
    - Dynamic allocation with worker decommission/data migration is targeting 3.1. (Thanks, Holden)
4. DSv2 Stabilization

I'm aware of some more features which are on the way currently, but I love to hear the opinions from the main developers and more over the main users who need those features.

Thank you in advance. Welcome for any comments.

Bests,
Dongjoon.


--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9