Thank you for putting forward this.
Can we put the support of view and partition catalog in version 3.1?
AFAIT, these are great features in DSv2 and Catalog. With these, we can work
well with warehouse, such as delta or hive.
Does this count only "new features" (probably major), or also count "improvements"? I'm aware of a couple of improvements which should be ideally included in the next release, but if this counts only major new features then don't feel they should be listed.
It's good if we could put the whole SPARK-25299 in Spark 3.1.
Holden Karau wrote
> Should we also consider the shuffle service refactoring to support
> pluggable storage engines as targeting the 3.1 release?
> On Mon, Jun 29, 2020 at 9:31 AM Maxim Gekk <
>> Hi Dongjoon,
>> I would add:
>> - Filters pushdown to JSON (https://github.com/apache/spark/pull/27366)
>> - Filters pushdown to other datasources like Avro
>> - Support nested attributes of filters pushed down to JSON
>> Maxim Gekk
>> Software Engineer
>> Databricks, Inc.
>> On Mon, Jun 29, 2020 at 7:07 PM Dongjoon Hyun <
>>> Hi, All.
>>> After a short celebration of Apache Spark 3.0, I'd like to ask you the
>>> community opinion on Apache Spark 3.1 feature expectations.
>>> First of all, Apache Spark 3.1 is scheduled for December 2020.
>>> - https://spark.apache.org/versioning-policy.html >>>
>>> I'm expecting the following items:
>>> 1. Support Scala 2.13
>>> 2. Use Apache Hadoop 3.2 by default for better cloud support
>>> 3. Declaring Kubernetes Scheduler GA
>>> In my perspective, the last main missing piece was Dynamic
>>> - Dynamic allocation with shuffle tracking is already shipped at
>>> - Dynamic allocation with worker decommission/data migration is
>>> targeting 3.1. (Thanks, Holden)
>>> 4. DSv2 Stabilization
>>> I'm aware of some more features which are on the way currently, but I
>>> love to hear the opinions from the main developers and more over the
>>> users who need those features.
>>> Thank you in advance. Welcome for any comments.
> Twitter: https://twitter.com/holdenkarau > Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
Thank you for sharing your opinions, Jacky, Maxim, Holden, Jungtaek, Yi, Tom, Gabor, Felix.
I also want to include both `New Features` and `Improvements` together according to the above discussion.
When I checked the item status as of today, it looked like the following. In short, I removed K8s GA and DSv2 Stabilization explicitly from ON-TRACK list according to the given concerns. For those items, we can try to build a consensus for Apache Spark 3.2 (June 2021) or later.
ON-TRACK 1. Support Scala 2.13 (SPARK-25075) 2. Use Apache Hadoop 3.2 by default for better cloud support (SPARK-32058) 3. Stage Level Scheduling (SPARK-27495) 4. Support filter pushdown more (CSV is already shipped by SPARK-30323 in 3.0) - Support filters pushdown to JSON (SPARK-30648 in 3.1) - Support filters pushdown to Avro (SPARK-XXX in 3.1) - Support nested attributes of filters pushed down to JSON 5. Support JDBC Kerberos w/ keytab (SPARK-12312)
NICE TO HAVE OR DEFERRED TO APACHE SPARK 3.2 1. Declaring Kubernetes Scheduler GA - Should we also consider the shuffle service refactoring to support pluggable storage engines as targeting the 3.1 release? (Holden) - I think pluggable storage in shuffle is essential for k8s GA (Felix) - Use remote storage for persisting shuffle data (SPARK-25299) 2. DSv2 Stabilization? (The followings and more) - SPARK-31357 Catalog API for view metadata - SPARK-31694 Add SupportsPartitions Catalog APIs on DataSourceV2
As we know, we work willingly and voluntarily. If something lands on the `master` branch before the feature freeze (November), it will be a part of Apache Spark 3.1, of course.