Apache Spark 3.2 Expectation

classic Classic list List threaded Threaded
23 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Apache Spark 3.2 Expectation

Dongjoon Hyun-2
Hi, All.

Since we have been preparing Apache Spark 3.2.0 in master branch since December 2020, March seems to be a good time to share our thoughts and aspirations on Apache Spark 3.2.

According to the progress on Apache Spark 3.1 release, Apache Spark 3.2 seems to be the last minor release of this year. Given the timeframe, we might consider the following. (This is a small set. Please add your thoughts to this limited list.)

# Languages

- Scala 2.13 Support: This was expected on 3.1 via SPARK-25075 but slipped out. Currently, we are trying to use Scala 2.13.5 via SPARK-34505 and investigating the publishing issue. Thank you for your contributions and feedback on this.

- Java 17 LTS Support: Java 17 LTS will arrive in September 2017. Like Java 11, we need lots of support from our dependencies. Let's see.

- Python 3.6 Deprecation(?): Python 3.6 community support ends at 2021-12-23. So, the deprecation is not required yet, but we had better prepare it because we don't have an ETA of Apache Spark 3.3 in 2022.

- SparkR CRAN publishing: As we know, it's discontinued so far. Resuming it depends on the success of Apache SparkR 3.1.1 CRAN publishing. If it succeeds to revive it, we can keep publishing. Otherwise, I believe we had better drop it from the releasing work item list officially.

# Dependencies

- Apache Hadoop 3.3.2: Hadoop 3.2.0 becomes the default Hadoop profile in Apache Spark 3.1. Currently, Spark master branch lives on Hadoop 3.2.2's shaded clients via SPARK-33212. So far, there is one on-going report at YARN environment. We hope it will be fixed soon at Spark 3.2 timeframe and we can move toward Hadoop 3.3.2.

- Apache Hive 2.3.9: Spark 3.0 starts to use Hive 2.3.7 by default instead of old Hive 1.2 fork. Spark 3.1 removed hive-1.2 profile completely via SPARK-32981 and replaced the generated hive-service-rpc code with the official dependency via SPARK-32981. We are steadily improving this area and will consume Hive 2.3.9 if available.

- K8s Client 4.13.2: During K8s GA activity, Spark 3.1 upgrades K8s client dependency to 4.12.0. Spark 3.2 upgrades it to 4.13.2 in order to support K8s model 1.19.

- Kafka Client 2.8: To bring the client fixes, Spark 3.1 is using Kafka Client 2.6. For Spark 3.2, SPARK-33913 upgraded to Kafka 2.7 with Scala 2.12.13, but it was reverted later due to Scala 2.12.13 issue. Since KAFKA-12357 fixed the Scala requirement two days ago, Spark 3.2 will go with Kafka Client 2.8 hopefully.

# Some Features

- Data Source v2: Spark 3.2 will deliver much richer DSv2 with Apache Iceberg integration. Especially, we hope the on-going function catalog SPIP and up-coming storage partitioned join SPIP can be delivered as a part of Spark 3.2 and become an additional foundation.

- Columnar Encryption: As of today, Apache Spark master branch supports columnar encryption via Apache ORC 1.6 and it's documented via SPARK-34036. Also, upcoming Apache Parquet 1.12 has a similar capability. Hopefully, Apache Spark 3.2 is going to be the first release to have this feature officially. Any feedback is welcome.

- Improved ZStandard Support: Spark 3.2 will bring more benefits for ZStandard users: 1) SPARK-34340 added native ZSTD JNI buffer pool support for all IO operations, 2) SPARK-33978 makes ORC datasource support ZSTD compression, 3) SPARK-34503 sets ZSTD as the default codec for event log compression, 4) SPARK-34479 aims to support ZSTD at Avro data source. Also, the upcoming Parquet 1.12 supports ZSTD (and supports JNI buffer pool), too. I'm expecting more benefits.

- Structure Streaming with RocksDB backend: According to the latest update, it looks active enough for merging to master branch in Spark 3.2.

Please share your thoughts and let's build better Apache Spark 3.2 together.

Bests,
Dongjoon.
Reply | Threaded
Open this post in threaded view
|

Re: Apache Spark 3.2 Expectation

Mridul Muralidharan


Nit: Java 17 -> should be available by Sept 2021 :-)
Adoption would also depend on some of our nontrivial dependencies supporting it - it might be a stretch to get it in for Apache Spark 3.2 ?

Features:
Push based shuffle and disaggregated shuffle should also be in 3.2


Regards,
Mridul






On Thu, Feb 25, 2021 at 10:57 AM Dongjoon Hyun <[hidden email]> wrote:
Hi, All.

Since we have been preparing Apache Spark 3.2.0 in master branch since December 2020, March seems to be a good time to share our thoughts and aspirations on Apache Spark 3.2.

According to the progress on Apache Spark 3.1 release, Apache Spark 3.2 seems to be the last minor release of this year. Given the timeframe, we might consider the following. (This is a small set. Please add your thoughts to this limited list.)

# Languages

- Scala 2.13 Support: This was expected on 3.1 via SPARK-25075 but slipped out. Currently, we are trying to use Scala 2.13.5 via SPARK-34505 and investigating the publishing issue. Thank you for your contributions and feedback on this.

- Java 17 LTS Support: Java 17 LTS will arrive in September 2017. Like Java 11, we need lots of support from our dependencies. Let's see.

- Python 3.6 Deprecation(?): Python 3.6 community support ends at 2021-12-23. So, the deprecation is not required yet, but we had better prepare it because we don't have an ETA of Apache Spark 3.3 in 2022.

- SparkR CRAN publishing: As we know, it's discontinued so far. Resuming it depends on the success of Apache SparkR 3.1.1 CRAN publishing. If it succeeds to revive it, we can keep publishing. Otherwise, I believe we had better drop it from the releasing work item list officially.

# Dependencies

- Apache Hadoop 3.3.2: Hadoop 3.2.0 becomes the default Hadoop profile in Apache Spark 3.1. Currently, Spark master branch lives on Hadoop 3.2.2's shaded clients via SPARK-33212. So far, there is one on-going report at YARN environment. We hope it will be fixed soon at Spark 3.2 timeframe and we can move toward Hadoop 3.3.2.

- Apache Hive 2.3.9: Spark 3.0 starts to use Hive 2.3.7 by default instead of old Hive 1.2 fork. Spark 3.1 removed hive-1.2 profile completely via SPARK-32981 and replaced the generated hive-service-rpc code with the official dependency via SPARK-32981. We are steadily improving this area and will consume Hive 2.3.9 if available.

- K8s Client 4.13.2: During K8s GA activity, Spark 3.1 upgrades K8s client dependency to 4.12.0. Spark 3.2 upgrades it to 4.13.2 in order to support K8s model 1.19.

- Kafka Client 2.8: To bring the client fixes, Spark 3.1 is using Kafka Client 2.6. For Spark 3.2, SPARK-33913 upgraded to Kafka 2.7 with Scala 2.12.13, but it was reverted later due to Scala 2.12.13 issue. Since KAFKA-12357 fixed the Scala requirement two days ago, Spark 3.2 will go with Kafka Client 2.8 hopefully.

# Some Features

- Data Source v2: Spark 3.2 will deliver much richer DSv2 with Apache Iceberg integration. Especially, we hope the on-going function catalog SPIP and up-coming storage partitioned join SPIP can be delivered as a part of Spark 3.2 and become an additional foundation.

- Columnar Encryption: As of today, Apache Spark master branch supports columnar encryption via Apache ORC 1.6 and it's documented via SPARK-34036. Also, upcoming Apache Parquet 1.12 has a similar capability. Hopefully, Apache Spark 3.2 is going to be the first release to have this feature officially. Any feedback is welcome.

- Improved ZStandard Support: Spark 3.2 will bring more benefits for ZStandard users: 1) SPARK-34340 added native ZSTD JNI buffer pool support for all IO operations, 2) SPARK-33978 makes ORC datasource support ZSTD compression, 3) SPARK-34503 sets ZSTD as the default codec for event log compression, 4) SPARK-34479 aims to support ZSTD at Avro data source. Also, the upcoming Parquet 1.12 supports ZSTD (and supports JNI buffer pool), too. I'm expecting more benefits.

- Structure Streaming with RocksDB backend: According to the latest update, it looks active enough for merging to master branch in Spark 3.2.

Please share your thoughts and let's build better Apache Spark 3.2 together.

Bests,
Dongjoon.
Reply | Threaded
Open this post in threaded view
|

Re: Apache Spark 3.2 Expectation

Sean Owen-2
In reply to this post by Dongjoon Hyun-2
I'd roughly expect 3.2 in, say, July of this year, given the usual cadence. No reason it couldn't be a little sooner or later. There is already some good stuff in 3.2 and will be a good minor release in 5-6 months.

On Thu, Feb 25, 2021 at 10:57 AM Dongjoon Hyun <[hidden email]> wrote:
Hi, All.

Since we have been preparing Apache Spark 3.2.0 in master branch since December 2020, March seems to be a good time to share our thoughts and aspirations on Apache Spark 3.2.

According to the progress on Apache Spark 3.1 release, Apache Spark 3.2 seems to be the last minor release of this year. Given the timeframe, we might consider the following. (This is a small set. Please add your thoughts to this limited list.)

# Languages

- Scala 2.13 Support: This was expected on 3.1 via SPARK-25075 but slipped out. Currently, we are trying to use Scala 2.13.5 via SPARK-34505 and investigating the publishing issue. Thank you for your contributions and feedback on this.

- Java 17 LTS Support: Java 17 LTS will arrive in September 2017. Like Java 11, we need lots of support from our dependencies. Let's see.

- Python 3.6 Deprecation(?): Python 3.6 community support ends at 2021-12-23. So, the deprecation is not required yet, but we had better prepare it because we don't have an ETA of Apache Spark 3.3 in 2022.

- SparkR CRAN publishing: As we know, it's discontinued so far. Resuming it depends on the success of Apache SparkR 3.1.1 CRAN publishing. If it succeeds to revive it, we can keep publishing. Otherwise, I believe we had better drop it from the releasing work item list officially.

# Dependencies

- Apache Hadoop 3.3.2: Hadoop 3.2.0 becomes the default Hadoop profile in Apache Spark 3.1. Currently, Spark master branch lives on Hadoop 3.2.2's shaded clients via SPARK-33212. So far, there is one on-going report at YARN environment. We hope it will be fixed soon at Spark 3.2 timeframe and we can move toward Hadoop 3.3.2.

- Apache Hive 2.3.9: Spark 3.0 starts to use Hive 2.3.7 by default instead of old Hive 1.2 fork. Spark 3.1 removed hive-1.2 profile completely via SPARK-32981 and replaced the generated hive-service-rpc code with the official dependency via SPARK-32981. We are steadily improving this area and will consume Hive 2.3.9 if available.

- K8s Client 4.13.2: During K8s GA activity, Spark 3.1 upgrades K8s client dependency to 4.12.0. Spark 3.2 upgrades it to 4.13.2 in order to support K8s model 1.19.

- Kafka Client 2.8: To bring the client fixes, Spark 3.1 is using Kafka Client 2.6. For Spark 3.2, SPARK-33913 upgraded to Kafka 2.7 with Scala 2.12.13, but it was reverted later due to Scala 2.12.13 issue. Since KAFKA-12357 fixed the Scala requirement two days ago, Spark 3.2 will go with Kafka Client 2.8 hopefully.

# Some Features

- Data Source v2: Spark 3.2 will deliver much richer DSv2 with Apache Iceberg integration. Especially, we hope the on-going function catalog SPIP and up-coming storage partitioned join SPIP can be delivered as a part of Spark 3.2 and become an additional foundation.

- Columnar Encryption: As of today, Apache Spark master branch supports columnar encryption via Apache ORC 1.6 and it's documented via SPARK-34036. Also, upcoming Apache Parquet 1.12 has a similar capability. Hopefully, Apache Spark 3.2 is going to be the first release to have this feature officially. Any feedback is welcome.

- Improved ZStandard Support: Spark 3.2 will bring more benefits for ZStandard users: 1) SPARK-34340 added native ZSTD JNI buffer pool support for all IO operations, 2) SPARK-33978 makes ORC datasource support ZSTD compression, 3) SPARK-34503 sets ZSTD as the default codec for event log compression, 4) SPARK-34479 aims to support ZSTD at Avro data source. Also, the upcoming Parquet 1.12 supports ZSTD (and supports JNI buffer pool), too. I'm expecting more benefits.

- Structure Streaming with RocksDB backend: According to the latest update, it looks active enough for merging to master branch in Spark 3.2.

Please share your thoughts and let's build better Apache Spark 3.2 together.

Bests,
Dongjoon.
Reply | Threaded
Open this post in threaded view
|

Re: Apache Spark 3.2 Expectation

Dongjoon Hyun-2
Thank you, Mridul and Sean.

1. Yes, `2017` was a typo. Java 17 is scheduled September 2021. And, of course, it's a nice-to-have status. :)

2. `Push based shuffle and disaggregated shuffle`. Definitely. Thanks for sharing,

3. +100 for Apache Spark 3.2.0 in July 2021. Maybe, we need `branch-cut` in April because we took 3 month for Spark 3.1 release.
    Let's update our release roadmap of the Apache Spark website.

> I'd roughly expect 3.2 in, say, July of this year, given the usual cadence. No reason it couldn't be a little sooner or later. There is already some good stuff in 3.2 and will be a good minor release in 5-6 months.

Bests,
Dongjoon.



On Thu, Feb 25, 2021 at 9:33 AM Sean Owen <[hidden email]> wrote:
I'd roughly expect 3.2 in, say, July of this year, given the usual cadence. No reason it couldn't be a little sooner or later. There is already some good stuff in 3.2 and will be a good minor release in 5-6 months.

On Thu, Feb 25, 2021 at 10:57 AM Dongjoon Hyun <[hidden email]> wrote:
Hi, All.

Since we have been preparing Apache Spark 3.2.0 in master branch since December 2020, March seems to be a good time to share our thoughts and aspirations on Apache Spark 3.2.

According to the progress on Apache Spark 3.1 release, Apache Spark 3.2 seems to be the last minor release of this year. Given the timeframe, we might consider the following. (This is a small set. Please add your thoughts to this limited list.)

# Languages

- Scala 2.13 Support: This was expected on 3.1 via SPARK-25075 but slipped out. Currently, we are trying to use Scala 2.13.5 via SPARK-34505 and investigating the publishing issue. Thank you for your contributions and feedback on this.

- Java 17 LTS Support: Java 17 LTS will arrive in September 2017. Like Java 11, we need lots of support from our dependencies. Let's see.

- Python 3.6 Deprecation(?): Python 3.6 community support ends at 2021-12-23. So, the deprecation is not required yet, but we had better prepare it because we don't have an ETA of Apache Spark 3.3 in 2022.

- SparkR CRAN publishing: As we know, it's discontinued so far. Resuming it depends on the success of Apache SparkR 3.1.1 CRAN publishing. If it succeeds to revive it, we can keep publishing. Otherwise, I believe we had better drop it from the releasing work item list officially.

# Dependencies

- Apache Hadoop 3.3.2: Hadoop 3.2.0 becomes the default Hadoop profile in Apache Spark 3.1. Currently, Spark master branch lives on Hadoop 3.2.2's shaded clients via SPARK-33212. So far, there is one on-going report at YARN environment. We hope it will be fixed soon at Spark 3.2 timeframe and we can move toward Hadoop 3.3.2.

- Apache Hive 2.3.9: Spark 3.0 starts to use Hive 2.3.7 by default instead of old Hive 1.2 fork. Spark 3.1 removed hive-1.2 profile completely via SPARK-32981 and replaced the generated hive-service-rpc code with the official dependency via SPARK-32981. We are steadily improving this area and will consume Hive 2.3.9 if available.

- K8s Client 4.13.2: During K8s GA activity, Spark 3.1 upgrades K8s client dependency to 4.12.0. Spark 3.2 upgrades it to 4.13.2 in order to support K8s model 1.19.

- Kafka Client 2.8: To bring the client fixes, Spark 3.1 is using Kafka Client 2.6. For Spark 3.2, SPARK-33913 upgraded to Kafka 2.7 with Scala 2.12.13, but it was reverted later due to Scala 2.12.13 issue. Since KAFKA-12357 fixed the Scala requirement two days ago, Spark 3.2 will go with Kafka Client 2.8 hopefully.

# Some Features

- Data Source v2: Spark 3.2 will deliver much richer DSv2 with Apache Iceberg integration. Especially, we hope the on-going function catalog SPIP and up-coming storage partitioned join SPIP can be delivered as a part of Spark 3.2 and become an additional foundation.

- Columnar Encryption: As of today, Apache Spark master branch supports columnar encryption via Apache ORC 1.6 and it's documented via SPARK-34036. Also, upcoming Apache Parquet 1.12 has a similar capability. Hopefully, Apache Spark 3.2 is going to be the first release to have this feature officially. Any feedback is welcome.

- Improved ZStandard Support: Spark 3.2 will bring more benefits for ZStandard users: 1) SPARK-34340 added native ZSTD JNI buffer pool support for all IO operations, 2) SPARK-33978 makes ORC datasource support ZSTD compression, 3) SPARK-34503 sets ZSTD as the default codec for event log compression, 4) SPARK-34479 aims to support ZSTD at Avro data source. Also, the upcoming Parquet 1.12 supports ZSTD (and supports JNI buffer pool), too. I'm expecting more benefits.

- Structure Streaming with RocksDB backend: According to the latest update, it looks active enough for merging to master branch in Spark 3.2.

Please share your thoughts and let's build better Apache Spark 3.2 together.

Bests,
Dongjoon.
Reply | Threaded
Open this post in threaded view
|

Re: Apache Spark 3.2 Expectation

Xiao Li
Thank you, Dongjoon, for initiating this discussion. Let us keep it open. It might take 1-2 weeks to collect from the community all the features we plan to build and ship in 3.2 since we just finished the 3.1 voting. 
 
3. +100 for Apache Spark 3.2.0 in July 2021. Maybe, we need `branch-cut` in April because we took 3 month for Spark 3.1 release.

TBH, cutting the branch this April does not look good to me. That means, we only have one month left for feature development of Spark 3.2. Do we have enough features in the current master branch? If not, are we able to finish major features we collected here? Do they have a timeline or project plan? 

Xiao

Dongjoon Hyun <[hidden email]> 于2021年2月26日周五 上午10:07写道:
Thank you, Mridul and Sean.

1. Yes, `2017` was a typo. Java 17 is scheduled September 2021. And, of course, it's a nice-to-have status. :)

2. `Push based shuffle and disaggregated shuffle`. Definitely. Thanks for sharing,

3. +100 for Apache Spark 3.2.0 in July 2021. Maybe, we need `branch-cut` in April because we took 3 month for Spark 3.1 release.
    Let's update our release roadmap of the Apache Spark website.

> I'd roughly expect 3.2 in, say, July of this year, given the usual cadence. No reason it couldn't be a little sooner or later. There is already some good stuff in 3.2 and will be a good minor release in 5-6 months.

Bests,
Dongjoon.



On Thu, Feb 25, 2021 at 9:33 AM Sean Owen <[hidden email]> wrote:
I'd roughly expect 3.2 in, say, July of this year, given the usual cadence. No reason it couldn't be a little sooner or later. There is already some good stuff in 3.2 and will be a good minor release in 5-6 months.

On Thu, Feb 25, 2021 at 10:57 AM Dongjoon Hyun <[hidden email]> wrote:
Hi, All.

Since we have been preparing Apache Spark 3.2.0 in master branch since December 2020, March seems to be a good time to share our thoughts and aspirations on Apache Spark 3.2.

According to the progress on Apache Spark 3.1 release, Apache Spark 3.2 seems to be the last minor release of this year. Given the timeframe, we might consider the following. (This is a small set. Please add your thoughts to this limited list.)

# Languages

- Scala 2.13 Support: This was expected on 3.1 via SPARK-25075 but slipped out. Currently, we are trying to use Scala 2.13.5 via SPARK-34505 and investigating the publishing issue. Thank you for your contributions and feedback on this.

- Java 17 LTS Support: Java 17 LTS will arrive in September 2017. Like Java 11, we need lots of support from our dependencies. Let's see.

- Python 3.6 Deprecation(?): Python 3.6 community support ends at 2021-12-23. So, the deprecation is not required yet, but we had better prepare it because we don't have an ETA of Apache Spark 3.3 in 2022.

- SparkR CRAN publishing: As we know, it's discontinued so far. Resuming it depends on the success of Apache SparkR 3.1.1 CRAN publishing. If it succeeds to revive it, we can keep publishing. Otherwise, I believe we had better drop it from the releasing work item list officially.

# Dependencies

- Apache Hadoop 3.3.2: Hadoop 3.2.0 becomes the default Hadoop profile in Apache Spark 3.1. Currently, Spark master branch lives on Hadoop 3.2.2's shaded clients via SPARK-33212. So far, there is one on-going report at YARN environment. We hope it will be fixed soon at Spark 3.2 timeframe and we can move toward Hadoop 3.3.2.

- Apache Hive 2.3.9: Spark 3.0 starts to use Hive 2.3.7 by default instead of old Hive 1.2 fork. Spark 3.1 removed hive-1.2 profile completely via SPARK-32981 and replaced the generated hive-service-rpc code with the official dependency via SPARK-32981. We are steadily improving this area and will consume Hive 2.3.9 if available.

- K8s Client 4.13.2: During K8s GA activity, Spark 3.1 upgrades K8s client dependency to 4.12.0. Spark 3.2 upgrades it to 4.13.2 in order to support K8s model 1.19.

- Kafka Client 2.8: To bring the client fixes, Spark 3.1 is using Kafka Client 2.6. For Spark 3.2, SPARK-33913 upgraded to Kafka 2.7 with Scala 2.12.13, but it was reverted later due to Scala 2.12.13 issue. Since KAFKA-12357 fixed the Scala requirement two days ago, Spark 3.2 will go with Kafka Client 2.8 hopefully.

# Some Features

- Data Source v2: Spark 3.2 will deliver much richer DSv2 with Apache Iceberg integration. Especially, we hope the on-going function catalog SPIP and up-coming storage partitioned join SPIP can be delivered as a part of Spark 3.2 and become an additional foundation.

- Columnar Encryption: As of today, Apache Spark master branch supports columnar encryption via Apache ORC 1.6 and it's documented via SPARK-34036. Also, upcoming Apache Parquet 1.12 has a similar capability. Hopefully, Apache Spark 3.2 is going to be the first release to have this feature officially. Any feedback is welcome.

- Improved ZStandard Support: Spark 3.2 will bring more benefits for ZStandard users: 1) SPARK-34340 added native ZSTD JNI buffer pool support for all IO operations, 2) SPARK-33978 makes ORC datasource support ZSTD compression, 3) SPARK-34503 sets ZSTD as the default codec for event log compression, 4) SPARK-34479 aims to support ZSTD at Avro data source. Also, the upcoming Parquet 1.12 supports ZSTD (and supports JNI buffer pool), too. I'm expecting more benefits.

- Structure Streaming with RocksDB backend: According to the latest update, it looks active enough for merging to master branch in Spark 3.2.

Please share your thoughts and let's build better Apache Spark 3.2 together.

Bests,
Dongjoon.
Reply | Threaded
Open this post in threaded view
|

Re: Apache Spark 3.2 Expectation

huaxin gao
Thanks Dongjoon and Xiao for the discussion. I would like to add Data Source V2 Aggregate push down to the list. I am currently working on JDBC Data Source V2 Aggregate push down, but the common code can be used for the file based V2 Data Source as well. For example, MAX and MIN can be pushed down to Parquet and Orc, since they can use statistics information to perform these operations efficiently. Quite a few users are interested in this Aggregate push down feature and the preliminary performance test for JDBC Aggregate push down is positive. So I think it is a valuable feature to add for Spark 3.2.

Thanks,
Huaxin

On Fri, Feb 26, 2021 at 11:13 AM Xiao Li <[hidden email]> wrote:
Thank you, Dongjoon, for initiating this discussion. Let us keep it open. It might take 1-2 weeks to collect from the community all the features we plan to build and ship in 3.2 since we just finished the 3.1 voting. 
 
3. +100 for Apache Spark 3.2.0 in July 2021. Maybe, we need `branch-cut` in April because we took 3 month for Spark 3.1 release.

TBH, cutting the branch this April does not look good to me. That means, we only have one month left for feature development of Spark 3.2. Do we have enough features in the current master branch? If not, are we able to finish major features we collected here? Do they have a timeline or project plan? 

Xiao

Dongjoon Hyun <[hidden email]> 于2021年2月26日周五 上午10:07写道:
Thank you, Mridul and Sean.

1. Yes, `2017` was a typo. Java 17 is scheduled September 2021. And, of course, it's a nice-to-have status. :)

2. `Push based shuffle and disaggregated shuffle`. Definitely. Thanks for sharing,

3. +100 for Apache Spark 3.2.0 in July 2021. Maybe, we need `branch-cut` in April because we took 3 month for Spark 3.1 release.
    Let's update our release roadmap of the Apache Spark website.

> I'd roughly expect 3.2 in, say, July of this year, given the usual cadence. No reason it couldn't be a little sooner or later. There is already some good stuff in 3.2 and will be a good minor release in 5-6 months.

Bests,
Dongjoon.



On Thu, Feb 25, 2021 at 9:33 AM Sean Owen <[hidden email]> wrote:
I'd roughly expect 3.2 in, say, July of this year, given the usual cadence. No reason it couldn't be a little sooner or later. There is already some good stuff in 3.2 and will be a good minor release in 5-6 months.

On Thu, Feb 25, 2021 at 10:57 AM Dongjoon Hyun <[hidden email]> wrote:
Hi, All.

Since we have been preparing Apache Spark 3.2.0 in master branch since December 2020, March seems to be a good time to share our thoughts and aspirations on Apache Spark 3.2.

According to the progress on Apache Spark 3.1 release, Apache Spark 3.2 seems to be the last minor release of this year. Given the timeframe, we might consider the following. (This is a small set. Please add your thoughts to this limited list.)

# Languages

- Scala 2.13 Support: This was expected on 3.1 via SPARK-25075 but slipped out. Currently, we are trying to use Scala 2.13.5 via SPARK-34505 and investigating the publishing issue. Thank you for your contributions and feedback on this.

- Java 17 LTS Support: Java 17 LTS will arrive in September 2017. Like Java 11, we need lots of support from our dependencies. Let's see.

- Python 3.6 Deprecation(?): Python 3.6 community support ends at 2021-12-23. So, the deprecation is not required yet, but we had better prepare it because we don't have an ETA of Apache Spark 3.3 in 2022.

- SparkR CRAN publishing: As we know, it's discontinued so far. Resuming it depends on the success of Apache SparkR 3.1.1 CRAN publishing. If it succeeds to revive it, we can keep publishing. Otherwise, I believe we had better drop it from the releasing work item list officially.

# Dependencies

- Apache Hadoop 3.3.2: Hadoop 3.2.0 becomes the default Hadoop profile in Apache Spark 3.1. Currently, Spark master branch lives on Hadoop 3.2.2's shaded clients via SPARK-33212. So far, there is one on-going report at YARN environment. We hope it will be fixed soon at Spark 3.2 timeframe and we can move toward Hadoop 3.3.2.

- Apache Hive 2.3.9: Spark 3.0 starts to use Hive 2.3.7 by default instead of old Hive 1.2 fork. Spark 3.1 removed hive-1.2 profile completely via SPARK-32981 and replaced the generated hive-service-rpc code with the official dependency via SPARK-32981. We are steadily improving this area and will consume Hive 2.3.9 if available.

- K8s Client 4.13.2: During K8s GA activity, Spark 3.1 upgrades K8s client dependency to 4.12.0. Spark 3.2 upgrades it to 4.13.2 in order to support K8s model 1.19.

- Kafka Client 2.8: To bring the client fixes, Spark 3.1 is using Kafka Client 2.6. For Spark 3.2, SPARK-33913 upgraded to Kafka 2.7 with Scala 2.12.13, but it was reverted later due to Scala 2.12.13 issue. Since KAFKA-12357 fixed the Scala requirement two days ago, Spark 3.2 will go with Kafka Client 2.8 hopefully.

# Some Features

- Data Source v2: Spark 3.2 will deliver much richer DSv2 with Apache Iceberg integration. Especially, we hope the on-going function catalog SPIP and up-coming storage partitioned join SPIP can be delivered as a part of Spark 3.2 and become an additional foundation.

- Columnar Encryption: As of today, Apache Spark master branch supports columnar encryption via Apache ORC 1.6 and it's documented via SPARK-34036. Also, upcoming Apache Parquet 1.12 has a similar capability. Hopefully, Apache Spark 3.2 is going to be the first release to have this feature officially. Any feedback is welcome.

- Improved ZStandard Support: Spark 3.2 will bring more benefits for ZStandard users: 1) SPARK-34340 added native ZSTD JNI buffer pool support for all IO operations, 2) SPARK-33978 makes ORC datasource support ZSTD compression, 3) SPARK-34503 sets ZSTD as the default codec for event log compression, 4) SPARK-34479 aims to support ZSTD at Avro data source. Also, the upcoming Parquet 1.12 supports ZSTD (and supports JNI buffer pool), too. I'm expecting more benefits.

- Structure Streaming with RocksDB backend: According to the latest update, it looks active enough for merging to master branch in Spark 3.2.

Please share your thoughts and let's build better Apache Spark 3.2 together.

Bests,
Dongjoon.
Reply | Threaded
Open this post in threaded view
|

Re: Apache Spark 3.2 Expectation

Dongjoon Hyun-2
In reply to this post by Xiao Li

On Fri, Feb 26, 2021 at 11:13 AM Xiao Li <[hidden email]> wrote:
Do we have enough features in the current master branch?

Hi, Xiao.
Is this a question to Sean's previous comment, `There is already some good stuff in 3.2 and will be a good minor release in 5-6 months.`?


On Thu, Feb 25, 2021 at 9:33 AM Sean Owen <[hidden email]> wrote:
I'd roughly expect 3.2 in, say, July of this year, given the usual cadence. No reason it couldn't be a little sooner or later. There is already some good stuff in 3.2 and will be a good minor release in 5-6 months.
Reply | Threaded
Open this post in threaded view
|

Re: Apache Spark 3.2 Expectation

Dongjoon Hyun-2
In reply to this post by huaxin gao
Thank you for sharing your plan, Huaxin!

Bests,
Dongjoon.


On Fri, Feb 26, 2021 at 12:20 PM huaxin gao <[hidden email]> wrote:
Thanks Dongjoon and Xiao for the discussion. I would like to add Data Source V2 Aggregate push down to the list. I am currently working on JDBC Data Source V2 Aggregate push down, but the common code can be used for the file based V2 Data Source as well. For example, MAX and MIN can be pushed down to Parquet and Orc, since they can use statistics information to perform these operations efficiently. Quite a few users are interested in this Aggregate push down feature and the preliminary performance test for JDBC Aggregate push down is positive. So I think it is a valuable feature to add for Spark 3.2.

Thanks,
Huaxin

On Fri, Feb 26, 2021 at 11:13 AM Xiao Li <[hidden email]> wrote:
Thank you, Dongjoon, for initiating this discussion. Let us keep it open. It might take 1-2 weeks to collect from the community all the features we plan to build and ship in 3.2 since we just finished the 3.1 voting. 
 
3. +100 for Apache Spark 3.2.0 in July 2021. Maybe, we need `branch-cut` in April because we took 3 month for Spark 3.1 release.

TBH, cutting the branch this April does not look good to me. That means, we only have one month left for feature development of Spark 3.2. Do we have enough features in the current master branch? If not, are we able to finish major features we collected here? Do they have a timeline or project plan? 

Xiao

Dongjoon Hyun <[hidden email]> 于2021年2月26日周五 上午10:07写道:
Thank you, Mridul and Sean.

1. Yes, `2017` was a typo. Java 17 is scheduled September 2021. And, of course, it's a nice-to-have status. :)

2. `Push based shuffle and disaggregated shuffle`. Definitely. Thanks for sharing,

3. +100 for Apache Spark 3.2.0 in July 2021. Maybe, we need `branch-cut` in April because we took 3 month for Spark 3.1 release.
    Let's update our release roadmap of the Apache Spark website.

> I'd roughly expect 3.2 in, say, July of this year, given the usual cadence. No reason it couldn't be a little sooner or later. There is already some good stuff in 3.2 and will be a good minor release in 5-6 months.

Bests,
Dongjoon.



On Thu, Feb 25, 2021 at 9:33 AM Sean Owen <[hidden email]> wrote:
I'd roughly expect 3.2 in, say, July of this year, given the usual cadence. No reason it couldn't be a little sooner or later. There is already some good stuff in 3.2 and will be a good minor release in 5-6 months.

On Thu, Feb 25, 2021 at 10:57 AM Dongjoon Hyun <[hidden email]> wrote:
Hi, All.

Since we have been preparing Apache Spark 3.2.0 in master branch since December 2020, March seems to be a good time to share our thoughts and aspirations on Apache Spark 3.2.

According to the progress on Apache Spark 3.1 release, Apache Spark 3.2 seems to be the last minor release of this year. Given the timeframe, we might consider the following. (This is a small set. Please add your thoughts to this limited list.)

# Languages

- Scala 2.13 Support: This was expected on 3.1 via SPARK-25075 but slipped out. Currently, we are trying to use Scala 2.13.5 via SPARK-34505 and investigating the publishing issue. Thank you for your contributions and feedback on this.

- Java 17 LTS Support: Java 17 LTS will arrive in September 2017. Like Java 11, we need lots of support from our dependencies. Let's see.

- Python 3.6 Deprecation(?): Python 3.6 community support ends at 2021-12-23. So, the deprecation is not required yet, but we had better prepare it because we don't have an ETA of Apache Spark 3.3 in 2022.

- SparkR CRAN publishing: As we know, it's discontinued so far. Resuming it depends on the success of Apache SparkR 3.1.1 CRAN publishing. If it succeeds to revive it, we can keep publishing. Otherwise, I believe we had better drop it from the releasing work item list officially.

# Dependencies

- Apache Hadoop 3.3.2: Hadoop 3.2.0 becomes the default Hadoop profile in Apache Spark 3.1. Currently, Spark master branch lives on Hadoop 3.2.2's shaded clients via SPARK-33212. So far, there is one on-going report at YARN environment. We hope it will be fixed soon at Spark 3.2 timeframe and we can move toward Hadoop 3.3.2.

- Apache Hive 2.3.9: Spark 3.0 starts to use Hive 2.3.7 by default instead of old Hive 1.2 fork. Spark 3.1 removed hive-1.2 profile completely via SPARK-32981 and replaced the generated hive-service-rpc code with the official dependency via SPARK-32981. We are steadily improving this area and will consume Hive 2.3.9 if available.

- K8s Client 4.13.2: During K8s GA activity, Spark 3.1 upgrades K8s client dependency to 4.12.0. Spark 3.2 upgrades it to 4.13.2 in order to support K8s model 1.19.

- Kafka Client 2.8: To bring the client fixes, Spark 3.1 is using Kafka Client 2.6. For Spark 3.2, SPARK-33913 upgraded to Kafka 2.7 with Scala 2.12.13, but it was reverted later due to Scala 2.12.13 issue. Since KAFKA-12357 fixed the Scala requirement two days ago, Spark 3.2 will go with Kafka Client 2.8 hopefully.

# Some Features

- Data Source v2: Spark 3.2 will deliver much richer DSv2 with Apache Iceberg integration. Especially, we hope the on-going function catalog SPIP and up-coming storage partitioned join SPIP can be delivered as a part of Spark 3.2 and become an additional foundation.

- Columnar Encryption: As of today, Apache Spark master branch supports columnar encryption via Apache ORC 1.6 and it's documented via SPARK-34036. Also, upcoming Apache Parquet 1.12 has a similar capability. Hopefully, Apache Spark 3.2 is going to be the first release to have this feature officially. Any feedback is welcome.

- Improved ZStandard Support: Spark 3.2 will bring more benefits for ZStandard users: 1) SPARK-34340 added native ZSTD JNI buffer pool support for all IO operations, 2) SPARK-33978 makes ORC datasource support ZSTD compression, 3) SPARK-34503 sets ZSTD as the default codec for event log compression, 4) SPARK-34479 aims to support ZSTD at Avro data source. Also, the upcoming Parquet 1.12 supports ZSTD (and supports JNI buffer pool), too. I'm expecting more benefits.

- Structure Streaming with RocksDB backend: According to the latest update, it looks active enough for merging to master branch in Spark 3.2.

Please share your thoughts and let's build better Apache Spark 3.2 together.

Bests,
Dongjoon.
Reply | Threaded
Open this post in threaded view
|

Re: Apache Spark 3.2 Expectation

Hyukjin Kwon
I have an idea which I'll send an email to discuss next or a week after the next week. I did not have enough bandwidth to drive both together at the same time. I would appreciate if we have some more time for 3.2.

In addition, It would also be great if we follow the schedule and catch potential blockers quickly during QA instead of when we cut RCs. That will considerably speed up the process and make it on time.

Thanks.


On Sat, 27 Feb 2021, 06:00 Dongjoon Hyun, <[hidden email]> wrote:
Thank you for sharing your plan, Huaxin!

Bests,
Dongjoon.


On Fri, Feb 26, 2021 at 12:20 PM huaxin gao <[hidden email]> wrote:
Thanks Dongjoon and Xiao for the discussion. I would like to add Data Source V2 Aggregate push down to the list. I am currently working on JDBC Data Source V2 Aggregate push down, but the common code can be used for the file based V2 Data Source as well. For example, MAX and MIN can be pushed down to Parquet and Orc, since they can use statistics information to perform these operations efficiently. Quite a few users are interested in this Aggregate push down feature and the preliminary performance test for JDBC Aggregate push down is positive. So I think it is a valuable feature to add for Spark 3.2.

Thanks,
Huaxin

On Fri, Feb 26, 2021 at 11:13 AM Xiao Li <[hidden email]> wrote:
Thank you, Dongjoon, for initiating this discussion. Let us keep it open. It might take 1-2 weeks to collect from the community all the features we plan to build and ship in 3.2 since we just finished the 3.1 voting. 
 
3. +100 for Apache Spark 3.2.0 in July 2021. Maybe, we need `branch-cut` in April because we took 3 month for Spark 3.1 release.

TBH, cutting the branch this April does not look good to me. That means, we only have one month left for feature development of Spark 3.2. Do we have enough features in the current master branch? If not, are we able to finish major features we collected here? Do they have a timeline or project plan? 

Xiao

Dongjoon Hyun <[hidden email]> 于2021年2月26日周五 上午10:07写道:
Thank you, Mridul and Sean.

1. Yes, `2017` was a typo. Java 17 is scheduled September 2021. And, of course, it's a nice-to-have status. :)

2. `Push based shuffle and disaggregated shuffle`. Definitely. Thanks for sharing,

3. +100 for Apache Spark 3.2.0 in July 2021. Maybe, we need `branch-cut` in April because we took 3 month for Spark 3.1 release.
    Let's update our release roadmap of the Apache Spark website.

> I'd roughly expect 3.2 in, say, July of this year, given the usual cadence. No reason it couldn't be a little sooner or later. There is already some good stuff in 3.2 and will be a good minor release in 5-6 months.

Bests,
Dongjoon.



On Thu, Feb 25, 2021 at 9:33 AM Sean Owen <[hidden email]> wrote:
I'd roughly expect 3.2 in, say, July of this year, given the usual cadence. No reason it couldn't be a little sooner or later. There is already some good stuff in 3.2 and will be a good minor release in 5-6 months.

On Thu, Feb 25, 2021 at 10:57 AM Dongjoon Hyun <[hidden email]> wrote:
Hi, All.

Since we have been preparing Apache Spark 3.2.0 in master branch since December 2020, March seems to be a good time to share our thoughts and aspirations on Apache Spark 3.2.

According to the progress on Apache Spark 3.1 release, Apache Spark 3.2 seems to be the last minor release of this year. Given the timeframe, we might consider the following. (This is a small set. Please add your thoughts to this limited list.)

# Languages

- Scala 2.13 Support: This was expected on 3.1 via SPARK-25075 but slipped out. Currently, we are trying to use Scala 2.13.5 via SPARK-34505 and investigating the publishing issue. Thank you for your contributions and feedback on this.

- Java 17 LTS Support: Java 17 LTS will arrive in September 2017. Like Java 11, we need lots of support from our dependencies. Let's see.

- Python 3.6 Deprecation(?): Python 3.6 community support ends at 2021-12-23. So, the deprecation is not required yet, but we had better prepare it because we don't have an ETA of Apache Spark 3.3 in 2022.

- SparkR CRAN publishing: As we know, it's discontinued so far. Resuming it depends on the success of Apache SparkR 3.1.1 CRAN publishing. If it succeeds to revive it, we can keep publishing. Otherwise, I believe we had better drop it from the releasing work item list officially.

# Dependencies

- Apache Hadoop 3.3.2: Hadoop 3.2.0 becomes the default Hadoop profile in Apache Spark 3.1. Currently, Spark master branch lives on Hadoop 3.2.2's shaded clients via SPARK-33212. So far, there is one on-going report at YARN environment. We hope it will be fixed soon at Spark 3.2 timeframe and we can move toward Hadoop 3.3.2.

- Apache Hive 2.3.9: Spark 3.0 starts to use Hive 2.3.7 by default instead of old Hive 1.2 fork. Spark 3.1 removed hive-1.2 profile completely via SPARK-32981 and replaced the generated hive-service-rpc code with the official dependency via SPARK-32981. We are steadily improving this area and will consume Hive 2.3.9 if available.

- K8s Client 4.13.2: During K8s GA activity, Spark 3.1 upgrades K8s client dependency to 4.12.0. Spark 3.2 upgrades it to 4.13.2 in order to support K8s model 1.19.

- Kafka Client 2.8: To bring the client fixes, Spark 3.1 is using Kafka Client 2.6. For Spark 3.2, SPARK-33913 upgraded to Kafka 2.7 with Scala 2.12.13, but it was reverted later due to Scala 2.12.13 issue. Since KAFKA-12357 fixed the Scala requirement two days ago, Spark 3.2 will go with Kafka Client 2.8 hopefully.

# Some Features

- Data Source v2: Spark 3.2 will deliver much richer DSv2 with Apache Iceberg integration. Especially, we hope the on-going function catalog SPIP and up-coming storage partitioned join SPIP can be delivered as a part of Spark 3.2 and become an additional foundation.

- Columnar Encryption: As of today, Apache Spark master branch supports columnar encryption via Apache ORC 1.6 and it's documented via SPARK-34036. Also, upcoming Apache Parquet 1.12 has a similar capability. Hopefully, Apache Spark 3.2 is going to be the first release to have this feature officially. Any feedback is welcome.

- Improved ZStandard Support: Spark 3.2 will bring more benefits for ZStandard users: 1) SPARK-34340 added native ZSTD JNI buffer pool support for all IO operations, 2) SPARK-33978 makes ORC datasource support ZSTD compression, 3) SPARK-34503 sets ZSTD as the default codec for event log compression, 4) SPARK-34479 aims to support ZSTD at Avro data source. Also, the upcoming Parquet 1.12 supports ZSTD (and supports JNI buffer pool), too. I'm expecting more benefits.

- Structure Streaming with RocksDB backend: According to the latest update, it looks active enough for merging to master branch in Spark 3.2.

Please share your thoughts and let's build better Apache Spark 3.2 together.

Bests,
Dongjoon.
Reply | Threaded
Open this post in threaded view
|

Re: Apache Spark 3.2 Expectation

Dongjoon Hyun-2
Sure, thank you, Hyukjin.

Bests,
Dongjoon.


On Fri, Feb 26, 2021 at 4:01 PM Hyukjin Kwon <[hidden email]> wrote:
I have an idea which I'll send an email to discuss next or a week after the next week. I did not have enough bandwidth to drive both together at the same time. I would appreciate if we have some more time for 3.2.

In addition, It would also be great if we follow the schedule and catch potential blockers quickly during QA instead of when we cut RCs. That will considerably speed up the process and make it on time.

Thanks.


On Sat, 27 Feb 2021, 06:00 Dongjoon Hyun, <[hidden email]> wrote:
Thank you for sharing your plan, Huaxin!

Bests,
Dongjoon.


On Fri, Feb 26, 2021 at 12:20 PM huaxin gao <[hidden email]> wrote:
Thanks Dongjoon and Xiao for the discussion. I would like to add Data Source V2 Aggregate push down to the list. I am currently working on JDBC Data Source V2 Aggregate push down, but the common code can be used for the file based V2 Data Source as well. For example, MAX and MIN can be pushed down to Parquet and Orc, since they can use statistics information to perform these operations efficiently. Quite a few users are interested in this Aggregate push down feature and the preliminary performance test for JDBC Aggregate push down is positive. So I think it is a valuable feature to add for Spark 3.2.

Thanks,
Huaxin

On Fri, Feb 26, 2021 at 11:13 AM Xiao Li <[hidden email]> wrote:
Thank you, Dongjoon, for initiating this discussion. Let us keep it open. It might take 1-2 weeks to collect from the community all the features we plan to build and ship in 3.2 since we just finished the 3.1 voting. 
 
3. +100 for Apache Spark 3.2.0 in July 2021. Maybe, we need `branch-cut` in April because we took 3 month for Spark 3.1 release.

TBH, cutting the branch this April does not look good to me. That means, we only have one month left for feature development of Spark 3.2. Do we have enough features in the current master branch? If not, are we able to finish major features we collected here? Do they have a timeline or project plan? 

Xiao

Dongjoon Hyun <[hidden email]> 于2021年2月26日周五 上午10:07写道:
Thank you, Mridul and Sean.

1. Yes, `2017` was a typo. Java 17 is scheduled September 2021. And, of course, it's a nice-to-have status. :)

2. `Push based shuffle and disaggregated shuffle`. Definitely. Thanks for sharing,

3. +100 for Apache Spark 3.2.0 in July 2021. Maybe, we need `branch-cut` in April because we took 3 month for Spark 3.1 release.
    Let's update our release roadmap of the Apache Spark website.

> I'd roughly expect 3.2 in, say, July of this year, given the usual cadence. No reason it couldn't be a little sooner or later. There is already some good stuff in 3.2 and will be a good minor release in 5-6 months.

Bests,
Dongjoon.



On Thu, Feb 25, 2021 at 9:33 AM Sean Owen <[hidden email]> wrote:
I'd roughly expect 3.2 in, say, July of this year, given the usual cadence. No reason it couldn't be a little sooner or later. There is already some good stuff in 3.2 and will be a good minor release in 5-6 months.

On Thu, Feb 25, 2021 at 10:57 AM Dongjoon Hyun <[hidden email]> wrote:
Hi, All.

Since we have been preparing Apache Spark 3.2.0 in master branch since December 2020, March seems to be a good time to share our thoughts and aspirations on Apache Spark 3.2.

According to the progress on Apache Spark 3.1 release, Apache Spark 3.2 seems to be the last minor release of this year. Given the timeframe, we might consider the following. (This is a small set. Please add your thoughts to this limited list.)

# Languages

- Scala 2.13 Support: This was expected on 3.1 via SPARK-25075 but slipped out. Currently, we are trying to use Scala 2.13.5 via SPARK-34505 and investigating the publishing issue. Thank you for your contributions and feedback on this.

- Java 17 LTS Support: Java 17 LTS will arrive in September 2017. Like Java 11, we need lots of support from our dependencies. Let's see.

- Python 3.6 Deprecation(?): Python 3.6 community support ends at 2021-12-23. So, the deprecation is not required yet, but we had better prepare it because we don't have an ETA of Apache Spark 3.3 in 2022.

- SparkR CRAN publishing: As we know, it's discontinued so far. Resuming it depends on the success of Apache SparkR 3.1.1 CRAN publishing. If it succeeds to revive it, we can keep publishing. Otherwise, I believe we had better drop it from the releasing work item list officially.

# Dependencies

- Apache Hadoop 3.3.2: Hadoop 3.2.0 becomes the default Hadoop profile in Apache Spark 3.1. Currently, Spark master branch lives on Hadoop 3.2.2's shaded clients via SPARK-33212. So far, there is one on-going report at YARN environment. We hope it will be fixed soon at Spark 3.2 timeframe and we can move toward Hadoop 3.3.2.

- Apache Hive 2.3.9: Spark 3.0 starts to use Hive 2.3.7 by default instead of old Hive 1.2 fork. Spark 3.1 removed hive-1.2 profile completely via SPARK-32981 and replaced the generated hive-service-rpc code with the official dependency via SPARK-32981. We are steadily improving this area and will consume Hive 2.3.9 if available.

- K8s Client 4.13.2: During K8s GA activity, Spark 3.1 upgrades K8s client dependency to 4.12.0. Spark 3.2 upgrades it to 4.13.2 in order to support K8s model 1.19.

- Kafka Client 2.8: To bring the client fixes, Spark 3.1 is using Kafka Client 2.6. For Spark 3.2, SPARK-33913 upgraded to Kafka 2.7 with Scala 2.12.13, but it was reverted later due to Scala 2.12.13 issue. Since KAFKA-12357 fixed the Scala requirement two days ago, Spark 3.2 will go with Kafka Client 2.8 hopefully.

# Some Features

- Data Source v2: Spark 3.2 will deliver much richer DSv2 with Apache Iceberg integration. Especially, we hope the on-going function catalog SPIP and up-coming storage partitioned join SPIP can be delivered as a part of Spark 3.2 and become an additional foundation.

- Columnar Encryption: As of today, Apache Spark master branch supports columnar encryption via Apache ORC 1.6 and it's documented via SPARK-34036. Also, upcoming Apache Parquet 1.12 has a similar capability. Hopefully, Apache Spark 3.2 is going to be the first release to have this feature officially. Any feedback is welcome.

- Improved ZStandard Support: Spark 3.2 will bring more benefits for ZStandard users: 1) SPARK-34340 added native ZSTD JNI buffer pool support for all IO operations, 2) SPARK-33978 makes ORC datasource support ZSTD compression, 3) SPARK-34503 sets ZSTD as the default codec for event log compression, 4) SPARK-34479 aims to support ZSTD at Avro data source. Also, the upcoming Parquet 1.12 supports ZSTD (and supports JNI buffer pool), too. I'm expecting more benefits.

- Structure Streaming with RocksDB backend: According to the latest update, it looks active enough for merging to master branch in Spark 3.2.

Please share your thoughts and let's build better Apache Spark 3.2 together.

Bests,
Dongjoon.
Reply | Threaded
Open this post in threaded view
|

Re: Apache Spark 3.2 Expectation

Cheng Su-2

Hi,

 

Just want to share something I am working on in 3.2 if these matter.

 

  • Shuffled hash join improvement (SPARK-32461)
    • This is one of release notes JIRAs in 3.1, and major thing left is sort-based fallback and code-gen for FULL OUTER join.
  • Join and aggregation code-gen (SPARK-34287 and more to create)
    • Add code-gen for all join types of sort merge join, object hash aggregation and sort aggregation.
  • Write Hive/Presto-compatible bucketed table (SPARK-19256)
    • This is a long-standing issue and we made progress on plan during 3.1 development. We ideally want to finish the feature in 3.2.

 

For most of features here, we already developed internally and rolled out to production.

 

Thanks,

Cheng Su

 

From: Dongjoon Hyun <[hidden email]>
Date: Friday, February 26, 2021 at 4:06 PM
To: Hyukjin Kwon <[hidden email]>
Cc: huaxin gao <[hidden email]>, Xiao Li <[hidden email]>, dev <[hidden email]>
Subject: Re: Apache Spark 3.2 Expectation

 

Sure, thank you, Hyukjin.

 

Bests,

Dongjoon.

 

 

On Fri, Feb 26, 2021 at 4:01 PM Hyukjin Kwon <[hidden email]> wrote:

I have an idea which I'll send an email to discuss next or a week after the next week. I did not have enough bandwidth to drive both together at the same time. I would appreciate if we have some more time for 3.2.

 

In addition, It would also be great if we follow the schedule and catch potential blockers quickly during QA instead of when we cut RCs. That will considerably speed up the process and make it on time.

 

Thanks.

 

On Sat, 27 Feb 2021, 06:00 Dongjoon Hyun, <[hidden email]> wrote:

Thank you for sharing your plan, Huaxin!

 

Bests,

Dongjoon.

 

 

On Fri, Feb 26, 2021 at 12:20 PM huaxin gao <[hidden email]> wrote:

Thanks Dongjoon and Xiao for the discussion. I would like to add Data Source V2 Aggregate push down to the list. I am currently working on JDBC Data Source V2 Aggregate push down, but the common code can be used for the file based V2 Data Source as well. For example, MAX and MIN can be pushed down to Parquet and Orc, since they can use statistics information to perform these operations efficiently. Quite a few users are interested in this Aggregate push down feature and the preliminary performance test for JDBC Aggregate push down is positive. So I think it is a valuable feature to add for Spark 3.2.


Thanks,

Huaxin

 

On Fri, Feb 26, 2021 at 11:13 AM Xiao Li <[hidden email]> wrote:

Thank you, Dongjoon, for initiating this discussion. Let us keep it open. It might take 1-2 weeks to collect from the community all the features we plan to build and ship in 3.2 since we just finished the 3.1 voting. 

 

3. +100 for Apache Spark 3.2.0 in July 2021. Maybe, we need `branch-cut` in April because we took 3 month for Spark 3.1 release.

 

TBH, cutting the branch this April does not look good to me. That means, we only have one month left for feature development of Spark 3.2. Do we have enough features in the current master branch? If not, are we able to finish major features we collected here? Do they have a timeline or project plan? 

 

Xiao

 

Dongjoon Hyun <[hidden email]> 2021226日周五 上午10:07写道:

Thank you, Mridul and Sean.

1. Yes, `2017` was a typo. Java 17 is scheduled September 2021. And, of course, it's a nice-to-have status. :)

2. `Push based shuffle and disaggregated shuffle`. Definitely. Thanks for sharing,

3. +100 for Apache Spark 3.2.0 in July 2021. Maybe, we need `branch-cut` in April because we took 3 month for Spark 3.1 release.

    Let's update our release roadmap of the Apache Spark website.

 

> I'd roughly expect 3.2 in, say, July of this year, given the usual cadence. No reason it couldn't be a little sooner or later. There is already some good stuff in 3.2 and will be a good minor release in 5-6 months.

 

Bests,

Dongjoon.

 

 

 

On Thu, Feb 25, 2021 at 9:33 AM Sean Owen <[hidden email]> wrote:

I'd roughly expect 3.2 in, say, July of this year, given the usual cadence. No reason it couldn't be a little sooner or later. There is already some good stuff in 3.2 and will be a good minor release in 5-6 months.

 

On Thu, Feb 25, 2021 at 10:57 AM Dongjoon Hyun <[hidden email]> wrote:

Hi, All.

Since we have been preparing Apache Spark 3.2.0 in master branch since December 2020, March seems to be a good time to share our thoughts and aspirations on Apache Spark 3.2.

According to the progress on Apache Spark 3.1 release, Apache Spark 3.2 seems to be the last minor release of this year. Given the timeframe, we might consider the following. (This is a small set. Please add your thoughts to this limited list.)

# Languages

- Scala 2.13 Support: This was expected on 3.1 via SPARK-25075 but slipped out. Currently, we are trying to use Scala 2.13.5 via SPARK-34505 and investigating the publishing issue. Thank you for your contributions and feedback on this.

- Java 17 LTS Support: Java 17 LTS will arrive in September 2017. Like Java 11, we need lots of support from our dependencies. Let's see.

- Python 3.6 Deprecation(?): Python 3.6 community support ends at 2021-12-23. So, the deprecation is not required yet, but we had better prepare it because we don't have an ETA of Apache Spark 3.3 in 2022.

- SparkR CRAN publishing: As we know, it's discontinued so far. Resuming it depends on the success of Apache SparkR 3.1.1 CRAN publishing. If it succeeds to revive it, we can keep publishing. Otherwise, I believe we had better drop it from the releasing work item list officially.

# Dependencies

- Apache Hadoop 3.3.2: Hadoop 3.2.0 becomes the default Hadoop profile in Apache Spark 3.1. Currently, Spark master branch lives on Hadoop 3.2.2's shaded clients via SPARK-33212. So far, there is one on-going report at YARN environment. We hope it will be fixed soon at Spark 3.2 timeframe and we can move toward Hadoop 3.3.2.

- Apache Hive 2.3.9: Spark 3.0 starts to use Hive 2.3.7 by default instead of old Hive 1.2 fork. Spark 3.1 removed hive-1.2 profile completely via SPARK-32981 and replaced the generated hive-service-rpc code with the official dependency via SPARK-32981. We are steadily improving this area and will consume Hive 2.3.9 if available.

- K8s Client 4.13.2: During K8s GA activity, Spark 3.1 upgrades K8s client dependency to 4.12.0. Spark 3.2 upgrades it to 4.13.2 in order to support K8s model 1.19.

- Kafka Client 2.8: To bring the client fixes, Spark 3.1 is using Kafka Client 2.6. For Spark 3.2, SPARK-33913 upgraded to Kafka 2.7 with Scala 2.12.13, but it was reverted later due to Scala 2.12.13 issue. Since KAFKA-12357 fixed the Scala requirement two days ago, Spark 3.2 will go with Kafka Client 2.8 hopefully.

# Some Features

- Data Source v2: Spark 3.2 will deliver much richer DSv2 with Apache Iceberg integration. Especially, we hope the on-going function catalog SPIP and up-coming storage partitioned join SPIP can be delivered as a part of Spark 3.2 and become an additional foundation.

- Columnar Encryption: As of today, Apache Spark master branch supports columnar encryption via Apache ORC 1.6 and it's documented via SPARK-34036. Also, upcoming Apache Parquet 1.12 has a similar capability. Hopefully, Apache Spark 3.2 is going to be the first release to have this feature officially. Any feedback is welcome.

- Improved ZStandard Support: Spark 3.2 will bring more benefits for ZStandard users: 1) SPARK-34340 added native ZSTD JNI buffer pool support for all IO operations, 2) SPARK-33978 makes ORC datasource support ZSTD compression, 3) SPARK-34503 sets ZSTD as the default codec for event log compression, 4) SPARK-34479 aims to support ZSTD at Avro data source. Also, the upcoming Parquet 1.12 supports ZSTD (and supports JNI buffer pool), too. I'm expecting more benefits.

- Structure Streaming with RocksDB backend: According to the latest update, it looks active enough for merging to master branch in Spark 3.2.

Please share your thoughts and let's build better Apache Spark 3.2 together.

Bests,
Dongjoon.

Reply | Threaded
Open this post in threaded view
|

Re: Apache Spark 3.2 Expectation

wuyi
In reply to this post by Mridul Muralidharan
+1 to continue the incompleted push-based shuffle.

--
Yi

On Fri, Feb 26, 2021 at 1:26 AM Mridul Muralidharan <[hidden email]> wrote:


Nit: Java 17 -> should be available by Sept 2021 :-)
Adoption would also depend on some of our nontrivial dependencies supporting it - it might be a stretch to get it in for Apache Spark 3.2 ?

Features:
Push based shuffle and disaggregated shuffle should also be in 3.2


Regards,
Mridul






On Thu, Feb 25, 2021 at 10:57 AM Dongjoon Hyun <[hidden email]> wrote:
Hi, All.

Since we have been preparing Apache Spark 3.2.0 in master branch since December 2020, March seems to be a good time to share our thoughts and aspirations on Apache Spark 3.2.

According to the progress on Apache Spark 3.1 release, Apache Spark 3.2 seems to be the last minor release of this year. Given the timeframe, we might consider the following. (This is a small set. Please add your thoughts to this limited list.)

# Languages

- Scala 2.13 Support: This was expected on 3.1 via SPARK-25075 but slipped out. Currently, we are trying to use Scala 2.13.5 via SPARK-34505 and investigating the publishing issue. Thank you for your contributions and feedback on this.

- Java 17 LTS Support: Java 17 LTS will arrive in September 2017. Like Java 11, we need lots of support from our dependencies. Let's see.

- Python 3.6 Deprecation(?): Python 3.6 community support ends at 2021-12-23. So, the deprecation is not required yet, but we had better prepare it because we don't have an ETA of Apache Spark 3.3 in 2022.

- SparkR CRAN publishing: As we know, it's discontinued so far. Resuming it depends on the success of Apache SparkR 3.1.1 CRAN publishing. If it succeeds to revive it, we can keep publishing. Otherwise, I believe we had better drop it from the releasing work item list officially.

# Dependencies

- Apache Hadoop 3.3.2: Hadoop 3.2.0 becomes the default Hadoop profile in Apache Spark 3.1. Currently, Spark master branch lives on Hadoop 3.2.2's shaded clients via SPARK-33212. So far, there is one on-going report at YARN environment. We hope it will be fixed soon at Spark 3.2 timeframe and we can move toward Hadoop 3.3.2.

- Apache Hive 2.3.9: Spark 3.0 starts to use Hive 2.3.7 by default instead of old Hive 1.2 fork. Spark 3.1 removed hive-1.2 profile completely via SPARK-32981 and replaced the generated hive-service-rpc code with the official dependency via SPARK-32981. We are steadily improving this area and will consume Hive 2.3.9 if available.

- K8s Client 4.13.2: During K8s GA activity, Spark 3.1 upgrades K8s client dependency to 4.12.0. Spark 3.2 upgrades it to 4.13.2 in order to support K8s model 1.19.

- Kafka Client 2.8: To bring the client fixes, Spark 3.1 is using Kafka Client 2.6. For Spark 3.2, SPARK-33913 upgraded to Kafka 2.7 with Scala 2.12.13, but it was reverted later due to Scala 2.12.13 issue. Since KAFKA-12357 fixed the Scala requirement two days ago, Spark 3.2 will go with Kafka Client 2.8 hopefully.

# Some Features

- Data Source v2: Spark 3.2 will deliver much richer DSv2 with Apache Iceberg integration. Especially, we hope the on-going function catalog SPIP and up-coming storage partitioned join SPIP can be delivered as a part of Spark 3.2 and become an additional foundation.

- Columnar Encryption: As of today, Apache Spark master branch supports columnar encryption via Apache ORC 1.6 and it's documented via SPARK-34036. Also, upcoming Apache Parquet 1.12 has a similar capability. Hopefully, Apache Spark 3.2 is going to be the first release to have this feature officially. Any feedback is welcome.

- Improved ZStandard Support: Spark 3.2 will bring more benefits for ZStandard users: 1) SPARK-34340 added native ZSTD JNI buffer pool support for all IO operations, 2) SPARK-33978 makes ORC datasource support ZSTD compression, 3) SPARK-34503 sets ZSTD as the default codec for event log compression, 4) SPARK-34479 aims to support ZSTD at Avro data source. Also, the upcoming Parquet 1.12 supports ZSTD (and supports JNI buffer pool), too. I'm expecting more benefits.

- Structure Streaming with RocksDB backend: According to the latest update, it looks active enough for merging to master branch in Spark 3.2.

Please share your thoughts and let's build better Apache Spark 3.2 together.

Bests,
Dongjoon.
Reply | Threaded
Open this post in threaded view
|

Re: Apache Spark 3.2 Expectation

Takeshi Yamamuro
Thanks, Dongjoon, for the discussion.
I would like to add Gengliang's work: SPARK-34246 New type coercion syntax rules in ANSI mode
I think it is worth describing it in the next release note, too.

Bests,
Takeshi

On Sat, Feb 27, 2021 at 11:41 AM Yi Wu <[hidden email]> wrote:
+1 to continue the incompleted push-based shuffle.

--
Yi

On Fri, Feb 26, 2021 at 1:26 AM Mridul Muralidharan <[hidden email]> wrote:


Nit: Java 17 -> should be available by Sept 2021 :-)
Adoption would also depend on some of our nontrivial dependencies supporting it - it might be a stretch to get it in for Apache Spark 3.2 ?

Features:
Push based shuffle and disaggregated shuffle should also be in 3.2


Regards,
Mridul






On Thu, Feb 25, 2021 at 10:57 AM Dongjoon Hyun <[hidden email]> wrote:
Hi, All.

Since we have been preparing Apache Spark 3.2.0 in master branch since December 2020, March seems to be a good time to share our thoughts and aspirations on Apache Spark 3.2.

According to the progress on Apache Spark 3.1 release, Apache Spark 3.2 seems to be the last minor release of this year. Given the timeframe, we might consider the following. (This is a small set. Please add your thoughts to this limited list.)

# Languages

- Scala 2.13 Support: This was expected on 3.1 via SPARK-25075 but slipped out. Currently, we are trying to use Scala 2.13.5 via SPARK-34505 and investigating the publishing issue. Thank you for your contributions and feedback on this.

- Java 17 LTS Support: Java 17 LTS will arrive in September 2017. Like Java 11, we need lots of support from our dependencies. Let's see.

- Python 3.6 Deprecation(?): Python 3.6 community support ends at 2021-12-23. So, the deprecation is not required yet, but we had better prepare it because we don't have an ETA of Apache Spark 3.3 in 2022.

- SparkR CRAN publishing: As we know, it's discontinued so far. Resuming it depends on the success of Apache SparkR 3.1.1 CRAN publishing. If it succeeds to revive it, we can keep publishing. Otherwise, I believe we had better drop it from the releasing work item list officially.

# Dependencies

- Apache Hadoop 3.3.2: Hadoop 3.2.0 becomes the default Hadoop profile in Apache Spark 3.1. Currently, Spark master branch lives on Hadoop 3.2.2's shaded clients via SPARK-33212. So far, there is one on-going report at YARN environment. We hope it will be fixed soon at Spark 3.2 timeframe and we can move toward Hadoop 3.3.2.

- Apache Hive 2.3.9: Spark 3.0 starts to use Hive 2.3.7 by default instead of old Hive 1.2 fork. Spark 3.1 removed hive-1.2 profile completely via SPARK-32981 and replaced the generated hive-service-rpc code with the official dependency via SPARK-32981. We are steadily improving this area and will consume Hive 2.3.9 if available.

- K8s Client 4.13.2: During K8s GA activity, Spark 3.1 upgrades K8s client dependency to 4.12.0. Spark 3.2 upgrades it to 4.13.2 in order to support K8s model 1.19.

- Kafka Client 2.8: To bring the client fixes, Spark 3.1 is using Kafka Client 2.6. For Spark 3.2, SPARK-33913 upgraded to Kafka 2.7 with Scala 2.12.13, but it was reverted later due to Scala 2.12.13 issue. Since KAFKA-12357 fixed the Scala requirement two days ago, Spark 3.2 will go with Kafka Client 2.8 hopefully.

# Some Features

- Data Source v2: Spark 3.2 will deliver much richer DSv2 with Apache Iceberg integration. Especially, we hope the on-going function catalog SPIP and up-coming storage partitioned join SPIP can be delivered as a part of Spark 3.2 and become an additional foundation.

- Columnar Encryption: As of today, Apache Spark master branch supports columnar encryption via Apache ORC 1.6 and it's documented via SPARK-34036. Also, upcoming Apache Parquet 1.12 has a similar capability. Hopefully, Apache Spark 3.2 is going to be the first release to have this feature officially. Any feedback is welcome.

- Improved ZStandard Support: Spark 3.2 will bring more benefits for ZStandard users: 1) SPARK-34340 added native ZSTD JNI buffer pool support for all IO operations, 2) SPARK-33978 makes ORC datasource support ZSTD compression, 3) SPARK-34503 sets ZSTD as the default codec for event log compression, 4) SPARK-34479 aims to support ZSTD at Avro data source. Also, the upcoming Parquet 1.12 supports ZSTD (and supports JNI buffer pool), too. I'm expecting more benefits.

- Structure Streaming with RocksDB backend: According to the latest update, it looks active enough for merging to master branch in Spark 3.2.

Please share your thoughts and let's build better Apache Spark 3.2 together.

Bests,
Dongjoon.


--
---
Takeshi Yamamuro
Reply | Threaded
Open this post in threaded view
|

Re: Apache Spark 3.2 Expectation

bo yang
In reply to this post by wuyi
+1 for better support for disaggregated shuffle (push-based shuffle is a great example, also there are Facebook shuffle service and Uber remote shuffle service). There were previously some community sync up meetings on this, but kind of discontinued. Are people interested to continue the sync up meeting on this?

On Fri, Feb 26, 2021 at 6:41 PM Yi Wu <[hidden email]> wrote:
+1 to continue the incompleted push-based shuffle.

--
Yi

On Fri, Feb 26, 2021 at 1:26 AM Mridul Muralidharan <[hidden email]> wrote:


Nit: Java 17 -> should be available by Sept 2021 :-)
Adoption would also depend on some of our nontrivial dependencies supporting it - it might be a stretch to get it in for Apache Spark 3.2 ?

Features:
Push based shuffle and disaggregated shuffle should also be in 3.2


Regards,
Mridul






On Thu, Feb 25, 2021 at 10:57 AM Dongjoon Hyun <[hidden email]> wrote:
Hi, All.

Since we have been preparing Apache Spark 3.2.0 in master branch since December 2020, March seems to be a good time to share our thoughts and aspirations on Apache Spark 3.2.

According to the progress on Apache Spark 3.1 release, Apache Spark 3.2 seems to be the last minor release of this year. Given the timeframe, we might consider the following. (This is a small set. Please add your thoughts to this limited list.)

# Languages

- Scala 2.13 Support: This was expected on 3.1 via SPARK-25075 but slipped out. Currently, we are trying to use Scala 2.13.5 via SPARK-34505 and investigating the publishing issue. Thank you for your contributions and feedback on this.

- Java 17 LTS Support: Java 17 LTS will arrive in September 2017. Like Java 11, we need lots of support from our dependencies. Let's see.

- Python 3.6 Deprecation(?): Python 3.6 community support ends at 2021-12-23. So, the deprecation is not required yet, but we had better prepare it because we don't have an ETA of Apache Spark 3.3 in 2022.

- SparkR CRAN publishing: As we know, it's discontinued so far. Resuming it depends on the success of Apache SparkR 3.1.1 CRAN publishing. If it succeeds to revive it, we can keep publishing. Otherwise, I believe we had better drop it from the releasing work item list officially.

# Dependencies

- Apache Hadoop 3.3.2: Hadoop 3.2.0 becomes the default Hadoop profile in Apache Spark 3.1. Currently, Spark master branch lives on Hadoop 3.2.2's shaded clients via SPARK-33212. So far, there is one on-going report at YARN environment. We hope it will be fixed soon at Spark 3.2 timeframe and we can move toward Hadoop 3.3.2.

- Apache Hive 2.3.9: Spark 3.0 starts to use Hive 2.3.7 by default instead of old Hive 1.2 fork. Spark 3.1 removed hive-1.2 profile completely via SPARK-32981 and replaced the generated hive-service-rpc code with the official dependency via SPARK-32981. We are steadily improving this area and will consume Hive 2.3.9 if available.

- K8s Client 4.13.2: During K8s GA activity, Spark 3.1 upgrades K8s client dependency to 4.12.0. Spark 3.2 upgrades it to 4.13.2 in order to support K8s model 1.19.

- Kafka Client 2.8: To bring the client fixes, Spark 3.1 is using Kafka Client 2.6. For Spark 3.2, SPARK-33913 upgraded to Kafka 2.7 with Scala 2.12.13, but it was reverted later due to Scala 2.12.13 issue. Since KAFKA-12357 fixed the Scala requirement two days ago, Spark 3.2 will go with Kafka Client 2.8 hopefully.

# Some Features

- Data Source v2: Spark 3.2 will deliver much richer DSv2 with Apache Iceberg integration. Especially, we hope the on-going function catalog SPIP and up-coming storage partitioned join SPIP can be delivered as a part of Spark 3.2 and become an additional foundation.

- Columnar Encryption: As of today, Apache Spark master branch supports columnar encryption via Apache ORC 1.6 and it's documented via SPARK-34036. Also, upcoming Apache Parquet 1.12 has a similar capability. Hopefully, Apache Spark 3.2 is going to be the first release to have this feature officially. Any feedback is welcome.

- Improved ZStandard Support: Spark 3.2 will bring more benefits for ZStandard users: 1) SPARK-34340 added native ZSTD JNI buffer pool support for all IO operations, 2) SPARK-33978 makes ORC datasource support ZSTD compression, 3) SPARK-34503 sets ZSTD as the default codec for event log compression, 4) SPARK-34479 aims to support ZSTD at Avro data source. Also, the upcoming Parquet 1.12 supports ZSTD (and supports JNI buffer pool), too. I'm expecting more benefits.

- Structure Streaming with RocksDB backend: According to the latest update, it looks active enough for merging to master branch in Spark 3.2.

Please share your thoughts and let's build better Apache Spark 3.2 together.

Bests,
Dongjoon.
Reply | Threaded
Open this post in threaded view
|

Re: Apache Spark 3.2 Expectation

Chang Chen
In reply to this post by huaxin gao
+1 for Data Source V2 Aggregate push down 

huaxin gao <[hidden email]> 于2021年2月27日周六 上午4:20写道:
Thanks Dongjoon and Xiao for the discussion. I would like to add Data Source V2 Aggregate push down to the list. I am currently working on JDBC Data Source V2 Aggregate push down, but the common code can be used for the file based V2 Data Source as well. For example, MAX and MIN can be pushed down to Parquet and Orc, since they can use statistics information to perform these operations efficiently. Quite a few users are interested in this Aggregate push down feature and the preliminary performance test for JDBC Aggregate push down is positive. So I think it is a valuable feature to add for Spark 3.2.

Thanks,
Huaxin

On Fri, Feb 26, 2021 at 11:13 AM Xiao Li <[hidden email]> wrote:
Thank you, Dongjoon, for initiating this discussion. Let us keep it open. It might take 1-2 weeks to collect from the community all the features we plan to build and ship in 3.2 since we just finished the 3.1 voting. 
 
3. +100 for Apache Spark 3.2.0 in July 2021. Maybe, we need `branch-cut` in April because we took 3 month for Spark 3.1 release.

TBH, cutting the branch this April does not look good to me. That means, we only have one month left for feature development of Spark 3.2. Do we have enough features in the current master branch? If not, are we able to finish major features we collected here? Do they have a timeline or project plan? 

Xiao

Dongjoon Hyun <[hidden email]> 于2021年2月26日周五 上午10:07写道:
Thank you, Mridul and Sean.

1. Yes, `2017` was a typo. Java 17 is scheduled September 2021. And, of course, it's a nice-to-have status. :)

2. `Push based shuffle and disaggregated shuffle`. Definitely. Thanks for sharing,

3. +100 for Apache Spark 3.2.0 in July 2021. Maybe, we need `branch-cut` in April because we took 3 month for Spark 3.1 release.
    Let's update our release roadmap of the Apache Spark website.

> I'd roughly expect 3.2 in, say, July of this year, given the usual cadence. No reason it couldn't be a little sooner or later. There is already some good stuff in 3.2 and will be a good minor release in 5-6 months.

Bests,
Dongjoon.



On Thu, Feb 25, 2021 at 9:33 AM Sean Owen <[hidden email]> wrote:
I'd roughly expect 3.2 in, say, July of this year, given the usual cadence. No reason it couldn't be a little sooner or later. There is already some good stuff in 3.2 and will be a good minor release in 5-6 months.

On Thu, Feb 25, 2021 at 10:57 AM Dongjoon Hyun <[hidden email]> wrote:
Hi, All.

Since we have been preparing Apache Spark 3.2.0 in master branch since December 2020, March seems to be a good time to share our thoughts and aspirations on Apache Spark 3.2.

According to the progress on Apache Spark 3.1 release, Apache Spark 3.2 seems to be the last minor release of this year. Given the timeframe, we might consider the following. (This is a small set. Please add your thoughts to this limited list.)

# Languages

- Scala 2.13 Support: This was expected on 3.1 via SPARK-25075 but slipped out. Currently, we are trying to use Scala 2.13.5 via SPARK-34505 and investigating the publishing issue. Thank you for your contributions and feedback on this.

- Java 17 LTS Support: Java 17 LTS will arrive in September 2017. Like Java 11, we need lots of support from our dependencies. Let's see.

- Python 3.6 Deprecation(?): Python 3.6 community support ends at 2021-12-23. So, the deprecation is not required yet, but we had better prepare it because we don't have an ETA of Apache Spark 3.3 in 2022.

- SparkR CRAN publishing: As we know, it's discontinued so far. Resuming it depends on the success of Apache SparkR 3.1.1 CRAN publishing. If it succeeds to revive it, we can keep publishing. Otherwise, I believe we had better drop it from the releasing work item list officially.

# Dependencies

- Apache Hadoop 3.3.2: Hadoop 3.2.0 becomes the default Hadoop profile in Apache Spark 3.1. Currently, Spark master branch lives on Hadoop 3.2.2's shaded clients via SPARK-33212. So far, there is one on-going report at YARN environment. We hope it will be fixed soon at Spark 3.2 timeframe and we can move toward Hadoop 3.3.2.

- Apache Hive 2.3.9: Spark 3.0 starts to use Hive 2.3.7 by default instead of old Hive 1.2 fork. Spark 3.1 removed hive-1.2 profile completely via SPARK-32981 and replaced the generated hive-service-rpc code with the official dependency via SPARK-32981. We are steadily improving this area and will consume Hive 2.3.9 if available.

- K8s Client 4.13.2: During K8s GA activity, Spark 3.1 upgrades K8s client dependency to 4.12.0. Spark 3.2 upgrades it to 4.13.2 in order to support K8s model 1.19.

- Kafka Client 2.8: To bring the client fixes, Spark 3.1 is using Kafka Client 2.6. For Spark 3.2, SPARK-33913 upgraded to Kafka 2.7 with Scala 2.12.13, but it was reverted later due to Scala 2.12.13 issue. Since KAFKA-12357 fixed the Scala requirement two days ago, Spark 3.2 will go with Kafka Client 2.8 hopefully.

# Some Features

- Data Source v2: Spark 3.2 will deliver much richer DSv2 with Apache Iceberg integration. Especially, we hope the on-going function catalog SPIP and up-coming storage partitioned join SPIP can be delivered as a part of Spark 3.2 and become an additional foundation.

- Columnar Encryption: As of today, Apache Spark master branch supports columnar encryption via Apache ORC 1.6 and it's documented via SPARK-34036. Also, upcoming Apache Parquet 1.12 has a similar capability. Hopefully, Apache Spark 3.2 is going to be the first release to have this feature officially. Any feedback is welcome.

- Improved ZStandard Support: Spark 3.2 will bring more benefits for ZStandard users: 1) SPARK-34340 added native ZSTD JNI buffer pool support for all IO operations, 2) SPARK-33978 makes ORC datasource support ZSTD compression, 3) SPARK-34503 sets ZSTD as the default codec for event log compression, 4) SPARK-34479 aims to support ZSTD at Avro data source. Also, the upcoming Parquet 1.12 supports ZSTD (and supports JNI buffer pool), too. I'm expecting more benefits.

- Structure Streaming with RocksDB backend: According to the latest update, it looks active enough for merging to master branch in Spark 3.2.

Please share your thoughts and let's build better Apache Spark 3.2 together.

Bests,
Dongjoon.
Reply | Threaded
Open this post in threaded view
|

Re: Apache Spark 3.2 Expectation

John Zhuge
In reply to this post by Dongjoon Hyun-2
Hi Dongjoon,

Is it possible to get ViewCatalog in? The community already had fairly detailed discussions.

Thanks,
John

On Thu, Feb 25, 2021 at 8:57 AM Dongjoon Hyun <[hidden email]> wrote:
Hi, All.

Since we have been preparing Apache Spark 3.2.0 in master branch since December 2020, March seems to be a good time to share our thoughts and aspirations on Apache Spark 3.2.

According to the progress on Apache Spark 3.1 release, Apache Spark 3.2 seems to be the last minor release of this year. Given the timeframe, we might consider the following. (This is a small set. Please add your thoughts to this limited list.)

# Languages

- Scala 2.13 Support: This was expected on 3.1 via SPARK-25075 but slipped out. Currently, we are trying to use Scala 2.13.5 via SPARK-34505 and investigating the publishing issue. Thank you for your contributions and feedback on this.

- Java 17 LTS Support: Java 17 LTS will arrive in September 2017. Like Java 11, we need lots of support from our dependencies. Let's see.

- Python 3.6 Deprecation(?): Python 3.6 community support ends at 2021-12-23. So, the deprecation is not required yet, but we had better prepare it because we don't have an ETA of Apache Spark 3.3 in 2022.

- SparkR CRAN publishing: As we know, it's discontinued so far. Resuming it depends on the success of Apache SparkR 3.1.1 CRAN publishing. If it succeeds to revive it, we can keep publishing. Otherwise, I believe we had better drop it from the releasing work item list officially.

# Dependencies

- Apache Hadoop 3.3.2: Hadoop 3.2.0 becomes the default Hadoop profile in Apache Spark 3.1. Currently, Spark master branch lives on Hadoop 3.2.2's shaded clients via SPARK-33212. So far, there is one on-going report at YARN environment. We hope it will be fixed soon at Spark 3.2 timeframe and we can move toward Hadoop 3.3.2.

- Apache Hive 2.3.9: Spark 3.0 starts to use Hive 2.3.7 by default instead of old Hive 1.2 fork. Spark 3.1 removed hive-1.2 profile completely via SPARK-32981 and replaced the generated hive-service-rpc code with the official dependency via SPARK-32981. We are steadily improving this area and will consume Hive 2.3.9 if available.

- K8s Client 4.13.2: During K8s GA activity, Spark 3.1 upgrades K8s client dependency to 4.12.0. Spark 3.2 upgrades it to 4.13.2 in order to support K8s model 1.19.

- Kafka Client 2.8: To bring the client fixes, Spark 3.1 is using Kafka Client 2.6. For Spark 3.2, SPARK-33913 upgraded to Kafka 2.7 with Scala 2.12.13, but it was reverted later due to Scala 2.12.13 issue. Since KAFKA-12357 fixed the Scala requirement two days ago, Spark 3.2 will go with Kafka Client 2.8 hopefully.

# Some Features

- Data Source v2: Spark 3.2 will deliver much richer DSv2 with Apache Iceberg integration. Especially, we hope the on-going function catalog SPIP and up-coming storage partitioned join SPIP can be delivered as a part of Spark 3.2 and become an additional foundation.

- Columnar Encryption: As of today, Apache Spark master branch supports columnar encryption via Apache ORC 1.6 and it's documented via SPARK-34036. Also, upcoming Apache Parquet 1.12 has a similar capability. Hopefully, Apache Spark 3.2 is going to be the first release to have this feature officially. Any feedback is welcome.

- Improved ZStandard Support: Spark 3.2 will bring more benefits for ZStandard users: 1) SPARK-34340 added native ZSTD JNI buffer pool support for all IO operations, 2) SPARK-33978 makes ORC datasource support ZSTD compression, 3) SPARK-34503 sets ZSTD as the default codec for event log compression, 4) SPARK-34479 aims to support ZSTD at Avro data source. Also, the upcoming Parquet 1.12 supports ZSTD (and supports JNI buffer pool), too. I'm expecting more benefits.

- Structure Streaming with RocksDB backend: According to the latest update, it looks active enough for merging to master branch in Spark 3.2.

Please share your thoughts and let's build better Apache Spark 3.2 together.

Bests,
Dongjoon.


--
John Zhuge
Reply | Threaded
Open this post in threaded view
|

Re: Apache Spark 3.2 Expectation

Dongjoon Hyun-2
Hi, John.

This thread aims to share your expectations and goals (and maybe work progress) to Apache Spark 3.2 because we are making this together. :)

Bests,
Dongjoon.


On Wed, Mar 3, 2021 at 1:59 PM John Zhuge <[hidden email]> wrote:
Hi Dongjoon,

Is it possible to get ViewCatalog in? The community already had fairly detailed discussions.

Thanks,
John

On Thu, Feb 25, 2021 at 8:57 AM Dongjoon Hyun <[hidden email]> wrote:
Hi, All.

Since we have been preparing Apache Spark 3.2.0 in master branch since December 2020, March seems to be a good time to share our thoughts and aspirations on Apache Spark 3.2.

According to the progress on Apache Spark 3.1 release, Apache Spark 3.2 seems to be the last minor release of this year. Given the timeframe, we might consider the following. (This is a small set. Please add your thoughts to this limited list.)

# Languages

- Scala 2.13 Support: This was expected on 3.1 via SPARK-25075 but slipped out. Currently, we are trying to use Scala 2.13.5 via SPARK-34505 and investigating the publishing issue. Thank you for your contributions and feedback on this.

- Java 17 LTS Support: Java 17 LTS will arrive in September 2017. Like Java 11, we need lots of support from our dependencies. Let's see.

- Python 3.6 Deprecation(?): Python 3.6 community support ends at 2021-12-23. So, the deprecation is not required yet, but we had better prepare it because we don't have an ETA of Apache Spark 3.3 in 2022.

- SparkR CRAN publishing: As we know, it's discontinued so far. Resuming it depends on the success of Apache SparkR 3.1.1 CRAN publishing. If it succeeds to revive it, we can keep publishing. Otherwise, I believe we had better drop it from the releasing work item list officially.

# Dependencies

- Apache Hadoop 3.3.2: Hadoop 3.2.0 becomes the default Hadoop profile in Apache Spark 3.1. Currently, Spark master branch lives on Hadoop 3.2.2's shaded clients via SPARK-33212. So far, there is one on-going report at YARN environment. We hope it will be fixed soon at Spark 3.2 timeframe and we can move toward Hadoop 3.3.2.

- Apache Hive 2.3.9: Spark 3.0 starts to use Hive 2.3.7 by default instead of old Hive 1.2 fork. Spark 3.1 removed hive-1.2 profile completely via SPARK-32981 and replaced the generated hive-service-rpc code with the official dependency via SPARK-32981. We are steadily improving this area and will consume Hive 2.3.9 if available.

- K8s Client 4.13.2: During K8s GA activity, Spark 3.1 upgrades K8s client dependency to 4.12.0. Spark 3.2 upgrades it to 4.13.2 in order to support K8s model 1.19.

- Kafka Client 2.8: To bring the client fixes, Spark 3.1 is using Kafka Client 2.6. For Spark 3.2, SPARK-33913 upgraded to Kafka 2.7 with Scala 2.12.13, but it was reverted later due to Scala 2.12.13 issue. Since KAFKA-12357 fixed the Scala requirement two days ago, Spark 3.2 will go with Kafka Client 2.8 hopefully.

# Some Features

- Data Source v2: Spark 3.2 will deliver much richer DSv2 with Apache Iceberg integration. Especially, we hope the on-going function catalog SPIP and up-coming storage partitioned join SPIP can be delivered as a part of Spark 3.2 and become an additional foundation.

- Columnar Encryption: As of today, Apache Spark master branch supports columnar encryption via Apache ORC 1.6 and it's documented via SPARK-34036. Also, upcoming Apache Parquet 1.12 has a similar capability. Hopefully, Apache Spark 3.2 is going to be the first release to have this feature officially. Any feedback is welcome.

- Improved ZStandard Support: Spark 3.2 will bring more benefits for ZStandard users: 1) SPARK-34340 added native ZSTD JNI buffer pool support for all IO operations, 2) SPARK-33978 makes ORC datasource support ZSTD compression, 3) SPARK-34503 sets ZSTD as the default codec for event log compression, 4) SPARK-34479 aims to support ZSTD at Avro data source. Also, the upcoming Parquet 1.12 supports ZSTD (and supports JNI buffer pool), too. I'm expecting more benefits.

- Structure Streaming with RocksDB backend: According to the latest update, it looks active enough for merging to master branch in Spark 3.2.

Please share your thoughts and let's build better Apache Spark 3.2 together.

Bests,
Dongjoon.


--
John Zhuge
Reply | Threaded
Open this post in threaded view
|

Re: Apache Spark 3.2 Expectation

Dongjoon Hyun-2
Hi, Xiao.

This thread started 13 days ago. Since you asked the community about major features or timelines at that time, could you share your roadmap or expectations if you have something in your mind?

> Thank you, Dongjoon, for initiating this discussion. Let us keep it open. It might take 1-2 weeks to collect from the community all the features we plan to build and ship in 3.2 since we just finished the 3.1 voting. 
> TBH, cutting the branch this April does not look good to me. That means, we only have one month left for feature development of Spark 3.2. Do we have enough features in the current master branch? If not, are we able to finish major features we collected here? Do they have a timeline or project plan?

Bests,
Dongjoon.



On Wed, Mar 3, 2021 at 2:58 PM Dongjoon Hyun <[hidden email]> wrote:
Hi, John.

This thread aims to share your expectations and goals (and maybe work progress) to Apache Spark 3.2 because we are making this together. :)

Bests,
Dongjoon.


On Wed, Mar 3, 2021 at 1:59 PM John Zhuge <[hidden email]> wrote:
Hi Dongjoon,

Is it possible to get ViewCatalog in? The community already had fairly detailed discussions.

Thanks,
John

On Thu, Feb 25, 2021 at 8:57 AM Dongjoon Hyun <[hidden email]> wrote:
Hi, All.

Since we have been preparing Apache Spark 3.2.0 in master branch since December 2020, March seems to be a good time to share our thoughts and aspirations on Apache Spark 3.2.

According to the progress on Apache Spark 3.1 release, Apache Spark 3.2 seems to be the last minor release of this year. Given the timeframe, we might consider the following. (This is a small set. Please add your thoughts to this limited list.)

# Languages

- Scala 2.13 Support: This was expected on 3.1 via SPARK-25075 but slipped out. Currently, we are trying to use Scala 2.13.5 via SPARK-34505 and investigating the publishing issue. Thank you for your contributions and feedback on this.

- Java 17 LTS Support: Java 17 LTS will arrive in September 2017. Like Java 11, we need lots of support from our dependencies. Let's see.

- Python 3.6 Deprecation(?): Python 3.6 community support ends at 2021-12-23. So, the deprecation is not required yet, but we had better prepare it because we don't have an ETA of Apache Spark 3.3 in 2022.

- SparkR CRAN publishing: As we know, it's discontinued so far. Resuming it depends on the success of Apache SparkR 3.1.1 CRAN publishing. If it succeeds to revive it, we can keep publishing. Otherwise, I believe we had better drop it from the releasing work item list officially.

# Dependencies

- Apache Hadoop 3.3.2: Hadoop 3.2.0 becomes the default Hadoop profile in Apache Spark 3.1. Currently, Spark master branch lives on Hadoop 3.2.2's shaded clients via SPARK-33212. So far, there is one on-going report at YARN environment. We hope it will be fixed soon at Spark 3.2 timeframe and we can move toward Hadoop 3.3.2.

- Apache Hive 2.3.9: Spark 3.0 starts to use Hive 2.3.7 by default instead of old Hive 1.2 fork. Spark 3.1 removed hive-1.2 profile completely via SPARK-32981 and replaced the generated hive-service-rpc code with the official dependency via SPARK-32981. We are steadily improving this area and will consume Hive 2.3.9 if available.

- K8s Client 4.13.2: During K8s GA activity, Spark 3.1 upgrades K8s client dependency to 4.12.0. Spark 3.2 upgrades it to 4.13.2 in order to support K8s model 1.19.

- Kafka Client 2.8: To bring the client fixes, Spark 3.1 is using Kafka Client 2.6. For Spark 3.2, SPARK-33913 upgraded to Kafka 2.7 with Scala 2.12.13, but it was reverted later due to Scala 2.12.13 issue. Since KAFKA-12357 fixed the Scala requirement two days ago, Spark 3.2 will go with Kafka Client 2.8 hopefully.

# Some Features

- Data Source v2: Spark 3.2 will deliver much richer DSv2 with Apache Iceberg integration. Especially, we hope the on-going function catalog SPIP and up-coming storage partitioned join SPIP can be delivered as a part of Spark 3.2 and become an additional foundation.

- Columnar Encryption: As of today, Apache Spark master branch supports columnar encryption via Apache ORC 1.6 and it's documented via SPARK-34036. Also, upcoming Apache Parquet 1.12 has a similar capability. Hopefully, Apache Spark 3.2 is going to be the first release to have this feature officially. Any feedback is welcome.

- Improved ZStandard Support: Spark 3.2 will bring more benefits for ZStandard users: 1) SPARK-34340 added native ZSTD JNI buffer pool support for all IO operations, 2) SPARK-33978 makes ORC datasource support ZSTD compression, 3) SPARK-34503 sets ZSTD as the default codec for event log compression, 4) SPARK-34479 aims to support ZSTD at Avro data source. Also, the upcoming Parquet 1.12 supports ZSTD (and supports JNI buffer pool), too. I'm expecting more benefits.

- Structure Streaming with RocksDB backend: According to the latest update, it looks active enough for merging to master branch in Spark 3.2.

Please share your thoughts and let's build better Apache Spark 3.2 together.

Bests,
Dongjoon.


--
John Zhuge
Reply | Threaded
Open this post in threaded view
|

Re: Apache Spark 3.2 Expectation

Xiao Li
Below are some nice-to-have features we can work on in Spark 3.2: Lateral Join support, interval data type, timestamp without time zone, un-nesting arbitrary queries, the returned metrics of DSV2, and error message standardization. Spark 3.2 will be another exciting release I believe! 

Go Spark!

Xiao




Dongjoon Hyun <[hidden email]> 于2021年3月10日周三 下午12:25写道:
Hi, Xiao.

This thread started 13 days ago. Since you asked the community about major features or timelines at that time, could you share your roadmap or expectations if you have something in your mind?

> Thank you, Dongjoon, for initiating this discussion. Let us keep it open. It might take 1-2 weeks to collect from the community all the features we plan to build and ship in 3.2 since we just finished the 3.1 voting. 
> TBH, cutting the branch this April does not look good to me. That means, we only have one month left for feature development of Spark 3.2. Do we have enough features in the current master branch? If not, are we able to finish major features we collected here? Do they have a timeline or project plan?

Bests,
Dongjoon.



On Wed, Mar 3, 2021 at 2:58 PM Dongjoon Hyun <[hidden email]> wrote:
Hi, John.

This thread aims to share your expectations and goals (and maybe work progress) to Apache Spark 3.2 because we are making this together. :)

Bests,
Dongjoon.


On Wed, Mar 3, 2021 at 1:59 PM John Zhuge <[hidden email]> wrote:
Hi Dongjoon,

Is it possible to get ViewCatalog in? The community already had fairly detailed discussions.

Thanks,
John

On Thu, Feb 25, 2021 at 8:57 AM Dongjoon Hyun <[hidden email]> wrote:
Hi, All.

Since we have been preparing Apache Spark 3.2.0 in master branch since December 2020, March seems to be a good time to share our thoughts and aspirations on Apache Spark 3.2.

According to the progress on Apache Spark 3.1 release, Apache Spark 3.2 seems to be the last minor release of this year. Given the timeframe, we might consider the following. (This is a small set. Please add your thoughts to this limited list.)

# Languages

- Scala 2.13 Support: This was expected on 3.1 via SPARK-25075 but slipped out. Currently, we are trying to use Scala 2.13.5 via SPARK-34505 and investigating the publishing issue. Thank you for your contributions and feedback on this.

- Java 17 LTS Support: Java 17 LTS will arrive in September 2017. Like Java 11, we need lots of support from our dependencies. Let's see.

- Python 3.6 Deprecation(?): Python 3.6 community support ends at 2021-12-23. So, the deprecation is not required yet, but we had better prepare it because we don't have an ETA of Apache Spark 3.3 in 2022.

- SparkR CRAN publishing: As we know, it's discontinued so far. Resuming it depends on the success of Apache SparkR 3.1.1 CRAN publishing. If it succeeds to revive it, we can keep publishing. Otherwise, I believe we had better drop it from the releasing work item list officially.

# Dependencies

- Apache Hadoop 3.3.2: Hadoop 3.2.0 becomes the default Hadoop profile in Apache Spark 3.1. Currently, Spark master branch lives on Hadoop 3.2.2's shaded clients via SPARK-33212. So far, there is one on-going report at YARN environment. We hope it will be fixed soon at Spark 3.2 timeframe and we can move toward Hadoop 3.3.2.

- Apache Hive 2.3.9: Spark 3.0 starts to use Hive 2.3.7 by default instead of old Hive 1.2 fork. Spark 3.1 removed hive-1.2 profile completely via SPARK-32981 and replaced the generated hive-service-rpc code with the official dependency via SPARK-32981. We are steadily improving this area and will consume Hive 2.3.9 if available.

- K8s Client 4.13.2: During K8s GA activity, Spark 3.1 upgrades K8s client dependency to 4.12.0. Spark 3.2 upgrades it to 4.13.2 in order to support K8s model 1.19.

- Kafka Client 2.8: To bring the client fixes, Spark 3.1 is using Kafka Client 2.6. For Spark 3.2, SPARK-33913 upgraded to Kafka 2.7 with Scala 2.12.13, but it was reverted later due to Scala 2.12.13 issue. Since KAFKA-12357 fixed the Scala requirement two days ago, Spark 3.2 will go with Kafka Client 2.8 hopefully.

# Some Features

- Data Source v2: Spark 3.2 will deliver much richer DSv2 with Apache Iceberg integration. Especially, we hope the on-going function catalog SPIP and up-coming storage partitioned join SPIP can be delivered as a part of Spark 3.2 and become an additional foundation.

- Columnar Encryption: As of today, Apache Spark master branch supports columnar encryption via Apache ORC 1.6 and it's documented via SPARK-34036. Also, upcoming Apache Parquet 1.12 has a similar capability. Hopefully, Apache Spark 3.2 is going to be the first release to have this feature officially. Any feedback is welcome.

- Improved ZStandard Support: Spark 3.2 will bring more benefits for ZStandard users: 1) SPARK-34340 added native ZSTD JNI buffer pool support for all IO operations, 2) SPARK-33978 makes ORC datasource support ZSTD compression, 3) SPARK-34503 sets ZSTD as the default codec for event log compression, 4) SPARK-34479 aims to support ZSTD at Avro data source. Also, the upcoming Parquet 1.12 supports ZSTD (and supports JNI buffer pool), too. I'm expecting more benefits.

- Structure Streaming with RocksDB backend: According to the latest update, it looks active enough for merging to master branch in Spark 3.2.

Please share your thoughts and let's build better Apache Spark 3.2 together.

Bests,
Dongjoon.


--
John Zhuge
Reply | Threaded
Open this post in threaded view
|

Re: Apache Spark 3.2 Expectation

cloud0fan
There are many projects going on right now, such as new DS v2 APIs, ANSI interval types, join improvement, disaggregated shuffle, etc. I don't think it's realistic to do the branch cut in April.

I'm +1 to release 3.2 around July, but it doesn't mean we have to cut the branch 3 months earlier. We should make the release process faster and cut the branch around June probably.



On Thu, Mar 11, 2021 at 4:41 AM Xiao Li <[hidden email]> wrote:
Below are some nice-to-have features we can work on in Spark 3.2: Lateral Join support, interval data type, timestamp without time zone, un-nesting arbitrary queries, the returned metrics of DSV2, and error message standardization. Spark 3.2 will be another exciting release I believe! 

Go Spark!

Xiao




Dongjoon Hyun <[hidden email]> 于2021年3月10日周三 下午12:25写道:
Hi, Xiao.

This thread started 13 days ago. Since you asked the community about major features or timelines at that time, could you share your roadmap or expectations if you have something in your mind?

> Thank you, Dongjoon, for initiating this discussion. Let us keep it open. It might take 1-2 weeks to collect from the community all the features we plan to build and ship in 3.2 since we just finished the 3.1 voting. 
> TBH, cutting the branch this April does not look good to me. That means, we only have one month left for feature development of Spark 3.2. Do we have enough features in the current master branch? If not, are we able to finish major features we collected here? Do they have a timeline or project plan?

Bests,
Dongjoon.



On Wed, Mar 3, 2021 at 2:58 PM Dongjoon Hyun <[hidden email]> wrote:
Hi, John.

This thread aims to share your expectations and goals (and maybe work progress) to Apache Spark 3.2 because we are making this together. :)

Bests,
Dongjoon.


On Wed, Mar 3, 2021 at 1:59 PM John Zhuge <[hidden email]> wrote:
Hi Dongjoon,

Is it possible to get ViewCatalog in? The community already had fairly detailed discussions.

Thanks,
John

On Thu, Feb 25, 2021 at 8:57 AM Dongjoon Hyun <[hidden email]> wrote:
Hi, All.

Since we have been preparing Apache Spark 3.2.0 in master branch since December 2020, March seems to be a good time to share our thoughts and aspirations on Apache Spark 3.2.

According to the progress on Apache Spark 3.1 release, Apache Spark 3.2 seems to be the last minor release of this year. Given the timeframe, we might consider the following. (This is a small set. Please add your thoughts to this limited list.)

# Languages

- Scala 2.13 Support: This was expected on 3.1 via SPARK-25075 but slipped out. Currently, we are trying to use Scala 2.13.5 via SPARK-34505 and investigating the publishing issue. Thank you for your contributions and feedback on this.

- Java 17 LTS Support: Java 17 LTS will arrive in September 2017. Like Java 11, we need lots of support from our dependencies. Let's see.

- Python 3.6 Deprecation(?): Python 3.6 community support ends at 2021-12-23. So, the deprecation is not required yet, but we had better prepare it because we don't have an ETA of Apache Spark 3.3 in 2022.

- SparkR CRAN publishing: As we know, it's discontinued so far. Resuming it depends on the success of Apache SparkR 3.1.1 CRAN publishing. If it succeeds to revive it, we can keep publishing. Otherwise, I believe we had better drop it from the releasing work item list officially.

# Dependencies

- Apache Hadoop 3.3.2: Hadoop 3.2.0 becomes the default Hadoop profile in Apache Spark 3.1. Currently, Spark master branch lives on Hadoop 3.2.2's shaded clients via SPARK-33212. So far, there is one on-going report at YARN environment. We hope it will be fixed soon at Spark 3.2 timeframe and we can move toward Hadoop 3.3.2.

- Apache Hive 2.3.9: Spark 3.0 starts to use Hive 2.3.7 by default instead of old Hive 1.2 fork. Spark 3.1 removed hive-1.2 profile completely via SPARK-32981 and replaced the generated hive-service-rpc code with the official dependency via SPARK-32981. We are steadily improving this area and will consume Hive 2.3.9 if available.

- K8s Client 4.13.2: During K8s GA activity, Spark 3.1 upgrades K8s client dependency to 4.12.0. Spark 3.2 upgrades it to 4.13.2 in order to support K8s model 1.19.

- Kafka Client 2.8: To bring the client fixes, Spark 3.1 is using Kafka Client 2.6. For Spark 3.2, SPARK-33913 upgraded to Kafka 2.7 with Scala 2.12.13, but it was reverted later due to Scala 2.12.13 issue. Since KAFKA-12357 fixed the Scala requirement two days ago, Spark 3.2 will go with Kafka Client 2.8 hopefully.

# Some Features

- Data Source v2: Spark 3.2 will deliver much richer DSv2 with Apache Iceberg integration. Especially, we hope the on-going function catalog SPIP and up-coming storage partitioned join SPIP can be delivered as a part of Spark 3.2 and become an additional foundation.

- Columnar Encryption: As of today, Apache Spark master branch supports columnar encryption via Apache ORC 1.6 and it's documented via SPARK-34036. Also, upcoming Apache Parquet 1.12 has a similar capability. Hopefully, Apache Spark 3.2 is going to be the first release to have this feature officially. Any feedback is welcome.

- Improved ZStandard Support: Spark 3.2 will bring more benefits for ZStandard users: 1) SPARK-34340 added native ZSTD JNI buffer pool support for all IO operations, 2) SPARK-33978 makes ORC datasource support ZSTD compression, 3) SPARK-34503 sets ZSTD as the default codec for event log compression, 4) SPARK-34479 aims to support ZSTD at Avro data source. Also, the upcoming Parquet 1.12 supports ZSTD (and supports JNI buffer pool), too. I'm expecting more benefits.

- Structure Streaming with RocksDB backend: According to the latest update, it looks active enough for merging to master branch in Spark 3.2.

Please share your thoughts and let's build better Apache Spark 3.2 together.

Bests,
Dongjoon.


--
John Zhuge
12