Spark 3.0 preview release feature list and major changes

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

Spark 3.0 preview release feature list and major changes

Jiang Xingbo
Hi all,

I went over all the finished JIRA tickets targeted to Spark 3.0.0, here I'm listing all the notable features and major changes that are ready to test/deliver, please don't hesitate to add more to the list:

SPARK-11215 Multiple columns support added to various Transformers: StringIndexer

SPARK-11150 Implement Dynamic Partition Pruning

SPARK-13677 Support Tree-Based Feature Transformation

SPARK-16692 Add MultilabelClassificationEvaluator

SPARK-19591 Add sample weights to decision trees

SPARK-19712 Pushing Left Semi and Left Anti joins through Project, Aggregate, Window, Union etc.

SPARK-19827 R API for Power Iteration Clustering

SPARK-20286 Improve logic for timing out executors in dynamic allocation

SPARK-20636 Eliminate unnecessary shuffle with adjacent Window expressions

SPARK-22148 Acquire new executors to avoid hang because of blacklisting

SPARK-22796 Multiple columns support added to various Transformers: PySpark QuantileDiscretizer

SPARK-23128 A new approach to do adaptive execution in Spark SQL

SPARK-23674 Add Spark ML Listener for Tracking ML Pipeline Status

SPARK-23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2

SPARK-24333 Add fit with validation set to Gradient Boosted Trees: Python API

SPARK-24417 Build and Run Spark on JDK11

SPARK-24615 Accelerator-aware task scheduling for Spark

SPARK-24920 Allow sharing Netty's memory pool allocators

SPARK-25250 Fix race condition with tasks running when new attempt for same stage is created leads to other task in the next attempt running on the same partition id retry multiple times

SPARK-25341 Support rolling back a shuffle map stage and re-generate the shuffle files

SPARK-25348 Data source for binary files

SPARK-25603 Generalize Nested Column Pruning

SPARK-26132 Remove support for Scala 2.11 in Spark 3.0.0

SPARK-26215 define reserved keywords after SQL standard

SPARK-26412 Allow Pandas UDF to take an iterator of pd.DataFrames

SPARK-26785 data source v2 API refactor: streaming write

SPARK-26956 remove streaming output mode from data source v2 APIs

SPARK-27064 create StreamingWrite at the beginning of streaming execution

SPARK-27119 Do not infer schema when reading Hive serde table with native data source

SPARK-27225 Implement join strategy hints

SPARK-27240 Use pandas DataFrame for struct type argument in Scalar Pandas UDF

SPARK-27338 Fix deadlock between TaskMemoryManager and UnsafeExternalSorter$SpillableIterator

SPARK-27396 Public APIs for extended Columnar Processing Support

SPARK-27589 Re-implement file sources with data source V2 API

SPARK-27677 Disk-persisted RDD blocks served by shuffle service, and ignored for Dynamic Allocation

SPARK-27699 Partially push down disjunctive predicated in Parquet/ORC

SPARK-27763 Port test cases from PostgreSQL to Spark SQL (ongoing)

SPARK-27884 Deprecate Python 2 support

SPARK-27921 Convert applicable *.sql tests into UDF integrated test base

SPARK-27963 Allow dynamic allocation without an external shuffle service

SPARK-28177 Adjust post shuffle partition number in adaptive execution

SPARK-28372 Document Spark WEB UI

SPARK-28399 RobustScaler feature transformer

SPARK-28426 Metadata Handling in Thrift Server

SPARK-28588 Build a SQL reference doc (ongoing)

SPARK-28608 Improve test coverage of ThriftServer

SPARK-28753 Dynamically reuse subqueries in AQE

SPARK-28855 Remove outdated Experimental, Evolving annotations

SPARK-25908 SPARK-28980 Remove deprecated items since <= 2.2.0


Cheers,

Xingbo
Reply | Threaded
Open this post in threaded view
|

Re: Spark 3.0 preview release feature list and major changes

Jungtaek Lim-2
Thanks for bringing the nice summary of Spark 3.0 improvements!

I'd like to add some items from structured streaming side,

SPARK-28199 Move Trigger implementations to Triggers.scala and avoid exposing these to the end users (removal of deprecated)
SPARK-23539 Add support for Kafka headers in Structured Streaming
SPARK-25501 Add kafka delegation token support (there were follow-up issues to add functionalities like support multi clusters, etc.)
SPARK-26848 Introduce new option to Kafka source: offset by timestamp (starting/ending)
SPARK-28074 Log warn message on possible correctness issue for multiple stateful operations in single query

and core side,

SPARK-23155 New feature: apply custom log URL pattern for executor log URLs in SHS (follow-up issue expanded the functionality to Spark UI as well)

FYI if we count on current work in progress, there's ongoing umbrella issue regarding rolling event log & snapshot (SPARK-28594) which we struggle to get things done in Spark 3.0.

Thanks,
Jungtaek Lim (HeartSaVioR)


On Tue, Oct 8, 2019 at 7:02 AM Xingbo Jiang <[hidden email]> wrote:
Hi all,

I went over all the finished JIRA tickets targeted to Spark 3.0.0, here I'm listing all the notable features and major changes that are ready to test/deliver, please don't hesitate to add more to the list:

SPARK-11215 Multiple columns support added to various Transformers: StringIndexer

SPARK-11150 Implement Dynamic Partition Pruning

SPARK-13677 Support Tree-Based Feature Transformation

SPARK-16692 Add MultilabelClassificationEvaluator

SPARK-19591 Add sample weights to decision trees

SPARK-19712 Pushing Left Semi and Left Anti joins through Project, Aggregate, Window, Union etc.

SPARK-19827 R API for Power Iteration Clustering

SPARK-20286 Improve logic for timing out executors in dynamic allocation

SPARK-20636 Eliminate unnecessary shuffle with adjacent Window expressions

SPARK-22148 Acquire new executors to avoid hang because of blacklisting

SPARK-22796 Multiple columns support added to various Transformers: PySpark QuantileDiscretizer

SPARK-23128 A new approach to do adaptive execution in Spark SQL

SPARK-23674 Add Spark ML Listener for Tracking ML Pipeline Status

SPARK-23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2

SPARK-24333 Add fit with validation set to Gradient Boosted Trees: Python API

SPARK-24417 Build and Run Spark on JDK11

SPARK-24615 Accelerator-aware task scheduling for Spark

SPARK-24920 Allow sharing Netty's memory pool allocators

SPARK-25250 Fix race condition with tasks running when new attempt for same stage is created leads to other task in the next attempt running on the same partition id retry multiple times

SPARK-25341 Support rolling back a shuffle map stage and re-generate the shuffle files

SPARK-25348 Data source for binary files

SPARK-25603 Generalize Nested Column Pruning

SPARK-26132 Remove support for Scala 2.11 in Spark 3.0.0

SPARK-26215 define reserved keywords after SQL standard

SPARK-26412 Allow Pandas UDF to take an iterator of pd.DataFrames

SPARK-26785 data source v2 API refactor: streaming write

SPARK-26956 remove streaming output mode from data source v2 APIs

SPARK-27064 create StreamingWrite at the beginning of streaming execution

SPARK-27119 Do not infer schema when reading Hive serde table with native data source

SPARK-27225 Implement join strategy hints

SPARK-27240 Use pandas DataFrame for struct type argument in Scalar Pandas UDF

SPARK-27338 Fix deadlock between TaskMemoryManager and UnsafeExternalSorter$SpillableIterator

SPARK-27396 Public APIs for extended Columnar Processing Support

SPARK-27589 Re-implement file sources with data source V2 API

SPARK-27677 Disk-persisted RDD blocks served by shuffle service, and ignored for Dynamic Allocation

SPARK-27699 Partially push down disjunctive predicated in Parquet/ORC

SPARK-27763 Port test cases from PostgreSQL to Spark SQL (ongoing)

SPARK-27884 Deprecate Python 2 support

SPARK-27921 Convert applicable *.sql tests into UDF integrated test base

SPARK-27963 Allow dynamic allocation without an external shuffle service

SPARK-28177 Adjust post shuffle partition number in adaptive execution

SPARK-28372 Document Spark WEB UI

SPARK-28399 RobustScaler feature transformer

SPARK-28426 Metadata Handling in Thrift Server

SPARK-28588 Build a SQL reference doc (ongoing)

SPARK-28608 Improve test coverage of ThriftServer

SPARK-28753 Dynamically reuse subqueries in AQE

SPARK-28855 Remove outdated Experimental, Evolving annotations

SPARK-25908 SPARK-28980 Remove deprecated items since <= 2.2.0


Cheers,

Xingbo
Reply | Threaded
Open this post in threaded view
|

Re: Spark 3.0 preview release feature list and major changes

Hyukjin Kwon
Cogroup Pandas UDF missing:

SPARK-27463 Support Dataframe Cogroup via Pandas UDFs
Vectorized R execution:

SPARK-26759 Arrow optimization in SparkR's interoperability



2019년 10월 8일 (화) 오전 7:50, Jungtaek Lim <[hidden email]>님이 작성:
Thanks for bringing the nice summary of Spark 3.0 improvements!

I'd like to add some items from structured streaming side,

SPARK-28199 Move Trigger implementations to Triggers.scala and avoid exposing these to the end users (removal of deprecated)
SPARK-23539 Add support for Kafka headers in Structured Streaming
SPARK-25501 Add kafka delegation token support (there were follow-up issues to add functionalities like support multi clusters, etc.)
SPARK-26848 Introduce new option to Kafka source: offset by timestamp (starting/ending)
SPARK-28074 Log warn message on possible correctness issue for multiple stateful operations in single query

and core side,

SPARK-23155 New feature: apply custom log URL pattern for executor log URLs in SHS (follow-up issue expanded the functionality to Spark UI as well)

FYI if we count on current work in progress, there's ongoing umbrella issue regarding rolling event log & snapshot (SPARK-28594) which we struggle to get things done in Spark 3.0.

Thanks,
Jungtaek Lim (HeartSaVioR)


On Tue, Oct 8, 2019 at 7:02 AM Xingbo Jiang <[hidden email]> wrote:
Hi all,

I went over all the finished JIRA tickets targeted to Spark 3.0.0, here I'm listing all the notable features and major changes that are ready to test/deliver, please don't hesitate to add more to the list:

SPARK-11215 Multiple columns support added to various Transformers: StringIndexer

SPARK-11150 Implement Dynamic Partition Pruning

SPARK-13677 Support Tree-Based Feature Transformation

SPARK-16692 Add MultilabelClassificationEvaluator

SPARK-19591 Add sample weights to decision trees

SPARK-19712 Pushing Left Semi and Left Anti joins through Project, Aggregate, Window, Union etc.

SPARK-19827 R API for Power Iteration Clustering

SPARK-20286 Improve logic for timing out executors in dynamic allocation

SPARK-20636 Eliminate unnecessary shuffle with adjacent Window expressions

SPARK-22148 Acquire new executors to avoid hang because of blacklisting

SPARK-22796 Multiple columns support added to various Transformers: PySpark QuantileDiscretizer

SPARK-23128 A new approach to do adaptive execution in Spark SQL

SPARK-23674 Add Spark ML Listener for Tracking ML Pipeline Status

SPARK-23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2

SPARK-24333 Add fit with validation set to Gradient Boosted Trees: Python API

SPARK-24417 Build and Run Spark on JDK11

SPARK-24615 Accelerator-aware task scheduling for Spark

SPARK-24920 Allow sharing Netty's memory pool allocators

SPARK-25250 Fix race condition with tasks running when new attempt for same stage is created leads to other task in the next attempt running on the same partition id retry multiple times

SPARK-25341 Support rolling back a shuffle map stage and re-generate the shuffle files

SPARK-25348 Data source for binary files

SPARK-25603 Generalize Nested Column Pruning

SPARK-26132 Remove support for Scala 2.11 in Spark 3.0.0

SPARK-26215 define reserved keywords after SQL standard

SPARK-26412 Allow Pandas UDF to take an iterator of pd.DataFrames

SPARK-26785 data source v2 API refactor: streaming write

SPARK-26956 remove streaming output mode from data source v2 APIs

SPARK-27064 create StreamingWrite at the beginning of streaming execution

SPARK-27119 Do not infer schema when reading Hive serde table with native data source

SPARK-27225 Implement join strategy hints

SPARK-27240 Use pandas DataFrame for struct type argument in Scalar Pandas UDF

SPARK-27338 Fix deadlock between TaskMemoryManager and UnsafeExternalSorter$SpillableIterator

SPARK-27396 Public APIs for extended Columnar Processing Support

SPARK-27589 Re-implement file sources with data source V2 API

SPARK-27677 Disk-persisted RDD blocks served by shuffle service, and ignored for Dynamic Allocation

SPARK-27699 Partially push down disjunctive predicated in Parquet/ORC

SPARK-27763 Port test cases from PostgreSQL to Spark SQL (ongoing)

SPARK-27884 Deprecate Python 2 support

SPARK-27921 Convert applicable *.sql tests into UDF integrated test base

SPARK-27963 Allow dynamic allocation without an external shuffle service

SPARK-28177 Adjust post shuffle partition number in adaptive execution

SPARK-28372 Document Spark WEB UI

SPARK-28399 RobustScaler feature transformer

SPARK-28426 Metadata Handling in Thrift Server

SPARK-28588 Build a SQL reference doc (ongoing)

SPARK-28608 Improve test coverage of ThriftServer

SPARK-28753 Dynamically reuse subqueries in AQE

SPARK-28855 Remove outdated Experimental, Evolving annotations

SPARK-25908 SPARK-28980 Remove deprecated items since <= 2.2.0


Cheers,

Xingbo
Reply | Threaded
Open this post in threaded view
|

Re: Spark 3.0 preview release feature list and major changes

Jiang Xingbo
Hi all,

Thanks for all the feedbacks, here is the updated feature list:

SPARK-11215 Multiple columns support added to various Transformers: StringIndexer

SPARK-11150 Implement Dynamic Partition Pruning

SPARK-13677 Support Tree-Based Feature Transformation

SPARK-16692 Add MultilabelClassificationEvaluator

SPARK-19591 Add sample weights to decision trees

SPARK-19712 Pushing Left Semi and Left Anti joins through Project, Aggregate, Window, Union etc.

SPARK-19827 R API for Power Iteration Clustering

SPARK-20286 Improve logic for timing out executors in dynamic allocation

SPARK-20636 Eliminate unnecessary shuffle with adjacent Window expressions

SPARK-22148 Acquire new executors to avoid hang because of blacklisting

SPARK-22796 Multiple columns support added to various Transformers: PySpark QuantileDiscretizer

SPARK-23128 A new approach to do adaptive execution in Spark SQL

SPARK-23155 Apply custom log URL pattern for executor log URLs in SHS

SPARK-23539 Add support for Kafka headers

SPARK-23674 Add Spark ML Listener for Tracking ML Pipeline Status

SPARK-23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2

SPARK-24333 Add fit with validation set to Gradient Boosted Trees: Python API

SPARK-24417 Build and Run Spark on JDK11

SPARK-24615 Accelerator-aware task scheduling for Spark

SPARK-24920 Allow sharing Netty's memory pool allocators

SPARK-25250 Fix race condition with tasks running when new attempt for same stage is created leads to other task in the next attempt running on the same partition id retry multiple times

SPARK-25341 Support rolling back a shuffle map stage and re-generate the shuffle files

SPARK-25348 Data source for binary files

SPARK-25501 Add kafka delegation token support

SPARK-25603 Generalize Nested Column Pruning

SPARK-26132 Remove support for Scala 2.11 in Spark 3.0.0

SPARK-26215 define reserved keywords after SQL standard

SPARK-26412 Allow Pandas UDF to take an iterator of pd.DataFrames

SPARK-26759 Arrow optimization in SparkR's interoperability

SPARK-26785 data source v2 API refactor: streaming write

SPARK-26848 Introduce new option to Kafka source: offset by timestamp (starting/ending)

SPARK-26956 remove streaming output mode from data source v2 APIs

SPARK-27064 create StreamingWrite at the beginning of streaming execution

SPARK-27119 Do not infer schema when reading Hive serde table with native data source

SPARK-27225 Implement join strategy hints

SPARK-27240 Use pandas DataFrame for struct type argument in Scalar Pandas UDF

SPARK-27338 Fix deadlock between TaskMemoryManager and UnsafeExternalSorter$SpillableIterator

SPARK-27396 Public APIs for extended Columnar Processing Support

SPARK-27463 Support Dataframe Cogroup via Pandas UDFs

SPARK-27589 Re-implement file sources with data source V2 API

SPARK-27677 Disk-persisted RDD blocks served by shuffle service, and ignored for Dynamic Allocation

SPARK-27699 Partially push down disjunctive predicated in Parquet/ORC

SPARK-27763 Port test cases from PostgreSQL to Spark SQL

SPARK-27884 Deprecate Python 2 support

SPARK-27921 Convert applicable *.sql tests into UDF integrated test base

SPARK-27963 Allow dynamic allocation without an external shuffle service

SPARK-28177 Adjust post shuffle partition number in adaptive execution

SPARK-28199 Move Trigger implementations to Triggers.scala and avoid exposing these to the end users 

SPARK-28372 Document Spark WEB UI

SPARK-28399 RobustScaler feature transformer

SPARK-28426 Metadata Handling in Thrift Server

SPARK-28588 Build a SQL reference doc

SPARK-28608 Improve test coverage of ThriftServer

SPARK-28753 Dynamically reuse subqueries in AQE

SPARK-28855 Remove outdated Experimental, Evolving annotations

SPARK-25908 SPARK-28980 Remove deprecated items since <= 2.2.0


Cheers,

Xingbo

Hyukjin Kwon <[hidden email]> 于2019年10月7日周一 下午9:29写道:
Cogroup Pandas UDF missing:

SPARK-27463 Support Dataframe Cogroup via Pandas UDFs
Vectorized R execution:

SPARK-26759 Arrow optimization in SparkR's interoperability



2019년 10월 8일 (화) 오전 7:50, Jungtaek Lim <[hidden email]>님이 작성:
Thanks for bringing the nice summary of Spark 3.0 improvements!

I'd like to add some items from structured streaming side,

SPARK-28199 Move Trigger implementations to Triggers.scala and avoid exposing these to the end users (removal of deprecated)
SPARK-23539 Add support for Kafka headers in Structured Streaming
SPARK-25501 Add kafka delegation token support (there were follow-up issues to add functionalities like support multi clusters, etc.)
SPARK-26848 Introduce new option to Kafka source: offset by timestamp (starting/ending)
SPARK-28074 Log warn message on possible correctness issue for multiple stateful operations in single query

and core side,

SPARK-23155 New feature: apply custom log URL pattern for executor log URLs in SHS (follow-up issue expanded the functionality to Spark UI as well)

FYI if we count on current work in progress, there's ongoing umbrella issue regarding rolling event log & snapshot (SPARK-28594) which we struggle to get things done in Spark 3.0.

Thanks,
Jungtaek Lim (HeartSaVioR)


On Tue, Oct 8, 2019 at 7:02 AM Xingbo Jiang <[hidden email]> wrote:
Hi all,

I went over all the finished JIRA tickets targeted to Spark 3.0.0, here I'm listing all the notable features and major changes that are ready to test/deliver, please don't hesitate to add more to the list:

SPARK-11215 Multiple columns support added to various Transformers: StringIndexer

SPARK-11150 Implement Dynamic Partition Pruning

SPARK-13677 Support Tree-Based Feature Transformation

SPARK-16692 Add MultilabelClassificationEvaluator

SPARK-19591 Add sample weights to decision trees

SPARK-19712 Pushing Left Semi and Left Anti joins through Project, Aggregate, Window, Union etc.

SPARK-19827 R API for Power Iteration Clustering

SPARK-20286 Improve logic for timing out executors in dynamic allocation

SPARK-20636 Eliminate unnecessary shuffle with adjacent Window expressions

SPARK-22148 Acquire new executors to avoid hang because of blacklisting

SPARK-22796 Multiple columns support added to various Transformers: PySpark QuantileDiscretizer

SPARK-23128 A new approach to do adaptive execution in Spark SQL

SPARK-23674 Add Spark ML Listener for Tracking ML Pipeline Status

SPARK-23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2

SPARK-24333 Add fit with validation set to Gradient Boosted Trees: Python API

SPARK-24417 Build and Run Spark on JDK11

SPARK-24615 Accelerator-aware task scheduling for Spark

SPARK-24920 Allow sharing Netty's memory pool allocators

SPARK-25250 Fix race condition with tasks running when new attempt for same stage is created leads to other task in the next attempt running on the same partition id retry multiple times

SPARK-25341 Support rolling back a shuffle map stage and re-generate the shuffle files

SPARK-25348 Data source for binary files

SPARK-25603 Generalize Nested Column Pruning

SPARK-26132 Remove support for Scala 2.11 in Spark 3.0.0

SPARK-26215 define reserved keywords after SQL standard

SPARK-26412 Allow Pandas UDF to take an iterator of pd.DataFrames

SPARK-26785 data source v2 API refactor: streaming write

SPARK-26956 remove streaming output mode from data source v2 APIs

SPARK-27064 create StreamingWrite at the beginning of streaming execution

SPARK-27119 Do not infer schema when reading Hive serde table with native data source

SPARK-27225 Implement join strategy hints

SPARK-27240 Use pandas DataFrame for struct type argument in Scalar Pandas UDF

SPARK-27338 Fix deadlock between TaskMemoryManager and UnsafeExternalSorter$SpillableIterator

SPARK-27396 Public APIs for extended Columnar Processing Support

SPARK-27589 Re-implement file sources with data source V2 API

SPARK-27677 Disk-persisted RDD blocks served by shuffle service, and ignored for Dynamic Allocation

SPARK-27699 Partially push down disjunctive predicated in Parquet/ORC

SPARK-27763 Port test cases from PostgreSQL to Spark SQL (ongoing)

SPARK-27884 Deprecate Python 2 support

SPARK-27921 Convert applicable *.sql tests into UDF integrated test base

SPARK-27963 Allow dynamic allocation without an external shuffle service

SPARK-28177 Adjust post shuffle partition number in adaptive execution

SPARK-28372 Document Spark WEB UI

SPARK-28399 RobustScaler feature transformer

SPARK-28426 Metadata Handling in Thrift Server

SPARK-28588 Build a SQL reference doc (ongoing)

SPARK-28608 Improve test coverage of ThriftServer

SPARK-28753 Dynamically reuse subqueries in AQE

SPARK-28855 Remove outdated Experimental, Evolving annotations

SPARK-25908 SPARK-28980 Remove deprecated items since <= 2.2.0


Cheers,

Xingbo
Reply | Threaded
Open this post in threaded view
|

Re: Spark 3.0 preview release feature list and major changes

Li Jin
Thanks for summary! 

I have a question that is semi-related - What's the process to propose a feature to be included in the final Spark 3.0 release? 

In particular, I am interested in https://issues.apache.org/jira/browse/SPARK-28006.  I am happy to do the work so want to make sure I don't miss the "cut" date.

On Tue, Oct 8, 2019 at 4:53 PM Xingbo Jiang <[hidden email]> wrote:
Hi all,

Thanks for all the feedbacks, here is the updated feature list:

SPARK-11215 Multiple columns support added to various Transformers: StringIndexer

SPARK-11150 Implement Dynamic Partition Pruning

SPARK-13677 Support Tree-Based Feature Transformation

SPARK-16692 Add MultilabelClassificationEvaluator

SPARK-19591 Add sample weights to decision trees

SPARK-19712 Pushing Left Semi and Left Anti joins through Project, Aggregate, Window, Union etc.

SPARK-19827 R API for Power Iteration Clustering

SPARK-20286 Improve logic for timing out executors in dynamic allocation

SPARK-20636 Eliminate unnecessary shuffle with adjacent Window expressions

SPARK-22148 Acquire new executors to avoid hang because of blacklisting

SPARK-22796 Multiple columns support added to various Transformers: PySpark QuantileDiscretizer

SPARK-23128 A new approach to do adaptive execution in Spark SQL

SPARK-23155 Apply custom log URL pattern for executor log URLs in SHS

SPARK-23539 Add support for Kafka headers

SPARK-23674 Add Spark ML Listener for Tracking ML Pipeline Status

SPARK-23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2

SPARK-24333 Add fit with validation set to Gradient Boosted Trees: Python API

SPARK-24417 Build and Run Spark on JDK11

SPARK-24615 Accelerator-aware task scheduling for Spark

SPARK-24920 Allow sharing Netty's memory pool allocators

SPARK-25250 Fix race condition with tasks running when new attempt for same stage is created leads to other task in the next attempt running on the same partition id retry multiple times

SPARK-25341 Support rolling back a shuffle map stage and re-generate the shuffle files

SPARK-25348 Data source for binary files

SPARK-25501 Add kafka delegation token support

SPARK-25603 Generalize Nested Column Pruning

SPARK-26132 Remove support for Scala 2.11 in Spark 3.0.0

SPARK-26215 define reserved keywords after SQL standard

SPARK-26412 Allow Pandas UDF to take an iterator of pd.DataFrames

SPARK-26759 Arrow optimization in SparkR's interoperability

SPARK-26785 data source v2 API refactor: streaming write

SPARK-26848 Introduce new option to Kafka source: offset by timestamp (starting/ending)

SPARK-26956 remove streaming output mode from data source v2 APIs

SPARK-27064 create StreamingWrite at the beginning of streaming execution

SPARK-27119 Do not infer schema when reading Hive serde table with native data source

SPARK-27225 Implement join strategy hints

SPARK-27240 Use pandas DataFrame for struct type argument in Scalar Pandas UDF

SPARK-27338 Fix deadlock between TaskMemoryManager and UnsafeExternalSorter$SpillableIterator

SPARK-27396 Public APIs for extended Columnar Processing Support

SPARK-27463 Support Dataframe Cogroup via Pandas UDFs

SPARK-27589 Re-implement file sources with data source V2 API

SPARK-27677 Disk-persisted RDD blocks served by shuffle service, and ignored for Dynamic Allocation

SPARK-27699 Partially push down disjunctive predicated in Parquet/ORC

SPARK-27763 Port test cases from PostgreSQL to Spark SQL

SPARK-27884 Deprecate Python 2 support

SPARK-27921 Convert applicable *.sql tests into UDF integrated test base

SPARK-27963 Allow dynamic allocation without an external shuffle service

SPARK-28177 Adjust post shuffle partition number in adaptive execution

SPARK-28199 Move Trigger implementations to Triggers.scala and avoid exposing these to the end users 

SPARK-28372 Document Spark WEB UI

SPARK-28399 RobustScaler feature transformer

SPARK-28426 Metadata Handling in Thrift Server

SPARK-28588 Build a SQL reference doc

SPARK-28608 Improve test coverage of ThriftServer

SPARK-28753 Dynamically reuse subqueries in AQE

SPARK-28855 Remove outdated Experimental, Evolving annotations

SPARK-25908 SPARK-28980 Remove deprecated items since <= 2.2.0


Cheers,

Xingbo

Hyukjin Kwon <[hidden email]> 于2019年10月7日周一 下午9:29写道:
Cogroup Pandas UDF missing:

SPARK-27463 Support Dataframe Cogroup via Pandas UDFs
Vectorized R execution:

SPARK-26759 Arrow optimization in SparkR's interoperability



2019년 10월 8일 (화) 오전 7:50, Jungtaek Lim <[hidden email]>님이 작성:
Thanks for bringing the nice summary of Spark 3.0 improvements!

I'd like to add some items from structured streaming side,

SPARK-28199 Move Trigger implementations to Triggers.scala and avoid exposing these to the end users (removal of deprecated)
SPARK-23539 Add support for Kafka headers in Structured Streaming
SPARK-25501 Add kafka delegation token support (there were follow-up issues to add functionalities like support multi clusters, etc.)
SPARK-26848 Introduce new option to Kafka source: offset by timestamp (starting/ending)
SPARK-28074 Log warn message on possible correctness issue for multiple stateful operations in single query

and core side,

SPARK-23155 New feature: apply custom log URL pattern for executor log URLs in SHS (follow-up issue expanded the functionality to Spark UI as well)

FYI if we count on current work in progress, there's ongoing umbrella issue regarding rolling event log & snapshot (SPARK-28594) which we struggle to get things done in Spark 3.0.

Thanks,
Jungtaek Lim (HeartSaVioR)


On Tue, Oct 8, 2019 at 7:02 AM Xingbo Jiang <[hidden email]> wrote:
Hi all,

I went over all the finished JIRA tickets targeted to Spark 3.0.0, here I'm listing all the notable features and major changes that are ready to test/deliver, please don't hesitate to add more to the list:

SPARK-11215 Multiple columns support added to various Transformers: StringIndexer

SPARK-11150 Implement Dynamic Partition Pruning

SPARK-13677 Support Tree-Based Feature Transformation

SPARK-16692 Add MultilabelClassificationEvaluator

SPARK-19591 Add sample weights to decision trees

SPARK-19712 Pushing Left Semi and Left Anti joins through Project, Aggregate, Window, Union etc.

SPARK-19827 R API for Power Iteration Clustering

SPARK-20286 Improve logic for timing out executors in dynamic allocation

SPARK-20636 Eliminate unnecessary shuffle with adjacent Window expressions

SPARK-22148 Acquire new executors to avoid hang because of blacklisting

SPARK-22796 Multiple columns support added to various Transformers: PySpark QuantileDiscretizer

SPARK-23128 A new approach to do adaptive execution in Spark SQL

SPARK-23674 Add Spark ML Listener for Tracking ML Pipeline Status

SPARK-23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2

SPARK-24333 Add fit with validation set to Gradient Boosted Trees: Python API

SPARK-24417 Build and Run Spark on JDK11

SPARK-24615 Accelerator-aware task scheduling for Spark

SPARK-24920 Allow sharing Netty's memory pool allocators

SPARK-25250 Fix race condition with tasks running when new attempt for same stage is created leads to other task in the next attempt running on the same partition id retry multiple times

SPARK-25341 Support rolling back a shuffle map stage and re-generate the shuffle files

SPARK-25348 Data source for binary files

SPARK-25603 Generalize Nested Column Pruning

SPARK-26132 Remove support for Scala 2.11 in Spark 3.0.0

SPARK-26215 define reserved keywords after SQL standard

SPARK-26412 Allow Pandas UDF to take an iterator of pd.DataFrames

SPARK-26785 data source v2 API refactor: streaming write

SPARK-26956 remove streaming output mode from data source v2 APIs

SPARK-27064 create StreamingWrite at the beginning of streaming execution

SPARK-27119 Do not infer schema when reading Hive serde table with native data source

SPARK-27225 Implement join strategy hints

SPARK-27240 Use pandas DataFrame for struct type argument in Scalar Pandas UDF

SPARK-27338 Fix deadlock between TaskMemoryManager and UnsafeExternalSorter$SpillableIterator

SPARK-27396 Public APIs for extended Columnar Processing Support

SPARK-27589 Re-implement file sources with data source V2 API

SPARK-27677 Disk-persisted RDD blocks served by shuffle service, and ignored for Dynamic Allocation

SPARK-27699 Partially push down disjunctive predicated in Parquet/ORC

SPARK-27763 Port test cases from PostgreSQL to Spark SQL (ongoing)

SPARK-27884 Deprecate Python 2 support

SPARK-27921 Convert applicable *.sql tests into UDF integrated test base

SPARK-27963 Allow dynamic allocation without an external shuffle service

SPARK-28177 Adjust post shuffle partition number in adaptive execution

SPARK-28372 Document Spark WEB UI

SPARK-28399 RobustScaler feature transformer

SPARK-28426 Metadata Handling in Thrift Server

SPARK-28588 Build a SQL reference doc (ongoing)

SPARK-28608 Improve test coverage of ThriftServer

SPARK-28753 Dynamically reuse subqueries in AQE

SPARK-28855 Remove outdated Experimental, Evolving annotations

SPARK-25908 SPARK-28980 Remove deprecated items since <= 2.2.0


Cheers,

Xingbo
Reply | Threaded
Open this post in threaded view
|

Re: Spark 3.0 preview release feature list and major changes

Jiang Xingbo
 What's the process to propose a feature to be included in the final Spark 3.0 release? 

I don't know whether there exists any specific process here, normally you just merge the feature into Spark master before release code freeze, and then the feature would probably be included in the release. The code freeze date for Spark 3.0 has not been decided yet, though.

Li Jin <[hidden email]> 于2019年10月8日周二 下午2:14写道:
Thanks for summary! 

I have a question that is semi-related - What's the process to propose a feature to be included in the final Spark 3.0 release? 

In particular, I am interested in https://issues.apache.org/jira/browse/SPARK-28006.  I am happy to do the work so want to make sure I don't miss the "cut" date.

On Tue, Oct 8, 2019 at 4:53 PM Xingbo Jiang <[hidden email]> wrote:
Hi all,

Thanks for all the feedbacks, here is the updated feature list:

SPARK-11215 Multiple columns support added to various Transformers: StringIndexer

SPARK-11150 Implement Dynamic Partition Pruning

SPARK-13677 Support Tree-Based Feature Transformation

SPARK-16692 Add MultilabelClassificationEvaluator

SPARK-19591 Add sample weights to decision trees

SPARK-19712 Pushing Left Semi and Left Anti joins through Project, Aggregate, Window, Union etc.

SPARK-19827 R API for Power Iteration Clustering

SPARK-20286 Improve logic for timing out executors in dynamic allocation

SPARK-20636 Eliminate unnecessary shuffle with adjacent Window expressions

SPARK-22148 Acquire new executors to avoid hang because of blacklisting

SPARK-22796 Multiple columns support added to various Transformers: PySpark QuantileDiscretizer

SPARK-23128 A new approach to do adaptive execution in Spark SQL

SPARK-23155 Apply custom log URL pattern for executor log URLs in SHS

SPARK-23539 Add support for Kafka headers

SPARK-23674 Add Spark ML Listener for Tracking ML Pipeline Status

SPARK-23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2

SPARK-24333 Add fit with validation set to Gradient Boosted Trees: Python API

SPARK-24417 Build and Run Spark on JDK11

SPARK-24615 Accelerator-aware task scheduling for Spark

SPARK-24920 Allow sharing Netty's memory pool allocators

SPARK-25250 Fix race condition with tasks running when new attempt for same stage is created leads to other task in the next attempt running on the same partition id retry multiple times

SPARK-25341 Support rolling back a shuffle map stage and re-generate the shuffle files

SPARK-25348 Data source for binary files

SPARK-25501 Add kafka delegation token support

SPARK-25603 Generalize Nested Column Pruning

SPARK-26132 Remove support for Scala 2.11 in Spark 3.0.0

SPARK-26215 define reserved keywords after SQL standard

SPARK-26412 Allow Pandas UDF to take an iterator of pd.DataFrames

SPARK-26759 Arrow optimization in SparkR's interoperability

SPARK-26785 data source v2 API refactor: streaming write

SPARK-26848 Introduce new option to Kafka source: offset by timestamp (starting/ending)

SPARK-26956 remove streaming output mode from data source v2 APIs

SPARK-27064 create StreamingWrite at the beginning of streaming execution

SPARK-27119 Do not infer schema when reading Hive serde table with native data source

SPARK-27225 Implement join strategy hints

SPARK-27240 Use pandas DataFrame for struct type argument in Scalar Pandas UDF

SPARK-27338 Fix deadlock between TaskMemoryManager and UnsafeExternalSorter$SpillableIterator

SPARK-27396 Public APIs for extended Columnar Processing Support

SPARK-27463 Support Dataframe Cogroup via Pandas UDFs

SPARK-27589 Re-implement file sources with data source V2 API

SPARK-27677 Disk-persisted RDD blocks served by shuffle service, and ignored for Dynamic Allocation

SPARK-27699 Partially push down disjunctive predicated in Parquet/ORC

SPARK-27763 Port test cases from PostgreSQL to Spark SQL

SPARK-27884 Deprecate Python 2 support

SPARK-27921 Convert applicable *.sql tests into UDF integrated test base

SPARK-27963 Allow dynamic allocation without an external shuffle service

SPARK-28177 Adjust post shuffle partition number in adaptive execution

SPARK-28199 Move Trigger implementations to Triggers.scala and avoid exposing these to the end users 

SPARK-28372 Document Spark WEB UI

SPARK-28399 RobustScaler feature transformer

SPARK-28426 Metadata Handling in Thrift Server

SPARK-28588 Build a SQL reference doc

SPARK-28608 Improve test coverage of ThriftServer

SPARK-28753 Dynamically reuse subqueries in AQE

SPARK-28855 Remove outdated Experimental, Evolving annotations

SPARK-25908 SPARK-28980 Remove deprecated items since <= 2.2.0


Cheers,

Xingbo

Hyukjin Kwon <[hidden email]> 于2019年10月7日周一 下午9:29写道:
Cogroup Pandas UDF missing:

SPARK-27463 Support Dataframe Cogroup via Pandas UDFs
Vectorized R execution:

SPARK-26759 Arrow optimization in SparkR's interoperability



2019년 10월 8일 (화) 오전 7:50, Jungtaek Lim <[hidden email]>님이 작성:
Thanks for bringing the nice summary of Spark 3.0 improvements!

I'd like to add some items from structured streaming side,

SPARK-28199 Move Trigger implementations to Triggers.scala and avoid exposing these to the end users (removal of deprecated)
SPARK-23539 Add support for Kafka headers in Structured Streaming
SPARK-25501 Add kafka delegation token support (there were follow-up issues to add functionalities like support multi clusters, etc.)
SPARK-26848 Introduce new option to Kafka source: offset by timestamp (starting/ending)
SPARK-28074 Log warn message on possible correctness issue for multiple stateful operations in single query

and core side,

SPARK-23155 New feature: apply custom log URL pattern for executor log URLs in SHS (follow-up issue expanded the functionality to Spark UI as well)

FYI if we count on current work in progress, there's ongoing umbrella issue regarding rolling event log & snapshot (SPARK-28594) which we struggle to get things done in Spark 3.0.

Thanks,
Jungtaek Lim (HeartSaVioR)


On Tue, Oct 8, 2019 at 7:02 AM Xingbo Jiang <[hidden email]> wrote:
Hi all,

I went over all the finished JIRA tickets targeted to Spark 3.0.0, here I'm listing all the notable features and major changes that are ready to test/deliver, please don't hesitate to add more to the list:

SPARK-11215 Multiple columns support added to various Transformers: StringIndexer

SPARK-11150 Implement Dynamic Partition Pruning

SPARK-13677 Support Tree-Based Feature Transformation

SPARK-16692 Add MultilabelClassificationEvaluator

SPARK-19591 Add sample weights to decision trees

SPARK-19712 Pushing Left Semi and Left Anti joins through Project, Aggregate, Window, Union etc.

SPARK-19827 R API for Power Iteration Clustering

SPARK-20286 Improve logic for timing out executors in dynamic allocation

SPARK-20636 Eliminate unnecessary shuffle with adjacent Window expressions

SPARK-22148 Acquire new executors to avoid hang because of blacklisting

SPARK-22796 Multiple columns support added to various Transformers: PySpark QuantileDiscretizer

SPARK-23128 A new approach to do adaptive execution in Spark SQL

SPARK-23674 Add Spark ML Listener for Tracking ML Pipeline Status

SPARK-23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2

SPARK-24333 Add fit with validation set to Gradient Boosted Trees: Python API

SPARK-24417 Build and Run Spark on JDK11

SPARK-24615 Accelerator-aware task scheduling for Spark

SPARK-24920 Allow sharing Netty's memory pool allocators

SPARK-25250 Fix race condition with tasks running when new attempt for same stage is created leads to other task in the next attempt running on the same partition id retry multiple times

SPARK-25341 Support rolling back a shuffle map stage and re-generate the shuffle files

SPARK-25348 Data source for binary files

SPARK-25603 Generalize Nested Column Pruning

SPARK-26132 Remove support for Scala 2.11 in Spark 3.0.0

SPARK-26215 define reserved keywords after SQL standard

SPARK-26412 Allow Pandas UDF to take an iterator of pd.DataFrames

SPARK-26785 data source v2 API refactor: streaming write

SPARK-26956 remove streaming output mode from data source v2 APIs

SPARK-27064 create StreamingWrite at the beginning of streaming execution

SPARK-27119 Do not infer schema when reading Hive serde table with native data source

SPARK-27225 Implement join strategy hints

SPARK-27240 Use pandas DataFrame for struct type argument in Scalar Pandas UDF

SPARK-27338 Fix deadlock between TaskMemoryManager and UnsafeExternalSorter$SpillableIterator

SPARK-27396 Public APIs for extended Columnar Processing Support

SPARK-27589 Re-implement file sources with data source V2 API

SPARK-27677 Disk-persisted RDD blocks served by shuffle service, and ignored for Dynamic Allocation

SPARK-27699 Partially push down disjunctive predicated in Parquet/ORC

SPARK-27763 Port test cases from PostgreSQL to Spark SQL (ongoing)

SPARK-27884 Deprecate Python 2 support

SPARK-27921 Convert applicable *.sql tests into UDF integrated test base

SPARK-27963 Allow dynamic allocation without an external shuffle service

SPARK-28177 Adjust post shuffle partition number in adaptive execution

SPARK-28372 Document Spark WEB UI

SPARK-28399 RobustScaler feature transformer

SPARK-28426 Metadata Handling in Thrift Server

SPARK-28588 Build a SQL reference doc (ongoing)

SPARK-28608 Improve test coverage of ThriftServer

SPARK-28753 Dynamically reuse subqueries in AQE

SPARK-28855 Remove outdated Experimental, Evolving annotations

SPARK-25908 SPARK-28980 Remove deprecated items since <= 2.2.0


Cheers,

Xingbo
Reply | Threaded
Open this post in threaded view
|

Re: Spark 3.0 preview release feature list and major changes

Dongjoon Hyun-2
Thank you for the preparation of 3.0-preview, Xingbo!

Bests,
Dongjoon.

On Tue, Oct 8, 2019 at 2:32 PM Xingbo Jiang <[hidden email]> wrote:
 What's the process to propose a feature to be included in the final Spark 3.0 release? 

I don't know whether there exists any specific process here, normally you just merge the feature into Spark master before release code freeze, and then the feature would probably be included in the release. The code freeze date for Spark 3.0 has not been decided yet, though.

Li Jin <[hidden email]> 于2019年10月8日周二 下午2:14写道:
Thanks for summary! 

I have a question that is semi-related - What's the process to propose a feature to be included in the final Spark 3.0 release? 

In particular, I am interested in https://issues.apache.org/jira/browse/SPARK-28006.  I am happy to do the work so want to make sure I don't miss the "cut" date.

On Tue, Oct 8, 2019 at 4:53 PM Xingbo Jiang <[hidden email]> wrote:
Hi all,

Thanks for all the feedbacks, here is the updated feature list:

SPARK-11215 Multiple columns support added to various Transformers: StringIndexer

SPARK-11150 Implement Dynamic Partition Pruning

SPARK-13677 Support Tree-Based Feature Transformation

SPARK-16692 Add MultilabelClassificationEvaluator

SPARK-19591 Add sample weights to decision trees

SPARK-19712 Pushing Left Semi and Left Anti joins through Project, Aggregate, Window, Union etc.

SPARK-19827 R API for Power Iteration Clustering

SPARK-20286 Improve logic for timing out executors in dynamic allocation

SPARK-20636 Eliminate unnecessary shuffle with adjacent Window expressions

SPARK-22148 Acquire new executors to avoid hang because of blacklisting

SPARK-22796 Multiple columns support added to various Transformers: PySpark QuantileDiscretizer

SPARK-23128 A new approach to do adaptive execution in Spark SQL

SPARK-23155 Apply custom log URL pattern for executor log URLs in SHS

SPARK-23539 Add support for Kafka headers

SPARK-23674 Add Spark ML Listener for Tracking ML Pipeline Status

SPARK-23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2

SPARK-24333 Add fit with validation set to Gradient Boosted Trees: Python API

SPARK-24417 Build and Run Spark on JDK11

SPARK-24615 Accelerator-aware task scheduling for Spark

SPARK-24920 Allow sharing Netty's memory pool allocators

SPARK-25250 Fix race condition with tasks running when new attempt for same stage is created leads to other task in the next attempt running on the same partition id retry multiple times

SPARK-25341 Support rolling back a shuffle map stage and re-generate the shuffle files

SPARK-25348 Data source for binary files

SPARK-25501 Add kafka delegation token support

SPARK-25603 Generalize Nested Column Pruning

SPARK-26132 Remove support for Scala 2.11 in Spark 3.0.0

SPARK-26215 define reserved keywords after SQL standard

SPARK-26412 Allow Pandas UDF to take an iterator of pd.DataFrames

SPARK-26759 Arrow optimization in SparkR's interoperability

SPARK-26785 data source v2 API refactor: streaming write

SPARK-26848 Introduce new option to Kafka source: offset by timestamp (starting/ending)

SPARK-26956 remove streaming output mode from data source v2 APIs

SPARK-27064 create StreamingWrite at the beginning of streaming execution

SPARK-27119 Do not infer schema when reading Hive serde table with native data source

SPARK-27225 Implement join strategy hints

SPARK-27240 Use pandas DataFrame for struct type argument in Scalar Pandas UDF

SPARK-27338 Fix deadlock between TaskMemoryManager and UnsafeExternalSorter$SpillableIterator

SPARK-27396 Public APIs for extended Columnar Processing Support

SPARK-27463 Support Dataframe Cogroup via Pandas UDFs

SPARK-27589 Re-implement file sources with data source V2 API

SPARK-27677 Disk-persisted RDD blocks served by shuffle service, and ignored for Dynamic Allocation

SPARK-27699 Partially push down disjunctive predicated in Parquet/ORC

SPARK-27763 Port test cases from PostgreSQL to Spark SQL

SPARK-27884 Deprecate Python 2 support

SPARK-27921 Convert applicable *.sql tests into UDF integrated test base

SPARK-27963 Allow dynamic allocation without an external shuffle service

SPARK-28177 Adjust post shuffle partition number in adaptive execution

SPARK-28199 Move Trigger implementations to Triggers.scala and avoid exposing these to the end users 

SPARK-28372 Document Spark WEB UI

SPARK-28399 RobustScaler feature transformer

SPARK-28426 Metadata Handling in Thrift Server

SPARK-28588 Build a SQL reference doc

SPARK-28608 Improve test coverage of ThriftServer

SPARK-28753 Dynamically reuse subqueries in AQE

SPARK-28855 Remove outdated Experimental, Evolving annotations

SPARK-25908 SPARK-28980 Remove deprecated items since <= 2.2.0


Cheers,

Xingbo

Hyukjin Kwon <[hidden email]> 于2019年10月7日周一 下午9:29写道:
Cogroup Pandas UDF missing:

SPARK-27463 Support Dataframe Cogroup via Pandas UDFs
Vectorized R execution:

SPARK-26759 Arrow optimization in SparkR's interoperability



2019년 10월 8일 (화) 오전 7:50, Jungtaek Lim <[hidden email]>님이 작성:
Thanks for bringing the nice summary of Spark 3.0 improvements!

I'd like to add some items from structured streaming side,

SPARK-28199 Move Trigger implementations to Triggers.scala and avoid exposing these to the end users (removal of deprecated)
SPARK-23539 Add support for Kafka headers in Structured Streaming
SPARK-25501 Add kafka delegation token support (there were follow-up issues to add functionalities like support multi clusters, etc.)
SPARK-26848 Introduce new option to Kafka source: offset by timestamp (starting/ending)
SPARK-28074 Log warn message on possible correctness issue for multiple stateful operations in single query

and core side,

SPARK-23155 New feature: apply custom log URL pattern for executor log URLs in SHS (follow-up issue expanded the functionality to Spark UI as well)

FYI if we count on current work in progress, there's ongoing umbrella issue regarding rolling event log & snapshot (SPARK-28594) which we struggle to get things done in Spark 3.0.

Thanks,
Jungtaek Lim (HeartSaVioR)


On Tue, Oct 8, 2019 at 7:02 AM Xingbo Jiang <[hidden email]> wrote:
Hi all,

I went over all the finished JIRA tickets targeted to Spark 3.0.0, here I'm listing all the notable features and major changes that are ready to test/deliver, please don't hesitate to add more to the list:

SPARK-11215 Multiple columns support added to various Transformers: StringIndexer

SPARK-11150 Implement Dynamic Partition Pruning

SPARK-13677 Support Tree-Based Feature Transformation

SPARK-16692 Add MultilabelClassificationEvaluator

SPARK-19591 Add sample weights to decision trees

SPARK-19712 Pushing Left Semi and Left Anti joins through Project, Aggregate, Window, Union etc.

SPARK-19827 R API for Power Iteration Clustering

SPARK-20286 Improve logic for timing out executors in dynamic allocation

SPARK-20636 Eliminate unnecessary shuffle with adjacent Window expressions

SPARK-22148 Acquire new executors to avoid hang because of blacklisting

SPARK-22796 Multiple columns support added to various Transformers: PySpark QuantileDiscretizer

SPARK-23128 A new approach to do adaptive execution in Spark SQL

SPARK-23674 Add Spark ML Listener for Tracking ML Pipeline Status

SPARK-23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2

SPARK-24333 Add fit with validation set to Gradient Boosted Trees: Python API

SPARK-24417 Build and Run Spark on JDK11

SPARK-24615 Accelerator-aware task scheduling for Spark

SPARK-24920 Allow sharing Netty's memory pool allocators

SPARK-25250 Fix race condition with tasks running when new attempt for same stage is created leads to other task in the next attempt running on the same partition id retry multiple times

SPARK-25341 Support rolling back a shuffle map stage and re-generate the shuffle files

SPARK-25348 Data source for binary files

SPARK-25603 Generalize Nested Column Pruning

SPARK-26132 Remove support for Scala 2.11 in Spark 3.0.0

SPARK-26215 define reserved keywords after SQL standard

SPARK-26412 Allow Pandas UDF to take an iterator of pd.DataFrames

SPARK-26785 data source v2 API refactor: streaming write

SPARK-26956 remove streaming output mode from data source v2 APIs

SPARK-27064 create StreamingWrite at the beginning of streaming execution

SPARK-27119 Do not infer schema when reading Hive serde table with native data source

SPARK-27225 Implement join strategy hints

SPARK-27240 Use pandas DataFrame for struct type argument in Scalar Pandas UDF

SPARK-27338 Fix deadlock between TaskMemoryManager and UnsafeExternalSorter$SpillableIterator

SPARK-27396 Public APIs for extended Columnar Processing Support

SPARK-27589 Re-implement file sources with data source V2 API

SPARK-27677 Disk-persisted RDD blocks served by shuffle service, and ignored for Dynamic Allocation

SPARK-27699 Partially push down disjunctive predicated in Parquet/ORC

SPARK-27763 Port test cases from PostgreSQL to Spark SQL (ongoing)

SPARK-27884 Deprecate Python 2 support

SPARK-27921 Convert applicable *.sql tests into UDF integrated test base

SPARK-27963 Allow dynamic allocation without an external shuffle service

SPARK-28177 Adjust post shuffle partition number in adaptive execution

SPARK-28372 Document Spark WEB UI

SPARK-28399 RobustScaler feature transformer

SPARK-28426 Metadata Handling in Thrift Server

SPARK-28588 Build a SQL reference doc (ongoing)

SPARK-28608 Improve test coverage of ThriftServer

SPARK-28753 Dynamically reuse subqueries in AQE

SPARK-28855 Remove outdated Experimental, Evolving annotations

SPARK-25908 SPARK-28980 Remove deprecated items since <= 2.2.0


Cheers,

Xingbo
Reply | Threaded
Open this post in threaded view
|

Re: Spark 3.0 preview release feature list and major changes

cloud0fan
Regarding DS v2, I'd like to remove
SPARK-26785 data source v2 API refactor: streaming write
SPARK-26956 remove streaming output mode from data source v2 APIs

and put the umbrella ticket instead
SPARK-25390 data source V2 API refactoring

Thanks,
Wenchen

On Wed, Oct 9, 2019 at 1:19 PM Dongjoon Hyun <[hidden email]> wrote:
Thank you for the preparation of 3.0-preview, Xingbo!

Bests,
Dongjoon.

On Tue, Oct 8, 2019 at 2:32 PM Xingbo Jiang <[hidden email]> wrote:
 What's the process to propose a feature to be included in the final Spark 3.0 release? 

I don't know whether there exists any specific process here, normally you just merge the feature into Spark master before release code freeze, and then the feature would probably be included in the release. The code freeze date for Spark 3.0 has not been decided yet, though.

Li Jin <[hidden email]> 于2019年10月8日周二 下午2:14写道:
Thanks for summary! 

I have a question that is semi-related - What's the process to propose a feature to be included in the final Spark 3.0 release? 

In particular, I am interested in https://issues.apache.org/jira/browse/SPARK-28006.  I am happy to do the work so want to make sure I don't miss the "cut" date.

On Tue, Oct 8, 2019 at 4:53 PM Xingbo Jiang <[hidden email]> wrote:
Hi all,

Thanks for all the feedbacks, here is the updated feature list:

SPARK-11215 Multiple columns support added to various Transformers: StringIndexer

SPARK-11150 Implement Dynamic Partition Pruning

SPARK-13677 Support Tree-Based Feature Transformation

SPARK-16692 Add MultilabelClassificationEvaluator

SPARK-19591 Add sample weights to decision trees

SPARK-19712 Pushing Left Semi and Left Anti joins through Project, Aggregate, Window, Union etc.

SPARK-19827 R API for Power Iteration Clustering

SPARK-20286 Improve logic for timing out executors in dynamic allocation

SPARK-20636 Eliminate unnecessary shuffle with adjacent Window expressions

SPARK-22148 Acquire new executors to avoid hang because of blacklisting

SPARK-22796 Multiple columns support added to various Transformers: PySpark QuantileDiscretizer

SPARK-23128 A new approach to do adaptive execution in Spark SQL

SPARK-23155 Apply custom log URL pattern for executor log URLs in SHS

SPARK-23539 Add support for Kafka headers

SPARK-23674 Add Spark ML Listener for Tracking ML Pipeline Status

SPARK-23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2

SPARK-24333 Add fit with validation set to Gradient Boosted Trees: Python API

SPARK-24417 Build and Run Spark on JDK11

SPARK-24615 Accelerator-aware task scheduling for Spark

SPARK-24920 Allow sharing Netty's memory pool allocators

SPARK-25250 Fix race condition with tasks running when new attempt for same stage is created leads to other task in the next attempt running on the same partition id retry multiple times

SPARK-25341 Support rolling back a shuffle map stage and re-generate the shuffle files

SPARK-25348 Data source for binary files

SPARK-25501 Add kafka delegation token support

SPARK-25603 Generalize Nested Column Pruning

SPARK-26132 Remove support for Scala 2.11 in Spark 3.0.0

SPARK-26215 define reserved keywords after SQL standard

SPARK-26412 Allow Pandas UDF to take an iterator of pd.DataFrames

SPARK-26759 Arrow optimization in SparkR's interoperability

SPARK-26785 data source v2 API refactor: streaming write

SPARK-26848 Introduce new option to Kafka source: offset by timestamp (starting/ending)

SPARK-26956 remove streaming output mode from data source v2 APIs

SPARK-27064 create StreamingWrite at the beginning of streaming execution

SPARK-27119 Do not infer schema when reading Hive serde table with native data source

SPARK-27225 Implement join strategy hints

SPARK-27240 Use pandas DataFrame for struct type argument in Scalar Pandas UDF

SPARK-27338 Fix deadlock between TaskMemoryManager and UnsafeExternalSorter$SpillableIterator

SPARK-27396 Public APIs for extended Columnar Processing Support

SPARK-27463 Support Dataframe Cogroup via Pandas UDFs

SPARK-27589 Re-implement file sources with data source V2 API

SPARK-27677 Disk-persisted RDD blocks served by shuffle service, and ignored for Dynamic Allocation

SPARK-27699 Partially push down disjunctive predicated in Parquet/ORC

SPARK-27763 Port test cases from PostgreSQL to Spark SQL

SPARK-27884 Deprecate Python 2 support

SPARK-27921 Convert applicable *.sql tests into UDF integrated test base

SPARK-27963 Allow dynamic allocation without an external shuffle service

SPARK-28177 Adjust post shuffle partition number in adaptive execution

SPARK-28199 Move Trigger implementations to Triggers.scala and avoid exposing these to the end users 

SPARK-28372 Document Spark WEB UI

SPARK-28399 RobustScaler feature transformer

SPARK-28426 Metadata Handling in Thrift Server

SPARK-28588 Build a SQL reference doc

SPARK-28608 Improve test coverage of ThriftServer

SPARK-28753 Dynamically reuse subqueries in AQE

SPARK-28855 Remove outdated Experimental, Evolving annotations

SPARK-25908 SPARK-28980 Remove deprecated items since <= 2.2.0


Cheers,

Xingbo

Hyukjin Kwon <[hidden email]> 于2019年10月7日周一 下午9:29写道:
Cogroup Pandas UDF missing:

SPARK-27463 Support Dataframe Cogroup via Pandas UDFs
Vectorized R execution:

SPARK-26759 Arrow optimization in SparkR's interoperability



2019년 10월 8일 (화) 오전 7:50, Jungtaek Lim <[hidden email]>님이 작성:
Thanks for bringing the nice summary of Spark 3.0 improvements!

I'd like to add some items from structured streaming side,

SPARK-28199 Move Trigger implementations to Triggers.scala and avoid exposing these to the end users (removal of deprecated)
SPARK-23539 Add support for Kafka headers in Structured Streaming
SPARK-25501 Add kafka delegation token support (there were follow-up issues to add functionalities like support multi clusters, etc.)
SPARK-26848 Introduce new option to Kafka source: offset by timestamp (starting/ending)
SPARK-28074 Log warn message on possible correctness issue for multiple stateful operations in single query

and core side,

SPARK-23155 New feature: apply custom log URL pattern for executor log URLs in SHS (follow-up issue expanded the functionality to Spark UI as well)

FYI if we count on current work in progress, there's ongoing umbrella issue regarding rolling event log & snapshot (SPARK-28594) which we struggle to get things done in Spark 3.0.

Thanks,
Jungtaek Lim (HeartSaVioR)


On Tue, Oct 8, 2019 at 7:02 AM Xingbo Jiang <[hidden email]> wrote:
Hi all,

I went over all the finished JIRA tickets targeted to Spark 3.0.0, here I'm listing all the notable features and major changes that are ready to test/deliver, please don't hesitate to add more to the list:

SPARK-11215 Multiple columns support added to various Transformers: StringIndexer

SPARK-11150 Implement Dynamic Partition Pruning

SPARK-13677 Support Tree-Based Feature Transformation

SPARK-16692 Add MultilabelClassificationEvaluator

SPARK-19591 Add sample weights to decision trees

SPARK-19712 Pushing Left Semi and Left Anti joins through Project, Aggregate, Window, Union etc.

SPARK-19827 R API for Power Iteration Clustering

SPARK-20286 Improve logic for timing out executors in dynamic allocation

SPARK-20636 Eliminate unnecessary shuffle with adjacent Window expressions

SPARK-22148 Acquire new executors to avoid hang because of blacklisting

SPARK-22796 Multiple columns support added to various Transformers: PySpark QuantileDiscretizer

SPARK-23128 A new approach to do adaptive execution in Spark SQL

SPARK-23674 Add Spark ML Listener for Tracking ML Pipeline Status

SPARK-23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2

SPARK-24333 Add fit with validation set to Gradient Boosted Trees: Python API

SPARK-24417 Build and Run Spark on JDK11

SPARK-24615 Accelerator-aware task scheduling for Spark

SPARK-24920 Allow sharing Netty's memory pool allocators

SPARK-25250 Fix race condition with tasks running when new attempt for same stage is created leads to other task in the next attempt running on the same partition id retry multiple times

SPARK-25341 Support rolling back a shuffle map stage and re-generate the shuffle files

SPARK-25348 Data source for binary files

SPARK-25603 Generalize Nested Column Pruning

SPARK-26132 Remove support for Scala 2.11 in Spark 3.0.0

SPARK-26215 define reserved keywords after SQL standard

SPARK-26412 Allow Pandas UDF to take an iterator of pd.DataFrames

SPARK-26785 data source v2 API refactor: streaming write

SPARK-26956 remove streaming output mode from data source v2 APIs

SPARK-27064 create StreamingWrite at the beginning of streaming execution

SPARK-27119 Do not infer schema when reading Hive serde table with native data source

SPARK-27225 Implement join strategy hints

SPARK-27240 Use pandas DataFrame for struct type argument in Scalar Pandas UDF

SPARK-27338 Fix deadlock between TaskMemoryManager and UnsafeExternalSorter$SpillableIterator

SPARK-27396 Public APIs for extended Columnar Processing Support

SPARK-27589 Re-implement file sources with data source V2 API

SPARK-27677 Disk-persisted RDD blocks served by shuffle service, and ignored for Dynamic Allocation

SPARK-27699 Partially push down disjunctive predicated in Parquet/ORC

SPARK-27763 Port test cases from PostgreSQL to Spark SQL (ongoing)

SPARK-27884 Deprecate Python 2 support

SPARK-27921 Convert applicable *.sql tests into UDF integrated test base

SPARK-27963 Allow dynamic allocation without an external shuffle service

SPARK-28177 Adjust post shuffle partition number in adaptive execution

SPARK-28372 Document Spark WEB UI

SPARK-28399 RobustScaler feature transformer

SPARK-28426 Metadata Handling in Thrift Server

SPARK-28588 Build a SQL reference doc (ongoing)

SPARK-28608 Improve test coverage of ThriftServer

SPARK-28753 Dynamically reuse subqueries in AQE

SPARK-28855 Remove outdated Experimental, Evolving annotations

SPARK-25908 SPARK-28980 Remove deprecated items since <= 2.2.0


Cheers,

Xingbo
Reply | Threaded
Open this post in threaded view
|

Re: Spark 3.0 preview release feature list and major changes

Xiao Li-2
SPARK-29345 Add an API that allows a user to define and observe arbitrary metrics on streaming queries

Let us add this too. 

Cheers,

Xiao

On Tue, Oct 8, 2019 at 10:31 PM Wenchen Fan <[hidden email]> wrote:
Regarding DS v2, I'd like to remove
SPARK-26785 data source v2 API refactor: streaming write
SPARK-26956 remove streaming output mode from data source v2 APIs

and put the umbrella ticket instead
SPARK-25390 data source V2 API refactoring

Thanks,
Wenchen

On Wed, Oct 9, 2019 at 1:19 PM Dongjoon Hyun <[hidden email]> wrote:
Thank you for the preparation of 3.0-preview, Xingbo!

Bests,
Dongjoon.

On Tue, Oct 8, 2019 at 2:32 PM Xingbo Jiang <[hidden email]> wrote:
 What's the process to propose a feature to be included in the final Spark 3.0 release? 

I don't know whether there exists any specific process here, normally you just merge the feature into Spark master before release code freeze, and then the feature would probably be included in the release. The code freeze date for Spark 3.0 has not been decided yet, though.

Li Jin <[hidden email]> 于2019年10月8日周二 下午2:14写道:
Thanks for summary! 

I have a question that is semi-related - What's the process to propose a feature to be included in the final Spark 3.0 release? 

In particular, I am interested in https://issues.apache.org/jira/browse/SPARK-28006.  I am happy to do the work so want to make sure I don't miss the "cut" date.

On Tue, Oct 8, 2019 at 4:53 PM Xingbo Jiang <[hidden email]> wrote:
Hi all,

Thanks for all the feedbacks, here is the updated feature list:

SPARK-11215 Multiple columns support added to various Transformers: StringIndexer

SPARK-11150 Implement Dynamic Partition Pruning

SPARK-13677 Support Tree-Based Feature Transformation

SPARK-16692 Add MultilabelClassificationEvaluator

SPARK-19591 Add sample weights to decision trees

SPARK-19712 Pushing Left Semi and Left Anti joins through Project, Aggregate, Window, Union etc.

SPARK-19827 R API for Power Iteration Clustering

SPARK-20286 Improve logic for timing out executors in dynamic allocation

SPARK-20636 Eliminate unnecessary shuffle with adjacent Window expressions

SPARK-22148 Acquire new executors to avoid hang because of blacklisting

SPARK-22796 Multiple columns support added to various Transformers: PySpark QuantileDiscretizer

SPARK-23128 A new approach to do adaptive execution in Spark SQL

SPARK-23155 Apply custom log URL pattern for executor log URLs in SHS

SPARK-23539 Add support for Kafka headers

SPARK-23674 Add Spark ML Listener for Tracking ML Pipeline Status

SPARK-23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2

SPARK-24333 Add fit with validation set to Gradient Boosted Trees: Python API

SPARK-24417 Build and Run Spark on JDK11

SPARK-24615 Accelerator-aware task scheduling for Spark

SPARK-24920 Allow sharing Netty's memory pool allocators

SPARK-25250 Fix race condition with tasks running when new attempt for same stage is created leads to other task in the next attempt running on the same partition id retry multiple times

SPARK-25341 Support rolling back a shuffle map stage and re-generate the shuffle files

SPARK-25348 Data source for binary files

SPARK-25501 Add kafka delegation token support

SPARK-25603 Generalize Nested Column Pruning

SPARK-26132 Remove support for Scala 2.11 in Spark 3.0.0

SPARK-26215 define reserved keywords after SQL standard

SPARK-26412 Allow Pandas UDF to take an iterator of pd.DataFrames

SPARK-26759 Arrow optimization in SparkR's interoperability

SPARK-26785 data source v2 API refactor: streaming write

SPARK-26848 Introduce new option to Kafka source: offset by timestamp (starting/ending)

SPARK-26956 remove streaming output mode from data source v2 APIs

SPARK-27064 create StreamingWrite at the beginning of streaming execution

SPARK-27119 Do not infer schema when reading Hive serde table with native data source

SPARK-27225 Implement join strategy hints

SPARK-27240 Use pandas DataFrame for struct type argument in Scalar Pandas UDF

SPARK-27338 Fix deadlock between TaskMemoryManager and UnsafeExternalSorter$SpillableIterator

SPARK-27396 Public APIs for extended Columnar Processing Support

SPARK-27463 Support Dataframe Cogroup via Pandas UDFs

SPARK-27589 Re-implement file sources with data source V2 API

SPARK-27677 Disk-persisted RDD blocks served by shuffle service, and ignored for Dynamic Allocation

SPARK-27699 Partially push down disjunctive predicated in Parquet/ORC

SPARK-27763 Port test cases from PostgreSQL to Spark SQL

SPARK-27884 Deprecate Python 2 support

SPARK-27921 Convert applicable *.sql tests into UDF integrated test base

SPARK-27963 Allow dynamic allocation without an external shuffle service

SPARK-28177 Adjust post shuffle partition number in adaptive execution

SPARK-28199 Move Trigger implementations to Triggers.scala and avoid exposing these to the end users 

SPARK-28372 Document Spark WEB UI

SPARK-28399 RobustScaler feature transformer

SPARK-28426 Metadata Handling in Thrift Server

SPARK-28588 Build a SQL reference doc

SPARK-28608 Improve test coverage of ThriftServer

SPARK-28753 Dynamically reuse subqueries in AQE

SPARK-28855 Remove outdated Experimental, Evolving annotations

SPARK-25908 SPARK-28980 Remove deprecated items since <= 2.2.0


Cheers,

Xingbo

Hyukjin Kwon <[hidden email]> 于2019年10月7日周一 下午9:29写道:
Cogroup Pandas UDF missing:

SPARK-27463 Support Dataframe Cogroup via Pandas UDFs
Vectorized R execution:

SPARK-26759 Arrow optimization in SparkR's interoperability



2019년 10월 8일 (화) 오전 7:50, Jungtaek Lim <[hidden email]>님이 작성:
Thanks for bringing the nice summary of Spark 3.0 improvements!

I'd like to add some items from structured streaming side,

SPARK-28199 Move Trigger implementations to Triggers.scala and avoid exposing these to the end users (removal of deprecated)
SPARK-23539 Add support for Kafka headers in Structured Streaming
SPARK-25501 Add kafka delegation token support (there were follow-up issues to add functionalities like support multi clusters, etc.)
SPARK-26848 Introduce new option to Kafka source: offset by timestamp (starting/ending)
SPARK-28074 Log warn message on possible correctness issue for multiple stateful operations in single query

and core side,

SPARK-23155 New feature: apply custom log URL pattern for executor log URLs in SHS (follow-up issue expanded the functionality to Spark UI as well)

FYI if we count on current work in progress, there's ongoing umbrella issue regarding rolling event log & snapshot (SPARK-28594) which we struggle to get things done in Spark 3.0.

Thanks,
Jungtaek Lim (HeartSaVioR)


On Tue, Oct 8, 2019 at 7:02 AM Xingbo Jiang <[hidden email]> wrote:
Hi all,

I went over all the finished JIRA tickets targeted to Spark 3.0.0, here I'm listing all the notable features and major changes that are ready to test/deliver, please don't hesitate to add more to the list:

SPARK-11215 Multiple columns support added to various Transformers: StringIndexer

SPARK-11150 Implement Dynamic Partition Pruning

SPARK-13677 Support Tree-Based Feature Transformation

SPARK-16692 Add MultilabelClassificationEvaluator

SPARK-19591 Add sample weights to decision trees

SPARK-19712 Pushing Left Semi and Left Anti joins through Project, Aggregate, Window, Union etc.

SPARK-19827 R API for Power Iteration Clustering

SPARK-20286 Improve logic for timing out executors in dynamic allocation

SPARK-20636 Eliminate unnecessary shuffle with adjacent Window expressions

SPARK-22148 Acquire new executors to avoid hang because of blacklisting

SPARK-22796 Multiple columns support added to various Transformers: PySpark QuantileDiscretizer

SPARK-23128 A new approach to do adaptive execution in Spark SQL

SPARK-23674 Add Spark ML Listener for Tracking ML Pipeline Status

SPARK-23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2

SPARK-24333 Add fit with validation set to Gradient Boosted Trees: Python API

SPARK-24417 Build and Run Spark on JDK11

SPARK-24615 Accelerator-aware task scheduling for Spark

SPARK-24920 Allow sharing Netty's memory pool allocators

SPARK-25250 Fix race condition with tasks running when new attempt for same stage is created leads to other task in the next attempt running on the same partition id retry multiple times

SPARK-25341 Support rolling back a shuffle map stage and re-generate the shuffle files

SPARK-25348 Data source for binary files

SPARK-25603 Generalize Nested Column Pruning

SPARK-26132 Remove support for Scala 2.11 in Spark 3.0.0

SPARK-26215 define reserved keywords after SQL standard

SPARK-26412 Allow Pandas UDF to take an iterator of pd.DataFrames

SPARK-26785 data source v2 API refactor: streaming write

SPARK-26956 remove streaming output mode from data source v2 APIs

SPARK-27064 create StreamingWrite at the beginning of streaming execution

SPARK-27119 Do not infer schema when reading Hive serde table with native data source

SPARK-27225 Implement join strategy hints

SPARK-27240 Use pandas DataFrame for struct type argument in Scalar Pandas UDF

SPARK-27338 Fix deadlock between TaskMemoryManager and UnsafeExternalSorter$SpillableIterator

SPARK-27396 Public APIs for extended Columnar Processing Support

SPARK-27589 Re-implement file sources with data source V2 API

SPARK-27677 Disk-persisted RDD blocks served by shuffle service, and ignored for Dynamic Allocation

SPARK-27699 Partially push down disjunctive predicated in Parquet/ORC

SPARK-27763 Port test cases from PostgreSQL to Spark SQL (ongoing)

SPARK-27884 Deprecate Python 2 support

SPARK-27921 Convert applicable *.sql tests into UDF integrated test base

SPARK-27963 Allow dynamic allocation without an external shuffle service

SPARK-28177 Adjust post shuffle partition number in adaptive execution

SPARK-28372 Document Spark WEB UI

SPARK-28399 RobustScaler feature transformer

SPARK-28426 Metadata Handling in Thrift Server

SPARK-28588 Build a SQL reference doc (ongoing)

SPARK-28608 Improve test coverage of ThriftServer

SPARK-28753 Dynamically reuse subqueries in AQE

SPARK-28855 Remove outdated Experimental, Evolving annotations

SPARK-25908 SPARK-28980 Remove deprecated items since <= 2.2.0


Cheers,

Xingbo


--
Databricks Summit - Watch the talks 
Reply | Threaded
Open this post in threaded view
|

Re: Spark 3.0 preview release feature list and major changes

antonkulaga
I think for sure  SPARK-28547
<https://issues.apache.org/jira/projects/SPARK/issues/SPARK-28547>  
At the moment there are some flows in Spark architecture and it performs
miserably or even freezes everywhere where column number exceeds 10-15K
(even simple describe function takes ages while the same functions with
pandas and no Spark take seconds). In many fields (like bioinformatics) wide
datasets with both large numbers of rows and columns are very common (gene
expression data is a good example here) and Spark is totally useless there.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Spark 3.0 preview release feature list and major changes

Sean Owen-2
See the JIRA - this is too open-ended and not obviously just due to
choices in data representation, what you're trying to do, etc. It's
correctly closed IMHO.
However, identifying the issue more narrowly, and something that looks
ripe for optimization, would be useful.

On Thu, Oct 10, 2019 at 12:30 PM antonkulaga <[hidden email]> wrote:

>
> I think for sure  SPARK-28547
> <https://issues.apache.org/jira/projects/SPARK/issues/SPARK-28547>
> At the moment there are some flows in Spark architecture and it performs
> miserably or even freezes everywhere where column number exceeds 10-15K
> (even simple describe function takes ages while the same functions with
> pandas and no Spark take seconds). In many fields (like bioinformatics) wide
> datasets with both large numbers of rows and columns are very common (gene
> expression data is a good example here) and Spark is totally useless there.
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Spark 3.0 preview release feature list and major changes

Jiang Xingbo
Hi all,

Here is the updated feature list:


SPARK-11215
Multiple columns support added to various Transformers: StringIndexer

SPARK-11150 Implement Dynamic Partition Pruning

SPARK-13677 Support Tree-Based Feature Transformation

SPARK-16692 Add MultilabelClassificationEvaluator

SPARK-19591 Add sample weights to decision trees

SPARK-19712 Pushing Left Semi and Left Anti joins through Project, Aggregate, Window, Union etc.

SPARK-19827 R API for Power Iteration Clustering

SPARK-20286 Improve logic for timing out executors in dynamic allocation

SPARK-20636 Eliminate unnecessary shuffle with adjacent Window expressions

SPARK-22148 Acquire new executors to avoid hang because of blacklisting

SPARK-22796 Multiple columns support added to various Transformers: PySpark QuantileDiscretizer

SPARK-23128 A new approach to do adaptive execution in Spark SQL

SPARK-23155 Apply custom log URL pattern for executor log URLs in SHS

SPARK-23539 Add support for Kafka headers

SPARK-23674 Add Spark ML Listener for Tracking ML Pipeline Status

SPARK-23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2

SPARK-24333 Add fit with validation set to Gradient Boosted Trees: Python API

SPARK-24417 Build and Run Spark on JDK11

SPARK-24615 Accelerator-aware task scheduling for Spark

SPARK-24920 Allow sharing Netty's memory pool allocators

SPARK-25250 Fix race condition with tasks running when new attempt for same stage is created leads to other task in the next attempt running on the same partition id retry multiple times

SPARK-25341 Support rolling back a shuffle map stage and re-generate the shuffle files

SPARK-25348 Data source for binary files

SPARK-25390 data source V2 API refactoring

SPARK-25501 Add Kafka delegation token support

SPARK-25603 Generalize Nested Column Pruning

SPARK-26132 Remove support for Scala 2.11 in Spark 3.0.0

SPARK-26215 define reserved keywords after SQL standard

SPARK-26412 Allow Pandas UDF to take an iterator of pd.DataFrames

SPARK-26651 Use Proleptic Gregorian calendar

SPARK-26759 Arrow optimization in SparkR's interoperability

SPARK-26848 Introduce new option to Kafka source: offset by timestamp (starting/ending)

SPARK-27064 create StreamingWrite at the beginning of streaming execution

SPARK-27119 Do not infer schema when reading Hive serde table with native data source

SPARK-27225 Implement join strategy hints

SPARK-27240 Use pandas DataFrame for struct type argument in Scalar Pandas UDF

SPARK-27338 Fix deadlock between TaskMemoryManager and UnsafeExternalSorter$SpillableIterator

SPARK-27396 Public APIs for extended Columnar Processing Support

SPARK-27463 Support Dataframe Cogroup via Pandas UDFs

SPARK-27589 Re-implement file sources with data source V2 API

SPARK-27677 Disk-persisted RDD blocks served by shuffle service, and ignored for Dynamic Allocation

SPARK-27699 Partially push down disjunctive predicated in Parquet/ORC

SPARK-27763 Port test cases from PostgreSQL to Spark SQL

SPARK-27884 Deprecate Python 2 support

SPARK-27921 Convert applicable *.sql tests into UDF integrated test base

SPARK-27963 Allow dynamic allocation without an external shuffle service

SPARK-28177 Adjust post shuffle partition number in adaptive execution

SPARK-28199 Move Trigger implementations to Triggers.scala and avoid exposing these to the end users 

SPARK-28372 Document Spark WEB UI

SPARK-28399 RobustScaler feature transformer

SPARK-28426 Metadata Handling in Thrift Server

SPARK-28588 Build a SQL reference doc

SPARK-28608 Improve test coverage of ThriftServer

SPARK-28753 Dynamically reuse subqueries in AQE

SPARK-28855 Remove outdated Experimental, Evolving annotations

SPARK-29345 Add an API that allows a user to define and observe arbitrary metrics on streaming queries

SPARK-25908 SPARK-28980 Remove deprecated items since <= 2.2.0


Cheers,

Xingbo


Sean Owen <[hidden email]> 于2019年10月10日周四 下午12:50写道:
See the JIRA - this is too open-ended and not obviously just due to
choices in data representation, what you're trying to do, etc. It's
correctly closed IMHO.
However, identifying the issue more narrowly, and something that looks
ripe for optimization, would be useful.

On Thu, Oct 10, 2019 at 12:30 PM antonkulaga <[hidden email]> wrote:
>
> I think for sure  SPARK-28547
> <https://issues.apache.org/jira/projects/SPARK/issues/SPARK-28547>
> At the moment there are some flows in Spark architecture and it performs
> miserably or even freezes everywhere where column number exceeds 10-15K
> (even simple describe function takes ages while the same functions with
> pandas and no Spark take seconds). In many fields (like bioinformatics) wide
> datasets with both large numbers of rows and columns are very common (gene
> expression data is a good example here) and Spark is totally useless there.
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Spark 3.0 preview release feature list and major changes

Weichen Xu
Wait... I have some supplement:

New API:
SPARK-25097 Support prediction on single instance in KMeans/BiKMeans/GMM
SPARK-28045 add missing RankingEvaluator
SPARK-29121 Support Dot Product for Vectors

Behavior change or new API with behavior change:
SPARK-23265 Update multi-column error handling logic in QuantileDiscretizer
SPARK-22798 Add multiple column support to PySpark StringIndexer
SPARK-11215 Add multiple columns support to StringIndexer
SPARK-24102 RegressionEvaluator should use sample weight data
SPARK-24101 MulticlassClassificationEvaluator should use sample weight data
SPARK-24103 BinaryClassificationEvaluator should use sample weight data
SPARK-23469 HashingTF should use corrected MurmurHash3 implementation

Deprecated API removal:
SPARK-25382 Remove ImageSchema.readImages in 3.0
SPARK-26133 Remove deprecated OneHotEncoder and rename OneHotEncoderEstimator to OneHotEncoder
SPARK-25867 Remove KMeans computeCost
SPARK-28243 remove setFeatureSubsetStrategy and setSubsamplingRate from Python TreeEnsembleParams

Thanks!

Weichen

On Fri, Oct 11, 2019 at 6:11 AM Xingbo Jiang <[hidden email]> wrote:
Hi all,

Here is the updated feature list:


SPARK-11215
Multiple columns support added to various Transformers: StringIndexer

SPARK-11150 Implement Dynamic Partition Pruning

SPARK-13677 Support Tree-Based Feature Transformation

SPARK-16692 Add MultilabelClassificationEvaluator

SPARK-19591 Add sample weights to decision trees

SPARK-19712 Pushing Left Semi and Left Anti joins through Project, Aggregate, Window, Union etc.

SPARK-19827 R API for Power Iteration Clustering

SPARK-20286 Improve logic for timing out executors in dynamic allocation

SPARK-20636 Eliminate unnecessary shuffle with adjacent Window expressions

SPARK-22148 Acquire new executors to avoid hang because of blacklisting

SPARK-22796 Multiple columns support added to various Transformers: PySpark QuantileDiscretizer

SPARK-23128 A new approach to do adaptive execution in Spark SQL

SPARK-23155 Apply custom log URL pattern for executor log URLs in SHS

SPARK-23539 Add support for Kafka headers

SPARK-23674 Add Spark ML Listener for Tracking ML Pipeline Status

SPARK-23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2

SPARK-24333 Add fit with validation set to Gradient Boosted Trees: Python API

SPARK-24417 Build and Run Spark on JDK11

SPARK-24615 Accelerator-aware task scheduling for Spark

SPARK-24920 Allow sharing Netty's memory pool allocators

SPARK-25250 Fix race condition with tasks running when new attempt for same stage is created leads to other task in the next attempt running on the same partition id retry multiple times

SPARK-25341 Support rolling back a shuffle map stage and re-generate the shuffle files

SPARK-25348 Data source for binary files

SPARK-25390 data source V2 API refactoring

SPARK-25501 Add Kafka delegation token support

SPARK-25603 Generalize Nested Column Pruning

SPARK-26132 Remove support for Scala 2.11 in Spark 3.0.0

SPARK-26215 define reserved keywords after SQL standard

SPARK-26412 Allow Pandas UDF to take an iterator of pd.DataFrames

SPARK-26651 Use Proleptic Gregorian calendar

SPARK-26759 Arrow optimization in SparkR's interoperability

SPARK-26848 Introduce new option to Kafka source: offset by timestamp (starting/ending)

SPARK-27064 create StreamingWrite at the beginning of streaming execution

SPARK-27119 Do not infer schema when reading Hive serde table with native data source

SPARK-27225 Implement join strategy hints

SPARK-27240 Use pandas DataFrame for struct type argument in Scalar Pandas UDF

SPARK-27338 Fix deadlock between TaskMemoryManager and UnsafeExternalSorter$SpillableIterator

SPARK-27396 Public APIs for extended Columnar Processing Support

SPARK-27463 Support Dataframe Cogroup via Pandas UDFs

SPARK-27589 Re-implement file sources with data source V2 API

SPARK-27677 Disk-persisted RDD blocks served by shuffle service, and ignored for Dynamic Allocation

SPARK-27699 Partially push down disjunctive predicated in Parquet/ORC

SPARK-27763 Port test cases from PostgreSQL to Spark SQL

SPARK-27884 Deprecate Python 2 support

SPARK-27921 Convert applicable *.sql tests into UDF integrated test base

SPARK-27963 Allow dynamic allocation without an external shuffle service

SPARK-28177 Adjust post shuffle partition number in adaptive execution

SPARK-28199 Move Trigger implementations to Triggers.scala and avoid exposing these to the end users 

SPARK-28372 Document Spark WEB UI

SPARK-28399 RobustScaler feature transformer

SPARK-28426 Metadata Handling in Thrift Server

SPARK-28588 Build a SQL reference doc

SPARK-28608 Improve test coverage of ThriftServer

SPARK-28753 Dynamically reuse subqueries in AQE

SPARK-28855 Remove outdated Experimental, Evolving annotations

SPARK-29345 Add an API that allows a user to define and observe arbitrary metrics on streaming queries

SPARK-25908 SPARK-28980 Remove deprecated items since <= 2.2.0


Cheers,

Xingbo


Sean Owen <[hidden email]> 于2019年10月10日周四 下午12:50写道:
See the JIRA - this is too open-ended and not obviously just due to
choices in data representation, what you're trying to do, etc. It's
correctly closed IMHO.
However, identifying the issue more narrowly, and something that looks
ripe for optimization, would be useful.

On Thu, Oct 10, 2019 at 12:30 PM antonkulaga <[hidden email]> wrote:
>
> I think for sure  SPARK-28547
> <https://issues.apache.org/jira/projects/SPARK/issues/SPARK-28547>
> At the moment there are some flows in Spark architecture and it performs
> miserably or even freezes everywhere where column number exceeds 10-15K
> (even simple describe function takes ages while the same functions with
> pandas and no Spark take seconds). In many fields (like bioinformatics) wide
> datasets with both large numbers of rows and columns are very common (gene
> expression data is a good example here) and Spark is totally useless there.
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Spark 3.0 preview release feature list and major changes

Erik Erlandson-2
In reply to this post by Jiang Xingbo
I'd like to get SPARK-27296 onto 3.0:
SPARK-27296 Efficient User Defined Aggregators



On Mon, Oct 7, 2019 at 3:03 PM Xingbo Jiang <[hidden email]> wrote:
Hi all,

I went over all the finished JIRA tickets targeted to Spark 3.0.0, here I'm listing all the notable features and major changes that are ready to test/deliver, please don't hesitate to add more to the list:

SPARK-11215 Multiple columns support added to various Transformers: StringIndexer

SPARK-11150 Implement Dynamic Partition Pruning

SPARK-13677 Support Tree-Based Feature Transformation

SPARK-16692 Add MultilabelClassificationEvaluator

SPARK-19591 Add sample weights to decision trees

SPARK-19712 Pushing Left Semi and Left Anti joins through Project, Aggregate, Window, Union etc.

SPARK-19827 R API for Power Iteration Clustering

SPARK-20286 Improve logic for timing out executors in dynamic allocation

SPARK-20636 Eliminate unnecessary shuffle with adjacent Window expressions

SPARK-22148 Acquire new executors to avoid hang because of blacklisting

SPARK-22796 Multiple columns support added to various Transformers: PySpark QuantileDiscretizer

SPARK-23128 A new approach to do adaptive execution in Spark SQL

SPARK-23674 Add Spark ML Listener for Tracking ML Pipeline Status

SPARK-23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2

SPARK-24333 Add fit with validation set to Gradient Boosted Trees: Python API

SPARK-24417 Build and Run Spark on JDK11

SPARK-24615 Accelerator-aware task scheduling for Spark

SPARK-24920 Allow sharing Netty's memory pool allocators

SPARK-25250 Fix race condition with tasks running when new attempt for same stage is created leads to other task in the next attempt running on the same partition id retry multiple times

SPARK-25341 Support rolling back a shuffle map stage and re-generate the shuffle files

SPARK-25348 Data source for binary files

SPARK-25603 Generalize Nested Column Pruning

SPARK-26132 Remove support for Scala 2.11 in Spark 3.0.0

SPARK-26215 define reserved keywords after SQL standard

SPARK-26412 Allow Pandas UDF to take an iterator of pd.DataFrames

SPARK-26785 data source v2 API refactor: streaming write

SPARK-26956 remove streaming output mode from data source v2 APIs

SPARK-27064 create StreamingWrite at the beginning of streaming execution

SPARK-27119 Do not infer schema when reading Hive serde table with native data source

SPARK-27225 Implement join strategy hints

SPARK-27240 Use pandas DataFrame for struct type argument in Scalar Pandas UDF

SPARK-27338 Fix deadlock between TaskMemoryManager and UnsafeExternalSorter$SpillableIterator

SPARK-27396 Public APIs for extended Columnar Processing Support

SPARK-27589 Re-implement file sources with data source V2 API

SPARK-27677 Disk-persisted RDD blocks served by shuffle service, and ignored for Dynamic Allocation

SPARK-27699 Partially push down disjunctive predicated in Parquet/ORC

SPARK-27763 Port test cases from PostgreSQL to Spark SQL (ongoing)

SPARK-27884 Deprecate Python 2 support

SPARK-27921 Convert applicable *.sql tests into UDF integrated test base

SPARK-27963 Allow dynamic allocation without an external shuffle service

SPARK-28177 Adjust post shuffle partition number in adaptive execution

SPARK-28372 Document Spark WEB UI

SPARK-28399 RobustScaler feature transformer

SPARK-28426 Metadata Handling in Thrift Server

SPARK-28588 Build a SQL reference doc (ongoing)

SPARK-28608 Improve test coverage of ThriftServer

SPARK-28753 Dynamically reuse subqueries in AQE

SPARK-28855 Remove outdated Experimental, Evolving annotations

SPARK-25908 SPARK-28980 Remove deprecated items since <= 2.2.0


Cheers,

Xingbo