Spark 3.0 preview release on-going features discussion

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Spark 3.0 preview release on-going features discussion

Jiang Xingbo
Hi all,

Let's start a new thread to discuss the on-going features for Spark 3.0 preview release.

Below is the feature list for the Spark 3.0 preview release. The list is collected from the previous discussions in the dev list.
Features that are nice to have:
Here, I am proposing to cut the branch on October 15th. If the features are targeting to 3.0 preview release, please prioritize the work and finish it before the date. Note, Oct. 15th is not the code freeze of Spark 3.0. That means, the community will still work on the features for the upcoming Spark 3.0 release, even if they are not included in the preview release. The goal of preview release is to collect more feedback from the community regarding the new 3.0 features/behavior changes.

Thanks!
Reply | Threaded
Open this post in threaded view
|

Re: Spark 3.0 preview release on-going features discussion

cloud0fan
> New pushdown API for DataSourceV2

One correction: I want to revisit the pushdown API to make sure it works for dynamic partition pruning and can be extended to support limit/aggregate/... pushdown in the future. It should be a small API update instead of a new API.

On Fri, Sep 20, 2019 at 3:46 PM Xingbo Jiang <[hidden email]> wrote:
Hi all,

Let's start a new thread to discuss the on-going features for Spark 3.0 preview release.

Below is the feature list for the Spark 3.0 preview release. The list is collected from the previous discussions in the dev list.
Features that are nice to have:
Here, I am proposing to cut the branch on October 15th. If the features are targeting to 3.0 preview release, please prioritize the work and finish it before the date. Note, Oct. 15th is not the code freeze of Spark 3.0. That means, the community will still work on the features for the upcoming Spark 3.0 release, even if they are not included in the preview release. The goal of preview release is to collect more feedback from the community regarding the new 3.0 features/behavior changes.

Thanks!
Reply | Threaded
Open this post in threaded view
|

Re: Spark 3.0 preview release on-going features discussion

Sean Owen-2
In reply to this post by Jiang Xingbo
Is this a list of items that might be focused on for the final 3.0
release? At least, Scala 2.13 support shouldn't be on that list. The
others look plausible, or are already done, but there are probably
more.

As for the 3.0 preview, I wouldn't necessarily block on any particular
feature, though, yes, the more work that can go into important items
between now and then, the better.
I wouldn't necessarily present any list of things that will or might
be in 3.0 with that preview; just list the things that are done, like
JDK 11 support.

On Fri, Sep 20, 2019 at 2:46 AM Xingbo Jiang <[hidden email]> wrote:

>
> Hi all,
>
> Let's start a new thread to discuss the on-going features for Spark 3.0 preview release.
>
> Below is the feature list for the Spark 3.0 preview release. The list is collected from the previous discussions in the dev list.
>
> Followup of the shuffle+repartition correctness issue: support roll back shuffle stages (https://issues.apache.org/jira/browse/SPARK-25341)
> Upgrade the built-in Hive to 2.3.5 for hadoop-3.2 (https://issues.apache.org/jira/browse/SPARK-23710)
> JDK 11 support (https://issues.apache.org/jira/browse/SPARK-28684)
> Scala 2.13 support (https://issues.apache.org/jira/browse/SPARK-25075)
> DataSourceV2 features
>
> Enable file source v2 writers (https://issues.apache.org/jira/browse/SPARK-27589)
> CREATE TABLE USING with DataSourceV2
> New pushdown API for DataSourceV2
> Support DELETE/UPDATE/MERGE Operations in DataSourceV2 (https://issues.apache.org/jira/browse/SPARK-28303)
>
> Correctness issue: Stream-stream joins - left outer join gives inconsistent output (https://issues.apache.org/jira/browse/SPARK-26154)
> Revisiting Python / pandas UDF (https://issues.apache.org/jira/browse/SPARK-28264)
> Spark Graph (https://issues.apache.org/jira/browse/SPARK-25994)
>
> Features that are nice to have:
>
> Use remote storage for persisting shuffle data (https://issues.apache.org/jira/browse/SPARK-25299)
> Spark + Hadoop + Parquet + Avro compatibility problems (https://issues.apache.org/jira/browse/SPARK-25588)
> Introduce new option to Kafka source - specify timestamp to start and end offset (https://issues.apache.org/jira/browse/SPARK-26848)
> Delete files after processing in structured streaming (https://issues.apache.org/jira/browse/SPARK-20568)
>
> Here, I am proposing to cut the branch on October 15th. If the features are targeting to 3.0 preview release, please prioritize the work and finish it before the date. Note, Oct. 15th is not the code freeze of Spark 3.0. That means, the community will still work on the features for the upcoming Spark 3.0 release, even if they are not included in the preview release. The goal of preview release is to collect more feedback from the community regarding the new 3.0 features/behavior changes.
>
> Thanks!

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Spark 3.0 preview release on-going features discussion

Dongjoon Hyun-2
Thank you for the summarization, Xingbo.

I also agree with Sean because I don't think those block 3.0.0 preview release.
Especially, correctness issues should not be there.

Instead, could you summarize what we have as of now for 3.0.0 preview?

I believe JDK11 (SPARK-28684) and Hive 2.3.5 (SPARK-23710) will be in the what-we-have list for 3.0.0 preview.

Bests,
Dongjoon.

On Fri, Sep 20, 2019 at 6:22 AM Sean Owen <[hidden email]> wrote:
Is this a list of items that might be focused on for the final 3.0
release? At least, Scala 2.13 support shouldn't be on that list. The
others look plausible, or are already done, but there are probably
more.

As for the 3.0 preview, I wouldn't necessarily block on any particular
feature, though, yes, the more work that can go into important items
between now and then, the better.
I wouldn't necessarily present any list of things that will or might
be in 3.0 with that preview; just list the things that are done, like
JDK 11 support.

On Fri, Sep 20, 2019 at 2:46 AM Xingbo Jiang <[hidden email]> wrote:
>
> Hi all,
>
> Let's start a new thread to discuss the on-going features for Spark 3.0 preview release.
>
> Below is the feature list for the Spark 3.0 preview release. The list is collected from the previous discussions in the dev list.
>
> Followup of the shuffle+repartition correctness issue: support roll back shuffle stages (https://issues.apache.org/jira/browse/SPARK-25341)
> Upgrade the built-in Hive to 2.3.5 for hadoop-3.2 (https://issues.apache.org/jira/browse/SPARK-23710)
> JDK 11 support (https://issues.apache.org/jira/browse/SPARK-28684)
> Scala 2.13 support (https://issues.apache.org/jira/browse/SPARK-25075)
> DataSourceV2 features
>
> Enable file source v2 writers (https://issues.apache.org/jira/browse/SPARK-27589)
> CREATE TABLE USING with DataSourceV2
> New pushdown API for DataSourceV2
> Support DELETE/UPDATE/MERGE Operations in DataSourceV2 (https://issues.apache.org/jira/browse/SPARK-28303)
>
> Correctness issue: Stream-stream joins - left outer join gives inconsistent output (https://issues.apache.org/jira/browse/SPARK-26154)
> Revisiting Python / pandas UDF (https://issues.apache.org/jira/browse/SPARK-28264)
> Spark Graph (https://issues.apache.org/jira/browse/SPARK-25994)
>
> Features that are nice to have:
>
> Use remote storage for persisting shuffle data (https://issues.apache.org/jira/browse/SPARK-25299)
> Spark + Hadoop + Parquet + Avro compatibility problems (https://issues.apache.org/jira/browse/SPARK-25588)
> Introduce new option to Kafka source - specify timestamp to start and end offset (https://issues.apache.org/jira/browse/SPARK-26848)
> Delete files after processing in structured streaming (https://issues.apache.org/jira/browse/SPARK-20568)
>
> Here, I am proposing to cut the branch on October 15th. If the features are targeting to 3.0 preview release, please prioritize the work and finish it before the date. Note, Oct. 15th is not the code freeze of Spark 3.0. That means, the community will still work on the features for the upcoming Spark 3.0 release, even if they are not included in the preview release. The goal of preview release is to collect more feedback from the community regarding the new 3.0 features/behavior changes.
>
> Thanks!

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Spark 3.0 preview release on-going features discussion

Ryan Blue

I’m not sure that DSv2 list is accurate. We discussed this in the DSv2 sync this week (just sent out the notes) and came up with these items:

  • Finish TableProvider update to avoid another API change: pass all table config from metastore
  • Catalog behavior fix: https://issues.apache.org/jira/browse/SPARK-29014
  • Stats push-down fix: move push-down to the optimizer
  • Make DataFrameWriter compatible when updating a source from v1 to v2, by adding extractCatalogName and extractIdentifier to TableProvider

Some of the ideas that came up, like changing the pushdown API, were passed on because it is too close to the release to reasonably get the changes done without a serious delay (like the API changes just before the 2.4 release).


On Fri, Sep 20, 2019 at 9:55 AM Dongjoon Hyun <[hidden email]> wrote:
Thank you for the summarization, Xingbo.

I also agree with Sean because I don't think those block 3.0.0 preview release.
Especially, correctness issues should not be there.

Instead, could you summarize what we have as of now for 3.0.0 preview?

I believe JDK11 (SPARK-28684) and Hive 2.3.5 (SPARK-23710) will be in the what-we-have list for 3.0.0 preview.

Bests,
Dongjoon.

On Fri, Sep 20, 2019 at 6:22 AM Sean Owen <[hidden email]> wrote:
Is this a list of items that might be focused on for the final 3.0
release? At least, Scala 2.13 support shouldn't be on that list. The
others look plausible, or are already done, but there are probably
more.

As for the 3.0 preview, I wouldn't necessarily block on any particular
feature, though, yes, the more work that can go into important items
between now and then, the better.
I wouldn't necessarily present any list of things that will or might
be in 3.0 with that preview; just list the things that are done, like
JDK 11 support.

On Fri, Sep 20, 2019 at 2:46 AM Xingbo Jiang <[hidden email]> wrote:
>
> Hi all,
>
> Let's start a new thread to discuss the on-going features for Spark 3.0 preview release.
>
> Below is the feature list for the Spark 3.0 preview release. The list is collected from the previous discussions in the dev list.
>
> Followup of the shuffle+repartition correctness issue: support roll back shuffle stages (https://issues.apache.org/jira/browse/SPARK-25341)
> Upgrade the built-in Hive to 2.3.5 for hadoop-3.2 (https://issues.apache.org/jira/browse/SPARK-23710)
> JDK 11 support (https://issues.apache.org/jira/browse/SPARK-28684)
> Scala 2.13 support (https://issues.apache.org/jira/browse/SPARK-25075)
> DataSourceV2 features
>
> Enable file source v2 writers (https://issues.apache.org/jira/browse/SPARK-27589)
> CREATE TABLE USING with DataSourceV2
> New pushdown API for DataSourceV2
> Support DELETE/UPDATE/MERGE Operations in DataSourceV2 (https://issues.apache.org/jira/browse/SPARK-28303)
>
> Correctness issue: Stream-stream joins - left outer join gives inconsistent output (https://issues.apache.org/jira/browse/SPARK-26154)
> Revisiting Python / pandas UDF (https://issues.apache.org/jira/browse/SPARK-28264)
> Spark Graph (https://issues.apache.org/jira/browse/SPARK-25994)
>
> Features that are nice to have:
>
> Use remote storage for persisting shuffle data (https://issues.apache.org/jira/browse/SPARK-25299)
> Spark + Hadoop + Parquet + Avro compatibility problems (https://issues.apache.org/jira/browse/SPARK-25588)
> Introduce new option to Kafka source - specify timestamp to start and end offset (https://issues.apache.org/jira/browse/SPARK-26848)
> Delete files after processing in structured streaming (https://issues.apache.org/jira/browse/SPARK-20568)
>
> Here, I am proposing to cut the branch on October 15th. If the features are targeting to 3.0 preview release, please prioritize the work and finish it before the date. Note, Oct. 15th is not the code freeze of Spark 3.0. That means, the community will still work on the features for the upcoming Spark 3.0 release, even if they are not included in the preview release. The goal of preview release is to collect more feedback from the community regarding the new 3.0 features/behavior changes.
>
> Thanks!

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: Spark 3.0 preview release on-going features discussion

Jiang Xingbo
Thanks everyone, let me first work on the feature list and major changes that have already been finished in the master branch.

Cheers!

Xingbo

Ryan Blue <[hidden email]> 于2019年9月20日周五 上午10:56写道:

I’m not sure that DSv2 list is accurate. We discussed this in the DSv2 sync this week (just sent out the notes) and came up with these items:

  • Finish TableProvider update to avoid another API change: pass all table config from metastore
  • Catalog behavior fix: https://issues.apache.org/jira/browse/SPARK-29014
  • Stats push-down fix: move push-down to the optimizer
  • Make DataFrameWriter compatible when updating a source from v1 to v2, by adding extractCatalogName and extractIdentifier to TableProvider

Some of the ideas that came up, like changing the pushdown API, were passed on because it is too close to the release to reasonably get the changes done without a serious delay (like the API changes just before the 2.4 release).


On Fri, Sep 20, 2019 at 9:55 AM Dongjoon Hyun <[hidden email]> wrote:
Thank you for the summarization, Xingbo.

I also agree with Sean because I don't think those block 3.0.0 preview release.
Especially, correctness issues should not be there.

Instead, could you summarize what we have as of now for 3.0.0 preview?

I believe JDK11 (SPARK-28684) and Hive 2.3.5 (SPARK-23710) will be in the what-we-have list for 3.0.0 preview.

Bests,
Dongjoon.

On Fri, Sep 20, 2019 at 6:22 AM Sean Owen <[hidden email]> wrote:
Is this a list of items that might be focused on for the final 3.0
release? At least, Scala 2.13 support shouldn't be on that list. The
others look plausible, or are already done, but there are probably
more.

As for the 3.0 preview, I wouldn't necessarily block on any particular
feature, though, yes, the more work that can go into important items
between now and then, the better.
I wouldn't necessarily present any list of things that will or might
be in 3.0 with that preview; just list the things that are done, like
JDK 11 support.

On Fri, Sep 20, 2019 at 2:46 AM Xingbo Jiang <[hidden email]> wrote:
>
> Hi all,
>
> Let's start a new thread to discuss the on-going features for Spark 3.0 preview release.
>
> Below is the feature list for the Spark 3.0 preview release. The list is collected from the previous discussions in the dev list.
>
> Followup of the shuffle+repartition correctness issue: support roll back shuffle stages (https://issues.apache.org/jira/browse/SPARK-25341)
> Upgrade the built-in Hive to 2.3.5 for hadoop-3.2 (https://issues.apache.org/jira/browse/SPARK-23710)
> JDK 11 support (https://issues.apache.org/jira/browse/SPARK-28684)
> Scala 2.13 support (https://issues.apache.org/jira/browse/SPARK-25075)
> DataSourceV2 features
>
> Enable file source v2 writers (https://issues.apache.org/jira/browse/SPARK-27589)
> CREATE TABLE USING with DataSourceV2
> New pushdown API for DataSourceV2
> Support DELETE/UPDATE/MERGE Operations in DataSourceV2 (https://issues.apache.org/jira/browse/SPARK-28303)
>
> Correctness issue: Stream-stream joins - left outer join gives inconsistent output (https://issues.apache.org/jira/browse/SPARK-26154)
> Revisiting Python / pandas UDF (https://issues.apache.org/jira/browse/SPARK-28264)
> Spark Graph (https://issues.apache.org/jira/browse/SPARK-25994)
>
> Features that are nice to have:
>
> Use remote storage for persisting shuffle data (https://issues.apache.org/jira/browse/SPARK-25299)
> Spark + Hadoop + Parquet + Avro compatibility problems (https://issues.apache.org/jira/browse/SPARK-25588)
> Introduce new option to Kafka source - specify timestamp to start and end offset (https://issues.apache.org/jira/browse/SPARK-26848)
> Delete files after processing in structured streaming (https://issues.apache.org/jira/browse/SPARK-20568)
>
> Here, I am proposing to cut the branch on October 15th. If the features are targeting to 3.0 preview release, please prioritize the work and finish it before the date. Note, Oct. 15th is not the code freeze of Spark 3.0. That means, the community will still work on the features for the upcoming Spark 3.0 release, even if they are not included in the preview release. The goal of preview release is to collect more feedback from the community regarding the new 3.0 features/behavior changes.
>
> Thanks!

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



--
Ryan Blue
Software Engineer
Netflix