time for Apache Spark 3.0?

classic Classic list List threaded Threaded
34 messages Options
12
Reply | Threaded
Open this post in threaded view
|

time for Apache Spark 3.0?

rxin
There was a discussion thread on scala-contributors about Apache Spark not yet supporting Scala 2.12, and that got me to think perhaps it is about time for Spark to work towards the 3.0 release. By the time it comes out, it will be more than 2 years since Spark 2.0.

For contributors less familiar with Spark’s history, I want to give more context on Spark releases:

1. Timeline: Spark 1.0 was released May 2014. Spark 2.0 was July 2016. If we were to maintain the ~ 2 year cadence, it is time to work on Spark 3.0 in 2018.

2. Spark’s versioning policy promises that Spark does not break stable APIs in feature releases (e.g. 2.1, 2.2). API breaking changes are sometimes a necessary evil, and can be done in major releases (e.g. 1.6 to 2.0, 2.x to 3.0).

3. That said, a major version isn’t necessarily the playground for disruptive API changes to make it painful for users to update. The main purpose of a major release is an opportunity to fix things that are broken in the current API and remove certain deprecated APIs.

4. Spark as a project has a culture of evolving architecture and developing major new features incrementally, so major releases are not the only time for exciting new features. For example, the bulk of the work in the move towards the DataFrame API was done in Spark 1.3, and Continuous Processing was introduced in Spark 2.3. Both were feature releases rather than major releases.

You can find more background in the thread discussing Spark 2.0: http://apache-spark-developers-list.1001551.n3.nabble.com/A-proposal-for-Spark-2-0-td15122.html


The primary motivating factor IMO for a major version bump is to support Scala 2.12, which requires minor API breaking changes to Spark’s APIs. Similar to Spark 2.0, I think there are also opportunities for other changes that we know have been biting us for a long time but can’t be changed in feature releases (to be clear, I’m actually not sure they are all good ideas, but I’m writing them down as candidates for consideration):

1. Support Scala 2.12.

2. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in Spark 2.x.

3. Shade all dependencies.

4. Change the reserved keywords in Spark SQL to be more ANSI-SQL compliant, to prevent users from shooting themselves in the foot, e.g. “SELECT 2 SECOND” -- is “SECOND” an interval unit or an alias? To make it less painful for users to upgrade here, I’d suggest creating a flag for backward compatibility mode.

5. Similar to 4, make our type coercion rule in DataFrame/SQL more standard compliant, and have a flag for backward compatibility.

6. Miscellaneous other small changes documented in JIRA already (e.g. “JavaPairRDD flatMapValues requires function returning Iterable, not Iterator”, “Prevent column name duplication in temporary view”).

Now the reality of a major version bump is that the world often thinks in terms of what exciting features are coming. I do think there are a number of major changes happening already that can be part of the 3.0 release, if they make it in:

1. Scala 2.12 support (listing it twice)
2. Continuous Processing non-experimental
3. Kubernetes support non-experimental
4. A more flushed out version of data source API v2 (I don’t think it is realistic to stabilize that in one release)
5. Hadoop 3.0 support
6. ...


Similar to the 2.0 discussion, this thread should focus on the framework and whether it’d make sense to create Spark 3.0 as the next release, rather than the individual feature requests. Those are important but are best done in their own separate threads.




Reply | Threaded
Open this post in threaded view
|

Re: time for Apache Spark 3.0?

Sean Owen-2
On Wed, Apr 4, 2018 at 6:20 PM Reynold Xin <[hidden email]> wrote:
The primary motivating factor IMO for a major version bump is to support Scala 2.12, which requires minor API breaking changes to Spark’s APIs. Similar to Spark 2.0, I think there are also opportunities for other changes that we know have been biting us for a long time but can’t be changed in feature releases (to be clear, I’m actually not sure they are all good ideas, but I’m writing them down as candidates for consideration):

IIRC from looking at this, it is possible to support 2.11 and 2.12 simultaneously. The cross-build already works now in 2.3.0. Barring some big change needed to get 2.12 fully working -- and that may be the case -- it nearly works that way now.

Compiling vs 2.11 and 2.12 does however result in some APIs that differ in byte code. However Scala itself isn't mutually compatible between 2.11 and 2.12 anyway; that's never been promised as compatible.

(Interesting question about what *Java* users should expect; they would see a difference in 2.11 vs 2.12 Spark APIs, but that has always been true.)

I don't disagree with shooting for Spark 3.0, just saying I don't know if 2.12 support requires moving to 3.0. But, Spark 3.0 could consider dropping 2.11 support if needed to make supporting 2.12 less painful.
Reply | Threaded
Open this post in threaded view
|

Re: time for Apache Spark 3.0?

Mark Hamstra
As with Sean, I'm not sure that this will require a new major version, but we should also be looking at Java 9 & 10 support -- particularly with regard to their better functionality in a containerized environment (memory limits from cgroups, not sysconf; support for cpusets). In that regard, we should also be looking at using the latest Scala 2.11.x maintenance release in current Spark branches.

On Thu, Apr 5, 2018 at 5:45 AM, Sean Owen <[hidden email]> wrote:
On Wed, Apr 4, 2018 at 6:20 PM Reynold Xin <[hidden email]> wrote:
The primary motivating factor IMO for a major version bump is to support Scala 2.12, which requires minor API breaking changes to Spark’s APIs. Similar to Spark 2.0, I think there are also opportunities for other changes that we know have been biting us for a long time but can’t be changed in feature releases (to be clear, I’m actually not sure they are all good ideas, but I’m writing them down as candidates for consideration):

IIRC from looking at this, it is possible to support 2.11 and 2.12 simultaneously. The cross-build already works now in 2.3.0. Barring some big change needed to get 2.12 fully working -- and that may be the case -- it nearly works that way now.

Compiling vs 2.11 and 2.12 does however result in some APIs that differ in byte code. However Scala itself isn't mutually compatible between 2.11 and 2.12 anyway; that's never been promised as compatible.

(Interesting question about what *Java* users should expect; they would see a difference in 2.11 vs 2.12 Spark APIs, but that has always been true.)

I don't disagree with shooting for Spark 3.0, just saying I don't know if 2.12 support requires moving to 3.0. But, Spark 3.0 could consider dropping 2.11 support if needed to make supporting 2.12 less painful.

Reply | Threaded
Open this post in threaded view
|

Re: time for Apache Spark 3.0?

Marco Gaido
Hi all,

I also agree with Mark that we should add Java 9/10 support to an eventual Spark 3.0 release, because supporting Java 9 is not a trivial task since we are using some internal APIs for the memory management which changed: either we find a solution which works on both (but I am not sure it is feasible) or we have to switch between 2 implementations according to the Java version.
So I'd rather avoid doing this in a non-major release.

Thanks,
Marco


2018-04-05 17:35 GMT+02:00 Mark Hamstra <[hidden email]>:
As with Sean, I'm not sure that this will require a new major version, but we should also be looking at Java 9 & 10 support -- particularly with regard to their better functionality in a containerized environment (memory limits from cgroups, not sysconf; support for cpusets). In that regard, we should also be looking at using the latest Scala 2.11.x maintenance release in current Spark branches.

On Thu, Apr 5, 2018 at 5:45 AM, Sean Owen <[hidden email]> wrote:
On Wed, Apr 4, 2018 at 6:20 PM Reynold Xin <[hidden email]> wrote:
The primary motivating factor IMO for a major version bump is to support Scala 2.12, which requires minor API breaking changes to Spark’s APIs. Similar to Spark 2.0, I think there are also opportunities for other changes that we know have been biting us for a long time but can’t be changed in feature releases (to be clear, I’m actually not sure they are all good ideas, but I’m writing them down as candidates for consideration):

IIRC from looking at this, it is possible to support 2.11 and 2.12 simultaneously. The cross-build already works now in 2.3.0. Barring some big change needed to get 2.12 fully working -- and that may be the case -- it nearly works that way now.

Compiling vs 2.11 and 2.12 does however result in some APIs that differ in byte code. However Scala itself isn't mutually compatible between 2.11 and 2.12 anyway; that's never been promised as compatible.

(Interesting question about what *Java* users should expect; they would see a difference in 2.11 vs 2.12 Spark APIs, but that has always been true.)

I don't disagree with shooting for Spark 3.0, just saying I don't know if 2.12 support requires moving to 3.0. But, Spark 3.0 could consider dropping 2.11 support if needed to make supporting 2.12 less painful.


Reply | Threaded
Open this post in threaded view
|

Re: time for Apache Spark 3.0?

Matei Zaharia
Administrator
Java 9/10 support would be great to add as well.

Regarding Scala 2.12, I thought that supporting it would become easier if we change the Spark API and ABI slightly. Basically, it is of course possible to create an alternate source tree today, but it might be possible to share the same source files if we tweak some small things in the methods that are overloaded across Scala and Java. I don’t remember the exact details, but the idea was to reduce the total maintenance work needed at the cost of requiring users to recompile their apps.

I’m personally for moving to 3.0 because of the other things we can clean up as well, e.g. the default SQL dialect, Iterable stuff, and possibly dependency shading (a major pain point for lots of users). It’s also a chance to highlight Kubernetes, continuous processing and other features more if they become “GA".

Matei

> On Apr 5, 2018, at 9:04 AM, Marco Gaido <[hidden email]> wrote:
>
> Hi all,
>
> I also agree with Mark that we should add Java 9/10 support to an eventual Spark 3.0 release, because supporting Java 9 is not a trivial task since we are using some internal APIs for the memory management which changed: either we find a solution which works on both (but I am not sure it is feasible) or we have to switch between 2 implementations according to the Java version.
> So I'd rather avoid doing this in a non-major release.
>
> Thanks,
> Marco
>
>
> 2018-04-05 17:35 GMT+02:00 Mark Hamstra <[hidden email]>:
> As with Sean, I'm not sure that this will require a new major version, but we should also be looking at Java 9 & 10 support -- particularly with regard to their better functionality in a containerized environment (memory limits from cgroups, not sysconf; support for cpusets). In that regard, we should also be looking at using the latest Scala 2.11.x maintenance release in current Spark branches.
>
> On Thu, Apr 5, 2018 at 5:45 AM, Sean Owen <[hidden email]> wrote:
> On Wed, Apr 4, 2018 at 6:20 PM Reynold Xin <[hidden email]> wrote:
> The primary motivating factor IMO for a major version bump is to support Scala 2.12, which requires minor API breaking changes to Spark’s APIs. Similar to Spark 2.0, I think there are also opportunities for other changes that we know have been biting us for a long time but can’t be changed in feature releases (to be clear, I’m actually not sure they are all good ideas, but I’m writing them down as candidates for consideration):
>
> IIRC from looking at this, it is possible to support 2.11 and 2.12 simultaneously. The cross-build already works now in 2.3.0. Barring some big change needed to get 2.12 fully working -- and that may be the case -- it nearly works that way now.
>
> Compiling vs 2.11 and 2.12 does however result in some APIs that differ in byte code. However Scala itself isn't mutually compatible between 2.11 and 2.12 anyway; that's never been promised as compatible.
>
> (Interesting question about what *Java* users should expect; they would see a difference in 2.11 vs 2.12 Spark APIs, but that has always been true.)
>
> I don't disagree with shooting for Spark 3.0, just saying I don't know if 2.12 support requires moving to 3.0. But, Spark 3.0 could consider dropping 2.11 support if needed to make supporting 2.12 less painful.
>
>


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: time for Apache Spark 3.0?

Marcelo Vanzin
In reply to this post by Marco Gaido
I remember seeing somewhere that Scala still has some issues with Java
9/10 so that might be hard...

But on that topic, it might be better to shoot for Java 11
compatibility. 9 and 10, following the new release model, aren't
really meant to be long-term releases.

In general, agree with Sean here. Doesn't look like 2.12 support
requires unexpected API breakages. So unless there's a really good
reason to break / remove a bunch of existing APIs...

On Thu, Apr 5, 2018 at 9:04 AM, Marco Gaido <[hidden email]> wrote:

> Hi all,
>
> I also agree with Mark that we should add Java 9/10 support to an eventual
> Spark 3.0 release, because supporting Java 9 is not a trivial task since we
> are using some internal APIs for the memory management which changed: either
> we find a solution which works on both (but I am not sure it is feasible) or
> we have to switch between 2 implementations according to the Java version.
> So I'd rather avoid doing this in a non-major release.
>
> Thanks,
> Marco
>
>
> 2018-04-05 17:35 GMT+02:00 Mark Hamstra <[hidden email]>:
>>
>> As with Sean, I'm not sure that this will require a new major version, but
>> we should also be looking at Java 9 & 10 support -- particularly with regard
>> to their better functionality in a containerized environment (memory limits
>> from cgroups, not sysconf; support for cpusets). In that regard, we should
>> also be looking at using the latest Scala 2.11.x maintenance release in
>> current Spark branches.
>>
>> On Thu, Apr 5, 2018 at 5:45 AM, Sean Owen <[hidden email]> wrote:
>>>
>>> On Wed, Apr 4, 2018 at 6:20 PM Reynold Xin <[hidden email]> wrote:
>>>>
>>>> The primary motivating factor IMO for a major version bump is to support
>>>> Scala 2.12, which requires minor API breaking changes to Spark’s APIs.
>>>> Similar to Spark 2.0, I think there are also opportunities for other changes
>>>> that we know have been biting us for a long time but can’t be changed in
>>>> feature releases (to be clear, I’m actually not sure they are all good
>>>> ideas, but I’m writing them down as candidates for consideration):
>>>
>>>
>>> IIRC from looking at this, it is possible to support 2.11 and 2.12
>>> simultaneously. The cross-build already works now in 2.3.0. Barring some big
>>> change needed to get 2.12 fully working -- and that may be the case -- it
>>> nearly works that way now.
>>>
>>> Compiling vs 2.11 and 2.12 does however result in some APIs that differ
>>> in byte code. However Scala itself isn't mutually compatible between 2.11
>>> and 2.12 anyway; that's never been promised as compatible.
>>>
>>> (Interesting question about what *Java* users should expect; they would
>>> see a difference in 2.11 vs 2.12 Spark APIs, but that has always been true.)
>>>
>>> I don't disagree with shooting for Spark 3.0, just saying I don't know if
>>> 2.12 support requires moving to 3.0. But, Spark 3.0 could consider dropping
>>> 2.11 support if needed to make supporting 2.12 less painful.
>>
>>
>



--
Marcelo

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: time for Apache Spark 3.0?

Matei Zaharia
Administrator
Sorry, but just to be clear here, this is the 2.12 API issue: https://issues.apache.org/jira/browse/SPARK-14643, with more details in this doc: https://docs.google.com/document/d/1P_wmH3U356f079AYgSsN53HKixuNdxSEvo8nw_tgLgM/edit.

Basically, if we are allowed to change Spark’s API a little to have only one version of methods that are currently overloaded between Java and Scala, we can get away with a single source three for all Scala versions and Java ABI compatibility against any type of Spark (whether using Scala 2.11 or 2.12). On the other hand, if we want to keep the API and ABI of the Spark 2.x branch, we’ll need a different source tree for Scala 2.12 with different copies of pretty large classes such as RDD, DataFrame and DStream, and Java users may have to change their code when linking against different versions of Spark.

This is of course only one of the possible ABI changes, but it is a considerable engineering effort, so we’d have to sign up for maintaining all these different source files. It seems kind of silly given that Scala 2.12 was released in 2016, so we’re doing all this work to keep ABI compatibility for Scala 2.11, which isn’t even that widely used any more for new projects. Also keep in mind that the next Spark release will probably take at least 3-4 months, so we’re talking about what people will be using in fall 2018.

Matei

> On Apr 5, 2018, at 10:13 AM, Marcelo Vanzin <[hidden email]> wrote:
>
> I remember seeing somewhere that Scala still has some issues with Java
> 9/10 so that might be hard...
>
> But on that topic, it might be better to shoot for Java 11
> compatibility. 9 and 10, following the new release model, aren't
> really meant to be long-term releases.
>
> In general, agree with Sean here. Doesn't look like 2.12 support
> requires unexpected API breakages. So unless there's a really good
> reason to break / remove a bunch of existing APIs...
>
> On Thu, Apr 5, 2018 at 9:04 AM, Marco Gaido <[hidden email]> wrote:
>> Hi all,
>>
>> I also agree with Mark that we should add Java 9/10 support to an eventual
>> Spark 3.0 release, because supporting Java 9 is not a trivial task since we
>> are using some internal APIs for the memory management which changed: either
>> we find a solution which works on both (but I am not sure it is feasible) or
>> we have to switch between 2 implementations according to the Java version.
>> So I'd rather avoid doing this in a non-major release.
>>
>> Thanks,
>> Marco
>>
>>
>> 2018-04-05 17:35 GMT+02:00 Mark Hamstra <[hidden email]>:
>>>
>>> As with Sean, I'm not sure that this will require a new major version, but
>>> we should also be looking at Java 9 & 10 support -- particularly with regard
>>> to their better functionality in a containerized environment (memory limits
>>> from cgroups, not sysconf; support for cpusets). In that regard, we should
>>> also be looking at using the latest Scala 2.11.x maintenance release in
>>> current Spark branches.
>>>
>>> On Thu, Apr 5, 2018 at 5:45 AM, Sean Owen <[hidden email]> wrote:
>>>>
>>>> On Wed, Apr 4, 2018 at 6:20 PM Reynold Xin <[hidden email]> wrote:
>>>>>
>>>>> The primary motivating factor IMO for a major version bump is to support
>>>>> Scala 2.12, which requires minor API breaking changes to Spark’s APIs.
>>>>> Similar to Spark 2.0, I think there are also opportunities for other changes
>>>>> that we know have been biting us for a long time but can’t be changed in
>>>>> feature releases (to be clear, I’m actually not sure they are all good
>>>>> ideas, but I’m writing them down as candidates for consideration):
>>>>
>>>>
>>>> IIRC from looking at this, it is possible to support 2.11 and 2.12
>>>> simultaneously. The cross-build already works now in 2.3.0. Barring some big
>>>> change needed to get 2.12 fully working -- and that may be the case -- it
>>>> nearly works that way now.
>>>>
>>>> Compiling vs 2.11 and 2.12 does however result in some APIs that differ
>>>> in byte code. However Scala itself isn't mutually compatible between 2.11
>>>> and 2.12 anyway; that's never been promised as compatible.
>>>>
>>>> (Interesting question about what *Java* users should expect; they would
>>>> see a difference in 2.11 vs 2.12 Spark APIs, but that has always been true.)
>>>>
>>>> I don't disagree with shooting for Spark 3.0, just saying I don't know if
>>>> 2.12 support requires moving to 3.0. But, Spark 3.0 could consider dropping
>>>> 2.11 support if needed to make supporting 2.12 less painful.
>>>
>>>
>>
>
>
>
> --
> Marcelo
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: time for Apache Spark 3.0?

Matei Zaharia
Administrator
Oh, forgot to add, but splitting the source tree in Scala also creates the issue of a big maintenance burden for third-party libraries built on Spark. As Josh said on the JIRA:

"I think this is primarily going to be an issue for end users who want to use an existing source tree to cross-compile for Scala 2.10, 2.11, and 2.12. Thus the pain of the source incompatibility would mostly be felt by library/package maintainers but it can be worked around as long as there's at least some common subset which is source compatible across all of those versions.”

This means that all the data sources, ML algorithms, etc developed outside our source tree would have to do the same thing we do internally.

> On Apr 5, 2018, at 10:30 AM, Matei Zaharia <[hidden email]> wrote:
>
> Sorry, but just to be clear here, this is the 2.12 API issue: https://issues.apache.org/jira/browse/SPARK-14643, with more details in this doc: https://docs.google.com/document/d/1P_wmH3U356f079AYgSsN53HKixuNdxSEvo8nw_tgLgM/edit.
>
> Basically, if we are allowed to change Spark’s API a little to have only one version of methods that are currently overloaded between Java and Scala, we can get away with a single source three for all Scala versions and Java ABI compatibility against any type of Spark (whether using Scala 2.11 or 2.12). On the other hand, if we want to keep the API and ABI of the Spark 2.x branch, we’ll need a different source tree for Scala 2.12 with different copies of pretty large classes such as RDD, DataFrame and DStream, and Java users may have to change their code when linking against different versions of Spark.
>
> This is of course only one of the possible ABI changes, but it is a considerable engineering effort, so we’d have to sign up for maintaining all these different source files. It seems kind of silly given that Scala 2.12 was released in 2016, so we’re doing all this work to keep ABI compatibility for Scala 2.11, which isn’t even that widely used any more for new projects. Also keep in mind that the next Spark release will probably take at least 3-4 months, so we’re talking about what people will be using in fall 2018.
>
> Matei
>
>> On Apr 5, 2018, at 10:13 AM, Marcelo Vanzin <[hidden email]> wrote:
>>
>> I remember seeing somewhere that Scala still has some issues with Java
>> 9/10 so that might be hard...
>>
>> But on that topic, it might be better to shoot for Java 11
>> compatibility. 9 and 10, following the new release model, aren't
>> really meant to be long-term releases.
>>
>> In general, agree with Sean here. Doesn't look like 2.12 support
>> requires unexpected API breakages. So unless there's a really good
>> reason to break / remove a bunch of existing APIs...
>>
>> On Thu, Apr 5, 2018 at 9:04 AM, Marco Gaido <[hidden email]> wrote:
>>> Hi all,
>>>
>>> I also agree with Mark that we should add Java 9/10 support to an eventual
>>> Spark 3.0 release, because supporting Java 9 is not a trivial task since we
>>> are using some internal APIs for the memory management which changed: either
>>> we find a solution which works on both (but I am not sure it is feasible) or
>>> we have to switch between 2 implementations according to the Java version.
>>> So I'd rather avoid doing this in a non-major release.
>>>
>>> Thanks,
>>> Marco
>>>
>>>
>>> 2018-04-05 17:35 GMT+02:00 Mark Hamstra <[hidden email]>:
>>>>
>>>> As with Sean, I'm not sure that this will require a new major version, but
>>>> we should also be looking at Java 9 & 10 support -- particularly with regard
>>>> to their better functionality in a containerized environment (memory limits
>>>> from cgroups, not sysconf; support for cpusets). In that regard, we should
>>>> also be looking at using the latest Scala 2.11.x maintenance release in
>>>> current Spark branches.
>>>>
>>>> On Thu, Apr 5, 2018 at 5:45 AM, Sean Owen <[hidden email]> wrote:
>>>>>
>>>>> On Wed, Apr 4, 2018 at 6:20 PM Reynold Xin <[hidden email]> wrote:
>>>>>>
>>>>>> The primary motivating factor IMO for a major version bump is to support
>>>>>> Scala 2.12, which requires minor API breaking changes to Spark’s APIs.
>>>>>> Similar to Spark 2.0, I think there are also opportunities for other changes
>>>>>> that we know have been biting us for a long time but can’t be changed in
>>>>>> feature releases (to be clear, I’m actually not sure they are all good
>>>>>> ideas, but I’m writing them down as candidates for consideration):
>>>>>
>>>>>
>>>>> IIRC from looking at this, it is possible to support 2.11 and 2.12
>>>>> simultaneously. The cross-build already works now in 2.3.0. Barring some big
>>>>> change needed to get 2.12 fully working -- and that may be the case -- it
>>>>> nearly works that way now.
>>>>>
>>>>> Compiling vs 2.11 and 2.12 does however result in some APIs that differ
>>>>> in byte code. However Scala itself isn't mutually compatible between 2.11
>>>>> and 2.12 anyway; that's never been promised as compatible.
>>>>>
>>>>> (Interesting question about what *Java* users should expect; they would
>>>>> see a difference in 2.11 vs 2.12 Spark APIs, but that has always been true.)
>>>>>
>>>>> I don't disagree with shooting for Spark 3.0, just saying I don't know if
>>>>> 2.12 support requires moving to 3.0. But, Spark 3.0 could consider dropping
>>>>> 2.11 support if needed to make supporting 2.12 less painful.
>>>>
>>>>
>>>
>>
>>
>>
>> --
>> Marcelo
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: [hidden email]
>>
>


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: time for Apache Spark 3.0?

Steve Loughran
In reply to this post by Matei Zaharia


On 5 Apr 2018, at 18:04, Matei Zaharia <[hidden email]> wrote:

Java 9/10 support would be great to add as well.

Be aware that the work moving hadoop core to java 9+ is still a big piece of work being undertaken by Akira Ajisaka & colleagues at NTT


Big dependency updates and handling Oracle hiding sun.misc stuff which low level code depends on are the troublespots, with a move to Log4J 2 going to be observably traumatic to all apps which require a log4.properties to set themselves up. As usual: any testing which can be done early will be welcomed by all, the earlier the better

That stuff is all about getting things working: supporting the java 9 packaging model. Which is a really compelling reason to go for it


Regarding Scala 2.12, I thought that supporting it would become easier if we change the Spark API and ABI slightly. Basically, it is of course possible to create an alternate source tree today, but it might be possible to share the same source files if we tweak some small things in the methods that are overloaded across Scala and Java. I don’t remember the exact details, but the idea was to reduce the total maintenance work needed at the cost of requiring users to recompile their apps.

I’m personally for moving to 3.0 because of the other things we can clean up as well, e.g. the default SQL dialect, Iterable stuff, and possibly dependency shading (a major pain point for lots of users)

Hadoop 3 does have a shaded client, though not enough for Spark; if work identifying & fixing the outstanding dependencies is started now, Hadoop 3.2 should be able to offer the set of shaded libraries needed by Spark.

There's always a price to that, which is in redistributable size and it's impact on start times, duplicate classes loaded (memory,  reduced chance of JIT recompilation, ...), and the whole transitive-shading problem. Java 9 should be the real target for a clean solution to all of this. 
Reply | Threaded
Open this post in threaded view
|

Re: time for Apache Spark 3.0?

Marcelo Vanzin
In reply to this post by Matei Zaharia
On Thu, Apr 5, 2018 at 10:30 AM, Matei Zaharia <[hidden email]> wrote:
> Sorry, but just to be clear here, this is the 2.12 API issue: https://issues.apache.org/jira/browse/SPARK-14643, with more details in this doc: https://docs.google.com/document/d/1P_wmH3U356f079AYgSsN53HKixuNdxSEvo8nw_tgLgM/edit.
>
> Basically, if we are allowed to change Spark’s API a little to have only one version of methods that are currently overloaded between Java and Scala, we can get away with a single source three for all Scala versions and Java ABI compatibility against any type of Spark (whether using Scala 2.11 or 2.12).

Fair enough. To play devil's advocate, most of those methods seem to
be marked "Experimental / Evolving", which could be used as a reason
to change them for this purpose in a minor release.

Not all of them are, though (e.g. foreach / foreachPartition are not
experimental).

--
Marcelo

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: time for Apache Spark 3.0?

Dean Wampler
I spoke with Martin Odersky and Lightbend's Scala Team about the known API issue with method disambiguation. They offered to implement a small patch in a new release of Scala 2.12 to handle the issue without requiring a Spark API change. They would cut a 2.12.6 release for it. I'm told that Scala 2.13 should already handle the issue without modification (it's not yet released, to be clear). They can also offer feedback on updating the closure cleaner.

So, this approach would support Scala 2.12 in Spark, but limited to 2.12.6+, without the API change requirement, but the closure cleaner would still need updating. Hence, it could be done for Spark 2.X.

Let me if you want to pursue this approach.

dean




On Thu, Apr 5, 2018 at 8:13 PM, Marcelo Vanzin <[hidden email]> wrote:
On Thu, Apr 5, 2018 at 10:30 AM, Matei Zaharia <[hidden email]> wrote:
> Sorry, but just to be clear here, this is the 2.12 API issue: https://issues.apache.org/jira/browse/SPARK-14643, with more details in this doc: https://docs.google.com/document/d/1P_wmH3U356f079AYgSsN53HKixuNdxSEvo8nw_tgLgM/edit.
>
> Basically, if we are allowed to change Spark’s API a little to have only one version of methods that are currently overloaded between Java and Scala, we can get away with a single source three for all Scala versions and Java ABI compatibility against any type of Spark (whether using Scala 2.11 or 2.12).

Fair enough. To play devil's advocate, most of those methods seem to
be marked "Experimental / Evolving", which could be used as a reason
to change them for this purpose in a minor release.

Not all of them are, though (e.g. foreach / foreachPartition are not
experimental).

--
Marcelo

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: time for Apache Spark 3.0?

Sean Owen-2
That certainly sounds beneficial, to maybe several other projects. If there's no downside and it takes away API issues, seems like a win.

On Thu, Apr 19, 2018 at 5:28 AM Dean Wampler <[hidden email]> wrote:
I spoke with Martin Odersky and Lightbend's Scala Team about the known API issue with method disambiguation. They offered to implement a small patch in a new release of Scala 2.12 to handle the issue without requiring a Spark API change. They would cut a 2.12.6 release for it. I'm told that Scala 2.13 should already handle the issue without modification (it's not yet released, to be clear). They can also offer feedback on updating the closure cleaner.

So, this approach would support Scala 2.12 in Spark, but limited to 2.12.6+, without the API change requirement, but the closure cleaner would still need updating. Hence, it could be done for Spark 2.X.

Let me if you want to pursue this approach.

dean
Reply | Threaded
Open this post in threaded view
|

Re: time for Apache Spark 3.0?

Andy-2
In reply to this post by rxin
Dear all:

It have been 2 months since this topic being proposed. Any progress now? 2018 has been passed about 1/2.

I agree with that the new version should be some exciting new feature. How about this one:

6. ML/DL framework to be integrated as core component and feature. (Such as Angel / BigDL / ……)

3.0 is a very important version for an good open source project. It should be better to drift away the historical burden and focus in new area. Spark has been widely used all over the world as a successful big data framework. And it can be better than that.

Andy


On Thu, Apr 5, 2018 at 7:20 AM Reynold Xin <[hidden email]> wrote:
There was a discussion thread on scala-contributors about Apache Spark not yet supporting Scala 2.12, and that got me to think perhaps it is about time for Spark to work towards the 3.0 release. By the time it comes out, it will be more than 2 years since Spark 2.0.

For contributors less familiar with Spark’s history, I want to give more context on Spark releases:

1. Timeline: Spark 1.0 was released May 2014. Spark 2.0 was July 2016. If we were to maintain the ~ 2 year cadence, it is time to work on Spark 3.0 in 2018.

2. Spark’s versioning policy promises that Spark does not break stable APIs in feature releases (e.g. 2.1, 2.2). API breaking changes are sometimes a necessary evil, and can be done in major releases (e.g. 1.6 to 2.0, 2.x to 3.0).

3. That said, a major version isn’t necessarily the playground for disruptive API changes to make it painful for users to update. The main purpose of a major release is an opportunity to fix things that are broken in the current API and remove certain deprecated APIs.

4. Spark as a project has a culture of evolving architecture and developing major new features incrementally, so major releases are not the only time for exciting new features. For example, the bulk of the work in the move towards the DataFrame API was done in Spark 1.3, and Continuous Processing was introduced in Spark 2.3. Both were feature releases rather than major releases.

You can find more background in the thread discussing Spark 2.0: http://apache-spark-developers-list.1001551.n3.nabble.com/A-proposal-for-Spark-2-0-td15122.html


The primary motivating factor IMO for a major version bump is to support Scala 2.12, which requires minor API breaking changes to Spark’s APIs. Similar to Spark 2.0, I think there are also opportunities for other changes that we know have been biting us for a long time but can’t be changed in feature releases (to be clear, I’m actually not sure they are all good ideas, but I’m writing them down as candidates for consideration):

1. Support Scala 2.12.

2. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in Spark 2.x.

3. Shade all dependencies.

4. Change the reserved keywords in Spark SQL to be more ANSI-SQL compliant, to prevent users from shooting themselves in the foot, e.g. “SELECT 2 SECOND” -- is “SECOND” an interval unit or an alias? To make it less painful for users to upgrade here, I’d suggest creating a flag for backward compatibility mode.

5. Similar to 4, make our type coercion rule in DataFrame/SQL more standard compliant, and have a flag for backward compatibility.

6. Miscellaneous other small changes documented in JIRA already (e.g. “JavaPairRDD flatMapValues requires function returning Iterable, not Iterator”, “Prevent column name duplication in temporary view”).

Now the reality of a major version bump is that the world often thinks in terms of what exciting features are coming. I do think there are a number of major changes happening already that can be part of the 3.0 release, if they make it in:

1. Scala 2.12 support (listing it twice)
2. Continuous Processing non-experimental
3. Kubernetes support non-experimental
4. A more flushed out version of data source API v2 (I don’t think it is realistic to stabilize that in one release)
5. Hadoop 3.0 support
6. ...


Similar to the 2.0 discussion, this thread should focus on the framework and whether it’d make sense to create Spark 3.0 as the next release, rather than the individual feature requests. Those are important but are best done in their own separate threads.




Reply | Threaded
Open this post in threaded view
|

Re: time for Apache Spark 3.0?

Mark Hamstra
Changing major version numbers is not about new features or a vague notion that it is time to do something that will be seen to be a significant release. It is about breaking stable public APIs.

I still remain unconvinced that the next version can't be 2.4.0.

On Fri, Jun 15, 2018 at 1:34 AM Andy <[hidden email]> wrote:
Dear all:

It have been 2 months since this topic being proposed. Any progress now? 2018 has been passed about 1/2.

I agree with that the new version should be some exciting new feature. How about this one:

6. ML/DL framework to be integrated as core component and feature. (Such as Angel / BigDL / ……)

3.0 is a very important version for an good open source project. It should be better to drift away the historical burden and focus in new area. Spark has been widely used all over the world as a successful big data framework. And it can be better than that.

Andy


On Thu, Apr 5, 2018 at 7:20 AM Reynold Xin <[hidden email]> wrote:
There was a discussion thread on scala-contributors about Apache Spark not yet supporting Scala 2.12, and that got me to think perhaps it is about time for Spark to work towards the 3.0 release. By the time it comes out, it will be more than 2 years since Spark 2.0.

For contributors less familiar with Spark’s history, I want to give more context on Spark releases:

1. Timeline: Spark 1.0 was released May 2014. Spark 2.0 was July 2016. If we were to maintain the ~ 2 year cadence, it is time to work on Spark 3.0 in 2018.

2. Spark’s versioning policy promises that Spark does not break stable APIs in feature releases (e.g. 2.1, 2.2). API breaking changes are sometimes a necessary evil, and can be done in major releases (e.g. 1.6 to 2.0, 2.x to 3.0).

3. That said, a major version isn’t necessarily the playground for disruptive API changes to make it painful for users to update. The main purpose of a major release is an opportunity to fix things that are broken in the current API and remove certain deprecated APIs.

4. Spark as a project has a culture of evolving architecture and developing major new features incrementally, so major releases are not the only time for exciting new features. For example, the bulk of the work in the move towards the DataFrame API was done in Spark 1.3, and Continuous Processing was introduced in Spark 2.3. Both were feature releases rather than major releases.

You can find more background in the thread discussing Spark 2.0: http://apache-spark-developers-list.1001551.n3.nabble.com/A-proposal-for-Spark-2-0-td15122.html


The primary motivating factor IMO for a major version bump is to support Scala 2.12, which requires minor API breaking changes to Spark’s APIs. Similar to Spark 2.0, I think there are also opportunities for other changes that we know have been biting us for a long time but can’t be changed in feature releases (to be clear, I’m actually not sure they are all good ideas, but I’m writing them down as candidates for consideration):

1. Support Scala 2.12.

2. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in Spark 2.x.

3. Shade all dependencies.

4. Change the reserved keywords in Spark SQL to be more ANSI-SQL compliant, to prevent users from shooting themselves in the foot, e.g. “SELECT 2 SECOND” -- is “SECOND” an interval unit or an alias? To make it less painful for users to upgrade here, I’d suggest creating a flag for backward compatibility mode.

5. Similar to 4, make our type coercion rule in DataFrame/SQL more standard compliant, and have a flag for backward compatibility.

6. Miscellaneous other small changes documented in JIRA already (e.g. “JavaPairRDD flatMapValues requires function returning Iterable, not Iterator”, “Prevent column name duplication in temporary view”).

Now the reality of a major version bump is that the world often thinks in terms of what exciting features are coming. I do think there are a number of major changes happening already that can be part of the 3.0 release, if they make it in:

1. Scala 2.12 support (listing it twice)
2. Continuous Processing non-experimental
3. Kubernetes support non-experimental
4. A more flushed out version of data source API v2 (I don’t think it is realistic to stabilize that in one release)
5. Hadoop 3.0 support
6. ...


Similar to the 2.0 discussion, this thread should focus on the framework and whether it’d make sense to create Spark 3.0 as the next release, rather than the individual feature requests. Those are important but are best done in their own separate threads.




Reply | Threaded
Open this post in threaded view
|

Re: time for Apache Spark 3.0?

Mridul Muralidharan
I agree, I dont see pressing need for major version bump as well.


Regards,
Mridul
On Fri, Jun 15, 2018 at 10:25 AM Mark Hamstra <[hidden email]> wrote:

>
> Changing major version numbers is not about new features or a vague notion that it is time to do something that will be seen to be a significant release. It is about breaking stable public APIs.
>
> I still remain unconvinced that the next version can't be 2.4.0.
>
> On Fri, Jun 15, 2018 at 1:34 AM Andy <[hidden email]> wrote:
>>
>> Dear all:
>>
>> It have been 2 months since this topic being proposed. Any progress now? 2018 has been passed about 1/2.
>>
>> I agree with that the new version should be some exciting new feature. How about this one:
>>
>> 6. ML/DL framework to be integrated as core component and feature. (Such as Angel / BigDL / ……)
>>
>> 3.0 is a very important version for an good open source project. It should be better to drift away the historical burden and focus in new area. Spark has been widely used all over the world as a successful big data framework. And it can be better than that.
>>
>> Andy
>>
>>
>> On Thu, Apr 5, 2018 at 7:20 AM Reynold Xin <[hidden email]> wrote:
>>>
>>> There was a discussion thread on scala-contributors about Apache Spark not yet supporting Scala 2.12, and that got me to think perhaps it is about time for Spark to work towards the 3.0 release. By the time it comes out, it will be more than 2 years since Spark 2.0.
>>>
>>> For contributors less familiar with Spark’s history, I want to give more context on Spark releases:
>>>
>>> 1. Timeline: Spark 1.0 was released May 2014. Spark 2.0 was July 2016. If we were to maintain the ~ 2 year cadence, it is time to work on Spark 3.0 in 2018.
>>>
>>> 2. Spark’s versioning policy promises that Spark does not break stable APIs in feature releases (e.g. 2.1, 2.2). API breaking changes are sometimes a necessary evil, and can be done in major releases (e.g. 1.6 to 2.0, 2.x to 3.0).
>>>
>>> 3. That said, a major version isn’t necessarily the playground for disruptive API changes to make it painful for users to update. The main purpose of a major release is an opportunity to fix things that are broken in the current API and remove certain deprecated APIs.
>>>
>>> 4. Spark as a project has a culture of evolving architecture and developing major new features incrementally, so major releases are not the only time for exciting new features. For example, the bulk of the work in the move towards the DataFrame API was done in Spark 1.3, and Continuous Processing was introduced in Spark 2.3. Both were feature releases rather than major releases.
>>>
>>>
>>> You can find more background in the thread discussing Spark 2.0: http://apache-spark-developers-list.1001551.n3.nabble.com/A-proposal-for-Spark-2-0-td15122.html
>>>
>>>
>>> The primary motivating factor IMO for a major version bump is to support Scala 2.12, which requires minor API breaking changes to Spark’s APIs. Similar to Spark 2.0, I think there are also opportunities for other changes that we know have been biting us for a long time but can’t be changed in feature releases (to be clear, I’m actually not sure they are all good ideas, but I’m writing them down as candidates for consideration):
>>>
>>> 1. Support Scala 2.12.
>>>
>>> 2. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in Spark 2.x.
>>>
>>> 3. Shade all dependencies.
>>>
>>> 4. Change the reserved keywords in Spark SQL to be more ANSI-SQL compliant, to prevent users from shooting themselves in the foot, e.g. “SELECT 2 SECOND” -- is “SECOND” an interval unit or an alias? To make it less painful for users to upgrade here, I’d suggest creating a flag for backward compatibility mode.
>>>
>>> 5. Similar to 4, make our type coercion rule in DataFrame/SQL more standard compliant, and have a flag for backward compatibility.
>>>
>>> 6. Miscellaneous other small changes documented in JIRA already (e.g. “JavaPairRDD flatMapValues requires function returning Iterable, not Iterator”, “Prevent column name duplication in temporary view”).
>>>
>>>
>>> Now the reality of a major version bump is that the world often thinks in terms of what exciting features are coming. I do think there are a number of major changes happening already that can be part of the 3.0 release, if they make it in:
>>>
>>> 1. Scala 2.12 support (listing it twice)
>>> 2. Continuous Processing non-experimental
>>> 3. Kubernetes support non-experimental
>>> 4. A more flushed out version of data source API v2 (I don’t think it is realistic to stabilize that in one release)
>>> 5. Hadoop 3.0 support
>>> 6. ...
>>>
>>>
>>>
>>> Similar to the 2.0 discussion, this thread should focus on the framework and whether it’d make sense to create Spark 3.0 as the next release, rather than the individual feature requests. Those are important but are best done in their own separate threads.
>>>
>>>
>>>
>>>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: time for Apache Spark 3.0?

rxin
Yes. At this rate I think it's better to do 2.4 next, followed by 3.0. 


On Fri, Jun 15, 2018 at 10:52 AM Mridul Muralidharan <[hidden email]> wrote:
I agree, I dont see pressing need for major version bump as well.


Regards,
Mridul
On Fri, Jun 15, 2018 at 10:25 AM Mark Hamstra <[hidden email]> wrote:
>
> Changing major version numbers is not about new features or a vague notion that it is time to do something that will be seen to be a significant release. It is about breaking stable public APIs.
>
> I still remain unconvinced that the next version can't be 2.4.0.
>
> On Fri, Jun 15, 2018 at 1:34 AM Andy <[hidden email]> wrote:
>>
>> Dear all:
>>
>> It have been 2 months since this topic being proposed. Any progress now? 2018 has been passed about 1/2.
>>
>> I agree with that the new version should be some exciting new feature. How about this one:
>>
>> 6. ML/DL framework to be integrated as core component and feature. (Such as Angel / BigDL / ……)
>>
>> 3.0 is a very important version for an good open source project. It should be better to drift away the historical burden and focus in new area. Spark has been widely used all over the world as a successful big data framework. And it can be better than that.
>>
>> Andy
>>
>>
>> On Thu, Apr 5, 2018 at 7:20 AM Reynold Xin <[hidden email]> wrote:
>>>
>>> There was a discussion thread on scala-contributors about Apache Spark not yet supporting Scala 2.12, and that got me to think perhaps it is about time for Spark to work towards the 3.0 release. By the time it comes out, it will be more than 2 years since Spark 2.0.
>>>
>>> For contributors less familiar with Spark’s history, I want to give more context on Spark releases:
>>>
>>> 1. Timeline: Spark 1.0 was released May 2014. Spark 2.0 was July 2016. If we were to maintain the ~ 2 year cadence, it is time to work on Spark 3.0 in 2018.
>>>
>>> 2. Spark’s versioning policy promises that Spark does not break stable APIs in feature releases (e.g. 2.1, 2.2). API breaking changes are sometimes a necessary evil, and can be done in major releases (e.g. 1.6 to 2.0, 2.x to 3.0).
>>>
>>> 3. That said, a major version isn’t necessarily the playground for disruptive API changes to make it painful for users to update. The main purpose of a major release is an opportunity to fix things that are broken in the current API and remove certain deprecated APIs.
>>>
>>> 4. Spark as a project has a culture of evolving architecture and developing major new features incrementally, so major releases are not the only time for exciting new features. For example, the bulk of the work in the move towards the DataFrame API was done in Spark 1.3, and Continuous Processing was introduced in Spark 2.3. Both were feature releases rather than major releases.
>>>
>>>
>>> You can find more background in the thread discussing Spark 2.0: http://apache-spark-developers-list.1001551.n3.nabble.com/A-proposal-for-Spark-2-0-td15122.html
>>>
>>>
>>> The primary motivating factor IMO for a major version bump is to support Scala 2.12, which requires minor API breaking changes to Spark’s APIs. Similar to Spark 2.0, I think there are also opportunities for other changes that we know have been biting us for a long time but can’t be changed in feature releases (to be clear, I’m actually not sure they are all good ideas, but I’m writing them down as candidates for consideration):
>>>
>>> 1. Support Scala 2.12.
>>>
>>> 2. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in Spark 2.x.
>>>
>>> 3. Shade all dependencies.
>>>
>>> 4. Change the reserved keywords in Spark SQL to be more ANSI-SQL compliant, to prevent users from shooting themselves in the foot, e.g. “SELECT 2 SECOND” -- is “SECOND” an interval unit or an alias? To make it less painful for users to upgrade here, I’d suggest creating a flag for backward compatibility mode.
>>>
>>> 5. Similar to 4, make our type coercion rule in DataFrame/SQL more standard compliant, and have a flag for backward compatibility.
>>>
>>> 6. Miscellaneous other small changes documented in JIRA already (e.g. “JavaPairRDD flatMapValues requires function returning Iterable, not Iterator”, “Prevent column name duplication in temporary view”).
>>>
>>>
>>> Now the reality of a major version bump is that the world often thinks in terms of what exciting features are coming. I do think there are a number of major changes happening already that can be part of the 3.0 release, if they make it in:
>>>
>>> 1. Scala 2.12 support (listing it twice)
>>> 2. Continuous Processing non-experimental
>>> 3. Kubernetes support non-experimental
>>> 4. A more flushed out version of data source API v2 (I don’t think it is realistic to stabilize that in one release)
>>> 5. Hadoop 3.0 support
>>> 6. ...
>>>
>>>
>>>
>>> Similar to the 2.0 discussion, this thread should focus on the framework and whether it’d make sense to create Spark 3.0 as the next release, rather than the individual feature requests. Those are important but are best done in their own separate threads.
>>>
>>>
>>>
>>>
Reply | Threaded
Open this post in threaded view
|

Re: time for Apache Spark 3.0?

Xiao Li
+1

2018-06-15 14:55 GMT-07:00 Reynold Xin <[hidden email]>:
Yes. At this rate I think it's better to do 2.4 next, followed by 3.0. 


On Fri, Jun 15, 2018 at 10:52 AM Mridul Muralidharan <[hidden email]> wrote:
I agree, I dont see pressing need for major version bump as well.


Regards,
Mridul
On Fri, Jun 15, 2018 at 10:25 AM Mark Hamstra <[hidden email]> wrote:
>
> Changing major version numbers is not about new features or a vague notion that it is time to do something that will be seen to be a significant release. It is about breaking stable public APIs.
>
> I still remain unconvinced that the next version can't be 2.4.0.
>
> On Fri, Jun 15, 2018 at 1:34 AM Andy <[hidden email]> wrote:
>>
>> Dear all:
>>
>> It have been 2 months since this topic being proposed. Any progress now? 2018 has been passed about 1/2.
>>
>> I agree with that the new version should be some exciting new feature. How about this one:
>>
>> 6. ML/DL framework to be integrated as core component and feature. (Such as Angel / BigDL / ……)
>>
>> 3.0 is a very important version for an good open source project. It should be better to drift away the historical burden and focus in new area. Spark has been widely used all over the world as a successful big data framework. And it can be better than that.
>>
>> Andy
>>
>>
>> On Thu, Apr 5, 2018 at 7:20 AM Reynold Xin <[hidden email]> wrote:
>>>
>>> There was a discussion thread on scala-contributors about Apache Spark not yet supporting Scala 2.12, and that got me to think perhaps it is about time for Spark to work towards the 3.0 release. By the time it comes out, it will be more than 2 years since Spark 2.0.
>>>
>>> For contributors less familiar with Spark’s history, I want to give more context on Spark releases:
>>>
>>> 1. Timeline: Spark 1.0 was released May 2014. Spark 2.0 was July 2016. If we were to maintain the ~ 2 year cadence, it is time to work on Spark 3.0 in 2018.
>>>
>>> 2. Spark’s versioning policy promises that Spark does not break stable APIs in feature releases (e.g. 2.1, 2.2). API breaking changes are sometimes a necessary evil, and can be done in major releases (e.g. 1.6 to 2.0, 2.x to 3.0).
>>>
>>> 3. That said, a major version isn’t necessarily the playground for disruptive API changes to make it painful for users to update. The main purpose of a major release is an opportunity to fix things that are broken in the current API and remove certain deprecated APIs.
>>>
>>> 4. Spark as a project has a culture of evolving architecture and developing major new features incrementally, so major releases are not the only time for exciting new features. For example, the bulk of the work in the move towards the DataFrame API was done in Spark 1.3, and Continuous Processing was introduced in Spark 2.3. Both were feature releases rather than major releases.
>>>
>>>
>>> You can find more background in the thread discussing Spark 2.0: http://apache-spark-developers-list.1001551.n3.nabble.com/A-proposal-for-Spark-2-0-td15122.html
>>>
>>>
>>> The primary motivating factor IMO for a major version bump is to support Scala 2.12, which requires minor API breaking changes to Spark’s APIs. Similar to Spark 2.0, I think there are also opportunities for other changes that we know have been biting us for a long time but can’t be changed in feature releases (to be clear, I’m actually not sure they are all good ideas, but I’m writing them down as candidates for consideration):
>>>
>>> 1. Support Scala 2.12.
>>>
>>> 2. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in Spark 2.x.
>>>
>>> 3. Shade all dependencies.
>>>
>>> 4. Change the reserved keywords in Spark SQL to be more ANSI-SQL compliant, to prevent users from shooting themselves in the foot, e.g. “SELECT 2 SECOND” -- is “SECOND” an interval unit or an alias? To make it less painful for users to upgrade here, I’d suggest creating a flag for backward compatibility mode.
>>>
>>> 5. Similar to 4, make our type coercion rule in DataFrame/SQL more standard compliant, and have a flag for backward compatibility.
>>>
>>> 6. Miscellaneous other small changes documented in JIRA already (e.g. “JavaPairRDD flatMapValues requires function returning Iterable, not Iterator”, “Prevent column name duplication in temporary view”).
>>>
>>>
>>> Now the reality of a major version bump is that the world often thinks in terms of what exciting features are coming. I do think there are a number of major changes happening already that can be part of the 3.0 release, if they make it in:
>>>
>>> 1. Scala 2.12 support (listing it twice)
>>> 2. Continuous Processing non-experimental
>>> 3. Kubernetes support non-experimental
>>> 4. A more flushed out version of data source API v2 (I don’t think it is realistic to stabilize that in one release)
>>> 5. Hadoop 3.0 support
>>> 6. ...
>>>
>>>
>>>
>>> Similar to the 2.0 discussion, this thread should focus on the framework and whether it’d make sense to create Spark 3.0 as the next release, rather than the individual feature requests. Those are important but are best done in their own separate threads.
>>>
>>>
>>>
>>>

Reply | Threaded
Open this post in threaded view
|

Re: time for Apache Spark 3.0?

vaquarkhan
In reply to this post by rxin

On Fri, Jun 15, 2018 at 4:55 PM, Reynold Xin <[hidden email]> wrote:
Yes. At this rate I think it's better to do 2.4 next, followed by 3.0. 


On Fri, Jun 15, 2018 at 10:52 AM Mridul Muralidharan <[hidden email]> wrote:
I agree, I dont see pressing need for major version bump as well.


Regards,
Mridul
On Fri, Jun 15, 2018 at 10:25 AM Mark Hamstra <[hidden email]> wrote:
>
> Changing major version numbers is not about new features or a vague notion that it is time to do something that will be seen to be a significant release. It is about breaking stable public APIs.
>
> I still remain unconvinced that the next version can't be 2.4.0.
>
> On Fri, Jun 15, 2018 at 1:34 AM Andy <[hidden email]> wrote:
>>
>> Dear all:
>>
>> It have been 2 months since this topic being proposed. Any progress now? 2018 has been passed about 1/2.
>>
>> I agree with that the new version should be some exciting new feature. How about this one:
>>
>> 6. ML/DL framework to be integrated as core component and feature. (Such as Angel / BigDL / ……)
>>
>> 3.0 is a very important version for an good open source project. It should be better to drift away the historical burden and focus in new area. Spark has been widely used all over the world as a successful big data framework. And it can be better than that.
>>
>> Andy
>>
>>
>> On Thu, Apr 5, 2018 at 7:20 AM Reynold Xin <[hidden email]> wrote:
>>>
>>> There was a discussion thread on scala-contributors about Apache Spark not yet supporting Scala 2.12, and that got me to think perhaps it is about time for Spark to work towards the 3.0 release. By the time it comes out, it will be more than 2 years since Spark 2.0.
>>>
>>> For contributors less familiar with Spark’s history, I want to give more context on Spark releases:
>>>
>>> 1. Timeline: Spark 1.0 was released May 2014. Spark 2.0 was July 2016. If we were to maintain the ~ 2 year cadence, it is time to work on Spark 3.0 in 2018.
>>>
>>> 2. Spark’s versioning policy promises that Spark does not break stable APIs in feature releases (e.g. 2.1, 2.2). API breaking changes are sometimes a necessary evil, and can be done in major releases (e.g. 1.6 to 2.0, 2.x to 3.0).
>>>
>>> 3. That said, a major version isn’t necessarily the playground for disruptive API changes to make it painful for users to update. The main purpose of a major release is an opportunity to fix things that are broken in the current API and remove certain deprecated APIs.
>>>
>>> 4. Spark as a project has a culture of evolving architecture and developing major new features incrementally, so major releases are not the only time for exciting new features. For example, the bulk of the work in the move towards the DataFrame API was done in Spark 1.3, and Continuous Processing was introduced in Spark 2.3. Both were feature releases rather than major releases.
>>>
>>>
>>> You can find more background in the thread discussing Spark 2.0: http://apache-spark-developers-list.1001551.n3.nabble.com/A-proposal-for-Spark-2-0-td15122.html
>>>
>>>
>>> The primary motivating factor IMO for a major version bump is to support Scala 2.12, which requires minor API breaking changes to Spark’s APIs. Similar to Spark 2.0, I think there are also opportunities for other changes that we know have been biting us for a long time but can’t be changed in feature releases (to be clear, I’m actually not sure they are all good ideas, but I’m writing them down as candidates for consideration):
>>>
>>> 1. Support Scala 2.12.
>>>
>>> 2. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in Spark 2.x.
>>>
>>> 3. Shade all dependencies.
>>>
>>> 4. Change the reserved keywords in Spark SQL to be more ANSI-SQL compliant, to prevent users from shooting themselves in the foot, e.g. “SELECT 2 SECOND” -- is “SECOND” an interval unit or an alias? To make it less painful for users to upgrade here, I’d suggest creating a flag for backward compatibility mode.
>>>
>>> 5. Similar to 4, make our type coercion rule in DataFrame/SQL more standard compliant, and have a flag for backward compatibility.
>>>
>>> 6. Miscellaneous other small changes documented in JIRA already (e.g. “JavaPairRDD flatMapValues requires function returning Iterable, not Iterator”, “Prevent column name duplication in temporary view”).
>>>
>>>
>>> Now the reality of a major version bump is that the world often thinks in terms of what exciting features are coming. I do think there are a number of major changes happening already that can be part of the 3.0 release, if they make it in:
>>>
>>> 1. Scala 2.12 support (listing it twice)
>>> 2. Continuous Processing non-experimental
>>> 3. Kubernetes support non-experimental
>>> 4. A more flushed out version of data source API v2 (I don’t think it is realistic to stabilize that in one release)
>>> 5. Hadoop 3.0 support
>>> 6. ...
>>>
>>>
>>>
>>> Similar to the 2.0 discussion, this thread should focus on the framework and whether it’d make sense to create Spark 3.0 as the next release, rather than the individual feature requests. Those are important but are best done in their own separate threads.
>>>
>>>
>>>
>>>



--
Regards,
Vaquar Khan
+1 -224-436-0783
Greater Chicago
Reply | Threaded
Open this post in threaded view
|

Re: time for Apache Spark 3.0?

vaquarkhan
Plz ignore last email link (you tube )not sure how it added .
Apologies not sure how to delete it.


On Sat, Jun 16, 2018 at 11:58 AM, vaquar khan <[hidden email]> wrote:

On Fri, Jun 15, 2018 at 4:55 PM, Reynold Xin <[hidden email]> wrote:
Yes. At this rate I think it's better to do 2.4 next, followed by 3.0. 


On Fri, Jun 15, 2018 at 10:52 AM Mridul Muralidharan <[hidden email]> wrote:
I agree, I dont see pressing need for major version bump as well.


Regards,
Mridul
On Fri, Jun 15, 2018 at 10:25 AM Mark Hamstra <[hidden email]> wrote:
>
> Changing major version numbers is not about new features or a vague notion that it is time to do something that will be seen to be a significant release. It is about breaking stable public APIs.
>
> I still remain unconvinced that the next version can't be 2.4.0.
>
> On Fri, Jun 15, 2018 at 1:34 AM Andy <[hidden email]> wrote:
>>
>> Dear all:
>>
>> It have been 2 months since this topic being proposed. Any progress now? 2018 has been passed about 1/2.
>>
>> I agree with that the new version should be some exciting new feature. How about this one:
>>
>> 6. ML/DL framework to be integrated as core component and feature. (Such as Angel / BigDL / ……)
>>
>> 3.0 is a very important version for an good open source project. It should be better to drift away the historical burden and focus in new area. Spark has been widely used all over the world as a successful big data framework. And it can be better than that.
>>
>> Andy
>>
>>
>> On Thu, Apr 5, 2018 at 7:20 AM Reynold Xin <[hidden email]> wrote:
>>>
>>> There was a discussion thread on scala-contributors about Apache Spark not yet supporting Scala 2.12, and that got me to think perhaps it is about time for Spark to work towards the 3.0 release. By the time it comes out, it will be more than 2 years since Spark 2.0.
>>>
>>> For contributors less familiar with Spark’s history, I want to give more context on Spark releases:
>>>
>>> 1. Timeline: Spark 1.0 was released May 2014. Spark 2.0 was July 2016. If we were to maintain the ~ 2 year cadence, it is time to work on Spark 3.0 in 2018.
>>>
>>> 2. Spark’s versioning policy promises that Spark does not break stable APIs in feature releases (e.g. 2.1, 2.2). API breaking changes are sometimes a necessary evil, and can be done in major releases (e.g. 1.6 to 2.0, 2.x to 3.0).
>>>
>>> 3. That said, a major version isn’t necessarily the playground for disruptive API changes to make it painful for users to update. The main purpose of a major release is an opportunity to fix things that are broken in the current API and remove certain deprecated APIs.
>>>
>>> 4. Spark as a project has a culture of evolving architecture and developing major new features incrementally, so major releases are not the only time for exciting new features. For example, the bulk of the work in the move towards the DataFrame API was done in Spark 1.3, and Continuous Processing was introduced in Spark 2.3. Both were feature releases rather than major releases.
>>>
>>>
>>> You can find more background in the thread discussing Spark 2.0: http://apache-spark-developers-list.1001551.n3.nabble.com/A-proposal-for-Spark-2-0-td15122.html
>>>
>>>
>>> The primary motivating factor IMO for a major version bump is to support Scala 2.12, which requires minor API breaking changes to Spark’s APIs. Similar to Spark 2.0, I think there are also opportunities for other changes that we know have been biting us for a long time but can’t be changed in feature releases (to be clear, I’m actually not sure they are all good ideas, but I’m writing them down as candidates for consideration):
>>>
>>> 1. Support Scala 2.12.
>>>
>>> 2. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in Spark 2.x.
>>>
>>> 3. Shade all dependencies.
>>>
>>> 4. Change the reserved keywords in Spark SQL to be more ANSI-SQL compliant, to prevent users from shooting themselves in the foot, e.g. “SELECT 2 SECOND” -- is “SECOND” an interval unit or an alias? To make it less painful for users to upgrade here, I’d suggest creating a flag for backward compatibility mode.
>>>
>>> 5. Similar to 4, make our type coercion rule in DataFrame/SQL more standard compliant, and have a flag for backward compatibility.
>>>
>>> 6. Miscellaneous other small changes documented in JIRA already (e.g. “JavaPairRDD flatMapValues requires function returning Iterable, not Iterator”, “Prevent column name duplication in temporary view”).
>>>
>>>
>>> Now the reality of a major version bump is that the world often thinks in terms of what exciting features are coming. I do think there are a number of major changes happening already that can be part of the 3.0 release, if they make it in:
>>>
>>> 1. Scala 2.12 support (listing it twice)
>>> 2. Continuous Processing non-experimental
>>> 3. Kubernetes support non-experimental
>>> 4. A more flushed out version of data source API v2 (I don’t think it is realistic to stabilize that in one release)
>>> 5. Hadoop 3.0 support
>>> 6. ...
>>>
>>>
>>>
>>> Similar to the 2.0 discussion, this thread should focus on the framework and whether it’d make sense to create Spark 3.0 as the next release, rather than the individual feature requests. Those are important but are best done in their own separate threads.
>>>
>>>
>>>
>>>



--
Regards,
Vaquar Khan
+1 -224-436-0783
Greater Chicago



--
Regards,
Vaquar Khan
+1 -224-436-0783
Greater Chicago
Reply | Threaded
Open this post in threaded view
|

Re: time for Apache Spark 3.0?

vaquarkhan
+1  for 2.4 next, followed by 3.0.      
  
Where we can get Apache Spark road map for 2.4 and 2.5 .... 3.0 ?
is it possible we can share future release proposed specification same like  releases (https://spark.apache.org/releases/spark-release-2-3-0.html)


Regards,
Viquar khan

On Sat, Jun 16, 2018 at 12:02 PM, vaquar khan <[hidden email]> wrote:
Plz ignore last email link (you tube )not sure how it added .
Apologies not sure how to delete it.


On Sat, Jun 16, 2018 at 11:58 AM, vaquar khan <[hidden email]> wrote:

On Fri, Jun 15, 2018 at 4:55 PM, Reynold Xin <[hidden email]> wrote:
Yes. At this rate I think it's better to do 2.4 next, followed by 3.0. 


On Fri, Jun 15, 2018 at 10:52 AM Mridul Muralidharan <[hidden email]> wrote:
I agree, I dont see pressing need for major version bump as well.


Regards,
Mridul
On Fri, Jun 15, 2018 at 10:25 AM Mark Hamstra <[hidden email]> wrote:
>
> Changing major version numbers is not about new features or a vague notion that it is time to do something that will be seen to be a significant release. It is about breaking stable public APIs.
>
> I still remain unconvinced that the next version can't be 2.4.0.
>
> On Fri, Jun 15, 2018 at 1:34 AM Andy <[hidden email]> wrote:
>>
>> Dear all:
>>
>> It have been 2 months since this topic being proposed. Any progress now? 2018 has been passed about 1/2.
>>
>> I agree with that the new version should be some exciting new feature. How about this one:
>>
>> 6. ML/DL framework to be integrated as core component and feature. (Such as Angel / BigDL / ……)
>>
>> 3.0 is a very important version for an good open source project. It should be better to drift away the historical burden and focus in new area. Spark has been widely used all over the world as a successful big data framework. And it can be better than that.
>>
>> Andy
>>
>>
>> On Thu, Apr 5, 2018 at 7:20 AM Reynold Xin <[hidden email]> wrote:
>>>
>>> There was a discussion thread on scala-contributors about Apache Spark not yet supporting Scala 2.12, and that got me to think perhaps it is about time for Spark to work towards the 3.0 release. By the time it comes out, it will be more than 2 years since Spark 2.0.
>>>
>>> For contributors less familiar with Spark’s history, I want to give more context on Spark releases:
>>>
>>> 1. Timeline: Spark 1.0 was released May 2014. Spark 2.0 was July 2016. If we were to maintain the ~ 2 year cadence, it is time to work on Spark 3.0 in 2018.
>>>
>>> 2. Spark’s versioning policy promises that Spark does not break stable APIs in feature releases (e.g. 2.1, 2.2). API breaking changes are sometimes a necessary evil, and can be done in major releases (e.g. 1.6 to 2.0, 2.x to 3.0).
>>>
>>> 3. That said, a major version isn’t necessarily the playground for disruptive API changes to make it painful for users to update. The main purpose of a major release is an opportunity to fix things that are broken in the current API and remove certain deprecated APIs.
>>>
>>> 4. Spark as a project has a culture of evolving architecture and developing major new features incrementally, so major releases are not the only time for exciting new features. For example, the bulk of the work in the move towards the DataFrame API was done in Spark 1.3, and Continuous Processing was introduced in Spark 2.3. Both were feature releases rather than major releases.
>>>
>>>
>>> You can find more background in the thread discussing Spark 2.0: http://apache-spark-developers-list.1001551.n3.nabble.com/A-proposal-for-Spark-2-0-td15122.html
>>>
>>>
>>> The primary motivating factor IMO for a major version bump is to support Scala 2.12, which requires minor API breaking changes to Spark’s APIs. Similar to Spark 2.0, I think there are also opportunities for other changes that we know have been biting us for a long time but can’t be changed in feature releases (to be clear, I’m actually not sure they are all good ideas, but I’m writing them down as candidates for consideration):
>>>
>>> 1. Support Scala 2.12.
>>>
>>> 2. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in Spark 2.x.
>>>
>>> 3. Shade all dependencies.
>>>
>>> 4. Change the reserved keywords in Spark SQL to be more ANSI-SQL compliant, to prevent users from shooting themselves in the foot, e.g. “SELECT 2 SECOND” -- is “SECOND” an interval unit or an alias? To make it less painful for users to upgrade here, I’d suggest creating a flag for backward compatibility mode.
>>>
>>> 5. Similar to 4, make our type coercion rule in DataFrame/SQL more standard compliant, and have a flag for backward compatibility.
>>>
>>> 6. Miscellaneous other small changes documented in JIRA already (e.g. “JavaPairRDD flatMapValues requires function returning Iterable, not Iterator”, “Prevent column name duplication in temporary view”).
>>>
>>>
>>> Now the reality of a major version bump is that the world often thinks in terms of what exciting features are coming. I do think there are a number of major changes happening already that can be part of the 3.0 release, if they make it in:
>>>
>>> 1. Scala 2.12 support (listing it twice)
>>> 2. Continuous Processing non-experimental
>>> 3. Kubernetes support non-experimental
>>> 4. A more flushed out version of data source API v2 (I don’t think it is realistic to stabilize that in one release)
>>> 5. Hadoop 3.0 support
>>> 6. ...
>>>
>>>
>>>
>>> Similar to the 2.0 discussion, this thread should focus on the framework and whether it’d make sense to create Spark 3.0 as the next release, rather than the individual feature requests. Those are important but are best done in their own separate threads.
>>>
>>>
>>>
>>>



--
Regards,
Vaquar Khan
+1 -224-436-0783
Greater Chicago



--
Regards,
Vaquar Khan
+1 -224-436-0783
Greater Chicago



--
Regards,
Vaquar Khan
+1 -224-436-0783
Greater Chicago
12