[DISCUSS] Spark 2.5 release

classic Classic list List threaded Threaded
31 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[DISCUSS] Spark 2.5 release

Ryan Blue
Hi everyone,

In the DSv2 sync this week, we talked about a possible Spark 2.5 release based on the latest Spark 2.4, but with DSv2 and Java 11 support added.

A Spark 2.5 release with these two additions will help people migrate to Spark 3.0 when it is released because they will be able to use a single implementation for DSv2 sources that works in both 2.5 and 3.0. Similarly, upgrading to 3.0 won't also require also updating to Java 11 because users could update to Java 11 with the 2.5 release and have fewer major changes.

Another reason to consider a 2.5 release is that many people are interested in a release with the latest DSv2 API and support for DSv2 SQL. I'm already going to be backporting DSv2 support to the Spark 2.4 line, so it makes sense to share this work with the community.

This release line would just consist of backports like DSv2 and Java 11 that assist compatibility, to keep the scope of the release small. The purpose is to assist people moving to 3.0 and not distract from the 3.0 release.

Would a Spark 2.5 release help anyone else? Are there any concerns about this plan?


rb


--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Spark 2.5 release

rxin
DSv2 is far from stable right? All the actual data types are unstable and you guys have completely ignored that. We'd need to work on that and that will be a breaking change. If the goal is to make DSv2 work across 3.x and 2.x, that seems too invasive of a change to backport once you consider the parts needed to make dsv2 stable.



On Fri, Sep 20, 2019 at 10:47 AM, Ryan Blue <[hidden email]> wrote:
Hi everyone,

In the DSv2 sync this week, we talked about a possible Spark 2.5 release based on the latest Spark 2.4, but with DSv2 and Java 11 support added.

A Spark 2.5 release with these two additions will help people migrate to Spark 3.0 when it is released because they will be able to use a single implementation for DSv2 sources that works in both 2.5 and 3.0. Similarly, upgrading to 3.0 won't also require also updating to Java 11 because users could update to Java 11 with the 2.5 release and have fewer major changes.

Another reason to consider a 2.5 release is that many people are interested in a release with the latest DSv2 API and support for DSv2 SQL. I'm already going to be backporting DSv2 support to the Spark 2.4 line, so it makes sense to share this work with the community.

This release line would just consist of backports like DSv2 and Java 11 that assist compatibility, to keep the scope of the release small. The purpose is to assist people moving to 3.0 and not distract from the 3.0 release.

Would a Spark 2.5 release help anyone else? Are there any concerns about this plan?


rb


--
Ryan Blue
Software Engineer
Netflix

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Spark 2.5 release

Sean Owen-2
In reply to this post by Ryan Blue
Narrowly on Java 11: the problem is that it'll take some breaking
changes, more than would be usually appropriate in a minor release, I
think. I'm still not convinced there is a burning need to use Java 11
but stay on 2.4, after 3.0 is out, and at least the wheels are in
motion there. Java 8 is still free and being updated.

On Fri, Sep 20, 2019 at 12:48 PM Ryan Blue <[hidden email]> wrote:

>
> Hi everyone,
>
> In the DSv2 sync this week, we talked about a possible Spark 2.5 release based on the latest Spark 2.4, but with DSv2 and Java 11 support added.
>
> A Spark 2.5 release with these two additions will help people migrate to Spark 3.0 when it is released because they will be able to use a single implementation for DSv2 sources that works in both 2.5 and 3.0. Similarly, upgrading to 3.0 won't also require also updating to Java 11 because users could update to Java 11 with the 2.5 release and have fewer major changes.
>
> Another reason to consider a 2.5 release is that many people are interested in a release with the latest DSv2 API and support for DSv2 SQL. I'm already going to be backporting DSv2 support to the Spark 2.4 line, so it makes sense to share this work with the community.
>
> This release line would just consist of backports like DSv2 and Java 11 that assist compatibility, to keep the scope of the release small. The purpose is to assist people moving to 3.0 and not distract from the 3.0 release.
>
> Would a Spark 2.5 release help anyone else? Are there any concerns about this plan?
>
>
> rb
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Spark 2.5 release

Ryan Blue
In reply to this post by rxin
> DSv2 is far from stable right?

No, I think it is reasonably stable and very close to being ready for a release.

> All the actual data types are unstable and you guys have completely ignored that.

I think what you're referring to is the use of `InternalRow`. That's a stable API and there has been no work to avoid using it. In any case, I don't think that anyone is suggesting that we delay 3.0 until a replacement for `InternalRow` is added, right?

While I understand the motivation for a better solution here, I think the pragmatic solution is to continue using `InternalRow`.

> If the goal is to make DSv2 work across 3.x and 2.x, that seems too invasive of a change to backport once you consider the parts needed to make dsv2 stable.

I believe that those of us working on DSv2 are confident about the current stability. We set goals for what to get into the 3.0 release months ago and have very nearly reached the point where we are ready for that release.

I don't think instability would be a problem in maintaining compatibility between the 2.5 version and the 3.0 version. If we find that we need to make API changes (other than additions) then we can make those in the 3.1 release. Because the goals we set for the 3.0 release have been reached with the current API and if we are ready to release 3.0, we can release a 2.5 with the same API.

On Fri, Sep 20, 2019 at 11:05 AM Reynold Xin <[hidden email]> wrote:
DSv2 is far from stable right? All the actual data types are unstable and you guys have completely ignored that. We'd need to work on that and that will be a breaking change. If the goal is to make DSv2 work across 3.x and 2.x, that seems too invasive of a change to backport once you consider the parts needed to make dsv2 stable.



On Fri, Sep 20, 2019 at 10:47 AM, Ryan Blue <[hidden email]> wrote:
Hi everyone,

In the DSv2 sync this week, we talked about a possible Spark 2.5 release based on the latest Spark 2.4, but with DSv2 and Java 11 support added.

A Spark 2.5 release with these two additions will help people migrate to Spark 3.0 when it is released because they will be able to use a single implementation for DSv2 sources that works in both 2.5 and 3.0. Similarly, upgrading to 3.0 won't also require also updating to Java 11 because users could update to Java 11 with the 2.5 release and have fewer major changes.

Another reason to consider a 2.5 release is that many people are interested in a release with the latest DSv2 API and support for DSv2 SQL. I'm already going to be backporting DSv2 support to the Spark 2.4 line, so it makes sense to share this work with the community.

This release line would just consist of backports like DSv2 and Java 11 that assist compatibility, to keep the scope of the release small. The purpose is to assist people moving to 3.0 and not distract from the 3.0 release.

Would a Spark 2.5 release help anyone else? Are there any concerns about this plan?


rb


--
Ryan Blue
Software Engineer
Netflix



--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Spark 2.5 release

Ryan Blue
In reply to this post by Sean Owen-2
I didn't realize that Java 11 would require breaking changes. What breaking changes are required?

On Fri, Sep 20, 2019 at 11:18 AM Sean Owen <[hidden email]> wrote:
Narrowly on Java 11: the problem is that it'll take some breaking
changes, more than would be usually appropriate in a minor release, I
think. I'm still not convinced there is a burning need to use Java 11
but stay on 2.4, after 3.0 is out, and at least the wheels are in
motion there. Java 8 is still free and being updated.

On Fri, Sep 20, 2019 at 12:48 PM Ryan Blue <[hidden email]> wrote:
>
> Hi everyone,
>
> In the DSv2 sync this week, we talked about a possible Spark 2.5 release based on the latest Spark 2.4, but with DSv2 and Java 11 support added.
>
> A Spark 2.5 release with these two additions will help people migrate to Spark 3.0 when it is released because they will be able to use a single implementation for DSv2 sources that works in both 2.5 and 3.0. Similarly, upgrading to 3.0 won't also require also updating to Java 11 because users could update to Java 11 with the 2.5 release and have fewer major changes.
>
> Another reason to consider a 2.5 release is that many people are interested in a release with the latest DSv2 API and support for DSv2 SQL. I'm already going to be backporting DSv2 support to the Spark 2.4 line, so it makes sense to share this work with the community.
>
> This release line would just consist of backports like DSv2 and Java 11 that assist compatibility, to keep the scope of the release small. The purpose is to assist people moving to 3.0 and not distract from the 3.0 release.
>
> Would a Spark 2.5 release help anyone else? Are there any concerns about this plan?
>
>
> rb
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix


--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Spark 2.5 release

rxin
In reply to this post by Ryan Blue
To push back, while I agree we should not drastically change "InternalRow", there are a lot of changes that need to happen to make it stable. For example, none of the publicly exposed interfaces should be in the Catalyst package or the unsafe package. External implementations should be decoupled from the internal implementations, with cheap ways to convert back and forth.

When you created the PR to make InternalRow public, the understanding was to work towards making it stable in the future, assuming we will start with an unstable API temporarily. You can't just make a bunch internal APIs tightly coupled with other internal pieces public and stable and call it a day, just because it happen to satisfy some use cases temporarily assuming the rest of Spark doesn't change.



On Fri, Sep 20, 2019 at 11:19 AM, Ryan Blue <[hidden email]> wrote:
> DSv2 is far from stable right?

No, I think it is reasonably stable and very close to being ready for a release.

> All the actual data types are unstable and you guys have completely ignored that.

I think what you're referring to is the use of `InternalRow`. That's a stable API and there has been no work to avoid using it. In any case, I don't think that anyone is suggesting that we delay 3.0 until a replacement for `InternalRow` is added, right?

While I understand the motivation for a better solution here, I think the pragmatic solution is to continue using `InternalRow`.

> If the goal is to make DSv2 work across 3.x and 2.x, that seems too invasive of a change to backport once you consider the parts needed to make dsv2 stable.

I believe that those of us working on DSv2 are confident about the current stability. We set goals for what to get into the 3.0 release months ago and have very nearly reached the point where we are ready for that release.

I don't think instability would be a problem in maintaining compatibility between the 2.5 version and the 3.0 version. If we find that we need to make API changes (other than additions) then we can make those in the 3.1 release. Because the goals we set for the 3.0 release have been reached with the current API and if we are ready to release 3.0, we can release a 2.5 with the same API.

On Fri, Sep 20, 2019 at 11:05 AM Reynold Xin <[hidden email]> wrote:
DSv2 is far from stable right? All the actual data types are unstable and you guys have completely ignored that. We'd need to work on that and that will be a breaking change. If the goal is to make DSv2 work across 3.x and 2.x, that seems too invasive of a change to backport once you consider the parts needed to make dsv2 stable.



On Fri, Sep 20, 2019 at 10:47 AM, Ryan Blue <[hidden email]> wrote:
Hi everyone,

In the DSv2 sync this week, we talked about a possible Spark 2.5 release based on the latest Spark 2.4, but with DSv2 and Java 11 support added.

A Spark 2.5 release with these two additions will help people migrate to Spark 3.0 when it is released because they will be able to use a single implementation for DSv2 sources that works in both 2.5 and 3.0. Similarly, upgrading to 3.0 won't also require also updating to Java 11 because users could update to Java 11 with the 2.5 release and have fewer major changes.

Another reason to consider a 2.5 release is that many people are interested in a release with the latest DSv2 API and support for DSv2 SQL. I'm already going to be backporting DSv2 support to the Spark 2.4 line, so it makes sense to share this work with the community.

This release line would just consist of backports like DSv2 and Java 11 that assist compatibility, to keep the scope of the release small. The purpose is to assist people moving to 3.0 and not distract from the 3.0 release.

Would a Spark 2.5 release help anyone else? Are there any concerns about this plan?


rb


--
Ryan Blue
Software Engineer
Netflix



--
Ryan Blue
Software Engineer
Netflix

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Spark 2.5 release

Sean Owen-2
In reply to this post by Ryan Blue
I don't know enough about DSv2 to comment on this part, but, any theoretical 2.5 is still a ways off. Does waiting for 3.0 to 'stabilize' it as much as is possible help?

I say that because re: Java 11, the main breaking change is probably the Hive 2 / Hadoop 3 dependency, JPMML (minor), as well as the general classloader changes, handling of off-heap memory. These aren't big breaks, but probably going to break some things. I think we'd want to see a 'proof of concept' branch to evaluate just how much has to change to get it working, and that is why I think a 2.5 release would still need more investigation.

On Fri, Sep 20, 2019 at 1:19 PM Ryan Blue <[hidden email]> wrote:
> DSv2 is far from stable right?

No, I think it is reasonably stable and very close to being ready for a release.

> All the actual data types are unstable and you guys have completely ignored that.

I think what you're referring to is the use of `InternalRow`. That's a stable API and there has been no work to avoid using it. In any case, I don't think that anyone is suggesting that we delay 3.0 until a replacement for `InternalRow` is added, right?

While I understand the motivation for a better solution here, I think the pragmatic solution is to continue using `InternalRow`.

> If the goal is to make DSv2 work across 3.x and 2.x, that seems too invasive of a change to backport once you consider the parts needed to make dsv2 stable.

I believe that those of us working on DSv2 are confident about the current stability. We set goals for what to get into the 3.0 release months ago and have very nearly reached the point where we are ready for that release.

I don't think instability would be a problem in maintaining compatibility between the 2.5 version and the 3.0 version. If we find that we need to make API changes (other than additions) then we can make those in the 3.1 release. Because the goals we set for the 3.0 release have been reached with the current API and if we are ready to release 3.0, we can release a 2.5 with the same API.

On Fri, Sep 20, 2019 at 11:05 AM Reynold Xin <[hidden email]> wrote:
DSv2 is far from stable right? All the actual data types are unstable and you guys have completely ignored that. We'd need to work on that and that will be a breaking change. If the goal is to make DSv2 work across 3.x and 2.x, that seems too invasive of a change to backport once you consider the parts needed to make dsv2 stable.



On Fri, Sep 20, 2019 at 10:47 AM, Ryan Blue <[hidden email]> wrote:
Hi everyone,

In the DSv2 sync this week, we talked about a possible Spark 2.5 release based on the latest Spark 2.4, but with DSv2 and Java 11 support added.

A Spark 2.5 release with these two additions will help people migrate to Spark 3.0 when it is released because they will be able to use a single implementation for DSv2 sources that works in both 2.5 and 3.0. Similarly, upgrading to 3.0 won't also require also updating to Java 11 because users could update to Java 11 with the 2.5 release and have fewer major changes.

Another reason to consider a 2.5 release is that many people are interested in a release with the latest DSv2 API and support for DSv2 SQL. I'm already going to be backporting DSv2 support to the Spark 2.4 line, so it makes sense to share this work with the community.

This release line would just consist of backports like DSv2 and Java 11 that assist compatibility, to keep the scope of the release small. The purpose is to assist people moving to 3.0 and not distract from the 3.0 release.

Would a Spark 2.5 release help anyone else? Are there any concerns about this plan?


rb


--
Ryan Blue
Software Engineer
Netflix



--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Spark 2.5 release

Ryan Blue
In reply to this post by rxin

When you created the PR to make InternalRow public

This isn’t quite accurate. The change I made was to use InternalRow instead of UnsafeRow, which is a specific implementation of InternalRow. Exposing this API has always been a part of DSv2 and while both you and I did some work to avoid this, we are still in the phase of starting with that API.

Note that any change to InternalRow would be very costly to implement because this interface is widely used. That is why I think we can certainly consider it stable enough to use here, and that’s probably why UnsafeRow was part of the original proposal.

In any case, the goal for 3.0 was not to replace the use of InternalRow, it was to get the majority of SQL working on top of the interface added after 2.4. That’s done and stable, so I think a 2.5 release with it is also reasonable.


On Fri, Sep 20, 2019 at 11:23 AM Reynold Xin <[hidden email]> wrote:
To push back, while I agree we should not drastically change "InternalRow", there are a lot of changes that need to happen to make it stable. For example, none of the publicly exposed interfaces should be in the Catalyst package or the unsafe package. External implementations should be decoupled from the internal implementations, with cheap ways to convert back and forth.

When you created the PR to make InternalRow public, the understanding was to work towards making it stable in the future, assuming we will start with an unstable API temporarily. You can't just make a bunch internal APIs tightly coupled with other internal pieces public and stable and call it a day, just because it happen to satisfy some use cases temporarily assuming the rest of Spark doesn't change.



On Fri, Sep 20, 2019 at 11:19 AM, Ryan Blue <[hidden email]> wrote:
> DSv2 is far from stable right?

No, I think it is reasonably stable and very close to being ready for a release.

> All the actual data types are unstable and you guys have completely ignored that.

I think what you're referring to is the use of `InternalRow`. That's a stable API and there has been no work to avoid using it. In any case, I don't think that anyone is suggesting that we delay 3.0 until a replacement for `InternalRow` is added, right?

While I understand the motivation for a better solution here, I think the pragmatic solution is to continue using `InternalRow`.

> If the goal is to make DSv2 work across 3.x and 2.x, that seems too invasive of a change to backport once you consider the parts needed to make dsv2 stable.

I believe that those of us working on DSv2 are confident about the current stability. We set goals for what to get into the 3.0 release months ago and have very nearly reached the point where we are ready for that release.

I don't think instability would be a problem in maintaining compatibility between the 2.5 version and the 3.0 version. If we find that we need to make API changes (other than additions) then we can make those in the 3.1 release. Because the goals we set for the 3.0 release have been reached with the current API and if we are ready to release 3.0, we can release a 2.5 with the same API.

On Fri, Sep 20, 2019 at 11:05 AM Reynold Xin <[hidden email]> wrote:
DSv2 is far from stable right? All the actual data types are unstable and you guys have completely ignored that. We'd need to work on that and that will be a breaking change. If the goal is to make DSv2 work across 3.x and 2.x, that seems too invasive of a change to backport once you consider the parts needed to make dsv2 stable.



On Fri, Sep 20, 2019 at 10:47 AM, Ryan Blue <[hidden email]> wrote:
Hi everyone,

In the DSv2 sync this week, we talked about a possible Spark 2.5 release based on the latest Spark 2.4, but with DSv2 and Java 11 support added.

A Spark 2.5 release with these two additions will help people migrate to Spark 3.0 when it is released because they will be able to use a single implementation for DSv2 sources that works in both 2.5 and 3.0. Similarly, upgrading to 3.0 won't also require also updating to Java 11 because users could update to Java 11 with the 2.5 release and have fewer major changes.

Another reason to consider a 2.5 release is that many people are interested in a release with the latest DSv2 API and support for DSv2 SQL. I'm already going to be backporting DSv2 support to the Spark 2.4 line, so it makes sense to share this work with the community.

This release line would just consist of backports like DSv2 and Java 11 that assist compatibility, to keep the scope of the release small. The purpose is to assist people moving to 3.0 and not distract from the 3.0 release.

Would a Spark 2.5 release help anyone else? Are there any concerns about this plan?


rb


--
Ryan Blue
Software Engineer
Netflix



--
Ryan Blue
Software Engineer
Netflix



--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Spark 2.5 release

rxin
I don't think we need to gate a 3.0 release on making a more stable version of InternalRow, but thinking we'd have dsv2 working in both 3.x (which will change and progress towards more stable, but will have to break certain APIs) and 2.x seems like a false premise.

To point out some problems with InternalRow that you think are already pragmatic and stable:


/**
* Catalyst is a library for manipulating relational query plans.  All classes in catalyst are
* considered an internal API to Spark SQL and are subject to change between minor releases.
*/

There is no even any annotation on the interface.

The entire dependency chain were created to be private, and tightly coupled with internal implementations. For example, 


/**
* A UTF-8 String for internal Spark use.
* <p>
* A String encoded in UTF-8 as an Array[Byte], which can be used for comparison,
* search, see http://en.wikipedia.org/wiki/UTF-8 for details.
* <p>
* Note: This is not designed for general use cases, should not be used outside SQL.
*/


(which again is in catalyst package)


If you want to argue this way, you might as well argue we should make the entire catalyst package public to be pragmatic and not allow any changes.




On Fri, Sep 20, 2019 at 11:32 AM, Ryan Blue <[hidden email]> wrote:

When you created the PR to make InternalRow public

This isn’t quite accurate. The change I made was to use InternalRow instead of UnsafeRow, which is a specific implementation of InternalRow. Exposing this API has always been a part of DSv2 and while both you and I did some work to avoid this, we are still in the phase of starting with that API.

Note that any change to InternalRow would be very costly to implement because this interface is widely used. That is why I think we can certainly consider it stable enough to use here, and that’s probably why UnsafeRow was part of the original proposal.

In any case, the goal for 3.0 was not to replace the use of InternalRow, it was to get the majority of SQL working on top of the interface added after 2.4. That’s done and stable, so I think a 2.5 release with it is also reasonable.


On Fri, Sep 20, 2019 at 11:23 AM Reynold Xin <[hidden email]> wrote:
To push back, while I agree we should not drastically change "InternalRow", there are a lot of changes that need to happen to make it stable. For example, none of the publicly exposed interfaces should be in the Catalyst package or the unsafe package. External implementations should be decoupled from the internal implementations, with cheap ways to convert back and forth.

When you created the PR to make InternalRow public, the understanding was to work towards making it stable in the future, assuming we will start with an unstable API temporarily. You can't just make a bunch internal APIs tightly coupled with other internal pieces public and stable and call it a day, just because it happen to satisfy some use cases temporarily assuming the rest of Spark doesn't change.



On Fri, Sep 20, 2019 at 11:19 AM, Ryan Blue <[hidden email]> wrote:
> DSv2 is far from stable right?

No, I think it is reasonably stable and very close to being ready for a release.

> All the actual data types are unstable and you guys have completely ignored that.

I think what you're referring to is the use of `InternalRow`. That's a stable API and there has been no work to avoid using it. In any case, I don't think that anyone is suggesting that we delay 3.0 until a replacement for `InternalRow` is added, right?

While I understand the motivation for a better solution here, I think the pragmatic solution is to continue using `InternalRow`.

> If the goal is to make DSv2 work across 3.x and 2.x, that seems too invasive of a change to backport once you consider the parts needed to make dsv2 stable.

I believe that those of us working on DSv2 are confident about the current stability. We set goals for what to get into the 3.0 release months ago and have very nearly reached the point where we are ready for that release.

I don't think instability would be a problem in maintaining compatibility between the 2.5 version and the 3.0 version. If we find that we need to make API changes (other than additions) then we can make those in the 3.1 release. Because the goals we set for the 3.0 release have been reached with the current API and if we are ready to release 3.0, we can release a 2.5 with the same API.

On Fri, Sep 20, 2019 at 11:05 AM Reynold Xin <[hidden email]> wrote:
DSv2 is far from stable right? All the actual data types are unstable and you guys have completely ignored that. We'd need to work on that and that will be a breaking change. If the goal is to make DSv2 work across 3.x and 2.x, that seems too invasive of a change to backport once you consider the parts needed to make dsv2 stable.



On Fri, Sep 20, 2019 at 10:47 AM, Ryan Blue <[hidden email]> wrote:
Hi everyone,

In the DSv2 sync this week, we talked about a possible Spark 2.5 release based on the latest Spark 2.4, but with DSv2 and Java 11 support added.

A Spark 2.5 release with these two additions will help people migrate to Spark 3.0 when it is released because they will be able to use a single implementation for DSv2 sources that works in both 2.5 and 3.0. Similarly, upgrading to 3.0 won't also require also updating to Java 11 because users could update to Java 11 with the 2.5 release and have fewer major changes.

Another reason to consider a 2.5 release is that many people are interested in a release with the latest DSv2 API and support for DSv2 SQL. I'm already going to be backporting DSv2 support to the Spark 2.4 line, so it makes sense to share this work with the community.

This release line would just consist of backports like DSv2 and Java 11 that assist compatibility, to keep the scope of the release small. The purpose is to assist people moving to 3.0 and not distract from the 3.0 release.

Would a Spark 2.5 release help anyone else? Are there any concerns about this plan?


rb


--
Ryan Blue
Software Engineer
Netflix



--
Ryan Blue
Software Engineer
Netflix



--
Ryan Blue
Software Engineer
Netflix

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Spark 2.5 release

Ryan Blue

I don’t think we need to gate a 3.0 release on making a more stable version of InternalRow

Sounds like we agree, then. We will use it for 3.0, but there are known problems with it.

Thinking we’d have dsv2 working in both 3.x (which will change and progress towards more stable, but will have to break certain APIs) and 2.x seems like a false premise.

Why do you think we will need to break certain APIs before 3.0?

I’m only suggesting that we release the same support in a 2.5 release that we do in 3.0. Since we are nearly finished with the 3.0 goals, it seems like we can certainly do that. We just won’t add any breaking changes before 3.1.


On Fri, Sep 20, 2019 at 11:39 AM Reynold Xin <[hidden email]> wrote:
I don't think we need to gate a 3.0 release on making a more stable version of InternalRow, but thinking we'd have dsv2 working in both 3.x (which will change and progress towards more stable, but will have to break certain APIs) and 2.x seems like a false premise.

To point out some problems with InternalRow that you think are already pragmatic and stable:


/**
* Catalyst is a library for manipulating relational query plans.  All classes in catalyst are
* considered an internal API to Spark SQL and are subject to change between minor releases.
*/

There is no even any annotation on the interface.

The entire dependency chain were created to be private, and tightly coupled with internal implementations. For example, 


/**
* A UTF-8 String for internal Spark use.
* <p>
* A String encoded in UTF-8 as an Array[Byte], which can be used for comparison,
* search, see http://en.wikipedia.org/wiki/UTF-8 for details.
* <p>
* Note: This is not designed for general use cases, should not be used outside SQL.
*/


(which again is in catalyst package)


If you want to argue this way, you might as well argue we should make the entire catalyst package public to be pragmatic and not allow any changes.




On Fri, Sep 20, 2019 at 11:32 AM, Ryan Blue <[hidden email]> wrote:

When you created the PR to make InternalRow public

This isn’t quite accurate. The change I made was to use InternalRow instead of UnsafeRow, which is a specific implementation of InternalRow. Exposing this API has always been a part of DSv2 and while both you and I did some work to avoid this, we are still in the phase of starting with that API.

Note that any change to InternalRow would be very costly to implement because this interface is widely used. That is why I think we can certainly consider it stable enough to use here, and that’s probably why UnsafeRow was part of the original proposal.

In any case, the goal for 3.0 was not to replace the use of InternalRow, it was to get the majority of SQL working on top of the interface added after 2.4. That’s done and stable, so I think a 2.5 release with it is also reasonable.


On Fri, Sep 20, 2019 at 11:23 AM Reynold Xin <[hidden email]> wrote:
To push back, while I agree we should not drastically change "InternalRow", there are a lot of changes that need to happen to make it stable. For example, none of the publicly exposed interfaces should be in the Catalyst package or the unsafe package. External implementations should be decoupled from the internal implementations, with cheap ways to convert back and forth.

When you created the PR to make InternalRow public, the understanding was to work towards making it stable in the future, assuming we will start with an unstable API temporarily. You can't just make a bunch internal APIs tightly coupled with other internal pieces public and stable and call it a day, just because it happen to satisfy some use cases temporarily assuming the rest of Spark doesn't change.



On Fri, Sep 20, 2019 at 11:19 AM, Ryan Blue <[hidden email]> wrote:
> DSv2 is far from stable right?

No, I think it is reasonably stable and very close to being ready for a release.

> All the actual data types are unstable and you guys have completely ignored that.

I think what you're referring to is the use of `InternalRow`. That's a stable API and there has been no work to avoid using it. In any case, I don't think that anyone is suggesting that we delay 3.0 until a replacement for `InternalRow` is added, right?

While I understand the motivation for a better solution here, I think the pragmatic solution is to continue using `InternalRow`.

> If the goal is to make DSv2 work across 3.x and 2.x, that seems too invasive of a change to backport once you consider the parts needed to make dsv2 stable.

I believe that those of us working on DSv2 are confident about the current stability. We set goals for what to get into the 3.0 release months ago and have very nearly reached the point where we are ready for that release.

I don't think instability would be a problem in maintaining compatibility between the 2.5 version and the 3.0 version. If we find that we need to make API changes (other than additions) then we can make those in the 3.1 release. Because the goals we set for the 3.0 release have been reached with the current API and if we are ready to release 3.0, we can release a 2.5 with the same API.

On Fri, Sep 20, 2019 at 11:05 AM Reynold Xin <[hidden email]> wrote:
DSv2 is far from stable right? All the actual data types are unstable and you guys have completely ignored that. We'd need to work on that and that will be a breaking change. If the goal is to make DSv2 work across 3.x and 2.x, that seems too invasive of a change to backport once you consider the parts needed to make dsv2 stable.



On Fri, Sep 20, 2019 at 10:47 AM, Ryan Blue <[hidden email]> wrote:
Hi everyone,

In the DSv2 sync this week, we talked about a possible Spark 2.5 release based on the latest Spark 2.4, but with DSv2 and Java 11 support added.

A Spark 2.5 release with these two additions will help people migrate to Spark 3.0 when it is released because they will be able to use a single implementation for DSv2 sources that works in both 2.5 and 3.0. Similarly, upgrading to 3.0 won't also require also updating to Java 11 because users could update to Java 11 with the 2.5 release and have fewer major changes.

Another reason to consider a 2.5 release is that many people are interested in a release with the latest DSv2 API and support for DSv2 SQL. I'm already going to be backporting DSv2 support to the Spark 2.4 line, so it makes sense to share this work with the community.

This release line would just consist of backports like DSv2 and Java 11 that assist compatibility, to keep the scope of the release small. The purpose is to assist people moving to 3.0 and not distract from the 3.0 release.

Would a Spark 2.5 release help anyone else? Are there any concerns about this plan?


rb


--
Ryan Blue
Software Engineer
Netflix



--
Ryan Blue
Software Engineer
Netflix



--
Ryan Blue
Software Engineer
Netflix



--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Spark 2.5 release

Dongjoon Hyun-2
Do you mean you want to have a breaking API change between 3.0 and 3.1?
I believe we follow Semantic Versioning ( https://spark.apache.org/versioning-policy.html ).

> We just won’t add any breaking changes before 3.1.

Bests,
Dongjoon.


On Fri, Sep 20, 2019 at 11:48 AM Ryan Blue <[hidden email]> wrote:

I don’t think we need to gate a 3.0 release on making a more stable version of InternalRow

Sounds like we agree, then. We will use it for 3.0, but there are known problems with it.

Thinking we’d have dsv2 working in both 3.x (which will change and progress towards more stable, but will have to break certain APIs) and 2.x seems like a false premise.

Why do you think we will need to break certain APIs before 3.0?

I’m only suggesting that we release the same support in a 2.5 release that we do in 3.0. Since we are nearly finished with the 3.0 goals, it seems like we can certainly do that. We just won’t add any breaking changes before 3.1.


On Fri, Sep 20, 2019 at 11:39 AM Reynold Xin <[hidden email]> wrote:
I don't think we need to gate a 3.0 release on making a more stable version of InternalRow, but thinking we'd have dsv2 working in both 3.x (which will change and progress towards more stable, but will have to break certain APIs) and 2.x seems like a false premise.

To point out some problems with InternalRow that you think are already pragmatic and stable:


/**
* Catalyst is a library for manipulating relational query plans.  All classes in catalyst are
* considered an internal API to Spark SQL and are subject to change between minor releases.
*/

There is no even any annotation on the interface.

The entire dependency chain were created to be private, and tightly coupled with internal implementations. For example, 


/**
* A UTF-8 String for internal Spark use.
* <p>
* A String encoded in UTF-8 as an Array[Byte], which can be used for comparison,
* search, see http://en.wikipedia.org/wiki/UTF-8 for details.
* <p>
* Note: This is not designed for general use cases, should not be used outside SQL.
*/


(which again is in catalyst package)


If you want to argue this way, you might as well argue we should make the entire catalyst package public to be pragmatic and not allow any changes.




On Fri, Sep 20, 2019 at 11:32 AM, Ryan Blue <[hidden email]> wrote:

When you created the PR to make InternalRow public

This isn’t quite accurate. The change I made was to use InternalRow instead of UnsafeRow, which is a specific implementation of InternalRow. Exposing this API has always been a part of DSv2 and while both you and I did some work to avoid this, we are still in the phase of starting with that API.

Note that any change to InternalRow would be very costly to implement because this interface is widely used. That is why I think we can certainly consider it stable enough to use here, and that’s probably why UnsafeRow was part of the original proposal.

In any case, the goal for 3.0 was not to replace the use of InternalRow, it was to get the majority of SQL working on top of the interface added after 2.4. That’s done and stable, so I think a 2.5 release with it is also reasonable.


On Fri, Sep 20, 2019 at 11:23 AM Reynold Xin <[hidden email]> wrote:
To push back, while I agree we should not drastically change "InternalRow", there are a lot of changes that need to happen to make it stable. For example, none of the publicly exposed interfaces should be in the Catalyst package or the unsafe package. External implementations should be decoupled from the internal implementations, with cheap ways to convert back and forth.

When you created the PR to make InternalRow public, the understanding was to work towards making it stable in the future, assuming we will start with an unstable API temporarily. You can't just make a bunch internal APIs tightly coupled with other internal pieces public and stable and call it a day, just because it happen to satisfy some use cases temporarily assuming the rest of Spark doesn't change.



On Fri, Sep 20, 2019 at 11:19 AM, Ryan Blue <[hidden email]> wrote:
> DSv2 is far from stable right?

No, I think it is reasonably stable and very close to being ready for a release.

> All the actual data types are unstable and you guys have completely ignored that.

I think what you're referring to is the use of `InternalRow`. That's a stable API and there has been no work to avoid using it. In any case, I don't think that anyone is suggesting that we delay 3.0 until a replacement for `InternalRow` is added, right?

While I understand the motivation for a better solution here, I think the pragmatic solution is to continue using `InternalRow`.

> If the goal is to make DSv2 work across 3.x and 2.x, that seems too invasive of a change to backport once you consider the parts needed to make dsv2 stable.

I believe that those of us working on DSv2 are confident about the current stability. We set goals for what to get into the 3.0 release months ago and have very nearly reached the point where we are ready for that release.

I don't think instability would be a problem in maintaining compatibility between the 2.5 version and the 3.0 version. If we find that we need to make API changes (other than additions) then we can make those in the 3.1 release. Because the goals we set for the 3.0 release have been reached with the current API and if we are ready to release 3.0, we can release a 2.5 with the same API.

On Fri, Sep 20, 2019 at 11:05 AM Reynold Xin <[hidden email]> wrote:
DSv2 is far from stable right? All the actual data types are unstable and you guys have completely ignored that. We'd need to work on that and that will be a breaking change. If the goal is to make DSv2 work across 3.x and 2.x, that seems too invasive of a change to backport once you consider the parts needed to make dsv2 stable.



On Fri, Sep 20, 2019 at 10:47 AM, Ryan Blue <[hidden email]> wrote:
Hi everyone,

In the DSv2 sync this week, we talked about a possible Spark 2.5 release based on the latest Spark 2.4, but with DSv2 and Java 11 support added.

A Spark 2.5 release with these two additions will help people migrate to Spark 3.0 when it is released because they will be able to use a single implementation for DSv2 sources that works in both 2.5 and 3.0. Similarly, upgrading to 3.0 won't also require also updating to Java 11 because users could update to Java 11 with the 2.5 release and have fewer major changes.

Another reason to consider a 2.5 release is that many people are interested in a release with the latest DSv2 API and support for DSv2 SQL. I'm already going to be backporting DSv2 support to the Spark 2.4 line, so it makes sense to share this work with the community.

This release line would just consist of backports like DSv2 and Java 11 that assist compatibility, to keep the scope of the release small. The purpose is to assist people moving to 3.0 and not distract from the 3.0 release.

Would a Spark 2.5 release help anyone else? Are there any concerns about this plan?


rb


--
Ryan Blue
Software Engineer
Netflix



--
Ryan Blue
Software Engineer
Netflix



--
Ryan Blue
Software Engineer
Netflix



--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Spark 2.5 release

Jungtaek Lim
Just 2 cents, I haven't tracked the change of DSv2 (though I needed to deal with this as the change made confusion on my PRs...), but my bet is that DSv2 would be already changed in incompatible way, at least who works for custom DataSource. Making downstream to diverge their implementation heavily between minor versions (say, 2.4 vs 2.5) wouldn't be a good experience - especially we are not completely closed the chance to further modify DSv2, and the change could be backward incompatible.

If we really want to bring the DSv2 change to 2.x version line to let end users avoid forcing to upgrade Spark 3.x to enjoy new DSv2, I'd rather say preparation of Spark 2.5 should be started after Spark 3.0 is officially released, honestly even later than that, say, getting some reports from Spark 3.0 about DSv2 so that we feel DSv2 is OK. I hope we don't make Spark 2.5 be a kind of "tech-preview" which Spark 2.4 users may be frustrated to upgrade to next minor version.

Btw, do we have any specific target users for this? Personally DSv2 change would be the major backward incompatibility which Spark 2.x users may hesitate to upgrade, so they might be already prepared to migrate to Spark 3.0 if they are prepared to migrate to new DSv2.

On Sat, Sep 21, 2019 at 12:46 PM Dongjoon Hyun <[hidden email]> wrote:
Do you mean you want to have a breaking API change between 3.0 and 3.1?
I believe we follow Semantic Versioning ( https://spark.apache.org/versioning-policy.html ).

> We just won’t add any breaking changes before 3.1.

Bests,
Dongjoon.


On Fri, Sep 20, 2019 at 11:48 AM Ryan Blue <[hidden email]> wrote:

I don’t think we need to gate a 3.0 release on making a more stable version of InternalRow

Sounds like we agree, then. We will use it for 3.0, but there are known problems with it.

Thinking we’d have dsv2 working in both 3.x (which will change and progress towards more stable, but will have to break certain APIs) and 2.x seems like a false premise.

Why do you think we will need to break certain APIs before 3.0?

I’m only suggesting that we release the same support in a 2.5 release that we do in 3.0. Since we are nearly finished with the 3.0 goals, it seems like we can certainly do that. We just won’t add any breaking changes before 3.1.


On Fri, Sep 20, 2019 at 11:39 AM Reynold Xin <[hidden email]> wrote:
I don't think we need to gate a 3.0 release on making a more stable version of InternalRow, but thinking we'd have dsv2 working in both 3.x (which will change and progress towards more stable, but will have to break certain APIs) and 2.x seems like a false premise.

To point out some problems with InternalRow that you think are already pragmatic and stable:


/**
* Catalyst is a library for manipulating relational query plans.  All classes in catalyst are
* considered an internal API to Spark SQL and are subject to change between minor releases.
*/

There is no even any annotation on the interface.

The entire dependency chain were created to be private, and tightly coupled with internal implementations. For example, 


/**
* A UTF-8 String for internal Spark use.
* <p>
* A String encoded in UTF-8 as an Array[Byte], which can be used for comparison,
* search, see http://en.wikipedia.org/wiki/UTF-8 for details.
* <p>
* Note: This is not designed for general use cases, should not be used outside SQL.
*/


(which again is in catalyst package)


If you want to argue this way, you might as well argue we should make the entire catalyst package public to be pragmatic and not allow any changes.




On Fri, Sep 20, 2019 at 11:32 AM, Ryan Blue <[hidden email]> wrote:

When you created the PR to make InternalRow public

This isn’t quite accurate. The change I made was to use InternalRow instead of UnsafeRow, which is a specific implementation of InternalRow. Exposing this API has always been a part of DSv2 and while both you and I did some work to avoid this, we are still in the phase of starting with that API.

Note that any change to InternalRow would be very costly to implement because this interface is widely used. That is why I think we can certainly consider it stable enough to use here, and that’s probably why UnsafeRow was part of the original proposal.

In any case, the goal for 3.0 was not to replace the use of InternalRow, it was to get the majority of SQL working on top of the interface added after 2.4. That’s done and stable, so I think a 2.5 release with it is also reasonable.


On Fri, Sep 20, 2019 at 11:23 AM Reynold Xin <[hidden email]> wrote:
To push back, while I agree we should not drastically change "InternalRow", there are a lot of changes that need to happen to make it stable. For example, none of the publicly exposed interfaces should be in the Catalyst package or the unsafe package. External implementations should be decoupled from the internal implementations, with cheap ways to convert back and forth.

When you created the PR to make InternalRow public, the understanding was to work towards making it stable in the future, assuming we will start with an unstable API temporarily. You can't just make a bunch internal APIs tightly coupled with other internal pieces public and stable and call it a day, just because it happen to satisfy some use cases temporarily assuming the rest of Spark doesn't change.



On Fri, Sep 20, 2019 at 11:19 AM, Ryan Blue <[hidden email]> wrote:
> DSv2 is far from stable right?

No, I think it is reasonably stable and very close to being ready for a release.

> All the actual data types are unstable and you guys have completely ignored that.

I think what you're referring to is the use of `InternalRow`. That's a stable API and there has been no work to avoid using it. In any case, I don't think that anyone is suggesting that we delay 3.0 until a replacement for `InternalRow` is added, right?

While I understand the motivation for a better solution here, I think the pragmatic solution is to continue using `InternalRow`.

> If the goal is to make DSv2 work across 3.x and 2.x, that seems too invasive of a change to backport once you consider the parts needed to make dsv2 stable.

I believe that those of us working on DSv2 are confident about the current stability. We set goals for what to get into the 3.0 release months ago and have very nearly reached the point where we are ready for that release.

I don't think instability would be a problem in maintaining compatibility between the 2.5 version and the 3.0 version. If we find that we need to make API changes (other than additions) then we can make those in the 3.1 release. Because the goals we set for the 3.0 release have been reached with the current API and if we are ready to release 3.0, we can release a 2.5 with the same API.

On Fri, Sep 20, 2019 at 11:05 AM Reynold Xin <[hidden email]> wrote:
DSv2 is far from stable right? All the actual data types are unstable and you guys have completely ignored that. We'd need to work on that and that will be a breaking change. If the goal is to make DSv2 work across 3.x and 2.x, that seems too invasive of a change to backport once you consider the parts needed to make dsv2 stable.



On Fri, Sep 20, 2019 at 10:47 AM, Ryan Blue <[hidden email]> wrote:
Hi everyone,

In the DSv2 sync this week, we talked about a possible Spark 2.5 release based on the latest Spark 2.4, but with DSv2 and Java 11 support added.

A Spark 2.5 release with these two additions will help people migrate to Spark 3.0 when it is released because they will be able to use a single implementation for DSv2 sources that works in both 2.5 and 3.0. Similarly, upgrading to 3.0 won't also require also updating to Java 11 because users could update to Java 11 with the 2.5 release and have fewer major changes.

Another reason to consider a 2.5 release is that many people are interested in a release with the latest DSv2 API and support for DSv2 SQL. I'm already going to be backporting DSv2 support to the Spark 2.4 line, so it makes sense to share this work with the community.

This release line would just consist of backports like DSv2 and Java 11 that assist compatibility, to keep the scope of the release small. The purpose is to assist people moving to 3.0 and not distract from the 3.0 release.

Would a Spark 2.5 release help anyone else? Are there any concerns about this plan?


rb


--
Ryan Blue
Software Engineer
Netflix



--
Ryan Blue
Software Engineer
Netflix



--
Ryan Blue
Software Engineer
Netflix



--
Ryan Blue
Software Engineer
Netflix


--
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Spark 2.5 release

Jungtaek Lim
small correction: confusion -> conflict, so I had to go through and understand parts of the changes

On Sat, Sep 21, 2019 at 1:25 PM Jungtaek Lim <[hidden email]> wrote:
Just 2 cents, I haven't tracked the change of DSv2 (though I needed to deal with this as the change made confusion on my PRs...), but my bet is that DSv2 would be already changed in incompatible way, at least who works for custom DataSource. Making downstream to diverge their implementation heavily between minor versions (say, 2.4 vs 2.5) wouldn't be a good experience - especially we are not completely closed the chance to further modify DSv2, and the change could be backward incompatible.

If we really want to bring the DSv2 change to 2.x version line to let end users avoid forcing to upgrade Spark 3.x to enjoy new DSv2, I'd rather say preparation of Spark 2.5 should be started after Spark 3.0 is officially released, honestly even later than that, say, getting some reports from Spark 3.0 about DSv2 so that we feel DSv2 is OK. I hope we don't make Spark 2.5 be a kind of "tech-preview" which Spark 2.4 users may be frustrated to upgrade to next minor version.

Btw, do we have any specific target users for this? Personally DSv2 change would be the major backward incompatibility which Spark 2.x users may hesitate to upgrade, so they might be already prepared to migrate to Spark 3.0 if they are prepared to migrate to new DSv2.

On Sat, Sep 21, 2019 at 12:46 PM Dongjoon Hyun <[hidden email]> wrote:
Do you mean you want to have a breaking API change between 3.0 and 3.1?
I believe we follow Semantic Versioning ( https://spark.apache.org/versioning-policy.html ).

> We just won’t add any breaking changes before 3.1.

Bests,
Dongjoon.


On Fri, Sep 20, 2019 at 11:48 AM Ryan Blue <[hidden email]> wrote:

I don’t think we need to gate a 3.0 release on making a more stable version of InternalRow

Sounds like we agree, then. We will use it for 3.0, but there are known problems with it.

Thinking we’d have dsv2 working in both 3.x (which will change and progress towards more stable, but will have to break certain APIs) and 2.x seems like a false premise.

Why do you think we will need to break certain APIs before 3.0?

I’m only suggesting that we release the same support in a 2.5 release that we do in 3.0. Since we are nearly finished with the 3.0 goals, it seems like we can certainly do that. We just won’t add any breaking changes before 3.1.


On Fri, Sep 20, 2019 at 11:39 AM Reynold Xin <[hidden email]> wrote:
I don't think we need to gate a 3.0 release on making a more stable version of InternalRow, but thinking we'd have dsv2 working in both 3.x (which will change and progress towards more stable, but will have to break certain APIs) and 2.x seems like a false premise.

To point out some problems with InternalRow that you think are already pragmatic and stable:


/**
* Catalyst is a library for manipulating relational query plans.  All classes in catalyst are
* considered an internal API to Spark SQL and are subject to change between minor releases.
*/

There is no even any annotation on the interface.

The entire dependency chain were created to be private, and tightly coupled with internal implementations. For example, 


/**
* A UTF-8 String for internal Spark use.
* <p>
* A String encoded in UTF-8 as an Array[Byte], which can be used for comparison,
* search, see http://en.wikipedia.org/wiki/UTF-8 for details.
* <p>
* Note: This is not designed for general use cases, should not be used outside SQL.
*/


(which again is in catalyst package)


If you want to argue this way, you might as well argue we should make the entire catalyst package public to be pragmatic and not allow any changes.




On Fri, Sep 20, 2019 at 11:32 AM, Ryan Blue <[hidden email]> wrote:

When you created the PR to make InternalRow public

This isn’t quite accurate. The change I made was to use InternalRow instead of UnsafeRow, which is a specific implementation of InternalRow. Exposing this API has always been a part of DSv2 and while both you and I did some work to avoid this, we are still in the phase of starting with that API.

Note that any change to InternalRow would be very costly to implement because this interface is widely used. That is why I think we can certainly consider it stable enough to use here, and that’s probably why UnsafeRow was part of the original proposal.

In any case, the goal for 3.0 was not to replace the use of InternalRow, it was to get the majority of SQL working on top of the interface added after 2.4. That’s done and stable, so I think a 2.5 release with it is also reasonable.


On Fri, Sep 20, 2019 at 11:23 AM Reynold Xin <[hidden email]> wrote:
To push back, while I agree we should not drastically change "InternalRow", there are a lot of changes that need to happen to make it stable. For example, none of the publicly exposed interfaces should be in the Catalyst package or the unsafe package. External implementations should be decoupled from the internal implementations, with cheap ways to convert back and forth.

When you created the PR to make InternalRow public, the understanding was to work towards making it stable in the future, assuming we will start with an unstable API temporarily. You can't just make a bunch internal APIs tightly coupled with other internal pieces public and stable and call it a day, just because it happen to satisfy some use cases temporarily assuming the rest of Spark doesn't change.



On Fri, Sep 20, 2019 at 11:19 AM, Ryan Blue <[hidden email]> wrote:
> DSv2 is far from stable right?

No, I think it is reasonably stable and very close to being ready for a release.

> All the actual data types are unstable and you guys have completely ignored that.

I think what you're referring to is the use of `InternalRow`. That's a stable API and there has been no work to avoid using it. In any case, I don't think that anyone is suggesting that we delay 3.0 until a replacement for `InternalRow` is added, right?

While I understand the motivation for a better solution here, I think the pragmatic solution is to continue using `InternalRow`.

> If the goal is to make DSv2 work across 3.x and 2.x, that seems too invasive of a change to backport once you consider the parts needed to make dsv2 stable.

I believe that those of us working on DSv2 are confident about the current stability. We set goals for what to get into the 3.0 release months ago and have very nearly reached the point where we are ready for that release.

I don't think instability would be a problem in maintaining compatibility between the 2.5 version and the 3.0 version. If we find that we need to make API changes (other than additions) then we can make those in the 3.1 release. Because the goals we set for the 3.0 release have been reached with the current API and if we are ready to release 3.0, we can release a 2.5 with the same API.

On Fri, Sep 20, 2019 at 11:05 AM Reynold Xin <[hidden email]> wrote:
DSv2 is far from stable right? All the actual data types are unstable and you guys have completely ignored that. We'd need to work on that and that will be a breaking change. If the goal is to make DSv2 work across 3.x and 2.x, that seems too invasive of a change to backport once you consider the parts needed to make dsv2 stable.



On Fri, Sep 20, 2019 at 10:47 AM, Ryan Blue <[hidden email]> wrote:
Hi everyone,

In the DSv2 sync this week, we talked about a possible Spark 2.5 release based on the latest Spark 2.4, but with DSv2 and Java 11 support added.

A Spark 2.5 release with these two additions will help people migrate to Spark 3.0 when it is released because they will be able to use a single implementation for DSv2 sources that works in both 2.5 and 3.0. Similarly, upgrading to 3.0 won't also require also updating to Java 11 because users could update to Java 11 with the 2.5 release and have fewer major changes.

Another reason to consider a 2.5 release is that many people are interested in a release with the latest DSv2 API and support for DSv2 SQL. I'm already going to be backporting DSv2 support to the Spark 2.4 line, so it makes sense to share this work with the community.

This release line would just consist of backports like DSv2 and Java 11 that assist compatibility, to keep the scope of the release small. The purpose is to assist people moving to 3.0 and not distract from the 3.0 release.

Would a Spark 2.5 release help anyone else? Are there any concerns about this plan?


rb


--
Ryan Blue
Software Engineer
Netflix



--
Ryan Blue
Software Engineer
Netflix



--
Ryan Blue
Software Engineer
Netflix



--
Ryan Blue
Software Engineer
Netflix


--


--
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Spark 2.5 release

Xiao Li-2
+1 on Jungtaek's point. We can revisit this when we release Spark 3.1? After the release of 3.0, I believe we will get more feedback about DSv2 from the community. The current design is just made by a small group of contributors. DSv2 + catalog APIs are still evolving. It is very likely we will make more changes after 3.0 release. 

On Fri, Sep 20, 2019 at 9:27 PM Jungtaek Lim <[hidden email]> wrote:
small correction: confusion -> conflict, so I had to go through and understand parts of the changes

On Sat, Sep 21, 2019 at 1:25 PM Jungtaek Lim <[hidden email]> wrote:
Just 2 cents, I haven't tracked the change of DSv2 (though I needed to deal with this as the change made confusion on my PRs...), but my bet is that DSv2 would be already changed in incompatible way, at least who works for custom DataSource. Making downstream to diverge their implementation heavily between minor versions (say, 2.4 vs 2.5) wouldn't be a good experience - especially we are not completely closed the chance to further modify DSv2, and the change could be backward incompatible.

If we really want to bring the DSv2 change to 2.x version line to let end users avoid forcing to upgrade Spark 3.x to enjoy new DSv2, I'd rather say preparation of Spark 2.5 should be started after Spark 3.0 is officially released, honestly even later than that, say, getting some reports from Spark 3.0 about DSv2 so that we feel DSv2 is OK. I hope we don't make Spark 2.5 be a kind of "tech-preview" which Spark 2.4 users may be frustrated to upgrade to next minor version.

Btw, do we have any specific target users for this? Personally DSv2 change would be the major backward incompatibility which Spark 2.x users may hesitate to upgrade, so they might be already prepared to migrate to Spark 3.0 if they are prepared to migrate to new DSv2.

On Sat, Sep 21, 2019 at 12:46 PM Dongjoon Hyun <[hidden email]> wrote:
Do you mean you want to have a breaking API change between 3.0 and 3.1?
I believe we follow Semantic Versioning ( https://spark.apache.org/versioning-policy.html ).

> We just won’t add any breaking changes before 3.1.

Bests,
Dongjoon.


On Fri, Sep 20, 2019 at 11:48 AM Ryan Blue <[hidden email]> wrote:

I don’t think we need to gate a 3.0 release on making a more stable version of InternalRow

Sounds like we agree, then. We will use it for 3.0, but there are known problems with it.

Thinking we’d have dsv2 working in both 3.x (which will change and progress towards more stable, but will have to break certain APIs) and 2.x seems like a false premise.

Why do you think we will need to break certain APIs before 3.0?

I’m only suggesting that we release the same support in a 2.5 release that we do in 3.0. Since we are nearly finished with the 3.0 goals, it seems like we can certainly do that. We just won’t add any breaking changes before 3.1.


On Fri, Sep 20, 2019 at 11:39 AM Reynold Xin <[hidden email]> wrote:
I don't think we need to gate a 3.0 release on making a more stable version of InternalRow, but thinking we'd have dsv2 working in both 3.x (which will change and progress towards more stable, but will have to break certain APIs) and 2.x seems like a false premise.

To point out some problems with InternalRow that you think are already pragmatic and stable:


/**
* Catalyst is a library for manipulating relational query plans.  All classes in catalyst are
* considered an internal API to Spark SQL and are subject to change between minor releases.
*/

There is no even any annotation on the interface.

The entire dependency chain were created to be private, and tightly coupled with internal implementations. For example, 


/**
* A UTF-8 String for internal Spark use.
* <p>
* A String encoded in UTF-8 as an Array[Byte], which can be used for comparison,
* search, see http://en.wikipedia.org/wiki/UTF-8 for details.
* <p>
* Note: This is not designed for general use cases, should not be used outside SQL.
*/


(which again is in catalyst package)


If you want to argue this way, you might as well argue we should make the entire catalyst package public to be pragmatic and not allow any changes.




On Fri, Sep 20, 2019 at 11:32 AM, Ryan Blue <[hidden email]> wrote:

When you created the PR to make InternalRow public

This isn’t quite accurate. The change I made was to use InternalRow instead of UnsafeRow, which is a specific implementation of InternalRow. Exposing this API has always been a part of DSv2 and while both you and I did some work to avoid this, we are still in the phase of starting with that API.

Note that any change to InternalRow would be very costly to implement because this interface is widely used. That is why I think we can certainly consider it stable enough to use here, and that’s probably why UnsafeRow was part of the original proposal.

In any case, the goal for 3.0 was not to replace the use of InternalRow, it was to get the majority of SQL working on top of the interface added after 2.4. That’s done and stable, so I think a 2.5 release with it is also reasonable.


On Fri, Sep 20, 2019 at 11:23 AM Reynold Xin <[hidden email]> wrote:
To push back, while I agree we should not drastically change "InternalRow", there are a lot of changes that need to happen to make it stable. For example, none of the publicly exposed interfaces should be in the Catalyst package or the unsafe package. External implementations should be decoupled from the internal implementations, with cheap ways to convert back and forth.

When you created the PR to make InternalRow public, the understanding was to work towards making it stable in the future, assuming we will start with an unstable API temporarily. You can't just make a bunch internal APIs tightly coupled with other internal pieces public and stable and call it a day, just because it happen to satisfy some use cases temporarily assuming the rest of Spark doesn't change.



On Fri, Sep 20, 2019 at 11:19 AM, Ryan Blue <[hidden email]> wrote:
> DSv2 is far from stable right?

No, I think it is reasonably stable and very close to being ready for a release.

> All the actual data types are unstable and you guys have completely ignored that.

I think what you're referring to is the use of `InternalRow`. That's a stable API and there has been no work to avoid using it. In any case, I don't think that anyone is suggesting that we delay 3.0 until a replacement for `InternalRow` is added, right?

While I understand the motivation for a better solution here, I think the pragmatic solution is to continue using `InternalRow`.

> If the goal is to make DSv2 work across 3.x and 2.x, that seems too invasive of a change to backport once you consider the parts needed to make dsv2 stable.

I believe that those of us working on DSv2 are confident about the current stability. We set goals for what to get into the 3.0 release months ago and have very nearly reached the point where we are ready for that release.

I don't think instability would be a problem in maintaining compatibility between the 2.5 version and the 3.0 version. If we find that we need to make API changes (other than additions) then we can make those in the 3.1 release. Because the goals we set for the 3.0 release have been reached with the current API and if we are ready to release 3.0, we can release a 2.5 with the same API.

On Fri, Sep 20, 2019 at 11:05 AM Reynold Xin <[hidden email]> wrote:
DSv2 is far from stable right? All the actual data types are unstable and you guys have completely ignored that. We'd need to work on that and that will be a breaking change. If the goal is to make DSv2 work across 3.x and 2.x, that seems too invasive of a change to backport once you consider the parts needed to make dsv2 stable.



On Fri, Sep 20, 2019 at 10:47 AM, Ryan Blue <[hidden email]> wrote:
Hi everyone,

In the DSv2 sync this week, we talked about a possible Spark 2.5 release based on the latest Spark 2.4, but with DSv2 and Java 11 support added.

A Spark 2.5 release with these two additions will help people migrate to Spark 3.0 when it is released because they will be able to use a single implementation for DSv2 sources that works in both 2.5 and 3.0. Similarly, upgrading to 3.0 won't also require also updating to Java 11 because users could update to Java 11 with the 2.5 release and have fewer major changes.

Another reason to consider a 2.5 release is that many people are interested in a release with the latest DSv2 API and support for DSv2 SQL. I'm already going to be backporting DSv2 support to the Spark 2.4 line, so it makes sense to share this work with the community.

This release line would just consist of backports like DSv2 and Java 11 that assist compatibility, to keep the scope of the release small. The purpose is to assist people moving to 3.0 and not distract from the 3.0 release.

Would a Spark 2.5 release help anyone else? Are there any concerns about this plan?


rb


--
Ryan Blue
Software Engineer
Netflix



--
Ryan Blue
Software Engineer
Netflix



--
Ryan Blue
Software Engineer
Netflix



--
Ryan Blue
Software Engineer
Netflix


--


--


--
Databricks Summit - Watch the talks 
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Spark 2.5 release

Ryan Blue
In reply to this post by Dongjoon Hyun-2

Thanks for pointing this out, Dongjoon.

To clarify, I’m not suggesting that we can break compatibility. I’m suggesting that we make a 2.5 release that uses the same DSv2 API as 3.0.

These APIs are marked unstable, so we could make changes to them if we needed — as we have done in the 2.x line — but I don’t see a reason why we would break compatibility in the 3.x line.


On Fri, Sep 20, 2019 at 8:46 PM Dongjoon Hyun <[hidden email]> wrote:
Do you mean you want to have a breaking API change between 3.0 and 3.1?
I believe we follow Semantic Versioning ( https://spark.apache.org/versioning-policy.html ).

> We just won’t add any breaking changes before 3.1.

Bests,
Dongjoon.


On Fri, Sep 20, 2019 at 11:48 AM Ryan Blue <[hidden email]> wrote:

I don’t think we need to gate a 3.0 release on making a more stable version of InternalRow

Sounds like we agree, then. We will use it for 3.0, but there are known problems with it.

Thinking we’d have dsv2 working in both 3.x (which will change and progress towards more stable, but will have to break certain APIs) and 2.x seems like a false premise.

Why do you think we will need to break certain APIs before 3.0?

I’m only suggesting that we release the same support in a 2.5 release that we do in 3.0. Since we are nearly finished with the 3.0 goals, it seems like we can certainly do that. We just won’t add any breaking changes before 3.1.


On Fri, Sep 20, 2019 at 11:39 AM Reynold Xin <[hidden email]> wrote:
I don't think we need to gate a 3.0 release on making a more stable version of InternalRow, but thinking we'd have dsv2 working in both 3.x (which will change and progress towards more stable, but will have to break certain APIs) and 2.x seems like a false premise.

To point out some problems with InternalRow that you think are already pragmatic and stable:


/**
* Catalyst is a library for manipulating relational query plans.  All classes in catalyst are
* considered an internal API to Spark SQL and are subject to change between minor releases.
*/

There is no even any annotation on the interface.

The entire dependency chain were created to be private, and tightly coupled with internal implementations. For example, 


/**
* A UTF-8 String for internal Spark use.
* <p>
* A String encoded in UTF-8 as an Array[Byte], which can be used for comparison,
* search, see http://en.wikipedia.org/wiki/UTF-8 for details.
* <p>
* Note: This is not designed for general use cases, should not be used outside SQL.
*/


(which again is in catalyst package)


If you want to argue this way, you might as well argue we should make the entire catalyst package public to be pragmatic and not allow any changes.




On Fri, Sep 20, 2019 at 11:32 AM, Ryan Blue <[hidden email]> wrote:

When you created the PR to make InternalRow public

This isn’t quite accurate. The change I made was to use InternalRow instead of UnsafeRow, which is a specific implementation of InternalRow. Exposing this API has always been a part of DSv2 and while both you and I did some work to avoid this, we are still in the phase of starting with that API.

Note that any change to InternalRow would be very costly to implement because this interface is widely used. That is why I think we can certainly consider it stable enough to use here, and that’s probably why UnsafeRow was part of the original proposal.

In any case, the goal for 3.0 was not to replace the use of InternalRow, it was to get the majority of SQL working on top of the interface added after 2.4. That’s done and stable, so I think a 2.5 release with it is also reasonable.


On Fri, Sep 20, 2019 at 11:23 AM Reynold Xin <[hidden email]> wrote:
To push back, while I agree we should not drastically change "InternalRow", there are a lot of changes that need to happen to make it stable. For example, none of the publicly exposed interfaces should be in the Catalyst package or the unsafe package. External implementations should be decoupled from the internal implementations, with cheap ways to convert back and forth.

When you created the PR to make InternalRow public, the understanding was to work towards making it stable in the future, assuming we will start with an unstable API temporarily. You can't just make a bunch internal APIs tightly coupled with other internal pieces public and stable and call it a day, just because it happen to satisfy some use cases temporarily assuming the rest of Spark doesn't change.



On Fri, Sep 20, 2019 at 11:19 AM, Ryan Blue <[hidden email]> wrote:
> DSv2 is far from stable right?

No, I think it is reasonably stable and very close to being ready for a release.

> All the actual data types are unstable and you guys have completely ignored that.

I think what you're referring to is the use of `InternalRow`. That's a stable API and there has been no work to avoid using it. In any case, I don't think that anyone is suggesting that we delay 3.0 until a replacement for `InternalRow` is added, right?

While I understand the motivation for a better solution here, I think the pragmatic solution is to continue using `InternalRow`.

> If the goal is to make DSv2 work across 3.x and 2.x, that seems too invasive of a change to backport once you consider the parts needed to make dsv2 stable.

I believe that those of us working on DSv2 are confident about the current stability. We set goals for what to get into the 3.0 release months ago and have very nearly reached the point where we are ready for that release.

I don't think instability would be a problem in maintaining compatibility between the 2.5 version and the 3.0 version. If we find that we need to make API changes (other than additions) then we can make those in the 3.1 release. Because the goals we set for the 3.0 release have been reached with the current API and if we are ready to release 3.0, we can release a 2.5 with the same API.

On Fri, Sep 20, 2019 at 11:05 AM Reynold Xin <[hidden email]> wrote:
DSv2 is far from stable right? All the actual data types are unstable and you guys have completely ignored that. We'd need to work on that and that will be a breaking change. If the goal is to make DSv2 work across 3.x and 2.x, that seems too invasive of a change to backport once you consider the parts needed to make dsv2 stable.



On Fri, Sep 20, 2019 at 10:47 AM, Ryan Blue <[hidden email]> wrote:
Hi everyone,

In the DSv2 sync this week, we talked about a possible Spark 2.5 release based on the latest Spark 2.4, but with DSv2 and Java 11 support added.

A Spark 2.5 release with these two additions will help people migrate to Spark 3.0 when it is released because they will be able to use a single implementation for DSv2 sources that works in both 2.5 and 3.0. Similarly, upgrading to 3.0 won't also require also updating to Java 11 because users could update to Java 11 with the 2.5 release and have fewer major changes.

Another reason to consider a 2.5 release is that many people are interested in a release with the latest DSv2 API and support for DSv2 SQL. I'm already going to be backporting DSv2 support to the Spark 2.4 line, so it makes sense to share this work with the community.

This release line would just consist of backports like DSv2 and Java 11 that assist compatibility, to keep the scope of the release small. The purpose is to assist people moving to 3.0 and not distract from the 3.0 release.

Would a Spark 2.5 release help anyone else? Are there any concerns about this plan?


rb


--
Ryan Blue
Software Engineer
Netflix



--
Ryan Blue
Software Engineer
Netflix



--
Ryan Blue
Software Engineer
Netflix



--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Spark 2.5 release

Ryan Blue
In reply to this post by Jungtaek Lim
> Making downstream to diverge their implementation heavily between minor versions (say, 2.4 vs 2.5) wouldn't be a good experience

You're right that the API has been evolving in the 2.x line. But, it is now reasonably stable with respect to the current feature set and we should not need to break compatibility in the 3.x line. Because we have reached our goals for the 3.0 release, we can backport at least those features to 2.x and confidently have an API that works in both a 2.x release and is compatible with 3.0, if not 3.1 and later releases as well.

> I'd rather say preparation of Spark 2.5 should be started after Spark 3.0 is officially released

The reason I'm suggesting this is that I'm already going to do the work to backport the 3.0 release features to 2.4. I've been asked by several people when DSv2 will be released, so I know there is a lot of interest in making this available sooner than 3.0. If I'm already doing the work, then I'd be happy to share that with the community.

I don't see why 2.5 and 3.0 are mutually exclusive. We can work on 2.5 while preparing the 3.0 preview and fixing bugs. For DSv2, the work is about complete so we can easily release the same set of features and API in 2.5 and 3.0.

If we decide for some reason to wait until after 3.0 is released, I don't know that there is much value in a 2.5. The purpose is to be a step toward 3.0, and releasing that step after 3.0 doesn't seem helpful to me. It also wouldn't get these features out any sooner than 3.0, as a 2.5 release probably would, given the work needed to validate the incompatible changes in 3.0.

> DSv2 change would be the major backward incompatibility which Spark 2.x users may hesitate to upgrade

As I pointed out, DSv2 has been changing in the 2.x line, so this is expected. I don't think it will need incompatible changes in the 3.x line.

On Fri, Sep 20, 2019 at 9:25 PM Jungtaek Lim <[hidden email]> wrote:
Just 2 cents, I haven't tracked the change of DSv2 (though I needed to deal with this as the change made confusion on my PRs...), but my bet is that DSv2 would be already changed in incompatible way, at least who works for custom DataSource. Making downstream to diverge their implementation heavily between minor versions (say, 2.4 vs 2.5) wouldn't be a good experience - especially we are not completely closed the chance to further modify DSv2, and the change could be backward incompatible.

If we really want to bring the DSv2 change to 2.x version line to let end users avoid forcing to upgrade Spark 3.x to enjoy new DSv2, I'd rather say preparation of Spark 2.5 should be started after Spark 3.0 is officially released, honestly even later than that, say, getting some reports from Spark 3.0 about DSv2 so that we feel DSv2 is OK. I hope we don't make Spark 2.5 be a kind of "tech-preview" which Spark 2.4 users may be frustrated to upgrade to next minor version.

Btw, do we have any specific target users for this? Personally DSv2 change would be the major backward incompatibility which Spark 2.x users may hesitate to upgrade, so they might be already prepared to migrate to Spark 3.0 if they are prepared to migrate to new DSv2.

On Sat, Sep 21, 2019 at 12:46 PM Dongjoon Hyun <[hidden email]> wrote:
Do you mean you want to have a breaking API change between 3.0 and 3.1?
I believe we follow Semantic Versioning ( https://spark.apache.org/versioning-policy.html ).

> We just won’t add any breaking changes before 3.1.

Bests,
Dongjoon.


On Fri, Sep 20, 2019 at 11:48 AM Ryan Blue <[hidden email]> wrote:

I don’t think we need to gate a 3.0 release on making a more stable version of InternalRow

Sounds like we agree, then. We will use it for 3.0, but there are known problems with it.

Thinking we’d have dsv2 working in both 3.x (which will change and progress towards more stable, but will have to break certain APIs) and 2.x seems like a false premise.

Why do you think we will need to break certain APIs before 3.0?

I’m only suggesting that we release the same support in a 2.5 release that we do in 3.0. Since we are nearly finished with the 3.0 goals, it seems like we can certainly do that. We just won’t add any breaking changes before 3.1.


On Fri, Sep 20, 2019 at 11:39 AM Reynold Xin <[hidden email]> wrote:
I don't think we need to gate a 3.0 release on making a more stable version of InternalRow, but thinking we'd have dsv2 working in both 3.x (which will change and progress towards more stable, but will have to break certain APIs) and 2.x seems like a false premise.

To point out some problems with InternalRow that you think are already pragmatic and stable:


/**
* Catalyst is a library for manipulating relational query plans.  All classes in catalyst are
* considered an internal API to Spark SQL and are subject to change between minor releases.
*/

There is no even any annotation on the interface.

The entire dependency chain were created to be private, and tightly coupled with internal implementations. For example, 


/**
* A UTF-8 String for internal Spark use.
* <p>
* A String encoded in UTF-8 as an Array[Byte], which can be used for comparison,
* search, see http://en.wikipedia.org/wiki/UTF-8 for details.
* <p>
* Note: This is not designed for general use cases, should not be used outside SQL.
*/


(which again is in catalyst package)


If you want to argue this way, you might as well argue we should make the entire catalyst package public to be pragmatic and not allow any changes.




On Fri, Sep 20, 2019 at 11:32 AM, Ryan Blue <[hidden email]> wrote:

When you created the PR to make InternalRow public

This isn’t quite accurate. The change I made was to use InternalRow instead of UnsafeRow, which is a specific implementation of InternalRow. Exposing this API has always been a part of DSv2 and while both you and I did some work to avoid this, we are still in the phase of starting with that API.

Note that any change to InternalRow would be very costly to implement because this interface is widely used. That is why I think we can certainly consider it stable enough to use here, and that’s probably why UnsafeRow was part of the original proposal.

In any case, the goal for 3.0 was not to replace the use of InternalRow, it was to get the majority of SQL working on top of the interface added after 2.4. That’s done and stable, so I think a 2.5 release with it is also reasonable.


On Fri, Sep 20, 2019 at 11:23 AM Reynold Xin <[hidden email]> wrote:
To push back, while I agree we should not drastically change "InternalRow", there are a lot of changes that need to happen to make it stable. For example, none of the publicly exposed interfaces should be in the Catalyst package or the unsafe package. External implementations should be decoupled from the internal implementations, with cheap ways to convert back and forth.

When you created the PR to make InternalRow public, the understanding was to work towards making it stable in the future, assuming we will start with an unstable API temporarily. You can't just make a bunch internal APIs tightly coupled with other internal pieces public and stable and call it a day, just because it happen to satisfy some use cases temporarily assuming the rest of Spark doesn't change.



On Fri, Sep 20, 2019 at 11:19 AM, Ryan Blue <[hidden email]> wrote:
> DSv2 is far from stable right?

No, I think it is reasonably stable and very close to being ready for a release.

> All the actual data types are unstable and you guys have completely ignored that.

I think what you're referring to is the use of `InternalRow`. That's a stable API and there has been no work to avoid using it. In any case, I don't think that anyone is suggesting that we delay 3.0 until a replacement for `InternalRow` is added, right?

While I understand the motivation for a better solution here, I think the pragmatic solution is to continue using `InternalRow`.

> If the goal is to make DSv2 work across 3.x and 2.x, that seems too invasive of a change to backport once you consider the parts needed to make dsv2 stable.

I believe that those of us working on DSv2 are confident about the current stability. We set goals for what to get into the 3.0 release months ago and have very nearly reached the point where we are ready for that release.

I don't think instability would be a problem in maintaining compatibility between the 2.5 version and the 3.0 version. If we find that we need to make API changes (other than additions) then we can make those in the 3.1 release. Because the goals we set for the 3.0 release have been reached with the current API and if we are ready to release 3.0, we can release a 2.5 with the same API.

On Fri, Sep 20, 2019 at 11:05 AM Reynold Xin <[hidden email]> wrote:
DSv2 is far from stable right? All the actual data types are unstable and you guys have completely ignored that. We'd need to work on that and that will be a breaking change. If the goal is to make DSv2 work across 3.x and 2.x, that seems too invasive of a change to backport once you consider the parts needed to make dsv2 stable.



On Fri, Sep 20, 2019 at 10:47 AM, Ryan Blue <[hidden email]> wrote:
Hi everyone,

In the DSv2 sync this week, we talked about a possible Spark 2.5 release based on the latest Spark 2.4, but with DSv2 and Java 11 support added.

A Spark 2.5 release with these two additions will help people migrate to Spark 3.0 when it is released because they will be able to use a single implementation for DSv2 sources that works in both 2.5 and 3.0. Similarly, upgrading to 3.0 won't also require also updating to Java 11 because users could update to Java 11 with the 2.5 release and have fewer major changes.

Another reason to consider a 2.5 release is that many people are interested in a release with the latest DSv2 API and support for DSv2 SQL. I'm already going to be backporting DSv2 support to the Spark 2.4 line, so it makes sense to share this work with the community.

This release line would just consist of backports like DSv2 and Java 11 that assist compatibility, to keep the scope of the release small. The purpose is to assist people moving to 3.0 and not distract from the 3.0 release.

Would a Spark 2.5 release help anyone else? Are there any concerns about this plan?


rb


--
Ryan Blue
Software Engineer
Netflix



--
Ryan Blue
Software Engineer
Netflix



--
Ryan Blue
Software Engineer
Netflix



--
Ryan Blue
Software Engineer
Netflix


--


--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Spark 2.5 release

rxin
How would you not make incompatible changes in 3.x? As discussed the InternalRow API is not stable and needs to change. 

On Sat, Sep 21, 2019 at 2:27 PM Ryan Blue <[hidden email]> wrote:
> Making downstream to diverge their implementation heavily between minor versions (say, 2.4 vs 2.5) wouldn't be a good experience

You're right that the API has been evolving in the 2.x line. But, it is now reasonably stable with respect to the current feature set and we should not need to break compatibility in the 3.x line. Because we have reached our goals for the 3.0 release, we can backport at least those features to 2.x and confidently have an API that works in both a 2.x release and is compatible with 3.0, if not 3.1 and later releases as well.

> I'd rather say preparation of Spark 2.5 should be started after Spark 3.0 is officially released

The reason I'm suggesting this is that I'm already going to do the work to backport the 3.0 release features to 2.4. I've been asked by several people when DSv2 will be released, so I know there is a lot of interest in making this available sooner than 3.0. If I'm already doing the work, then I'd be happy to share that with the community.

I don't see why 2.5 and 3.0 are mutually exclusive. We can work on 2.5 while preparing the 3.0 preview and fixing bugs. For DSv2, the work is about complete so we can easily release the same set of features and API in 2.5 and 3.0.

If we decide for some reason to wait until after 3.0 is released, I don't know that there is much value in a 2.5. The purpose is to be a step toward 3.0, and releasing that step after 3.0 doesn't seem helpful to me. It also wouldn't get these features out any sooner than 3.0, as a 2.5 release probably would, given the work needed to validate the incompatible changes in 3.0.

> DSv2 change would be the major backward incompatibility which Spark 2.x users may hesitate to upgrade

As I pointed out, DSv2 has been changing in the 2.x line, so this is expected. I don't think it will need incompatible changes in the 3.x line.

On Fri, Sep 20, 2019 at 9:25 PM Jungtaek Lim <[hidden email]> wrote:
Just 2 cents, I haven't tracked the change of DSv2 (though I needed to deal with this as the change made confusion on my PRs...), but my bet is that DSv2 would be already changed in incompatible way, at least who works for custom DataSource. Making downstream to diverge their implementation heavily between minor versions (say, 2.4 vs 2.5) wouldn't be a good experience - especially we are not completely closed the chance to further modify DSv2, and the change could be backward incompatible.

If we really want to bring the DSv2 change to 2.x version line to let end users avoid forcing to upgrade Spark 3.x to enjoy new DSv2, I'd rather say preparation of Spark 2.5 should be started after Spark 3.0 is officially released, honestly even later than that, say, getting some reports from Spark 3.0 about DSv2 so that we feel DSv2 is OK. I hope we don't make Spark 2.5 be a kind of "tech-preview" which Spark 2.4 users may be frustrated to upgrade to next minor version.

Btw, do we have any specific target users for this? Personally DSv2 change would be the major backward incompatibility which Spark 2.x users may hesitate to upgrade, so they might be already prepared to migrate to Spark 3.0 if they are prepared to migrate to new DSv2.

On Sat, Sep 21, 2019 at 12:46 PM Dongjoon Hyun <[hidden email]> wrote:
Do you mean you want to have a breaking API change between 3.0 and 3.1?
I believe we follow Semantic Versioning ( https://spark.apache.org/versioning-policy.html ).

> We just won’t add any breaking changes before 3.1.

Bests,
Dongjoon.


On Fri, Sep 20, 2019 at 11:48 AM Ryan Blue <[hidden email]> wrote:

I don’t think we need to gate a 3.0 release on making a more stable version of InternalRow

Sounds like we agree, then. We will use it for 3.0, but there are known problems with it.

Thinking we’d have dsv2 working in both 3.x (which will change and progress towards more stable, but will have to break certain APIs) and 2.x seems like a false premise.

Why do you think we will need to break certain APIs before 3.0?

I’m only suggesting that we release the same support in a 2.5 release that we do in 3.0. Since we are nearly finished with the 3.0 goals, it seems like we can certainly do that. We just won’t add any breaking changes before 3.1.


On Fri, Sep 20, 2019 at 11:39 AM Reynold Xin <[hidden email]> wrote:
I don't think we need to gate a 3.0 release on making a more stable version of InternalRow, but thinking we'd have dsv2 working in both 3.x (which will change and progress towards more stable, but will have to break certain APIs) and 2.x seems like a false premise.

To point out some problems with InternalRow that you think are already pragmatic and stable:


/**
* Catalyst is a library for manipulating relational query plans.  All classes in catalyst are
* considered an internal API to Spark SQL and are subject to change between minor releases.
*/

There is no even any annotation on the interface.

The entire dependency chain were created to be private, and tightly coupled with internal implementations. For example, 


/**
* A UTF-8 String for internal Spark use.
* <p>
* A String encoded in UTF-8 as an Array[Byte], which can be used for comparison,
* search, see http://en.wikipedia.org/wiki/UTF-8 for details.
* <p>
* Note: This is not designed for general use cases, should not be used outside SQL.
*/


(which again is in catalyst package)


If you want to argue this way, you might as well argue we should make the entire catalyst package public to be pragmatic and not allow any changes.




On Fri, Sep 20, 2019 at 11:32 AM, Ryan Blue <[hidden email]> wrote:

When you created the PR to make InternalRow public

This isn’t quite accurate. The change I made was to use InternalRow instead of UnsafeRow, which is a specific implementation of InternalRow. Exposing this API has always been a part of DSv2 and while both you and I did some work to avoid this, we are still in the phase of starting with that API.

Note that any change to InternalRow would be very costly to implement because this interface is widely used. That is why I think we can certainly consider it stable enough to use here, and that’s probably why UnsafeRow was part of the original proposal.

In any case, the goal for 3.0 was not to replace the use of InternalRow, it was to get the majority of SQL working on top of the interface added after 2.4. That’s done and stable, so I think a 2.5 release with it is also reasonable.


On Fri, Sep 20, 2019 at 11:23 AM Reynold Xin <[hidden email]> wrote:
To push back, while I agree we should not drastically change "InternalRow", there are a lot of changes that need to happen to make it stable. For example, none of the publicly exposed interfaces should be in the Catalyst package or the unsafe package. External implementations should be decoupled from the internal implementations, with cheap ways to convert back and forth.

When you created the PR to make InternalRow public, the understanding was to work towards making it stable in the future, assuming we will start with an unstable API temporarily. You can't just make a bunch internal APIs tightly coupled with other internal pieces public and stable and call it a day, just because it happen to satisfy some use cases temporarily assuming the rest of Spark doesn't change.



On Fri, Sep 20, 2019 at 11:19 AM, Ryan Blue <[hidden email]> wrote:
> DSv2 is far from stable right?

No, I think it is reasonably stable and very close to being ready for a release.

> All the actual data types are unstable and you guys have completely ignored that.

I think what you're referring to is the use of `InternalRow`. That's a stable API and there has been no work to avoid using it. In any case, I don't think that anyone is suggesting that we delay 3.0 until a replacement for `InternalRow` is added, right?

While I understand the motivation for a better solution here, I think the pragmatic solution is to continue using `InternalRow`.

> If the goal is to make DSv2 work across 3.x and 2.x, that seems too invasive of a change to backport once you consider the parts needed to make dsv2 stable.

I believe that those of us working on DSv2 are confident about the current stability. We set goals for what to get into the 3.0 release months ago and have very nearly reached the point where we are ready for that release.

I don't think instability would be a problem in maintaining compatibility between the 2.5 version and the 3.0 version. If we find that we need to make API changes (other than additions) then we can make those in the 3.1 release. Because the goals we set for the 3.0 release have been reached with the current API and if we are ready to release 3.0, we can release a 2.5 with the same API.

On Fri, Sep 20, 2019 at 11:05 AM Reynold Xin <[hidden email]> wrote:
DSv2 is far from stable right? All the actual data types are unstable and you guys have completely ignored that. We'd need to work on that and that will be a breaking change. If the goal is to make DSv2 work across 3.x and 2.x, that seems too invasive of a change to backport once you consider the parts needed to make dsv2 stable.



On Fri, Sep 20, 2019 at 10:47 AM, Ryan Blue <[hidden email]> wrote:
Hi everyone,

In the DSv2 sync this week, we talked about a possible Spark 2.5 release based on the latest Spark 2.4, but with DSv2 and Java 11 support added.

A Spark 2.5 release with these two additions will help people migrate to Spark 3.0 when it is released because they will be able to use a single implementation for DSv2 sources that works in both 2.5 and 3.0. Similarly, upgrading to 3.0 won't also require also updating to Java 11 because users could update to Java 11 with the 2.5 release and have fewer major changes.

Another reason to consider a 2.5 release is that many people are interested in a release with the latest DSv2 API and support for DSv2 SQL. I'm already going to be backporting DSv2 support to the Spark 2.4 line, so it makes sense to share this work with the community.

This release line would just consist of backports like DSv2 and Java 11 that assist compatibility, to keep the scope of the release small. The purpose is to assist people moving to 3.0 and not distract from the 3.0 release.

Would a Spark 2.5 release help anyone else? Are there any concerns about this plan?


rb


--
Ryan Blue
Software Engineer
Netflix



--
Ryan Blue
Software Engineer
Netflix



--
Ryan Blue
Software Engineer
Netflix



--
Ryan Blue
Software Engineer
Netflix


--


--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Spark 2.5 release

Ryan Blue
Why would that require an incompatible change?

We *could* make an incompatible change and remove support for InternalRow, but I think we would want to carefully consider whether that is the right decision. And in any case, we would be able to keep 2.5 and 3.0 compatible, which is the main goal.

On Sat, Sep 21, 2019 at 2:28 PM Reynold Xin <[hidden email]> wrote:
How would you not make incompatible changes in 3.x? As discussed the InternalRow API is not stable and needs to change. 

On Sat, Sep 21, 2019 at 2:27 PM Ryan Blue <[hidden email]> wrote:
> Making downstream to diverge their implementation heavily between minor versions (say, 2.4 vs 2.5) wouldn't be a good experience

You're right that the API has been evolving in the 2.x line. But, it is now reasonably stable with respect to the current feature set and we should not need to break compatibility in the 3.x line. Because we have reached our goals for the 3.0 release, we can backport at least those features to 2.x and confidently have an API that works in both a 2.x release and is compatible with 3.0, if not 3.1 and later releases as well.

> I'd rather say preparation of Spark 2.5 should be started after Spark 3.0 is officially released

The reason I'm suggesting this is that I'm already going to do the work to backport the 3.0 release features to 2.4. I've been asked by several people when DSv2 will be released, so I know there is a lot of interest in making this available sooner than 3.0. If I'm already doing the work, then I'd be happy to share that with the community.

I don't see why 2.5 and 3.0 are mutually exclusive. We can work on 2.5 while preparing the 3.0 preview and fixing bugs. For DSv2, the work is about complete so we can easily release the same set of features and API in 2.5 and 3.0.

If we decide for some reason to wait until after 3.0 is released, I don't know that there is much value in a 2.5. The purpose is to be a step toward 3.0, and releasing that step after 3.0 doesn't seem helpful to me. It also wouldn't get these features out any sooner than 3.0, as a 2.5 release probably would, given the work needed to validate the incompatible changes in 3.0.

> DSv2 change would be the major backward incompatibility which Spark 2.x users may hesitate to upgrade

As I pointed out, DSv2 has been changing in the 2.x line, so this is expected. I don't think it will need incompatible changes in the 3.x line.

On Fri, Sep 20, 2019 at 9:25 PM Jungtaek Lim <[hidden email]> wrote:
Just 2 cents, I haven't tracked the change of DSv2 (though I needed to deal with this as the change made confusion on my PRs...), but my bet is that DSv2 would be already changed in incompatible way, at least who works for custom DataSource. Making downstream to diverge their implementation heavily between minor versions (say, 2.4 vs 2.5) wouldn't be a good experience - especially we are not completely closed the chance to further modify DSv2, and the change could be backward incompatible.

If we really want to bring the DSv2 change to 2.x version line to let end users avoid forcing to upgrade Spark 3.x to enjoy new DSv2, I'd rather say preparation of Spark 2.5 should be started after Spark 3.0 is officially released, honestly even later than that, say, getting some reports from Spark 3.0 about DSv2 so that we feel DSv2 is OK. I hope we don't make Spark 2.5 be a kind of "tech-preview" which Spark 2.4 users may be frustrated to upgrade to next minor version.

Btw, do we have any specific target users for this? Personally DSv2 change would be the major backward incompatibility which Spark 2.x users may hesitate to upgrade, so they might be already prepared to migrate to Spark 3.0 if they are prepared to migrate to new DSv2.

On Sat, Sep 21, 2019 at 12:46 PM Dongjoon Hyun <[hidden email]> wrote:
Do you mean you want to have a breaking API change between 3.0 and 3.1?
I believe we follow Semantic Versioning ( https://spark.apache.org/versioning-policy.html ).

> We just won’t add any breaking changes before 3.1.

Bests,
Dongjoon.


On Fri, Sep 20, 2019 at 11:48 AM Ryan Blue <[hidden email]> wrote:

I don’t think we need to gate a 3.0 release on making a more stable version of InternalRow

Sounds like we agree, then. We will use it for 3.0, but there are known problems with it.

Thinking we’d have dsv2 working in both 3.x (which will change and progress towards more stable, but will have to break certain APIs) and 2.x seems like a false premise.

Why do you think we will need to break certain APIs before 3.0?

I’m only suggesting that we release the same support in a 2.5 release that we do in 3.0. Since we are nearly finished with the 3.0 goals, it seems like we can certainly do that. We just won’t add any breaking changes before 3.1.


On Fri, Sep 20, 2019 at 11:39 AM Reynold Xin <[hidden email]> wrote:
I don't think we need to gate a 3.0 release on making a more stable version of InternalRow, but thinking we'd have dsv2 working in both 3.x (which will change and progress towards more stable, but will have to break certain APIs) and 2.x seems like a false premise.

To point out some problems with InternalRow that you think are already pragmatic and stable:


/**
* Catalyst is a library for manipulating relational query plans.  All classes in catalyst are
* considered an internal API to Spark SQL and are subject to change between minor releases.
*/

There is no even any annotation on the interface.

The entire dependency chain were created to be private, and tightly coupled with internal implementations. For example, 


/**
* A UTF-8 String for internal Spark use.
* <p>
* A String encoded in UTF-8 as an Array[Byte], which can be used for comparison,
* search, see http://en.wikipedia.org/wiki/UTF-8 for details.
* <p>
* Note: This is not designed for general use cases, should not be used outside SQL.
*/


(which again is in catalyst package)


If you want to argue this way, you might as well argue we should make the entire catalyst package public to be pragmatic and not allow any changes.




On Fri, Sep 20, 2019 at 11:32 AM, Ryan Blue <[hidden email]> wrote:

When you created the PR to make InternalRow public

This isn’t quite accurate. The change I made was to use InternalRow instead of UnsafeRow, which is a specific implementation of InternalRow. Exposing this API has always been a part of DSv2 and while both you and I did some work to avoid this, we are still in the phase of starting with that API.

Note that any change to InternalRow would be very costly to implement because this interface is widely used. That is why I think we can certainly consider it stable enough to use here, and that’s probably why UnsafeRow was part of the original proposal.

In any case, the goal for 3.0 was not to replace the use of InternalRow, it was to get the majority of SQL working on top of the interface added after 2.4. That’s done and stable, so I think a 2.5 release with it is also reasonable.


On Fri, Sep 20, 2019 at 11:23 AM Reynold Xin <[hidden email]> wrote:
To push back, while I agree we should not drastically change "InternalRow", there are a lot of changes that need to happen to make it stable. For example, none of the publicly exposed interfaces should be in the Catalyst package or the unsafe package. External implementations should be decoupled from the internal implementations, with cheap ways to convert back and forth.

When you created the PR to make InternalRow public, the understanding was to work towards making it stable in the future, assuming we will start with an unstable API temporarily. You can't just make a bunch internal APIs tightly coupled with other internal pieces public and stable and call it a day, just because it happen to satisfy some use cases temporarily assuming the rest of Spark doesn't change.



On Fri, Sep 20, 2019 at 11:19 AM, Ryan Blue <[hidden email]> wrote:
> DSv2 is far from stable right?

No, I think it is reasonably stable and very close to being ready for a release.

> All the actual data types are unstable and you guys have completely ignored that.

I think what you're referring to is the use of `InternalRow`. That's a stable API and there has been no work to avoid using it. In any case, I don't think that anyone is suggesting that we delay 3.0 until a replacement for `InternalRow` is added, right?

While I understand the motivation for a better solution here, I think the pragmatic solution is to continue using `InternalRow`.

> If the goal is to make DSv2 work across 3.x and 2.x, that seems too invasive of a change to backport once you consider the parts needed to make dsv2 stable.

I believe that those of us working on DSv2 are confident about the current stability. We set goals for what to get into the 3.0 release months ago and have very nearly reached the point where we are ready for that release.

I don't think instability would be a problem in maintaining compatibility between the 2.5 version and the 3.0 version. If we find that we need to make API changes (other than additions) then we can make those in the 3.1 release. Because the goals we set for the 3.0 release have been reached with the current API and if we are ready to release 3.0, we can release a 2.5 with the same API.

On Fri, Sep 20, 2019 at 11:05 AM Reynold Xin <[hidden email]> wrote:
DSv2 is far from stable right? All the actual data types are unstable and you guys have completely ignored that. We'd need to work on that and that will be a breaking change. If the goal is to make DSv2 work across 3.x and 2.x, that seems too invasive of a change to backport once you consider the parts needed to make dsv2 stable.



On Fri, Sep 20, 2019 at 10:47 AM, Ryan Blue <[hidden email]> wrote:
Hi everyone,

In the DSv2 sync this week, we talked about a possible Spark 2.5 release based on the latest Spark 2.4, but with DSv2 and Java 11 support added.

A Spark 2.5 release with these two additions will help people migrate to Spark 3.0 when it is released because they will be able to use a single implementation for DSv2 sources that works in both 2.5 and 3.0. Similarly, upgrading to 3.0 won't also require also updating to Java 11 because users could update to Java 11 with the 2.5 release and have fewer major changes.

Another reason to consider a 2.5 release is that many people are interested in a release with the latest DSv2 API and support for DSv2 SQL. I'm already going to be backporting DSv2 support to the Spark 2.4 line, so it makes sense to share this work with the community.

This release line would just consist of backports like DSv2 and Java 11 that assist compatibility, to keep the scope of the release small. The purpose is to assist people moving to 3.0 and not distract from the 3.0 release.

Would a Spark 2.5 release help anyone else? Are there any concerns about this plan?


rb


--
Ryan Blue
Software Engineer
Netflix



--
Ryan Blue
Software Engineer
Netflix



--
Ryan Blue
Software Engineer
Netflix



--
Ryan Blue
Software Engineer
Netflix


--


--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Spark 2.5 release

rxin
Because for example we'd need to move the location of InternalRow, breaking the package name. If you insist we shouldn't change the unstable temporary API in 3.x to maintain compatibility with 3.0, which is totally different from my understanding of the situation when you exposed it, then I'd say we should gate 3.0 on having a stable row interface.

I also don't get this backporting a giant feature to 2.x line ... as suggested by others in the thread, DSv2 would be one of the main reasons people upgrade to 3.0. What's so special about DSv2 that we are doing this? Why not abandoning 3.0 entirely and backport all the features to 2.x?



On Sat, Sep 21, 2019 at 2:31 PM, Ryan Blue <[hidden email]> wrote:
Why would that require an incompatible change?

We *could* make an incompatible change and remove support for InternalRow, but I think we would want to carefully consider whether that is the right decision. And in any case, we would be able to keep 2.5 and 3.0 compatible, which is the main goal.

On Sat, Sep 21, 2019 at 2:28 PM Reynold Xin <[hidden email]> wrote:
How would you not make incompatible changes in 3.x? As discussed the InternalRow API is not stable and needs to change. 

On Sat, Sep 21, 2019 at 2:27 PM Ryan Blue <[hidden email]> wrote:
> Making downstream to diverge their implementation heavily between minor versions (say, 2.4 vs 2.5) wouldn't be a good experience

You're right that the API has been evolving in the 2.x line. But, it is now reasonably stable with respect to the current feature set and we should not need to break compatibility in the 3.x line. Because we have reached our goals for the 3.0 release, we can backport at least those features to 2.x and confidently have an API that works in both a 2.x release and is compatible with 3.0, if not 3.1 and later releases as well.

> I'd rather say preparation of Spark 2.5 should be started after Spark 3.0 is officially released

The reason I'm suggesting this is that I'm already going to do the work to backport the 3.0 release features to 2.4. I've been asked by several people when DSv2 will be released, so I know there is a lot of interest in making this available sooner than 3.0. If I'm already doing the work, then I'd be happy to share that with the community.

I don't see why 2.5 and 3.0 are mutually exclusive. We can work on 2.5 while preparing the 3.0 preview and fixing bugs. For DSv2, the work is about complete so we can easily release the same set of features and API in 2.5 and 3.0.

If we decide for some reason to wait until after 3.0 is released, I don't know that there is much value in a 2.5. The purpose is to be a step toward 3.0, and releasing that step after 3.0 doesn't seem helpful to me. It also wouldn't get these features out any sooner than 3.0, as a 2.5 release probably would, given the work needed to validate the incompatible changes in 3.0.

> DSv2 change would be the major backward incompatibility which Spark 2.x users may hesitate to upgrade

As I pointed out, DSv2 has been changing in the 2.x line, so this is expected. I don't think it will need incompatible changes in the 3.x line.

On Fri, Sep 20, 2019 at 9:25 PM Jungtaek Lim <[hidden email]> wrote:
Just 2 cents, I haven't tracked the change of DSv2 (though I needed to deal with this as the change made confusion on my PRs...), but my bet is that DSv2 would be already changed in incompatible way, at least who works for custom DataSource. Making downstream to diverge their implementation heavily between minor versions (say, 2.4 vs 2.5) wouldn't be a good experience - especially we are not completely closed the chance to further modify DSv2, and the change could be backward incompatible.

If we really want to bring the DSv2 change to 2.x version line to let end users avoid forcing to upgrade Spark 3.x to enjoy new DSv2, I'd rather say preparation of Spark 2.5 should be started after Spark 3.0 is officially released, honestly even later than that, say, getting some reports from Spark 3.0 about DSv2 so that we feel DSv2 is OK. I hope we don't make Spark 2.5 be a kind of "tech-preview" which Spark 2.4 users may be frustrated to upgrade to next minor version.

Btw, do we have any specific target users for this? Personally DSv2 change would be the major backward incompatibility which Spark 2.x users may hesitate to upgrade, so they might be already prepared to migrate to Spark 3.0 if they are prepared to migrate to new DSv2.

On Sat, Sep 21, 2019 at 12:46 PM Dongjoon Hyun <[hidden email]> wrote:
Do you mean you want to have a breaking API change between 3.0 and 3.1?
I believe we follow Semantic Versioning ( https://spark.apache.org/versioning-policy.html ).

> We just won’t add any breaking changes before 3.1.

Bests,
Dongjoon.


On Fri, Sep 20, 2019 at 11:48 AM Ryan Blue <[hidden email]> wrote:

I don’t think we need to gate a 3.0 release on making a more stable version of InternalRow

Sounds like we agree, then. We will use it for 3.0, but there are known problems with it.

Thinking we’d have dsv2 working in both 3.x (which will change and progress towards more stable, but will have to break certain APIs) and 2.x seems like a false premise.

Why do you think we will need to break certain APIs before 3.0?

I’m only suggesting that we release the same support in a 2.5 release that we do in 3.0. Since we are nearly finished with the 3.0 goals, it seems like we can certainly do that. We just won’t add any breaking changes before 3.1.


On Fri, Sep 20, 2019 at 11:39 AM Reynold Xin <[hidden email]> wrote:
I don't think we need to gate a 3.0 release on making a more stable version of InternalRow, but thinking we'd have dsv2 working in both 3.x (which will change and progress towards more stable, but will have to break certain APIs) and 2.x seems like a false premise.

To point out some problems with InternalRow that you think are already pragmatic and stable:


/**
* Catalyst is a library for manipulating relational query plans.  All classes in catalyst are
* considered an internal API to Spark SQL and are subject to change between minor releases.
*/

There is no even any annotation on the interface.

The entire dependency chain were created to be private, and tightly coupled with internal implementations. For example, 


/**
* A UTF-8 String for internal Spark use.
* <p>
* A String encoded in UTF-8 as an Array[Byte], which can be used for comparison,
* search, see http://en.wikipedia.org/wiki/UTF-8 for details.
* <p>
* Note: This is not designed for general use cases, should not be used outside SQL.
*/


(which again is in catalyst package)


If you want to argue this way, you might as well argue we should make the entire catalyst package public to be pragmatic and not allow any changes.




On Fri, Sep 20, 2019 at 11:32 AM, Ryan Blue <[hidden email]> wrote:

When you created the PR to make InternalRow public

This isn’t quite accurate. The change I made was to use InternalRow instead of UnsafeRow, which is a specific implementation of InternalRow. Exposing this API has always been a part of DSv2 and while both you and I did some work to avoid this, we are still in the phase of starting with that API.

Note that any change to InternalRow would be very costly to implement because this interface is widely used. That is why I think we can certainly consider it stable enough to use here, and that’s probably why UnsafeRow was part of the original proposal.

In any case, the goal for 3.0 was not to replace the use of InternalRow, it was to get the majority of SQL working on top of the interface added after 2.4. That’s done and stable, so I think a 2.5 release with it is also reasonable.


On Fri, Sep 20, 2019 at 11:23 AM Reynold Xin <[hidden email]> wrote:
To push back, while I agree we should not drastically change "InternalRow", there are a lot of changes that need to happen to make it stable. For example, none of the publicly exposed interfaces should be in the Catalyst package or the unsafe package. External implementations should be decoupled from the internal implementations, with cheap ways to convert back and forth.

When you created the PR to make InternalRow public, the understanding was to work towards making it stable in the future, assuming we will start with an unstable API temporarily. You can't just make a bunch internal APIs tightly coupled with other internal pieces public and stable and call it a day, just because it happen to satisfy some use cases temporarily assuming the rest of Spark doesn't change.



On Fri, Sep 20, 2019 at 11:19 AM, Ryan Blue <[hidden email]> wrote:
> DSv2 is far from stable right?

No, I think it is reasonably stable and very close to being ready for a release.

> All the actual data types are unstable and you guys have completely ignored that.

I think what you're referring to is the use of `InternalRow`. That's a stable API and there has been no work to avoid using it. In any case, I don't think that anyone is suggesting that we delay 3.0 until a replacement for `InternalRow` is added, right?

While I understand the motivation for a better solution here, I think the pragmatic solution is to continue using `InternalRow`.

> If the goal is to make DSv2 work across 3.x and 2.x, that seems too invasive of a change to backport once you consider the parts needed to make dsv2 stable.

I believe that those of us working on DSv2 are confident about the current stability. We set goals for what to get into the 3.0 release months ago and have very nearly reached the point where we are ready for that release.

I don't think instability would be a problem in maintaining compatibility between the 2.5 version and the 3.0 version. If we find that we need to make API changes (other than additions) then we can make those in the 3.1 release. Because the goals we set for the 3.0 release have been reached with the current API and if we are ready to release 3.0, we can release a 2.5 with the same API.

On Fri, Sep 20, 2019 at 11:05 AM Reynold Xin <[hidden email]> wrote:
DSv2 is far from stable right? All the actual data types are unstable and you guys have completely ignored that. We'd need to work on that and that will be a breaking change. If the goal is to make DSv2 work across 3.x and 2.x, that seems too invasive of a change to backport once you consider the parts needed to make dsv2 stable.



On Fri, Sep 20, 2019 at 10:47 AM, Ryan Blue <[hidden email]> wrote:
Hi everyone,

In the DSv2 sync this week, we talked about a possible Spark 2.5 release based on the latest Spark 2.4, but with DSv2 and Java 11 support added.

A Spark 2.5 release with these two additions will help people migrate to Spark 3.0 when it is released because they will be able to use a single implementation for DSv2 sources that works in both 2.5 and 3.0. Similarly, upgrading to 3.0 won't also require also updating to Java 11 because users could update to Java 11 with the 2.5 release and have fewer major changes.

Another reason to consider a 2.5 release is that many people are interested in a release with the latest DSv2 API and support for DSv2 SQL. I'm already going to be backporting DSv2 support to the Spark 2.4 line, so it makes sense to share this work with the community.

This release line would just consist of backports like DSv2 and Java 11 that assist compatibility, to keep the scope of the release small. The purpose is to assist people moving to 3.0 and not distract from the 3.0 release.

Would a Spark 2.5 release help anyone else? Are there any concerns about this plan?


rb


--
Ryan Blue
Software Engineer
Netflix



--
Ryan Blue
Software Engineer
Netflix



--
Ryan Blue
Software Engineer
Netflix



--
Ryan Blue
Software Engineer
Netflix


--


--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Spark 2.5 release

Ryan Blue
> If you insist we shouldn't change the unstable temporary API in 3.x . . .

Not what I'm saying at all. I said we should carefully consider whether a breaking change is the right decision in the 3.x line.

All I'm suggesting is that we can make a 2.5 release with the feature and an API that is the same as the one in 3.0.

> I also don't get this backporting a giant feature to 2.x line

I am planning to do this so we can use DSv2 before 3.0 is released. Then we can have a source implementation that works in both 2.x and 3.0 to make the transition easier. Since I'm already doing the work, I'm offering to share it with the community.


On Sat, Sep 21, 2019 at 2:36 PM Reynold Xin <[hidden email]> wrote:
Because for example we'd need to move the location of InternalRow, breaking the package name. If you insist we shouldn't change the unstable temporary API in 3.x to maintain compatibility with 3.0, which is totally different from my understanding of the situation when you exposed it, then I'd say we should gate 3.0 on having a stable row interface.

I also don't get this backporting a giant feature to 2.x line ... as suggested by others in the thread, DSv2 would be one of the main reasons people upgrade to 3.0. What's so special about DSv2 that we are doing this? Why not abandoning 3.0 entirely and backport all the features to 2.x?



On Sat, Sep 21, 2019 at 2:31 PM, Ryan Blue <[hidden email]> wrote:
Why would that require an incompatible change?

We *could* make an incompatible change and remove support for InternalRow, but I think we would want to carefully consider whether that is the right decision. And in any case, we would be able to keep 2.5 and 3.0 compatible, which is the main goal.

On Sat, Sep 21, 2019 at 2:28 PM Reynold Xin <[hidden email]> wrote:
How would you not make incompatible changes in 3.x? As discussed the InternalRow API is not stable and needs to change. 

On Sat, Sep 21, 2019 at 2:27 PM Ryan Blue <[hidden email]> wrote:
> Making downstream to diverge their implementation heavily between minor versions (say, 2.4 vs 2.5) wouldn't be a good experience

You're right that the API has been evolving in the 2.x line. But, it is now reasonably stable with respect to the current feature set and we should not need to break compatibility in the 3.x line. Because we have reached our goals for the 3.0 release, we can backport at least those features to 2.x and confidently have an API that works in both a 2.x release and is compatible with 3.0, if not 3.1 and later releases as well.

> I'd rather say preparation of Spark 2.5 should be started after Spark 3.0 is officially released

The reason I'm suggesting this is that I'm already going to do the work to backport the 3.0 release features to 2.4. I've been asked by several people when DSv2 will be released, so I know there is a lot of interest in making this available sooner than 3.0. If I'm already doing the work, then I'd be happy to share that with the community.

I don't see why 2.5 and 3.0 are mutually exclusive. We can work on 2.5 while preparing the 3.0 preview and fixing bugs. For DSv2, the work is about complete so we can easily release the same set of features and API in 2.5 and 3.0.

If we decide for some reason to wait until after 3.0 is released, I don't know that there is much value in a 2.5. The purpose is to be a step toward 3.0, and releasing that step after 3.0 doesn't seem helpful to me. It also wouldn't get these features out any sooner than 3.0, as a 2.5 release probably would, given the work needed to validate the incompatible changes in 3.0.

> DSv2 change would be the major backward incompatibility which Spark 2.x users may hesitate to upgrade

As I pointed out, DSv2 has been changing in the 2.x line, so this is expected. I don't think it will need incompatible changes in the 3.x line.

On Fri, Sep 20, 2019 at 9:25 PM Jungtaek Lim <[hidden email]> wrote:
Just 2 cents, I haven't tracked the change of DSv2 (though I needed to deal with this as the change made confusion on my PRs...), but my bet is that DSv2 would be already changed in incompatible way, at least who works for custom DataSource. Making downstream to diverge their implementation heavily between minor versions (say, 2.4 vs 2.5) wouldn't be a good experience - especially we are not completely closed the chance to further modify DSv2, and the change could be backward incompatible.

If we really want to bring the DSv2 change to 2.x version line to let end users avoid forcing to upgrade Spark 3.x to enjoy new DSv2, I'd rather say preparation of Spark 2.5 should be started after Spark 3.0 is officially released, honestly even later than that, say, getting some reports from Spark 3.0 about DSv2 so that we feel DSv2 is OK. I hope we don't make Spark 2.5 be a kind of "tech-preview" which Spark 2.4 users may be frustrated to upgrade to next minor version.

Btw, do we have any specific target users for this? Personally DSv2 change would be the major backward incompatibility which Spark 2.x users may hesitate to upgrade, so they might be already prepared to migrate to Spark 3.0 if they are prepared to migrate to new DSv2.

On Sat, Sep 21, 2019 at 12:46 PM Dongjoon Hyun <[hidden email]> wrote:
Do you mean you want to have a breaking API change between 3.0 and 3.1?
I believe we follow Semantic Versioning ( https://spark.apache.org/versioning-policy.html ).

> We just won’t add any breaking changes before 3.1.

Bests,
Dongjoon.


On Fri, Sep 20, 2019 at 11:48 AM Ryan Blue <[hidden email]> wrote:

I don’t think we need to gate a 3.0 release on making a more stable version of InternalRow

Sounds like we agree, then. We will use it for 3.0, but there are known problems with it.

Thinking we’d have dsv2 working in both 3.x (which will change and progress towards more stable, but will have to break certain APIs) and 2.x seems like a false premise.

Why do you think we will need to break certain APIs before 3.0?

I’m only suggesting that we release the same support in a 2.5 release that we do in 3.0. Since we are nearly finished with the 3.0 goals, it seems like we can certainly do that. We just won’t add any breaking changes before 3.1.


On Fri, Sep 20, 2019 at 11:39 AM Reynold Xin <[hidden email]> wrote:
I don't think we need to gate a 3.0 release on making a more stable version of InternalRow, but thinking we'd have dsv2 working in both 3.x (which will change and progress towards more stable, but will have to break certain APIs) and 2.x seems like a false premise.

To point out some problems with InternalRow that you think are already pragmatic and stable:


/**
* Catalyst is a library for manipulating relational query plans.  All classes in catalyst are
* considered an internal API to Spark SQL and are subject to change between minor releases.
*/

There is no even any annotation on the interface.

The entire dependency chain were created to be private, and tightly coupled with internal implementations. For example, 


/**
* A UTF-8 String for internal Spark use.
* <p>
* A String encoded in UTF-8 as an Array[Byte], which can be used for comparison,
* search, see http://en.wikipedia.org/wiki/UTF-8 for details.
* <p>
* Note: This is not designed for general use cases, should not be used outside SQL.
*/


(which again is in catalyst package)


If you want to argue this way, you might as well argue we should make the entire catalyst package public to be pragmatic and not allow any changes.




On Fri, Sep 20, 2019 at 11:32 AM, Ryan Blue <[hidden email]> wrote:

When you created the PR to make InternalRow public

This isn’t quite accurate. The change I made was to use InternalRow instead of UnsafeRow, which is a specific implementation of InternalRow. Exposing this API has always been a part of DSv2 and while both you and I did some work to avoid this, we are still in the phase of starting with that API.

Note that any change to InternalRow would be very costly to implement because this interface is widely used. That is why I think we can certainly consider it stable enough to use here, and that’s probably why UnsafeRow was part of the original proposal.

In any case, the goal for 3.0 was not to replace the use of InternalRow, it was to get the majority of SQL working on top of the interface added after 2.4. That’s done and stable, so I think a 2.5 release with it is also reasonable.


On Fri, Sep 20, 2019 at 11:23 AM Reynold Xin <[hidden email]> wrote:
To push back, while I agree we should not drastically change "InternalRow", there are a lot of changes that need to happen to make it stable. For example, none of the publicly exposed interfaces should be in the Catalyst package or the unsafe package. External implementations should be decoupled from the internal implementations, with cheap ways to convert back and forth.

When you created the PR to make InternalRow public, the understanding was to work towards making it stable in the future, assuming we will start with an unstable API temporarily. You can't just make a bunch internal APIs tightly coupled with other internal pieces public and stable and call it a day, just because it happen to satisfy some use cases temporarily assuming the rest of Spark doesn't change.



On Fri, Sep 20, 2019 at 11:19 AM, Ryan Blue <[hidden email]> wrote:
> DSv2 is far from stable right?

No, I think it is reasonably stable and very close to being ready for a release.

> All the actual data types are unstable and you guys have completely ignored that.

I think what you're referring to is the use of `InternalRow`. That's a stable API and there has been no work to avoid using it. In any case, I don't think that anyone is suggesting that we delay 3.0 until a replacement for `InternalRow` is added, right?

While I understand the motivation for a better solution here, I think the pragmatic solution is to continue using `InternalRow`.

> If the goal is to make DSv2 work across 3.x and 2.x, that seems too invasive of a change to backport once you consider the parts needed to make dsv2 stable.

I believe that those of us working on DSv2 are confident about the current stability. We set goals for what to get into the 3.0 release months ago and have very nearly reached the point where we are ready for that release.

I don't think instability would be a problem in maintaining compatibility between the 2.5 version and the 3.0 version. If we find that we need to make API changes (other than additions) then we can make those in the 3.1 release. Because the goals we set for the 3.0 release have been reached with the current API and if we are ready to release 3.0, we can release a 2.5 with the same API.

On Fri, Sep 20, 2019 at 11:05 AM Reynold Xin <[hidden email]> wrote:
DSv2 is far from stable right? All the actual data types are unstable and you guys have completely ignored that. We'd need to work on that and that will be a breaking change. If the goal is to make DSv2 work across 3.x and 2.x, that seems too invasive of a change to backport once you consider the parts needed to make dsv2 stable.



On Fri, Sep 20, 2019 at 10:47 AM, Ryan Blue <[hidden email]> wrote:
Hi everyone,

In the DSv2 sync this week, we talked about a possible Spark 2.5 release based on the latest Spark 2.4, but with DSv2 and Java 11 support added.

A Spark 2.5 release with these two additions will help people migrate to Spark 3.0 when it is released because they will be able to use a single implementation for DSv2 sources that works in both 2.5 and 3.0. Similarly, upgrading to 3.0 won't also require also updating to Java 11 because users could update to Java 11 with the 2.5 release and have fewer major changes.

Another reason to consider a 2.5 release is that many people are interested in a release with the latest DSv2 API and support for DSv2 SQL. I'm already going to be backporting DSv2 support to the Spark 2.4 line, so it makes sense to share this work with the community.

This release line would just consist of backports like DSv2 and Java 11 that assist compatibility, to keep the scope of the release small. The purpose is to assist people moving to 3.0 and not distract from the 3.0 release.

Would a Spark 2.5 release help anyone else? Are there any concerns about this plan?


rb


--
Ryan Blue
Software Engineer
Netflix



--
Ryan Blue
Software Engineer
Netflix



--
Ryan Blue
Software Engineer
Netflix



--
Ryan Blue
Software Engineer
Netflix


--


--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix



--
Ryan Blue
Software Engineer
Netflix
12