Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

classic Classic list List threaded Threaded
30 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Dongjoon Hyun-2
Hi, All.

There was a discussion on publishing artifacts built with Hadoop 3 .
But, we are still publishing with Hadoop 2.7.3 and `3.0-preview` will be the same because we didn't change anything yet.

Technically, we need to change two places for publishing.

1. Jenkins Snapshot Publishing

2. Release Snapshot/Release Publishing

To minimize the change, we need to switch our default Hadoop profile.

Currently, the default is `hadoop-2.7 (2.7.4)` profile and `hadoop-3.2 (3.2.0)` is optional.
We had better use `hadoop-3.2` profile by default and `hadoop-2.7` optionally.

Note that this means we use Hive 2.3.6 by default. Only `hadoop-2.7` distribution will use `Hive 1.2.1` like Apache Spark 2.4.x.

Bests,
Dongjoon.
Reply | Threaded
Open this post in threaded view
|

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Sean Owen-2
I'm OK with that, but don't have a strong opinion nor info about the
implications.
That said my guess is we're close to the point where we don't need to
support Hadoop 2.x anyway, so, yeah.

On Mon, Oct 28, 2019 at 2:33 PM Dongjoon Hyun <[hidden email]> wrote:

>
> Hi, All.
>
> There was a discussion on publishing artifacts built with Hadoop 3 .
> But, we are still publishing with Hadoop 2.7.3 and `3.0-preview` will be the same because we didn't change anything yet.
>
> Technically, we need to change two places for publishing.
>
> 1. Jenkins Snapshot Publishing
>     https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/
>
> 2. Release Snapshot/Release Publishing
>     https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh
>
> To minimize the change, we need to switch our default Hadoop profile.
>
> Currently, the default is `hadoop-2.7 (2.7.4)` profile and `hadoop-3.2 (3.2.0)` is optional.
> We had better use `hadoop-3.2` profile by default and `hadoop-2.7` optionally.
>
> Note that this means we use Hive 2.3.6 by default. Only `hadoop-2.7` distribution will use `Hive 1.2.1` like Apache Spark 2.4.x.
>
> Bests,
> Dongjoon.

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Xiao Li-2
The stability and quality of Hadoop 3.2 profile are unknown. The changes are massive, including Hive execution and a new version of Hive thriftserver. 

To reduce the risk, I would like to keep the current default version unchanged. When it becomes stable, we can change the default profile to Hadoop-3.2. 

Cheers,

Xiao  

On Mon, Oct 28, 2019 at 12:51 PM Sean Owen <[hidden email]> wrote:
I'm OK with that, but don't have a strong opinion nor info about the
implications.
That said my guess is we're close to the point where we don't need to
support Hadoop 2.x anyway, so, yeah.

On Mon, Oct 28, 2019 at 2:33 PM Dongjoon Hyun <[hidden email]> wrote:
>
> Hi, All.
>
> There was a discussion on publishing artifacts built with Hadoop 3 .
> But, we are still publishing with Hadoop 2.7.3 and `3.0-preview` will be the same because we didn't change anything yet.
>
> Technically, we need to change two places for publishing.
>
> 1. Jenkins Snapshot Publishing
>     https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/
>
> 2. Release Snapshot/Release Publishing
>     https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh
>
> To minimize the change, we need to switch our default Hadoop profile.
>
> Currently, the default is `hadoop-2.7 (2.7.4)` profile and `hadoop-3.2 (3.2.0)` is optional.
> We had better use `hadoop-3.2` profile by default and `hadoop-2.7` optionally.
>
> Note that this means we use Hive 2.3.6 by default. Only `hadoop-2.7` distribution will use `Hive 1.2.1` like Apache Spark 2.4.x.
>
> Bests,
> Dongjoon.

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



--
Databricks Summit - Watch the talks 
Reply | Threaded
Open this post in threaded view
|

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Dongjoon Hyun-2
Thank you for the feedback, Sean and Xiao.

Bests,
Dongjoon.

On Mon, Oct 28, 2019 at 12:52 PM Xiao Li <[hidden email]> wrote:
The stability and quality of Hadoop 3.2 profile are unknown. The changes are massive, including Hive execution and a new version of Hive thriftserver. 

To reduce the risk, I would like to keep the current default version unchanged. When it becomes stable, we can change the default profile to Hadoop-3.2. 

Cheers,

Xiao  

On Mon, Oct 28, 2019 at 12:51 PM Sean Owen <[hidden email]> wrote:
I'm OK with that, but don't have a strong opinion nor info about the
implications.
That said my guess is we're close to the point where we don't need to
support Hadoop 2.x anyway, so, yeah.

On Mon, Oct 28, 2019 at 2:33 PM Dongjoon Hyun <[hidden email]> wrote:
>
> Hi, All.
>
> There was a discussion on publishing artifacts built with Hadoop 3 .
> But, we are still publishing with Hadoop 2.7.3 and `3.0-preview` will be the same because we didn't change anything yet.
>
> Technically, we need to change two places for publishing.
>
> 1. Jenkins Snapshot Publishing
>     https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/
>
> 2. Release Snapshot/Release Publishing
>     https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh
>
> To minimize the change, we need to switch our default Hadoop profile.
>
> Currently, the default is `hadoop-2.7 (2.7.4)` profile and `hadoop-3.2 (3.2.0)` is optional.
> We had better use `hadoop-3.2` profile by default and `hadoop-2.7` optionally.
>
> Note that this means we use Hive 2.3.6 by default. Only `hadoop-2.7` distribution will use `Hive 1.2.1` like Apache Spark 2.4.x.
>
> Bests,
> Dongjoon.

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



--
Databricks Summit - Watch the talks 
Reply | Threaded
Open this post in threaded view
|

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Steve Loughran-2
In reply to this post by Xiao Li-2
What is the current default value? as the 2.x releases are becoming EOL; 2.7 is dead, there might be a 2.8.x; for now 2.9 is the branch-2 release getting attention. 2.10.0 shipped yesterday, but the ".0" means there will inevitably be surprises.

One issue about using a older versions is that any problem reported -especially at stack traces you can blame me for- Will generally be met by a response of "does it go away when you upgrade?" The other issue is how much test coverage are things getting?

w.r.t Hadoop 3.2 stability, nothing major has been reported. The ABFS client is there, and I the big guava update (HADOOP-16213) went in. People will either love or hate that.

No major changes in s3a code between 3.2.0 and 3.2.1; I have a large backport planned though, including changes to better handle AWS caching of 404s generatd from HEAD requests before an object was actually created.

It would be really good if the spark distributions shipped with later versions of the hadoop artifacts.

On Mon, Oct 28, 2019 at 7:53 PM Xiao Li <[hidden email]> wrote:
The stability and quality of Hadoop 3.2 profile are unknown. The changes are massive, including Hive execution and a new version of Hive thriftserver. 

To reduce the risk, I would like to keep the current default version unchanged. When it becomes stable, we can change the default profile to Hadoop-3.2. 

Cheers,

Xiao  

On Mon, Oct 28, 2019 at 12:51 PM Sean Owen <[hidden email]> wrote:
I'm OK with that, but don't have a strong opinion nor info about the
implications.
That said my guess is we're close to the point where we don't need to
support Hadoop 2.x anyway, so, yeah.

On Mon, Oct 28, 2019 at 2:33 PM Dongjoon Hyun <[hidden email]> wrote:
>
> Hi, All.
>
> There was a discussion on publishing artifacts built with Hadoop 3 .
> But, we are still publishing with Hadoop 2.7.3 and `3.0-preview` will be the same because we didn't change anything yet.
>
> Technically, we need to change two places for publishing.
>
> 1. Jenkins Snapshot Publishing
>     https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/
>
> 2. Release Snapshot/Release Publishing
>     https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh
>
> To minimize the change, we need to switch our default Hadoop profile.
>
> Currently, the default is `hadoop-2.7 (2.7.4)` profile and `hadoop-3.2 (3.2.0)` is optional.
> We had better use `hadoop-3.2` profile by default and `hadoop-2.7` optionally.
>
> Note that this means we use Hive 2.3.6 by default. Only `hadoop-2.7` distribution will use `Hive 1.2.1` like Apache Spark 2.4.x.
>
> Bests,
> Dongjoon.

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



--
Databricks Summit - Watch the talks 
Reply | Threaded
Open this post in threaded view
|

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Xiao Li-2
Hi, Steve, 

Thanks for your comments! My major quality concern is not against Hadoop 3.2. In this release, Hive execution module upgrade [from 1.2 to 2.3], Hive thrift-server upgrade, and JDK11 supports are added to Hadoop 3.2 profile only. Compared with Hadoop 2.x profile, the Hadoop 3.2 profile is more risky due to these changes.

To speed up the adoption of Spark 3.0, which has many other highly desirable features, I am proposing to keep Hadoop 2.x profile as the default.

Cheers,

Xiao.



On Fri, Nov 1, 2019 at 5:33 AM Steve Loughran <[hidden email]> wrote:
What is the current default value? as the 2.x releases are becoming EOL; 2.7 is dead, there might be a 2.8.x; for now 2.9 is the branch-2 release getting attention. 2.10.0 shipped yesterday, but the ".0" means there will inevitably be surprises.

One issue about using a older versions is that any problem reported -especially at stack traces you can blame me for- Will generally be met by a response of "does it go away when you upgrade?" The other issue is how much test coverage are things getting?

w.r.t Hadoop 3.2 stability, nothing major has been reported. The ABFS client is there, and I the big guava update (HADOOP-16213) went in. People will either love or hate that.

No major changes in s3a code between 3.2.0 and 3.2.1; I have a large backport planned though, including changes to better handle AWS caching of 404s generatd from HEAD requests before an object was actually created.

It would be really good if the spark distributions shipped with later versions of the hadoop artifacts.

On Mon, Oct 28, 2019 at 7:53 PM Xiao Li <[hidden email]> wrote:
The stability and quality of Hadoop 3.2 profile are unknown. The changes are massive, including Hive execution and a new version of Hive thriftserver. 

To reduce the risk, I would like to keep the current default version unchanged. When it becomes stable, we can change the default profile to Hadoop-3.2. 

Cheers,

Xiao  

On Mon, Oct 28, 2019 at 12:51 PM Sean Owen <[hidden email]> wrote:
I'm OK with that, but don't have a strong opinion nor info about the
implications.
That said my guess is we're close to the point where we don't need to
support Hadoop 2.x anyway, so, yeah.

On Mon, Oct 28, 2019 at 2:33 PM Dongjoon Hyun <[hidden email]> wrote:
>
> Hi, All.
>
> There was a discussion on publishing artifacts built with Hadoop 3 .
> But, we are still publishing with Hadoop 2.7.3 and `3.0-preview` will be the same because we didn't change anything yet.
>
> Technically, we need to change two places for publishing.
>
> 1. Jenkins Snapshot Publishing
>     https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/
>
> 2. Release Snapshot/Release Publishing
>     https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh
>
> To minimize the change, we need to switch our default Hadoop profile.
>
> Currently, the default is `hadoop-2.7 (2.7.4)` profile and `hadoop-3.2 (3.2.0)` is optional.
> We had better use `hadoop-3.2` profile by default and `hadoop-2.7` optionally.
>
> Note that this means we use Hive 2.3.6 by default. Only `hadoop-2.7` distribution will use `Hive 1.2.1` like Apache Spark 2.4.x.
>
> Bests,
> Dongjoon.

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



--
Databricks Summit - Watch the talks 


--
Databricks Summit - Watch the talks 
Reply | Threaded
Open this post in threaded view
|

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Jiaxin Shan
+1 for Hadoop 3.2.  Seems lots of cloud integration efforts Steve made is only available in 3.2. We see lots of users asking for better S3A support in Spark.

On Fri, Nov 1, 2019 at 9:46 AM Xiao Li <[hidden email]> wrote:
Hi, Steve, 

Thanks for your comments! My major quality concern is not against Hadoop 3.2. In this release, Hive execution module upgrade [from 1.2 to 2.3], Hive thrift-server upgrade, and JDK11 supports are added to Hadoop 3.2 profile only. Compared with Hadoop 2.x profile, the Hadoop 3.2 profile is more risky due to these changes.

To speed up the adoption of Spark 3.0, which has many other highly desirable features, I am proposing to keep Hadoop 2.x profile as the default.

Cheers,

Xiao.



On Fri, Nov 1, 2019 at 5:33 AM Steve Loughran <[hidden email]> wrote:
What is the current default value? as the 2.x releases are becoming EOL; 2.7 is dead, there might be a 2.8.x; for now 2.9 is the branch-2 release getting attention. 2.10.0 shipped yesterday, but the ".0" means there will inevitably be surprises.

One issue about using a older versions is that any problem reported -especially at stack traces you can blame me for- Will generally be met by a response of "does it go away when you upgrade?" The other issue is how much test coverage are things getting?

w.r.t Hadoop 3.2 stability, nothing major has been reported. The ABFS client is there, and I the big guava update (HADOOP-16213) went in. People will either love or hate that.

No major changes in s3a code between 3.2.0 and 3.2.1; I have a large backport planned though, including changes to better handle AWS caching of 404s generatd from HEAD requests before an object was actually created.

It would be really good if the spark distributions shipped with later versions of the hadoop artifacts.

On Mon, Oct 28, 2019 at 7:53 PM Xiao Li <[hidden email]> wrote:
The stability and quality of Hadoop 3.2 profile are unknown. The changes are massive, including Hive execution and a new version of Hive thriftserver. 

To reduce the risk, I would like to keep the current default version unchanged. When it becomes stable, we can change the default profile to Hadoop-3.2. 

Cheers,

Xiao  

On Mon, Oct 28, 2019 at 12:51 PM Sean Owen <[hidden email]> wrote:
I'm OK with that, but don't have a strong opinion nor info about the
implications.
That said my guess is we're close to the point where we don't need to
support Hadoop 2.x anyway, so, yeah.

On Mon, Oct 28, 2019 at 2:33 PM Dongjoon Hyun <[hidden email]> wrote:
>
> Hi, All.
>
> There was a discussion on publishing artifacts built with Hadoop 3 .
> But, we are still publishing with Hadoop 2.7.3 and `3.0-preview` will be the same because we didn't change anything yet.
>
> Technically, we need to change two places for publishing.
>
> 1. Jenkins Snapshot Publishing
>     https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/
>
> 2. Release Snapshot/Release Publishing
>     https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh
>
> To minimize the change, we need to switch our default Hadoop profile.
>
> Currently, the default is `hadoop-2.7 (2.7.4)` profile and `hadoop-3.2 (3.2.0)` is optional.
> We had better use `hadoop-3.2` profile by default and `hadoop-2.7` optionally.
>
> Note that this means we use Hive 2.3.6 by default. Only `hadoop-2.7` distribution will use `Hive 1.2.1` like Apache Spark 2.4.x.
>
> Bests,
> Dongjoon.

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



--
Databricks Summit - Watch the talks 


--
Databricks Summit - Watch the talks 


--
Best Regards!
Jiaxin Shan
Tel:  412-230-7670
Address: 470 2nd Ave S, Kirkland, WA

Reply | Threaded
Open this post in threaded view
|

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Dongjoon Hyun-2
Hi, Xiao.

How JDK11-support can make `Hadoop-3.2 profile` risky? We build and publish with JDK8.

> In this release, Hive execution module upgrade [from 1.2 to 2.3], Hive thrift-server upgrade, and JDK11 supports are added to Hadoop 3.2 profile only.

Since we build and publish with JDK8 and the default runtime is still JDK8, I don't think `hadoop-3.2 profile` is risky in that context.

For JDK11, Hive execution module 2.3.6 doesn't support JDK11 still in terms of remote HiveMetastore.

So, among the above reasons, we can say that Hive execution module (with Hive 2.3.6) can be the root cause of potential unknown issues.

In other words, `Hive 1.2.1` is the one you think stable, isn't it?

Although Hive 2.3.6 might be not proven in Apache Spark officially, we resolved several SPARK issues by upgrading Hive from 1.2.1 to 2.3.6 also.

Bests,
Dongjoon.



On Fri, Nov 1, 2019 at 5:37 PM Jiaxin Shan <[hidden email]> wrote:
+1 for Hadoop 3.2.  Seems lots of cloud integration efforts Steve made is only available in 3.2. We see lots of users asking for better S3A support in Spark.

On Fri, Nov 1, 2019 at 9:46 AM Xiao Li <[hidden email]> wrote:
Hi, Steve, 

Thanks for your comments! My major quality concern is not against Hadoop 3.2. In this release, Hive execution module upgrade [from 1.2 to 2.3], Hive thrift-server upgrade, and JDK11 supports are added to Hadoop 3.2 profile only. Compared with Hadoop 2.x profile, the Hadoop 3.2 profile is more risky due to these changes.

To speed up the adoption of Spark 3.0, which has many other highly desirable features, I am proposing to keep Hadoop 2.x profile as the default.

Cheers,

Xiao.



On Fri, Nov 1, 2019 at 5:33 AM Steve Loughran <[hidden email]> wrote:
What is the current default value? as the 2.x releases are becoming EOL; 2.7 is dead, there might be a 2.8.x; for now 2.9 is the branch-2 release getting attention. 2.10.0 shipped yesterday, but the ".0" means there will inevitably be surprises.

One issue about using a older versions is that any problem reported -especially at stack traces you can blame me for- Will generally be met by a response of "does it go away when you upgrade?" The other issue is how much test coverage are things getting?

w.r.t Hadoop 3.2 stability, nothing major has been reported. The ABFS client is there, and I the big guava update (HADOOP-16213) went in. People will either love or hate that.

No major changes in s3a code between 3.2.0 and 3.2.1; I have a large backport planned though, including changes to better handle AWS caching of 404s generatd from HEAD requests before an object was actually created.

It would be really good if the spark distributions shipped with later versions of the hadoop artifacts.

On Mon, Oct 28, 2019 at 7:53 PM Xiao Li <[hidden email]> wrote:
The stability and quality of Hadoop 3.2 profile are unknown. The changes are massive, including Hive execution and a new version of Hive thriftserver. 

To reduce the risk, I would like to keep the current default version unchanged. When it becomes stable, we can change the default profile to Hadoop-3.2. 

Cheers,

Xiao  

On Mon, Oct 28, 2019 at 12:51 PM Sean Owen <[hidden email]> wrote:
I'm OK with that, but don't have a strong opinion nor info about the
implications.
That said my guess is we're close to the point where we don't need to
support Hadoop 2.x anyway, so, yeah.

On Mon, Oct 28, 2019 at 2:33 PM Dongjoon Hyun <[hidden email]> wrote:
>
> Hi, All.
>
> There was a discussion on publishing artifacts built with Hadoop 3 .
> But, we are still publishing with Hadoop 2.7.3 and `3.0-preview` will be the same because we didn't change anything yet.
>
> Technically, we need to change two places for publishing.
>
> 1. Jenkins Snapshot Publishing
>     https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/
>
> 2. Release Snapshot/Release Publishing
>     https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh
>
> To minimize the change, we need to switch our default Hadoop profile.
>
> Currently, the default is `hadoop-2.7 (2.7.4)` profile and `hadoop-3.2 (3.2.0)` is optional.
> We had better use `hadoop-3.2` profile by default and `hadoop-2.7` optionally.
>
> Note that this means we use Hive 2.3.6 by default. Only `hadoop-2.7` distribution will use `Hive 1.2.1` like Apache Spark 2.4.x.
>
> Bests,
> Dongjoon.

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



--
Databricks Summit - Watch the talks 


--
Databricks Summit - Watch the talks 


--
Best Regards!
Jiaxin Shan
Tel:  412-230-7670
Address: 470 2nd Ave S, Kirkland, WA

Reply | Threaded
Open this post in threaded view
|

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Koert Kuipers
In reply to this post by Sean Owen-2
i dont see how we can be close to the point where we dont need to support hadoop 2.x. this does not agree with the reality from my perspective, which is that all our clients are on hadoop 2.x. not a single one is on hadoop 3.x currently. this includes deployments of cloudera distros, hortonworks distros, and cloud distros like emr and dataproc.

forcing us to be on older spark versions would be unfortunate for us, and also bad for the community (as deployments like ours help find bugs in spark).

On Mon, Oct 28, 2019 at 3:51 PM Sean Owen <[hidden email]> wrote:
I'm OK with that, but don't have a strong opinion nor info about the
implications.
That said my guess is we're close to the point where we don't need to
support Hadoop 2.x anyway, so, yeah.

On Mon, Oct 28, 2019 at 2:33 PM Dongjoon Hyun <[hidden email]> wrote:
>
> Hi, All.
>
> There was a discussion on publishing artifacts built with Hadoop 3 .
> But, we are still publishing with Hadoop 2.7.3 and `3.0-preview` will be the same because we didn't change anything yet.
>
> Technically, we need to change two places for publishing.
>
> 1. Jenkins Snapshot Publishing
>     https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/
>
> 2. Release Snapshot/Release Publishing
>     https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh
>
> To minimize the change, we need to switch our default Hadoop profile.
>
> Currently, the default is `hadoop-2.7 (2.7.4)` profile and `hadoop-3.2 (3.2.0)` is optional.
> We had better use `hadoop-3.2` profile by default and `hadoop-2.7` optionally.
>
> Note that this means we use Hive 2.3.6 by default. Only `hadoop-2.7` distribution will use `Hive 1.2.1` like Apache Spark 2.4.x.
>
> Bests,
> Dongjoon.

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Dongjoon Hyun-2
Hi, Koert.

Could you be more specific to your Hadoop version requirement?

Although we will have Hadoop 2.7 profile, Hadoop 2.6 and older support is officially already dropped in Apache Spark 3.0.0. We can not give you the answer for Hadoop 2.6 and older version clusters because we are not testing at all.

Also, Steve already pointed out that Hadoop 2.7 is also EOL. According to his advice, we might need to upgrade our Hadoop 2.7 profile to the latest 2.x. I'm wondering you are against on that because of Hadoop 2.6 or older version support.

BTW, I'm the one of the users of Hadoop 3.x clusters. It's used already and we are migrating more. Apache Spark 3.0 will arrive 2020 (not today). We need to consider that, too. Do you have any migration plan in 2020?

In short, for the clusters using Hadoop 2.6 and older versions, Apache Spark 2.4 is supported as a LTS version. You can get the bug fixes. For Hadoop 2.7, Apache Spark 3.0 will have the profile and the binary release. Making Hadoop 3.2 profile as a default is irrelevant to that.

Bests,
Dongjoon.


On Sat, Nov 2, 2019 at 09:35 Koert Kuipers <[hidden email]> wrote:
i dont see how we can be close to the point where we dont need to support hadoop 2.x. this does not agree with the reality from my perspective, which is that all our clients are on hadoop 2.x. not a single one is on hadoop 3.x currently. this includes deployments of cloudera distros, hortonworks distros, and cloud distros like emr and dataproc.

forcing us to be on older spark versions would be unfortunate for us, and also bad for the community (as deployments like ours help find bugs in spark).

On Mon, Oct 28, 2019 at 3:51 PM Sean Owen <[hidden email]> wrote:
I'm OK with that, but don't have a strong opinion nor info about the
implications.
That said my guess is we're close to the point where we don't need to
support Hadoop 2.x anyway, so, yeah.

On Mon, Oct 28, 2019 at 2:33 PM Dongjoon Hyun <[hidden email]> wrote:
>
> Hi, All.
>
> There was a discussion on publishing artifacts built with Hadoop 3 .
> But, we are still publishing with Hadoop 2.7.3 and `3.0-preview` will be the same because we didn't change anything yet.
>
> Technically, we need to change two places for publishing.
>
> 1. Jenkins Snapshot Publishing
>     https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/
>
> 2. Release Snapshot/Release Publishing
>     https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh
>
> To minimize the change, we need to switch our default Hadoop profile.
>
> Currently, the default is `hadoop-2.7 (2.7.4)` profile and `hadoop-3.2 (3.2.0)` is optional.
> We had better use `hadoop-3.2` profile by default and `hadoop-2.7` optionally.
>
> Note that this means we use Hive 2.3.6 by default. Only `hadoop-2.7` distribution will use `Hive 1.2.1` like Apache Spark 2.4.x.
>
> Bests,
> Dongjoon.

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Koert Kuipers
yes i am not against hadoop 3 becoming the default. i was just questioning the statement that we are close to dropping support for hadoop 2.

we build our own spark releases that we deploy on the clusters of our clients. these clusters are hdp 2.x, cdh 5, emr, dataproc, etc.

i am aware that hadoop 2.6 profile was dropped and we are handling this in-house.

given that latest hdp 2.x is still hadoop 2.7 bumping hadoop 2 profile to latest would probably be an issue for us.

On Sat, Nov 2, 2019, 15:47 Dongjoon Hyun <[hidden email]> wrote:
Hi, Koert.

Could you be more specific to your Hadoop version requirement?

Although we will have Hadoop 2.7 profile, Hadoop 2.6 and older support is officially already dropped in Apache Spark 3.0.0. We can not give you the answer for Hadoop 2.6 and older version clusters because we are not testing at all.

Also, Steve already pointed out that Hadoop 2.7 is also EOL. According to his advice, we might need to upgrade our Hadoop 2.7 profile to the latest 2.x. I'm wondering you are against on that because of Hadoop 2.6 or older version support.

BTW, I'm the one of the users of Hadoop 3.x clusters. It's used already and we are migrating more. Apache Spark 3.0 will arrive 2020 (not today). We need to consider that, too. Do you have any migration plan in 2020?

In short, for the clusters using Hadoop 2.6 and older versions, Apache Spark 2.4 is supported as a LTS version. You can get the bug fixes. For Hadoop 2.7, Apache Spark 3.0 will have the profile and the binary release. Making Hadoop 3.2 profile as a default is irrelevant to that.

Bests,
Dongjoon.


On Sat, Nov 2, 2019 at 09:35 Koert Kuipers <[hidden email]> wrote:
i dont see how we can be close to the point where we dont need to support hadoop 2.x. this does not agree with the reality from my perspective, which is that all our clients are on hadoop 2.x. not a single one is on hadoop 3.x currently. this includes deployments of cloudera distros, hortonworks distros, and cloud distros like emr and dataproc.

forcing us to be on older spark versions would be unfortunate for us, and also bad for the community (as deployments like ours help find bugs in spark).

On Mon, Oct 28, 2019 at 3:51 PM Sean Owen <[hidden email]> wrote:
I'm OK with that, but don't have a strong opinion nor info about the
implications.
That said my guess is we're close to the point where we don't need to
support Hadoop 2.x anyway, so, yeah.

On Mon, Oct 28, 2019 at 2:33 PM Dongjoon Hyun <[hidden email]> wrote:
>
> Hi, All.
>
> There was a discussion on publishing artifacts built with Hadoop 3 .
> But, we are still publishing with Hadoop 2.7.3 and `3.0-preview` will be the same because we didn't change anything yet.
>
> Technically, we need to change two places for publishing.
>
> 1. Jenkins Snapshot Publishing
>     https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/
>
> 2. Release Snapshot/Release Publishing
>     https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh
>
> To minimize the change, we need to switch our default Hadoop profile.
>
> Currently, the default is `hadoop-2.7 (2.7.4)` profile and `hadoop-3.2 (3.2.0)` is optional.
> We had better use `hadoop-3.2` profile by default and `hadoop-2.7` optionally.
>
> Note that this means we use Hive 2.3.6 by default. Only `hadoop-2.7` distribution will use `Hive 1.2.1` like Apache Spark 2.4.x.
>
> Bests,
> Dongjoon.

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Xiao Li-2
In reply to this post by Dongjoon Hyun-2
The changes for JDK 11 supports are not increasing the risk of Hadoop 3.2 profile. 

Hive 1.2.1 execution JARs are much more stable than Hive 2.3.6 execution JARs. The changes of thrift-servers are massive. We need more evidence to prove the quality and stability before we switching the default to Hadoop 3.2 profile. Adoption of Spark 3.0 is more important in the current moment. I think we can switch the default profile in Spark 3.1 or 3.2 releases, instead of Spark 3.0. 


On Fri, Nov 1, 2019 at 6:21 PM Dongjoon Hyun <[hidden email]> wrote:
Hi, Xiao.

How JDK11-support can make `Hadoop-3.2 profile` risky? We build and publish with JDK8.

> In this release, Hive execution module upgrade [from 1.2 to 2.3], Hive thrift-server upgrade, and JDK11 supports are added to Hadoop 3.2 profile only.

Since we build and publish with JDK8 and the default runtime is still JDK8, I don't think `hadoop-3.2 profile` is risky in that context.

For JDK11, Hive execution module 2.3.6 doesn't support JDK11 still in terms of remote HiveMetastore.

So, among the above reasons, we can say that Hive execution module (with Hive 2.3.6) can be the root cause of potential unknown issues.

In other words, `Hive 1.2.1` is the one you think stable, isn't it?

Although Hive 2.3.6 might be not proven in Apache Spark officially, we resolved several SPARK issues by upgrading Hive from 1.2.1 to 2.3.6 also.

Bests,
Dongjoon.



On Fri, Nov 1, 2019 at 5:37 PM Jiaxin Shan <[hidden email]> wrote:
+1 for Hadoop 3.2.  Seems lots of cloud integration efforts Steve made is only available in 3.2. We see lots of users asking for better S3A support in Spark.

On Fri, Nov 1, 2019 at 9:46 AM Xiao Li <[hidden email]> wrote:
Hi, Steve, 

Thanks for your comments! My major quality concern is not against Hadoop 3.2. In this release, Hive execution module upgrade [from 1.2 to 2.3], Hive thrift-server upgrade, and JDK11 supports are added to Hadoop 3.2 profile only. Compared with Hadoop 2.x profile, the Hadoop 3.2 profile is more risky due to these changes.

To speed up the adoption of Spark 3.0, which has many other highly desirable features, I am proposing to keep Hadoop 2.x profile as the default.

Cheers,

Xiao.



On Fri, Nov 1, 2019 at 5:33 AM Steve Loughran <[hidden email]> wrote:
What is the current default value? as the 2.x releases are becoming EOL; 2.7 is dead, there might be a 2.8.x; for now 2.9 is the branch-2 release getting attention. 2.10.0 shipped yesterday, but the ".0" means there will inevitably be surprises.

One issue about using a older versions is that any problem reported -especially at stack traces you can blame me for- Will generally be met by a response of "does it go away when you upgrade?" The other issue is how much test coverage are things getting?

w.r.t Hadoop 3.2 stability, nothing major has been reported. The ABFS client is there, and I the big guava update (HADOOP-16213) went in. People will either love or hate that.

No major changes in s3a code between 3.2.0 and 3.2.1; I have a large backport planned though, including changes to better handle AWS caching of 404s generatd from HEAD requests before an object was actually created.

It would be really good if the spark distributions shipped with later versions of the hadoop artifacts.

On Mon, Oct 28, 2019 at 7:53 PM Xiao Li <[hidden email]> wrote:
The stability and quality of Hadoop 3.2 profile are unknown. The changes are massive, including Hive execution and a new version of Hive thriftserver. 

To reduce the risk, I would like to keep the current default version unchanged. When it becomes stable, we can change the default profile to Hadoop-3.2. 

Cheers,

Xiao  

On Mon, Oct 28, 2019 at 12:51 PM Sean Owen <[hidden email]> wrote:
I'm OK with that, but don't have a strong opinion nor info about the
implications.
That said my guess is we're close to the point where we don't need to
support Hadoop 2.x anyway, so, yeah.

On Mon, Oct 28, 2019 at 2:33 PM Dongjoon Hyun <[hidden email]> wrote:
>
> Hi, All.
>
> There was a discussion on publishing artifacts built with Hadoop 3 .
> But, we are still publishing with Hadoop 2.7.3 and `3.0-preview` will be the same because we didn't change anything yet.
>
> Technically, we need to change two places for publishing.
>
> 1. Jenkins Snapshot Publishing
>     https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/
>
> 2. Release Snapshot/Release Publishing
>     https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh
>
> To minimize the change, we need to switch our default Hadoop profile.
>
> Currently, the default is `hadoop-2.7 (2.7.4)` profile and `hadoop-3.2 (3.2.0)` is optional.
> We had better use `hadoop-3.2` profile by default and `hadoop-2.7` optionally.
>
> Note that this means we use Hive 2.3.6 by default. Only `hadoop-2.7` distribution will use `Hive 1.2.1` like Apache Spark 2.4.x.
>
> Bests,
> Dongjoon.

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



--
Databricks Summit - Watch the talks 


--
Databricks Summit - Watch the talks 


--
Best Regards!
Jiaxin Shan
Tel:  412-230-7670
Address: 470 2nd Ave S, Kirkland, WA



--
Databricks Summit - Watch the talks 
Reply | Threaded
Open this post in threaded view
|

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Nicholas Chammas
In reply to this post by Steve Loughran-2
On Fri, Nov 1, 2019 at 8:41 AM Steve Loughran <[hidden email]> wrote:
It would be really good if the spark distributions shipped with later versions of the hadoop artifacts.

I second this. If we need to keep a Hadoop 2.x profile around, why not make it Hadoop 2.8 or something newer?

Koert Kuipers <[hidden email]> wrote:
given that latest hdp 2.x is still hadoop 2.7 bumping hadoop 2 profile to latest would probably be an issue for us.

When was the last time HDP 2.x bumped their minor version of Hadoop? Do we want to wait for them to bump to Hadoop 2.8 before we do the same?
Reply | Threaded
Open this post in threaded view
|

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Steve Loughran-2
In reply to this post by Koert Kuipers
I'd move spark's branch-2 line to 2.9.x as

(a) spark's version of httpclient hits a bug in the AWS SDK used in hadoop-2.8 unless you revert that patch https://issues.apache.org/jira/browse/SPARK-22919
(b) there's only one future version of 2.8x planned, which is expected once myself or someone else sits down to do it. After that, all CVEs will be dealt with by "upgrade".
(c) it's actually tested against java 8, whereas versions <= 2.8 are nominally java 7 only.
(d) Microsoft contributed a lot for Azure integration


To be fair, the fact that the 2.7 release has lasted so long is actually pretty impressive. Core APIs stable; Kerberos under control, HDFS client and server happy (no erasure coding, other things tho'); the lack of support/performance for object store integration shows how things have changed since it's release in April 2015. But that was over five years ago. 

On Sat, Nov 2, 2019 at 4:36 PM Koert Kuipers <[hidden email]> wrote:
i dont see how we can be close to the point where we dont need to support hadoop 2.x. this does not agree with the reality from my perspective, which is that all our clients are on hadoop 2.x. not a single one is on hadoop 3.x currently.

Maybe, but unlikely to be on a "vanilla" 2.7.x release except for some very special cases where teams have taken on that task of maintaining their own installation.
 
this includes deployments of cloudera distros, hortonworks distros,

In front of me I have a git source tree whose repositories let me see the version histories of all of these and ~HD/I too. This is a power (I can make changes to all) and a responsibility (I could accidentally break the nightly builds of all if I'm not careful (1)). The one thing it doesn't do is have write access to asf gitbox, but that's only to stop me accidentally pushing me up an internal HDP or CDH branch to the ASF/github repos (2).

CDH5.x: hadoop branch-2 with some S3A features backported from hadoop branch-3 (i.e S3Guard). I'd call it 2.8+ though I don't know it in detail there.

HDP2.6.x: again, 2.8+ with abfs and gcs support.

Either way: when Spark 3.x ships it'd be up to Cloudera to deal with that release.

I have no idea what is going to happen there. If other people want to test spark 3.0.0 on those platforms -go for it, but do consider that by the commercial on-premises clusters have had a hadoop-3 option for 2+ years and that every month the age of those 2.x-based clusters increases. In cloud, things are transient so it doesn't matter *at all*.
 
and cloud distros like emr and dataproc.


EMR is a closed-source fork of (hadoop, hbase, spark, ...) with their own S3 connector which has never had its source seen other than in stack traces on stack overflow. Their problem (3).

HD/I: Current with azure connectivity, doesn't worry about the rest.

dataproc: no idea. Their gcs connector has been pretty stable. They do both branch-2 and branch-3.1 artifacts & do run the fs contract tests to help catch regressions in our code and theirs.

For all those in-cloud deployments, if you say "min version is Hadoop 3.x artifacts" then when they offer spark-3 they'll just do it with their build of the hadoop-3 JARs. It's not like they have 1000+ node HDFS clusters to upgrade.
 
forcing us to be on older spark versions would be unfortunate for us, and also bad for the community (as deployments like ours help find bugs in spark).


Bear also in mind: because all the work with hadoop, hive, HBase etc goes on in branch-3 code, the compatibility with those things ages too. If you are worried about Hive, well, you need to be working with their latest releases to get any issues you find fixed,

It's a really hard choice here: Stable dependencies versus newer ones. Certainly hadoop stayed with an old version of guava because the upgrade was so traumatic (it's changed now), and as for protobuf, that was so traumatic that everyone left it stayed frozen until last month (3.3, not 3.2.x, and protoc is done in java/maven). At the same time CVEs force Jackson updates on a fortnightly basis and the move to java 11 breaks so much that it's a big upgrade festival on us all.

You're going to have to consider "how much suffering with Hadoop 2.7 support is justified?" and "what should be the version which is actually shipped for people to play with". I think my stance is clear: time to move on. You cut your test matrix in half, be confident all users reporting bugs will be on hadoop 3.x and when you do file bugs with your peer ASF projects they don't get closed as WONTFIX.


BTW: out of curiosity, what versions of things does Databricks build off. ASF 2.7.x or something later?

-Steve


(1) Narrator: He has accidentally broken the nightly builds of most of these. And IBM websphere once. Breaking google cloud is still an unrealised ambition.
(2) Narrator: He has accidentally pushed up a release of an internal branch to the ASF/github repos. Colleagues were unhappy.
(3) Pro: they don't have to worry about me directly breaking their S3 integration. Con: I could still indirectly do it elsewhere in the source tree,  wouldn't notice, and probably wouldn't care much if they complained.
 
Reply | Threaded
Open this post in threaded view
|

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Steve Loughran-2
In reply to this post by Nicholas Chammas


On Mon, Nov 4, 2019 at 12:39 AM Nicholas Chammas <[hidden email]> wrote:
On Fri, Nov 1, 2019 at 8:41 AM Steve Loughran <[hidden email]> wrote:
It would be really good if the spark distributions shipped with later versions of the hadoop artifacts.

I second this. If we need to keep a Hadoop 2.x profile around, why not make it Hadoop 2.8 or something newer?

go for 2.9 

Koert Kuipers <[hidden email]> wrote:
given that latest hdp 2.x is still hadoop 2.7 bumping hadoop 2 profile to latest would probably be an issue for us.

When was the last time HDP 2.x bumped their minor version of Hadoop? Do we want to wait for them to bump to Hadoop 2.8 before we do the same?

The internal builds of CDH and HDP are not those of ASF 2.7.x. A really large proportion of the later branch-2 patches are backported. 2,7 was left behind a long time ago


 
Reply | Threaded
Open this post in threaded view
|

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Koert Kuipers
i get that cdh and hdp backport a lot and in that way left 2.7 behind. but they kept the public apis stable at the 2.7 level, because thats kind of the point. arent those the hadoop apis spark uses?

On Mon, Nov 4, 2019 at 10:07 AM Steve Loughran <[hidden email]> wrote:


On Mon, Nov 4, 2019 at 12:39 AM Nicholas Chammas <[hidden email]> wrote:
On Fri, Nov 1, 2019 at 8:41 AM Steve Loughran <[hidden email]> wrote:
It would be really good if the spark distributions shipped with later versions of the hadoop artifacts.

I second this. If we need to keep a Hadoop 2.x profile around, why not make it Hadoop 2.8 or something newer?

go for 2.9 

Koert Kuipers <[hidden email]> wrote:
given that latest hdp 2.x is still hadoop 2.7 bumping hadoop 2 profile to latest would probably be an issue for us.

When was the last time HDP 2.x bumped their minor version of Hadoop? Do we want to wait for them to bump to Hadoop 2.8 before we do the same?

The internal builds of CDH and HDP are not those of ASF 2.7.x. A really large proportion of the later branch-2 patches are backported. 2,7 was left behind a long time ago


 
Reply | Threaded
Open this post in threaded view
|

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Cheng Lian

Similar to Xiao, my major concern about making Hadoop 3.2 the default Hadoop version is quality control. The current hadoop-3.2 profile covers too many major component upgrades, i.e.:

  • Hadoop 3.2
  • Hive 2.3
  • JDK 11

We have already found and fixed some feature and performance regressions related to these upgrades. Empirically, I’m not surprised at all if more regressions are lurking somewhere. On the other hand, we do want help from the community to help us to evaluate and stabilize these new changes. Following that, I’d like to propose:

  1. Introduce a new profile hive-2.3 to enable (hopefully) less risky Hadoop/Hive/JDK version combinations.

    This new profile allows us to decouple Hive 2.3 from the hadoop-3.2 profile, so that users may try out some less risky Hadoop/Hive/JDK combinations: if you only want Hive 2.3 and/or JDK 11, you don’t need to face potential regressions introduced by the Hadoop 3.2 upgrade.

    Yuming Wang has already sent out PR #26533 to exercise the Hadoop 2.7 + Hive 2.3 + JDK 11 combination (this PR does not have the hive-2.3 profile yet), and the result looks promising: the Kafka streaming and Arrow related test failures should be irrelevant to the topic discussed here.

    After decoupling Hive 2.3 and Hadoop 3.2, I don’t think it makes a lot of difference between having Hadoop 2.7 or Hadoop 3.2 as the default Hadoop version. For users who are still using Hadoop 2.x in production, they will have to use a hadoop-provided prebuilt package or build Spark 3.0 against their own 2.x version anyway. It does make a difference for cloud users who don’t use Hadoop at all, though. And this probably also helps to stabilize the Hadoop 3.2 code path faster since our PR builder will exercise it regularly.

  2. Defer Hadoop 2.x upgrade to Spark 3.1+

    I personally do want to bump our Hadoop 2.x version to 2.9 or even 2.10. Steve has already stated the benefits very well. My worry here is still quality control: Spark 3.0 has already had tons of changes and major component version upgrades that are subject to all kinds of known and hidden regressions. Having Hadoop 2.7 there provides us a safety net, since it’s proven to be stable. To me, it’s much less risky to upgrade Hadoop 2.7 to 2.9/2.10 after we stabilize the Hadoop 3.2/Hive 2.3 combinations in the next 1 or 2 Spark 3.x releases.

Cheng

On Mon, Nov 4, 2019 at 11:24 AM Koert Kuipers <[hidden email]> wrote:
i get that cdh and hdp backport a lot and in that way left 2.7 behind. but they kept the public apis stable at the 2.7 level, because thats kind of the point. arent those the hadoop apis spark uses?

On Mon, Nov 4, 2019 at 10:07 AM Steve Loughran <[hidden email]> wrote:


On Mon, Nov 4, 2019 at 12:39 AM Nicholas Chammas <[hidden email]> wrote:
On Fri, Nov 1, 2019 at 8:41 AM Steve Loughran <[hidden email]> wrote:
It would be really good if the spark distributions shipped with later versions of the hadoop artifacts.

I second this. If we need to keep a Hadoop 2.x profile around, why not make it Hadoop 2.8 or something newer?

go for 2.9 

Koert Kuipers <[hidden email]> wrote:
given that latest hdp 2.x is still hadoop 2.7 bumping hadoop 2 profile to latest would probably be an issue for us.

When was the last time HDP 2.x bumped their minor version of Hadoop? Do we want to wait for them to bump to Hadoop 2.8 before we do the same?

The internal builds of CDH and HDP are not those of ASF 2.7.x. A really large proportion of the later branch-2 patches are backported. 2,7 was left behind a long time ago


 
Reply | Threaded
Open this post in threaded view
|

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Cheng Lian
Cc Yuming, Steve, and Dongjoon

On Fri, Nov 15, 2019 at 10:37 AM Cheng Lian <[hidden email]> wrote:

Similar to Xiao, my major concern about making Hadoop 3.2 the default Hadoop version is quality control. The current hadoop-3.2 profile covers too many major component upgrades, i.e.:

  • Hadoop 3.2
  • Hive 2.3
  • JDK 11

We have already found and fixed some feature and performance regressions related to these upgrades. Empirically, I’m not surprised at all if more regressions are lurking somewhere. On the other hand, we do want help from the community to help us to evaluate and stabilize these new changes. Following that, I’d like to propose:

  1. Introduce a new profile hive-2.3 to enable (hopefully) less risky Hadoop/Hive/JDK version combinations.

    This new profile allows us to decouple Hive 2.3 from the hadoop-3.2 profile, so that users may try out some less risky Hadoop/Hive/JDK combinations: if you only want Hive 2.3 and/or JDK 11, you don’t need to face potential regressions introduced by the Hadoop 3.2 upgrade.

    Yuming Wang has already sent out PR #26533 to exercise the Hadoop 2.7 + Hive 2.3 + JDK 11 combination (this PR does not have the hive-2.3 profile yet), and the result looks promising: the Kafka streaming and Arrow related test failures should be irrelevant to the topic discussed here.

    After decoupling Hive 2.3 and Hadoop 3.2, I don’t think it makes a lot of difference between having Hadoop 2.7 or Hadoop 3.2 as the default Hadoop version. For users who are still using Hadoop 2.x in production, they will have to use a hadoop-provided prebuilt package or build Spark 3.0 against their own 2.x version anyway. It does make a difference for cloud users who don’t use Hadoop at all, though. And this probably also helps to stabilize the Hadoop 3.2 code path faster since our PR builder will exercise it regularly.

  2. Defer Hadoop 2.x upgrade to Spark 3.1+

    I personally do want to bump our Hadoop 2.x version to 2.9 or even 2.10. Steve has already stated the benefits very well. My worry here is still quality control: Spark 3.0 has already had tons of changes and major component version upgrades that are subject to all kinds of known and hidden regressions. Having Hadoop 2.7 there provides us a safety net, since it’s proven to be stable. To me, it’s much less risky to upgrade Hadoop 2.7 to 2.9/2.10 after we stabilize the Hadoop 3.2/Hive 2.3 combinations in the next 1 or 2 Spark 3.x releases.

Cheng

On Mon, Nov 4, 2019 at 11:24 AM Koert Kuipers <[hidden email]> wrote:
i get that cdh and hdp backport a lot and in that way left 2.7 behind. but they kept the public apis stable at the 2.7 level, because thats kind of the point. arent those the hadoop apis spark uses?

On Mon, Nov 4, 2019 at 10:07 AM Steve Loughran <[hidden email]> wrote:


On Mon, Nov 4, 2019 at 12:39 AM Nicholas Chammas <[hidden email]> wrote:
On Fri, Nov 1, 2019 at 8:41 AM Steve Loughran <[hidden email]> wrote:
It would be really good if the spark distributions shipped with later versions of the hadoop artifacts.

I second this. If we need to keep a Hadoop 2.x profile around, why not make it Hadoop 2.8 or something newer?

go for 2.9 

Koert Kuipers <[hidden email]> wrote:
given that latest hdp 2.x is still hadoop 2.7 bumping hadoop 2 profile to latest would probably be an issue for us.

When was the last time HDP 2.x bumped their minor version of Hadoop? Do we want to wait for them to bump to Hadoop 2.8 before we do the same?

The internal builds of CDH and HDP are not those of ASF 2.7.x. A really large proportion of the later branch-2 patches are backported. 2,7 was left behind a long time ago


 
Reply | Threaded
Open this post in threaded view
|

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Dongjoon Hyun-2
Thank you for suggestion.

Having `hive-2.3` profile sounds good to me because it's orthogonal to Hadoop 3.
IIRC, originally, it was proposed in that way, but we put it under `hadoop-3.2` to avoid adding new profiles at that time.

And, I'm wondering if you are considering additional pre-built distribution and Jenkins jobs.

Bests,
Dongjoon.



On Fri, Nov 15, 2019 at 1:38 PM Cheng Lian <[hidden email]> wrote:
Cc Yuming, Steve, and Dongjoon

On Fri, Nov 15, 2019 at 10:37 AM Cheng Lian <[hidden email]> wrote:

Similar to Xiao, my major concern about making Hadoop 3.2 the default Hadoop version is quality control. The current hadoop-3.2 profile covers too many major component upgrades, i.e.:

  • Hadoop 3.2
  • Hive 2.3
  • JDK 11

We have already found and fixed some feature and performance regressions related to these upgrades. Empirically, I’m not surprised at all if more regressions are lurking somewhere. On the other hand, we do want help from the community to help us to evaluate and stabilize these new changes. Following that, I’d like to propose:

  1. Introduce a new profile hive-2.3 to enable (hopefully) less risky Hadoop/Hive/JDK version combinations.

    This new profile allows us to decouple Hive 2.3 from the hadoop-3.2 profile, so that users may try out some less risky Hadoop/Hive/JDK combinations: if you only want Hive 2.3 and/or JDK 11, you don’t need to face potential regressions introduced by the Hadoop 3.2 upgrade.

    Yuming Wang has already sent out PR #26533 to exercise the Hadoop 2.7 + Hive 2.3 + JDK 11 combination (this PR does not have the hive-2.3 profile yet), and the result looks promising: the Kafka streaming and Arrow related test failures should be irrelevant to the topic discussed here.

    After decoupling Hive 2.3 and Hadoop 3.2, I don’t think it makes a lot of difference between having Hadoop 2.7 or Hadoop 3.2 as the default Hadoop version. For users who are still using Hadoop 2.x in production, they will have to use a hadoop-provided prebuilt package or build Spark 3.0 against their own 2.x version anyway. It does make a difference for cloud users who don’t use Hadoop at all, though. And this probably also helps to stabilize the Hadoop 3.2 code path faster since our PR builder will exercise it regularly.

  2. Defer Hadoop 2.x upgrade to Spark 3.1+

    I personally do want to bump our Hadoop 2.x version to 2.9 or even 2.10. Steve has already stated the benefits very well. My worry here is still quality control: Spark 3.0 has already had tons of changes and major component version upgrades that are subject to all kinds of known and hidden regressions. Having Hadoop 2.7 there provides us a safety net, since it’s proven to be stable. To me, it’s much less risky to upgrade Hadoop 2.7 to 2.9/2.10 after we stabilize the Hadoop 3.2/Hive 2.3 combinations in the next 1 or 2 Spark 3.x releases.

Cheng

On Mon, Nov 4, 2019 at 11:24 AM Koert Kuipers <[hidden email]> wrote:
i get that cdh and hdp backport a lot and in that way left 2.7 behind. but they kept the public apis stable at the 2.7 level, because thats kind of the point. arent those the hadoop apis spark uses?

On Mon, Nov 4, 2019 at 10:07 AM Steve Loughran <[hidden email]> wrote:


On Mon, Nov 4, 2019 at 12:39 AM Nicholas Chammas <[hidden email]> wrote:
On Fri, Nov 1, 2019 at 8:41 AM Steve Loughran <[hidden email]> wrote:
It would be really good if the spark distributions shipped with later versions of the hadoop artifacts.

I second this. If we need to keep a Hadoop 2.x profile around, why not make it Hadoop 2.8 or something newer?

go for 2.9 

Koert Kuipers <[hidden email]> wrote:
given that latest hdp 2.x is still hadoop 2.7 bumping hadoop 2 profile to latest would probably be an issue for us.

When was the last time HDP 2.x bumped their minor version of Hadoop? Do we want to wait for them to bump to Hadoop 2.8 before we do the same?

The internal builds of CDH and HDP are not those of ASF 2.7.x. A really large proportion of the later branch-2 patches are backported. 2,7 was left behind a long time ago


 
Reply | Threaded
Open this post in threaded view
|

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

cloud0fan
Do we have a limitation on the number of pre-built distributions? Seems this time we need
1. hadoop 2.7 + hive 1.2
2. hadoop 2.7 + hive 2.3
3. hadoop 3 + hive 2.3

AFAIK we always built with JDK 8 (but make it JDK 11 compatible), so don't need to add JDK version to the combination.

On Sat, Nov 16, 2019 at 4:05 PM Dongjoon Hyun <[hidden email]> wrote:
Thank you for suggestion.

Having `hive-2.3` profile sounds good to me because it's orthogonal to Hadoop 3.
IIRC, originally, it was proposed in that way, but we put it under `hadoop-3.2` to avoid adding new profiles at that time.

And, I'm wondering if you are considering additional pre-built distribution and Jenkins jobs.

Bests,
Dongjoon.



On Fri, Nov 15, 2019 at 1:38 PM Cheng Lian <[hidden email]> wrote:
Cc Yuming, Steve, and Dongjoon

On Fri, Nov 15, 2019 at 10:37 AM Cheng Lian <[hidden email]> wrote:

Similar to Xiao, my major concern about making Hadoop 3.2 the default Hadoop version is quality control. The current hadoop-3.2 profile covers too many major component upgrades, i.e.:

  • Hadoop 3.2
  • Hive 2.3
  • JDK 11

We have already found and fixed some feature and performance regressions related to these upgrades. Empirically, I’m not surprised at all if more regressions are lurking somewhere. On the other hand, we do want help from the community to help us to evaluate and stabilize these new changes. Following that, I’d like to propose:

  1. Introduce a new profile hive-2.3 to enable (hopefully) less risky Hadoop/Hive/JDK version combinations.

    This new profile allows us to decouple Hive 2.3 from the hadoop-3.2 profile, so that users may try out some less risky Hadoop/Hive/JDK combinations: if you only want Hive 2.3 and/or JDK 11, you don’t need to face potential regressions introduced by the Hadoop 3.2 upgrade.

    Yuming Wang has already sent out PR #26533 to exercise the Hadoop 2.7 + Hive 2.3 + JDK 11 combination (this PR does not have the hive-2.3 profile yet), and the result looks promising: the Kafka streaming and Arrow related test failures should be irrelevant to the topic discussed here.

    After decoupling Hive 2.3 and Hadoop 3.2, I don’t think it makes a lot of difference between having Hadoop 2.7 or Hadoop 3.2 as the default Hadoop version. For users who are still using Hadoop 2.x in production, they will have to use a hadoop-provided prebuilt package or build Spark 3.0 against their own 2.x version anyway. It does make a difference for cloud users who don’t use Hadoop at all, though. And this probably also helps to stabilize the Hadoop 3.2 code path faster since our PR builder will exercise it regularly.

  2. Defer Hadoop 2.x upgrade to Spark 3.1+

    I personally do want to bump our Hadoop 2.x version to 2.9 or even 2.10. Steve has already stated the benefits very well. My worry here is still quality control: Spark 3.0 has already had tons of changes and major component version upgrades that are subject to all kinds of known and hidden regressions. Having Hadoop 2.7 there provides us a safety net, since it’s proven to be stable. To me, it’s much less risky to upgrade Hadoop 2.7 to 2.9/2.10 after we stabilize the Hadoop 3.2/Hive 2.3 combinations in the next 1 or 2 Spark 3.x releases.

Cheng

On Mon, Nov 4, 2019 at 11:24 AM Koert Kuipers <[hidden email]> wrote:
i get that cdh and hdp backport a lot and in that way left 2.7 behind. but they kept the public apis stable at the 2.7 level, because thats kind of the point. arent those the hadoop apis spark uses?

On Mon, Nov 4, 2019 at 10:07 AM Steve Loughran <[hidden email]> wrote:


On Mon, Nov 4, 2019 at 12:39 AM Nicholas Chammas <[hidden email]> wrote:
On Fri, Nov 1, 2019 at 8:41 AM Steve Loughran <[hidden email]> wrote:
It would be really good if the spark distributions shipped with later versions of the hadoop artifacts.

I second this. If we need to keep a Hadoop 2.x profile around, why not make it Hadoop 2.8 or something newer?

go for 2.9 

Koert Kuipers <[hidden email]> wrote:
given that latest hdp 2.x is still hadoop 2.7 bumping hadoop 2 profile to latest would probably be an issue for us.

When was the last time HDP 2.x bumped their minor version of Hadoop? Do we want to wait for them to bump to Hadoop 2.8 before we do the same?

The internal builds of CDH and HDP are not those of ASF 2.7.x. A really large proportion of the later branch-2 patches are backported. 2,7 was left behind a long time ago


 
12