Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

classic Classic list List threaded Threaded
20 messages Options
Reply | Threaded
Open this post in threaded view
|

Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

Dongjoon Hyun-2
Hi, All.

First of all, I want to put this as a policy issue instead of a technical issue.
Also, this is orthogonal from `hadoop` version discussion.

Apache Spark community kept (not maintained) the forked Apache Hive
1.2.1 because there has been no other options before. As we see at
SPARK-20202, it's not a desirable situation among the Apache projects.


Also, please note that we `kept`, not `maintained`, because we know it's not good.
There are several attempt to update that forked repository
for several reasons (Hadoop 3 support is one of the example),
but those attempts are also turned down.

From Apache Spark 3.0, it seems that we have a new feasible option
`hive-2.3` profile. What about moving forward in this direction further?

For example, can we remove the usage of forked `hive` in Apache Spark 3.0
completely officially? If someone still needs to use the forked `hive`, we can
have a profile `hive-1.2`. Of course, it should not be a default profile in the community.

I want to say this is a goal we should achieve someday.
If we don't do anything, nothing happen. At least we need to prepare this.
Without any preparation, Spark 3.1+ will be the same.

Shall we focus on what are our problems with Hive 2.3.6?
If the only reason is that we didn't use it before, we can release another 
`3.0.0-preview` for that.

Bests,
Dongjoon. 
Reply | Threaded
Open this post in threaded view
|

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

Hyukjin Kwon
I struggled hard to deal with this issue multiple times over a year and thankfully we finally
decided to use the official version of Hive 2.3.x too (thank you, Yuming, Alan, and guys)
I think this is already a huge progress that we started to use the official version of Hive.

I think we should at least have one minor release term to let users test out Spark with Hive 2.3.x. before switching this
as a default. My impression was it's the decision made before at:
http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Upgrade-built-in-Hive-to-2-3-4-td26153.html

How about we try to make it default in Spark 3.1 by using this thread as a reference? I think it's too a radical change.


2019년 11월 19일 (화) 오후 2:11, Dongjoon Hyun <[hidden email]>님이 작성:
Hi, All.

First of all, I want to put this as a policy issue instead of a technical issue.
Also, this is orthogonal from `hadoop` version discussion.

Apache Spark community kept (not maintained) the forked Apache Hive
1.2.1 because there has been no other options before. As we see at
SPARK-20202, it's not a desirable situation among the Apache projects.


Also, please note that we `kept`, not `maintained`, because we know it's not good.
There are several attempt to update that forked repository
for several reasons (Hadoop 3 support is one of the example),
but those attempts are also turned down.

From Apache Spark 3.0, it seems that we have a new feasible option
`hive-2.3` profile. What about moving forward in this direction further?

For example, can we remove the usage of forked `hive` in Apache Spark 3.0
completely officially? If someone still needs to use the forked `hive`, we can
have a profile `hive-1.2`. Of course, it should not be a default profile in the community.

I want to say this is a goal we should achieve someday.
If we don't do anything, nothing happen. At least we need to prepare this.
Without any preparation, Spark 3.1+ will be the same.

Shall we focus on what are our problems with Hive 2.3.6?
If the only reason is that we didn't use it before, we can release another 
`3.0.0-preview` for that.

Bests,
Dongjoon. 
Reply | Threaded
Open this post in threaded view
|

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

Sean Owen-2
In reply to this post by Dongjoon Hyun-2
Just to clarify, as even I have lost the details over time: hadoop-2.7
works with hive-2.3? it isn't tied to hadoop-3.2?
Roughly how much risk is there in using the Hive 1.x fork over Hive
2.x, for end users using Hive via Spark?
I don't have a strong opinion, other than sharing the view that we
have to dump the Hive 1.x fork at the first opportunity.
Question is simply how much risk that entails. Keeping in mind that
Spark 3.0 is already something that people understand works
differently. We can accept some behavior changes.

On Mon, Nov 18, 2019 at 11:11 PM Dongjoon Hyun <[hidden email]> wrote:

>
> Hi, All.
>
> First of all, I want to put this as a policy issue instead of a technical issue.
> Also, this is orthogonal from `hadoop` version discussion.
>
> Apache Spark community kept (not maintained) the forked Apache Hive
> 1.2.1 because there has been no other options before. As we see at
> SPARK-20202, it's not a desirable situation among the Apache projects.
>
>     https://issues.apache.org/jira/browse/SPARK-20202
>
> Also, please note that we `kept`, not `maintained`, because we know it's not good.
> There are several attempt to update that forked repository
> for several reasons (Hadoop 3 support is one of the example),
> but those attempts are also turned down.
>
> From Apache Spark 3.0, it seems that we have a new feasible option
> `hive-2.3` profile. What about moving forward in this direction further?
>
> For example, can we remove the usage of forked `hive` in Apache Spark 3.0
> completely officially? If someone still needs to use the forked `hive`, we can
> have a profile `hive-1.2`. Of course, it should not be a default profile in the community.
>
> I want to say this is a goal we should achieve someday.
> If we don't do anything, nothing happen. At least we need to prepare this.
> Without any preparation, Spark 3.1+ will be the same.
>
> Shall we focus on what are our problems with Hive 2.3.6?
> If the only reason is that we didn't use it before, we can release another
> `3.0.0-preview` for that.
>
> Bests,
> Dongjoon.

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

Dongjoon Hyun-2
Thank you for feedback, Hyujkjin and Sean.

I proposed `preview-2` for that purpose but I'm also +1 for do that at 3.1
if we can make a decision to eliminate the illegitimate Hive fork reference
immediately after `branch-3.0` cut.

Sean, I'm referencing Cheng Lian's email for the status of `hadoop-2.7`.


The way I see this is that it's not a user problem. Apache Spark community didn't try to drop the illegitimate Hive fork yet.
We need to drop it by ourselves because we created it and it's our bad.

Bests,
Dongjoon.



On Tue, Nov 19, 2019 at 5:06 AM Sean Owen <[hidden email]> wrote:
Just to clarify, as even I have lost the details over time: hadoop-2.7
works with hive-2.3? it isn't tied to hadoop-3.2?
Roughly how much risk is there in using the Hive 1.x fork over Hive
2.x, for end users using Hive via Spark?
I don't have a strong opinion, other than sharing the view that we
have to dump the Hive 1.x fork at the first opportunity.
Question is simply how much risk that entails. Keeping in mind that
Spark 3.0 is already something that people understand works
differently. We can accept some behavior changes.

On Mon, Nov 18, 2019 at 11:11 PM Dongjoon Hyun <[hidden email]> wrote:
>
> Hi, All.
>
> First of all, I want to put this as a policy issue instead of a technical issue.
> Also, this is orthogonal from `hadoop` version discussion.
>
> Apache Spark community kept (not maintained) the forked Apache Hive
> 1.2.1 because there has been no other options before. As we see at
> SPARK-20202, it's not a desirable situation among the Apache projects.
>
>     https://issues.apache.org/jira/browse/SPARK-20202
>
> Also, please note that we `kept`, not `maintained`, because we know it's not good.
> There are several attempt to update that forked repository
> for several reasons (Hadoop 3 support is one of the example),
> but those attempts are also turned down.
>
> From Apache Spark 3.0, it seems that we have a new feasible option
> `hive-2.3` profile. What about moving forward in this direction further?
>
> For example, can we remove the usage of forked `hive` in Apache Spark 3.0
> completely officially? If someone still needs to use the forked `hive`, we can
> have a profile `hive-1.2`. Of course, it should not be a default profile in the community.
>
> I want to say this is a goal we should achieve someday.
> If we don't do anything, nothing happen. At least we need to prepare this.
> Without any preparation, Spark 3.1+ will be the same.
>
> Shall we focus on what are our problems with Hive 2.3.6?
> If the only reason is that we didn't use it before, we can release another
> `3.0.0-preview` for that.
>
> Bests,
> Dongjoon.
Reply | Threaded
Open this post in threaded view
|

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

Cheng Lian
Dongjoon, I'm with Hyukjin. There should be at least one Spark 3.x minor release to stabilize Hive 2.3 code paths before retiring the Hive 1.2 fork. Even today, the Hive 2.3.6 version bundled in Spark 3.0 is still buggy in terms of JDK 11 support. (BTW, I just found that our root POM is referring both Hive 2.3.6 and 2.3.5 at the moment, see here and here.)

Again, I'm happy to get rid of ancient legacy dependencies like Hadoop 2.7 and the Hive 1.2 fork, but I do believe that we need a safety net for Spark 3.0. For preview releases, I'm afraid that their visibility is not good enough for covering such major upgrades.

On Tue, Nov 19, 2019 at 8:39 AM Dongjoon Hyun <[hidden email]> wrote:
Thank you for feedback, Hyujkjin and Sean.

I proposed `preview-2` for that purpose but I'm also +1 for do that at 3.1
if we can make a decision to eliminate the illegitimate Hive fork reference
immediately after `branch-3.0` cut.

Sean, I'm referencing Cheng Lian's email for the status of `hadoop-2.7`.


The way I see this is that it's not a user problem. Apache Spark community didn't try to drop the illegitimate Hive fork yet.
We need to drop it by ourselves because we created it and it's our bad.

Bests,
Dongjoon.



On Tue, Nov 19, 2019 at 5:06 AM Sean Owen <[hidden email]> wrote:
Just to clarify, as even I have lost the details over time: hadoop-2.7
works with hive-2.3? it isn't tied to hadoop-3.2?
Roughly how much risk is there in using the Hive 1.x fork over Hive
2.x, for end users using Hive via Spark?
I don't have a strong opinion, other than sharing the view that we
have to dump the Hive 1.x fork at the first opportunity.
Question is simply how much risk that entails. Keeping in mind that
Spark 3.0 is already something that people understand works
differently. We can accept some behavior changes.

On Mon, Nov 18, 2019 at 11:11 PM Dongjoon Hyun <[hidden email]> wrote:
>
> Hi, All.
>
> First of all, I want to put this as a policy issue instead of a technical issue.
> Also, this is orthogonal from `hadoop` version discussion.
>
> Apache Spark community kept (not maintained) the forked Apache Hive
> 1.2.1 because there has been no other options before. As we see at
> SPARK-20202, it's not a desirable situation among the Apache projects.
>
>     https://issues.apache.org/jira/browse/SPARK-20202
>
> Also, please note that we `kept`, not `maintained`, because we know it's not good.
> There are several attempt to update that forked repository
> for several reasons (Hadoop 3 support is one of the example),
> but those attempts are also turned down.
>
> From Apache Spark 3.0, it seems that we have a new feasible option
> `hive-2.3` profile. What about moving forward in this direction further?
>
> For example, can we remove the usage of forked `hive` in Apache Spark 3.0
> completely officially? If someone still needs to use the forked `hive`, we can
> have a profile `hive-1.2`. Of course, it should not be a default profile in the community.
>
> I want to say this is a goal we should achieve someday.
> If we don't do anything, nothing happen. At least we need to prepare this.
> Without any preparation, Spark 3.1+ will be the same.
>
> Shall we focus on what are our problems with Hive 2.3.6?
> If the only reason is that we didn't use it before, we can release another
> `3.0.0-preview` for that.
>
> Bests,
> Dongjoon.
Reply | Threaded
Open this post in threaded view
|

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

Dongjoon Hyun-2
Hi, Cheng.

This is irrelevant to JDK11 and Hadoop 3. I'm talking about JDK8 world.
If we consider them, it could be the followings.

+----------+-----------------+--------------------+
|          | Hive 1.2.1 fork |  Apache Hive 2.3.6 |
+-------------------------------------------------+
|Legitimate|        X        |         O          |
|JDK11     |        X        |         O          |
|Hadoop3   |        X        |         O          |
|Hadoop2   |        O        |         O          |
|Functions |     Baseline    |       More         |
|Bug fixes |     Baseline    |       More         |
+-------------------------------------------------+

To stabilize Spark's Hive 2.3 usage, we should use it by ourselves (including Jenkins/GitHubAction/AppVeyor).

For me, AS-IS 3.0 is not enough for that. According to your advices,
to give more visibility to the whole community,

1. We need to give additional `hadoop-2.7 with Hive 2.3` pre-built distribution
2. We need to switch our default Hive usage to 2.3 in `master` for 3.1 after `branch-3.0` branch cut.

I know that we have been reluctant to (1) and (2) due to its burden.
But, it's time to prepare. Without them, we are going to be insufficient again and again.

Bests,
Dongjoon.




On Tue, Nov 19, 2019 at 9:26 AM Cheng Lian <[hidden email]> wrote:
Dongjoon, I'm with Hyukjin. There should be at least one Spark 3.x minor release to stabilize Hive 2.3 code paths before retiring the Hive 1.2 fork. Even today, the Hive 2.3.6 version bundled in Spark 3.0 is still buggy in terms of JDK 11 support. (BTW, I just found that our root POM is referring both Hive 2.3.6 and 2.3.5 at the moment, see here and here.)

Again, I'm happy to get rid of ancient legacy dependencies like Hadoop 2.7 and the Hive 1.2 fork, but I do believe that we need a safety net for Spark 3.0. For preview releases, I'm afraid that their visibility is not good enough for covering such major upgrades.

On Tue, Nov 19, 2019 at 8:39 AM Dongjoon Hyun <[hidden email]> wrote:
Thank you for feedback, Hyujkjin and Sean.

I proposed `preview-2` for that purpose but I'm also +1 for do that at 3.1
if we can make a decision to eliminate the illegitimate Hive fork reference
immediately after `branch-3.0` cut.

Sean, I'm referencing Cheng Lian's email for the status of `hadoop-2.7`.


The way I see this is that it's not a user problem. Apache Spark community didn't try to drop the illegitimate Hive fork yet.
We need to drop it by ourselves because we created it and it's our bad.

Bests,
Dongjoon.



On Tue, Nov 19, 2019 at 5:06 AM Sean Owen <[hidden email]> wrote:
Just to clarify, as even I have lost the details over time: hadoop-2.7
works with hive-2.3? it isn't tied to hadoop-3.2?
Roughly how much risk is there in using the Hive 1.x fork over Hive
2.x, for end users using Hive via Spark?
I don't have a strong opinion, other than sharing the view that we
have to dump the Hive 1.x fork at the first opportunity.
Question is simply how much risk that entails. Keeping in mind that
Spark 3.0 is already something that people understand works
differently. We can accept some behavior changes.

On Mon, Nov 18, 2019 at 11:11 PM Dongjoon Hyun <[hidden email]> wrote:
>
> Hi, All.
>
> First of all, I want to put this as a policy issue instead of a technical issue.
> Also, this is orthogonal from `hadoop` version discussion.
>
> Apache Spark community kept (not maintained) the forked Apache Hive
> 1.2.1 because there has been no other options before. As we see at
> SPARK-20202, it's not a desirable situation among the Apache projects.
>
>     https://issues.apache.org/jira/browse/SPARK-20202
>
> Also, please note that we `kept`, not `maintained`, because we know it's not good.
> There are several attempt to update that forked repository
> for several reasons (Hadoop 3 support is one of the example),
> but those attempts are also turned down.
>
> From Apache Spark 3.0, it seems that we have a new feasible option
> `hive-2.3` profile. What about moving forward in this direction further?
>
> For example, can we remove the usage of forked `hive` in Apache Spark 3.0
> completely officially? If someone still needs to use the forked `hive`, we can
> have a profile `hive-1.2`. Of course, it should not be a default profile in the community.
>
> I want to say this is a goal we should achieve someday.
> If we don't do anything, nothing happen. At least we need to prepare this.
> Without any preparation, Spark 3.1+ will be the same.
>
> Shall we focus on what are our problems with Hive 2.3.6?
> If the only reason is that we didn't use it before, we can release another
> `3.0.0-preview` for that.
>
> Bests,
> Dongjoon.
Reply | Threaded
Open this post in threaded view
|

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

Dongjoon Hyun-2
BTW, `hive.version.short` is a directory name. We are using 2.3.6 only.

For directory name, we use '1.2.1' and '2.3.5' because we just delayed the renaming the directories until 3.0.0 deadline to minimize the diff.

We can replace it immediately if we want right now.



On Tue, Nov 19, 2019 at 11:11 AM Dongjoon Hyun <[hidden email]> wrote:
Hi, Cheng.

This is irrelevant to JDK11 and Hadoop 3. I'm talking about JDK8 world.
If we consider them, it could be the followings.

+----------+-----------------+--------------------+
|          | Hive 1.2.1 fork |  Apache Hive 2.3.6 |
+-------------------------------------------------+
|Legitimate|        X        |         O          |
|JDK11     |        X        |         O          |
|Hadoop3   |        X        |         O          |
|Hadoop2   |        O        |         O          |
|Functions |     Baseline    |       More         |
|Bug fixes |     Baseline    |       More         |
+-------------------------------------------------+

To stabilize Spark's Hive 2.3 usage, we should use it by ourselves (including Jenkins/GitHubAction/AppVeyor).

For me, AS-IS 3.0 is not enough for that. According to your advices,
to give more visibility to the whole community,

1. We need to give additional `hadoop-2.7 with Hive 2.3` pre-built distribution
2. We need to switch our default Hive usage to 2.3 in `master` for 3.1 after `branch-3.0` branch cut.

I know that we have been reluctant to (1) and (2) due to its burden.
But, it's time to prepare. Without them, we are going to be insufficient again and again.

Bests,
Dongjoon.




On Tue, Nov 19, 2019 at 9:26 AM Cheng Lian <[hidden email]> wrote:
Dongjoon, I'm with Hyukjin. There should be at least one Spark 3.x minor release to stabilize Hive 2.3 code paths before retiring the Hive 1.2 fork. Even today, the Hive 2.3.6 version bundled in Spark 3.0 is still buggy in terms of JDK 11 support. (BTW, I just found that our root POM is referring both Hive 2.3.6 and 2.3.5 at the moment, see here and here.)

Again, I'm happy to get rid of ancient legacy dependencies like Hadoop 2.7 and the Hive 1.2 fork, but I do believe that we need a safety net for Spark 3.0. For preview releases, I'm afraid that their visibility is not good enough for covering such major upgrades.

On Tue, Nov 19, 2019 at 8:39 AM Dongjoon Hyun <[hidden email]> wrote:
Thank you for feedback, Hyujkjin and Sean.

I proposed `preview-2` for that purpose but I'm also +1 for do that at 3.1
if we can make a decision to eliminate the illegitimate Hive fork reference
immediately after `branch-3.0` cut.

Sean, I'm referencing Cheng Lian's email for the status of `hadoop-2.7`.


The way I see this is that it's not a user problem. Apache Spark community didn't try to drop the illegitimate Hive fork yet.
We need to drop it by ourselves because we created it and it's our bad.

Bests,
Dongjoon.



On Tue, Nov 19, 2019 at 5:06 AM Sean Owen <[hidden email]> wrote:
Just to clarify, as even I have lost the details over time: hadoop-2.7
works with hive-2.3? it isn't tied to hadoop-3.2?
Roughly how much risk is there in using the Hive 1.x fork over Hive
2.x, for end users using Hive via Spark?
I don't have a strong opinion, other than sharing the view that we
have to dump the Hive 1.x fork at the first opportunity.
Question is simply how much risk that entails. Keeping in mind that
Spark 3.0 is already something that people understand works
differently. We can accept some behavior changes.

On Mon, Nov 18, 2019 at 11:11 PM Dongjoon Hyun <[hidden email]> wrote:
>
> Hi, All.
>
> First of all, I want to put this as a policy issue instead of a technical issue.
> Also, this is orthogonal from `hadoop` version discussion.
>
> Apache Spark community kept (not maintained) the forked Apache Hive
> 1.2.1 because there has been no other options before. As we see at
> SPARK-20202, it's not a desirable situation among the Apache projects.
>
>     https://issues.apache.org/jira/browse/SPARK-20202
>
> Also, please note that we `kept`, not `maintained`, because we know it's not good.
> There are several attempt to update that forked repository
> for several reasons (Hadoop 3 support is one of the example),
> but those attempts are also turned down.
>
> From Apache Spark 3.0, it seems that we have a new feasible option
> `hive-2.3` profile. What about moving forward in this direction further?
>
> For example, can we remove the usage of forked `hive` in Apache Spark 3.0
> completely officially? If someone still needs to use the forked `hive`, we can
> have a profile `hive-1.2`. Of course, it should not be a default profile in the community.
>
> I want to say this is a goal we should achieve someday.
> If we don't do anything, nothing happen. At least we need to prepare this.
> Without any preparation, Spark 3.1+ will be the same.
>
> Shall we focus on what are our problems with Hive 2.3.6?
> If the only reason is that we didn't use it before, we can release another
> `3.0.0-preview` for that.
>
> Bests,
> Dongjoon.
Reply | Threaded
Open this post in threaded view
|

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

Cheng Lian
Hmm, what exactly did you mean by "remove the usage of forked `hive` in Apache Spark 3.0 completely officially"? I thought you wanted to remove the forked Hive 1.2 dependencies completely, no? As long as we still keep the Hive 1.2 in Spark 3.0, I'm fine with that. I personally don't have a particular preference between using Hive 1.2 or 2.3 as the default Hive version. After all, for end-users and providers who need a particular version combination, they can always build Spark with proper profiles themselves.

And thanks for clarifying the Hive 2.3.5 issue. I didn't notice that it's due to the folder name.

On Tue, Nov 19, 2019 at 11:15 AM Dongjoon Hyun <[hidden email]> wrote:
BTW, `hive.version.short` is a directory name. We are using 2.3.6 only.

For directory name, we use '1.2.1' and '2.3.5' because we just delayed the renaming the directories until 3.0.0 deadline to minimize the diff.

We can replace it immediately if we want right now.



On Tue, Nov 19, 2019 at 11:11 AM Dongjoon Hyun <[hidden email]> wrote:
Hi, Cheng.

This is irrelevant to JDK11 and Hadoop 3. I'm talking about JDK8 world.
If we consider them, it could be the followings.

+----------+-----------------+--------------------+
|          | Hive 1.2.1 fork |  Apache Hive 2.3.6 |
+-------------------------------------------------+
|Legitimate|        X        |         O          |
|JDK11     |        X        |         O          |
|Hadoop3   |        X        |         O          |
|Hadoop2   |        O        |         O          |
|Functions |     Baseline    |       More         |
|Bug fixes |     Baseline    |       More         |
+-------------------------------------------------+

To stabilize Spark's Hive 2.3 usage, we should use it by ourselves (including Jenkins/GitHubAction/AppVeyor).

For me, AS-IS 3.0 is not enough for that. According to your advices,
to give more visibility to the whole community,

1. We need to give additional `hadoop-2.7 with Hive 2.3` pre-built distribution
2. We need to switch our default Hive usage to 2.3 in `master` for 3.1 after `branch-3.0` branch cut.

I know that we have been reluctant to (1) and (2) due to its burden.
But, it's time to prepare. Without them, we are going to be insufficient again and again.

Bests,
Dongjoon.




On Tue, Nov 19, 2019 at 9:26 AM Cheng Lian <[hidden email]> wrote:
Dongjoon, I'm with Hyukjin. There should be at least one Spark 3.x minor release to stabilize Hive 2.3 code paths before retiring the Hive 1.2 fork. Even today, the Hive 2.3.6 version bundled in Spark 3.0 is still buggy in terms of JDK 11 support. (BTW, I just found that our root POM is referring both Hive 2.3.6 and 2.3.5 at the moment, see here and here.)

Again, I'm happy to get rid of ancient legacy dependencies like Hadoop 2.7 and the Hive 1.2 fork, but I do believe that we need a safety net for Spark 3.0. For preview releases, I'm afraid that their visibility is not good enough for covering such major upgrades.

On Tue, Nov 19, 2019 at 8:39 AM Dongjoon Hyun <[hidden email]> wrote:
Thank you for feedback, Hyujkjin and Sean.

I proposed `preview-2` for that purpose but I'm also +1 for do that at 3.1
if we can make a decision to eliminate the illegitimate Hive fork reference
immediately after `branch-3.0` cut.

Sean, I'm referencing Cheng Lian's email for the status of `hadoop-2.7`.


The way I see this is that it's not a user problem. Apache Spark community didn't try to drop the illegitimate Hive fork yet.
We need to drop it by ourselves because we created it and it's our bad.

Bests,
Dongjoon.



On Tue, Nov 19, 2019 at 5:06 AM Sean Owen <[hidden email]> wrote:
Just to clarify, as even I have lost the details over time: hadoop-2.7
works with hive-2.3? it isn't tied to hadoop-3.2?
Roughly how much risk is there in using the Hive 1.x fork over Hive
2.x, for end users using Hive via Spark?
I don't have a strong opinion, other than sharing the view that we
have to dump the Hive 1.x fork at the first opportunity.
Question is simply how much risk that entails. Keeping in mind that
Spark 3.0 is already something that people understand works
differently. We can accept some behavior changes.

On Mon, Nov 18, 2019 at 11:11 PM Dongjoon Hyun <[hidden email]> wrote:
>
> Hi, All.
>
> First of all, I want to put this as a policy issue instead of a technical issue.
> Also, this is orthogonal from `hadoop` version discussion.
>
> Apache Spark community kept (not maintained) the forked Apache Hive
> 1.2.1 because there has been no other options before. As we see at
> SPARK-20202, it's not a desirable situation among the Apache projects.
>
>     https://issues.apache.org/jira/browse/SPARK-20202
>
> Also, please note that we `kept`, not `maintained`, because we know it's not good.
> There are several attempt to update that forked repository
> for several reasons (Hadoop 3 support is one of the example),
> but those attempts are also turned down.
>
> From Apache Spark 3.0, it seems that we have a new feasible option
> `hive-2.3` profile. What about moving forward in this direction further?
>
> For example, can we remove the usage of forked `hive` in Apache Spark 3.0
> completely officially? If someone still needs to use the forked `hive`, we can
> have a profile `hive-1.2`. Of course, it should not be a default profile in the community.
>
> I want to say this is a goal we should achieve someday.
> If we don't do anything, nothing happen. At least we need to prepare this.
> Without any preparation, Spark 3.1+ will be the same.
>
> Shall we focus on what are our problems with Hive 2.3.6?
> If the only reason is that we didn't use it before, we can release another
> `3.0.0-preview` for that.
>
> Bests,
> Dongjoon.
Reply | Threaded
Open this post in threaded view
|

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

Cheng Lian
It's kinda like Scala version upgrade. Historically, we only remove the support of an older Scala version when the newer version is proven to be stable after one or more Spark minor versions.

On Tue, Nov 19, 2019 at 2:07 PM Cheng Lian <[hidden email]> wrote:
Hmm, what exactly did you mean by "remove the usage of forked `hive` in Apache Spark 3.0 completely officially"? I thought you wanted to remove the forked Hive 1.2 dependencies completely, no? As long as we still keep the Hive 1.2 in Spark 3.0, I'm fine with that. I personally don't have a particular preference between using Hive 1.2 or 2.3 as the default Hive version. After all, for end-users and providers who need a particular version combination, they can always build Spark with proper profiles themselves.

And thanks for clarifying the Hive 2.3.5 issue. I didn't notice that it's due to the folder name.

On Tue, Nov 19, 2019 at 11:15 AM Dongjoon Hyun <[hidden email]> wrote:
BTW, `hive.version.short` is a directory name. We are using 2.3.6 only.

For directory name, we use '1.2.1' and '2.3.5' because we just delayed the renaming the directories until 3.0.0 deadline to minimize the diff.

We can replace it immediately if we want right now.



On Tue, Nov 19, 2019 at 11:11 AM Dongjoon Hyun <[hidden email]> wrote:
Hi, Cheng.

This is irrelevant to JDK11 and Hadoop 3. I'm talking about JDK8 world.
If we consider them, it could be the followings.

+----------+-----------------+--------------------+
|          | Hive 1.2.1 fork |  Apache Hive 2.3.6 |
+-------------------------------------------------+
|Legitimate|        X        |         O          |
|JDK11     |        X        |         O          |
|Hadoop3   |        X        |         O          |
|Hadoop2   |        O        |         O          |
|Functions |     Baseline    |       More         |
|Bug fixes |     Baseline    |       More         |
+-------------------------------------------------+

To stabilize Spark's Hive 2.3 usage, we should use it by ourselves (including Jenkins/GitHubAction/AppVeyor).

For me, AS-IS 3.0 is not enough for that. According to your advices,
to give more visibility to the whole community,

1. We need to give additional `hadoop-2.7 with Hive 2.3` pre-built distribution
2. We need to switch our default Hive usage to 2.3 in `master` for 3.1 after `branch-3.0` branch cut.

I know that we have been reluctant to (1) and (2) due to its burden.
But, it's time to prepare. Without them, we are going to be insufficient again and again.

Bests,
Dongjoon.




On Tue, Nov 19, 2019 at 9:26 AM Cheng Lian <[hidden email]> wrote:
Dongjoon, I'm with Hyukjin. There should be at least one Spark 3.x minor release to stabilize Hive 2.3 code paths before retiring the Hive 1.2 fork. Even today, the Hive 2.3.6 version bundled in Spark 3.0 is still buggy in terms of JDK 11 support. (BTW, I just found that our root POM is referring both Hive 2.3.6 and 2.3.5 at the moment, see here and here.)

Again, I'm happy to get rid of ancient legacy dependencies like Hadoop 2.7 and the Hive 1.2 fork, but I do believe that we need a safety net for Spark 3.0. For preview releases, I'm afraid that their visibility is not good enough for covering such major upgrades.

On Tue, Nov 19, 2019 at 8:39 AM Dongjoon Hyun <[hidden email]> wrote:
Thank you for feedback, Hyujkjin and Sean.

I proposed `preview-2` for that purpose but I'm also +1 for do that at 3.1
if we can make a decision to eliminate the illegitimate Hive fork reference
immediately after `branch-3.0` cut.

Sean, I'm referencing Cheng Lian's email for the status of `hadoop-2.7`.


The way I see this is that it's not a user problem. Apache Spark community didn't try to drop the illegitimate Hive fork yet.
We need to drop it by ourselves because we created it and it's our bad.

Bests,
Dongjoon.



On Tue, Nov 19, 2019 at 5:06 AM Sean Owen <[hidden email]> wrote:
Just to clarify, as even I have lost the details over time: hadoop-2.7
works with hive-2.3? it isn't tied to hadoop-3.2?
Roughly how much risk is there in using the Hive 1.x fork over Hive
2.x, for end users using Hive via Spark?
I don't have a strong opinion, other than sharing the view that we
have to dump the Hive 1.x fork at the first opportunity.
Question is simply how much risk that entails. Keeping in mind that
Spark 3.0 is already something that people understand works
differently. We can accept some behavior changes.

On Mon, Nov 18, 2019 at 11:11 PM Dongjoon Hyun <[hidden email]> wrote:
>
> Hi, All.
>
> First of all, I want to put this as a policy issue instead of a technical issue.
> Also, this is orthogonal from `hadoop` version discussion.
>
> Apache Spark community kept (not maintained) the forked Apache Hive
> 1.2.1 because there has been no other options before. As we see at
> SPARK-20202, it's not a desirable situation among the Apache projects.
>
>     https://issues.apache.org/jira/browse/SPARK-20202
>
> Also, please note that we `kept`, not `maintained`, because we know it's not good.
> There are several attempt to update that forked repository
> for several reasons (Hadoop 3 support is one of the example),
> but those attempts are also turned down.
>
> From Apache Spark 3.0, it seems that we have a new feasible option
> `hive-2.3` profile. What about moving forward in this direction further?
>
> For example, can we remove the usage of forked `hive` in Apache Spark 3.0
> completely officially? If someone still needs to use the forked `hive`, we can
> have a profile `hive-1.2`. Of course, it should not be a default profile in the community.
>
> I want to say this is a goal we should achieve someday.
> If we don't do anything, nothing happen. At least we need to prepare this.
> Without any preparation, Spark 3.1+ will be the same.
>
> Shall we focus on what are our problems with Hive 2.3.6?
> If the only reason is that we didn't use it before, we can release another
> `3.0.0-preview` for that.
>
> Bests,
> Dongjoon.
Reply | Threaded
Open this post in threaded view
|

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

Dongjoon Hyun-2
Yes. It does. I meant SPARK-20202.

Thanks. I understand that it can be considered like Scala version issue.
So, that's the reason why I put this as a `policy` issue from the beginning.

> First of all, I want to put this as a policy issue instead of a technical issue.

In the policy perspective, we should remove this immediately if we have a solution to fix this.
For now, I set `Target Versions` of SPARK-20202 to `3.1.0` according to the current discussion status.


And, if there is no other issues, I'll create a PR to remove it from `master` branch when we cut `branch-3.0`.

For additional `hadoop-2.7 with Hive 2.3` pre-built distribution, how do you think about this, Sean?
The preparation is already started in another email thread and I believe that is a keystone to prove `Hive 2.3` version stability
(which Cheng/Hyukjin/you asked).

Bests,
Dongjoon.


On Tue, Nov 19, 2019 at 2:09 PM Cheng Lian <[hidden email]> wrote:
It's kinda like Scala version upgrade. Historically, we only remove the support of an older Scala version when the newer version is proven to be stable after one or more Spark minor versions.

On Tue, Nov 19, 2019 at 2:07 PM Cheng Lian <[hidden email]> wrote:
Hmm, what exactly did you mean by "remove the usage of forked `hive` in Apache Spark 3.0 completely officially"? I thought you wanted to remove the forked Hive 1.2 dependencies completely, no? As long as we still keep the Hive 1.2 in Spark 3.0, I'm fine with that. I personally don't have a particular preference between using Hive 1.2 or 2.3 as the default Hive version. After all, for end-users and providers who need a particular version combination, they can always build Spark with proper profiles themselves.

And thanks for clarifying the Hive 2.3.5 issue. I didn't notice that it's due to the folder name.

On Tue, Nov 19, 2019 at 11:15 AM Dongjoon Hyun <[hidden email]> wrote:
BTW, `hive.version.short` is a directory name. We are using 2.3.6 only.

For directory name, we use '1.2.1' and '2.3.5' because we just delayed the renaming the directories until 3.0.0 deadline to minimize the diff.

We can replace it immediately if we want right now.



On Tue, Nov 19, 2019 at 11:11 AM Dongjoon Hyun <[hidden email]> wrote:
Hi, Cheng.

This is irrelevant to JDK11 and Hadoop 3. I'm talking about JDK8 world.
If we consider them, it could be the followings.

+----------+-----------------+--------------------+
|          | Hive 1.2.1 fork |  Apache Hive 2.3.6 |
+-------------------------------------------------+
|Legitimate|        X        |         O          |
|JDK11     |        X        |         O          |
|Hadoop3   |        X        |         O          |
|Hadoop2   |        O        |         O          |
|Functions |     Baseline    |       More         |
|Bug fixes |     Baseline    |       More         |
+-------------------------------------------------+

To stabilize Spark's Hive 2.3 usage, we should use it by ourselves (including Jenkins/GitHubAction/AppVeyor).

For me, AS-IS 3.0 is not enough for that. According to your advices,
to give more visibility to the whole community,

1. We need to give additional `hadoop-2.7 with Hive 2.3` pre-built distribution
2. We need to switch our default Hive usage to 2.3 in `master` for 3.1 after `branch-3.0` branch cut.

I know that we have been reluctant to (1) and (2) due to its burden.
But, it's time to prepare. Without them, we are going to be insufficient again and again.

Bests,
Dongjoon.




On Tue, Nov 19, 2019 at 9:26 AM Cheng Lian <[hidden email]> wrote:
Dongjoon, I'm with Hyukjin. There should be at least one Spark 3.x minor release to stabilize Hive 2.3 code paths before retiring the Hive 1.2 fork. Even today, the Hive 2.3.6 version bundled in Spark 3.0 is still buggy in terms of JDK 11 support. (BTW, I just found that our root POM is referring both Hive 2.3.6 and 2.3.5 at the moment, see here and here.)

Again, I'm happy to get rid of ancient legacy dependencies like Hadoop 2.7 and the Hive 1.2 fork, but I do believe that we need a safety net for Spark 3.0. For preview releases, I'm afraid that their visibility is not good enough for covering such major upgrades.

On Tue, Nov 19, 2019 at 8:39 AM Dongjoon Hyun <[hidden email]> wrote:
Thank you for feedback, Hyujkjin and Sean.

I proposed `preview-2` for that purpose but I'm also +1 for do that at 3.1
if we can make a decision to eliminate the illegitimate Hive fork reference
immediately after `branch-3.0` cut.

Sean, I'm referencing Cheng Lian's email for the status of `hadoop-2.7`.


The way I see this is that it's not a user problem. Apache Spark community didn't try to drop the illegitimate Hive fork yet.
We need to drop it by ourselves because we created it and it's our bad.

Bests,
Dongjoon.



On Tue, Nov 19, 2019 at 5:06 AM Sean Owen <[hidden email]> wrote:
Just to clarify, as even I have lost the details over time: hadoop-2.7
works with hive-2.3? it isn't tied to hadoop-3.2?
Roughly how much risk is there in using the Hive 1.x fork over Hive
2.x, for end users using Hive via Spark?
I don't have a strong opinion, other than sharing the view that we
have to dump the Hive 1.x fork at the first opportunity.
Question is simply how much risk that entails. Keeping in mind that
Spark 3.0 is already something that people understand works
differently. We can accept some behavior changes.

On Mon, Nov 18, 2019 at 11:11 PM Dongjoon Hyun <[hidden email]> wrote:
>
> Hi, All.
>
> First of all, I want to put this as a policy issue instead of a technical issue.
> Also, this is orthogonal from `hadoop` version discussion.
>
> Apache Spark community kept (not maintained) the forked Apache Hive
> 1.2.1 because there has been no other options before. As we see at
> SPARK-20202, it's not a desirable situation among the Apache projects.
>
>     https://issues.apache.org/jira/browse/SPARK-20202
>
> Also, please note that we `kept`, not `maintained`, because we know it's not good.
> There are several attempt to update that forked repository
> for several reasons (Hadoop 3 support is one of the example),
> but those attempts are also turned down.
>
> From Apache Spark 3.0, it seems that we have a new feasible option
> `hive-2.3` profile. What about moving forward in this direction further?
>
> For example, can we remove the usage of forked `hive` in Apache Spark 3.0
> completely officially? If someone still needs to use the forked `hive`, we can
> have a profile `hive-1.2`. Of course, it should not be a default profile in the community.
>
> I want to say this is a goal we should achieve someday.
> If we don't do anything, nothing happen. At least we need to prepare this.
> Without any preparation, Spark 3.1+ will be the same.
>
> Shall we focus on what are our problems with Hive 2.3.6?
> If the only reason is that we didn't use it before, we can release another
> `3.0.0-preview` for that.
>
> Bests,
> Dongjoon.
Reply | Threaded
Open this post in threaded view
|

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

Sean Owen-2
Same idea? support this combo in 3.0 and then remove Hadoop 2 support
in 3.1 or something? or at least make them non-default, not
necessarily publish special builds?

On Tue, Nov 19, 2019 at 4:53 PM Dongjoon Hyun <[hidden email]> wrote:
> For additional `hadoop-2.7 with Hive 2.3` pre-built distribution, how do you think about this, Sean?
> The preparation is already started in another email thread and I believe that is a keystone to prove `Hive 2.3` version stability
> (which Cheng/Hyukjin/you asked).
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

Cheng Lian
In reply to this post by Dongjoon Hyun-2
Thanks for taking care of this, Dongjoon!

We can target SPARK-20202 to 3.1.0, but I don't think we should do it immediately after cutting the branch-3.0. The Hive 1.2 code paths can only be removed once the Hive 2.3 code paths are proven to be stable. If it turned out to be buggy in Spark 3.1, we may want to further postpone SPARK-20202 to 3.2.0 by then.

On Tue, Nov 19, 2019 at 2:53 PM Dongjoon Hyun <[hidden email]> wrote:
Yes. It does. I meant SPARK-20202.

Thanks. I understand that it can be considered like Scala version issue.
So, that's the reason why I put this as a `policy` issue from the beginning.

> First of all, I want to put this as a policy issue instead of a technical issue.

In the policy perspective, we should remove this immediately if we have a solution to fix this.
For now, I set `Target Versions` of SPARK-20202 to `3.1.0` according to the current discussion status.


And, if there is no other issues, I'll create a PR to remove it from `master` branch when we cut `branch-3.0`.

For additional `hadoop-2.7 with Hive 2.3` pre-built distribution, how do you think about this, Sean?
The preparation is already started in another email thread and I believe that is a keystone to prove `Hive 2.3` version stability
(which Cheng/Hyukjin/you asked).

Bests,
Dongjoon.


On Tue, Nov 19, 2019 at 2:09 PM Cheng Lian <[hidden email]> wrote:
It's kinda like Scala version upgrade. Historically, we only remove the support of an older Scala version when the newer version is proven to be stable after one or more Spark minor versions.

On Tue, Nov 19, 2019 at 2:07 PM Cheng Lian <[hidden email]> wrote:
Hmm, what exactly did you mean by "remove the usage of forked `hive` in Apache Spark 3.0 completely officially"? I thought you wanted to remove the forked Hive 1.2 dependencies completely, no? As long as we still keep the Hive 1.2 in Spark 3.0, I'm fine with that. I personally don't have a particular preference between using Hive 1.2 or 2.3 as the default Hive version. After all, for end-users and providers who need a particular version combination, they can always build Spark with proper profiles themselves.

And thanks for clarifying the Hive 2.3.5 issue. I didn't notice that it's due to the folder name.

On Tue, Nov 19, 2019 at 11:15 AM Dongjoon Hyun <[hidden email]> wrote:
BTW, `hive.version.short` is a directory name. We are using 2.3.6 only.

For directory name, we use '1.2.1' and '2.3.5' because we just delayed the renaming the directories until 3.0.0 deadline to minimize the diff.

We can replace it immediately if we want right now.



On Tue, Nov 19, 2019 at 11:11 AM Dongjoon Hyun <[hidden email]> wrote:
Hi, Cheng.

This is irrelevant to JDK11 and Hadoop 3. I'm talking about JDK8 world.
If we consider them, it could be the followings.

+----------+-----------------+--------------------+
|          | Hive 1.2.1 fork |  Apache Hive 2.3.6 |
+-------------------------------------------------+
|Legitimate|        X        |         O          |
|JDK11     |        X        |         O          |
|Hadoop3   |        X        |         O          |
|Hadoop2   |        O        |         O          |
|Functions |     Baseline    |       More         |
|Bug fixes |     Baseline    |       More         |
+-------------------------------------------------+

To stabilize Spark's Hive 2.3 usage, we should use it by ourselves (including Jenkins/GitHubAction/AppVeyor).

For me, AS-IS 3.0 is not enough for that. According to your advices,
to give more visibility to the whole community,

1. We need to give additional `hadoop-2.7 with Hive 2.3` pre-built distribution
2. We need to switch our default Hive usage to 2.3 in `master` for 3.1 after `branch-3.0` branch cut.

I know that we have been reluctant to (1) and (2) due to its burden.
But, it's time to prepare. Without them, we are going to be insufficient again and again.

Bests,
Dongjoon.




On Tue, Nov 19, 2019 at 9:26 AM Cheng Lian <[hidden email]> wrote:
Dongjoon, I'm with Hyukjin. There should be at least one Spark 3.x minor release to stabilize Hive 2.3 code paths before retiring the Hive 1.2 fork. Even today, the Hive 2.3.6 version bundled in Spark 3.0 is still buggy in terms of JDK 11 support. (BTW, I just found that our root POM is referring both Hive 2.3.6 and 2.3.5 at the moment, see here and here.)

Again, I'm happy to get rid of ancient legacy dependencies like Hadoop 2.7 and the Hive 1.2 fork, but I do believe that we need a safety net for Spark 3.0. For preview releases, I'm afraid that their visibility is not good enough for covering such major upgrades.

On Tue, Nov 19, 2019 at 8:39 AM Dongjoon Hyun <[hidden email]> wrote:
Thank you for feedback, Hyujkjin and Sean.

I proposed `preview-2` for that purpose but I'm also +1 for do that at 3.1
if we can make a decision to eliminate the illegitimate Hive fork reference
immediately after `branch-3.0` cut.

Sean, I'm referencing Cheng Lian's email for the status of `hadoop-2.7`.


The way I see this is that it's not a user problem. Apache Spark community didn't try to drop the illegitimate Hive fork yet.
We need to drop it by ourselves because we created it and it's our bad.

Bests,
Dongjoon.



On Tue, Nov 19, 2019 at 5:06 AM Sean Owen <[hidden email]> wrote:
Just to clarify, as even I have lost the details over time: hadoop-2.7
works with hive-2.3? it isn't tied to hadoop-3.2?
Roughly how much risk is there in using the Hive 1.x fork over Hive
2.x, for end users using Hive via Spark?
I don't have a strong opinion, other than sharing the view that we
have to dump the Hive 1.x fork at the first opportunity.
Question is simply how much risk that entails. Keeping in mind that
Spark 3.0 is already something that people understand works
differently. We can accept some behavior changes.

On Mon, Nov 18, 2019 at 11:11 PM Dongjoon Hyun <[hidden email]> wrote:
>
> Hi, All.
>
> First of all, I want to put this as a policy issue instead of a technical issue.
> Also, this is orthogonal from `hadoop` version discussion.
>
> Apache Spark community kept (not maintained) the forked Apache Hive
> 1.2.1 because there has been no other options before. As we see at
> SPARK-20202, it's not a desirable situation among the Apache projects.
>
>     https://issues.apache.org/jira/browse/SPARK-20202
>
> Also, please note that we `kept`, not `maintained`, because we know it's not good.
> There are several attempt to update that forked repository
> for several reasons (Hadoop 3 support is one of the example),
> but those attempts are also turned down.
>
> From Apache Spark 3.0, it seems that we have a new feasible option
> `hive-2.3` profile. What about moving forward in this direction further?
>
> For example, can we remove the usage of forked `hive` in Apache Spark 3.0
> completely officially? If someone still needs to use the forked `hive`, we can
> have a profile `hive-1.2`. Of course, it should not be a default profile in the community.
>
> I want to say this is a goal we should achieve someday.
> If we don't do anything, nothing happen. At least we need to prepare this.
> Without any preparation, Spark 3.1+ will be the same.
>
> Shall we focus on what are our problems with Hive 2.3.6?
> If the only reason is that we didn't use it before, we can release another
> `3.0.0-preview` for that.
>
> Bests,
> Dongjoon.
Reply | Threaded
Open this post in threaded view
|

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

Hyukjin Kwon
So, are we able to conclude our plans as below?

1. In Spark 3,  we release as below:
  - Hadoop 3.2 + Hive 2.3 + JDK8 build that also works JDK 11
  - Hadoop 2.7 + Hive 2.3 + JDK8 build that also works JDK 11
  - Hadoop 2.7 + Hive 1.2.1 (fork) + JDK8 (default)

2. In Spark 3.1, we target:
  - Hadoop 3.2 + Hive 2.3 + JDK8 build that also works JDK 11
  - Hadoop 2.7 + Hive 2.3 + JDK8 build that also works JDK 11 (default)

3. Avoid to remove "Hadoop 2.7 + Hive 1.2.1 (fork) + JDK8 (default)" combo right away after cutting branch-3 to see if Hive 2.3 is considered as stable in general.
    I roughly suspect it would be a couple of months after Spark 3.0 release (?).

BTW, maybe we should officially note that "Hadoop 2.7 + Hive 1.2.1 (fork) + JDK8 (default)" combination is deprecated anyway in Spark 3.



2019년 11월 20일 (수) 오전 9:52, Cheng Lian <[hidden email]>님이 작성:
Thanks for taking care of this, Dongjoon!

We can target SPARK-20202 to 3.1.0, but I don't think we should do it immediately after cutting the branch-3.0. The Hive 1.2 code paths can only be removed once the Hive 2.3 code paths are proven to be stable. If it turned out to be buggy in Spark 3.1, we may want to further postpone SPARK-20202 to 3.2.0 by then.

On Tue, Nov 19, 2019 at 2:53 PM Dongjoon Hyun <[hidden email]> wrote:
Yes. It does. I meant SPARK-20202.

Thanks. I understand that it can be considered like Scala version issue.
So, that's the reason why I put this as a `policy` issue from the beginning.

> First of all, I want to put this as a policy issue instead of a technical issue.

In the policy perspective, we should remove this immediately if we have a solution to fix this.
For now, I set `Target Versions` of SPARK-20202 to `3.1.0` according to the current discussion status.


And, if there is no other issues, I'll create a PR to remove it from `master` branch when we cut `branch-3.0`.

For additional `hadoop-2.7 with Hive 2.3` pre-built distribution, how do you think about this, Sean?
The preparation is already started in another email thread and I believe that is a keystone to prove `Hive 2.3` version stability
(which Cheng/Hyukjin/you asked).

Bests,
Dongjoon.


On Tue, Nov 19, 2019 at 2:09 PM Cheng Lian <[hidden email]> wrote:
It's kinda like Scala version upgrade. Historically, we only remove the support of an older Scala version when the newer version is proven to be stable after one or more Spark minor versions.

On Tue, Nov 19, 2019 at 2:07 PM Cheng Lian <[hidden email]> wrote:
Hmm, what exactly did you mean by "remove the usage of forked `hive` in Apache Spark 3.0 completely officially"? I thought you wanted to remove the forked Hive 1.2 dependencies completely, no? As long as we still keep the Hive 1.2 in Spark 3.0, I'm fine with that. I personally don't have a particular preference between using Hive 1.2 or 2.3 as the default Hive version. After all, for end-users and providers who need a particular version combination, they can always build Spark with proper profiles themselves.

And thanks for clarifying the Hive 2.3.5 issue. I didn't notice that it's due to the folder name.

On Tue, Nov 19, 2019 at 11:15 AM Dongjoon Hyun <[hidden email]> wrote:
BTW, `hive.version.short` is a directory name. We are using 2.3.6 only.

For directory name, we use '1.2.1' and '2.3.5' because we just delayed the renaming the directories until 3.0.0 deadline to minimize the diff.

We can replace it immediately if we want right now.



On Tue, Nov 19, 2019 at 11:11 AM Dongjoon Hyun <[hidden email]> wrote:
Hi, Cheng.

This is irrelevant to JDK11 and Hadoop 3. I'm talking about JDK8 world.
If we consider them, it could be the followings.

+----------+-----------------+--------------------+
|          | Hive 1.2.1 fork |  Apache Hive 2.3.6 |
+-------------------------------------------------+
|Legitimate|        X        |         O          |
|JDK11     |        X        |         O          |
|Hadoop3   |        X        |         O          |
|Hadoop2   |        O        |         O          |
|Functions |     Baseline    |       More         |
|Bug fixes |     Baseline    |       More         |
+-------------------------------------------------+

To stabilize Spark's Hive 2.3 usage, we should use it by ourselves (including Jenkins/GitHubAction/AppVeyor).

For me, AS-IS 3.0 is not enough for that. According to your advices,
to give more visibility to the whole community,

1. We need to give additional `hadoop-2.7 with Hive 2.3` pre-built distribution
2. We need to switch our default Hive usage to 2.3 in `master` for 3.1 after `branch-3.0` branch cut.

I know that we have been reluctant to (1) and (2) due to its burden.
But, it's time to prepare. Without them, we are going to be insufficient again and again.

Bests,
Dongjoon.




On Tue, Nov 19, 2019 at 9:26 AM Cheng Lian <[hidden email]> wrote:
Dongjoon, I'm with Hyukjin. There should be at least one Spark 3.x minor release to stabilize Hive 2.3 code paths before retiring the Hive 1.2 fork. Even today, the Hive 2.3.6 version bundled in Spark 3.0 is still buggy in terms of JDK 11 support. (BTW, I just found that our root POM is referring both Hive 2.3.6 and 2.3.5 at the moment, see here and here.)

Again, I'm happy to get rid of ancient legacy dependencies like Hadoop 2.7 and the Hive 1.2 fork, but I do believe that we need a safety net for Spark 3.0. For preview releases, I'm afraid that their visibility is not good enough for covering such major upgrades.

On Tue, Nov 19, 2019 at 8:39 AM Dongjoon Hyun <[hidden email]> wrote:
Thank you for feedback, Hyujkjin and Sean.

I proposed `preview-2` for that purpose but I'm also +1 for do that at 3.1
if we can make a decision to eliminate the illegitimate Hive fork reference
immediately after `branch-3.0` cut.

Sean, I'm referencing Cheng Lian's email for the status of `hadoop-2.7`.


The way I see this is that it's not a user problem. Apache Spark community didn't try to drop the illegitimate Hive fork yet.
We need to drop it by ourselves because we created it and it's our bad.

Bests,
Dongjoon.



On Tue, Nov 19, 2019 at 5:06 AM Sean Owen <[hidden email]> wrote:
Just to clarify, as even I have lost the details over time: hadoop-2.7
works with hive-2.3? it isn't tied to hadoop-3.2?
Roughly how much risk is there in using the Hive 1.x fork over Hive
2.x, for end users using Hive via Spark?
I don't have a strong opinion, other than sharing the view that we
have to dump the Hive 1.x fork at the first opportunity.
Question is simply how much risk that entails. Keeping in mind that
Spark 3.0 is already something that people understand works
differently. We can accept some behavior changes.

On Mon, Nov 18, 2019 at 11:11 PM Dongjoon Hyun <[hidden email]> wrote:
>
> Hi, All.
>
> First of all, I want to put this as a policy issue instead of a technical issue.
> Also, this is orthogonal from `hadoop` version discussion.
>
> Apache Spark community kept (not maintained) the forked Apache Hive
> 1.2.1 because there has been no other options before. As we see at
> SPARK-20202, it's not a desirable situation among the Apache projects.
>
>     https://issues.apache.org/jira/browse/SPARK-20202
>
> Also, please note that we `kept`, not `maintained`, because we know it's not good.
> There are several attempt to update that forked repository
> for several reasons (Hadoop 3 support is one of the example),
> but those attempts are also turned down.
>
> From Apache Spark 3.0, it seems that we have a new feasible option
> `hive-2.3` profile. What about moving forward in this direction further?
>
> For example, can we remove the usage of forked `hive` in Apache Spark 3.0
> completely officially? If someone still needs to use the forked `hive`, we can
> have a profile `hive-1.2`. Of course, it should not be a default profile in the community.
>
> I want to say this is a goal we should achieve someday.
> If we don't do anything, nothing happen. At least we need to prepare this.
> Without any preparation, Spark 3.1+ will be the same.
>
> Shall we focus on what are our problems with Hive 2.3.6?
> If the only reason is that we didn't use it before, we can release another
> `3.0.0-preview` for that.
>
> Bests,
> Dongjoon.
Reply | Threaded
Open this post in threaded view
|

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

Sean Owen-2
Should Hadoop 2 + Hive 2 be considered to work on JDK 11? I wasn't
sure if 2.7 did, but honestly I've lost track.
Anyway, it doesn't matter much as the JDK doesn't cause another build
permutation. All are built targeting Java 8.

I also don't know if we have to declare a binary release a default.
The published POM will be agnostic to Hadoop / Hive; well, it will
link against a particular version but can be overridden. That's what
you're getting at?


On Tue, Nov 19, 2019 at 7:11 PM Hyukjin Kwon <[hidden email]> wrote:

>
> So, are we able to conclude our plans as below?
>
> 1. In Spark 3,  we release as below:
>   - Hadoop 3.2 + Hive 2.3 + JDK8 build that also works JDK 11
>   - Hadoop 2.7 + Hive 2.3 + JDK8 build that also works JDK 11
>   - Hadoop 2.7 + Hive 1.2.1 (fork) + JDK8 (default)
>
> 2. In Spark 3.1, we target:
>   - Hadoop 3.2 + Hive 2.3 + JDK8 build that also works JDK 11
>   - Hadoop 2.7 + Hive 2.3 + JDK8 build that also works JDK 11 (default)
>
> 3. Avoid to remove "Hadoop 2.7 + Hive 1.2.1 (fork) + JDK8 (default)" combo right away after cutting branch-3 to see if Hive 2.3 is considered as stable in general.
>     I roughly suspect it would be a couple of months after Spark 3.0 release (?).
>
> BTW, maybe we should officially note that "Hadoop 2.7 + Hive 1.2.1 (fork) + JDK8 (default)" combination is deprecated anyway in Spark 3.
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

Hyukjin Kwon
> Should Hadoop 2 + Hive 2 be considered to work on JDK 11?
This seems being investigated by Yuming's PR (https://github.com/apache/spark/pull/26533) if I am not mistaken.

Oh, yes, what I meant by (default) was the default profiles we will use in Spark.


2019년 11월 20일 (수) 오전 10:14, Sean Owen <[hidden email]>님이 작성:
Should Hadoop 2 + Hive 2 be considered to work on JDK 11? I wasn't
sure if 2.7 did, but honestly I've lost track.
Anyway, it doesn't matter much as the JDK doesn't cause another build
permutation. All are built targeting Java 8.

I also don't know if we have to declare a binary release a default.
The published POM will be agnostic to Hadoop / Hive; well, it will
link against a particular version but can be overridden. That's what
you're getting at?


On Tue, Nov 19, 2019 at 7:11 PM Hyukjin Kwon <[hidden email]> wrote:
>
> So, are we able to conclude our plans as below?
>
> 1. In Spark 3,  we release as below:
>   - Hadoop 3.2 + Hive 2.3 + JDK8 build that also works JDK 11
>   - Hadoop 2.7 + Hive 2.3 + JDK8 build that also works JDK 11
>   - Hadoop 2.7 + Hive 1.2.1 (fork) + JDK8 (default)
>
> 2. In Spark 3.1, we target:
>   - Hadoop 3.2 + Hive 2.3 + JDK8 build that also works JDK 11
>   - Hadoop 2.7 + Hive 2.3 + JDK8 build that also works JDK 11 (default)
>
> 3. Avoid to remove "Hadoop 2.7 + Hive 1.2.1 (fork) + JDK8 (default)" combo right away after cutting branch-3 to see if Hive 2.3 is considered as stable in general.
>     I roughly suspect it would be a couple of months after Spark 3.0 release (?).
>
> BTW, maybe we should officially note that "Hadoop 2.7 + Hive 1.2.1 (fork) + JDK8 (default)" combination is deprecated anyway in Spark 3.
>
Reply | Threaded
Open this post in threaded view
|

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

Dongjoon Hyun-2
Cheng, could you elaborate on your criteria, `Hive 2.3 code paths are proven to be stable`?
For me, it's difficult to image that we can reach any stable situation when we don't use it at all by ourselves.

    > The Hive 1.2 code paths can only be removed once the Hive 2.3 code paths are proven to be stable.

Sean, our published POM is pointing and advertising the illegitimate Hive 1.2 fork as a compile dependency.
Yes. It can be overridden. So, why does Apache Spark need to publish like that?
If someone want to use that illegitimate Hive 1.2 fork, let them override it. We are unable to delete those illegitimate Hive 1.2 fork.
Those artifacts will be orphans.

    > The published POM will be agnostic to Hadoop / Hive; well,
    > it will link against a particular version but can be overridden. 


Bests,
Dongjoon.


On Tue, Nov 19, 2019 at 5:26 PM Hyukjin Kwon <[hidden email]> wrote:
> Should Hadoop 2 + Hive 2 be considered to work on JDK 11?
This seems being investigated by Yuming's PR (https://github.com/apache/spark/pull/26533) if I am not mistaken.

Oh, yes, what I meant by (default) was the default profiles we will use in Spark.


2019년 11월 20일 (수) 오전 10:14, Sean Owen <[hidden email]>님이 작성:
Should Hadoop 2 + Hive 2 be considered to work on JDK 11? I wasn't
sure if 2.7 did, but honestly I've lost track.
Anyway, it doesn't matter much as the JDK doesn't cause another build
permutation. All are built targeting Java 8.

I also don't know if we have to declare a binary release a default.
The published POM will be agnostic to Hadoop / Hive; well, it will
link against a particular version but can be overridden. That's what
you're getting at?


On Tue, Nov 19, 2019 at 7:11 PM Hyukjin Kwon <[hidden email]> wrote:
>
> So, are we able to conclude our plans as below?
>
> 1. In Spark 3,  we release as below:
>   - Hadoop 3.2 + Hive 2.3 + JDK8 build that also works JDK 11
>   - Hadoop 2.7 + Hive 2.3 + JDK8 build that also works JDK 11
>   - Hadoop 2.7 + Hive 1.2.1 (fork) + JDK8 (default)
>
> 2. In Spark 3.1, we target:
>   - Hadoop 3.2 + Hive 2.3 + JDK8 build that also works JDK 11
>   - Hadoop 2.7 + Hive 2.3 + JDK8 build that also works JDK 11 (default)
>
> 3. Avoid to remove "Hadoop 2.7 + Hive 1.2.1 (fork) + JDK8 (default)" combo right away after cutting branch-3 to see if Hive 2.3 is considered as stable in general.
>     I roughly suspect it would be a couple of months after Spark 3.0 release (?).
>
> BTW, maybe we should officially note that "Hadoop 2.7 + Hive 1.2.1 (fork) + JDK8 (default)" combination is deprecated anyway in Spark 3.
>
Reply | Threaded
Open this post in threaded view
|

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

Sean Owen-2
Yes, good point. A user would get whatever the POM says without
profiles enabled so it matters.

Playing it out, an app _should_ compile with the Spark dependency
marked 'provided'. In that case the app that is spark-submit-ted is
agnostic to the Hive dependency as the only one that matters is what's
on the cluster. Right? we don't leak through the Hive API in the Spark
API. And yes it's then up to the cluster to provide whatever version
it wants. Vendors will have made a specific version choice when
building their distro one way or the other.

If you run a Spark cluster yourself, you're using the binary distro,
and we're already talking about also publishing a binary distro with
this variation, so that's not the issue.

The corner cases where it might matter are:

- I unintentionally package Spark in the app and by default pull in
Hive 2 when I will deploy against Hive 1. But that's user error, and
causes other problems
- I run tests locally in my project, which will pull in a default
version of Hive defined by the POM

Double-checking, is that right? if so it kind of implies it doesn't
matter. Which is an argument either way about what's the default. I
too would then prefer defaulting to Hive 2 in the POM. Am I missing
something about the implication?

(That fork will stay published forever anyway, that's not an issue per se.)

On Wed, Nov 20, 2019 at 1:40 AM Dongjoon Hyun <[hidden email]> wrote:
> Sean, our published POM is pointing and advertising the illegitimate Hive 1.2 fork as a compile dependency.
> Yes. It can be overridden. So, why does Apache Spark need to publish like that?
> If someone want to use that illegitimate Hive 1.2 fork, let them override it. We are unable to delete those illegitimate Hive 1.2 fork.
> Those artifacts will be orphans.
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

Dongjoon Hyun-2
Yes. Right. That's the situation we are hitting and the result I expected.
We need to change our default with Hive 2 in the POM.

Dongjoon.


On Wed, Nov 20, 2019 at 5:20 AM Sean Owen <[hidden email]> wrote:
Yes, good point. A user would get whatever the POM says without
profiles enabled so it matters.

Playing it out, an app _should_ compile with the Spark dependency
marked 'provided'. In that case the app that is spark-submit-ted is
agnostic to the Hive dependency as the only one that matters is what's
on the cluster. Right? we don't leak through the Hive API in the Spark
API. And yes it's then up to the cluster to provide whatever version
it wants. Vendors will have made a specific version choice when
building their distro one way or the other.

If you run a Spark cluster yourself, you're using the binary distro,
and we're already talking about also publishing a binary distro with
this variation, so that's not the issue.

The corner cases where it might matter are:

- I unintentionally package Spark in the app and by default pull in
Hive 2 when I will deploy against Hive 1. But that's user error, and
causes other problems
- I run tests locally in my project, which will pull in a default
version of Hive defined by the POM

Double-checking, is that right? if so it kind of implies it doesn't
matter. Which is an argument either way about what's the default. I
too would then prefer defaulting to Hive 2 in the POM. Am I missing
something about the implication?

(That fork will stay published forever anyway, that's not an issue per se.)

On Wed, Nov 20, 2019 at 1:40 AM Dongjoon Hyun <[hidden email]> wrote:
> Sean, our published POM is pointing and advertising the illegitimate Hive 1.2 fork as a compile dependency.
> Yes. It can be overridden. So, why does Apache Spark need to publish like that?
> If someone want to use that illegitimate Hive 1.2 fork, let them override it. We are unable to delete those illegitimate Hive 1.2 fork.
> Those artifacts will be orphans.
>
Reply | Threaded
Open this post in threaded view
|

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

Cheng Lian
Sean, thanks for the corner cases you listed. They make a lot of sense. Now I do incline to have Hive 2.3 as the default version.

Dongjoon, apologize if I didn't make it clear before. What made me concerned initially was only the following part:

> can we remove the usage of forked `hive` in Apache Spark 3.0 completely officially?

So having Hive 2.3 as the default Hive version and adding a `hive-1.2` profile to keep the Hive 1.2.1 fork looks like a feasible approach to me. Thanks for starting the discussion!

On Wed, Nov 20, 2019 at 9:46 AM Dongjoon Hyun <[hidden email]> wrote:
Yes. Right. That's the situation we are hitting and the result I expected.
We need to change our default with Hive 2 in the POM.

Dongjoon.


On Wed, Nov 20, 2019 at 5:20 AM Sean Owen <[hidden email]> wrote:
Yes, good point. A user would get whatever the POM says without
profiles enabled so it matters.

Playing it out, an app _should_ compile with the Spark dependency
marked 'provided'. In that case the app that is spark-submit-ted is
agnostic to the Hive dependency as the only one that matters is what's
on the cluster. Right? we don't leak through the Hive API in the Spark
API. And yes it's then up to the cluster to provide whatever version
it wants. Vendors will have made a specific version choice when
building their distro one way or the other.

If you run a Spark cluster yourself, you're using the binary distro,
and we're already talking about also publishing a binary distro with
this variation, so that's not the issue.

The corner cases where it might matter are:

- I unintentionally package Spark in the app and by default pull in
Hive 2 when I will deploy against Hive 1. But that's user error, and
causes other problems
- I run tests locally in my project, which will pull in a default
version of Hive defined by the POM

Double-checking, is that right? if so it kind of implies it doesn't
matter. Which is an argument either way about what's the default. I
too would then prefer defaulting to Hive 2 in the POM. Am I missing
something about the implication?

(That fork will stay published forever anyway, that's not an issue per se.)

On Wed, Nov 20, 2019 at 1:40 AM Dongjoon Hyun <[hidden email]> wrote:
> Sean, our published POM is pointing and advertising the illegitimate Hive 1.2 fork as a compile dependency.
> Yes. It can be overridden. So, why does Apache Spark need to publish like that?
> If someone want to use that illegitimate Hive 1.2 fork, let them override it. We are unable to delete those illegitimate Hive 1.2 fork.
> Those artifacts will be orphans.
>
Reply | Threaded
Open this post in threaded view
|

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

Dongjoon Hyun-2
Thank you all.

I'll try to make JIRA and PR for that.

Bests,
Dongjoon.

On Wed, Nov 20, 2019 at 4:08 PM Cheng Lian <[hidden email]> wrote:
Sean, thanks for the corner cases you listed. They make a lot of sense. Now I do incline to have Hive 2.3 as the default version.

Dongjoon, apologize if I didn't make it clear before. What made me concerned initially was only the following part:

> can we remove the usage of forked `hive` in Apache Spark 3.0 completely officially?

So having Hive 2.3 as the default Hive version and adding a `hive-1.2` profile to keep the Hive 1.2.1 fork looks like a feasible approach to me. Thanks for starting the discussion!

On Wed, Nov 20, 2019 at 9:46 AM Dongjoon Hyun <[hidden email]> wrote:
Yes. Right. That's the situation we are hitting and the result I expected.
We need to change our default with Hive 2 in the POM.

Dongjoon.


On Wed, Nov 20, 2019 at 5:20 AM Sean Owen <[hidden email]> wrote:
Yes, good point. A user would get whatever the POM says without
profiles enabled so it matters.

Playing it out, an app _should_ compile with the Spark dependency
marked 'provided'. In that case the app that is spark-submit-ted is
agnostic to the Hive dependency as the only one that matters is what's
on the cluster. Right? we don't leak through the Hive API in the Spark
API. And yes it's then up to the cluster to provide whatever version
it wants. Vendors will have made a specific version choice when
building their distro one way or the other.

If you run a Spark cluster yourself, you're using the binary distro,
and we're already talking about also publishing a binary distro with
this variation, so that's not the issue.

The corner cases where it might matter are:

- I unintentionally package Spark in the app and by default pull in
Hive 2 when I will deploy against Hive 1. But that's user error, and
causes other problems
- I run tests locally in my project, which will pull in a default
version of Hive defined by the POM

Double-checking, is that right? if so it kind of implies it doesn't
matter. Which is an argument either way about what's the default. I
too would then prefer defaulting to Hive 2 in the POM. Am I missing
something about the implication?

(That fork will stay published forever anyway, that's not an issue per se.)

On Wed, Nov 20, 2019 at 1:40 AM Dongjoon Hyun <[hidden email]> wrote:
> Sean, our published POM is pointing and advertising the illegitimate Hive 1.2 fork as a compile dependency.
> Yes. It can be overridden. So, why does Apache Spark need to publish like that?
> If someone want to use that illegitimate Hive 1.2 fork, let them override it. We are unable to delete those illegitimate Hive 1.2 fork.
> Those artifacts will be orphans.
>