The Myth: the forked Hive 1.2.1 is stabler than XXX

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

The Myth: the forked Hive 1.2.1 is stabler than XXX

Dongjoon Hyun-2
Hi, All.

I'm sending this email because it's important to discuss this topic narrowly
and make a clear conclusion.

`The forked Hive 1.2.1 is stable`? It sounds like a myth we created
by ignoring the existing bugs. If you want to say the forked Hive 1.2.1 is
stabler than XXX, please give us the evidence. Then, we can fix it.
Otherwise, let's stop making `The forked Hive 1.2.1` invincible.

Historically, the following forked Hive 1.2.1 has never been stable.
It's just frozen. Since the forked Hive is out of our control, we ignored bugs.
That's all. The reality is a way far from the stable status.


First, let's begin Hive itself by comparing with Apache Hive 1.2.2 and 1.2.3,

    Apache Hive 1.2.2 has 50 bug fixes.
    Apache Hive 1.2.3 has 9 bug fixes.

I will not cover all of them, but Apache Hive community also backports
important patches like Apache Spark community.

Second, let's move to SPARK issues because we aren't exposed to all Hive issues.

    SPARK-19109 ORC metadata section can sometimes exceed protobuf message size limit
    SPARK-22267 Spark SQL incorrectly reads ORC file when column order is different

These were reported since Apache Spark 1.6.x because the forked Hive doesn't have
a proper upstream patch like HIVE-11592 (fixed at Apache Hive 1.3.0).

Since we couldn't update the frozen forked Hive, we added Apache ORC dependency
at SPARK-20682 (2.3.0), added a switching configuration at SPARK-20728 (2.3.0),
tured on `spark.sql.hive.convertMetastoreOrc by default` at SPARK-22279 (2.4.0).
However, if you turn off the switch and start to use the forked hive,
you will be exposed to the buggy forked Hive 1.2.1 again.

Third, let's talk about the new features like Hadoop 3 and JDK11.
No one believe that the ancient forked Hive 1.2.1 will work with this.
I saw that the following issue is mentioned as an evidence of Hive 2.3.6 bug.

    SPARK-29245 ClassCastException during creating HiveMetaStoreClient

Yes. I know that issue because I reported it and verified HIVE-21508.
It's fixed already and will be released ad Apache Hive 2.3.7.

Can we imagine something like this in the forked Hive 1.2.1?
'No'. There is no future on it. It's frozen.

From now, I want to claim that the forked Hive 1.2.1 is the unstable one.
I welcome all your positive and negative opinions.
Please share your concerns and problems and fix them together.
Apache Spark is an open source project we shared.

Bests,
Dongjoon.

Reply | Threaded
Open this post in threaded view
|

Re: The Myth: the forked Hive 1.2.1 is stabler than XXX

Sean Owen-2
Yeah 'stable' is ambiguous. It's old and buggy, but at least it's the
same old and buggy that's been there a while. "stable" in that sense
I'm sure there is a lot more delta between Hive 1 and 2 in terms of
bug fixes that are important; the question isn't just 1.x releases.

What I don't know is how much affects Spark, as it's a Hive client
mostly. Clearly some do.

I'd prefer making it the default in the POM for 3.0. Mostly on the
grounds that its effects are on deployed clusters, not apps. And
deployers can still choose a binary distro with 1.x or make the choice
they want. Those that don't care should probably be nudged to 2.x.
Spark 3.x is already full of behavior changes and 'unstable', so I
think this is minor relative to the overall risk question.

On Wed, Nov 20, 2019 at 12:53 PM Dongjoon Hyun <[hidden email]> wrote:

>
> Hi, All.
>
> I'm sending this email because it's important to discuss this topic narrowly
> and make a clear conclusion.
>
> `The forked Hive 1.2.1 is stable`? It sounds like a myth we created
> by ignoring the existing bugs. If you want to say the forked Hive 1.2.1 is
> stabler than XXX, please give us the evidence. Then, we can fix it.
> Otherwise, let's stop making `The forked Hive 1.2.1` invincible.
>
> Historically, the following forked Hive 1.2.1 has never been stable.
> It's just frozen. Since the forked Hive is out of our control, we ignored bugs.
> That's all. The reality is a way far from the stable status.
>
>     https://mvnrepository.com/artifact/org.spark-project.hive/
>     https://mvnrepository.com/artifact/org.spark-project.hive/hive-exec/1.2.1.spark (2015 August)
>     https://mvnrepository.com/artifact/org.spark-project.hive/hive-exec/1.2.1.spark2 (2016 April)
>
> First, let's begin Hive itself by comparing with Apache Hive 1.2.2 and 1.2.3,
>
>     Apache Hive 1.2.2 has 50 bug fixes.
>     Apache Hive 1.2.3 has 9 bug fixes.
>
> I will not cover all of them, but Apache Hive community also backports
> important patches like Apache Spark community.
>
> Second, let's move to SPARK issues because we aren't exposed to all Hive issues.
>
>     SPARK-19109 ORC metadata section can sometimes exceed protobuf message size limit
>     SPARK-22267 Spark SQL incorrectly reads ORC file when column order is different
>
> These were reported since Apache Spark 1.6.x because the forked Hive doesn't have
> a proper upstream patch like HIVE-11592 (fixed at Apache Hive 1.3.0).
>
> Since we couldn't update the frozen forked Hive, we added Apache ORC dependency
> at SPARK-20682 (2.3.0), added a switching configuration at SPARK-20728 (2.3.0),
> tured on `spark.sql.hive.convertMetastoreOrc by default` at SPARK-22279 (2.4.0).
> However, if you turn off the switch and start to use the forked hive,
> you will be exposed to the buggy forked Hive 1.2.1 again.
>
> Third, let's talk about the new features like Hadoop 3 and JDK11.
> No one believe that the ancient forked Hive 1.2.1 will work with this.
> I saw that the following issue is mentioned as an evidence of Hive 2.3.6 bug.
>
>     SPARK-29245 ClassCastException during creating HiveMetaStoreClient
>
> Yes. I know that issue because I reported it and verified HIVE-21508.
> It's fixed already and will be released ad Apache Hive 2.3.7.
>
> Can we imagine something like this in the forked Hive 1.2.1?
> 'No'. There is no future on it. It's frozen.
>
> From now, I want to claim that the forked Hive 1.2.1 is the unstable one.
> I welcome all your positive and negative opinions.
> Please share your concerns and problems and fix them together.
> Apache Spark is an open source project we shared.
>
> Bests,
> Dongjoon.
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: The Myth: the forked Hive 1.2.1 is stabler than XXX

Dongjoon Hyun-2
Thanks. That will be a giant step forward, Sean!

> I'd prefer making it the default in the POM for 3.0. 

Bests,
Dongjoon.

On Wed, Nov 20, 2019 at 11:02 AM Sean Owen <[hidden email]> wrote:
Yeah 'stable' is ambiguous. It's old and buggy, but at least it's the
same old and buggy that's been there a while. "stable" in that sense
I'm sure there is a lot more delta between Hive 1 and 2 in terms of
bug fixes that are important; the question isn't just 1.x releases.

What I don't know is how much affects Spark, as it's a Hive client
mostly. Clearly some do.

I'd prefer making it the default in the POM for 3.0. Mostly on the
grounds that its effects are on deployed clusters, not apps. And
deployers can still choose a binary distro with 1.x or make the choice
they want. Those that don't care should probably be nudged to 2.x.
Spark 3.x is already full of behavior changes and 'unstable', so I
think this is minor relative to the overall risk question.

On Wed, Nov 20, 2019 at 12:53 PM Dongjoon Hyun <[hidden email]> wrote:
>
> Hi, All.
>
> I'm sending this email because it's important to discuss this topic narrowly
> and make a clear conclusion.
>
> `The forked Hive 1.2.1 is stable`? It sounds like a myth we created
> by ignoring the existing bugs. If you want to say the forked Hive 1.2.1 is
> stabler than XXX, please give us the evidence. Then, we can fix it.
> Otherwise, let's stop making `The forked Hive 1.2.1` invincible.
>
> Historically, the following forked Hive 1.2.1 has never been stable.
> It's just frozen. Since the forked Hive is out of our control, we ignored bugs.
> That's all. The reality is a way far from the stable status.
>
>     https://mvnrepository.com/artifact/org.spark-project.hive/
>     https://mvnrepository.com/artifact/org.spark-project.hive/hive-exec/1.2.1.spark (2015 August)
>     https://mvnrepository.com/artifact/org.spark-project.hive/hive-exec/1.2.1.spark2 (2016 April)
>
> First, let's begin Hive itself by comparing with Apache Hive 1.2.2 and 1.2.3,
>
>     Apache Hive 1.2.2 has 50 bug fixes.
>     Apache Hive 1.2.3 has 9 bug fixes.
>
> I will not cover all of them, but Apache Hive community also backports
> important patches like Apache Spark community.
>
> Second, let's move to SPARK issues because we aren't exposed to all Hive issues.
>
>     SPARK-19109 ORC metadata section can sometimes exceed protobuf message size limit
>     SPARK-22267 Spark SQL incorrectly reads ORC file when column order is different
>
> These were reported since Apache Spark 1.6.x because the forked Hive doesn't have
> a proper upstream patch like HIVE-11592 (fixed at Apache Hive 1.3.0).
>
> Since we couldn't update the frozen forked Hive, we added Apache ORC dependency
> at SPARK-20682 (2.3.0), added a switching configuration at SPARK-20728 (2.3.0),
> tured on `spark.sql.hive.convertMetastoreOrc by default` at SPARK-22279 (2.4.0).
> However, if you turn off the switch and start to use the forked hive,
> you will be exposed to the buggy forked Hive 1.2.1 again.
>
> Third, let's talk about the new features like Hadoop 3 and JDK11.
> No one believe that the ancient forked Hive 1.2.1 will work with this.
> I saw that the following issue is mentioned as an evidence of Hive 2.3.6 bug.
>
>     SPARK-29245 ClassCastException during creating HiveMetaStoreClient
>
> Yes. I know that issue because I reported it and verified HIVE-21508.
> It's fixed already and will be released ad Apache Hive 2.3.7.
>
> Can we imagine something like this in the forked Hive 1.2.1?
> 'No'. There is no future on it. It's frozen.
>
> From now, I want to claim that the forked Hive 1.2.1 is the unstable one.
> I welcome all your positive and negative opinions.
> Please share your concerns and problems and fix them together.
> Apache Spark is an open source project we shared.
>
> Bests,
> Dongjoon.
>
Reply | Threaded
Open this post in threaded view
|

Re: The Myth: the forked Hive 1.2.1 is stabler than XXX

Felix Cheung
Just to add - hive 1.2 fork is definitely not more stable. We know of a few critical bug fixes that we cherry picked into a fork of that fork to maintain ourselves.



From: Dongjoon Hyun <[hidden email]>
Sent: Wednesday, November 20, 2019 11:07:47 AM
To: Sean Owen <[hidden email]>
Cc: dev <[hidden email]>
Subject: Re: The Myth: the forked Hive 1.2.1 is stabler than XXX
 
Thanks. That will be a giant step forward, Sean!

> I'd prefer making it the default in the POM for 3.0. 

Bests,
Dongjoon.

On Wed, Nov 20, 2019 at 11:02 AM Sean Owen <[hidden email]> wrote:
Yeah 'stable' is ambiguous. It's old and buggy, but at least it's the
same old and buggy that's been there a while. "stable" in that sense
I'm sure there is a lot more delta between Hive 1 and 2 in terms of
bug fixes that are important; the question isn't just 1.x releases.

What I don't know is how much affects Spark, as it's a Hive client
mostly. Clearly some do.

I'd prefer making it the default in the POM for 3.0. Mostly on the
grounds that its effects are on deployed clusters, not apps. And
deployers can still choose a binary distro with 1.x or make the choice
they want. Those that don't care should probably be nudged to 2.x.
Spark 3.x is already full of behavior changes and 'unstable', so I
think this is minor relative to the overall risk question.

On Wed, Nov 20, 2019 at 12:53 PM Dongjoon Hyun <[hidden email]> wrote:
>
> Hi, All.
>
> I'm sending this email because it's important to discuss this topic narrowly
> and make a clear conclusion.
>
> `The forked Hive 1.2.1 is stable`? It sounds like a myth we created
> by ignoring the existing bugs. If you want to say the forked Hive 1.2.1 is
> stabler than XXX, please give us the evidence. Then, we can fix it.
> Otherwise, let's stop making `The forked Hive 1.2.1` invincible.
>
> Historically, the following forked Hive 1.2.1 has never been stable.
> It's just frozen. Since the forked Hive is out of our control, we ignored bugs.
> That's all. The reality is a way far from the stable status.
>
>     https://mvnrepository.com/artifact/org.spark-project.hive/
>     https://mvnrepository.com/artifact/org.spark-project.hive/hive-exec/1.2.1.spark (2015 August)
>     https://mvnrepository.com/artifact/org.spark-project.hive/hive-exec/1.2.1.spark2 (2016 April)
>
> First, let's begin Hive itself by comparing with Apache Hive 1.2.2 and 1.2.3,
>
>     Apache Hive 1.2.2 has 50 bug fixes.
>     Apache Hive 1.2.3 has 9 bug fixes.
>
> I will not cover all of them, but Apache Hive community also backports
> important patches like Apache Spark community.
>
> Second, let's move to SPARK issues because we aren't exposed to all Hive issues.
>
>     SPARK-19109 ORC metadata section can sometimes exceed protobuf message size limit
>     SPARK-22267 Spark SQL incorrectly reads ORC file when column order is different
>
> These were reported since Apache Spark 1.6.x because the forked Hive doesn't have
> a proper upstream patch like HIVE-11592 (fixed at Apache Hive 1.3.0).
>
> Since we couldn't update the frozen forked Hive, we added Apache ORC dependency
> at SPARK-20682 (2.3.0), added a switching configuration at SPARK-20728 (2.3.0),
> tured on `spark.sql.hive.convertMetastoreOrc by default` at SPARK-22279 (2.4.0).
> However, if you turn off the switch and start to use the forked hive,
> you will be exposed to the buggy forked Hive 1.2.1 again.
>
> Third, let's talk about the new features like Hadoop 3 and JDK11.
> No one believe that the ancient forked Hive 1.2.1 will work with this.
> I saw that the following issue is mentioned as an evidence of Hive 2.3.6 bug.
>
>     SPARK-29245 ClassCastException during creating HiveMetaStoreClient
>
> Yes. I know that issue because I reported it and verified HIVE-21508.
> It's fixed already and will be released ad Apache Hive 2.3.7.
>
> Can we imagine something like this in the forked Hive 1.2.1?
> 'No'. There is no future on it. It's frozen.
>
> From now, I want to claim that the forked Hive 1.2.1 is the unstable one.
> I welcome all your positive and negative opinions.
> Please share your concerns and problems and fix them together.
> Apache Spark is an open source project we shared.
>
> Bests,
> Dongjoon.
>
Reply | Threaded
Open this post in threaded view
|

Re: The Myth: the forked Hive 1.2.1 is stabler than XXX

Cheng Lian
Hey Dongjoon and Felix,

I totally agree that Hive 2.3 is more stable than Hive 1.2. Otherwise, we wouldn't even consider integrating with Hive 2.3 in Spark 3.0.

However, "Hive" and "Hive integration in Spark" are two quite different things, and I don't think anybody has ever mentioned "the forked Hive 1.2.1 is stable" in any recent Hadoop/Hive version discussions (at least I double-checked all my replies).

What I really care about is the stability and quality of "Hive integration in Spark", which have gone through some major updates due to the recent Hive 2.3 upgrade in Spark 3.0. We had already found bugs in this piece, and empirically, for a significant upgrade like this one, it is not surprising that other bugs/regressions can be found in the near future. On the other hand, the Hive 1.2 integration code path in Spark has been battle-tested for years. Yes, there are issues, but people have learned how to get along with these issues. And please don't forget that, for Spark 3.0 end-users who really don't want to interact with this Hive 1.2 fork, they can always use Hive 2.3 at their own risks.

True, "stable" is quite vague a criterion, and hard to be proven. But that is exactly the reason why we may want to be conservative and wait for some time and see whether there are further signals suggesting that the Hive 2.3 integration in Spark 3.0 is unstable. After one or two Spark 3.x minor releases, if we've fixed all the outstanding issues and no more significant ones are showing up, we can declare that the Hive 2.3 integration in Spark 3.x is stable, and then we can consider removing reference to the Hive 1.2 fork. Does that make sense?

Cheng

On Wed, Nov 20, 2019 at 11:49 AM Felix Cheung <[hidden email]> wrote:
Just to add - hive 1.2 fork is definitely not more stable. We know of a few critical bug fixes that we cherry picked into a fork of that fork to maintain ourselves.



From: Dongjoon Hyun <[hidden email]>
Sent: Wednesday, November 20, 2019 11:07:47 AM
To: Sean Owen <[hidden email]>
Cc: dev <[hidden email]>
Subject: Re: The Myth: the forked Hive 1.2.1 is stabler than XXX
 
Thanks. That will be a giant step forward, Sean!

> I'd prefer making it the default in the POM for 3.0. 

Bests,
Dongjoon.

On Wed, Nov 20, 2019 at 11:02 AM Sean Owen <[hidden email]> wrote:
Yeah 'stable' is ambiguous. It's old and buggy, but at least it's the
same old and buggy that's been there a while. "stable" in that sense
I'm sure there is a lot more delta between Hive 1 and 2 in terms of
bug fixes that are important; the question isn't just 1.x releases.

What I don't know is how much affects Spark, as it's a Hive client
mostly. Clearly some do.

I'd prefer making it the default in the POM for 3.0. Mostly on the
grounds that its effects are on deployed clusters, not apps. And
deployers can still choose a binary distro with 1.x or make the choice
they want. Those that don't care should probably be nudged to 2.x.
Spark 3.x is already full of behavior changes and 'unstable', so I
think this is minor relative to the overall risk question.

On Wed, Nov 20, 2019 at 12:53 PM Dongjoon Hyun <[hidden email]> wrote:
>
> Hi, All.
>
> I'm sending this email because it's important to discuss this topic narrowly
> and make a clear conclusion.
>
> `The forked Hive 1.2.1 is stable`? It sounds like a myth we created
> by ignoring the existing bugs. If you want to say the forked Hive 1.2.1 is
> stabler than XXX, please give us the evidence. Then, we can fix it.
> Otherwise, let's stop making `The forked Hive 1.2.1` invincible.
>
> Historically, the following forked Hive 1.2.1 has never been stable.
> It's just frozen. Since the forked Hive is out of our control, we ignored bugs.
> That's all. The reality is a way far from the stable status.
>
>     https://mvnrepository.com/artifact/org.spark-project.hive/
>     https://mvnrepository.com/artifact/org.spark-project.hive/hive-exec/1.2.1.spark (2015 August)
>     https://mvnrepository.com/artifact/org.spark-project.hive/hive-exec/1.2.1.spark2 (2016 April)
>
> First, let's begin Hive itself by comparing with Apache Hive 1.2.2 and 1.2.3,
>
>     Apache Hive 1.2.2 has 50 bug fixes.
>     Apache Hive 1.2.3 has 9 bug fixes.
>
> I will not cover all of them, but Apache Hive community also backports
> important patches like Apache Spark community.
>
> Second, let's move to SPARK issues because we aren't exposed to all Hive issues.
>
>     SPARK-19109 ORC metadata section can sometimes exceed protobuf message size limit
>     SPARK-22267 Spark SQL incorrectly reads ORC file when column order is different
>
> These were reported since Apache Spark 1.6.x because the forked Hive doesn't have
> a proper upstream patch like HIVE-11592 (fixed at Apache Hive 1.3.0).
>
> Since we couldn't update the frozen forked Hive, we added Apache ORC dependency
> at SPARK-20682 (2.3.0), added a switching configuration at SPARK-20728 (2.3.0),
> tured on `spark.sql.hive.convertMetastoreOrc by default` at SPARK-22279 (2.4.0).
> However, if you turn off the switch and start to use the forked hive,
> you will be exposed to the buggy forked Hive 1.2.1 again.
>
> Third, let's talk about the new features like Hadoop 3 and JDK11.
> No one believe that the ancient forked Hive 1.2.1 will work with this.
> I saw that the following issue is mentioned as an evidence of Hive 2.3.6 bug.
>
>     SPARK-29245 ClassCastException during creating HiveMetaStoreClient
>
> Yes. I know that issue because I reported it and verified HIVE-21508.
> It's fixed already and will be released ad Apache Hive 2.3.7.
>
> Can we imagine something like this in the forked Hive 1.2.1?
> 'No'. There is no future on it. It's frozen.
>
> From now, I want to claim that the forked Hive 1.2.1 is the unstable one.
> I welcome all your positive and negative opinions.
> Please share your concerns and problems and fix them together.
> Apache Spark is an open source project we shared.
>
> Bests,
> Dongjoon.
>
Reply | Threaded
Open this post in threaded view
|

Re: The Myth: the forked Hive 1.2.1 is stabler than XXX

Dongjoon Hyun-2
Nice. That's a progress.

Let's narrow down to the path. We need to clarify what is the criteria we can agree.

1. What does `battle-tested for years` mean exactly?
    How and when can we start the `battle-tested` stage for Hive 2.3?

2. What is the new "Hive integration in Spark"?
    During introducing Hive 2.3, we fixed the compatibility stuff as you said.
    Most of code is shared for Hive 1.2 and Hive 2.3.
    That means if there is a bug inside this shared code, both of them will be affected.
    Of course, we can fix this because it's Spark code. We will learn and fix it as you said.

    >  Yes, there are issues, but people have learned how to get along with these issues.

    The only non-shared code are the following.
    Do you have a concern on the following directories?
    If there is no bugs on the following codebase, can we switch?

    $ find . -name v2.3.5
    ./sql/core/v2.3.5
    ./sql/hive-thriftserver/v2.3.5

3. We know that we can keep both code bases, but the community should choose Hive 2.3 officially.
    That's the right choice in the Apache project policy perspective. At least, Sean and I prefer that.
    If someone really want to stick to Hive 1.2 fork, they can use it at their own risks.

    > for Spark 3.0 end-users who really don't want to interact with this Hive 1.2 fork, they can always use Hive 2.3 at their own risks.

Specifically, what about having a profile `hive-1.2` at `3.0.0` with the default Hive 2.3 pom at least?
How do you think about that way, Cheng?

Bests,
Dongjoon.


On Wed, Nov 20, 2019 at 12:59 PM Cheng Lian <[hidden email]> wrote:
Hey Dongjoon and Felix,

I totally agree that Hive 2.3 is more stable than Hive 1.2. Otherwise, we wouldn't even consider integrating with Hive 2.3 in Spark 3.0.

However, "Hive" and "Hive integration in Spark" are two quite different things, and I don't think anybody has ever mentioned "the forked Hive 1.2.1 is stable" in any recent Hadoop/Hive version discussions (at least I double-checked all my replies).

What I really care about is the stability and quality of "Hive integration in Spark", which have gone through some major updates due to the recent Hive 2.3 upgrade in Spark 3.0. We had already found bugs in this piece, and empirically, for a significant upgrade like this one, it is not surprising that other bugs/regressions can be found in the near future. On the other hand, the Hive 1.2 integration code path in Spark has been battle-tested for years. Yes, there are issues, but people have learned how to get along with these issues. And please don't forget that, for Spark 3.0 end-users who really don't want to interact with this Hive 1.2 fork, they can always use Hive 2.3 at their own risks.

True, "stable" is quite vague a criterion, and hard to be proven. But that is exactly the reason why we may want to be conservative and wait for some time and see whether there are further signals suggesting that the Hive 2.3 integration in Spark 3.0 is unstable. After one or two Spark 3.x minor releases, if we've fixed all the outstanding issues and no more significant ones are showing up, we can declare that the Hive 2.3 integration in Spark 3.x is stable, and then we can consider removing reference to the Hive 1.2 fork. Does that make sense?

Cheng

On Wed, Nov 20, 2019 at 11:49 AM Felix Cheung <[hidden email]> wrote:
Just to add - hive 1.2 fork is definitely not more stable. We know of a few critical bug fixes that we cherry picked into a fork of that fork to maintain ourselves.



From: Dongjoon Hyun <[hidden email]>
Sent: Wednesday, November 20, 2019 11:07:47 AM
To: Sean Owen <[hidden email]>
Cc: dev <[hidden email]>
Subject: Re: The Myth: the forked Hive 1.2.1 is stabler than XXX
 
Thanks. That will be a giant step forward, Sean!

> I'd prefer making it the default in the POM for 3.0. 

Bests,
Dongjoon.

On Wed, Nov 20, 2019 at 11:02 AM Sean Owen <[hidden email]> wrote:
Yeah 'stable' is ambiguous. It's old and buggy, but at least it's the
same old and buggy that's been there a while. "stable" in that sense
I'm sure there is a lot more delta between Hive 1 and 2 in terms of
bug fixes that are important; the question isn't just 1.x releases.

What I don't know is how much affects Spark, as it's a Hive client
mostly. Clearly some do.

I'd prefer making it the default in the POM for 3.0. Mostly on the
grounds that its effects are on deployed clusters, not apps. And
deployers can still choose a binary distro with 1.x or make the choice
they want. Those that don't care should probably be nudged to 2.x.
Spark 3.x is already full of behavior changes and 'unstable', so I
think this is minor relative to the overall risk question.

On Wed, Nov 20, 2019 at 12:53 PM Dongjoon Hyun <[hidden email]> wrote:
>
> Hi, All.
>
> I'm sending this email because it's important to discuss this topic narrowly
> and make a clear conclusion.
>
> `The forked Hive 1.2.1 is stable`? It sounds like a myth we created
> by ignoring the existing bugs. If you want to say the forked Hive 1.2.1 is
> stabler than XXX, please give us the evidence. Then, we can fix it.
> Otherwise, let's stop making `The forked Hive 1.2.1` invincible.
>
> Historically, the following forked Hive 1.2.1 has never been stable.
> It's just frozen. Since the forked Hive is out of our control, we ignored bugs.
> That's all. The reality is a way far from the stable status.
>
>     https://mvnrepository.com/artifact/org.spark-project.hive/
>     https://mvnrepository.com/artifact/org.spark-project.hive/hive-exec/1.2.1.spark (2015 August)
>     https://mvnrepository.com/artifact/org.spark-project.hive/hive-exec/1.2.1.spark2 (2016 April)
>
> First, let's begin Hive itself by comparing with Apache Hive 1.2.2 and 1.2.3,
>
>     Apache Hive 1.2.2 has 50 bug fixes.
>     Apache Hive 1.2.3 has 9 bug fixes.
>
> I will not cover all of them, but Apache Hive community also backports
> important patches like Apache Spark community.
>
> Second, let's move to SPARK issues because we aren't exposed to all Hive issues.
>
>     SPARK-19109 ORC metadata section can sometimes exceed protobuf message size limit
>     SPARK-22267 Spark SQL incorrectly reads ORC file when column order is different
>
> These were reported since Apache Spark 1.6.x because the forked Hive doesn't have
> a proper upstream patch like HIVE-11592 (fixed at Apache Hive 1.3.0).
>
> Since we couldn't update the frozen forked Hive, we added Apache ORC dependency
> at SPARK-20682 (2.3.0), added a switching configuration at SPARK-20728 (2.3.0),
> tured on `spark.sql.hive.convertMetastoreOrc by default` at SPARK-22279 (2.4.0).
> However, if you turn off the switch and start to use the forked hive,
> you will be exposed to the buggy forked Hive 1.2.1 again.
>
> Third, let's talk about the new features like Hadoop 3 and JDK11.
> No one believe that the ancient forked Hive 1.2.1 will work with this.
> I saw that the following issue is mentioned as an evidence of Hive 2.3.6 bug.
>
>     SPARK-29245 ClassCastException during creating HiveMetaStoreClient
>
> Yes. I know that issue because I reported it and verified HIVE-21508.
> It's fixed already and will be released ad Apache Hive 2.3.7.
>
> Can we imagine something like this in the forked Hive 1.2.1?
> 'No'. There is no future on it. It's frozen.
>
> From now, I want to claim that the forked Hive 1.2.1 is the unstable one.
> I welcome all your positive and negative opinions.
> Please share your concerns and problems and fix them together.
> Apache Spark is an open source project we shared.
>
> Bests,
> Dongjoon.
>
Reply | Threaded
Open this post in threaded view
|

Re: The Myth: the forked Hive 1.2.1 is stabler than XXX

Cheng Lian
Dongjoon, I don't think we have any conflicts here. As stated in other threads multiple times, as long as Hive 2.3 and Hadoop 3.2 version upgrades can be decoupled, I have no preference over picking which Hive/Hadoop version as the default version. So the following two plans both work for me:
  1. Keep Hive 1.2 as default Spark 3.0 execution Hive version, and have an extra hive-2.3 profile.
  2. Choose Hive 2.3 as default Spark 3.0 execution Hive version, and have an extra hive-1.2 profile.
BTW, I was also discussing Hive dependency issues with other people offline, and I realized that the Hive isolated client loader is not well known, and caused unnecessary confusion/worry. So I would like to provide some background context for readers who are not familiar with Spark Hive integration here. Building Spark 3.0 with Hive 1.2.1 does NOT mean that you can only interact with Hive 1.2.1.

Spark does work with different versions of Hive metastore via an isolated classloading mechanism. Even if Spark itself is built with the Hive 1.2.1 fork, you can still interact with a Hive 2.3 metastore, and this has been true ever since Spark 1.x. In order to do this, just set the following two options according to instructions in our official doc page:
  • spark.sql.hive.metastore.version
  • spark.sql.hive.metastore.jars
Say you set "spark.sql.hive.metastore.version" to "2.3.6", and "spark.sql.hive.metastore.jars" to "maven", Spark will pull Hive 2.3.6 dependencies from Maven at runtime when initializing the Hive metastore client. And those dependencies will NOT conflict with the built-in Hive 1.2.1 jars, because the downloaded jars are loaded using an isolated classloader (see here). Historically, we call these two sets of Hive dependencies "execution Hive" and "metastore Hive". The former is mostly used for features like SerDe, while the latter is used to interact with Hive metastore. And the Hive version upgrade we are discussing here is about the execution Hive.

Cheng

On Wed, Nov 20, 2019 at 2:38 PM Dongjoon Hyun <[hidden email]> wrote:
Nice. That's a progress.

Let's narrow down to the path. We need to clarify what is the criteria we can agree.

1. What does `battle-tested for years` mean exactly?
    How and when can we start the `battle-tested` stage for Hive 2.3?

2. What is the new "Hive integration in Spark"?
    During introducing Hive 2.3, we fixed the compatibility stuff as you said.
    Most of code is shared for Hive 1.2 and Hive 2.3.
    That means if there is a bug inside this shared code, both of them will be affected.
    Of course, we can fix this because it's Spark code. We will learn and fix it as you said.

    >  Yes, there are issues, but people have learned how to get along with these issues.

    The only non-shared code are the following.
    Do you have a concern on the following directories?
    If there is no bugs on the following codebase, can we switch?

    $ find . -name v2.3.5
    ./sql/core/v2.3.5
    ./sql/hive-thriftserver/v2.3.5

3. We know that we can keep both code bases, but the community should choose Hive 2.3 officially.
    That's the right choice in the Apache project policy perspective. At least, Sean and I prefer that.
    If someone really want to stick to Hive 1.2 fork, they can use it at their own risks.

    > for Spark 3.0 end-users who really don't want to interact with this Hive 1.2 fork, they can always use Hive 2.3 at their own risks.

Specifically, what about having a profile `hive-1.2` at `3.0.0` with the default Hive 2.3 pom at least?
How do you think about that way, Cheng?

Bests,
Dongjoon.


On Wed, Nov 20, 2019 at 12:59 PM Cheng Lian <[hidden email]> wrote:
Hey Dongjoon and Felix,

I totally agree that Hive 2.3 is more stable than Hive 1.2. Otherwise, we wouldn't even consider integrating with Hive 2.3 in Spark 3.0.

However, "Hive" and "Hive integration in Spark" are two quite different things, and I don't think anybody has ever mentioned "the forked Hive 1.2.1 is stable" in any recent Hadoop/Hive version discussions (at least I double-checked all my replies).

What I really care about is the stability and quality of "Hive integration in Spark", which have gone through some major updates due to the recent Hive 2.3 upgrade in Spark 3.0. We had already found bugs in this piece, and empirically, for a significant upgrade like this one, it is not surprising that other bugs/regressions can be found in the near future. On the other hand, the Hive 1.2 integration code path in Spark has been battle-tested for years. Yes, there are issues, but people have learned how to get along with these issues. And please don't forget that, for Spark 3.0 end-users who really don't want to interact with this Hive 1.2 fork, they can always use Hive 2.3 at their own risks.

True, "stable" is quite vague a criterion, and hard to be proven. But that is exactly the reason why we may want to be conservative and wait for some time and see whether there are further signals suggesting that the Hive 2.3 integration in Spark 3.0 is unstable. After one or two Spark 3.x minor releases, if we've fixed all the outstanding issues and no more significant ones are showing up, we can declare that the Hive 2.3 integration in Spark 3.x is stable, and then we can consider removing reference to the Hive 1.2 fork. Does that make sense?

Cheng

On Wed, Nov 20, 2019 at 11:49 AM Felix Cheung <[hidden email]> wrote:
Just to add - hive 1.2 fork is definitely not more stable. We know of a few critical bug fixes that we cherry picked into a fork of that fork to maintain ourselves.



From: Dongjoon Hyun <[hidden email]>
Sent: Wednesday, November 20, 2019 11:07:47 AM
To: Sean Owen <[hidden email]>
Cc: dev <[hidden email]>
Subject: Re: The Myth: the forked Hive 1.2.1 is stabler than XXX
 
Thanks. That will be a giant step forward, Sean!

> I'd prefer making it the default in the POM for 3.0. 

Bests,
Dongjoon.

On Wed, Nov 20, 2019 at 11:02 AM Sean Owen <[hidden email]> wrote:
Yeah 'stable' is ambiguous. It's old and buggy, but at least it's the
same old and buggy that's been there a while. "stable" in that sense
I'm sure there is a lot more delta between Hive 1 and 2 in terms of
bug fixes that are important; the question isn't just 1.x releases.

What I don't know is how much affects Spark, as it's a Hive client
mostly. Clearly some do.

I'd prefer making it the default in the POM for 3.0. Mostly on the
grounds that its effects are on deployed clusters, not apps. And
deployers can still choose a binary distro with 1.x or make the choice
they want. Those that don't care should probably be nudged to 2.x.
Spark 3.x is already full of behavior changes and 'unstable', so I
think this is minor relative to the overall risk question.

On Wed, Nov 20, 2019 at 12:53 PM Dongjoon Hyun <[hidden email]> wrote:
>
> Hi, All.
>
> I'm sending this email because it's important to discuss this topic narrowly
> and make a clear conclusion.
>
> `The forked Hive 1.2.1 is stable`? It sounds like a myth we created
> by ignoring the existing bugs. If you want to say the forked Hive 1.2.1 is
> stabler than XXX, please give us the evidence. Then, we can fix it.
> Otherwise, let's stop making `The forked Hive 1.2.1` invincible.
>
> Historically, the following forked Hive 1.2.1 has never been stable.
> It's just frozen. Since the forked Hive is out of our control, we ignored bugs.
> That's all. The reality is a way far from the stable status.
>
>     https://mvnrepository.com/artifact/org.spark-project.hive/
>     https://mvnrepository.com/artifact/org.spark-project.hive/hive-exec/1.2.1.spark (2015 August)
>     https://mvnrepository.com/artifact/org.spark-project.hive/hive-exec/1.2.1.spark2 (2016 April)
>
> First, let's begin Hive itself by comparing with Apache Hive 1.2.2 and 1.2.3,
>
>     Apache Hive 1.2.2 has 50 bug fixes.
>     Apache Hive 1.2.3 has 9 bug fixes.
>
> I will not cover all of them, but Apache Hive community also backports
> important patches like Apache Spark community.
>
> Second, let's move to SPARK issues because we aren't exposed to all Hive issues.
>
>     SPARK-19109 ORC metadata section can sometimes exceed protobuf message size limit
>     SPARK-22267 Spark SQL incorrectly reads ORC file when column order is different
>
> These were reported since Apache Spark 1.6.x because the forked Hive doesn't have
> a proper upstream patch like HIVE-11592 (fixed at Apache Hive 1.3.0).
>
> Since we couldn't update the frozen forked Hive, we added Apache ORC dependency
> at SPARK-20682 (2.3.0), added a switching configuration at SPARK-20728 (2.3.0),
> tured on `spark.sql.hive.convertMetastoreOrc by default` at SPARK-22279 (2.4.0).
> However, if you turn off the switch and start to use the forked hive,
> you will be exposed to the buggy forked Hive 1.2.1 again.
>
> Third, let's talk about the new features like Hadoop 3 and JDK11.
> No one believe that the ancient forked Hive 1.2.1 will work with this.
> I saw that the following issue is mentioned as an evidence of Hive 2.3.6 bug.
>
>     SPARK-29245 ClassCastException during creating HiveMetaStoreClient
>
> Yes. I know that issue because I reported it and verified HIVE-21508.
> It's fixed already and will be released ad Apache Hive 2.3.7.
>
> Can we imagine something like this in the forked Hive 1.2.1?
> 'No'. There is no future on it. It's frozen.
>
> From now, I want to claim that the forked Hive 1.2.1 is the unstable one.
> I welcome all your positive and negative opinions.
> Please share your concerns and problems and fix them together.
> Apache Spark is an open source project we shared.
>
> Bests,
> Dongjoon.
>
Reply | Threaded
Open this post in threaded view
|

Re: The Myth: the forked Hive 1.2.1 is stabler than XXX

Cheng Lian
Just to summarize my points:
  1. Let's still keep the Hive 1.2 dependency in Spark 3.0, but it is optional. End-users may choose between Hive 1.2/2.3 via a new profile (either adding a hive-1.2 profile or adding a hive-2.3 profile works for me, depending on which Hive version we pick as the default version).
  2. Decouple Hive version upgrade and Hadoop version upgrade, so that people may have more choices in production, and makes Spark 3.0 migration easier (e.g., you don't have to switch to Hadoop 3 in order to pick Hive 2.3 and/or JDK 11.).
  3. For default Hadoop/Hive versions in Spark 3.0, I personally do not have a preference as long as the above two are met.

On Wed, Nov 20, 2019 at 3:22 PM Cheng Lian <[hidden email]> wrote:
Dongjoon, I don't think we have any conflicts here. As stated in other threads multiple times, as long as Hive 2.3 and Hadoop 3.2 version upgrades can be decoupled, I have no preference over picking which Hive/Hadoop version as the default version. So the following two plans both work for me:
  1. Keep Hive 1.2 as default Spark 3.0 execution Hive version, and have an extra hive-2.3 profile.
  2. Choose Hive 2.3 as default Spark 3.0 execution Hive version, and have an extra hive-1.2 profile.
BTW, I was also discussing Hive dependency issues with other people offline, and I realized that the Hive isolated client loader is not well known, and caused unnecessary confusion/worry. So I would like to provide some background context for readers who are not familiar with Spark Hive integration here. Building Spark 3.0 with Hive 1.2.1 does NOT mean that you can only interact with Hive 1.2.1.

Spark does work with different versions of Hive metastore via an isolated classloading mechanism. Even if Spark itself is built with the Hive 1.2.1 fork, you can still interact with a Hive 2.3 metastore, and this has been true ever since Spark 1.x. In order to do this, just set the following two options according to instructions in our official doc page:
  • spark.sql.hive.metastore.version
  • spark.sql.hive.metastore.jars
Say you set "spark.sql.hive.metastore.version" to "2.3.6", and "spark.sql.hive.metastore.jars" to "maven", Spark will pull Hive 2.3.6 dependencies from Maven at runtime when initializing the Hive metastore client. And those dependencies will NOT conflict with the built-in Hive 1.2.1 jars, because the downloaded jars are loaded using an isolated classloader (see here). Historically, we call these two sets of Hive dependencies "execution Hive" and "metastore Hive". The former is mostly used for features like SerDe, while the latter is used to interact with Hive metastore. And the Hive version upgrade we are discussing here is about the execution Hive.

Cheng

On Wed, Nov 20, 2019 at 2:38 PM Dongjoon Hyun <[hidden email]> wrote:
Nice. That's a progress.

Let's narrow down to the path. We need to clarify what is the criteria we can agree.

1. What does `battle-tested for years` mean exactly?
    How and when can we start the `battle-tested` stage for Hive 2.3?

2. What is the new "Hive integration in Spark"?
    During introducing Hive 2.3, we fixed the compatibility stuff as you said.
    Most of code is shared for Hive 1.2 and Hive 2.3.
    That means if there is a bug inside this shared code, both of them will be affected.
    Of course, we can fix this because it's Spark code. We will learn and fix it as you said.

    >  Yes, there are issues, but people have learned how to get along with these issues.

    The only non-shared code are the following.
    Do you have a concern on the following directories?
    If there is no bugs on the following codebase, can we switch?

    $ find . -name v2.3.5
    ./sql/core/v2.3.5
    ./sql/hive-thriftserver/v2.3.5

3. We know that we can keep both code bases, but the community should choose Hive 2.3 officially.
    That's the right choice in the Apache project policy perspective. At least, Sean and I prefer that.
    If someone really want to stick to Hive 1.2 fork, they can use it at their own risks.

    > for Spark 3.0 end-users who really don't want to interact with this Hive 1.2 fork, they can always use Hive 2.3 at their own risks.

Specifically, what about having a profile `hive-1.2` at `3.0.0` with the default Hive 2.3 pom at least?
How do you think about that way, Cheng?

Bests,
Dongjoon.


On Wed, Nov 20, 2019 at 12:59 PM Cheng Lian <[hidden email]> wrote:
Hey Dongjoon and Felix,

I totally agree that Hive 2.3 is more stable than Hive 1.2. Otherwise, we wouldn't even consider integrating with Hive 2.3 in Spark 3.0.

However, "Hive" and "Hive integration in Spark" are two quite different things, and I don't think anybody has ever mentioned "the forked Hive 1.2.1 is stable" in any recent Hadoop/Hive version discussions (at least I double-checked all my replies).

What I really care about is the stability and quality of "Hive integration in Spark", which have gone through some major updates due to the recent Hive 2.3 upgrade in Spark 3.0. We had already found bugs in this piece, and empirically, for a significant upgrade like this one, it is not surprising that other bugs/regressions can be found in the near future. On the other hand, the Hive 1.2 integration code path in Spark has been battle-tested for years. Yes, there are issues, but people have learned how to get along with these issues. And please don't forget that, for Spark 3.0 end-users who really don't want to interact with this Hive 1.2 fork, they can always use Hive 2.3 at their own risks.

True, "stable" is quite vague a criterion, and hard to be proven. But that is exactly the reason why we may want to be conservative and wait for some time and see whether there are further signals suggesting that the Hive 2.3 integration in Spark 3.0 is unstable. After one or two Spark 3.x minor releases, if we've fixed all the outstanding issues and no more significant ones are showing up, we can declare that the Hive 2.3 integration in Spark 3.x is stable, and then we can consider removing reference to the Hive 1.2 fork. Does that make sense?

Cheng

On Wed, Nov 20, 2019 at 11:49 AM Felix Cheung <[hidden email]> wrote:
Just to add - hive 1.2 fork is definitely not more stable. We know of a few critical bug fixes that we cherry picked into a fork of that fork to maintain ourselves.



From: Dongjoon Hyun <[hidden email]>
Sent: Wednesday, November 20, 2019 11:07:47 AM
To: Sean Owen <[hidden email]>
Cc: dev <[hidden email]>
Subject: Re: The Myth: the forked Hive 1.2.1 is stabler than XXX
 
Thanks. That will be a giant step forward, Sean!

> I'd prefer making it the default in the POM for 3.0. 

Bests,
Dongjoon.

On Wed, Nov 20, 2019 at 11:02 AM Sean Owen <[hidden email]> wrote:
Yeah 'stable' is ambiguous. It's old and buggy, but at least it's the
same old and buggy that's been there a while. "stable" in that sense
I'm sure there is a lot more delta between Hive 1 and 2 in terms of
bug fixes that are important; the question isn't just 1.x releases.

What I don't know is how much affects Spark, as it's a Hive client
mostly. Clearly some do.

I'd prefer making it the default in the POM for 3.0. Mostly on the
grounds that its effects are on deployed clusters, not apps. And
deployers can still choose a binary distro with 1.x or make the choice
they want. Those that don't care should probably be nudged to 2.x.
Spark 3.x is already full of behavior changes and 'unstable', so I
think this is minor relative to the overall risk question.

On Wed, Nov 20, 2019 at 12:53 PM Dongjoon Hyun <[hidden email]> wrote:
>
> Hi, All.
>
> I'm sending this email because it's important to discuss this topic narrowly
> and make a clear conclusion.
>
> `The forked Hive 1.2.1 is stable`? It sounds like a myth we created
> by ignoring the existing bugs. If you want to say the forked Hive 1.2.1 is
> stabler than XXX, please give us the evidence. Then, we can fix it.
> Otherwise, let's stop making `The forked Hive 1.2.1` invincible.
>
> Historically, the following forked Hive 1.2.1 has never been stable.
> It's just frozen. Since the forked Hive is out of our control, we ignored bugs.
> That's all. The reality is a way far from the stable status.
>
>     https://mvnrepository.com/artifact/org.spark-project.hive/
>     https://mvnrepository.com/artifact/org.spark-project.hive/hive-exec/1.2.1.spark (2015 August)
>     https://mvnrepository.com/artifact/org.spark-project.hive/hive-exec/1.2.1.spark2 (2016 April)
>
> First, let's begin Hive itself by comparing with Apache Hive 1.2.2 and 1.2.3,
>
>     Apache Hive 1.2.2 has 50 bug fixes.
>     Apache Hive 1.2.3 has 9 bug fixes.
>
> I will not cover all of them, but Apache Hive community also backports
> important patches like Apache Spark community.
>
> Second, let's move to SPARK issues because we aren't exposed to all Hive issues.
>
>     SPARK-19109 ORC metadata section can sometimes exceed protobuf message size limit
>     SPARK-22267 Spark SQL incorrectly reads ORC file when column order is different
>
> These were reported since Apache Spark 1.6.x because the forked Hive doesn't have
> a proper upstream patch like HIVE-11592 (fixed at Apache Hive 1.3.0).
>
> Since we couldn't update the frozen forked Hive, we added Apache ORC dependency
> at SPARK-20682 (2.3.0), added a switching configuration at SPARK-20728 (2.3.0),
> tured on `spark.sql.hive.convertMetastoreOrc by default` at SPARK-22279 (2.4.0).
> However, if you turn off the switch and start to use the forked hive,
> you will be exposed to the buggy forked Hive 1.2.1 again.
>
> Third, let's talk about the new features like Hadoop 3 and JDK11.
> No one believe that the ancient forked Hive 1.2.1 will work with this.
> I saw that the following issue is mentioned as an evidence of Hive 2.3.6 bug.
>
>     SPARK-29245 ClassCastException during creating HiveMetaStoreClient
>
> Yes. I know that issue because I reported it and verified HIVE-21508.
> It's fixed already and will be released ad Apache Hive 2.3.7.
>
> Can we imagine something like this in the forked Hive 1.2.1?
> 'No'. There is no future on it. It's frozen.
>
> From now, I want to claim that the forked Hive 1.2.1 is the unstable one.
> I welcome all your positive and negative opinions.
> Please share your concerns and problems and fix them together.
> Apache Spark is an open source project we shared.
>
> Bests,
> Dongjoon.
>
Reply | Threaded
Open this post in threaded view
|

Re: The Myth: the forked Hive 1.2.1 is stabler than XXX

Cheng Lian

Oh, actually, in order to decouple Hadoop 3.2 and Hive 2.3 upgrades, we will need a hive-2.3 profile anyway, no matter having the hive-1.2 profile or not.


On Wed, Nov 20, 2019 at 3:33 PM Cheng Lian <[hidden email]> wrote:
Just to summarize my points:
  1. Let's still keep the Hive 1.2 dependency in Spark 3.0, but it is optional. End-users may choose between Hive 1.2/2.3 via a new profile (either adding a hive-1.2 profile or adding a hive-2.3 profile works for me, depending on which Hive version we pick as the default version).
  2. Decouple Hive version upgrade and Hadoop version upgrade, so that people may have more choices in production, and makes Spark 3.0 migration easier (e.g., you don't have to switch to Hadoop 3 in order to pick Hive 2.3 and/or JDK 11.).
  3. For default Hadoop/Hive versions in Spark 3.0, I personally do not have a preference as long as the above two are met.

On Wed, Nov 20, 2019 at 3:22 PM Cheng Lian <[hidden email]> wrote:
Dongjoon, I don't think we have any conflicts here. As stated in other threads multiple times, as long as Hive 2.3 and Hadoop 3.2 version upgrades can be decoupled, I have no preference over picking which Hive/Hadoop version as the default version. So the following two plans both work for me:
  1. Keep Hive 1.2 as default Spark 3.0 execution Hive version, and have an extra hive-2.3 profile.
  2. Choose Hive 2.3 as default Spark 3.0 execution Hive version, and have an extra hive-1.2 profile.
BTW, I was also discussing Hive dependency issues with other people offline, and I realized that the Hive isolated client loader is not well known, and caused unnecessary confusion/worry. So I would like to provide some background context for readers who are not familiar with Spark Hive integration here. Building Spark 3.0 with Hive 1.2.1 does NOT mean that you can only interact with Hive 1.2.1.

Spark does work with different versions of Hive metastore via an isolated classloading mechanism. Even if Spark itself is built with the Hive 1.2.1 fork, you can still interact with a Hive 2.3 metastore, and this has been true ever since Spark 1.x. In order to do this, just set the following two options according to instructions in our official doc page:
  • spark.sql.hive.metastore.version
  • spark.sql.hive.metastore.jars
Say you set "spark.sql.hive.metastore.version" to "2.3.6", and "spark.sql.hive.metastore.jars" to "maven", Spark will pull Hive 2.3.6 dependencies from Maven at runtime when initializing the Hive metastore client. And those dependencies will NOT conflict with the built-in Hive 1.2.1 jars, because the downloaded jars are loaded using an isolated classloader (see here). Historically, we call these two sets of Hive dependencies "execution Hive" and "metastore Hive". The former is mostly used for features like SerDe, while the latter is used to interact with Hive metastore. And the Hive version upgrade we are discussing here is about the execution Hive.

Cheng

On Wed, Nov 20, 2019 at 2:38 PM Dongjoon Hyun <[hidden email]> wrote:
Nice. That's a progress.

Let's narrow down to the path. We need to clarify what is the criteria we can agree.

1. What does `battle-tested for years` mean exactly?
    How and when can we start the `battle-tested` stage for Hive 2.3?

2. What is the new "Hive integration in Spark"?
    During introducing Hive 2.3, we fixed the compatibility stuff as you said.
    Most of code is shared for Hive 1.2 and Hive 2.3.
    That means if there is a bug inside this shared code, both of them will be affected.
    Of course, we can fix this because it's Spark code. We will learn and fix it as you said.

    >  Yes, there are issues, but people have learned how to get along with these issues.

    The only non-shared code are the following.
    Do you have a concern on the following directories?
    If there is no bugs on the following codebase, can we switch?

    $ find . -name v2.3.5
    ./sql/core/v2.3.5
    ./sql/hive-thriftserver/v2.3.5

3. We know that we can keep both code bases, but the community should choose Hive 2.3 officially.
    That's the right choice in the Apache project policy perspective. At least, Sean and I prefer that.
    If someone really want to stick to Hive 1.2 fork, they can use it at their own risks.

    > for Spark 3.0 end-users who really don't want to interact with this Hive 1.2 fork, they can always use Hive 2.3 at their own risks.

Specifically, what about having a profile `hive-1.2` at `3.0.0` with the default Hive 2.3 pom at least?
How do you think about that way, Cheng?

Bests,
Dongjoon.


On Wed, Nov 20, 2019 at 12:59 PM Cheng Lian <[hidden email]> wrote:
Hey Dongjoon and Felix,

I totally agree that Hive 2.3 is more stable than Hive 1.2. Otherwise, we wouldn't even consider integrating with Hive 2.3 in Spark 3.0.

However, "Hive" and "Hive integration in Spark" are two quite different things, and I don't think anybody has ever mentioned "the forked Hive 1.2.1 is stable" in any recent Hadoop/Hive version discussions (at least I double-checked all my replies).

What I really care about is the stability and quality of "Hive integration in Spark", which have gone through some major updates due to the recent Hive 2.3 upgrade in Spark 3.0. We had already found bugs in this piece, and empirically, for a significant upgrade like this one, it is not surprising that other bugs/regressions can be found in the near future. On the other hand, the Hive 1.2 integration code path in Spark has been battle-tested for years. Yes, there are issues, but people have learned how to get along with these issues. And please don't forget that, for Spark 3.0 end-users who really don't want to interact with this Hive 1.2 fork, they can always use Hive 2.3 at their own risks.

True, "stable" is quite vague a criterion, and hard to be proven. But that is exactly the reason why we may want to be conservative and wait for some time and see whether there are further signals suggesting that the Hive 2.3 integration in Spark 3.0 is unstable. After one or two Spark 3.x minor releases, if we've fixed all the outstanding issues and no more significant ones are showing up, we can declare that the Hive 2.3 integration in Spark 3.x is stable, and then we can consider removing reference to the Hive 1.2 fork. Does that make sense?

Cheng

On Wed, Nov 20, 2019 at 11:49 AM Felix Cheung <[hidden email]> wrote:
Just to add - hive 1.2 fork is definitely not more stable. We know of a few critical bug fixes that we cherry picked into a fork of that fork to maintain ourselves.



From: Dongjoon Hyun <[hidden email]>
Sent: Wednesday, November 20, 2019 11:07:47 AM
To: Sean Owen <[hidden email]>
Cc: dev <[hidden email]>
Subject: Re: The Myth: the forked Hive 1.2.1 is stabler than XXX
 
Thanks. That will be a giant step forward, Sean!

> I'd prefer making it the default in the POM for 3.0. 

Bests,
Dongjoon.

On Wed, Nov 20, 2019 at 11:02 AM Sean Owen <[hidden email]> wrote:
Yeah 'stable' is ambiguous. It's old and buggy, but at least it's the
same old and buggy that's been there a while. "stable" in that sense
I'm sure there is a lot more delta between Hive 1 and 2 in terms of
bug fixes that are important; the question isn't just 1.x releases.

What I don't know is how much affects Spark, as it's a Hive client
mostly. Clearly some do.

I'd prefer making it the default in the POM for 3.0. Mostly on the
grounds that its effects are on deployed clusters, not apps. And
deployers can still choose a binary distro with 1.x or make the choice
they want. Those that don't care should probably be nudged to 2.x.
Spark 3.x is already full of behavior changes and 'unstable', so I
think this is minor relative to the overall risk question.

On Wed, Nov 20, 2019 at 12:53 PM Dongjoon Hyun <[hidden email]> wrote:
>
> Hi, All.
>
> I'm sending this email because it's important to discuss this topic narrowly
> and make a clear conclusion.
>
> `The forked Hive 1.2.1 is stable`? It sounds like a myth we created
> by ignoring the existing bugs. If you want to say the forked Hive 1.2.1 is
> stabler than XXX, please give us the evidence. Then, we can fix it.
> Otherwise, let's stop making `The forked Hive 1.2.1` invincible.
>
> Historically, the following forked Hive 1.2.1 has never been stable.
> It's just frozen. Since the forked Hive is out of our control, we ignored bugs.
> That's all. The reality is a way far from the stable status.
>
>     https://mvnrepository.com/artifact/org.spark-project.hive/
>     https://mvnrepository.com/artifact/org.spark-project.hive/hive-exec/1.2.1.spark (2015 August)
>     https://mvnrepository.com/artifact/org.spark-project.hive/hive-exec/1.2.1.spark2 (2016 April)
>
> First, let's begin Hive itself by comparing with Apache Hive 1.2.2 and 1.2.3,
>
>     Apache Hive 1.2.2 has 50 bug fixes.
>     Apache Hive 1.2.3 has 9 bug fixes.
>
> I will not cover all of them, but Apache Hive community also backports
> important patches like Apache Spark community.
>
> Second, let's move to SPARK issues because we aren't exposed to all Hive issues.
>
>     SPARK-19109 ORC metadata section can sometimes exceed protobuf message size limit
>     SPARK-22267 Spark SQL incorrectly reads ORC file when column order is different
>
> These were reported since Apache Spark 1.6.x because the forked Hive doesn't have
> a proper upstream patch like HIVE-11592 (fixed at Apache Hive 1.3.0).
>
> Since we couldn't update the frozen forked Hive, we added Apache ORC dependency
> at SPARK-20682 (2.3.0), added a switching configuration at SPARK-20728 (2.3.0),
> tured on `spark.sql.hive.convertMetastoreOrc by default` at SPARK-22279 (2.4.0).
> However, if you turn off the switch and start to use the forked hive,
> you will be exposed to the buggy forked Hive 1.2.1 again.
>
> Third, let's talk about the new features like Hadoop 3 and JDK11.
> No one believe that the ancient forked Hive 1.2.1 will work with this.
> I saw that the following issue is mentioned as an evidence of Hive 2.3.6 bug.
>
>     SPARK-29245 ClassCastException during creating HiveMetaStoreClient
>
> Yes. I know that issue because I reported it and verified HIVE-21508.
> It's fixed already and will be released ad Apache Hive 2.3.7.
>
> Can we imagine something like this in the forked Hive 1.2.1?
> 'No'. There is no future on it. It's frozen.
>
> From now, I want to claim that the forked Hive 1.2.1 is the unstable one.
> I welcome all your positive and negative opinions.
> Please share your concerns and problems and fix them together.
> Apache Spark is an open source project we shared.
>
> Bests,
> Dongjoon.
>
Reply | Threaded
Open this post in threaded view
|

Re: The Myth: the forked Hive 1.2.1 is stabler than XXX

Dongjoon Hyun-2
Thank you for much thoughtful clarification. I agree with your all options.

Especially, for Hive Metastore connection, `Hive isolated client loader` is also important with Hive 2.3 because Hive 2.3 client cannot talk with Hive 2.1 and lower. `Hive Isolated client loader` is one of the good design in Apache Spark.

One of the reason I started this thread focusing on the fork is that we don't use that fork actually.


Big companies (and vendors) maintains their own fork of that fork or upgrade its hive dependency already. So, when we say it's battle-tested, it does not mean it really. It's not tested.

The above repository becomes something like a stranded phantom. We pointed that repo as a legacy interface, and we don't use the code really in the large production environments. Since there is no way to contribute back to that repo, we also have a segmentation problem on the experience with Hive 1.2.1. Someone may say it's good while the others still struggles without any community support.

Anyway, thank you so much for the conclusion.
I'll try to make a JIRA and PR for `hive-1.2` profile first as a conclusion.

Bests,
Dongjoon.


On Wed, Nov 20, 2019 at 4:10 PM Cheng Lian <[hidden email]> wrote:

Oh, actually, in order to decouple Hadoop 3.2 and Hive 2.3 upgrades, we will need a hive-2.3 profile anyway, no matter having the hive-1.2 profile or not.


On Wed, Nov 20, 2019 at 3:33 PM Cheng Lian <[hidden email]> wrote:
Just to summarize my points:
  1. Let's still keep the Hive 1.2 dependency in Spark 3.0, but it is optional. End-users may choose between Hive 1.2/2.3 via a new profile (either adding a hive-1.2 profile or adding a hive-2.3 profile works for me, depending on which Hive version we pick as the default version).
  2. Decouple Hive version upgrade and Hadoop version upgrade, so that people may have more choices in production, and makes Spark 3.0 migration easier (e.g., you don't have to switch to Hadoop 3 in order to pick Hive 2.3 and/or JDK 11.).
  3. For default Hadoop/Hive versions in Spark 3.0, I personally do not have a preference as long as the above two are met.

On Wed, Nov 20, 2019 at 3:22 PM Cheng Lian <[hidden email]> wrote:
Dongjoon, I don't think we have any conflicts here. As stated in other threads multiple times, as long as Hive 2.3 and Hadoop 3.2 version upgrades can be decoupled, I have no preference over picking which Hive/Hadoop version as the default version. So the following two plans both work for me:
  1. Keep Hive 1.2 as default Spark 3.0 execution Hive version, and have an extra hive-2.3 profile.
  2. Choose Hive 2.3 as default Spark 3.0 execution Hive version, and have an extra hive-1.2 profile.
BTW, I was also discussing Hive dependency issues with other people offline, and I realized that the Hive isolated client loader is not well known, and caused unnecessary confusion/worry. So I would like to provide some background context for readers who are not familiar with Spark Hive integration here. Building Spark 3.0 with Hive 1.2.1 does NOT mean that you can only interact with Hive 1.2.1.

Spark does work with different versions of Hive metastore via an isolated classloading mechanism. Even if Spark itself is built with the Hive 1.2.1 fork, you can still interact with a Hive 2.3 metastore, and this has been true ever since Spark 1.x. In order to do this, just set the following two options according to instructions in our official doc page:
  • spark.sql.hive.metastore.version
  • spark.sql.hive.metastore.jars
Say you set "spark.sql.hive.metastore.version" to "2.3.6", and "spark.sql.hive.metastore.jars" to "maven", Spark will pull Hive 2.3.6 dependencies from Maven at runtime when initializing the Hive metastore client. And those dependencies will NOT conflict with the built-in Hive 1.2.1 jars, because the downloaded jars are loaded using an isolated classloader (see here). Historically, we call these two sets of Hive dependencies "execution Hive" and "metastore Hive". The former is mostly used for features like SerDe, while the latter is used to interact with Hive metastore. And the Hive version upgrade we are discussing here is about the execution Hive.

Cheng

On Wed, Nov 20, 2019 at 2:38 PM Dongjoon Hyun <[hidden email]> wrote:
Nice. That's a progress.

Let's narrow down to the path. We need to clarify what is the criteria we can agree.

1. What does `battle-tested for years` mean exactly?
    How and when can we start the `battle-tested` stage for Hive 2.3?

2. What is the new "Hive integration in Spark"?
    During introducing Hive 2.3, we fixed the compatibility stuff as you said.
    Most of code is shared for Hive 1.2 and Hive 2.3.
    That means if there is a bug inside this shared code, both of them will be affected.
    Of course, we can fix this because it's Spark code. We will learn and fix it as you said.

    >  Yes, there are issues, but people have learned how to get along with these issues.

    The only non-shared code are the following.
    Do you have a concern on the following directories?
    If there is no bugs on the following codebase, can we switch?

    $ find . -name v2.3.5
    ./sql/core/v2.3.5
    ./sql/hive-thriftserver/v2.3.5

3. We know that we can keep both code bases, but the community should choose Hive 2.3 officially.
    That's the right choice in the Apache project policy perspective. At least, Sean and I prefer that.
    If someone really want to stick to Hive 1.2 fork, they can use it at their own risks.

    > for Spark 3.0 end-users who really don't want to interact with this Hive 1.2 fork, they can always use Hive 2.3 at their own risks.

Specifically, what about having a profile `hive-1.2` at `3.0.0` with the default Hive 2.3 pom at least?
How do you think about that way, Cheng?

Bests,
Dongjoon.


On Wed, Nov 20, 2019 at 12:59 PM Cheng Lian <[hidden email]> wrote:
Hey Dongjoon and Felix,

I totally agree that Hive 2.3 is more stable than Hive 1.2. Otherwise, we wouldn't even consider integrating with Hive 2.3 in Spark 3.0.

However, "Hive" and "Hive integration in Spark" are two quite different things, and I don't think anybody has ever mentioned "the forked Hive 1.2.1 is stable" in any recent Hadoop/Hive version discussions (at least I double-checked all my replies).

What I really care about is the stability and quality of "Hive integration in Spark", which have gone through some major updates due to the recent Hive 2.3 upgrade in Spark 3.0. We had already found bugs in this piece, and empirically, for a significant upgrade like this one, it is not surprising that other bugs/regressions can be found in the near future. On the other hand, the Hive 1.2 integration code path in Spark has been battle-tested for years. Yes, there are issues, but people have learned how to get along with these issues. And please don't forget that, for Spark 3.0 end-users who really don't want to interact with this Hive 1.2 fork, they can always use Hive 2.3 at their own risks.

True, "stable" is quite vague a criterion, and hard to be proven. But that is exactly the reason why we may want to be conservative and wait for some time and see whether there are further signals suggesting that the Hive 2.3 integration in Spark 3.0 is unstable. After one or two Spark 3.x minor releases, if we've fixed all the outstanding issues and no more significant ones are showing up, we can declare that the Hive 2.3 integration in Spark 3.x is stable, and then we can consider removing reference to the Hive 1.2 fork. Does that make sense?

Cheng

On Wed, Nov 20, 2019 at 11:49 AM Felix Cheung <[hidden email]> wrote:
Just to add - hive 1.2 fork is definitely not more stable. We know of a few critical bug fixes that we cherry picked into a fork of that fork to maintain ourselves.



From: Dongjoon Hyun <[hidden email]>
Sent: Wednesday, November 20, 2019 11:07:47 AM
To: Sean Owen <[hidden email]>
Cc: dev <[hidden email]>
Subject: Re: The Myth: the forked Hive 1.2.1 is stabler than XXX
 
Thanks. That will be a giant step forward, Sean!

> I'd prefer making it the default in the POM for 3.0. 

Bests,
Dongjoon.

On Wed, Nov 20, 2019 at 11:02 AM Sean Owen <[hidden email]> wrote:
Yeah 'stable' is ambiguous. It's old and buggy, but at least it's the
same old and buggy that's been there a while. "stable" in that sense
I'm sure there is a lot more delta between Hive 1 and 2 in terms of
bug fixes that are important; the question isn't just 1.x releases.

What I don't know is how much affects Spark, as it's a Hive client
mostly. Clearly some do.

I'd prefer making it the default in the POM for 3.0. Mostly on the
grounds that its effects are on deployed clusters, not apps. And
deployers can still choose a binary distro with 1.x or make the choice
they want. Those that don't care should probably be nudged to 2.x.
Spark 3.x is already full of behavior changes and 'unstable', so I
think this is minor relative to the overall risk question.

On Wed, Nov 20, 2019 at 12:53 PM Dongjoon Hyun <[hidden email]> wrote:
>
> Hi, All.
>
> I'm sending this email because it's important to discuss this topic narrowly
> and make a clear conclusion.
>
> `The forked Hive 1.2.1 is stable`? It sounds like a myth we created
> by ignoring the existing bugs. If you want to say the forked Hive 1.2.1 is
> stabler than XXX, please give us the evidence. Then, we can fix it.
> Otherwise, let's stop making `The forked Hive 1.2.1` invincible.
>
> Historically, the following forked Hive 1.2.1 has never been stable.
> It's just frozen. Since the forked Hive is out of our control, we ignored bugs.
> That's all. The reality is a way far from the stable status.
>
>     https://mvnrepository.com/artifact/org.spark-project.hive/
>     https://mvnrepository.com/artifact/org.spark-project.hive/hive-exec/1.2.1.spark (2015 August)
>     https://mvnrepository.com/artifact/org.spark-project.hive/hive-exec/1.2.1.spark2 (2016 April)
>
> First, let's begin Hive itself by comparing with Apache Hive 1.2.2 and 1.2.3,
>
>     Apache Hive 1.2.2 has 50 bug fixes.
>     Apache Hive 1.2.3 has 9 bug fixes.
>
> I will not cover all of them, but Apache Hive community also backports
> important patches like Apache Spark community.
>
> Second, let's move to SPARK issues because we aren't exposed to all Hive issues.
>
>     SPARK-19109 ORC metadata section can sometimes exceed protobuf message size limit
>     SPARK-22267 Spark SQL incorrectly reads ORC file when column order is different
>
> These were reported since Apache Spark 1.6.x because the forked Hive doesn't have
> a proper upstream patch like HIVE-11592 (fixed at Apache Hive 1.3.0).
>
> Since we couldn't update the frozen forked Hive, we added Apache ORC dependency
> at SPARK-20682 (2.3.0), added a switching configuration at SPARK-20728 (2.3.0),
> tured on `spark.sql.hive.convertMetastoreOrc by default` at SPARK-22279 (2.4.0).
> However, if you turn off the switch and start to use the forked hive,
> you will be exposed to the buggy forked Hive 1.2.1 again.
>
> Third, let's talk about the new features like Hadoop 3 and JDK11.
> No one believe that the ancient forked Hive 1.2.1 will work with this.
> I saw that the following issue is mentioned as an evidence of Hive 2.3.6 bug.
>
>     SPARK-29245 ClassCastException during creating HiveMetaStoreClient
>
> Yes. I know that issue because I reported it and verified HIVE-21508.
> It's fixed already and will be released ad Apache Hive 2.3.7.
>
> Can we imagine something like this in the forked Hive 1.2.1?
> 'No'. There is no future on it. It's frozen.
>
> From now, I want to claim that the forked Hive 1.2.1 is the unstable one.
> I welcome all your positive and negative opinions.
> Please share your concerns and problems and fix them together.
> Apache Spark is an open source project we shared.
>
> Bests,
> Dongjoon.
>
Reply | Threaded
Open this post in threaded view
|

Re: The Myth: the forked Hive 1.2.1 is stabler than XXX

Steve Loughran-2


On Thu, Nov 21, 2019 at 12:53 AM Dongjoon Hyun <[hidden email]> wrote:
Thank you for much thoughtful clarification. I agree with your all options.

Especially, for Hive Metastore connection, `Hive isolated client loader` is also important with Hive 2.3 because Hive 2.3 client cannot talk with Hive 2.1 and lower. `Hive Isolated client loader` is one of the good design in Apache Spark.

One of the reason I started this thread focusing on the fork is that we don't use that fork actually.


Big companies (and vendors) maintains their own fork of that fork or upgrade its hive dependency already. So, when we say it's battle-tested, it does not mean it really. It's not tested.


I'm not up to date with the cloudera fork. Last time I went near the then-hortonworks fork was for this : https://github.com/pwendell/hive/pull/2  ; think there were a couple of security patches too.
 
I don't think anyone would have added new features to the branch, but bug fixes and security patches are inevitable.

The above repository becomes something like a stranded phantom. We pointed that repo as a legacy interface, and we don't use the code really in the large production environments. Since there is no way to contribute back to that repo, we also have a segmentation problem on the experience with Hive 1.2.1. Someone may say it's good while the others still struggles without any community support.

Anyway, thank you so much for the conclusion.
I'll try to make a JIRA and PR for `hive-1.2` profile first as a conclusion.

+1


Reply | Threaded
Open this post in threaded view
|

Re: The Myth: the forked Hive 1.2.1 is stabler than XXX

Dongjoon Hyun-2
Thank you, Steve and all.

As a conclusion of this thread, we will merge the following PR and move forward.

    [SPARK-29981][BUILD] Add hive-1.2/2.3 profiles
    https://github.com/apache/spark/pull/26619

Please leave your comments if you have any concern.
And, the following PRs and more will follow it soon.

    SPARK-29988 Adjust Jenkins jobs for hive-1.2/2.3 combination
    SPARK-29989 Update release-script for hive-1.2/2.3 combination
    SPARK-29991 Support hive-1.2/2.3 in PR Builder

In this thread, we have been focusing on only Hive dependency.
These change become effective at Apache Spark 3.0.0 (or the next preview).
For Hadoop3 and JDK11, please follow up the other threads.

Bests,
Dongjoon.