Hadoop 3 support

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Hadoop 3 support

rxin
Does anybody know what needs to be done in order for Spark to support Hadoop 3?

Reply | Threaded
Open this post in threaded view
|

Re: Hadoop 3 support

Mridul Muralidharan
Specifically to run spark with hadoop 3 docker support, I have filed a
few jira's tracked under [1].

Regards,
Mridul

[1] https://issues.apache.org/jira/browse/SPARK-23717


On Mon, Apr 2, 2018 at 1:00 PM, Reynold Xin <[hidden email]> wrote:
> Does anybody know what needs to be done in order for Spark to support Hadoop
> 3?
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Hadoop 3 support

rxin
That's just a nice to have improvement right? I'm more curious what is the minimal amount of work required to support 3.0, without all the bells and whistles. (Of course we can also do the bells and whistles, but those would come after we can actually get 3.0 running).


On Mon, Apr 2, 2018 at 1:50 PM, Mridul Muralidharan <[hidden email]> wrote:
Specifically to run spark with hadoop 3 docker support, I have filed a
few jira's tracked under [1].

Regards,
Mridul

[1] https://issues.apache.org/jira/browse/SPARK-23717


On Mon, Apr 2, 2018 at 1:00 PM, Reynold Xin <[hidden email]> wrote:
> Does anybody know what needs to be done in order for Spark to support Hadoop
> 3?
>

Reply | Threaded
Open this post in threaded view
|

Re: Hadoop 3 support

Marcelo Vanzin
In reply to this post by rxin
Saisai filed SPARK-23534, but the main blocking issue is really SPARK-18673.


On Mon, Apr 2, 2018 at 1:00 PM, Reynold Xin <[hidden email]> wrote:
> Does anybody know what needs to be done in order for Spark to support Hadoop
> 3?
>



--
Marcelo

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Hadoop 3 support

rxin
Is it difficult to upgrade Hive execution version to the latest version? The metastore used to be an issue but now that part had been separated from the execution part.


On Mon, Apr 2, 2018 at 1:57 PM, Marcelo Vanzin <[hidden email]> wrote:
Saisai filed SPARK-23534, but the main blocking issue is really SPARK-18673.


On Mon, Apr 2, 2018 at 1:00 PM, Reynold Xin <[hidden email]> wrote:
> Does anybody know what needs to be done in order for Spark to support Hadoop
> 3?
>



--
Marcelo

Reply | Threaded
Open this post in threaded view
|

Re: Hadoop 3 support

Marcelo Vanzin
I haven't looked at it in detail...

Somebody's been trying to do that in
https://github.com/apache/spark/pull/20659, but that's kind of a huge
change.

The parts where I'd be concerned are:
- using Hive's original hive-exec package brings in a bunch of shaded
dependencies, which may break Spark in weird ways. HIVE-16391 was
supposed to fix that but nothing has really been done as part of that
bug.
- the hive-exec "core" package avoids the shaded dependencies but used
to have issues of its own. Maybe it's better now, haven't looked.
- what about the current thrift server which is basically a fork of
the Hive 1.2 source code?
- when using Hadoop 3 + an old metastore client that doesn't know
about Hadoop 3, things may break.

The latter one has two possible fixes: say that Hadoop 3 builds of
Spark don't support old metastores; or add code so that Spark loads a
separate copy of Hadoop libraries in that case (search for
"sharesHadoopClasses" in IsolatedClientLoader for where to start with
that).

If trying to update Hive it would be good to avoid having to fork it,
like it's done currently. But not sure that will be possible given the
current hive-exec packaging.

On Mon, Apr 2, 2018 at 2:58 PM, Reynold Xin <[hidden email]> wrote:

> Is it difficult to upgrade Hive execution version to the latest version? The
> metastore used to be an issue but now that part had been separated from the
> execution part.
>
>
> On Mon, Apr 2, 2018 at 1:57 PM, Marcelo Vanzin <[hidden email]> wrote:
>>
>> Saisai filed SPARK-23534, but the main blocking issue is really
>> SPARK-18673.
>>
>>
>> On Mon, Apr 2, 2018 at 1:00 PM, Reynold Xin <[hidden email]> wrote:
>> > Does anybody know what needs to be done in order for Spark to support
>> > Hadoop
>> > 3?
>> >
>>
>>
>>
>> --
>> Marcelo
>
>



--
Marcelo

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Hadoop 3 support

Saisai Shao
In reply to this post by Marcelo Vanzin
Yes, the main blocking issue is the hive version used in Spark (1.2.1.spark) doesn't support run on Hadoop 3. Hive will check the Hadoop version in the runtime [1]. Besides this I think some pom changes should be enough to support Hadoop 3.

If we want to use Hadoop 3 shaded client jar, then the pom requires lots of changes, but this is not necessary.

2018-04-03 4:57 GMT+08:00 Marcelo Vanzin <[hidden email]>:
Saisai filed SPARK-23534, but the main blocking issue is really SPARK-18673.


On Mon, Apr 2, 2018 at 1:00 PM, Reynold Xin <[hidden email]> wrote:
> Does anybody know what needs to be done in order for Spark to support Hadoop
> 3?
>



--
Marcelo

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Hadoop 3 support

Steve Loughran


On 3 Apr 2018, at 01:30, Saisai Shao <[hidden email]> wrote:

Yes, the main blocking issue is the hive version used in Spark (1.2.1.spark) doesn't support run on Hadoop 3. Hive will check the Hadoop version in the runtime [1]. Besides this I think some pom changes should be enough to support Hadoop 3.

If we want to use Hadoop 3 shaded client jar, then the pom requires lots of changes, but this is not necessary.

2018-04-03 4:57 GMT+08:00 Marcelo Vanzin <[hidden email]>:
Saisai filed SPARK-23534, but the main blocking issue is really SPARK-18673.


On Mon, Apr 2, 2018 at 1:00 PM, Reynold Xin <[hidden email]> wrote:
> Does anybody know what needs to be done in order for Spark to support Hadoop
> 3?
>


To be ruthless, I'd view Hadoop 3.1 as the first one to play with...3.0.x was more of a wide-version check. Hadoop 3.1RC0 is out this week, making it the ideal (last!) time to find showstoppers.

1. I've got a PR which adds a profile to build spark against hadoop 3, with some fixes for zk import along with better hadoop-cloud profile



Apply that and patch and both mvn and sbt can build with the RC0 from the ASF staging repo:

build/sbt -Phadoop-3,hadoop-cloud,yarn -Psnapshots-and-staging



2. Everything Marcelo says about hive. 

You can build hadoop locally with a -Dhadoop.version=2.11 and the hive 1.2.1.-spark version check goes through. You can't safely bring up HDFS like that, but you can run spark standalone against things

Some strategies

Short term: build a new hive-1,2.x-spark which fixes up the version check and merges in those critical patches that cloudera, hortoworks, databricks, + anyone else has got in for their production systems. I don't think we have that many. 

That leaves a "how to release" story, as the ASF will want it to come out under the ASF auspices, and, given the liability disclaimers, so should everyone. The Hive team could be "invited" to publish it as their own if people ask nicely. 

Long term
 -do something about that subclassing to get the thrift endpoint to work. That can include fixing hive's service to be subclass friendly.
 -move to hive 2

That' s a major piece of work.
Reply | Threaded
Open this post in threaded view
|

Re: Hadoop 3 support

Steve Loughran
In reply to this post by Saisai Shao


On 3 Apr 2018, at 01:30, Saisai Shao <[hidden email]> wrote:

Yes, the main blocking issue is the hive version used in Spark (1.2.1.spark) doesn't support run on Hadoop 3. Hive will check the Hadoop version in the runtime [1]. Besides this I think some pom changes should be enough to support Hadoop 3.

If we want to use Hadoop 3 shaded client jar, then the pom requires lots of changes, but this is not necessary.


I don't think the hadoop-shaded JAR is complete enough for spark yet...it was very much driven by HBase's needs. But there's only one way to get Hadoop to fix that: try the move, find the problems, complain noisily. Then Hadoop 3.2 and/or a 3.1.x for x>=1 can have the broader shading

Assume my name is next to the "Shade hadoop-cloud-storage" problem, though there the fact that aws-java-sdk-bundle is 50 MB already, I don't plan to shade that at all. The AWS shading already isolates everything from amazon's choice of Jackson, which was one of the sore points.

-Steve
Reply | Threaded
Open this post in threaded view
|

Re: Hadoop 3 support

Felix Cheung
In reply to this post by Steve Loughran
What would be the strategy with hive? Cherry pick patches? Update to more “modern” versions (like 2.3?)

I know of a few critical schema evolution fixes that we could port to hive 1.2.1-spark 


_____________________________
From: Steve Loughran <[hidden email]>
Sent: Tuesday, April 3, 2018 1:33 PM
Subject: Re: Hadoop 3 support
To: Apache Spark Dev <[hidden email]>




On 3 Apr 2018, at 01:30, Saisai Shao <[hidden email]> wrote:

Yes, the main blocking issue is the hive version used in Spark (1.2.1.spark) doesn't support run on Hadoop 3. Hive will check the Hadoop version in the runtime [1]. Besides this I think some pom changes should be enough to support Hadoop 3.

If we want to use Hadoop 3 shaded client jar, then the pom requires lots of changes, but this is not necessary.

2018-04-03 4:57 GMT+08:00 Marcelo Vanzin <[hidden email]>:
Saisai filed SPARK-23534, but the main blocking issue is really SPARK-18673.


On Mon, Apr 2, 2018 at 1:00 PM, Reynold Xin <[hidden email]> wrote:
> Does anybody know what needs to be done in order for Spark to support Hadoop
> 3?
>


To be ruthless, I'd view Hadoop 3.1 as the first one to play with...3.0.x was more of a wide-version check. Hadoop 3.1RC0 is out this week, making it the ideal (last!) time to find showstoppers.

1. I've got a PR which adds a profile to build spark against hadoop 3, with some fixes for zk import along with better hadoop-cloud profile



Apply that and patch and both mvn and sbt can build with the RC0 from the ASF staging repo:

build/sbt -Phadoop-3,hadoop-cloud,yarn -Psnapshots-and-staging



2. Everything Marcelo says about hive. 

You can build hadoop locally with a -Dhadoop.version=2.11 and the hive 1.2.1.-spark version check goes through. You can't safely bring up HDFS like that, but you can run spark standalone against things

Some strategies

Short term: build a new hive-1,2.x-spark which fixes up the version check and merges in those critical patches that cloudera, hortoworks, databricks, + anyone else has got in for their production systems. I don't think we have that many. 

That leaves a "how to release" story, as the ASF will want it to come out under the ASF auspices, and, given the liability disclaimers, so should everyone. The Hive team could be "invited" to publish it as their own if people ask nicely. 

Long term
 -do something about that subclassing to get the thrift endpoint to work. That can include fixing hive's service to be subclass friendly.
 -move to hive 2

That' s a major piece of work.


t4
Reply | Threaded
Open this post in threaded view
|

Re: Hadoop 3 support

t4
has anyone got spark jars working with hadoop3.1 that they can share? i am
looking to be able to use the latest  hadoop-aws fixes from v3.1



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Hadoop 3 support

Hyukjin Kwon

2018년 10월 17일 (수) 오전 5:06, t4 <[hidden email]>님이 작성:
has anyone got spark jars working with hadoop3.1 that they can share? i am
looking to be able to use the latest  hadoop-aws fixes from v3.1



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Hadoop 3 support

Steve Loughran
In reply to this post by t4


> On 16 Oct 2018, at 22:06, t4 <[hidden email]> wrote:
>
> has anyone got spark jars working with hadoop3.1 that they can share? i am
> looking to be able to use the latest  hadoop-aws fixes from v3.1

we do, but we do with

*  a patched hive JAR
* bulding spark with -Phive,yarn,hadoop-3.1,hadoop-cloud,kinesis  profiles to pull in the object store stuff *while leaving out the things which cause conflict*
* some extra stuff to wire up the 0-rename-committer

w.r.t hadoop aws, the hadoop-2.9 artifacts have the shaded aws JAR; 50 MB of .class to avoid jackson dependency pain, and an early version of S3Guard. For the new commit stuff you will need to go to hadoop 3.1

-steve



>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]