Hadoop version(s) compatible with spark-2.4.3-bin-without-hadoop-scala-2.12

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Hadoop version(s) compatible with spark-2.4.3-bin-without-hadoop-scala-2.12

Michael Heuer
Hello,

Which Hadoop version or versions are compatible with Spark 2.4.3 and Scala 2.12?

The binary distribution spark-2.4.3-bin-without-hadoop-scala-2.12.tgz is missing avro-1.8.2.jar, so when attempting to run with Hadoop 2.7.7 there are classpath conflicts at runtime, as Hadoop 2.7.7 includes avro-1.7.4.jar.


   michael
Reply | Threaded
Open this post in threaded view
|

Re: Hadoop version(s) compatible with spark-2.4.3-bin-without-hadoop-scala-2.12

Koert Kuipers
we run it without issues on hadoop 2.6 - 2.8 on top of my head.

we however do some post-processing on the tarball:
1) we fix the ownership of the files inside the tar.gz file (should be uid/gid 0/0, otherwise untarring by root can lead to ownership by unknown user).
2) add avro-1.8.2.jar and jline-2.14.6.jar to jars folder. i believe these jars missing in provided profile is simply a mistake.

best,
koert

On Mon, May 20, 2019 at 3:37 PM Michael Heuer <[hidden email]> wrote:
Hello,

Which Hadoop version or versions are compatible with Spark 2.4.3 and Scala 2.12?

The binary distribution spark-2.4.3-bin-without-hadoop-scala-2.12.tgz is missing avro-1.8.2.jar, so when attempting to run with Hadoop 2.7.7 there are classpath conflicts at runtime, as Hadoop 2.7.7 includes avro-1.7.4.jar.


   michael
Reply | Threaded
Open this post in threaded view
|

Re: Hadoop version(s) compatible with spark-2.4.3-bin-without-hadoop-scala-2.12

Sean Owen-2
Re: 1), I think we tried to fix that on the build side and it requires
flags that not all tar versions (i.e. OS X) have. But that's
tangential.

I think the Avro + Parquet dependency situation is generally
problematic -- see JIRA for some details. But yes I'm not surprised if
Spark has a different version from Hadoop 2.7.x and that would cause
problems -- if using Avro. I'm not sure the mistake is that the JARs
are missing, as I think this is supposed to be a 'provided'
dependency, but I haven't looked into it. If there's any easy obvious
correction to be made there, by all means.

Not sure what the deal is with jline... I'd expect that's in the
"hadoop-provided" distro? That one may be a real issue if it's
considered provided but isn't used that way.


On Mon, May 20, 2019 at 4:15 PM Koert Kuipers <[hidden email]> wrote:

>
> we run it without issues on hadoop 2.6 - 2.8 on top of my head.
>
> we however do some post-processing on the tarball:
> 1) we fix the ownership of the files inside the tar.gz file (should be uid/gid 0/0, otherwise untarring by root can lead to ownership by unknown user).
> 2) add avro-1.8.2.jar and jline-2.14.6.jar to jars folder. i believe these jars missing in provided profile is simply a mistake.
>
> best,
> koert
>
> On Mon, May 20, 2019 at 3:37 PM Michael Heuer <[hidden email]> wrote:
>>
>> Hello,
>>
>> Which Hadoop version or versions are compatible with Spark 2.4.3 and Scala 2.12?
>>
>> The binary distribution spark-2.4.3-bin-without-hadoop-scala-2.12.tgz is missing avro-1.8.2.jar, so when attempting to run with Hadoop 2.7.7 there are classpath conflicts at runtime, as Hadoop 2.7.7 includes avro-1.7.4.jar.
>>
>> https://issues.apache.org/jira/browse/SPARK-27781
>>
>>    michael

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Hadoop version(s) compatible with spark-2.4.3-bin-without-hadoop-scala-2.12

Koert Kuipers
its somewhat weird because avro-mapred-1.8.2-hadoop2.jar is included in the hadoop-provided distro, but avro-1.8.2.jar is not. i tried to fix it but i am not too familiar with the pom file.

regarding jline you only run into this if you use spark-shell (and it isnt always reproducible it seems). see SPARK-25783
best,
koert




On Mon, May 20, 2019 at 5:43 PM Sean Owen <[hidden email]> wrote:
Re: 1), I think we tried to fix that on the build side and it requires
flags that not all tar versions (i.e. OS X) have. But that's
tangential.

I think the Avro + Parquet dependency situation is generally
problematic -- see JIRA for some details. But yes I'm not surprised if
Spark has a different version from Hadoop 2.7.x and that would cause
problems -- if using Avro. I'm not sure the mistake is that the JARs
are missing, as I think this is supposed to be a 'provided'
dependency, but I haven't looked into it. If there's any easy obvious
correction to be made there, by all means.

Not sure what the deal is with jline... I'd expect that's in the
"hadoop-provided" distro? That one may be a real issue if it's
considered provided but isn't used that way.


On Mon, May 20, 2019 at 4:15 PM Koert Kuipers <[hidden email]> wrote:
>
> we run it without issues on hadoop 2.6 - 2.8 on top of my head.
>
> we however do some post-processing on the tarball:
> 1) we fix the ownership of the files inside the tar.gz file (should be uid/gid 0/0, otherwise untarring by root can lead to ownership by unknown user).
> 2) add avro-1.8.2.jar and jline-2.14.6.jar to jars folder. i believe these jars missing in provided profile is simply a mistake.
>
> best,
> koert
>
> On Mon, May 20, 2019 at 3:37 PM Michael Heuer <[hidden email]> wrote:
>>
>> Hello,
>>
>> Which Hadoop version or versions are compatible with Spark 2.4.3 and Scala 2.12?
>>
>> The binary distribution spark-2.4.3-bin-without-hadoop-scala-2.12.tgz is missing avro-1.8.2.jar, so when attempting to run with Hadoop 2.7.7 there are classpath conflicts at runtime, as Hadoop 2.7.7 includes avro-1.7.4.jar.
>>
>> https://issues.apache.org/jira/browse/SPARK-27781
>>
>>    michael
Reply | Threaded
Open this post in threaded view
|

Re: Hadoop version(s) compatible with spark-2.4.3-bin-without-hadoop-scala-2.12

Michael Heuer
The scopes for avro-1.8.2.jar and avro-mapred-1.8.2-hadoop2.jar are different

<dependency>
  <groupId>org.apache.avro</groupId>
  <artifactId>avro</artifactId>
  <version>${avro.version}</version>
  <scope>${hadoop.deps.scope}</scope>
...
<dependency>
  <groupId>org.apache.avro</groupId>
  <artifactId>avro-mapred</artifactId>
  <version>${avro.version}</version>
  <classifier>${avro.mapred.classifier}</classifier>
  <scope>${hive.deps.scope}</scope>


What needs to be done then?  At a minimum, something should be added to the release notes for 2.4.3 to say that the spark-2.4.3-bin-without-hadoop-scala-2.12 binary distribution is incompatible with Hadoop 2.7.7 (and perhaps earlier and later versions, I haven't confirmed).

Note that Avro 1.9.0 was just released, with many binary and source incompatibilities compared to 1.8.2, so this problem may soon be getting worse, unless all of Parquet, Hadoop, Hive, and Spark can all make the move simultaneously.

   michael


On May 20, 2019, at 5:03 PM, Koert Kuipers <[hidden email]> wrote:

its somewhat weird because avro-mapred-1.8.2-hadoop2.jar is included in the hadoop-provided distro, but avro-1.8.2.jar is not. i tried to fix it but i am not too familiar with the pom file.

regarding jline you only run into this if you use spark-shell (and it isnt always reproducible it seems). see SPARK-25783
best,
koert




On Mon, May 20, 2019 at 5:43 PM Sean Owen <[hidden email]> wrote:
Re: 1), I think we tried to fix that on the build side and it requires
flags that not all tar versions (i.e. OS X) have. But that's
tangential.

I think the Avro + Parquet dependency situation is generally
problematic -- see JIRA for some details. But yes I'm not surprised if
Spark has a different version from Hadoop 2.7.x and that would cause
problems -- if using Avro. I'm not sure the mistake is that the JARs
are missing, as I think this is supposed to be a 'provided'
dependency, but I haven't looked into it. If there's any easy obvious
correction to be made there, by all means.

Not sure what the deal is with jline... I'd expect that's in the
"hadoop-provided" distro? That one may be a real issue if it's
considered provided but isn't used that way.


On Mon, May 20, 2019 at 4:15 PM Koert Kuipers <[hidden email]> wrote:
>
> we run it without issues on hadoop 2.6 - 2.8 on top of my head.
>
> we however do some post-processing on the tarball:
> 1) we fix the ownership of the files inside the tar.gz file (should be uid/gid 0/0, otherwise untarring by root can lead to ownership by unknown user).
> 2) add avro-1.8.2.jar and jline-2.14.6.jar to jars folder. i believe these jars missing in provided profile is simply a mistake.
>
> best,
> koert
>
> On Mon, May 20, 2019 at 3:37 PM Michael Heuer <[hidden email]> wrote:
>>
>> Hello,
>>
>> Which Hadoop version or versions are compatible with Spark 2.4.3 and Scala 2.12?
>>
>> The binary distribution spark-2.4.3-bin-without-hadoop-scala-2.12.tgz is missing avro-1.8.2.jar, so when attempting to run with Hadoop 2.7.7 there are classpath conflicts at runtime, as Hadoop 2.7.7 includes avro-1.7.4.jar.
>>
>> https://issues.apache.org/jira/browse/SPARK-27781
>>
>>    michael

Reply | Threaded
Open this post in threaded view
|

Re: Hadoop version(s) compatible with spark-2.4.3-bin-without-hadoop-scala-2.12

Sean Owen-2
Tough one. Yes it's because Hive is still 'included' with the
no-Hadoop build. I think the avro scope is on purpose in that it's
meant to use the version in the larger Hadoop installation it will run
on. But, I suspect you'll find 1.7 doesn't work. Yes, there's a rat's
nest of compatibility problems here which makes tinkering with the
versions problematic.

My instinct is to say that if Spark uses Avro directly and needs 1.8
(and probably doesn't want 1.9) then this needs to be included with
the package, not left to hadoop.deps.scope.

Does anyone who knows more about this piece know better?


On Tue, May 21, 2019 at 1:32 PM Michael Heuer <[hidden email]> wrote:

>
> The scopes for avro-1.8.2.jar and avro-mapred-1.8.2-hadoop2.jar are different
>
> <dependency>
>   <groupId>org.apache.avro</groupId>
>   <artifactId>avro</artifactId>
>   <version>${avro.version}</version>
>   <scope>${hadoop.deps.scope}</scope>
> ...
> <dependency>
>   <groupId>org.apache.avro</groupId>
>   <artifactId>avro-mapred</artifactId>
>   <version>${avro.version}</version>
>   <classifier>${avro.mapred.classifier}</classifier>
>   <scope>${hive.deps.scope}</scope>
>
>
> What needs to be done then?  At a minimum, something should be added to the release notes for 2.4.3 to say that the spark-2.4.3-bin-without-hadoop-scala-2.12 binary distribution is incompatible with Hadoop 2.7.7 (and perhaps earlier and later versions, I haven't confirmed).
>
> Note that Avro 1.9.0 was just released, with many binary and source incompatibilities compared to 1.8.2, so this problem may soon be getting worse, unless all of Parquet, Hadoop, Hive, and Spark can all make the move simultaneously.
>
>    michael
>
>
> On May 20, 2019, at 5:03 PM, Koert Kuipers <[hidden email]> wrote:
>
> its somewhat weird because avro-mapred-1.8.2-hadoop2.jar is included in the hadoop-provided distro, but avro-1.8.2.jar is not. i tried to fix it but i am not too familiar with the pom file.
>
> regarding jline you only run into this if you use spark-shell (and it isnt always reproducible it seems). see SPARK-25783
> best,
> koert
>
>
>
>
> On Mon, May 20, 2019 at 5:43 PM Sean Owen <[hidden email]> wrote:
>>
>> Re: 1), I think we tried to fix that on the build side and it requires
>> flags that not all tar versions (i.e. OS X) have. But that's
>> tangential.
>>
>> I think the Avro + Parquet dependency situation is generally
>> problematic -- see JIRA for some details. But yes I'm not surprised if
>> Spark has a different version from Hadoop 2.7.x and that would cause
>> problems -- if using Avro. I'm not sure the mistake is that the JARs
>> are missing, as I think this is supposed to be a 'provided'
>> dependency, but I haven't looked into it. If there's any easy obvious
>> correction to be made there, by all means.
>>
>> Not sure what the deal is with jline... I'd expect that's in the
>> "hadoop-provided" distro? That one may be a real issue if it's
>> considered provided but isn't used that way.
>>
>>
>> On Mon, May 20, 2019 at 4:15 PM Koert Kuipers <[hidden email]> wrote:
>> >
>> > we run it without issues on hadoop 2.6 - 2.8 on top of my head.
>> >
>> > we however do some post-processing on the tarball:
>> > 1) we fix the ownership of the files inside the tar.gz file (should be uid/gid 0/0, otherwise untarring by root can lead to ownership by unknown user).
>> > 2) add avro-1.8.2.jar and jline-2.14.6.jar to jars folder. i believe these jars missing in provided profile is simply a mistake.
>> >
>> > best,
>> > koert
>> >
>> > On Mon, May 20, 2019 at 3:37 PM Michael Heuer <[hidden email]> wrote:
>> >>
>> >> Hello,
>> >>
>> >> Which Hadoop version or versions are compatible with Spark 2.4.3 and Scala 2.12?
>> >>
>> >> The binary distribution spark-2.4.3-bin-without-hadoop-scala-2.12.tgz is missing avro-1.8.2.jar, so when attempting to run with Hadoop 2.7.7 there are classpath conflicts at runtime, as Hadoop 2.7.7 includes avro-1.7.4.jar.
>> >>
>> >> https://issues.apache.org/jira/browse/SPARK-27781
>> >>
>> >>    michael
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Hadoop version(s) compatible with spark-2.4.3-bin-without-hadoop-scala-2.12

Steve Loughran-2

hadoop is still on 1.7.7 branch. A move to 1.9 would probably be as painful as a move to 1.8.x, so submit a patch for hadoop trunk. Last PR there wasn't quite ready and I didn't get any follow up to the "what is this going to break" question 


There's not actually much use of Avro in hadoop-common (some @Stringable annotations and some ser/deser code in org.apache.hadoop.io.serializer.avro); some avro structures are generated in mapreduce and (maybe) HDFS, which is probably not relevant for spark except for things trying to run in the node manager. 

So bumping up the spark version may not be too destructive for any uses in the Hadoop codebase. *may*. 


On Tue, May 21, 2019 at 7:46 PM Sean Owen <[hidden email]> wrote:
Tough one. Yes it's because Hive is still 'included' with the
no-Hadoop build. I think the avro scope is on purpose in that it's
meant to use the version in the larger Hadoop installation it will run
on. But, I suspect you'll find 1.7 doesn't work. Yes, there's a rat's
nest of compatibility problems here which makes tinkering with the
versions problematic.

My instinct is to say that if Spark uses Avro directly and needs 1.8
(and probably doesn't want 1.9) then this needs to be included with
the package, not left to hadoop.deps.scope.

Does anyone who knows more about this piece know better?


On Tue, May 21, 2019 at 1:32 PM Michael Heuer <[hidden email]> wrote:
>
> The scopes for avro-1.8.2.jar and avro-mapred-1.8.2-hadoop2.jar are different
>
> <dependency>
>   <groupId>org.apache.avro</groupId>
>   <artifactId>avro</artifactId>
>   <version>${avro.version}</version>
>   <scope>${hadoop.deps.scope}</scope>
> ...
> <dependency>
>   <groupId>org.apache.avro</groupId>
>   <artifactId>avro-mapred</artifactId>
>   <version>${avro.version}</version>
>   <classifier>${avro.mapred.classifier}</classifier>
>   <scope>${hive.deps.scope}</scope>
>
>
> What needs to be done then?  At a minimum, something should be added to the release notes for 2.4.3 to say that the spark-2.4.3-bin-without-hadoop-scala-2.12 binary distribution is incompatible with Hadoop 2.7.7 (and perhaps earlier and later versions, I haven't confirmed).
>
> Note that Avro 1.9.0 was just released, with many binary and source incompatibilities compared to 1.8.2, so this problem may soon be getting worse, unless all of Parquet, Hadoop, Hive, and Spark can all make the move simultaneously.
>
>    michael
>
>
> On May 20, 2019, at 5:03 PM, Koert Kuipers <[hidden email]> wrote:
>
> its somewhat weird because avro-mapred-1.8.2-hadoop2.jar is included in the hadoop-provided distro, but avro-1.8.2.jar is not. i tried to fix it but i am not too familiar with the pom file.
>
> regarding jline you only run into this if you use spark-shell (and it isnt always reproducible it seems). see SPARK-25783
> best,
> koert
>
>
>
>
> On Mon, May 20, 2019 at 5:43 PM Sean Owen <[hidden email]> wrote:
>>
>> Re: 1), I think we tried to fix that on the build side and it requires
>> flags that not all tar versions (i.e. OS X) have. But that's
>> tangential.
>>
>> I think the Avro + Parquet dependency situation is generally
>> problematic -- see JIRA for some details. But yes I'm not surprised if
>> Spark has a different version from Hadoop 2.7.x and that would cause
>> problems -- if using Avro. I'm not sure the mistake is that the JARs
>> are missing, as I think this is supposed to be a 'provided'
>> dependency, but I haven't looked into it. If there's any easy obvious
>> correction to be made there, by all means.
>>
>> Not sure what the deal is with jline... I'd expect that's in the
>> "hadoop-provided" distro? That one may be a real issue if it's
>> considered provided but isn't used that way.
>>
>>
>> On Mon, May 20, 2019 at 4:15 PM Koert Kuipers <[hidden email]> wrote:
>> >
>> > we run it without issues on hadoop 2.6 - 2.8 on top of my head.
>> >
>> > we however do some post-processing on the tarball:
>> > 1) we fix the ownership of the files inside the tar.gz file (should be uid/gid 0/0, otherwise untarring by root can lead to ownership by unknown user).
>> > 2) add avro-1.8.2.jar and jline-2.14.6.jar to jars folder. i believe these jars missing in provided profile is simply a mistake.
>> >
>> > best,
>> > koert
>> >
>> > On Mon, May 20, 2019 at 3:37 PM Michael Heuer <[hidden email]> wrote:
>> >>
>> >> Hello,
>> >>
>> >> Which Hadoop version or versions are compatible with Spark 2.4.3 and Scala 2.12?
>> >>
>> >> The binary distribution spark-2.4.3-bin-without-hadoop-scala-2.12.tgz is missing avro-1.8.2.jar, so when attempting to run with Hadoop 2.7.7 there are classpath conflicts at runtime, as Hadoop 2.7.7 includes avro-1.7.4.jar.
>> >>
>> >> https://issues.apache.org/jira/browse/SPARK-27781
>> >>
>> >>    michael
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]