Quantcast

excluding hadoop dependencies in spark's assembly files

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

excluding hadoop dependencies in spark's assembly files

Alex Cozzi
I am trying to exclude the hadoop jar dependencies from spark’s assembly files, the reason being that in order to work on our cluster it is necessary to use our now version of those files instead of the published ones. I tried define the hadoop dependencies as “provided”, but surpassingly this causes compilation errors in the build. Just to be clear, I modified the sbt build file
as follows:

  def yarnEnabledSettings = Seq(
    libraryDependencies ++= Seq(
      // Exclude rule required for all ?
      "org.apache.hadoop" % "hadoop-client" % hadoopVersion  % "provided" excludeAll(excludeJackson, excludeNetty, excludeAsm, excludeCglib),
      "org.apache.hadoop" % "hadoop-yarn-api" % hadoopVersion  % "provided" excludeAll(excludeJackson, excludeNetty, excludeAsm, excludeCglib),
      "org.apache.hadoop" % "hadoop-yarn-common" % hadoopVersion  % "provided" excludeAll(excludeJackson, excludeNetty, excludeAsm, excludeCglib),
      "org.apache.hadoop" % "hadoop-yarn-client" % hadoopVersion  % "provided" excludeAll(excludeJackson, excludeNetty, excludeAsm, excludeCglib)
    )
  )

and compile as

 SPARK_HADOOP_VERSION=2.2.0 SPARK_YARN=true SPARK_IS_NEW_HADOOP=true sbt  assembly


but the assembly still includes the hadoop libraries, contrary to what the assembly docs say. I managed to exclude them instead by using the non-recommended way:
def extraAssemblySettings() = Seq(
    test in assembly := {},
    mergeStrategy in assembly := {
      case m if m.toLowerCase.endsWith("manifest.mf") => MergeStrategy.discard
      case m if m.toLowerCase.matches("meta-inf.*\\.sf$") => MergeStrategy.discard
      case "log4j.properties" => MergeStrategy.discard
      case m if m.toLowerCase.startsWith("meta-inf/services/") => MergeStrategy.filterDistinctLines
      case "reference.conf" => MergeStrategy.concat
      case _ => MergeStrategy.first
    },
    excludedJars in assembly <<= (fullClasspath in assembly) map { cp =>
     cp filter {_.data.getName.contains("hadoop")}
    }
)


But I would like to hear whether there is interest in excluding the hadoop jar by default in the build
Alex Cozzi
[hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: excluding hadoop dependencies in spark's assembly files

Roman Shaposhnik-2
Alex,

I don't know if it helps or not but sometimes back I made maven assembly to be
able to package Spark in Bigtop. That assembly exclude all hadoop
dependencies. So, you can simply build it using maven, instead of sbt.

Regards,
  Cos

On Mon, Jan 06, 2014 at 02:33PM, Alex Cozzi wrote:

> I am trying to exclude the hadoop jar dependencies from spark’s assembly files, the reason being that in order to work on our cluster it is necessary to use our now version of those files instead of the published ones. I tried define the hadoop dependencies as “provided”, but surpassingly this causes compilation errors in the build. Just to be clear, I modified the sbt build file
> as follows:
>
>   def yarnEnabledSettings = Seq(
>     libraryDependencies ++= Seq(
>       // Exclude rule required for all ?
>       "org.apache.hadoop" % "hadoop-client" % hadoopVersion  % "provided" excludeAll(excludeJackson, excludeNetty, excludeAsm, excludeCglib),
>       "org.apache.hadoop" % "hadoop-yarn-api" % hadoopVersion  % "provided" excludeAll(excludeJackson, excludeNetty, excludeAsm, excludeCglib),
>       "org.apache.hadoop" % "hadoop-yarn-common" % hadoopVersion  % "provided" excludeAll(excludeJackson, excludeNetty, excludeAsm, excludeCglib),
>       "org.apache.hadoop" % "hadoop-yarn-client" % hadoopVersion  % "provided" excludeAll(excludeJackson, excludeNetty, excludeAsm, excludeCglib)
>     )
>   )
>
> and compile as
>
>  SPARK_HADOOP_VERSION=2.2.0 SPARK_YARN=true SPARK_IS_NEW_HADOOP=true sbt  assembly
>
>
> but the assembly still includes the hadoop libraries, contrary to what the assembly docs say. I managed to exclude them instead by using the non-recommended way:
> def extraAssemblySettings() = Seq(
>     test in assembly := {},
>     mergeStrategy in assembly := {
>       case m if m.toLowerCase.endsWith("manifest.mf") => MergeStrategy.discard
>       case m if m.toLowerCase.matches("meta-inf.*\\.sf$") => MergeStrategy.discard
>       case "log4j.properties" => MergeStrategy.discard
>       case m if m.toLowerCase.startsWith("meta-inf/services/") => MergeStrategy.filterDistinctLines
>       case "reference.conf" => MergeStrategy.concat
>       case _ => MergeStrategy.first
>     },
>     excludedJars in assembly <<= (fullClasspath in assembly) map { cp =>
>      cp filter {_.data.getName.contains("hadoop")}
>     }
> )
>
>
> But I would like to hear whether there is interest in excluding the hadoop jar by default in the build
> Alex Cozzi
> [hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: excluding hadoop dependencies in spark's assembly files

Konstantin Boudnik
Well,

somehow I managed to send an email as Roman Shaposhnik :) I guess he got himself a
stacker ;)

Cos

On Mon, Jan 06, 2014 at 04:55PM, Roman Shaposhnik wrote:

> Alex,
>
> I don't know if it helps or not but sometimes back I made maven assembly to be
> able to package Spark in Bigtop. That assembly exclude all hadoop
> dependencies. So, you can simply build it using maven, instead of sbt.
>
> Regards,
>   Cos
>
> On Mon, Jan 06, 2014 at 02:33PM, Alex Cozzi wrote:
> > I am trying to exclude the hadoop jar dependencies from spark’s assembly files, the reason being that in order to work on our cluster it is necessary to use our now version of those files instead of the published ones. I tried define the hadoop dependencies as “provided”, but surpassingly this causes compilation errors in the build. Just to be clear, I modified the sbt build file
> > as follows:
> >
> >   def yarnEnabledSettings = Seq(
> >     libraryDependencies ++= Seq(
> >       // Exclude rule required for all ?
> >       "org.apache.hadoop" % "hadoop-client" % hadoopVersion  % "provided" excludeAll(excludeJackson, excludeNetty, excludeAsm, excludeCglib),
> >       "org.apache.hadoop" % "hadoop-yarn-api" % hadoopVersion  % "provided" excludeAll(excludeJackson, excludeNetty, excludeAsm, excludeCglib),
> >       "org.apache.hadoop" % "hadoop-yarn-common" % hadoopVersion  % "provided" excludeAll(excludeJackson, excludeNetty, excludeAsm, excludeCglib),
> >       "org.apache.hadoop" % "hadoop-yarn-client" % hadoopVersion  % "provided" excludeAll(excludeJackson, excludeNetty, excludeAsm, excludeCglib)
> >     )
> >   )
> >
> > and compile as
> >
> >  SPARK_HADOOP_VERSION=2.2.0 SPARK_YARN=true SPARK_IS_NEW_HADOOP=true sbt  assembly
> >
> >
> > but the assembly still includes the hadoop libraries, contrary to what the assembly docs say. I managed to exclude them instead by using the non-recommended way:
> > def extraAssemblySettings() = Seq(
> >     test in assembly := {},
> >     mergeStrategy in assembly := {
> >       case m if m.toLowerCase.endsWith("manifest.mf") => MergeStrategy.discard
> >       case m if m.toLowerCase.matches("meta-inf.*\\.sf$") => MergeStrategy.discard
> >       case "log4j.properties" => MergeStrategy.discard
> >       case m if m.toLowerCase.startsWith("meta-inf/services/") => MergeStrategy.filterDistinctLines
> >       case "reference.conf" => MergeStrategy.concat
> >       case _ => MergeStrategy.first
> >     },
> >     excludedJars in assembly <<= (fullClasspath in assembly) map { cp =>
> >      cp filter {_.data.getName.contains("hadoop")}
> >     }
> > )
> >
> >
> > But I would like to hear whether there is interest in excluding the hadoop jar by default in the build
> > Alex Cozzi
> > [hidden email]
Loading...