yarn, fat-jars and lib_managed

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

yarn, fat-jars and lib_managed

Alex Cozzi
I am just starting out playing with spark on our hadoop 2.2 cluster and I have a question.

The current way to submit jobs to the cluster is to create fat-jars with sbt assembly. This approach works but I think is less than optimal in many large hadoop installation:

the way we interact with the cluster is to log into a CLI machine, which is the only authorized to submit jobs. Now, I can not use the CLI machine as a dev environment since for security reason the CLI and hadoop cluster is fire-walled and can not reach out to the internet, so sbt and manven resolution does not work.

So the procedure now is:
- hack code
- sbt assembly
- rsync my spark directory to the CLI machine
- run my job.

the issue is that every time i need to shuttle large binary files (all the fat-jars) back and forth, they are about 120Mb now, which is slow, particularly when I am working remotely from home.

I was wondering whether a better solution would be to create normal thin-jars of my code, which is very small, less than a Mb, and have no problem to copy every time to the cluster, but to take advantage of the sbt-create directory lib_managed to handle dependencies. We already have this directory that sbt handles with all the needed dependencies for the job to run. Wouldn’t be possible to have the Spark Yarn Client take care of adding all the jars in lib_managed to class path and distribute them to the workers automatically (and they could also be cached across invocations of spark, after all those jars are versioned and immutable, with the possible exception of -SNAPSHOT releases). I think that this would greatly simplify the development procedure and remove the need of messing with ADD_JAR and SPARK_CLASSPATH.

What do you think?

Alex
Reply | Threaded
Open this post in threaded view
|

RE: yarn, fat-jars and lib_managed

Liu, Raymond
I think you could put the spark jar and other jar your app depends on while not changes a lot on HDFS, and use --files or --addjars ( depends on the mode you run YarnClient/YarnStandalone ) to refer to them.
And then just need to redeploy your thin app jar on each invoke.

Best Regards,
Raymond Liu


-----Original Message-----
From: Alex Cozzi [mailto:[hidden email]]
Sent: Friday, January 10, 2014 5:32 AM
To: [hidden email]
Subject: yarn, fat-jars and lib_managed

I am just starting out playing with spark on our hadoop 2.2 cluster and I have a question.

The current way to submit jobs to the cluster is to create fat-jars with sbt assembly. This approach works but I think is less than optimal in many large hadoop installation:

the way we interact with the cluster is to log into a CLI machine, which is the only authorized to submit jobs. Now, I can not use the CLI machine as a dev environment since for security reason the CLI and hadoop cluster is fire-walled and can not reach out to the internet, so sbt and manven resolution does not work.

So the procedure now is:
- hack code
- sbt assembly
- rsync my spark directory to the CLI machine
- run my job.

the issue is that every time i need to shuttle large binary files (all the fat-jars) back and forth, they are about 120Mb now, which is slow, particularly when I am working remotely from home.

I was wondering whether a better solution would be to create normal thin-jars of my code, which is very small, less than a Mb, and have no problem to copy every time to the cluster, but to take advantage of the sbt-create directory lib_managed to handle dependencies. We already have this directory that sbt handles with all the needed dependencies for the job to run. Wouldn't be possible to have the Spark Yarn Client take care of adding all the jars in lib_managed to class path and distribute them to the workers automatically (and they could also be cached across invocations of spark, after all those jars are versioned and immutable, with the possible exception of -SNAPSHOT releases). I think that this would greatly simplify the development procedure and remove the need of messing with ADD_JAR and SPARK_CLASSPATH.

What do you think?

Alex
Reply | Threaded
Open this post in threaded view
|

Re: yarn, fat-jars and lib_managed

Alex Cozzi
well, yes, you can, but it would be much more convenient if spark were to
automatically take care of all the jars under lib_managed, rather than
having to list 20-30 jar in --files and --addJars


On Thu, Jan 9, 2014 at 6:10 PM, Liu, Raymond <[hidden email]> wrote:

> I think you could put the spark jar and other jar your app depends on
> while not changes a lot on HDFS, and use --files or --addjars ( depends on
> the mode you run YarnClient/YarnStandalone ) to refer to them.
> And then just need to redeploy your thin app jar on each invoke.
>
> Best Regards,
> Raymond Liu
>
>
> -----Original Message-----
> From: Alex Cozzi [mailto:[hidden email]]
> Sent: Friday, January 10, 2014 5:32 AM
> To: [hidden email]
> Subject: yarn, fat-jars and lib_managed
>
> I am just starting out playing with spark on our hadoop 2.2 cluster and I
> have a question.
>
> The current way to submit jobs to the cluster is to create fat-jars with
> sbt assembly. This approach works but I think is less than optimal in many
> large hadoop installation:
>
> the way we interact with the cluster is to log into a CLI machine, which
> is the only authorized to submit jobs. Now, I can not use the CLI machine
> as a dev environment since for security reason the CLI and hadoop cluster
> is fire-walled and can not reach out to the internet, so sbt and manven
> resolution does not work.
>
> So the procedure now is:
> - hack code
> - sbt assembly
> - rsync my spark directory to the CLI machine
> - run my job.
>
> the issue is that every time i need to shuttle large binary files (all the
> fat-jars) back and forth, they are about 120Mb now, which is slow,
> particularly when I am working remotely from home.
>
> I was wondering whether a better solution would be to create normal
> thin-jars of my code, which is very small, less than a Mb, and have no
> problem to copy every time to the cluster, but to take advantage of the
> sbt-create directory lib_managed to handle dependencies. We already have
> this directory that sbt handles with all the needed dependencies for the
> job to run. Wouldn't be possible to have the Spark Yarn Client take care of
> adding all the jars in lib_managed to class path and distribute them to the
> workers automatically (and they could also be cached across invocations of
> spark, after all those jars are versioned and immutable, with the possible
> exception of -SNAPSHOT releases). I think that this would greatly simplify
> the development procedure and remove the need of messing with ADD_JAR and
> SPARK_CLASSPATH.
>
> What do you think?
>
> Alex
>