Compiling Spark UDF at runtime

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Compiling Spark UDF at runtime

Michael Shtelma
Hi all,

I would like to be able to compile Spark UDF at runtime. Right now I
am using Janino for that.
My problem is, that in order to make my compiled functions visible to
spark, I have to set janino classloader (janino gives me classloader
with compiled UDF classes) as context class loader before I create
Spark Session. This approach is working locally for debugging purposes
but is not going to work in cluster mode, because the UDF classes will
not be distributed to the worker nodes.

An alternative is to register UDF via Hive functionality and generate
temporary jar somewhere, which at least in Standalone cluster mode
will be made available to spark workers using embedded http server. As
far as I understand, this is not going to work in yarn mode.

I am wondering now, how is it better to approach this problem? My
current best idea is to develop own small netty based file web server
and use it in order to distribute my custom jar, which can be created
on the fly, to workers both in standalone and in yarn modes. Can I
reference the jar in form  of http url using extra driver options and
then register UDFs contained in this jar using spark.udf().* methods?

Does anybody have any better ideas?
Any assistance would be greatly appreciated!

Thanks,
Michael

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Compiling Spark UDF at runtime

geoHeil
You could store the jar in hdfs. Then even in yarn cluster mode your give workaround should work.
Michael Shtelma <[hidden email]> schrieb am Fr. 12. Jan. 2018 um 12:58:
Hi all,

I would like to be able to compile Spark UDF at runtime. Right now I
am using Janino for that.
My problem is, that in order to make my compiled functions visible to
spark, I have to set janino classloader (janino gives me classloader
with compiled UDF classes) as context class loader before I create
Spark Session. This approach is working locally for debugging purposes
but is not going to work in cluster mode, because the UDF classes will
not be distributed to the worker nodes.

An alternative is to register UDF via Hive functionality and generate
temporary jar somewhere, which at least in Standalone cluster mode
will be made available to spark workers using embedded http server. As
far as I understand, this is not going to work in yarn mode.

I am wondering now, how is it better to approach this problem? My
current best idea is to develop own small netty based file web server
and use it in order to distribute my custom jar, which can be created
on the fly, to workers both in standalone and in yarn modes. Can I
reference the jar in formĀ  of http url using extra driver options and
then register UDFs contained in this jar using spark.udf().* methods?

Does anybody have any better ideas?
Any assistance would be greatly appreciated!

Thanks,
Michael

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Compiling Spark UDF at runtime

Michael Shtelma
Thanks!
yes, this would be an option of course.
HDFS or Alluxio.
Sincerely,
Michael Shtelma


On Fri, Jan 12, 2018 at 3:26 PM, Georg Heiler <[hidden email]> wrote:

> You could store the jar in hdfs. Then even in yarn cluster mode your give
> workaround should work.
> Michael Shtelma <[hidden email]> schrieb am Fr. 12. Jan. 2018 um 12:58:
>>
>> Hi all,
>>
>> I would like to be able to compile Spark UDF at runtime. Right now I
>> am using Janino for that.
>> My problem is, that in order to make my compiled functions visible to
>> spark, I have to set janino classloader (janino gives me classloader
>> with compiled UDF classes) as context class loader before I create
>> Spark Session. This approach is working locally for debugging purposes
>> but is not going to work in cluster mode, because the UDF classes will
>> not be distributed to the worker nodes.
>>
>> An alternative is to register UDF via Hive functionality and generate
>> temporary jar somewhere, which at least in Standalone cluster mode
>> will be made available to spark workers using embedded http server. As
>> far as I understand, this is not going to work in yarn mode.
>>
>> I am wondering now, how is it better to approach this problem? My
>> current best idea is to develop own small netty based file web server
>> and use it in order to distribute my custom jar, which can be created
>> on the fly, to workers both in standalone and in yarn modes. Can I
>> reference the jar in form  of http url using extra driver options and
>> then register UDFs contained in this jar using spark.udf().* methods?
>>
>> Does anybody have any better ideas?
>> Any assistance would be greatly appreciated!
>>
>> Thanks,
>> Michael
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: [hidden email]
>>
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]