SPARK-23443 - Spark with Glue as external catalog

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

SPARK-23443 - Spark with Glue as external catalog

Edgar Klerks
Hi there,

I am a potentially new contributor, so don't spend too much time on me. However I would like to give this a try. Reason is that  it would be a nice to have at my work (the connection between glue and spark). We run our own spark clusters and don't use EMR and right now our spark jobs can't benefit from the glue metastore. This is not a huge problem, because we keep strict naming conventions and use orc, but still it would be nice for our user base.

As you can guess, our cluster runs on AWS and I have a good amount of experience with the aws SDK's, reasonable amount with Scala. I am however a beginner with Spark, never contributed before. 

As far as I can see I need to implement ExternelCatalog for Glue and glue seems to support all operations specified in the trait. Even the user defined functions, which surprised me, because Athena doesn't support this. 

I can see some obstacles, e.g. how to deal with permissions. Therefore I will study the hive ExternalCatalog. Can I take that as leading example? 

I also saw there was prior work from the mailing list (http://apache-spark-developers-list.1001551.n3.nabble.com/A-new-external-catalog-td23394.html), but unfortunately there is no code. 

Would this be a suitable project to pick up? I thought it might be, because it is kinda on the edge of Spark. 

Thanks for your time in advance!

Greets,

Edgar Klerks
Reply | Threaded
Open this post in threaded view
|

Re: SPARK-23443 - Spark with Glue as external catalog

Edgar Klerks
I already went ahead with this one, everything is pretty self explanatory + previous emails seem pretty helpful about how to test things. I don't need answers on my previous questions any more. 

On Fri, May 22, 2020 at 10:12 AM Edgar Klerks <[hidden email]> wrote:
Hi there,

I am a potentially new contributor, so don't spend too much time on me. However I would like to give this a try. Reason is that  it would be a nice to have at my work (the connection between glue and spark). We run our own spark clusters and don't use EMR and right now our spark jobs can't benefit from the glue metastore. This is not a huge problem, because we keep strict naming conventions and use orc, but still it would be nice for our user base.

As you can guess, our cluster runs on AWS and I have a good amount of experience with the aws SDK's, reasonable amount with Scala. I am however a beginner with Spark, never contributed before. 

As far as I can see I need to implement ExternelCatalog for Glue and glue seems to support all operations specified in the trait. Even the user defined functions, which surprised me, because Athena doesn't support this. 

I can see some obstacles, e.g. how to deal with permissions. Therefore I will study the hive ExternalCatalog. Can I take that as leading example? 

I also saw there was prior work from the mailing list (http://apache-spark-developers-list.1001551.n3.nabble.com/A-new-external-catalog-td23394.html), but unfortunately there is no code. 

Would this be a suitable project to pick up? I thought it might be, because it is kinda on the edge of Spark. 

Thanks for your time in advance!

Greets,

Edgar Klerks