Benchmark Java/Scala/Python for Apache spark

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Benchmark Java/Scala/Python for Apache spark

SNEHASISH DUTTA
Hi

Is there a way to get performance benchmarks for development of application using either Java/Scala/Python

Use case mostly involve SQL pipeline/data ingested from various sources including Kafka

What should be the most preferred language and it would be great if the preference for language can be justified from the perspective of application development

Thanks and Regards
Snehasish
Reply | Threaded
Open this post in threaded view
|

Re: Benchmark Java/Scala/Python for Apache spark

Jonathan Winandy
Hello Snehasish 

If you are not using UDFs, you will have very similar performance with those languages on SQL. 

So it go down to :
* if you know python, go for python. 
* if you are used to the JVM, and are ready for a bit of paradigm shift, go for Scala. 

Our team is using Scala, however we help other data engs that are using python. 

I would say go for pure functional programming, however that is biased and python gets the job done anyway. 

Cheers, 
Jonathan

On Mon, 11 Mar 2019, 10:34 SNEHASISH DUTTA, <[hidden email]> wrote:
Hi

Is there a way to get performance benchmarks for development of application using either Java/Scala/Python

Use case mostly involve SQL pipeline/data ingested from various sources including Kafka

What should be the most preferred language and it would be great if the preference for language can be justified from the perspective of application development

Thanks and Regards
Snehasish
Reply | Threaded
Open this post in threaded view
|

Re: Benchmark Java/Scala/Python for Apache spark

Dylan Guedes
Btw, even if you are using Python you can register your UDFs in Scala and use them in Python.

On Mon, Mar 11, 2019 at 6:55 AM Jonathan Winandy <[hidden email]> wrote:
Hello Snehasish 

If you are not using UDFs, you will have very similar performance with those languages on SQL. 

So it go down to :
* if you know python, go for python. 
* if you are used to the JVM, and are ready for a bit of paradigm shift, go for Scala. 

Our team is using Scala, however we help other data engs that are using python. 

I would say go for pure functional programming, however that is biased and python gets the job done anyway. 

Cheers, 
Jonathan

On Mon, 11 Mar 2019, 10:34 SNEHASISH DUTTA, <[hidden email]> wrote:
Hi

Is there a way to get performance benchmarks for development of application using either Java/Scala/Python

Use case mostly involve SQL pipeline/data ingested from various sources including Kafka

What should be the most preferred language and it would be great if the preference for language can be justified from the perspective of application development

Thanks and Regards
Snehasish
Reply | Threaded
Open this post in threaded view
|

Re: Benchmark Java/Scala/Python for Apache spark

Jonathan Winandy
Thanks, I didn't know! 

That being said, any udf use seems to affect badly code generation (and the performance).


On Mon, 11 Mar 2019, 15:13 Dylan Guedes, <[hidden email]> wrote:
Btw, even if you are using Python you can register your UDFs in Scala and use them in Python.

On Mon, Mar 11, 2019 at 6:55 AM Jonathan Winandy <[hidden email]> wrote:
Hello Snehasish 

If you are not using UDFs, you will have very similar performance with those languages on SQL. 

So it go down to :
* if you know python, go for python. 
* if you are used to the JVM, and are ready for a bit of paradigm shift, go for Scala. 

Our team is using Scala, however we help other data engs that are using python. 

I would say go for pure functional programming, however that is biased and python gets the job done anyway. 

Cheers, 
Jonathan

On Mon, 11 Mar 2019, 10:34 SNEHASISH DUTTA, <[hidden email]> wrote:
Hi

Is there a way to get performance benchmarks for development of application using either Java/Scala/Python

Use case mostly involve SQL pipeline/data ingested from various sources including Kafka

What should be the most preferred language and it would be great if the preference for language can be justified from the perspective of application development

Thanks and Regards
Snehasish
Reply | Threaded
Open this post in threaded view
|

Re: Benchmark Java/Scala/Python for Apache spark

rxin
If you use UDFs in Python, you would want to use Pandas UDF for better performance. 

On Mon, Mar 11, 2019 at 7:50 PM Jonathan Winandy <[hidden email]> wrote:
Thanks, I didn't know! 

That being said, any udf use seems to affect badly code generation (and the performance).


On Mon, 11 Mar 2019, 15:13 Dylan Guedes, <[hidden email]> wrote:
Btw, even if you are using Python you can register your UDFs in Scala and use them in Python.

On Mon, Mar 11, 2019 at 6:55 AM Jonathan Winandy <[hidden email]> wrote:
Hello Snehasish 

If you are not using UDFs, you will have very similar performance with those languages on SQL. 

So it go down to :
* if you know python, go for python. 
* if you are used to the JVM, and are ready for a bit of paradigm shift, go for Scala. 

Our team is using Scala, however we help other data engs that are using python. 

I would say go for pure functional programming, however that is biased and python gets the job done anyway. 

Cheers, 
Jonathan

On Mon, 11 Mar 2019, 10:34 SNEHASISH DUTTA, <[hidden email]> wrote:
Hi

Is there a way to get performance benchmarks for development of application using either Java/Scala/Python

Use case mostly involve SQL pipeline/data ingested from various sources including Kafka

What should be the most preferred language and it would be great if the preference for language can be justified from the perspective of application development

Thanks and Regards
Snehasish