Speeding up Catalyst engine

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Speeding up Catalyst engine

Maciej Bryński
Hi Everyone,
I'm trying to speed up my Spark streaming application and I have following problem.
I'm using a lot of joins in my app and full catalyst analysis is triggered during every join.

I found 2 options to speed up.

1) spark.sql.selfJoinAutoResolveAmbiguity  option
But looking at code:
https://github.com/apache/spark/blob/8cd9cdf17a7a4ad6f2eecd7c4b388ca363c20982/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L918

Shouldn't lines 925-927 be before 920-922 ?


Is it safe to use it on top of 2.2.0 ?

Regards,
--
Maciek Bryński
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Speeding up Catalyst engine

Liang-Chi Hsieh

Hi Maciej,

For backportting https://issues.apache.org/jira/browse/SPARK-20392, you can see the suggestion from committers on the PR. I think we don't expect it will be merged into 2.2.


Maciej Bryński wrote
Hi Everyone,
I'm trying to speed up my Spark streaming application and I have following
problem.
I'm using a lot of joins in my app and full catalyst analysis is triggered
during every join.

I found 2 options to speed up.

1) spark.sql.selfJoinAutoResolveAmbiguity  option
But looking at code:
https://github.com/apache/spark/blob/8cd9cdf17a7a4ad6f2eecd7c4b388ca363c20982/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L918

Shouldn't lines 925-927 be before 920-922 ?

2) https://issues.apache.org/jira/browse/SPARK-20392

Is it safe to use it on top of 2.2.0 ?

Regards,
--
Maciek Bryński
Liang-Chi Hsieh | @viirya
Spark Technology Center
http://www.spark.tc/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Speeding up Catalyst engine

Maciej Bryński
Hi,

I did backport this to 2.2.
First results of tests (join of about 60 tables).
Vanilla Spark: 50 sec
With 20392 - 38 sec
With 20392 and spark.sql.selfJoinAutoResolveAmbiguity=false - 29 sec
Vanilla Spark with spark.sql.selfJoinAutoResolveAmbiguity=false - 34 sec

I didn't measure any difference changing spark.sql.constraintPropagation.enabled and any other spark.sql option.

So I will leave your patch on top of 2.2
Thank you.

M.

2017-07-25 1:39 GMT+02:00 Liang-Chi Hsieh <[hidden email]>:

Hi Maciej,

For backportting https://issues.apache.org/jira/browse/SPARK-20392, you can
see the suggestion from committers on the PR. I think we don't expect it
will be merged into 2.2.



Maciej Bryński wrote
> Hi Everyone,
> I'm trying to speed up my Spark streaming application and I have following
> problem.
> I'm using a lot of joins in my app and full catalyst analysis is triggered
> during every join.
>
> I found 2 options to speed up.
>
> 1) spark.sql.selfJoinAutoResolveAmbiguity  option
> But looking at code:
> https://github.com/apache/spark/blob/8cd9cdf17a7a4ad6f2eecd7c4b388ca363c20982/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L918
>
> Shouldn't lines 925-927 be before 920-922 ?
>
> 2) https://issues.apache.org/jira/browse/SPARK-20392
>
> Is it safe to use it on top of 2.2.0 ?
>
> Regards,
> --
> Maciek Bryński





-----
Liang-Chi Hsieh | @viirya
Spark Technology Center
http://www.spark.tc/
--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Speeding-up-Catalyst-engine-tp22013p22014.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]




--
Maciek Bryński
Loading...