Quantcast

How to cache SparkPlan.execute for reusing?

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

How to cache SparkPlan.execute for reusing?

summerDG
This post has NOT been accepted by the mailing list yet.
We are optimizing the Spark SQL for adaptive execution. So the SparkPlan maybe reused for strategy choice. But we find that  once the result of SparkPlan.execute, RDD[InternalRow], is cached using RDD.cache, the query output is empty.
1. How to cache the result of SparkPlan.execute?
2. Why is RDD.cache invalid for RDD[InternalRow]?
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: How to cache SparkPlan.execute for reusing?

Liang-Chi Hsieh

Internally, in each partition of the resulting RDD[InternalRow], you will get the same UnsafeRow when iterating the rows. Typical RDD.cache doesn't work for it. You will get the output with the same rows. Not sure why you get empty output.

Dataset.cache() is used for caching SQL query results. Even you really cache RDD[InternalRow] by RDD.cache with the trick which copies the rows (with significant performance penalty), a new query (plan) will not automatically reuse the cached RDD, because new RDDs will be created.

summerDG wrote
We are optimizing the Spark SQL for adaptive execution. So the SparkPlan maybe reused for strategy choice. But we find that  once the result of SparkPlan.execute, RDD[InternalRow], is cached using RDD.cache, the query output is empty.
1. How to cache the result of SparkPlan.execute?
2. Why is RDD.cache invalid for RDD[InternalRow]?
Liang-Chi Hsieh | @viirya
Spark Technology Center
http://www.spark.tc/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: How to cache SparkPlan.execute for reusing?

summerDG
This post has NOT been accepted by the mailing list yet.
Thank you very much. The reason why the output is empty is that the query involves join. I forgot to mention it in the question. So even I succeed in caching the RDD, the following SparkPlans in the query will not reuse it.
If there is a SparkPlan of the query, which has several "parent" nodes, its "parents" have to reuse it by creating new RDDs?
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: How to cache SparkPlan.execute for reusing?

Liang-Chi Hsieh

Not sure what you mean in "its parents have to reuse it by creating new RDDs".

As SparkPlan.execute returns new RDD every time, you won't expect the cached RDD can be reused automatically, even you reuse the SparkPlan in several queries.

Btw, is there any existing ways to reuse SparkPlan?


summerDG wrote
Thank you very much. The reason why the output is empty is that the query involves join. I forgot to mention it in the question. So even I succeed in caching the RDD, the following SparkPlans in the query will not reuse it.
If there is a SparkPlan of the query, which has several "parent" nodes, its "parents" have to reuse it by creating new RDDs?
Liang-Chi Hsieh | @viirya
Spark Technology Center
http://www.spark.tc/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: How to cache SparkPlan.execute for reusing?

summerDG
This post has NOT been accepted by the mailing list yet.
There is really no way to reuse SparkPlan in existing Spark. But I modified the code of Spark SQL for optimizing multi-way join. The new project needs to collect statistical information, thus some SparkPlan, each of which has more than one parent, can be executed several times.
The existing design of Spark can hardly support adaptive strategy choice based on statistical information.
Loading...