[DISCUSS] Caching SparkPlan

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[DISCUSS] Caching SparkPlan

Chang Chen
I'd like to start a discussion on caching SparkPlan

From what I benchmark, if sql execution time is less than 1 second, then we cannot ignore the following overheads , especially if we cache data in memory
  1. Paring, analysing, optimizing SQL
  2. Generating Physical Plan (SparkPlan)
  3. Generating Codes
  4. Generating RDD
snappydata(https://github.com/SnappyDataInc/snappydata) already implemented plan cache for same query pattern, for example ,given the sql: SELECT intField, count(*), sum(intField) FROM t WHERE intField = 1 group by intField, for any sql which is only difference on Literal, we can re-use the same SparkPlan. snappydata's implementation is based on caching the final RDD graph, since spark will cache shuffle data internally, I believe it can not run the same  query pattern concurrently.

My idea is caching SparkPlan(and hence, caching generated codes), and creating RDD graph every time(the only overhead we can not ignore, but if we read data from memory, the overhead is OK). I did some experiment by extracting ParamLiteral from snappydata, and the experiment looks fine.

I want to get some comments before writing a jira.  Would appreciate comments and discussions from the community.

Thanks
Chang