I'd like to start a discussion on caching SparkPlan
From what I benchmark, if sql execution time is less than 1 second, then we cannot ignore the following overheads , especially if we cache data in memory
Paring, analysing, optimizing SQL
Generating Physical Plan (SparkPlan)
snappydata(https://github.com/SnappyDataInc/snappydata) already implemented plan cache for same query pattern, for example ,given the sql: SELECT intField, count(*), sum(intField) FROM t WHERE intField = 1 group by intField, for any sql which is only difference on Literal, we can re-use the same SparkPlan. snappydata's implementation is based on caching the final RDD graph, since spark will cache shuffle data internally, I believe it can not run the same query pattern concurrently.
My idea is caching SparkPlan(and hence, caching generated codes), and creating RDD graph every time(the only overhead we can not ignore, but if we read data from memory, the overhead is OK). I did some experiment by extracting ParamLiteral from snappydata, and the experiment looks fine.
I want to get some comments before writing a jira. Would appreciate comments and discussions from the community.