Is there any inplict RDD cache operation for query optimizations?

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Is there any inplict RDD cache operation for query optimizations?

marcelo.amaral
As the documentation says, Cache Manager is only invoked when a caching (i.e.
persist) function is called by the user in the code. Therefore, giving that,
as far as I understood, unless cache/persist operations are not explicitly
called, the job's results (including inputs and intermediate ones) will
never be stored to be reused.

I am wondering if there exist any optimization for the query execution plan
that applies any implicit cache mechanism without calling the cache/persist
operation. Or if there is any other mechanism that can implicitly invoke the
cache for any other situation.

In the case that I understood correctly, is there any strong reason why
Catalyst Optimizer does not enforce any cache mechanism for the intermediate
results between jobs?



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Is there any inplict RDD cache operation for query optimizations?

Theodoros Gkountouvas
I think Spark allows users to manage the cache space because they can do it much more effectively compared to an automated approach. It is very difficult to find a caching strategy that fits the needs of all users. Finally, although there are soft limits between the execution and cache memory space for executors, Spark does not want to fill the cache space with unnecessary intermediate data and limit the execution space for no reason by default.

There are somethings that are implicitly cached though (e.g. shuffles in disk) and you can avoid re-executing them if you re-use them.

To answer your question directly, I am not aware of any Catalyst optimization that does what you want, but Spark allows custom optimizations in Catalyst and you can implement your own caching strategy if it fits your purposes (see below).

sparkSession.experimental.extraOptimizations += Seq(CacheRule)

I hope this helps,
Theo.

-----Original Message-----
From: marcelo.amaral <[hidden email]>
Sent: Tuesday, December 8, 2020 4:02 AM
To: [hidden email]
Subject: Is there any inplict RDD cache operation for query optimizations?

As the documentation says, Cache Manager is only invoked when a caching (i.e.
persist) function is called by the user in the code. Therefore, giving that, as far as I understood, unless cache/persist operations are not explicitly called, the job's results (including inputs and intermediate ones) will never be stored to be reused.

I am wondering if there exist any optimization for the query execution plan that applies any implicit cache mechanism without calling the cache/persist operation. Or if there is any other mechanism that can implicitly invoke the cache for any other situation.

In the case that I understood correctly, is there any strong reason why Catalyst Optimizer does not enforce any cache mechanism for the intermediate results between jobs?



--
Sent from: https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fapache-spark-developers-list.1001551.n3.nabble.com%2F&amp;data=04%7C01%7Ctheo.gkountouvas%40futurewei.com%7C07ccc03d5852409ea1e808d89b57f1ef%7C0fee8ff2a3b240189c753a1d5591fedc%7C1%7C0%7C637430149328029037%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=fCFkqJ2o3lSPMbwcOwHRFSX3szkSwEitpcp1m2IhHm8%3D&amp;reserved=0

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Is there any inplict RDD cache operation for query optimizations?

attilapiros
In reply to this post by marcelo.amaral
hi,

There is good reason why the decision about caching is left for the user.
Spark does not know about the future of the DataFrames and RDDs.

Think about how your program is running (you are still running program), so
there is an exact point where the execution is and when Spark reaches an
action it evaluates the Spark job but it does not know about the future
jobs. A cached data would be only useful for that future job which will
reuses it.

On the other hand this information is available for the user as he writes
all the jobs.

Attila



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]