Checkpointing clarifications

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Checkpointing clarifications

Alessandro Liparoti
Good morning,

I have a large scale job that for certain size of input breaks so I am trying to play with checkpointing to split the DAG and understand the problematic point. I have some questions about checkpointing:
  1. What is the utility of non-eager checkpointing?
  2. How checkpointing is different than manually write a dataframe (or rdd) to hdfs? Also, doing that will allow to re-read the stored dataframe, while with chekpointing I don't see a simple way of re-reading them in a future job
  3. I read that checkpointing is different than persisting because the lineage is not stored, but I don't understand why persisting stores the lineage. The point of persisting is that next computation will start from the persisted data (either mem or mem+disk), so what is the advantage of having the lineage available? Am I missing some basic understanding of these 2 apparently different operations?
Thanks,
Alessandro Liparoti