Spark DAG scheduler

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Spark DAG scheduler

Mania Abdi
Hello everyone,

I am implementing a caching mechanism for analytic workloads running on top of Spark and I need to retrieve the Spark DAG right after it is generated and the DAG scheduler. I would appreciate it if you could give me some hints or reference me to some documents about where the DAG is generated and inputs assigned to it. I found the DAG Scheduler class but I am not sure if it is a good starting point.

Regards
Mania
Reply | Threaded
Open this post in threaded view
|

Re: Spark DAG scheduler

rxin
The RDD is the DAG.


On Thu, Apr 16, 2020 at 3:16 PM, Mania Abdi <[hidden email]> wrote:
Hello everyone,

I am implementing a caching mechanism for analytic workloads running on top of Spark and I need to retrieve the Spark DAG right after it is generated and the DAG scheduler. I would appreciate it if you could give me some hints or reference me to some documents about where the DAG is generated and inputs assigned to it. I found the DAG Scheduler class but I am not sure if it is a good starting point.

Regards
Mania


smime.p7s (6K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Spark DAG scheduler

Mania Abdi
Is it correct to say, the nodes in the DAG are RDDs and the edges are computations?

On Thu, Apr 16, 2020 at 6:21 PM Reynold Xin <[hidden email]> wrote:
The RDD is the DAG.


On Thu, Apr 16, 2020 at 3:16 PM, Mania Abdi <[hidden email]> wrote:
Hello everyone,

I am implementing a caching mechanism for analytic workloads running on top of Spark and I need to retrieve the Spark DAG right after it is generated and the DAG scheduler. I would appreciate it if you could give me some hints or reference me to some documents about where the DAG is generated and inputs assigned to it. I found the DAG Scheduler class but I am not sure if it is a good starting point.

Regards
Mania

Reply | Threaded
Open this post in threaded view
|

Re: Spark DAG scheduler

rxin
If you are talking about a tree, then the RDDs are nodes, and the dependencies are the edges.

If you are talking about a DAG, then the partitions in the RDDs are the nodes, and the dependencies between the partitions are the edges.


On Thu, Apr 16, 2020 at 4:02 PM, Mania Abdi <[hidden email]> wrote:
Is it correct to say, the nodes in the DAG are RDDs and the edges are computations?

On Thu, Apr 16, 2020 at 6:21 PM Reynold Xin <[hidden email]> wrote:
The RDD is the DAG.


On Thu, Apr 16, 2020 at 3:16 PM, Mania Abdi <[hidden email]> wrote:
Hello everyone,

I am implementing a caching mechanism for analytic workloads running on top of Spark and I need to retrieve the Spark DAG right after it is generated and the DAG scheduler. I would appreciate it if you could give me some hints or reference me to some documents about where the DAG is generated and inputs assigned to it. I found the DAG Scheduler class but I am not sure if it is a good starting point.

Regards
Mania


smime.p7s (6K) Download Attachment