Result obtained before the completion of Stages

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Result obtained before the completion of Stages

ckhari4u
I found this interesting behavior while running some adhoc analysis query. I
have a Spark SQL query where I am joining 2 tables and then performing a
count operation. In the Spark Web UI, I see there are 4 Stages getting
launched.

The interesting behavior I see here is that I see the result before all
stages are executed. The Stage 2 which performs the Sort merge join is
running but I see the result in the Spark Shell before the completion of
Stage 2. However, the application still continues to run?

Any thoughts on this behavior?





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Result obtained before the completion of Stages

Sean Owen
My guess is that either they haven't actually finished before the result and something about timestamps you're comparing is misleading, or else, you're looking at stages executing that are part of a later part of the program.

On Tue, Dec 26, 2017 at 3:49 PM ckhari4u <[hidden email]> wrote:
I found this interesting behavior while running some adhoc analysis query. I
have a Spark SQL query where I am joining 2 tables and then performing a
count operation. In the Spark Web UI, I see there are 4 Stages getting
launched.

The interesting behavior I see here is that I see the result before all
stages are executed. The Stage 2 which performs the Sort merge join is
running but I see the result in the Spark Shell before the completion of
Stage 2. However, the application still continues to run?

Any thoughts on this behavior?





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Result obtained before the completion of Stages

ckhari4u
In reply to this post by ckhari4u
Hi Sean,

Thanks for the reply. I believe I am not facing the scenarios you mentioned.

Timestamp conflict: I see the Spark driver logs on the console (tried with
INFO and DEBUG). In all the scenarios, I see the result getting printed and
the application execution continues for 4 more minutes.
ie: I have seen scenarios where Spark History Server time stamp not matching
with the Spark driver logs and all. In this case, I am checking only the
driver logs and I could see the logs getting printed on the console even
after the result is generated.

Stages of a different action: I am performing a join on 2 tables and doing a
count operation. So there is only one action. The stage which is taking more
time is the join phase (Sort merge join specifically). To improve the join,
I tried to cache the smaller dataset. Then I do not see the issue.

I am just wondering how Spark can get the result before the completion of
the join operation.

PS: My actual query in the application has many operators, UDF's etc. The
above is the minimal operation query for which I am able to reproduce the
issue.






--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Result obtained before the completion of Stages

rxin
What did you run?


On Tue, Dec 26, 2017 at 10:21 PM, ckhari4u <[hidden email]> wrote:
Hi Sean,

Thanks for the reply. I believe I am not facing the scenarios you mentioned.

Timestamp conflict: I see the Spark driver logs on the console (tried with
INFO and DEBUG). In all the scenarios, I see the result getting printed and
the application execution continues for 4 more minutes.
ie: I have seen scenarios where Spark History Server time stamp not matching
with the Spark driver logs and all. In this case, I am checking only the
driver logs and I could see the logs getting printed on the console even
after the result is generated.

Stages of a different action: I am performing a join on 2 tables and doing a
count operation. So there is only one action. The stage which is taking more
time is the join phase (Sort merge join specifically). To improve the join,
I tried to cache the smaller dataset. Then I do not see the issue.

I am just wondering how Spark can get the result before the completion of
the join operation.

PS: My actual query in the application has many operators, UDF's etc. The
above is the minimal operation query for which I am able to reproduce the
issue.






--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Result obtained before the completion of Stages

ckhari4u
Hi Reynold,

I am running a Spark SQL query.

val df = spark.sql("select * from table1 t1 join table2 t2 on
t1.col1=t2.col1")
df.count()




--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Result obtained before the completion of Stages

rxin
Is it possible there is a bug for the UI? If you can run jstack on the executor process to see whether anything is actually running, that can help narrow down the issue. 

On Tue, Dec 26, 2017 at 10:28 PM ckhari4u <[hidden email]> wrote:
Hi Reynold,

I am running a Spark SQL query.

val df = spark.sql("select * from table1 t1 join table2 t2 on
t1.col1=t2.col1")
df.count()




--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Result obtained before the completion of Stages

ckhari4u
That's a good catch. I just checked the jstack, ps -ef of executor process.
they are progressing and completing much after the result generation.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]