Screen Shot 2020-05-11 at 5.28.03 AM

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Screen Shot 2020-05-11 at 5.28.03 AM

zhangliyun
Hi all:
   i have a spark 2.3.1  job stuck for 23 hours , when i go to spark history server. it shows that 5039 tasks in totally 5043 tasks have been finished. so  it means there are 4 still running. but when i go to tasks page, there is no running tasks.  I have downloaded logs, wants to grep "Dropping event from queue" stdout , there is no result for that. seems this stuck is not caused by " spark.scheduler.listenerbus.eventqueue.capacity " is not big.   Appreciate if you can give me some suggestion to find the reason why the job is stuck?
  there is no running tasks in the running stage


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Screen Shot 2020-05-11 at 5.28.03 AM.png (131K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Screen Shot 2020-05-11 at 5.28.03 AM

RussS
Have you checked the executor thread dumps? It may give you some insight if there is a deadlock or something else.

They should be available under the executor tab on the ui

On Sun, May 10, 2020, 4:43 PM zhangliyun <[hidden email]> wrote:
Hi all:
   i have a spark 2.3.1  job stuck for 23 hours , when i go to spark history server. it shows that 5039 tasks in totally 5043 tasks have been finished. so  it means there are 4 still running. but when i go to tasks page, there is no running tasks.  I have downloaded logs, wants to grep "Dropping event from queue" stdout , there is no result for that. seems this stuck is not caused by " spark.scheduler.listenerbus.eventqueue.capacity " is not big.   Appreciate if you can give me some suggestion to find the reason why the job is stuck?
  there is no running tasks in the running stage

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Screen Shot 2020-05-11 at 5.28.03 AM.png (131K) Download Attachment
Screen Shot 2020-05-11 at 5.30.27 AM.png (454K) Download Attachment
Screen Shot 2020-05-11 at 5.28.03 AM.png (131K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re:Re: Screen Shot 2020-05-11 at 5.28.03 AM

zhangliyun


Hi 

 appreciate your reply
 i guess you want me to see the executor page,  i go to the page, if the deadlock, will the thread_state states "Dead Lock" ?  which clue i should use to find the
reason why there are running tasks but actually not have.




At 2020-05-11 08:55:25, "Russell Spitzer" <[hidden email]> wrote:

Have you checked the executor thread dumps? It may give you some insight if there is a deadlock or something else.

They should be available under the executor tab on the ui

On Sun, May 10, 2020, 4:43 PM zhangliyun <[hidden email]> wrote:
Hi all:
   i have a spark 2.3.1  job stuck for 23 hours , when i go to spark history server. it shows that 5039 tasks in totally 5043 tasks have been finished. so  it means there are 4 still running. but when i go to tasks page, there is no running tasks.  I have downloaded logs, wants to grep "Dropping event from queue" stdout , there is no result for that. seems this stuck is not caused by " spark.scheduler.listenerbus.eventqueue.capacity " is not big.   Appreciate if you can give me some suggestion to find the reason why the job is stuck?
  there is no running tasks in the running stage

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]


 



---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Screen Shot 2020-05-11 at 9.39.07 AM.png (515K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Re:Re: Screen Shot 2020-05-11 at 5.28.03 AM

ZHANG Wei
Sometimes, the Thread dump result table of Spark UI can provide some clues to find out thread locks issue, such as:

  Thread ID | Thread Name                  | Thread State | Thread Locks
  13        | NonBlockingInputStreamThread | WAITING      | Blocked by Thread Some(48) Lock(jline.internal.NonBlockingInputStream@103008951})
  48        | Thread-16                    | RUNNABLE     | Monitor(jline.internal.NonBlockingInputStream@103008951})

And echo thread row can show the call stacks after being clicked, such as this case, for thread 48, there are the function which is holding the lock:

  org.fusesource.jansi.internal.Kernel32.ReadConsoleInputW(Native Method)
  org.fusesource.jansi.internal.Kernel32.readConsoleInputHelper(Kernel32.java:811)
  org.fusesource.jansi.internal.Kernel32.readConsoleKeyInput(Kernel32.java:842)
  org.fusesource.jansi.internal.WindowsSupport.readConsoleInput(WindowsSupport.java:97)
  jline.WindowsTerminal.readConsoleInput(WindowsTerminal.java:222)
  <snip...>

Cheers,
-z

________________________________________
From: zhangliyun <[hidden email]>
Sent: Monday, May 11, 2020 9:44
To: Russell Spitzer; Spark Dev List
Subject: Re:Re: Screen Shot 2020-05-11 at 5.28.03 AM


Hi

 appreciate your reply
 i guess you want me to see the executor page,  i go to the page, if the deadlock, will the thread_state states "Dead Lock" ?  which clue i should use to find the
reason why there are running tasks but actually not have.
[cid:3f7815ce$1$17201673b40$Coremail$kellyzly$126.com]




At 2020-05-11 08:55:25, "Russell Spitzer" <[hidden email]> wrote:

Have you checked the executor thread dumps? It may give you some insight if there is a deadlock or something else.

They should be available under the executor tab on the ui

On Sun, May 10, 2020, 4:43 PM zhangliyun <[hidden email]<mailto:[hidden email]>> wrote:
Hi all:
   i have a spark 2.3.1  job stuck for 23 hours , when i go to spark history server. it shows that 5039 tasks in totally 5043 tasks have been finished. so  it means there are 4 still running. but when i go to tasks page, there is no running tasks.  I have downloaded logs, wants to grep "Dropping event from queue" stdout , there is no result for that. seems this stuck is not caused by " spark.scheduler.listenerbus.eventqueue.capacity " is not big.   Appreciate if you can give me some suggestion to find the reason why the job is stuck?
[cid:54633c54$1$1720089560a$Coremail$kellyzly$126.com]
  there is no running tasks in the running stage
[cid:789f4af8$2$1720089560a$Coremail$kellyzly$126.com][cid:54633c54$1$1720089560a$Coremail$kellyzly$126.com]

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]<mailto:[hidden email]>





Screen Shot 2020-05-11 at 9.39.07 AM.png (515K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Re:Re: Screen Shot 2020-05-11 at 5.28.03 AM

RussS
I would spefically look into the executor launch threads, if you click on them they will expand. This will tell you what line of code they are executing on which may give some hints as to why the code has not yet returned.

Here is an example where I just shorted out a job by putting a Thread sleep in random partitions

image.png

You can see the code is waiting because I called a sleep.

On Mon, May 11, 2020 at 2:41 AM ZHANG Wei <[hidden email]> wrote:
Sometimes, the Thread dump result table of Spark UI can provide some clues to find out thread locks issue, such as:

  Thread ID | Thread Name                  | Thread State | Thread Locks
  13        | NonBlockingInputStreamThread | WAITING      | Blocked by Thread Some(48) Lock(jline.internal.NonBlockingInputStream@103008951})
  48        | Thread-16                    | RUNNABLE     | Monitor(jline.internal.NonBlockingInputStream@103008951})

And echo thread row can show the call stacks after being clicked, such as this case, for thread 48, there are the function which is holding the lock:

  org.fusesource.jansi.internal.Kernel32.ReadConsoleInputW(Native Method)
  org.fusesource.jansi.internal.Kernel32.readConsoleInputHelper(Kernel32.java:811)
  org.fusesource.jansi.internal.Kernel32.readConsoleKeyInput(Kernel32.java:842)
  org.fusesource.jansi.internal.WindowsSupport.readConsoleInput(WindowsSupport.java:97)
  jline.WindowsTerminal.readConsoleInput(WindowsTerminal.java:222)
  <snip...>

Cheers,
-z

________________________________________
From: zhangliyun <[hidden email]>
Sent: Monday, May 11, 2020 9:44
To: Russell Spitzer; Spark Dev List
Subject: Re:Re: Screen Shot 2020-05-11 at 5.28.03 AM


Hi

 appreciate your reply
 i guess you want me to see the executor page,  i go to the page, if the deadlock, will the thread_state states "Dead Lock" ?  which clue i should use to find the
reason why there are running tasks but actually not have.
[cid:3f7815ce$1$17201673b40$Coremail$kellyzly$126.com]




At 2020-05-11 08:55:25, "Russell Spitzer" <[hidden email]> wrote:

Have you checked the executor thread dumps? It may give you some insight if there is a deadlock or something else.

They should be available under the executor tab on the ui

On Sun, May 10, 2020, 4:43 PM zhangliyun <[hidden email]<mailto:[hidden email]>> wrote:
Hi all:
   i have a spark 2.3.1  job stuck for 23 hours , when i go to spark history server. it shows that 5039 tasks in totally 5043 tasks have been finished. so  it means there are 4 still running. but when i go to tasks page, there is no running tasks.  I have downloaded logs, wants to grep "Dropping event from queue" stdout , there is no result for that. seems this stuck is not caused by " spark.scheduler.listenerbus.eventqueue.capacity " is not big.   Appreciate if you can give me some suggestion to find the reason why the job is stuck?
[cid:54633c54$1$1720089560a$Coremail$kellyzly$126.com]
  there is no running tasks in the running stage
[cid:789f4af8$2$1720089560a$Coremail$kellyzly$126.com][cid:54633c54$1$1720089560a$Coremail$kellyzly$126.com]

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]<mailto:[hidden email]>




Reply | Threaded
Open this post in threaded view
|

Re:Re: Re:Re: Screen Shot 2020-05-11 at 5.28.03 AM

zhangliyun
In reply to this post by ZHANG Wei


  Hi all:
  thanks for your reply. the job is hang as 20+ hours, The history server has deleted the log. I will monitor and   l try to use thread dump to try to find something.

Best Regards

Kelly Zhang





At 2020-05-11 15:41:29, "ZHANG Wei" <[hidden email]> wrote: >Sometimes, the Thread dump result table of Spark UI can provide some clues to find out thread locks issue, such as: > > Thread ID | Thread Name | Thread State | Thread Locks > 13 | NonBlockingInputStreamThread | WAITING | Blocked by Thread Some(48) Lock(jline.internal.NonBlockingInputStream@103008951}) > 48 | Thread-16 | RUNNABLE | Monitor(jline.internal.NonBlockingInputStream@103008951}) > >And echo thread row can show the call stacks after being clicked, such as this case, for thread 48, there are the function which is holding the lock: > > org.fusesource.jansi.internal.Kernel32.ReadConsoleInputW(Native Method) > org.fusesource.jansi.internal.Kernel32.readConsoleInputHelper(Kernel32.java:811) > org.fusesource.jansi.internal.Kernel32.readConsoleKeyInput(Kernel32.java:842) > org.fusesource.jansi.internal.WindowsSupport.readConsoleInput(WindowsSupport.java:97) > jline.WindowsTerminal.readConsoleInput(WindowsTerminal.java:222) > <snip...> > >Cheers, >-z > >________________________________________ >From: zhangliyun <[hidden email]> >Sent: Monday, May 11, 2020 9:44 >To: Russell Spitzer; Spark Dev List >Subject: Re:Re: Screen Shot 2020-05-11 at 5.28.03 AM > > >Hi > > appreciate your reply > i guess you want me to see the executor page, i go to the page, if the deadlock, will the thread_state states "Dead Lock" ? which clue i should use to find the >reason why there are running tasks but actually not have. >[cid:3f7815ce$1$17201673b40$Coremail$kellyzly$126.com] > > > > >At 2020-05-11 08:55:25, "Russell Spitzer" <[hidden email]> wrote: > >Have you checked the executor thread dumps? It may give you some insight if there is a deadlock or something else. > >They should be available under the executor tab on the ui > >On Sun, May 10, 2020, 4:43 PM zhangliyun <[hidden email]<mailto:[hidden email]>> wrote: >Hi all: > i have a spark 2.3.1 job stuck for 23 hours , when i go to spark history server. it shows that 5039 tasks in totally 5043 tasks have been finished. so it means there are 4 still running. but when i go to tasks page, there is no running tasks. I have downloaded logs, wants to grep "Dropping event from queue" stdout , there is no result for that. seems this stuck is not caused by " spark.scheduler.listenerbus.eventqueue.capacity " is not big. Appreciate if you can give me some suggestion to find the reason why the job is stuck? >[cid:54633c54$1$1720089560a$Coremail$kellyzly$126.com] > there is no running tasks in the running stage >[cid:789f4af8$2$1720089560a$Coremail$kellyzly$126.com][cid:54633c54$1$1720089560a$Coremail$kellyzly$126.com] > >--------------------------------------------------------------------- >To unsubscribe e-mail: [hidden email]<mailto:[hidden email]> > > > >