[Spark RPC] Help how to debug sudden performance issue

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

[Spark RPC] Help how to debug sudden performance issue

Hough, Stephen C

Spark Version: 2.0.2

 

I am running a cluster with 400 workers where each worker has 1 executor configured with 40 cores for a total capacity of 16,000 cores.

I run about 10,000 jobs with 1.5M tasks where the job is a simple spark.parallelize(list, list.size()).map(func).collectAsysnc().  The job contains a list of tasks ranging from 1 to 2000.

‘func’ will run our code which will do computations then post events over Kafka and return a Boolean, we ignore the result.  The duration of this func can be from seconds up to 20 mins.

 

The application context is launched on a driver server with 32 cores and the only other services running on the box is a Kafka broker and zookeeper service.

 

This particular batch in development environment took around 2 hours to run which met our SLA however when we ran in production it took 3 hours to complete, we thought it may have been due to another cluster we were running with around 300 workers however AWS informed us that the networks were isolated.  I moved the job to run later after the other clusters batch had finished and the time reduced back down to 2 hrs.

I analyzed our logs and it shows that a yet to be determined incident @22:02:42 caused Spark to ‘go slow’.

 

By capturing the duration from the executor thread message ‘Finished task’ I tracked the TID seen by the task result getter to determine duration until the result is processed on the driver and a core is freed for the scheduler.

 

For the most part it is within a reasonable range of 10ms then suddenly at the given incident time it suddenly rises to 5s, 20s, 587s, peaking at 32m only 8 mins after the incident.  So it takes 32 mins from the time the result was sent back to spark driver to the time it is processed which explains the performance hit because during this time the freed cores on the worker go idle waiting for a new task.  Note I did track the time I saw our Kafka event sent by this task and we saw it roughly 2ms later on the driver so the results are getting to server over the network okay.

 

Looking at the rpc code it became apparent to me that if we start to see a build of messages the dispatcher should turn single-threaded as it processes the backlog, so I did another scan of the driver logs to look for long running dispatcher threads, i.e. a dispatcher that processes more than 1 consecutive message.  A very obvious issue became apparent.

 

Dispatcher: 23 started 22:02:42:647 processed 80386 consecutive messages for a duration of 53 minutes.

 

If one looks at the beginning of these messages it is obvious that a slowdown occurs, the first 3 are within millis of each other, then a suspicious 100ms delay starts happening.

 

04-03-19 22:02:43:032 [INFO] [dispatcher-event-loop-23] o.a.s.s.c.CoarseGrainedSchedulerBackend$DriverEndpoint - Launching task 1418419 on executor id: 1048 hostname: 10.9.141.180

04-03-19 22:02:43:034 [INFO] [dispatcher-event-loop-23] o.a.s.s.c.CoarseGrainedSchedulerBackend$DriverEndpoint - Launching task 1418420 on executor id: 967 hostname: 10.9.134.69

04-03-19 22:02:43:037 [INFO] [dispatcher-event-loop-23] o.a.s.s.c.CoarseGrainedSchedulerBackend$DriverEndpoint - Launching task 1418421 on executor id: 791 hostname: 10.9.139.73

04-03-19 22:02:43:136 [INFO] [dispatcher-event-loop-23] o.a.s.s.c.CoarseGrainedSchedulerBackend$DriverEndpoint - Launching task 1418422 on executor id: 941 hostname: 10.9.142.127

04-03-19 22:02:43:234 [INFO] [dispatcher-event-loop-23] o.a.s.s.c.CoarseGrainedSchedulerBackend$DriverEndpoint - Launching task 1418423 on executor id: 1085 hostname: 10.9.142.23

04-03-19 22:02:43:348 [INFO] [dispatcher-event-loop-23] o.a.s.s.c.CoarseGrainedSchedulerBackend$DriverEndpoint - Launching task 1418424 on executor id: 944 hostname: 10.9.141.65

 

Unfortunately I can’t turn on any more extra logging for the DriverEndpoint ‘StatusUpdate’ handler however at a guess I would say the launchTasks, executorData.executorEndpoint.send operation is introducing some sort of blocking which causes a backlog that takes time to process.

 

When the system is running okay we don’t see this behaviour.

 

Q, Have you guys seen this behaviour before, and if so would an update to Spark 2.4 do the trick.

 

If not are there any extra logging or debugging I can do to track down what the external event may be that is introducing the delay.  Given the volume of tasks I can only really analyze logs.

 

 

 


This email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please delete all copies and notify the sender immediately. You may wish to refer to the incorporation details of Standard Chartered PLC, Standard Chartered Bank and their subsidiaries at https://www.sc.com/en/our-locations. Please refer to https://www.sc.com/en/privacy-policy/ for Standard Chartered Bank’s Privacy Policy.
Reply | Threaded
Open this post in threaded view
|

Re: [Spark RPC] Help how to debug sudden performance issue

Jörn Franke
Well it is a little bit difficult to say, because a lot of things are mixing up here. What function is calculated? Does it need a lot of memory? Could it be that you run out of memory and some spillover happens and you have a lot of IO to disk which is blocking? 

Related to that could be 1 executor 40 cores. How much memory does it have and need?

I would not put Kafka+ZK on the server where the driver is running.

A different Spark version - that may depend on what are the answers to the questions above.

Am 11.03.2019 um 07:40 schrieb Hough, Stephen C <[hidden email]>:

Spark Version: 2.0.2

 

I am running a cluster with 400 workers where each worker has 1 executor configured with 40 cores for a total capacity of 16,000 cores.

I run about 10,000 jobs with 1.5M tasks where the job is a simple spark.parallelize(list, list.size()).map(func).collectAsysnc().  The job contains a list of tasks ranging from 1 to 2000.

‘func’ will run our code which will do computations then post events over Kafka and return a Boolean, we ignore the result.  The duration of this func can be from seconds up to 20 mins.

 

The application context is launched on a driver server with 32 cores and the only other services running on the box is a Kafka broker and zookeeper service.

 

This particular batch in development environment took around 2 hours to run which met our SLA however when we ran in production it took 3 hours to complete, we thought it may have been due to another cluster we were running with around 300 workers however AWS informed us that the networks were isolated.  I moved the job to run later after the other clusters batch had finished and the time reduced back down to 2 hrs.

I analyzed our logs and it shows that a yet to be determined incident @22:02:42 caused Spark to ‘go slow’.

 

By capturing the duration from the executor thread message ‘Finished task’ I tracked the TID seen by the task result getter to determine duration until the result is processed on the driver and a core is freed for the scheduler.

 

For the most part it is within a reasonable range of 10ms then suddenly at the given incident time it suddenly rises to 5s, 20s, 587s, peaking at 32m only 8 mins after the incident.  So it takes 32 mins from the time the result was sent back to spark driver to the time it is processed which explains the performance hit because during this time the freed cores on the worker go idle waiting for a new task.  Note I did track the time I saw our Kafka event sent by this task and we saw it roughly 2ms later on the driver so the results are getting to server over the network okay.

 

Looking at the rpc code it became apparent to me that if we start to see a build of messages the dispatcher should turn single-threaded as it processes the backlog, so I did another scan of the driver logs to look for long running dispatcher threads, i.e. a dispatcher that processes more than 1 consecutive message.  A very obvious issue became apparent.

 

Dispatcher: 23 started 22:02:42:647 processed 80386 consecutive messages for a duration of 53 minutes.

 

If one looks at the beginning of these messages it is obvious that a slowdown occurs, the first 3 are within millis of each other, then a suspicious 100ms delay starts happening.

 

04-03-19 22:02:43:032 [INFO] [dispatcher-event-loop-23] o.a.s.s.c.CoarseGrainedSchedulerBackend$DriverEndpoint - Launching task 1418419 on executor id: 1048 hostname: 10.9.141.180

04-03-19 22:02:43:034 [INFO] [dispatcher-event-loop-23] o.a.s.s.c.CoarseGrainedSchedulerBackend$DriverEndpoint - Launching task 1418420 on executor id: 967 hostname: 10.9.134.69

04-03-19 22:02:43:037 [INFO] [dispatcher-event-loop-23] o.a.s.s.c.CoarseGrainedSchedulerBackend$DriverEndpoint - Launching task 1418421 on executor id: 791 hostname: 10.9.139.73

04-03-19 22:02:43:136 [INFO] [dispatcher-event-loop-23] o.a.s.s.c.CoarseGrainedSchedulerBackend$DriverEndpoint - Launching task 1418422 on executor id: 941 hostname: 10.9.142.127

04-03-19 22:02:43:234 [INFO] [dispatcher-event-loop-23] o.a.s.s.c.CoarseGrainedSchedulerBackend$DriverEndpoint - Launching task 1418423 on executor id: 1085 hostname: 10.9.142.23

04-03-19 22:02:43:348 [INFO] [dispatcher-event-loop-23] o.a.s.s.c.CoarseGrainedSchedulerBackend$DriverEndpoint - Launching task 1418424 on executor id: 944 hostname: 10.9.141.65

 

Unfortunately I can’t turn on any more extra logging for the DriverEndpoint ‘StatusUpdate’ handler however at a guess I would say the launchTasks, executorData.executorEndpoint.send operation is introducing some sort of blocking which causes a backlog that takes time to process.

 

When the system is running okay we don’t see this behaviour.

 

Q, Have you guys seen this behaviour before, and if so would an update to Spark 2.4 do the trick.

 

If not are there any extra logging or debugging I can do to track down what the external event may be that is introducing the delay.  Given the volume of tasks I can only really analyze logs.

 

 

 


This email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please delete all copies and notify the sender immediately. You may wish to refer to the incorporation details of Standard Chartered PLC, Standard Chartered Bank and their subsidiaries at https://www.sc.com/en/our-locations. Please refer to https://www.sc.com/en/privacy-policy/ for Standard Chartered Bank’s Privacy Policy.
Reply | Threaded
Open this post in threaded view
|

RE: [External] Re: [Spark RPC] Help how to debug sudden performance issue

Hough, Stephen C

Thanks

 

There is no issue on the worker/executor side they have ample memory > 200GB, I gave that information as background to the system apologies for the confusion.

 

The problem is isolated to the lifetime of processing a DriverEndpoint  StatusUpdate message.  For 40 minutes the system runs fine with 30+ dispatcher threads taking turns to process results then given an external yet to be determined trigger there is a slight slowdown but it’s enough to upset the system for 50 minutes before recovering.

 

J I have little expectation of finding a solution this is a last resort punt I was hoping you dev’s may have more insight of possible sources of interruption like why would it take StatusUpdate  almost exactly 100ms to process 1 result (as seen by the consecutive TID’s below) or any logging I may be able to turn on to narrow the search.

 

There are no errors or warnings in the logs.

 

 

From: Jörn Franke [mailto:[hidden email]]
Sent: Monday, March 11, 2019 3:08 PM
To: Hough, Stephen C <[hidden email]>
Cc: [hidden email]
Subject: [External] Re: [Spark RPC] Help how to debug sudden performance issue

 

Well it is a little bit difficult to say, because a lot of things are mixing up here. What function is calculated? Does it need a lot of memory? Could it be that you run out of memory and some spillover happens and you have a lot of IO to disk which is blocking? 

 

Related to that could be 1 executor 40 cores. How much memory does it have and need?

 

I would not put Kafka+ZK on the server where the driver is running.

 

A different Spark version - that may depend on what are the answers to the questions above.


Am 11.03.2019 um 07:40 schrieb Hough, Stephen C <[hidden email]>:

Spark Version: 2.0.2

 

I am running a cluster with 400 workers where each worker has 1 executor configured with 40 cores for a total capacity of 16,000 cores.

I run about 10,000 jobs with 1.5M tasks where the job is a simple spark.parallelize(list, list.size()).map(func).collectAsysnc().  The job contains a list of tasks ranging from 1 to 2000.

‘func’ will run our code which will do computations then post events over Kafka and return a Boolean, we ignore the result.  The duration of this func can be from seconds up to 20 mins.

 

The application context is launched on a driver server with 32 cores and the only other services running on the box is a Kafka broker and zookeeper service.

 

This particular batch in development environment took around 2 hours to run which met our SLA however when we ran in production it took 3 hours to complete, we thought it may have been due to another cluster we were running with around 300 workers however AWS informed us that the networks were isolated.  I moved the job to run later after the other clusters batch had finished and the time reduced back down to 2 hrs.

I analyzed our logs and it shows that a yet to be determined incident @22:02:42 caused Spark to ‘go slow’.

 

By capturing the duration from the executor thread message ‘Finished task’ I tracked the TID seen by the task result getter to determine duration until the result is processed on the driver and a core is freed for the scheduler.

 

For the most part it is within a reasonable range of 10ms then suddenly at the given incident time it suddenly rises to 5s, 20s, 587s, peaking at 32m only 8 mins after the incident.  So it takes 32 mins from the time the result was sent back to spark driver to the time it is processed which explains the performance hit because during this time the freed cores on the worker go idle waiting for a new task.  Note I did track the time I saw our Kafka event sent by this task and we saw it roughly 2ms later on the driver so the results are getting to server over the network okay.

 

Looking at the rpc code it became apparent to me that if we start to see a build of messages the dispatcher should turn single-threaded as it processes the backlog, so I did another scan of the driver logs to look for long running dispatcher threads, i.e. a dispatcher that processes more than 1 consecutive message.  A very obvious issue became apparent.

 

Dispatcher: 23 started 22:02:42:647 processed 80386 consecutive messages for a duration of 53 minutes.

 

If one looks at the beginning of these messages it is obvious that a slowdown occurs, the first 3 are within millis of each other, then a suspicious 100ms delay starts happening.

 

04-03-19 22:02:43:032 [INFO] [dispatcher-event-loop-23] o.a.s.s.c.CoarseGrainedSchedulerBackend$DriverEndpoint - Launching task 1418419 on executor id: 1048 hostname: 10.9.141.180

04-03-19 22:02:43:034 [INFO] [dispatcher-event-loop-23] o.a.s.s.c.CoarseGrainedSchedulerBackend$DriverEndpoint - Launching task 1418420 on executor id: 967 hostname: 10.9.134.69

04-03-19 22:02:43:037 [INFO] [dispatcher-event-loop-23] o.a.s.s.c.CoarseGrainedSchedulerBackend$DriverEndpoint - Launching task 1418421 on executor id: 791 hostname: 10.9.139.73

04-03-19 22:02:43:136 [INFO] [dispatcher-event-loop-23] o.a.s.s.c.CoarseGrainedSchedulerBackend$DriverEndpoint - Launching task 1418422 on executor id: 941 hostname: 10.9.142.127

04-03-19 22:02:43:234 [INFO] [dispatcher-event-loop-23] o.a.s.s.c.CoarseGrainedSchedulerBackend$DriverEndpoint - Launching task 1418423 on executor id: 1085 hostname: 10.9.142.23

04-03-19 22:02:43:348 [INFO] [dispatcher-event-loop-23] o.a.s.s.c.CoarseGrainedSchedulerBackend$DriverEndpoint - Launching task 1418424 on executor id: 944 hostname: 10.9.141.65

 

Unfortunately I can’t turn on any more extra logging for the DriverEndpoint ‘StatusUpdate’ handler however at a guess I would say the launchTasks, executorData.executorEndpoint.send operation is introducing some sort of blocking which causes a backlog that takes time to process.

 

When the system is running okay we don’t see this behaviour.

 

Q, Have you guys seen this behaviour before, and if so would an update to Spark 2.4 do the trick.

 

If not are there any extra logging or debugging I can do to track down what the external event may be that is introducing the delay.  Given the volume of tasks I can only really analyze logs.

 

 

 


This email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please delete all copies and notify the sender immediately. You may wish to refer to the incorporation details of Standard Chartered PLC, Standard Chartered Bank and their subsidiaries at https://www.sc.com/en/our-locations. Please refer to https://www.sc.com/en/privacy-policy/ for Standard Chartered Bank’s Privacy Policy.


This email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please delete all copies and notify the sender immediately. You may wish to refer to the incorporation details of Standard Chartered PLC, Standard Chartered Bank and their subsidiaries at https://www.sc.com/en/our-locations. Please refer to https://www.sc.com/en/privacy-policy/ for Standard Chartered Bank’s Privacy Policy.
Reply | Threaded
Open this post in threaded view
|

Re: [External] Re: [Spark RPC] Help how to debug sudden performance issue

Jörn Franke
Well it will be difficult to say anything without  knowing func. It could be that 40 cores and 200 gb for an executor is not a setup that suits the func and the overall architecture.

It could be also GC collection issues etc.

Sometimes it also does not help to throw hardware at the issue. It depends really on the processing pipeline.

Am 11.03.2019 um 08:50 schrieb Hough, Stephen C <[hidden email]>:

Thanks

 

There is no issue on the worker/executor side they have ample memory > 200GB, I gave that information as background to the system apologies for the confusion.

 

The problem is isolated to the lifetime of processing a DriverEndpoint  StatusUpdate message.  For 40 minutes the system runs fine with 30+ dispatcher threads taking turns to process results then given an external yet to be determined trigger there is a slight slowdown but it’s enough to upset the system for 50 minutes before recovering.

 

J I have little expectation of finding a solution this is a last resort punt I was hoping you dev’s may have more insight of possible sources of interruption like why would it take StatusUpdate  almost exactly 100ms to process 1 result (as seen by the consecutive TID’s below) or any logging I may be able to turn on to narrow the search.

 

There are no errors or warnings in the logs.

 

 

From: Jörn Franke [[hidden email]]
Sent: Monday, March 11, 2019 3:08 PM
To: Hough, Stephen C <[hidden email]>
Cc: [hidden email]
Subject: [External] Re: [Spark RPC] Help how to debug sudden performance issue

 

Well it is a little bit difficult to say, because a lot of things are mixing up here. What function is calculated? Does it need a lot of memory? Could it be that you run out of memory and some spillover happens and you have a lot of IO to disk which is blocking? 

 

Related to that could be 1 executor 40 cores. How much memory does it have and need?

 

I would not put Kafka+ZK on the server where the driver is running.

 

A different Spark version - that may depend on what are the answers to the questions above.


Am 11.03.2019 um 07:40 schrieb Hough, Stephen C <[hidden email]>:

Spark Version: 2.0.2

 

I am running a cluster with 400 workers where each worker has 1 executor configured with 40 cores for a total capacity of 16,000 cores.

I run about 10,000 jobs with 1.5M tasks where the job is a simple spark.parallelize(list, list.size()).map(func).collectAsysnc().  The job contains a list of tasks ranging from 1 to 2000.

‘func’ will run our code which will do computations then post events over Kafka and return a Boolean, we ignore the result.  The duration of this func can be from seconds up to 20 mins.

 

The application context is launched on a driver server with 32 cores and the only other services running on the box is a Kafka broker and zookeeper service.

 

This particular batch in development environment took around 2 hours to run which met our SLA however when we ran in production it took 3 hours to complete, we thought it may have been due to another cluster we were running with around 300 workers however AWS informed us that the networks were isolated.  I moved the job to run later after the other clusters batch had finished and the time reduced back down to 2 hrs.

I analyzed our logs and it shows that a yet to be determined incident @22:02:42 caused Spark to ‘go slow’.

 

By capturing the duration from the executor thread message ‘Finished task’ I tracked the TID seen by the task result getter to determine duration until the result is processed on the driver and a core is freed for the scheduler.

 

For the most part it is within a reasonable range of 10ms then suddenly at the given incident time it suddenly rises to 5s, 20s, 587s, peaking at 32m only 8 mins after the incident.  So it takes 32 mins from the time the result was sent back to spark driver to the time it is processed which explains the performance hit because during this time the freed cores on the worker go idle waiting for a new task.  Note I did track the time I saw our Kafka event sent by this task and we saw it roughly 2ms later on the driver so the results are getting to server over the network okay.

 

Looking at the rpc code it became apparent to me that if we start to see a build of messages the dispatcher should turn single-threaded as it processes the backlog, so I did another scan of the driver logs to look for long running dispatcher threads, i.e. a dispatcher that processes more than 1 consecutive message.  A very obvious issue became apparent.

 

Dispatcher: 23 started 22:02:42:647 processed 80386 consecutive messages for a duration of 53 minutes.

 

If one looks at the beginning of these messages it is obvious that a slowdown occurs, the first 3 are within millis of each other, then a suspicious 100ms delay starts happening.

 

04-03-19 22:02:43:032 [INFO] [dispatcher-event-loop-23] o.a.s.s.c.CoarseGrainedSchedulerBackend$DriverEndpoint - Launching task 1418419 on executor id: 1048 hostname: 10.9.141.180

04-03-19 22:02:43:034 [INFO] [dispatcher-event-loop-23] o.a.s.s.c.CoarseGrainedSchedulerBackend$DriverEndpoint - Launching task 1418420 on executor id: 967 hostname: 10.9.134.69

04-03-19 22:02:43:037 [INFO] [dispatcher-event-loop-23] o.a.s.s.c.CoarseGrainedSchedulerBackend$DriverEndpoint - Launching task 1418421 on executor id: 791 hostname: 10.9.139.73

04-03-19 22:02:43:136 [INFO] [dispatcher-event-loop-23] o.a.s.s.c.CoarseGrainedSchedulerBackend$DriverEndpoint - Launching task 1418422 on executor id: 941 hostname: 10.9.142.127

04-03-19 22:02:43:234 [INFO] [dispatcher-event-loop-23] o.a.s.s.c.CoarseGrainedSchedulerBackend$DriverEndpoint - Launching task 1418423 on executor id: 1085 hostname: 10.9.142.23

04-03-19 22:02:43:348 [INFO] [dispatcher-event-loop-23] o.a.s.s.c.CoarseGrainedSchedulerBackend$DriverEndpoint - Launching task 1418424 on executor id: 944 hostname: 10.9.141.65

 

Unfortunately I can’t turn on any more extra logging for the DriverEndpoint ‘StatusUpdate’ handler however at a guess I would say the launchTasks, executorData.executorEndpoint.send operation is introducing some sort of blocking which causes a backlog that takes time to process.

 

When the system is running okay we don’t see this behaviour.

 

Q, Have you guys seen this behaviour before, and if so would an update to Spark 2.4 do the trick.

 

If not are there any extra logging or debugging I can do to track down what the external event may be that is introducing the delay.  Given the volume of tasks I can only really analyze logs.

 

 

 


This email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please delete all copies and notify the sender immediately. You may wish to refer to the incorporation details of Standard Chartered PLC, Standard Chartered Bank and their subsidiaries at https://www.sc.com/en/our-locations. Please refer to https://www.sc.com/en/privacy-policy/ for Standard Chartered Bank’s Privacy Policy.


This email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please delete all copies and notify the sender immediately. You may wish to refer to the incorporation details of Standard Chartered PLC, Standard Chartered Bank and their subsidiaries at https://www.sc.com/en/our-locations. Please refer to https://www.sc.com/en/privacy-policy/ for Standard Chartered Bank’s Privacy Policy.
Reply | Threaded
Open this post in threaded view
|

RE: [External] Re: [Spark RPC] Help how to debug sudden performance issue

Hough, Stephen C

The problem is located within \scheduler\cluster\CoarseGrainedSchedulerBackend.scala on the receive function, StatusUpdate. 

 

When my incident occurs the scheduler becomes effectively single threaded processing 80k continuous messages

 

Take 3 of those consecutive messages

 

04-03-19 22:02:43:037 [INFO] [dispatcher-event-loop-23] o.a.s.s.c.CoarseGrainedSchedulerBackend$DriverEndpoint - Launching task 1418421 on executor id: 791

04-03-19 22:02:43:136 [INFO] [dispatcher-event-loop-23] o.a.s.s.c.CoarseGrainedSchedulerBackend$DriverEndpoint - Launching task 1418422 on executor id: 941

04-03-19 22:02:43:234 [INFO] [dispatcher-event-loop-23] o.a.s.s.c.CoarseGrainedSchedulerBackend$DriverEndpoint - Launching task 1418423 on executor id: 1085

 

This is the only log that is emitted from within this function but as you can see it is taking 100ms to complete each loop.  I would suggest that this is a critical pathway within spark execution pipeline and in my scenario there is a blocking like call within which kills Sparks efficiency.  This is transient otherwise spark would only be able to process 10 completed events every second, but it’s enough to back up the inbox queue.

 

The function will:

  1. Call scheduler.statusUpdate
  2. Make an offer to the scheduler with freed core
  3. Serialize and launch scheduled task using the executors endpoint refs client.

Somewhere in here a delay is added to cause 100ms of processing time.  My suspicion is calling the client to send the next task request if my external event causes netty to restablish the connection but I have no errors or warnings to prove that hence my question is would you know of any additional logging that can help me trace this issue or have you seen something similar before.

 

 

 

From: Jörn Franke [mailto:[hidden email]]
Sent: Monday, March 11, 2019 4:28 PM
To: Hough, Stephen C <[hidden email]>
Cc: Hough, Stephen C <[hidden email]>; [hidden email]
Subject: Re: [External] Re: [Spark RPC] Help how to debug sudden performance issue

 

Well it will be difficult to say anything without  knowing func. It could be that 40 cores and 200 gb for an executor is not a setup that suits the func and the overall architecture.

 

It could be also GC collection issues etc.

 

Sometimes it also does not help to throw hardware at the issue. It depends really on the processing pipeline.


Am 11.03.2019 um 08:50 schrieb Hough, Stephen C <[hidden email]>:

Thanks

 

There is no issue on the worker/executor side they have ample memory > 200GB, I gave that information as background to the system apologies for the confusion.

 

The problem is isolated to the lifetime of processing a DriverEndpoint  StatusUpdate message.  For 40 minutes the system runs fine with 30+ dispatcher threads taking turns to process results then given an external yet to be determined trigger there is a slight slowdown but it’s enough to upset the system for 50 minutes before recovering.

 

J I have little expectation of finding a solution this is a last resort punt I was hoping you dev’s may have more insight of possible sources of interruption like why would it take StatusUpdate  almost exactly 100ms to process 1 result (as seen by the consecutive TID’s below) or any logging I may be able to turn on to narrow the search.

 

There are no errors or warnings in the logs.

 

 

From: Jörn Franke [[hidden email]]
Sent: Monday, March 11, 2019 3:08 PM
To: Hough, Stephen C <[hidden email]>
Cc: [hidden email]
Subject: [External] Re: [Spark RPC] Help how to debug sudden performance issue

 

Well it is a little bit difficult to say, because a lot of things are mixing up here. What function is calculated? Does it need a lot of memory? Could it be that you run out of memory and some spillover happens and you have a lot of IO to disk which is blocking? 

 

Related to that could be 1 executor 40 cores. How much memory does it have and need?

 

I would not put Kafka+ZK on the server where the driver is running.

 

A different Spark version - that may depend on what are the answers to the questions above.


Am 11.03.2019 um 07:40 schrieb Hough, Stephen C <[hidden email]>:

Spark Version: 2.0.2

 

I am running a cluster with 400 workers where each worker has 1 executor configured with 40 cores for a total capacity of 16,000 cores.

I run about 10,000 jobs with 1.5M tasks where the job is a simple spark.parallelize(list, list.size()).map(func).collectAsysnc().  The job contains a list of tasks ranging from 1 to 2000.

‘func’ will run our code which will do computations then post events over Kafka and return a Boolean, we ignore the result.  The duration of this func can be from seconds up to 20 mins.

 

The application context is launched on a driver server with 32 cores and the only other services running on the box is a Kafka broker and zookeeper service.

 

This particular batch in development environment took around 2 hours to run which met our SLA however when we ran in production it took 3 hours to complete, we thought it may have been due to another cluster we were running with around 300 workers however AWS informed us that the networks were isolated.  I moved the job to run later after the other clusters batch had finished and the time reduced back down to 2 hrs.

I analyzed our logs and it shows that a yet to be determined incident @22:02:42 caused Spark to ‘go slow’.

 

By capturing the duration from the executor thread message ‘Finished task’ I tracked the TID seen by the task result getter to determine duration until the result is processed on the driver and a core is freed for the scheduler.

 

For the most part it is within a reasonable range of 10ms then suddenly at the given incident time it suddenly rises to 5s, 20s, 587s, peaking at 32m only 8 mins after the incident.  So it takes 32 mins from the time the result was sent back to spark driver to the time it is processed which explains the performance hit because during this time the freed cores on the worker go idle waiting for a new task.  Note I did track the time I saw our Kafka event sent by this task and we saw it roughly 2ms later on the driver so the results are getting to server over the network okay.

 

Looking at the rpc code it became apparent to me that if we start to see a build of messages the dispatcher should turn single-threaded as it processes the backlog, so I did another scan of the driver logs to look for long running dispatcher threads, i.e. a dispatcher that processes more than 1 consecutive message.  A very obvious issue became apparent.

 

Dispatcher: 23 started 22:02:42:647 processed 80386 consecutive messages for a duration of 53 minutes.

 

If one looks at the beginning of these messages it is obvious that a slowdown occurs, the first 3 are within millis of each other, then a suspicious 100ms delay starts happening.

 

04-03-19 22:02:43:032 [INFO] [dispatcher-event-loop-23] o.a.s.s.c.CoarseGrainedSchedulerBackend$DriverEndpoint - Launching task 1418419 on executor id: 1048 hostname: 10.9.141.180

04-03-19 22:02:43:034 [INFO] [dispatcher-event-loop-23] o.a.s.s.c.CoarseGrainedSchedulerBackend$DriverEndpoint - Launching task 1418420 on executor id: 967 hostname: 10.9.134.69

04-03-19 22:02:43:037 [INFO] [dispatcher-event-loop-23] o.a.s.s.c.CoarseGrainedSchedulerBackend$DriverEndpoint - Launching task 1418421 on executor id: 791 hostname: 10.9.139.73

04-03-19 22:02:43:136 [INFO] [dispatcher-event-loop-23] o.a.s.s.c.CoarseGrainedSchedulerBackend$DriverEndpoint - Launching task 1418422 on executor id: 941 hostname: 10.9.142.127

04-03-19 22:02:43:234 [INFO] [dispatcher-event-loop-23] o.a.s.s.c.CoarseGrainedSchedulerBackend$DriverEndpoint - Launching task 1418423 on executor id: 1085 hostname: 10.9.142.23

04-03-19 22:02:43:348 [INFO] [dispatcher-event-loop-23] o.a.s.s.c.CoarseGrainedSchedulerBackend$DriverEndpoint - Launching task 1418424 on executor id: 944 hostname: 10.9.141.65

 

Unfortunately I can’t turn on any more extra logging for the DriverEndpoint ‘StatusUpdate’ handler however at a guess I would say the launchTasks, executorData.executorEndpoint.send operation is introducing some sort of blocking which causes a backlog that takes time to process.

 

When the system is running okay we don’t see this behaviour.

 

Q, Have you guys seen this behaviour before, and if so would an update to Spark 2.4 do the trick.

 

If not are there any extra logging or debugging I can do to track down what the external event may be that is introducing the delay.  Given the volume of tasks I can only really analyze logs.

 

 

 


This email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please delete all copies and notify the sender immediately. You may wish to refer to the incorporation details of Standard Chartered PLC, Standard Chartered Bank and their subsidiaries at https://www.sc.com/en/our-locations. Please refer to https://www.sc.com/en/privacy-policy/ for Standard Chartered Bank’s Privacy Policy.


This email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please delete all copies and notify the sender immediately. You may wish to refer to the incorporation details of Standard Chartered PLC, Standard Chartered Bank and their subsidiaries at https://www.sc.com/en/our-locations. Please refer to https://www.sc.com/en/privacy-policy/ for Standard Chartered Bank’s Privacy Policy.


This email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please delete all copies and notify the sender immediately. You may wish to refer to the incorporation details of Standard Chartered PLC, Standard Chartered Bank and their subsidiaries at https://www.sc.com/en/our-locations. Please refer to https://www.sc.com/en/privacy-policy/ for Standard Chartered Bank’s Privacy Policy.