Total memory tracking: request for comments

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Total memory tracking: request for comments

Jose Soltren
https://issues.apache.org/jira/browse/SPARK-21157

Hi - often times, Spark applications are killed for overrunning available memory by YARN, Mesos, or the OS. In SPARK-21157, I propose a design for grabbing and reporting "total memory" usage for Spark executors - that is, memory usage as visible from the OS, including on-heap and off-heap memory used by Spark and third party libraries. This builds on many ideas from SPARK-9103.

I'd really welcome some review and some feedback of this design proposal. I think this could be a helpful feature for Spark users who are trying to triage memory usage issues. In the future I'd like to think about reporting memory usage from third party libraries like Netty, as was originally proposed in SPARK-9103.

Cheers,
--José
Reply | Threaded
Open this post in threaded view
|

Re: Total memory tracking: request for comments

rxin
Thanks. This is an important direction to explore and my apologies for the late reply.

One thing that is really hard about this is that with different layers of abstractions, we often use other libraries that might allocate large amount of memory (e.g. snappy library, Parquet itself), which makes it very difficult to track. That's where I see how most of the OOMs or crashes happen. How do you propose solving those?



On Tue, Jun 20, 2017 at 4:15 PM, Jose Soltren <[hidden email]> wrote:
https://issues.apache.org/jira/browse/SPARK-21157

Hi - often times, Spark applications are killed for overrunning available memory by YARN, Mesos, or the OS. In SPARK-21157, I propose a design for grabbing and reporting "total memory" usage for Spark executors - that is, memory usage as visible from the OS, including on-heap and off-heap memory used by Spark and third party libraries. This builds on many ideas from SPARK-9103.

I'd really welcome some review and some feedback of this design proposal. I think this could be a helpful feature for Spark users who are trying to triage memory usage issues. In the future I'd like to think about reporting memory usage from third party libraries like Netty, as was originally proposed in SPARK-9103.

Cheers,
--José

Reply | Threaded
Open this post in threaded view
|

Re: Total memory tracking: request for comments

Vadim Semenov
Just going to say what we did at Datadog for counting total memory of each executor, and maybe someone will find it useful.

We get the PID of the java process that runs an executor, and then get the Resident Set Size memory from the system's `/proc/<pid>/stat` and then send that value with the YARN Container ID.

We forked Etsy's `statsd-jvm-profiler`, added the said thing https://github.com/DataDog/spark-jvm-profiler/pull/1
and then add this java agent to executor's JVM options like

`--conf "spark.executor.extraJavaOptions=… -javaagent:spark-jvm-profiler.jar=server=localhost,port=8125,profilers=MemoryProfiler"`

and then it goes to any StatsD backend, in this case our datadog-agent.

And then we get metrics in our UI and see the actual total Process memory, Heap Total/Used/Avg/Max




On Wed, Sep 20, 2017 at 1:21 PM, Reynold Xin <[hidden email]> wrote:
Thanks. This is an important direction to explore and my apologies for the late reply.

One thing that is really hard about this is that with different layers of abstractions, we often use other libraries that might allocate large amount of memory (e.g. snappy library, Parquet itself), which makes it very difficult to track. That's where I see how most of the OOMs or crashes happen. How do you propose solving those?



On Tue, Jun 20, 2017 at 4:15 PM, Jose Soltren <[hidden email]> wrote:
https://issues.apache.org/jira/browse/SPARK-21157

Hi - often times, Spark applications are killed for overrunning available memory by YARN, Mesos, or the OS. In SPARK-21157, I propose a design for grabbing and reporting "total memory" usage for Spark executors - that is, memory usage as visible from the OS, including on-heap and off-heap memory used by Spark and third party libraries. This builds on many ideas from SPARK-9103.

I'd really welcome some review and some feedback of this design proposal. I think this could be a helpful feature for Spark users who are trying to triage memory usage issues. In the future I'd like to think about reporting memory usage from third party libraries like Netty, as was originally proposed in SPARK-9103.

Cheers,
--José


Reply | Threaded
Open this post in threaded view
|

Re: Total memory tracking: request for comments

Vadim Semenov
I read the design doc in https://issues.apache.org/jira/browse/SPARK-21157

and I described what you essentially proposed


One thing that is really hard about this is that with different layers of abstractions, we often use other libraries that might allocate large amount of memory (e.g. snappy library, Parquet itself), which makes it very difficult to track. That's where I see how most of the OOMs or crashes happen. How do you propose solving those?

Since they're getting executed within the JVM process, we don't need to separately account for them, since we're looking at the whole process memory stats.

On Wed, Sep 20, 2017 at 1:56 PM, Vadim Semenov <[hidden email]> wrote:
Just going to say what we did at Datadog for counting total memory of each executor, and maybe someone will find it useful.

We get the PID of the java process that runs an executor, and then get the Resident Set Size memory from the system's `/proc/<pid>/stat` and then send that value with the YARN Container ID.

We forked Etsy's `statsd-jvm-profiler`, added the said thing https://github.com/DataDog/spark-jvm-profiler/pull/1
and then add this java agent to executor's JVM options like

`--conf "spark.executor.extraJavaOptions=… -javaagent:spark-jvm-profiler.jar=server=localhost,port=8125,profilers=MemoryProfiler"`

and then it goes to any StatsD backend, in this case our datadog-agent.

And then we get metrics in our UI and see the actual total Process memory, Heap Total/Used/Avg/Max




On Wed, Sep 20, 2017 at 1:21 PM, Reynold Xin <[hidden email]> wrote:
Thanks. This is an important direction to explore and my apologies for the late reply.

One thing that is really hard about this is that with different layers of abstractions, we often use other libraries that might allocate large amount of memory (e.g. snappy library, Parquet itself), which makes it very difficult to track. That's where I see how most of the OOMs or crashes happen. How do you propose solving those?



On Tue, Jun 20, 2017 at 4:15 PM, Jose Soltren <[hidden email]> wrote:
https://issues.apache.org/jira/browse/SPARK-21157

Hi - often times, Spark applications are killed for overrunning available memory by YARN, Mesos, or the OS. In SPARK-21157, I propose a design for grabbing and reporting "total memory" usage for Spark executors - that is, memory usage as visible from the OS, including on-heap and off-heap memory used by Spark and third party libraries. This builds on many ideas from SPARK-9103.

I'd really welcome some review and some feedback of this design proposal. I think this could be a helpful feature for Spark users who are trying to triage memory usage issues. In the future I'd like to think about reporting memory usage from third party libraries like Netty, as was originally proposed in SPARK-9103.

Cheers,
--José