Hi - often times, Spark applications are killed for overrunning available memory by YARN, Mesos, or the OS. In SPARK-21157, I propose a design for grabbing and reporting "total memory" usage for Spark executors - that is, memory usage as visible from the OS, including on-heap and off-heap memory used by Spark and third party libraries. This builds on many ideas from SPARK-9103.
I'd really welcome some review and some feedback of this design proposal. I think this could be a helpful feature for Spark users who are trying to triage memory usage issues. In the future I'd like to think about reporting memory usage from third party libraries like Netty, as was originally proposed in SPARK-9103.
Thanks. This is an important direction to explore and my apologies for the late reply.
One thing that is really hard about this is that with different layers of abstractions, we often use other libraries that might allocate large amount of memory (e.g. snappy library, Parquet itself), which makes it very difficult to track. That's where I see how most of the OOMs or crashes happen. How do you propose solving those?
On Tue, Jun 20, 2017 at 4:15 PM, Jose Soltren <[hidden email]> wrote:
Just going to say what we did at Datadog for counting total memory of each executor, and maybe someone will find it useful.
We get the PID of the java process that runs an executor, and then get the Resident Set Size memory from the system's `/proc/<pid>/stat` and then send that value with the YARN Container ID.
We forked Etsy's `statsd-jvm-profiler`, added the said thing https://github.com/DataDog/spark-jvm-profiler/pull/1
and then add this java agent to executor's JVM options like
`--conf "spark.executor.extraJavaOptions=… -javaagent:spark-jvm-profiler.jar=server=localhost,port=8125,profilers=MemoryProfiler"`
and then it goes to any StatsD backend, in this case our datadog-agent.
And then we get metrics in our UI and see the actual total Process memory, Heap Total/Used/Avg/Max
On Wed, Sep 20, 2017 at 1:21 PM, Reynold Xin <[hidden email]> wrote:
I read the design doc in https://issues.apache.org/jira/browse/SPARK-21157
and I described what you essentially proposed
> One thing that is really hard about this is that with different layers of abstractions, we often use other libraries that might allocate large amount of memory (e.g. snappy library, Parquet itself), which makes it very difficult to track. That's where I see how most of the OOMs or crashes happen. How do you propose solving those?
Since they're getting executed within the JVM process, we don't need to separately account for them, since we're looking at the whole process memory stats.
On Wed, Sep 20, 2017 at 1:56 PM, Vadim Semenov <[hidden email]> wrote:
|Free forum by Nabble||Edit this page|