[METRICS] Metrics names inconsistent between executions

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[METRICS] Metrics names inconsistent between executions

Anton Kirillov
Hi everyone!

We are currently working on building a unified monitoring/alerting solution for Spark and would like to rely on Spark's own metrics to avoid divergence from the upstream. One of the challenges is to support metrics coming from multiple Spark applications running on a cluster: scheduled jobs, long-running streaming applications etc.

Original problem:
Spark assigns metrics names using spark.app.id and spark.executor.id as a part of them. Thus the number of metrics is continuously growing because those IDs are unique between executions whereas the metrics themselves report the same thing. Another issue which arises here is how to use constantly changing metric names in dashboards.

For example, jvm_heap_used reported by all Spark instances (components):
- <spark.app.id>_driver_jvm_heap_used (Driver)
- <spark.app.id>_<spark.executor.id>_jvm_heap_used (Executors)

While spark.app.id can be overridden with spark.metrics.namespace, there's no such an option for spark.executor.id which makes it impossible to build a reusable dashboard because (given the uniqueness of IDs) differently named metrics are emitted for each execution.

One of the possible solutions would be to make executor metrics names follow the driver's metrics name pattern, e.g.:
- <spark.app.id>_driver_jvm_heap_used (Driver)
- <spark.app.id>_executor_jvm_heap_used (Executors)

and distinguish executors based on tags (tags should be configured in metric reporters in this case). Not sure if this could potentially break Driver UI though.

I'd really appreciate any feedback on this issue and would be happy to create a Jira issue/PR if this change looks sane for the community.

Thanks in advance.

--
Anton Kirillov
Senior Software Engineer, Mesosphere
Reply | Threaded
Open this post in threaded view
|

Re: [METRICS] Metrics names inconsistent between executions

Stavros Kontopoulos-3
Hi,

With jmx_exporter and Prometheus you can always re-write the metrics patterns on the fly. Btw if you use Grafana its easy to filter things even without the re-write.
If this is a custom dashboard you can always group metrics based on the spark.app.id as a prefix, no? Also I think some times its good to know if some executor
failed and why and report specific execution metrics. For example if you have skewed data and that caused jvm issues etc.

Stavros
On Mon, May 6, 2019 at 11:29 PM Anton Kirillov <[hidden email]> wrote:
Hi everyone!

We are currently working on building a unified monitoring/alerting solution for Spark and would like to rely on Spark's own metrics to avoid divergence from the upstream. One of the challenges is to support metrics coming from multiple Spark applications running on a cluster: scheduled jobs, long-running streaming applications etc.

Original problem:
Spark assigns metrics names using spark.app.id and spark.executor.id as a part of them. Thus the number of metrics is continuously growing because those IDs are unique between executions whereas the metrics themselves report the same thing. Another issue which arises here is how to use constantly changing metric names in dashboards.

For example, jvm_heap_used reported by all Spark instances (components):
- <spark.app.id>_driver_jvm_heap_used (Driver)
- <spark.app.id>_<spark.executor.id>_jvm_heap_used (Executors)

While spark.app.id can be overridden with spark.metrics.namespace, there's no such an option for spark.executor.id which makes it impossible to build a reusable dashboard because (given the uniqueness of IDs) differently named metrics are emitted for each execution.

One of the possible solutions would be to make executor metrics names follow the driver's metrics name pattern, e.g.:
- <spark.app.id>_driver_jvm_heap_used (Driver)
- <spark.app.id>_executor_jvm_heap_used (Executors)

and distinguish executors based on tags (tags should be configured in metric reporters in this case). Not sure if this could potentially break Driver UI though.

I'd really appreciate any feedback on this issue and would be happy to create a Jira issue/PR if this change looks sane for the community.

Thanks in advance.

--
Anton Kirillov
Senior Software Engineer, Mesosphere