[DISCUSS][CORE] Exposing application status metrics via a source

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[DISCUSS][CORE] Exposing application status metrics via a source

Stavros Kontopoulos-3
Hi all,

I have a PR https://github.com/apache/spark/pull/22381 that exposes application status
metrics (related jira: SPARK-25394).

So far metrics tooling needs to scrape the metrics rest api to get metrics like job delay, stages failed, stages completed etc.
From devops perspective it is good to standardize on a unified way of gathering metrics.
The need came up on the K8s side where jmx prometheus exporter is commonly used to scrape metrics for several components such as kafka, cassandra, but the need is not limited there.

"The rest api is great for UI and consolidated analytics, but monitoring through it is not as straightforward as when the data emits directly from the source like this. There is all kinds of nice context that we get when the data from this spark node is collected directly from the node itself, and not proxied through another collector / reporter. It is easier to build a monitoring data model across the cluster when node, jmx, pod, resource manifests, and spark data all align by virtue of coming from the same collector. Building a similar view of the cluster just from the rest api, as a comparison, is simply harder and quite challenging to do in general purpose terms."

The PR is ok to be merged but the major concern here is the mirroring of the metrics. I think that mirroring is ok since people may dont want to check the ui and they just want to integrate with jmx only (my use case) and gather metrics in grafana (common case out there).

Does any of the committers or the community have an opinion on this?
Is there an agreement about moving on with this? Note that the addition does not change much and can always be refactored if we come up with a new plan for the metrics story in the future.

Thanks,
Stavros
Reply | Threaded
Open this post in threaded view
|

RE: [DISCUSS][CORE] Exposing application status metrics via a source

Luca Canali

Hi Stavros, All,

 

Interesting topic, I add here some thoughts and personal opinions on it: I find too the metrics system quite useful for the use case of building Grafana dashboards as opposed to scraping logs and/or using the Event Listener infrastructure, as you mentioned in your mail.

A few additional points in favour of Dropwizard metrics for me are:

-          Regarding the metrics defined on the ExecutorSource, I believe they have better scalability compare to standard Task Metrics, as the Dropwizard metrics go directly from executors to sink(s) rather than passing via the driver through the ListenerBus.

-          Another advantage that I see is that Dropwizard metrics make it easy to expose information not available otherwise from the EveloLog/Listener events, such as executor.jvmCpuTime (SPARK-25228).

 

I ’d like to add some feedback and random thoughts based on recent work on SPARK-25228 and SPARK-22190, SPARK-25277, SPARK-25285:

-          the “Dropwizard metrics” space currently appears a bit “crowded”,  we could probably profit from adding a few configuration parameters to turn some of the metrics on/off as needed (I see that this point is also raised in the discussion in your PR 22381).

-          Another point is that the metrics instrumentation is a bit scattered around the code, it would be nice to have a central point where the available metrics are exposed (maybe just in the documentation).

-          Testing of new metrics seems to be a bit of a manual process at the moment (at least it was for me) which could be improved. Related to that I believe that some recent work on adding new metrics has ended up with a minor side effect/issue, details in SPARK-25277.

 

Best regards,

Luca

 

From: Stavros Kontopoulos <[hidden email]>
Sent: Wednesday, September 12, 2018 22:35
To: Dev <[hidden email]>
Subject: [DISCUSS][CORE] Exposing application status metrics via a source

 

Hi all,

 

I have a PR https://github.com/apache/spark/pull/22381 that exposes application status

metrics (related jira: SPARK-25394).

 

So far metrics tooling needs to scrape the metrics rest api to get metrics like job delay, stages failed, stages completed etc.

From devops perspective it is good to standardize on a unified way of gathering metrics.

The need came up on the K8s side where jmx prometheus exporter is commonly used to scrape metrics for several components such as kafka, cassandra, but the need is not limited there.

 

"The rest api is great for UI and consolidated analytics, but monitoring through it is not as straightforward as when the data emits directly from the source like this. There is all kinds of nice context that we get when the data from this spark node is collected directly from the node itself, and not proxied through another collector / reporter. It is easier to build a monitoring data model across the cluster when node, jmx, pod, resource manifests, and spark data all align by virtue of coming from the same collector. Building a similar view of the cluster just from the rest api, as a comparison, is simply harder and quite challenging to do in general purpose terms."

 

The PR is ok to be merged but the major concern here is the mirroring of the metrics. I think that mirroring is ok since people may dont want to check the ui and they just want to integrate with jmx only (my use case) and gather metrics in grafana (common case out there).

 

Does any of the committers or the community have an opinion on this?

Is there an agreement about moving on with this? Note that the addition does not change much and can always be refactored if we come up with a new plan for the metrics story in the future.

 

Thanks,

Stavros