[Discuss] Metrics Support for DS V2

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

[Discuss] Metrics Support for DS V2

sandeep_katta
Hi Devs,

Currently DS V2 does not update any input metrics. SPARK-30362 aims at solving this problem.

We can have the below approach. Have marker interface let's say  "ReportMetrics"

If the DataSource Implements this interface, then it will be easy to collect the metrics.

For e.g. FilePartitionReaderFactory can support metrics.

So it will be easy to collect the metrics if FilePartitionReaderFactory implements ReportMetrics


Please let me know the views, or even if we want to have new solution or design.
Reply | Threaded
Open this post in threaded view
|

Re: [Discuss] Metrics Support for DS V2

cloud0fan
I think there are a few details we need to discuss.

how frequently a source should update its metrics? For example, if file source needs to report size metrics per row, it'll be super slow.

what metrics a source should report? data size? numFiles? read time?

shall we show metrics in SQL web UI as well?

On Fri, Jan 17, 2020 at 3:07 PM Sandeep Katta <[hidden email]> wrote:
Hi Devs,

Currently DS V2 does not update any input metrics. SPARK-30362 aims at solving this problem.

We can have the below approach. Have marker interface let's say  "ReportMetrics"

If the DataSource Implements this interface, then it will be easy to collect the metrics.

For e.g. FilePartitionReaderFactory can support metrics.

So it will be easy to collect the metrics if FilePartitionReaderFactory implements ReportMetrics


Please let me know the views, or even if we want to have new solution or design.
Reply | Threaded
Open this post in threaded view
|

Re: [Discuss] Metrics Support for DS V2

Ryan Blue
We've implemented these metrics in the RDD (for input metrics) and in the v2 DataWritingSparkTask. That approach gives you the same metrics in the stage views that you get with v1 sources, regardless of the v2 implementation.

I'm not sure why they weren't included from the start. It looks like the way metrics are collected is changing. There are a couple of metrics for number of rows; looks like one that goes to the Spark SQL tab and one that is used for the stages view.

If you'd like, I can send you a patch.

rb

On Fri, Jan 17, 2020 at 5:09 AM Wenchen Fan <[hidden email]> wrote:
I think there are a few details we need to discuss.

how frequently a source should update its metrics? For example, if file source needs to report size metrics per row, it'll be super slow.

what metrics a source should report? data size? numFiles? read time?

shall we show metrics in SQL web UI as well?

On Fri, Jan 17, 2020 at 3:07 PM Sandeep Katta <[hidden email]> wrote:
Hi Devs,

Currently DS V2 does not update any input metrics. SPARK-30362 aims at solving this problem.

We can have the below approach. Have marker interface let's say  "ReportMetrics"

If the DataSource Implements this interface, then it will be easy to collect the metrics.

For e.g. FilePartitionReaderFactory can support metrics.

So it will be easy to collect the metrics if FilePartitionReaderFactory implements ReportMetrics


Please let me know the views, or even if we want to have new solution or design.


--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: [Discuss] Metrics Support for DS V2

sandeep_katta
Please send me the patch , I will apply and test.

On Fri, 17 Jan 2020 at 10:33 PM, Ryan Blue <[hidden email]> wrote:
We've implemented these metrics in the RDD (for input metrics) and in the v2 DataWritingSparkTask. That approach gives you the same metrics in the stage views that you get with v1 sources, regardless of the v2 implementation.

I'm not sure why they weren't included from the start. It looks like the way metrics are collected is changing. There are a couple of metrics for number of rows; looks like one that goes to the Spark SQL tab and one that is used for the stages view.

If you'd like, I can send you a patch.

rb

On Fri, Jan 17, 2020 at 5:09 AM Wenchen Fan <[hidden email]> wrote:
I think there are a few details we need to discuss.

how frequently a source should update its metrics? For example, if file source needs to report size metrics per row, it'll be super slow.

what metrics a source should report? data size? numFiles? read time?

shall we show metrics in SQL web UI as well?

On Fri, Jan 17, 2020 at 3:07 PM Sandeep Katta <[hidden email]> wrote:
Hi Devs,

Currently DS V2 does not update any input metrics. SPARK-30362 aims at solving this problem.

We can have the below approach. Have marker interface let's say  "ReportMetrics"

If the DataSource Implements this interface, then it will be easy to collect the metrics.

For e.g. FilePartitionReaderFactory can support metrics.

So it will be easy to collect the metrics if FilePartitionReaderFactory implements ReportMetrics


Please let me know the views, or even if we want to have new solution or design.


--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: [Discuss] Metrics Support for DS V2

Ryan Blue
I sent them to you. I had to go direct because the ASF mailing list will remove attachments. I'm happy to send them to others if needed as well.

On Sun, Jan 19, 2020 at 9:01 PM Sandeep Katta <[hidden email]> wrote:
Please send me the patch , I will apply and test.

On Fri, 17 Jan 2020 at 10:33 PM, Ryan Blue <[hidden email]> wrote:
We've implemented these metrics in the RDD (for input metrics) and in the v2 DataWritingSparkTask. That approach gives you the same metrics in the stage views that you get with v1 sources, regardless of the v2 implementation.

I'm not sure why they weren't included from the start. It looks like the way metrics are collected is changing. There are a couple of metrics for number of rows; looks like one that goes to the Spark SQL tab and one that is used for the stages view.

If you'd like, I can send you a patch.

rb

On Fri, Jan 17, 2020 at 5:09 AM Wenchen Fan <[hidden email]> wrote:
I think there are a few details we need to discuss.

how frequently a source should update its metrics? For example, if file source needs to report size metrics per row, it'll be super slow.

what metrics a source should report? data size? numFiles? read time?

shall we show metrics in SQL web UI as well?

On Fri, Jan 17, 2020 at 3:07 PM Sandeep Katta <[hidden email]> wrote:
Hi Devs,

Currently DS V2 does not update any input metrics. SPARK-30362 aims at solving this problem.

We can have the below approach. Have marker interface let's say  "ReportMetrics"

If the DataSource Implements this interface, then it will be easy to collect the metrics.

For e.g. FilePartitionReaderFactory can support metrics.

So it will be easy to collect the metrics if FilePartitionReaderFactory implements ReportMetrics


Please let me know the views, or even if we want to have new solution or design.


--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix