Asking for reviewing PRs regarding structured streaming

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Asking for reviewing PRs regarding structured streaming

Jungtaek Lim
Hi Spark devs,

I have couple of pull requests for structured streaming which are getting older and fading out from earlier pages in PR pages.


Two of them are in a kind of approval by couple of folks, but no approval from committers yet.
One of them needs rebase and I would be happy to do it after reviewing or in progress of reviewing.

Getting reviewed in time would be critical for contributors to be honest, so I'd like to ask dev mailing list to review my PRs.

Thanks in advance,
Jungtaek Lim (HeartSaVioR) 
Reply | Threaded
Open this post in threaded view
|

Re: Asking for reviewing PRs regarding structured streaming

Jungtaek Lim
Kindly reminder since around 2 weeks passed. I've added more PR during 2 weeks and even planning to do more.

2018년 6월 19일 (화) 오후 6:34, Jungtaek Lim <[hidden email]>님이 작성:
Hi Spark devs,

I have couple of pull requests for structured streaming which are getting older and fading out from earlier pages in PR pages.


Two of them are in a kind of approval by couple of folks, but no approval from committers yet.
One of them needs rebase and I would be happy to do it after reviewing or in progress of reviewing.

Getting reviewed in time would be critical for contributors to be honest, so I'd like to ask dev mailing list to review my PRs.

Thanks in advance,
Jungtaek Lim (HeartSaVioR) 
Reply | Threaded
Open this post in threaded view
|

Re: Asking for reviewing PRs regarding structured streaming

Jungtaek Lim
Bump. I have been having hard time working on making additional PRs since some of these rely on non-merged PRs, so spending additional time to decouple these things if possible.


Pending 5 PRs so far, and may add more sooner or later.

Thanks,
Jungtaek Lim (HeartSaVioR)

2018년 7월 1일 (일) 오전 6:21, Jungtaek Lim <[hidden email]>님이 작성:
Kindly reminder since around 2 weeks passed. I've added more PR during 2 weeks and even planning to do more.

2018년 6월 19일 (화) 오후 6:34, Jungtaek Lim <[hidden email]>님이 작성:
Hi Spark devs,

I have couple of pull requests for structured streaming which are getting older and fading out from earlier pages in PR pages.


Two of them are in a kind of approval by couple of folks, but no approval from committers yet.
One of them needs rebase and I would be happy to do it after reviewing or in progress of reviewing.

Getting reviewed in time would be critical for contributors to be honest, so I'd like to ask dev mailing list to review my PRs.

Thanks in advance,
Jungtaek Lim (HeartSaVioR) 
Reply | Threaded
Open this post in threaded view
|

Re: Asking for reviewing PRs regarding structured streaming

Jungtaek Lim
Ted Yu suggested posting the improved numbers to this thread and I think it's good idea, so also posting here, but I also think explaining rationalization of my issues would help understanding why I'm submitting couple of patches, so I'll explain it first. (Sorry to post a wall of text).

tl;dr. SPARK-24717 [1] can reduce the overall memory usage of HDFS state store provider from 10x~80x of size of state for a batch according to various stateful workloads to less than or around 2x. The new option is flexible so it can be even around 1x or even effectively disable cache. Please refer the comment in the PR [2] to get more details. (hard to post detailed numbers in mail format so link a Github comment instead)

I have interest on stateful streaming processing on Structured Streaming, and have been learning from codebase as well as analyzing memory usage as well as latency (while I admit it is hard to measure latency correctly...).


While took a look at HDFSBackedStateStoreProvider I indicated a kind of excessive caching. As I described in section "The impact of storing multiple versions from HDFSBackedStateStoreProvider" in above link, while multiple versions share the same UnsafeRow unless there's a change on the value which lessen the impact of caching multiple versions (credit to Jose Torres since I realized it from his comment). But in some workloads which lots of writes to state incurs in a batch, the overall memory usage of state is going to be out of expectation.

Related patch [3] is also submitted from other contributor (so I'm not the one to notice this behavior), whereas the patch might not look enough generalized to apply.

First I decided to track the overall memory size of state provider cache and expose to UI as well as query status (SPARK-24441 [4]). The metric looked like critical and worth to monitor, so I thought it is better to expose it (and watermark) to Dropwizard (SPARK-24637 [5]).

Based on adoption of SPARK-24441, I could find more flexible way to resolve the issue (SPARK-24717) what I've mentioned in tl;dr.

So 3 of 5 issues are coupled so far to track and resolve one issue. Hope that it helps explaining worth of reviews for these patches.

Thanks,
Jungtaek Lim (HeartSaVioR)


ps. Before all mentioned issues I also submitted some other issues regarding feature addition/refactor (2 of 5 issues).


2018년 7월 6일 (금) 오전 10:09, Jungtaek Lim <[hidden email]>님이 작성:
Bump. I have been having hard time working on making additional PRs since some of these rely on non-merged PRs, so spending additional time to decouple these things if possible.


Pending 5 PRs so far, and may add more sooner or later.

Thanks,
Jungtaek Lim (HeartSaVioR)

2018년 7월 1일 (일) 오전 6:21, Jungtaek Lim <[hidden email]>님이 작성:
Kindly reminder since around 2 weeks passed. I've added more PR during 2 weeks and even planning to do more.

2018년 6월 19일 (화) 오후 6:34, Jungtaek Lim <[hidden email]>님이 작성:
Hi Spark devs,

I have couple of pull requests for structured streaming which are getting older and fading out from earlier pages in PR pages.


Two of them are in a kind of approval by couple of folks, but no approval from committers yet.
One of them needs rebase and I would be happy to do it after reviewing or in progress of reviewing.

Getting reviewed in time would be critical for contributors to be honest, so I'd like to ask dev mailing list to review my PRs.

Thanks in advance,
Jungtaek Lim (HeartSaVioR) 
Reply | Threaded
Open this post in threaded view
|

Re: Asking for reviewing PRs regarding structured streaming

Jungtaek Lim
Now I'm adding one more issue (SPARK-24763 [1]), which proposes a new option to enable optimization of state size in streaming aggregation without hurting performance.

The idea is to remove data for key fields from value which is duplicated between key and value in state row. This requires additional operations like projection and join, but smaller state row would also give performance benefit, which can offset each other.

Please refer the comment in JIRA issue [2] to see the numbers from simple perf. test.

Thanks,

2018년 7월 6일 (금) 오후 1:54, Jungtaek Lim <[hidden email]>님이 작성:
Ted Yu suggested posting the improved numbers to this thread and I think it's good idea, so also posting here, but I also think explaining rationalization of my issues would help understanding why I'm submitting couple of patches, so I'll explain it first. (Sorry to post a wall of text).

tl;dr. SPARK-24717 [1] can reduce the overall memory usage of HDFS state store provider from 10x~80x of size of state for a batch according to various stateful workloads to less than or around 2x. The new option is flexible so it can be even around 1x or even effectively disable cache. Please refer the comment in the PR [2] to get more details. (hard to post detailed numbers in mail format so link a Github comment instead)

I have interest on stateful streaming processing on Structured Streaming, and have been learning from codebase as well as analyzing memory usage as well as latency (while I admit it is hard to measure latency correctly...).


While took a look at HDFSBackedStateStoreProvider I indicated a kind of excessive caching. As I described in section "The impact of storing multiple versions from HDFSBackedStateStoreProvider" in above link, while multiple versions share the same UnsafeRow unless there's a change on the value which lessen the impact of caching multiple versions (credit to Jose Torres since I realized it from his comment). But in some workloads which lots of writes to state incurs in a batch, the overall memory usage of state is going to be out of expectation.

Related patch [3] is also submitted from other contributor (so I'm not the one to notice this behavior), whereas the patch might not look enough generalized to apply.

First I decided to track the overall memory size of state provider cache and expose to UI as well as query status (SPARK-24441 [4]). The metric looked like critical and worth to monitor, so I thought it is better to expose it (and watermark) to Dropwizard (SPARK-24637 [5]).

Based on adoption of SPARK-24441, I could find more flexible way to resolve the issue (SPARK-24717) what I've mentioned in tl;dr.

So 3 of 5 issues are coupled so far to track and resolve one issue. Hope that it helps explaining worth of reviews for these patches.

Thanks,
Jungtaek Lim (HeartSaVioR)


ps. Before all mentioned issues I also submitted some other issues regarding feature addition/refactor (2 of 5 issues).


2018년 7월 6일 (금) 오전 10:09, Jungtaek Lim <[hidden email]>님이 작성:
Bump. I have been having hard time working on making additional PRs since some of these rely on non-merged PRs, so spending additional time to decouple these things if possible.


Pending 5 PRs so far, and may add more sooner or later.

Thanks,
Jungtaek Lim (HeartSaVioR)

2018년 7월 1일 (일) 오전 6:21, Jungtaek Lim <[hidden email]>님이 작성:
Kindly reminder since around 2 weeks passed. I've added more PR during 2 weeks and even planning to do more.

2018년 6월 19일 (화) 오후 6:34, Jungtaek Lim <[hidden email]>님이 작성:
Hi Spark devs,

I have couple of pull requests for structured streaming which are getting older and fading out from earlier pages in PR pages.


Two of them are in a kind of approval by couple of folks, but no approval from committers yet.
One of them needs rebase and I would be happy to do it after reviewing or in progress of reviewing.

Getting reviewed in time would be critical for contributors to be honest, so I'd like to ask dev mailing list to review my PRs.

Thanks in advance,
Jungtaek Lim (HeartSaVioR) 
Reply | Threaded
Open this post in threaded view
|

Re: Asking for reviewing PRs regarding structured streaming

Jungtaek Lim
I recently added more test results to SPARK-24763 [1] which shows that the proposal reduces state size according to the ratio of key-value size, whereas there's no performance hit and sometimes even slight boost.

Please refer the latest comment in JIRA issue [2] to see the numbers from perf. tests.

Thanks,
Jungtaek Lim (HeartSaVioR)



2018년 7월 9일 (월) 오후 5:28, Jungtaek Lim <[hidden email]>님이 작성:
Now I'm adding one more issue (SPARK-24763 [1]), which proposes a new option to enable optimization of state size in streaming aggregation without hurting performance.

The idea is to remove data for key fields from value which is duplicated between key and value in state row. This requires additional operations like projection and join, but smaller state row would also give performance benefit, which can offset each other.

Please refer the comment in JIRA issue [2] to see the numbers from simple perf. test.

Thanks,

2018년 7월 6일 (금) 오후 1:54, Jungtaek Lim <[hidden email]>님이 작성:
Ted Yu suggested posting the improved numbers to this thread and I think it's good idea, so also posting here, but I also think explaining rationalization of my issues would help understanding why I'm submitting couple of patches, so I'll explain it first. (Sorry to post a wall of text).

tl;dr. SPARK-24717 [1] can reduce the overall memory usage of HDFS state store provider from 10x~80x of size of state for a batch according to various stateful workloads to less than or around 2x. The new option is flexible so it can be even around 1x or even effectively disable cache. Please refer the comment in the PR [2] to get more details. (hard to post detailed numbers in mail format so link a Github comment instead)

I have interest on stateful streaming processing on Structured Streaming, and have been learning from codebase as well as analyzing memory usage as well as latency (while I admit it is hard to measure latency correctly...).


While took a look at HDFSBackedStateStoreProvider I indicated a kind of excessive caching. As I described in section "The impact of storing multiple versions from HDFSBackedStateStoreProvider" in above link, while multiple versions share the same UnsafeRow unless there's a change on the value which lessen the impact of caching multiple versions (credit to Jose Torres since I realized it from his comment). But in some workloads which lots of writes to state incurs in a batch, the overall memory usage of state is going to be out of expectation.

Related patch [3] is also submitted from other contributor (so I'm not the one to notice this behavior), whereas the patch might not look enough generalized to apply.

First I decided to track the overall memory size of state provider cache and expose to UI as well as query status (SPARK-24441 [4]). The metric looked like critical and worth to monitor, so I thought it is better to expose it (and watermark) to Dropwizard (SPARK-24637 [5]).

Based on adoption of SPARK-24441, I could find more flexible way to resolve the issue (SPARK-24717) what I've mentioned in tl;dr.

So 3 of 5 issues are coupled so far to track and resolve one issue. Hope that it helps explaining worth of reviews for these patches.

Thanks,
Jungtaek Lim (HeartSaVioR)


ps. Before all mentioned issues I also submitted some other issues regarding feature addition/refactor (2 of 5 issues).


2018년 7월 6일 (금) 오전 10:09, Jungtaek Lim <[hidden email]>님이 작성:
Bump. I have been having hard time working on making additional PRs since some of these rely on non-merged PRs, so spending additional time to decouple these things if possible.


Pending 5 PRs so far, and may add more sooner or later.

Thanks,
Jungtaek Lim (HeartSaVioR)

2018년 7월 1일 (일) 오전 6:21, Jungtaek Lim <[hidden email]>님이 작성:
Kindly reminder since around 2 weeks passed. I've added more PR during 2 weeks and even planning to do more.

2018년 6월 19일 (화) 오후 6:34, Jungtaek Lim <[hidden email]>님이 작성:
Hi Spark devs,

I have couple of pull requests for structured streaming which are getting older and fading out from earlier pages in PR pages.


Two of them are in a kind of approval by couple of folks, but no approval from committers yet.
One of them needs rebase and I would be happy to do it after reviewing or in progress of reviewing.

Getting reviewed in time would be critical for contributors to be honest, so I'd like to ask dev mailing list to review my PRs.

Thanks in advance,
Jungtaek Lim (HeartSaVioR) 
Reply | Threaded
Open this post in threaded view
|

Re: Asking for reviewing PRs regarding structured streaming

Jungtaek Lim
Bump. I got couple of review comments from contributors including soft LGTM, but still haven't got any (non code style) review from committers, so technically haven't have any progress to be merged. 

I'm planning to work on adding new feature as well, but it's not easy for me to concentrate on something with also concerning to maintain 6 existing pull requests. Merge conflicts would be matter on maintaining, especially other pull requests (submitted later than my pull requests) are getting reviewed and merged.

I'd like to ask any structured streaming related committer to take a look at pull requests.

- Jungtaek Lim (HeartSaVioR)

2018년 7월 12일 (목) 오후 10:41, Jungtaek Lim <[hidden email]>님이 작성:
I recently added more test results to SPARK-24763 [1] which shows that the proposal reduces state size according to the ratio of key-value size, whereas there's no performance hit and sometimes even slight boost.

Please refer the latest comment in JIRA issue [2] to see the numbers from perf. tests.

Thanks,
Jungtaek Lim (HeartSaVioR)


2018년 7월 9일 (월) 오후 5:28, Jungtaek Lim <[hidden email]>님이 작성:
Now I'm adding one more issue (SPARK-24763 [1]), which proposes a new option to enable optimization of state size in streaming aggregation without hurting performance.

The idea is to remove data for key fields from value which is duplicated between key and value in state row. This requires additional operations like projection and join, but smaller state row would also give performance benefit, which can offset each other.

Please refer the comment in JIRA issue [2] to see the numbers from simple perf. test.

Thanks,

2018년 7월 6일 (금) 오후 1:54, Jungtaek Lim <[hidden email]>님이 작성:
Ted Yu suggested posting the improved numbers to this thread and I think it's good idea, so also posting here, but I also think explaining rationalization of my issues would help understanding why I'm submitting couple of patches, so I'll explain it first. (Sorry to post a wall of text).

tl;dr. SPARK-24717 [1] can reduce the overall memory usage of HDFS state store provider from 10x~80x of size of state for a batch according to various stateful workloads to less than or around 2x. The new option is flexible so it can be even around 1x or even effectively disable cache. Please refer the comment in the PR [2] to get more details. (hard to post detailed numbers in mail format so link a Github comment instead)

I have interest on stateful streaming processing on Structured Streaming, and have been learning from codebase as well as analyzing memory usage as well as latency (while I admit it is hard to measure latency correctly...).


While took a look at HDFSBackedStateStoreProvider I indicated a kind of excessive caching. As I described in section "The impact of storing multiple versions from HDFSBackedStateStoreProvider" in above link, while multiple versions share the same UnsafeRow unless there's a change on the value which lessen the impact of caching multiple versions (credit to Jose Torres since I realized it from his comment). But in some workloads which lots of writes to state incurs in a batch, the overall memory usage of state is going to be out of expectation.

Related patch [3] is also submitted from other contributor (so I'm not the one to notice this behavior), whereas the patch might not look enough generalized to apply.

First I decided to track the overall memory size of state provider cache and expose to UI as well as query status (SPARK-24441 [4]). The metric looked like critical and worth to monitor, so I thought it is better to expose it (and watermark) to Dropwizard (SPARK-24637 [5]).

Based on adoption of SPARK-24441, I could find more flexible way to resolve the issue (SPARK-24717) what I've mentioned in tl;dr.

So 3 of 5 issues are coupled so far to track and resolve one issue. Hope that it helps explaining worth of reviews for these patches.

Thanks,
Jungtaek Lim (HeartSaVioR)


ps. Before all mentioned issues I also submitted some other issues regarding feature addition/refactor (2 of 5 issues).


2018년 7월 6일 (금) 오전 10:09, Jungtaek Lim <[hidden email]>님이 작성:
Bump. I have been having hard time working on making additional PRs since some of these rely on non-merged PRs, so spending additional time to decouple these things if possible.


Pending 5 PRs so far, and may add more sooner or later.

Thanks,
Jungtaek Lim (HeartSaVioR)

2018년 7월 1일 (일) 오전 6:21, Jungtaek Lim <[hidden email]>님이 작성:
Kindly reminder since around 2 weeks passed. I've added more PR during 2 weeks and even planning to do more.

2018년 6월 19일 (화) 오후 6:34, Jungtaek Lim <[hidden email]>님이 작성:
Hi Spark devs,

I have couple of pull requests for structured streaming which are getting older and fading out from earlier pages in PR pages.


Two of them are in a kind of approval by couple of folks, but no approval from committers yet.
One of them needs rebase and I would be happy to do it after reviewing or in progress of reviewing.

Getting reviewed in time would be critical for contributors to be honest, so I'd like to ask dev mailing list to review my PRs.

Thanks in advance,
Jungtaek Lim (HeartSaVioR) 
Reply | Threaded
Open this post in threaded view
|

Re: Asking for reviewing PRs regarding structured streaming

Jungtaek Lim
I'd like to bump this again, since only one of 6 pull requests is merged (5 remaining), and others are not reviewed (non code style) from committers.


All pull requests are related to Structured Streaming, and most of all are already reviewed by couple of contributors. They're open for 17 days at least and more than 2 months at most. I'm not persuading committers to merge them in 2.4, but hope to see any reactions / reviews so that I can hopefully reflect and take them forward to be ready to merge.

- Jungtaek Lim (HeartSaVioR)

2018년 7월 16일 (월) 오후 1:04, Jungtaek Lim <[hidden email]>님이 작성:
Bump. I got couple of review comments from contributors including soft LGTM, but still haven't got any (non code style) review from committers, so technically haven't have any progress to be merged. 

I'm planning to work on adding new feature as well, but it's not easy for me to concentrate on something with also concerning to maintain 6 existing pull requests. Merge conflicts would be matter on maintaining, especially other pull requests (submitted later than my pull requests) are getting reviewed and merged.

I'd like to ask any structured streaming related committer to take a look at pull requests.

- Jungtaek Lim (HeartSaVioR)

2018년 7월 12일 (목) 오후 10:41, Jungtaek Lim <[hidden email]>님이 작성:
I recently added more test results to SPARK-24763 [1] which shows that the proposal reduces state size according to the ratio of key-value size, whereas there's no performance hit and sometimes even slight boost.

Please refer the latest comment in JIRA issue [2] to see the numbers from perf. tests.

Thanks,
Jungtaek Lim (HeartSaVioR)


2018년 7월 9일 (월) 오후 5:28, Jungtaek Lim <[hidden email]>님이 작성:
Now I'm adding one more issue (SPARK-24763 [1]), which proposes a new option to enable optimization of state size in streaming aggregation without hurting performance.

The idea is to remove data for key fields from value which is duplicated between key and value in state row. This requires additional operations like projection and join, but smaller state row would also give performance benefit, which can offset each other.

Please refer the comment in JIRA issue [2] to see the numbers from simple perf. test.

Thanks,

2018년 7월 6일 (금) 오후 1:54, Jungtaek Lim <[hidden email]>님이 작성:
Ted Yu suggested posting the improved numbers to this thread and I think it's good idea, so also posting here, but I also think explaining rationalization of my issues would help understanding why I'm submitting couple of patches, so I'll explain it first. (Sorry to post a wall of text).

tl;dr. SPARK-24717 [1] can reduce the overall memory usage of HDFS state store provider from 10x~80x of size of state for a batch according to various stateful workloads to less than or around 2x. The new option is flexible so it can be even around 1x or even effectively disable cache. Please refer the comment in the PR [2] to get more details. (hard to post detailed numbers in mail format so link a Github comment instead)

I have interest on stateful streaming processing on Structured Streaming, and have been learning from codebase as well as analyzing memory usage as well as latency (while I admit it is hard to measure latency correctly...).


While took a look at HDFSBackedStateStoreProvider I indicated a kind of excessive caching. As I described in section "The impact of storing multiple versions from HDFSBackedStateStoreProvider" in above link, while multiple versions share the same UnsafeRow unless there's a change on the value which lessen the impact of caching multiple versions (credit to Jose Torres since I realized it from his comment). But in some workloads which lots of writes to state incurs in a batch, the overall memory usage of state is going to be out of expectation.

Related patch [3] is also submitted from other contributor (so I'm not the one to notice this behavior), whereas the patch might not look enough generalized to apply.

First I decided to track the overall memory size of state provider cache and expose to UI as well as query status (SPARK-24441 [4]). The metric looked like critical and worth to monitor, so I thought it is better to expose it (and watermark) to Dropwizard (SPARK-24637 [5]).

Based on adoption of SPARK-24441, I could find more flexible way to resolve the issue (SPARK-24717) what I've mentioned in tl;dr.

So 3 of 5 issues are coupled so far to track and resolve one issue. Hope that it helps explaining worth of reviews for these patches.

Thanks,
Jungtaek Lim (HeartSaVioR)


ps. Before all mentioned issues I also submitted some other issues regarding feature addition/refactor (2 of 5 issues).


2018년 7월 6일 (금) 오전 10:09, Jungtaek Lim <[hidden email]>님이 작성:
Bump. I have been having hard time working on making additional PRs since some of these rely on non-merged PRs, so spending additional time to decouple these things if possible.


Pending 5 PRs so far, and may add more sooner or later.

Thanks,
Jungtaek Lim (HeartSaVioR)

2018년 7월 1일 (일) 오전 6:21, Jungtaek Lim <[hidden email]>님이 작성:
Kindly reminder since around 2 weeks passed. I've added more PR during 2 weeks and even planning to do more.

2018년 6월 19일 (화) 오후 6:34, Jungtaek Lim <[hidden email]>님이 작성:
Hi Spark devs,

I have couple of pull requests for structured streaming which are getting older and fading out from earlier pages in PR pages.


Two of them are in a kind of approval by couple of folks, but no approval from committers yet.
One of them needs rebase and I would be happy to do it after reviewing or in progress of reviewing.

Getting reviewed in time would be critical for contributors to be honest, so I'd like to ask dev mailing list to review my PRs.

Thanks in advance,
Jungtaek Lim (HeartSaVioR)