Handling user-facing metadata issues on file stream source & sink

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Handling user-facing metadata issues on file stream source & sink

Jungtaek Lim-2
Hi devs,

I'm seeing more and more structured streaming end users encountered the metadata issues on file stream source and sink. They have been known-issues and there're even long-standing JIRA issues reported before, end users report them again in user@ mailing list in April.

* Spark Structure Streaming | FileStreamSourceLog not deleting list of input files | Spark -2.4.0 [1]
* [Structured Streaming] Checkpoint file compact file grows big [2]

I've proposed various improvements on the area (see my PRs [3]) but suffered on lack of interests/reviews. I feel the issue is critical (under-estimated) because...

1. It's one of "built-in" data sources which is being maintained by Spark community. (End users may judge the state of project/area on the quality on the built-in data source, because that's the thing they would start with.)
2. It's the only built-in data source which provides "end-to-end exactly-once" in structured streaming.

I'd hope to see us address such issues so that end users can live with built-in data source. (It may not need to be perfect, but at least be reasonable on the long-run streaming workloads.) I know there're couple of alternatives, but I don't think starter would start from there. End users may just try to find alternatives - not alternative of data source, but alternative of streaming processing framework.

Thanks,
Jungtaek Lim (HeartSaVioR)
Reply | Threaded
Open this post in threaded view
|

Re: Handling user-facing metadata issues on file stream source & sink

Jungtaek Lim-2
(bump to expose the discussion to more readers)

On Mon, May 4, 2020 at 5:45 PM Jungtaek Lim <[hidden email]> wrote:
Hi devs,

I'm seeing more and more structured streaming end users encountered the metadata issues on file stream source and sink. They have been known-issues and there're even long-standing JIRA issues reported before, end users report them again in user@ mailing list in April.

* Spark Structure Streaming | FileStreamSourceLog not deleting list of input files | Spark -2.4.0 [1]
* [Structured Streaming] Checkpoint file compact file grows big [2]

I've proposed various improvements on the area (see my PRs [3]) but suffered on lack of interests/reviews. I feel the issue is critical (under-estimated) because...

1. It's one of "built-in" data sources which is being maintained by Spark community. (End users may judge the state of project/area on the quality on the built-in data source, because that's the thing they would start with.)
2. It's the only built-in data source which provides "end-to-end exactly-once" in structured streaming.

I'd hope to see us address such issues so that end users can live with built-in data source. (It may not need to be perfect, but at least be reasonable on the long-run streaming workloads.) I know there're couple of alternatives, but I don't think starter would start from there. End users may just try to find alternatives - not alternative of data source, but alternative of streaming processing framework.

Thanks,
Jungtaek Lim (HeartSaVioR)
Reply | Threaded
Open this post in threaded view
|

Re: Handling user-facing metadata issues on file stream source & sink

Jungtaek Lim-2
Worth noting that I got similar question around local community as well. These reporters didn't encounter the edge-case, they're encountered the critical issue in the normal running of streaming query.

On Fri, May 8, 2020 at 4:49 PM Jungtaek Lim <[hidden email]> wrote:
(bump to expose the discussion to more readers)

On Mon, May 4, 2020 at 5:45 PM Jungtaek Lim <[hidden email]> wrote:
Hi devs,

I'm seeing more and more structured streaming end users encountered the metadata issues on file stream source and sink. They have been known-issues and there're even long-standing JIRA issues reported before, end users report them again in user@ mailing list in April.

* Spark Structure Streaming | FileStreamSourceLog not deleting list of input files | Spark -2.4.0 [1]
* [Structured Streaming] Checkpoint file compact file grows big [2]

I've proposed various improvements on the area (see my PRs [3]) but suffered on lack of interests/reviews. I feel the issue is critical (under-estimated) because...

1. It's one of "built-in" data sources which is being maintained by Spark community. (End users may judge the state of project/area on the quality on the built-in data source, because that's the thing they would start with.)
2. It's the only built-in data source which provides "end-to-end exactly-once" in structured streaming.

I'd hope to see us address such issues so that end users can live with built-in data source. (It may not need to be perfect, but at least be reasonable on the long-run streaming workloads.) I know there're couple of alternatives, but I don't think starter would start from there. End users may just try to find alternatives - not alternative of data source, but alternative of streaming processing framework.

Thanks,
Jungtaek Lim (HeartSaVioR)
Reply | Threaded
Open this post in threaded view
|

Re: Handling user-facing metadata issues on file stream source & sink

Jungtaek Lim-2
Bump again - hope to get some traction because these issues are either long-standing problems or noticeable improvements (each PR has numbers/UI graph to show the improvement).

Fixed long-standing problems:

* [SPARK-17604][SS] FileStreamSource: provide a new option to have retention on input files [1]
* [SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files [2]

There's no logic to control the size of metadata for file stream source & file stream sink, and it affects end users who run the streaming query with many input files / output files in the long run. Both are to resolve metadata growing incrementally over time. As the number of the issue represents for SPARK-17604 it's a fairly old problem. There're at least three relevant issues being reported on SPARK-27188.

Improvements:

* [SPARK-30866][SS] FileStreamSource: Cache fetched list of files beyond maxFilesPerTrigger as unread files [3]
* [SPARK-30900][SS] FileStreamSource: Avoid reading compact metadata log twice if the query restarts from compact batch [4]
* [SPARK-30946][SS] Serde entry via DataInputStream/DataOutputStream with LZ4 compression on FileStream(Source/Sink)Log [5]

Above patches provide better performance on the condition described on each PR. Worth noting, SPARK-30946 provides pretty much better performance (~10x) on compaction per every compact batch, whereas it also reduces down the compact batch log file (~30% of current).



On Fri, May 22, 2020 at 12:50 PM Jungtaek Lim <[hidden email]> wrote:
Worth noting that I got similar question around local community as well. These reporters didn't encounter the edge-case, they're encountered the critical issue in the normal running of streaming query.

On Fri, May 8, 2020 at 4:49 PM Jungtaek Lim <[hidden email]> wrote:
(bump to expose the discussion to more readers)

On Mon, May 4, 2020 at 5:45 PM Jungtaek Lim <[hidden email]> wrote:
Hi devs,

I'm seeing more and more structured streaming end users encountered the metadata issues on file stream source and sink. They have been known-issues and there're even long-standing JIRA issues reported before, end users report them again in user@ mailing list in April.

* Spark Structure Streaming | FileStreamSourceLog not deleting list of input files | Spark -2.4.0 [1]
* [Structured Streaming] Checkpoint file compact file grows big [2]

I've proposed various improvements on the area (see my PRs [3]) but suffered on lack of interests/reviews. I feel the issue is critical (under-estimated) because...

1. It's one of "built-in" data sources which is being maintained by Spark community. (End users may judge the state of project/area on the quality on the built-in data source, because that's the thing they would start with.)
2. It's the only built-in data source which provides "end-to-end exactly-once" in structured streaming.

I'd hope to see us address such issues so that end users can live with built-in data source. (It may not need to be perfect, but at least be reasonable on the long-run streaming workloads.) I know there're couple of alternatives, but I don't think starter would start from there. End users may just try to find alternatives - not alternative of data source, but alternative of streaming processing framework.

Thanks,
Jungtaek Lim (HeartSaVioR)
Reply | Threaded
Open this post in threaded view
|

Re: Handling user-facing metadata issues on file stream source & sink

Jungtaek Lim-2
Bump + adding one more issue I fixed (and by chance there's relevant report in user mailing list recently)

* [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue [1]

The patch stabilizes the driver's memory usage on utilizing a huge metadata log, which was throwing OOME.


On Sun, Jun 14, 2020 at 4:14 PM Jungtaek Lim <[hidden email]> wrote:
Bump again - hope to get some traction because these issues are either long-standing problems or noticeable improvements (each PR has numbers/UI graph to show the improvement).

Fixed long-standing problems:

* [SPARK-17604][SS] FileStreamSource: provide a new option to have retention on input files [1]
* [SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files [2]

There's no logic to control the size of metadata for file stream source & file stream sink, and it affects end users who run the streaming query with many input files / output files in the long run. Both are to resolve metadata growing incrementally over time. As the number of the issue represents for SPARK-17604 it's a fairly old problem. There're at least three relevant issues being reported on SPARK-27188.

Improvements:

* [SPARK-30866][SS] FileStreamSource: Cache fetched list of files beyond maxFilesPerTrigger as unread files [3]
* [SPARK-30900][SS] FileStreamSource: Avoid reading compact metadata log twice if the query restarts from compact batch [4]
* [SPARK-30946][SS] Serde entry via DataInputStream/DataOutputStream with LZ4 compression on FileStream(Source/Sink)Log [5]

Above patches provide better performance on the condition described on each PR. Worth noting, SPARK-30946 provides pretty much better performance (~10x) on compaction per every compact batch, whereas it also reduces down the compact batch log file (~30% of current).



On Fri, May 22, 2020 at 12:50 PM Jungtaek Lim <[hidden email]> wrote:
Worth noting that I got similar question around local community as well. These reporters didn't encounter the edge-case, they're encountered the critical issue in the normal running of streaming query.

On Fri, May 8, 2020 at 4:49 PM Jungtaek Lim <[hidden email]> wrote:
(bump to expose the discussion to more readers)

On Mon, May 4, 2020 at 5:45 PM Jungtaek Lim <[hidden email]> wrote:
Hi devs,

I'm seeing more and more structured streaming end users encountered the metadata issues on file stream source and sink. They have been known-issues and there're even long-standing JIRA issues reported before, end users report them again in user@ mailing list in April.

* Spark Structure Streaming | FileStreamSourceLog not deleting list of input files | Spark -2.4.0 [1]
* [Structured Streaming] Checkpoint file compact file grows big [2]

I've proposed various improvements on the area (see my PRs [3]) but suffered on lack of interests/reviews. I feel the issue is critical (under-estimated) because...

1. It's one of "built-in" data sources which is being maintained by Spark community. (End users may judge the state of project/area on the quality on the built-in data source, because that's the thing they would start with.)
2. It's the only built-in data source which provides "end-to-end exactly-once" in structured streaming.

I'd hope to see us address such issues so that end users can live with built-in data source. (It may not need to be perfect, but at least be reasonable on the long-run streaming workloads.) I know there're couple of alternatives, but I don't think starter would start from there. End users may just try to find alternatives - not alternative of data source, but alternative of streaming processing framework.

Thanks,
Jungtaek Lim (HeartSaVioR)