[DISCUSS] Preferred approach on dealing with SPARK-29322

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

[DISCUSS] Preferred approach on dealing with SPARK-29322

Jungtaek Lim-2
Hi devs,

I've discovered an issue with event logger, specifically reading incomplete event log file which is compressed with 'zstd' - the reader thread got stuck on reading that file.

This is very easy to reproduce: setting configuration as below

- spark.eventLog.enabled=true
- spark.eventLog.compress=true
- spark.eventLog.compression.codec=zstd

and start Spark application. While the application is running, load the application in SHS webpage. It may succeed to replay the event log, but high likely it will be stuck and loading page will be also stuck.

Please refer SPARK-29322 for more details.

As the issue only occurs with 'zstd', the simplest approach is dropping support of 'zstd' for event log. More general approach would be introducing timeout on reading event log file, but it should be able to differentiate thread being stuck vs thread busy with reading huge event log file.

Which approach would be preferred in Spark community, or would someone propose better ideas for handling this?

Thanks,
Jungtaek Lim (HeartSaVioR)
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Preferred approach on dealing with SPARK-29322

Mridul Muralidharan
Makes more sense to drop support for zstd assuming the fix is not
something at spark end (configuration, etc).
Does not make sense to try to detect deadlock in codec.

Regards,
Mridul

On Tue, Oct 1, 2019 at 8:39 PM Jungtaek Lim
<[hidden email]> wrote:

>
> Hi devs,
>
> I've discovered an issue with event logger, specifically reading incomplete event log file which is compressed with 'zstd' - the reader thread got stuck on reading that file.
>
> This is very easy to reproduce: setting configuration as below
>
> - spark.eventLog.enabled=true
> - spark.eventLog.compress=true
> - spark.eventLog.compression.codec=zstd
>
> and start Spark application. While the application is running, load the application in SHS webpage. It may succeed to replay the event log, but high likely it will be stuck and loading page will be also stuck.
>
> Please refer SPARK-29322 for more details.
>
> As the issue only occurs with 'zstd', the simplest approach is dropping support of 'zstd' for event log. More general approach would be introducing timeout on reading event log file, but it should be able to differentiate thread being stuck vs thread busy with reading huge event log file.
>
> Which approach would be preferred in Spark community, or would someone propose better ideas for handling this?
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Preferred approach on dealing with SPARK-29322

Dongjoon Hyun-2
Thank you for reporting, Jungtaek.

Can we try to upgrade it to the newer version first?

Since we are at 1.4.2, the newer version is 1.4.3.

Bests,
Dongjoon.



On Tue, Oct 1, 2019 at 9:18 PM Mridul Muralidharan <[hidden email]> wrote:
Makes more sense to drop support for zstd assuming the fix is not
something at spark end (configuration, etc).
Does not make sense to try to detect deadlock in codec.

Regards,
Mridul

On Tue, Oct 1, 2019 at 8:39 PM Jungtaek Lim
<[hidden email]> wrote:
>
> Hi devs,
>
> I've discovered an issue with event logger, specifically reading incomplete event log file which is compressed with 'zstd' - the reader thread got stuck on reading that file.
>
> This is very easy to reproduce: setting configuration as below
>
> - spark.eventLog.enabled=true
> - spark.eventLog.compress=true
> - spark.eventLog.compression.codec=zstd
>
> and start Spark application. While the application is running, load the application in SHS webpage. It may succeed to replay the event log, but high likely it will be stuck and loading page will be also stuck.
>
> Please refer SPARK-29322 for more details.
>
> As the issue only occurs with 'zstd', the simplest approach is dropping support of 'zstd' for event log. More general approach would be introducing timeout on reading event log file, but it should be able to differentiate thread being stuck vs thread busy with reading huge event log file.
>
> Which approach would be preferred in Spark community, or would someone propose better ideas for handling this?
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Preferred approach on dealing with SPARK-29322

Jungtaek Lim-2
The change log for zstd v1.4.3 feels me that the changes don't seem to be related.


v1.4.3
bug: Fix Dictionary Compression Ratio Regression by @cyan4973 (#1709)
bug: Fix Buffer Overflow in v0.3 Decompression by @felixhandte (#1722)
build: Add support for IAR C/C++ Compiler for Arm by @joseph0918 (#1705)
misc: Add NULL pointer check in util.c by @leeyoung624 (#1706)

But it's only the matter of dependency update and rebuild, so I'll try it out.

Before that, I just indicated ZstdOutputStream has a parameter "closeFrameOnFlush" which seems to deal with flush. We let the value as the default value which is "false". Let me pass the value to "true" and see it helps. Please let me know if someone knows why we pick the value as false (or let it by default).


On Wed, Oct 2, 2019 at 1:48 PM Dongjoon Hyun <[hidden email]> wrote:
Thank you for reporting, Jungtaek.

Can we try to upgrade it to the newer version first?

Since we are at 1.4.2, the newer version is 1.4.3.

Bests,
Dongjoon.



On Tue, Oct 1, 2019 at 9:18 PM Mridul Muralidharan <[hidden email]> wrote:
Makes more sense to drop support for zstd assuming the fix is not
something at spark end (configuration, etc).
Does not make sense to try to detect deadlock in codec.

Regards,
Mridul

On Tue, Oct 1, 2019 at 8:39 PM Jungtaek Lim
<[hidden email]> wrote:
>
> Hi devs,
>
> I've discovered an issue with event logger, specifically reading incomplete event log file which is compressed with 'zstd' - the reader thread got stuck on reading that file.
>
> This is very easy to reproduce: setting configuration as below
>
> - spark.eventLog.enabled=true
> - spark.eventLog.compress=true
> - spark.eventLog.compression.codec=zstd
>
> and start Spark application. While the application is running, load the application in SHS webpage. It may succeed to replay the event log, but high likely it will be stuck and loading page will be also stuck.
>
> Please refer SPARK-29322 for more details.
>
> As the issue only occurs with 'zstd', the simplest approach is dropping support of 'zstd' for event log. More general approach would be introducing timeout on reading event log file, but it should be able to differentiate thread being stuck vs thread busy with reading huge event log file.
>
> Which approach would be preferred in Spark community, or would someone propose better ideas for handling this?
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Preferred approach on dealing with SPARK-29322

Jungtaek Lim-2
I need to do full manual test to make sure, but according to experiment (small UT) "closeFrameOnFlush" seems to work. 

There was relevant change on master branch SPARK-26283 [1], and it changed the way to read the zstd event log file to "continuous", which seems to read open frame. With "closeFrameOnFlush" being false for ZstdOutputStream, frame is never closed (even flushing output stream) unless output stream is closed.

I'll raise a patch once manual test is passed. Sorry for the false alarm.

Thanks,
Jungtaek Lim (HeartSaVioR)


On Wed, Oct 2, 2019 at 2:33 PM Jungtaek Lim <[hidden email]> wrote:
The change log for zstd v1.4.3 feels me that the changes don't seem to be related.


v1.4.3
bug: Fix Dictionary Compression Ratio Regression by @cyan4973 (#1709)
bug: Fix Buffer Overflow in v0.3 Decompression by @felixhandte (#1722)
build: Add support for IAR C/C++ Compiler for Arm by @joseph0918 (#1705)
misc: Add NULL pointer check in util.c by @leeyoung624 (#1706)

But it's only the matter of dependency update and rebuild, so I'll try it out.

Before that, I just indicated ZstdOutputStream has a parameter "closeFrameOnFlush" which seems to deal with flush. We let the value as the default value which is "false". Let me pass the value to "true" and see it helps. Please let me know if someone knows why we pick the value as false (or let it by default).


On Wed, Oct 2, 2019 at 1:48 PM Dongjoon Hyun <[hidden email]> wrote:
Thank you for reporting, Jungtaek.

Can we try to upgrade it to the newer version first?

Since we are at 1.4.2, the newer version is 1.4.3.

Bests,
Dongjoon.



On Tue, Oct 1, 2019 at 9:18 PM Mridul Muralidharan <[hidden email]> wrote:
Makes more sense to drop support for zstd assuming the fix is not
something at spark end (configuration, etc).
Does not make sense to try to detect deadlock in codec.

Regards,
Mridul

On Tue, Oct 1, 2019 at 8:39 PM Jungtaek Lim
<[hidden email]> wrote:
>
> Hi devs,
>
> I've discovered an issue with event logger, specifically reading incomplete event log file which is compressed with 'zstd' - the reader thread got stuck on reading that file.
>
> This is very easy to reproduce: setting configuration as below
>
> - spark.eventLog.enabled=true
> - spark.eventLog.compress=true
> - spark.eventLog.compression.codec=zstd
>
> and start Spark application. While the application is running, load the application in SHS webpage. It may succeed to replay the event log, but high likely it will be stuck and loading page will be also stuck.
>
> Please refer SPARK-29322 for more details.
>
> As the issue only occurs with 'zstd', the simplest approach is dropping support of 'zstd' for event log. More general approach would be introducing timeout on reading event log file, but it should be able to differentiate thread being stuck vs thread busy with reading huge event log file.
>
> Which approach would be preferred in Spark community, or would someone propose better ideas for handling this?
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Preferred approach on dealing with SPARK-29322

Jungtaek Lim-2

On Wed, Oct 2, 2019 at 3:25 PM Jungtaek Lim <[hidden email]> wrote:
I need to do full manual test to make sure, but according to experiment (small UT) "closeFrameOnFlush" seems to work. 

There was relevant change on master branch SPARK-26283 [1], and it changed the way to read the zstd event log file to "continuous", which seems to read open frame. With "closeFrameOnFlush" being false for ZstdOutputStream, frame is never closed (even flushing output stream) unless output stream is closed.

I'll raise a patch once manual test is passed. Sorry for the false alarm.

Thanks,
Jungtaek Lim (HeartSaVioR)


On Wed, Oct 2, 2019 at 2:33 PM Jungtaek Lim <[hidden email]> wrote:
The change log for zstd v1.4.3 feels me that the changes don't seem to be related.


v1.4.3
bug: Fix Dictionary Compression Ratio Regression by @cyan4973 (#1709)
bug: Fix Buffer Overflow in v0.3 Decompression by @felixhandte (#1722)
build: Add support for IAR C/C++ Compiler for Arm by @joseph0918 (#1705)
misc: Add NULL pointer check in util.c by @leeyoung624 (#1706)

But it's only the matter of dependency update and rebuild, so I'll try it out.

Before that, I just indicated ZstdOutputStream has a parameter "closeFrameOnFlush" which seems to deal with flush. We let the value as the default value which is "false". Let me pass the value to "true" and see it helps. Please let me know if someone knows why we pick the value as false (or let it by default).


On Wed, Oct 2, 2019 at 1:48 PM Dongjoon Hyun <[hidden email]> wrote:
Thank you for reporting, Jungtaek.

Can we try to upgrade it to the newer version first?

Since we are at 1.4.2, the newer version is 1.4.3.

Bests,
Dongjoon.



On Tue, Oct 1, 2019 at 9:18 PM Mridul Muralidharan <[hidden email]> wrote:
Makes more sense to drop support for zstd assuming the fix is not
something at spark end (configuration, etc).
Does not make sense to try to detect deadlock in codec.

Regards,
Mridul

On Tue, Oct 1, 2019 at 8:39 PM Jungtaek Lim
<[hidden email]> wrote:
>
> Hi devs,
>
> I've discovered an issue with event logger, specifically reading incomplete event log file which is compressed with 'zstd' - the reader thread got stuck on reading that file.
>
> This is very easy to reproduce: setting configuration as below
>
> - spark.eventLog.enabled=true
> - spark.eventLog.compress=true
> - spark.eventLog.compression.codec=zstd
>
> and start Spark application. While the application is running, load the application in SHS webpage. It may succeed to replay the event log, but high likely it will be stuck and loading page will be also stuck.
>
> Please refer SPARK-29322 for more details.
>
> As the issue only occurs with 'zstd', the simplest approach is dropping support of 'zstd' for event log. More general approach would be introducing timeout on reading event log file, but it should be able to differentiate thread being stuck vs thread busy with reading huge event log file.
>
> Which approach would be preferred in Spark community, or would someone propose better ideas for handling this?
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Preferred approach on dealing with SPARK-29322

Dongjoon Hyun-2
Thank you for the investigation and making a fix.

So, both issues are on only master (3.0.0) branch?

Bests,
Dongjoon.


On Wed, Oct 2, 2019 at 00:06 Jungtaek Lim <[hidden email]> wrote:

On Wed, Oct 2, 2019 at 3:25 PM Jungtaek Lim <[hidden email]> wrote:
I need to do full manual test to make sure, but according to experiment (small UT) "closeFrameOnFlush" seems to work. 

There was relevant change on master branch SPARK-26283 [1], and it changed the way to read the zstd event log file to "continuous", which seems to read open frame. With "closeFrameOnFlush" being false for ZstdOutputStream, frame is never closed (even flushing output stream) unless output stream is closed.

I'll raise a patch once manual test is passed. Sorry for the false alarm.

Thanks,
Jungtaek Lim (HeartSaVioR)


On Wed, Oct 2, 2019 at 2:33 PM Jungtaek Lim <[hidden email]> wrote:
The change log for zstd v1.4.3 feels me that the changes don't seem to be related.


v1.4.3
bug: Fix Dictionary Compression Ratio Regression by @cyan4973 (#1709)
bug: Fix Buffer Overflow in v0.3 Decompression by @felixhandte (#1722)
build: Add support for IAR C/C++ Compiler for Arm by @joseph0918 (#1705)
misc: Add NULL pointer check in util.c by @leeyoung624 (#1706)

But it's only the matter of dependency update and rebuild, so I'll try it out.

Before that, I just indicated ZstdOutputStream has a parameter "closeFrameOnFlush" which seems to deal with flush. We let the value as the default value which is "false". Let me pass the value to "true" and see it helps. Please let me know if someone knows why we pick the value as false (or let it by default).


On Wed, Oct 2, 2019 at 1:48 PM Dongjoon Hyun <[hidden email]> wrote:
Thank you for reporting, Jungtaek.

Can we try to upgrade it to the newer version first?

Since we are at 1.4.2, the newer version is 1.4.3.

Bests,
Dongjoon.



On Tue, Oct 1, 2019 at 9:18 PM Mridul Muralidharan <[hidden email]> wrote:
Makes more sense to drop support for zstd assuming the fix is not
something at spark end (configuration, etc).
Does not make sense to try to detect deadlock in codec.

Regards,
Mridul

On Tue, Oct 1, 2019 at 8:39 PM Jungtaek Lim
<[hidden email]> wrote:
>
> Hi devs,
>
> I've discovered an issue with event logger, specifically reading incomplete event log file which is compressed with 'zstd' - the reader thread got stuck on reading that file.
>
> This is very easy to reproduce: setting configuration as below
>
> - spark.eventLog.enabled=true
> - spark.eventLog.compress=true
> - spark.eventLog.compression.codec=zstd
>
> and start Spark application. While the application is running, load the application in SHS webpage. It may succeed to replay the event log, but high likely it will be stuck and loading page will be also stuck.
>
> Please refer SPARK-29322 for more details.
>
> As the issue only occurs with 'zstd', the simplest approach is dropping support of 'zstd' for event log. More general approach would be introducing timeout on reading event log file, but it should be able to differentiate thread being stuck vs thread busy with reading huge event log file.
>
> Which approach would be preferred in Spark community, or would someone propose better ideas for handling this?
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Preferred approach on dealing with SPARK-29322

Jungtaek Lim-2
I'm not 100% sure I understand the question. Assuming you're referring "both" as SPARK-26283 [1] and SPARK-29322 [2], if you ask about the fix then yes, only master branch as fix for SPARK-26283 is not ported back to branch-2.4. If you ask about the issue (problem) then maybe no, according to the affected version of SPARK-26283 (2.4.0 is also there).

On Wed, Oct 2, 2019 at 11:47 PM Dongjoon Hyun <[hidden email]> wrote:
Thank you for the investigation and making a fix.

So, both issues are on only master (3.0.0) branch?

Bests,
Dongjoon.


On Wed, Oct 2, 2019 at 00:06 Jungtaek Lim <[hidden email]> wrote:

On Wed, Oct 2, 2019 at 3:25 PM Jungtaek Lim <[hidden email]> wrote:
I need to do full manual test to make sure, but according to experiment (small UT) "closeFrameOnFlush" seems to work. 

There was relevant change on master branch SPARK-26283 [1], and it changed the way to read the zstd event log file to "continuous", which seems to read open frame. With "closeFrameOnFlush" being false for ZstdOutputStream, frame is never closed (even flushing output stream) unless output stream is closed.

I'll raise a patch once manual test is passed. Sorry for the false alarm.

Thanks,
Jungtaek Lim (HeartSaVioR)


On Wed, Oct 2, 2019 at 2:33 PM Jungtaek Lim <[hidden email]> wrote:
The change log for zstd v1.4.3 feels me that the changes don't seem to be related.


v1.4.3
bug: Fix Dictionary Compression Ratio Regression by @cyan4973 (#1709)
bug: Fix Buffer Overflow in v0.3 Decompression by @felixhandte (#1722)
build: Add support for IAR C/C++ Compiler for Arm by @joseph0918 (#1705)
misc: Add NULL pointer check in util.c by @leeyoung624 (#1706)

But it's only the matter of dependency update and rebuild, so I'll try it out.

Before that, I just indicated ZstdOutputStream has a parameter "closeFrameOnFlush" which seems to deal with flush. We let the value as the default value which is "false". Let me pass the value to "true" and see it helps. Please let me know if someone knows why we pick the value as false (or let it by default).


On Wed, Oct 2, 2019 at 1:48 PM Dongjoon Hyun <[hidden email]> wrote:
Thank you for reporting, Jungtaek.

Can we try to upgrade it to the newer version first?

Since we are at 1.4.2, the newer version is 1.4.3.

Bests,
Dongjoon.



On Tue, Oct 1, 2019 at 9:18 PM Mridul Muralidharan <[hidden email]> wrote:
Makes more sense to drop support for zstd assuming the fix is not
something at spark end (configuration, etc).
Does not make sense to try to detect deadlock in codec.

Regards,
Mridul

On Tue, Oct 1, 2019 at 8:39 PM Jungtaek Lim
<[hidden email]> wrote:
>
> Hi devs,
>
> I've discovered an issue with event logger, specifically reading incomplete event log file which is compressed with 'zstd' - the reader thread got stuck on reading that file.
>
> This is very easy to reproduce: setting configuration as below
>
> - spark.eventLog.enabled=true
> - spark.eventLog.compress=true
> - spark.eventLog.compression.codec=zstd
>
> and start Spark application. While the application is running, load the application in SHS webpage. It may succeed to replay the event log, but high likely it will be stuck and loading page will be also stuck.
>
> Please refer SPARK-29322 for more details.
>
> As the issue only occurs with 'zstd', the simplest approach is dropping support of 'zstd' for event log. More general approach would be introducing timeout on reading event log file, but it should be able to differentiate thread being stuck vs thread busy with reading huge event log file.
>
> Which approach would be preferred in Spark community, or would someone propose better ideas for handling this?
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]