Output mode in Structured Streaming and DSv1 sink/DSv2 table

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Output mode in Structured Streaming and DSv1 sink/DSv2 table

Jungtaek Lim-2
Hi devs,

We have a capability check in DSv2 defining which operations can be done against the data source both read and write. The concept was brought in DSv2, so it's not weird for DSv1 to don't have a concept.

In SS the problem arises - if I understand correctly, we would like to couple the output mode in the query and the output table. That said, complete mode should enforce the output table to truncate the content. Update mode should enforce the output table to "upsert" or "delete and append" the content.

Nothing has been done against the DSv1 sink - Spark doesn't enforce anything and works as append mode, though the query still respects the output mode on stateful operations.

I understand we don't want to make end users surprised on broken compatibility, but shouldn't it be an "temporary" "exceptional" case and DSv2 never does it again? I'm seeing many built-in data sources being migrated to DSv2 with the exception of "do nothing for update/truncate", which simply destruct the rationalization on capability.

In addition, they don't add TRUNCATE in capability but add SupportsTruncate in WriteBuilder, which is weird. It works as of now because SS misses checking capability on the writer side (I guess it only checks STREAMING_WRITE), but once we check capability in first place, things will break.
(I'm looking into adding a writer plan in SS before analyzer, and check capability there.)

What would be our best fix on this issue? Would we leave the responsibility of handling "truncate" on the data source (so do nothing is fine if it's intended), and just add TRUNCATE to the capability? (That should be documented in its data source description though.) Or drop the support on truncate if the data source is unable to truncate? (Foreach and Kafka output tables will be unable to apply complete mode afterwards.)

Looking forward to hear everyone's thoughts.

Thanks,
Jungtaek Lim (HeartSaVioR)
Reply | Threaded
Open this post in threaded view
|

Re: Output mode in Structured Streaming and DSv1 sink/DSv2 table

Jungtaek Lim-2
bump to see anyone interested or concerned about this

On Sun, Sep 20, 2020 at 1:59 PM Jungtaek Lim <[hidden email]> wrote:
Hi devs,

We have a capability check in DSv2 defining which operations can be done against the data source both read and write. The concept was brought in DSv2, so it's not weird for DSv1 to don't have a concept.

In SS the problem arises - if I understand correctly, we would like to couple the output mode in the query and the output table. That said, complete mode should enforce the output table to truncate the content. Update mode should enforce the output table to "upsert" or "delete and append" the content.

Nothing has been done against the DSv1 sink - Spark doesn't enforce anything and works as append mode, though the query still respects the output mode on stateful operations.

I understand we don't want to make end users surprised on broken compatibility, but shouldn't it be an "temporary" "exceptional" case and DSv2 never does it again? I'm seeing many built-in data sources being migrated to DSv2 with the exception of "do nothing for update/truncate", which simply destruct the rationalization on capability.

In addition, they don't add TRUNCATE in capability but add SupportsTruncate in WriteBuilder, which is weird. It works as of now because SS misses checking capability on the writer side (I guess it only checks STREAMING_WRITE), but once we check capability in first place, things will break.
(I'm looking into adding a writer plan in SS before analyzer, and check capability there.)

What would be our best fix on this issue? Would we leave the responsibility of handling "truncate" on the data source (so do nothing is fine if it's intended), and just add TRUNCATE to the capability? (That should be documented in its data source description though.) Or drop the support on truncate if the data source is unable to truncate? (Foreach and Kafka output tables will be unable to apply complete mode afterwards.)

Looking forward to hear everyone's thoughts.

Thanks,
Jungtaek Lim (HeartSaVioR)