[SS] full outer stream-stream join

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[SS] full outer stream-stream join

Cheng Su-2
Hi,
 
Stream-stream join in spark structured streaming right now supports INNER,
LEFT OUTER, RIGHT OUTER and LEFT SEMI join type. But it does not support
FULL OUTER join and we are working on to add it in
https://github.com/apache/spark/pull/30395 .
 
Given LEFT OUTER and RIGHT OUTER stream-stream join is supported, the code
needed for FULL OUTER join is actually quite straightforward:

* For left side input row, check if there's a match on right side state
store. if there's a match, output the joined row, o.w. output nothing. Put
the row in left side state store.
* For right side input row, check if there's a match on left side state
store. if there's a match, output the joined row, o.w. output nothing. Put
the row in right side state store.
* State store eviction: evict rows from left/right side state store below
watermark, and output rows never matched before (a combination of left outer
and right outer join).

Given FULL OUTER join consumes same amount of space in state store, compared
with INNER/LEFT OUTER/RIGH OUTER join, and pretty easy to add. I don’t see
any issues from system perspective that FULL OUTER join should not be added.

I am wondering is there any major blocker to add FULL OUTER stream-stream
join? Asking in dev mailing list in case we miss anything besides PR review
participation, thanks.

Cheng Su



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [SS] full outer stream-stream join

Jungtaek Lim-2
Adding rationalization here, my request for raising the thead to dev mailing list is, to figure out possible reasons not having full outer join at the moment when adding left/right outer join.

This is rather historical knowledge, so I have no idea about this. Most likely a limited number of folks could answer and I hope we could get some historical information.

Note that I don't object the change. Just wanted to make clear we don't miss something.

On Sat, Nov 21, 2020 at 4:15 AM real-cheng-su <[hidden email]> wrote:
Hi,

Stream-stream join in spark structured streaming right now supports INNER,
LEFT OUTER, RIGHT OUTER and LEFT SEMI join type. But it does not support
FULL OUTER join and we are working on to add it in
https://github.com/apache/spark/pull/30395 .

Given LEFT OUTER and RIGHT OUTER stream-stream join is supported, the code
needed for FULL OUTER join is actually quite straightforward:

* For left side input row, check if there's a match on right side state
store. if there's a match, output the joined row, o.w. output nothing. Put
the row in left side state store.
* For right side input row, check if there's a match on left side state
store. if there's a match, output the joined row, o.w. output nothing. Put
the row in right side state store.
* State store eviction: evict rows from left/right side state store below
watermark, and output rows never matched before (a combination of left outer
and right outer join).

Given FULL OUTER join consumes same amount of space in state store, compared
with INNER/LEFT OUTER/RIGH OUTER join, and pretty easy to add. I don’t see
any issues from system perspective that FULL OUTER join should not be added.

I am wondering is there any major blocker to add FULL OUTER stream-stream
join? Asking in dev mailing list in case we miss anything besides PR review
participation, thanks.

Cheng Su



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]