|
|
Hi there,
I'd like to know what's the root reason why multiple aggregations on streaming dataframe is not allowed since it's a very useful feature, and flink has supported it for a long time.
Thanks.
|
|
There is PR for this but not yet merged. Hi there,
I'd like to know what's the root reason why multiple aggregations on streaming dataframe is not allowed since it's a very useful feature, and flink has supported it for a long time.
Thanks.
|
|
Heres the proposal for supporting it in "append" mode - https://github.com/apache/spark/pull/23576. You could see if it addresses your requirement and post your feedback in the PR. For "update" mode its going to be much harder to support this without first adding support for "retractions", otherwise we would end up with wrong results.
- Arun
There is PR for this but not yet merged.
Hi there,
I'd like to know what's the root reason why multiple aggregations on streaming dataframe is not allowed since it's a very useful feature, and flink has supported it for a long time.
Thanks.
|
|
Thanks, I'll check it out. Heres the proposal for supporting it in "append" mode - https://github.com/apache/spark/pull/23576. You could see if it addresses your requirement and post your feedback in the PR. For "update" mode its going to be much harder to support this without first adding support for "retractions", otherwise we would end up with wrong results.
- Arun
There is PR for this but not yet merged.
Hi there,
I'd like to know what's the root reason why multiple aggregations on streaming dataframe is not allowed since it's a very useful feature, and flink has supported it for a long time.
Thanks.
|
|
Hi all,
I'm also very interested in this feature but the PR is open since
January 2019 and was not updated. It raised a design discussion
around watermarks and a design doc was written (https://docs.google.com/document/d/1IAH9UQJPUiUCLd7H6dazRK2k1szDX38SnM6GVNZYvUo/edit#heading=h.npkueh4bbkz1).
We also commented this design but no matter what it seems that the
subject is still stale.
Is there any interest in the community in delivering this feature
or is it considered worthless ? If the latter, can you explain why
?
Best
Etienne
On 22/05/2019 03:38, 张万新 wrote:
Thanks, I'll check it out.
Heres the proposal for supporting it in
"append" mode - https://github.com/apache/spark/pull/23576.
You could see if it addresses your requirement and post your
feedback in the PR.
For "update" mode its going to be much harder to
support this without first adding support for
"retractions", otherwise we would end up with wrong
results.
- Arun
There is PR for this but not yet merged.
Hi there,
I'd like to know what's the root
reason why multiple aggregations on streaming
dataframe is not allowed since it's a very useful
feature, and flink has supported it for a long
time.
Thanks.
|
|
Unfortunately I don't see enough active committers working on Structured Streaming; I don't expect major features/improvements can be brought in this situation.
Technically I can review and merge the PR on major improvements in SS, but that depends on how huge the proposal is changing. If the proposal brings conceptual change, being reviewed by a committer wouldn't still be enough.
So that's not due to the fact we think it's worthless. (That might be only me though.) I'd understand as there's not much investment on SS. There's also a known workaround for multiple aggregations (I've documented in the SS guide doc, in "Limitation of global watermark" section), though I totally agree the workaround is bad. On Tue, Sep 1, 2020 at 12:28 AM Etienne Chauchot < [hidden email]> wrote:
Hi all,
I'm also very interested in this feature but the PR is open since
January 2019 and was not updated. It raised a design discussion
around watermarks and a design doc was written (https://docs.google.com/document/d/1IAH9UQJPUiUCLd7H6dazRK2k1szDX38SnM6GVNZYvUo/edit#heading=h.npkueh4bbkz1).
We also commented this design but no matter what it seems that the
subject is still stale.
Is there any interest in the community in delivering this feature
or is it considered worthless ? If the latter, can you explain why
?
Best
Etienne
On 22/05/2019 03:38, 张万新 wrote:
Thanks, I'll check it out.
Heres the proposal for supporting it in
"append" mode - https://github.com/apache/spark/pull/23576.
You could see if it addresses your requirement and post your
feedback in the PR.
For "update" mode its going to be much harder to
support this without first adding support for
"retractions", otherwise we would end up with wrong
results.
- Arun
There is PR for this but not yet merged.
Hi there,
I'd like to know what's the root
reason why multiple aggregations on streaming
dataframe is not allowed since it's a very useful
feature, and flink has supported it for a long
time.
Thanks.
|
|
Hi Jungtaek Lim,
Nice to hear from you again since last time we talked :) and
congrats on becoming a Spark committer in the meantime ! (if I'm
not mistaking you were not at the time)
I totally agree with what you're saying on merging structural
parts of Spark without having a broader consensus. What I don't
understand is why there is not more investment in SS. Especially
because in another thread the community is discussing about
deprecating the regular DStream streaming framework.
Is the orientation of Spark now mostly batch ?
PS: yeah I saw your update on the doc when I took a look at 3.0
preview 2 searching for this particular feature. And regarding the
workaround, I'm not sure it meets my needs as it will add delays
and also may mess up with watermarks.
Best
Etienne Chauchot
On 04/09/2020 08:06, Jungtaek Lim
wrote:
Unfortunately I don't see enough active committers
working on Structured Streaming; I don't expect major
features/improvements can be brought in this situation.
Technically I can review and merge the PR on major
improvements in SS, but that depends on how huge the proposal
is changing. If the proposal brings conceptual change, being
reviewed by a committer wouldn't still be enough.
So that's not due to the fact we think it's worthless.
(That might be only me though.) I'd understand as there's not
much investment on SS. There's also a known workaround for
multiple aggregations (I've documented in the SS guide doc, in
"Limitation of global watermark" section), though I totally
agree the workaround is bad.
On Tue, Sep 1, 2020 at 12:28
AM Etienne Chauchot < [hidden email]> wrote:
Hi all,
I'm also very interested in this feature but the PR is
open since January 2019 and was not updated. It raised a
design discussion around watermarks and a design doc was
written (https://docs.google.com/document/d/1IAH9UQJPUiUCLd7H6dazRK2k1szDX38SnM6GVNZYvUo/edit#heading=h.npkueh4bbkz1).
We also commented this design but no matter what it seems
that the subject is still stale.
Is there any interest in the community in delivering this
feature or is it considered worthless ? If the latter, can
you explain why ?
Best
Etienne
On 22/05/2019 03:38, 张万新 wrote:
Thanks, I'll check it out.
Heres the proposal for supporting it in
"append" mode - https://github.com/apache/spark/pull/23576.
You could see if it addresses your requirement and
post your feedback in the PR.
For "update" mode its going to be much harder
to support this without first adding support for
"retractions", otherwise we would end up with
wrong results.
- Arun
There is PR for this but not yet
merged.
Hi there,
I'd like to know what's the
root reason why multiple aggregations on
streaming dataframe is not allowed since
it's a very useful feature, and flink has
supported it for a long time.
Thanks.
|
|
Thanks Etienne! Yeah I forgot to say nice talking with you again. And sorry I forgot to send the reply (was in draft).
Regarding investment in SS, well, unfortunately I don't know - I'm just an individual. There might be various reasons to do so, most probably "priority" among the stuff. There's not much I could change.
I agree the workaround is sub-optimal, but unless I see sufficient support in the community I probably couldn't make it go forward. I'll just say there's an elephant in the room - as the project goes forward for more than 10 years, backward compatibility is a top priority concern in the project, even across the major versions along the features/APIs. It is great for end users to migrate the version easily, but also blocks devs to fix the bad design once it ships. I'm the one complaining about these issues in the dev list, and I don't see willingness to correct them.
On Fri, Sep 4, 2020 at 5:55 PM Etienne Chauchot < [hidden email]> wrote:
Hi Jungtaek Lim,
Nice to hear from you again since last time we talked :) and
congrats on becoming a Spark committer in the meantime ! (if I'm
not mistaking you were not at the time)
I totally agree with what you're saying on merging structural
parts of Spark without having a broader consensus. What I don't
understand is why there is not more investment in SS. Especially
because in another thread the community is discussing about
deprecating the regular DStream streaming framework.
Is the orientation of Spark now mostly batch ?
PS: yeah I saw your update on the doc when I took a look at 3.0
preview 2 searching for this particular feature. And regarding the
workaround, I'm not sure it meets my needs as it will add delays
and also may mess up with watermarks.
Best
Etienne Chauchot
On 04/09/2020 08:06, Jungtaek Lim
wrote:
Unfortunately I don't see enough active committers
working on Structured Streaming; I don't expect major
features/improvements can be brought in this situation.
Technically I can review and merge the PR on major
improvements in SS, but that depends on how huge the proposal
is changing. If the proposal brings conceptual change, being
reviewed by a committer wouldn't still be enough.
So that's not due to the fact we think it's worthless.
(That might be only me though.) I'd understand as there's not
much investment on SS. There's also a known workaround for
multiple aggregations (I've documented in the SS guide doc, in
"Limitation of global watermark" section), though I totally
agree the workaround is bad.
On Tue, Sep 1, 2020 at 12:28
AM Etienne Chauchot < [hidden email]> wrote:
Hi all,
I'm also very interested in this feature but the PR is
open since January 2019 and was not updated. It raised a
design discussion around watermarks and a design doc was
written (https://docs.google.com/document/d/1IAH9UQJPUiUCLd7H6dazRK2k1szDX38SnM6GVNZYvUo/edit#heading=h.npkueh4bbkz1).
We also commented this design but no matter what it seems
that the subject is still stale.
Is there any interest in the community in delivering this
feature or is it considered worthless ? If the latter, can
you explain why ?
Best
Etienne
On 22/05/2019 03:38, 张万新 wrote:
Thanks, I'll check it out.
Heres the proposal for supporting it in
"append" mode - https://github.com/apache/spark/pull/23576.
You could see if it addresses your requirement and
post your feedback in the PR.
For "update" mode its going to be much harder
to support this without first adding support for
"retractions", otherwise we would end up with
wrong results.
- Arun
There is PR for this but not yet
merged.
Hi there,
I'd like to know what's the
root reason why multiple aggregations on
streaming dataframe is not allowed since
it's a very useful feature, and flink has
supported it for a long time.
Thanks.
|
|
Thanks for the great discussion!
Also interested in this feature and did some investigation before. As Arun mentioned, similar to the "update" mode, the "complete" mode also needs more design. We might need an operation level output mode for the complete mode support. That is to say, if we use "complete" mode for every aggregation operators, the wrong result will return.
SPARK-26655 would be a good start, which only considers about "append" mode. Maybe we need more discussion on the watermark interface. I will take a close look at the doc and PR. Hope we will have the first version with limitations and fix/remove them gradually.
Best, Yuanjian Thanks Etienne! Yeah I forgot to say nice talking with you again. And sorry I forgot to send the reply (was in draft).
Regarding investment in SS, well, unfortunately I don't know - I'm just an individual. There might be various reasons to do so, most probably "priority" among the stuff. There's not much I could change.
I agree the workaround is sub-optimal, but unless I see sufficient support in the community I probably couldn't make it go forward. I'll just say there's an elephant in the room - as the project goes forward for more than 10 years, backward compatibility is a top priority concern in the project, even across the major versions along the features/APIs. It is great for end users to migrate the version easily, but also blocks devs to fix the bad design once it ships. I'm the one complaining about these issues in the dev list, and I don't see willingness to correct them.
On Fri, Sep 4, 2020 at 5:55 PM Etienne Chauchot < [hidden email]> wrote:
Hi Jungtaek Lim,
Nice to hear from you again since last time we talked :) and
congrats on becoming a Spark committer in the meantime ! (if I'm
not mistaking you were not at the time)
I totally agree with what you're saying on merging structural
parts of Spark without having a broader consensus. What I don't
understand is why there is not more investment in SS. Especially
because in another thread the community is discussing about
deprecating the regular DStream streaming framework.
Is the orientation of Spark now mostly batch ?
PS: yeah I saw your update on the doc when I took a look at 3.0
preview 2 searching for this particular feature. And regarding the
workaround, I'm not sure it meets my needs as it will add delays
and also may mess up with watermarks.
Best
Etienne Chauchot
On 04/09/2020 08:06, Jungtaek Lim
wrote:
Unfortunately I don't see enough active committers
working on Structured Streaming; I don't expect major
features/improvements can be brought in this situation.
Technically I can review and merge the PR on major
improvements in SS, but that depends on how huge the proposal
is changing. If the proposal brings conceptual change, being
reviewed by a committer wouldn't still be enough.
So that's not due to the fact we think it's worthless.
(That might be only me though.) I'd understand as there's not
much investment on SS. There's also a known workaround for
multiple aggregations (I've documented in the SS guide doc, in
"Limitation of global watermark" section), though I totally
agree the workaround is bad.
On Tue, Sep 1, 2020 at 12:28
AM Etienne Chauchot < [hidden email]> wrote:
Hi all,
I'm also very interested in this feature but the PR is
open since January 2019 and was not updated. It raised a
design discussion around watermarks and a design doc was
written (https://docs.google.com/document/d/1IAH9UQJPUiUCLd7H6dazRK2k1szDX38SnM6GVNZYvUo/edit#heading=h.npkueh4bbkz1).
We also commented this design but no matter what it seems
that the subject is still stale.
Is there any interest in the community in delivering this
feature or is it considered worthless ? If the latter, can
you explain why ?
Best
Etienne
On 22/05/2019 03:38, 张万新 wrote:
Thanks, I'll check it out.
Heres the proposal for supporting it in
"append" mode - https://github.com/apache/spark/pull/23576.
You could see if it addresses your requirement and
post your feedback in the PR.
For "update" mode its going to be much harder
to support this without first adding support for
"retractions", otherwise we would end up with
wrong results.
- Arun
There is PR for this but not yet
merged.
Hi there,
I'd like to know what's the
root reason why multiple aggregations on
streaming dataframe is not allowed since
it's a very useful feature, and flink has
supported it for a long time.
Thanks.
|
|
Hi,
Regarding this subject I wrote a blog article that gives details
about the watermark architecture proposal that was discussed in
the design doc and in the PR:
https://echauchot.blogspot.com/2020/11/watermark-architecture-proposal-for.html
Best
Etienne
On 29/09/2020 03:24, Yuanjian Li wrote:
Thanks for the great discussion!
Also interested in this feature and did some
investigation before. As Arun mentioned, similar to the
"update" mode, the "complete" mode also needs more design.
We might need an operation level output mode for the
complete mode support. That is to say, if we use "complete"
mode for every aggregation operators, the wrong result will
return.
SPARK-26655 would be a good start, which only considers
about "append" mode. Maybe we need more discussion on the
watermark interface. I will take a close look at the doc and
PR. Hope we will have the first version with limitations and
fix/remove them gradually.
Best,
Yuanjian
Thanks Etienne! Yeah I forgot to say nice
talking with you again. And sorry I forgot to send the
reply (was in draft).
Regarding investment in SS, well, unfortunately I
don't know - I'm just an individual. There might be
various reasons to do so, most probably "priority" among
the stuff. There's not much I could change.
I agree the workaround is sub-optimal, but unless I
see sufficient support in the community I probably
couldn't make it go forward. I'll just say there's an
elephant in the room - as the project goes forward for
more than 10 years, backward compatibility is a top
priority concern in the project, even across the major
versions along the features/APIs. It is great for end
users to migrate the version easily, but also blocks
devs to fix the bad design once it ships. I'm the one
complaining about these issues in the dev list, and I
don't see willingness to correct them.
On Fri, Sep 4, 2020 at
5:55 PM Etienne Chauchot < [hidden email]>
wrote:
Hi Jungtaek Lim,
Nice to hear from you again since last time we
talked :) and congrats on becoming a Spark committer
in the meantime ! (if I'm not mistaking you were not
at the time)
I totally agree with what you're saying on merging
structural parts of Spark without having a broader
consensus. What I don't understand is why there is
not more investment in SS. Especially because in
another thread the community is discussing about
deprecating the regular DStream streaming framework.
Is the orientation of Spark now mostly batch ?
PS: yeah I saw your update on the doc when I took a
look at 3.0 preview 2 searching for this particular
feature. And regarding the workaround, I'm not sure
it meets my needs as it will add delays and also may
mess up with watermarks.
Best
Etienne Chauchot
On 04/09/2020 08:06, Jungtaek Lim wrote:
Unfortunately I don't see enough
active committers working on Structured Streaming;
I don't expect major features/improvements can be
brought in this situation.
Technically I can review and merge the PR on
major improvements in SS, but that depends on
how huge the proposal is changing. If the
proposal brings conceptual change, being
reviewed by a committer wouldn't still be
enough.
So that's not due to the fact we think it's
worthless. (That might be only me though.) I'd
understand as there's not much investment on SS.
There's also a known workaround for multiple
aggregations (I've documented in the SS guide
doc, in "Limitation of global watermark"
section), though I totally agree the workaround
is bad.
On Tue, Sep 1,
2020 at 12:28 AM Etienne Chauchot < [hidden email]>
wrote:
Hi all,
I'm also very interested in this feature
but the PR is open since January 2019 and
was not updated. It raised a design
discussion around watermarks and a design
doc was written (https://docs.google.com/document/d/1IAH9UQJPUiUCLd7H6dazRK2k1szDX38SnM6GVNZYvUo/edit#heading=h.npkueh4bbkz1).
We also commented this design but no matter
what it seems that the subject is still
stale.
Is there any interest in the community in
delivering this feature or is it considered
worthless ? If the latter, can you explain
why ?
Best
Etienne
On 22/05/2019 03:38, 张万新 wrote:
Thanks, I'll check it out.
Heres the proposal for
supporting it in "append" mode - https://github.com/apache/spark/pull/23576.
You could see if it addresses your
requirement and post your feedback in
the PR.
For "update" mode its going to
be much harder to support this
without first adding support for
"retractions", otherwise we would
end up with wrong results.
- Arun
There is PR for this
but not yet merged.
Hi there,
I'd like to
know what's the root reason
why multiple aggregations on
streaming dataframe is not
allowed since it's a very
useful feature, and flink
has supported it for a long
time.
Thanks.
|
|
Nice blog! Thanks for sharing, Etienne!
Let's try to raise this discussion again after the 3.1 release. I do think more committers/contributors had realized the issue of global watermark per SPARK-24634 and SPARK-33259.
Leaving some thoughts on my end: 1. Compatibility: The per-operation watermark should be compatible with the original global one when there are no multi-aggregations. 2. Versioning: If we need to change checkpoints' format, versioning info should be added for the first time. 3. Fix more things together: We'd better fix more issues(e.g. per-operation output mode for multi-aggregations) together, which would require versioning changes in the same Spark version.
Best, Yuanjian
Hi,
Regarding this subject I wrote a blog article that gives details
about the watermark architecture proposal that was discussed in
the design doc and in the PR:
https://echauchot.blogspot.com/2020/11/watermark-architecture-proposal-for.html
Best
Etienne
On 29/09/2020 03:24, Yuanjian Li wrote:
Thanks for the great discussion!
Also interested in this feature and did some
investigation before. As Arun mentioned, similar to the
"update" mode, the "complete" mode also needs more design.
We might need an operation level output mode for the
complete mode support. That is to say, if we use "complete"
mode for every aggregation operators, the wrong result will
return.
SPARK-26655 would be a good start, which only considers
about "append" mode. Maybe we need more discussion on the
watermark interface. I will take a close look at the doc and
PR. Hope we will have the first version with limitations and
fix/remove them gradually.
Best,
Yuanjian
Thanks Etienne! Yeah I forgot to say nice
talking with you again. And sorry I forgot to send the
reply (was in draft).
Regarding investment in SS, well, unfortunately I
don't know - I'm just an individual. There might be
various reasons to do so, most probably "priority" among
the stuff. There's not much I could change.
I agree the workaround is sub-optimal, but unless I
see sufficient support in the community I probably
couldn't make it go forward. I'll just say there's an
elephant in the room - as the project goes forward for
more than 10 years, backward compatibility is a top
priority concern in the project, even across the major
versions along the features/APIs. It is great for end
users to migrate the version easily, but also blocks
devs to fix the bad design once it ships. I'm the one
complaining about these issues in the dev list, and I
don't see willingness to correct them.
On Fri, Sep 4, 2020 at
5:55 PM Etienne Chauchot < [hidden email]>
wrote:
Hi Jungtaek Lim,
Nice to hear from you again since last time we
talked :) and congrats on becoming a Spark committer
in the meantime ! (if I'm not mistaking you were not
at the time)
I totally agree with what you're saying on merging
structural parts of Spark without having a broader
consensus. What I don't understand is why there is
not more investment in SS. Especially because in
another thread the community is discussing about
deprecating the regular DStream streaming framework.
Is the orientation of Spark now mostly batch ?
PS: yeah I saw your update on the doc when I took a
look at 3.0 preview 2 searching for this particular
feature. And regarding the workaround, I'm not sure
it meets my needs as it will add delays and also may
mess up with watermarks.
Best
Etienne Chauchot
On 04/09/2020 08:06, Jungtaek Lim wrote:
Unfortunately I don't see enough
active committers working on Structured Streaming;
I don't expect major features/improvements can be
brought in this situation.
Technically I can review and merge the PR on
major improvements in SS, but that depends on
how huge the proposal is changing. If the
proposal brings conceptual change, being
reviewed by a committer wouldn't still be
enough.
So that's not due to the fact we think it's
worthless. (That might be only me though.) I'd
understand as there's not much investment on SS.
There's also a known workaround for multiple
aggregations (I've documented in the SS guide
doc, in "Limitation of global watermark"
section), though I totally agree the workaround
is bad.
On Tue, Sep 1,
2020 at 12:28 AM Etienne Chauchot < [hidden email]>
wrote:
Hi all,
I'm also very interested in this feature
but the PR is open since January 2019 and
was not updated. It raised a design
discussion around watermarks and a design
doc was written (https://docs.google.com/document/d/1IAH9UQJPUiUCLd7H6dazRK2k1szDX38SnM6GVNZYvUo/edit#heading=h.npkueh4bbkz1).
We also commented this design but no matter
what it seems that the subject is still
stale.
Is there any interest in the community in
delivering this feature or is it considered
worthless ? If the latter, can you explain
why ?
Best
Etienne
On 22/05/2019 03:38, 张万新 wrote:
Thanks, I'll check it out.
Heres the proposal for
supporting it in "append" mode - https://github.com/apache/spark/pull/23576.
You could see if it addresses your
requirement and post your feedback in
the PR.
For "update" mode its going to
be much harder to support this
without first adding support for
"retractions", otherwise we would
end up with wrong results.
- Arun
There is PR for this
but not yet merged.
Hi there,
I'd like to
know what's the root reason
why multiple aggregations on
streaming dataframe is not
allowed since it's a very
useful feature, and flink
has supported it for a long
time.
Thanks.
|
|
To make clear, what Arun meant in old PR is, watermark and output mode are not relevant. It's limited to the append mode in any way when we only deal with watermark. So in this phase we don't (and shouldn't) bring output mode in topic and make things complicated, unless we really have a solid plan to introduce retraction.
Nice blog! Thanks for sharing, Etienne!
Let's try to raise this discussion again after the 3.1 release. I do think more committers/contributors had realized the issue of global watermark per SPARK-24634 and SPARK-33259.
Leaving some thoughts on my end: 1. Compatibility: The per-operation watermark should be compatible with the original global one when there are no multi-aggregations. 2. Versioning: If we need to change checkpoints' format, versioning info should be added for the first time. 3. Fix more things together: We'd better fix more issues(e.g. per-operation output mode for multi-aggregations) together, which would require versioning changes in the same Spark version.
Best, Yuanjian
Hi,
Regarding this subject I wrote a blog article that gives details
about the watermark architecture proposal that was discussed in
the design doc and in the PR:
https://echauchot.blogspot.com/2020/11/watermark-architecture-proposal-for.html
Best
Etienne
On 29/09/2020 03:24, Yuanjian Li wrote:
Thanks for the great discussion!
Also interested in this feature and did some
investigation before. As Arun mentioned, similar to the
"update" mode, the "complete" mode also needs more design.
We might need an operation level output mode for the
complete mode support. That is to say, if we use "complete"
mode for every aggregation operators, the wrong result will
return.
SPARK-26655 would be a good start, which only considers
about "append" mode. Maybe we need more discussion on the
watermark interface. I will take a close look at the doc and
PR. Hope we will have the first version with limitations and
fix/remove them gradually.
Best,
Yuanjian
Thanks Etienne! Yeah I forgot to say nice
talking with you again. And sorry I forgot to send the
reply (was in draft).
Regarding investment in SS, well, unfortunately I
don't know - I'm just an individual. There might be
various reasons to do so, most probably "priority" among
the stuff. There's not much I could change.
I agree the workaround is sub-optimal, but unless I
see sufficient support in the community I probably
couldn't make it go forward. I'll just say there's an
elephant in the room - as the project goes forward for
more than 10 years, backward compatibility is a top
priority concern in the project, even across the major
versions along the features/APIs. It is great for end
users to migrate the version easily, but also blocks
devs to fix the bad design once it ships. I'm the one
complaining about these issues in the dev list, and I
don't see willingness to correct them.
On Fri, Sep 4, 2020 at
5:55 PM Etienne Chauchot < [hidden email]>
wrote:
Hi Jungtaek Lim,
Nice to hear from you again since last time we
talked :) and congrats on becoming a Spark committer
in the meantime ! (if I'm not mistaking you were not
at the time)
I totally agree with what you're saying on merging
structural parts of Spark without having a broader
consensus. What I don't understand is why there is
not more investment in SS. Especially because in
another thread the community is discussing about
deprecating the regular DStream streaming framework.
Is the orientation of Spark now mostly batch ?
PS: yeah I saw your update on the doc when I took a
look at 3.0 preview 2 searching for this particular
feature. And regarding the workaround, I'm not sure
it meets my needs as it will add delays and also may
mess up with watermarks.
Best
Etienne Chauchot
On 04/09/2020 08:06, Jungtaek Lim wrote:
Unfortunately I don't see enough
active committers working on Structured Streaming;
I don't expect major features/improvements can be
brought in this situation.
Technically I can review and merge the PR on
major improvements in SS, but that depends on
how huge the proposal is changing. If the
proposal brings conceptual change, being
reviewed by a committer wouldn't still be
enough.
So that's not due to the fact we think it's
worthless. (That might be only me though.) I'd
understand as there's not much investment on SS.
There's also a known workaround for multiple
aggregations (I've documented in the SS guide
doc, in "Limitation of global watermark"
section), though I totally agree the workaround
is bad.
On Tue, Sep 1,
2020 at 12:28 AM Etienne Chauchot < [hidden email]>
wrote:
Hi all,
I'm also very interested in this feature
but the PR is open since January 2019 and
was not updated. It raised a design
discussion around watermarks and a design
doc was written (https://docs.google.com/document/d/1IAH9UQJPUiUCLd7H6dazRK2k1szDX38SnM6GVNZYvUo/edit#heading=h.npkueh4bbkz1).
We also commented this design but no matter
what it seems that the subject is still
stale.
Is there any interest in the community in
delivering this feature or is it considered
worthless ? If the latter, can you explain
why ?
Best
Etienne
On 22/05/2019 03:38, 张万新 wrote:
Thanks, I'll check it out.
Heres the proposal for
supporting it in "append" mode - https://github.com/apache/spark/pull/23576.
You could see if it addresses your
requirement and post your feedback in
the PR.
For "update" mode its going to
be much harder to support this
without first adding support for
"retractions", otherwise we would
end up with wrong results.
- Arun
There is PR for this
but not yet merged.
Hi there,
I'd like to
know what's the root reason
why multiple aggregations on
streaming dataframe is not
allowed since it's a very
useful feature, and flink
has supported it for a long
time.
Thanks.
|
|