Plan on Structured Streaming in next major/minor release?

classic Classic list List threaded Threaded
17 messages Options
Reply | Threaded
Open this post in threaded view
|

Plan on Structured Streaming in next major/minor release?

Jungtaek Lim
Hi devs,

While Spark 2.4.0 is still in progress of release votes, I'm seeing some pull requests on non-SS are being reviewed and merged into master branch, so I guess discussion about next release is OK.

Looks like there's a major TODO left on structured streaming: allowing stateful operation in continuous mode (watermark, stateful exactly-once) and no other major milestone is shared. (Please let me know if I'm missing here!) As a structured streaming contributor's point of view, there're another features we could discuss and see which are good to have, and prioritize if possible (NOTE: just a brainstorming and some items might not be valid for structured streaming):

* Native support on session window (SPARK-10816 [1])
  ** patch available
* Support delegation token on Kafka (SPARK-25501 [2])
  ** patch available
* Queryable State (SPARK-16738 [3])
  ** some discussion took place, but no action is taken yet
* End to end exactly-once with Kafka sink
  ** given Kafka is the first class on streaming source/sink nowadays
* Custom window / custom watermark
* Physically scale (up/down) streaming state
* State TTL (especially for non-watermark state)
  ** "timeout" in map/flatmapGroupsWithState fits it, but just to check whether we want to have it for normal streaming aggregation
* Provide discarded events due to late via side output or similar feature
  ** for me it looks like tricky one, since Spark's RDD as well as SQL semantic provide one output
* more?

Would like to hear others opinions about this. Please also share if there're ongoing efforts on other items for structured streaming. Happy to help out if it needs another hand.

Thanks,
Jungtaek Lim (HeartSaVioR)


Reply | Threaded
Open this post in threaded view
|

Re: Plan on Structured Streaming in next major/minor release?

Jungtaek Lim
Small correction: "timeout" in map/flatmapGroupsWithState would not work similar as State TTL when event time and watermark is set. So timeout in map/flatmapGroupsWithState is to guarantee removal of state when the state will not be used, as similar as what we do with streaming aggregation, whereas State TTL is just work as its name is represented (self-explanatory). Hence State TTL looks valid for all the cases.

2018년 10월 19일 (금) 오후 12:20, Jungtaek Lim <[hidden email]>님이 작성:
Hi devs,

While Spark 2.4.0 is still in progress of release votes, I'm seeing some pull requests on non-SS are being reviewed and merged into master branch, so I guess discussion about next release is OK.

Looks like there's a major TODO left on structured streaming: allowing stateful operation in continuous mode (watermark, stateful exactly-once) and no other major milestone is shared. (Please let me know if I'm missing here!) As a structured streaming contributor's point of view, there're another features we could discuss and see which are good to have, and prioritize if possible (NOTE: just a brainstorming and some items might not be valid for structured streaming):

* Native support on session window (SPARK-10816 [1])
  ** patch available
* Support delegation token on Kafka (SPARK-25501 [2])
  ** patch available
* Queryable State (SPARK-16738 [3])
  ** some discussion took place, but no action is taken yet
* End to end exactly-once with Kafka sink
  ** given Kafka is the first class on streaming source/sink nowadays
* Custom window / custom watermark
* Physically scale (up/down) streaming state
* State TTL (especially for non-watermark state)
  ** "timeout" in map/flatmapGroupsWithState fits it, but just to check whether we want to have it for normal streaming aggregation
* Provide discarded events due to late via side output or similar feature
  ** for me it looks like tricky one, since Spark's RDD as well as SQL semantic provide one output
* more?

Would like to hear others opinions about this. Please also share if there're ongoing efforts on other items for structured streaming. Happy to help out if it needs another hand.

Thanks,
Jungtaek Lim (HeartSaVioR)


Reply | Threaded
Open this post in threaded view
|

Re: Plan on Structured Streaming in next major/minor release?

kant kodali
+1 For Raising all this.
+1 For Queryable State (SPARK-16738 [3])

On Thu, Oct 18, 2018 at 9:59 PM Jungtaek Lim <[hidden email]> wrote:
Small correction: "timeout" in map/flatmapGroupsWithState would not work similar as State TTL when event time and watermark is set. So timeout in map/flatmapGroupsWithState is to guarantee removal of state when the state will not be used, as similar as what we do with streaming aggregation, whereas State TTL is just work as its name is represented (self-explanatory). Hence State TTL looks valid for all the cases.

2018년 10월 19일 (금) 오후 12:20, Jungtaek Lim <[hidden email]>님이 작성:
Hi devs,

While Spark 2.4.0 is still in progress of release votes, I'm seeing some pull requests on non-SS are being reviewed and merged into master branch, so I guess discussion about next release is OK.

Looks like there's a major TODO left on structured streaming: allowing stateful operation in continuous mode (watermark, stateful exactly-once) and no other major milestone is shared. (Please let me know if I'm missing here!) As a structured streaming contributor's point of view, there're another features we could discuss and see which are good to have, and prioritize if possible (NOTE: just a brainstorming and some items might not be valid for structured streaming):

* Native support on session window (SPARK-10816 [1])
  ** patch available
* Support delegation token on Kafka (SPARK-25501 [2])
  ** patch available
* Queryable State (SPARK-16738 [3])
  ** some discussion took place, but no action is taken yet
* End to end exactly-once with Kafka sink
  ** given Kafka is the first class on streaming source/sink nowadays
* Custom window / custom watermark
* Physically scale (up/down) streaming state
* State TTL (especially for non-watermark state)
  ** "timeout" in map/flatmapGroupsWithState fits it, but just to check whether we want to have it for normal streaming aggregation
* Provide discarded events due to late via side output or similar feature
  ** for me it looks like tricky one, since Spark's RDD as well as SQL semantic provide one output
* more?

Would like to hear others opinions about this. Please also share if there're ongoing efforts on other items for structured streaming. Happy to help out if it needs another hand.

Thanks,
Jungtaek Lim (HeartSaVioR)


Reply | Threaded
Open this post in threaded view
|

Re: Plan on Structured Streaming in next major/minor release?

JackyLee
In reply to this post by Jungtaek Lim
Thanks for raising them.

FYI, I believe this open issues could also be considered:

https://issues.apache.org/jira/browse/SPARK-24630
<https://issues.apache.org/jira/browse/SPARK-24630>  

An new ability to express Struct Streaming on pure SQL.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Plan on Structured Streaming in next major/minor release?

Stavros Kontopoulos-3
That is a very interesting list thanks. I could create a design doc as a starting pointing for discussion if this is a feature we would like to have.

Regards,
Stavros

On Sun, Oct 21, 2018 at 3:04 PM, JackyLee <[hidden email]> wrote:
Thanks for raising them.

FYI, I believe this open issues could also be considered:

https://issues.apache.org/jira/browse/SPARK-24630
<https://issues.apache.org/jira/browse/SPARK-24630

An new ability to express Struct Streaming on pure SQL.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]





Reply | Threaded
Open this post in threaded view
|

Re: Plan on Structured Streaming in next major/minor release?

Jungtaek Lim
Stavros, if my memory is right, you were trying to drive queryable state, right?

Could you summary the progress and the reason why the progress got stopped?

2018년 10월 21일 (일) 오후 10:27, Stavros Kontopoulos <[hidden email]>님이 작성:
That is a very interesting list thanks. I could create a design doc as a starting pointing for discussion if this is a feature we would like to have.

Regards,
Stavros

On Sun, Oct 21, 2018 at 3:04 PM, JackyLee <[hidden email]> wrote:
Thanks for raising them.

FYI, I believe this open issues could also be considered:

https://issues.apache.org/jira/browse/SPARK-24630
<https://issues.apache.org/jira/browse/SPARK-24630

An new ability to express Struct Streaming on pure SQL.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]





Reply | Threaded
Open this post in threaded view
|

Re: Plan on Structured Streaming in next major/minor release?

Stavros Kontopoulos-3
Hi Jungtaek,

I just tried to start the discussion in the dev list along time ago. 
I enumerated some uses cases as Michael proposed here. The discussion didn't go further.

If people find it useful we should start discussing it in detail again.

Stavros

On Sun, Oct 21, 2018 at 4:54 PM, Jungtaek Lim <[hidden email]> wrote:
Stavros, if my memory is right, you were trying to drive queryable state, right?

Could you summary the progress and the reason why the progress got stopped?

2018년 10월 21일 (일) 오후 10:27, Stavros Kontopoulos <[hidden email]>님이 작성:
That is a very interesting list thanks. I could create a design doc as a starting pointing for discussion if this is a feature we would like to have.

Regards,
Stavros

On Sun, Oct 21, 2018 at 3:04 PM, JackyLee <[hidden email]> wrote:
Thanks for raising them.

FYI, I believe this open issues could also be considered:

https://issues.apache.org/jira/browse/SPARK-24630
<https://issues.apache.org/jira/browse/SPARK-24630

An new ability to express Struct Streaming on pure SQL.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]







Reply | Threaded
Open this post in threaded view
|

Re: Plan on Structured Streaming in next major/minor release?

Jungtaek Lim
Yeah, the main intention of this thread is to collect interest on possible feature list for structured streaming. From what I can see in Spark community, most of the discussions as well as contributions are for SQL, and I'd wish to see similar activeness / efforts on structured streaming.
(Unfortunately there's less effort to review others' works - design doc as well as pull request - most of efforts looks like being spent to their own works.)

I respect the role of PMC member, so the final decision would be up to PMC members, but contributors as well as end users could show the interest as well as discuss about requirements on SPIP, which could be a good background to persuade PMC members. 

Before going into the deep I guess we could use this thread to discuss about possible use cases, and if we would like to move forward to individual thread we could initiate (or resurrect) its discussion thread.

For queryable state, at least there seems no workaround in Spark to provide similar thing, especially state is getting bigger. I may have some concerns on the details, but I'll add my thought on the discussion thread.

- Jungtaek Lim (HeartSaVioR)

2018년 10월 22일 (월) 오전 1:15, Stavros Kontopoulos <[hidden email]>님이 작성:
Hi Jungtaek,

I just tried to start the discussion in the dev list along time ago. 
I enumerated some uses cases as Michael proposed here. The discussion didn't go further.

If people find it useful we should start discussing it in detail again.

Stavros

On Sun, Oct 21, 2018 at 4:54 PM, Jungtaek Lim <[hidden email]> wrote:
Stavros, if my memory is right, you were trying to drive queryable state, right?

Could you summary the progress and the reason why the progress got stopped?

2018년 10월 21일 (일) 오후 10:27, Stavros Kontopoulos <[hidden email]>님이 작성:
That is a very interesting list thanks. I could create a design doc as a starting pointing for discussion if this is a feature we would like to have.

Regards,
Stavros

On Sun, Oct 21, 2018 at 3:04 PM, JackyLee <[hidden email]> wrote:
Thanks for raising them.

FYI, I believe this open issues could also be considered:

https://issues.apache.org/jira/browse/SPARK-24630
<https://issues.apache.org/jira/browse/SPARK-24630

An new ability to express Struct Streaming on pure SQL.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]







Reply | Threaded
Open this post in threaded view
|

Re: Plan on Structured Streaming in next major/minor release?

Jungtaek Lim
Adding more: again, it doesn't mean they're feasible to do. Just a kind of brainstorming.

* SPARK-20568: Delete files after processing in structured streaming
  * There hasn't been consensus regarding supporting this: there were voices for both YES and NO.
* Support multiple levels of aggregations in structured streaming
  * There're plenty of questions in SO regarding this. While I don't think it makes sense on structured streaming if it requires additional shuffle, there might be another case: group by keys, apply aggregation, apply aggregation on aggregated result (grouped keys don't change)

2018년 10월 22일 (월) 오후 12:25, Jungtaek Lim <[hidden email]>님이 작성:
Yeah, the main intention of this thread is to collect interest on possible feature list for structured streaming. From what I can see in Spark community, most of the discussions as well as contributions are for SQL, and I'd wish to see similar activeness / efforts on structured streaming.
(Unfortunately there's less effort to review others' works - design doc as well as pull request - most of efforts looks like being spent to their own works.)

I respect the role of PMC member, so the final decision would be up to PMC members, but contributors as well as end users could show the interest as well as discuss about requirements on SPIP, which could be a good background to persuade PMC members. 

Before going into the deep I guess we could use this thread to discuss about possible use cases, and if we would like to move forward to individual thread we could initiate (or resurrect) its discussion thread.

For queryable state, at least there seems no workaround in Spark to provide similar thing, especially state is getting bigger. I may have some concerns on the details, but I'll add my thought on the discussion thread.

- Jungtaek Lim (HeartSaVioR)

2018년 10월 22일 (월) 오전 1:15, Stavros Kontopoulos <[hidden email]>님이 작성:
Hi Jungtaek,

I just tried to start the discussion in the dev list along time ago. 
I enumerated some uses cases as Michael proposed here. The discussion didn't go further.

If people find it useful we should start discussing it in detail again.

Stavros

On Sun, Oct 21, 2018 at 4:54 PM, Jungtaek Lim <[hidden email]> wrote:
Stavros, if my memory is right, you were trying to drive queryable state, right?

Could you summary the progress and the reason why the progress got stopped?

2018년 10월 21일 (일) 오후 10:27, Stavros Kontopoulos <[hidden email]>님이 작성:
That is a very interesting list thanks. I could create a design doc as a starting pointing for discussion if this is a feature we would like to have.

Regards,
Stavros

On Sun, Oct 21, 2018 at 3:04 PM, JackyLee <[hidden email]> wrote:
Thanks for raising them.

FYI, I believe this open issues could also be considered:

https://issues.apache.org/jira/browse/SPARK-24630
<https://issues.apache.org/jira/browse/SPARK-24630

An new ability to express Struct Streaming on pure SQL.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]







Reply | Threaded
Open this post in threaded view
|

Re: Plan on Structured Streaming in next major/minor release?

Michael Armbrust
Thanks for bringing up some possible future directions for streaming. Here are some thoughts:
 - I personally view all of the activity on Spark SQL also as activity on Structured Streaming. The great thing about building streaming on catalyst / tungsten is that continued improvement to these components improves streaming use cases as well.
 - I think the biggest on-going project is DataSourceV2, whose goal is to provide a stable / performant API for streaming and batch data sources to plug in.  I think connectivity to many different systems is one of the most powerful aspects of Spark and right now there is no stable public API for streaming. A lot of committer / PMC time is being spent here at the moment.
 - As you mention, 2.4.0 significantly improves the built in connectivity for Kafka, giving us the ability to read exactly once from a topic being written to transactional producers. I think projects to extend this guarantee to the Kafka Sink and also to improve authentication with Kafka are a great idea (and it seems like there is a lot of review activity on the latter).

You bring up some other possible projects like session window support.  This is an interesting project, but as far as I can tell it still there is still a lot of work that would need to be done before this feature could be merged.  We'd need to understand how it works with update mode amongst other things. Additionally, a 3000+ line patch is really time consuming to review. This coupled with the fact that all the users that I have interacted with need "session windows + some custom business logic" (usually implemented with flatMapGroupsWithState), mean that I'm more inclined to direct limited review bandwidth to incremental improvements in that feature than to something large/new. This is not to say that this feature isn't useful / shouldn't be merge, just a bit of explanation as to why there might be less activity here than you would hope.

Similarly, multiple aggregations are an often requested feature.  However, fundamentally, this is going to be a fairly large investment (I think we'd need to combine the unsupported operation checker and the query planner and also create a high performance (i.e. whole stage code-gened) aggregation operator that understands negation).

Thanks again for starting the discussion, and looking forward to hearing about what features are most requested!

On Tue, Oct 30, 2018 at 12:23 AM Jungtaek Lim <[hidden email]> wrote:
Adding more: again, it doesn't mean they're feasible to do. Just a kind of brainstorming.

* SPARK-20568: Delete files after processing in structured streaming
  * There hasn't been consensus regarding supporting this: there were voices for both YES and NO.
* Support multiple levels of aggregations in structured streaming
  * There're plenty of questions in SO regarding this. While I don't think it makes sense on structured streaming if it requires additional shuffle, there might be another case: group by keys, apply aggregation, apply aggregation on aggregated result (grouped keys don't change)

2018년 10월 22일 (월) 오후 12:25, Jungtaek Lim <[hidden email]>님이 작성:
Yeah, the main intention of this thread is to collect interest on possible feature list for structured streaming. From what I can see in Spark community, most of the discussions as well as contributions are for SQL, and I'd wish to see similar activeness / efforts on structured streaming.
(Unfortunately there's less effort to review others' works - design doc as well as pull request - most of efforts looks like being spent to their own works.)

I respect the role of PMC member, so the final decision would be up to PMC members, but contributors as well as end users could show the interest as well as discuss about requirements on SPIP, which could be a good background to persuade PMC members. 

Before going into the deep I guess we could use this thread to discuss about possible use cases, and if we would like to move forward to individual thread we could initiate (or resurrect) its discussion thread.

For queryable state, at least there seems no workaround in Spark to provide similar thing, especially state is getting bigger. I may have some concerns on the details, but I'll add my thought on the discussion thread.

- Jungtaek Lim (HeartSaVioR)

2018년 10월 22일 (월) 오전 1:15, Stavros Kontopoulos <[hidden email]>님이 작성:
Hi Jungtaek,

I just tried to start the discussion in the dev list along time ago. 
I enumerated some uses cases as Michael proposed here. The discussion didn't go further.

If people find it useful we should start discussing it in detail again.

Stavros

On Sun, Oct 21, 2018 at 4:54 PM, Jungtaek Lim <[hidden email]> wrote:
Stavros, if my memory is right, you were trying to drive queryable state, right?

Could you summary the progress and the reason why the progress got stopped?

2018년 10월 21일 (일) 오후 10:27, Stavros Kontopoulos <[hidden email]>님이 작성:
That is a very interesting list thanks. I could create a design doc as a starting pointing for discussion if this is a feature we would like to have.

Regards,
Stavros

On Sun, Oct 21, 2018 at 3:04 PM, JackyLee <[hidden email]> wrote:
Thanks for raising them.

FYI, I believe this open issues could also be considered:

https://issues.apache.org/jira/browse/SPARK-24630
<https://issues.apache.org/jira/browse/SPARK-24630

An new ability to express Struct Streaming on pure SQL.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]







Reply | Threaded
Open this post in threaded view
|

Re: Plan on Structured Streaming in next major/minor release?

Stavros Kontopoulos-3
@Michael any update about queryable state?

Stavros

On Tue, Oct 30, 2018 at 10:43 PM, Michael Armbrust <[hidden email]> wrote:
Thanks for bringing up some possible future directions for streaming. Here are some thoughts:
 - I personally view all of the activity on Spark SQL also as activity on Structured Streaming. The great thing about building streaming on catalyst / tungsten is that continued improvement to these components improves streaming use cases as well.
 - I think the biggest on-going project is DataSourceV2, whose goal is to provide a stable / performant API for streaming and batch data sources to plug in.  I think connectivity to many different systems is one of the most powerful aspects of Spark and right now there is no stable public API for streaming. A lot of committer / PMC time is being spent here at the moment.
 - As you mention, 2.4.0 significantly improves the built in connectivity for Kafka, giving us the ability to read exactly once from a topic being written to transactional producers. I think projects to extend this guarantee to the Kafka Sink and also to improve authentication with Kafka are a great idea (and it seems like there is a lot of review activity on the latter).

You bring up some other possible projects like session window support.  This is an interesting project, but as far as I can tell it still there is still a lot of work that would need to be done before this feature could be merged.  We'd need to understand how it works with update mode amongst other things. Additionally, a 3000+ line patch is really time consuming to review. This coupled with the fact that all the users that I have interacted with need "session windows + some custom business logic" (usually implemented with flatMapGroupsWithState), mean that I'm more inclined to direct limited review bandwidth to incremental improvements in that feature than to something large/new. This is not to say that this feature isn't useful / shouldn't be merge, just a bit of explanation as to why there might be less activity here than you would hope.

Similarly, multiple aggregations are an often requested feature.  However, fundamentally, this is going to be a fairly large investment (I think we'd need to combine the unsupported operation checker and the query planner and also create a high performance (i.e. whole stage code-gened) aggregation operator that understands negation).

Thanks again for starting the discussion, and looking forward to hearing about what features are most requested!

On Tue, Oct 30, 2018 at 12:23 AM Jungtaek Lim <[hidden email]> wrote:
Adding more: again, it doesn't mean they're feasible to do. Just a kind of brainstorming.

* SPARK-20568: Delete files after processing in structured streaming
  * There hasn't been consensus regarding supporting this: there were voices for both YES and NO.
* Support multiple levels of aggregations in structured streaming
  * There're plenty of questions in SO regarding this. While I don't think it makes sense on structured streaming if it requires additional shuffle, there might be another case: group by keys, apply aggregation, apply aggregation on aggregated result (grouped keys don't change)

2018년 10월 22일 (월) 오후 12:25, Jungtaek Lim <[hidden email]>님이 작성:
Yeah, the main intention of this thread is to collect interest on possible feature list for structured streaming. From what I can see in Spark community, most of the discussions as well as contributions are for SQL, and I'd wish to see similar activeness / efforts on structured streaming.
(Unfortunately there's less effort to review others' works - design doc as well as pull request - most of efforts looks like being spent to their own works.)

I respect the role of PMC member, so the final decision would be up to PMC members, but contributors as well as end users could show the interest as well as discuss about requirements on SPIP, which could be a good background to persuade PMC members. 

Before going into the deep I guess we could use this thread to discuss about possible use cases, and if we would like to move forward to individual thread we could initiate (or resurrect) its discussion thread.

For queryable state, at least there seems no workaround in Spark to provide similar thing, especially state is getting bigger. I may have some concerns on the details, but I'll add my thought on the discussion thread.

- Jungtaek Lim (HeartSaVioR)

2018년 10월 22일 (월) 오전 1:15, Stavros Kontopoulos <[hidden email]>님이 작성:
Hi Jungtaek,

I just tried to start the discussion in the dev list along time ago. 
I enumerated some uses cases as Michael proposed here. The discussion didn't go further.

If people find it useful we should start discussing it in detail again.

Stavros

On Sun, Oct 21, 2018 at 4:54 PM, Jungtaek Lim <[hidden email]> wrote:
Stavros, if my memory is right, you were trying to drive queryable state, right?

Could you summary the progress and the reason why the progress got stopped?

2018년 10월 21일 (일) 오후 10:27, Stavros Kontopoulos <[hidden email]>님이 작성:
That is a very interesting list thanks. I could create a design doc as a starting pointing for discussion if this is a feature we would like to have.

Regards,
Stavros

On Sun, Oct 21, 2018 at 3:04 PM, JackyLee <[hidden email]> wrote:
Thanks for raising them.

FYI, I believe this open issues could also be considered:

https://issues.apache.org/jira/browse/SPARK-24630
<https://issues.apache.org/jira/browse/SPARK-24630

An new ability to express Struct Streaming on pure SQL.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]









Reply | Threaded
Open this post in threaded view
|

Re: Plan on Structured Streaming in next major/minor release?

Jungtaek Lim
In reply to this post by Michael Armbrust
Thanks Micheal for explaining activity on SS as well as giving opinion on some items!

Replying inline.

2018년 10월 31일 (수) 오전 5:44, Michael Armbrust <[hidden email]>님이 작성:
Thanks for bringing up some possible future directions for streaming. Here are some thoughts:
 - I personally view all of the activity on Spark SQL also as activity on Structured Streaming. The great thing about building streaming on catalyst / tungsten is that continued improvement to these components improves streaming use cases as well.

While I agree with you (in terms of performance and improvement on built-in functions), like I enumerated, streaming area has its own features which require another efforts. It would be also great if someone resumes putting major efforts on continuous mode (Spark specific project): I guess we were waiting on barrier execution. I'm happy to help on reviewing design doc, taking up implementing part(s) of.
 
 - I think the biggest on-going project is DataSourceV2, whose goal is to provide a stable / performant API for streaming and batch data sources to plug in.  I think connectivity to many different systems is one of the most powerful aspects of Spark and right now there is no stable public API for streaming. A lot of committer / PMC time is being spent here at the moment.

100% agree that DSv2 should be the thing to be stabilized sooner than later, and understand major efforts are going there.
 
 - As you mention, 2.4.0 significantly improves the built in connectivity for Kafka, giving us the ability to read exactly once from a topic being written to transactional producers. I think projects to extend this guarantee to the Kafka Sink and also to improve authentication with Kafka are a great idea (and it seems like there is a lot of review activity on the latter).

Actually I was spending time to design former, and realized that it should give up either scalability or transactional to respect Spark's contract on exactly-once. (Most of storages don't support transaction on multiple connections so transaction can't be achieved among tasks. They also don't support moving data without resending.) That's what I sent a mail in different mail thread on lessening contract. I think it is related to DSv2 and need to be considered while discussing DSv2, since the issue is not only for Kafka, but also most of external storages.
 
You bring up some other possible projects like session window support.  This is an interesting project, but as far as I can tell it still there is still a lot of work that would need to be done before this feature could be merged.  We'd need to understand how it works with update mode amongst other things. Additionally, a 3000+ line patch is really time consuming to review. This coupled with the fact that all the users that I have interacted with need "session windows + some custom business logic" (usually implemented with flatMapGroupsWithState), mean that I'm more inclined to direct limited review bandwidth to incremental improvements in that feature than to something large/new. This is not to say that this feature isn't useful / shouldn't be merge, just a bit of explanation as to why there might be less activity here than you would hope.

Yeah while I would like to get another feedbacks on session window stuff (because without feedback I need to explore all possible paths by myself without any help), I didn't intend to get attraction on session window in this mail thread. The rationalization on this mail thread is to get attraction on broader area: features support on streaming area. 

Anyway, thanks for explaining! For individual contributors, determining whether the proposal is (softly) rejected or not is very important in terms of further investigation, and it helped much on understanding current status.
 
Similarly, multiple aggregations are an often requested feature.  However, fundamentally, this is going to be a fairly large investment (I think we'd need to combine the unsupported operation checker and the query planner and also create a high performance (i.e. whole stage code-gened) aggregation operator that understands negation).

Agree. Just curious, could you explain what do you mean by "negation"? Does it mean applying retraction on aggregated?
 
Thanks again for starting the discussion, and looking forward to hearing about what features are most requested!

On Tue, Oct 30, 2018 at 12:23 AM Jungtaek Lim <[hidden email]> wrote:
Adding more: again, it doesn't mean they're feasible to do. Just a kind of brainstorming.

* SPARK-20568: Delete files after processing in structured streaming
  * There hasn't been consensus regarding supporting this: there were voices for both YES and NO.
* Support multiple levels of aggregations in structured streaming
  * There're plenty of questions in SO regarding this. While I don't think it makes sense on structured streaming if it requires additional shuffle, there might be another case: group by keys, apply aggregation, apply aggregation on aggregated result (grouped keys don't change)

2018년 10월 22일 (월) 오후 12:25, Jungtaek Lim <[hidden email]>님이 작성:
Yeah, the main intention of this thread is to collect interest on possible feature list for structured streaming. From what I can see in Spark community, most of the discussions as well as contributions are for SQL, and I'd wish to see similar activeness / efforts on structured streaming.
(Unfortunately there's less effort to review others' works - design doc as well as pull request - most of efforts looks like being spent to their own works.)

I respect the role of PMC member, so the final decision would be up to PMC members, but contributors as well as end users could show the interest as well as discuss about requirements on SPIP, which could be a good background to persuade PMC members. 

Before going into the deep I guess we could use this thread to discuss about possible use cases, and if we would like to move forward to individual thread we could initiate (or resurrect) its discussion thread.

For queryable state, at least there seems no workaround in Spark to provide similar thing, especially state is getting bigger. I may have some concerns on the details, but I'll add my thought on the discussion thread.

- Jungtaek Lim (HeartSaVioR)

2018년 10월 22일 (월) 오전 1:15, Stavros Kontopoulos <[hidden email]>님이 작성:
Hi Jungtaek,

I just tried to start the discussion in the dev list along time ago. 
I enumerated some uses cases as Michael proposed here. The discussion didn't go further.

If people find it useful we should start discussing it in detail again.

Stavros

On Sun, Oct 21, 2018 at 4:54 PM, Jungtaek Lim <[hidden email]> wrote:
Stavros, if my memory is right, you were trying to drive queryable state, right?

Could you summary the progress and the reason why the progress got stopped?

2018년 10월 21일 (일) 오후 10:27, Stavros Kontopoulos <[hidden email]>님이 작성:
That is a very interesting list thanks. I could create a design doc as a starting pointing for discussion if this is a feature we would like to have.

Regards,
Stavros

On Sun, Oct 21, 2018 at 3:04 PM, JackyLee <[hidden email]> wrote:
Thanks for raising them.

FYI, I believe this open issues could also be considered:

https://issues.apache.org/jira/browse/SPARK-24630
<https://issues.apache.org/jira/browse/SPARK-24630

An new ability to express Struct Streaming on pure SQL.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]







Reply | Threaded
Open this post in threaded view
|

Re: Plan on Structured Streaming in next major/minor release?

Michael Armbrust
Agree. Just curious, could you explain what do you mean by "negation"? Does it mean applying retraction on aggregated?

Yeah exactly.  Our current streaming aggregation assumes that the input is in append-mode and multiple aggregations break this.
Reply | Threaded
Open this post in threaded view
|

Re: Plan on Structured Streaming in next major/minor release?

Jungtaek Lim
OK thanks for clarifying. I guess it is one of major features in streaming area and nice to add, but also agree it would require huge investigation.

2018년 10월 31일 (수) 오전 8:06, Michael Armbrust <[hidden email]>님이 작성:
Agree. Just curious, could you explain what do you mean by "negation"? Does it mean applying retraction on aggregated?

Yeah exactly.  Our current streaming aggregation assumes that the input is in append-mode and multiple aggregations break this.
Reply | Threaded
Open this post in threaded view
|

Re: Plan on Structured Streaming in next major/minor release?

kant kodali
If I can add one thing to this list I would say stateless aggregations using Raw SQL.

For example: As I read micro-batches from Kafka I want to do say a count of that micro batch and spit it out using Raw SQL . (No Count aggregation across batches.)



On Tue, Oct 30, 2018 at 4:55 PM Jungtaek Lim <[hidden email]> wrote:
OK thanks for clarifying. I guess it is one of major features in streaming area and nice to add, but also agree it would require huge investigation.

2018년 10월 31일 (수) 오전 8:06, Michael Armbrust <[hidden email]>님이 작성:
Agree. Just curious, could you explain what do you mean by "negation"? Does it mean applying retraction on aggregated?

Yeah exactly.  Our current streaming aggregation assumes that the input is in append-mode and multiple aggregations break this.
Reply | Threaded
Open this post in threaded view
|

Re: Plan on Structured Streaming in next major/minor release?

Jungtaek Lim
My 2 cents, "micro-batch" is the way how Spark handles stream, not a semantic we are considering. Semantically and ideally, same SQL query should provide same result between batch and streaming except late events once the operations in query are supported.

2018년 11월 2일 (금) 오후 3:54, kant kodali <[hidden email]>님이 작성:
If I can add one thing to this list I would say stateless aggregations using Raw SQL.

For example: As I read micro-batches from Kafka I want to do say a count of that micro batch and spit it out using Raw SQL . (No Count aggregation across batches.)



On Tue, Oct 30, 2018 at 4:55 PM Jungtaek Lim <[hidden email]> wrote:
OK thanks for clarifying. I guess it is one of major features in streaming area and nice to add, but also agree it would require huge investigation.

2018년 10월 31일 (수) 오전 8:06, Michael Armbrust <[hidden email]>님이 작성:
Agree. Just curious, could you explain what do you mean by "negation"? Does it mean applying retraction on aggregated?

Yeah exactly.  Our current streaming aggregation assumes that the input is in append-mode and multiple aggregations break this.
Reply | Threaded
Open this post in threaded view
|

Re: Plan on Structured Streaming in next major/minor release?

JackyLee
Can these things be added into this list?
1. [SPARK-24630] Support SQLStreaming in Spark
      This patch defines the Table API for StructStreaming
2. [SPARK-25937] Support user-defined schema in Kafka Source & Sink
      This patch make user easier to work with StructStreaming
3. SS supports dynamic partition scheduling
       SS uses the serial execution engine, which means, SS can not catch up
with data output effectively when back pressure or computing speed is
reduced. If the dynamic partition scheduling for SS can be realized, the
partition number will be automatically increased when needed, then SS can
effectively catch up with the calculation speed.The main idea is to replace
time with computing resources.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]