Support SqlStreaming in spark

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

Support SqlStreaming in spark

JackyLee
Hello

Nowadays, more and more streaming products begin to support SQL streaming,
such as KafaSQL, Flink SQL and Storm SQL. To support SQL Streaming can not
only reduce the threshold of streaming, but also make streaming easier to be
accepted by everyone.

At present, StructStreaming is relatively mature, and the StructStreaming is
based on DataSet API, which make it possibal to  provide a SQL portal for
structstreaming and run structstreaming in SQL.

To support for SQL Streaming, there are two key points:
1, Analysis should be able to parse streaming type SQL.
2, Analyzer should be able to map metadata information to the corresponding
Relation.

Running StructStreaming in SQL can bring some benefits.
1, Reduce the entry threshold of StructStreaming and attract users more
easily.
2, Encapsulate the meta information of source or sink into table, maintain
and manage uniformly, and make users more accessible.
3. Metadata permissions management, which is based on hive, can control
StructStreaming's overall authority management scheme more closely.

We have found some ways to solve this problem. It's a pleasure to discuss it
with you.

Thanks,  

Jackey Lee



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re:Re: Support SqlStreaming in spark

JackyLee
The repo you give may solve some of SqlStreaming problems, but not friendly enough, user need to learn this new syntax.

--
Jacky Lee

At 2018-06-15 11:48:01, "Bowden, Chris" <[hidden email]> wrote:

Not sure if there is a question in here, but if you are hinting that structured streaming should support a sql interface, spark has appropriate extensibility hooks to make it possible. However, the most powerful construct in structured streaming is quite difficult to find a sql equivalent for (e.g., flatMapGroupsWithState). This repo could use some cleanup but is an example of providing a sql interface to a subset of structured streaming's functionality: https://github.com/vertica/pstl/blob/master/pstl/src/main/antlr4/org/apache/spark/sql/catalyst/parser/pstl/PstlSqlBase.g4.



From: JackyLee <[hidden email]>
Sent: Thursday, June 14, 2018 7:06:17 PM
To: [hidden email]
Subject: Support SqlStreaming in spark
 
Hello

Nowadays, more and more streaming products begin to support SQL streaming,
such as KafaSQL, Flink SQL and Storm SQL. To support SQL Streaming can not
only reduce the threshold of streaming, but also make streaming easier to be
accepted by everyone.

At present, StructStreaming is relatively mature, and the StructStreaming is
based on DataSet API, which make it possibal to  provide a SQL portal for
structstreaming and run structstreaming in SQL.

To support for SQL Streaming, there are two key points:
1, Analysis should be able to parse streaming type SQL.
2, Analyzer should be able to map metadata information to the corresponding
Relation.

Running StructStreaming in SQL can bring some benefits.
1, Reduce the entry threshold of StructStreaming and attract users more
easily.
2, Encapsulate the meta information of source or sink into table, maintain
and manage uniformly, and make users more accessible.
3. Metadata permissions management, which is based on hive, can control
StructStreaming's overall authority management scheme more closely.

We have found some ways to solve this problem. It's a pleasure to discuss it
with you.

Thanks, 

Jackey Lee



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Support SqlStreaming in spark

Shixiong(Ryan) Zhu
In reply to this post by JackyLee
Structured Streaming supports standard SQL as the batch queries, so the users can switch their queries between batch and streaming easily. Could you clarify what problems SqlStreaming solves and what are the benefits of the new syntax?

Best Regards,

Ryan

On Thu, Jun 14, 2018 at 7:06 PM, JackyLee <[hidden email]> wrote:
Hello

Nowadays, more and more streaming products begin to support SQL streaming,
such as KafaSQL, Flink SQL and Storm SQL. To support SQL Streaming can not
only reduce the threshold of streaming, but also make streaming easier to be
accepted by everyone.

At present, StructStreaming is relatively mature, and the StructStreaming is
based on DataSet API, which make it possibal to  provide a SQL portal for
structstreaming and run structstreaming in SQL.

To support for SQL Streaming, there are two key points:
1, Analysis should be able to parse streaming type SQL.
2, Analyzer should be able to map metadata information to the corresponding
Relation.

Running StructStreaming in SQL can bring some benefits.
1, Reduce the entry threshold of StructStreaming and attract users more
easily.
2, Encapsulate the meta information of source or sink into table, maintain
and manage uniformly, and make users more accessible.
3. Metadata permissions management, which is based on hive, can control
StructStreaming's overall authority management scheme more closely.

We have found some ways to solve this problem. It's a pleasure to discuss it
with you.

Thanks, 

Jackey Lee



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Support SqlStreaming in spark

JackyLee
Spark JIRA:
https://issues.apache.org/jira/projects/SPARK/issues/SPARK-24630

Benefits:

Firstly, users, who are unfamiliar with streaming, can easily use SQL to run
StructStreaming especially when migrating offline tasks to real time
processing tasks.
Secondly, support SQL API in StructStreaming can also combine
StructStreaming with hive. Users can store the source/sink metadata in a
table and use hive metastore to manage it. The users, who want to read this
data, can easily create a stream by accessing the table, which can greatly
reduce the development cost and maintenance costs of StructStreaming.
Finally, easy to achieve unified management and authority control of source
and sink, and more controllable in the management of some private data,
especially in some financial or security area.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Support SqlStreaming in spark

JackyLee
The code of SQLStreaming has been pushed:

https://github.com/apache/spark/pull/22575



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Support SqlStreaming in spark

cloud0fan
It will be great to add pure-SQL support to structured streaming. I think it goes without saying that how important SQL support is, but we should make a completed design first.

Looking at the Kafka streaming syntax, it has CREATE STREAM, it has WINDOW TUMBLING. Shall we check other streaming systems with SQL support, and justify places where we are going to differ?

We should also take into account the full lifecycle:
1. how to restart a streaming query from checkpoint?
2. how to stop a streaming query?
3. how to check status/progress of a streaming query?
4. ...

Basically, we should check what functions the DataStreamReader/Writer API support, and see if we can support it with SQL as well.


Thanks for your proposal!
Wenchen

On Mon, Oct 22, 2018 at 11:15 AM JackyLee <[hidden email]> wrote:
The code of SQLStreaming has been pushed:

https://github.com/apache/spark/pull/22575



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Support SqlStreaming in spark

Arun Mahadevan
There has been efforts to come up with a unified syntax for streaming (see [1] [2]), but I guess there will be differences based on the streaming features supported by a system.

Agree it needs a detailed design and it can be as close to the Spark batch SQL syntax as possible.

Also I am not sure if its possible or makes sense to express all the operations via pure sql. e.g. the query start/stop, triggers, watermark etc might be better expressed via APIs.



On Fri, 21 Dec 2018 at 18:13, Wenchen Fan <[hidden email]> wrote:
It will be great to add pure-SQL support to structured streaming. I think it goes without saying that how important SQL support is, but we should make a completed design first.

Looking at the Kafka streaming syntax, it has CREATE STREAM, it has WINDOW TUMBLING. Shall we check other streaming systems with SQL support, and justify places where we are going to differ?

We should also take into account the full lifecycle:
1. how to restart a streaming query from checkpoint?
2. how to stop a streaming query?
3. how to check status/progress of a streaming query?
4. ...

Basically, we should check what functions the DataStreamReader/Writer API support, and see if we can support it with SQL as well.


Thanks for your proposal!
Wenchen

On Mon, Oct 22, 2018 at 11:15 AM JackyLee <[hidden email]> wrote:
The code of SQLStreaming has been pushed:

https://github.com/apache/spark/pull/22575



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Support SqlStreaming in spark

JackyLee
Hi wenchen and Arun Mahadevan
    Thanks for your reply.

    SQLStreaming is not just a way to support pure-SQL, but also a way to
define table api for Streaming.
    I have redefined the SQLStreaming to make it support table API. User can
use sql or table API to run SQLStreaming.

    I will update the design document of SQLStreaming. Could you help me
improve the design doc?

    Again, thanks for your attention.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Support SqlStreaming in spark

JackyLee
In reply to this post by cloud0fan
Hi wenchen
    I have been working at SQLStreaming for a year, and I have promoted it
in company.
    I have seen the design for Kafka or the Calcite, and I believe my design
is better than them. They support pure-SQL not table API for streaming.
Users can only use the specified Streaming statement, and the same statement
can't run Batch queries.
    But in my opinion, the Table API is actually  the key to solve
SQLStreaming, pure-SQL is just another expression of the Streaming Table
API.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Support SqlStreaming in spark

cloud0fan
Hi JackyLee,

Can you put the answers to these questions in the design doc?

e.g. if we don't want to support manipulating a streaming query, then is `SELECT STREAM ...` a blocking action? And how users can create a Spark application with multiple streaming jobs? How users can run Structured Streaming interactively? etc.

On Sat, Dec 22, 2018 at 3:04 PM JackyLee <[hidden email]> wrote:
Hi wenchen
    I have been working at SQLStreaming for a year, and I have promoted it
in company.
    I have seen the design for Kafka or the Calcite, and I believe my design
is better than them. They support pure-SQL not table API for streaming.
Users can only use the specified Streaming statement, and the same statement
can't run Batch queries.
    But in my opinion, the Table API is actually  the key to solve
SQLStreaming, pure-SQL is just another expression of the Streaming Table
API.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Support SqlStreaming in spark

JackyLee
No problem



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Support SqlStreaming in spark

JackyLee
In reply to this post by cloud0fan
Hi, Wenchen

Thank you for your recognition of Streaming on sql. I have written the
SQLStreaming design document:
https://docs.google.com/document/d/19degwnIIcuMSELv6BQ_1VQI5AIVcvGeqOm5xE2-aRA0/edit#

Your Questions are answered in here:
https://docs.google.com/document/d/19degwnIIcuMSELv6BQ_1VQI5AIVcvGeqOm5xE2-aRA0/edit#heading=h.t96f9l205fk1

There may be some details that I have not considered, we can discuss it in
more depth.

Thanks



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Support SqlStreaming in spark

sujith71955
Hi All,

 I think there are few more updates are added in the design document compare to last document where few folks has reviewed and provided inputs., requesting all experts to review the design document and help us to baseline the design for the  SPIP
'Support SQL streaming' in spark structured streaming, few more sections is been added in-order to handle some scenarios as below

1) Passing the stream level configurations to the sql command instead of setting it in session/application level.

2) Supporting Multiple Streams in single application,. etc


Link to the design document


Few Questions are already clarified by Jacky, please find through below link

Regards,
Sujith

On Thu, Dec 27, 2018 at 6:39 PM JackyLee <[hidden email]> wrote:
Hi, Wenchen

Thank you for your recognition of Streaming on sql. I have written the
SQLStreaming design document:
https://docs.google.com/document/d/19degwnIIcuMSELv6BQ_1VQI5AIVcvGeqOm5xE2-aRA0/edit#

Your Questions are answered in here:
https://docs.google.com/document/d/19degwnIIcuMSELv6BQ_1VQI5AIVcvGeqOm5xE2-aRA0/edit#heading=h.t96f9l205fk1

There may be some details that I have not considered, we can discuss it in
more depth.

Thanks



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Support SqlStreaming in spark

uncleGen
Hi all,

I have rewritten the design doc based on previous discussing.
https://docs.google.com/document/d/19degwnIIcuMSELv6BQ_1VQI5AIVcvGeqOm5xE2-aRA0

Would be interested to hear what others think.

Regards,
Genmao Yu



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]