[SS] How to create a streaming DataFrame (for a custom Source in Spark 2.4.4 / MicroBatch / DSv1)?

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

[SS] How to create a streaming DataFrame (for a custom Source in Spark 2.4.4 / MicroBatch / DSv1)?

Jacek Laskowski
Hi,

I think I've got stuck and without your help I won't move any further. Please help.

I'm with Spark 2.4.4 and am developing a streaming Source (DSv1, MicroBatch) and in getBatch phase when requested for a DataFrame, there is this assert [1] I can't seem to go past with any DataFrame I managed to create as it's not streaming.

          assert(batch.isStreaming,
            s"DataFrame returned by getBatch from $source did not have isStreaming=true\n" +
              s"${batch.queryExecution.logical}")


All I could find is private[sql], e.g. SQLContext.internalCreateDataFrame(..., isStreaming = true) [2] or [3]


Pozdrawiam,
Jacek Laskowski
----
The Internals of Spark SQL https://bit.ly/spark-sql-internals
The Internals of Spark Structured Streaming https://bit.ly/spark-structured-streaming
The Internals of Apache Kafka https://bit.ly/apache-kafka-internals
Reply | Threaded
Open this post in threaded view
|

Re: [SS] How to create a streaming DataFrame (for a custom Source in Spark 2.4.4 / MicroBatch / DSv1)?

Jungtaek Lim-2
Looks like it's missing, or intended to force custom streaming source implemented as DSv2.

I'm not sure Spark community wants to expand DSv1 API: I could propose the change if we get some supports here.

To Spark community: given we bring major changes on DSv2, someone would want to rely on DSv1 while transition from old DSv2 to new DSv2 happens and new DSv2 gets stabilized. Would we like to provide necessary changes on DSv1?

Thanks,
Jungtaek Lim (HeartSaVioR)

On Wed, Oct 2, 2019 at 4:27 AM Jacek Laskowski <[hidden email]> wrote:
Hi,

I think I've got stuck and without your help I won't move any further. Please help.

I'm with Spark 2.4.4 and am developing a streaming Source (DSv1, MicroBatch) and in getBatch phase when requested for a DataFrame, there is this assert [1] I can't seem to go past with any DataFrame I managed to create as it's not streaming.

          assert(batch.isStreaming,
            s"DataFrame returned by getBatch from $source did not have isStreaming=true\n" +
              s"${batch.queryExecution.logical}")


All I could find is private[sql], e.g. SQLContext.internalCreateDataFrame(..., isStreaming = true) [2] or [3]


Pozdrawiam,
Jacek Laskowski
----
The Internals of Spark SQL https://bit.ly/spark-sql-internals
The Internals of Spark Structured Streaming https://bit.ly/spark-structured-streaming
The Internals of Apache Kafka https://bit.ly/apache-kafka-internals
Reply | Threaded
Open this post in threaded view
|

Re: [SS] How to create a streaming DataFrame (for a custom Source in Spark 2.4.4 / MicroBatch / DSv1)?

Jacek Laskowski
Hi Jungtaek,

Thanks a lot for your very prompt response!

> Looks like it's missing, or intended to force custom streaming source implemented as DSv2.

That's exactly my understanding = no more DSv1 data sources. That however is not consistent with the official message, is it? Spark 2.4.4 does not actually say "we're abandoning DSv1", and people could not really want to jump on DSv2 since it's not recommended (unless I missed that).

I love surprises (as that's where people pay more for consulting :)), but not necessarily before public talks (with one at SparkAISummit in two weeks!) Gonna be challenging! Hope I won't spread a wrong word.

Pozdrawiam,
Jacek Laskowski
----
The Internals of Spark SQL https://bit.ly/spark-sql-internals
The Internals of Spark Structured Streaming https://bit.ly/spark-structured-streaming
The Internals of Apache Kafka https://bit.ly/apache-kafka-internals


On Wed, Oct 2, 2019 at 6:16 AM Jungtaek Lim <[hidden email]> wrote:
Looks like it's missing, or intended to force custom streaming source implemented as DSv2.

I'm not sure Spark community wants to expand DSv1 API: I could propose the change if we get some supports here.

To Spark community: given we bring major changes on DSv2, someone would want to rely on DSv1 while transition from old DSv2 to new DSv2 happens and new DSv2 gets stabilized. Would we like to provide necessary changes on DSv1?

Thanks,
Jungtaek Lim (HeartSaVioR)

On Wed, Oct 2, 2019 at 4:27 AM Jacek Laskowski <[hidden email]> wrote:
Hi,

I think I've got stuck and without your help I won't move any further. Please help.

I'm with Spark 2.4.4 and am developing a streaming Source (DSv1, MicroBatch) and in getBatch phase when requested for a DataFrame, there is this assert [1] I can't seem to go past with any DataFrame I managed to create as it's not streaming.

          assert(batch.isStreaming,
            s"DataFrame returned by getBatch from $source did not have isStreaming=true\n" +
              s"${batch.queryExecution.logical}")


All I could find is private[sql], e.g. SQLContext.internalCreateDataFrame(..., isStreaming = true) [2] or [3]


Pozdrawiam,
Jacek Laskowski
----
The Internals of Spark SQL https://bit.ly/spark-sql-internals
The Internals of Spark Structured Streaming https://bit.ly/spark-structured-streaming
The Internals of Apache Kafka https://bit.ly/apache-kafka-internals
Reply | Threaded
Open this post in threaded view
|

Re: [SS] How to create a streaming DataFrame (for a custom Source in Spark 2.4.4 / MicroBatch / DSv1)?

Jungtaek Lim-2
I remembered the actual case from developer who implements custom data source.


Quoting here: 
We started implementing DSv2 in the 2.4 branch, but quickly discovered that the DSv2 in 3.0 was a complete breaking change (to the point where it could have been named DSv3 and it wouldn’t have come as a surprise). Since the DSv2 in 3.0 has a compatibility layer for DSv1 datasources, we decided to fall back into DSv1 in order to ease the future transition to Spark 3.

Given DSv2 for Spark 2.x and 3.x are diverged a lot, realistic solution on dealing with DSv2 breaking change is having DSv1 as temporary solution, even DSv2 for 3.x will be available. They need some time to make transition.

I would file an issue to support streaming data source on DSv1 and submit a patch unless someone objects.


On Wed, Oct 2, 2019 at 4:08 PM Jacek Laskowski <[hidden email]> wrote:
Hi Jungtaek,

Thanks a lot for your very prompt response!

> Looks like it's missing, or intended to force custom streaming source implemented as DSv2.

That's exactly my understanding = no more DSv1 data sources. That however is not consistent with the official message, is it? Spark 2.4.4 does not actually say "we're abandoning DSv1", and people could not really want to jump on DSv2 since it's not recommended (unless I missed that).

I love surprises (as that's where people pay more for consulting :)), but not necessarily before public talks (with one at SparkAISummit in two weeks!) Gonna be challenging! Hope I won't spread a wrong word.

Pozdrawiam,
Jacek Laskowski
----
The Internals of Spark SQL https://bit.ly/spark-sql-internals
The Internals of Spark Structured Streaming https://bit.ly/spark-structured-streaming
The Internals of Apache Kafka https://bit.ly/apache-kafka-internals


On Wed, Oct 2, 2019 at 6:16 AM Jungtaek Lim <[hidden email]> wrote:
Looks like it's missing, or intended to force custom streaming source implemented as DSv2.

I'm not sure Spark community wants to expand DSv1 API: I could propose the change if we get some supports here.

To Spark community: given we bring major changes on DSv2, someone would want to rely on DSv1 while transition from old DSv2 to new DSv2 happens and new DSv2 gets stabilized. Would we like to provide necessary changes on DSv1?

Thanks,
Jungtaek Lim (HeartSaVioR)

On Wed, Oct 2, 2019 at 4:27 AM Jacek Laskowski <[hidden email]> wrote:
Hi,

I think I've got stuck and without your help I won't move any further. Please help.

I'm with Spark 2.4.4 and am developing a streaming Source (DSv1, MicroBatch) and in getBatch phase when requested for a DataFrame, there is this assert [1] I can't seem to go past with any DataFrame I managed to create as it's not streaming.

          assert(batch.isStreaming,
            s"DataFrame returned by getBatch from $source did not have isStreaming=true\n" +
              s"${batch.queryExecution.logical}")


All I could find is private[sql], e.g. SQLContext.internalCreateDataFrame(..., isStreaming = true) [2] or [3]


Pozdrawiam,
Jacek Laskowski
----
The Internals of Spark SQL https://bit.ly/spark-sql-internals
The Internals of Spark Structured Streaming https://bit.ly/spark-structured-streaming
The Internals of Apache Kafka https://bit.ly/apache-kafka-internals
Reply | Threaded
Open this post in threaded view
|

Re: [SS] How to create a streaming DataFrame (for a custom Source in Spark 2.4.4 / MicroBatch / DSv1)?

cloud0fan
AFAIK there is no public streaming data source API before DS v2. The Source and Sink API is private and is only for builtin streaming sources. Advanced users can still implement custom stream sources with private Spark APIs (you can put your classes under the org.apache.spark.sql package to access the private methods).

That said, DS v2 is the first public streaming data source API. It's really hard to design a stable, efficient and flexible data source API that is unified between batch and streaming. DS v2 has evolved a lot in the master branch and hopefully there will be no big breaking changes anymore.


On Sat, Oct 5, 2019 at 12:24 PM Jungtaek Lim <[hidden email]> wrote:
I remembered the actual case from developer who implements custom data source.


Quoting here: 
We started implementing DSv2 in the 2.4 branch, but quickly discovered that the DSv2 in 3.0 was a complete breaking change (to the point where it could have been named DSv3 and it wouldn’t have come as a surprise). Since the DSv2 in 3.0 has a compatibility layer for DSv1 datasources, we decided to fall back into DSv1 in order to ease the future transition to Spark 3.

Given DSv2 for Spark 2.x and 3.x are diverged a lot, realistic solution on dealing with DSv2 breaking change is having DSv1 as temporary solution, even DSv2 for 3.x will be available. They need some time to make transition.

I would file an issue to support streaming data source on DSv1 and submit a patch unless someone objects.


On Wed, Oct 2, 2019 at 4:08 PM Jacek Laskowski <[hidden email]> wrote:
Hi Jungtaek,

Thanks a lot for your very prompt response!

> Looks like it's missing, or intended to force custom streaming source implemented as DSv2.

That's exactly my understanding = no more DSv1 data sources. That however is not consistent with the official message, is it? Spark 2.4.4 does not actually say "we're abandoning DSv1", and people could not really want to jump on DSv2 since it's not recommended (unless I missed that).

I love surprises (as that's where people pay more for consulting :)), but not necessarily before public talks (with one at SparkAISummit in two weeks!) Gonna be challenging! Hope I won't spread a wrong word.

Pozdrawiam,
Jacek Laskowski
----
The Internals of Spark SQL https://bit.ly/spark-sql-internals
The Internals of Spark Structured Streaming https://bit.ly/spark-structured-streaming
The Internals of Apache Kafka https://bit.ly/apache-kafka-internals


On Wed, Oct 2, 2019 at 6:16 AM Jungtaek Lim <[hidden email]> wrote:
Looks like it's missing, or intended to force custom streaming source implemented as DSv2.

I'm not sure Spark community wants to expand DSv1 API: I could propose the change if we get some supports here.

To Spark community: given we bring major changes on DSv2, someone would want to rely on DSv1 while transition from old DSv2 to new DSv2 happens and new DSv2 gets stabilized. Would we like to provide necessary changes on DSv1?

Thanks,
Jungtaek Lim (HeartSaVioR)

On Wed, Oct 2, 2019 at 4:27 AM Jacek Laskowski <[hidden email]> wrote:
Hi,

I think I've got stuck and without your help I won't move any further. Please help.

I'm with Spark 2.4.4 and am developing a streaming Source (DSv1, MicroBatch) and in getBatch phase when requested for a DataFrame, there is this assert [1] I can't seem to go past with any DataFrame I managed to create as it's not streaming.

          assert(batch.isStreaming,
            s"DataFrame returned by getBatch from $source did not have isStreaming=true\n" +
              s"${batch.queryExecution.logical}")


All I could find is private[sql], e.g. SQLContext.internalCreateDataFrame(..., isStreaming = true) [2] or [3]


Pozdrawiam,
Jacek Laskowski
----
The Internals of Spark SQL https://bit.ly/spark-sql-internals
The Internals of Spark Structured Streaming https://bit.ly/spark-structured-streaming
The Internals of Apache Kafka https://bit.ly/apache-kafka-internals
Reply | Threaded
Open this post in threaded view
|

Re: [SS] How to create a streaming DataFrame (for a custom Source in Spark 2.4.4 / MicroBatch / DSv1)?

Jungtaek Lim-2
Would you mind if I ask the condition of being public API? Source/Sink traits are not marked as @DeveloperApi but they're defined as public, and located to sql-core so even not semantically private (for catalyst), easy to give a signal they're public APIs.

Also, if I'm not missing here, creating streaming DataFrame via RDD[Row] is not available even for private API. There're some other approaches on using private API: 1) SQLContext.internalCreateDataFrame - as it requires RDD[InternalRow], they should also depend on catalyst and have to deal with InternalRow which Spark community seems to be desired to change it eventually 2) Dataset.ofRows - it requires LogicalPlan which is also in catalyst. So they not only need to apply "package hack" but also need to depend on catalyst.


On Mon, Oct 7, 2019 at 9:45 PM Wenchen Fan <[hidden email]> wrote:
AFAIK there is no public streaming data source API before DS v2. The Source and Sink API is private and is only for builtin streaming sources. Advanced users can still implement custom stream sources with private Spark APIs (you can put your classes under the org.apache.spark.sql package to access the private methods).

That said, DS v2 is the first public streaming data source API. It's really hard to design a stable, efficient and flexible data source API that is unified between batch and streaming. DS v2 has evolved a lot in the master branch and hopefully there will be no big breaking changes anymore.


On Sat, Oct 5, 2019 at 12:24 PM Jungtaek Lim <[hidden email]> wrote:
I remembered the actual case from developer who implements custom data source.


Quoting here: 
We started implementing DSv2 in the 2.4 branch, but quickly discovered that the DSv2 in 3.0 was a complete breaking change (to the point where it could have been named DSv3 and it wouldn’t have come as a surprise). Since the DSv2 in 3.0 has a compatibility layer for DSv1 datasources, we decided to fall back into DSv1 in order to ease the future transition to Spark 3.

Given DSv2 for Spark 2.x and 3.x are diverged a lot, realistic solution on dealing with DSv2 breaking change is having DSv1 as temporary solution, even DSv2 for 3.x will be available. They need some time to make transition.

I would file an issue to support streaming data source on DSv1 and submit a patch unless someone objects.


On Wed, Oct 2, 2019 at 4:08 PM Jacek Laskowski <[hidden email]> wrote:
Hi Jungtaek,

Thanks a lot for your very prompt response!

> Looks like it's missing, or intended to force custom streaming source implemented as DSv2.

That's exactly my understanding = no more DSv1 data sources. That however is not consistent with the official message, is it? Spark 2.4.4 does not actually say "we're abandoning DSv1", and people could not really want to jump on DSv2 since it's not recommended (unless I missed that).

I love surprises (as that's where people pay more for consulting :)), but not necessarily before public talks (with one at SparkAISummit in two weeks!) Gonna be challenging! Hope I won't spread a wrong word.

Pozdrawiam,
Jacek Laskowski
----
The Internals of Spark SQL https://bit.ly/spark-sql-internals
The Internals of Spark Structured Streaming https://bit.ly/spark-structured-streaming
The Internals of Apache Kafka https://bit.ly/apache-kafka-internals


On Wed, Oct 2, 2019 at 6:16 AM Jungtaek Lim <[hidden email]> wrote:
Looks like it's missing, or intended to force custom streaming source implemented as DSv2.

I'm not sure Spark community wants to expand DSv1 API: I could propose the change if we get some supports here.

To Spark community: given we bring major changes on DSv2, someone would want to rely on DSv1 while transition from old DSv2 to new DSv2 happens and new DSv2 gets stabilized. Would we like to provide necessary changes on DSv1?

Thanks,
Jungtaek Lim (HeartSaVioR)

On Wed, Oct 2, 2019 at 4:27 AM Jacek Laskowski <[hidden email]> wrote:
Hi,

I think I've got stuck and without your help I won't move any further. Please help.

I'm with Spark 2.4.4 and am developing a streaming Source (DSv1, MicroBatch) and in getBatch phase when requested for a DataFrame, there is this assert [1] I can't seem to go past with any DataFrame I managed to create as it's not streaming.

          assert(batch.isStreaming,
            s"DataFrame returned by getBatch from $source did not have isStreaming=true\n" +
              s"${batch.queryExecution.logical}")


All I could find is private[sql], e.g. SQLContext.internalCreateDataFrame(..., isStreaming = true) [2] or [3]


Pozdrawiam,
Jacek Laskowski
----
The Internals of Spark SQL https://bit.ly/spark-sql-internals
The Internals of Spark Structured Streaming https://bit.ly/spark-structured-streaming
The Internals of Apache Kafka https://bit.ly/apache-kafka-internals
Reply | Threaded
Open this post in threaded view
|

Re: [SS] How to create a streaming DataFrame (for a custom Source in Spark 2.4.4 / MicroBatch / DSv1)?

cloud0fan
> Would you mind if I ask the condition of being public API?

The module doesn't matter, but the package matters. We have many public APIs in the catalyst module as well. (e.g. DataType)

There are 3 packages in Spark SQL that are meant to be private:
1. org.apache.spark.sql.catalyst
2. org.apache.spark.sql.execution
3. org.apache.spark.sql.internal

You can check out the full list of private packages of Spark in project/SparkBuild.scala#Unidoc#ignoreUndocumentedPackages

Basically, classes/interfaces that don't appear in the official Spark API doc are private.

Source/Sink traits are in org.apache.spark.sql.execution and thus they are private.

On Tue, Oct 8, 2019 at 6:19 AM Jungtaek Lim <[hidden email]> wrote:
Would you mind if I ask the condition of being public API? Source/Sink traits are not marked as @DeveloperApi but they're defined as public, and located to sql-core so even not semantically private (for catalyst), easy to give a signal they're public APIs.

Also, if I'm not missing here, creating streaming DataFrame via RDD[Row] is not available even for private API. There're some other approaches on using private API: 1) SQLContext.internalCreateDataFrame - as it requires RDD[InternalRow], they should also depend on catalyst and have to deal with InternalRow which Spark community seems to be desired to change it eventually 2) Dataset.ofRows - it requires LogicalPlan which is also in catalyst. So they not only need to apply "package hack" but also need to depend on catalyst.


On Mon, Oct 7, 2019 at 9:45 PM Wenchen Fan <[hidden email]> wrote:
AFAIK there is no public streaming data source API before DS v2. The Source and Sink API is private and is only for builtin streaming sources. Advanced users can still implement custom stream sources with private Spark APIs (you can put your classes under the org.apache.spark.sql package to access the private methods).

That said, DS v2 is the first public streaming data source API. It's really hard to design a stable, efficient and flexible data source API that is unified between batch and streaming. DS v2 has evolved a lot in the master branch and hopefully there will be no big breaking changes anymore.


On Sat, Oct 5, 2019 at 12:24 PM Jungtaek Lim <[hidden email]> wrote:
I remembered the actual case from developer who implements custom data source.


Quoting here: 
We started implementing DSv2 in the 2.4 branch, but quickly discovered that the DSv2 in 3.0 was a complete breaking change (to the point where it could have been named DSv3 and it wouldn’t have come as a surprise). Since the DSv2 in 3.0 has a compatibility layer for DSv1 datasources, we decided to fall back into DSv1 in order to ease the future transition to Spark 3.

Given DSv2 for Spark 2.x and 3.x are diverged a lot, realistic solution on dealing with DSv2 breaking change is having DSv1 as temporary solution, even DSv2 for 3.x will be available. They need some time to make transition.

I would file an issue to support streaming data source on DSv1 and submit a patch unless someone objects.


On Wed, Oct 2, 2019 at 4:08 PM Jacek Laskowski <[hidden email]> wrote:
Hi Jungtaek,

Thanks a lot for your very prompt response!

> Looks like it's missing, or intended to force custom streaming source implemented as DSv2.

That's exactly my understanding = no more DSv1 data sources. That however is not consistent with the official message, is it? Spark 2.4.4 does not actually say "we're abandoning DSv1", and people could not really want to jump on DSv2 since it's not recommended (unless I missed that).

I love surprises (as that's where people pay more for consulting :)), but not necessarily before public talks (with one at SparkAISummit in two weeks!) Gonna be challenging! Hope I won't spread a wrong word.

Pozdrawiam,
Jacek Laskowski
----
The Internals of Spark SQL https://bit.ly/spark-sql-internals
The Internals of Spark Structured Streaming https://bit.ly/spark-structured-streaming
The Internals of Apache Kafka https://bit.ly/apache-kafka-internals


On Wed, Oct 2, 2019 at 6:16 AM Jungtaek Lim <[hidden email]> wrote:
Looks like it's missing, or intended to force custom streaming source implemented as DSv2.

I'm not sure Spark community wants to expand DSv1 API: I could propose the change if we get some supports here.

To Spark community: given we bring major changes on DSv2, someone would want to rely on DSv1 while transition from old DSv2 to new DSv2 happens and new DSv2 gets stabilized. Would we like to provide necessary changes on DSv1?

Thanks,
Jungtaek Lim (HeartSaVioR)

On Wed, Oct 2, 2019 at 4:27 AM Jacek Laskowski <[hidden email]> wrote:
Hi,

I think I've got stuck and without your help I won't move any further. Please help.

I'm with Spark 2.4.4 and am developing a streaming Source (DSv1, MicroBatch) and in getBatch phase when requested for a DataFrame, there is this assert [1] I can't seem to go past with any DataFrame I managed to create as it's not streaming.

          assert(batch.isStreaming,
            s"DataFrame returned by getBatch from $source did not have isStreaming=true\n" +
              s"${batch.queryExecution.logical}")


All I could find is private[sql], e.g. SQLContext.internalCreateDataFrame(..., isStreaming = true) [2] or [3]


Pozdrawiam,
Jacek Laskowski
----
The Internals of Spark SQL https://bit.ly/spark-sql-internals
The Internals of Spark Structured Streaming https://bit.ly/spark-structured-streaming
The Internals of Apache Kafka https://bit.ly/apache-kafka-internals
Reply | Threaded
Open this post in threaded view
|

Re: [SS] How to create a streaming DataFrame (for a custom Source in Spark 2.4.4 / MicroBatch / DSv1)?

Jacek Laskowski
Hi,

Thanks much for such thorough conversation. Enjoyed it very much.

> Source/Sink traits are in org.apache.spark.sql.execution and thus they are private.

That would explain why I couldn't find scaladocs.

Pozdrawiam,
Jacek Laskowski
----
The Internals of Spark SQL https://bit.ly/spark-sql-internals
The Internals of Spark Structured Streaming https://bit.ly/spark-structured-streaming
The Internals of Apache Kafka https://bit.ly/apache-kafka-internals


On Wed, Oct 9, 2019 at 7:46 AM Wenchen Fan <[hidden email]> wrote:
> Would you mind if I ask the condition of being public API?

The module doesn't matter, but the package matters. We have many public APIs in the catalyst module as well. (e.g. DataType)

There are 3 packages in Spark SQL that are meant to be private:
1. org.apache.spark.sql.catalyst
2. org.apache.spark.sql.execution
3. org.apache.spark.sql.internal

You can check out the full list of private packages of Spark in project/SparkBuild.scala#Unidoc#ignoreUndocumentedPackages

Basically, classes/interfaces that don't appear in the official Spark API doc are private.

Source/Sink traits are in org.apache.spark.sql.execution and thus they are private.

On Tue, Oct 8, 2019 at 6:19 AM Jungtaek Lim <[hidden email]> wrote:
Would you mind if I ask the condition of being public API? Source/Sink traits are not marked as @DeveloperApi but they're defined as public, and located to sql-core so even not semantically private (for catalyst), easy to give a signal they're public APIs.

Also, if I'm not missing here, creating streaming DataFrame via RDD[Row] is not available even for private API. There're some other approaches on using private API: 1) SQLContext.internalCreateDataFrame - as it requires RDD[InternalRow], they should also depend on catalyst and have to deal with InternalRow which Spark community seems to be desired to change it eventually 2) Dataset.ofRows - it requires LogicalPlan which is also in catalyst. So they not only need to apply "package hack" but also need to depend on catalyst.


On Mon, Oct 7, 2019 at 9:45 PM Wenchen Fan <[hidden email]> wrote:
AFAIK there is no public streaming data source API before DS v2. The Source and Sink API is private and is only for builtin streaming sources. Advanced users can still implement custom stream sources with private Spark APIs (you can put your classes under the org.apache.spark.sql package to access the private methods).

That said, DS v2 is the first public streaming data source API. It's really hard to design a stable, efficient and flexible data source API that is unified between batch and streaming. DS v2 has evolved a lot in the master branch and hopefully there will be no big breaking changes anymore.


On Sat, Oct 5, 2019 at 12:24 PM Jungtaek Lim <[hidden email]> wrote:
I remembered the actual case from developer who implements custom data source.


Quoting here: 
We started implementing DSv2 in the 2.4 branch, but quickly discovered that the DSv2 in 3.0 was a complete breaking change (to the point where it could have been named DSv3 and it wouldn’t have come as a surprise). Since the DSv2 in 3.0 has a compatibility layer for DSv1 datasources, we decided to fall back into DSv1 in order to ease the future transition to Spark 3.

Given DSv2 for Spark 2.x and 3.x are diverged a lot, realistic solution on dealing with DSv2 breaking change is having DSv1 as temporary solution, even DSv2 for 3.x will be available. They need some time to make transition.

I would file an issue to support streaming data source on DSv1 and submit a patch unless someone objects.


On Wed, Oct 2, 2019 at 4:08 PM Jacek Laskowski <[hidden email]> wrote:
Hi Jungtaek,

Thanks a lot for your very prompt response!

> Looks like it's missing, or intended to force custom streaming source implemented as DSv2.

That's exactly my understanding = no more DSv1 data sources. That however is not consistent with the official message, is it? Spark 2.4.4 does not actually say "we're abandoning DSv1", and people could not really want to jump on DSv2 since it's not recommended (unless I missed that).

I love surprises (as that's where people pay more for consulting :)), but not necessarily before public talks (with one at SparkAISummit in two weeks!) Gonna be challenging! Hope I won't spread a wrong word.

Pozdrawiam,
Jacek Laskowski
----
The Internals of Spark SQL https://bit.ly/spark-sql-internals
The Internals of Spark Structured Streaming https://bit.ly/spark-structured-streaming
The Internals of Apache Kafka https://bit.ly/apache-kafka-internals


On Wed, Oct 2, 2019 at 6:16 AM Jungtaek Lim <[hidden email]> wrote:
Looks like it's missing, or intended to force custom streaming source implemented as DSv2.

I'm not sure Spark community wants to expand DSv1 API: I could propose the change if we get some supports here.

To Spark community: given we bring major changes on DSv2, someone would want to rely on DSv1 while transition from old DSv2 to new DSv2 happens and new DSv2 gets stabilized. Would we like to provide necessary changes on DSv1?

Thanks,
Jungtaek Lim (HeartSaVioR)

On Wed, Oct 2, 2019 at 4:27 AM Jacek Laskowski <[hidden email]> wrote:
Hi,

I think I've got stuck and without your help I won't move any further. Please help.

I'm with Spark 2.4.4 and am developing a streaming Source (DSv1, MicroBatch) and in getBatch phase when requested for a DataFrame, there is this assert [1] I can't seem to go past with any DataFrame I managed to create as it's not streaming.

          assert(batch.isStreaming,
            s"DataFrame returned by getBatch from $source did not have isStreaming=true\n" +
              s"${batch.queryExecution.logical}")


All I could find is private[sql], e.g. SQLContext.internalCreateDataFrame(..., isStreaming = true) [2] or [3]


Pozdrawiam,
Jacek Laskowski
----
The Internals of Spark SQL https://bit.ly/spark-sql-internals
The Internals of Spark Structured Streaming https://bit.ly/spark-structured-streaming
The Internals of Apache Kafka https://bit.ly/apache-kafka-internals