[VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

classic Classic list List threaded Threaded
18 messages Options
Reply | Threaded
Open this post in threaded view
|

[VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

cloud0fan
Hi all,

In the previous discussion, we decided to split the read and write path of data source v2 into 2 SPIPs, and I'm sending this email to call a vote for Data Source V2 read path only.

The full document of the Data Source API V2 is:

The ready-for-review PR that implements the basic infrastructure for the read path is:

The vote will be up for the next 72 hours. Please reply with your vote:

+1: Yeah, let's go forward and implement the SPIP.
+0: Don't really care.
-1: I don't think this is a good idea because of the following technical reasons.

Thanks!
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

cloud0fan
adding my own +1 (binding)

On Thu, Sep 7, 2017 at 10:29 AM, Wenchen Fan <[hidden email]> wrote:
Hi all,

In the previous discussion, we decided to split the read and write path of data source v2 into 2 SPIPs, and I'm sending this email to call a vote for Data Source V2 read path only.

The full document of the Data Source API V2 is:

The ready-for-review PR that implements the basic infrastructure for the read path is:

The vote will be up for the next 72 hours. Please reply with your vote:

+1: Yeah, let's go forward and implement the SPIP.
+0: Don't really care.
-1: I don't think this is a good idea because of the following technical reasons.

Thanks!

Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

Xiao Li
+1

Xiao

2017-09-06 19:37 GMT-07:00 Wenchen Fan <[hidden email]>:
adding my own +1 (binding)

On Thu, Sep 7, 2017 at 10:29 AM, Wenchen Fan <[hidden email]> wrote:
Hi all,

In the previous discussion, we decided to split the read and write path of data source v2 into 2 SPIPs, and I'm sending this email to call a vote for Data Source V2 read path only.

The full document of the Data Source API V2 is:

The ready-for-review PR that implements the basic infrastructure for the read path is:

The vote will be up for the next 72 hours. Please reply with your vote:

+1: Yeah, let's go forward and implement the SPIP.
+0: Don't really care.
-1: I don't think this is a good idea because of the following technical reasons.

Thanks!


Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

Sameer Agarwal
+1

On Wed, Sep 6, 2017 at 8:53 PM, Xiao Li <[hidden email]> wrote:
+1

Xiao

2017-09-06 19:37 GMT-07:00 Wenchen Fan <[hidden email]>:
adding my own +1 (binding)

On Thu, Sep 7, 2017 at 10:29 AM, Wenchen Fan <[hidden email]> wrote:
Hi all,

In the previous discussion, we decided to split the read and write path of data source v2 into 2 SPIPs, and I'm sending this email to call a vote for Data Source V2 read path only.

The full document of the Data Source API V2 is:

The ready-for-review PR that implements the basic infrastructure for the read path is:

The vote will be up for the next 72 hours. Please reply with your vote:

+1: Yeah, let's go forward and implement the SPIP.
+0: Don't really care.
-1: I don't think this is a good idea because of the following technical reasons.

Thanks!





--
Sameer Agarwal
Software Engineer | Databricks Inc.
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

Suresh Thalamati
In reply to this post by cloud0fan
+1 (non-binding)


On Sep 6, 2017, at 7:29 PM, Wenchen Fan <[hidden email]> wrote:

Hi all,

In the previous discussion, we decided to split the read and write path of data source v2 into 2 SPIPs, and I'm sending this email to call a vote for Data Source V2 read path only.

The full document of the Data Source API V2 is:

The ready-for-review PR that implements the basic infrastructure for the read path is:

The vote will be up for the next 72 hours. Please reply with your vote:

+1: Yeah, let's go forward and implement the SPIP.
+0: Don't really care.
-1: I don't think this is a good idea because of the following technical reasons.

Thanks!

Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

Andrew Ash
+0 (non-binding)

I think there are benefits to unifying all the Spark-internal datasources into a common public API for sure.  It will serve as a forcing function to ensure that those internal datasources aren't advantaged vs datasources developed externally as plugins to Spark, and that all Spark features are available to all datasources.

But I also think this read-path proposal avoids the more difficult questions around how to continue pushing datasource performance forwards.  James Baker (my colleague) had a number of questions about advanced pushdowns (combined sorting and filtering), and Reynold also noted that pushdown of aggregates and joins are desirable on longer timeframes as well.  The Spark community saw similar requests, for aggregate pushdown in SPARK-12686, join pushdown in SPARK-20259, and arbitrary plan pushdown in SPARK-12449.  Clearly a number of people are interested in this kind of performance work for datasources.

To leave enough space for datasource developers to continue experimenting with advanced interactions between Spark and their datasources, I'd propose we leave some sort of escape valve that enables these datasources to keep pushing the boundaries without forking Spark.  Possibly that looks like an additional unsupported/unstable interface that pushes down an entire (unstable API) logical plan, which is expected to break API on every release.   (Spark attempts this full-plan pushdown, and if that fails Spark ignores it and continues on with the rest of the V2 API for compatibility).  Or maybe it looks like something else that we don't know of yet.  Possibly this falls outside of the desired goals for the V2 API and instead should be a separate SPIP.

If we had a plan for this kind of escape valve for advanced datasource developers I'd be an unequivocal +1.  Right now it feels like this SPIP is focused more on getting the basics right for what many datasources are already doing in API V1 combined with other private APIs, vs pushing forward state of the art for performance.

Andrew

On Wed, Sep 6, 2017 at 10:56 PM, Suresh Thalamati <[hidden email]> wrote:
+1 (non-binding)


On Sep 6, 2017, at 7:29 PM, Wenchen Fan <[hidden email]> wrote:

Hi all,

In the previous discussion, we decided to split the read and write path of data source v2 into 2 SPIPs, and I'm sending this email to call a vote for Data Source V2 read path only.

The full document of the Data Source API V2 is:

The ready-for-review PR that implements the basic infrastructure for the read path is:

The vote will be up for the next 72 hours. Please reply with your vote:

+1: Yeah, let's go forward and implement the SPIP.
+0: Don't really care.
-1: I don't think this is a good idea because of the following technical reasons.

Thanks!


Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

Herman van Hövell tot Westerflier-2
+1 (binding)

I personally believe that there is quite a big difference between having a generic data source interface with a low surface area and pushing down a significant part of query processing into a datasource. The later has much wider wider surface area and will require us to stabilize most of the internal catalyst API's which will be a significant burden on the community to maintain and has the potential to slow development velocity significantly. If you want to write such integrations then you should be prepared to work with catalyst internals and own up to the fact that things might change across minor versions (and in some cases even maintenance releases). If you are willing to go down that road, then your best bet is to use the already existing spark session extensions which will allow you to write such integrations and can be used as an `escape hatch`.


On Thu, Sep 7, 2017 at 10:23 AM, Andrew Ash <[hidden email]> wrote:
+0 (non-binding)

I think there are benefits to unifying all the Spark-internal datasources into a common public API for sure.  It will serve as a forcing function to ensure that those internal datasources aren't advantaged vs datasources developed externally as plugins to Spark, and that all Spark features are available to all datasources.

But I also think this read-path proposal avoids the more difficult questions around how to continue pushing datasource performance forwards.  James Baker (my colleague) had a number of questions about advanced pushdowns (combined sorting and filtering), and Reynold also noted that pushdown of aggregates and joins are desirable on longer timeframes as well.  The Spark community saw similar requests, for aggregate pushdown in SPARK-12686, join pushdown in SPARK-20259, and arbitrary plan pushdown in SPARK-12449.  Clearly a number of people are interested in this kind of performance work for datasources.

To leave enough space for datasource developers to continue experimenting with advanced interactions between Spark and their datasources, I'd propose we leave some sort of escape valve that enables these datasources to keep pushing the boundaries without forking Spark.  Possibly that looks like an additional unsupported/unstable interface that pushes down an entire (unstable API) logical plan, which is expected to break API on every release.   (Spark attempts this full-plan pushdown, and if that fails Spark ignores it and continues on with the rest of the V2 API for compatibility).  Or maybe it looks like something else that we don't know of yet.  Possibly this falls outside of the desired goals for the V2 API and instead should be a separate SPIP.

If we had a plan for this kind of escape valve for advanced datasource developers I'd be an unequivocal +1.  Right now it feels like this SPIP is focused more on getting the basics right for what many datasources are already doing in API V1 combined with other private APIs, vs pushing forward state of the art for performance.

Andrew

On Wed, Sep 6, 2017 at 10:56 PM, Suresh Thalamati <[hidden email]> wrote:
+1 (non-binding)


On Sep 6, 2017, at 7:29 PM, Wenchen Fan <[hidden email]> wrote:

Hi all,

In the previous discussion, we decided to split the read and write path of data source v2 into 2 SPIPs, and I'm sending this email to call a vote for Data Source V2 read path only.

The full document of the Data Source API V2 is:

The ready-for-review PR that implements the basic infrastructure for the read path is:

The vote will be up for the next 72 hours. Please reply with your vote:

+1: Yeah, let's go forward and implement the SPIP.
+0: Don't really care.
-1: I don't think this is a good idea because of the following technical reasons.

Thanks!





--

Herman van Hövell

Software Engineer

Databricks Inc.

[hidden email]

+31 6 420 590 27

databricks.com

http://databricks.com



Announcing Databricks Serverless. The first serverless data science and big data platform. Watch the demo from Spark Summit 2017.

Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

Ryan Blue

+1 (non-binding)

Thanks for making the updates reflected in the current PR. It would be great to see the doc updated before it is finally published though.

Right now it feels like this SPIP is focused more on getting the basics right for what many datasources are already doing in API V1 combined with other private APIs, vs pushing forward state of the art for performance.

I think that’s the right approach for this SPIP. We can add the support you’re talking about later with a more specific plan that doesn’t block fixing the problems that this addresses.


On Thu, Sep 7, 2017 at 2:00 AM, Herman van Hövell tot Westerflier <[hidden email]> wrote:
+1 (binding)

I personally believe that there is quite a big difference between having a generic data source interface with a low surface area and pushing down a significant part of query processing into a datasource. The later has much wider wider surface area and will require us to stabilize most of the internal catalyst API's which will be a significant burden on the community to maintain and has the potential to slow development velocity significantly. If you want to write such integrations then you should be prepared to work with catalyst internals and own up to the fact that things might change across minor versions (and in some cases even maintenance releases). If you are willing to go down that road, then your best bet is to use the already existing spark session extensions which will allow you to write such integrations and can be used as an `escape hatch`.


On Thu, Sep 7, 2017 at 10:23 AM, Andrew Ash <[hidden email]> wrote:
+0 (non-binding)

I think there are benefits to unifying all the Spark-internal datasources into a common public API for sure.  It will serve as a forcing function to ensure that those internal datasources aren't advantaged vs datasources developed externally as plugins to Spark, and that all Spark features are available to all datasources.

But I also think this read-path proposal avoids the more difficult questions around how to continue pushing datasource performance forwards.  James Baker (my colleague) had a number of questions about advanced pushdowns (combined sorting and filtering), and Reynold also noted that pushdown of aggregates and joins are desirable on longer timeframes as well.  The Spark community saw similar requests, for aggregate pushdown in SPARK-12686, join pushdown in SPARK-20259, and arbitrary plan pushdown in SPARK-12449.  Clearly a number of people are interested in this kind of performance work for datasources.

To leave enough space for datasource developers to continue experimenting with advanced interactions between Spark and their datasources, I'd propose we leave some sort of escape valve that enables these datasources to keep pushing the boundaries without forking Spark.  Possibly that looks like an additional unsupported/unstable interface that pushes down an entire (unstable API) logical plan, which is expected to break API on every release.   (Spark attempts this full-plan pushdown, and if that fails Spark ignores it and continues on with the rest of the V2 API for compatibility).  Or maybe it looks like something else that we don't know of yet.  Possibly this falls outside of the desired goals for the V2 API and instead should be a separate SPIP.

If we had a plan for this kind of escape valve for advanced datasource developers I'd be an unequivocal +1.  Right now it feels like this SPIP is focused more on getting the basics right for what many datasources are already doing in API V1 combined with other private APIs, vs pushing forward state of the art for performance.

Andrew

On Wed, Sep 6, 2017 at 10:56 PM, Suresh Thalamati <[hidden email]> wrote:
+1 (non-binding)


On Sep 6, 2017, at 7:29 PM, Wenchen Fan <[hidden email]> wrote:

Hi all,

In the previous discussion, we decided to split the read and write path of data source v2 into 2 SPIPs, and I'm sending this email to call a vote for Data Source V2 read path only.

The full document of the Data Source API V2 is:

The ready-for-review PR that implements the basic infrastructure for the read path is:

The vote will be up for the next 72 hours. Please reply with your vote:

+1: Yeah, let's go forward and implement the SPIP.
+0: Don't really care.
-1: I don't think this is a good idea because of the following technical reasons.

Thanks!





--

Herman van Hövell

Software Engineer

Databricks Inc.

[hidden email]

+31 6 420 590 27

databricks.com

http://databricks.com



Announcing Databricks Serverless. The first serverless data science and big data platform. Watch the demo from Spark Summit 2017.




--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

Michael Armbrust
+1

On Thu, Sep 7, 2017 at 9:32 AM, Ryan Blue <[hidden email]> wrote:

+1 (non-binding)

Thanks for making the updates reflected in the current PR. It would be great to see the doc updated before it is finally published though.

Right now it feels like this SPIP is focused more on getting the basics right for what many datasources are already doing in API V1 combined with other private APIs, vs pushing forward state of the art for performance.

I think that’s the right approach for this SPIP. We can add the support you’re talking about later with a more specific plan that doesn’t block fixing the problems that this addresses.


On Thu, Sep 7, 2017 at 2:00 AM, Herman van Hövell tot Westerflier <[hidden email]> wrote:
+1 (binding)

I personally believe that there is quite a big difference between having a generic data source interface with a low surface area and pushing down a significant part of query processing into a datasource. The later has much wider wider surface area and will require us to stabilize most of the internal catalyst API's which will be a significant burden on the community to maintain and has the potential to slow development velocity significantly. If you want to write such integrations then you should be prepared to work with catalyst internals and own up to the fact that things might change across minor versions (and in some cases even maintenance releases). If you are willing to go down that road, then your best bet is to use the already existing spark session extensions which will allow you to write such integrations and can be used as an `escape hatch`.


On Thu, Sep 7, 2017 at 10:23 AM, Andrew Ash <[hidden email]> wrote:
+0 (non-binding)

I think there are benefits to unifying all the Spark-internal datasources into a common public API for sure.  It will serve as a forcing function to ensure that those internal datasources aren't advantaged vs datasources developed externally as plugins to Spark, and that all Spark features are available to all datasources.

But I also think this read-path proposal avoids the more difficult questions around how to continue pushing datasource performance forwards.  James Baker (my colleague) had a number of questions about advanced pushdowns (combined sorting and filtering), and Reynold also noted that pushdown of aggregates and joins are desirable on longer timeframes as well.  The Spark community saw similar requests, for aggregate pushdown in SPARK-12686, join pushdown in SPARK-20259, and arbitrary plan pushdown in SPARK-12449.  Clearly a number of people are interested in this kind of performance work for datasources.

To leave enough space for datasource developers to continue experimenting with advanced interactions between Spark and their datasources, I'd propose we leave some sort of escape valve that enables these datasources to keep pushing the boundaries without forking Spark.  Possibly that looks like an additional unsupported/unstable interface that pushes down an entire (unstable API) logical plan, which is expected to break API on every release.   (Spark attempts this full-plan pushdown, and if that fails Spark ignores it and continues on with the rest of the V2 API for compatibility).  Or maybe it looks like something else that we don't know of yet.  Possibly this falls outside of the desired goals for the V2 API and instead should be a separate SPIP.

If we had a plan for this kind of escape valve for advanced datasource developers I'd be an unequivocal +1.  Right now it feels like this SPIP is focused more on getting the basics right for what many datasources are already doing in API V1 combined with other private APIs, vs pushing forward state of the art for performance.

Andrew

On Wed, Sep 6, 2017 at 10:56 PM, Suresh Thalamati <[hidden email]> wrote:
+1 (non-binding)


On Sep 6, 2017, at 7:29 PM, Wenchen Fan <[hidden email]> wrote:

Hi all,

In the previous discussion, we decided to split the read and write path of data source v2 into 2 SPIPs, and I'm sending this email to call a vote for Data Source V2 read path only.

The full document of the Data Source API V2 is:

The ready-for-review PR that implements the basic infrastructure for the read path is:

The vote will be up for the next 72 hours. Please reply with your vote:

+1: Yeah, let's go forward and implement the SPIP.
+0: Don't really care.
-1: I don't think this is a good idea because of the following technical reasons.

Thanks!





--

Herman van Hövell

Software Engineer

Databricks Inc.

[hidden email]

+31 6 420 590 27

databricks.com

http://databricks.com



Announcing Databricks Serverless. The first serverless data science and big data platform. Watch the demo from Spark Summit 2017.




--
Ryan Blue
Software Engineer
Netflix

Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

rxin
+1 as well

On Thu, Sep 7, 2017 at 9:12 PM, Michael Armbrust <[hidden email]> wrote:
+1

On Thu, Sep 7, 2017 at 9:32 AM, Ryan Blue <[hidden email]> wrote:

+1 (non-binding)

Thanks for making the updates reflected in the current PR. It would be great to see the doc updated before it is finally published though.

Right now it feels like this SPIP is focused more on getting the basics right for what many datasources are already doing in API V1 combined with other private APIs, vs pushing forward state of the art for performance.

I think that’s the right approach for this SPIP. We can add the support you’re talking about later with a more specific plan that doesn’t block fixing the problems that this addresses.


On Thu, Sep 7, 2017 at 2:00 AM, Herman van Hövell tot Westerflier <[hidden email]> wrote:
+1 (binding)

I personally believe that there is quite a big difference between having a generic data source interface with a low surface area and pushing down a significant part of query processing into a datasource. The later has much wider wider surface area and will require us to stabilize most of the internal catalyst API's which will be a significant burden on the community to maintain and has the potential to slow development velocity significantly. If you want to write such integrations then you should be prepared to work with catalyst internals and own up to the fact that things might change across minor versions (and in some cases even maintenance releases). If you are willing to go down that road, then your best bet is to use the already existing spark session extensions which will allow you to write such integrations and can be used as an `escape hatch`.


On Thu, Sep 7, 2017 at 10:23 AM, Andrew Ash <[hidden email]> wrote:
+0 (non-binding)

I think there are benefits to unifying all the Spark-internal datasources into a common public API for sure.  It will serve as a forcing function to ensure that those internal datasources aren't advantaged vs datasources developed externally as plugins to Spark, and that all Spark features are available to all datasources.

But I also think this read-path proposal avoids the more difficult questions around how to continue pushing datasource performance forwards.  James Baker (my colleague) had a number of questions about advanced pushdowns (combined sorting and filtering), and Reynold also noted that pushdown of aggregates and joins are desirable on longer timeframes as well.  The Spark community saw similar requests, for aggregate pushdown in SPARK-12686, join pushdown in SPARK-20259, and arbitrary plan pushdown in SPARK-12449.  Clearly a number of people are interested in this kind of performance work for datasources.

To leave enough space for datasource developers to continue experimenting with advanced interactions between Spark and their datasources, I'd propose we leave some sort of escape valve that enables these datasources to keep pushing the boundaries without forking Spark.  Possibly that looks like an additional unsupported/unstable interface that pushes down an entire (unstable API) logical plan, which is expected to break API on every release.   (Spark attempts this full-plan pushdown, and if that fails Spark ignores it and continues on with the rest of the V2 API for compatibility).  Or maybe it looks like something else that we don't know of yet.  Possibly this falls outside of the desired goals for the V2 API and instead should be a separate SPIP.

If we had a plan for this kind of escape valve for advanced datasource developers I'd be an unequivocal +1.  Right now it feels like this SPIP is focused more on getting the basics right for what many datasources are already doing in API V1 combined with other private APIs, vs pushing forward state of the art for performance.

Andrew

On Wed, Sep 6, 2017 at 10:56 PM, Suresh Thalamati <[hidden email]> wrote:
+1 (non-binding)


On Sep 6, 2017, at 7:29 PM, Wenchen Fan <[hidden email]> wrote:

Hi all,

In the previous discussion, we decided to split the read and write path of data source v2 into 2 SPIPs, and I'm sending this email to call a vote for Data Source V2 read path only.

The full document of the Data Source API V2 is:

The ready-for-review PR that implements the basic infrastructure for the read path is:

The vote will be up for the next 72 hours. Please reply with your vote:

+1: Yeah, let's go forward and implement the SPIP.
+0: Don't really care.
-1: I don't think this is a good idea because of the following technical reasons.

Thanks!





--

Herman van Hövell

Software Engineer

Databricks Inc.

[hidden email]

+31 6 420 590 27

databricks.com

http://databricks.com



Announcing Databricks Serverless. The first serverless data science and big data platform. Watch the demo from Spark Summit 2017.




--
Ryan Blue
Software Engineer
Netflix


Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

Jiang Xingbo
+1


Reynold Xin <[hidden email]>于2017年9月7日 周四下午12:04写道:
+1 as well

On Thu, Sep 7, 2017 at 9:12 PM, Michael Armbrust <[hidden email]> wrote:
+1

On Thu, Sep 7, 2017 at 9:32 AM, Ryan Blue <[hidden email]> wrote:

+1 (non-binding)

Thanks for making the updates reflected in the current PR. It would be great to see the doc updated before it is finally published though.

Right now it feels like this SPIP is focused more on getting the basics right for what many datasources are already doing in API V1 combined with other private APIs, vs pushing forward state of the art for performance.

I think that’s the right approach for this SPIP. We can add the support you’re talking about later with a more specific plan that doesn’t block fixing the problems that this addresses.


On Thu, Sep 7, 2017 at 2:00 AM, Herman van Hövell tot Westerflier <[hidden email]> wrote:
+1 (binding)

I personally believe that there is quite a big difference between having a generic data source interface with a low surface area and pushing down a significant part of query processing into a datasource. The later has much wider wider surface area and will require us to stabilize most of the internal catalyst API's which will be a significant burden on the community to maintain and has the potential to slow development velocity significantly. If you want to write such integrations then you should be prepared to work with catalyst internals and own up to the fact that things might change across minor versions (and in some cases even maintenance releases). If you are willing to go down that road, then your best bet is to use the already existing spark session extensions which will allow you to write such integrations and can be used as an `escape hatch`.


On Thu, Sep 7, 2017 at 10:23 AM, Andrew Ash <[hidden email]> wrote:
+0 (non-binding)

I think there are benefits to unifying all the Spark-internal datasources into a common public API for sure.  It will serve as a forcing function to ensure that those internal datasources aren't advantaged vs datasources developed externally as plugins to Spark, and that all Spark features are available to all datasources.

But I also think this read-path proposal avoids the more difficult questions around how to continue pushing datasource performance forwards.  James Baker (my colleague) had a number of questions about advanced pushdowns (combined sorting and filtering), and Reynold also noted that pushdown of aggregates and joins are desirable on longer timeframes as well.  The Spark community saw similar requests, for aggregate pushdown in SPARK-12686, join pushdown in SPARK-20259, and arbitrary plan pushdown in SPARK-12449.  Clearly a number of people are interested in this kind of performance work for datasources.

To leave enough space for datasource developers to continue experimenting with advanced interactions between Spark and their datasources, I'd propose we leave some sort of escape valve that enables these datasources to keep pushing the boundaries without forking Spark.  Possibly that looks like an additional unsupported/unstable interface that pushes down an entire (unstable API) logical plan, which is expected to break API on every release.   (Spark attempts this full-plan pushdown, and if that fails Spark ignores it and continues on with the rest of the V2 API for compatibility).  Or maybe it looks like something else that we don't know of yet.  Possibly this falls outside of the desired goals for the V2 API and instead should be a separate SPIP.

If we had a plan for this kind of escape valve for advanced datasource developers I'd be an unequivocal +1.  Right now it feels like this SPIP is focused more on getting the basics right for what many datasources are already doing in API V1 combined with other private APIs, vs pushing forward state of the art for performance.

Andrew

On Wed, Sep 6, 2017 at 10:56 PM, Suresh Thalamati <[hidden email]> wrote:
+1 (non-binding)


On Sep 6, 2017, at 7:29 PM, Wenchen Fan <[hidden email]> wrote:

Hi all,

In the previous discussion, we decided to split the read and write path of data source v2 into 2 SPIPs, and I'm sending this email to call a vote for Data Source V2 read path only.

The full document of the Data Source API V2 is:

The ready-for-review PR that implements the basic infrastructure for the read path is:

The vote will be up for the next 72 hours. Please reply with your vote:

+1: Yeah, let's go forward and implement the SPIP.
+0: Don't really care.
-1: I don't think this is a good idea because of the following technical reasons.

Thanks!





--

Herman van Hövell

Software Engineer

Databricks Inc.

[hidden email]

+31 6 420 590 27

databricks.com

http://databricks.com



Announcing Databricks Serverless. The first serverless data science and big data platform. Watch the demo from Spark Summit 2017.




--
Ryan Blue
Software Engineer
Netflix


Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

Dongjoon Hyun-2
+1 (non-binding).

On Thu, Sep 7, 2017 at 12:46 PM, 蒋星博 <[hidden email]> wrote:
+1


Reynold Xin <[hidden email]>于2017年9月7日 周四下午12:04写道:
+1 as well

On Thu, Sep 7, 2017 at 9:12 PM, Michael Armbrust <[hidden email]> wrote:
+1

On Thu, Sep 7, 2017 at 9:32 AM, Ryan Blue <[hidden email]> wrote:

+1 (non-binding)

Thanks for making the updates reflected in the current PR. It would be great to see the doc updated before it is finally published though.

Right now it feels like this SPIP is focused more on getting the basics right for what many datasources are already doing in API V1 combined with other private APIs, vs pushing forward state of the art for performance.

I think that’s the right approach for this SPIP. We can add the support you’re talking about later with a more specific plan that doesn’t block fixing the problems that this addresses.


On Thu, Sep 7, 2017 at 2:00 AM, Herman van Hövell tot Westerflier <[hidden email]> wrote:
+1 (binding)

I personally believe that there is quite a big difference between having a generic data source interface with a low surface area and pushing down a significant part of query processing into a datasource. The later has much wider wider surface area and will require us to stabilize most of the internal catalyst API's which will be a significant burden on the community to maintain and has the potential to slow development velocity significantly. If you want to write such integrations then you should be prepared to work with catalyst internals and own up to the fact that things might change across minor versions (and in some cases even maintenance releases). If you are willing to go down that road, then your best bet is to use the already existing spark session extensions which will allow you to write such integrations and can be used as an `escape hatch`.


On Thu, Sep 7, 2017 at 10:23 AM, Andrew Ash <[hidden email]> wrote:
+0 (non-binding)

I think there are benefits to unifying all the Spark-internal datasources into a common public API for sure.  It will serve as a forcing function to ensure that those internal datasources aren't advantaged vs datasources developed externally as plugins to Spark, and that all Spark features are available to all datasources.

But I also think this read-path proposal avoids the more difficult questions around how to continue pushing datasource performance forwards.  James Baker (my colleague) had a number of questions about advanced pushdowns (combined sorting and filtering), and Reynold also noted that pushdown of aggregates and joins are desirable on longer timeframes as well.  The Spark community saw similar requests, for aggregate pushdown in SPARK-12686, join pushdown in SPARK-20259, and arbitrary plan pushdown in SPARK-12449.  Clearly a number of people are interested in this kind of performance work for datasources.

To leave enough space for datasource developers to continue experimenting with advanced interactions between Spark and their datasources, I'd propose we leave some sort of escape valve that enables these datasources to keep pushing the boundaries without forking Spark.  Possibly that looks like an additional unsupported/unstable interface that pushes down an entire (unstable API) logical plan, which is expected to break API on every release.   (Spark attempts this full-plan pushdown, and if that fails Spark ignores it and continues on with the rest of the V2 API for compatibility).  Or maybe it looks like something else that we don't know of yet.  Possibly this falls outside of the desired goals for the V2 API and instead should be a separate SPIP.

If we had a plan for this kind of escape valve for advanced datasource developers I'd be an unequivocal +1.  Right now it feels like this SPIP is focused more on getting the basics right for what many datasources are already doing in API V1 combined with other private APIs, vs pushing forward state of the art for performance.

Andrew

On Wed, Sep 6, 2017 at 10:56 PM, Suresh Thalamati <[hidden email]> wrote:
+1 (non-binding)


On Sep 6, 2017, at 7:29 PM, Wenchen Fan <[hidden email]> wrote:

Hi all,

In the previous discussion, we decided to split the read and write path of data source v2 into 2 SPIPs, and I'm sending this email to call a vote for Data Source V2 read path only.

The full document of the Data Source API V2 is:

The ready-for-review PR that implements the basic infrastructure for the read path is:

The vote will be up for the next 72 hours. Please reply with your vote:

+1: Yeah, let's go forward and implement the SPIP.
+0: Don't really care.
-1: I don't think this is a good idea because of the following technical reasons.

Thanks!





--

Herman van Hövell

Software Engineer

Databricks Inc.

[hidden email]

+31 6 420 590 27

databricks.com

http://databricks.com



Announcing Databricks Serverless. The first serverless data science and big data platform. Watch the demo from Spark Summit 2017.




--
Ryan Blue
Software Engineer
Netflix



Reply | Threaded
Open this post in threaded view
|

答复: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

wangzhenhua (G)

+1 (non-binding)  Great to see data source API is going to be improved!

 

best regards,

-Zhenhua(Xander)

 

发件人: Dongjoon Hyun [mailto:[hidden email]]
发送时间: 201798 4:07
收件人: 蒋星博
抄送: Michael Armbrust; Reynold Xin; Andrew Ash; Herman van Hövell tot Westerflier; Ryan Blue; Spark dev list; Suresh Thalamati; Wenchen Fan
主题: Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

 

+1 (non-binding).

 

On Thu, Sep 7, 2017 at 12:46 PM, 蒋星博 <[hidden email]> wrote:

+1

 

 

Reynold Xin <[hidden email]>201797日 周四下午12:04写道:

+1 as well

 

On Thu, Sep 7, 2017 at 9:12 PM, Michael Armbrust <[hidden email]> wrote:

+1

 

On Thu, Sep 7, 2017 at 9:32 AM, Ryan Blue <[hidden email]> wrote:

+1 (non-binding)

Thanks for making the updates reflected in the current PR. It would be great to see the doc updated before it is finally published though.

Right now it feels like this SPIP is focused more on getting the basics right for what many datasources are already doing in API V1 combined with other private APIs, vs pushing forward state of the art for performance.

I think that’s the right approach for this SPIP. We can add the support you’re talking about later with a more specific plan that doesn’t block fixing the problems that this addresses.

 

On Thu, Sep 7, 2017 at 2:00 AM, Herman van Hövell tot Westerflier <[hidden email]> wrote:

+1 (binding)

 

I personally believe that there is quite a big difference between having a generic data source interface with a low surface area and pushing down a significant part of query processing into a datasource. The later has much wider wider surface area and will require us to stabilize most of the internal catalyst API's which will be a significant burden on the community to maintain and has the potential to slow development velocity significantly. If you want to write such integrations then you should be prepared to work with catalyst internals and own up to the fact that things might change across minor versions (and in some cases even maintenance releases). If you are willing to go down that road, then your best bet is to use the already existing spark session extensions which will allow you to write such integrations and can be used as an `escape hatch`.

 

 

On Thu, Sep 7, 2017 at 10:23 AM, Andrew Ash <[hidden email]> wrote:

+0 (non-binding)

 

I think there are benefits to unifying all the Spark-internal datasources into a common public API for sure.  It will serve as a forcing function to ensure that those internal datasources aren't advantaged vs datasources developed externally as plugins to Spark, and that all Spark features are available to all datasources.

 

But I also think this read-path proposal avoids the more difficult questions around how to continue pushing datasource performance forwards.  James Baker (my colleague) had a number of questions about advanced pushdowns (combined sorting and filtering), and Reynold also noted that pushdown of aggregates and joins are desirable on longer timeframes as well.  The Spark community saw similar requests, for aggregate pushdown in SPARK-12686, join pushdown in SPARK-20259, and arbitrary plan pushdown in SPARK-12449.  Clearly a number of people are interested in this kind of performance work for datasources.

 

To leave enough space for datasource developers to continue experimenting with advanced interactions between Spark and their datasources, I'd propose we leave some sort of escape valve that enables these datasources to keep pushing the boundaries without forking Spark.  Possibly that looks like an additional unsupported/unstable interface that pushes down an entire (unstable API) logical plan, which is expected to break API on every release.   (Spark attempts this full-plan pushdown, and if that fails Spark ignores it and continues on with the rest of the V2 API for compatibility).  Or maybe it looks like something else that we don't know of yet.  Possibly this falls outside of the desired goals for the V2 API and instead should be a separate SPIP.

 

If we had a plan for this kind of escape valve for advanced datasource developers I'd be an unequivocal +1.  Right now it feels like this SPIP is focused more on getting the basics right for what many datasources are already doing in API V1 combined with other private APIs, vs pushing forward state of the art for performance.

 

Andrew

 

On Wed, Sep 6, 2017 at 10:56 PM, Suresh Thalamati <[hidden email]> wrote:

+1 (non-binding)

 

 

On Sep 6, 2017, at 7:29 PM, Wenchen Fan <[hidden email]> wrote:

 

Hi all,

 

In the previous discussion, we decided to split the read and write path of data source v2 into 2 SPIPs, and I'm sending this email to call a vote for Data Source V2 read path only.

 

The full document of the Data Source API V2 is:

 

The ready-for-review PR that implements the basic infrastructure for the read path is:

 

The vote will be up for the next 72 hours. Please reply with your vote:

 

+1: Yeah, let's go forward and implement the SPIP.

+0: Don't really care.

-1: I don't think this is a good idea because of the following technical reasons.

 

Thanks!

 

 



 

--

Herman van Hövell

Software Engineer

Databricks Inc.

[hidden email]

+31 6 420 590 27

databricks.com

http://databricks.com

 

Announcing Databricks Serverless. The first serverless data science and big data platform. Watch the demo from Spark Summit 2017.



 

--

Ryan Blue

Software Engineer

Netflix

 

 

 

Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

Noman Khan
+1
From: wangzhenhua (G) <[hidden email]>
Sent: Friday, September 8, 2017 2:20:07 AM
To: Dongjoon Hyun; 蒋星博
Cc: Michael Armbrust; Reynold Xin; Andrew Ash; Herman van Hövell tot Westerflier; Ryan Blue; Spark dev list; Suresh Thalamati; Wenchen Fan
Subject: 答复: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path
 

+1 (non-binding)  Great to see data source API is going to be improved!

 

best regards,

-Zhenhua(Xander)

 

发件人: Dongjoon Hyun [mailto:[hidden email]]
发送时间: 201798 4:07
收件人: 蒋星博
抄送: Michael Armbrust; Reynold Xin; Andrew Ash; Herman van Hövell tot Westerflier; Ryan Blue; Spark dev list; Suresh Thalamati; Wenchen Fan
主题: Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

 

+1 (non-binding).

 

On Thu, Sep 7, 2017 at 12:46 PM, 蒋星博 <[hidden email]> wrote:

+1

 

 

Reynold Xin <[hidden email]>201797日 周四下午12:04写道:

+1 as well

 

On Thu, Sep 7, 2017 at 9:12 PM, Michael Armbrust <[hidden email]> wrote:

+1

 

On Thu, Sep 7, 2017 at 9:32 AM, Ryan Blue <[hidden email]> wrote:

+1 (non-binding)

Thanks for making the updates reflected in the current PR. It would be great to see the doc updated before it is finally published though.

Right now it feels like this SPIP is focused more on getting the basics right for what many datasources are already doing in API V1 combined with other private APIs, vs pushing forward state of the art for performance.

I think that’s the right approach for this SPIP. We can add the support you’re talking about later with a more specific plan that doesn’t block fixing the problems that this addresses.

 

On Thu, Sep 7, 2017 at 2:00 AM, Herman van Hövell tot Westerflier <[hidden email]> wrote:

+1 (binding)

 

I personally believe that there is quite a big difference between having a generic data source interface with a low surface area and pushing down a significant part of query processing into a datasource. The later has much wider wider surface area and will require us to stabilize most of the internal catalyst API's which will be a significant burden on the community to maintain and has the potential to slow development velocity significantly. If you want to write such integrations then you should be prepared to work with catalyst internals and own up to the fact that things might change across minor versions (and in some cases even maintenance releases). If you are willing to go down that road, then your best bet is to use the already existing spark session extensions which will allow you to write such integrations and can be used as an `escape hatch`.

 

 

On Thu, Sep 7, 2017 at 10:23 AM, Andrew Ash <[hidden email]> wrote:

+0 (non-binding)

 

I think there are benefits to unifying all the Spark-internal datasources into a common public API for sure.  It will serve as a forcing function to ensure that those internal datasources aren't advantaged vs datasources developed externally as plugins to Spark, and that all Spark features are available to all datasources.

 

But I also think this read-path proposal avoids the more difficult questions around how to continue pushing datasource performance forwards.  James Baker (my colleague) had a number of questions about advanced pushdowns (combined sorting and filtering), and Reynold also noted that pushdown of aggregates and joins are desirable on longer timeframes as well.  The Spark community saw similar requests, for aggregate pushdown in SPARK-12686, join pushdown in SPARK-20259, and arbitrary plan pushdown in SPARK-12449.  Clearly a number of people are interested in this kind of performance work for datasources.

 

To leave enough space for datasource developers to continue experimenting with advanced interactions between Spark and their datasources, I'd propose we leave some sort of escape valve that enables these datasources to keep pushing the boundaries without forking Spark.  Possibly that looks like an additional unsupported/unstable interface that pushes down an entire (unstable API) logical plan, which is expected to break API on every release.   (Spark attempts this full-plan pushdown, and if that fails Spark ignores it and continues on with the rest of the V2 API for compatibility).  Or maybe it looks like something else that we don't know of yet.  Possibly this falls outside of the desired goals for the V2 API and instead should be a separate SPIP.

 

If we had a plan for this kind of escape valve for advanced datasource developers I'd be an unequivocal +1.  Right now it feels like this SPIP is focused more on getting the basics right for what many datasources are already doing in API V1 combined with other private APIs, vs pushing forward state of the art for performance.

 

Andrew

 

On Wed, Sep 6, 2017 at 10:56 PM, Suresh Thalamati <[hidden email]> wrote:

+1 (non-binding)

 

 

On Sep 6, 2017, at 7:29 PM, Wenchen Fan <[hidden email]> wrote:

 

Hi all,

 

In the previous discussion, we decided to split the read and write path of data source v2 into 2 SPIPs, and I'm sending this email to call a vote for Data Source V2 read path only.

 

The full document of the Data Source API V2 is:

 

The ready-for-review PR that implements the basic infrastructure for the read path is:

 

The vote will be up for the next 72 hours. Please reply with your vote:

 

+1: Yeah, let's go forward and implement the SPIP.

+0: Don't really care.

-1: I don't think this is a good idea because of the following technical reasons.

 

Thanks!

 

 



 

--

Herman van Hövell

Software Engineer

Databricks Inc.

[hidden email]

+31 6 420 590 27

databricks.com

http://databricks.com

 

Announcing Databricks Serverless. The first serverless data science and big data platform. Watch the demo from Spark Summit 2017.



 

--

Ryan Blue

Software Engineer

Netflix

 

 

 

Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

vaquarkhan
+1

Regards,
Vaquar khan

On Sep 10, 2017 5:18 AM, "Noman Khan" <[hidden email]> wrote:
+1
From: wangzhenhua (G) <[hidden email]>
Sent: Friday, September 8, 2017 2:20:07 AM
To: Dongjoon Hyun; 蒋星博
Cc: Michael Armbrust; Reynold Xin; Andrew Ash; Herman van Hövell tot Westerflier; Ryan Blue; Spark dev list; Suresh Thalamati; Wenchen Fan
Subject: 答复: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path
 

+1 (non-binding)  Great to see data source API is going to be improved!

 

best regards,

-Zhenhua(Xander)

 

发件人: Dongjoon Hyun [mailto:[hidden email]]
发送时间: 201798 4:07
收件人: 蒋星博
抄送: Michael Armbrust; Reynold Xin; Andrew Ash; Herman van Hövell tot Westerflier; Ryan Blue; Spark dev list; Suresh Thalamati; Wenchen Fan
主题: Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

 

+1 (non-binding).

 

On Thu, Sep 7, 2017 at 12:46 PM, 蒋星博 <[hidden email]> wrote:

+1

 

 

Reynold Xin <[hidden email]>201797日 周四下午12:04写道:

+1 as well

 

On Thu, Sep 7, 2017 at 9:12 PM, Michael Armbrust <[hidden email]> wrote:

+1

 

On Thu, Sep 7, 2017 at 9:32 AM, Ryan Blue <[hidden email]> wrote:

+1 (non-binding)

Thanks for making the updates reflected in the current PR. It would be great to see the doc updated before it is finally published though.

Right now it feels like this SPIP is focused more on getting the basics right for what many datasources are already doing in API V1 combined with other private APIs, vs pushing forward state of the art for performance.

I think that’s the right approach for this SPIP. We can add the support you’re talking about later with a more specific plan that doesn’t block fixing the problems that this addresses.

 

On Thu, Sep 7, 2017 at 2:00 AM, Herman van Hövell tot Westerflier <[hidden email]> wrote:

+1 (binding)

 

I personally believe that there is quite a big difference between having a generic data source interface with a low surface area and pushing down a significant part of query processing into a datasource. The later has much wider wider surface area and will require us to stabilize most of the internal catalyst API's which will be a significant burden on the community to maintain and has the potential to slow development velocity significantly. If you want to write such integrations then you should be prepared to work with catalyst internals and own up to the fact that things might change across minor versions (and in some cases even maintenance releases). If you are willing to go down that road, then your best bet is to use the already existing spark session extensions which will allow you to write such integrations and can be used as an `escape hatch`.

 

 

On Thu, Sep 7, 2017 at 10:23 AM, Andrew Ash <[hidden email]> wrote:

+0 (non-binding)

 

I think there are benefits to unifying all the Spark-internal datasources into a common public API for sure.  It will serve as a forcing function to ensure that those internal datasources aren't advantaged vs datasources developed externally as plugins to Spark, and that all Spark features are available to all datasources.

 

But I also think this read-path proposal avoids the more difficult questions around how to continue pushing datasource performance forwards.  James Baker (my colleague) had a number of questions about advanced pushdowns (combined sorting and filtering), and Reynold also noted that pushdown of aggregates and joins are desirable on longer timeframes as well.  The Spark community saw similar requests, for aggregate pushdown in SPARK-12686, join pushdown in SPARK-20259, and arbitrary plan pushdown in SPARK-12449.  Clearly a number of people are interested in this kind of performance work for datasources.

 

To leave enough space for datasource developers to continue experimenting with advanced interactions between Spark and their datasources, I'd propose we leave some sort of escape valve that enables these datasources to keep pushing the boundaries without forking Spark.  Possibly that looks like an additional unsupported/unstable interface that pushes down an entire (unstable API) logical plan, which is expected to break API on every release.   (Spark attempts this full-plan pushdown, and if that fails Spark ignores it and continues on with the rest of the V2 API for compatibility).  Or maybe it looks like something else that we don't know of yet.  Possibly this falls outside of the desired goals for the V2 API and instead should be a separate SPIP.

 

If we had a plan for this kind of escape valve for advanced datasource developers I'd be an unequivocal +1.  Right now it feels like this SPIP is focused more on getting the basics right for what many datasources are already doing in API V1 combined with other private APIs, vs pushing forward state of the art for performance.

 

Andrew

 

On Wed, Sep 6, 2017 at 10:56 PM, Suresh Thalamati <[hidden email]> wrote:

+1 (non-binding)

 

 

On Sep 6, 2017, at 7:29 PM, Wenchen Fan <[hidden email]> wrote:

 

Hi all,

 

In the previous discussion, we decided to split the read and write path of data source v2 into 2 SPIPs, and I'm sending this email to call a vote for Data Source V2 read path only.

 

The full document of the Data Source API V2 is:

 

The ready-for-review PR that implements the basic infrastructure for the read path is:

 

The vote will be up for the next 72 hours. Please reply with your vote:

 

+1: Yeah, let's go forward and implement the SPIP.

+0: Don't really care.

-1: I don't think this is a good idea because of the following technical reasons.

 

Thanks!

 

 



 

--

Herman van Hövell

Software Engineer

Databricks Inc.

[hidden email]

+31 6 420 590 27

databricks.com

http://databricks.com

 

Announcing Databricks Serverless. The first serverless data science and big data platform. Watch the demo from Spark Summit 2017.



 

--

Ryan Blue

Software Engineer

Netflix

 

 

 

Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

Hemant Bhanawat
+1 (non-binding)

I have found the suggestion from Andrew Ash and James about plan push down quite interesting. However, I am not clear about the join push-down support at the data source level. Shouldn't it be the responsibility of the join node to carry out a data source specific join? I mean join node and the data source scan of the two sides can be coalesced into a single node (theoretically). This can be done by providing a Strategy that replaces the join node with a data source specific join node. We are doing it that way for our data sources. I find this more intuitive.

BTW, aggregate push-down support is desirable and should be considered as an enhancement going forward. 


On Sun, Sep 10, 2017 at 8:45 PM, vaquar khan <[hidden email]> wrote:
+1

Regards,
Vaquar khan

On Sep 10, 2017 5:18 AM, "Noman Khan" <[hidden email]> wrote:
+1
From: wangzhenhua (G) <[hidden email]>
Sent: Friday, September 8, 2017 2:20:07 AM
To: Dongjoon Hyun; 蒋星博
Cc: Michael Armbrust; Reynold Xin; Andrew Ash; Herman van Hövell tot Westerflier; Ryan Blue; Spark dev list; Suresh Thalamati; Wenchen Fan
Subject: 答复: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path
 

+1 (non-binding)  Great to see data source API is going to be improved!

 

best regards,

-Zhenhua(Xander)

 

发件人: Dongjoon Hyun [mailto:[hidden email]]
发送时间: 201798 4:07
收件人: 蒋星博
抄送: Michael Armbrust; Reynold Xin; Andrew Ash; Herman van Hövell tot Westerflier; Ryan Blue; Spark dev list; Suresh Thalamati; Wenchen Fan
主题: Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

 

+1 (non-binding).

 

On Thu, Sep 7, 2017 at 12:46 PM, 蒋星博 <[hidden email]> wrote:

+1

 

 

Reynold Xin <[hidden email]>201797日 周四下午12:04写道:

+1 as well

 

On Thu, Sep 7, 2017 at 9:12 PM, Michael Armbrust <[hidden email]> wrote:

+1

 

On Thu, Sep 7, 2017 at 9:32 AM, Ryan Blue <[hidden email]> wrote:

+1 (non-binding)

Thanks for making the updates reflected in the current PR. It would be great to see the doc updated before it is finally published though.

Right now it feels like this SPIP is focused more on getting the basics right for what many datasources are already doing in API V1 combined with other private APIs, vs pushing forward state of the art for performance.

I think that’s the right approach for this SPIP. We can add the support you’re talking about later with a more specific plan that doesn’t block fixing the problems that this addresses.

 

On Thu, Sep 7, 2017 at 2:00 AM, Herman van Hövell tot Westerflier <[hidden email]> wrote:

+1 (binding)

 

I personally believe that there is quite a big difference between having a generic data source interface with a low surface area and pushing down a significant part of query processing into a datasource. The later has much wider wider surface area and will require us to stabilize most of the internal catalyst API's which will be a significant burden on the community to maintain and has the potential to slow development velocity significantly. If you want to write such integrations then you should be prepared to work with catalyst internals and own up to the fact that things might change across minor versions (and in some cases even maintenance releases). If you are willing to go down that road, then your best bet is to use the already existing spark session extensions which will allow you to write such integrations and can be used as an `escape hatch`.

 

 

On Thu, Sep 7, 2017 at 10:23 AM, Andrew Ash <[hidden email]> wrote:

+0 (non-binding)

 

I think there are benefits to unifying all the Spark-internal datasources into a common public API for sure.  It will serve as a forcing function to ensure that those internal datasources aren't advantaged vs datasources developed externally as plugins to Spark, and that all Spark features are available to all datasources.

 

But I also think this read-path proposal avoids the more difficult questions around how to continue pushing datasource performance forwards.  James Baker (my colleague) had a number of questions about advanced pushdowns (combined sorting and filtering), and Reynold also noted that pushdown of aggregates and joins are desirable on longer timeframes as well.  The Spark community saw similar requests, for aggregate pushdown in SPARK-12686, join pushdown in SPARK-20259, and arbitrary plan pushdown in SPARK-12449.  Clearly a number of people are interested in this kind of performance work for datasources.

 

To leave enough space for datasource developers to continue experimenting with advanced interactions between Spark and their datasources, I'd propose we leave some sort of escape valve that enables these datasources to keep pushing the boundaries without forking Spark.  Possibly that looks like an additional unsupported/unstable interface that pushes down an entire (unstable API) logical plan, which is expected to break API on every release.   (Spark attempts this full-plan pushdown, and if that fails Spark ignores it and continues on with the rest of the V2 API for compatibility).  Or maybe it looks like something else that we don't know of yet.  Possibly this falls outside of the desired goals for the V2 API and instead should be a separate SPIP.

 

If we had a plan for this kind of escape valve for advanced datasource developers I'd be an unequivocal +1.  Right now it feels like this SPIP is focused more on getting the basics right for what many datasources are already doing in API V1 combined with other private APIs, vs pushing forward state of the art for performance.

 

Andrew

 

On Wed, Sep 6, 2017 at 10:56 PM, Suresh Thalamati <[hidden email]> wrote:

+1 (non-binding)

 

 

On Sep 6, 2017, at 7:29 PM, Wenchen Fan <[hidden email]> wrote:

 

Hi all,

 

In the previous discussion, we decided to split the read and write path of data source v2 into 2 SPIPs, and I'm sending this email to call a vote for Data Source V2 read path only.

 

The full document of the Data Source API V2 is:

 

The ready-for-review PR that implements the basic infrastructure for the read path is:

 

The vote will be up for the next 72 hours. Please reply with your vote:

 

+1: Yeah, let's go forward and implement the SPIP.

+0: Don't really care.

-1: I don't think this is a good idea because of the following technical reasons.

 

Thanks!

 

 



 

--

Herman van Hövell

Software Engineer

Databricks Inc.

[hidden email]

+31 6 420 590 27

databricks.com

http://databricks.com

 

Announcing Databricks Serverless. The first serverless data science and big data platform. Watch the demo from Spark Summit 2017.



 

--

Ryan Blue

Software Engineer

Netflix

 

 

 


Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

cloud0fan
yea, join push down (providing the other reader and join conditions) and aggregate push down (providing grouping keys and aggregate functions) can be added via the current framework in the future.

On Mon, Sep 11, 2017 at 1:54 PM, Hemant Bhanawat <[hidden email]> wrote:
+1 (non-binding)

I have found the suggestion from Andrew Ash and James about plan push down quite interesting. However, I am not clear about the join push-down support at the data source level. Shouldn't it be the responsibility of the join node to carry out a data source specific join? I mean join node and the data source scan of the two sides can be coalesced into a single node (theoretically). This can be done by providing a Strategy that replaces the join node with a data source specific join node. We are doing it that way for our data sources. I find this more intuitive.

BTW, aggregate push-down support is desirable and should be considered as an enhancement going forward. 


On Sun, Sep 10, 2017 at 8:45 PM, vaquar khan <[hidden email]> wrote:
+1

Regards,
Vaquar khan

On Sep 10, 2017 5:18 AM, "Noman Khan" <[hidden email]> wrote:
+1
From: wangzhenhua (G) <[hidden email]>
Sent: Friday, September 8, 2017 2:20:07 AM
To: Dongjoon Hyun; 蒋星博
Cc: Michael Armbrust; Reynold Xin; Andrew Ash; Herman van Hövell tot Westerflier; Ryan Blue; Spark dev list; Suresh Thalamati; Wenchen Fan
Subject: 答复: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path
 

+1 (non-binding)  Great to see data source API is going to be improved!

 

best regards,

-Zhenhua(Xander)

 

发件人: Dongjoon Hyun [mailto:[hidden email]]
发送时间: 201798 4:07
收件人: 蒋星博
抄送: Michael Armbrust; Reynold Xin; Andrew Ash; Herman van Hövell tot Westerflier; Ryan Blue; Spark dev list; Suresh Thalamati; Wenchen Fan
主题: Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

 

+1 (non-binding).

 

On Thu, Sep 7, 2017 at 12:46 PM, 蒋星博 <[hidden email]> wrote:

+1

 

 

Reynold Xin <[hidden email]>201797日 周四下午12:04写道:

+1 as well

 

On Thu, Sep 7, 2017 at 9:12 PM, Michael Armbrust <[hidden email]> wrote:

+1

 

On Thu, Sep 7, 2017 at 9:32 AM, Ryan Blue <[hidden email]> wrote:

+1 (non-binding)

Thanks for making the updates reflected in the current PR. It would be great to see the doc updated before it is finally published though.

Right now it feels like this SPIP is focused more on getting the basics right for what many datasources are already doing in API V1 combined with other private APIs, vs pushing forward state of the art for performance.

I think that’s the right approach for this SPIP. We can add the support you’re talking about later with a more specific plan that doesn’t block fixing the problems that this addresses.

 

On Thu, Sep 7, 2017 at 2:00 AM, Herman van Hövell tot Westerflier <[hidden email]> wrote:

+1 (binding)

 

I personally believe that there is quite a big difference between having a generic data source interface with a low surface area and pushing down a significant part of query processing into a datasource. The later has much wider wider surface area and will require us to stabilize most of the internal catalyst API's which will be a significant burden on the community to maintain and has the potential to slow development velocity significantly. If you want to write such integrations then you should be prepared to work with catalyst internals and own up to the fact that things might change across minor versions (and in some cases even maintenance releases). If you are willing to go down that road, then your best bet is to use the already existing spark session extensions which will allow you to write such integrations and can be used as an `escape hatch`.

 

 

On Thu, Sep 7, 2017 at 10:23 AM, Andrew Ash <[hidden email]> wrote:

+0 (non-binding)

 

I think there are benefits to unifying all the Spark-internal datasources into a common public API for sure.  It will serve as a forcing function to ensure that those internal datasources aren't advantaged vs datasources developed externally as plugins to Spark, and that all Spark features are available to all datasources.

 

But I also think this read-path proposal avoids the more difficult questions around how to continue pushing datasource performance forwards.  James Baker (my colleague) had a number of questions about advanced pushdowns (combined sorting and filtering), and Reynold also noted that pushdown of aggregates and joins are desirable on longer timeframes as well.  The Spark community saw similar requests, for aggregate pushdown in SPARK-12686, join pushdown in SPARK-20259, and arbitrary plan pushdown in SPARK-12449.  Clearly a number of people are interested in this kind of performance work for datasources.

 

To leave enough space for datasource developers to continue experimenting with advanced interactions between Spark and their datasources, I'd propose we leave some sort of escape valve that enables these datasources to keep pushing the boundaries without forking Spark.  Possibly that looks like an additional unsupported/unstable interface that pushes down an entire (unstable API) logical plan, which is expected to break API on every release.   (Spark attempts this full-plan pushdown, and if that fails Spark ignores it and continues on with the rest of the V2 API for compatibility).  Or maybe it looks like something else that we don't know of yet.  Possibly this falls outside of the desired goals for the V2 API and instead should be a separate SPIP.

 

If we had a plan for this kind of escape valve for advanced datasource developers I'd be an unequivocal +1.  Right now it feels like this SPIP is focused more on getting the basics right for what many datasources are already doing in API V1 combined with other private APIs, vs pushing forward state of the art for performance.

 

Andrew

 

On Wed, Sep 6, 2017 at 10:56 PM, Suresh Thalamati <[hidden email]> wrote:

+1 (non-binding)

 

 

On Sep 6, 2017, at 7:29 PM, Wenchen Fan <[hidden email]> wrote:

 

Hi all,

 

In the previous discussion, we decided to split the read and write path of data source v2 into 2 SPIPs, and I'm sending this email to call a vote for Data Source V2 read path only.

 

The full document of the Data Source API V2 is:

 

The ready-for-review PR that implements the basic infrastructure for the read path is:

 

The vote will be up for the next 72 hours. Please reply with your vote:

 

+1: Yeah, let's go forward and implement the SPIP.

+0: Don't really care.

-1: I don't think this is a good idea because of the following technical reasons.

 

Thanks!

 

 



 

--

Herman van Hövell

Software Engineer

Databricks Inc.

[hidden email]

+31 6 420 590 27

databricks.com

http://databricks.com

 

Announcing Databricks Serverless. The first serverless data science and big data platform. Watch the demo from Spark Summit 2017.



 

--

Ryan Blue

Software Engineer

Netflix

 

 

 



Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

cloud0fan
This vote passes with 4 binding +1 votes, 10 non-binding votes, one +0 vote, and no -1 votes.

Thanks all!

+1 votes (binding):
Wenchen Fan
Herman van Hövell tot Westerflier
Michael Armbrust
Reynold Xin


+1 votes (non-binding):
Xiao Li
Sameer Agarwal
Suresh Thalamati
Ryan Blue
Xingbo Jiang
Dongjoon Hyun
Zhenhua Wang
Noman Khan
vaquar khan
Hemant Bhanawat

+0 votes:
Andrew Ash

On Mon, Sep 11, 2017 at 4:03 PM, Wenchen Fan <[hidden email]> wrote:
yea, join push down (providing the other reader and join conditions) and aggregate push down (providing grouping keys and aggregate functions) can be added via the current framework in the future.

On Mon, Sep 11, 2017 at 1:54 PM, Hemant Bhanawat <[hidden email]> wrote:
+1 (non-binding)

I have found the suggestion from Andrew Ash and James about plan push down quite interesting. However, I am not clear about the join push-down support at the data source level. Shouldn't it be the responsibility of the join node to carry out a data source specific join? I mean join node and the data source scan of the two sides can be coalesced into a single node (theoretically). This can be done by providing a Strategy that replaces the join node with a data source specific join node. We are doing it that way for our data sources. I find this more intuitive.

BTW, aggregate push-down support is desirable and should be considered as an enhancement going forward. 


On Sun, Sep 10, 2017 at 8:45 PM, vaquar khan <[hidden email]> wrote:
+1

Regards,
Vaquar khan

On Sep 10, 2017 5:18 AM, "Noman Khan" <[hidden email]> wrote:
+1
From: wangzhenhua (G) <[hidden email]>
Sent: Friday, September 8, 2017 2:20:07 AM
To: Dongjoon Hyun; 蒋星博
Cc: Michael Armbrust; Reynold Xin; Andrew Ash; Herman van Hövell tot Westerflier; Ryan Blue; Spark dev list; Suresh Thalamati; Wenchen Fan
Subject: 答复: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path
 

+1 (non-binding)  Great to see data source API is going to be improved!

 

best regards,

-Zhenhua(Xander)

 

发件人: Dongjoon Hyun [mailto:[hidden email]]
发送时间: 201798 4:07
收件人: 蒋星博
抄送: Michael Armbrust; Reynold Xin; Andrew Ash; Herman van Hövell tot Westerflier; Ryan Blue; Spark dev list; Suresh Thalamati; Wenchen Fan
主题: Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

 

+1 (non-binding).

 

On Thu, Sep 7, 2017 at 12:46 PM, 蒋星博 <[hidden email]> wrote:

+1

 

 

Reynold Xin <[hidden email]>201797日 周四下午12:04写道:

+1 as well

 

On Thu, Sep 7, 2017 at 9:12 PM, Michael Armbrust <[hidden email]> wrote:

+1

 

On Thu, Sep 7, 2017 at 9:32 AM, Ryan Blue <[hidden email]> wrote:

+1 (non-binding)

Thanks for making the updates reflected in the current PR. It would be great to see the doc updated before it is finally published though.

Right now it feels like this SPIP is focused more on getting the basics right for what many datasources are already doing in API V1 combined with other private APIs, vs pushing forward state of the art for performance.

I think that’s the right approach for this SPIP. We can add the support you’re talking about later with a more specific plan that doesn’t block fixing the problems that this addresses.

 

On Thu, Sep 7, 2017 at 2:00 AM, Herman van Hövell tot Westerflier <[hidden email]> wrote:

+1 (binding)

 

I personally believe that there is quite a big difference between having a generic data source interface with a low surface area and pushing down a significant part of query processing into a datasource. The later has much wider wider surface area and will require us to stabilize most of the internal catalyst API's which will be a significant burden on the community to maintain and has the potential to slow development velocity significantly. If you want to write such integrations then you should be prepared to work with catalyst internals and own up to the fact that things might change across minor versions (and in some cases even maintenance releases). If you are willing to go down that road, then your best bet is to use the already existing spark session extensions which will allow you to write such integrations and can be used as an `escape hatch`.

 

 

On Thu, Sep 7, 2017 at 10:23 AM, Andrew Ash <[hidden email]> wrote:

+0 (non-binding)

 

I think there are benefits to unifying all the Spark-internal datasources into a common public API for sure.  It will serve as a forcing function to ensure that those internal datasources aren't advantaged vs datasources developed externally as plugins to Spark, and that all Spark features are available to all datasources.

 

But I also think this read-path proposal avoids the more difficult questions around how to continue pushing datasource performance forwards.  James Baker (my colleague) had a number of questions about advanced pushdowns (combined sorting and filtering), and Reynold also noted that pushdown of aggregates and joins are desirable on longer timeframes as well.  The Spark community saw similar requests, for aggregate pushdown in SPARK-12686, join pushdown in SPARK-20259, and arbitrary plan pushdown in SPARK-12449.  Clearly a number of people are interested in this kind of performance work for datasources.

 

To leave enough space for datasource developers to continue experimenting with advanced interactions between Spark and their datasources, I'd propose we leave some sort of escape valve that enables these datasources to keep pushing the boundaries without forking Spark.  Possibly that looks like an additional unsupported/unstable interface that pushes down an entire (unstable API) logical plan, which is expected to break API on every release.   (Spark attempts this full-plan pushdown, and if that fails Spark ignores it and continues on with the rest of the V2 API for compatibility).  Or maybe it looks like something else that we don't know of yet.  Possibly this falls outside of the desired goals for the V2 API and instead should be a separate SPIP.

 

If we had a plan for this kind of escape valve for advanced datasource developers I'd be an unequivocal +1.  Right now it feels like this SPIP is focused more on getting the basics right for what many datasources are already doing in API V1 combined with other private APIs, vs pushing forward state of the art for performance.

 

Andrew

 

On Wed, Sep 6, 2017 at 10:56 PM, Suresh Thalamati <[hidden email]> wrote:

+1 (non-binding)

 

 

On Sep 6, 2017, at 7:29 PM, Wenchen Fan <[hidden email]> wrote:

 

Hi all,

 

In the previous discussion, we decided to split the read and write path of data source v2 into 2 SPIPs, and I'm sending this email to call a vote for Data Source V2 read path only.

 

The full document of the Data Source API V2 is:

 

The ready-for-review PR that implements the basic infrastructure for the read path is:

 

The vote will be up for the next 72 hours. Please reply with your vote:

 

+1: Yeah, let's go forward and implement the SPIP.

+0: Don't really care.

-1: I don't think this is a good idea because of the following technical reasons.

 

Thanks!

 

 



 

--

Herman van Hövell

Software Engineer

Databricks Inc.

[hidden email]

+31 6 420 590 27

databricks.com

http://databricks.com

 

Announcing Databricks Serverless. The first serverless data science and big data platform. Watch the demo from Spark Summit 2017.



 

--

Ryan Blue

Software Engineer

Netflix