DSv2 & DataSourceRegister

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

DSv2 & DataSourceRegister

Andrew Melo
Hi all,

I posted an improvement ticket in JIRA and Hyukjin Kwon requested I
send an email to the dev list for discussion.

As the DSv2 API evolves, some breaking changes are occasionally made
to the API. It's possible to split a plugin into a "common" part and
multiple version-specific parts and this works OK to have a single
artifact for users, as long as they write out the fully qualified
classname as the DataFrame format(). The one part that can't be
currently worked around is the DataSourceRegister trait. Since classes
which implement DataSourceRegister must also implement DataSourceV2
(and its mixins), changes to those interfaces cause the ServiceLoader
to fail when it attempts to load the "wrong" DataSourceV2 class.
(there's also an additional prohibition against multiple
implementations having the same ShortName in
org.apache.spark.sql.execution.datasources.DataSource.lookupDataSource).
This means users will need to update their notebooks/code/tutorials if
they run @ a different site whose cluster is a different version.

To solve this, I proposed in SPARK-31363 a new trait who would
function the same as the existing DataSourceRegister trait, but adds
an additional method:

public Class<? implements DataSourceV2> getImplementation();

...which will allow DSv2 plugins to dynamically choose the appropriate
DataSourceV2 class based on the runtime environment. This would let us
release a single artifact for different Spark versions and users could
use the same artifactID & format regardless of where they were
executing their code. If there was no services registered with this
new trait, the functionality would remain the same as before.

I think this functionality will be useful as DSv2 continues to evolve,
please let me know your thoughts.

Thanks
Andrew

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: DSv2 & DataSourceRegister

Ryan Blue
Hi Andrew,

With DataSourceV2, I recommend plugging in a catalog instead of using DataSource. As you've noticed, the way that you plug in data sources isn't very flexible. That's one of the reasons why we changed the plugin system and made it possible to use named catalogs that load implementations based on configuration properties.

I think it's fine to consider how to patch the registration trait, but I really don't recommend continuing to identify table implementations directly by name.

On Tue, Apr 7, 2020 at 12:26 PM Andrew Melo <[hidden email]> wrote:
Hi all,

I posted an improvement ticket in JIRA and Hyukjin Kwon requested I
send an email to the dev list for discussion.

As the DSv2 API evolves, some breaking changes are occasionally made
to the API. It's possible to split a plugin into a "common" part and
multiple version-specific parts and this works OK to have a single
artifact for users, as long as they write out the fully qualified
classname as the DataFrame format(). The one part that can't be
currently worked around is the DataSourceRegister trait. Since classes
which implement DataSourceRegister must also implement DataSourceV2
(and its mixins), changes to those interfaces cause the ServiceLoader
to fail when it attempts to load the "wrong" DataSourceV2 class.
(there's also an additional prohibition against multiple
implementations having the same ShortName in
org.apache.spark.sql.execution.datasources.DataSource.lookupDataSource).
This means users will need to update their notebooks/code/tutorials if
they run @ a different site whose cluster is a different version.

To solve this, I proposed in SPARK-31363 a new trait who would
function the same as the existing DataSourceRegister trait, but adds
an additional method:

public Class<? implements DataSourceV2> getImplementation();

...which will allow DSv2 plugins to dynamically choose the appropriate
DataSourceV2 class based on the runtime environment. This would let us
release a single artifact for different Spark versions and users could
use the same artifactID & format regardless of where they were
executing their code. If there was no services registered with this
new trait, the functionality would remain the same as before.

I think this functionality will be useful as DSv2 continues to evolve,
please let me know your thoughts.

Thanks
Andrew

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: DSv2 & DataSourceRegister

Andrew Melo
Hi Ryan,

On Tue, Apr 7, 2020 at 5:21 PM Ryan Blue <[hidden email]> wrote:
>
> Hi Andrew,
>
> With DataSourceV2, I recommend plugging in a catalog instead of using DataSource. As you've noticed, the way that you plug in data sources isn't very flexible. That's one of the reasons why we changed the plugin system and made it possible to use named catalogs that load implementations based on configuration properties.
>
> I think it's fine to consider how to patch the registration trait, but I really don't recommend continuing to identify table implementations directly by name.

Can you be a bit more concrete with what you mean by plugging a
catalog instead of a DataSource? We have been using
sc.read.format("root").load([list of paths]) which works well. Since
we don't have a database or tables, I don't fully understand what's
different between the two interfaces that would make us prefer one or
another.

That being said, WRT the registration trait, if I'm not misreading
createTable() and friends, the "source" parameter is resolved the same
way as DataFrameReader.format(), so a solution that helps out
registration should help both interfaces.

Thanks again,
Andrew

>
> On Tue, Apr 7, 2020 at 12:26 PM Andrew Melo <[hidden email]> wrote:
>>
>> Hi all,
>>
>> I posted an improvement ticket in JIRA and Hyukjin Kwon requested I
>> send an email to the dev list for discussion.
>>
>> As the DSv2 API evolves, some breaking changes are occasionally made
>> to the API. It's possible to split a plugin into a "common" part and
>> multiple version-specific parts and this works OK to have a single
>> artifact for users, as long as they write out the fully qualified
>> classname as the DataFrame format(). The one part that can't be
>> currently worked around is the DataSourceRegister trait. Since classes
>> which implement DataSourceRegister must also implement DataSourceV2
>> (and its mixins), changes to those interfaces cause the ServiceLoader
>> to fail when it attempts to load the "wrong" DataSourceV2 class.
>> (there's also an additional prohibition against multiple
>> implementations having the same ShortName in
>> org.apache.spark.sql.execution.datasources.DataSource.lookupDataSource).
>> This means users will need to update their notebooks/code/tutorials if
>> they run @ a different site whose cluster is a different version.
>>
>> To solve this, I proposed in SPARK-31363 a new trait who would
>> function the same as the existing DataSourceRegister trait, but adds
>> an additional method:
>>
>> public Class<? implements DataSourceV2> getImplementation();
>>
>> ...which will allow DSv2 plugins to dynamically choose the appropriate
>> DataSourceV2 class based on the runtime environment. This would let us
>> release a single artifact for different Spark versions and users could
>> use the same artifactID & format regardless of where they were
>> executing their code. If there was no services registered with this
>> new trait, the functionality would remain the same as before.
>>
>> I think this functionality will be useful as DSv2 continues to evolve,
>> please let me know your thoughts.
>>
>> Thanks
>> Andrew
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: [hidden email]
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: DSv2 & DataSourceRegister

cloud0fan
Are you going to provide a single artifact for Spark 2.4 and 3.0? I'm not sure this is possible as the DS V2 API is very different in 3.0, e.g. there is no `DataSourceV2` anymore, and you should implement `TableProvider` (if you don't have database/table).

On Wed, Apr 8, 2020 at 6:58 AM Andrew Melo <[hidden email]> wrote:
Hi Ryan,

On Tue, Apr 7, 2020 at 5:21 PM Ryan Blue <[hidden email]> wrote:
>
> Hi Andrew,
>
> With DataSourceV2, I recommend plugging in a catalog instead of using DataSource. As you've noticed, the way that you plug in data sources isn't very flexible. That's one of the reasons why we changed the plugin system and made it possible to use named catalogs that load implementations based on configuration properties.
>
> I think it's fine to consider how to patch the registration trait, but I really don't recommend continuing to identify table implementations directly by name.

Can you be a bit more concrete with what you mean by plugging a
catalog instead of a DataSource? We have been using
sc.read.format("root").load([list of paths]) which works well. Since
we don't have a database or tables, I don't fully understand what's
different between the two interfaces that would make us prefer one or
another.

That being said, WRT the registration trait, if I'm not misreading
createTable() and friends, the "source" parameter is resolved the same
way as DataFrameReader.format(), so a solution that helps out
registration should help both interfaces.

Thanks again,
Andrew

>
> On Tue, Apr 7, 2020 at 12:26 PM Andrew Melo <[hidden email]> wrote:
>>
>> Hi all,
>>
>> I posted an improvement ticket in JIRA and Hyukjin Kwon requested I
>> send an email to the dev list for discussion.
>>
>> As the DSv2 API evolves, some breaking changes are occasionally made
>> to the API. It's possible to split a plugin into a "common" part and
>> multiple version-specific parts and this works OK to have a single
>> artifact for users, as long as they write out the fully qualified
>> classname as the DataFrame format(). The one part that can't be
>> currently worked around is the DataSourceRegister trait. Since classes
>> which implement DataSourceRegister must also implement DataSourceV2
>> (and its mixins), changes to those interfaces cause the ServiceLoader
>> to fail when it attempts to load the "wrong" DataSourceV2 class.
>> (there's also an additional prohibition against multiple
>> implementations having the same ShortName in
>> org.apache.spark.sql.execution.datasources.DataSource.lookupDataSource).
>> This means users will need to update their notebooks/code/tutorials if
>> they run @ a different site whose cluster is a different version.
>>
>> To solve this, I proposed in SPARK-31363 a new trait who would
>> function the same as the existing DataSourceRegister trait, but adds
>> an additional method:
>>
>> public Class<? implements DataSourceV2> getImplementation();
>>
>> ...which will allow DSv2 plugins to dynamically choose the appropriate
>> DataSourceV2 class based on the runtime environment. This would let us
>> release a single artifact for different Spark versions and users could
>> use the same artifactID & format regardless of where they were
>> executing their code. If there was no services registered with this
>> new trait, the functionality would remain the same as before.
>>
>> I think this functionality will be useful as DSv2 continues to evolve,
>> please let me know your thoughts.
>>
>> Thanks
>> Andrew
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: [hidden email]
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: DSv2 & DataSourceRegister

Andrew Melo
Hello

On Tue, Apr 7, 2020 at 23:16 Wenchen Fan <[hidden email]> wrote:
Are you going to provide a single artifact for Spark 2.4 and 3.0? I'm not sure this is possible as the DS V2 API is very different in 3.0, e.g. there is no `DataSourceV2` anymore, and you should implement `TableProvider` (if you don't have database/table).

Correct, I've got a single jar for both Spark 2.4 and 3.0, with a toplevel Root_v24 (implements DataSourceV2) and Root_v30 (implements TableProvider). I can load this jar in a both pyspark 2.4 and 3.0 and it works well -- as long as I remove the registration from META-INF and pass in the full class name to the DataFrameReader.

Thanks
Andrew


On Wed, Apr 8, 2020 at 6:58 AM Andrew Melo <[hidden email]> wrote:
Hi Ryan,

On Tue, Apr 7, 2020 at 5:21 PM Ryan Blue <[hidden email]> wrote:
>
> Hi Andrew,
>
> With DataSourceV2, I recommend plugging in a catalog instead of using DataSource. As you've noticed, the way that you plug in data sources isn't very flexible. That's one of the reasons why we changed the plugin system and made it possible to use named catalogs that load implementations based on configuration properties.
>
> I think it's fine to consider how to patch the registration trait, but I really don't recommend continuing to identify table implementations directly by name.

Can you be a bit more concrete with what you mean by plugging a
catalog instead of a DataSource? We have been using
sc.read.format("root").load([list of paths]) which works well. Since
we don't have a database or tables, I don't fully understand what's
different between the two interfaces that would make us prefer one or
another.

That being said, WRT the registration trait, if I'm not misreading
createTable() and friends, the "source" parameter is resolved the same
way as DataFrameReader.format(), so a solution that helps out
registration should help both interfaces.

Thanks again,
Andrew

>
> On Tue, Apr 7, 2020 at 12:26 PM Andrew Melo <[hidden email]> wrote:
>>
>> Hi all,
>>
>> I posted an improvement ticket in JIRA and Hyukjin Kwon requested I
>> send an email to the dev list for discussion.
>>
>> As the DSv2 API evolves, some breaking changes are occasionally made
>> to the API. It's possible to split a plugin into a "common" part and
>> multiple version-specific parts and this works OK to have a single
>> artifact for users, as long as they write out the fully qualified
>> classname as the DataFrame format(). The one part that can't be
>> currently worked around is the DataSourceRegister trait. Since classes
>> which implement DataSourceRegister must also implement DataSourceV2
>> (and its mixins), changes to those interfaces cause the ServiceLoader
>> to fail when it attempts to load the "wrong" DataSourceV2 class.
>> (there's also an additional prohibition against multiple
>> implementations having the same ShortName in
>> org.apache.spark.sql.execution.datasources.DataSource.lookupDataSource).
>> This means users will need to update their notebooks/code/tutorials if
>> they run @ a different site whose cluster is a different version.
>>
>> To solve this, I proposed in SPARK-31363 a new trait who would
>> function the same as the existing DataSourceRegister trait, but adds
>> an additional method:
>>
>> public Class<? implements DataSourceV2> getImplementation();
>>
>> ...which will allow DSv2 plugins to dynamically choose the appropriate
>> DataSourceV2 class based on the runtime environment. This would let us
>> release a single artifact for different Spark versions and users could
>> use the same artifactID & format regardless of where they were
>> executing their code. If there was no services registered with this
>> new trait, the functionality would remain the same as before.
>>
>> I think this functionality will be useful as DSv2 continues to evolve,
>> please let me know your thoughts.
>>
>> Thanks
>> Andrew
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: [hidden email]
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: DSv2 & DataSourceRegister

cloud0fan
It would be good to support your use case, but I'm not sure how to accomplish that. Can you open a PR so that we can discuss it in detail? How can `public Class<? implements DataSourceV2> getImplementation();` be possible in 3.0 as there is no `DataSourceV2`?

On Wed, Apr 8, 2020 at 1:12 PM Andrew Melo <[hidden email]> wrote:
Hello

On Tue, Apr 7, 2020 at 23:16 Wenchen Fan <[hidden email]> wrote:
Are you going to provide a single artifact for Spark 2.4 and 3.0? I'm not sure this is possible as the DS V2 API is very different in 3.0, e.g. there is no `DataSourceV2` anymore, and you should implement `TableProvider` (if you don't have database/table).

Correct, I've got a single jar for both Spark 2.4 and 3.0, with a toplevel Root_v24 (implements DataSourceV2) and Root_v30 (implements TableProvider). I can load this jar in a both pyspark 2.4 and 3.0 and it works well -- as long as I remove the registration from META-INF and pass in the full class name to the DataFrameReader.

Thanks
Andrew


On Wed, Apr 8, 2020 at 6:58 AM Andrew Melo <[hidden email]> wrote:
Hi Ryan,

On Tue, Apr 7, 2020 at 5:21 PM Ryan Blue <[hidden email]> wrote:
>
> Hi Andrew,
>
> With DataSourceV2, I recommend plugging in a catalog instead of using DataSource. As you've noticed, the way that you plug in data sources isn't very flexible. That's one of the reasons why we changed the plugin system and made it possible to use named catalogs that load implementations based on configuration properties.
>
> I think it's fine to consider how to patch the registration trait, but I really don't recommend continuing to identify table implementations directly by name.

Can you be a bit more concrete with what you mean by plugging a
catalog instead of a DataSource? We have been using
sc.read.format("root").load([list of paths]) which works well. Since
we don't have a database or tables, I don't fully understand what's
different between the two interfaces that would make us prefer one or
another.

That being said, WRT the registration trait, if I'm not misreading
createTable() and friends, the "source" parameter is resolved the same
way as DataFrameReader.format(), so a solution that helps out
registration should help both interfaces.

Thanks again,
Andrew

>
> On Tue, Apr 7, 2020 at 12:26 PM Andrew Melo <[hidden email]> wrote:
>>
>> Hi all,
>>
>> I posted an improvement ticket in JIRA and Hyukjin Kwon requested I
>> send an email to the dev list for discussion.
>>
>> As the DSv2 API evolves, some breaking changes are occasionally made
>> to the API. It's possible to split a plugin into a "common" part and
>> multiple version-specific parts and this works OK to have a single
>> artifact for users, as long as they write out the fully qualified
>> classname as the DataFrame format(). The one part that can't be
>> currently worked around is the DataSourceRegister trait. Since classes
>> which implement DataSourceRegister must also implement DataSourceV2
>> (and its mixins), changes to those interfaces cause the ServiceLoader
>> to fail when it attempts to load the "wrong" DataSourceV2 class.
>> (there's also an additional prohibition against multiple
>> implementations having the same ShortName in
>> org.apache.spark.sql.execution.datasources.DataSource.lookupDataSource).
>> This means users will need to update their notebooks/code/tutorials if
>> they run @ a different site whose cluster is a different version.
>>
>> To solve this, I proposed in SPARK-31363 a new trait who would
>> function the same as the existing DataSourceRegister trait, but adds
>> an additional method:
>>
>> public Class<? implements DataSourceV2> getImplementation();
>>
>> ...which will allow DSv2 plugins to dynamically choose the appropriate
>> DataSourceV2 class based on the runtime environment. This would let us
>> release a single artifact for different Spark versions and users could
>> use the same artifactID & format regardless of where they were
>> executing their code. If there was no services registered with this
>> new trait, the functionality would remain the same as before.
>>
>> I think this functionality will be useful as DSv2 continues to evolve,
>> please let me know your thoughts.
>>
>> Thanks
>> Andrew
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: [hidden email]
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: DSv2 & DataSourceRegister

Andrew Melo
On Wed, Apr 8, 2020 at 8:35 AM Wenchen Fan <[hidden email]> wrote:
>
> It would be good to support your use case, but I'm not sure how to accomplish that. Can you open a PR so that we can discuss it in detail? How can `public Class<? implements DataSourceV2> getImplementation();` be possible in 3.0 as there is no `DataSourceV2`?

You're right, that was a typo. Since the whole point is to separate
the (stable) registration interface from the (evolving) DSv2 API, it
defeats the purpose to then directly reference the DSv2 API within the
registration interface.

I'll put together a PR.

Thanks again,
Andrew

>
> On Wed, Apr 8, 2020 at 1:12 PM Andrew Melo <[hidden email]> wrote:
>>
>> Hello
>>
>> On Tue, Apr 7, 2020 at 23:16 Wenchen Fan <[hidden email]> wrote:
>>>
>>> Are you going to provide a single artifact for Spark 2.4 and 3.0? I'm not sure this is possible as the DS V2 API is very different in 3.0, e.g. there is no `DataSourceV2` anymore, and you should implement `TableProvider` (if you don't have database/table).
>>
>>
>> Correct, I've got a single jar for both Spark 2.4 and 3.0, with a toplevel Root_v24 (implements DataSourceV2) and Root_v30 (implements TableProvider). I can load this jar in a both pyspark 2.4 and 3.0 and it works well -- as long as I remove the registration from META-INF and pass in the full class name to the DataFrameReader.
>>
>> Thanks
>> Andrew
>>
>>>
>>> On Wed, Apr 8, 2020 at 6:58 AM Andrew Melo <[hidden email]> wrote:
>>>>
>>>> Hi Ryan,
>>>>
>>>> On Tue, Apr 7, 2020 at 5:21 PM Ryan Blue <[hidden email]> wrote:
>>>> >
>>>> > Hi Andrew,
>>>> >
>>>> > With DataSourceV2, I recommend plugging in a catalog instead of using DataSource. As you've noticed, the way that you plug in data sources isn't very flexible. That's one of the reasons why we changed the plugin system and made it possible to use named catalogs that load implementations based on configuration properties.
>>>> >
>>>> > I think it's fine to consider how to patch the registration trait, but I really don't recommend continuing to identify table implementations directly by name.
>>>>
>>>> Can you be a bit more concrete with what you mean by plugging a
>>>> catalog instead of a DataSource? We have been using
>>>> sc.read.format("root").load([list of paths]) which works well. Since
>>>> we don't have a database or tables, I don't fully understand what's
>>>> different between the two interfaces that would make us prefer one or
>>>> another.
>>>>
>>>> That being said, WRT the registration trait, if I'm not misreading
>>>> createTable() and friends, the "source" parameter is resolved the same
>>>> way as DataFrameReader.format(), so a solution that helps out
>>>> registration should help both interfaces.
>>>>
>>>> Thanks again,
>>>> Andrew
>>>>
>>>> >
>>>> > On Tue, Apr 7, 2020 at 12:26 PM Andrew Melo <[hidden email]> wrote:
>>>> >>
>>>> >> Hi all,
>>>> >>
>>>> >> I posted an improvement ticket in JIRA and Hyukjin Kwon requested I
>>>> >> send an email to the dev list for discussion.
>>>> >>
>>>> >> As the DSv2 API evolves, some breaking changes are occasionally made
>>>> >> to the API. It's possible to split a plugin into a "common" part and
>>>> >> multiple version-specific parts and this works OK to have a single
>>>> >> artifact for users, as long as they write out the fully qualified
>>>> >> classname as the DataFrame format(). The one part that can't be
>>>> >> currently worked around is the DataSourceRegister trait. Since classes
>>>> >> which implement DataSourceRegister must also implement DataSourceV2
>>>> >> (and its mixins), changes to those interfaces cause the ServiceLoader
>>>> >> to fail when it attempts to load the "wrong" DataSourceV2 class.
>>>> >> (there's also an additional prohibition against multiple
>>>> >> implementations having the same ShortName in
>>>> >> org.apache.spark.sql.execution.datasources.DataSource.lookupDataSource).
>>>> >> This means users will need to update their notebooks/code/tutorials if
>>>> >> they run @ a different site whose cluster is a different version.
>>>> >>
>>>> >> To solve this, I proposed in SPARK-31363 a new trait who would
>>>> >> function the same as the existing DataSourceRegister trait, but adds
>>>> >> an additional method:
>>>> >>
>>>> >> public Class<? implements DataSourceV2> getImplementation();
>>>> >>
>>>> >> ...which will allow DSv2 plugins to dynamically choose the appropriate
>>>> >> DataSourceV2 class based on the runtime environment. This would let us
>>>> >> release a single artifact for different Spark versions and users could
>>>> >> use the same artifactID & format regardless of where they were
>>>> >> executing their code. If there was no services registered with this
>>>> >> new trait, the functionality would remain the same as before.
>>>> >>
>>>> >> I think this functionality will be useful as DSv2 continues to evolve,
>>>> >> please let me know your thoughts.
>>>> >>
>>>> >> Thanks
>>>> >> Andrew
>>>> >>
>>>> >> ---------------------------------------------------------------------
>>>> >> To unsubscribe e-mail: [hidden email]
>>>> >>
>>>> >
>>>> >
>>>> > --
>>>> > Ryan Blue
>>>> > Software Engineer
>>>> > Netflix
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: [hidden email]
>>>>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: DSv2 & DataSourceRegister

Andrew Melo
Hi all,

I've opened a WIP PR here https://github.com/apache/spark/pull/28159
I'm a novice at Scala, so I'm sure the code isn't idiomatic, but it
behaves functionally how I'd expect. I've added unit tests to the PR,
but if you would like to verify the intended functionality, I've
uploaded a fat jar with my datasource to
http://mirror.accre.vanderbilt.edu/spark/laurelin-both.jar and an
example input file to
https://github.com/spark-root/laurelin/raw/master/testdata/stdvector.root.
The following in spark-shell successfully chooses the proper plugin
implementation based on the spark version:

spark.read.format("root").option("tree","tvec").load("stdvector.root")

Additionally, I did a very rough POC for spark2.4, which you can find
at https://github.com/PerilousApricot/spark/tree/feature/registerv2-24
. The same jar/inputfile works there as well.

Thanks again,
Andrew

On Wed, Apr 8, 2020 at 10:27 AM Andrew Melo <[hidden email]> wrote:

>
> On Wed, Apr 8, 2020 at 8:35 AM Wenchen Fan <[hidden email]> wrote:
> >
> > It would be good to support your use case, but I'm not sure how to accomplish that. Can you open a PR so that we can discuss it in detail? How can `public Class<? implements DataSourceV2> getImplementation();` be possible in 3.0 as there is no `DataSourceV2`?
>
> You're right, that was a typo. Since the whole point is to separate
> the (stable) registration interface from the (evolving) DSv2 API, it
> defeats the purpose to then directly reference the DSv2 API within the
> registration interface.
>
> I'll put together a PR.
>
> Thanks again,
> Andrew
>
> >
> > On Wed, Apr 8, 2020 at 1:12 PM Andrew Melo <[hidden email]> wrote:
> >>
> >> Hello
> >>
> >> On Tue, Apr 7, 2020 at 23:16 Wenchen Fan <[hidden email]> wrote:
> >>>
> >>> Are you going to provide a single artifact for Spark 2.4 and 3.0? I'm not sure this is possible as the DS V2 API is very different in 3.0, e.g. there is no `DataSourceV2` anymore, and you should implement `TableProvider` (if you don't have database/table).
> >>
> >>
> >> Correct, I've got a single jar for both Spark 2.4 and 3.0, with a toplevel Root_v24 (implements DataSourceV2) and Root_v30 (implements TableProvider). I can load this jar in a both pyspark 2.4 and 3.0 and it works well -- as long as I remove the registration from META-INF and pass in the full class name to the DataFrameReader.
> >>
> >> Thanks
> >> Andrew
> >>
> >>>
> >>> On Wed, Apr 8, 2020 at 6:58 AM Andrew Melo <[hidden email]> wrote:
> >>>>
> >>>> Hi Ryan,
> >>>>
> >>>> On Tue, Apr 7, 2020 at 5:21 PM Ryan Blue <[hidden email]> wrote:
> >>>> >
> >>>> > Hi Andrew,
> >>>> >
> >>>> > With DataSourceV2, I recommend plugging in a catalog instead of using DataSource. As you've noticed, the way that you plug in data sources isn't very flexible. That's one of the reasons why we changed the plugin system and made it possible to use named catalogs that load implementations based on configuration properties.
> >>>> >
> >>>> > I think it's fine to consider how to patch the registration trait, but I really don't recommend continuing to identify table implementations directly by name.
> >>>>
> >>>> Can you be a bit more concrete with what you mean by plugging a
> >>>> catalog instead of a DataSource? We have been using
> >>>> sc.read.format("root").load([list of paths]) which works well. Since
> >>>> we don't have a database or tables, I don't fully understand what's
> >>>> different between the two interfaces that would make us prefer one or
> >>>> another.
> >>>>
> >>>> That being said, WRT the registration trait, if I'm not misreading
> >>>> createTable() and friends, the "source" parameter is resolved the same
> >>>> way as DataFrameReader.format(), so a solution that helps out
> >>>> registration should help both interfaces.
> >>>>
> >>>> Thanks again,
> >>>> Andrew
> >>>>
> >>>> >
> >>>> > On Tue, Apr 7, 2020 at 12:26 PM Andrew Melo <[hidden email]> wrote:
> >>>> >>
> >>>> >> Hi all,
> >>>> >>
> >>>> >> I posted an improvement ticket in JIRA and Hyukjin Kwon requested I
> >>>> >> send an email to the dev list for discussion.
> >>>> >>
> >>>> >> As the DSv2 API evolves, some breaking changes are occasionally made
> >>>> >> to the API. It's possible to split a plugin into a "common" part and
> >>>> >> multiple version-specific parts and this works OK to have a single
> >>>> >> artifact for users, as long as they write out the fully qualified
> >>>> >> classname as the DataFrame format(). The one part that can't be
> >>>> >> currently worked around is the DataSourceRegister trait. Since classes
> >>>> >> which implement DataSourceRegister must also implement DataSourceV2
> >>>> >> (and its mixins), changes to those interfaces cause the ServiceLoader
> >>>> >> to fail when it attempts to load the "wrong" DataSourceV2 class.
> >>>> >> (there's also an additional prohibition against multiple
> >>>> >> implementations having the same ShortName in
> >>>> >> org.apache.spark.sql.execution.datasources.DataSource.lookupDataSource).
> >>>> >> This means users will need to update their notebooks/code/tutorials if
> >>>> >> they run @ a different site whose cluster is a different version.
> >>>> >>
> >>>> >> To solve this, I proposed in SPARK-31363 a new trait who would
> >>>> >> function the same as the existing DataSourceRegister trait, but adds
> >>>> >> an additional method:
> >>>> >>
> >>>> >> public Class<? implements DataSourceV2> getImplementation();
> >>>> >>
> >>>> >> ...which will allow DSv2 plugins to dynamically choose the appropriate
> >>>> >> DataSourceV2 class based on the runtime environment. This would let us
> >>>> >> release a single artifact for different Spark versions and users could
> >>>> >> use the same artifactID & format regardless of where they were
> >>>> >> executing their code. If there was no services registered with this
> >>>> >> new trait, the functionality would remain the same as before.
> >>>> >>
> >>>> >> I think this functionality will be useful as DSv2 continues to evolve,
> >>>> >> please let me know your thoughts.
> >>>> >>
> >>>> >> Thanks
> >>>> >> Andrew
> >>>> >>
> >>>> >> ---------------------------------------------------------------------
> >>>> >> To unsubscribe e-mail: [hidden email]
> >>>> >>
> >>>> >
> >>>> >
> >>>> > --
> >>>> > Ryan Blue
> >>>> > Software Engineer
> >>>> > Netflix
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe e-mail: [hidden email]
> >>>>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: DSv2 & DataSourceRegister

Andrew Melo
Hi again,

Does anyone have thoughts on either the idea or the implementation?

Thanks,
Andrew

On Thu, Apr 9, 2020 at 11:32 PM Andrew Melo <[hidden email]> wrote:

>
> Hi all,
>
> I've opened a WIP PR here https://github.com/apache/spark/pull/28159
> I'm a novice at Scala, so I'm sure the code isn't idiomatic, but it
> behaves functionally how I'd expect. I've added unit tests to the PR,
> but if you would like to verify the intended functionality, I've
> uploaded a fat jar with my datasource to
> http://mirror.accre.vanderbilt.edu/spark/laurelin-both.jar and an
> example input file to
> https://github.com/spark-root/laurelin/raw/master/testdata/stdvector.root.
> The following in spark-shell successfully chooses the proper plugin
> implementation based on the spark version:
>
> spark.read.format("root").option("tree","tvec").load("stdvector.root")
>
> Additionally, I did a very rough POC for spark2.4, which you can find
> at https://github.com/PerilousApricot/spark/tree/feature/registerv2-24
> . The same jar/inputfile works there as well.
>
> Thanks again,
> Andrew
>
> On Wed, Apr 8, 2020 at 10:27 AM Andrew Melo <[hidden email]> wrote:
> >
> > On Wed, Apr 8, 2020 at 8:35 AM Wenchen Fan <[hidden email]> wrote:
> > >
> > > It would be good to support your use case, but I'm not sure how to accomplish that. Can you open a PR so that we can discuss it in detail? How can `public Class<? implements DataSourceV2> getImplementation();` be possible in 3.0 as there is no `DataSourceV2`?
> >
> > You're right, that was a typo. Since the whole point is to separate
> > the (stable) registration interface from the (evolving) DSv2 API, it
> > defeats the purpose to then directly reference the DSv2 API within the
> > registration interface.
> >
> > I'll put together a PR.
> >
> > Thanks again,
> > Andrew
> >
> > >
> > > On Wed, Apr 8, 2020 at 1:12 PM Andrew Melo <[hidden email]> wrote:
> > >>
> > >> Hello
> > >>
> > >> On Tue, Apr 7, 2020 at 23:16 Wenchen Fan <[hidden email]> wrote:
> > >>>
> > >>> Are you going to provide a single artifact for Spark 2.4 and 3.0? I'm not sure this is possible as the DS V2 API is very different in 3.0, e.g. there is no `DataSourceV2` anymore, and you should implement `TableProvider` (if you don't have database/table).
> > >>
> > >>
> > >> Correct, I've got a single jar for both Spark 2.4 and 3.0, with a toplevel Root_v24 (implements DataSourceV2) and Root_v30 (implements TableProvider). I can load this jar in a both pyspark 2.4 and 3.0 and it works well -- as long as I remove the registration from META-INF and pass in the full class name to the DataFrameReader.
> > >>
> > >> Thanks
> > >> Andrew
> > >>
> > >>>
> > >>> On Wed, Apr 8, 2020 at 6:58 AM Andrew Melo <[hidden email]> wrote:
> > >>>>
> > >>>> Hi Ryan,
> > >>>>
> > >>>> On Tue, Apr 7, 2020 at 5:21 PM Ryan Blue <[hidden email]> wrote:
> > >>>> >
> > >>>> > Hi Andrew,
> > >>>> >
> > >>>> > With DataSourceV2, I recommend plugging in a catalog instead of using DataSource. As you've noticed, the way that you plug in data sources isn't very flexible. That's one of the reasons why we changed the plugin system and made it possible to use named catalogs that load implementations based on configuration properties.
> > >>>> >
> > >>>> > I think it's fine to consider how to patch the registration trait, but I really don't recommend continuing to identify table implementations directly by name.
> > >>>>
> > >>>> Can you be a bit more concrete with what you mean by plugging a
> > >>>> catalog instead of a DataSource? We have been using
> > >>>> sc.read.format("root").load([list of paths]) which works well. Since
> > >>>> we don't have a database or tables, I don't fully understand what's
> > >>>> different between the two interfaces that would make us prefer one or
> > >>>> another.
> > >>>>
> > >>>> That being said, WRT the registration trait, if I'm not misreading
> > >>>> createTable() and friends, the "source" parameter is resolved the same
> > >>>> way as DataFrameReader.format(), so a solution that helps out
> > >>>> registration should help both interfaces.
> > >>>>
> > >>>> Thanks again,
> > >>>> Andrew
> > >>>>
> > >>>> >
> > >>>> > On Tue, Apr 7, 2020 at 12:26 PM Andrew Melo <[hidden email]> wrote:
> > >>>> >>
> > >>>> >> Hi all,
> > >>>> >>
> > >>>> >> I posted an improvement ticket in JIRA and Hyukjin Kwon requested I
> > >>>> >> send an email to the dev list for discussion.
> > >>>> >>
> > >>>> >> As the DSv2 API evolves, some breaking changes are occasionally made
> > >>>> >> to the API. It's possible to split a plugin into a "common" part and
> > >>>> >> multiple version-specific parts and this works OK to have a single
> > >>>> >> artifact for users, as long as they write out the fully qualified
> > >>>> >> classname as the DataFrame format(). The one part that can't be
> > >>>> >> currently worked around is the DataSourceRegister trait. Since classes
> > >>>> >> which implement DataSourceRegister must also implement DataSourceV2
> > >>>> >> (and its mixins), changes to those interfaces cause the ServiceLoader
> > >>>> >> to fail when it attempts to load the "wrong" DataSourceV2 class.
> > >>>> >> (there's also an additional prohibition against multiple
> > >>>> >> implementations having the same ShortName in
> > >>>> >> org.apache.spark.sql.execution.datasources.DataSource.lookupDataSource).
> > >>>> >> This means users will need to update their notebooks/code/tutorials if
> > >>>> >> they run @ a different site whose cluster is a different version.
> > >>>> >>
> > >>>> >> To solve this, I proposed in SPARK-31363 a new trait who would
> > >>>> >> function the same as the existing DataSourceRegister trait, but adds
> > >>>> >> an additional method:
> > >>>> >>
> > >>>> >> public Class<? implements DataSourceV2> getImplementation();
> > >>>> >>
> > >>>> >> ...which will allow DSv2 plugins to dynamically choose the appropriate
> > >>>> >> DataSourceV2 class based on the runtime environment. This would let us
> > >>>> >> release a single artifact for different Spark versions and users could
> > >>>> >> use the same artifactID & format regardless of where they were
> > >>>> >> executing their code. If there was no services registered with this
> > >>>> >> new trait, the functionality would remain the same as before.
> > >>>> >>
> > >>>> >> I think this functionality will be useful as DSv2 continues to evolve,
> > >>>> >> please let me know your thoughts.
> > >>>> >>
> > >>>> >> Thanks
> > >>>> >> Andrew
> > >>>> >>
> > >>>> >> ---------------------------------------------------------------------
> > >>>> >> To unsubscribe e-mail: [hidden email]
> > >>>> >>
> > >>>> >
> > >>>> >
> > >>>> > --
> > >>>> > Ryan Blue
> > >>>> > Software Engineer
> > >>>> > Netflix
> > >>>>
> > >>>> ---------------------------------------------------------------------
> > >>>> To unsubscribe e-mail: [hidden email]
> > >>>>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]