Public API access to UDTs

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Public API access to UDTs

Simeon Fitch
Hi,

First time posting here, so apologies if I need to be directing this topic elsewhere.

I'm the author of RasterFrames, and a contributor to GeoMesa's Spark SQL module. Both make use of decently low level Catalyst constructs, include custom UDTs; RasterFrames introduces a geospatial raster type, and GeoMesa a geometry type.

In order to make this work we've circumvented the [`package private`](https://bit.ly/3pr0fVv)  restriction on `UDTRegistration` by inserting sibling classes into the package namespace. It's a hack, and works fine with JVM 8, but violates the [much more restrictive](https://bit.ly/3aadO5g) module constructs in JVM 9+.

We've been monitoring [SPARK-7768](https://issues.apache.org/jira/browse/SPARK-7768) (filed in 2015)  and it's [associated PR](https://github.com/apache/spark/pull/16478) for years now, but it keeps getting kicked down the road(map).

As authors of open source systems we completely understand how and why this happens, but we are at a critical juncture in our projects' lifecycle, anchored to JVM 8 while other systems have moved on to later versions. We'd also like to enjoy the benefits of later JVMs.

So... I'm here to find out how I and others critically needing public access to `UDTRegistration` might better advocate for it?

I think (but not 100% sure) the PR linked above is more extensive than what we need, also addressing usability around Encoders, for which we have our own type class solution. My assumption to date has been all we need is line 32 of `UDTRegistration` deleted (if there's folly therein, please say so!). While I understand a reluctance to promote `UDTRegistration` to `public`, I note that it has not been changed since 2016, perhaps a good indicator that the API is stable enough. Marking it as `@Experimental` could be a compromise option.

Thanks for reading this far and giving this consideration. Any and all advice is appreciated.

Simeon (@metasim)


--
Simeon Fitch
Co-founder & VP of R&D
Astraea, Inc.

Reply | Threaded
Open this post in threaded view
|

Re: Public API access to UDTs

Sean Owen-2
I'm also interested: are there problems with opening up this API beyond needing to freeze it and keep it stable? it's pretty stable. As @DeveloperApi at least?
Are there implications for storing UDTs in particular engines or formats?
Just making it public for developers, even with a 'use at your own risk' warning, seems pretty small as a change?

On Thu, Jan 28, 2021 at 5:10 PM Fitch, Simeon <[hidden email]> wrote:
Hi,

First time posting here, so apologies if I need to be directing this topic elsewhere.

I'm the author of RasterFrames, and a contributor to GeoMesa's Spark SQL module. Both make use of decently low level Catalyst constructs, include custom UDTs; RasterFrames introduces a geospatial raster type, and GeoMesa a geometry type.

In order to make this work we've circumvented the [`package private`](https://bit.ly/3pr0fVv)  restriction on `UDTRegistration` by inserting sibling classes into the package namespace. It's a hack, and works fine with JVM 8, but violates the [much more restrictive](https://bit.ly/3aadO5g) module constructs in JVM 9+.

We've been monitoring [SPARK-7768](https://issues.apache.org/jira/browse/SPARK-7768) (filed in 2015)  and it's [associated PR](https://github.com/apache/spark/pull/16478) for years now, but it keeps getting kicked down the road(map).

As authors of open source systems we completely understand how and why this happens, but we are at a critical juncture in our projects' lifecycle, anchored to JVM 8 while other systems have moved on to later versions. We'd also like to enjoy the benefits of later JVMs.

So... I'm here to find out how I and others critically needing public access to `UDTRegistration` might better advocate for it?

I think (but not 100% sure) the PR linked above is more extensive than what we need, also addressing usability around Encoders, for which we have our own type class solution. My assumption to date has been all we need is line 32 of `UDTRegistration` deleted (if there's folly therein, please say so!). While I understand a reluctance to promote `UDTRegistration` to `public`, I note that it has not been changed since 2016, perhaps a good indicator that the API is stable enough. Marking it as `@Experimental` could be a compromise option.

Thanks for reading this far and giving this consideration. Any and all advice is appreciated.

Simeon (@metasim)


--
Simeon Fitch
Co-founder & VP of R&D
Astraea, Inc.

Reply | Threaded
Open this post in threaded view
|

Re: Public API access to UDTs

Simeon Fitch

On Fri, Jan 29, 2021 at 9:46 AM Sean Owen <[hidden email]> wrote:
Are there implications for storing UDTs in particular engines or formats?

I've found UDTs I/O to Parquet without problem.

They work fine with PySpark with implementation of mirror classes. Without properly constructed mirror classe they show up as structs, which isn't a bad fallback.

However, they do *not* work with Spark's use of Arrow, as they get rejected here:


 
Just making it public for developers, even with a 'use at your own risk' warning, seems pretty small as a change?

On Thu, Jan 28, 2021 at 5:10 PM Fitch, Simeon <[hidden email]> wrote:
Hi,

First time posting here, so apologies if I need to be directing this topic elsewhere.

I'm the author of RasterFrames, and a contributor to GeoMesa's Spark SQL module. Both make use of decently low level Catalyst constructs, include custom UDTs; RasterFrames introduces a geospatial raster type, and GeoMesa a geometry type.

In order to make this work we've circumvented the [`package private`](https://bit.ly/3pr0fVv)  restriction on `UDTRegistration` by inserting sibling classes into the package namespace. It's a hack, and works fine with JVM 8, but violates the [much more restrictive](https://bit.ly/3aadO5g) module constructs in JVM 9+.

We've been monitoring [SPARK-7768](https://issues.apache.org/jira/browse/SPARK-7768) (filed in 2015)  and it's [associated PR](https://github.com/apache/spark/pull/16478) for years now, but it keeps getting kicked down the road(map).

As authors of open source systems we completely understand how and why this happens, but we are at a critical juncture in our projects' lifecycle, anchored to JVM 8 while other systems have moved on to later versions. We'd also like to enjoy the benefits of later JVMs.

So... I'm here to find out how I and others critically needing public access to `UDTRegistration` might better advocate for it?

I think (but not 100% sure) the PR linked above is more extensive than what we need, also addressing usability around Encoders, for which we have our own type class solution. My assumption to date has been all we need is line 32 of `UDTRegistration` deleted (if there's folly therein, please say so!). While I understand a reluctance to promote `UDTRegistration` to `public`, I note that it has not been changed since 2016, perhaps a good indicator that the API is stable enough. Marking it as `@Experimental` could be a compromise option.

Thanks for reading this far and giving this consideration. Any and all advice is appreciated.

Simeon (@metasim)


--
Simeon Fitch
Co-founder & VP of R&D
Astraea, Inc.



--
Simeon Fitch
Co-founder & VP of R&D
Astraea, Inc.

Reply | Threaded
Open this post in threaded view
|

Re: Public API access to UDTs

Sean Owen-2
In reply to this post by Sean Owen-2
I'm not hearing any objection to making it public as a @DeveloperApi ? anyone object to a PR on that?

On Fri, Jan 29, 2021 at 8:46 AM Sean Owen <[hidden email]> wrote:
I'm also interested: are there problems with opening up this API beyond needing to freeze it and keep it stable? it's pretty stable. As @DeveloperApi at least?
Are there implications for storing UDTs in particular engines or formats?
Just making it public for developers, even with a 'use at your own risk' warning, seems pretty small as a change?

On Thu, Jan 28, 2021 at 5:10 PM Fitch, Simeon <[hidden email]> wrote:
Hi,

First time posting here, so apologies if I need to be directing this topic elsewhere.

I'm the author of RasterFrames, and a contributor to GeoMesa's Spark SQL module. Both make use of decently low level Catalyst constructs, include custom UDTs; RasterFrames introduces a geospatial raster type, and GeoMesa a geometry type.

In order to make this work we've circumvented the [`package private`](https://bit.ly/3pr0fVv)  restriction on `UDTRegistration` by inserting sibling classes into the package namespace. It's a hack, and works fine with JVM 8, but violates the [much more restrictive](https://bit.ly/3aadO5g) module constructs in JVM 9+.

We've been monitoring [SPARK-7768](https://issues.apache.org/jira/browse/SPARK-7768) (filed in 2015)  and it's [associated PR](https://github.com/apache/spark/pull/16478) for years now, but it keeps getting kicked down the road(map).

As authors of open source systems we completely understand how and why this happens, but we are at a critical juncture in our projects' lifecycle, anchored to JVM 8 while other systems have moved on to later versions. We'd also like to enjoy the benefits of later JVMs.

So... I'm here to find out how I and others critically needing public access to `UDTRegistration` might better advocate for it?

I think (but not 100% sure) the PR linked above is more extensive than what we need, also addressing usability around Encoders, for which we have our own type class solution. My assumption to date has been all we need is line 32 of `UDTRegistration` deleted (if there's folly therein, please say so!). While I understand a reluctance to promote `UDTRegistration` to `public`, I note that it has not been changed since 2016, perhaps a good indicator that the API is stable enough. Marking it as `@Experimental` could be a compromise option.

Thanks for reading this far and giving this consideration. Any and all advice is appreciated.

Simeon (@metasim)


--
Simeon Fitch
Co-founder & VP of R&D
Astraea, Inc.

Reply | Threaded
Open this post in threaded view
|

Re: Public API access to UDTs

Simeon Fitch
🙇

On Mon, Feb 1, 2021 at 9:38 AM Sean Owen <[hidden email]> wrote:
I'm not hearing any objection to making it public as a @DeveloperApi ? anyone object to a PR on that?

On Fri, Jan 29, 2021 at 8:46 AM Sean Owen <[hidden email]> wrote:
I'm also interested: are there problems with opening up this API beyond needing to freeze it and keep it stable? it's pretty stable. As @DeveloperApi at least?
Are there implications for storing UDTs in particular engines or formats?
Just making it public for developers, even with a 'use at your own risk' warning, seems pretty small as a change?

On Thu, Jan 28, 2021 at 5:10 PM Fitch, Simeon <[hidden email]> wrote:
Hi,

First time posting here, so apologies if I need to be directing this topic elsewhere.

I'm the author of RasterFrames, and a contributor to GeoMesa's Spark SQL module. Both make use of decently low level Catalyst constructs, include custom UDTs; RasterFrames introduces a geospatial raster type, and GeoMesa a geometry type.

In order to make this work we've circumvented the [`package private`](https://bit.ly/3pr0fVv)  restriction on `UDTRegistration` by inserting sibling classes into the package namespace. It's a hack, and works fine with JVM 8, but violates the [much more restrictive](https://bit.ly/3aadO5g) module constructs in JVM 9+.

We've been monitoring [SPARK-7768](https://issues.apache.org/jira/browse/SPARK-7768) (filed in 2015)  and it's [associated PR](https://github.com/apache/spark/pull/16478) for years now, but it keeps getting kicked down the road(map).

As authors of open source systems we completely understand how and why this happens, but we are at a critical juncture in our projects' lifecycle, anchored to JVM 8 while other systems have moved on to later versions. We'd also like to enjoy the benefits of later JVMs.

So... I'm here to find out how I and others critically needing public access to `UDTRegistration` might better advocate for it?

I think (but not 100% sure) the PR linked above is more extensive than what we need, also addressing usability around Encoders, for which we have our own type class solution. My assumption to date has been all we need is line 32 of `UDTRegistration` deleted (if there's folly therein, please say so!). While I understand a reluctance to promote `UDTRegistration` to `public`, I note that it has not been changed since 2016, perhaps a good indicator that the API is stable enough. Marking it as `@Experimental` could be a compromise option.

Thanks for reading this far and giving this consideration. Any and all advice is appreciated.

Simeon (@metasim)


--
Simeon Fitch
Co-founder & VP of R&D
Astraea, Inc.



--
Simeon Fitch
Co-founder & VP of R&D
Astraea, Inc.

Reply | Threaded
Open this post in threaded view
|

Re: Public API access to UDTs

Sean Owen-2
I opened https://github.com/apache/spark/pull/31461 to track the discussion further. It narrowly proposes making a few types public.

On Mon, Feb 1, 2021 at 8:52 AM Fitch, Simeon <[hidden email]> wrote:
🙇

On Mon, Feb 1, 2021 at 9:38 AM Sean Owen <[hidden email]> wrote:
I'm not hearing any objection to making it public as a @DeveloperApi ? anyone object to a PR on that?

On Fri, Jan 29, 2021 at 8:46 AM Sean Owen <[hidden email]> wrote:
I'm also interested: are there problems with opening up this API beyond needing to freeze it and keep it stable? it's pretty stable. As @DeveloperApi at least?
Are there implications for storing UDTs in particular engines or formats?
Just making it public for developers, even with a 'use at your own risk' warning, seems pretty small as a change?

On Thu, Jan 28, 2021 at 5:10 PM Fitch, Simeon <[hidden email]> wrote:
Hi,

First time posting here, so apologies if I need to be directing this topic elsewhere.

I'm the author of RasterFrames, and a contributor to GeoMesa's Spark SQL module. Both make use of decently low level Catalyst constructs, include custom UDTs; RasterFrames introduces a geospatial raster type, and GeoMesa a geometry type.

In order to make this work we've circumvented the [`package private`](https://bit.ly/3pr0fVv)  restriction on `UDTRegistration` by inserting sibling classes into the package namespace. It's a hack, and works fine with JVM 8, but violates the [much more restrictive](https://bit.ly/3aadO5g) module constructs in JVM 9+.

We've been monitoring [SPARK-7768](https://issues.apache.org/jira/browse/SPARK-7768) (filed in 2015)  and it's [associated PR](https://github.com/apache/spark/pull/16478) for years now, but it keeps getting kicked down the road(map).

As authors of open source systems we completely understand how and why this happens, but we are at a critical juncture in our projects' lifecycle, anchored to JVM 8 while other systems have moved on to later versions. We'd also like to enjoy the benefits of later JVMs.

So... I'm here to find out how I and others critically needing public access to `UDTRegistration` might better advocate for it?

I think (but not 100% sure) the PR linked above is more extensive than what we need, also addressing usability around Encoders, for which we have our own type class solution. My assumption to date has been all we need is line 32 of `UDTRegistration` deleted (if there's folly therein, please say so!). While I understand a reluctance to promote `UDTRegistration` to `public`, I note that it has not been changed since 2016, perhaps a good indicator that the API is stable enough. Marking it as `@Experimental` could be a compromise option.

Thanks for reading this far and giving this consideration. Any and all advice is appreciated.

Simeon (@metasim)


--
Simeon Fitch
Co-founder & VP of R&D
Astraea, Inc.



--
Simeon Fitch
Co-founder & VP of R&D
Astraea, Inc.