[SPARK-30319][SQL] Add a stricter version of as[T]

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

[SPARK-30319][SQL] Add a stricter version of as[T]

Enrico Minack

Hi Devs,

I'd like to propose a stricter version of as[T]. Given the interface def as[T](): Dataset[T], it is counter-intuitive that the schema of the returned Dataset[T] is not agnostic to the schema of the originating Dataset. The schema should always be derived only from T.

I am proposing a stricter version so that user code does not need to pair an .as[T] with a select(schemaOfT.fields.map(col(_.name)): _*) whenever your code expects Dataset[T] to really contain only columns of T.

https://github.com/apache/spark/pull/26969

Regards,
Enrico

Reply | Threaded
Open this post in threaded view
|

Re: [SPARK-30319][SQL] Add a stricter version of as[T]

cloud0fan
I think it's simply because as[T] is lazy. You will see the right schema if you do `df.as[T].map(identity)`.



On Tue, Jan 7, 2020 at 4:42 PM Enrico Minack <[hidden email]> wrote:

Hi Devs,

I'd like to propose a stricter version of as[T]. Given the interface def as[T](): Dataset[T], it is counter-intuitive that the schema of the returned Dataset[T] is not agnostic to the schema of the originating Dataset. The schema should always be derived only from T.

I am proposing a stricter version so that user code does not need to pair an .as[T] with a select(schemaOfT.fields.map(col(_.name)): _*) whenever your code expects Dataset[T] to really contain only columns of T.

https://github.com/apache/spark/pull/26969

Regards,
Enrico

Reply | Threaded
Open this post in threaded view
|

Re: [SPARK-30319][SQL] Add a stricter version of as[T]

Enrico Minack
Yes, as[T] is lazy as any transformation is, but in terms of data processing not schema. You seem to imply the as[T] is lazy in terms of the schema, where I do no know of any other transformation that behaves like this.

Your proposed solution works, because the map transformation returns the right schema, though it is also a lazy transformation. The as[T] should behave like this too.

The map transformation is a quick fix in terms of code length, but it materializes the data as instances of T, which introduces a prohibitive deserialization / serialization round trip for no good reason:

I think returning the right schema does not need to touch any data and should be as lightweight as a projection.

Enrico


Am 07.01.20 um 10:13 schrieb Wenchen Fan:
I think it's simply because as[T] is lazy. You will see the right schema if you do `df.as[T].map(identity)`.



On Tue, Jan 7, 2020 at 4:42 PM Enrico Minack <[hidden email]> wrote:

Hi Devs,

I'd like to propose a stricter version of as[T]. Given the interface def as[T](): Dataset[T], it is counter-intuitive that the schema of the returned Dataset[T] is not agnostic to the schema of the originating Dataset. The schema should always be derived only from T.

I am proposing a stricter version so that user code does not need to pair an .as[T] with a select(schemaOfT.fields.map(col(_.name)): _*) whenever your code expects Dataset[T] to really contain only columns of T.

https://github.com/apache/spark/pull/26969

Regards,
Enrico