Why are DataFrames always read with nullable=True?

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Why are DataFrames always read with nullable=True?

Jason White
If I create a dataframe in Spark with non-nullable columns, and then save that to disk as a Parquet file, the columns are properly marked as non-nullable. I confirmed this using parquet-tools. Then, when loading it back, Spark forces the nullable back to True.

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L378

If I remove the `.asNullable` part, Spark performs exactly as I'd like by default, picking up the data using the schema either in the Parquet file or provided by me.

This particular LoC goes back a year now, and I've seen a variety of discussions about this issue. In particular with Michael here: https://www.mail-archive.com/user@spark.apache.org/msg39230.html. Those seemed to be discussing writing, not reading, though, and writing is already supported now.

Is this functionality still desirable? Is it potentially not applicable for all file formats and situations (e.g. HDFS/Parquet)? Would it be suitable to pass an option to the DataFrameReader to disable this functionality?
Reply | Threaded
Open this post in threaded view
|

Re: Why are DataFrames always read with nullable=True?

Kazuaki Ishizaki
Hi,
Regarding reading part for nullable, it seems to be considered to add a data cleaning step as Xiao said at https://www.mail-archive.com/user@.../msg39233.html.

Here is a PR https://github.com/apache/spark/pull/17293to add the data cleaning step that throws an exception if null exists in non-null column.
Any comments are appreciated.

Kazuaki Ishizaki



From:        Jason White <[hidden email]>
To:        [hidden email]
Date:        2017/03/21 06:31
Subject:        Why are DataFrames always read with nullable=True?




If I create a dataframe in Spark with non-nullable columns, and then save
that to disk as a Parquet file, the columns are properly marked as
non-nullable. I confirmed this using parquet-tools. Then, when loading it
back, Spark forces the nullable back to True.

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L378

If I remove the `.asNullable` part, Spark performs exactly as I'd like by
default, picking up the data using the schema either in the Parquet file or
provided by me.

This particular LoC goes back a year now, and I've seen a variety of
discussions about this issue. In particular with Michael here:
https://www.mail-archive.com/user@.../msg39230.html. Those
seemed to be discussing writing, not reading, though, and writing is already
supported now.

Is this functionality still desirable? Is it potentially not applicable for
all file formats and situations (e.g. HDFS/Parquet)? Would it be suitable to
pass an option to the DataFrameReader to disable this functionality?



--
View this message in context:
http://apache-spark-developers-list.1001551.n3.nabble.com/Why-are-DataFrames-always-read-with-nullable-True-tp21207.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



Reply | Threaded
Open this post in threaded view
|

Re: Why are DataFrames always read with nullable=True?

Takeshi Yamamuro
In reply to this post by Jason White
Hi,

Have you check the related JIRA? e.g., https://issues.apache.org/jira/browse/SPARK-19950
If you have any ask and request, you'd better to do there.

Thanks!

// maropu


On Tue, Mar 21, 2017 at 6:30 AM, Jason White <[hidden email]> wrote:
If I create a dataframe in Spark with non-nullable columns, and then save
that to disk as a Parquet file, the columns are properly marked as
non-nullable. I confirmed this using parquet-tools. Then, when loading it
back, Spark forces the nullable back to True.

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L378

If I remove the `.asNullable` part, Spark performs exactly as I'd like by
default, picking up the data using the schema either in the Parquet file or
provided by me.

This particular LoC goes back a year now, and I've seen a variety of
discussions about this issue. In particular with Michael here:
https://www.mail-archive.com/user@.../msg39230.html. Those
seemed to be discussing writing, not reading, though, and writing is already
supported now.

Is this functionality still desirable? Is it potentially not applicable for
all file formats and situations (e.g. HDFS/Parquet)? Would it be suitable to
pass an option to the DataFrameReader to disable this functionality?



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Why-are-DataFrames-always-read-with-nullable-True-tp21207.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]




--
---
Takeshi Yamamuro
Reply | Threaded
Open this post in threaded view
|

Re: Why are DataFrames always read with nullable=True?

Jason White
Thanks for pointing to those JIRA tickets, I hadn't seen them. Encouraging that they are recent. I hope we can find a solution there.