FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE syntax

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE syntax

Dongjoon Hyun-2
Hi, All.

I want to share the following change to the community.

    SPARK-30098 Use default datasource as provider for CREATE TABLE syntax

This is merged today and now Spark's `CREATE TABLE` is using Spark's default data sources instead of `hive` provider. This is a good and big improvement for Apache Spark 3.0, but this might surprise someone. (Please note that there is a fallback option for them.)

Thank you, Yi, Wenchen, Xiao.

Cheers,
Dongjoon.
Reply | Threaded
Open this post in threaded view
|

Re: FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE syntax

Takeshi Yamamuro
Oh, looks nice. Thanks for the sharing, Dongjoon

Bests,
Takeshi

On Sat, Dec 7, 2019 at 3:35 AM Dongjoon Hyun <[hidden email]> wrote:
Hi, All.

I want to share the following change to the community.

    SPARK-30098 Use default datasource as provider for CREATE TABLE syntax

This is merged today and now Spark's `CREATE TABLE` is using Spark's default data sources instead of `hive` provider. This is a good and big improvement for Apache Spark 3.0, but this might surprise someone. (Please note that there is a fallback option for them.)

Thank you, Yi, Wenchen, Xiao.

Cheers,
Dongjoon.


--
---
Takeshi Yamamuro
Reply | Threaded
Open this post in threaded view
|

Re: FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE syntax

cloud0fan
I'm reviving this thread because this feature was reverted before the 3.0 release, and now we are trying to add it back since the CREATE TABLE syntax is unified.

The benefits are pretty clear: CREATE TABLE by default (without USING or STORED AS) should create native tables that work best with Spark. You can see all the benefits listed in https://github.com/apache/spark/pull/30554.

I'm sending this email to collect feedback about the risks. AFAIK the broken use cases are:
1. A user issues `CREATE TABLE ... LOCATION ...`. After some table insertions he want to read the data files directly from the table location. Because the file format is changed from Hive text to Parquet, this use case may be broken.
2. A user issues `CREATE TABLE ...` and then runs `ALTER TABLE SET SERDE` or `LOAD DATA`. These two are Hive specific commands and doesn't work with Spark native tables.
3. A user issues `CREATE TABLE ...` and then uses Hive to add partitions with different serdes to this table. Spark doesn't allow a native partitioned table to have partitions with different formats.

From my personal experience, the Hive text tables are usually used to import CSV-like data. It's very likely that people will create Hive text table explicitly as they need the Hive syntax to specify options like delimiter. Besides, I'm not sure how many Spark users are using this feature, as the native CSV data source can do the same job.

I'd consider it a bad user experience if a simple `CREATE TABLE` gives users a very slow table. Changing it to return native Parquet table doesn't seems to break many people, but I can be wrong.

Please reply to this thread if you know more use cases that may be affected by this change, and share your thoughts.

Thanks,
Wenchen

On Sat, Dec 7, 2019 at 1:58 PM Takeshi Yamamuro <[hidden email]> wrote:
Oh, looks nice. Thanks for the sharing, Dongjoon

Bests,
Takeshi

On Sat, Dec 7, 2019 at 3:35 AM Dongjoon Hyun <[hidden email]> wrote:
Hi, All.

I want to share the following change to the community.

    SPARK-30098 Use default datasource as provider for CREATE TABLE syntax

This is merged today and now Spark's `CREATE TABLE` is using Spark's default data sources instead of `hive` provider. This is a good and big improvement for Apache Spark 3.0, but this might surprise someone. (Please note that there is a fallback option for them.)

Thank you, Yi, Wenchen, Xiao.

Cheers,
Dongjoon.


--
---
Takeshi Yamamuro
Reply | Threaded
Open this post in threaded view
|

Re: FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE syntax

Ryan Blue
Wenchen, could you start a new thread? Many people have probably already muted this one, and it isn't really on topic.

The question that needs to be discussed is whether this is a safe change for the 3.1 release, and reusing an old thread is not a great way to get people's attention about something potentially harmful like that.

On Tue, Dec 1, 2020 at 10:46 AM Wenchen Fan <[hidden email]> wrote:
I'm reviving this thread because this feature was reverted before the 3.0 release, and now we are trying to add it back since the CREATE TABLE syntax is unified.

The benefits are pretty clear: CREATE TABLE by default (without USING or STORED AS) should create native tables that work best with Spark. You can see all the benefits listed in https://github.com/apache/spark/pull/30554.

I'm sending this email to collect feedback about the risks. AFAIK the broken use cases are:
1. A user issues `CREATE TABLE ... LOCATION ...`. After some table insertions he want to read the data files directly from the table location. Because the file format is changed from Hive text to Parquet, this use case may be broken.
2. A user issues `CREATE TABLE ...` and then runs `ALTER TABLE SET SERDE` or `LOAD DATA`. These two are Hive specific commands and doesn't work with Spark native tables.
3. A user issues `CREATE TABLE ...` and then uses Hive to add partitions with different serdes to this table. Spark doesn't allow a native partitioned table to have partitions with different formats.

From my personal experience, the Hive text tables are usually used to import CSV-like data. It's very likely that people will create Hive text table explicitly as they need the Hive syntax to specify options like delimiter. Besides, I'm not sure how many Spark users are using this feature, as the native CSV data source can do the same job.

I'd consider it a bad user experience if a simple `CREATE TABLE` gives users a very slow table. Changing it to return native Parquet table doesn't seems to break many people, but I can be wrong.

Please reply to this thread if you know more use cases that may be affected by this change, and share your thoughts.

Thanks,
Wenchen

On Sat, Dec 7, 2019 at 1:58 PM Takeshi Yamamuro <[hidden email]> wrote:
Oh, looks nice. Thanks for the sharing, Dongjoon

Bests,
Takeshi

On Sat, Dec 7, 2019 at 3:35 AM Dongjoon Hyun <[hidden email]> wrote:
Hi, All.

I want to share the following change to the community.

    SPARK-30098 Use default datasource as provider for CREATE TABLE syntax

This is merged today and now Spark's `CREATE TABLE` is using Spark's default data sources instead of `hive` provider. This is a good and big improvement for Apache Spark 3.0, but this might surprise someone. (Please note that there is a fallback option for them.)

Thank you, Yi, Wenchen, Xiao.

Cheers,
Dongjoon.


--
---
Takeshi Yamamuro


--
Ryan Blue
Software Engineer
Netflix