Official support of CREATE EXTERNAL TABLE

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Official support of CREATE EXTERNAL TABLE

cloud0fan
Hi all,

I'd like to start a discussion thread about this topic, as it blocks an important feature that we target for Spark 3.1: unify the CREATE TABLE SQL syntax.

A bit more background for CREATE EXTERNAL TABLE: it's kind of a hidden feature in Spark for Hive compatibility.

When you write native CREATE TABLE syntax such as `CREATE EXTERNAL TABLE ... USING parquet`, the parser fails and tells you that EXTERNAL can't be specified.

When we write Hive CREATE TABLE syntax, the EXTERNAL can be specified if LOCATION clause or path option is present. For example, `CREATE EXTERNAL TABLE ... STORED AS parquet` is not allowed as there is no LOCATION clause or path option. This is not 100% Hive compatible.

As we are unifying the CREATE TABLE SQL syntax, one problem is how to deal with CREATE EXTERNAL TABLE. We can keep it as a hidden feature as it was, or we can officially support it.

Please let us know your thoughts:
1. As an end-user, what do you expect CREATE EXTERNAL TABLE to do? Have you used it in production before? For what use cases?
2. As a catalog developer, how are you going to implement EXTERNAL TABLE? It seems to me that it only makes sense for file source, as the table directory can be managed. I'm not sure how to interpret EXTERNAL in catalogs like jdbc, cassandra, etc.

For more details, please refer to the long discussion in https://github.com/apache/spark/pull/28026

Thanks,
Wenchen
Reply | Threaded
Open this post in threaded view
|

Re: Official support of CREATE EXTERNAL TABLE

Ryan Blue

I would summarize both the problem and the current state differently.

Currently, Spark parses the EXTERNAL keyword for compatibility with Hive SQL, but Spark’s built-in catalog doesn’t allow creating a table with EXTERNAL unless LOCATION is also present. This “hidden feature” breaks compatibility with Hive SQL because all combinations of EXTERNAL and LOCATION are valid in Hive, but creating an external table with a default location is not allowed by Spark. Note that Spark must still handle these tables because it shares a metastore with Hive, which can still create them.

Now catalogs can be plugged in, the question is whether to pass the fact that EXTERNAL was in the CREATE TABLE statement to the v2 catalog handling a create command, or to suppress it and apply Spark’s rule that LOCATION must be present.

If it is not passed to the catalog, then a Hive catalog cannot implement the behavior of Hive SQL, even though Spark added the keyword for Hive compatibility. The Spark catalog can interpret EXTERNAL however Spark chooses to, but I think it is a poor choice to force different behavior on other catalogs.

Wenchen has also argued that the purpose of this is to standardize behavior across catalogs. But hiding EXTERNAL would not accomplish that goal. Whether to physically delete data is a choice that is up to the catalog. Some catalogs have no “external” concept and will always drop data when a table is dropped. The ability to keep underlying data files is specific to a few catalogs, and whether that is controlled by EXTERNAL, the LOCATION clause, or something else is still up to the catalog implementation.

I don’t think that there is a good reason to force catalogs to break compatibility with Hive SQL, while making it appear as though DDL is compatible. Because removing EXTERNAL would be a breaking change to the SQL parser, I think the best option is to pass it to v2 catalogs so the catalog can decide how to handle it.

rb


On Tue, Oct 6, 2020 at 7:06 AM Wenchen Fan <[hidden email]> wrote:
Hi all,

I'd like to start a discussion thread about this topic, as it blocks an important feature that we target for Spark 3.1: unify the CREATE TABLE SQL syntax.

A bit more background for CREATE EXTERNAL TABLE: it's kind of a hidden feature in Spark for Hive compatibility.

When you write native CREATE TABLE syntax such as `CREATE EXTERNAL TABLE ... USING parquet`, the parser fails and tells you that EXTERNAL can't be specified.

When we write Hive CREATE TABLE syntax, the EXTERNAL can be specified if LOCATION clause or path option is present. For example, `CREATE EXTERNAL TABLE ... STORED AS parquet` is not allowed as there is no LOCATION clause or path option. This is not 100% Hive compatible.

As we are unifying the CREATE TABLE SQL syntax, one problem is how to deal with CREATE EXTERNAL TABLE. We can keep it as a hidden feature as it was, or we can officially support it.

Please let us know your thoughts:
1. As an end-user, what do you expect CREATE EXTERNAL TABLE to do? Have you used it in production before? For what use cases?
2. As a catalog developer, how are you going to implement EXTERNAL TABLE? It seems to me that it only makes sense for file source, as the table directory can be managed. I'm not sure how to interpret EXTERNAL in catalogs like jdbc, cassandra, etc.

For more details, please refer to the long discussion in https://github.com/apache/spark/pull/28026

Thanks,
Wenchen


--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: Official support of CREATE EXTERNAL TABLE

Holden Karau
As someone who's had the job of porting different SQL dialects to Spark, I'm also very much in favor of keeping EXTERNAL, and I think Ryan's suggestion of leaving it up to the catalogs on how to handle this makes sense.

On Tue, Oct 6, 2020 at 1:54 PM Ryan Blue <[hidden email]> wrote:

I would summarize both the problem and the current state differently.

Currently, Spark parses the EXTERNAL keyword for compatibility with Hive SQL, but Spark’s built-in catalog doesn’t allow creating a table with EXTERNAL unless LOCATION is also present. This “hidden feature” breaks compatibility with Hive SQL because all combinations of EXTERNAL and LOCATION are valid in Hive, but creating an external table with a default location is not allowed by Spark. Note that Spark must still handle these tables because it shares a metastore with Hive, which can still create them.

Now catalogs can be plugged in, the question is whether to pass the fact that EXTERNAL was in the CREATE TABLE statement to the v2 catalog handling a create command, or to suppress it and apply Spark’s rule that LOCATION must be present.

If it is not passed to the catalog, then a Hive catalog cannot implement the behavior of Hive SQL, even though Spark added the keyword for Hive compatibility. The Spark catalog can interpret EXTERNAL however Spark chooses to, but I think it is a poor choice to force different behavior on other catalogs.

Wenchen has also argued that the purpose of this is to standardize behavior across catalogs. But hiding EXTERNAL would not accomplish that goal. Whether to physically delete data is a choice that is up to the catalog. Some catalogs have no “external” concept and will always drop data when a table is dropped. The ability to keep underlying data files is specific to a few catalogs, and whether that is controlled by EXTERNAL, the LOCATION clause, or something else is still up to the catalog implementation.

I don’t think that there is a good reason to force catalogs to break compatibility with Hive SQL, while making it appear as though DDL is compatible. Because removing EXTERNAL would be a breaking change to the SQL parser, I think the best option is to pass it to v2 catalogs so the catalog can decide how to handle it.

rb


On Tue, Oct 6, 2020 at 7:06 AM Wenchen Fan <[hidden email]> wrote:
Hi all,

I'd like to start a discussion thread about this topic, as it blocks an important feature that we target for Spark 3.1: unify the CREATE TABLE SQL syntax.

A bit more background for CREATE EXTERNAL TABLE: it's kind of a hidden feature in Spark for Hive compatibility.

When you write native CREATE TABLE syntax such as `CREATE EXTERNAL TABLE ... USING parquet`, the parser fails and tells you that EXTERNAL can't be specified.

When we write Hive CREATE TABLE syntax, the EXTERNAL can be specified if LOCATION clause or path option is present. For example, `CREATE EXTERNAL TABLE ... STORED AS parquet` is not allowed as there is no LOCATION clause or path option. This is not 100% Hive compatible.

As we are unifying the CREATE TABLE SQL syntax, one problem is how to deal with CREATE EXTERNAL TABLE. We can keep it as a hidden feature as it was, or we can officially support it.

Please let us know your thoughts:
1. As an end-user, what do you expect CREATE EXTERNAL TABLE to do? Have you used it in production before? For what use cases?
2. As a catalog developer, how are you going to implement EXTERNAL TABLE? It seems to me that it only makes sense for file source, as the table directory can be managed. I'm not sure how to interpret EXTERNAL in catalogs like jdbc, cassandra, etc.

For more details, please refer to the long discussion in https://github.com/apache/spark/pull/28026

Thanks,
Wenchen


--
Ryan Blue
Software Engineer
Netflix


--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
Reply | Threaded
Open this post in threaded view
|

Re: Official support of CREATE EXTERNAL TABLE

RussS
In reply to this post by Ryan Blue
I don't feel differently than I did on the thread linked above, I think treating "External" as a table option is still the safest way to go about things. For the Cassandra catalog this option wouldn't appear on our whitelist of allowed options, the same as "path" and other options that don't apply to C.

On Tue, Oct 6, 2020 at 3:54 PM Ryan Blue <[hidden email]> wrote:

I would summarize both the problem and the current state differently.

Currently, Spark parses the EXTERNAL keyword for compatibility with Hive SQL, but Spark’s built-in catalog doesn’t allow creating a table with EXTERNAL unless LOCATION is also present. This “hidden feature” breaks compatibility with Hive SQL because all combinations of EXTERNAL and LOCATION are valid in Hive, but creating an external table with a default location is not allowed by Spark. Note that Spark must still handle these tables because it shares a metastore with Hive, which can still create them.

Now catalogs can be plugged in, the question is whether to pass the fact that EXTERNAL was in the CREATE TABLE statement to the v2 catalog handling a create command, or to suppress it and apply Spark’s rule that LOCATION must be present.

If it is not passed to the catalog, then a Hive catalog cannot implement the behavior of Hive SQL, even though Spark added the keyword for Hive compatibility. The Spark catalog can interpret EXTERNAL however Spark chooses to, but I think it is a poor choice to force different behavior on other catalogs.

Wenchen has also argued that the purpose of this is to standardize behavior across catalogs. But hiding EXTERNAL would not accomplish that goal. Whether to physically delete data is a choice that is up to the catalog. Some catalogs have no “external” concept and will always drop data when a table is dropped. The ability to keep underlying data files is specific to a few catalogs, and whether that is controlled by EXTERNAL, the LOCATION clause, or something else is still up to the catalog implementation.

I don’t think that there is a good reason to force catalogs to break compatibility with Hive SQL, while making it appear as though DDL is compatible. Because removing EXTERNAL would be a breaking change to the SQL parser, I think the best option is to pass it to v2 catalogs so the catalog can decide how to handle it.

rb


On Tue, Oct 6, 2020 at 7:06 AM Wenchen Fan <[hidden email]> wrote:
Hi all,

I'd like to start a discussion thread about this topic, as it blocks an important feature that we target for Spark 3.1: unify the CREATE TABLE SQL syntax.

A bit more background for CREATE EXTERNAL TABLE: it's kind of a hidden feature in Spark for Hive compatibility.

When you write native CREATE TABLE syntax such as `CREATE EXTERNAL TABLE ... USING parquet`, the parser fails and tells you that EXTERNAL can't be specified.

When we write Hive CREATE TABLE syntax, the EXTERNAL can be specified if LOCATION clause or path option is present. For example, `CREATE EXTERNAL TABLE ... STORED AS parquet` is not allowed as there is no LOCATION clause or path option. This is not 100% Hive compatible.

As we are unifying the CREATE TABLE SQL syntax, one problem is how to deal with CREATE EXTERNAL TABLE. We can keep it as a hidden feature as it was, or we can officially support it.

Please let us know your thoughts:
1. As an end-user, what do you expect CREATE EXTERNAL TABLE to do? Have you used it in production before? For what use cases?
2. As a catalog developer, how are you going to implement EXTERNAL TABLE? It seems to me that it only makes sense for file source, as the table directory can be managed. I'm not sure how to interpret EXTERNAL in catalogs like jdbc, cassandra, etc.

For more details, please refer to the long discussion in https://github.com/apache/spark/pull/28026

Thanks,
Wenchen


--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: Official support of CREATE EXTERNAL TABLE

cloud0fan
In reply to this post by Holden Karau
> As someone who's had the job of porting different SQL dialects to Spark, I'm also very much in favor of keeping EXTERNAL

Just to be clear: no one is proposing to remove EXTERNAL. The 2 options we are discussing are:
1. Keep the behavior the same as before, i.e. EXTERNAL must co-exists with LOCATION (or path option).
2. Always allow EXTERNAL, and decouple it with LOCATION.

I'm fine with option 2 if there are reasonable use cases. I think it's always safer to keep the behavior the same as before. If we want to change the behavior and follow option 2, we need use cases to justify it.

For now, the only use case I see is for Hive compatibility and allow EXTERNAL TABLE without user-specified LOCATION. Are there any more use cases we are targeting?

On Wed, Oct 7, 2020 at 5:06 AM Holden Karau <[hidden email]> wrote:
As someone who's had the job of porting different SQL dialects to Spark, I'm also very much in favor of keeping EXTERNAL, and I think Ryan's suggestion of leaving it up to the catalogs on how to handle this makes sense.

On Tue, Oct 6, 2020 at 1:54 PM Ryan Blue <[hidden email]> wrote:

I would summarize both the problem and the current state differently.

Currently, Spark parses the EXTERNAL keyword for compatibility with Hive SQL, but Spark’s built-in catalog doesn’t allow creating a table with EXTERNAL unless LOCATION is also present. This “hidden feature” breaks compatibility with Hive SQL because all combinations of EXTERNAL and LOCATION are valid in Hive, but creating an external table with a default location is not allowed by Spark. Note that Spark must still handle these tables because it shares a metastore with Hive, which can still create them.

Now catalogs can be plugged in, the question is whether to pass the fact that EXTERNAL was in the CREATE TABLE statement to the v2 catalog handling a create command, or to suppress it and apply Spark’s rule that LOCATION must be present.

If it is not passed to the catalog, then a Hive catalog cannot implement the behavior of Hive SQL, even though Spark added the keyword for Hive compatibility. The Spark catalog can interpret EXTERNAL however Spark chooses to, but I think it is a poor choice to force different behavior on other catalogs.

Wenchen has also argued that the purpose of this is to standardize behavior across catalogs. But hiding EXTERNAL would not accomplish that goal. Whether to physically delete data is a choice that is up to the catalog. Some catalogs have no “external” concept and will always drop data when a table is dropped. The ability to keep underlying data files is specific to a few catalogs, and whether that is controlled by EXTERNAL, the LOCATION clause, or something else is still up to the catalog implementation.

I don’t think that there is a good reason to force catalogs to break compatibility with Hive SQL, while making it appear as though DDL is compatible. Because removing EXTERNAL would be a breaking change to the SQL parser, I think the best option is to pass it to v2 catalogs so the catalog can decide how to handle it.

rb


On Tue, Oct 6, 2020 at 7:06 AM Wenchen Fan <[hidden email]> wrote:
Hi all,

I'd like to start a discussion thread about this topic, as it blocks an important feature that we target for Spark 3.1: unify the CREATE TABLE SQL syntax.

A bit more background for CREATE EXTERNAL TABLE: it's kind of a hidden feature in Spark for Hive compatibility.

When you write native CREATE TABLE syntax such as `CREATE EXTERNAL TABLE ... USING parquet`, the parser fails and tells you that EXTERNAL can't be specified.

When we write Hive CREATE TABLE syntax, the EXTERNAL can be specified if LOCATION clause or path option is present. For example, `CREATE EXTERNAL TABLE ... STORED AS parquet` is not allowed as there is no LOCATION clause or path option. This is not 100% Hive compatible.

As we are unifying the CREATE TABLE SQL syntax, one problem is how to deal with CREATE EXTERNAL TABLE. We can keep it as a hidden feature as it was, or we can officially support it.

Please let us know your thoughts:
1. As an end-user, what do you expect CREATE EXTERNAL TABLE to do? Have you used it in production before? For what use cases?
2. As a catalog developer, how are you going to implement EXTERNAL TABLE? It seems to me that it only makes sense for file source, as the table directory can be managed. I'm not sure how to interpret EXTERNAL in catalogs like jdbc, cassandra, etc.

For more details, please refer to the long discussion in https://github.com/apache/spark/pull/28026

Thanks,
Wenchen


--
Ryan Blue
Software Engineer
Netflix


--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
Reply | Threaded
Open this post in threaded view
|

Re: Official support of CREATE EXTERNAL TABLE

Ryan Blue

Wenchen, why are you ignoring Hive as a “reasonable use case”?

The keyword came from Hive and we all agree that a Hive catalog with Hive behavior can’t be implemented if Spark chooses to couple this with LOCATION. Why is this use case not a justification?

Also, the option to keep behavior the same as before is not mutually exclusive with passing EXTERNAL to catalogs. Spark can continue to have the same behavior in its catalog. But Spark cannot just choose to break compatibility with external systems by deciding when to fail certain combinations of DDL options. Choosing not to allow external without location when it is valid for Hive prevents building a compatible catalog.

There are many reasons to build a Hive-compatible catalog. A great recent example is Nessie, which enables branching and tagging table states across several table formats and aims to be compatible with Hive.


On Wed, Oct 7, 2020 at 5:51 AM Wenchen Fan <[hidden email]> wrote:
> As someone who's had the job of porting different SQL dialects to Spark, I'm also very much in favor of keeping EXTERNAL

Just to be clear: no one is proposing to remove EXTERNAL. The 2 options we are discussing are:
1. Keep the behavior the same as before, i.e. EXTERNAL must co-exists with LOCATION (or path option).
2. Always allow EXTERNAL, and decouple it with LOCATION.

I'm fine with option 2 if there are reasonable use cases. I think it's always safer to keep the behavior the same as before. If we want to change the behavior and follow option 2, we need use cases to justify it.

For now, the only use case I see is for Hive compatibility and allow EXTERNAL TABLE without user-specified LOCATION. Are there any more use cases we are targeting?

On Wed, Oct 7, 2020 at 5:06 AM Holden Karau <[hidden email]> wrote:
As someone who's had the job of porting different SQL dialects to Spark, I'm also very much in favor of keeping EXTERNAL, and I think Ryan's suggestion of leaving it up to the catalogs on how to handle this makes sense.

On Tue, Oct 6, 2020 at 1:54 PM Ryan Blue <[hidden email]> wrote:

I would summarize both the problem and the current state differently.

Currently, Spark parses the EXTERNAL keyword for compatibility with Hive SQL, but Spark’s built-in catalog doesn’t allow creating a table with EXTERNAL unless LOCATION is also present. This “hidden feature” breaks compatibility with Hive SQL because all combinations of EXTERNAL and LOCATION are valid in Hive, but creating an external table with a default location is not allowed by Spark. Note that Spark must still handle these tables because it shares a metastore with Hive, which can still create them.

Now catalogs can be plugged in, the question is whether to pass the fact that EXTERNAL was in the CREATE TABLE statement to the v2 catalog handling a create command, or to suppress it and apply Spark’s rule that LOCATION must be present.

If it is not passed to the catalog, then a Hive catalog cannot implement the behavior of Hive SQL, even though Spark added the keyword for Hive compatibility. The Spark catalog can interpret EXTERNAL however Spark chooses to, but I think it is a poor choice to force different behavior on other catalogs.

Wenchen has also argued that the purpose of this is to standardize behavior across catalogs. But hiding EXTERNAL would not accomplish that goal. Whether to physically delete data is a choice that is up to the catalog. Some catalogs have no “external” concept and will always drop data when a table is dropped. The ability to keep underlying data files is specific to a few catalogs, and whether that is controlled by EXTERNAL, the LOCATION clause, or something else is still up to the catalog implementation.

I don’t think that there is a good reason to force catalogs to break compatibility with Hive SQL, while making it appear as though DDL is compatible. Because removing EXTERNAL would be a breaking change to the SQL parser, I think the best option is to pass it to v2 catalogs so the catalog can decide how to handle it.

rb


On Tue, Oct 6, 2020 at 7:06 AM Wenchen Fan <[hidden email]> wrote:
Hi all,

I'd like to start a discussion thread about this topic, as it blocks an important feature that we target for Spark 3.1: unify the CREATE TABLE SQL syntax.

A bit more background for CREATE EXTERNAL TABLE: it's kind of a hidden feature in Spark for Hive compatibility.

When you write native CREATE TABLE syntax such as `CREATE EXTERNAL TABLE ... USING parquet`, the parser fails and tells you that EXTERNAL can't be specified.

When we write Hive CREATE TABLE syntax, the EXTERNAL can be specified if LOCATION clause or path option is present. For example, `CREATE EXTERNAL TABLE ... STORED AS parquet` is not allowed as there is no LOCATION clause or path option. This is not 100% Hive compatible.

As we are unifying the CREATE TABLE SQL syntax, one problem is how to deal with CREATE EXTERNAL TABLE. We can keep it as a hidden feature as it was, or we can officially support it.

Please let us know your thoughts:
1. As an end-user, what do you expect CREATE EXTERNAL TABLE to do? Have you used it in production before? For what use cases?
2. As a catalog developer, how are you going to implement EXTERNAL TABLE? It seems to me that it only makes sense for file source, as the table directory can be managed. I'm not sure how to interpret EXTERNAL in catalogs like jdbc, cassandra, etc.

For more details, please refer to the long discussion in https://github.com/apache/spark/pull/28026

Thanks,
Wenchen


--
Ryan Blue
Software Engineer
Netflix


--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 


--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: Official support of CREATE EXTERNAL TABLE

cloud0fan
I don't think Hive compatibility itself is a "use case". The Nessie example you mentioned is a reasonable use case to me: some frameworks/applications want to create external tables without user-specified location, so that they can manage the table directory themselves and implement fancy features.

That said, now I agree it's better to decouple EXTERNAL and LOCATION. We should clearly document that, EXTERNAL and LOCATION are only applicable for file-based data sources, and catalog implementation should fail if the table has EXTERNAL or LOCATION property, but the table provider is not file-based.

BTW, how about LOCATION without EXTERNAL? Currently Spark treats it as an external table. Hive gives warning when you create managed tables with custom location, which means this behavior is not recommended. Shall we "infer" EXTERNAL from LOCATION although it's not Hive compatible?

On Thu, Oct 8, 2020 at 12:24 AM Ryan Blue <[hidden email]> wrote:

Wenchen, why are you ignoring Hive as a “reasonable use case”?

The keyword came from Hive and we all agree that a Hive catalog with Hive behavior can’t be implemented if Spark chooses to couple this with LOCATION. Why is this use case not a justification?

Also, the option to keep behavior the same as before is not mutually exclusive with passing EXTERNAL to catalogs. Spark can continue to have the same behavior in its catalog. But Spark cannot just choose to break compatibility with external systems by deciding when to fail certain combinations of DDL options. Choosing not to allow external without location when it is valid for Hive prevents building a compatible catalog.

There are many reasons to build a Hive-compatible catalog. A great recent example is Nessie, which enables branching and tagging table states across several table formats and aims to be compatible with Hive.


On Wed, Oct 7, 2020 at 5:51 AM Wenchen Fan <[hidden email]> wrote:
> As someone who's had the job of porting different SQL dialects to Spark, I'm also very much in favor of keeping EXTERNAL

Just to be clear: no one is proposing to remove EXTERNAL. The 2 options we are discussing are:
1. Keep the behavior the same as before, i.e. EXTERNAL must co-exists with LOCATION (or path option).
2. Always allow EXTERNAL, and decouple it with LOCATION.

I'm fine with option 2 if there are reasonable use cases. I think it's always safer to keep the behavior the same as before. If we want to change the behavior and follow option 2, we need use cases to justify it.

For now, the only use case I see is for Hive compatibility and allow EXTERNAL TABLE without user-specified LOCATION. Are there any more use cases we are targeting?

On Wed, Oct 7, 2020 at 5:06 AM Holden Karau <[hidden email]> wrote:
As someone who's had the job of porting different SQL dialects to Spark, I'm also very much in favor of keeping EXTERNAL, and I think Ryan's suggestion of leaving it up to the catalogs on how to handle this makes sense.

On Tue, Oct 6, 2020 at 1:54 PM Ryan Blue <[hidden email]> wrote:

I would summarize both the problem and the current state differently.

Currently, Spark parses the EXTERNAL keyword for compatibility with Hive SQL, but Spark’s built-in catalog doesn’t allow creating a table with EXTERNAL unless LOCATION is also present. This “hidden feature” breaks compatibility with Hive SQL because all combinations of EXTERNAL and LOCATION are valid in Hive, but creating an external table with a default location is not allowed by Spark. Note that Spark must still handle these tables because it shares a metastore with Hive, which can still create them.

Now catalogs can be plugged in, the question is whether to pass the fact that EXTERNAL was in the CREATE TABLE statement to the v2 catalog handling a create command, or to suppress it and apply Spark’s rule that LOCATION must be present.

If it is not passed to the catalog, then a Hive catalog cannot implement the behavior of Hive SQL, even though Spark added the keyword for Hive compatibility. The Spark catalog can interpret EXTERNAL however Spark chooses to, but I think it is a poor choice to force different behavior on other catalogs.

Wenchen has also argued that the purpose of this is to standardize behavior across catalogs. But hiding EXTERNAL would not accomplish that goal. Whether to physically delete data is a choice that is up to the catalog. Some catalogs have no “external” concept and will always drop data when a table is dropped. The ability to keep underlying data files is specific to a few catalogs, and whether that is controlled by EXTERNAL, the LOCATION clause, or something else is still up to the catalog implementation.

I don’t think that there is a good reason to force catalogs to break compatibility with Hive SQL, while making it appear as though DDL is compatible. Because removing EXTERNAL would be a breaking change to the SQL parser, I think the best option is to pass it to v2 catalogs so the catalog can decide how to handle it.

rb


On Tue, Oct 6, 2020 at 7:06 AM Wenchen Fan <[hidden email]> wrote:
Hi all,

I'd like to start a discussion thread about this topic, as it blocks an important feature that we target for Spark 3.1: unify the CREATE TABLE SQL syntax.

A bit more background for CREATE EXTERNAL TABLE: it's kind of a hidden feature in Spark for Hive compatibility.

When you write native CREATE TABLE syntax such as `CREATE EXTERNAL TABLE ... USING parquet`, the parser fails and tells you that EXTERNAL can't be specified.

When we write Hive CREATE TABLE syntax, the EXTERNAL can be specified if LOCATION clause or path option is present. For example, `CREATE EXTERNAL TABLE ... STORED AS parquet` is not allowed as there is no LOCATION clause or path option. This is not 100% Hive compatible.

As we are unifying the CREATE TABLE SQL syntax, one problem is how to deal with CREATE EXTERNAL TABLE. We can keep it as a hidden feature as it was, or we can officially support it.

Please let us know your thoughts:
1. As an end-user, what do you expect CREATE EXTERNAL TABLE to do? Have you used it in production before? For what use cases?
2. As a catalog developer, how are you going to implement EXTERNAL TABLE? It seems to me that it only makes sense for file source, as the table directory can be managed. I'm not sure how to interpret EXTERNAL in catalogs like jdbc, cassandra, etc.

For more details, please refer to the long discussion in https://github.com/apache/spark/pull/28026

Thanks,
Wenchen


--
Ryan Blue
Software Engineer
Netflix


--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 


--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: Official support of CREATE EXTERNAL TABLE

Holden Karau


On Wed, Oct 7, 2020 at 9:57 AM Wenchen Fan <[hidden email]> wrote:
I don't think Hive compatibility itself is a "use case".
Ok let's add on top of this: I have some hive queries that I want to run on Spark. I believe that makes it a use case. 
The Nessie example you mentioned is a reasonable use case to me: some frameworks/applications want to create external tables without user-specified location, so that they can manage the table directory themselves and implement fancy features.

That said, now I agree it's better to decouple EXTERNAL and LOCATION. We should clearly document that, EXTERNAL and LOCATION are only applicable for file-based data sources, and catalog implementation should fail if the table has EXTERNAL or LOCATION property, but the table provider is not file-based.

BTW, how about LOCATION without EXTERNAL? Currently Spark treats it as an external table. Hive gives warning when you create managed tables with custom location, which means this behavior is not recommended. Shall we "infer" EXTERNAL from LOCATION although it's not Hive compatible?

On Thu, Oct 8, 2020 at 12:24 AM Ryan Blue <[hidden email]> wrote:

Wenchen, why are you ignoring Hive as a “reasonable use case”?

The keyword came from Hive and we all agree that a Hive catalog with Hive behavior can’t be implemented if Spark chooses to couple this with LOCATION. Why is this use case not a justification?

Also, the option to keep behavior the same as before is not mutually exclusive with passing EXTERNAL to catalogs. Spark can continue to have the same behavior in its catalog. But Spark cannot just choose to break compatibility with external systems by deciding when to fail certain combinations of DDL options. Choosing not to allow external without location when it is valid for Hive prevents building a compatible catalog.

There are many reasons to build a Hive-compatible catalog. A great recent example is Nessie, which enables branching and tagging table states across several table formats and aims to be compatible with Hive.


On Wed, Oct 7, 2020 at 5:51 AM Wenchen Fan <[hidden email]> wrote:
> As someone who's had the job of porting different SQL dialects to Spark, I'm also very much in favor of keeping EXTERNAL

Just to be clear: no one is proposing to remove EXTERNAL. The 2 options we are discussing are:
1. Keep the behavior the same as before, i.e. EXTERNAL must co-exists with LOCATION (or path option).
2. Always allow EXTERNAL, and decouple it with LOCATION.

I'm fine with option 2 if there are reasonable use cases. I think it's always safer to keep the behavior the same as before. If we want to change the behavior and follow option 2, we need use cases to justify it.

For now, the only use case I see is for Hive compatibility and allow EXTERNAL TABLE without user-specified LOCATION. Are there any more use cases we are targeting?

On Wed, Oct 7, 2020 at 5:06 AM Holden Karau <[hidden email]> wrote:
As someone who's had the job of porting different SQL dialects to Spark, I'm also very much in favor of keeping EXTERNAL, and I think Ryan's suggestion of leaving it up to the catalogs on how to handle this makes sense.

On Tue, Oct 6, 2020 at 1:54 PM Ryan Blue <[hidden email]> wrote:

I would summarize both the problem and the current state differently.

Currently, Spark parses the EXTERNAL keyword for compatibility with Hive SQL, but Spark’s built-in catalog doesn’t allow creating a table with EXTERNAL unless LOCATION is also present. This “hidden feature” breaks compatibility with Hive SQL because all combinations of EXTERNAL and LOCATION are valid in Hive, but creating an external table with a default location is not allowed by Spark. Note that Spark must still handle these tables because it shares a metastore with Hive, which can still create them.

Now catalogs can be plugged in, the question is whether to pass the fact that EXTERNAL was in the CREATE TABLE statement to the v2 catalog handling a create command, or to suppress it and apply Spark’s rule that LOCATION must be present.

If it is not passed to the catalog, then a Hive catalog cannot implement the behavior of Hive SQL, even though Spark added the keyword for Hive compatibility. The Spark catalog can interpret EXTERNAL however Spark chooses to, but I think it is a poor choice to force different behavior on other catalogs.

Wenchen has also argued that the purpose of this is to standardize behavior across catalogs. But hiding EXTERNAL would not accomplish that goal. Whether to physically delete data is a choice that is up to the catalog. Some catalogs have no “external” concept and will always drop data when a table is dropped. The ability to keep underlying data files is specific to a few catalogs, and whether that is controlled by EXTERNAL, the LOCATION clause, or something else is still up to the catalog implementation.

I don’t think that there is a good reason to force catalogs to break compatibility with Hive SQL, while making it appear as though DDL is compatible. Because removing EXTERNAL would be a breaking change to the SQL parser, I think the best option is to pass it to v2 catalogs so the catalog can decide how to handle it.

rb


On Tue, Oct 6, 2020 at 7:06 AM Wenchen Fan <[hidden email]> wrote:
Hi all,

I'd like to start a discussion thread about this topic, as it blocks an important feature that we target for Spark 3.1: unify the CREATE TABLE SQL syntax.

A bit more background for CREATE EXTERNAL TABLE: it's kind of a hidden feature in Spark for Hive compatibility.

When you write native CREATE TABLE syntax such as `CREATE EXTERNAL TABLE ... USING parquet`, the parser fails and tells you that EXTERNAL can't be specified.

When we write Hive CREATE TABLE syntax, the EXTERNAL can be specified if LOCATION clause or path option is present. For example, `CREATE EXTERNAL TABLE ... STORED AS parquet` is not allowed as there is no LOCATION clause or path option. This is not 100% Hive compatible.

As we are unifying the CREATE TABLE SQL syntax, one problem is how to deal with CREATE EXTERNAL TABLE. We can keep it as a hidden feature as it was, or we can officially support it.

Please let us know your thoughts:
1. As an end-user, what do you expect CREATE EXTERNAL TABLE to do? Have you used it in production before? For what use cases?
2. As a catalog developer, how are you going to implement EXTERNAL TABLE? It seems to me that it only makes sense for file source, as the table directory can be managed. I'm not sure how to interpret EXTERNAL in catalogs like jdbc, cassandra, etc.

For more details, please refer to the long discussion in https://github.com/apache/spark/pull/28026

Thanks,
Wenchen


--
Ryan Blue
Software Engineer
Netflix


--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 


--
Ryan Blue
Software Engineer
Netflix


--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
Reply | Threaded
Open this post in threaded view
|

Re: Official support of CREATE EXTERNAL TABLE

Ryan Blue

how about LOCATION without EXTERNAL? Currently Spark treats it as an external table.

I think there is some confusion about what Spark has to handle. Regardless of what Spark allows as DDL, these tables can exist in a Hive MetaStore that Spark connects to, and the general expectation is that Spark doesn’t change the meaning of table configuration. There are notable bugs where Spark has different behavior, but that is the expectation.

In this particular case, we’re talking about what can be expressed in DDL that is sent to an external catalog. Spark could (unwisely) choose to disallow some DDL combinations, but the table is implemented through a plugin so the interpretation is up to the plugin. Spark has no role in choosing how to treat this table, unless it is loaded through Spark’s built-in catalog; in which case, see above.

I don’t think Hive compatibility itself is a “use case”.

Why?

Hive is an external database that defines its own behavior and with which Spark claims to be compatible. If Hive isn’t a valid use case, then why is EXTERNAL supported at all?


On Wed, Oct 7, 2020 at 10:17 AM Holden Karau <[hidden email]> wrote:


On Wed, Oct 7, 2020 at 9:57 AM Wenchen Fan <[hidden email]> wrote:
I don't think Hive compatibility itself is a "use case".
Ok let's add on top of this: I have some hive queries that I want to run on Spark. I believe that makes it a use case. 
The Nessie example you mentioned is a reasonable use case to me: some frameworks/applications want to create external tables without user-specified location, so that they can manage the table directory themselves and implement fancy features.

That said, now I agree it's better to decouple EXTERNAL and LOCATION. We should clearly document that, EXTERNAL and LOCATION are only applicable for file-based data sources, and catalog implementation should fail if the table has EXTERNAL or LOCATION property, but the table provider is not file-based.

BTW, how about LOCATION without EXTERNAL? Currently Spark treats it as an external table. Hive gives warning when you create managed tables with custom location, which means this behavior is not recommended. Shall we "infer" EXTERNAL from LOCATION although it's not Hive compatible?

On Thu, Oct 8, 2020 at 12:24 AM Ryan Blue <[hidden email]> wrote:

Wenchen, why are you ignoring Hive as a “reasonable use case”?

The keyword came from Hive and we all agree that a Hive catalog with Hive behavior can’t be implemented if Spark chooses to couple this with LOCATION. Why is this use case not a justification?

Also, the option to keep behavior the same as before is not mutually exclusive with passing EXTERNAL to catalogs. Spark can continue to have the same behavior in its catalog. But Spark cannot just choose to break compatibility with external systems by deciding when to fail certain combinations of DDL options. Choosing not to allow external without location when it is valid for Hive prevents building a compatible catalog.

There are many reasons to build a Hive-compatible catalog. A great recent example is Nessie, which enables branching and tagging table states across several table formats and aims to be compatible with Hive.


On Wed, Oct 7, 2020 at 5:51 AM Wenchen Fan <[hidden email]> wrote:
> As someone who's had the job of porting different SQL dialects to Spark, I'm also very much in favor of keeping EXTERNAL

Just to be clear: no one is proposing to remove EXTERNAL. The 2 options we are discussing are:
1. Keep the behavior the same as before, i.e. EXTERNAL must co-exists with LOCATION (or path option).
2. Always allow EXTERNAL, and decouple it with LOCATION.

I'm fine with option 2 if there are reasonable use cases. I think it's always safer to keep the behavior the same as before. If we want to change the behavior and follow option 2, we need use cases to justify it.

For now, the only use case I see is for Hive compatibility and allow EXTERNAL TABLE without user-specified LOCATION. Are there any more use cases we are targeting?

On Wed, Oct 7, 2020 at 5:06 AM Holden Karau <[hidden email]> wrote:
As someone who's had the job of porting different SQL dialects to Spark, I'm also very much in favor of keeping EXTERNAL, and I think Ryan's suggestion of leaving it up to the catalogs on how to handle this makes sense.

On Tue, Oct 6, 2020 at 1:54 PM Ryan Blue <[hidden email]> wrote:

I would summarize both the problem and the current state differently.

Currently, Spark parses the EXTERNAL keyword for compatibility with Hive SQL, but Spark’s built-in catalog doesn’t allow creating a table with EXTERNAL unless LOCATION is also present. This “hidden feature” breaks compatibility with Hive SQL because all combinations of EXTERNAL and LOCATION are valid in Hive, but creating an external table with a default location is not allowed by Spark. Note that Spark must still handle these tables because it shares a metastore with Hive, which can still create them.

Now catalogs can be plugged in, the question is whether to pass the fact that EXTERNAL was in the CREATE TABLE statement to the v2 catalog handling a create command, or to suppress it and apply Spark’s rule that LOCATION must be present.

If it is not passed to the catalog, then a Hive catalog cannot implement the behavior of Hive SQL, even though Spark added the keyword for Hive compatibility. The Spark catalog can interpret EXTERNAL however Spark chooses to, but I think it is a poor choice to force different behavior on other catalogs.

Wenchen has also argued that the purpose of this is to standardize behavior across catalogs. But hiding EXTERNAL would not accomplish that goal. Whether to physically delete data is a choice that is up to the catalog. Some catalogs have no “external” concept and will always drop data when a table is dropped. The ability to keep underlying data files is specific to a few catalogs, and whether that is controlled by EXTERNAL, the LOCATION clause, or something else is still up to the catalog implementation.

I don’t think that there is a good reason to force catalogs to break compatibility with Hive SQL, while making it appear as though DDL is compatible. Because removing EXTERNAL would be a breaking change to the SQL parser, I think the best option is to pass it to v2 catalogs so the catalog can decide how to handle it.

rb


On Tue, Oct 6, 2020 at 7:06 AM Wenchen Fan <[hidden email]> wrote:
Hi all,

I'd like to start a discussion thread about this topic, as it blocks an important feature that we target for Spark 3.1: unify the CREATE TABLE SQL syntax.

A bit more background for CREATE EXTERNAL TABLE: it's kind of a hidden feature in Spark for Hive compatibility.

When you write native CREATE TABLE syntax such as `CREATE EXTERNAL TABLE ... USING parquet`, the parser fails and tells you that EXTERNAL can't be specified.

When we write Hive CREATE TABLE syntax, the EXTERNAL can be specified if LOCATION clause or path option is present. For example, `CREATE EXTERNAL TABLE ... STORED AS parquet` is not allowed as there is no LOCATION clause or path option. This is not 100% Hive compatible.

As we are unifying the CREATE TABLE SQL syntax, one problem is how to deal with CREATE EXTERNAL TABLE. We can keep it as a hidden feature as it was, or we can officially support it.

Please let us know your thoughts:
1. As an end-user, what do you expect CREATE EXTERNAL TABLE to do? Have you used it in production before? For what use cases?
2. As a catalog developer, how are you going to implement EXTERNAL TABLE? It seems to me that it only makes sense for file source, as the table directory can be managed. I'm not sure how to interpret EXTERNAL in catalogs like jdbc, cassandra, etc.

For more details, please refer to the long discussion in https://github.com/apache/spark/pull/28026

Thanks,
Wenchen


--
Ryan Blue
Software Engineer
Netflix


--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 


--
Ryan Blue
Software Engineer
Netflix


--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 


--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: Official support of CREATE EXTERNAL TABLE

cloud0fan
> I have some hive queries that I want to run on Spark.

Spark is not compatible with Hive in many places. Decoupling EXTERNAL and LOCATION can't help you too much here. If you do have this use case, we need a much wider discussion about how to achieve it.

For this particular topic, we need concrete use cases like Nessie. It will be great to see more concrete use cases, but I think the Nessie use case is good enough to justify the decoupling of EXTERNAL and LOCATION.

BTW, CREATE EXTERNAL TABLE is not a Hive-specific feature. Many databases have it. That's why I think Hive-compatibility alone is not a reasonable use case. For your reference:
1. Snowflake supports CREATE EXTERNAL TABLE and requires the LOCATION clause as Spark does: doc
2. Redshift supports CREATE EXTERNAL TABLE and requires the LOCATION clause as Spark does: doc
3. Db2 supports CREATE EXTERNAL TABLE and requires DATAOBJECT or FILE_NAME option: doc
4. SQL Server also supports CREATE EXTERNAL TABLE: doc

> with which Spark claims to be compatible

I don't think Spark ever claims to be 100% Hive compatible. In fact, we diverged from Hive intentionally in several places, where we think the Hive behavior was not reasonable and we shouldn't follow it.

On Thu, Oct 8, 2020 at 1:58 AM Ryan Blue <[hidden email]> wrote:

how about LOCATION without EXTERNAL? Currently Spark treats it as an external table.

I think there is some confusion about what Spark has to handle. Regardless of what Spark allows as DDL, these tables can exist in a Hive MetaStore that Spark connects to, and the general expectation is that Spark doesn’t change the meaning of table configuration. There are notable bugs where Spark has different behavior, but that is the expectation.

In this particular case, we’re talking about what can be expressed in DDL that is sent to an external catalog. Spark could (unwisely) choose to disallow some DDL combinations, but the table is implemented through a plugin so the interpretation is up to the plugin. Spark has no role in choosing how to treat this table, unless it is loaded through Spark’s built-in catalog; in which case, see above.

I don’t think Hive compatibility itself is a “use case”.

Why?

Hive is an external database that defines its own behavior and with which Spark claims to be compatible. If Hive isn’t a valid use case, then why is EXTERNAL supported at all?


On Wed, Oct 7, 2020 at 10:17 AM Holden Karau <[hidden email]> wrote:


On Wed, Oct 7, 2020 at 9:57 AM Wenchen Fan <[hidden email]> wrote:
I don't think Hive compatibility itself is a "use case".
Ok let's add on top of this: I have some hive queries that I want to run on Spark. I believe that makes it a use case. 
The Nessie example you mentioned is a reasonable use case to me: some frameworks/applications want to create external tables without user-specified location, so that they can manage the table directory themselves and implement fancy features.

That said, now I agree it's better to decouple EXTERNAL and LOCATION. We should clearly document that, EXTERNAL and LOCATION are only applicable for file-based data sources, and catalog implementation should fail if the table has EXTERNAL or LOCATION property, but the table provider is not file-based.

BTW, how about LOCATION without EXTERNAL? Currently Spark treats it as an external table. Hive gives warning when you create managed tables with custom location, which means this behavior is not recommended. Shall we "infer" EXTERNAL from LOCATION although it's not Hive compatible?

On Thu, Oct 8, 2020 at 12:24 AM Ryan Blue <[hidden email]> wrote:

Wenchen, why are you ignoring Hive as a “reasonable use case”?

The keyword came from Hive and we all agree that a Hive catalog with Hive behavior can’t be implemented if Spark chooses to couple this with LOCATION. Why is this use case not a justification?

Also, the option to keep behavior the same as before is not mutually exclusive with passing EXTERNAL to catalogs. Spark can continue to have the same behavior in its catalog. But Spark cannot just choose to break compatibility with external systems by deciding when to fail certain combinations of DDL options. Choosing not to allow external without location when it is valid for Hive prevents building a compatible catalog.

There are many reasons to build a Hive-compatible catalog. A great recent example is Nessie, which enables branching and tagging table states across several table formats and aims to be compatible with Hive.


On Wed, Oct 7, 2020 at 5:51 AM Wenchen Fan <[hidden email]> wrote:
> As someone who's had the job of porting different SQL dialects to Spark, I'm also very much in favor of keeping EXTERNAL

Just to be clear: no one is proposing to remove EXTERNAL. The 2 options we are discussing are:
1. Keep the behavior the same as before, i.e. EXTERNAL must co-exists with LOCATION (or path option).
2. Always allow EXTERNAL, and decouple it with LOCATION.

I'm fine with option 2 if there are reasonable use cases. I think it's always safer to keep the behavior the same as before. If we want to change the behavior and follow option 2, we need use cases to justify it.

For now, the only use case I see is for Hive compatibility and allow EXTERNAL TABLE without user-specified LOCATION. Are there any more use cases we are targeting?

On Wed, Oct 7, 2020 at 5:06 AM Holden Karau <[hidden email]> wrote:
As someone who's had the job of porting different SQL dialects to Spark, I'm also very much in favor of keeping EXTERNAL, and I think Ryan's suggestion of leaving it up to the catalogs on how to handle this makes sense.

On Tue, Oct 6, 2020 at 1:54 PM Ryan Blue <[hidden email]> wrote:

I would summarize both the problem and the current state differently.

Currently, Spark parses the EXTERNAL keyword for compatibility with Hive SQL, but Spark’s built-in catalog doesn’t allow creating a table with EXTERNAL unless LOCATION is also present. This “hidden feature” breaks compatibility with Hive SQL because all combinations of EXTERNAL and LOCATION are valid in Hive, but creating an external table with a default location is not allowed by Spark. Note that Spark must still handle these tables because it shares a metastore with Hive, which can still create them.

Now catalogs can be plugged in, the question is whether to pass the fact that EXTERNAL was in the CREATE TABLE statement to the v2 catalog handling a create command, or to suppress it and apply Spark’s rule that LOCATION must be present.

If it is not passed to the catalog, then a Hive catalog cannot implement the behavior of Hive SQL, even though Spark added the keyword for Hive compatibility. The Spark catalog can interpret EXTERNAL however Spark chooses to, but I think it is a poor choice to force different behavior on other catalogs.

Wenchen has also argued that the purpose of this is to standardize behavior across catalogs. But hiding EXTERNAL would not accomplish that goal. Whether to physically delete data is a choice that is up to the catalog. Some catalogs have no “external” concept and will always drop data when a table is dropped. The ability to keep underlying data files is specific to a few catalogs, and whether that is controlled by EXTERNAL, the LOCATION clause, or something else is still up to the catalog implementation.

I don’t think that there is a good reason to force catalogs to break compatibility with Hive SQL, while making it appear as though DDL is compatible. Because removing EXTERNAL would be a breaking change to the SQL parser, I think the best option is to pass it to v2 catalogs so the catalog can decide how to handle it.

rb


On Tue, Oct 6, 2020 at 7:06 AM Wenchen Fan <[hidden email]> wrote:
Hi all,

I'd like to start a discussion thread about this topic, as it blocks an important feature that we target for Spark 3.1: unify the CREATE TABLE SQL syntax.

A bit more background for CREATE EXTERNAL TABLE: it's kind of a hidden feature in Spark for Hive compatibility.

When you write native CREATE TABLE syntax such as `CREATE EXTERNAL TABLE ... USING parquet`, the parser fails and tells you that EXTERNAL can't be specified.

When we write Hive CREATE TABLE syntax, the EXTERNAL can be specified if LOCATION clause or path option is present. For example, `CREATE EXTERNAL TABLE ... STORED AS parquet` is not allowed as there is no LOCATION clause or path option. This is not 100% Hive compatible.

As we are unifying the CREATE TABLE SQL syntax, one problem is how to deal with CREATE EXTERNAL TABLE. We can keep it as a hidden feature as it was, or we can officially support it.

Please let us know your thoughts:
1. As an end-user, what do you expect CREATE EXTERNAL TABLE to do? Have you used it in production before? For what use cases?
2. As a catalog developer, how are you going to implement EXTERNAL TABLE? It seems to me that it only makes sense for file source, as the table directory can be managed. I'm not sure how to interpret EXTERNAL in catalogs like jdbc, cassandra, etc.

For more details, please refer to the long discussion in https://github.com/apache/spark/pull/28026

Thanks,
Wenchen


--
Ryan Blue
Software Engineer
Netflix


--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 


--
Ryan Blue
Software Engineer
Netflix


--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 


--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: Official support of CREATE EXTERNAL TABLE

Ryan Blue

I don’t think Spark ever claims to be 100% Hive compatible.

By accepting the EXTERNAL keyword in some circumstances, Spark is providing compatibility with Hive DDL. Yes, there are places where it breaks. The question is whether we should deliberately break what a Hive catalog could implement, when we know what Hive’s behavior is.

CREATE EXTERNAL TABLE is not a Hive-specific feature

Great. So there are other catalogs that could use it. Why should Spark choose to limit Hive’s interpretation of this keyword?

While it is great that we seem to agree that Spark shouldn’t do this — now that Nessie was pointed out — I’m concerned that you still seem to think this is a choice that Spark could reasonably make. Spark cannot arbitrarily choose how to interpret DDL for an external catalog.

You may not consider this arbitrary because there are other examples where location is required. But the Hive community made the choice that these clauses are orthogonal, so it is clearly a choice of the external system, and it is not Spark’s role to dictate how an external database should behave.

I think the Nessie use case is good enough to justify the decoupling of EXTERNAL and LOCATION.

It appears that we have consensus. This will be passed to catalogs, which can implement the behavior that they choose.


On Wed, Oct 7, 2020 at 11:08 AM Wenchen Fan <[hidden email]> wrote:
> I have some hive queries that I want to run on Spark.

Spark is not compatible with Hive in many places. Decoupling EXTERNAL and LOCATION can't help you too much here. If you do have this use case, we need a much wider discussion about how to achieve it.

For this particular topic, we need concrete use cases like Nessie. It will be great to see more concrete use cases, but I think the Nessie use case is good enough to justify the decoupling of EXTERNAL and LOCATION.

BTW, CREATE EXTERNAL TABLE is not a Hive-specific feature. Many databases have it. That's why I think Hive-compatibility alone is not a reasonable use case. For your reference:
1. Snowflake supports CREATE EXTERNAL TABLE and requires the LOCATION clause as Spark does: doc
2. Redshift supports CREATE EXTERNAL TABLE and requires the LOCATION clause as Spark does: doc
3. Db2 supports CREATE EXTERNAL TABLE and requires DATAOBJECT or FILE_NAME option: doc
4. SQL Server also supports CREATE EXTERNAL TABLE: doc

> with which Spark claims to be compatible

I don't think Spark ever claims to be 100% Hive compatible. In fact, we diverged from Hive intentionally in several places, where we think the Hive behavior was not reasonable and we shouldn't follow it.

On Thu, Oct 8, 2020 at 1:58 AM Ryan Blue <[hidden email]> wrote:

how about LOCATION without EXTERNAL? Currently Spark treats it as an external table.

I think there is some confusion about what Spark has to handle. Regardless of what Spark allows as DDL, these tables can exist in a Hive MetaStore that Spark connects to, and the general expectation is that Spark doesn’t change the meaning of table configuration. There are notable bugs where Spark has different behavior, but that is the expectation.

In this particular case, we’re talking about what can be expressed in DDL that is sent to an external catalog. Spark could (unwisely) choose to disallow some DDL combinations, but the table is implemented through a plugin so the interpretation is up to the plugin. Spark has no role in choosing how to treat this table, unless it is loaded through Spark’s built-in catalog; in which case, see above.

I don’t think Hive compatibility itself is a “use case”.

Why?

Hive is an external database that defines its own behavior and with which Spark claims to be compatible. If Hive isn’t a valid use case, then why is EXTERNAL supported at all?


On Wed, Oct 7, 2020 at 10:17 AM Holden Karau <[hidden email]> wrote:


On Wed, Oct 7, 2020 at 9:57 AM Wenchen Fan <[hidden email]> wrote:
I don't think Hive compatibility itself is a "use case".
Ok let's add on top of this: I have some hive queries that I want to run on Spark. I believe that makes it a use case. 
The Nessie example you mentioned is a reasonable use case to me: some frameworks/applications want to create external tables without user-specified location, so that they can manage the table directory themselves and implement fancy features.

That said, now I agree it's better to decouple EXTERNAL and LOCATION. We should clearly document that, EXTERNAL and LOCATION are only applicable for file-based data sources, and catalog implementation should fail if the table has EXTERNAL or LOCATION property, but the table provider is not file-based.

BTW, how about LOCATION without EXTERNAL? Currently Spark treats it as an external table. Hive gives warning when you create managed tables with custom location, which means this behavior is not recommended. Shall we "infer" EXTERNAL from LOCATION although it's not Hive compatible?

On Thu, Oct 8, 2020 at 12:24 AM Ryan Blue <[hidden email]> wrote:

Wenchen, why are you ignoring Hive as a “reasonable use case”?

The keyword came from Hive and we all agree that a Hive catalog with Hive behavior can’t be implemented if Spark chooses to couple this with LOCATION. Why is this use case not a justification?

Also, the option to keep behavior the same as before is not mutually exclusive with passing EXTERNAL to catalogs. Spark can continue to have the same behavior in its catalog. But Spark cannot just choose to break compatibility with external systems by deciding when to fail certain combinations of DDL options. Choosing not to allow external without location when it is valid for Hive prevents building a compatible catalog.

There are many reasons to build a Hive-compatible catalog. A great recent example is Nessie, which enables branching and tagging table states across several table formats and aims to be compatible with Hive.


On Wed, Oct 7, 2020 at 5:51 AM Wenchen Fan <[hidden email]> wrote:
> As someone who's had the job of porting different SQL dialects to Spark, I'm also very much in favor of keeping EXTERNAL

Just to be clear: no one is proposing to remove EXTERNAL. The 2 options we are discussing are:
1. Keep the behavior the same as before, i.e. EXTERNAL must co-exists with LOCATION (or path option).
2. Always allow EXTERNAL, and decouple it with LOCATION.

I'm fine with option 2 if there are reasonable use cases. I think it's always safer to keep the behavior the same as before. If we want to change the behavior and follow option 2, we need use cases to justify it.

For now, the only use case I see is for Hive compatibility and allow EXTERNAL TABLE without user-specified LOCATION. Are there any more use cases we are targeting?

On Wed, Oct 7, 2020 at 5:06 AM Holden Karau <[hidden email]> wrote:
As someone who's had the job of porting different SQL dialects to Spark, I'm also very much in favor of keeping EXTERNAL, and I think Ryan's suggestion of leaving it up to the catalogs on how to handle this makes sense.

On Tue, Oct 6, 2020 at 1:54 PM Ryan Blue <[hidden email]> wrote:

I would summarize both the problem and the current state differently.

Currently, Spark parses the EXTERNAL keyword for compatibility with Hive SQL, but Spark’s built-in catalog doesn’t allow creating a table with EXTERNAL unless LOCATION is also present. This “hidden feature” breaks compatibility with Hive SQL because all combinations of EXTERNAL and LOCATION are valid in Hive, but creating an external table with a default location is not allowed by Spark. Note that Spark must still handle these tables because it shares a metastore with Hive, which can still create them.

Now catalogs can be plugged in, the question is whether to pass the fact that EXTERNAL was in the CREATE TABLE statement to the v2 catalog handling a create command, or to suppress it and apply Spark’s rule that LOCATION must be present.

If it is not passed to the catalog, then a Hive catalog cannot implement the behavior of Hive SQL, even though Spark added the keyword for Hive compatibility. The Spark catalog can interpret EXTERNAL however Spark chooses to, but I think it is a poor choice to force different behavior on other catalogs.

Wenchen has also argued that the purpose of this is to standardize behavior across catalogs. But hiding EXTERNAL would not accomplish that goal. Whether to physically delete data is a choice that is up to the catalog. Some catalogs have no “external” concept and will always drop data when a table is dropped. The ability to keep underlying data files is specific to a few catalogs, and whether that is controlled by EXTERNAL, the LOCATION clause, or something else is still up to the catalog implementation.

I don’t think that there is a good reason to force catalogs to break compatibility with Hive SQL, while making it appear as though DDL is compatible. Because removing EXTERNAL would be a breaking change to the SQL parser, I think the best option is to pass it to v2 catalogs so the catalog can decide how to handle it.

rb


On Tue, Oct 6, 2020 at 7:06 AM Wenchen Fan <[hidden email]> wrote:
Hi all,

I'd like to start a discussion thread about this topic, as it blocks an important feature that we target for Spark 3.1: unify the CREATE TABLE SQL syntax.

A bit more background for CREATE EXTERNAL TABLE: it's kind of a hidden feature in Spark for Hive compatibility.

When you write native CREATE TABLE syntax such as `CREATE EXTERNAL TABLE ... USING parquet`, the parser fails and tells you that EXTERNAL can't be specified.

When we write Hive CREATE TABLE syntax, the EXTERNAL can be specified if LOCATION clause or path option is present. For example, `CREATE EXTERNAL TABLE ... STORED AS parquet` is not allowed as there is no LOCATION clause or path option. This is not 100% Hive compatible.

As we are unifying the CREATE TABLE SQL syntax, one problem is how to deal with CREATE EXTERNAL TABLE. We can keep it as a hidden feature as it was, or we can officially support it.

Please let us know your thoughts:
1. As an end-user, what do you expect CREATE EXTERNAL TABLE to do? Have you used it in production before? For what use cases?
2. As a catalog developer, how are you going to implement EXTERNAL TABLE? It seems to me that it only makes sense for file source, as the table directory can be managed. I'm not sure how to interpret EXTERNAL in catalogs like jdbc, cassandra, etc.

For more details, please refer to the long discussion in https://github.com/apache/spark/pull/28026

Thanks,
Wenchen


--
Ryan Blue
Software Engineer
Netflix


--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 


--
Ryan Blue
Software Engineer
Netflix


--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 


--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: Official support of CREATE EXTERNAL TABLE

Ryan Blue

I don’t think Spark ever claims to be 100% Hive compatible.

I just found some relevant documentation on this, where Databricks claims that “Apache Spark SQL in Databricks is designed to be compatible with the Apache Hive, including metastore connectivity, SerDes, and UDFs.”

There is a strong expectation that Spark is compatible with Hive, and the Spark community has made the claim. That isn’t a claim of 100% compatibility, but no one suggested that there was 100% compatibility.


On Wed, Oct 7, 2020 at 11:54 AM Ryan Blue <[hidden email]> wrote:

I don’t think Spark ever claims to be 100% Hive compatible.

By accepting the EXTERNAL keyword in some circumstances, Spark is providing compatibility with Hive DDL. Yes, there are places where it breaks. The question is whether we should deliberately break what a Hive catalog could implement, when we know what Hive’s behavior is.

CREATE EXTERNAL TABLE is not a Hive-specific feature

Great. So there are other catalogs that could use it. Why should Spark choose to limit Hive’s interpretation of this keyword?

While it is great that we seem to agree that Spark shouldn’t do this — now that Nessie was pointed out — I’m concerned that you still seem to think this is a choice that Spark could reasonably make. Spark cannot arbitrarily choose how to interpret DDL for an external catalog.

You may not consider this arbitrary because there are other examples where location is required. But the Hive community made the choice that these clauses are orthogonal, so it is clearly a choice of the external system, and it is not Spark’s role to dictate how an external database should behave.

I think the Nessie use case is good enough to justify the decoupling of EXTERNAL and LOCATION.

It appears that we have consensus. This will be passed to catalogs, which can implement the behavior that they choose.


On Wed, Oct 7, 2020 at 11:08 AM Wenchen Fan <[hidden email]> wrote:
> I have some hive queries that I want to run on Spark.

Spark is not compatible with Hive in many places. Decoupling EXTERNAL and LOCATION can't help you too much here. If you do have this use case, we need a much wider discussion about how to achieve it.

For this particular topic, we need concrete use cases like Nessie. It will be great to see more concrete use cases, but I think the Nessie use case is good enough to justify the decoupling of EXTERNAL and LOCATION.

BTW, CREATE EXTERNAL TABLE is not a Hive-specific feature. Many databases have it. That's why I think Hive-compatibility alone is not a reasonable use case. For your reference:
1. Snowflake supports CREATE EXTERNAL TABLE and requires the LOCATION clause as Spark does: doc
2. Redshift supports CREATE EXTERNAL TABLE and requires the LOCATION clause as Spark does: doc
3. Db2 supports CREATE EXTERNAL TABLE and requires DATAOBJECT or FILE_NAME option: doc
4. SQL Server also supports CREATE EXTERNAL TABLE: doc

> with which Spark claims to be compatible

I don't think Spark ever claims to be 100% Hive compatible. In fact, we diverged from Hive intentionally in several places, where we think the Hive behavior was not reasonable and we shouldn't follow it.

On Thu, Oct 8, 2020 at 1:58 AM Ryan Blue <[hidden email]> wrote:

how about LOCATION without EXTERNAL? Currently Spark treats it as an external table.

I think there is some confusion about what Spark has to handle. Regardless of what Spark allows as DDL, these tables can exist in a Hive MetaStore that Spark connects to, and the general expectation is that Spark doesn’t change the meaning of table configuration. There are notable bugs where Spark has different behavior, but that is the expectation.

In this particular case, we’re talking about what can be expressed in DDL that is sent to an external catalog. Spark could (unwisely) choose to disallow some DDL combinations, but the table is implemented through a plugin so the interpretation is up to the plugin. Spark has no role in choosing how to treat this table, unless it is loaded through Spark’s built-in catalog; in which case, see above.

I don’t think Hive compatibility itself is a “use case”.

Why?

Hive is an external database that defines its own behavior and with which Spark claims to be compatible. If Hive isn’t a valid use case, then why is EXTERNAL supported at all?


On Wed, Oct 7, 2020 at 10:17 AM Holden Karau <[hidden email]> wrote:


On Wed, Oct 7, 2020 at 9:57 AM Wenchen Fan <[hidden email]> wrote:
I don't think Hive compatibility itself is a "use case".
Ok let's add on top of this: I have some hive queries that I want to run on Spark. I believe that makes it a use case. 
The Nessie example you mentioned is a reasonable use case to me: some frameworks/applications want to create external tables without user-specified location, so that they can manage the table directory themselves and implement fancy features.

That said, now I agree it's better to decouple EXTERNAL and LOCATION. We should clearly document that, EXTERNAL and LOCATION are only applicable for file-based data sources, and catalog implementation should fail if the table has EXTERNAL or LOCATION property, but the table provider is not file-based.

BTW, how about LOCATION without EXTERNAL? Currently Spark treats it as an external table. Hive gives warning when you create managed tables with custom location, which means this behavior is not recommended. Shall we "infer" EXTERNAL from LOCATION although it's not Hive compatible?

On Thu, Oct 8, 2020 at 12:24 AM Ryan Blue <[hidden email]> wrote:

Wenchen, why are you ignoring Hive as a “reasonable use case”?

The keyword came from Hive and we all agree that a Hive catalog with Hive behavior can’t be implemented if Spark chooses to couple this with LOCATION. Why is this use case not a justification?

Also, the option to keep behavior the same as before is not mutually exclusive with passing EXTERNAL to catalogs. Spark can continue to have the same behavior in its catalog. But Spark cannot just choose to break compatibility with external systems by deciding when to fail certain combinations of DDL options. Choosing not to allow external without location when it is valid for Hive prevents building a compatible catalog.

There are many reasons to build a Hive-compatible catalog. A great recent example is Nessie, which enables branching and tagging table states across several table formats and aims to be compatible with Hive.


On Wed, Oct 7, 2020 at 5:51 AM Wenchen Fan <[hidden email]> wrote:
> As someone who's had the job of porting different SQL dialects to Spark, I'm also very much in favor of keeping EXTERNAL

Just to be clear: no one is proposing to remove EXTERNAL. The 2 options we are discussing are:
1. Keep the behavior the same as before, i.e. EXTERNAL must co-exists with LOCATION (or path option).
2. Always allow EXTERNAL, and decouple it with LOCATION.

I'm fine with option 2 if there are reasonable use cases. I think it's always safer to keep the behavior the same as before. If we want to change the behavior and follow option 2, we need use cases to justify it.

For now, the only use case I see is for Hive compatibility and allow EXTERNAL TABLE without user-specified LOCATION. Are there any more use cases we are targeting?

On Wed, Oct 7, 2020 at 5:06 AM Holden Karau <[hidden email]> wrote:
As someone who's had the job of porting different SQL dialects to Spark, I'm also very much in favor of keeping EXTERNAL, and I think Ryan's suggestion of leaving it up to the catalogs on how to handle this makes sense.

On Tue, Oct 6, 2020 at 1:54 PM Ryan Blue <[hidden email]> wrote:

I would summarize both the problem and the current state differently.

Currently, Spark parses the EXTERNAL keyword for compatibility with Hive SQL, but Spark’s built-in catalog doesn’t allow creating a table with EXTERNAL unless LOCATION is also present. This “hidden feature” breaks compatibility with Hive SQL because all combinations of EXTERNAL and LOCATION are valid in Hive, but creating an external table with a default location is not allowed by Spark. Note that Spark must still handle these tables because it shares a metastore with Hive, which can still create them.

Now catalogs can be plugged in, the question is whether to pass the fact that EXTERNAL was in the CREATE TABLE statement to the v2 catalog handling a create command, or to suppress it and apply Spark’s rule that LOCATION must be present.

If it is not passed to the catalog, then a Hive catalog cannot implement the behavior of Hive SQL, even though Spark added the keyword for Hive compatibility. The Spark catalog can interpret EXTERNAL however Spark chooses to, but I think it is a poor choice to force different behavior on other catalogs.

Wenchen has also argued that the purpose of this is to standardize behavior across catalogs. But hiding EXTERNAL would not accomplish that goal. Whether to physically delete data is a choice that is up to the catalog. Some catalogs have no “external” concept and will always drop data when a table is dropped. The ability to keep underlying data files is specific to a few catalogs, and whether that is controlled by EXTERNAL, the LOCATION clause, or something else is still up to the catalog implementation.

I don’t think that there is a good reason to force catalogs to break compatibility with Hive SQL, while making it appear as though DDL is compatible. Because removing EXTERNAL would be a breaking change to the SQL parser, I think the best option is to pass it to v2 catalogs so the catalog can decide how to handle it.

rb


On Tue, Oct 6, 2020 at 7:06 AM Wenchen Fan <[hidden email]> wrote:
Hi all,

I'd like to start a discussion thread about this topic, as it blocks an important feature that we target for Spark 3.1: unify the CREATE TABLE SQL syntax.

A bit more background for CREATE EXTERNAL TABLE: it's kind of a hidden feature in Spark for Hive compatibility.

When you write native CREATE TABLE syntax such as `CREATE EXTERNAL TABLE ... USING parquet`, the parser fails and tells you that EXTERNAL can't be specified.

When we write Hive CREATE TABLE syntax, the EXTERNAL can be specified if LOCATION clause or path option is present. For example, `CREATE EXTERNAL TABLE ... STORED AS parquet` is not allowed as there is no LOCATION clause or path option. This is not 100% Hive compatible.

As we are unifying the CREATE TABLE SQL syntax, one problem is how to deal with CREATE EXTERNAL TABLE. We can keep it as a hidden feature as it was, or we can officially support it.

Please let us know your thoughts:
1. As an end-user, what do you expect CREATE EXTERNAL TABLE to do? Have you used it in production before? For what use cases?
2. As a catalog developer, how are you going to implement EXTERNAL TABLE? It seems to me that it only makes sense for file source, as the table directory can be managed. I'm not sure how to interpret EXTERNAL in catalogs like jdbc, cassandra, etc.

For more details, please refer to the long discussion in https://github.com/apache/spark/pull/28026

Thanks,
Wenchen


--
Ryan Blue
Software Engineer
Netflix


--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 


--
Ryan Blue
Software Engineer
Netflix


--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 


--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re:Official support of CREATE EXTERNAL TABLE

beliefer
In reply to this post by cloud0fan
Personally, I think EXTERNAL is a special feture supported by Hive.
If Spark SQL want support it, only consider it for Hive.
We only unify 
`CREATE EXTERNAL TABLE in parser and check for unsupported data sources.




At 2020-10-06 22:06:28, "Wenchen Fan" <[hidden email]> wrote:

Hi all,

I'd like to start a discussion thread about this topic, as it blocks an important feature that we target for Spark 3.1: unify the CREATE TABLE SQL syntax.

A bit more background for CREATE EXTERNAL TABLE: it's kind of a hidden feature in Spark for Hive compatibility.

When you write native CREATE TABLE syntax such as `CREATE EXTERNAL TABLE ... USING parquet`, the parser fails and tells you that EXTERNAL can't be specified.

When we write Hive CREATE TABLE syntax, the EXTERNAL can be specified if LOCATION clause or path option is present. For example, `CREATE EXTERNAL TABLE ... STORED AS parquet` is not allowed as there is no LOCATION clause or path option. This is not 100% Hive compatible.

As we are unifying the CREATE TABLE SQL syntax, one problem is how to deal with CREATE EXTERNAL TABLE. We can keep it as a hidden feature as it was, or we can officially support it.

Please let us know your thoughts:
1. As an end-user, what do you expect CREATE EXTERNAL TABLE to do? Have you used it in production before? For what use cases?
2. As a catalog developer, how are you going to implement EXTERNAL TABLE? It seems to me that it only makes sense for file source, as the table directory can be managed. I'm not sure how to interpret EXTERNAL in catalogs like jdbc, cassandra, etc.

For more details, please refer to the long discussion in https://github.com/apache/spark/pull/28026

Thanks,
Wenchen