Spark SQL parser and DDL

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Spark SQL parser and DDL

Ryan Blue

Hi everyone,

I’ve been working on SQL DDL statements for v2 tables lately, including the proposed additions to drop, rename, and alter columns. The most recent update I’ve added is to allow transformation functions in the PARTITION BY clause to pass to v2 data sources. This allows sources like Iceberg to do partition pruning internally.

One of the difficulties has been that the SQL parser is coupled to the current logical plans and includes details that are specific to them. For example, data source table creation makes determinations like the EXTERNAL keyword is not allowed and instead the mode (external or managed) is set depending on whether a path is set. It also translates IF NOT EXISTS into a SaveMode and introduces a few other transformations.

The main problem with this is that converting the SQL plans produced by the parser to v2 plans requires interpreting these alterations and not the original SQL. Another consequence is that there are two parsers: AstBuilder in spark-catalyst and SparkSqlParser in spark-sql (core) because not all of the plans are available to the parser in the catalyst module.

I think it would be cleaner if we added a sql package with catalyst plans that carry the SQL options as they were parsed, and then convert those plans to specific implementations depending on the tables that are used. That makes support for v2 plans much cleaner by converting from a generic SQL plan instead of creating a v1 plan that assumes a data source table and then converting that to a v2 plan (playing telephone with logical plans).

This has simplified the work I’ve been doing to add PARTITION BY transformations. Instead of needing to add transformations to the CatalogTable metadata that’s used everywhere, this only required a change to the rule that converts from the parsed SQL plan to CatalogTable-based v1 plans. It is also cleaner to have the logic for converting to CatalogTable in DataSourceAnalysis instead of in the parser itself.

Are there objections to this approach for integrating v2 plans?

--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: Spark SQL parser and DDL

Felix Cheung
Sounds like a good idea?

Would this be a step in the direction of supporting variation of the SQL dialect, too?

 

From: Ryan Blue <[hidden email]>
Sent: Thursday, October 4, 2018 8:56 AM
To: Spark Dev List
Subject: Spark SQL parser and DDL
 

Hi everyone,

I’ve been working on SQL DDL statements for v2 tables lately, including the proposed additions to drop, rename, and alter columns. The most recent update I’ve added is to allow transformation functions in the PARTITION BY clause to pass to v2 data sources. This allows sources like Iceberg to do partition pruning internally.

One of the difficulties has been that the SQL parser is coupled to the current logical plans and includes details that are specific to them. For example, data source table creation makes determinations like the EXTERNAL keyword is not allowed and instead the mode (external or managed) is set depending on whether a path is set. It also translates IF NOT EXISTS into a SaveMode and introduces a few other transformations.

The main problem with this is that converting the SQL plans produced by the parser to v2 plans requires interpreting these alterations and not the original SQL. Another consequence is that there are two parsers: AstBuilder in spark-catalyst and SparkSqlParser in spark-sql (core) because not all of the plans are available to the parser in the catalyst module.

I think it would be cleaner if we added a sql package with catalyst plans that carry the SQL options as they were parsed, and then convert those plans to specific implementations depending on the tables that are used. That makes support for v2 plans much cleaner by converting from a generic SQL plan instead of creating a v1 plan that assumes a data source table and then converting that to a v2 plan (playing telephone with logical plans).

This has simplified the work I’ve been doing to add PARTITION BY transformations. Instead of needing to add transformations to the CatalogTable metadata that’s used everywhere, this only required a change to the rule that converts from the parsed SQL plan to CatalogTable-based v1 plans. It is also cleaner to have the logic for converting to CatalogTable in DataSourceAnalysis instead of in the parser itself.

Are there objections to this approach for integrating v2 plans?

--
Ryan Blue
Software Engineer
Netflix