[DISCUSS] ViewCatalog interface for DSv2

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

[DISCUSS] ViewCatalog interface for DSv2

John Zhuge
Catalog support has been added to DSv2 along with a table catalog interface. Here I'd like to propose a view catalog interface, for the following benefit:
  • Abstraction for view management thus allowing different view backends
  • Disassociation of view definition storage from Hive Metastore
A catalog plugin can be both TableCatalog and ViewCatalog. Resolve an identifier as view first then table.

More details in SPIP and PR if we decide to proceed. Here is a quick glance at the API:

ViewCatalog interface:
  • loadView
  • listViews
  • createView
  • deleteView
View interface:
  • name
  • originalSql
  • defaultCatalog
  • defaultNamespace
  • viewColumns
  • owner
  • createTime
  • softwareVersion
  • options (map)
ViewColumn interface:
  • name
  • type

Thanks,
John Zhuge
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] ViewCatalog interface for DSv2

Ryan Blue
Thanks for working on this, John!

I'd like to see a more complete write-up of what you're proposing. Without that, I don't think we can have a productive discussion about this.

For example, I think you're proposing to keep the view columns to ensure that the same columns are produced by the view every time, based on requirements from the SQL spec. Let's start by stating what those behavior requirements are, so that everyone has the context to understand why your proposal includes the view columns. Similarly, I'd like to know why you're proposing `softwareVersion` in the view definition.

On Tue, Aug 13, 2019 at 8:56 AM John Zhuge <[hidden email]> wrote:
Catalog support has been added to DSv2 along with a table catalog interface. Here I'd like to propose a view catalog interface, for the following benefit:
  • Abstraction for view management thus allowing different view backends
  • Disassociation of view definition storage from Hive Metastore
A catalog plugin can be both TableCatalog and ViewCatalog. Resolve an identifier as view first then table.

More details in SPIP and PR if we decide to proceed. Here is a quick glance at the API:

ViewCatalog interface:
  • loadView
  • listViews
  • createView
  • deleteView
View interface:
  • name
  • originalSql
  • defaultCatalog
  • defaultNamespace
  • viewColumns
  • owner
  • createTime
  • softwareVersion
  • options (map)
ViewColumn interface:
  • name
  • type

Thanks,
John Zhuge


--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] ViewCatalog interface for DSv2

John Zhuge
Thanks for the feedback, Ryan! I can share the WIP copy of the SPIP if that makes sense.

I can't find out a lot about view resolution and validation in SQL Spec Part1. Anybody with full SQL knowledge may chime in.

Here are my understanding based on online manuals, docs, and other resources:
  • A view has a name in the database schema so that other queries can use it like a table.
  • A view's schema is frozen at the time the view is created; subsequent changes to underlying tables (e.g. adding a column) will not be reflected in the view's schema. If an underlying table is dropped or changed in an incompatible fashion, subsequent attempts to query the invalid view will fail.
In Preso, view columns are used for validation only (see StatementAnalyzer.Visitor#isViewStale):
  • view column names must match the visible fields of analyzed view sql
  • the visible fields can be coerced to view column types
In Spark 2.2+, view columns are also used for validation (see CheckAnalysis#checkAnalysis case View):
  • view column names must match the output fields of the view sql
  • view column types must be able to UpCast to output field types
Rule EliminateView adds a Project to viewQueryColumnNames if it exists.

As for `softwareVersion`, the purpose is to track which software version is used to create the view, in preparation for different versions of the same software or even different softwares, such as Presto vs Spark.


On Tue, Aug 13, 2019 at 9:47 AM Ryan Blue <[hidden email]> wrote:
Thanks for working on this, John!

I'd like to see a more complete write-up of what you're proposing. Without that, I don't think we can have a productive discussion about this.

For example, I think you're proposing to keep the view columns to ensure that the same columns are produced by the view every time, based on requirements from the SQL spec. Let's start by stating what those behavior requirements are, so that everyone has the context to understand why your proposal includes the view columns. Similarly, I'd like to know why you're proposing `softwareVersion` in the view definition.

On Tue, Aug 13, 2019 at 8:56 AM John Zhuge <[hidden email]> wrote:
Catalog support has been added to DSv2 along with a table catalog interface. Here I'd like to propose a view catalog interface, for the following benefit:
  • Abstraction for view management thus allowing different view backends
  • Disassociation of view definition storage from Hive Metastore
A catalog plugin can be both TableCatalog and ViewCatalog. Resolve an identifier as view first then table.

More details in SPIP and PR if we decide to proceed. Here is a quick glance at the API:

ViewCatalog interface:
  • loadView
  • listViews
  • createView
  • deleteView
View interface:
  • name
  • originalSql
  • defaultCatalog
  • defaultNamespace
  • viewColumns
  • owner
  • createTime
  • softwareVersion
  • options (map)
ViewColumn interface:
  • name
  • type

Thanks,
John Zhuge


--
Ryan Blue
Software Engineer
Netflix


--
John Zhuge
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] ViewCatalog interface for DSv2

cloud0fan
I'm fine with the view definition proposed here, but my major concern is how to make sure table/view share the same namespace. According to the SQL spec, if there is a view named "a", we can't create a table named "a" anymore.

We can add documents and ask the implementation to guarantee it, but it's better if this can be guaranteed by the API.

On Wed, Aug 14, 2019 at 1:46 AM John Zhuge <[hidden email]> wrote:
Thanks for the feedback, Ryan! I can share the WIP copy of the SPIP if that makes sense.

I can't find out a lot about view resolution and validation in SQL Spec Part1. Anybody with full SQL knowledge may chime in.

Here are my understanding based on online manuals, docs, and other resources:
  • A view has a name in the database schema so that other queries can use it like a table.
  • A view's schema is frozen at the time the view is created; subsequent changes to underlying tables (e.g. adding a column) will not be reflected in the view's schema. If an underlying table is dropped or changed in an incompatible fashion, subsequent attempts to query the invalid view will fail.
In Preso, view columns are used for validation only (see StatementAnalyzer.Visitor#isViewStale):
  • view column names must match the visible fields of analyzed view sql
  • the visible fields can be coerced to view column types
In Spark 2.2+, view columns are also used for validation (see CheckAnalysis#checkAnalysis case View):
  • view column names must match the output fields of the view sql
  • view column types must be able to UpCast to output field types
Rule EliminateView adds a Project to viewQueryColumnNames if it exists.

As for `softwareVersion`, the purpose is to track which software version is used to create the view, in preparation for different versions of the same software or even different softwares, such as Presto vs Spark.


On Tue, Aug 13, 2019 at 9:47 AM Ryan Blue <[hidden email]> wrote:
Thanks for working on this, John!

I'd like to see a more complete write-up of what you're proposing. Without that, I don't think we can have a productive discussion about this.

For example, I think you're proposing to keep the view columns to ensure that the same columns are produced by the view every time, based on requirements from the SQL spec. Let's start by stating what those behavior requirements are, so that everyone has the context to understand why your proposal includes the view columns. Similarly, I'd like to know why you're proposing `softwareVersion` in the view definition.

On Tue, Aug 13, 2019 at 8:56 AM John Zhuge <[hidden email]> wrote:
Catalog support has been added to DSv2 along with a table catalog interface. Here I'd like to propose a view catalog interface, for the following benefit:
  • Abstraction for view management thus allowing different view backends
  • Disassociation of view definition storage from Hive Metastore
A catalog plugin can be both TableCatalog and ViewCatalog. Resolve an identifier as view first then table.

More details in SPIP and PR if we decide to proceed. Here is a quick glance at the API:

ViewCatalog interface:
  • loadView
  • listViews
  • createView
  • deleteView
View interface:
  • name
  • originalSql
  • defaultCatalog
  • defaultNamespace
  • viewColumns
  • owner
  • createTime
  • softwareVersion
  • options (map)
ViewColumn interface:
  • name
  • type

Thanks,
John Zhuge


--
Ryan Blue
Software Engineer
Netflix


--
John Zhuge
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] ViewCatalog interface for DSv2

John Zhuge
Thanks for the feedback. I am preparing a doc and a PoC, will post soon.

On Mon, Oct 14, 2019 at 3:17 AM Wenchen Fan <[hidden email]> wrote:
I'm fine with the view definition proposed here, but my major concern is how to make sure table/view share the same namespace. According to the SQL spec, if there is a view named "a", we can't create a table named "a" anymore.

We can add documents and ask the implementation to guarantee it, but it's better if this can be guaranteed by the API.

On Wed, Aug 14, 2019 at 1:46 AM John Zhuge <[hidden email]> wrote:
Thanks for the feedback, Ryan! I can share the WIP copy of the SPIP if that makes sense.

I can't find out a lot about view resolution and validation in SQL Spec Part1. Anybody with full SQL knowledge may chime in.

Here are my understanding based on online manuals, docs, and other resources:
  • A view has a name in the database schema so that other queries can use it like a table.
  • A view's schema is frozen at the time the view is created; subsequent changes to underlying tables (e.g. adding a column) will not be reflected in the view's schema. If an underlying table is dropped or changed in an incompatible fashion, subsequent attempts to query the invalid view will fail.
In Preso, view columns are used for validation only (see StatementAnalyzer.Visitor#isViewStale):
  • view column names must match the visible fields of analyzed view sql
  • the visible fields can be coerced to view column types
In Spark 2.2+, view columns are also used for validation (see CheckAnalysis#checkAnalysis case View):
  • view column names must match the output fields of the view sql
  • view column types must be able to UpCast to output field types
Rule EliminateView adds a Project to viewQueryColumnNames if it exists.

As for `softwareVersion`, the purpose is to track which software version is used to create the view, in preparation for different versions of the same software or even different softwares, such as Presto vs Spark.


On Tue, Aug 13, 2019 at 9:47 AM Ryan Blue <[hidden email]> wrote:
Thanks for working on this, John!

I'd like to see a more complete write-up of what you're proposing. Without that, I don't think we can have a productive discussion about this.

For example, I think you're proposing to keep the view columns to ensure that the same columns are produced by the view every time, based on requirements from the SQL spec. Let's start by stating what those behavior requirements are, so that everyone has the context to understand why your proposal includes the view columns. Similarly, I'd like to know why you're proposing `softwareVersion` in the view definition.

On Tue, Aug 13, 2019 at 8:56 AM John Zhuge <[hidden email]> wrote:
Catalog support has been added to DSv2 along with a table catalog interface. Here I'd like to propose a view catalog interface, for the following benefit:
  • Abstraction for view management thus allowing different view backends
  • Disassociation of view definition storage from Hive Metastore
A catalog plugin can be both TableCatalog and ViewCatalog. Resolve an identifier as view first then table.

More details in SPIP and PR if we decide to proceed. Here is a quick glance at the API:

ViewCatalog interface:
  • loadView
  • listViews
  • createView
  • deleteView
View interface:
  • name
  • originalSql
  • defaultCatalog
  • defaultNamespace
  • viewColumns
  • owner
  • createTime
  • softwareVersion
  • options (map)
ViewColumn interface:
  • name
  • type

Thanks,
John Zhuge


--
Ryan Blue
Software Engineer
Netflix


--
John Zhuge


--
John Zhuge