SQL DDL statements with replacing default catalog with custom catalog

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

SQL DDL statements with replacing default catalog with custom catalog

Jungtaek Lim-2
Hi devs,

I'm not sure whether it's addressed in Spark 3.1, but at least from Spark 3.0.1, many SQL DDL statements don't seem to go through the custom catalog when I replace default catalog with custom catalog and only provide 'dbName.tableName' as table identifier.

I'm not an expert in this area, but after skimming the code I feel TempViewOrV1Table looks to be broken for the case, as it can still be a V2 table. Classifying the table identifier to either V2 table or "temp view or v1 table" looks to be mandatory, as former and latter have different code paths and different catalog interfaces.

That sounds to me as being stuck and the only "clear" approach seems to disallow default catalog with custom one. Am I missing something?

Thanks,
Jungtaek Lim (HeartSaVioR)
Reply | Threaded
Open this post in threaded view
|

Re: SQL DDL statements with replacing default catalog with custom catalog

Ryan Blue
I've hit this with `DROP TABLE` commands that should be passed to a registered v2 session catalog, but are handled by v1. I think that's the only case we hit in our downstream test suites, but we haven't been exploring the use of a session catalog for fallback. We use v2 for everything now, which avoids the problem and comes with multi-catalog support.

On Tue, Oct 6, 2020 at 5:55 PM Jungtaek Lim <[hidden email]> wrote:
Hi devs,

I'm not sure whether it's addressed in Spark 3.1, but at least from Spark 3.0.1, many SQL DDL statements don't seem to go through the custom catalog when I replace default catalog with custom catalog and only provide 'dbName.tableName' as table identifier.

I'm not an expert in this area, but after skimming the code I feel TempViewOrV1Table looks to be broken for the case, as it can still be a V2 table. Classifying the table identifier to either V2 table or "temp view or v1 table" looks to be mandatory, as former and latter have different code paths and different catalog interfaces.

That sounds to me as being stuck and the only "clear" approach seems to disallow default catalog with custom one. Am I missing something?

Thanks,
Jungtaek Lim (HeartSaVioR)


--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: SQL DDL statements with replacing default catalog with custom catalog

Jungtaek Lim-2
The logical plan for the parsed statement is getting converted either for old one or v2, and for the former one it keeps using an external catalog (Hive) - so replacing default session catalog with custom one and trying to use it like it is in external catalog doesn't work, which destroys the purpose of replacing the default session catalog.

Btw I see one approach: in TempViewOrV1Table, if it matches with SessionCatalogAndIdentifier where the catalog is TableCatalog, call loadTable in catalog and see whether it's V1 table or not. Not sure it's a viable approach though, as it requires loading a table during resolution of the table identifier.

On Wed, Oct 7, 2020 at 10:04 AM Ryan Blue <[hidden email]> wrote:
I've hit this with `DROP TABLE` commands that should be passed to a registered v2 session catalog, but are handled by v1. I think that's the only case we hit in our downstream test suites, but we haven't been exploring the use of a session catalog for fallback. We use v2 for everything now, which avoids the problem and comes with multi-catalog support.

On Tue, Oct 6, 2020 at 5:55 PM Jungtaek Lim <[hidden email]> wrote:
Hi devs,

I'm not sure whether it's addressed in Spark 3.1, but at least from Spark 3.0.1, many SQL DDL statements don't seem to go through the custom catalog when I replace default catalog with custom catalog and only provide 'dbName.tableName' as table identifier.

I'm not an expert in this area, but after skimming the code I feel TempViewOrV1Table looks to be broken for the case, as it can still be a V2 table. Classifying the table identifier to either V2 table or "temp view or v1 table" looks to be mandatory, as former and latter have different code paths and different catalog interfaces.

That sounds to me as being stuck and the only "clear" approach seems to disallow default catalog with custom one. Am I missing something?

Thanks,
Jungtaek Lim (HeartSaVioR)


--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: SQL DDL statements with replacing default catalog with custom catalog

cloud0fan
Not all the DDL commands support v2 catalog APIs (e.g. CREATE TABLE LIKE), so it's possible that some commands still go through the v1 session catalog although you configured a custom v2 session catalog.

Can you create JIRA tickets if you hit any DDL commands that don't support v2 catalog? We should fix them.

On Wed, Oct 7, 2020 at 9:15 AM Jungtaek Lim <[hidden email]> wrote:
The logical plan for the parsed statement is getting converted either for old one or v2, and for the former one it keeps using an external catalog (Hive) - so replacing default session catalog with custom one and trying to use it like it is in external catalog doesn't work, which destroys the purpose of replacing the default session catalog.

Btw I see one approach: in TempViewOrV1Table, if it matches with SessionCatalogAndIdentifier where the catalog is TableCatalog, call loadTable in catalog and see whether it's V1 table or not. Not sure it's a viable approach though, as it requires loading a table during resolution of the table identifier.

On Wed, Oct 7, 2020 at 10:04 AM Ryan Blue <[hidden email]> wrote:
I've hit this with `DROP TABLE` commands that should be passed to a registered v2 session catalog, but are handled by v1. I think that's the only case we hit in our downstream test suites, but we haven't been exploring the use of a session catalog for fallback. We use v2 for everything now, which avoids the problem and comes with multi-catalog support.

On Tue, Oct 6, 2020 at 5:55 PM Jungtaek Lim <[hidden email]> wrote:
Hi devs,

I'm not sure whether it's addressed in Spark 3.1, but at least from Spark 3.0.1, many SQL DDL statements don't seem to go through the custom catalog when I replace default catalog with custom catalog and only provide 'dbName.tableName' as table identifier.

I'm not an expert in this area, but after skimming the code I feel TempViewOrV1Table looks to be broken for the case, as it can still be a V2 table. Classifying the table identifier to either V2 table or "temp view or v1 table" looks to be mandatory, as former and latter have different code paths and different catalog interfaces.

That sounds to me as being stuck and the only "clear" approach seems to disallow default catalog with custom one. Am I missing something?

Thanks,
Jungtaek Lim (HeartSaVioR)


--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: SQL DDL statements with replacing default catalog with custom catalog

Jungtaek Lim-2
My case is DROP TABLE and DROP TABLE supports both v1 and v2 (as it simply works when I use custom catalog without replacing the default catalog).

It just fails on v2 when the "default catalog" is replaced (say I replace 'spark_catalog'), because TempViewOrV1Table is providing value even with v2 table, and then the catalyst goes with v1 exec. I guess all commands leveraging TempViewOrV1Table to determine whether the table is v1 vs v2 would all suffer from this issue.

On Wed, Oct 7, 2020 at 5:45 PM Wenchen Fan <[hidden email]> wrote:
Not all the DDL commands support v2 catalog APIs (e.g. CREATE TABLE LIKE), so it's possible that some commands still go through the v1 session catalog although you configured a custom v2 session catalog.

Can you create JIRA tickets if you hit any DDL commands that don't support v2 catalog? We should fix them.

On Wed, Oct 7, 2020 at 9:15 AM Jungtaek Lim <[hidden email]> wrote:
The logical plan for the parsed statement is getting converted either for old one or v2, and for the former one it keeps using an external catalog (Hive) - so replacing default session catalog with custom one and trying to use it like it is in external catalog doesn't work, which destroys the purpose of replacing the default session catalog.

Btw I see one approach: in TempViewOrV1Table, if it matches with SessionCatalogAndIdentifier where the catalog is TableCatalog, call loadTable in catalog and see whether it's V1 table or not. Not sure it's a viable approach though, as it requires loading a table during resolution of the table identifier.

On Wed, Oct 7, 2020 at 10:04 AM Ryan Blue <[hidden email]> wrote:
I've hit this with `DROP TABLE` commands that should be passed to a registered v2 session catalog, but are handled by v1. I think that's the only case we hit in our downstream test suites, but we haven't been exploring the use of a session catalog for fallback. We use v2 for everything now, which avoids the problem and comes with multi-catalog support.

On Tue, Oct 6, 2020 at 5:55 PM Jungtaek Lim <[hidden email]> wrote:
Hi devs,

I'm not sure whether it's addressed in Spark 3.1, but at least from Spark 3.0.1, many SQL DDL statements don't seem to go through the custom catalog when I replace default catalog with custom catalog and only provide 'dbName.tableName' as table identifier.

I'm not an expert in this area, but after skimming the code I feel TempViewOrV1Table looks to be broken for the case, as it can still be a V2 table. Classifying the table identifier to either V2 table or "temp view or v1 table" looks to be mandatory, as former and latter have different code paths and different catalog interfaces.

That sounds to me as being stuck and the only "clear" approach seems to disallow default catalog with custom one. Am I missing something?

Thanks,
Jungtaek Lim (HeartSaVioR)


--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: SQL DDL statements with replacing default catalog with custom catalog

cloud0fan
Ah, this is by design. V1 tables should still go through the v1 session catalog. I think we can remove this restriction when we are confident about the new v2 DDL commands that work with v2 catalog APIs.

On Wed, Oct 7, 2020 at 5:00 PM Jungtaek Lim <[hidden email]> wrote:
My case is DROP TABLE and DROP TABLE supports both v1 and v2 (as it simply works when I use custom catalog without replacing the default catalog).

It just fails on v2 when the "default catalog" is replaced (say I replace 'spark_catalog'), because TempViewOrV1Table is providing value even with v2 table, and then the catalyst goes with v1 exec. I guess all commands leveraging TempViewOrV1Table to determine whether the table is v1 vs v2 would all suffer from this issue.

On Wed, Oct 7, 2020 at 5:45 PM Wenchen Fan <[hidden email]> wrote:
Not all the DDL commands support v2 catalog APIs (e.g. CREATE TABLE LIKE), so it's possible that some commands still go through the v1 session catalog although you configured a custom v2 session catalog.

Can you create JIRA tickets if you hit any DDL commands that don't support v2 catalog? We should fix them.

On Wed, Oct 7, 2020 at 9:15 AM Jungtaek Lim <[hidden email]> wrote:
The logical plan for the parsed statement is getting converted either for old one or v2, and for the former one it keeps using an external catalog (Hive) - so replacing default session catalog with custom one and trying to use it like it is in external catalog doesn't work, which destroys the purpose of replacing the default session catalog.

Btw I see one approach: in TempViewOrV1Table, if it matches with SessionCatalogAndIdentifier where the catalog is TableCatalog, call loadTable in catalog and see whether it's V1 table or not. Not sure it's a viable approach though, as it requires loading a table during resolution of the table identifier.

On Wed, Oct 7, 2020 at 10:04 AM Ryan Blue <[hidden email]> wrote:
I've hit this with `DROP TABLE` commands that should be passed to a registered v2 session catalog, but are handled by v1. I think that's the only case we hit in our downstream test suites, but we haven't been exploring the use of a session catalog for fallback. We use v2 for everything now, which avoids the problem and comes with multi-catalog support.

On Tue, Oct 6, 2020 at 5:55 PM Jungtaek Lim <[hidden email]> wrote:
Hi devs,

I'm not sure whether it's addressed in Spark 3.1, but at least from Spark 3.0.1, many SQL DDL statements don't seem to go through the custom catalog when I replace default catalog with custom catalog and only provide 'dbName.tableName' as table identifier.

I'm not an expert in this area, but after skimming the code I feel TempViewOrV1Table looks to be broken for the case, as it can still be a V2 table. Classifying the table identifier to either V2 table or "temp view or v1 table" looks to be mandatory, as former and latter have different code paths and different catalog interfaces.

That sounds to me as being stuck and the only "clear" approach seems to disallow default catalog with custom one. Am I missing something?

Thanks,
Jungtaek Lim (HeartSaVioR)


--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: SQL DDL statements with replacing default catalog with custom catalog

Jungtaek Lim-2
If it's by design and not prepared, then IMHO replacing the default session catalog is better to be restricted until things are sorted out, as it gives pretty much confusion and has known bugs. Actually there's another bug/limitation on default session catalog on the length of identifier, so things that work with custom catalog no longer work when it replaces default session catalog.

On Wed, Oct 7, 2020 at 6:05 PM Wenchen Fan <[hidden email]> wrote:
Ah, this is by design. V1 tables should still go through the v1 session catalog. I think we can remove this restriction when we are confident about the new v2 DDL commands that work with v2 catalog APIs.

On Wed, Oct 7, 2020 at 5:00 PM Jungtaek Lim <[hidden email]> wrote:
My case is DROP TABLE and DROP TABLE supports both v1 and v2 (as it simply works when I use custom catalog without replacing the default catalog).

It just fails on v2 when the "default catalog" is replaced (say I replace 'spark_catalog'), because TempViewOrV1Table is providing value even with v2 table, and then the catalyst goes with v1 exec. I guess all commands leveraging TempViewOrV1Table to determine whether the table is v1 vs v2 would all suffer from this issue.

On Wed, Oct 7, 2020 at 5:45 PM Wenchen Fan <[hidden email]> wrote:
Not all the DDL commands support v2 catalog APIs (e.g. CREATE TABLE LIKE), so it's possible that some commands still go through the v1 session catalog although you configured a custom v2 session catalog.

Can you create JIRA tickets if you hit any DDL commands that don't support v2 catalog? We should fix them.

On Wed, Oct 7, 2020 at 9:15 AM Jungtaek Lim <[hidden email]> wrote:
The logical plan for the parsed statement is getting converted either for old one or v2, and for the former one it keeps using an external catalog (Hive) - so replacing default session catalog with custom one and trying to use it like it is in external catalog doesn't work, which destroys the purpose of replacing the default session catalog.

Btw I see one approach: in TempViewOrV1Table, if it matches with SessionCatalogAndIdentifier where the catalog is TableCatalog, call loadTable in catalog and see whether it's V1 table or not. Not sure it's a viable approach though, as it requires loading a table during resolution of the table identifier.

On Wed, Oct 7, 2020 at 10:04 AM Ryan Blue <[hidden email]> wrote:
I've hit this with `DROP TABLE` commands that should be passed to a registered v2 session catalog, but are handled by v1. I think that's the only case we hit in our downstream test suites, but we haven't been exploring the use of a session catalog for fallback. We use v2 for everything now, which avoids the problem and comes with multi-catalog support.

On Tue, Oct 6, 2020 at 5:55 PM Jungtaek Lim <[hidden email]> wrote:
Hi devs,

I'm not sure whether it's addressed in Spark 3.1, but at least from Spark 3.0.1, many SQL DDL statements don't seem to go through the custom catalog when I replace default catalog with custom catalog and only provide 'dbName.tableName' as table identifier.

I'm not an expert in this area, but after skimming the code I feel TempViewOrV1Table looks to be broken for the case, as it can still be a V2 table. Classifying the table identifier to either V2 table or "temp view or v1 table" looks to be mandatory, as former and latter have different code paths and different catalog interfaces.

That sounds to me as being stuck and the only "clear" approach seems to disallow default catalog with custom one. Am I missing something?

Thanks,
Jungtaek Lim (HeartSaVioR)


--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: SQL DDL statements with replacing default catalog with custom catalog

cloud0fan
If you just want to save typing the catalog name when writing table names, you can set your custom catalog as the default catalog (See SQLConf.DEFAULT_CATALOG). SQLConf.V2_SESSION_CATALOG_IMPLEMENTATION is used to extend the v1 session catalog, not replace it.

On Wed, Oct 7, 2020 at 5:36 PM Jungtaek Lim <[hidden email]> wrote:
If it's by design and not prepared, then IMHO replacing the default session catalog is better to be restricted until things are sorted out, as it gives pretty much confusion and has known bugs. Actually there's another bug/limitation on default session catalog on the length of identifier, so things that work with custom catalog no longer work when it replaces default session catalog.

On Wed, Oct 7, 2020 at 6:05 PM Wenchen Fan <[hidden email]> wrote:
Ah, this is by design. V1 tables should still go through the v1 session catalog. I think we can remove this restriction when we are confident about the new v2 DDL commands that work with v2 catalog APIs.

On Wed, Oct 7, 2020 at 5:00 PM Jungtaek Lim <[hidden email]> wrote:
My case is DROP TABLE and DROP TABLE supports both v1 and v2 (as it simply works when I use custom catalog without replacing the default catalog).

It just fails on v2 when the "default catalog" is replaced (say I replace 'spark_catalog'), because TempViewOrV1Table is providing value even with v2 table, and then the catalyst goes with v1 exec. I guess all commands leveraging TempViewOrV1Table to determine whether the table is v1 vs v2 would all suffer from this issue.

On Wed, Oct 7, 2020 at 5:45 PM Wenchen Fan <[hidden email]> wrote:
Not all the DDL commands support v2 catalog APIs (e.g. CREATE TABLE LIKE), so it's possible that some commands still go through the v1 session catalog although you configured a custom v2 session catalog.

Can you create JIRA tickets if you hit any DDL commands that don't support v2 catalog? We should fix them.

On Wed, Oct 7, 2020 at 9:15 AM Jungtaek Lim <[hidden email]> wrote:
The logical plan for the parsed statement is getting converted either for old one or v2, and for the former one it keeps using an external catalog (Hive) - so replacing default session catalog with custom one and trying to use it like it is in external catalog doesn't work, which destroys the purpose of replacing the default session catalog.

Btw I see one approach: in TempViewOrV1Table, if it matches with SessionCatalogAndIdentifier where the catalog is TableCatalog, call loadTable in catalog and see whether it's V1 table or not. Not sure it's a viable approach though, as it requires loading a table during resolution of the table identifier.

On Wed, Oct 7, 2020 at 10:04 AM Ryan Blue <[hidden email]> wrote:
I've hit this with `DROP TABLE` commands that should be passed to a registered v2 session catalog, but are handled by v1. I think that's the only case we hit in our downstream test suites, but we haven't been exploring the use of a session catalog for fallback. We use v2 for everything now, which avoids the problem and comes with multi-catalog support.

On Tue, Oct 6, 2020 at 5:55 PM Jungtaek Lim <[hidden email]> wrote:
Hi devs,

I'm not sure whether it's addressed in Spark 3.1, but at least from Spark 3.0.1, many SQL DDL statements don't seem to go through the custom catalog when I replace default catalog with custom catalog and only provide 'dbName.tableName' as table identifier.

I'm not an expert in this area, but after skimming the code I feel TempViewOrV1Table looks to be broken for the case, as it can still be a V2 table. Classifying the table identifier to either V2 table or "temp view or v1 table" looks to be mandatory, as former and latter have different code paths and different catalog interfaces.

That sounds to me as being stuck and the only "clear" approach seems to disallow default catalog with custom one. Am I missing something?

Thanks,
Jungtaek Lim (HeartSaVioR)


--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: SQL DDL statements with replacing default catalog with custom catalog

Ryan Blue

I disagree that this is “by design”. An operation like DROP TABLE should use a v2 drop plan if the table is v2.

If a v2 table is loaded or created using a v2 catalog it should also be dropped that way. Otherwise, the v2 catalog is not notified when the table is dropped and can’t perform other necessary updates, like invalidating caches or dropping state outside of Hive. V2 tables should always use the v2 API, and I’m not aware of a design where that wasn’t the case.

I’d also say that for DROP TABLE in particular, all calls could use the v2 catalog. We may not want to do this until we are confident as Wenchen said, but this would be the simpler solution. The v2 catalog can delegate to the old session catalog, after all.


On Wed, Oct 7, 2020 at 3:48 AM Wenchen Fan <[hidden email]> wrote:
If you just want to save typing the catalog name when writing table names, you can set your custom catalog as the default catalog (See SQLConf.DEFAULT_CATALOG). SQLConf.V2_SESSION_CATALOG_IMPLEMENTATION is used to extend the v1 session catalog, not replace it.

On Wed, Oct 7, 2020 at 5:36 PM Jungtaek Lim <[hidden email]> wrote:
If it's by design and not prepared, then IMHO replacing the default session catalog is better to be restricted until things are sorted out, as it gives pretty much confusion and has known bugs. Actually there's another bug/limitation on default session catalog on the length of identifier, so things that work with custom catalog no longer work when it replaces default session catalog.

On Wed, Oct 7, 2020 at 6:05 PM Wenchen Fan <[hidden email]> wrote:
Ah, this is by design. V1 tables should still go through the v1 session catalog. I think we can remove this restriction when we are confident about the new v2 DDL commands that work with v2 catalog APIs.

On Wed, Oct 7, 2020 at 5:00 PM Jungtaek Lim <[hidden email]> wrote:
My case is DROP TABLE and DROP TABLE supports both v1 and v2 (as it simply works when I use custom catalog without replacing the default catalog).

It just fails on v2 when the "default catalog" is replaced (say I replace 'spark_catalog'), because TempViewOrV1Table is providing value even with v2 table, and then the catalyst goes with v1 exec. I guess all commands leveraging TempViewOrV1Table to determine whether the table is v1 vs v2 would all suffer from this issue.

On Wed, Oct 7, 2020 at 5:45 PM Wenchen Fan <[hidden email]> wrote:
Not all the DDL commands support v2 catalog APIs (e.g. CREATE TABLE LIKE), so it's possible that some commands still go through the v1 session catalog although you configured a custom v2 session catalog.

Can you create JIRA tickets if you hit any DDL commands that don't support v2 catalog? We should fix them.

On Wed, Oct 7, 2020 at 9:15 AM Jungtaek Lim <[hidden email]> wrote:
The logical plan for the parsed statement is getting converted either for old one or v2, and for the former one it keeps using an external catalog (Hive) - so replacing default session catalog with custom one and trying to use it like it is in external catalog doesn't work, which destroys the purpose of replacing the default session catalog.

Btw I see one approach: in TempViewOrV1Table, if it matches with SessionCatalogAndIdentifier where the catalog is TableCatalog, call loadTable in catalog and see whether it's V1 table or not. Not sure it's a viable approach though, as it requires loading a table during resolution of the table identifier.

On Wed, Oct 7, 2020 at 10:04 AM Ryan Blue <[hidden email]> wrote:
I've hit this with `DROP TABLE` commands that should be passed to a registered v2 session catalog, but are handled by v1. I think that's the only case we hit in our downstream test suites, but we haven't been exploring the use of a session catalog for fallback. We use v2 for everything now, which avoids the problem and comes with multi-catalog support.

On Tue, Oct 6, 2020 at 5:55 PM Jungtaek Lim <[hidden email]> wrote:
Hi devs,

I'm not sure whether it's addressed in Spark 3.1, but at least from Spark 3.0.1, many SQL DDL statements don't seem to go through the custom catalog when I replace default catalog with custom catalog and only provide 'dbName.tableName' as table identifier.

I'm not an expert in this area, but after skimming the code I feel TempViewOrV1Table looks to be broken for the case, as it can still be a V2 table. Classifying the table identifier to either V2 table or "temp view or v1 table" looks to be mandatory, as former and latter have different code paths and different catalog interfaces.

That sounds to me as being stuck and the only "clear" approach seems to disallow default catalog with custom one. Am I missing something?

Thanks,
Jungtaek Lim (HeartSaVioR)


--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: SQL DDL statements with replacing default catalog with custom catalog

Jungtaek Lim-2
> If you just want to save typing the catalog name when writing table names, you can set your custom catalog as the default catalog (See SQLConf.DEFAULT_CATALOG). SQLConf.V2_SESSION_CATALOG_IMPLEMENTATION is used to extend the v1 session catalog, not replace it.

I'm sorry, but I don't get this.

The custom session catalog I use for V2_SESSION_CATALOG_IMPLEMENTATION is intended to go through a specific provider first (v2) and go back to call Spark's catalog if the table doesn't exist for the catalog. If this is not a design intention of V2_SESSION_CATALOG_IMPLEMENTATION then OK (probably should be documented somewhere), but the implementation doesn't receive any call for methods so it's no-op even if it is just designed to extend V1 session catalog.

My understanding is, V1 commands leverage sparkSession.sessionState.catalog which doesn't seem to know about extended session catalog. It just uses ExternalCatalog which sticks to Spark built-in. That said, the functionality is only partially working. Is this a thing we should fix for Spark 3.0.2/3.1.0, or better to disable the feature until we ensure it works for all commands?


On Thu, Oct 8, 2020 at 1:31 AM Ryan Blue <[hidden email]> wrote:

I disagree that this is “by design”. An operation like DROP TABLE should use a v2 drop plan if the table is v2.

If a v2 table is loaded or created using a v2 catalog it should also be dropped that way. Otherwise, the v2 catalog is not notified when the table is dropped and can’t perform other necessary updates, like invalidating caches or dropping state outside of Hive. V2 tables should always use the v2 API, and I’m not aware of a design where that wasn’t the case.

I’d also say that for DROP TABLE in particular, all calls could use the v2 catalog. We may not want to do this until we are confident as Wenchen said, but this would be the simpler solution. The v2 catalog can delegate to the old session catalog, after all.


On Wed, Oct 7, 2020 at 3:48 AM Wenchen Fan <[hidden email]> wrote:
If you just want to save typing the catalog name when writing table names, you can set your custom catalog as the default catalog (See SQLConf.DEFAULT_CATALOG). SQLConf.V2_SESSION_CATALOG_IMPLEMENTATION is used to extend the v1 session catalog, not replace it.

On Wed, Oct 7, 2020 at 5:36 PM Jungtaek Lim <[hidden email]> wrote:
If it's by design and not prepared, then IMHO replacing the default session catalog is better to be restricted until things are sorted out, as it gives pretty much confusion and has known bugs. Actually there's another bug/limitation on default session catalog on the length of identifier, so things that work with custom catalog no longer work when it replaces default session catalog.

On Wed, Oct 7, 2020 at 6:05 PM Wenchen Fan <[hidden email]> wrote:
Ah, this is by design. V1 tables should still go through the v1 session catalog. I think we can remove this restriction when we are confident about the new v2 DDL commands that work with v2 catalog APIs.

On Wed, Oct 7, 2020 at 5:00 PM Jungtaek Lim <[hidden email]> wrote:
My case is DROP TABLE and DROP TABLE supports both v1 and v2 (as it simply works when I use custom catalog without replacing the default catalog).

It just fails on v2 when the "default catalog" is replaced (say I replace 'spark_catalog'), because TempViewOrV1Table is providing value even with v2 table, and then the catalyst goes with v1 exec. I guess all commands leveraging TempViewOrV1Table to determine whether the table is v1 vs v2 would all suffer from this issue.

On Wed, Oct 7, 2020 at 5:45 PM Wenchen Fan <[hidden email]> wrote:
Not all the DDL commands support v2 catalog APIs (e.g. CREATE TABLE LIKE), so it's possible that some commands still go through the v1 session catalog although you configured a custom v2 session catalog.

Can you create JIRA tickets if you hit any DDL commands that don't support v2 catalog? We should fix them.

On Wed, Oct 7, 2020 at 9:15 AM Jungtaek Lim <[hidden email]> wrote:
The logical plan for the parsed statement is getting converted either for old one or v2, and for the former one it keeps using an external catalog (Hive) - so replacing default session catalog with custom one and trying to use it like it is in external catalog doesn't work, which destroys the purpose of replacing the default session catalog.

Btw I see one approach: in TempViewOrV1Table, if it matches with SessionCatalogAndIdentifier where the catalog is TableCatalog, call loadTable in catalog and see whether it's V1 table or not. Not sure it's a viable approach though, as it requires loading a table during resolution of the table identifier.

On Wed, Oct 7, 2020 at 10:04 AM Ryan Blue <[hidden email]> wrote:
I've hit this with `DROP TABLE` commands that should be passed to a registered v2 session catalog, but are handled by v1. I think that's the only case we hit in our downstream test suites, but we haven't been exploring the use of a session catalog for fallback. We use v2 for everything now, which avoids the problem and comes with multi-catalog support.

On Tue, Oct 6, 2020 at 5:55 PM Jungtaek Lim <[hidden email]> wrote:
Hi devs,

I'm not sure whether it's addressed in Spark 3.1, but at least from Spark 3.0.1, many SQL DDL statements don't seem to go through the custom catalog when I replace default catalog with custom catalog and only provide 'dbName.tableName' as table identifier.

I'm not an expert in this area, but after skimming the code I feel TempViewOrV1Table looks to be broken for the case, as it can still be a V2 table. Classifying the table identifier to either V2 table or "temp view or v1 table" looks to be mandatory, as former and latter have different code paths and different catalog interfaces.

That sounds to me as being stuck and the only "clear" approach seems to disallow default catalog with custom one. Am I missing something?

Thanks,
Jungtaek Lim (HeartSaVioR)


--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix