DataSourceV2 sync notes - 2 October 2019

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

DataSourceV2 sync notes - 2 October 2019

Ryan Blue

Here are my notes from last week's DSv2 sync.

Attendees:

Ryan Blue
Terry Kim
Wenchen Fan

Topics:

Discussion:

  • Update identifier and table resolution
    • Wenchen: Will not handle SPARK-29014, it is a pure refactor
    • Ryan: I think this should separate the v2 rules from the v1 fallback, to keep table and identifier resolution separate. The only time that table resolution needs to be done at the same time is for v1 fallback.
    • This was merged last week
  • Update to use spark_catalog
  • Early DSv2 pushdown
    • Ryan: this depends on fixing a few more tests. To validate there are no calls to computeStats with the DSv2 relation, I’ve temporarily removed the method. Other than a few remaining test failures where the old relation was expected, it looks like there are no uses of computeStats before early pushdown in the optimizer.
    • Wenchen: agreed that the batch was in the correct place in the optimizer
    • Ryan: once tests are passing, will add the computeStats implementation back with Utils.isTesting to fail during testing when called before early pushdown, but will not fail at runtime
  • Wenchen: when using v2, there is no way to configure custom options for a JDBC table. For v1, the table was created and stored in the session catalog, at which point Spark-specific properties like parallelism could be stored. In v2, the catalog is the source of truth, so tables don’t get created in the same way. Options are only passed in a create statement.
    • Ryan: this could be fixed by allowing users to pass options as table properties. We mix the two today, but if we used a prefix for table properties, “options.”, then you could use SET TBLPROPERTIES to get around this. That’s also better for compatibility. I’ll open a PR for this.
    • Ryan: this could also be solved by adding an OPTIONS clause or hint to SELECT
  • Wenchen: There are commands without v2 statements. We should add v2 statements to reject non-v1 uses.
    • Ryan: Doesn’t the parser only parse up to 2 identifiers for these? That would handle the majority of cases
    • Wenchen: Yes, but there is still a problem for identifiers with 1 part in v2 catalogs, like catalog.table. Commands that don’t support v2 will use catalog.table in the v1 catalog.
    • Ryan: Sounds like a good plan to update the parser and add statements for these. Do we have a list of commands to update?
    • Wenchen: REFRESH TABLE, ANALYZE TABLE, ALTER TABLE PARTITION, etc. Will open an umbrella JIRA with a list.
--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: DataSourceV2 sync notes - 2 October 2019

cloud0fan
Hi Ryan,

Thanks for summarizing and sending out the notes! I've created the JIRA ticket to add v2 statements for all the commands that need to resolve a table: https://issues.apache.org/jira/browse/SPARK-29481

Contributions to it are appreciated!

Thanks,
Wenchen

On Fri, Oct 11, 2019 at 7:05 AM Ryan Blue <[hidden email]> wrote:

Here are my notes from last week's DSv2 sync.

Attendees:

Ryan Blue
Terry Kim
Wenchen Fan

Topics:

Discussion:

  • Update identifier and table resolution
    • Wenchen: Will not handle SPARK-29014, it is a pure refactor
    • Ryan: I think this should separate the v2 rules from the v1 fallback, to keep table and identifier resolution separate. The only time that table resolution needs to be done at the same time is for v1 fallback.
    • This was merged last week
  • Update to use spark_catalog
  • Early DSv2 pushdown
    • Ryan: this depends on fixing a few more tests. To validate there are no calls to computeStats with the DSv2 relation, I’ve temporarily removed the method. Other than a few remaining test failures where the old relation was expected, it looks like there are no uses of computeStats before early pushdown in the optimizer.
    • Wenchen: agreed that the batch was in the correct place in the optimizer
    • Ryan: once tests are passing, will add the computeStats implementation back with Utils.isTesting to fail during testing when called before early pushdown, but will not fail at runtime
  • Wenchen: when using v2, there is no way to configure custom options for a JDBC table. For v1, the table was created and stored in the session catalog, at which point Spark-specific properties like parallelism could be stored. In v2, the catalog is the source of truth, so tables don’t get created in the same way. Options are only passed in a create statement.
    • Ryan: this could be fixed by allowing users to pass options as table properties. We mix the two today, but if we used a prefix for table properties, “options.”, then you could use SET TBLPROPERTIES to get around this. That’s also better for compatibility. I’ll open a PR for this.
    • Ryan: this could also be solved by adding an OPTIONS clause or hint to SELECT
  • Wenchen: There are commands without v2 statements. We should add v2 statements to reject non-v1 uses.
    • Ryan: Doesn’t the parser only parse up to 2 identifiers for these? That would handle the majority of cases
    • Wenchen: Yes, but there is still a problem for identifiers with 1 part in v2 catalogs, like catalog.table. Commands that don’t support v2 will use catalog.table in the v1 catalog.
    • Ryan: Sounds like a good plan to update the parser and add statements for these. Do we have a list of commands to update?
    • Wenchen: REFRESH TABLE, ANALYZE TABLE, ALTER TABLE PARTITION, etc. Will open an umbrella JIRA with a list.
--
Ryan Blue
Software Engineer
Netflix