DataSourceV2 sync notes - 12 June 2019

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

DataSourceV2 sync notes - 12 June 2019

Ryan Blue

Here are the latest DSv2 sync notes. Please reply with updates or corrections.


Ryan Blue
Michael Armbrust
Gengliang Wang
Matt Cheah
John Zhuge


Wenchen’s reorganization proposal
Problems with TableProvider - property map isn’t sufficient

New PRs:


  • Wenchen’s organization proposal
    • Ryan: Wenchen proposed using `org.apache.spark.sql.connector.{catalog, expressions, read, write, extensions}
    • Ryan: I’m not sure we need extensions, but otherwise it looks good to me
    • Matt: This is in the catalyst module, right?
    • Ryan: Right. The API is in catalyst. The extensions package would be used for any parts that need to be in SQL, but hopefully there aren’t any.
    • Consensus was to go with the proposed organization
  • Problems with TableProvider:
    • Gengliang: CREATE TABLE with an ORC v2 table can’t report its schema because there are no files
    • Ryan: We hit this when trying to use ORC in SQL unit tests for v2. The problem is that the source can’t be passed the schema and other information
    • Gengliang: Schema could be passed using the userSpecifiedSchema arg
    • Ryan: The user schema is for cases where the data lacks specific types and a user supplies them, like CSV. I don’t think it makes sense to reuse that to pass the schema from the catalog
    • Ryan: Other table metadata should be passed as well, like partitioning, so sources don’t infer it. I think this requires some thought. Anyone want to volunteer?
    • No one volunteered to fix the problem
  • ReplaceTable PR
    • Matt: Needs another update after comments, but about ready
    • Ryan: I agree it is almost ready to commit. I should point out that this includes a default implementation that will leave a table deleted if the write fails. I think this is expected behavior because REPLACE is a DROP combined with CTAS
    • Michael: Sources should be able to opt out of that behavior
    • Ryan: We want to ensure consistent behavior across sources
    • Resolution: sources can implement the staging and throw an exception if they choose to opt out
  • V2 table resolution:
    • John: this should be ready to go, only minor comments from Dongjoon left
    • This was merged the next day
  • V2 session catalog
    • Ryan: When testing, we realized that if a default catalog is used for v2 sources (like ORC v2) then you can run CREATE TABLE, which goes to some v2 catalog, but then you can’t load the same table using the same name because the session catalog doesn’t have it.
    • Ryan: To fix this, we need a v2 catalog that delegates to the session catalog. This should be used for all v2 operations when the session catalog can’t be used.
    • Ryan: Then the v2 default catalog should be used instead of the session catalog when it is set. This provides a smooth transition from the session catalog to v2 catalogs.
  • Gengliang: another topic: decimals
    • Gengliang: v2 doesn’t insert unsafe casts, so literals in SQL cannot be inserted to double/float columns
    • Michael: Shouldn’t queries use decimal literals so that floating point literals can be floats? What do other databases do?
    • Matt: is this a v2 problem?
    • Ryan: this is not specific to v2 and was discovered when converting v1 to use the v2 output rules
    • Ryan: we could add a new decimal type that doesn’t lose data but is allowed to be cast because it can only be used for literals where the intended type is unknown. There is precedent for this in the parser with Hive char and varchar types.
    • Conclusion: This isn’t really a v2 problem
  • Michael: Any work so far on MERGE INTO?
    • Ryan: Not yet, but feel free to make a proposal and start working
    • Ryan: Do you also need to pass extra metadata with each row?
    • Michael: No, this should be delegated to the source
    • Matt: That would be operator push-down
    • Ryan: I agree, that’s operator push-down. It would be great to hear how that would work, but I think MERGE INTO should have a default implementation. It should be supported across sources instead of in just one so we have a reference implementation.
    • Michael: Having only a reference implementation was the problem with v1. The behavior should be written down in a spec. Hive has a reasonable implementation to follow.
    • Ryan: Yes, but it is still valuable to have a reference implementation. And of course a spec is needed.
  • Matt: what does the roadmap look like for finishing in time for Spark 3.0?
    • Ryan: Looking good if we get v2 table resolution soon. Then major features are ALTER TABLE, INSERT INTO, and the new v2 API
    • Ryan: There are also DESCRIBE TABLE, SHOW (TABLES/etc), and others that are low priority, but would be nice to get done. Those will require the namespace support PR. (Everyone, please contribute here!)
Ryan Blue
Software Engineer