DataSourceV2 sync notes - 10 July 2019

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

DataSourceV2 sync notes - 10 July 2019

Ryan Blue

Here are my notes from the last sync. If you’d like to be added to the invite or have topics, please let me know.

Attendees:

Ryan Blue
Matt Cheah
Yifei Huang
Jose Torres
Burak Yavuz
Gengliang Wang
Michael Artz
Russel Spitzer

Topics:

Discussion:

  • ALTER TABLE PR is ready to commit (and was after the sync)
  • REPLACE and RTAS PR: waiting on more reviews
  • INSERT INTO PR: Ryan will review
  • DESCRIBE TABLE has test failures, Matt will fix
  • V2 session catalog:
    • How will v2 catalog be configured?
    • Ryan: This is up for discussion because it currently uses a table property. I think it needs to be configurable
    • Burak: Agree that it should be configurable
    • Ryan: Does this need to be determined now, or can we solve this after getting the functionality in?
    • Jose: let’s get it in and fix it later
  • Stats integration:
    • Matt: has anyone looked at stats integration? What needs to be done?
    • Ryan: stats are part of the Scan API. Configure a scan with ScanBuilder and then get stats from it. The problem is that this happens when converting to physical plan, after the optimizer. But the optimizer determines what gets broadcasted. A work-around Netflix uses is to run push down in the stats code. This runs push-down twice and was rejected from Spark, but is important for performance. We should add a property to enable this.
    • Ryan: The larger problem is that stats are used in the optimizer, but push-down happens when converting to physical plan. This is also related to our earlier discussions about when join types are chosen. Fixing this is a big project
  • CTAS and DataFrameWriter behavior
    • Burak: DataFrameWriter uses CTAS where it shouldn’t. It is difficult to predict v1 behavior
    • Ryan: Agree, v1 DataFrameWriter does not have clear behavior. We suggest a replacement with clear verbs for each SQL action: append/insert, overwrite, overwriteDynamic, create (table), replace (table)
    • Ryan: Prototype available here: https://gist.github.com/rdblue/6bc140a575fdf266beb2710ad9dbed8f
--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: DataSourceV2 sync notes - 10 July 2019

cloud0fan
Hi Ryan,

Thanks for summarizing and sending out the meeting notes! Unfortunately, I missed the last sync, but the topics are really interesting, especially the stats integration.

The ideal solution I can think of is to refactor the optimizer/planner and move all the stats-based optimization to the physical plan phase (or do it during the planning). This needs a lot of design work and I'm not sure if we can finish it in the near future.

Alternatively, we can do the operator pushdown at logical plan phase via the optimizer. This is not ideal but I think is a better workaround than doing pushdown twice. The parquet nested column pruning is also done at the logical plan phase, so I think there are no serious problems if we do operator pushdown at the logical plan phase.

This is only about the internal implementation so we can fix it at any time. But this may hurt data source v2 performance a lot and we'd better fix it sooner rather than later.


On Sat, Jul 20, 2019 at 8:20 AM Ryan Blue <[hidden email]> wrote:

Here are my notes from the last sync. If you’d like to be added to the invite or have topics, please let me know.

Attendees:

Ryan Blue
Matt Cheah
Yifei Huang
Jose Torres
Burak Yavuz
Gengliang Wang
Michael Artz
Russel Spitzer

Topics:

Discussion:

  • ALTER TABLE PR is ready to commit (and was after the sync)
  • REPLACE and RTAS PR: waiting on more reviews
  • INSERT INTO PR: Ryan will review
  • DESCRIBE TABLE has test failures, Matt will fix
  • V2 session catalog:
    • How will v2 catalog be configured?
    • Ryan: This is up for discussion because it currently uses a table property. I think it needs to be configurable
    • Burak: Agree that it should be configurable
    • Ryan: Does this need to be determined now, or can we solve this after getting the functionality in?
    • Jose: let’s get it in and fix it later
  • Stats integration:
    • Matt: has anyone looked at stats integration? What needs to be done?
    • Ryan: stats are part of the Scan API. Configure a scan with ScanBuilder and then get stats from it. The problem is that this happens when converting to physical plan, after the optimizer. But the optimizer determines what gets broadcasted. A work-around Netflix uses is to run push down in the stats code. This runs push-down twice and was rejected from Spark, but is important for performance. We should add a property to enable this.
    • Ryan: The larger problem is that stats are used in the optimizer, but push-down happens when converting to physical plan. This is also related to our earlier discussions about when join types are chosen. Fixing this is a big project
  • CTAS and DataFrameWriter behavior
    • Burak: DataFrameWriter uses CTAS where it shouldn’t. It is difficult to predict v1 behavior
    • Ryan: Agree, v1 DataFrameWriter does not have clear behavior. We suggest a replacement with clear verbs for each SQL action: append/insert, overwrite, overwriteDynamic, create (table), replace (table)
    • Ryan: Prototype available here: https://gist.github.com/rdblue/6bc140a575fdf266beb2710ad9dbed8f
--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: DataSourceV2 sync notes - 10 July 2019

Ryan Blue
I agree that the long-term solution is much farther away, but I'm not sure it is a good idea to do this in the optimizer. Maybe we could find a good way to do it, but the initial complication required before we moved to push-down to the conversion to physical plan was really bad. Plus, this has been outstanding for probably a year now, so I am not confident that the long-term solution would be a priority -- it seems to me that band-aid solutions persist for far too long.

On Tue, Jul 23, 2019 at 4:30 AM Wenchen Fan <[hidden email]> wrote:
Hi Ryan,

Thanks for summarizing and sending out the meeting notes! Unfortunately, I missed the last sync, but the topics are really interesting, especially the stats integration.

The ideal solution I can think of is to refactor the optimizer/planner and move all the stats-based optimization to the physical plan phase (or do it during the planning). This needs a lot of design work and I'm not sure if we can finish it in the near future.

Alternatively, we can do the operator pushdown at logical plan phase via the optimizer. This is not ideal but I think is a better workaround than doing pushdown twice. The parquet nested column pruning is also done at the logical plan phase, so I think there are no serious problems if we do operator pushdown at the logical plan phase.

This is only about the internal implementation so we can fix it at any time. But this may hurt data source v2 performance a lot and we'd better fix it sooner rather than later.


On Sat, Jul 20, 2019 at 8:20 AM Ryan Blue <[hidden email]> wrote:

Here are my notes from the last sync. If you’d like to be added to the invite or have topics, please let me know.

Attendees:

Ryan Blue
Matt Cheah
Yifei Huang
Jose Torres
Burak Yavuz
Gengliang Wang
Michael Artz
Russel Spitzer

Topics:

Discussion:

  • ALTER TABLE PR is ready to commit (and was after the sync)
  • REPLACE and RTAS PR: waiting on more reviews
  • INSERT INTO PR: Ryan will review
  • DESCRIBE TABLE has test failures, Matt will fix
  • V2 session catalog:
    • How will v2 catalog be configured?
    • Ryan: This is up for discussion because it currently uses a table property. I think it needs to be configurable
    • Burak: Agree that it should be configurable
    • Ryan: Does this need to be determined now, or can we solve this after getting the functionality in?
    • Jose: let’s get it in and fix it later
  • Stats integration:
    • Matt: has anyone looked at stats integration? What needs to be done?
    • Ryan: stats are part of the Scan API. Configure a scan with ScanBuilder and then get stats from it. The problem is that this happens when converting to physical plan, after the optimizer. But the optimizer determines what gets broadcasted. A work-around Netflix uses is to run push down in the stats code. This runs push-down twice and was rejected from Spark, but is important for performance. We should add a property to enable this.
    • Ryan: The larger problem is that stats are used in the optimizer, but push-down happens when converting to physical plan. This is also related to our earlier discussions about when join types are chosen. Fixing this is a big project
  • CTAS and DataFrameWriter behavior
    • Burak: DataFrameWriter uses CTAS where it shouldn’t. It is difficult to predict v1 behavior
    • Ryan: Agree, v1 DataFrameWriter does not have clear behavior. We suggest a replacement with clear verbs for each SQL action: append/insert, overwrite, overwriteDynamic, create (table), replace (table)
    • Ryan: Prototype available here: https://gist.github.com/rdblue/6bc140a575fdf266beb2710ad9dbed8f
--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix