dsv2 remaining work

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

dsv2 remaining work

rxin
Unfortunately I can't make it to the DSv2 sync today. Sending an email with my thoughts instead. I spent a few hours thinking about this. It's evident that progress has been slow, because this is an important API and people from different perspectives have very different requirements, and the priorities are weighted very differently (e.g. issues that are super important to one might be not as important to another, and people just talk past each other arguing why one ignored a broader issue in a PR or proposal).

I think the only real way to make progress is to decouple the efforts into major areas, and make progress somewhat independently. Of course, some care is needed to take care of

Here's one attempt at listing some of the remaining big rocks:

1. Basic write API -- with the current SaveMode.

2. Add Overwrite (or Replace) logical plan, and the associated API in Table.

3. Add APIs for per-table metadata operations (note that I'm not calling it a catalog API here). Create/drop/alter table goes here. We also need to figure out how to do this for the file system sources in which there is no underlying catalog. One idea is to treat the file system as a catalog (with arbitrary levels of databases). To do that, it'd be great if the identifier for a table is not a fixed 2 or 3 part name, but just a string array.

4. Remove SaveMode. This is blocked on at least 1 + 2, and potentially 3.

5. Design a stable, fast, smaller surface row format to replace the existing InternalRow (and all the internal data types), which is internal and unstable. This can be further decoupled into the design for each data type.

The above are the big one I can think of. I probably missed some, but a lot of other smaller things can be improved on later.






Reply | Threaded
Open this post in threaded view
|

Re: dsv2 remaining work

Ryan Blue
We discussed this issue in the sync. I'll be sending out a summary later today, but we came to a conclusion on some of these.

For #1, there are 2 parts: the design and the implementation. We agreed that the design should not include SaveMode. The implementation may include SaveMode until we can replace it with Overwrite, #2. We decided to create a release-blocking issue to remove SaveMode so we will not include the redesign to DataSourceV2 in a release unless SaveMode has been removed from the read/write API (not the public API).

Let's continue discussions on #3. I don't think removing SaveMode needs to be blocked by this because the justification for keeping SaveMode was to not break existing tests. Existing tests only rely on overwrite. I agree that CTAS is important and I'd prefer to get that in before a release as well, though we didn't talk about that.

rb

On Wed, Dec 12, 2018 at 4:58 PM Reynold Xin <[hidden email]> wrote:
Unfortunately I can't make it to the DSv2 sync today. Sending an email with my thoughts instead. I spent a few hours thinking about this. It's evident that progress has been slow, because this is an important API and people from different perspectives have very different requirements, and the priorities are weighted very differently (e.g. issues that are super important to one might be not as important to another, and people just talk past each other arguing why one ignored a broader issue in a PR or proposal).

I think the only real way to make progress is to decouple the efforts into major areas, and make progress somewhat independently. Of course, some care is needed to take care of

Here's one attempt at listing some of the remaining big rocks:

1. Basic write API -- with the current SaveMode.

2. Add Overwrite (or Replace) logical plan, and the associated API in Table.

3. Add APIs for per-table metadata operations (note that I'm not calling it a catalog API here). Create/drop/alter table goes here. We also need to figure out how to do this for the file system sources in which there is no underlying catalog. One idea is to treat the file system as a catalog (with arbitrary levels of databases). To do that, it'd be great if the identifier for a table is not a fixed 2 or 3 part name, but just a string array.

4. Remove SaveMode. This is blocked on at least 1 + 2, and potentially 3.

5. Design a stable, fast, smaller surface row format to replace the existing InternalRow (and all the internal data types), which is internal and unstable. This can be further decoupled into the design for each data type.

The above are the big one I can think of. I probably missed some, but a lot of other smaller things can be improved on later.








--
Ryan Blue
Software Engineer
Netflix