DSv1 removal

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

DSv1 removal

Gabor Somogyi
Hi All,

  I've taken a look at the code and docs to find out when DSv1 sources has to be removed (in case of DSv2 replacement is implemented). After some digging I've found DSv1 sources which are already removed but in some cases v1 and v2 still exists in parallel.

Can somebody please tell me what's the overall plan in this area?

BR,
G

Reply | Threaded
Open this post in threaded view
|

Re: DSv1 removal

Ryan Blue
Hi Gabor,

First, a little context... one of the goals of DSv2 is to standardize the behavior of SQL operations in Spark. For example, running CTAS when a table exists will fail, not take some action depending on what the source chooses, like drop & CTAS, inserting, or failing.

Unfortunately, this means that DSv1 can't be easily replaced because it has behavior differences between sources. In addition, we're not really sure how DSv1 works in all cases -- it really depends on what seemed reasonable to authors at the time. For example, we don't have a good understanding of how file-based tables behave (those not backed by a Metastore). There are also changes that we know are breaking and are okay with, like only inserting safe casts when writing with v2.

Because of this, we can't just replace v1 with v2 transparently, so the plan is to allow deployments to migrate to v2 in stages. Here's the plan:
1. Use v1 by default so all existing queries work as they do today for identifiers like `db.table`
2. Allow users to add additional v2 catalogs that will be used when identifiers specifically start with one, like `test_catalog.db.table`
3. Add a v2 catalog that delegates to the session catalog, so that v2 read/write implementations can be used, but are stored just like v1 tables in the session catalog
4. Add a setting to use a v2 catalog as the default. Setting this would use a v2 catalog for all identifiers without a catalog, like `db.table`
5. Add a way for a v2 catalog to return a table that gets converted to v1. This is what `CatalogTableAsV2` does in #24768.

PR #24768 implements the rest of these changes. Specifically, we initially used the default catalog for v2 sources, but that causes namespace problems, so we need the v2 session catalog (point #3) as the default when there is no default v2 catalog.

I hope that answers your question. If not, I'm happy to answer follow-ups and we can add this as a topic in the next v2 sync on Wednesday. I'm also planning on talking about metadata columns or function push-down from the Kafka v2 PR at that sync, so you may want to attend.

rb


On Thu, Jun 20, 2019 at 4:45 AM Gabor Somogyi <[hidden email]> wrote:
Hi All,

  I've taken a look at the code and docs to find out when DSv1 sources has to be removed (in case of DSv2 replacement is implemented). After some digging I've found DSv1 sources which are already removed but in some cases v1 and v2 still exists in parallel.

Can somebody please tell me what's the overall plan in this area?

BR,
G



--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: DSv1 removal

Gabor Somogyi
Hi Ryan,

Thanks for the explanation! This shed lights on areas but also triggered some questions.

The main conclusion to me on the Kafka connector side is to keep the v1 as default. Let the users some time to migrate to v2 and later delete v1 when its stable (which makes sense from my perspective).

The interesting part is that the Kafka microbatch already uses v2 as default which I don't fully understand how to fit into this.
Since https://issues.apache.org/jira/browse/SPARK-23362 merged into 2.4 it shouldn't be breaking (I assume batch part should be similar).

We can continue the discussion about Kafka batch v1/v2 default on https://github.com/apache/spark/pull/24738 not to bomb everybody.

Please send me an invite to the sync meeting. Not sure when exactly that happens but presume it's in the night from CET timezone perspective.
Try to organize my time to participate...

BR,
G


On Thu, Jun 20, 2019 at 8:24 PM Ryan Blue <[hidden email]> wrote:
Hi Gabor,

First, a little context... one of the goals of DSv2 is to standardize the behavior of SQL operations in Spark. For example, running CTAS when a table exists will fail, not take some action depending on what the source chooses, like drop & CTAS, inserting, or failing.

Unfortunately, this means that DSv1 can't be easily replaced because it has behavior differences between sources. In addition, we're not really sure how DSv1 works in all cases -- it really depends on what seemed reasonable to authors at the time. For example, we don't have a good understanding of how file-based tables behave (those not backed by a Metastore). There are also changes that we know are breaking and are okay with, like only inserting safe casts when writing with v2.

Because of this, we can't just replace v1 with v2 transparently, so the plan is to allow deployments to migrate to v2 in stages. Here's the plan:
1. Use v1 by default so all existing queries work as they do today for identifiers like `db.table`
2. Allow users to add additional v2 catalogs that will be used when identifiers specifically start with one, like `test_catalog.db.table`
3. Add a v2 catalog that delegates to the session catalog, so that v2 read/write implementations can be used, but are stored just like v1 tables in the session catalog
4. Add a setting to use a v2 catalog as the default. Setting this would use a v2 catalog for all identifiers without a catalog, like `db.table`
5. Add a way for a v2 catalog to return a table that gets converted to v1. This is what `CatalogTableAsV2` does in #24768.

PR #24768 implements the rest of these changes. Specifically, we initially used the default catalog for v2 sources, but that causes namespace problems, so we need the v2 session catalog (point #3) as the default when there is no default v2 catalog.

I hope that answers your question. If not, I'm happy to answer follow-ups and we can add this as a topic in the next v2 sync on Wednesday. I'm also planning on talking about metadata columns or function push-down from the Kafka v2 PR at that sync, so you may want to attend.

rb


On Thu, Jun 20, 2019 at 4:45 AM Gabor Somogyi <[hidden email]> wrote:
Hi All,

  I've taken a look at the code and docs to find out when DSv1 sources has to be removed (in case of DSv2 replacement is implemented). After some digging I've found DSv1 sources which are already removed but in some cases v1 and v2 still exists in parallel.

Can somebody please tell me what's the overall plan in this area?

BR,
G



--
Ryan Blue
Software Engineer
Netflix