[DatasourceV2] Allowing Partial Writes to DSV2 Tables

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

[DatasourceV2] Allowing Partial Writes to DSV2 Tables

RussS
In DSV1 this was pretty easy to do because of the burden of verification for writes had to be in the datasource, the new setup makes partial writes difficult.

resolveOuptutColumns checks the table schema against the writeplan's output and will fail any requests which don't contain every column as specified in the table schema.
I would like it if instead if either we made this check optional for a datasource, perhaps an "allow partial writes" trait for the table? Or just allowed analysis
to fail on "withInputDataSchema" where an implementer could throw exceptions on underspecified writes.


The use case here is that C* (and many other sinks) have mandated columns that must be present during an insert as well as those
which are not required.

Please let me know if i've misread this,

Thanks for your time again,
Russ
Reply | Threaded
Open this post in threaded view
|

Re: [DatasourceV2] Allowing Partial Writes to DSV2 Tables

Ryan Blue
I agree with adding a table capability for this. This is something that we support in our Spark branch so that users can evolve tables without breaking existing ETL jobs -- when you add an optional column, it shouldn't fail the existing pipeline writing data to a table. I can contribute the changes to validation if people are interested.

On Wed, May 13, 2020 at 2:57 PM Russell Spitzer <[hidden email]> wrote:
In DSV1 this was pretty easy to do because of the burden of verification for writes had to be in the datasource, the new setup makes partial writes difficult.

resolveOuptutColumns checks the table schema against the writeplan's output and will fail any requests which don't contain every column as specified in the table schema.
I would like it if instead if either we made this check optional for a datasource, perhaps an "allow partial writes" trait for the table? Or just allowed analysis
to fail on "withInputDataSchema" where an implementer could throw exceptions on underspecified writes.


The use case here is that C* (and many other sinks) have mandated columns that must be present during an insert as well as those
which are not required.

Please let me know if i've misread this,

Thanks for your time again,
Russ


--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: [DatasourceV2] Allowing Partial Writes to DSV2 Tables

RussS
I would really appreciate that, I'm probably going to just write a planner rule for now which matches up my table schema with the query output if they are valid, and fails analysis otherwise. This approach is how I got metadata columns in so I believe it would work for writing as well.

On Wed, May 13, 2020 at 5:13 PM Ryan Blue <[hidden email]> wrote:
I agree with adding a table capability for this. This is something that we support in our Spark branch so that users can evolve tables without breaking existing ETL jobs -- when you add an optional column, it shouldn't fail the existing pipeline writing data to a table. I can contribute the changes to validation if people are interested.

On Wed, May 13, 2020 at 2:57 PM Russell Spitzer <[hidden email]> wrote:
In DSV1 this was pretty easy to do because of the burden of verification for writes had to be in the datasource, the new setup makes partial writes difficult.

resolveOuptutColumns checks the table schema against the writeplan's output and will fail any requests which don't contain every column as specified in the table schema.
I would like it if instead if either we made this check optional for a datasource, perhaps an "allow partial writes" trait for the table? Or just allowed analysis
to fail on "withInputDataSchema" where an implementer could throw exceptions on underspecified writes.


The use case here is that C* (and many other sinks) have mandated columns that must be present during an insert as well as those
which are not required.

Please let me know if i've misread this,

Thanks for your time again,
Russ


--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: [DatasourceV2] Allowing Partial Writes to DSV2 Tables

cloud0fan
I think we already have this table capacity: ACCEPT_ANY_SCHEMA. Can you try that?

On Thu, May 14, 2020 at 6:17 AM Russell Spitzer <[hidden email]> wrote:
I would really appreciate that, I'm probably going to just write a planner rule for now which matches up my table schema with the query output if they are valid, and fails analysis otherwise. This approach is how I got metadata columns in so I believe it would work for writing as well.

On Wed, May 13, 2020 at 5:13 PM Ryan Blue <[hidden email]> wrote:
I agree with adding a table capability for this. This is something that we support in our Spark branch so that users can evolve tables without breaking existing ETL jobs -- when you add an optional column, it shouldn't fail the existing pipeline writing data to a table. I can contribute the changes to validation if people are interested.

On Wed, May 13, 2020 at 2:57 PM Russell Spitzer <[hidden email]> wrote:
In DSV1 this was pretty easy to do because of the burden of verification for writes had to be in the datasource, the new setup makes partial writes difficult.

resolveOuptutColumns checks the table schema against the writeplan's output and will fail any requests which don't contain every column as specified in the table schema.
I would like it if instead if either we made this check optional for a datasource, perhaps an "allow partial writes" trait for the table? Or just allowed analysis
to fail on "withInputDataSchema" where an implementer could throw exceptions on underspecified writes.


The use case here is that C* (and many other sinks) have mandated columns that must be present during an insert as well as those
which are not required.

Please let me know if i've misread this,

Thanks for your time again,
Russ


--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: [DatasourceV2] Allowing Partial Writes to DSV2 Tables

RussS
Yeah! That is working for me. Thanks!

On Thu, May 14, 2020 at 12:10 AM Wenchen Fan <[hidden email]> wrote:
I think we already have this table capacity: ACCEPT_ANY_SCHEMA. Can you try that?

On Thu, May 14, 2020 at 6:17 AM Russell Spitzer <[hidden email]> wrote:
I would really appreciate that, I'm probably going to just write a planner rule for now which matches up my table schema with the query output if they are valid, and fails analysis otherwise. This approach is how I got metadata columns in so I believe it would work for writing as well.

On Wed, May 13, 2020 at 5:13 PM Ryan Blue <[hidden email]> wrote:
I agree with adding a table capability for this. This is something that we support in our Spark branch so that users can evolve tables without breaking existing ETL jobs -- when you add an optional column, it shouldn't fail the existing pipeline writing data to a table. I can contribute the changes to validation if people are interested.

On Wed, May 13, 2020 at 2:57 PM Russell Spitzer <[hidden email]> wrote:
In DSV1 this was pretty easy to do because of the burden of verification for writes had to be in the datasource, the new setup makes partial writes difficult.

resolveOuptutColumns checks the table schema against the writeplan's output and will fail any requests which don't contain every column as specified in the table schema.
I would like it if instead if either we made this check optional for a datasource, perhaps an "allow partial writes" trait for the table? Or just allowed analysis
to fail on "withInputDataSchema" where an implementer could throw exceptions on underspecified writes.


The use case here is that C* (and many other sinks) have mandated columns that must be present during an insert as well as those
which are not required.

Please let me know if i've misread this,

Thanks for your time again,
Russ


--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: [DatasourceV2] Allowing Partial Writes to DSV2 Tables

Ryan Blue
ACCEPT_ANY_SCHEMA isn't a good way to solve the problem because you often want at least some checking in Spark to validate the rows match. It's a good way to be unblocked, but not a long-term solution.

On Thu, May 14, 2020 at 4:57 AM Russell Spitzer <[hidden email]> wrote:
Yeah! That is working for me. Thanks!

On Thu, May 14, 2020 at 12:10 AM Wenchen Fan <[hidden email]> wrote:
I think we already have this table capacity: ACCEPT_ANY_SCHEMA. Can you try that?

On Thu, May 14, 2020 at 6:17 AM Russell Spitzer <[hidden email]> wrote:
I would really appreciate that, I'm probably going to just write a planner rule for now which matches up my table schema with the query output if they are valid, and fails analysis otherwise. This approach is how I got metadata columns in so I believe it would work for writing as well.

On Wed, May 13, 2020 at 5:13 PM Ryan Blue <[hidden email]> wrote:
I agree with adding a table capability for this. This is something that we support in our Spark branch so that users can evolve tables without breaking existing ETL jobs -- when you add an optional column, it shouldn't fail the existing pipeline writing data to a table. I can contribute the changes to validation if people are interested.

On Wed, May 13, 2020 at 2:57 PM Russell Spitzer <[hidden email]> wrote:
In DSV1 this was pretty easy to do because of the burden of verification for writes had to be in the datasource, the new setup makes partial writes difficult.

resolveOuptutColumns checks the table schema against the writeplan's output and will fail any requests which don't contain every column as specified in the table schema.
I would like it if instead if either we made this check optional for a datasource, perhaps an "allow partial writes" trait for the table? Or just allowed analysis
to fail on "withInputDataSchema" where an implementer could throw exceptions on underspecified writes.


The use case here is that C* (and many other sinks) have mandated columns that must be present during an insert as well as those
which are not required.

Please let me know if i've misread this,

Thanks for your time again,
Russ


--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix