[DISCUSS] Default values and data sources

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

[DISCUSS] Default values and data sources

Ryan Blue

Hi everyone,

This thread is a follow-up to a discussion that we started in the DSv2 community sync last week.

The problem I’m trying to solve is that the format I’m using DSv2 to integrate supports schema evolution. Specifically, adding a new optional column so that rows without that column get a default value (null for Iceberg). The current validation rule for an append in DSv2 fails a write if it is missing a column, so adding a column to an existing table will cause currently-scheduled jobs that insert data to start failing. Clearly, schema evolution shouldn't break existing jobs that produce valid data.

To fix this problem, I suggested option 1: adding a way for Spark to check whether to fail when an optional column is missing. Other contributors in the sync thought that Spark should go with option 2: Spark’s schema should have defaults and Spark should handle filling in defaults the same way across all sources, like other databases.

I think we agree that option 2 would be ideal. The problem is that it is very hard to implement.

A source might manage data stored in millions of immutable Parquet files, so adding a default value isn’t possible. Spark would need to fill in defaults for files written before the column was added at read time (it could fill in defaults in new files at write time). Filling in defaults at read time would require Spark to fill in defaults for only some of the files in a scan, so Spark would need different handling for each task depending on the schema of that task. Tasks would also be required to produce a consistent schema, so a file without the new column couldn’t be combined into a task with a file that has the new column. This adds quite a bit of complexity.

Other sources may not need Spark to fill in the default at all. A JDBC source would be capable of filling in the default values itself, so Spark would need some way to communicate the default to that source. If the source had a different policy for default values (write time instead of read time, for example) then behavior could still be inconsistent.

I think that this complexity probably isn’t worth consistency in default values across sources, if that is even achievable.

In the sync we thought it was a good idea to send this out to the larger group to discuss. Please reply with comments!

rb

--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Default values and data sources

cloud0fan
I agree that we should not rewrite existing parquet files when a new column is added, but we should also try out best to make the behavior same as RDBMS/SQL standard.

1. it should be the user who decides the default value of a column, by CREATE TABLE, or ALTER TABLE ADD COLUMN, or ALTER TABLE ALTER COLUMN.
2. When adding a new column, the default value should be effective for all the existing data, and newly written data.
3. When altering an existing column and change the default value, it should be effective for newly written data only.

A possible implementation:
1. a columnn has 2 default values: the initial one and the latest one.
2. when adding a column with a default value, set both the initial one and the latest one to this value. But do not update existing data.
3. when reading data, fill the missing column with the initial default value
4. when writing data, fill the missing column with the latest default value
5. when altering a column to change its default value, only update the latest default value.

This works because:
1. new files will be written with the latest default value, nothing we need to worry about at read time.
2. old files will be read with the initial default value, which returns expected result.

On Wed, Dec 19, 2018 at 8:39 AM Ryan Blue <[hidden email]> wrote:

Hi everyone,

This thread is a follow-up to a discussion that we started in the DSv2 community sync last week.

The problem I’m trying to solve is that the format I’m using DSv2 to integrate supports schema evolution. Specifically, adding a new optional column so that rows without that column get a default value (null for Iceberg). The current validation rule for an append in DSv2 fails a write if it is missing a column, so adding a column to an existing table will cause currently-scheduled jobs that insert data to start failing. Clearly, schema evolution shouldn't break existing jobs that produce valid data.

To fix this problem, I suggested option 1: adding a way for Spark to check whether to fail when an optional column is missing. Other contributors in the sync thought that Spark should go with option 2: Spark’s schema should have defaults and Spark should handle filling in defaults the same way across all sources, like other databases.

I think we agree that option 2 would be ideal. The problem is that it is very hard to implement.

A source might manage data stored in millions of immutable Parquet files, so adding a default value isn’t possible. Spark would need to fill in defaults for files written before the column was added at read time (it could fill in defaults in new files at write time). Filling in defaults at read time would require Spark to fill in defaults for only some of the files in a scan, so Spark would need different handling for each task depending on the schema of that task. Tasks would also be required to produce a consistent schema, so a file without the new column couldn’t be combined into a task with a file that has the new column. This adds quite a bit of complexity.

Other sources may not need Spark to fill in the default at all. A JDBC source would be capable of filling in the default values itself, so Spark would need some way to communicate the default to that source. If the source had a different policy for default values (write time instead of read time, for example) then behavior could still be inconsistent.

I think that this complexity probably isn’t worth consistency in default values across sources, if that is even achievable.

In the sync we thought it was a good idea to send this out to the larger group to discuss. Please reply with comments!

rb

--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Default values and data sources

RussS
I'm not sure why 1) wouldn't be fine. I'm guessing the reason we want 2 is for a unified way of dealing with missing columns? I feel like that probably should be left up to the underlying datasource implementation. For example if you have missing columns with a database the Datasource can choose a value based on the Database's metadata if such a thing exists, I don't think Spark should really have a this level of detail but I've also missed out on all of these meetings (sorry it's family dinner time :) ) so I may be missing something. 

So my tldr is, Let a datasource report whether or not missing columns are OK and let the Datasource deal with the missing data based on it's underlying storage.

On Wed, Dec 19, 2018 at 8:23 AM Wenchen Fan <[hidden email]> wrote:
I agree that we should not rewrite existing parquet files when a new column is added, but we should also try out best to make the behavior same as RDBMS/SQL standard.

1. it should be the user who decides the default value of a column, by CREATE TABLE, or ALTER TABLE ADD COLUMN, or ALTER TABLE ALTER COLUMN.
2. When adding a new column, the default value should be effective for all the existing data, and newly written data.
3. When altering an existing column and change the default value, it should be effective for newly written data only.

A possible implementation:
1. a columnn has 2 default values: the initial one and the latest one.
2. when adding a column with a default value, set both the initial one and the latest one to this value. But do not update existing data.
3. when reading data, fill the missing column with the initial default value
4. when writing data, fill the missing column with the latest default value
5. when altering a column to change its default value, only update the latest default value.

This works because:
1. new files will be written with the latest default value, nothing we need to worry about at read time.
2. old files will be read with the initial default value, which returns expected result.

On Wed, Dec 19, 2018 at 8:39 AM Ryan Blue <[hidden email]> wrote:

Hi everyone,

This thread is a follow-up to a discussion that we started in the DSv2 community sync last week.

The problem I’m trying to solve is that the format I’m using DSv2 to integrate supports schema evolution. Specifically, adding a new optional column so that rows without that column get a default value (null for Iceberg). The current validation rule for an append in DSv2 fails a write if it is missing a column, so adding a column to an existing table will cause currently-scheduled jobs that insert data to start failing. Clearly, schema evolution shouldn't break existing jobs that produce valid data.

To fix this problem, I suggested option 1: adding a way for Spark to check whether to fail when an optional column is missing. Other contributors in the sync thought that Spark should go with option 2: Spark’s schema should have defaults and Spark should handle filling in defaults the same way across all sources, like other databases.

I think we agree that option 2 would be ideal. The problem is that it is very hard to implement.

A source might manage data stored in millions of immutable Parquet files, so adding a default value isn’t possible. Spark would need to fill in defaults for files written before the column was added at read time (it could fill in defaults in new files at write time). Filling in defaults at read time would require Spark to fill in defaults for only some of the files in a scan, so Spark would need different handling for each task depending on the schema of that task. Tasks would also be required to produce a consistent schema, so a file without the new column couldn’t be combined into a task with a file that has the new column. This adds quite a bit of complexity.

Other sources may not need Spark to fill in the default at all. A JDBC source would be capable of filling in the default values itself, so Spark would need some way to communicate the default to that source. If the source had a different policy for default values (write time instead of read time, for example) then behavior could still be inconsistent.

I think that this complexity probably isn’t worth consistency in default values across sources, if that is even achievable.

In the sync we thought it was a good idea to send this out to the larger group to discuss. Please reply with comments!

rb

--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Default values and data sources

cloud0fan
Note that the design we make here will affect both data source developers and end-users. It's better to provide reliable behaviors to end-users, instead of asking them to read the spec of the data source and know which value will be used for missing columns, when they write data.

If we do want to go with the "data source decides default value" approach, we should create a new SQL syntax for ADD COLUMN, as its behavior is different from the SQL standard ADD COLUMN command.

On Wed, Dec 19, 2018 at 10:58 PM Russell Spitzer <[hidden email]> wrote:
I'm not sure why 1) wouldn't be fine. I'm guessing the reason we want 2 is for a unified way of dealing with missing columns? I feel like that probably should be left up to the underlying datasource implementation. For example if you have missing columns with a database the Datasource can choose a value based on the Database's metadata if such a thing exists, I don't think Spark should really have a this level of detail but I've also missed out on all of these meetings (sorry it's family dinner time :) ) so I may be missing something. 

So my tldr is, Let a datasource report whether or not missing columns are OK and let the Datasource deal with the missing data based on it's underlying storage.

On Wed, Dec 19, 2018 at 8:23 AM Wenchen Fan <[hidden email]> wrote:
I agree that we should not rewrite existing parquet files when a new column is added, but we should also try out best to make the behavior same as RDBMS/SQL standard.

1. it should be the user who decides the default value of a column, by CREATE TABLE, or ALTER TABLE ADD COLUMN, or ALTER TABLE ALTER COLUMN.
2. When adding a new column, the default value should be effective for all the existing data, and newly written data.
3. When altering an existing column and change the default value, it should be effective for newly written data only.

A possible implementation:
1. a columnn has 2 default values: the initial one and the latest one.
2. when adding a column with a default value, set both the initial one and the latest one to this value. But do not update existing data.
3. when reading data, fill the missing column with the initial default value
4. when writing data, fill the missing column with the latest default value
5. when altering a column to change its default value, only update the latest default value.

This works because:
1. new files will be written with the latest default value, nothing we need to worry about at read time.
2. old files will be read with the initial default value, which returns expected result.

On Wed, Dec 19, 2018 at 8:39 AM Ryan Blue <[hidden email]> wrote:

Hi everyone,

This thread is a follow-up to a discussion that we started in the DSv2 community sync last week.

The problem I’m trying to solve is that the format I’m using DSv2 to integrate supports schema evolution. Specifically, adding a new optional column so that rows without that column get a default value (null for Iceberg). The current validation rule for an append in DSv2 fails a write if it is missing a column, so adding a column to an existing table will cause currently-scheduled jobs that insert data to start failing. Clearly, schema evolution shouldn't break existing jobs that produce valid data.

To fix this problem, I suggested option 1: adding a way for Spark to check whether to fail when an optional column is missing. Other contributors in the sync thought that Spark should go with option 2: Spark’s schema should have defaults and Spark should handle filling in defaults the same way across all sources, like other databases.

I think we agree that option 2 would be ideal. The problem is that it is very hard to implement.

A source might manage data stored in millions of immutable Parquet files, so adding a default value isn’t possible. Spark would need to fill in defaults for files written before the column was added at read time (it could fill in defaults in new files at write time). Filling in defaults at read time would require Spark to fill in defaults for only some of the files in a scan, so Spark would need different handling for each task depending on the schema of that task. Tasks would also be required to produce a consistent schema, so a file without the new column couldn’t be combined into a task with a file that has the new column. This adds quite a bit of complexity.

Other sources may not need Spark to fill in the default at all. A JDBC source would be capable of filling in the default values itself, so Spark would need some way to communicate the default to that source. If the source had a different policy for default values (write time instead of read time, for example) then behavior could still be inconsistent.

I think that this complexity probably isn’t worth consistency in default values across sources, if that is even achievable.

In the sync we thought it was a good idea to send this out to the larger group to discuss. Please reply with comments!

rb

--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Default values and data sources

Ryan Blue
Wenchen, can you give more detail about the different ADD COLUMN syntax? That sounds confusing to end users to me.

On Wed, Dec 19, 2018 at 7:15 AM Wenchen Fan <[hidden email]> wrote:
Note that the design we make here will affect both data source developers and end-users. It's better to provide reliable behaviors to end-users, instead of asking them to read the spec of the data source and know which value will be used for missing columns, when they write data.

If we do want to go with the "data source decides default value" approach, we should create a new SQL syntax for ADD COLUMN, as its behavior is different from the SQL standard ADD COLUMN command.

On Wed, Dec 19, 2018 at 10:58 PM Russell Spitzer <[hidden email]> wrote:
I'm not sure why 1) wouldn't be fine. I'm guessing the reason we want 2 is for a unified way of dealing with missing columns? I feel like that probably should be left up to the underlying datasource implementation. For example if you have missing columns with a database the Datasource can choose a value based on the Database's metadata if such a thing exists, I don't think Spark should really have a this level of detail but I've also missed out on all of these meetings (sorry it's family dinner time :) ) so I may be missing something. 

So my tldr is, Let a datasource report whether or not missing columns are OK and let the Datasource deal with the missing data based on it's underlying storage.

On Wed, Dec 19, 2018 at 8:23 AM Wenchen Fan <[hidden email]> wrote:
I agree that we should not rewrite existing parquet files when a new column is added, but we should also try out best to make the behavior same as RDBMS/SQL standard.

1. it should be the user who decides the default value of a column, by CREATE TABLE, or ALTER TABLE ADD COLUMN, or ALTER TABLE ALTER COLUMN.
2. When adding a new column, the default value should be effective for all the existing data, and newly written data.
3. When altering an existing column and change the default value, it should be effective for newly written data only.

A possible implementation:
1. a columnn has 2 default values: the initial one and the latest one.
2. when adding a column with a default value, set both the initial one and the latest one to this value. But do not update existing data.
3. when reading data, fill the missing column with the initial default value
4. when writing data, fill the missing column with the latest default value
5. when altering a column to change its default value, only update the latest default value.

This works because:
1. new files will be written with the latest default value, nothing we need to worry about at read time.
2. old files will be read with the initial default value, which returns expected result.

On Wed, Dec 19, 2018 at 8:39 AM Ryan Blue <[hidden email]> wrote:

Hi everyone,

This thread is a follow-up to a discussion that we started in the DSv2 community sync last week.

The problem I’m trying to solve is that the format I’m using DSv2 to integrate supports schema evolution. Specifically, adding a new optional column so that rows without that column get a default value (null for Iceberg). The current validation rule for an append in DSv2 fails a write if it is missing a column, so adding a column to an existing table will cause currently-scheduled jobs that insert data to start failing. Clearly, schema evolution shouldn't break existing jobs that produce valid data.

To fix this problem, I suggested option 1: adding a way for Spark to check whether to fail when an optional column is missing. Other contributors in the sync thought that Spark should go with option 2: Spark’s schema should have defaults and Spark should handle filling in defaults the same way across all sources, like other databases.

I think we agree that option 2 would be ideal. The problem is that it is very hard to implement.

A source might manage data stored in millions of immutable Parquet files, so adding a default value isn’t possible. Spark would need to fill in defaults for files written before the column was added at read time (it could fill in defaults in new files at write time). Filling in defaults at read time would require Spark to fill in defaults for only some of the files in a scan, so Spark would need different handling for each task depending on the schema of that task. Tasks would also be required to produce a consistent schema, so a file without the new column couldn’t be combined into a task with a file that has the new column. This adds quite a bit of complexity.

Other sources may not need Spark to fill in the default at all. A JDBC source would be capable of filling in the default values itself, so Spark would need some way to communicate the default to that source. If the source had a different policy for default values (write time instead of read time, for example) then behavior could still be inconsistent.

I think that this complexity probably isn’t worth consistency in default values across sources, if that is even achievable.

In the sync we thought it was a good idea to send this out to the larger group to discuss. Please reply with comments!

rb

--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Default values and data sources

cloud0fan
The standard ADD COLUMN SQL syntax is: ALTER TABLE table_name ADD COLUMN column_name datatype [DEFAULT value];

If the DEFAULT statement is not specified, then the default value is null. If we are going to change the behavior and say the default value is decided by the underlying data source, we should use a new SQL syntax(I don't have a proposal in mind), instead of reusing the existing syntax, to be SQL compatible.

Personally I don't like re-invent wheels. It's better to just implement the SQL standard ADD COLUMN command, which means the default value is decided by the end-users.

On Thu, Dec 20, 2018 at 12:43 AM Ryan Blue <[hidden email]> wrote:
Wenchen, can you give more detail about the different ADD COLUMN syntax? That sounds confusing to end users to me.

On Wed, Dec 19, 2018 at 7:15 AM Wenchen Fan <[hidden email]> wrote:
Note that the design we make here will affect both data source developers and end-users. It's better to provide reliable behaviors to end-users, instead of asking them to read the spec of the data source and know which value will be used for missing columns, when they write data.

If we do want to go with the "data source decides default value" approach, we should create a new SQL syntax for ADD COLUMN, as its behavior is different from the SQL standard ADD COLUMN command.

On Wed, Dec 19, 2018 at 10:58 PM Russell Spitzer <[hidden email]> wrote:
I'm not sure why 1) wouldn't be fine. I'm guessing the reason we want 2 is for a unified way of dealing with missing columns? I feel like that probably should be left up to the underlying datasource implementation. For example if you have missing columns with a database the Datasource can choose a value based on the Database's metadata if such a thing exists, I don't think Spark should really have a this level of detail but I've also missed out on all of these meetings (sorry it's family dinner time :) ) so I may be missing something. 

So my tldr is, Let a datasource report whether or not missing columns are OK and let the Datasource deal with the missing data based on it's underlying storage.

On Wed, Dec 19, 2018 at 8:23 AM Wenchen Fan <[hidden email]> wrote:
I agree that we should not rewrite existing parquet files when a new column is added, but we should also try out best to make the behavior same as RDBMS/SQL standard.

1. it should be the user who decides the default value of a column, by CREATE TABLE, or ALTER TABLE ADD COLUMN, or ALTER TABLE ALTER COLUMN.
2. When adding a new column, the default value should be effective for all the existing data, and newly written data.
3. When altering an existing column and change the default value, it should be effective for newly written data only.

A possible implementation:
1. a columnn has 2 default values: the initial one and the latest one.
2. when adding a column with a default value, set both the initial one and the latest one to this value. But do not update existing data.
3. when reading data, fill the missing column with the initial default value
4. when writing data, fill the missing column with the latest default value
5. when altering a column to change its default value, only update the latest default value.

This works because:
1. new files will be written with the latest default value, nothing we need to worry about at read time.
2. old files will be read with the initial default value, which returns expected result.

On Wed, Dec 19, 2018 at 8:39 AM Ryan Blue <[hidden email]> wrote:

Hi everyone,

This thread is a follow-up to a discussion that we started in the DSv2 community sync last week.

The problem I’m trying to solve is that the format I’m using DSv2 to integrate supports schema evolution. Specifically, adding a new optional column so that rows without that column get a default value (null for Iceberg). The current validation rule for an append in DSv2 fails a write if it is missing a column, so adding a column to an existing table will cause currently-scheduled jobs that insert data to start failing. Clearly, schema evolution shouldn't break existing jobs that produce valid data.

To fix this problem, I suggested option 1: adding a way for Spark to check whether to fail when an optional column is missing. Other contributors in the sync thought that Spark should go with option 2: Spark’s schema should have defaults and Spark should handle filling in defaults the same way across all sources, like other databases.

I think we agree that option 2 would be ideal. The problem is that it is very hard to implement.

A source might manage data stored in millions of immutable Parquet files, so adding a default value isn’t possible. Spark would need to fill in defaults for files written before the column was added at read time (it could fill in defaults in new files at write time). Filling in defaults at read time would require Spark to fill in defaults for only some of the files in a scan, so Spark would need different handling for each task depending on the schema of that task. Tasks would also be required to produce a consistent schema, so a file without the new column couldn’t be combined into a task with a file that has the new column. This adds quite a bit of complexity.

Other sources may not need Spark to fill in the default at all. A JDBC source would be capable of filling in the default values itself, so Spark would need some way to communicate the default to that source. If the source had a different policy for default values (write time instead of read time, for example) then behavior could still be inconsistent.

I think that this complexity probably isn’t worth consistency in default values across sources, if that is even achievable.

In the sync we thought it was a good idea to send this out to the larger group to discuss. Please reply with comments!

rb

--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Default values and data sources

Ryan Blue
I don't think we have to change the syntax. Isn't the right thing (for option 1) to pass the default through to the underlying data source? Sources that don't support defaults would throw an exception.

On Wed, Dec 19, 2018 at 6:29 PM Wenchen Fan <[hidden email]> wrote:
The standard ADD COLUMN SQL syntax is: ALTER TABLE table_name ADD COLUMN column_name datatype [DEFAULT value];

If the DEFAULT statement is not specified, then the default value is null. If we are going to change the behavior and say the default value is decided by the underlying data source, we should use a new SQL syntax(I don't have a proposal in mind), instead of reusing the existing syntax, to be SQL compatible.

Personally I don't like re-invent wheels. It's better to just implement the SQL standard ADD COLUMN command, which means the default value is decided by the end-users.

On Thu, Dec 20, 2018 at 12:43 AM Ryan Blue <[hidden email]> wrote:
Wenchen, can you give more detail about the different ADD COLUMN syntax? That sounds confusing to end users to me.

On Wed, Dec 19, 2018 at 7:15 AM Wenchen Fan <[hidden email]> wrote:
Note that the design we make here will affect both data source developers and end-users. It's better to provide reliable behaviors to end-users, instead of asking them to read the spec of the data source and know which value will be used for missing columns, when they write data.

If we do want to go with the "data source decides default value" approach, we should create a new SQL syntax for ADD COLUMN, as its behavior is different from the SQL standard ADD COLUMN command.

On Wed, Dec 19, 2018 at 10:58 PM Russell Spitzer <[hidden email]> wrote:
I'm not sure why 1) wouldn't be fine. I'm guessing the reason we want 2 is for a unified way of dealing with missing columns? I feel like that probably should be left up to the underlying datasource implementation. For example if you have missing columns with a database the Datasource can choose a value based on the Database's metadata if such a thing exists, I don't think Spark should really have a this level of detail but I've also missed out on all of these meetings (sorry it's family dinner time :) ) so I may be missing something. 

So my tldr is, Let a datasource report whether or not missing columns are OK and let the Datasource deal with the missing data based on it's underlying storage.

On Wed, Dec 19, 2018 at 8:23 AM Wenchen Fan <[hidden email]> wrote:
I agree that we should not rewrite existing parquet files when a new column is added, but we should also try out best to make the behavior same as RDBMS/SQL standard.

1. it should be the user who decides the default value of a column, by CREATE TABLE, or ALTER TABLE ADD COLUMN, or ALTER TABLE ALTER COLUMN.
2. When adding a new column, the default value should be effective for all the existing data, and newly written data.
3. When altering an existing column and change the default value, it should be effective for newly written data only.

A possible implementation:
1. a columnn has 2 default values: the initial one and the latest one.
2. when adding a column with a default value, set both the initial one and the latest one to this value. But do not update existing data.
3. when reading data, fill the missing column with the initial default value
4. when writing data, fill the missing column with the latest default value
5. when altering a column to change its default value, only update the latest default value.

This works because:
1. new files will be written with the latest default value, nothing we need to worry about at read time.
2. old files will be read with the initial default value, which returns expected result.

On Wed, Dec 19, 2018 at 8:39 AM Ryan Blue <[hidden email]> wrote:

Hi everyone,

This thread is a follow-up to a discussion that we started in the DSv2 community sync last week.

The problem I’m trying to solve is that the format I’m using DSv2 to integrate supports schema evolution. Specifically, adding a new optional column so that rows without that column get a default value (null for Iceberg). The current validation rule for an append in DSv2 fails a write if it is missing a column, so adding a column to an existing table will cause currently-scheduled jobs that insert data to start failing. Clearly, schema evolution shouldn't break existing jobs that produce valid data.

To fix this problem, I suggested option 1: adding a way for Spark to check whether to fail when an optional column is missing. Other contributors in the sync thought that Spark should go with option 2: Spark’s schema should have defaults and Spark should handle filling in defaults the same way across all sources, like other databases.

I think we agree that option 2 would be ideal. The problem is that it is very hard to implement.

A source might manage data stored in millions of immutable Parquet files, so adding a default value isn’t possible. Spark would need to fill in defaults for files written before the column was added at read time (it could fill in defaults in new files at write time). Filling in defaults at read time would require Spark to fill in defaults for only some of the files in a scan, so Spark would need different handling for each task depending on the schema of that task. Tasks would also be required to produce a consistent schema, so a file without the new column couldn’t be combined into a task with a file that has the new column. This adds quite a bit of complexity.

Other sources may not need Spark to fill in the default at all. A JDBC source would be capable of filling in the default values itself, so Spark would need some way to communicate the default to that source. If the source had a different policy for default values (write time instead of read time, for example) then behavior could still be inconsistent.

I think that this complexity probably isn’t worth consistency in default values across sources, if that is even achievable.

In the sync we thought it was a good idea to send this out to the larger group to discuss. Please reply with comments!

rb

--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Default values and data sources

cloud0fan
So you agree with my proposal that we should follow RDBMS/SQL standard regarding the behavior?

> pass the default through to the underlying data source

This is one way to implement the behavior.

On Thu, Dec 20, 2018 at 11:12 AM Ryan Blue <[hidden email]> wrote:
I don't think we have to change the syntax. Isn't the right thing (for option 1) to pass the default through to the underlying data source? Sources that don't support defaults would throw an exception.

On Wed, Dec 19, 2018 at 6:29 PM Wenchen Fan <[hidden email]> wrote:
The standard ADD COLUMN SQL syntax is: ALTER TABLE table_name ADD COLUMN column_name datatype [DEFAULT value];

If the DEFAULT statement is not specified, then the default value is null. If we are going to change the behavior and say the default value is decided by the underlying data source, we should use a new SQL syntax(I don't have a proposal in mind), instead of reusing the existing syntax, to be SQL compatible.

Personally I don't like re-invent wheels. It's better to just implement the SQL standard ADD COLUMN command, which means the default value is decided by the end-users.

On Thu, Dec 20, 2018 at 12:43 AM Ryan Blue <[hidden email]> wrote:
Wenchen, can you give more detail about the different ADD COLUMN syntax? That sounds confusing to end users to me.

On Wed, Dec 19, 2018 at 7:15 AM Wenchen Fan <[hidden email]> wrote:
Note that the design we make here will affect both data source developers and end-users. It's better to provide reliable behaviors to end-users, instead of asking them to read the spec of the data source and know which value will be used for missing columns, when they write data.

If we do want to go with the "data source decides default value" approach, we should create a new SQL syntax for ADD COLUMN, as its behavior is different from the SQL standard ADD COLUMN command.

On Wed, Dec 19, 2018 at 10:58 PM Russell Spitzer <[hidden email]> wrote:
I'm not sure why 1) wouldn't be fine. I'm guessing the reason we want 2 is for a unified way of dealing with missing columns? I feel like that probably should be left up to the underlying datasource implementation. For example if you have missing columns with a database the Datasource can choose a value based on the Database's metadata if such a thing exists, I don't think Spark should really have a this level of detail but I've also missed out on all of these meetings (sorry it's family dinner time :) ) so I may be missing something. 

So my tldr is, Let a datasource report whether or not missing columns are OK and let the Datasource deal with the missing data based on it's underlying storage.

On Wed, Dec 19, 2018 at 8:23 AM Wenchen Fan <[hidden email]> wrote:
I agree that we should not rewrite existing parquet files when a new column is added, but we should also try out best to make the behavior same as RDBMS/SQL standard.

1. it should be the user who decides the default value of a column, by CREATE TABLE, or ALTER TABLE ADD COLUMN, or ALTER TABLE ALTER COLUMN.
2. When adding a new column, the default value should be effective for all the existing data, and newly written data.
3. When altering an existing column and change the default value, it should be effective for newly written data only.

A possible implementation:
1. a columnn has 2 default values: the initial one and the latest one.
2. when adding a column with a default value, set both the initial one and the latest one to this value. But do not update existing data.
3. when reading data, fill the missing column with the initial default value
4. when writing data, fill the missing column with the latest default value
5. when altering a column to change its default value, only update the latest default value.

This works because:
1. new files will be written with the latest default value, nothing we need to worry about at read time.
2. old files will be read with the initial default value, which returns expected result.

On Wed, Dec 19, 2018 at 8:39 AM Ryan Blue <[hidden email]> wrote:

Hi everyone,

This thread is a follow-up to a discussion that we started in the DSv2 community sync last week.

The problem I’m trying to solve is that the format I’m using DSv2 to integrate supports schema evolution. Specifically, adding a new optional column so that rows without that column get a default value (null for Iceberg). The current validation rule for an append in DSv2 fails a write if it is missing a column, so adding a column to an existing table will cause currently-scheduled jobs that insert data to start failing. Clearly, schema evolution shouldn't break existing jobs that produce valid data.

To fix this problem, I suggested option 1: adding a way for Spark to check whether to fail when an optional column is missing. Other contributors in the sync thought that Spark should go with option 2: Spark’s schema should have defaults and Spark should handle filling in defaults the same way across all sources, like other databases.

I think we agree that option 2 would be ideal. The problem is that it is very hard to implement.

A source might manage data stored in millions of immutable Parquet files, so adding a default value isn’t possible. Spark would need to fill in defaults for files written before the column was added at read time (it could fill in defaults in new files at write time). Filling in defaults at read time would require Spark to fill in defaults for only some of the files in a scan, so Spark would need different handling for each task depending on the schema of that task. Tasks would also be required to produce a consistent schema, so a file without the new column couldn’t be combined into a task with a file that has the new column. This adds quite a bit of complexity.

Other sources may not need Spark to fill in the default at all. A JDBC source would be capable of filling in the default values itself, so Spark would need some way to communicate the default to that source. If the source had a different policy for default values (write time instead of read time, for example) then behavior could still be inconsistent.

I think that this complexity probably isn’t worth consistency in default values across sources, if that is even achievable.

In the sync we thought it was a good idea to send this out to the larger group to discuss. Please reply with comments!

rb

--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Default values and data sources

RussS
I guess my question is why is this a Spark level behavior? Say the user has an underlying source where they have a different behavior at the source level. In Spark they set a new default behavior and it's added to the catalogue, is the Source expected to propagate this? Or does the user have to be aware that their own Source settings may be different for a client connecting via Spark or via a native driver.

For example say i'm using C* (sorry but obviously I'm always thinking about C*), and I add a new column to the database. When i connect to the database with a non-spark application I expect to be able to insert to the table given that I satisfy the required columns. In Spark someone sets the columns as having a default value (there is no such feature in C*), now depending on how I connect to the source I have two different behaviors. If I insert from the native app I get empty cells, if I insert from spark i get a default value inserted. That sounds more confusing to an end-user to than having a consistent behavior between native clients and Spark clients. This is why I asked if the goal was to just have a common "Spark" behavior because I don't think it makes sense if you consider multiple interaction points for a source.

On Wed, Dec 19, 2018 at 9:28 PM Wenchen Fan <[hidden email]> wrote:
So you agree with my proposal that we should follow RDBMS/SQL standard regarding the behavior?

> pass the default through to the underlying data source

This is one way to implement the behavior.

On Thu, Dec 20, 2018 at 11:12 AM Ryan Blue <[hidden email]> wrote:
I don't think we have to change the syntax. Isn't the right thing (for option 1) to pass the default through to the underlying data source? Sources that don't support defaults would throw an exception.

On Wed, Dec 19, 2018 at 6:29 PM Wenchen Fan <[hidden email]> wrote:
The standard ADD COLUMN SQL syntax is: ALTER TABLE table_name ADD COLUMN column_name datatype [DEFAULT value];

If the DEFAULT statement is not specified, then the default value is null. If we are going to change the behavior and say the default value is decided by the underlying data source, we should use a new SQL syntax(I don't have a proposal in mind), instead of reusing the existing syntax, to be SQL compatible.

Personally I don't like re-invent wheels. It's better to just implement the SQL standard ADD COLUMN command, which means the default value is decided by the end-users.

On Thu, Dec 20, 2018 at 12:43 AM Ryan Blue <[hidden email]> wrote:
Wenchen, can you give more detail about the different ADD COLUMN syntax? That sounds confusing to end users to me.

On Wed, Dec 19, 2018 at 7:15 AM Wenchen Fan <[hidden email]> wrote:
Note that the design we make here will affect both data source developers and end-users. It's better to provide reliable behaviors to end-users, instead of asking them to read the spec of the data source and know which value will be used for missing columns, when they write data.

If we do want to go with the "data source decides default value" approach, we should create a new SQL syntax for ADD COLUMN, as its behavior is different from the SQL standard ADD COLUMN command.

On Wed, Dec 19, 2018 at 10:58 PM Russell Spitzer <[hidden email]> wrote:
I'm not sure why 1) wouldn't be fine. I'm guessing the reason we want 2 is for a unified way of dealing with missing columns? I feel like that probably should be left up to the underlying datasource implementation. For example if you have missing columns with a database the Datasource can choose a value based on the Database's metadata if such a thing exists, I don't think Spark should really have a this level of detail but I've also missed out on all of these meetings (sorry it's family dinner time :) ) so I may be missing something. 

So my tldr is, Let a datasource report whether or not missing columns are OK and let the Datasource deal with the missing data based on it's underlying storage.

On Wed, Dec 19, 2018 at 8:23 AM Wenchen Fan <[hidden email]> wrote:
I agree that we should not rewrite existing parquet files when a new column is added, but we should also try out best to make the behavior same as RDBMS/SQL standard.

1. it should be the user who decides the default value of a column, by CREATE TABLE, or ALTER TABLE ADD COLUMN, or ALTER TABLE ALTER COLUMN.
2. When adding a new column, the default value should be effective for all the existing data, and newly written data.
3. When altering an existing column and change the default value, it should be effective for newly written data only.

A possible implementation:
1. a columnn has 2 default values: the initial one and the latest one.
2. when adding a column with a default value, set both the initial one and the latest one to this value. But do not update existing data.
3. when reading data, fill the missing column with the initial default value
4. when writing data, fill the missing column with the latest default value
5. when altering a column to change its default value, only update the latest default value.

This works because:
1. new files will be written with the latest default value, nothing we need to worry about at read time.
2. old files will be read with the initial default value, which returns expected result.

On Wed, Dec 19, 2018 at 8:39 AM Ryan Blue <[hidden email]> wrote:

Hi everyone,

This thread is a follow-up to a discussion that we started in the DSv2 community sync last week.

The problem I’m trying to solve is that the format I’m using DSv2 to integrate supports schema evolution. Specifically, adding a new optional column so that rows without that column get a default value (null for Iceberg). The current validation rule for an append in DSv2 fails a write if it is missing a column, so adding a column to an existing table will cause currently-scheduled jobs that insert data to start failing. Clearly, schema evolution shouldn't break existing jobs that produce valid data.

To fix this problem, I suggested option 1: adding a way for Spark to check whether to fail when an optional column is missing. Other contributors in the sync thought that Spark should go with option 2: Spark’s schema should have defaults and Spark should handle filling in defaults the same way across all sources, like other databases.

I think we agree that option 2 would be ideal. The problem is that it is very hard to implement.

A source might manage data stored in millions of immutable Parquet files, so adding a default value isn’t possible. Spark would need to fill in defaults for files written before the column was added at read time (it could fill in defaults in new files at write time). Filling in defaults at read time would require Spark to fill in defaults for only some of the files in a scan, so Spark would need different handling for each task depending on the schema of that task. Tasks would also be required to produce a consistent schema, so a file without the new column couldn’t be combined into a task with a file that has the new column. This adds quite a bit of complexity.

Other sources may not need Spark to fill in the default at all. A JDBC source would be capable of filling in the default values itself, so Spark would need some way to communicate the default to that source. If the source had a different policy for default values (write time instead of read time, for example) then behavior could still be inconsistent.

I think that this complexity probably isn’t worth consistency in default values across sources, if that is even achievable.

In the sync we thought it was a good idea to send this out to the larger group to discuss. Please reply with comments!

rb

--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Default values and data sources

Ryan Blue
I think it is good to know that not all sources support default values. That makes me think that we should delegate this behavior to the source and have a way for sources to signal that they accept default values in DDL (a capability) and assume that they do not in most cases.

On Thu, Dec 20, 2018 at 1:32 PM Russell Spitzer <[hidden email]> wrote:
I guess my question is why is this a Spark level behavior? Say the user has an underlying source where they have a different behavior at the source level. In Spark they set a new default behavior and it's added to the catalogue, is the Source expected to propagate this? Or does the user have to be aware that their own Source settings may be different for a client connecting via Spark or via a native driver.

For example say i'm using C* (sorry but obviously I'm always thinking about C*), and I add a new column to the database. When i connect to the database with a non-spark application I expect to be able to insert to the table given that I satisfy the required columns. In Spark someone sets the columns as having a default value (there is no such feature in C*), now depending on how I connect to the source I have two different behaviors. If I insert from the native app I get empty cells, if I insert from spark i get a default value inserted. That sounds more confusing to an end-user to than having a consistent behavior between native clients and Spark clients. This is why I asked if the goal was to just have a common "Spark" behavior because I don't think it makes sense if you consider multiple interaction points for a source.

On Wed, Dec 19, 2018 at 9:28 PM Wenchen Fan <[hidden email]> wrote:
So you agree with my proposal that we should follow RDBMS/SQL standard regarding the behavior?

> pass the default through to the underlying data source

This is one way to implement the behavior.

On Thu, Dec 20, 2018 at 11:12 AM Ryan Blue <[hidden email]> wrote:
I don't think we have to change the syntax. Isn't the right thing (for option 1) to pass the default through to the underlying data source? Sources that don't support defaults would throw an exception.

On Wed, Dec 19, 2018 at 6:29 PM Wenchen Fan <[hidden email]> wrote:
The standard ADD COLUMN SQL syntax is: ALTER TABLE table_name ADD COLUMN column_name datatype [DEFAULT value];

If the DEFAULT statement is not specified, then the default value is null. If we are going to change the behavior and say the default value is decided by the underlying data source, we should use a new SQL syntax(I don't have a proposal in mind), instead of reusing the existing syntax, to be SQL compatible.

Personally I don't like re-invent wheels. It's better to just implement the SQL standard ADD COLUMN command, which means the default value is decided by the end-users.

On Thu, Dec 20, 2018 at 12:43 AM Ryan Blue <[hidden email]> wrote:
Wenchen, can you give more detail about the different ADD COLUMN syntax? That sounds confusing to end users to me.

On Wed, Dec 19, 2018 at 7:15 AM Wenchen Fan <[hidden email]> wrote:
Note that the design we make here will affect both data source developers and end-users. It's better to provide reliable behaviors to end-users, instead of asking them to read the spec of the data source and know which value will be used for missing columns, when they write data.

If we do want to go with the "data source decides default value" approach, we should create a new SQL syntax for ADD COLUMN, as its behavior is different from the SQL standard ADD COLUMN command.

On Wed, Dec 19, 2018 at 10:58 PM Russell Spitzer <[hidden email]> wrote:
I'm not sure why 1) wouldn't be fine. I'm guessing the reason we want 2 is for a unified way of dealing with missing columns? I feel like that probably should be left up to the underlying datasource implementation. For example if you have missing columns with a database the Datasource can choose a value based on the Database's metadata if such a thing exists, I don't think Spark should really have a this level of detail but I've also missed out on all of these meetings (sorry it's family dinner time :) ) so I may be missing something. 

So my tldr is, Let a datasource report whether or not missing columns are OK and let the Datasource deal with the missing data based on it's underlying storage.

On Wed, Dec 19, 2018 at 8:23 AM Wenchen Fan <[hidden email]> wrote:
I agree that we should not rewrite existing parquet files when a new column is added, but we should also try out best to make the behavior same as RDBMS/SQL standard.

1. it should be the user who decides the default value of a column, by CREATE TABLE, or ALTER TABLE ADD COLUMN, or ALTER TABLE ALTER COLUMN.
2. When adding a new column, the default value should be effective for all the existing data, and newly written data.
3. When altering an existing column and change the default value, it should be effective for newly written data only.

A possible implementation:
1. a columnn has 2 default values: the initial one and the latest one.
2. when adding a column with a default value, set both the initial one and the latest one to this value. But do not update existing data.
3. when reading data, fill the missing column with the initial default value
4. when writing data, fill the missing column with the latest default value
5. when altering a column to change its default value, only update the latest default value.

This works because:
1. new files will be written with the latest default value, nothing we need to worry about at read time.
2. old files will be read with the initial default value, which returns expected result.

On Wed, Dec 19, 2018 at 8:39 AM Ryan Blue <[hidden email]> wrote:

Hi everyone,

This thread is a follow-up to a discussion that we started in the DSv2 community sync last week.

The problem I’m trying to solve is that the format I’m using DSv2 to integrate supports schema evolution. Specifically, adding a new optional column so that rows without that column get a default value (null for Iceberg). The current validation rule for an append in DSv2 fails a write if it is missing a column, so adding a column to an existing table will cause currently-scheduled jobs that insert data to start failing. Clearly, schema evolution shouldn't break existing jobs that produce valid data.

To fix this problem, I suggested option 1: adding a way for Spark to check whether to fail when an optional column is missing. Other contributors in the sync thought that Spark should go with option 2: Spark’s schema should have defaults and Spark should handle filling in defaults the same way across all sources, like other databases.

I think we agree that option 2 would be ideal. The problem is that it is very hard to implement.

A source might manage data stored in millions of immutable Parquet files, so adding a default value isn’t possible. Spark would need to fill in defaults for files written before the column was added at read time (it could fill in defaults in new files at write time). Filling in defaults at read time would require Spark to fill in defaults for only some of the files in a scan, so Spark would need different handling for each task depending on the schema of that task. Tasks would also be required to produce a consistent schema, so a file without the new column couldn’t be combined into a task with a file that has the new column. This adds quite a bit of complexity.

Other sources may not need Spark to fill in the default at all. A JDBC source would be capable of filling in the default values itself, so Spark would need some way to communicate the default to that source. If the source had a different policy for default values (write time instead of read time, for example) then behavior could still be inconsistent.

I think that this complexity probably isn’t worth consistency in default values across sources, if that is even achievable.

In the sync we thought it was a good idea to send this out to the larger group to discuss. Please reply with comments!

rb

--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Default values and data sources

cloud0fan
Hi Ryan,

That's a good point. Since in this case Spark is just a channel to pass user's action to the data source, we should think of what actions the data source supports.

Following this direction, it makes more sense to delegate everything to data sources.

As the first step, maybe we should not add DDL commands to change schema of data source, but just use the capability API to let data source decide what to do when input schema doesn't match the table schema during writing. Users can use native client of data source to change schema.

On Fri, Dec 21, 2018 at 8:03 AM Ryan Blue <[hidden email]> wrote:
I think it is good to know that not all sources support default values. That makes me think that we should delegate this behavior to the source and have a way for sources to signal that they accept default values in DDL (a capability) and assume that they do not in most cases.

On Thu, Dec 20, 2018 at 1:32 PM Russell Spitzer <[hidden email]> wrote:
I guess my question is why is this a Spark level behavior? Say the user has an underlying source where they have a different behavior at the source level. In Spark they set a new default behavior and it's added to the catalogue, is the Source expected to propagate this? Or does the user have to be aware that their own Source settings may be different for a client connecting via Spark or via a native driver.

For example say i'm using C* (sorry but obviously I'm always thinking about C*), and I add a new column to the database. When i connect to the database with a non-spark application I expect to be able to insert to the table given that I satisfy the required columns. In Spark someone sets the columns as having a default value (there is no such feature in C*), now depending on how I connect to the source I have two different behaviors. If I insert from the native app I get empty cells, if I insert from spark i get a default value inserted. That sounds more confusing to an end-user to than having a consistent behavior between native clients and Spark clients. This is why I asked if the goal was to just have a common "Spark" behavior because I don't think it makes sense if you consider multiple interaction points for a source.

On Wed, Dec 19, 2018 at 9:28 PM Wenchen Fan <[hidden email]> wrote:
So you agree with my proposal that we should follow RDBMS/SQL standard regarding the behavior?

> pass the default through to the underlying data source

This is one way to implement the behavior.

On Thu, Dec 20, 2018 at 11:12 AM Ryan Blue <[hidden email]> wrote:
I don't think we have to change the syntax. Isn't the right thing (for option 1) to pass the default through to the underlying data source? Sources that don't support defaults would throw an exception.

On Wed, Dec 19, 2018 at 6:29 PM Wenchen Fan <[hidden email]> wrote:
The standard ADD COLUMN SQL syntax is: ALTER TABLE table_name ADD COLUMN column_name datatype [DEFAULT value];

If the DEFAULT statement is not specified, then the default value is null. If we are going to change the behavior and say the default value is decided by the underlying data source, we should use a new SQL syntax(I don't have a proposal in mind), instead of reusing the existing syntax, to be SQL compatible.

Personally I don't like re-invent wheels. It's better to just implement the SQL standard ADD COLUMN command, which means the default value is decided by the end-users.

On Thu, Dec 20, 2018 at 12:43 AM Ryan Blue <[hidden email]> wrote:
Wenchen, can you give more detail about the different ADD COLUMN syntax? That sounds confusing to end users to me.

On Wed, Dec 19, 2018 at 7:15 AM Wenchen Fan <[hidden email]> wrote:
Note that the design we make here will affect both data source developers and end-users. It's better to provide reliable behaviors to end-users, instead of asking them to read the spec of the data source and know which value will be used for missing columns, when they write data.

If we do want to go with the "data source decides default value" approach, we should create a new SQL syntax for ADD COLUMN, as its behavior is different from the SQL standard ADD COLUMN command.

On Wed, Dec 19, 2018 at 10:58 PM Russell Spitzer <[hidden email]> wrote:
I'm not sure why 1) wouldn't be fine. I'm guessing the reason we want 2 is for a unified way of dealing with missing columns? I feel like that probably should be left up to the underlying datasource implementation. For example if you have missing columns with a database the Datasource can choose a value based on the Database's metadata if such a thing exists, I don't think Spark should really have a this level of detail but I've also missed out on all of these meetings (sorry it's family dinner time :) ) so I may be missing something. 

So my tldr is, Let a datasource report whether or not missing columns are OK and let the Datasource deal with the missing data based on it's underlying storage.

On Wed, Dec 19, 2018 at 8:23 AM Wenchen Fan <[hidden email]> wrote:
I agree that we should not rewrite existing parquet files when a new column is added, but we should also try out best to make the behavior same as RDBMS/SQL standard.

1. it should be the user who decides the default value of a column, by CREATE TABLE, or ALTER TABLE ADD COLUMN, or ALTER TABLE ALTER COLUMN.
2. When adding a new column, the default value should be effective for all the existing data, and newly written data.
3. When altering an existing column and change the default value, it should be effective for newly written data only.

A possible implementation:
1. a columnn has 2 default values: the initial one and the latest one.
2. when adding a column with a default value, set both the initial one and the latest one to this value. But do not update existing data.
3. when reading data, fill the missing column with the initial default value
4. when writing data, fill the missing column with the latest default value
5. when altering a column to change its default value, only update the latest default value.

This works because:
1. new files will be written with the latest default value, nothing we need to worry about at read time.
2. old files will be read with the initial default value, which returns expected result.

On Wed, Dec 19, 2018 at 8:39 AM Ryan Blue <[hidden email]> wrote:

Hi everyone,

This thread is a follow-up to a discussion that we started in the DSv2 community sync last week.

The problem I’m trying to solve is that the format I’m using DSv2 to integrate supports schema evolution. Specifically, adding a new optional column so that rows without that column get a default value (null for Iceberg). The current validation rule for an append in DSv2 fails a write if it is missing a column, so adding a column to an existing table will cause currently-scheduled jobs that insert data to start failing. Clearly, schema evolution shouldn't break existing jobs that produce valid data.

To fix this problem, I suggested option 1: adding a way for Spark to check whether to fail when an optional column is missing. Other contributors in the sync thought that Spark should go with option 2: Spark’s schema should have defaults and Spark should handle filling in defaults the same way across all sources, like other databases.

I think we agree that option 2 would be ideal. The problem is that it is very hard to implement.

A source might manage data stored in millions of immutable Parquet files, so adding a default value isn’t possible. Spark would need to fill in defaults for files written before the column was added at read time (it could fill in defaults in new files at write time). Filling in defaults at read time would require Spark to fill in defaults for only some of the files in a scan, so Spark would need different handling for each task depending on the schema of that task. Tasks would also be required to produce a consistent schema, so a file without the new column couldn’t be combined into a task with a file that has the new column. This adds quite a bit of complexity.

Other sources may not need Spark to fill in the default at all. A JDBC source would be capable of filling in the default values itself, so Spark would need some way to communicate the default to that source. If the source had a different policy for default values (write time instead of read time, for example) then behavior could still be inconsistent.

I think that this complexity probably isn’t worth consistency in default values across sources, if that is even achievable.

In the sync we thought it was a good idea to send this out to the larger group to discuss. Please reply with comments!

rb

--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Default values and data sources

Alessandro Solimando
Hello,
I agree that Spark should check whether the underlying datasource
support default values or not, and adjust its behavior accordingly.

If we follow this direction, do you see the default-values capability
in scope of the "DataSourceV2 capability API"?

Best regards,
Alessandro

On Fri, 21 Dec 2018 at 03:31, Wenchen Fan <[hidden email]> wrote:

>
> Hi Ryan,
>
> That's a good point. Since in this case Spark is just a channel to pass user's action to the data source, we should think of what actions the data source supports.
>
> Following this direction, it makes more sense to delegate everything to data sources.
>
> As the first step, maybe we should not add DDL commands to change schema of data source, but just use the capability API to let data source decide what to do when input schema doesn't match the table schema during writing. Users can use native client of data source to change schema.
>
> On Fri, Dec 21, 2018 at 8:03 AM Ryan Blue <[hidden email]> wrote:
>>
>> I think it is good to know that not all sources support default values. That makes me think that we should delegate this behavior to the source and have a way for sources to signal that they accept default values in DDL (a capability) and assume that they do not in most cases.
>>
>> On Thu, Dec 20, 2018 at 1:32 PM Russell Spitzer <[hidden email]> wrote:
>>>
>>> I guess my question is why is this a Spark level behavior? Say the user has an underlying source where they have a different behavior at the source level. In Spark they set a new default behavior and it's added to the catalogue, is the Source expected to propagate this? Or does the user have to be aware that their own Source settings may be different for a client connecting via Spark or via a native driver.
>>>
>>> For example say i'm using C* (sorry but obviously I'm always thinking about C*), and I add a new column to the database. When i connect to the database with a non-spark application I expect to be able to insert to the table given that I satisfy the required columns. In Spark someone sets the columns as having a default value (there is no such feature in C*), now depending on how I connect to the source I have two different behaviors. If I insert from the native app I get empty cells, if I insert from spark i get a default value inserted. That sounds more confusing to an end-user to than having a consistent behavior between native clients and Spark clients. This is why I asked if the goal was to just have a common "Spark" behavior because I don't think it makes sense if you consider multiple interaction points for a source.
>>>
>>> On Wed, Dec 19, 2018 at 9:28 PM Wenchen Fan <[hidden email]> wrote:
>>>>
>>>> So you agree with my proposal that we should follow RDBMS/SQL standard regarding the behavior?
>>>>
>>>> > pass the default through to the underlying data source
>>>>
>>>> This is one way to implement the behavior.
>>>>
>>>> On Thu, Dec 20, 2018 at 11:12 AM Ryan Blue <[hidden email]> wrote:
>>>>>
>>>>> I don't think we have to change the syntax. Isn't the right thing (for option 1) to pass the default through to the underlying data source? Sources that don't support defaults would throw an exception.
>>>>>
>>>>> On Wed, Dec 19, 2018 at 6:29 PM Wenchen Fan <[hidden email]> wrote:
>>>>>>
>>>>>> The standard ADD COLUMN SQL syntax is: ALTER TABLE table_name ADD COLUMN column_name datatype [DEFAULT value];
>>>>>>
>>>>>> If the DEFAULT statement is not specified, then the default value is null. If we are going to change the behavior and say the default value is decided by the underlying data source, we should use a new SQL syntax(I don't have a proposal in mind), instead of reusing the existing syntax, to be SQL compatible.
>>>>>>
>>>>>> Personally I don't like re-invent wheels. It's better to just implement the SQL standard ADD COLUMN command, which means the default value is decided by the end-users.
>>>>>>
>>>>>> On Thu, Dec 20, 2018 at 12:43 AM Ryan Blue <[hidden email]> wrote:
>>>>>>>
>>>>>>> Wenchen, can you give more detail about the different ADD COLUMN syntax? That sounds confusing to end users to me.
>>>>>>>
>>>>>>> On Wed, Dec 19, 2018 at 7:15 AM Wenchen Fan <[hidden email]> wrote:
>>>>>>>>
>>>>>>>> Note that the design we make here will affect both data source developers and end-users. It's better to provide reliable behaviors to end-users, instead of asking them to read the spec of the data source and know which value will be used for missing columns, when they write data.
>>>>>>>>
>>>>>>>> If we do want to go with the "data source decides default value" approach, we should create a new SQL syntax for ADD COLUMN, as its behavior is different from the SQL standard ADD COLUMN command.
>>>>>>>>
>>>>>>>> On Wed, Dec 19, 2018 at 10:58 PM Russell Spitzer <[hidden email]> wrote:
>>>>>>>>>
>>>>>>>>> I'm not sure why 1) wouldn't be fine. I'm guessing the reason we want 2 is for a unified way of dealing with missing columns? I feel like that probably should be left up to the underlying datasource implementation. For example if you have missing columns with a database the Datasource can choose a value based on the Database's metadata if such a thing exists, I don't think Spark should really have a this level of detail but I've also missed out on all of these meetings (sorry it's family dinner time :) ) so I may be missing something.
>>>>>>>>>
>>>>>>>>> So my tldr is, Let a datasource report whether or not missing columns are OK and let the Datasource deal with the missing data based on it's underlying storage.
>>>>>>>>>
>>>>>>>>> On Wed, Dec 19, 2018 at 8:23 AM Wenchen Fan <[hidden email]> wrote:
>>>>>>>>>>
>>>>>>>>>> I agree that we should not rewrite existing parquet files when a new column is added, but we should also try out best to make the behavior same as RDBMS/SQL standard.
>>>>>>>>>>
>>>>>>>>>> 1. it should be the user who decides the default value of a column, by CREATE TABLE, or ALTER TABLE ADD COLUMN, or ALTER TABLE ALTER COLUMN.
>>>>>>>>>> 2. When adding a new column, the default value should be effective for all the existing data, and newly written data.
>>>>>>>>>> 3. When altering an existing column and change the default value, it should be effective for newly written data only.
>>>>>>>>>>
>>>>>>>>>> A possible implementation:
>>>>>>>>>> 1. a columnn has 2 default values: the initial one and the latest one.
>>>>>>>>>> 2. when adding a column with a default value, set both the initial one and the latest one to this value. But do not update existing data.
>>>>>>>>>> 3. when reading data, fill the missing column with the initial default value
>>>>>>>>>> 4. when writing data, fill the missing column with the latest default value
>>>>>>>>>> 5. when altering a column to change its default value, only update the latest default value.
>>>>>>>>>>
>>>>>>>>>> This works because:
>>>>>>>>>> 1. new files will be written with the latest default value, nothing we need to worry about at read time.
>>>>>>>>>> 2. old files will be read with the initial default value, which returns expected result.
>>>>>>>>>>
>>>>>>>>>> On Wed, Dec 19, 2018 at 8:39 AM Ryan Blue <[hidden email]> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>
>>>>>>>>>>> This thread is a follow-up to a discussion that we started in the DSv2 community sync last week.
>>>>>>>>>>>
>>>>>>>>>>> The problem I’m trying to solve is that the format I’m using DSv2 to integrate supports schema evolution. Specifically, adding a new optional column so that rows without that column get a default value (null for Iceberg). The current validation rule for an append in DSv2 fails a write if it is missing a column, so adding a column to an existing table will cause currently-scheduled jobs that insert data to start failing. Clearly, schema evolution shouldn't break existing jobs that produce valid data.
>>>>>>>>>>>
>>>>>>>>>>> To fix this problem, I suggested option 1: adding a way for Spark to check whether to fail when an optional column is missing. Other contributors in the sync thought that Spark should go with option 2: Spark’s schema should have defaults and Spark should handle filling in defaults the same way across all sources, like other databases.
>>>>>>>>>>>
>>>>>>>>>>> I think we agree that option 2 would be ideal. The problem is that it is very hard to implement.
>>>>>>>>>>>
>>>>>>>>>>> A source might manage data stored in millions of immutable Parquet files, so adding a default value isn’t possible. Spark would need to fill in defaults for files written before the column was added at read time (it could fill in defaults in new files at write time). Filling in defaults at read time would require Spark to fill in defaults for only some of the files in a scan, so Spark would need different handling for each task depending on the schema of that task. Tasks would also be required to produce a consistent schema, so a file without the new column couldn’t be combined into a task with a file that has the new column. This adds quite a bit of complexity.
>>>>>>>>>>>
>>>>>>>>>>> Other sources may not need Spark to fill in the default at all. A JDBC source would be capable of filling in the default values itself, so Spark would need some way to communicate the default to that source. If the source had a different policy for default values (write time instead of read time, for example) then behavior could still be inconsistent.
>>>>>>>>>>>
>>>>>>>>>>> I think that this complexity probably isn’t worth consistency in default values across sources, if that is even achievable.
>>>>>>>>>>>
>>>>>>>>>>> In the sync we thought it was a good idea to send this out to the larger group to discuss. Please reply with comments!
>>>>>>>>>>>
>>>>>>>>>>> rb
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Ryan Blue
>>>>>>>>>>> Software Engineer
>>>>>>>>>>> Netflix
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Software Engineer
>>>>>>> Netflix
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Default values and data sources

RussS
I definitely would like to have a "column can be missing" capability, allowing the underlying datasource to fill in a default if it wants to (or not).

On Fri, Dec 21, 2018 at 1:40 AM Alessandro Solimando <[hidden email]> wrote:
Hello,
I agree that Spark should check whether the underlying datasource
support default values or not, and adjust its behavior accordingly.

If we follow this direction, do you see the default-values capability
in scope of the "DataSourceV2 capability API"?

Best regards,
Alessandro

On Fri, 21 Dec 2018 at 03:31, Wenchen Fan <[hidden email]> wrote:
>
> Hi Ryan,
>
> That's a good point. Since in this case Spark is just a channel to pass user's action to the data source, we should think of what actions the data source supports.
>
> Following this direction, it makes more sense to delegate everything to data sources.
>
> As the first step, maybe we should not add DDL commands to change schema of data source, but just use the capability API to let data source decide what to do when input schema doesn't match the table schema during writing. Users can use native client of data source to change schema.
>
> On Fri, Dec 21, 2018 at 8:03 AM Ryan Blue <[hidden email]> wrote:
>>
>> I think it is good to know that not all sources support default values. That makes me think that we should delegate this behavior to the source and have a way for sources to signal that they accept default values in DDL (a capability) and assume that they do not in most cases.
>>
>> On Thu, Dec 20, 2018 at 1:32 PM Russell Spitzer <[hidden email]> wrote:
>>>
>>> I guess my question is why is this a Spark level behavior? Say the user has an underlying source where they have a different behavior at the source level. In Spark they set a new default behavior and it's added to the catalogue, is the Source expected to propagate this? Or does the user have to be aware that their own Source settings may be different for a client connecting via Spark or via a native driver.
>>>
>>> For example say i'm using C* (sorry but obviously I'm always thinking about C*), and I add a new column to the database. When i connect to the database with a non-spark application I expect to be able to insert to the table given that I satisfy the required columns. In Spark someone sets the columns as having a default value (there is no such feature in C*), now depending on how I connect to the source I have two different behaviors. If I insert from the native app I get empty cells, if I insert from spark i get a default value inserted. That sounds more confusing to an end-user to than having a consistent behavior between native clients and Spark clients. This is why I asked if the goal was to just have a common "Spark" behavior because I don't think it makes sense if you consider multiple interaction points for a source.
>>>
>>> On Wed, Dec 19, 2018 at 9:28 PM Wenchen Fan <[hidden email]> wrote:
>>>>
>>>> So you agree with my proposal that we should follow RDBMS/SQL standard regarding the behavior?
>>>>
>>>> > pass the default through to the underlying data source
>>>>
>>>> This is one way to implement the behavior.
>>>>
>>>> On Thu, Dec 20, 2018 at 11:12 AM Ryan Blue <[hidden email]> wrote:
>>>>>
>>>>> I don't think we have to change the syntax. Isn't the right thing (for option 1) to pass the default through to the underlying data source? Sources that don't support defaults would throw an exception.
>>>>>
>>>>> On Wed, Dec 19, 2018 at 6:29 PM Wenchen Fan <[hidden email]> wrote:
>>>>>>
>>>>>> The standard ADD COLUMN SQL syntax is: ALTER TABLE table_name ADD COLUMN column_name datatype [DEFAULT value];
>>>>>>
>>>>>> If the DEFAULT statement is not specified, then the default value is null. If we are going to change the behavior and say the default value is decided by the underlying data source, we should use a new SQL syntax(I don't have a proposal in mind), instead of reusing the existing syntax, to be SQL compatible.
>>>>>>
>>>>>> Personally I don't like re-invent wheels. It's better to just implement the SQL standard ADD COLUMN command, which means the default value is decided by the end-users.
>>>>>>
>>>>>> On Thu, Dec 20, 2018 at 12:43 AM Ryan Blue <[hidden email]> wrote:
>>>>>>>
>>>>>>> Wenchen, can you give more detail about the different ADD COLUMN syntax? That sounds confusing to end users to me.
>>>>>>>
>>>>>>> On Wed, Dec 19, 2018 at 7:15 AM Wenchen Fan <[hidden email]> wrote:
>>>>>>>>
>>>>>>>> Note that the design we make here will affect both data source developers and end-users. It's better to provide reliable behaviors to end-users, instead of asking them to read the spec of the data source and know which value will be used for missing columns, when they write data.
>>>>>>>>
>>>>>>>> If we do want to go with the "data source decides default value" approach, we should create a new SQL syntax for ADD COLUMN, as its behavior is different from the SQL standard ADD COLUMN command.
>>>>>>>>
>>>>>>>> On Wed, Dec 19, 2018 at 10:58 PM Russell Spitzer <[hidden email]> wrote:
>>>>>>>>>
>>>>>>>>> I'm not sure why 1) wouldn't be fine. I'm guessing the reason we want 2 is for a unified way of dealing with missing columns? I feel like that probably should be left up to the underlying datasource implementation. For example if you have missing columns with a database the Datasource can choose a value based on the Database's metadata if such a thing exists, I don't think Spark should really have a this level of detail but I've also missed out on all of these meetings (sorry it's family dinner time :) ) so I may be missing something.
>>>>>>>>>
>>>>>>>>> So my tldr is, Let a datasource report whether or not missing columns are OK and let the Datasource deal with the missing data based on it's underlying storage.
>>>>>>>>>
>>>>>>>>> On Wed, Dec 19, 2018 at 8:23 AM Wenchen Fan <[hidden email]> wrote:
>>>>>>>>>>
>>>>>>>>>> I agree that we should not rewrite existing parquet files when a new column is added, but we should also try out best to make the behavior same as RDBMS/SQL standard.
>>>>>>>>>>
>>>>>>>>>> 1. it should be the user who decides the default value of a column, by CREATE TABLE, or ALTER TABLE ADD COLUMN, or ALTER TABLE ALTER COLUMN.
>>>>>>>>>> 2. When adding a new column, the default value should be effective for all the existing data, and newly written data.
>>>>>>>>>> 3. When altering an existing column and change the default value, it should be effective for newly written data only.
>>>>>>>>>>
>>>>>>>>>> A possible implementation:
>>>>>>>>>> 1. a columnn has 2 default values: the initial one and the latest one.
>>>>>>>>>> 2. when adding a column with a default value, set both the initial one and the latest one to this value. But do not update existing data.
>>>>>>>>>> 3. when reading data, fill the missing column with the initial default value
>>>>>>>>>> 4. when writing data, fill the missing column with the latest default value
>>>>>>>>>> 5. when altering a column to change its default value, only update the latest default value.
>>>>>>>>>>
>>>>>>>>>> This works because:
>>>>>>>>>> 1. new files will be written with the latest default value, nothing we need to worry about at read time.
>>>>>>>>>> 2. old files will be read with the initial default value, which returns expected result.
>>>>>>>>>>
>>>>>>>>>> On Wed, Dec 19, 2018 at 8:39 AM Ryan Blue <[hidden email]> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>
>>>>>>>>>>> This thread is a follow-up to a discussion that we started in the DSv2 community sync last week.
>>>>>>>>>>>
>>>>>>>>>>> The problem I’m trying to solve is that the format I’m using DSv2 to integrate supports schema evolution. Specifically, adding a new optional column so that rows without that column get a default value (null for Iceberg). The current validation rule for an append in DSv2 fails a write if it is missing a column, so adding a column to an existing table will cause currently-scheduled jobs that insert data to start failing. Clearly, schema evolution shouldn't break existing jobs that produce valid data.
>>>>>>>>>>>
>>>>>>>>>>> To fix this problem, I suggested option 1: adding a way for Spark to check whether to fail when an optional column is missing. Other contributors in the sync thought that Spark should go with option 2: Spark’s schema should have defaults and Spark should handle filling in defaults the same way across all sources, like other databases.
>>>>>>>>>>>
>>>>>>>>>>> I think we agree that option 2 would be ideal. The problem is that it is very hard to implement.
>>>>>>>>>>>
>>>>>>>>>>> A source might manage data stored in millions of immutable Parquet files, so adding a default value isn’t possible. Spark would need to fill in defaults for files written before the column was added at read time (it could fill in defaults in new files at write time). Filling in defaults at read time would require Spark to fill in defaults for only some of the files in a scan, so Spark would need different handling for each task depending on the schema of that task. Tasks would also be required to produce a consistent schema, so a file without the new column couldn’t be combined into a task with a file that has the new column. This adds quite a bit of complexity.
>>>>>>>>>>>
>>>>>>>>>>> Other sources may not need Spark to fill in the default at all. A JDBC source would be capable of filling in the default values itself, so Spark would need some way to communicate the default to that source. If the source had a different policy for default values (write time instead of read time, for example) then behavior could still be inconsistent.
>>>>>>>>>>>
>>>>>>>>>>> I think that this complexity probably isn’t worth consistency in default values across sources, if that is even achievable.
>>>>>>>>>>>
>>>>>>>>>>> In the sync we thought it was a good idea to send this out to the larger group to discuss. Please reply with comments!
>>>>>>>>>>>
>>>>>>>>>>> rb
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Ryan Blue
>>>>>>>>>>> Software Engineer
>>>>>>>>>>> Netflix
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Software Engineer
>>>>>>> Netflix
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Default values and data sources

rxin
I'd only do any of the schema evolution things as add-on on top. This is an extremely complicated area and we could risk never shipping anything because there would be a lot of different requirements.


On Fri, Dec 21, 2018 at 9:46 AM, Russell Spitzer <[hidden email]> wrote:
I definitely would like to have a "column can be missing" capability, allowing the underlying datasource to fill in a default if it wants to (or not).

On Fri, Dec 21, 2018 at 1:40 AM Alessandro Solimando <[hidden email]> wrote:
Hello,
I agree that Spark should check whether the underlying datasource
support default values or not, and adjust its behavior accordingly.

If we follow this direction, do you see the default-values capability
in scope of the "DataSourceV2 capability API"?

Best regards,
Alessandro

On Fri, 21 Dec 2018 at 03:31, Wenchen Fan <[hidden email]> wrote:
>
> Hi Ryan,
>
> That's a good point. Since in this case Spark is just a channel to pass user's action to the data source, we should think of what actions the data source supports.
>
> Following this direction, it makes more sense to delegate everything to data sources.
>
> As the first step, maybe we should not add DDL commands to change schema of data source, but just use the capability API to let data source decide what to do when input schema doesn't match the table schema during writing. Users can use native client of data source to change schema.
>
> On Fri, Dec 21, 2018 at 8:03 AM Ryan Blue <[hidden email]> wrote:
>>
>> I think it is good to know that not all sources support default values. That makes me think that we should delegate this behavior to the source and have a way for sources to signal that they accept default values in DDL (a capability) and assume that they do not in most cases.
>>
>> On Thu, Dec 20, 2018 at 1:32 PM Russell Spitzer <[hidden email]> wrote:
>>>
>>> I guess my question is why is this a Spark level behavior? Say the user has an underlying source where they have a different behavior at the source level. In Spark they set a new default behavior and it's added to the catalogue, is the Source expected to propagate this? Or does the user have to be aware that their own Source settings may be different for a client connecting via Spark or via a native driver.
>>>
>>> For example say i'm using C* (sorry but obviously I'm always thinking about C*), and I add a new column to the database. When i connect to the database with a non-spark application I expect to be able to insert to the table given that I satisfy the required columns. In Spark someone sets the columns as having a default value (there is no such feature in C*), now depending on how I connect to the source I have two different behaviors. If I insert from the native app I get empty cells, if I insert from spark i get a default value inserted. That sounds more confusing to an end-user to than having a consistent behavior between native clients and Spark clients. This is why I asked if the goal was to just have a common "Spark" behavior because I don't think it makes sense if you consider multiple interaction points for a source.
>>>
>>> On Wed, Dec 19, 2018 at 9:28 PM Wenchen Fan <[hidden email]> wrote:
>>>>
>>>> So you agree with my proposal that we should follow RDBMS/SQL standard regarding the behavior?
>>>>
>>>> > pass the default through to the underlying data source
>>>>
>>>> This is one way to implement the behavior.
>>>>
>>>> On Thu, Dec 20, 2018 at 11:12 AM Ryan Blue <[hidden email]> wrote:
>>>>>
>>>>> I don't think we have to change the syntax. Isn't the right thing (for option 1) to pass the default through to the underlying data source? Sources that don't support defaults would throw an exception.
>>>>>
>>>>> On Wed, Dec 19, 2018 at 6:29 PM Wenchen Fan <[hidden email]> wrote:
>>>>>>
>>>>>> The standard ADD COLUMN SQL syntax is: ALTER TABLE table_name ADD COLUMN column_name datatype [DEFAULT value];
>>>>>>
>>>>>> If the DEFAULT statement is not specified, then the default value is null. If we are going to change the behavior and say the default value is decided by the underlying data source, we should use a new SQL syntax(I don't have a proposal in mind), instead of reusing the existing syntax, to be SQL compatible.
>>>>>>
>>>>>> Personally I don't like re-invent wheels. It's better to just implement the SQL standard ADD COLUMN command, which means the default value is decided by the end-users.
>>>>>>
>>>>>> On Thu, Dec 20, 2018 at 12:43 AM Ryan Blue <[hidden email]> wrote:
>>>>>>>
>>>>>>> Wenchen, can you give more detail about the different ADD COLUMN syntax? That sounds confusing to end users to me.
>>>>>>>
>>>>>>> On Wed, Dec 19, 2018 at 7:15 AM Wenchen Fan <[hidden email]> wrote:
>>>>>>>>
>>>>>>>> Note that the design we make here will affect both data source developers and end-users. It's better to provide reliable behaviors to end-users, instead of asking them to read the spec of the data source and know which value will be used for missing columns, when they write data.
>>>>>>>>
>>>>>>>> If we do want to go with the "data source decides default value" approach, we should create a new SQL syntax for ADD COLUMN, as its behavior is different from the SQL standard ADD COLUMN command.
>>>>>>>>
>>>>>>>> On Wed, Dec 19, 2018 at 10:58 PM Russell Spitzer <[hidden email]> wrote:
>>>>>>>>>
>>>>>>>>> I'm not sure why 1) wouldn't be fine. I'm guessing the reason we want 2 is for a unified way of dealing with missing columns? I feel like that probably should be left up to the underlying datasource implementation. For example if you have missing columns with a database the Datasource can choose a value based on the Database's metadata if such a thing exists, I don't think Spark should really have a this level of detail but I've also missed out on all of these meetings (sorry it's family dinner time :) ) so I may be missing something.
>>>>>>>>>
>>>>>>>>> So my tldr is, Let a datasource report whether or not missing columns are OK and let the Datasource deal with the missing data based on it's underlying storage.
>>>>>>>>>
>>>>>>>>> On Wed, Dec 19, 2018 at 8:23 AM Wenchen Fan <[hidden email]> wrote:
>>>>>>>>>>
>>>>>>>>>> I agree that we should not rewrite existing parquet files when a new column is added, but we should also try out best to make the behavior same as RDBMS/SQL standard.
>>>>>>>>>>
>>>>>>>>>> 1. it should be the user who decides the default value of a column, by CREATE TABLE, or ALTER TABLE ADD COLUMN, or ALTER TABLE ALTER COLUMN.
>>>>>>>>>> 2. When adding a new column, the default value should be effective for all the existing data, and newly written data.
>>>>>>>>>> 3. When altering an existing column and change the default value, it should be effective for newly written data only.
>>>>>>>>>>
>>>>>>>>>> A possible implementation:
>>>>>>>>>> 1. a columnn has 2 default values: the initial one and the latest one.
>>>>>>>>>> 2. when adding a column with a default value, set both the initial one and the latest one to this value. But do not update existing data.
>>>>>>>>>> 3. when reading data, fill the missing column with the initial default value
>>>>>>>>>> 4. when writing data, fill the missing column with the latest default value
>>>>>>>>>> 5. when altering a column to change its default value, only update the latest default value.
>>>>>>>>>>
>>>>>>>>>> This works because:
>>>>>>>>>> 1. new files will be written with the latest default value, nothing we need to worry about at read time.
>>>>>>>>>> 2. old files will be read with the initial default value, which returns expected result.
>>>>>>>>>>
>>>>>>>>>> On Wed, Dec 19, 2018 at 8:39 AM Ryan Blue <[hidden email]> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>
>>>>>>>>>>> This thread is a follow-up to a discussion that we started in the DSv2 community sync last week.
>>>>>>>>>>>
>>>>>>>>>>> The problem I’m trying to solve is that the format I’m using DSv2 to integrate supports schema evolution. Specifically, adding a new optional column so that rows without that column get a default value (null for Iceberg). The current validation rule for an append in DSv2 fails a write if it is missing a column, so adding a column to an existing table will cause currently-scheduled jobs that insert data to start failing. Clearly, schema evolution shouldn't break existing jobs that produce valid data.
>>>>>>>>>>>
>>>>>>>>>>> To fix this problem, I suggested option 1: adding a way for Spark to check whether to fail when an optional column is missing. Other contributors in the sync thought that Spark should go with option 2: Spark’s schema should have defaults and Spark should handle filling in defaults the same way across all sources, like other databases.
>>>>>>>>>>>
>>>>>>>>>>> I think we agree that option 2 would be ideal. The problem is that it is very hard to implement.
>>>>>>>>>>>
>>>>>>>>>>> A source might manage data stored in millions of immutable Parquet files, so adding a default value isn’t possible. Spark would need to fill in defaults for files written before the column was added at read time (it could fill in defaults in new files at write time). Filling in defaults at read time would require Spark to fill in defaults for only some of the files in a scan, so Spark would need different handling for each task depending on the schema of that task. Tasks would also be required to produce a consistent schema, so a file without the new column couldn’t be combined into a task with a file that has the new column. This adds quite a bit of complexity.
>>>>>>>>>>>
>>>>>>>>>>> Other sources may not need Spark to fill in the default at all. A JDBC source would be capable of filling in the default values itself, so Spark would need some way to communicate the default to that source. If the source had a different policy for default values (write time instead of read time, for example) then behavior could still be inconsistent.
>>>>>>>>>>>
>>>>>>>>>>> I think that this complexity probably isn’t worth consistency in default values across sources, if that is even achievable.
>>>>>>>>>>>
>>>>>>>>>>> In the sync we thought it was a good idea to send this out to the larger group to discuss. Please reply with comments!
>>>>>>>>>>>
>>>>>>>>>>> rb
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Ryan Blue
>>>>>>>>>>> Software Engineer
>>>>>>>>>>> Netflix
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Software Engineer
>>>>>>> Netflix
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Default values and data sources

Ryan Blue-3
In reply to this post by Alessandro Solimando
Alessandro, yes. This was one of the use cases that motivated the capability API I proposed.

After this discussion, I think we probably need a couple of capabilities. First, the capability that indicates reads will fill in some default value for missing columns. That way, Spark allows writes to continue when they do not include a column that was recently added. Second, I think we need a capability to indicate that the source accepts non-null default values in DDL. Those are distinct because some sources can return non-null defaults, but sources like C* and Iceberg don't support them. In Iceberg, the default value is always null for optional columns so it supports missing columns but not non-null defaults.

On Thu, Dec 20, 2018 at 11:40 PM Alessandro Solimando <[hidden email]> wrote:
Hello,
I agree that Spark should check whether the underlying datasource
support default values or not, and adjust its behavior accordingly.

If we follow this direction, do you see the default-values capability
in scope of the "DataSourceV2 capability API"?

Best regards,
Alessandro

On Fri, 21 Dec 2018 at 03:31, Wenchen Fan <[hidden email]> wrote:
>
> Hi Ryan,
>
> That's a good point. Since in this case Spark is just a channel to pass user's action to the data source, we should think of what actions the data source supports.
>
> Following this direction, it makes more sense to delegate everything to data sources.
>
> As the first step, maybe we should not add DDL commands to change schema of data source, but just use the capability API to let data source decide what to do when input schema doesn't match the table schema during writing. Users can use native client of data source to change schema.
>
> On Fri, Dec 21, 2018 at 8:03 AM Ryan Blue <[hidden email]> wrote:
>>
>> I think it is good to know that not all sources support default values. That makes me think that we should delegate this behavior to the source and have a way for sources to signal that they accept default values in DDL (a capability) and assume that they do not in most cases.
>>
>> On Thu, Dec 20, 2018 at 1:32 PM Russell Spitzer <[hidden email]> wrote:
>>>
>>> I guess my question is why is this a Spark level behavior? Say the user has an underlying source where they have a different behavior at the source level. In Spark they set a new default behavior and it's added to the catalogue, is the Source expected to propagate this? Or does the user have to be aware that their own Source settings may be different for a client connecting via Spark or via a native driver.
>>>
>>> For example say i'm using C* (sorry but obviously I'm always thinking about C*), and I add a new column to the database. When i connect to the database with a non-spark application I expect to be able to insert to the table given that I satisfy the required columns. In Spark someone sets the columns as having a default value (there is no such feature in C*), now depending on how I connect to the source I have two different behaviors. If I insert from the native app I get empty cells, if I insert from spark i get a default value inserted. That sounds more confusing to an end-user to than having a consistent behavior between native clients and Spark clients. This is why I asked if the goal was to just have a common "Spark" behavior because I don't think it makes sense if you consider multiple interaction points for a source.
>>>
>>> On Wed, Dec 19, 2018 at 9:28 PM Wenchen Fan <[hidden email]> wrote:
>>>>
>>>> So you agree with my proposal that we should follow RDBMS/SQL standard regarding the behavior?
>>>>
>>>> > pass the default through to the underlying data source
>>>>
>>>> This is one way to implement the behavior.
>>>>
>>>> On Thu, Dec 20, 2018 at 11:12 AM Ryan Blue <[hidden email]> wrote:
>>>>>
>>>>> I don't think we have to change the syntax. Isn't the right thing (for option 1) to pass the default through to the underlying data source? Sources that don't support defaults would throw an exception.
>>>>>
>>>>> On Wed, Dec 19, 2018 at 6:29 PM Wenchen Fan <[hidden email]> wrote:
>>>>>>
>>>>>> The standard ADD COLUMN SQL syntax is: ALTER TABLE table_name ADD COLUMN column_name datatype [DEFAULT value];
>>>>>>
>>>>>> If the DEFAULT statement is not specified, then the default value is null. If we are going to change the behavior and say the default value is decided by the underlying data source, we should use a new SQL syntax(I don't have a proposal in mind), instead of reusing the existing syntax, to be SQL compatible.
>>>>>>
>>>>>> Personally I don't like re-invent wheels. It's better to just implement the SQL standard ADD COLUMN command, which means the default value is decided by the end-users.
>>>>>>
>>>>>> On Thu, Dec 20, 2018 at 12:43 AM Ryan Blue <[hidden email]> wrote:
>>>>>>>
>>>>>>> Wenchen, can you give more detail about the different ADD COLUMN syntax? That sounds confusing to end users to me.
>>>>>>>
>>>>>>> On Wed, Dec 19, 2018 at 7:15 AM Wenchen Fan <[hidden email]> wrote:
>>>>>>>>
>>>>>>>> Note that the design we make here will affect both data source developers and end-users. It's better to provide reliable behaviors to end-users, instead of asking them to read the spec of the data source and know which value will be used for missing columns, when they write data.
>>>>>>>>
>>>>>>>> If we do want to go with the "data source decides default value" approach, we should create a new SQL syntax for ADD COLUMN, as its behavior is different from the SQL standard ADD COLUMN command.
>>>>>>>>
>>>>>>>> On Wed, Dec 19, 2018 at 10:58 PM Russell Spitzer <[hidden email]> wrote:
>>>>>>>>>
>>>>>>>>> I'm not sure why 1) wouldn't be fine. I'm guessing the reason we want 2 is for a unified way of dealing with missing columns? I feel like that probably should be left up to the underlying datasource implementation. For example if you have missing columns with a database the Datasource can choose a value based on the Database's metadata if such a thing exists, I don't think Spark should really have a this level of detail but I've also missed out on all of these meetings (sorry it's family dinner time :) ) so I may be missing something.
>>>>>>>>>
>>>>>>>>> So my tldr is, Let a datasource report whether or not missing columns are OK and let the Datasource deal with the missing data based on it's underlying storage.
>>>>>>>>>
>>>>>>>>> On Wed, Dec 19, 2018 at 8:23 AM Wenchen Fan <[hidden email]> wrote:
>>>>>>>>>>
>>>>>>>>>> I agree that we should not rewrite existing parquet files when a new column is added, but we should also try out best to make the behavior same as RDBMS/SQL standard.
>>>>>>>>>>
>>>>>>>>>> 1. it should be the user who decides the default value of a column, by CREATE TABLE, or ALTER TABLE ADD COLUMN, or ALTER TABLE ALTER COLUMN.
>>>>>>>>>> 2. When adding a new column, the default value should be effective for all the existing data, and newly written data.
>>>>>>>>>> 3. When altering an existing column and change the default value, it should be effective for newly written data only.
>>>>>>>>>>
>>>>>>>>>> A possible implementation:
>>>>>>>>>> 1. a columnn has 2 default values: the initial one and the latest one.
>>>>>>>>>> 2. when adding a column with a default value, set both the initial one and the latest one to this value. But do not update existing data.
>>>>>>>>>> 3. when reading data, fill the missing column with the initial default value
>>>>>>>>>> 4. when writing data, fill the missing column with the latest default value
>>>>>>>>>> 5. when altering a column to change its default value, only update the latest default value.
>>>>>>>>>>
>>>>>>>>>> This works because:
>>>>>>>>>> 1. new files will be written with the latest default value, nothing we need to worry about at read time.
>>>>>>>>>> 2. old files will be read with the initial default value, which returns expected result.
>>>>>>>>>>
>>>>>>>>>> On Wed, Dec 19, 2018 at 8:39 AM Ryan Blue <[hidden email]> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>
>>>>>>>>>>> This thread is a follow-up to a discussion that we started in the DSv2 community sync last week.
>>>>>>>>>>>
>>>>>>>>>>> The problem I’m trying to solve is that the format I’m using DSv2 to integrate supports schema evolution. Specifically, adding a new optional column so that rows without that column get a default value (null for Iceberg). The current validation rule for an append in DSv2 fails a write if it is missing a column, so adding a column to an existing table will cause currently-scheduled jobs that insert data to start failing. Clearly, schema evolution shouldn't break existing jobs that produce valid data.
>>>>>>>>>>>
>>>>>>>>>>> To fix this problem, I suggested option 1: adding a way for Spark to check whether to fail when an optional column is missing. Other contributors in the sync thought that Spark should go with option 2: Spark’s schema should have defaults and Spark should handle filling in defaults the same way across all sources, like other databases.
>>>>>>>>>>>
>>>>>>>>>>> I think we agree that option 2 would be ideal. The problem is that it is very hard to implement.
>>>>>>>>>>>
>>>>>>>>>>> A source might manage data stored in millions of immutable Parquet files, so adding a default value isn’t possible. Spark would need to fill in defaults for files written before the column was added at read time (it could fill in defaults in new files at write time). Filling in defaults at read time would require Spark to fill in defaults for only some of the files in a scan, so Spark would need different handling for each task depending on the schema of that task. Tasks would also be required to produce a consistent schema, so a file without the new column couldn’t be combined into a task with a file that has the new column. This adds quite a bit of complexity.
>>>>>>>>>>>
>>>>>>>>>>> Other sources may not need Spark to fill in the default at all. A JDBC source would be capable of filling in the default values itself, so Spark would need some way to communicate the default to that source. If the source had a different policy for default values (write time instead of read time, for example) then behavior could still be inconsistent.
>>>>>>>>>>>
>>>>>>>>>>> I think that this complexity probably isn’t worth consistency in default values across sources, if that is even achievable.
>>>>>>>>>>>
>>>>>>>>>>> In the sync we thought it was a good idea to send this out to the larger group to discuss. Please reply with comments!
>>>>>>>>>>>
>>>>>>>>>>> rb
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Ryan Blue
>>>>>>>>>>> Software Engineer
>>>>>>>>>>> Netflix
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Software Engineer
>>>>>>> Netflix
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix


--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Default values and data sources

Ryan Blue-3
In reply to this post by rxin
I agree with Reynold's sentiment here. We don't want to create too many capabilities because it makes everything more complicated for both sources and Spark. Let's just go with the capability to read missing columns for now and we can add support for default values if and when Spark DDL begins to support them.

On Fri, Dec 21, 2018 at 9:56 AM Reynold Xin <[hidden email]> wrote:
I'd only do any of the schema evolution things as add-on on top. This is an extremely complicated area and we could risk never shipping anything because there would be a lot of different requirements.


On Fri, Dec 21, 2018 at 9:46 AM, Russell Spitzer <[hidden email]> wrote:
I definitely would like to have a "column can be missing" capability, allowing the underlying datasource to fill in a default if it wants to (or not).

On Fri, Dec 21, 2018 at 1:40 AM Alessandro Solimando <[hidden email]> wrote:
Hello,
I agree that Spark should check whether the underlying datasource
support default values or not, and adjust its behavior accordingly.

If we follow this direction, do you see the default-values capability
in scope of the "DataSourceV2 capability API"?

Best regards,
Alessandro

On Fri, 21 Dec 2018 at 03:31, Wenchen Fan <[hidden email]> wrote:
>
> Hi Ryan,
>
> That's a good point. Since in this case Spark is just a channel to pass user's action to the data source, we should think of what actions the data source supports.
>
> Following this direction, it makes more sense to delegate everything to data sources.
>
> As the first step, maybe we should not add DDL commands to change schema of data source, but just use the capability API to let data source decide what to do when input schema doesn't match the table schema during writing. Users can use native client of data source to change schema.
>
> On Fri, Dec 21, 2018 at 8:03 AM Ryan Blue <[hidden email]> wrote:
>>
>> I think it is good to know that not all sources support default values. That makes me think that we should delegate this behavior to the source and have a way for sources to signal that they accept default values in DDL (a capability) and assume that they do not in most cases.
>>
>> On Thu, Dec 20, 2018 at 1:32 PM Russell Spitzer <[hidden email]> wrote:
>>>
>>> I guess my question is why is this a Spark level behavior? Say the user has an underlying source where they have a different behavior at the source level. In Spark they set a new default behavior and it's added to the catalogue, is the Source expected to propagate this? Or does the user have to be aware that their own Source settings may be different for a client connecting via Spark or via a native driver.
>>>
>>> For example say i'm using C* (sorry but obviously I'm always thinking about C*), and I add a new column to the database. When i connect to the database with a non-spark application I expect to be able to insert to the table given that I satisfy the required columns. In Spark someone sets the columns as having a default value (there is no such feature in C*), now depending on how I connect to the source I have two different behaviors. If I insert from the native app I get empty cells, if I insert from spark i get a default value inserted. That sounds more confusing to an end-user to than having a consistent behavior between native clients and Spark clients. This is why I asked if the goal was to just have a common "Spark" behavior because I don't think it makes sense if you consider multiple interaction points for a source.
>>>
>>> On Wed, Dec 19, 2018 at 9:28 PM Wenchen Fan <[hidden email]> wrote:
>>>>
>>>> So you agree with my proposal that we should follow RDBMS/SQL standard regarding the behavior?
>>>>
>>>> > pass the default through to the underlying data source
>>>>
>>>> This is one way to implement the behavior.
>>>>
>>>> On Thu, Dec 20, 2018 at 11:12 AM Ryan Blue <[hidden email]> wrote:
>>>>>
>>>>> I don't think we have to change the syntax. Isn't the right thing (for option 1) to pass the default through to the underlying data source? Sources that don't support defaults would throw an exception.
>>>>>
>>>>> On Wed, Dec 19, 2018 at 6:29 PM Wenchen Fan <[hidden email]> wrote:
>>>>>>
>>>>>> The standard ADD COLUMN SQL syntax is: ALTER TABLE table_name ADD COLUMN column_name datatype [DEFAULT value];
>>>>>>
>>>>>> If the DEFAULT statement is not specified, then the default value is null. If we are going to change the behavior and say the default value is decided by the underlying data source, we should use a new SQL syntax(I don't have a proposal in mind), instead of reusing the existing syntax, to be SQL compatible.
>>>>>>
>>>>>> Personally I don't like re-invent wheels. It's better to just implement the SQL standard ADD COLUMN command, which means the default value is decided by the end-users.
>>>>>>
>>>>>> On Thu, Dec 20, 2018 at 12:43 AM Ryan Blue <[hidden email]> wrote:
>>>>>>>
>>>>>>> Wenchen, can you give more detail about the different ADD COLUMN syntax? That sounds confusing to end users to me.
>>>>>>>
>>>>>>> On Wed, Dec 19, 2018 at 7:15 AM Wenchen Fan <[hidden email]> wrote:
>>>>>>>>
>>>>>>>> Note that the design we make here will affect both data source developers and end-users. It's better to provide reliable behaviors to end-users, instead of asking them to read the spec of the data source and know which value will be used for missing columns, when they write data.
>>>>>>>>
>>>>>>>> If we do want to go with the "data source decides default value" approach, we should create a new SQL syntax for ADD COLUMN, as its behavior is different from the SQL standard ADD COLUMN command.
>>>>>>>>
>>>>>>>> On Wed, Dec 19, 2018 at 10:58 PM Russell Spitzer <[hidden email]> wrote:
>>>>>>>>>
>>>>>>>>> I'm not sure why 1) wouldn't be fine. I'm guessing the reason we want 2 is for a unified way of dealing with missing columns? I feel like that probably should be left up to the underlying datasource implementation. For example if you have missing columns with a database the Datasource can choose a value based on the Database's metadata if such a thing exists, I don't think Spark should really have a this level of detail but I've also missed out on all of these meetings (sorry it's family dinner time :) ) so I may be missing something.
>>>>>>>>>
>>>>>>>>> So my tldr is, Let a datasource report whether or not missing columns are OK and let the Datasource deal with the missing data based on it's underlying storage.
>>>>>>>>>
>>>>>>>>> On Wed, Dec 19, 2018 at 8:23 AM Wenchen Fan <[hidden email]> wrote:
>>>>>>>>>>
>>>>>>>>>> I agree that we should not rewrite existing parquet files when a new column is added, but we should also try out best to make the behavior same as RDBMS/SQL standard.
>>>>>>>>>>
>>>>>>>>>> 1. it should be the user who decides the default value of a column, by CREATE TABLE, or ALTER TABLE ADD COLUMN, or ALTER TABLE ALTER COLUMN.
>>>>>>>>>> 2. When adding a new column, the default value should be effective for all the existing data, and newly written data.
>>>>>>>>>> 3. When altering an existing column and change the default value, it should be effective for newly written data only.
>>>>>>>>>>
>>>>>>>>>> A possible implementation:
>>>>>>>>>> 1. a columnn has 2 default values: the initial one and the latest one.
>>>>>>>>>> 2. when adding a column with a default value, set both the initial one and the latest one to this value. But do not update existing data.
>>>>>>>>>> 3. when reading data, fill the missing column with the initial default value
>>>>>>>>>> 4. when writing data, fill the missing column with the latest default value
>>>>>>>>>> 5. when altering a column to change its default value, only update the latest default value.
>>>>>>>>>>
>>>>>>>>>> This works because:
>>>>>>>>>> 1. new files will be written with the latest default value, nothing we need to worry about at read time.
>>>>>>>>>> 2. old files will be read with the initial default value, which returns expected result.
>>>>>>>>>>
>>>>>>>>>> On Wed, Dec 19, 2018 at 8:39 AM Ryan Blue <[hidden email]> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>
>>>>>>>>>>> This thread is a follow-up to a discussion that we started in the DSv2 community sync last week.
>>>>>>>>>>>
>>>>>>>>>>> The problem I’m trying to solve is that the format I’m using DSv2 to integrate supports schema evolution. Specifically, adding a new optional column so that rows without that column get a default value (null for Iceberg). The current validation rule for an append in DSv2 fails a write if it is missing a column, so adding a column to an existing table will cause currently-scheduled jobs that insert data to start failing. Clearly, schema evolution shouldn't break existing jobs that produce valid data.
>>>>>>>>>>>
>>>>>>>>>>> To fix this problem, I suggested option 1: adding a way for Spark to check whether to fail when an optional column is missing. Other contributors in the sync thought that Spark should go with option 2: Spark’s schema should have defaults and Spark should handle filling in defaults the same way across all sources, like other databases.
>>>>>>>>>>>
>>>>>>>>>>> I think we agree that option 2 would be ideal. The problem is that it is very hard to implement.
>>>>>>>>>>>
>>>>>>>>>>> A source might manage data stored in millions of immutable Parquet files, so adding a default value isn’t possible. Spark would need to fill in defaults for files written before the column was added at read time (it could fill in defaults in new files at write time). Filling in defaults at read time would require Spark to fill in defaults for only some of the files in a scan, so Spark would need different handling for each task depending on the schema of that task. Tasks would also be required to produce a consistent schema, so a file without the new column couldn’t be combined into a task with a file that has the new column. This adds quite a bit of complexity.
>>>>>>>>>>>
>>>>>>>>>>> Other sources may not need Spark to fill in the default at all. A JDBC source would be capable of filling in the default values itself, so Spark would need some way to communicate the default to that source. If the source had a different policy for default values (write time instead of read time, for example) then behavior could still be inconsistent.
>>>>>>>>>>>
>>>>>>>>>>> I think that this complexity probably isn’t worth consistency in default values across sources, if that is even achievable.
>>>>>>>>>>>
>>>>>>>>>>> In the sync we thought it was a good idea to send this out to the larger group to discuss. Please reply with comments!
>>>>>>>>>>>
>>>>>>>>>>> rb
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Ryan Blue
>>>>>>>>>>> Software Engineer
>>>>>>>>>>> Netflix
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Software Engineer
>>>>>>> Netflix
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix



--
Ryan Blue
Software Engineer
Netflix