Schema Evolution in Apache Spark

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Schema Evolution in Apache Spark

Dongjoon Hyun-2
Hi, All.

A data schema can evolve in several ways and Apache Spark 2.3 already supports the followings for file-based data sources like CSV/JSON/ORC/Parquet.

1. Add a column
2. Remove a column
3. Change a column position
4. Change a column type

Can we guarantee users some schema evolution coverage on file-based data sources by adding schema evolution test suites explicitly? So far, there are some test cases.

For simplicity, I have several assumptions on schema evolution.

1. A safe evolution without data loss.
    - e.g. from small types to larger types like int-to-long, not vice versa.
2. Final schema is given by users (or Hive)
3. Simple Spark data types supported by Spark vectorized execution.

I made a test case PR to receive your opinions for this.

[SPARK-23007][SQL][TEST] Add schema evolution test suite for file-based data sources
- https://github.com/apache/spark/pull/20208

Could you take a look and give some opinions?

Bests,
Dongjoon.
Reply | Threaded
Open this post in threaded view
|

Re: Schema Evolution in Apache Spark

geoHeil
Isn't this related to the data format used, i.e. parquet, Avro, ... which already support changing schema?

Dongjoon Hyun <[hidden email]> schrieb am Fr., 12. Jan. 2018 um 02:30 Uhr:
Hi, All.

A data schema can evolve in several ways and Apache Spark 2.3 already supports the followings for file-based data sources like CSV/JSON/ORC/Parquet.

1. Add a column
2. Remove a column
3. Change a column position
4. Change a column type

Can we guarantee users some schema evolution coverage on file-based data sources by adding schema evolution test suites explicitly? So far, there are some test cases.

For simplicity, I have several assumptions on schema evolution.

1. A safe evolution without data loss.
    - e.g. from small types to larger types like int-to-long, not vice versa.
2. Final schema is given by users (or Hive)
3. Simple Spark data types supported by Spark vectorized execution.

I made a test case PR to receive your opinions for this.

[SPARK-23007][SQL][TEST] Add schema evolution test suite for file-based data sources
- https://github.com/apache/spark/pull/20208

Could you take a look and give some opinions?

Bests,
Dongjoon.
Reply | Threaded
Open this post in threaded view
|

Re: Schema Evolution in Apache Spark

Dongjoon Hyun-2
This is about Spark-layer test cases on **read-only** CSV, JSON, Parquet, ORC files. You can find more details and comparisons in terms of Spatk support coverage.

Bests,
Dongjoon.


On Thu, Jan 11, 2018 at 22:19 Georg Heiler <[hidden email]> wrote:
Isn't this related to the data format used, i.e. parquet, Avro, ... which already support changing schema?

Dongjoon Hyun <[hidden email]> schrieb am Fr., 12. Jan. 2018 um 02:30 Uhr:
Hi, All.

A data schema can evolve in several ways and Apache Spark 2.3 already supports the followings for file-based data sources like CSV/JSON/ORC/Parquet.

1. Add a column
2. Remove a column
3. Change a column position
4. Change a column type

Can we guarantee users some schema evolution coverage on file-based data sources by adding schema evolution test suites explicitly? So far, there are some test cases.

For simplicity, I have several assumptions on schema evolution.

1. A safe evolution without data loss.
    - e.g. from small types to larger types like int-to-long, not vice versa.
2. Final schema is given by users (or Hive)
3. Simple Spark data types supported by Spark vectorized execution.

I made a test case PR to receive your opinions for this.

[SPARK-23007][SQL][TEST] Add schema evolution test suite for file-based data sources
- https://github.com/apache/spark/pull/20208

Could you take a look and give some opinions?

Bests,
Dongjoon.