[discuss][SQL] Partitioned column type inference proposal

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

[discuss][SQL] Partitioned column type inference proposal

Hyukjin Kwon
Hi dev,

I would like to post a proposal about partitioned column type inference (related with 'spark.sql.sources.partitionColumnTypeInference.enabled' configuration).

This thread focuses on the type coercion (finding the common type) in partitioned columns, in particular, when the different form of data is inserted for the partition column and then it is read back with the type inference.


Problem:


val df = Seq((1, "2015-01-01"), (2, "2016-01-01 00:00:00")).toDF("i", "ts")
df.write.format("parquet").partitionBy("ts").save("/tmp/foo")
spark.read.load("/tmp/foo").printSchema()

val df = Seq((1, "1"), (2, "1" * 30)).toDF("i", "decimal")
df.write.format("parquet").partitionBy("decimal").save("/tmp/bar")
spark.read.load("/tmp/bar").printSchema()


It currently returns:


root
 |-- i: integer (nullable = true)
 |-- ts: date (nullable = true)

root
 |-- i: integer (nullable = true)
 |-- decimal: integer (nullable = true)

The type coercion looks less well designed yet and currently there are few holes which is not quite ideal:


private val upCastingOrder: Seq[DataType] =
  Seq(NullType, IntegerType, LongType, FloatType, DoubleType, StringType)
...
literals.map(_.dataType).maxBy(upCastingOrder.indexOf(_))


The current way does not deal with when the types are outside of the upCastingOrder. It just returns the first type, as the type coerced one.

This has been being discussed in https://github.com/apache/spark/pull/19389#discussion_r150426911, but I would like to have more feedback from community as it possibly is a breaking change.

For the current releases of Spark (2.2.0 <=), we support the types below for partitioned column schema inference, given my investigation - https://github.com/apache/spark/pull/19389#discussion_r150528207:

  NullType
  IntegerType
  LongType
  DoubleType,
  *DecimalType(...)
  DateType
  TimestampType
  StringType

  *DecimalType only when it's bigger than LongType:

I believe this is something we should definitely fix.



Proposal:



Please refer the chart I produced here - https://github.com/apache/spark/pull/19389/files#r150528361. The current proposal will brings the type coercion behaviour change in those cases below:


Input typesOld output typeNew output type
[NullTypeDecimalType(38,0)]StringTypeDecimalType(38,0)
[NullTypeDateType]StringTypeDateType
[NullTypeTimestampType]StringTypeTimestampType
[IntegerTypeDecimalType(38,0)]IntegerTypeDecimalType(38,0)
[IntegerTypeDateType]IntegerTypeStringType
[IntegerTypeTimestampType]IntegerTypeStringType
[LongTypeDecimalType(38,0)]LongTypeDecimalType(38,0)
[LongTypeDateType]LongTypeStringType
[LongTypeTimestampType]LongTypeStringType
[DoubleTypeDateType]DoubleTypeStringType
[DoubleTypeTimestampType]DoubleTypeStringType
[DecimalType(38,0)NullType]StringTypeDecimalType(38,0)
[DecimalType(38,0)IntegerType]IntegerTypeDecimalType(38,0)
[DecimalType(38,0)LongType]LongTypeDecimalType(38,0)
[DecimalType(38,0)DateType]DecimalType(38,0)StringType
[DecimalType(38,0)TimestampType]DecimalType(38,0)StringType
[DateTypeNullType]StringTypeDateType
[DateTypeIntegerType]IntegerTypeStringType
[DateTypeLongType]LongTypeStringType
[DateTypeDoubleType]DoubleTypeStringType
[DateTypeDecimalType(38,0)]DateTypeStringType
[DateTypeTimestampType]DateTypeTimestampType
[TimestampTypeNullType]StringTypeTimestampType
[TimestampTypeIntegerType]IntegerTypeStringType
[TimestampTypeLongType]LongTypeStringType
[TimestampTypeDoubleType]DoubleTypeStringType
[TimestampTypeDecimalType(38,0)]TimestampTypeStringType



Other possible suggestions:

Probably, we could also consider simply making the merged type to string types when there are any type conflicts.
Otherwise, there could be more stricter rules. I want more opinion from the community.



Questions:

- Does the Proposal: above looks good?

- If not, what would be the alternative?




Reply | Threaded
Open this post in threaded view
|

Re: [discuss][SQL] Partitioned column type inference proposal

cloud0fan
My 2 cents:

1. when merging NullType with another type, the result should always be that type.
2. when merging StringType with another type, the result should always be StringType.
3. when merging integral types, the priority from high to low: DecimalType, LongType, IntegerType. This is because DecimalType is used as big integer when paring partition column values.
4. DoubleType can't be merged with other types, except DoubleType itself.
5. when merging TimestampType with DateType, return TimestampType.


On Tue, Nov 14, 2017 at 3:54 PM, Hyukjin Kwon <[hidden email]> wrote:
Hi dev,

I would like to post a proposal about partitioned column type inference (related with 'spark.sql.sources.partitionColumnTypeInference.enabled' configuration).

This thread focuses on the type coercion (finding the common type) in partitioned columns, in particular, when the different form of data is inserted for the partition column and then it is read back with the type inference.


Problem:


val df = Seq((1, "2015-01-01"), (2, "2016-01-01 00:00:00")).toDF("i", "ts")
df.write.format("parquet").partitionBy("ts").save("/tmp/foo")
spark.read.load("/tmp/foo").printSchema()

val df = Seq((1, "1"), (2, "1" * 30)).toDF("i", "decimal")
df.write.format("parquet").partitionBy("decimal").save("/tmp/bar")
spark.read.load("/tmp/bar").printSchema()


It currently returns:


root
 |-- i: integer (nullable = true)
 |-- ts: date (nullable = true)

root
 |-- i: integer (nullable = true)
 |-- decimal: integer (nullable = true)

The type coercion looks less well designed yet and currently there are few holes which is not quite ideal:


private val upCastingOrder: Seq[DataType] =
  Seq(NullType, IntegerType, LongType, FloatType, DoubleType, StringType)
...
literals.map(_.dataType).maxBy(upCastingOrder.indexOf(_))


The current way does not deal with when the types are outside of the upCastingOrder. It just returns the first type, as the type coerced one.

This has been being discussed in https://github.com/apache/spark/pull/19389#discussion_r150426911, but I would like to have more feedback from community as it possibly is a breaking change.

For the current releases of Spark (2.2.0 <=), we support the types below for partitioned column schema inference, given my investigation - https://github.com/apache/spark/pull/19389#discussion_r150528207:

  NullType
  IntegerType
  LongType
  DoubleType,
  *DecimalType(...)
  DateType
  TimestampType
  StringType

  *DecimalType only when it's bigger than LongType:

I believe this is something we should definitely fix.



Proposal:



Please refer the chart I produced here - https://github.com/apache/spark/pull/19389/files#r150528361. The current proposal will brings the type coercion behaviour change in those cases below:


Input typesOld output typeNew output type
[NullTypeDecimalType(38,0)]StringTypeDecimalType(38,0)
[NullTypeDateType]StringTypeDateType
[NullTypeTimestampType]StringTypeTimestampType
[IntegerTypeDecimalType(38,0)]IntegerTypeDecimalType(38,0)
[IntegerTypeDateType]IntegerTypeStringType
[IntegerTypeTimestampType]IntegerTypeStringType
[LongTypeDecimalType(38,0)]LongTypeDecimalType(38,0)
[LongTypeDateType]LongTypeStringType
[LongTypeTimestampType]LongTypeStringType
[DoubleTypeDateType]DoubleTypeStringType
[DoubleTypeTimestampType]DoubleTypeStringType
[DecimalType(38,0)NullType]StringTypeDecimalType(38,0)
[DecimalType(38,0)IntegerType]IntegerTypeDecimalType(38,0)
[DecimalType(38,0)LongType]LongTypeDecimalType(38,0)
[DecimalType(38,0)DateType]DecimalType(38,0)StringType
[DecimalType(38,0)TimestampType]DecimalType(38,0)StringType
[DateTypeNullType]StringTypeDateType
[DateTypeIntegerType]IntegerTypeStringType
[DateTypeLongType]LongTypeStringType
[DateTypeDoubleType]DoubleTypeStringType
[DateTypeDecimalType(38,0)]DateTypeStringType
[DateTypeTimestampType]DateTypeTimestampType
[TimestampTypeNullType]StringTypeTimestampType
[TimestampTypeIntegerType]IntegerTypeStringType
[TimestampTypeLongType]LongTypeStringType
[TimestampTypeDoubleType]DoubleTypeStringType
[TimestampTypeDecimalType(38,0)]TimestampTypeStringType



Other possible suggestions:

Probably, we could also consider simply making the merged type to string types when there are any type conflicts.
Otherwise, there could be more stricter rules. I want more opinion from the community.



Questions:

- Does the Proposal: above looks good?

- If not, what would be the alternative?





Reply | Threaded
Open this post in threaded view
|

Re: [discuss][SQL] Partitioned column type inference proposal

rxin
Most of those thoughts from Wenchen make sense to me.


Rather than a list, can we create a table? X-axis is data type, and Y-axis is also data type, and the intersection explains what the coerced type is? Can we also look at what Hive, standard SQL (Postgres?) do?


Also, this shouldn't be isolated to partition column inference. We should make sure most of the type coercions are consistent across different functionalities, with the caveat that we need to preserve backward compatibility.



On Tue, Nov 14, 2017 at 8:33 AM, Wenchen Fan <[hidden email]> wrote:
My 2 cents:

1. when merging NullType with another type, the result should always be that type.
2. when merging StringType with another type, the result should always be StringType.
3. when merging integral types, the priority from high to low: DecimalType, LongType, IntegerType. This is because DecimalType is used as big integer when paring partition column values.
4. DoubleType can't be merged with other types, except DoubleType itself.
5. when merging TimestampType with DateType, return TimestampType.


On Tue, Nov 14, 2017 at 3:54 PM, Hyukjin Kwon <[hidden email]> wrote:
Hi dev,

I would like to post a proposal about partitioned column type inference (related with 'spark.sql.sources.partitionColumnTypeInference.enabled' configuration).

This thread focuses on the type coercion (finding the common type) in partitioned columns, in particular, when the different form of data is inserted for the partition column and then it is read back with the type inference.


Problem:


val df = Seq((1, "2015-01-01"), (2, "2016-01-01 00:00:00")).toDF("i", "ts")
df.write.format("parquet").partitionBy("ts").save("/tmp/foo")
spark.read.load("/tmp/foo").printSchema()

val df = Seq((1, "1"), (2, "1" * 30)).toDF("i", "decimal")
df.write.format("parquet").partitionBy("decimal").save("/tmp/bar")
spark.read.load("/tmp/bar").printSchema()


It currently returns:


root
 |-- i: integer (nullable = true)
 |-- ts: date (nullable = true)

root
 |-- i: integer (nullable = true)
 |-- decimal: integer (nullable = true)

The type coercion looks less well designed yet and currently there are few holes which is not quite ideal:


private val upCastingOrder: Seq[DataType] =
  Seq(NullType, IntegerType, LongType, FloatType, DoubleType, StringType)
...
literals.map(_.dataType).maxBy(upCastingOrder.indexOf(_))


The current way does not deal with when the types are outside of the upCastingOrder. It just returns the first type, as the type coerced one.

This has been being discussed in https://github.com/apache/spark/pull/19389#discussion_r150426911, but I would like to have more feedback from community as it possibly is a breaking change.

For the current releases of Spark (2.2.0 <=), we support the types below for partitioned column schema inference, given my investigation - https://github.com/apache/spark/pull/19389#discussion_r150528207:

  NullType
  IntegerType
  LongType
  DoubleType,
  *DecimalType(...)
  DateType
  TimestampType
  StringType

  *DecimalType only when it's bigger than LongType:

I believe this is something we should definitely fix.



Proposal:



Please refer the chart I produced here - https://github.com/apache/spark/pull/19389/files#r150528361. The current proposal will brings the type coercion behaviour change in those cases below:


Input typesOld output typeNew output type
[NullTypeDecimalType(38,0)]StringTypeDecimalType(38,0)
[NullTypeDateType]StringTypeDateType
[NullTypeTimestampType]StringTypeTimestampType
[IntegerTypeDecimalType(38,0)]IntegerTypeDecimalType(38,0)
[IntegerTypeDateType]IntegerTypeStringType
[IntegerTypeTimestampType]IntegerTypeStringType
[LongTypeDecimalType(38,0)]LongTypeDecimalType(38,0)
[LongTypeDateType]LongTypeStringType
[LongTypeTimestampType]LongTypeStringType
[DoubleTypeDateType]DoubleTypeStringType
[DoubleTypeTimestampType]DoubleTypeStringType
[DecimalType(38,0)NullType]StringTypeDecimalType(38,0)
[DecimalType(38,0)IntegerType]IntegerTypeDecimalType(38,0)
[DecimalType(38,0)LongType]LongTypeDecimalType(38,0)
[DecimalType(38,0)DateType]DecimalType(38,0)StringType
[DecimalType(38,0)TimestampType]DecimalType(38,0)StringType
[DateTypeNullType]StringTypeDateType
[DateTypeIntegerType]IntegerTypeStringType
[DateTypeLongType]LongTypeStringType
[DateTypeDoubleType]DoubleTypeStringType
[DateTypeDecimalType(38,0)]DateTypeStringType
[DateTypeTimestampType]DateTypeTimestampType
[TimestampTypeNullType]StringTypeTimestampType
[TimestampTypeIntegerType]IntegerTypeStringType
[TimestampTypeLongType]LongTypeStringType
[TimestampTypeDoubleType]DoubleTypeStringType
[TimestampTypeDecimalType(38,0)]TimestampTypeStringType



Other possible suggestions:

Probably, we could also consider simply making the merged type to string types when there are any type conflicts.
Otherwise, there could be more stricter rules. I want more opinion from the community.



Questions:

- Does the Proposal: above looks good?

- If not, what would be the alternative?






Reply | Threaded
Open this post in threaded view
|

Re: [discuss][SQL] Partitioned column type inference proposal

Hyukjin Kwon
Thanks all for feedback.

> 1. when merging NullType with another type, the result should always be that type.
> 2. when merging StringType with another type, the result should always be StringType.
> 3. when merging integral types, the priority from high to low: DecimalType, LongType, IntegerType. This is because DecimalType is used as big integer when paring partition column values.
> 4. DoubleType can't be merged with other types, except DoubleType itself.
> 5. when merging TimestampType with DateType, return TimestampType.

> Rather than a list, can we create a table? X-axis is data type, and Y-axis is also data type, and the intersection explains what the coerced type is?


Here, I produced a table as below:

Before

InputA \ InputBNullTypeIntegerTypeLongTypeDoubleTypeDecimalType(38,0)DateTypeTimestampTypeStringType
NullTypeStringTypeIntegerTypeLongTypeDoubleTypeStringTypeStringTypeStringTypeStringType
IntegerTypeIntegerTypeIntegerTypeLongTypeDoubleTypeIntegerTypeIntegerTypeIntegerTypeStringType
LongTypeLongTypeLongTypeLongTypeDoubleTypeLongTypeLongTypeLongTypeStringType
DoubleTypeDoubleTypeDoubleTypeDoubleTypeDoubleTypeDoubleTypeDoubleTypeDoubleTypeStringType
DecimalType(38,0)StringTypeIntegerTypeLongTypeDoubleTypeDecimalType(38,0)DecimalType(38,0)DecimalType(38,0)StringType
DateTypeStringTypeIntegerTypeLongTypeDoubleTypeDateTypeDateTypeDateTypeStringType
TimestampTypeStringTypeIntegerTypeLongTypeDoubleTypeTimestampTypeTimestampTypeTimestampTypeStringType
StringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringType

After

InputA \ InputBNullTypeIntegerTypeLongTypeDoubleTypeDecimalType(38,0)DateTypeTimestampTypeStringType
NullTypeStringTypeIntegerTypeLongTypeDoubleTypeDecimalType(38,0)DateTypeTimestampTypeStringType
IntegerTypeIntegerTypeIntegerTypeLongTypeDoubleTypeDecimalType(38,0)StringTypeStringTypeStringType
LongTypeLongTypeLongTypeLongTypeDoubleTypeDecimalType(38,0)StringTypeStringTypeStringType
DoubleTypeDoubleTypeDoubleTypeDoubleTypeDoubleTypeDoubleTypeStringTypeStringTypeStringType
DecimalType(38,0)DecimalType(38,0)DecimalType(38,0)DecimalType(38,0)DoubleTypeDecimalType(38,0)StringTypeStringTypeStringType
DateTypeDateTypeStringTypeStringTypeStringTypeStringTypeDateTypeTimestampTypeStringType
TimestampTypeTimestampTypeStringTypeStringTypeStringTypeStringTypeTimestampTypeTimestampTypeStringType
StringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringType


Seems following Wenchen's comments (1. to 5.), and I also updated PR description there with some codes I used.


> Can we also look at what Hive, standard SQL (Postgres?) do?
> Also, this shouldn't be isolated to partition column inference.
> We should make sure most of the type coercions are consistent across different functionalities, with the caveat that we need to preserve backward compatibility.


Sure, so, if I understood correctly, we preserve backward compatibility for improvements, but not for bugs in general.


Probably, there are two things to be done now:

  - Deduplicates the type coercion logics, and fixes the obvious bugs, that doesn't make sense, (e.g., decimal and timestamp ends up with decimal).

  - Improves the deduplicated type coercion to follow other systems like Hive and Postgres, or to make it sounds coherent by referring them.


I think my PR here focuses on the former, because the current partition column inference itself is already isolated
and I am trying to propose to put those type coercion into one place in the TypeCoercion. We could also consider
some divergence as exceptions in the nature of this functionality too, maybe .. ? (although I agree that  most of the
type coercions are consistent across different functionalities in general).

If the deduplicated type coercion itself across functionalities should be changed as an improvement, it should preserve
backward compatibility if I understood correctly and requires to take a look for other systems.

I am willing to do the latter soon; however, it probably takes a quite while for me to investigate and propose a change.

So, meanwhile, could we separately proceed this PR maybe or, probably would there be something I missed?



2017-11-15 7:29 GMT+09:00 Reynold Xin <[hidden email]>:
Most of those thoughts from Wenchen make sense to me.


Rather than a list, can we create a table? X-axis is data type, and Y-axis is also data type, and the intersection explains what the coerced type is? Can we also look at what Hive, standard SQL (Postgres?) do?


Also, this shouldn't be isolated to partition column inference. We should make sure most of the type coercions are consistent across different functionalities, with the caveat that we need to preserve backward compatibility.



On Tue, Nov 14, 2017 at 8:33 AM, Wenchen Fan <[hidden email]> wrote:
My 2 cents:

1. when merging NullType with another type, the result should always be that type.
2. when merging StringType with another type, the result should always be StringType.
3. when merging integral types, the priority from high to low: DecimalType, LongType, IntegerType. This is because DecimalType is used as big integer when paring partition column values.
4. DoubleType can't be merged with other types, except DoubleType itself.
5. when merging TimestampType with DateType, return TimestampType.


On Tue, Nov 14, 2017 at 3:54 PM, Hyukjin Kwon <[hidden email]> wrote:
Hi dev,

I would like to post a proposal about partitioned column type inference (related with 'spark.sql.sources.partitionColumnTypeInference.enabled' configuration).

This thread focuses on the type coercion (finding the common type) in partitioned columns, in particular, when the different form of data is inserted for the partition column and then it is read back with the type inference.


Problem:


val df = Seq((1, "2015-01-01"), (2, "2016-01-01 00:00:00")).toDF("i", "ts")
df.write.format("parquet").partitionBy("ts").save("/tmp/foo")
spark.read.load("/tmp/foo").printSchema()

val df = Seq((1, "1"), (2, "1" * 30)).toDF("i", "decimal")
df.write.format("parquet").partitionBy("decimal").save("/tmp/bar")
spark.read.load("/tmp/bar").printSchema()


It currently returns:


root
 |-- i: integer (nullable = true)
 |-- ts: date (nullable = true)

root
 |-- i: integer (nullable = true)
 |-- decimal: integer (nullable = true)

The type coercion looks less well designed yet and currently there are few holes which is not quite ideal:


private val upCastingOrder: Seq[DataType] =
  Seq(NullType, IntegerType, LongType, FloatType, DoubleType, StringType)
...
literals.map(_.dataType).maxBy(upCastingOrder.indexOf(_))


The current way does not deal with when the types are outside of the upCastingOrder. It just returns the first type, as the type coerced one.

This has been being discussed in https://github.com/apache/spark/pull/19389#discussion_r150426911, but I would like to have more feedback from community as it possibly is a breaking change.

For the current releases of Spark (2.2.0 <=), we support the types below for partitioned column schema inference, given my investigation - https://github.com/apache/spark/pull/19389#discussion_r150528207:

  NullType
  IntegerType
  LongType
  DoubleType,
  *DecimalType(...)
  DateType
  TimestampType
  StringType

  *DecimalType only when it's bigger than LongType:

I believe this is something we should definitely fix.



Proposal:



Please refer the chart I produced here - https://github.com/apache/spark/pull/19389/files#r150528361. The current proposal will brings the type coercion behaviour change in those cases below:


Input typesOld output typeNew output type
[NullTypeDecimalType(38,0)]StringTypeDecimalType(38,0)
[NullTypeDateType]StringTypeDateType
[NullTypeTimestampType]StringTypeTimestampType
[IntegerTypeDecimalType(38,0)]IntegerTypeDecimalType(38,0)
[IntegerTypeDateType]IntegerTypeStringType
[IntegerTypeTimestampType]IntegerTypeStringType
[LongTypeDecimalType(38,0)]LongTypeDecimalType(38,0)
[LongTypeDateType]LongTypeStringType
[LongTypeTimestampType]LongTypeStringType
[DoubleTypeDateType]DoubleTypeStringType
[DoubleTypeTimestampType]DoubleTypeStringType
[DecimalType(38,0)NullType]StringTypeDecimalType(38,0)
[DecimalType(38,0)IntegerType]IntegerTypeDecimalType(38,0)
[DecimalType(38,0)LongType]LongTypeDecimalType(38,0)
[DecimalType(38,0)DateType]DecimalType(38,0)StringType
[DecimalType(38,0)TimestampType]DecimalType(38,0)StringType
[DateTypeNullType]StringTypeDateType
[DateTypeIntegerType]IntegerTypeStringType
[DateTypeLongType]LongTypeStringType
[DateTypeDoubleType]DoubleTypeStringType
[DateTypeDecimalType(38,0)]DateTypeStringType
[DateTypeTimestampType]DateTypeTimestampType
[TimestampTypeNullType]StringTypeTimestampType
[TimestampTypeIntegerType]IntegerTypeStringType
[TimestampTypeLongType]LongTypeStringType
[TimestampTypeDoubleType]DoubleTypeStringType
[TimestampTypeDecimalType(38,0)]TimestampTypeStringType



Other possible suggestions:

Probably, we could also consider simply making the merged type to string types when there are any type conflicts.
Otherwise, there could be more stricter rules. I want more opinion from the community.



Questions:

- Does the Proposal: above looks good?

- If not, what would be the alternative?







Reply | Threaded
Open this post in threaded view
|

Re: [discuss][SQL] Partitioned column type inference proposal

Hyukjin Kwon
I just reordered a little bit and coloured differences to check easily:

Before:

InputA \ InputBNullTypeIntegerTypeLongTypeDecimalType(38,0)DoubleTypeDateTypeTimestampTypeStringType
NullTypeStringTypeIntegerTypeLongTypeStringTypeDoubleTypeStringTypeStringTypeStringType
IntegerTypeIntegerTypeIntegerTypeLongTypeIntegerTypeDoubleTypeIntegerTypeIntegerTypeStringType
LongTypeLongTypeLongTypeLongTypeLongTypeDoubleTypeLongTypeLongTypeStringType
DecimalType(38,0)StringTypeIntegerTypeLongTypeDecimalType(38,0)DoubleTypeDecimalType(38,0)DecimalType(38,0)StringType
DoubleTypeDoubleTypeDoubleTypeDoubleTypeDoubleTypeDoubleTypeDoubleTypeDoubleTypeStringType
DateTypeStringTypeIntegerTypeLongTypeDateTypeDoubleTypeDateTypeDateTypeStringType
TimestampTypeStringTypeIntegerTypeLongTypeTimestampTypeDoubleTypeTimestampTypeTimestampTypeStringType
StringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringType

After:
InputA \ InputBNullTypeIntegerTypeLongTypeDecimalType(38,0)DoubleTypeDateTypeTimestampTypeStringType
NullTypeStringTypeIntegerTypeLongTypeDecimalType(38,0)DoubleTypeDateTypeTimestampTypeStringType
IntegerTypeIntegerTypeIntegerTypeLongTypeDecimalType(38,0)DoubleTypeStringTypeStringTypeStringType
LongTypeLongTypeLongTypeLongTypeDecimalType(38,0)DoubleTypeStringTypeStringTypeStringType
DecimalType(38,0)DecimalType(38,0)DecimalType(38,0)DecimalType(38,0)DecimalType(38,0)DoubleTypeStringTypeStringTypeStringType
DoubleTypeDoubleTypeDoubleTypeDoubleTypeDoubleTypeDoubleTypeStringTypeStringTypeStringType
DateTypeDateTypeStringTypeStringTypeStringTypeStringTypeDateTypeTimestampTypeStringType
TimestampTypeTimestampTypeStringTypeStringTypeStringTypeStringTypeTimestampTypeTimestampTypeStringType
StringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringType





2017-11-15 11:15 GMT+09:00 Hyukjin Kwon <[hidden email]>:
Thanks all for feedback.

> 1. when merging NullType with another type, the result should always be that type.
> 2. when merging StringType with another type, the result should always be StringType.
> 3. when merging integral types, the priority from high to low: DecimalType, LongType, IntegerType. This is because DecimalType is used as big integer when paring partition column values.
> 4. DoubleType can't be merged with other types, except DoubleType itself.
> 5. when merging TimestampType with DateType, return TimestampType.

> Rather than a list, can we create a table? X-axis is data type, and Y-axis is also data type, and the intersection explains what the coerced type is?


Here, I produced a table as below:

Before

InputA \ InputBNullTypeIntegerTypeLongTypeDoubleTypeDecimalType(38,0)DateTypeTimestampTypeStringType
NullTypeStringTypeIntegerTypeLongTypeDoubleTypeStringTypeStringTypeStringTypeStringType
IntegerTypeIntegerTypeIntegerTypeLongTypeDoubleTypeIntegerTypeIntegerTypeIntegerTypeStringType
LongTypeLongTypeLongTypeLongTypeDoubleTypeLongTypeLongTypeLongTypeStringType
DoubleTypeDoubleTypeDoubleTypeDoubleTypeDoubleTypeDoubleTypeDoubleTypeDoubleTypeStringType
DecimalType(38,0)StringTypeIntegerTypeLongTypeDoubleTypeDecimalType(38,0)DecimalType(38,0)DecimalType(38,0)StringType
DateTypeStringTypeIntegerTypeLongTypeDoubleTypeDateTypeDateTypeDateTypeStringType
TimestampTypeStringTypeIntegerTypeLongTypeDoubleTypeTimestampTypeTimestampTypeTimestampTypeStringType
StringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringType

After

InputA \ InputBNullTypeIntegerTypeLongTypeDoubleTypeDecimalType(38,0)DateTypeTimestampTypeStringType
NullTypeStringTypeIntegerTypeLongTypeDoubleTypeDecimalType(38,0)DateTypeTimestampTypeStringType
IntegerTypeIntegerTypeIntegerTypeLongTypeDoubleTypeDecimalType(38,0)StringTypeStringTypeStringType
LongTypeLongTypeLongTypeLongTypeDoubleTypeDecimalType(38,0)StringTypeStringTypeStringType
DoubleTypeDoubleTypeDoubleTypeDoubleTypeDoubleTypeDoubleTypeStringTypeStringTypeStringType
DecimalType(38,0)DecimalType(38,0)DecimalType(38,0)DecimalType(38,0)DoubleTypeDecimalType(38,0)StringTypeStringTypeStringType
DateTypeDateTypeStringTypeStringTypeStringTypeStringTypeDateTypeTimestampTypeStringType
TimestampTypeTimestampTypeStringTypeStringTypeStringTypeStringTypeTimestampTypeTimestampTypeStringType
StringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringType


Seems following Wenchen's comments (1. to 5.), and I also updated PR description there with some codes I used.


> Can we also look at what Hive, standard SQL (Postgres?) do?
> Also, this shouldn't be isolated to partition column inference.
> We should make sure most of the type coercions are consistent across different functionalities, with the caveat that we need to preserve backward compatibility.


Sure, so, if I understood correctly, we preserve backward compatibility for improvements, but not for bugs in general.


Probably, there are two things to be done now:

  - Deduplicates the type coercion logics, and fixes the obvious bugs, that doesn't make sense, (e.g., decimal and timestamp ends up with decimal).

  - Improves the deduplicated type coercion to follow other systems like Hive and Postgres, or to make it sounds coherent by referring them.


I think my PR here focuses on the former, because the current partition column inference itself is already isolated
and I am trying to propose to put those type coercion into one place in the TypeCoercion. We could also consider
some divergence as exceptions in the nature of this functionality too, maybe .. ? (although I agree that  most of the
type coercions are consistent across different functionalities in general).

If the deduplicated type coercion itself across functionalities should be changed as an improvement, it should preserve
backward compatibility if I understood correctly and requires to take a look for other systems.

I am willing to do the latter soon; however, it probably takes a quite while for me to investigate and propose a change.

So, meanwhile, could we separately proceed this PR maybe or, probably would there be something I missed?



2017-11-15 7:29 GMT+09:00 Reynold Xin <[hidden email]>:
Most of those thoughts from Wenchen make sense to me.


Rather than a list, can we create a table? X-axis is data type, and Y-axis is also data type, and the intersection explains what the coerced type is? Can we also look at what Hive, standard SQL (Postgres?) do?


Also, this shouldn't be isolated to partition column inference. We should make sure most of the type coercions are consistent across different functionalities, with the caveat that we need to preserve backward compatibility.



On Tue, Nov 14, 2017 at 8:33 AM, Wenchen Fan <[hidden email]> wrote:
My 2 cents:

1. when merging NullType with another type, the result should always be that type.
2. when merging StringType with another type, the result should always be StringType.
3. when merging integral types, the priority from high to low: DecimalType, LongType, IntegerType. This is because DecimalType is used as big integer when paring partition column values.
4. DoubleType can't be merged with other types, except DoubleType itself.
5. when merging TimestampType with DateType, return TimestampType.


On Tue, Nov 14, 2017 at 3:54 PM, Hyukjin Kwon <[hidden email]> wrote:
Hi dev,

I would like to post a proposal about partitioned column type inference (related with 'spark.sql.sources.partitionColumnTypeInference.enabled' configuration).

This thread focuses on the type coercion (finding the common type) in partitioned columns, in particular, when the different form of data is inserted for the partition column and then it is read back with the type inference.


Problem:


val df = Seq((1, "2015-01-01"), (2, "2016-01-01 00:00:00")).toDF("i", "ts")
df.write.format("parquet").partitionBy("ts").save("/tmp/foo")
spark.read.load("/tmp/foo").printSchema()

val df = Seq((1, "1"), (2, "1" * 30)).toDF("i", "decimal")
df.write.format("parquet").partitionBy("decimal").save("/tmp/bar")
spark.read.load("/tmp/bar").printSchema()


It currently returns:


root
 |-- i: integer (nullable = true)
 |-- ts: date (nullable = true)

root
 |-- i: integer (nullable = true)
 |-- decimal: integer (nullable = true)

The type coercion looks less well designed yet and currently there are few holes which is not quite ideal:


private val upCastingOrder: Seq[DataType] =
  Seq(NullType, IntegerType, LongType, FloatType, DoubleType, StringType)
...
literals.map(_.dataType).maxBy(upCastingOrder.indexOf(_))


The current way does not deal with when the types are outside of the upCastingOrder. It just returns the first type, as the type coerced one.

This has been being discussed in https://github.com/apache/spark/pull/19389#discussion_r150426911, but I would like to have more feedback from community as it possibly is a breaking change.

For the current releases of Spark (2.2.0 <=), we support the types below for partitioned column schema inference, given my investigation - https://github.com/apache/spark/pull/19389#discussion_r150528207:

  NullType
  IntegerType
  LongType
  DoubleType,
  *DecimalType(...)
  DateType
  TimestampType
  StringType

  *DecimalType only when it's bigger than LongType:

I believe this is something we should definitely fix.



Proposal:



Please refer the chart I produced here - https://github.com/apache/spark/pull/19389/files#r150528361. The current proposal will brings the type coercion behaviour change in those cases below:


Input typesOld output typeNew output type
[NullTypeDecimalType(38,0)]StringTypeDecimalType(38,0)
[NullTypeDateType]StringTypeDateType
[NullTypeTimestampType]StringTypeTimestampType
[IntegerTypeDecimalType(38,0)]IntegerTypeDecimalType(38,0)
[IntegerTypeDateType]IntegerTypeStringType
[IntegerTypeTimestampType]IntegerTypeStringType
[LongTypeDecimalType(38,0)]LongTypeDecimalType(38,0)
[LongTypeDateType]LongTypeStringType
[LongTypeTimestampType]LongTypeStringType
[DoubleTypeDateType]DoubleTypeStringType
[DoubleTypeTimestampType]DoubleTypeStringType
[DecimalType(38,0)NullType]StringTypeDecimalType(38,0)
[DecimalType(38,0)IntegerType]IntegerTypeDecimalType(38,0)
[DecimalType(38,0)LongType]LongTypeDecimalType(38,0)
[DecimalType(38,0)DateType]DecimalType(38,0)StringType
[DecimalType(38,0)TimestampType]DecimalType(38,0)StringType
[DateTypeNullType]StringTypeDateType
[DateTypeIntegerType]IntegerTypeStringType
[DateTypeLongType]LongTypeStringType
[DateTypeDoubleType]DoubleTypeStringType
[DateTypeDecimalType(38,0)]DateTypeStringType
[DateTypeTimestampType]DateTypeTimestampType
[TimestampTypeNullType]StringTypeTimestampType
[TimestampTypeIntegerType]IntegerTypeStringType
[TimestampTypeLongType]LongTypeStringType
[TimestampTypeDoubleType]DoubleTypeStringType
[TimestampTypeDecimalType(38,0)]TimestampTypeStringType



Other possible suggestions:

Probably, we could also consider simply making the merged type to string types when there are any type conflicts.
Otherwise, there could be more stricter rules. I want more opinion from the community.



Questions:

- Does the Proposal: above looks good?

- If not, what would be the alternative?








Reply | Threaded
Open this post in threaded view
|

Re: [discuss][SQL] Partitioned column type inference proposal

cloud0fan
1. shall we return NullType when merging 2 NullTypes?
2. when merging DoubleType and LongType/DecimalType, we should return StringType, otherwise precision lose may happen.

One thing special for partition values type coercion is, we have a safe fallback: the StringType. This means we can be very conservative on type coercion, and reject any precision lose. While for normal type coercion, sometimes type coercion is better than failing the query.

On Wed, Nov 15, 2017 at 8:50 AM, Hyukjin Kwon <[hidden email]> wrote:
I just reordered a little bit and coloured differences to check easily:

Before:

InputA \ InputBNullTypeIntegerTypeLongTypeDecimalType(38,0)DoubleTypeDateTypeTimestampTypeStringType
NullTypeStringTypeIntegerTypeLongTypeStringTypeDoubleTypeStringTypeStringTypeStringType
IntegerTypeIntegerTypeIntegerTypeLongTypeIntegerTypeDoubleTypeIntegerTypeIntegerTypeStringType
LongTypeLongTypeLongTypeLongTypeLongTypeDoubleTypeLongTypeLongTypeStringType
DecimalType(38,0)StringTypeIntegerTypeLongTypeDecimalType(38,0)DoubleTypeDecimalType(38,0)DecimalType(38,0)StringType
DoubleTypeDoubleTypeDoubleTypeDoubleTypeDoubleTypeDoubleTypeDoubleTypeDoubleTypeStringType
DateTypeStringTypeIntegerTypeLongTypeDateTypeDoubleTypeDateTypeDateTypeStringType
TimestampTypeStringTypeIntegerTypeLongTypeTimestampTypeDoubleTypeTimestampTypeTimestampTypeStringType
StringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringType

After:
InputA \ InputBNullTypeIntegerTypeLongTypeDecimalType(38,0)DoubleTypeDateTypeTimestampTypeStringType
NullTypeStringTypeIntegerTypeLongTypeDecimalType(38,0)DoubleTypeDateTypeTimestampTypeStringType
IntegerTypeIntegerTypeIntegerTypeLongTypeDecimalType(38,0)DoubleTypeStringTypeStringTypeStringType
LongTypeLongTypeLongTypeLongTypeDecimalType(38,0)DoubleTypeStringTypeStringTypeStringType
DecimalType(38,0)DecimalType(38,0)DecimalType(38,0)DecimalType(38,0)DecimalType(38,0)DoubleTypeStringTypeStringTypeStringType
DoubleTypeDoubleTypeDoubleTypeDoubleTypeDoubleTypeDoubleTypeStringTypeStringTypeStringType
DateTypeDateTypeStringTypeStringTypeStringTypeStringTypeDateTypeTimestampTypeStringType
TimestampTypeTimestampTypeStringTypeStringTypeStringTypeStringTypeTimestampTypeTimestampTypeStringType
StringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringType





2017-11-15 11:15 GMT+09:00 Hyukjin Kwon <[hidden email]>:
Thanks all for feedback.

> 1. when merging NullType with another type, the result should always be that type.
> 2. when merging StringType with another type, the result should always be StringType.
> 3. when merging integral types, the priority from high to low: DecimalType, LongType, IntegerType. This is because DecimalType is used as big integer when paring partition column values.
> 4. DoubleType can't be merged with other types, except DoubleType itself.
> 5. when merging TimestampType with DateType, return TimestampType.

> Rather than a list, can we create a table? X-axis is data type, and Y-axis is also data type, and the intersection explains what the coerced type is?


Here, I produced a table as below:

Before

InputA \ InputBNullTypeIntegerTypeLongTypeDoubleTypeDecimalType(38,0)DateTypeTimestampTypeStringType
NullTypeStringTypeIntegerTypeLongTypeDoubleTypeStringTypeStringTypeStringTypeStringType
IntegerTypeIntegerTypeIntegerTypeLongTypeDoubleTypeIntegerTypeIntegerTypeIntegerTypeStringType
LongTypeLongTypeLongTypeLongTypeDoubleTypeLongTypeLongTypeLongTypeStringType
DoubleTypeDoubleTypeDoubleTypeDoubleTypeDoubleTypeDoubleTypeDoubleTypeDoubleTypeStringType
DecimalType(38,0)StringTypeIntegerTypeLongTypeDoubleTypeDecimalType(38,0)DecimalType(38,0)DecimalType(38,0)StringType
DateTypeStringTypeIntegerTypeLongTypeDoubleTypeDateTypeDateTypeDateTypeStringType
TimestampTypeStringTypeIntegerTypeLongTypeDoubleTypeTimestampTypeTimestampTypeTimestampTypeStringType
StringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringType

After

InputA \ InputBNullTypeIntegerTypeLongTypeDoubleTypeDecimalType(38,0)DateTypeTimestampTypeStringType
NullTypeStringTypeIntegerTypeLongTypeDoubleTypeDecimalType(38,0)DateTypeTimestampTypeStringType
IntegerTypeIntegerTypeIntegerTypeLongTypeDoubleTypeDecimalType(38,0)StringTypeStringTypeStringType
LongTypeLongTypeLongTypeLongTypeDoubleTypeDecimalType(38,0)StringTypeStringTypeStringType
DoubleTypeDoubleTypeDoubleTypeDoubleTypeDoubleTypeDoubleTypeStringTypeStringTypeStringType
DecimalType(38,0)DecimalType(38,0)DecimalType(38,0)DecimalType(38,0)DoubleTypeDecimalType(38,0)StringTypeStringTypeStringType
DateTypeDateTypeStringTypeStringTypeStringTypeStringTypeDateTypeTimestampTypeStringType
TimestampTypeTimestampTypeStringTypeStringTypeStringTypeStringTypeTimestampTypeTimestampTypeStringType
StringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringType


Seems following Wenchen's comments (1. to 5.), and I also updated PR description there with some codes I used.


> Can we also look at what Hive, standard SQL (Postgres?) do?
> Also, this shouldn't be isolated to partition column inference.
> We should make sure most of the type coercions are consistent across different functionalities, with the caveat that we need to preserve backward compatibility.


Sure, so, if I understood correctly, we preserve backward compatibility for improvements, but not for bugs in general.


Probably, there are two things to be done now:

  - Deduplicates the type coercion logics, and fixes the obvious bugs, that doesn't make sense, (e.g., decimal and timestamp ends up with decimal).

  - Improves the deduplicated type coercion to follow other systems like Hive and Postgres, or to make it sounds coherent by referring them.


I think my PR here focuses on the former, because the current partition column inference itself is already isolated
and I am trying to propose to put those type coercion into one place in the TypeCoercion. We could also consider
some divergence as exceptions in the nature of this functionality too, maybe .. ? (although I agree that  most of the
type coercions are consistent across different functionalities in general).

If the deduplicated type coercion itself across functionalities should be changed as an improvement, it should preserve
backward compatibility if I understood correctly and requires to take a look for other systems.

I am willing to do the latter soon; however, it probably takes a quite while for me to investigate and propose a change.

So, meanwhile, could we separately proceed this PR maybe or, probably would there be something I missed?



2017-11-15 7:29 GMT+09:00 Reynold Xin <[hidden email]>:
Most of those thoughts from Wenchen make sense to me.


Rather than a list, can we create a table? X-axis is data type, and Y-axis is also data type, and the intersection explains what the coerced type is? Can we also look at what Hive, standard SQL (Postgres?) do?


Also, this shouldn't be isolated to partition column inference. We should make sure most of the type coercions are consistent across different functionalities, with the caveat that we need to preserve backward compatibility.



On Tue, Nov 14, 2017 at 8:33 AM, Wenchen Fan <[hidden email]> wrote:
My 2 cents:

1. when merging NullType with another type, the result should always be that type.
2. when merging StringType with another type, the result should always be StringType.
3. when merging integral types, the priority from high to low: DecimalType, LongType, IntegerType. This is because DecimalType is used as big integer when paring partition column values.
4. DoubleType can't be merged with other types, except DoubleType itself.
5. when merging TimestampType with DateType, return TimestampType.


On Tue, Nov 14, 2017 at 3:54 PM, Hyukjin Kwon <[hidden email]> wrote:
Hi dev,

I would like to post a proposal about partitioned column type inference (related with 'spark.sql.sources.partitionColumnTypeInference.enabled' configuration).

This thread focuses on the type coercion (finding the common type) in partitioned columns, in particular, when the different form of data is inserted for the partition column and then it is read back with the type inference.


Problem:


val df = Seq((1, "2015-01-01"), (2, "2016-01-01 00:00:00")).toDF("i", "ts")
df.write.format("parquet").partitionBy("ts").save("/tmp/foo")
spark.read.load("/tmp/foo").printSchema()

val df = Seq((1, "1"), (2, "1" * 30)).toDF("i", "decimal")
df.write.format("parquet").partitionBy("decimal").save("/tmp/bar")
spark.read.load("/tmp/bar").printSchema()


It currently returns:


root
 |-- i: integer (nullable = true)
 |-- ts: date (nullable = true)

root
 |-- i: integer (nullable = true)
 |-- decimal: integer (nullable = true)

The type coercion looks less well designed yet and currently there are few holes which is not quite ideal:


private val upCastingOrder: Seq[DataType] =
  Seq(NullType, IntegerType, LongType, FloatType, DoubleType, StringType)
...
literals.map(_.dataType).maxBy(upCastingOrder.indexOf(_))


The current way does not deal with when the types are outside of the upCastingOrder. It just returns the first type, as the type coerced one.

This has been being discussed in https://github.com/apache/spark/pull/19389#discussion_r150426911, but I would like to have more feedback from community as it possibly is a breaking change.

For the current releases of Spark (2.2.0 <=), we support the types below for partitioned column schema inference, given my investigation - https://github.com/apache/spark/pull/19389#discussion_r150528207:

  NullType
  IntegerType
  LongType
  DoubleType,
  *DecimalType(...)
  DateType
  TimestampType
  StringType

  *DecimalType only when it's bigger than LongType:

I believe this is something we should definitely fix.



Proposal:



Please refer the chart I produced here - https://github.com/apache/spark/pull/19389/files#r150528361. The current proposal will brings the type coercion behaviour change in those cases below:


Input typesOld output typeNew output type
[NullTypeDecimalType(38,0)]StringTypeDecimalType(38,0)
[NullTypeDateType]StringTypeDateType
[NullTypeTimestampType]StringTypeTimestampType
[IntegerTypeDecimalType(38,0)]IntegerTypeDecimalType(38,0)
[IntegerTypeDateType]IntegerTypeStringType
[IntegerTypeTimestampType]IntegerTypeStringType
[LongTypeDecimalType(38,0)]LongTypeDecimalType(38,0)
[LongTypeDateType]LongTypeStringType
[LongTypeTimestampType]LongTypeStringType
[DoubleTypeDateType]DoubleTypeStringType
[DoubleTypeTimestampType]DoubleTypeStringType
[DecimalType(38,0)NullType]StringTypeDecimalType(38,0)
[DecimalType(38,0)IntegerType]IntegerTypeDecimalType(38,0)
[DecimalType(38,0)LongType]LongTypeDecimalType(38,0)
[DecimalType(38,0)DateType]DecimalType(38,0)StringType
[DecimalType(38,0)TimestampType]DecimalType(38,0)StringType
[DateTypeNullType]StringTypeDateType
[DateTypeIntegerType]IntegerTypeStringType
[DateTypeLongType]LongTypeStringType
[DateTypeDoubleType]DoubleTypeStringType
[DateTypeDecimalType(38,0)]DateTypeStringType
[DateTypeTimestampType]DateTypeTimestampType
[TimestampTypeNullType]StringTypeTimestampType
[TimestampTypeIntegerType]IntegerTypeStringType
[TimestampTypeLongType]LongTypeStringType
[TimestampTypeDoubleType]DoubleTypeStringType
[TimestampTypeDecimalType(38,0)]TimestampTypeStringType



Other possible suggestions:

Probably, we could also consider simply making the merged type to string types when there are any type conflicts.
Otherwise, there could be more stricter rules. I want more opinion from the community.



Questions:

- Does the Proposal: above looks good?

- If not, what would be the alternative?









Reply | Threaded
Open this post in threaded view
|

Re: [discuss][SQL] Partitioned column type inference proposal

Hyukjin Kwon
If we do 1. and 2., I guess it will be:


Before:

InputA \ InputBNullTypeIntegerTypeLongTypeDecimalType(38,0)DoubleTypeDateTypeTimestampTypeStringType
NullTypeStringTypeIntegerTypeLongTypeStringTypeDoubleTypeStringTypeStringTypeStringType
IntegerTypeIntegerTypeIntegerTypeLongTypeIntegerTypeDoubleTypeIntegerTypeIntegerTypeStringType
LongTypeLongTypeLongTypeLongTypeLongTypeDoubleTypeLongTypeLongTypeStringType
DecimalType(38,0)StringTypeIntegerTypeLongTypeDecimalType(38,0)DoubleTypeDecimalType(38,0)DecimalType(38,0)StringType
DoubleTypeDoubleTypeDoubleTypeDoubleTypeDoubleTypeDoubleTypeDoubleTypeDoubleTypeStringType
DateTypeStringTypeIntegerTypeLongTypeDateTypeDoubleTypeDateTypeDateTypeStringType
TimestampTypeStringTypeIntegerTypeLongTypeTimestampTypeDoubleTypeTimestampTypeTimestampTypeStringType
StringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringType

After:
InputA \ InputBNullTypeIntegerTypeLongTypeDecimalType(38,0)DoubleTypeDateTypeTimestampTypeStringType
NullTypeNullTypeIntegerTypeLongTypeDecimalType(38,0)DoubleTypeDateTypeTimestampTypeStringType
IntegerTypeIntegerTypeIntegerTypeLongTypeDecimalType(38,0)DoubleTypeStringTypeStringTypeStringType
LongTypeLongTypeLongTypeLongTypeStringTypeStringTypeStringTypeStringTypeStringType
DecimalType(38,0)DecimalType(38,0)DecimalType(38,0)StringTypeDecimalType(38,0)StringTypeStringTypeStringTypeStringType
DoubleTypeDoubleTypeDoubleTypeStringTypeStringTypeDoubleTypeStringTypeStringTypeStringType
DateTypeDateTypeStringTypeStringTypeStringTypeStringTypeDateTypeTimestampTypeStringType
TimestampTypeTimestampTypeStringTypeStringTypeStringTypeStringTypeTimestampTypeTimestampTypeStringType
StringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringType


Sounds fine to me and looks we can stay safer from precision loss concern.



2017-11-15 19:15 GMT+09:00 Wenchen Fan <[hidden email]>:
1. shall we return NullType when merging 2 NullTypes?
2. when merging DoubleType and LongType/DecimalType, we should return StringType, otherwise precision lose may happen.

One thing special for partition values type coercion is, we have a safe fallback: the StringType. This means we can be very conservative on type coercion, and reject any precision lose. While for normal type coercion, sometimes type coercion is better than failing the query.

On Wed, Nov 15, 2017 at 8:50 AM, Hyukjin Kwon <[hidden email]> wrote:
I just reordered a little bit and coloured differences to check easily:

Before:

InputA \ InputBNullTypeIntegerTypeLongTypeDecimalType(38,0)DoubleTypeDateTypeTimestampTypeStringType
NullTypeStringTypeIntegerTypeLongTypeStringTypeDoubleTypeStringTypeStringTypeStringType
IntegerTypeIntegerTypeIntegerTypeLongTypeIntegerTypeDoubleTypeIntegerTypeIntegerTypeStringType
LongTypeLongTypeLongTypeLongTypeLongTypeDoubleTypeLongTypeLongTypeStringType
DecimalType(38,0)StringTypeIntegerTypeLongTypeDecimalType(38,0)DoubleTypeDecimalType(38,0)DecimalType(38,0)StringType
DoubleTypeDoubleTypeDoubleTypeDoubleTypeDoubleTypeDoubleTypeDoubleTypeDoubleTypeStringType
DateTypeStringTypeIntegerTypeLongTypeDateTypeDoubleTypeDateTypeDateTypeStringType
TimestampTypeStringTypeIntegerTypeLongTypeTimestampTypeDoubleTypeTimestampTypeTimestampTypeStringType
StringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringType

After:
InputA \ InputBNullTypeIntegerTypeLongTypeDecimalType(38,0)DoubleTypeDateTypeTimestampTypeStringType
NullTypeStringTypeIntegerTypeLongTypeDecimalType(38,0)DoubleTypeDateTypeTimestampTypeStringType
IntegerTypeIntegerTypeIntegerTypeLongTypeDecimalType(38,0)DoubleTypeStringTypeStringTypeStringType
LongTypeLongTypeLongTypeLongTypeDecimalType(38,0)DoubleTypeStringTypeStringTypeStringType
DecimalType(38,0)DecimalType(38,0)DecimalType(38,0)DecimalType(38,0)DecimalType(38,0)DoubleTypeStringTypeStringTypeStringType
DoubleTypeDoubleTypeDoubleTypeDoubleTypeDoubleTypeDoubleTypeStringTypeStringTypeStringType
DateTypeDateTypeStringTypeStringTypeStringTypeStringTypeDateTypeTimestampTypeStringType
TimestampTypeTimestampTypeStringTypeStringTypeStringTypeStringTypeTimestampTypeTimestampTypeStringType
StringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringType





2017-11-15 11:15 GMT+09:00 Hyukjin Kwon <[hidden email]>:
Thanks all for feedback.

> 1. when merging NullType with another type, the result should always be that type.
> 2. when merging StringType with another type, the result should always be StringType.
> 3. when merging integral types, the priority from high to low: DecimalType, LongType, IntegerType. This is because DecimalType is used as big integer when paring partition column values.
> 4. DoubleType can't be merged with other types, except DoubleType itself.
> 5. when merging TimestampType with DateType, return TimestampType.

> Rather than a list, can we create a table? X-axis is data type, and Y-axis is also data type, and the intersection explains what the coerced type is?


Here, I produced a table as below:

Before

InputA \ InputBNullTypeIntegerTypeLongTypeDoubleTypeDecimalType(38,0)DateTypeTimestampTypeStringType
NullTypeStringTypeIntegerTypeLongTypeDoubleTypeStringTypeStringTypeStringTypeStringType
IntegerTypeIntegerTypeIntegerTypeLongTypeDoubleTypeIntegerTypeIntegerTypeIntegerTypeStringType
LongTypeLongTypeLongTypeLongTypeDoubleTypeLongTypeLongTypeLongTypeStringType
DoubleTypeDoubleTypeDoubleTypeDoubleTypeDoubleTypeDoubleTypeDoubleTypeDoubleTypeStringType
DecimalType(38,0)StringTypeIntegerTypeLongTypeDoubleTypeDecimalType(38,0)DecimalType(38,0)DecimalType(38,0)StringType
DateTypeStringTypeIntegerTypeLongTypeDoubleTypeDateTypeDateTypeDateTypeStringType
TimestampTypeStringTypeIntegerTypeLongTypeDoubleTypeTimestampTypeTimestampTypeTimestampTypeStringType
StringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringType

After

InputA \ InputBNullTypeIntegerTypeLongTypeDoubleTypeDecimalType(38,0)DateTypeTimestampTypeStringType
NullTypeStringTypeIntegerTypeLongTypeDoubleTypeDecimalType(38,0)DateTypeTimestampTypeStringType
IntegerTypeIntegerTypeIntegerTypeLongTypeDoubleTypeDecimalType(38,0)StringTypeStringTypeStringType
LongTypeLongTypeLongTypeLongTypeDoubleTypeDecimalType(38,0)StringTypeStringTypeStringType
DoubleTypeDoubleTypeDoubleTypeDoubleTypeDoubleTypeDoubleTypeStringTypeStringTypeStringType
DecimalType(38,0)DecimalType(38,0)DecimalType(38,0)DecimalType(38,0)DoubleTypeDecimalType(38,0)StringTypeStringTypeStringType
DateTypeDateTypeStringTypeStringTypeStringTypeStringTypeDateTypeTimestampTypeStringType
TimestampTypeTimestampTypeStringTypeStringTypeStringTypeStringTypeTimestampTypeTimestampTypeStringType
StringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringType


Seems following Wenchen's comments (1. to 5.), and I also updated PR description there with some codes I used.


> Can we also look at what Hive, standard SQL (Postgres?) do?
> Also, this shouldn't be isolated to partition column inference.
> We should make sure most of the type coercions are consistent across different functionalities, with the caveat that we need to preserve backward compatibility.


Sure, so, if I understood correctly, we preserve backward compatibility for improvements, but not for bugs in general.


Probably, there are two things to be done now:

  - Deduplicates the type coercion logics, and fixes the obvious bugs, that doesn't make sense, (e.g., decimal and timestamp ends up with decimal).

  - Improves the deduplicated type coercion to follow other systems like Hive and Postgres, or to make it sounds coherent by referring them.


I think my PR here focuses on the former, because the current partition column inference itself is already isolated
and I am trying to propose to put those type coercion into one place in the TypeCoercion. We could also consider
some divergence as exceptions in the nature of this functionality too, maybe .. ? (although I agree that  most of the
type coercions are consistent across different functionalities in general).

If the deduplicated type coercion itself across functionalities should be changed as an improvement, it should preserve
backward compatibility if I understood correctly and requires to take a look for other systems.

I am willing to do the latter soon; however, it probably takes a quite while for me to investigate and propose a change.

So, meanwhile, could we separately proceed this PR maybe or, probably would there be something I missed?



2017-11-15 7:29 GMT+09:00 Reynold Xin <[hidden email]>:
Most of those thoughts from Wenchen make sense to me.


Rather than a list, can we create a table? X-axis is data type, and Y-axis is also data type, and the intersection explains what the coerced type is? Can we also look at what Hive, standard SQL (Postgres?) do?


Also, this shouldn't be isolated to partition column inference. We should make sure most of the type coercions are consistent across different functionalities, with the caveat that we need to preserve backward compatibility.



On Tue, Nov 14, 2017 at 8:33 AM, Wenchen Fan <[hidden email]> wrote:
My 2 cents:

1. when merging NullType with another type, the result should always be that type.
2. when merging StringType with another type, the result should always be StringType.
3. when merging integral types, the priority from high to low: DecimalType, LongType, IntegerType. This is because DecimalType is used as big integer when paring partition column values.
4. DoubleType can't be merged with other types, except DoubleType itself.
5. when merging TimestampType with DateType, return TimestampType.


On Tue, Nov 14, 2017 at 3:54 PM, Hyukjin Kwon <[hidden email]> wrote:
Hi dev,

I would like to post a proposal about partitioned column type inference (related with 'spark.sql.sources.partitionColumnTypeInference.enabled' configuration).

This thread focuses on the type coercion (finding the common type) in partitioned columns, in particular, when the different form of data is inserted for the partition column and then it is read back with the type inference.


Problem:


val df = Seq((1, "2015-01-01"), (2, "2016-01-01 00:00:00")).toDF("i", "ts")
df.write.format("parquet").partitionBy("ts").save("/tmp/foo")
spark.read.load("/tmp/foo").printSchema()

val df = Seq((1, "1"), (2, "1" * 30)).toDF("i", "decimal")
df.write.format("parquet").partitionBy("decimal").save("/tmp/bar")
spark.read.load("/tmp/bar").printSchema()


It currently returns:


root
 |-- i: integer (nullable = true)
 |-- ts: date (nullable = true)

root
 |-- i: integer (nullable = true)
 |-- decimal: integer (nullable = true)

The type coercion looks less well designed yet and currently there are few holes which is not quite ideal:


private val upCastingOrder: Seq[DataType] =
  Seq(NullType, IntegerType, LongType, FloatType, DoubleType, StringType)
...
literals.map(_.dataType).maxBy(upCastingOrder.indexOf(_))


The current way does not deal with when the types are outside of the upCastingOrder. It just returns the first type, as the type coerced one.

This has been being discussed in https://github.com/apache/spark/pull/19389#discussion_r150426911, but I would like to have more feedback from community as it possibly is a breaking change.

For the current releases of Spark (2.2.0 <=), we support the types below for partitioned column schema inference, given my investigation - https://github.com/apache/spark/pull/19389#discussion_r150528207:

  NullType
  IntegerType
  LongType
  DoubleType,
  *DecimalType(...)
  DateType
  TimestampType
  StringType

  *DecimalType only when it's bigger than LongType:

I believe this is something we should definitely fix.



Proposal:



Please refer the chart I produced here - https://github.com/apache/spark/pull/19389/files#r150528361. The current proposal will brings the type coercion behaviour change in those cases below:


Input typesOld output typeNew output type
[NullTypeDecimalType(38,0)]StringTypeDecimalType(38,0)
[NullTypeDateType]StringTypeDateType
[NullTypeTimestampType]StringTypeTimestampType
[IntegerTypeDecimalType(38,0)]IntegerTypeDecimalType(38,0)
[IntegerTypeDateType]IntegerTypeStringType
[IntegerTypeTimestampType]IntegerTypeStringType
[LongTypeDecimalType(38,0)]LongTypeDecimalType(38,0)
[LongTypeDateType]LongTypeStringType
[LongTypeTimestampType]LongTypeStringType
[DoubleTypeDateType]DoubleTypeStringType
[DoubleTypeTimestampType]DoubleTypeStringType
[DecimalType(38,0)NullType]StringTypeDecimalType(38,0)
[DecimalType(38,0)IntegerType]IntegerTypeDecimalType(38,0)
[DecimalType(38,0)LongType]LongTypeDecimalType(38,0)
[DecimalType(38,0)DateType]DecimalType(38,0)StringType
[DecimalType(38,0)TimestampType]DecimalType(38,0)StringType
[DateTypeNullType]StringTypeDateType
[DateTypeIntegerType]IntegerTypeStringType
[DateTypeLongType]LongTypeStringType
[DateTypeDoubleType]DoubleTypeStringType
[DateTypeDecimalType(38,0)]DateTypeStringType
[DateTypeTimestampType]DateTypeTimestampType
[TimestampTypeNullType]StringTypeTimestampType
[TimestampTypeIntegerType]IntegerTypeStringType
[TimestampTypeLongType]LongTypeStringType
[TimestampTypeDoubleType]DoubleTypeStringType
[TimestampTypeDecimalType(38,0)]TimestampTypeStringType



Other possible suggestions:

Probably, we could also consider simply making the merged type to string types when there are any type conflicts.
Otherwise, there could be more stricter rules. I want more opinion from the community.



Questions:

- Does the Proposal: above looks good?

- If not, what would be the alternative?










Reply | Threaded
Open this post in threaded view
|

Re: [discuss][SQL] Partitioned column type inference proposal

Hyukjin Kwon
Sorry, I meant:

Before:

InputA \ InputBNullTypeIntegerTypeLongTypeDecimalType(38,0)DoubleTypeDateTypeTimestampTypeStringType
NullTypeStringTypeIntegerTypeLongTypeStringTypeDoubleTypeStringTypeStringTypeStringType
IntegerTypeIntegerTypeIntegerTypeLongTypeIntegerTypeDoubleTypeIntegerTypeIntegerTypeStringType
LongTypeLongTypeLongTypeLongTypeLongTypeDoubleTypeLongTypeLongTypeStringType
DecimalType(38,0)StringTypeIntegerTypeLongTypeDecimalType(38,0)DoubleTypeDecimalType(38,0)DecimalType(38,0)StringType
DoubleTypeDoubleTypeDoubleTypeDoubleTypeDoubleTypeDoubleTypeDoubleTypeDoubleTypeStringType
DateTypeStringTypeIntegerTypeLongTypeDateTypeDoubleTypeDateTypeDateTypeStringType
TimestampTypeStringTypeIntegerTypeLongTypeTimestampTypeDoubleTypeTimestampTypeTimestampTypeStringType
StringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringType

After:
InputA \ InputBNullTypeIntegerTypeLongTypeDecimalType(38,0)DoubleTypeDateTypeTimestampTypeStringType
NullTypeNullTypeIntegerTypeLongTypeDecimalType(38,0)DoubleTypeDateTypeTimestampTypeStringType
IntegerTypeIntegerTypeIntegerTypeLongTypeDecimalType(38,0)DoubleTypeStringTypeStringTypeStringType
LongTypeLongTypeLongTypeLongTypeDecimalType(38,0)StringTypeStringTypeStringTypeStringType
DecimalType(38,0)DecimalType(38,0)DecimalType(38,0)DecimalType(38,0)DecimalType(38,0)StringTypeStringTypeStringTypeStringType
DoubleTypeDoubleTypeDoubleTypeStringTypeStringTypeDoubleTypeStringTypeStringTypeStringType
DateTypeDateTypeStringTypeStringTypeStringTypeStringTypeDateTypeTimestampTypeStringType
TimestampTypeTimestampTypeStringTypeStringTypeStringTypeStringTypeTimestampTypeTimestampTypeStringType
StringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringType


2017-11-15 21:22 GMT+09:00 Hyukjin Kwon <[hidden email]>:
If we do 1. and 2., I guess it will be:


Before:

InputA \ InputBNullTypeIntegerTypeLongTypeDecimalType(38,0)DoubleTypeDateTypeTimestampTypeStringType
NullTypeStringTypeIntegerTypeLongTypeStringTypeDoubleTypeStringTypeStringTypeStringType
IntegerTypeIntegerTypeIntegerTypeLongTypeIntegerTypeDoubleTypeIntegerTypeIntegerTypeStringType
LongTypeLongTypeLongTypeLongTypeLongTypeDoubleTypeLongTypeLongTypeStringType
DecimalType(38,0)StringTypeIntegerTypeLongTypeDecimalType(38,0)DoubleTypeDecimalType(38,0)DecimalType(38,0)StringType
DoubleTypeDoubleTypeDoubleTypeDoubleTypeDoubleTypeDoubleTypeDoubleTypeDoubleTypeStringType
DateTypeStringTypeIntegerTypeLongTypeDateTypeDoubleTypeDateTypeDateTypeStringType
TimestampTypeStringTypeIntegerTypeLongTypeTimestampTypeDoubleTypeTimestampTypeTimestampTypeStringType
StringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringType

After:
InputA \ InputBNullTypeIntegerTypeLongTypeDecimalType(38,0)DoubleTypeDateTypeTimestampTypeStringType
NullTypeNullTypeIntegerTypeLongTypeDecimalType(38,0)DoubleTypeDateTypeTimestampTypeStringType
IntegerTypeIntegerTypeIntegerTypeLongTypeDecimalType(38,0)DoubleTypeStringTypeStringTypeStringType
LongTypeLongTypeLongTypeLongTypeStringTypeStringTypeStringTypeStringTypeStringType
DecimalType(38,0)DecimalType(38,0)DecimalType(38,0)StringTypeDecimalType(38,0)StringTypeStringTypeStringTypeStringType
DoubleTypeDoubleTypeDoubleTypeStringTypeStringTypeDoubleTypeStringTypeStringTypeStringType
DateTypeDateTypeStringTypeStringTypeStringTypeStringTypeDateTypeTimestampTypeStringType
TimestampTypeTimestampTypeStringTypeStringTypeStringTypeStringTypeTimestampTypeTimestampTypeStringType
StringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringType


Sounds fine to me and looks we can stay safer from precision loss concern.



2017-11-15 19:15 GMT+09:00 Wenchen Fan <[hidden email]>:
1. shall we return NullType when merging 2 NullTypes?
2. when merging DoubleType and LongType/DecimalType, we should return StringType, otherwise precision lose may happen.

One thing special for partition values type coercion is, we have a safe fallback: the StringType. This means we can be very conservative on type coercion, and reject any precision lose. While for normal type coercion, sometimes type coercion is better than failing the query.

On Wed, Nov 15, 2017 at 8:50 AM, Hyukjin Kwon <[hidden email]> wrote:
I just reordered a little bit and coloured differences to check easily:

Before:

InputA \ InputBNullTypeIntegerTypeLongTypeDecimalType(38,0)DoubleTypeDateTypeTimestampTypeStringType
NullTypeStringTypeIntegerTypeLongTypeStringTypeDoubleTypeStringTypeStringTypeStringType
IntegerTypeIntegerTypeIntegerTypeLongTypeIntegerTypeDoubleTypeIntegerTypeIntegerTypeStringType
LongTypeLongTypeLongTypeLongTypeLongTypeDoubleTypeLongTypeLongTypeStringType
DecimalType(38,0)StringTypeIntegerTypeLongTypeDecimalType(38,0)DoubleTypeDecimalType(38,0)DecimalType(38,0)StringType
DoubleTypeDoubleTypeDoubleTypeDoubleTypeDoubleTypeDoubleTypeDoubleTypeDoubleTypeStringType
DateTypeStringTypeIntegerTypeLongTypeDateTypeDoubleTypeDateTypeDateTypeStringType
TimestampTypeStringTypeIntegerTypeLongTypeTimestampTypeDoubleTypeTimestampTypeTimestampTypeStringType
StringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringType

After:
InputA \ InputBNullTypeIntegerTypeLongTypeDecimalType(38,0)DoubleTypeDateTypeTimestampTypeStringType
NullTypeStringTypeIntegerTypeLongTypeDecimalType(38,0)DoubleTypeDateTypeTimestampTypeStringType
IntegerTypeIntegerTypeIntegerTypeLongTypeDecimalType(38,0)DoubleTypeStringTypeStringTypeStringType
LongTypeLongTypeLongTypeLongTypeDecimalType(38,0)DoubleTypeStringTypeStringTypeStringType
DecimalType(38,0)DecimalType(38,0)DecimalType(38,0)DecimalType(38,0)DecimalType(38,0)DoubleTypeStringTypeStringTypeStringType
DoubleTypeDoubleTypeDoubleTypeDoubleTypeDoubleTypeDoubleTypeStringTypeStringTypeStringType
DateTypeDateTypeStringTypeStringTypeStringTypeStringTypeDateTypeTimestampTypeStringType
TimestampTypeTimestampTypeStringTypeStringTypeStringTypeStringTypeTimestampTypeTimestampTypeStringType
StringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringType





2017-11-15 11:15 GMT+09:00 Hyukjin Kwon <[hidden email]>:
Thanks all for feedback.

> 1. when merging NullType with another type, the result should always be that type.
> 2. when merging StringType with another type, the result should always be StringType.
> 3. when merging integral types, the priority from high to low: DecimalType, LongType, IntegerType. This is because DecimalType is used as big integer when paring partition column values.
> 4. DoubleType can't be merged with other types, except DoubleType itself.
> 5. when merging TimestampType with DateType, return TimestampType.

> Rather than a list, can we create a table? X-axis is data type, and Y-axis is also data type, and the intersection explains what the coerced type is?


Here, I produced a table as below:

Before

InputA \ InputBNullTypeIntegerTypeLongTypeDoubleTypeDecimalType(38,0)DateTypeTimestampTypeStringType
NullTypeStringTypeIntegerTypeLongTypeDoubleTypeStringTypeStringTypeStringTypeStringType
IntegerTypeIntegerTypeIntegerTypeLongTypeDoubleTypeIntegerTypeIntegerTypeIntegerTypeStringType
LongTypeLongTypeLongTypeLongTypeDoubleTypeLongTypeLongTypeLongTypeStringType
DoubleTypeDoubleTypeDoubleTypeDoubleTypeDoubleTypeDoubleTypeDoubleTypeDoubleTypeStringType
DecimalType(38,0)StringTypeIntegerTypeLongTypeDoubleTypeDecimalType(38,0)DecimalType(38,0)DecimalType(38,0)StringType
DateTypeStringTypeIntegerTypeLongTypeDoubleTypeDateTypeDateTypeDateTypeStringType
TimestampTypeStringTypeIntegerTypeLongTypeDoubleTypeTimestampTypeTimestampTypeTimestampTypeStringType
StringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringType

After

InputA \ InputBNullTypeIntegerTypeLongTypeDoubleTypeDecimalType(38,0)DateTypeTimestampTypeStringType
NullTypeStringTypeIntegerTypeLongTypeDoubleTypeDecimalType(38,0)DateTypeTimestampTypeStringType
IntegerTypeIntegerTypeIntegerTypeLongTypeDoubleTypeDecimalType(38,0)StringTypeStringTypeStringType
LongTypeLongTypeLongTypeLongTypeDoubleTypeDecimalType(38,0)StringTypeStringTypeStringType
DoubleTypeDoubleTypeDoubleTypeDoubleTypeDoubleTypeDoubleTypeStringTypeStringTypeStringType
DecimalType(38,0)DecimalType(38,0)DecimalType(38,0)DecimalType(38,0)DoubleTypeDecimalType(38,0)StringTypeStringTypeStringType
DateTypeDateTypeStringTypeStringTypeStringTypeStringTypeDateTypeTimestampTypeStringType
TimestampTypeTimestampTypeStringTypeStringTypeStringTypeStringTypeTimestampTypeTimestampTypeStringType
StringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringTypeStringType


Seems following Wenchen's comments (1. to 5.), and I also updated PR description there with some codes I used.


> Can we also look at what Hive, standard SQL (Postgres?) do?
> Also, this shouldn't be isolated to partition column inference.
> We should make sure most of the type coercions are consistent across different functionalities, with the caveat that we need to preserve backward compatibility.


Sure, so, if I understood correctly, we preserve backward compatibility for improvements, but not for bugs in general.


Probably, there are two things to be done now:

  - Deduplicates the type coercion logics, and fixes the obvious bugs, that doesn't make sense, (e.g., decimal and timestamp ends up with decimal).

  - Improves the deduplicated type coercion to follow other systems like Hive and Postgres, or to make it sounds coherent by referring them.


I think my PR here focuses on the former, because the current partition column inference itself is already isolated
and I am trying to propose to put those type coercion into one place in the TypeCoercion. We could also consider
some divergence as exceptions in the nature of this functionality too, maybe .. ? (although I agree that  most of the
type coercions are consistent across different functionalities in general).

If the deduplicated type coercion itself across functionalities should be changed as an improvement, it should preserve
backward compatibility if I understood correctly and requires to take a look for other systems.

I am willing to do the latter soon; however, it probably takes a quite while for me to investigate and propose a change.

So, meanwhile, could we separately proceed this PR maybe or, probably would there be something I missed?



2017-11-15 7:29 GMT+09:00 Reynold Xin <[hidden email]>:
Most of those thoughts from Wenchen make sense to me.


Rather than a list, can we create a table? X-axis is data type, and Y-axis is also data type, and the intersection explains what the coerced type is? Can we also look at what Hive, standard SQL (Postgres?) do?


Also, this shouldn't be isolated to partition column inference. We should make sure most of the type coercions are consistent across different functionalities, with the caveat that we need to preserve backward compatibility.



On Tue, Nov 14, 2017 at 8:33 AM, Wenchen Fan <[hidden email]> wrote:
My 2 cents:

1. when merging NullType with another type, the result should always be that type.
2. when merging StringType with another type, the result should always be StringType.
3. when merging integral types, the priority from high to low: DecimalType, LongType, IntegerType. This is because DecimalType is used as big integer when paring partition column values.
4. DoubleType can't be merged with other types, except DoubleType itself.
5. when merging TimestampType with DateType, return TimestampType.


On Tue, Nov 14, 2017 at 3:54 PM, Hyukjin Kwon <[hidden email]> wrote:
Hi dev,

I would like to post a proposal about partitioned column type inference (related with 'spark.sql.sources.partitionColumnTypeInference.enabled' configuration).

This thread focuses on the type coercion (finding the common type) in partitioned columns, in particular, when the different form of data is inserted for the partition column and then it is read back with the type inference.


Problem:


val df = Seq((1, "2015-01-01"), (2, "2016-01-01 00:00:00")).toDF("i", "ts")
df.write.format("parquet").partitionBy("ts").save("/tmp/foo")
spark.read.load("/tmp/foo").printSchema()

val df = Seq((1, "1"), (2, "1" * 30)).toDF("i", "decimal")
df.write.format("parquet").partitionBy("decimal").save("/tmp/bar")
spark.read.load("/tmp/bar").printSchema()


It currently returns:


root
 |-- i: integer (nullable = true)
 |-- ts: date (nullable = true)

root
 |-- i: integer (nullable = true)
 |-- decimal: integer (nullable = true)

The type coercion looks less well designed yet and currently there are few holes which is not quite ideal:


private val upCastingOrder: Seq[DataType] =
  Seq(NullType, IntegerType, LongType, FloatType, DoubleType, StringType)
...
literals.map(_.dataType).maxBy(upCastingOrder.indexOf(_))


The current way does not deal with when the types are outside of the upCastingOrder. It just returns the first type, as the type coerced one.

This has been being discussed in https://github.com/apache/spark/pull/19389#discussion_r150426911, but I would like to have more feedback from community as it possibly is a breaking change.

For the current releases of Spark (2.2.0 <=), we support the types below for partitioned column schema inference, given my investigation - https://github.com/apache/spark/pull/19389#discussion_r150528207:

  NullType
  IntegerType
  LongType
  DoubleType,
  *DecimalType(...)
  DateType
  TimestampType
  StringType

  *DecimalType only when it's bigger than LongType:

I believe this is something we should definitely fix.



Proposal:



Please refer the chart I produced here - https://github.com/apache/spark/pull/19389/files#r150528361. The current proposal will brings the type coercion behaviour change in those cases below:


Input typesOld output typeNew output type
[NullTypeDecimalType(38,0)]StringTypeDecimalType(38,0)
[NullTypeDateType]StringTypeDateType
[NullTypeTimestampType]StringTypeTimestampType
[IntegerTypeDecimalType(38,0)]IntegerTypeDecimalType(38,0)
[IntegerTypeDateType]IntegerTypeStringType
[IntegerTypeTimestampType]IntegerTypeStringType
[LongTypeDecimalType(38,0)]LongTypeDecimalType(38,0)
[LongTypeDateType]LongTypeStringType
[LongTypeTimestampType]LongTypeStringType
[DoubleTypeDateType]DoubleTypeStringType
[DoubleTypeTimestampType]DoubleTypeStringType
[DecimalType(38,0)NullType]StringTypeDecimalType(38,0)
[DecimalType(38,0)IntegerType]IntegerTypeDecimalType(38,0)
[DecimalType(38,0)LongType]LongTypeDecimalType(38,0)
[DecimalType(38,0)DateType]DecimalType(38,0)StringType
[DecimalType(38,0)TimestampType]DecimalType(38,0)StringType
[DateTypeNullType]StringTypeDateType
[DateTypeIntegerType]IntegerTypeStringType
[DateTypeLongType]LongTypeStringType
[DateTypeDoubleType]DoubleTypeStringType
[