SPARK-22267 issue: Spark SQL incorrectly reads ORC file when column order is different

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

SPARK-22267 issue: Spark SQL incorrectly reads ORC file when column order is different

Mark Petruska
  Hi,
I'm very new to spark development, and would like to get guidance from more experienced members.
Sorry this email will be long as I try to explain the details.

Started to investigate the issue SPARK-22267; added some test cases to highlight the problem in the PR. Here are my findings:

- for parquet the test case succeeds as expected

- the sql test case for orc:
    - when CONVERT_METASTORE_ORC is set to "true" the data fields are presented in the desired order
    - when it is "false" the columns are read in the wrong order
    - Reason: when `isConvertible` returns true in `RelationConversions` the plan executes `convertToLogicalRelation`, which in turn uses `OrcFileFormat` to read the data; otherwise it uses the classes in "hive-exec:1.2.1".

- the HadoopRDD test case was added to further investigate the parameter values to discover a working combination, but unfortunately no combination of "serialization.ddl" and "columns" result in success. It seems that those fields do not have any effect on the order of the resulting data fields.


At this point I do not see any option to fix this issue without risking "backward compatibility" problems.
The possible actions (as I see them):
- link a new version of "hive-exec": surely this bug has been fixed in a newer version
- use `OrcFileFormat` for reading orc data regardless of the setting of CONVERT_METASTORE_ORC
- also there's an `OrcNewInputFormat` class in "hive-exec", but it implements an InputFormat interface from a different package, hence it is incompatible with HadoopRDD at the moment

Please help me. Did I miss some viable options?

Thanks,
Mark

Reply | Threaded
Open this post in threaded view
|

Re: SPARK-22267 issue: Spark SQL incorrectly reads ORC file when column order is different

Dongjoon Hyun-2
Hi, Mark.

That is one of the reasons why I left it behind from the previous PR (below) and I'm focusing is the second approach; use OrcFileFormat with convertMetastoreOrc.

https://github.com/apache/spark/pull/19470
[SPARK-14387][SPARK-16628][SPARK-18355][SQL] Use Spark schema to read ORC table instead of ORC file schema

With `convertMetastoreOrc=true`, Spark 2.3 will become stabler and faster. Also, it's the default Spark way to handle Parquet.

BTW, thank you for looking at SPARK-22267. So far, I'm not looking at that issue.
If we have a fix for SPARK-22267 in Spark 2.3, it would be great!

Bests,
Dongjoon.


On Tue, Nov 14, 2017 at 3:46 AM, Mark Petruska <[hidden email]> wrote:
  Hi,
I'm very new to spark development, and would like to get guidance from more experienced members.
Sorry this email will be long as I try to explain the details.

Started to investigate the issue SPARK-22267; added some test cases to highlight the problem in the PR. Here are my findings:

- for parquet the test case succeeds as expected

- the sql test case for orc:
    - when CONVERT_METASTORE_ORC is set to "true" the data fields are presented in the desired order
    - when it is "false" the columns are read in the wrong order
    - Reason: when `isConvertible` returns true in `RelationConversions` the plan executes `convertToLogicalRelation`, which in turn uses `OrcFileFormat` to read the data; otherwise it uses the classes in "hive-exec:1.2.1".

- the HadoopRDD test case was added to further investigate the parameter values to discover a working combination, but unfortunately no combination of "serialization.ddl" and "columns" result in success. It seems that those fields do not have any effect on the order of the resulting data fields.


At this point I do not see any option to fix this issue without risking "backward compatibility" problems.
The possible actions (as I see them):
- link a new version of "hive-exec": surely this bug has been fixed in a newer version
- use `OrcFileFormat` for reading orc data regardless of the setting of CONVERT_METASTORE_ORC
- also there's an `OrcNewInputFormat` class in "hive-exec", but it implements an InputFormat interface from a different package, hence it is incompatible with HadoopRDD at the moment

Please help me. Did I miss some viable options?

Thanks,
Mark


Reply | Threaded
Open this post in threaded view
|

Re: SPARK-22267 issue: Spark SQL incorrectly reads ORC file when column order is different

Mark Petruska
  Hi Dongjoon,
Thanks for the info.
Unfortunately I did not find any means to fix the issue without forcing CONVERT_METASTORE_ORC or changing the ORC reader implementation.
Closing the PR, as it was only used to demonstrate the root cause.
Best regards,
Mark

On Tue, Nov 14, 2017 at 6:58 PM, Dongjoon Hyun <[hidden email]> wrote:
Hi, Mark.

That is one of the reasons why I left it behind from the previous PR (below) and I'm focusing is the second approach; use OrcFileFormat with convertMetastoreOrc.

https://github.com/apache/spark/pull/19470
[SPARK-14387][SPARK-16628][SPARK-18355][SQL] Use Spark schema to read ORC table instead of ORC file schema

With `convertMetastoreOrc=true`, Spark 2.3 will become stabler and faster. Also, it's the default Spark way to handle Parquet.

BTW, thank you for looking at SPARK-22267. So far, I'm not looking at that issue.
If we have a fix for SPARK-22267 in Spark 2.3, it would be great!

Bests,
Dongjoon.


On Tue, Nov 14, 2017 at 3:46 AM, Mark Petruska <[hidden email]> wrote:
  Hi,
I'm very new to spark development, and would like to get guidance from more experienced members.
Sorry this email will be long as I try to explain the details.

Started to investigate the issue SPARK-22267; added some test cases to highlight the problem in the PR. Here are my findings:

- for parquet the test case succeeds as expected

- the sql test case for orc:
    - when CONVERT_METASTORE_ORC is set to "true" the data fields are presented in the desired order
    - when it is "false" the columns are read in the wrong order
    - Reason: when `isConvertible` returns true in `RelationConversions` the plan executes `convertToLogicalRelation`, which in turn uses `OrcFileFormat` to read the data; otherwise it uses the classes in "hive-exec:1.2.1".

- the HadoopRDD test case was added to further investigate the parameter values to discover a working combination, but unfortunately no combination of "serialization.ddl" and "columns" result in success. It seems that those fields do not have any effect on the order of the resulting data fields.


At this point I do not see any option to fix this issue without risking "backward compatibility" problems.
The possible actions (as I see them):
- link a new version of "hive-exec": surely this bug has been fixed in a newer version
- use `OrcFileFormat` for reading orc data regardless of the setting of CONVERT_METASTORE_ORC
- also there's an `OrcNewInputFormat` class in "hive-exec", but it implements an InputFormat interface from a different package, hence it is incompatible with HadoopRDD at the moment

Please help me. Did I miss some viable options?

Thanks,
Mark