I'm very new to spark development, and would like to get guidance from more experienced members.
Sorry this email will be long as I try to explain the details.
Started to investigate the issue SPARK-22267; added some test cases to highlight the problem in the PR. Here are my findings:
- for parquet the test case succeeds as expected
- the sql test case for orc:
- when CONVERT_METASTORE_ORC is set to "true" the data fields are presented in the desired order
- when it is "false" the columns are read in the wrong order
- Reason: when `isConvertible` returns true in `RelationConversions` the plan executes `convertToLogicalRelation`, which in turn uses `OrcFileFormat` to read the data; otherwise it uses the classes in "hive-exec:1.2.1".
- the HadoopRDD test case was added to further investigate the parameter values to discover a working combination, but unfortunately no combination of "serialization.ddl" and "columns" result in success. It seems that those fields do not have any effect on the order of the resulting data fields.
At this point I do not see any option to fix this issue without risking "backward compatibility" problems.
The possible actions (as I see them):
- link a new version of "hive-exec": surely this bug has been fixed in a newer version
- use `OrcFileFormat` for reading orc data regardless of the setting of CONVERT_METASTORE_ORC
- also there's an `OrcNewInputFormat` class in "hive-exec", but it implements an InputFormat interface from a different package, hence it is incompatible with HadoopRDD at the moment
Please help me. Did I miss some viable options?
That is one of the reasons why I left it behind from the previous PR (below) and I'm focusing is the second approach; use OrcFileFormat with convertMetastoreOrc.
[SPARK-14387][SPARK-16628][SPARK-18355][SQL] Use Spark schema to read ORC table instead of ORC file schema
With `convertMetastoreOrc=true`, Spark 2.3 will become stabler and faster. Also, it's the default Spark way to handle Parquet.
BTW, thank you for looking at SPARK-22267. So far, I'm not looking at that issue.
If we have a fix for SPARK-22267 in Spark 2.3, it would be great!
On Tue, Nov 14, 2017 at 3:46 AM, Mark Petruska <[hidden email]> wrote:
Thanks for the info.
Unfortunately I did not find any means to fix the issue without forcing CONVERT_METASTORE_ORC or changing the ORC reader implementation.
Closing the PR, as it was only used to demonstrate the root cause.
On Tue, Nov 14, 2017 at 6:58 PM, Dongjoon Hyun <[hidden email]> wrote:
|Free forum by Nabble||Edit this page|