Spark Issues on ORC

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Spark Issues on ORC

Dong Joon Hyun

Hi, All.

 

Today, while I’m looking over JIRA issues for Spark 2.2.0 in Apache Spark.

I noticed that there are many unresolved community requests and related efforts over `Feature parity for ORC with Parquet`.

Some examples I found are the following. I created SPARK-20901 to organize these although I’m not in the body to do this.

Please let me know if this is not a proper way in the Apache Spark community.

I think we can leverage or transfer the improvement of Parquet in Spark.

 

SPARK-11412   Support merge schema for ORC

SPARK-12417   Orc bloom filter options are not propagated during file write in spark

SPARK-14286   Empty ORC table join throws exception

SPARK-14387   Enable Hive-1.x ORC compatibility with spark.sql.hive.convertMetastoreOrc

SPARK-15347   Problem select empty ORC table

SPARK-15474   ORC data source fails to write and read back empty dataframe

SPARK-15682   Hive ORC partition write looks for root hdfs folder for existence

SPARK-15731   orc writer directory permissions

SPARK-15757   Error occurs when using Spark sql ""select"" statement on orc file …

SPARK-16060   Vectorized Orc reader

SPARK-16628   OrcConversions should not convert an ORC table represented by MetastoreRelation to HadoopFsRelation if …

SPARK-17047   Spark 2 cannot create ORC table when CLUSTERED

SPARK-18355   Spark SQL fails to read data from a ORC hive table that has a new column added to it

SPARK-18540   Wholestage code-gen for ORC Hive tables

SPARK-19109   ORC metadata section can sometimes exceed protobuf message size limit

SPARK-19122   Unnecessary shuffle+sort added if join predicates ordering differ from bucketing and sorting order

SPARK-19430   Cannot read external tables with VARCHAR columns if they're backed by ORC files written by Hive 1.2.1

SPARK-19809   NullPointerException on empty ORC file

SPARK-20515   Issue with reading Hive ORC tables having char/varchar columns in Spark SQL

SPARK-20682   Implement new ORC data source based on Apache ORC

SPARK-20728   Make ORCFileFormat configurable between sql/hive and sql/core

SPARK-20799   Unable to infer schema for ORC on reading ORC from S3

 

Bests,

Dongjoon.

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Spark Issues on ORC

Steve Loughran

On 26 May 2017, at 19:02, Dong Joon Hyun <[hidden email]> wrote:

Hi, All.
 
Today, while I’m looking over JIRA issues for Spark 2.2.0 in Apache Spark.
I noticed that there are many unresolved community requests and related efforts over `Feature parity for ORC with Parquet`.
Some examples I found are the following. I created SPARK-20901 to organize these although I’m not in the body to do this.
Please let me know if this is not a proper way in the Apache Spark community.
I think we can leverage or transfer the improvement of Parquet in Spark.
 

SPARK-20799   Unable to infer schema for ORC on reading ORC from S3


Fixed that one for you by changing title: SPARK-20799 Unable to infer schema for ORC/Parquet on S3N when secrets are in the URL

I'd recommended closing that as a WONTFIX as its related to some security work in HADOOP-3733 where Path.toString/toURI now strip out the AWS crentials, and the way things get passed around as Path.toString(), its losing them. As the current model meant that everything which logged a path would be logging AWS secrets, and the logs & exceptions weren't being treated as the sensitive documents they became the moment that happened.

It could could as a regression, but as it never worked if there was a "/" in the secret, it's always been a bit patchy.

If this is really needed then it could be pushed back into Hadoop 2.8.2 but disabled by default unless you set some option like "fs.s3a.insecure.secrets.in.URL".

Maybe also (somehow) changing to only support AWS Session token triples (id, session-secret, session-token), so that the damage caused by secrets in logs, bug reports &c are less destructive


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Spark Issues on ORC

Dong Joon Hyun

Thank you for confirming, Steve.

 

I removes the dependency of SPARK-20799 on SPARK-20901.

 

Bests,

Dongjoon.

 

From: Steve Loughran <[hidden email]>
Date: Friday, June 2, 2017 at 4:42 AM
To: Dong Joon Hyun <[hidden email]>
Cc: Apache Spark Dev <[hidden email]>
Subject: Re: Spark Issues on ORC

 

 

On 26 May 2017, at 19:02, Dong Joon Hyun <[hidden email]> wrote:

 

Hi, All.

 

Today, while I’m looking over JIRA issues for Spark 2.2.0 in Apache Spark.

I noticed that there are many unresolved community requests and related efforts over `Feature parity for ORC with Parquet`.

Some examples I found are the following. I created SPARK-20901 to organize these although I’m not in the body to do this.

Please let me know if this is not a proper way in the Apache Spark community.

I think we can leverage or transfer the improvement of Parquet in Spark.

 

 

SPARK-20799   Unable to infer schema for ORC on reading ORC from S3

 

 

Fixed that one for you by changing title: SPARK-20799 Unable to infer schema for ORC/Parquet on S3N when secrets are in the URL

 

I'd recommended closing that as a WONTFIX as its related to some security work in HADOOP-3733 where Path.toString/toURI now strip out the AWS crentials, and the way things get passed around as Path.toString(), its losing them. As the current model meant that everything which logged a path would be logging AWS secrets, and the logs & exceptions weren't being treated as the sensitive documents they became the moment that happened.

 

It could could as a regression, but as it never worked if there was a "/" in the secret, it's always been a bit patchy.

 

If this is really needed then it could be pushed back into Hadoop 2.8.2 but disabled by default unless you set some option like "fs.s3a.insecure.secrets.in.URL".

 

Maybe also (somehow) changing to only support AWS Session token triples (id, session-secret, session-token), so that the damage caused by secrets in logs, bug reports &c are less destructive

 

 

Loading...