Honor ParseMode in AvroFileFormat

classic Classic list List threaded Threaded
6 messages Options
tim
Reply | Threaded
Open this post in threaded view
|

Honor ParseMode in AvroFileFormat

tim
Hi Spark Devs,

We're processing a large number of Avro files with Spark and found that the
Avro reader is missing the ability to handle malformed or truncated files
like the JSON reader. Currently the Avro reader throws exceptions when it
encounters any bad or truncated record in an Avro file, causing the entire
Spark job to fail from a single dodgy file.

Ideally the AvroFileFormat would accept a Permissive or DropMalformed
ParseMode like Spark's JSON format. This would enable the the Avro reader to
drop bad records and continue processing the good records rather than abort
the entire job.

I've searched through Jira and haven’t found any related issues, but it’s a
relatively straight-forward change that brings consistency across the
readers. Obviously the default could remain as FailFastMode, which is the
current effective behavior, so this wouldn’t break any existing users.

Is there any reason why this behavior doesn't exist or obvious workaround
that I missed?

If not, are there any further details needed to consider adding this
capability to Spark's Avro reader? I’m happy to propose a solution and
contribute this update if somebody isn't already working on it.

Thanks,
Tim



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Honor ParseMode in AvroFileFormat

Xiao Li-2
Hi, Tim,

This is a valid requirement. Could you open a JIRA?

Thanks,

Xiao

On Thu, Mar 7, 2019 at 1:04 PM tim <[hidden email]> wrote:
Hi Spark Devs,

We're processing a large number of Avro files with Spark and found that the
Avro reader is missing the ability to handle malformed or truncated files
like the JSON reader. Currently the Avro reader throws exceptions when it
encounters any bad or truncated record in an Avro file, causing the entire
Spark job to fail from a single dodgy file.

Ideally the AvroFileFormat would accept a Permissive or DropMalformed
ParseMode like Spark's JSON format. This would enable the the Avro reader to
drop bad records and continue processing the good records rather than abort
the entire job.

I've searched through Jira and haven’t found any related issues, but it’s a
relatively straight-forward change that brings consistency across the
readers. Obviously the default could remain as FailFastMode, which is the
current effective behavior, so this wouldn’t break any existing users.

Is there any reason why this behavior doesn't exist or obvious workaround
that I missed?

If not, are there any further details needed to consider adding this
capability to Spark's Avro reader? I’m happy to propose a solution and
contribute this update if somebody isn't already working on it.

Thanks,
Tim



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



--
https://databricks.com/sparkaisummit/north-america?utm_source=email&utm_medium=signature
tim
Reply | Threaded
Open this post in threaded view
|

Re: Honor ParseMode in AvroFileFormat

tim
Thanks Xiao, it's good to have that validated.

I've created a ticket here: https://issues.apache.org/jira/browse/AVRO-2342



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Honor ParseMode in AvroFileFormat

Xiao Li-2
Could you just create an Apache Spark JIRA https://issues.apache.org/jira/projects/SPARK/

On Thu, Mar 7, 2019 at 2:13 PM tim <[hidden email]> wrote:
Thanks Xiao, it's good to have that validated.

I've created a ticket here: https://issues.apache.org/jira/browse/AVRO-2342



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



--
https://databricks.com/sparkaisummit/north-america?utm_source=email&utm_medium=signature
tim
Reply | Threaded
Open this post in threaded view
|

Re: Honor ParseMode in AvroFileFormat

tim
In reply to this post by tim
/facepalm

Here we go: https://issues.apache.org/jira/browse/SPARK-27093

Tim



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Honor ParseMode in AvroFileFormat

Gengliang
Hi Tim,

I think you can try setting the option spark.sql.files.ignoreCorruptFiles as true. With the option enabled, the Spark jobs will continue to run when encountering corrupted files and the contents that have been read will still be returned.
The CSV/JSON data source supports the Permissive modes in reading files because it is possible that users still want partial row results. 
When reading corrupted Avro files, I think skipping the rest of files is enough if users want to ignore them. 
For processing data with function `from_avro`, I have created a PR to support  PERMISSIVE/FAILFAST mode: https://github.com/apache/spark/pull/22814

Gengliang


On Fri, Mar 8, 2019 at 6:25 AM tim <[hidden email]> wrote:
/facepalm

Here we go: https://issues.apache.org/jira/browse/SPARK-27093

Tim



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]