Option for silent failure while reading a list of files.

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Option for silent failure while reading a list of files.

Naresh Peshwe
Hi All,
When I try to read a list parquet files from S3, my application errors out if even one of the files are absent. When I searched for solutions most of them suggested filtering the list of files (on presence) before calling read. 
Shouldn't this be handled by Spark by providing an option for continuing without throwing an error? If not, could you point me to the thread where this was discussed upon.


Regards,
Naresh
Reply | Threaded
Open this post in threaded view
|

Re: Option for silent failure while reading a list of files.

Steve Loughran-2
Where is this list of files coming from?

If you made the list, then yes, the expectation is generally "supply a list of files which are present" on the basis that general convention is "missing files are considered bad"

Though you could try setting spark.sql.files.ignoreCorruptFiles=true to see what happens 

Past discussion on the topic of : what if the set of files off s3 includes files which have been moved offline, where the conclusion was "you get to filter, sorry" 




On Mon, Jul 1, 2019 at 2:52 AM Naresh Peshwe <[hidden email]> wrote:
Hi All,
When I try to read a list parquet files from S3, my application errors out if even one of the files are absent. When I searched for solutions most of them suggested filtering the list of files (on presence) before calling read. 
Shouldn't this be handled by Spark by providing an option for continuing without throwing an error? If not, could you point me to the thread where this was discussed upon.


Regards,
Naresh
Reply | Threaded
Open this post in threaded view
|

Re: Option for silent failure while reading a list of files.

Naresh Peshwe
Thanks for your reply. It makes sense as to why the option is not provided. (Since the user is the one who is imperatively asking spark to read the files.)

Yes, I provide the list of files. I'll try the ignoreCorruptFiles option. Also, I'll look into how I can avoid missing files or at least check if file is present before reading.

Regards,
Naresh

On Mon, Jul 1, 2019, 19:34 Steve Loughran <[hidden email]> wrote:
Where is this list of files coming from?

If you made the list, then yes, the expectation is generally "supply a list of files which are present" on the basis that general convention is "missing files are considered bad"

Though you could try setting spark.sql.files.ignoreCorruptFiles=true to see what happens 

Past discussion on the topic of : what if the set of files off s3 includes files which have been moved offline, where the conclusion was "you get to filter, sorry" 




On Mon, Jul 1, 2019 at 2:52 AM Naresh Peshwe <[hidden email]> wrote:
Hi All,
When I try to read a list parquet files from S3, my application errors out if even one of the files are absent. When I searched for solutions most of them suggested filtering the list of files (on presence) before calling read. 
Shouldn't this be handled by Spark by providing an option for continuing without throwing an error? If not, could you point me to the thread where this was discussed upon.


Regards,
Naresh