FineNotFoundException on _temporary dir when running multiple app for same table

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

FineNotFoundException on _temporary dir when running multiple app for same table

Bimalendu Varsha Choudhary
Hi All,

I have a use case where we try to run multiple parallel Spark SQL query to write data to the same table but different partitions to speed up things. 

But looks like different applications will end up using the same _temporary directory 
for ex:-
/user/hive/db/tb/_temporary/0/task_20210319101649_0006_m_000000..


And one finishing first might end up deleting the directory and the other application will fail with this exception:

java.io.FileNotFoundException: File 
org.apache.hadoop.hdfs://nameservice1/user/test/db/tb/_temporary/0 does not exist.
hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:985)
 at org.apache.hadoop.hdfs.DistributedFileSystem.access$1000(DistributedFileSystem.java:121)
        at org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1045)
        at org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1042)
        at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
        at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:1052)
        at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:473)
        at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:475)
        at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJobInternal(FileOutputCommitter.java:392)
        at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:364)
        at org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:48)
        at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.commitJob(HadoopMapReduceCommitProtocol.scala:166)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:187)
        at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
        at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
        at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
        at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)

When going over similar issues reported I do not see any solution to this. This issue might be specific to V1 but V2 having its own limitation we are stuck with V1. 

Was thinking if we can configure the temporary directory to be application specific but does not look like. ( The mapred.outpu.dir or mapred mapreduce.output.fileoutputformat.outputdir does not seemed to be used). Why can't we have an app id in the _temporary name to stop interfering with other application or have _temporary configurable.

Any suggestions.

Regards
Bimal