(no subject)

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

(no subject)

vtygoss


Hi devs,


question: how to convert hive output format to spark sql datasource format?   


spark version: spark 2.3.0  

scene:  there are many small files on hdfs(hive) generated  by spark sql applications when dynamic partition is enabled or setting spark.sql.shuffle.partitions >200.  so i am trying to develop a new feature: after temporary files have been written on hdfs but haven’t been moved to final path, calculate ideal file number by dfs.blocksize and temporary files’ total length, then merge(coalesce/repartition) to ideal file number.  but i meet with a difficulty:  temporary files are written in the output format(e.g. org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat) defined at hive TableDesc, i can’t load temporary files by 


```

sparkSession

.read.format(TableDesc.getInputFormatClassName)

.load(tempDataPath)

.repartition(ideal file number)

.write.format(TableDesc.getOutputFormatClassName)

```

Throw exception: xxx is not a valid Spark SQL Data Source at DataSource#resolveRelation. 

i also tried to use 


```

sparkSession.read

.option("inputFormat",TableDesc.getInputFormatClassName)

.option("outputFormat", TableDesc.getOutputFormatClassName)

.load(tempDataPath)

….

```


it not works and spark sql DataSource defaults to parquet.


So how to convert hive output format to spark sql datasource format?  is there any better way than building an map<hive output format, spark sql datasource>?



Thanks in advance