question: how to convert hive output format to spark sql datasource format?
spark version: spark 2.3.0
scene: there are many small files on hdfs(hive) generated by spark sql applications when dynamic partition is enabled or setting spark.sql.shuffle.partitions >200. so i am trying to develop a new feature: after temporary files have been written on hdfs but haven’t been moved to final path, calculate ideal file number by dfs.blocksize and temporary files’ total length, then merge(coalesce/repartition) to ideal file number. but i meet with a difficulty: temporary files are written in the output format(e.g. org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat) defined at hive TableDesc, i can’t load temporary files by
.repartition(ideal file number)
Throw exception: xxx is not a valid Spark SQL Data Source at DataSource#resolveRelation.