to prune the partitions. From my understanding pruning works by looking up the partition path in leafDirToChildrenFiles which in this case is s3://bucket/table/date=2019-11-25/ and therefore it fails to find any files for this partition.
I have tested this by updating the jar running on EMR and we correctly can now read the data from these partitioned tables. It's also worth noting that we can read
the data, without any modifications to the code, if we use the following settings:
"spark.sql.hive.convertMetastoreParquet" to "false",
"spark.hive.mapred.supports.subdirectories" to "true",
"spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive" to "true"
However with these settings we lose the ability to prune partitions causing us to read the entire table every time as we aren't using a Spark relation.
I want to start discussion on whether this is a correct change, or if we are missing something more obvious. In either case I would be happy to fully implement the
Amazon Development Centre (Scotland) Limited registered office: Waverley Gate, 2-4 Waterloo Place, Edinburgh EH1 3EG, Scotland. Registered in Scotland Registration number SC26867