Parquet read performance for different schemas

Parquet read performance for different schemas

Tomas Bartalos

I have 2 parquets (each containing 1 file):
  • parquet-wide - schema has 25 top level cols¬†+ 1 array
  • parquet-narrow - schema has 3 top level cols
Both files have same data for given columns.
When I read from parquet-wide spark reports read 52.6 KB, from parquet-narrow only 2.6 KB.
For bigger dataset the difference is 413 MB vs 961 MB. Needless to say reading narrow parquet is much faster.

Since schema pruning is applied I expected to get similar results for both scenarios (timing and amount of data read). 
What do you think is the reason for such a big difference, is there any tuning I can do ?

Thank you,