FileSystem.getContentSummary for total size stats in DetermineTableStats VS CommandUtils?

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

FileSystem.getContentSummary for total size stats in DetermineTableStats VS CommandUtils?

Jacek Laskowski
Hi,

I was wondering what's wrong with FileSystem.getContentSummary in CommandUtils.calculateLocationSize as "expressed" in the comment [1]:

    // This method is mainly based on
    // org.apache.hadoop.hive.ql.stats.StatsUtils.getFileSizeForTable(HiveConf, Table)
    // in Hive 0.13 (except that we do not use fs.getContentSummary).
    // TODO: Generalize statistics collection.
    // TODO: Why fs.getContentSummary returns wrong size on Jenkins?
    // Can we use fs.getContentSummary in future?
    // Seems fs.getContentSummary returns wrong table size on Jenkins. So we use
    // countFileSize to count the table size.

until I found out that there seems to be no issue whatsoever since DetermineTableStats uses it just fine [2].

Why does CommandUtils.calculateLocationSize *not* use what DetermineTableStats does successfully?


Reply | Threaded
Open this post in threaded view
|

Re: FileSystem.getContentSummary for total size stats in DetermineTableStats VS CommandUtils?

Steve Loughran

The default implementation is a recursive treewalk, though HDFS and ADL both push the work out to the remote system for performance.

If odd numbers are coming back on getContentSummary() against HDFS, then it's a bug there. Though if its Jenkins test runs against the local FS, then it's in the client-side treewalk,

Reimplementing the treewalk in spark work, but very inefficient on a deep/wide tree compared to one RPC call to HDFS, which can then lock the directory
once & do a recurse down. And, if needed, the blobstore clients can do a flat listing which is much more efficient than the recursion, in time and $.

Only ADSL does though...if getContentSummary() does get used on a path where performance matters, the other stores could be uprated fairly easily

-steve


On 2 Jan 2018, at 09:45, Jacek Laskowski <[hidden email]> wrote:

Hi,

I was wondering what's wrong with FileSystem.getContentSummary in CommandUtils.calculateLocationSize as "expressed" in the comment [1]:

    // This method is mainly based on
    // org.apache.hadoop.hive.ql.stats.StatsUtils.getFileSizeForTable(HiveConf, Table)
    // in Hive 0.13 (except that we do not use fs.getContentSummary).
    // TODO: Generalize statistics collection.
    // TODO: Why fs.getContentSummary returns wrong size on Jenkins?
    // Can we use fs.getContentSummary in future?
    // Seems fs.getContentSummary returns wrong table size on Jenkins. So we use
    // countFileSize to count the table size.

until I found out that there seems to be no issue whatsoever since DetermineTableStats uses it just fine [2].

Why does CommandUtils.calculateLocationSize *not* use what DetermineTableStats does successfully?



Pozdrawiam,
Jacek Laskowski
----
Spark Structured Streaming https://bit.ly/spark-structured-streaming