Fw:Re:Re: A question about radd bytes size

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Fw:Re:Re: A question about radd bytes size

zhangliyun






-------- 转发邮件信息 --------
发件人:"zhangliyun" <[hidden email]>
发送日期:2019-12-03 05:56:55
收件人:"Wenchen Fan" <[hidden email]>
主题:Re:Re: A question about radd bytes size

Hi Fan:
   thanks for reply,  I agree that the how the data is stored decides the total bytes of the table file.
In my experiment,  I found that 
sequence file with gzip compress is 0.5x of the total byte size calculated in memory.
parquet file with lzo compress is 0.2x of the total byte size calculated in memory.

Here the reason why  actual hive table size is  less than total size calculated in memory is decided by format sequence, orc, parquet and others.
Or is decided by compress algorithm Or both?


Meanwhile can I directly use org.apache.spark.util.SizeEstimator.estimate(RDD) to estimate the total size of a rdd? I guess there is some difference between the actual size and estimated size. So in which case, we can use or in which case we can not use.

Best Regards
Kelly Zhang




在 2019-12-02 15:54:19,"Wenchen Fan" <[hidden email]> 写道:
When we talk about bytes size, we need to specify how the data is stored. For example, if we cache the dataframe, then the bytes size is the number of bytes of the binary format of the table cache. If we write to hive tables, then the bytes size is the total size of the data files of the table.

On Mon, Dec 2, 2019 at 1:06 PM zhangliyun <[hidden email]> wrote:
Hi:

 I want to get the total bytes of a DataFrame by following function , but when I insert the DataFrame into hive , I found the value of the function is different from spark.sql.statistics.totalSize .  The spark.sql.statistics.totalSize  is less than the result of following function getRDDBytes . 

   def getRDDBytes(df:DataFrame):Long={

df.rdd.getNumPartitions match {
case 0 =>
0
case numPartitions =>
val rddOfDataframe = df.rdd.map(_.toString().getBytes("UTF-8").length.toLong)
val size = if (rddOfDataframe.isEmpty()) {
0
} else {
rddOfDataframe.reduce(_ + _)
}

size
}
}
Appreciate if you can provide your suggestion.

Best Regards
Kelly Zhang



 



 



 

Reply | Threaded
Open this post in threaded view
|

Re: Fw:Re:Re: A question about radd bytes size

cloud0fan
You can only know the actual data size of your RDD in memory if you serialize your data objects to binaries. Otherwise you can only estimate the size of the data objects in JVM.

On Tue, Dec 3, 2019 at 5:58 AM zhangliyun <[hidden email]> wrote:






-------- 转发邮件信息 --------
发件人:"zhangliyun" <[hidden email]>
发送日期:2019-12-03 05:56:55
收件人:"Wenchen Fan" <[hidden email]>
主题:Re:Re: A question about radd bytes size

Hi Fan:
   thanks for reply,  I agree that the how the data is stored decides the total bytes of the table file.
In my experiment,  I found that 
sequence file with gzip compress is 0.5x of the total byte size calculated in memory.
parquet file with lzo compress is 0.2x of the total byte size calculated in memory.

Here the reason why  actual hive table size is  less than total size calculated in memory is decided by format sequence, orc, parquet and others.
Or is decided by compress algorithm Or both?


Meanwhile can I directly use org.apache.spark.util.SizeEstimator.estimate(RDD) to estimate the total size of a rdd? I guess there is some difference between the actual size and estimated size. So in which case, we can use or in which case we can not use.

Best Regards
Kelly Zhang




在 2019-12-02 15:54:19,"Wenchen Fan" <[hidden email]> 写道:
When we talk about bytes size, we need to specify how the data is stored. For example, if we cache the dataframe, then the bytes size is the number of bytes of the binary format of the table cache. If we write to hive tables, then the bytes size is the total size of the data files of the table.

On Mon, Dec 2, 2019 at 1:06 PM zhangliyun <[hidden email]> wrote:
Hi:

 I want to get the total bytes of a DataFrame by following function , but when I insert the DataFrame into hive , I found the value of the function is different from spark.sql.statistics.totalSize .  The spark.sql.statistics.totalSize  is less than the result of following function getRDDBytes . 

   def getRDDBytes(df:DataFrame):Long={

df.rdd.getNumPartitions match {
case 0 =>
0
case numPartitions =>
val rddOfDataframe = df.rdd.map(_.toString().getBytes("UTF-8").length.toLong)
val size = if (rddOfDataframe.isEmpty()) {
0
} else {
rddOfDataframe.reduce(_ + _)
}

size
}
}
Appreciate if you can provide your suggestion.

Best Regards
Kelly Zhang