[Spark SQL] ceil and floor functions on doubles

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[Spark SQL] ceil and floor functions on doubles

Anton Okolnychyi
Hi all,

I am wondering why the results of ceil and floor functions on doubles are internally casted to longs. This causes loss of precision since doubles can hold bigger numbers.

Consider the following example:

// 9.223372036854786E20 is greater than Long.MaxValue
val df = sc.parallelize(Array(("col", 9.223372036854786E20))).toDF()
df.createOrReplaceTempView("tbl")
spark.sql("select _2 AS original_value, ceil(_2) as ceil_result from tbl").show()

+---------------------------------+---------------------------------+
|        original_value           |         ceil_result               | 
+---------------------------------+---------------------------------+
| 9.223372036854786E20 | 9223372036854775807 |
+---------------------------------+---------------------------------+

So, the original double value is rounded to 9223372036854775807, which is Long.MaxValue. 
I think that it would be better to return 9.223372036854786E20 as it was (and as it is actually returned by math.ceil before the cast to long). If it is a problem, then I can fix this.

Best regards,
Anton 
Reply | Threaded
Open this post in threaded view
|

Re: [Spark SQL] ceil and floor functions on doubles

Dong Joon Hyun

Hi, Anton.

 

It’s the same result with Hive, isn’t it?

 

hive> select 9.223372036854786E20, ceil(9.223372036854786E20);

OK

_c0      _c1

9.223372036854786E20         9223372036854775807

Time taken: 2.041 seconds, Fetched: 1 row(s)

 

Bests,

Dongjoon.

 

From: Anton Okolnychyi <[hidden email]>
Date: Friday, May 19, 2017 at 7:26 AM
To: "[hidden email]" <[hidden email]>
Subject: [Spark SQL] ceil and floor functions on doubles

 

Hi all,

 

I am wondering why the results of ceil and floor functions on doubles are internally casted to longs. This causes loss of precision since doubles can hold bigger numbers.

 

Consider the following example:

 

// 9.223372036854786E20 is greater than Long.MaxValue

val df = sc.parallelize(Array(("col", 9.223372036854786E20))).toDF()

df.createOrReplaceTempView("tbl")

spark.sql("select _2 AS original_value, ceil(_2) as ceil_result from tbl").show()

 

+---------------------------------+---------------------------------+

|        original_value           |         ceil_result               | 

+---------------------------------+---------------------------------+

| 9.223372036854786E20 | 9223372036854775807 |

+---------------------------------+---------------------------------+

 

So, the original double value is rounded to 9223372036854775807, which is Long.MaxValue. 

I think that it would be better to return 9.223372036854786E20 as it was (and as it is actually returned by math.ceil before the cast to long). If it is a problem, then I can fix this.

 

Best regards,

Anton 

Reply | Threaded
Open this post in threaded view
|

Re: [Spark SQL] ceil and floor functions on doubles

Anton Okolnychyi
Hi Dongjoon,

yeah, it seems to be the same. So, was it done on purpose to match the behavior of Hive?

Best regards,
Anton

2017-05-19 16:39 GMT+02:00 Dong Joon Hyun <[hidden email]>:

Hi, Anton.

 

It’s the same result with Hive, isn’t it?

 

hive> select 9.223372036854786E20, ceil(9.223372036854786E20);

OK

_c0      _c1

9.223372036854786E20         9223372036854775807

Time taken: 2.041 seconds, Fetched: 1 row(s)

 

Bests,

Dongjoon.

 

From: Anton Okolnychyi <[hidden email]>
Date: Friday, May 19, 2017 at 7:26 AM
To: "[hidden email]" <[hidden email]>
Subject: [Spark SQL] ceil and floor functions on doubles

 

Hi all,

 

I am wondering why the results of ceil and floor functions on doubles are internally casted to longs. This causes loss of precision since doubles can hold bigger numbers.

 

Consider the following example:

 

// 9.223372036854786E20 is greater than Long.MaxValue

val df = sc.parallelize(Array(("col", 9.223372036854786E20))).toDF()

df.createOrReplaceTempView("tbl")

spark.sql("select _2 AS original_value, ceil(_2) as ceil_result from tbl").show()

 

+---------------------------------+---------------------------------+

|        original_value           |         ceil_result               | 

+---------------------------------+---------------------------------+

| 9.223372036854786E20 | 9223372036854775807 |

+---------------------------------+---------------------------------+

 

So, the original double value is rounded to 9223372036854775807, which is Long.MaxValue. 

I think that it would be better to return 9.223372036854786E20 as it was (and as it is actually returned by math.ceil before the cast to long). If it is a problem, then I can fix this.

 

Best regards,

Anton 


Reply | Threaded
Open this post in threaded view
|

Re: [Spark SQL] ceil and floor functions on doubles

Vadim Semenov
Yes, it was done on purpose to match the behavior of Hive (https://issues.apache.org/jira/browse/SPARK-10865).

And I believe Hive returns `Long`s because they adopted the definition used in MySQL (https://issues.apache.org/jira/browse/HIVE-615).

On Fri, May 19, 2017 at 10:51 AM, Anton Okolnychyi <[hidden email]> wrote:
Hi Dongjoon,

yeah, it seems to be the same. So, was it done on purpose to match the behavior of Hive?

Best regards,
Anton

2017-05-19 16:39 GMT+02:00 Dong Joon Hyun <[hidden email]>:

Hi, Anton.

 

It’s the same result with Hive, isn’t it?

 

hive> select 9.223372036854786E20, ceil(9.223372036854786E20);

OK

_c0      _c1

9.223372036854786E20         9223372036854775807

Time taken: 2.041 seconds, Fetched: 1 row(s)

 

Bests,

Dongjoon.

 

From: Anton Okolnychyi <[hidden email]>
Date: Friday, May 19, 2017 at 7:26 AM
To: "[hidden email]" <[hidden email]>
Subject: [Spark SQL] ceil and floor functions on doubles

 

Hi all,

 

I am wondering why the results of ceil and floor functions on doubles are internally casted to longs. This causes loss of precision since doubles can hold bigger numbers.

 

Consider the following example:

 

// 9.223372036854786E20 is greater than Long.MaxValue

val df = sc.parallelize(Array(("col", 9.223372036854786E20))).toDF()

df.createOrReplaceTempView("tbl")

spark.sql("select _2 AS original_value, ceil(_2) as ceil_result from tbl").show()

 

+---------------------------------+---------------------------------+

|        original_value           |         ceil_result               | 

+---------------------------------+---------------------------------+

| 9.223372036854786E20 | 9223372036854775807 |

+---------------------------------+---------------------------------+

 

So, the original double value is rounded to 9223372036854775807, which is Long.MaxValue. 

I think that it would be better to return 9.223372036854786E20 as it was (and as it is actually returned by math.ceil before the cast to long). If it is a problem, then I can fix this.

 

Best regards,

Anton