[DISCUSS] upper/lower of special characters

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

[DISCUSS] upper/lower of special characters

seancxmao
Hi, all

We found that there are some differences about case handling of special characters between Spark and other database systems. You may see blow list for an example (you may also check attached pictures)

select upper("i"), lower("İ"), upper("ı"), lower("I");
------------------------------------------------------
Spark      I, i with dot, I, i
Hive       I, i with dot, I, i
Teradata   I, i,          I, i
Oracle     I, i,          I, i
SQLServer  I, i,          I, i
MySQL      I, i,          I, i

"İ" and "ı" are Turkish characters. If locale-sensitive case handling is used, the expected results of above upper/lower functions should be:
select upper("i"), lower("İ"), upper("ı"), lower("I");
------------------------------------------------------
İ, i, I, ı

But, it seems that these systems all do local-insensitive mapping. Presto explicitly describe this as a known issue in their docs (https://prestodb.io/docs/current/functions/string.html)
> The lower() and upper() functions do not perform locale-sensitive, context-sensitive, or one-to-many mappings required for some languages. Specifically, this will return incorrect results for Lithuanian, Turkish and Azeri.

Java besed systems have same behaviors since they all depend on the same JDK String methods. Teradata/Oracle/SQLServer/MySQL also have same behaviors. However Java based systems return different results for lower("İ"). Java based systems (Spark/Hive) return "i with dot" while other database systems(Teradata/Oracle/SQLServer/MySQL) return "i".

My questions:
(1) Should we let Spark return "i" for lower("İ"), which is same as other database systems?
(2) Should Spark support locale-sensitive upper/lower functions? Because row of a table may need different locales, we cannot even set locale at table level. What we might do is to provide upper(string, locale)/lower(string, locale), and let users decide what locale they want to use.

Some references below. Just FYI.

* https://docs.oracle.com/javase/8/docs/api/java/lang/String.html#toLowerCase-java.util.Locale-
* https://docs.oracle.com/javase/8/docs/api/java/lang/String.html#toUpperCase-java.util.Locale-
* http://grepalex.com/2013/02/14/java-7-and-the-dotted--and-dotless-i/
* https://stackoverflow.com/questions/3322152/is-there-a-way-to-get-rid-of-accents-and-convert-a-whole-string-to-regular-lette

Your comments and advices are highly appreciated.

Many thanks!
Chenxiao Mao (Sean)


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

i-hive.png (33K) Download Attachment
i-mysql.png (49K) Download Attachment
i-oracle.jpeg (67K) Download Attachment
i-spark.png (31K) Download Attachment
i-sqlserver.png (42K) Download Attachment
i-teradata.jpeg (142K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] upper/lower of special characters

rxin
I'd just document it as a known limitation and move on for now, until there are enough end users that need this. Spark is also very powerful with UDFs and end users can easily work around this using UDFs.

--
excuse the brevity and lower case due to wrist injury


On Tue, Sep 18, 2018 at 11:14 PM seancxmao <[hidden email]> wrote:
Hi, all

We found that there are some differences about case handling of special characters between Spark and other database systems. You may see blow list for an example (you may also check attached pictures)

select upper("i"), lower("İ"), upper("ı"), lower("I");
------------------------------------------------------
Spark      I, i with dot, I, i
Hive       I, i with dot, I, i
Teradata   I, i,          I, i
Oracle     I, i,          I, i
SQLServer  I, i,          I, i
MySQL      I, i,          I, i

"İ" and "ı" are Turkish characters. If locale-sensitive case handling is used, the expected results of above upper/lower functions should be:
select upper("i"), lower("İ"), upper("ı"), lower("I");
------------------------------------------------------
İ, i, I, ı

But, it seems that these systems all do local-insensitive mapping. Presto explicitly describe this as a known issue in their docs (https://prestodb.io/docs/current/functions/string.html)
> The lower() and upper() functions do not perform locale-sensitive, context-sensitive, or one-to-many mappings required for some languages. Specifically, this will return incorrect results for Lithuanian, Turkish and Azeri.

Java besed systems have same behaviors since they all depend on the same JDK String methods. Teradata/Oracle/SQLServer/MySQL also have same behaviors. However Java based systems return different results for lower("İ"). Java based systems (Spark/Hive) return "i with dot" while other database systems(Teradata/Oracle/SQLServer/MySQL) return "i".

My questions:
(1) Should we let Spark return "i" for lower("İ"), which is same as other database systems?
(2) Should Spark support locale-sensitive upper/lower functions? Because row of a table may need different locales, we cannot even set locale at table level. What we might do is to provide upper(string, locale)/lower(string, locale), and let users decide what locale they want to use.

Some references below. Just FYI.


Your comments and advices are highly appreciated.

Many thanks!
Chenxiao Mao (Sean)

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] upper/lower of special characters

Sean Owen-2
In reply to this post by seancxmao
I don't have the details in front of me, but I recall we explicitly overhauled locale-sensitive toUpper and toLower in the code for this exact situation. The current behavior should be on purpose. I believe user data strings are handled in a case sensitive way but things like reserved words in SQL are not of course. The Spark behavior is most correct and consistent with Hive, right?

On Wed, Sep 19, 2018, 1:14 AM seancxmao <[hidden email]> wrote:
Hi, all

We found that there are some differences about case handling of special characters between Spark and other database systems. You may see blow list for an example (you may also check attached pictures)

select upper("i"), lower("İ"), upper("ı"), lower("I");
------------------------------------------------------
Spark      I, i with dot, I, i
Hive       I, i with dot, I, i
Teradata   I, i,          I, i
Oracle     I, i,          I, i
SQLServer  I, i,          I, i
MySQL      I, i,          I, i

"İ" and "ı" are Turkish characters. If locale-sensitive case handling is used, the expected results of above upper/lower functions should be:
select upper("i"), lower("İ"), upper("ı"), lower("I");
------------------------------------------------------
İ, i, I, ı

But, it seems that these systems all do local-insensitive mapping. Presto explicitly describe this as a known issue in their docs (https://prestodb.io/docs/current/functions/string.html)
> The lower() and upper() functions do not perform locale-sensitive, context-sensitive, or one-to-many mappings required for some languages. Specifically, this will return incorrect results for Lithuanian, Turkish and Azeri.

Java besed systems have same behaviors since they all depend on the same JDK String methods. Teradata/Oracle/SQLServer/MySQL also have same behaviors. However Java based systems return different results for lower("İ"). Java based systems (Spark/Hive) return "i with dot" while other database systems(Teradata/Oracle/SQLServer/MySQL) return "i".

My questions:
(1) Should we let Spark return "i" for lower("İ"), which is same as other database systems?
(2) Should Spark support locale-sensitive upper/lower functions? Because row of a table may need different locales, we cannot even set locale at table level. What we might do is to provide upper(string, locale)/lower(string, locale), and let users decide what locale they want to use.

Some references below. Just FYI.


Your comments and advices are highly appreciated.

Many thanks!
Chenxiao Mao (Sean)

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] upper/lower of special characters

seancxmao
In reply to this post by rxin
Hi, Raynold

Sorry for slow response. Thanks for your suggestion. I'd like to document this in the API docs - SQL built-in functions. 

BTW, this is a real case we met in production, the Turkish data is from other systems through ETL. As what you mentioned, we use UDFs to avoid issues. E.g. for the special Turkish character "İ"(u+0130), we first
process by regexp_replace(c,'İ','I') before further processing.

Thanks,
Chenxiao Mao (Sean)
On 09/19/2018 14:18[hidden email] wrote:
I'd just document it as a known limitation and move on for now, until there are enough end users that need this. Spark is also very powerful with UDFs and end users can easily work around this using UDFs.

--
excuse the brevity and lower case due to wrist injury


On Tue, Sep 18, 2018 at 11:14 PM seancxmao <[hidden email]> wrote:
Hi, all

We found that there are some differences about case handling of special characters between Spark and other database systems. You may see blow list for an example (you may also check attached pictures)

select upper("i"), lower("İ"), upper("ı"), lower("I");
------------------------------------------------------
Spark      I, i with dot, I, i
Hive       I, i with dot, I, i
Teradata   I, i,          I, i
Oracle     I, i,          I, i
SQLServer  I, i,          I, i
MySQL      I, i,          I, i

"İ" and "ı" are Turkish characters. If locale-sensitive case handling is used, the expected results of above upper/lower functions should be:
select upper("i"), lower("İ"), upper("ı"), lower("I");
------------------------------------------------------
İ, i, I, ı

But, it seems that these systems all do local-insensitive mapping. Presto explicitly describe this as a known issue in their docs (https://prestodb.io/docs/current/functions/string.html)
> The lower() and upper() functions do not perform locale-sensitive, context-sensitive, or one-to-many mappings required for some languages. Specifically, this will return incorrect results for Lithuanian, Turkish and Azeri.

Java besed systems have same behaviors since they all depend on the same JDK String methods. Teradata/Oracle/SQLServer/MySQL also have same behaviors. However Java based systems return different results for lower("İ"). Java based systems (Spark/Hive) return "i with dot" while other database systems(Teradata/Oracle/SQLServer/MySQL) return "i".

My questions:
(1) Should we let Spark return "i" for lower("İ"), which is same as other database systems?
(2) Should Spark support locale-sensitive upper/lower functions? Because row of a table may need different locales, we cannot even set locale at table level. What we might do is to provide upper(string, locale)/lower(string, locale), and let users decide what locale they want to use.

Some references below. Just FYI.


Your comments and advices are highly appreciated.

Many thanks!
Chenxiao Mao (Sean)

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]
--------------------------------------------------------------------- To unsubscribe e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] upper/lower of special characters

seancxmao
In reply to this post by Sean Owen-2
Hi, Sean

After brief investigation, I found there are some tickets/PRs about this issue. I just didn't know that. 

I have carefully read the discussions, really learned a lot. I totally agree with the idea:
I could see backing out changes that affect user application strings, to be conservative. We could decide to change that later. The issue here really stems from lowercasing of purely internal strings.

As for lower/upper functions, Spark has same bahavior with Hive.
select upper("i"), lower("İ"), upper("ı"), lower("I");
------------------------------------------------------
Spark      I, i with dot, I, i
Hive       I, i with dot, I, i

Thanks,
Chenxiao Mao
On 09/19/2018 18:35[hidden email] wrote:
I don't have the details in front of me, but I recall we explicitly overhauled locale-sensitive toUpper and toLower in the code for this exact situation. The current behavior should be on purpose. I believe user data strings are handled in a case sensitive way but things like reserved words in SQL are not of course. The Spark behavior is most correct and consistent with Hive, right?

On Wed, Sep 19, 2018, 1:14 AM seancxmao <[hidden email]> wrote:
Hi, all

We found that there are some differences about case handling of special characters between Spark and other database systems. You may see blow list for an example (you may also check attached pictures)

select upper("i"), lower("İ"), upper("ı"), lower("I");
------------------------------------------------------
Spark      I, i with dot, I, i
Hive       I, i with dot, I, i
Teradata   I, i,          I, i
Oracle     I, i,          I, i
SQLServer  I, i,          I, i
MySQL      I, i,          I, i

"İ" and "ı" are Turkish characters. If locale-sensitive case handling is used, the expected results of above upper/lower functions should be:
select upper("i"), lower("İ"), upper("ı"), lower("I");
------------------------------------------------------
İ, i, I, ı

But, it seems that these systems all do local-insensitive mapping. Presto explicitly describe this as a known issue in their docs (https://prestodb.io/docs/current/functions/string.html)
> The lower() and upper() functions do not perform locale-sensitive, context-sensitive, or one-to-many mappings required for some languages. Specifically, this will return incorrect results for Lithuanian, Turkish and Azeri.

Java besed systems have same behaviors since they all depend on the same JDK String methods. Teradata/Oracle/SQLServer/MySQL also have same behaviors. However Java based systems return different results for lower("İ"). Java based systems (Spark/Hive) return "i with dot" while other database systems(Teradata/Oracle/SQLServer/MySQL) return "i".

My questions:
(1) Should we let Spark return "i" for lower("İ"), which is same as other database systems?
(2) Should Spark support locale-sensitive upper/lower functions? Because row of a table may need different locales, we cannot even set locale at table level. What we might do is to provide upper(string, locale)/lower(string, locale), and let users decide what locale they want to use.

Some references below. Just FYI.


Your comments and advices are highly appreciated.

Many thanks!
Chenxiao Mao (Sean)

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]
--------------------------------------------------------------------- To unsubscribe e-mail: [hidden email]