[SQL] parse_url does not work for Internationalized domain names ?

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

[SQL] parse_url does not work for Internationalized domain names ?

yash datta
Hi devs,

Stumbled across an interesting problem with the parse_url function that has been implemented in spark in https://issues.apache.org/jira/browse/SPARK-16281

When using internationalized Domains in the urls like:

val url = "http://правительство.рф"
The parse_url returns null, but works fine when using the hive 's version of parse_url

On digging further, found that the difference is in below call in spark:

private def getUrl(url: UTF8String): URI = {
try {
new URI(url.toString)
} catch {
case e: URISyntaxException => null
}
}

while hive uses java.net.URL:

url = new URL(urlStr)

Sure enough, this simple test demonstrates URL works but URI does not in this case:

val url = "http://правительство.рф"

val uriHost = new URI(url).getHost
val urlHost = new URL(url).getHost

println(s"uriHost = $uriHost") // prints uriHost = null
println(s"urlHost = $urlHost") // prints urlHost = правительство.рф

To reproduce the problem on spark-sql:

spark-sql> select parse_url('http://千夏ともか.test', 'HOST');
returns NULL

Could someone  please explain the reason of using URI instead of URL ? Does this problem warrant creating a jira ticket ?


Best Regards
Yash

--
When events unfold with calm and ease
When the winds that blow are merely breeze
Learn from nature, from birds and bees
Live your life in love, and let joy not cease.
Reply | Threaded
Open this post in threaded view
|

Re: [SQL] parse_url does not work for Internationalized domain names ?

StanZhai
This problem was introduced by
<https://issues.apache.org/jira/browse/SPARK-16826> which is designed to
improve performance of PARSE_URL().

The same issue exists in the following SQL:

```SQL
SELECT PARSE_URL('http://stanzhai.site?p=["abc"]', 'QUERY', 'p')

// return null in Spark 2.1+
// return ["abc"] less than Spark 2.1
```

I think it's a regression.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [SQL] parse_url does not work for Internationalized domain names ?

yash datta
Thanks for the prompt reply!.



BR
Yash

On Fri, Jan 12, 2018 at 3:41 PM, StanZhai <[hidden email]> wrote:
This problem was introduced by
<https://issues.apache.org/jira/browse/SPARK-16826> which is designed to
improve performance of PARSE_URL().

The same issue exists in the following SQL:

```SQL
SELECT PARSE_URL('http://stanzhai.site?p=["abc"]', 'QUERY', 'p')

// return null in Spark 2.1+
// return ["abc"] less than Spark 2.1
```

I think it's a regression.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]




--
When events unfold with calm and ease
When the winds that blow are merely breeze
Learn from nature, from birds and bees
Live your life in love, and let joy not cease.