Spark Utf 8 encoding

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Spark Utf 8 encoding

lsn24
Hello,

 Per the documentation default character encoding of spark is UTF-8. But
when i try to read non ascii characters, spark tend to read it as question
marks. What am I doing wrong ?. Below is my Syntax:

val ds = spark.read.textFile("a .bz2 file from hdfs");
ds.show();

The string "KøBENHAVN"  gets displayed as "K�BENHAVN"

I did the testing on spark shell, ran it the same command as a part of spark
Job. Both yields the same result.

I don't know what I am missing . I read the documentation, I couldn't find
any explicit config etc.

Any pointers will be greatly appreciated!

Thanks




--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Spark Utf 8 encoding

Sean Owen-2
That doesn't necessarily look like a Spark-related issue. Your
terminal seems to be displaying the glyph with a question mark because
the font lacks that symbol, maybe?
On Fri, Nov 9, 2018 at 7:17 PM lsn24 <[hidden email]> wrote:

>
> Hello,
>
>  Per the documentation default character encoding of spark is UTF-8. But
> when i try to read non ascii characters, spark tend to read it as question
> marks. What am I doing wrong ?. Below is my Syntax:
>
> val ds = spark.read.textFile("a .bz2 file from hdfs");
> ds.show();
>
> The string "KøBENHAVN"  gets displayed as "K�BENHAVN"
>
> I did the testing on spark shell, ran it the same command as a part of spark
> Job. Both yields the same result.
>
> I don't know what I am missing . I read the documentation, I couldn't find
> any explicit config etc.
>
> Any pointers will be greatly appreciated!
>
> Thanks
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Spark Utf 8 encoding

Jörn Franke
In reply to this post by lsn24
Is the original file indeed utf-8? Especially Windows environments tend to mess up the files (E.g. Java on Windows does not use by default UTF-8). However, also the software that processed the data before could have modified it.

> Am 10.11.2018 um 02:17 schrieb lsn24 <[hidden email]>:
>
> Hello,
>
> Per the documentation default character encoding of spark is UTF-8. But
> when i try to read non ascii characters, spark tend to read it as question
> marks. What am I doing wrong ?. Below is my Syntax:
>
> val ds = spark.read.textFile("a .bz2 file from hdfs");
> ds.show();
>
> The string "KøBENHAVN"  gets displayed as "K�BENHAVN"
>
> I did the testing on spark shell, ran it the same command as a part of spark
> Job. Both yields the same result.
>
> I don't know what I am missing . I read the documentation, I couldn't find
> any explicit config etc.
>
> Any pointers will be greatly appreciated!
>
> Thanks
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Spark Utf 8 encoding

lsn24
In reply to this post by Sean Owen-2
My Terminal can display UTF-8 encoded characters. I already verified that.
But will double check again.
Thanks!



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]