specifing schema on dataframe

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

specifing schema on dataframe

Sam Elamin
Hi All

I would like to specify a schema when reading from a json but when trying to map a number to a Double it fails, I tried FloatType and IntType with no joy!


When inferring the schema customer id is set to String, and I would like to cast it as Double

so df1 is corrupted while df2 shows


Also FYI I need this to be generic as I would like to apply it to any json, I specified the below schema as an example of the issue I am facing

import org.apache.spark.sql.types.{BinaryType, StringType, StructField, DoubleType,FloatType, StructType, LongType,DecimalType}
val testSchema = StructType(Array(StructField("customerid",DoubleType)))
val df1 = spark.read.schema(testSchema).json(sc.parallelize(Array("""{"customerid":"535137"}""")))
val df2 = spark.read.json(sc.parallelize(Array("""{"customerid":"535137"}""")))
df1.show(1)
df2.show(1)

Any help would be appreciated, I am sure I am missing something obvious but for the life of me I cant tell what it is! 


Kind Regards
Sam
Reply | Threaded
Open this post in threaded view
|

Re: specifing schema on dataframe

Dirceu Semighini Filho
Hi Sam 
Remove the " from the number that it will work

Em 4 de fev de 2017 11:46 AM, "Sam Elamin" <[hidden email]> escreveu:
Hi All

I would like to specify a schema when reading from a json but when trying to map a number to a Double it fails, I tried FloatType and IntType with no joy!


When inferring the schema customer id is set to String, and I would like to cast it as Double

so df1 is corrupted while df2 shows


Also FYI I need this to be generic as I would like to apply it to any json, I specified the below schema as an example of the issue I am facing

import org.apache.spark.sql.types.{BinaryType, StringType, StructField, DoubleType,FloatType, StructType, LongType,DecimalType}
val testSchema = StructType(Array(StructField("customerid",DoubleType)))
val df1 = spark.read.schema(testSchema).json(sc.parallelize(Array("""{"customerid":"535137"}""")))
val df2 = spark.read.json(sc.parallelize(Array("""{"customerid":"535137"}""")))
df1.show(1)
df2.show(1)

Any help would be appreciated, I am sure I am missing something obvious but for the life of me I cant tell what it is! 


Kind Regards
Sam
Reply | Threaded
Open this post in threaded view
|

Re: specifing schema on dataframe

Sam Elamin
Hi Direceu

Thanks your right! that did work


But now im facing an even bigger problem since i dont have access to change the underlying data, I just want to apply a schema over something that was written via the sparkContext.newAPIHadoopRDD

Basically I am reading in a RDD[JsonObject] and would like to convert it into a dataframe which I pass the schema into

Whats the best way to do this?

I doubt removing all the quotes in the JSON is the best solution is it? 

Regards
Sam 

On Sat, Feb 4, 2017 at 2:13 PM, Dirceu Semighini Filho <[hidden email]> wrote:
Hi Sam 
Remove the " from the number that it will work

Em 4 de fev de 2017 11:46 AM, "Sam Elamin" <[hidden email]> escreveu:
Hi All

I would like to specify a schema when reading from a json but when trying to map a number to a Double it fails, I tried FloatType and IntType with no joy!


When inferring the schema customer id is set to String, and I would like to cast it as Double

so df1 is corrupted while df2 shows


Also FYI I need this to be generic as I would like to apply it to any json, I specified the below schema as an example of the issue I am facing

import org.apache.spark.sql.types.{BinaryType, StringType, StructField, DoubleType,FloatType, StructType, LongType,DecimalType}
val testSchema = StructType(Array(StructField("customerid",DoubleType)))
val df1 = spark.read.schema(testSchema).json(sc.parallelize(Array("""{"customerid":"535137"}""")))
val df2 = spark.read.json(sc.parallelize(Array("""{"customerid":"535137"}""")))
df1.show(1)
df2.show(1)

Any help would be appreciated, I am sure I am missing something obvious but for the life of me I cant tell what it is! 


Kind Regards
Sam

Reply | Threaded
Open this post in threaded view
|

Re: specifing schema on dataframe

Michael Armbrust

On Sat, Feb 4, 2017 at 6:22 AM, Sam Elamin <[hidden email]> wrote:
Hi Direceu

Thanks your right! that did work


But now im facing an even bigger problem since i dont have access to change the underlying data, I just want to apply a schema over something that was written via the sparkContext.newAPIHadoopRDD

Basically I am reading in a RDD[JsonObject] and would like to convert it into a dataframe which I pass the schema into

Whats the best way to do this?

I doubt removing all the quotes in the JSON is the best solution is it? 

Regards
Sam 

On Sat, Feb 4, 2017 at 2:13 PM, Dirceu Semighini Filho <[hidden email]> wrote:
Hi Sam 
Remove the " from the number that it will work

Em 4 de fev de 2017 11:46 AM, "Sam Elamin" <[hidden email]> escreveu:
Hi All

I would like to specify a schema when reading from a json but when trying to map a number to a Double it fails, I tried FloatType and IntType with no joy!


When inferring the schema customer id is set to String, and I would like to cast it as Double

so df1 is corrupted while df2 shows


Also FYI I need this to be generic as I would like to apply it to any json, I specified the below schema as an example of the issue I am facing

import org.apache.spark.sql.types.{BinaryType, StringType, StructField, DoubleType,FloatType, StructType, LongType,DecimalType}
val testSchema = StructType(Array(StructField("customerid",DoubleType)))
val df1 = spark.read.schema(testSchema).json(sc.parallelize(Array("""{"customerid":"535137"}""")))
val df2 = spark.read.json(sc.parallelize(Array("""{"customerid":"535137"}""")))
df1.show(1)
df2.show(1)

Any help would be appreciated, I am sure I am missing something obvious but for the life of me I cant tell what it is! 


Kind Regards
Sam