How to load Python Pickle File in Spark Data frame

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

How to load Python Pickle File in Spark Data frame

hxngillani
Hello  Dear Members
i want to train model using Bigdl, i have data set of Medical images in the
form of pickle object files (,pck).that pickle file is a 3D image(3D array)

i have tried
pickleRdd = sc.pickleFilehome/student/BigDL-
 trainings/elephantscale/data/volumetric_data/329637-8.pck
 sqlContext = SQLContext(sc)
 df = sqlContext.createDataFrame(pickleRdd)

this code throwing and error
Caused by: java.io.IOException:
file:/home/student/BigDL-trainings/elephantscale/data/volumetric_data/329637-8.pck
not a SequenceFile


the things i came to know is that
The function
sc.pickleFile
loads a pickle file that is created by
rdd.saveAsPickleFile

I am loading a pickle file that is created by Python's "pickle" module  
My Question is that  is there any way to load that file in spark data frame



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: How to load Python Pickle File in Spark Data frame

Roland Johann
The error you provided hints that pySpark seems to read pickle files as sequence files but are written as simple pickle files without having sequencefile format in mind.

I’m no pySpark expert, but I suggest you look into loading the pickle files as binary file and deserialize at custom code.

Then you should be able to deserialize the records and flat map the results to get RDD[YourType].

Best Regards

Roland Johann
Software Developer/Data Engineer

phenetic GmbH
Lütticher Straße 10, 50674 Köln, Germany

Mobil: +49 172 365 26 46
Mail: [hidden email]
Web: phenetic.io

Handelsregister: Amtsgericht Köln (HRB 92595)
Geschäftsführer: Roland Johann, Uwe Reimann



Am 26.08.2019 um 07:23 schrieb hxngillani <[hidden email]>:

Hello  Dear Members
i want to train model using Bigdl, i have data set of Medical images in the
form of pickle object files (,pck).that pickle file is a 3D image(3D array)

i have tried
pickleRdd = sc.pickleFilehome/student/BigDL-
trainings/elephantscale/data/volumetric_data/329637-8.pck
sqlContext = SQLContext(sc)
df = sqlContext.createDataFrame(pickleRdd)

this code throwing and error
Caused by: java.io.IOException:
file:/home/student/BigDL-trainings/elephantscale/data/volumetric_data/329637-8.pck
not a SequenceFile


the things i came to know is that
The function
sc.pickleFile
loads a pickle file that is created by
rdd.saveAsPickleFile

I am loading a pickle file that is created by Python's "pickle" module  
My Question is that  is there any way to load that file in spark data frame



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: How to load Python Pickle File in Spark Data frame

Sean Owen-2
In reply to this post by hxngillani
Yes, this does not read raw pickle files. It reads files written in
the standard Spark/Hadoop form for binary objects (SequenceFiles) but
uses Python pickling for the serialization. See the docs, which say
this reads what saveAsPickleFile() writes.

On Mon, Aug 26, 2019 at 12:23 AM hxngillani <[hidden email]> wrote:

>
> Hello  Dear Members
> i want to train model using Bigdl, i have data set of Medical images in the
> form of pickle object files (,pck).that pickle file is a 3D image(3D array)
>
> i have tried
> pickleRdd = sc.pickleFilehome/student/BigDL-
>  trainings/elephantscale/data/volumetric_data/329637-8.pck
>  sqlContext = SQLContext(sc)
>  df = sqlContext.createDataFrame(pickleRdd)
>
> this code throwing and error
> Caused by: java.io.IOException:
> file:/home/student/BigDL-trainings/elephantscale/data/volumetric_data/329637-8.pck
> not a SequenceFile
>
>
> the things i came to know is that
> The function
> sc.pickleFile
> loads a pickle file that is created by
> rdd.saveAsPickleFile
>
> I am loading a pickle file that is created by Python's "pickle" module
> My Question is that  is there any way to load that file in spark data frame
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]