Unsupported Catalyst types in Parquet

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Unsupported Catalyst types in Parquet

Alessandro Baretta
Michael,

I'm having trouble storing my SchemaRDDs in Parquet format with SparkSQL,
due to my RDDs having having DateType and DecimalType fields. What would it
take to add Parquet support for these Catalyst? Are there any other
Catalyst types for which there is no Catalyst support?

Alex
Reply | Threaded
Open this post in threaded view
|

RE: Unsupported Catalyst types in Parquet

Wang, Daoyuan
Hi Alex,

I'll create JIRA SPARK-4985 for date type support in parquet, and SPARK-4987 for timestamp type support. For decimal type, I think we only support decimals that fits in a long.

Thanks,
Daoyuan

-----Original Message-----
From: Alessandro Baretta [mailto:[hidden email]]
Sent: Saturday, December 27, 2014 2:47 PM
To: [hidden email]; Michael Armbrust
Subject: Unsupported Catalyst types in Parquet

Michael,

I'm having trouble storing my SchemaRDDs in Parquet format with SparkSQL, due to my RDDs having having DateType and DecimalType fields. What would it take to add Parquet support for these Catalyst? Are there any other Catalyst types for which there is no Catalyst support?

Alex

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

RE: Unsupported Catalyst types in Parquet

Alessandro Baretta
Daoyuan,

Thanks for creating the jiras. I need these features by... last week, so
I'd be happy to take care of this myself, if only you or someone more
experienced than me in the SparkSQL codebase could provide some guidance.

Alex
On Dec 29, 2014 12:06 AM, "Wang, Daoyuan" <[hidden email]> wrote:

> Hi Alex,
>
> I'll create JIRA SPARK-4985 for date type support in parquet, and
> SPARK-4987 for timestamp type support. For decimal type, I think we only
> support decimals that fits in a long.
>
> Thanks,
> Daoyuan
>
> -----Original Message-----
> From: Alessandro Baretta [mailto:[hidden email]]
> Sent: Saturday, December 27, 2014 2:47 PM
> To: [hidden email]; Michael Armbrust
> Subject: Unsupported Catalyst types in Parquet
>
> Michael,
>
> I'm having trouble storing my SchemaRDDs in Parquet format with SparkSQL,
> due to my RDDs having having DateType and DecimalType fields. What would it
> take to add Parquet support for these Catalyst? Are there any other
> Catalyst types for which there is no Catalyst support?
>
> Alex
>
Reply | Threaded
Open this post in threaded view
|

Re: Unsupported Catalyst types in Parquet

Michael Armbrust
I'd love to get both of these in.  There is some trickiness that I talk
about on the JIRA for timestamps since the SQL timestamp class can support
nano seconds and I don't think parquet has a type for this.  Other systems
(impala) seem to use INT96.  It would be great to maybe ask on the parquet
mailing list what the plan is there to make sure that whatever we do is
going to be compatible long term.

Michael

On Mon, Dec 29, 2014 at 8:13 AM, Alessandro Baretta <[hidden email]>
wrote:

> Daoyuan,
>
> Thanks for creating the jiras. I need these features by... last week, so
> I'd be happy to take care of this myself, if only you or someone more
> experienced than me in the SparkSQL codebase could provide some guidance.
>
> Alex
> On Dec 29, 2014 12:06 AM, "Wang, Daoyuan" <[hidden email]> wrote:
>
>> Hi Alex,
>>
>> I'll create JIRA SPARK-4985 for date type support in parquet, and
>> SPARK-4987 for timestamp type support. For decimal type, I think we only
>> support decimals that fits in a long.
>>
>> Thanks,
>> Daoyuan
>>
>> -----Original Message-----
>> From: Alessandro Baretta [mailto:[hidden email]]
>> Sent: Saturday, December 27, 2014 2:47 PM
>> To: [hidden email]; Michael Armbrust
>> Subject: Unsupported Catalyst types in Parquet
>>
>> Michael,
>>
>> I'm having trouble storing my SchemaRDDs in Parquet format with SparkSQL,
>> due to my RDDs having having DateType and DecimalType fields. What would it
>> take to add Parquet support for these Catalyst? Are there any other
>> Catalyst types for which there is no Catalyst support?
>>
>> Alex
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Unsupported Catalyst types in Parquet

Alessandro Baretta
Michael,

Actually, Adrian Wang already created pull requests for these issues.

https://github.com/apache/spark/pull/3820
https://github.com/apache/spark/pull/3822

What do you think?

Alex

On Mon, Dec 29, 2014 at 3:07 PM, Michael Armbrust <[hidden email]>
wrote:

> I'd love to get both of these in.  There is some trickiness that I talk
> about on the JIRA for timestamps since the SQL timestamp class can support
> nano seconds and I don't think parquet has a type for this.  Other systems
> (impala) seem to use INT96.  It would be great to maybe ask on the parquet
> mailing list what the plan is there to make sure that whatever we do is
> going to be compatible long term.
>
> Michael
>
> On Mon, Dec 29, 2014 at 8:13 AM, Alessandro Baretta <[hidden email]
> > wrote:
>
>> Daoyuan,
>>
>> Thanks for creating the jiras. I need these features by... last week, so
>> I'd be happy to take care of this myself, if only you or someone more
>> experienced than me in the SparkSQL codebase could provide some guidance.
>>
>> Alex
>> On Dec 29, 2014 12:06 AM, "Wang, Daoyuan" <[hidden email]> wrote:
>>
>>> Hi Alex,
>>>
>>> I'll create JIRA SPARK-4985 for date type support in parquet, and
>>> SPARK-4987 for timestamp type support. For decimal type, I think we only
>>> support decimals that fits in a long.
>>>
>>> Thanks,
>>> Daoyuan
>>>
>>> -----Original Message-----
>>> From: Alessandro Baretta [mailto:[hidden email]]
>>> Sent: Saturday, December 27, 2014 2:47 PM
>>> To: [hidden email]; Michael Armbrust
>>> Subject: Unsupported Catalyst types in Parquet
>>>
>>> Michael,
>>>
>>> I'm having trouble storing my SchemaRDDs in Parquet format with
>>> SparkSQL, due to my RDDs having having DateType and DecimalType fields.
>>> What would it take to add Parquet support for these Catalyst? Are there any
>>> other Catalyst types for which there is no Catalyst support?
>>>
>>> Alex
>>>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Unsupported Catalyst types in Parquet

Michael Armbrust
Yeah, I saw those.  The problem is that #3822 truncates timestamps that
include nanoseconds.

On Mon, Dec 29, 2014 at 5:14 PM, Alessandro Baretta <[hidden email]>
wrote:

> Michael,
>
> Actually, Adrian Wang already created pull requests for these issues.
>
> https://github.com/apache/spark/pull/3820
> https://github.com/apache/spark/pull/3822
>
> What do you think?
>
> Alex
>
> On Mon, Dec 29, 2014 at 3:07 PM, Michael Armbrust <[hidden email]>
> wrote:
>
>> I'd love to get both of these in.  There is some trickiness that I talk
>> about on the JIRA for timestamps since the SQL timestamp class can support
>> nano seconds and I don't think parquet has a type for this.  Other systems
>> (impala) seem to use INT96.  It would be great to maybe ask on the parquet
>> mailing list what the plan is there to make sure that whatever we do is
>> going to be compatible long term.
>>
>> Michael
>>
>> On Mon, Dec 29, 2014 at 8:13 AM, Alessandro Baretta <
>> [hidden email]> wrote:
>>
>>> Daoyuan,
>>>
>>> Thanks for creating the jiras. I need these features by... last week, so
>>> I'd be happy to take care of this myself, if only you or someone more
>>> experienced than me in the SparkSQL codebase could provide some guidance.
>>>
>>> Alex
>>> On Dec 29, 2014 12:06 AM, "Wang, Daoyuan" <[hidden email]>
>>> wrote:
>>>
>>>> Hi Alex,
>>>>
>>>> I'll create JIRA SPARK-4985 for date type support in parquet, and
>>>> SPARK-4987 for timestamp type support. For decimal type, I think we only
>>>> support decimals that fits in a long.
>>>>
>>>> Thanks,
>>>> Daoyuan
>>>>
>>>> -----Original Message-----
>>>> From: Alessandro Baretta [mailto:[hidden email]]
>>>> Sent: Saturday, December 27, 2014 2:47 PM
>>>> To: [hidden email]; Michael Armbrust
>>>> Subject: Unsupported Catalyst types in Parquet
>>>>
>>>> Michael,
>>>>
>>>> I'm having trouble storing my SchemaRDDs in Parquet format with
>>>> SparkSQL, due to my RDDs having having DateType and DecimalType fields.
>>>> What would it take to add Parquet support for these Catalyst? Are there any
>>>> other Catalyst types for which there is no Catalyst support?
>>>>
>>>> Alex
>>>>
>>>
>>
>
Reply | Threaded
Open this post in threaded view
|

RE: Unsupported Catalyst types in Parquet

Wang, Daoyuan
By adding a flag in SQLContext, I have modified #3822 to include nanoseconds now. Since passing too many flags is ugly, now I need the whole SQLContext, so that we can put more flags there.

Thanks,
Daoyuan

From: Michael Armbrust [mailto:[hidden email]]
Sent: Tuesday, December 30, 2014 10:43 AM
To: Alessandro Baretta
Cc: Wang, Daoyuan; [hidden email]
Subject: Re: Unsupported Catalyst types in Parquet

Yeah, I saw those.  The problem is that #3822 truncates timestamps that include nanoseconds.

On Mon, Dec 29, 2014 at 5:14 PM, Alessandro Baretta <[hidden email]<mailto:[hidden email]>> wrote:
Michael,

Actually, Adrian Wang already created pull requests for these issues.

https://github.com/apache/spark/pull/3820
https://github.com/apache/spark/pull/3822

What do you think?

Alex

On Mon, Dec 29, 2014 at 3:07 PM, Michael Armbrust <[hidden email]<mailto:[hidden email]>> wrote:
I'd love to get both of these in.  There is some trickiness that I talk about on the JIRA for timestamps since the SQL timestamp class can support nano seconds and I don't think parquet has a type for this.  Other systems (impala) seem to use INT96.  It would be great to maybe ask on the parquet mailing list what the plan is there to make sure that whatever we do is going to be compatible long term.

Michael

On Mon, Dec 29, 2014 at 8:13 AM, Alessandro Baretta <[hidden email]<mailto:[hidden email]>> wrote:

Daoyuan,

Thanks for creating the jiras. I need these features by... last week, so I'd be happy to take care of this myself, if only you or someone more experienced than me in the SparkSQL codebase could provide some guidance.

Alex
On Dec 29, 2014 12:06 AM, "Wang, Daoyuan" <[hidden email]<mailto:[hidden email]>> wrote:
Hi Alex,

I'll create JIRA SPARK-4985 for date type support in parquet, and SPARK-4987 for timestamp type support. For decimal type, I think we only support decimals that fits in a long.

Thanks,
Daoyuan

-----Original Message-----
From: Alessandro Baretta [mailto:[hidden email]<mailto:[hidden email]>]
Sent: Saturday, December 27, 2014 2:47 PM
To: [hidden email]<mailto:[hidden email]>; Michael Armbrust
Subject: Unsupported Catalyst types in Parquet

Michael,

I'm having trouble storing my SchemaRDDs in Parquet format with SparkSQL, due to my RDDs having having DateType and DecimalType fields. What would it take to add Parquet support for these Catalyst? Are there any other Catalyst types for which there is no Catalyst support?

Alex



Reply | Threaded
Open this post in threaded view
|

Re: Unsupported Catalyst types in Parquet

Alessandro Baretta
Gents,

I tried #3820. It doesn't work. I'm still getting the following exceptions:

Exception in thread "Thread-45" java.lang.RuntimeException: Unsupported
datatype DateType
        at scala.sys.package$.error(package.scala:27)
        at
org.apache.spark.sql.parquet.ParquetTypesConverter$anonfun$fromDataType$2.apply(ParquetTypes.scala:343)
        at
org.apache.spark.sql.parquet.ParquetTypesConverter$anonfun$fromDataType$2.apply(ParquetTypes.scala:292)
        at scala.Option.getOrElse(Option.scala:120)
        at
org.apache.spark.sql.parquet.ParquetTypesConverter$.fromDataType(ParquetTypes.scala:291)
        at
org.apache.spark.sql.parquet.ParquetTypesConverter$anonfun$4.apply(ParquetTypes.scala:363)
        at
org.apache.spark.sql.parquet.ParquetTypesConverter$anonfun$4.apply(ParquetTypes.scala:362)

I would more than happy to fix this myself, but I would need some help
wading through the code. Could anyone explain to me what exactly is needed
to support a new data type in SparkSQL's Parquet storage engine?

Thanks.

Alex

On Mon, Dec 29, 2014 at 10:20 PM, Wang, Daoyuan <[hidden email]>
wrote:

>  By adding a flag in SQLContext, I have modified #3822 to include
> nanoseconds now. Since passing too many flags is ugly, now I need the whole
> SQLContext, so that we can put more flags there.
>
>
>
> Thanks,
>
> Daoyuan
>
>
>
> *From:* Michael Armbrust [mailto:[hidden email]]
> *Sent:* Tuesday, December 30, 2014 10:43 AM
> *To:* Alessandro Baretta
> *Cc:* Wang, Daoyuan; [hidden email]
> *Subject:* Re: Unsupported Catalyst types in Parquet
>
>
>
> Yeah, I saw those.  The problem is that #3822 truncates timestamps that
> include nanoseconds.
>
>
>
> On Mon, Dec 29, 2014 at 5:14 PM, Alessandro Baretta <[hidden email]>
> wrote:
>
> Michael,
>
>
>
> Actually, Adrian Wang already created pull requests for these issues.
>
>
>
> https://github.com/apache/spark/pull/3820
>
> https://github.com/apache/spark/pull/3822
>
>
>
> What do you think?
>
>
>
> Alex
>
>
>
> On Mon, Dec 29, 2014 at 3:07 PM, Michael Armbrust <[hidden email]>
> wrote:
>
> I'd love to get both of these in.  There is some trickiness that I talk
> about on the JIRA for timestamps since the SQL timestamp class can support
> nano seconds and I don't think parquet has a type for this.  Other systems
> (impala) seem to use INT96.  It would be great to maybe ask on the parquet
> mailing list what the plan is there to make sure that whatever we do is
> going to be compatible long term.
>
>
>
> Michael
>
>
>
> On Mon, Dec 29, 2014 at 8:13 AM, Alessandro Baretta <[hidden email]>
> wrote:
>
> Daoyuan,
>
> Thanks for creating the jiras. I need these features by... last week, so
> I'd be happy to take care of this myself, if only you or someone more
> experienced than me in the SparkSQL codebase could provide some guidance.
>
> Alex
>
> On Dec 29, 2014 12:06 AM, "Wang, Daoyuan" <[hidden email]> wrote:
>
> Hi Alex,
>
> I'll create JIRA SPARK-4985 for date type support in parquet, and
> SPARK-4987 for timestamp type support. For decimal type, I think we only
> support decimals that fits in a long.
>
> Thanks,
> Daoyuan
>
> -----Original Message-----
> From: Alessandro Baretta [mailto:[hidden email]]
> Sent: Saturday, December 27, 2014 2:47 PM
> To: [hidden email]; Michael Armbrust
> Subject: Unsupported Catalyst types in Parquet
>
> Michael,
>
> I'm having trouble storing my SchemaRDDs in Parquet format with SparkSQL,
> due to my RDDs having having DateType and DecimalType fields. What would it
> take to add Parquet support for these Catalyst? Are there any other
> Catalyst types for which there is no Catalyst support?
>
> Alex
>
>
>
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Unsupported Catalyst types in Parquet

Alessandro Baretta
Sorry! My bad. I had stale spark jars sitting on the slave nodes...

Alex

On Tue, Dec 30, 2014 at 4:39 PM, Alessandro Baretta <[hidden email]>
wrote:

> Gents,
>
> I tried #3820. It doesn't work. I'm still getting the following exceptions:
>
> Exception in thread "Thread-45" java.lang.RuntimeException: Unsupported
> datatype DateType
>         at scala.sys.package$.error(package.scala:27)
>         at
> org.apache.spark.sql.parquet.ParquetTypesConverter$anonfun$fromDataType$2.apply(ParquetTypes.scala:343)
>         at
> org.apache.spark.sql.parquet.ParquetTypesConverter$anonfun$fromDataType$2.apply(ParquetTypes.scala:292)
>         at scala.Option.getOrElse(Option.scala:120)
>         at
> org.apache.spark.sql.parquet.ParquetTypesConverter$.fromDataType(ParquetTypes.scala:291)
>         at
> org.apache.spark.sql.parquet.ParquetTypesConverter$anonfun$4.apply(ParquetTypes.scala:363)
>         at
> org.apache.spark.sql.parquet.ParquetTypesConverter$anonfun$4.apply(ParquetTypes.scala:362)
>
> I would more than happy to fix this myself, but I would need some help
> wading through the code. Could anyone explain to me what exactly is needed
> to support a new data type in SparkSQL's Parquet storage engine?
>
> Thanks.
>
> Alex
>
> On Mon, Dec 29, 2014 at 10:20 PM, Wang, Daoyuan <[hidden email]>
> wrote:
>
>>  By adding a flag in SQLContext, I have modified #3822 to include
>> nanoseconds now. Since passing too many flags is ugly, now I need the whole
>> SQLContext, so that we can put more flags there.
>>
>>
>>
>> Thanks,
>>
>> Daoyuan
>>
>>
>>
>> *From:* Michael Armbrust [mailto:[hidden email]]
>> *Sent:* Tuesday, December 30, 2014 10:43 AM
>> *To:* Alessandro Baretta
>> *Cc:* Wang, Daoyuan; [hidden email]
>> *Subject:* Re: Unsupported Catalyst types in Parquet
>>
>>
>>
>> Yeah, I saw those.  The problem is that #3822 truncates timestamps that
>> include nanoseconds.
>>
>>
>>
>> On Mon, Dec 29, 2014 at 5:14 PM, Alessandro Baretta <
>> [hidden email]> wrote:
>>
>> Michael,
>>
>>
>>
>> Actually, Adrian Wang already created pull requests for these issues.
>>
>>
>>
>> https://github.com/apache/spark/pull/3820
>>
>> https://github.com/apache/spark/pull/3822
>>
>>
>>
>> What do you think?
>>
>>
>>
>> Alex
>>
>>
>>
>> On Mon, Dec 29, 2014 at 3:07 PM, Michael Armbrust <[hidden email]>
>> wrote:
>>
>> I'd love to get both of these in.  There is some trickiness that I talk
>> about on the JIRA for timestamps since the SQL timestamp class can support
>> nano seconds and I don't think parquet has a type for this.  Other systems
>> (impala) seem to use INT96.  It would be great to maybe ask on the parquet
>> mailing list what the plan is there to make sure that whatever we do is
>> going to be compatible long term.
>>
>>
>>
>> Michael
>>
>>
>>
>> On Mon, Dec 29, 2014 at 8:13 AM, Alessandro Baretta <
>> [hidden email]> wrote:
>>
>> Daoyuan,
>>
>> Thanks for creating the jiras. I need these features by... last week, so
>> I'd be happy to take care of this myself, if only you or someone more
>> experienced than me in the SparkSQL codebase could provide some guidance.
>>
>> Alex
>>
>> On Dec 29, 2014 12:06 AM, "Wang, Daoyuan" <[hidden email]> wrote:
>>
>> Hi Alex,
>>
>> I'll create JIRA SPARK-4985 for date type support in parquet, and
>> SPARK-4987 for timestamp type support. For decimal type, I think we only
>> support decimals that fits in a long.
>>
>> Thanks,
>> Daoyuan
>>
>> -----Original Message-----
>> From: Alessandro Baretta [mailto:[hidden email]]
>> Sent: Saturday, December 27, 2014 2:47 PM
>> To: [hidden email]; Michael Armbrust
>> Subject: Unsupported Catalyst types in Parquet
>>
>> Michael,
>>
>> I'm having trouble storing my SchemaRDDs in Parquet format with SparkSQL,
>> due to my RDDs having having DateType and DecimalType fields. What would it
>> take to add Parquet support for these Catalyst? Are there any other
>> Catalyst types for which there is no Catalyst support?
>>
>> Alex
>>
>>
>>
>>
>>
>>
>>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Unsupported Catalyst types in Parquet

Alessandro Baretta
Here's a more meaningful exception:

java.lang.ClassCastException: org.apache.spark.sql.catalyst.types.DateType$
cannot be cast to org.apache.spark.sql.catalyst.types.PrimitiveType
        at
org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:188)
        at
org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:167)
        at
org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:130)
        at
parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:120)
        at
parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81)
        at
parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37)
        at org.apache.spark.sql.parquet.InsertIntoParquetTable.org
$apache$spark$sql$parquet$InsertIntoParquetTable$writeShard$1(ParquetTableOperations.scala:309)
        at
org.apache.spark.sql.parquet.InsertIntoParquetTable$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:326)
        at
org.apache.spark.sql.parquet.InsertIntoParquetTable$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:326)
        at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
        at org.apache.spark.scheduler.Task.run(Task.scala:56)
        at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)


This is easy to fix even for a newbie like myself: it suffices to add the
PrimitiveType trait to the DateType object. You can find this change here:

https://github.com/alexbaretta/spark/compare/parquet-date-support

However, even this does not work. Here's the next blocker:

java.lang.RuntimeException: Unsupported datatype DateType, cannot write to
consumer
        at scala.sys.package$.error(package.scala:27)
        at
org.apache.spark.sql.parquet.MutableRowWriteSupport.consumeType(ParquetTableSupport.scala:361)
        at
org.apache.spark.sql.parquet.MutableRowWriteSupport.write(ParquetTableSupport.scala:329)
        at
org.apache.spark.sql.parquet.MutableRowWriteSupport.write(ParquetTableSupport.scala:315)
        at
parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:120)
        at
parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81)
        at
parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37)
        at org.apache.spark.sql.parquet.InsertIntoParquetTable.org
$apache$spark$sql$parquet$InsertIntoParquetTable$writeShard$1(ParquetTableOperations.scala:309)
        at
org.apache.spark.sql.parquet.InsertIntoParquetTable$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:326)
        at
org.apache.spark.sql.parquet.InsertIntoParquetTable$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:326)
        at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
        at org.apache.spark.scheduler.Task.run(Task.scala:56)
        at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

Any input on how to address this issue would be welcome.

Alex

On Tue, Dec 30, 2014 at 5:21 PM, Alessandro Baretta <[hidden email]>
wrote:

> Sorry! My bad. I had stale spark jars sitting on the slave nodes...
>
> Alex
>
> On Tue, Dec 30, 2014 at 4:39 PM, Alessandro Baretta <[hidden email]
> > wrote:
>
>> Gents,
>>
>> I tried #3820. It doesn't work. I'm still getting the following
>> exceptions:
>>
>> Exception in thread "Thread-45" java.lang.RuntimeException: Unsupported
>> datatype DateType
>>         at scala.sys.package$.error(package.scala:27)
>>         at
>> org.apache.spark.sql.parquet.ParquetTypesConverter$anonfun$fromDataType$2.apply(ParquetTypes.scala:343)
>>         at
>> org.apache.spark.sql.parquet.ParquetTypesConverter$anonfun$fromDataType$2.apply(ParquetTypes.scala:292)
>>         at scala.Option.getOrElse(Option.scala:120)
>>         at
>> org.apache.spark.sql.parquet.ParquetTypesConverter$.fromDataType(ParquetTypes.scala:291)
>>         at
>> org.apache.spark.sql.parquet.ParquetTypesConverter$anonfun$4.apply(ParquetTypes.scala:363)
>>         at
>> org.apache.spark.sql.parquet.ParquetTypesConverter$anonfun$4.apply(ParquetTypes.scala:362)
>>
>> I would more than happy to fix this myself, but I would need some help
>> wading through the code. Could anyone explain to me what exactly is needed
>> to support a new data type in SparkSQL's Parquet storage engine?
>>
>> Thanks.
>>
>> Alex
>>
>> On Mon, Dec 29, 2014 at 10:20 PM, Wang, Daoyuan <[hidden email]>
>> wrote:
>>
>>>  By adding a flag in SQLContext, I have modified #3822 to include
>>> nanoseconds now. Since passing too many flags is ugly, now I need the whole
>>> SQLContext, so that we can put more flags there.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Daoyuan
>>>
>>>
>>>
>>> *From:* Michael Armbrust [mailto:[hidden email]]
>>> *Sent:* Tuesday, December 30, 2014 10:43 AM
>>> *To:* Alessandro Baretta
>>> *Cc:* Wang, Daoyuan; [hidden email]
>>> *Subject:* Re: Unsupported Catalyst types in Parquet
>>>
>>>
>>>
>>> Yeah, I saw those.  The problem is that #3822 truncates timestamps that
>>> include nanoseconds.
>>>
>>>
>>>
>>> On Mon, Dec 29, 2014 at 5:14 PM, Alessandro Baretta <
>>> [hidden email]> wrote:
>>>
>>> Michael,
>>>
>>>
>>>
>>> Actually, Adrian Wang already created pull requests for these issues.
>>>
>>>
>>>
>>> https://github.com/apache/spark/pull/3820
>>>
>>> https://github.com/apache/spark/pull/3822
>>>
>>>
>>>
>>> What do you think?
>>>
>>>
>>>
>>> Alex
>>>
>>>
>>>
>>> On Mon, Dec 29, 2014 at 3:07 PM, Michael Armbrust <
>>> [hidden email]> wrote:
>>>
>>> I'd love to get both of these in.  There is some trickiness that I talk
>>> about on the JIRA for timestamps since the SQL timestamp class can support
>>> nano seconds and I don't think parquet has a type for this.  Other systems
>>> (impala) seem to use INT96.  It would be great to maybe ask on the parquet
>>> mailing list what the plan is there to make sure that whatever we do is
>>> going to be compatible long term.
>>>
>>>
>>>
>>> Michael
>>>
>>>
>>>
>>> On Mon, Dec 29, 2014 at 8:13 AM, Alessandro Baretta <
>>> [hidden email]> wrote:
>>>
>>> Daoyuan,
>>>
>>> Thanks for creating the jiras. I need these features by... last week, so
>>> I'd be happy to take care of this myself, if only you or someone more
>>> experienced than me in the SparkSQL codebase could provide some guidance.
>>>
>>> Alex
>>>
>>> On Dec 29, 2014 12:06 AM, "Wang, Daoyuan" <[hidden email]>
>>> wrote:
>>>
>>> Hi Alex,
>>>
>>> I'll create JIRA SPARK-4985 for date type support in parquet, and
>>> SPARK-4987 for timestamp type support. For decimal type, I think we only
>>> support decimals that fits in a long.
>>>
>>> Thanks,
>>> Daoyuan
>>>
>>> -----Original Message-----
>>> From: Alessandro Baretta [mailto:[hidden email]]
>>> Sent: Saturday, December 27, 2014 2:47 PM
>>> To: [hidden email]; Michael Armbrust
>>> Subject: Unsupported Catalyst types in Parquet
>>>
>>> Michael,
>>>
>>> I'm having trouble storing my SchemaRDDs in Parquet format with
>>> SparkSQL, due to my RDDs having having DateType and DecimalType fields.
>>> What would it take to add Parquet support for these Catalyst? Are there any
>>> other Catalyst types for which there is no Catalyst support?
>>>
>>> Alex
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Unsupported Catalyst types in Parquet

Alessandro Baretta
I think I might have figure it out myself. Here's a pull request for you
guys to check out:

https://github.com/apache/spark/pull/3855

I successfully tested this code on my cluster.

On Tue, Dec 30, 2014 at 11:01 PM, Alessandro Baretta <[hidden email]>
wrote:

> Here's a more meaningful exception:
>
> java.lang.ClassCastException:
> org.apache.spark.sql.catalyst.types.DateType$ cannot be cast to
> org.apache.spark.sql.catalyst.types.PrimitiveType
>         at
> org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:188)
>         at
> org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:167)
>         at
> org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:130)
>         at
> parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:120)
>         at
> parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81)
>         at
> parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37)
>         at org.apache.spark.sql.parquet.InsertIntoParquetTable.org
> $apache$spark$sql$parquet$InsertIntoParquetTable$writeShard$1(ParquetTableOperations.scala:309)
>         at
> org.apache.spark.sql.parquet.InsertIntoParquetTable$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:326)
>         at
> org.apache.spark.sql.parquet.InsertIntoParquetTable$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:326)
>         at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>         at org.apache.spark.scheduler.Task.run(Task.scala:56)
>         at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
>
>
> This is easy to fix even for a newbie like myself: it suffices to add the
> PrimitiveType trait to the DateType object. You can find this change here:
>
> https://github.com/alexbaretta/spark/compare/parquet-date-support
>
> However, even this does not work. Here's the next blocker:
>
> java.lang.RuntimeException: Unsupported datatype DateType, cannot write to
> consumer
>         at scala.sys.package$.error(package.scala:27)
>         at
> org.apache.spark.sql.parquet.MutableRowWriteSupport.consumeType(ParquetTableSupport.scala:361)
>         at
> org.apache.spark.sql.parquet.MutableRowWriteSupport.write(ParquetTableSupport.scala:329)
>         at
> org.apache.spark.sql.parquet.MutableRowWriteSupport.write(ParquetTableSupport.scala:315)
>         at
> parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:120)
>         at
> parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81)
>         at
> parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37)
>         at org.apache.spark.sql.parquet.InsertIntoParquetTable.org
> $apache$spark$sql$parquet$InsertIntoParquetTable$writeShard$1(ParquetTableOperations.scala:309)
>         at
> org.apache.spark.sql.parquet.InsertIntoParquetTable$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:326)
>         at
> org.apache.spark.sql.parquet.InsertIntoParquetTable$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:326)
>         at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>         at org.apache.spark.scheduler.Task.run(Task.scala:56)
>         at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
>
> Any input on how to address this issue would be welcome.
>
> Alex
>
> On Tue, Dec 30, 2014 at 5:21 PM, Alessandro Baretta <[hidden email]
> > wrote:
>
>> Sorry! My bad. I had stale spark jars sitting on the slave nodes...
>>
>> Alex
>>
>> On Tue, Dec 30, 2014 at 4:39 PM, Alessandro Baretta <
>> [hidden email]> wrote:
>>
>>> Gents,
>>>
>>> I tried #3820. It doesn't work. I'm still getting the following
>>> exceptions:
>>>
>>> Exception in thread "Thread-45" java.lang.RuntimeException: Unsupported
>>> datatype DateType
>>>         at scala.sys.package$.error(package.scala:27)
>>>         at
>>> org.apache.spark.sql.parquet.ParquetTypesConverter$anonfun$fromDataType$2.apply(ParquetTypes.scala:343)
>>>         at
>>> org.apache.spark.sql.parquet.ParquetTypesConverter$anonfun$fromDataType$2.apply(ParquetTypes.scala:292)
>>>         at scala.Option.getOrElse(Option.scala:120)
>>>         at
>>> org.apache.spark.sql.parquet.ParquetTypesConverter$.fromDataType(ParquetTypes.scala:291)
>>>         at
>>> org.apache.spark.sql.parquet.ParquetTypesConverter$anonfun$4.apply(ParquetTypes.scala:363)
>>>         at
>>> org.apache.spark.sql.parquet.ParquetTypesConverter$anonfun$4.apply(ParquetTypes.scala:362)
>>>
>>> I would more than happy to fix this myself, but I would need some help
>>> wading through the code. Could anyone explain to me what exactly is needed
>>> to support a new data type in SparkSQL's Parquet storage engine?
>>>
>>> Thanks.
>>>
>>> Alex
>>>
>>> On Mon, Dec 29, 2014 at 10:20 PM, Wang, Daoyuan <[hidden email]>
>>> wrote:
>>>
>>>>  By adding a flag in SQLContext, I have modified #3822 to include
>>>> nanoseconds now. Since passing too many flags is ugly, now I need the whole
>>>> SQLContext, so that we can put more flags there.
>>>>
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Daoyuan
>>>>
>>>>
>>>>
>>>> *From:* Michael Armbrust [mailto:[hidden email]]
>>>> *Sent:* Tuesday, December 30, 2014 10:43 AM
>>>> *To:* Alessandro Baretta
>>>> *Cc:* Wang, Daoyuan; [hidden email]
>>>> *Subject:* Re: Unsupported Catalyst types in Parquet
>>>>
>>>>
>>>>
>>>> Yeah, I saw those.  The problem is that #3822 truncates timestamps that
>>>> include nanoseconds.
>>>>
>>>>
>>>>
>>>> On Mon, Dec 29, 2014 at 5:14 PM, Alessandro Baretta <
>>>> [hidden email]> wrote:
>>>>
>>>> Michael,
>>>>
>>>>
>>>>
>>>> Actually, Adrian Wang already created pull requests for these issues.
>>>>
>>>>
>>>>
>>>> https://github.com/apache/spark/pull/3820
>>>>
>>>> https://github.com/apache/spark/pull/3822
>>>>
>>>>
>>>>
>>>> What do you think?
>>>>
>>>>
>>>>
>>>> Alex
>>>>
>>>>
>>>>
>>>> On Mon, Dec 29, 2014 at 3:07 PM, Michael Armbrust <
>>>> [hidden email]> wrote:
>>>>
>>>> I'd love to get both of these in.  There is some trickiness that I talk
>>>> about on the JIRA for timestamps since the SQL timestamp class can support
>>>> nano seconds and I don't think parquet has a type for this.  Other systems
>>>> (impala) seem to use INT96.  It would be great to maybe ask on the parquet
>>>> mailing list what the plan is there to make sure that whatever we do is
>>>> going to be compatible long term.
>>>>
>>>>
>>>>
>>>> Michael
>>>>
>>>>
>>>>
>>>> On Mon, Dec 29, 2014 at 8:13 AM, Alessandro Baretta <
>>>> [hidden email]> wrote:
>>>>
>>>> Daoyuan,
>>>>
>>>> Thanks for creating the jiras. I need these features by... last week,
>>>> so I'd be happy to take care of this myself, if only you or someone more
>>>> experienced than me in the SparkSQL codebase could provide some guidance.
>>>>
>>>> Alex
>>>>
>>>> On Dec 29, 2014 12:06 AM, "Wang, Daoyuan" <[hidden email]>
>>>> wrote:
>>>>
>>>> Hi Alex,
>>>>
>>>> I'll create JIRA SPARK-4985 for date type support in parquet, and
>>>> SPARK-4987 for timestamp type support. For decimal type, I think we only
>>>> support decimals that fits in a long.
>>>>
>>>> Thanks,
>>>> Daoyuan
>>>>
>>>> -----Original Message-----
>>>> From: Alessandro Baretta [mailto:[hidden email]]
>>>> Sent: Saturday, December 27, 2014 2:47 PM
>>>> To: [hidden email]; Michael Armbrust
>>>> Subject: Unsupported Catalyst types in Parquet
>>>>
>>>> Michael,
>>>>
>>>> I'm having trouble storing my SchemaRDDs in Parquet format with
>>>> SparkSQL, due to my RDDs having having DateType and DecimalType fields.
>>>> What would it take to add Parquet support for these Catalyst? Are there any
>>>> other Catalyst types for which there is no Catalyst support?
>>>>
>>>> Alex
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>