Inconsistent schema on Encoders.bean (reported issues from user@)

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Inconsistent schema on Encoders.bean (reported issues from user@)

Jungtaek Lim-2
Hi devs,

There're couple of issues being reported on the user@ mailing list which results in being affected by inconsistent schema on Encoders.bean.

1. Typed datataset from Avro generated classes? [1]
2. spark structured streaming GroupState returns weird values from sate [2]

Below is a part of JavaTypeInference.inferDataType() which handles beans:


it collects properties based on the availability of getter.

(It's applied as well as `SQLContext.beansToRows`.)

JavaTypeInference.serializerFor() and JavaTypeInference.deserializerFor() aren't. They collect properties based on the available of both getter and setter.
(It calls JavaTypeInference.inferDataType() inside the method, making inconsistent even only these method is called.)

This inconsistent produces runtime issues when Java bean only has getter for some fields, even there's no such field for the getter method - as getter/setter methods are determined by naming convention.

I feel this is something we should fix, but would like to see opinions on how to fix it. If the user query has the problematic beans but hasn't encountered such issue, fixing the issue would drop off some columns, which would be backward incompatible. I think this is still the way to go, but if we concern more on not breaking existing query, we may want to at least document the ideal form of the bean Spark expects.

Would like to hear opinions on this.

Thanks,
Jungtaek Lim (HeartSaVioR)

Reply | Threaded
Open this post in threaded view
|

Re: Inconsistent schema on Encoders.bean (reported issues from user@)

Jungtaek Lim-2
(bump to expose the discussion to more readers)

On Mon, May 4, 2020 at 4:57 PM Jungtaek Lim <[hidden email]> wrote:
Hi devs,

There're couple of issues being reported on the user@ mailing list which results in being affected by inconsistent schema on Encoders.bean.

1. Typed datataset from Avro generated classes? [1]
2. spark structured streaming GroupState returns weird values from sate [2]

Below is a part of JavaTypeInference.inferDataType() which handles beans:


it collects properties based on the availability of getter.

(It's applied as well as `SQLContext.beansToRows`.)

JavaTypeInference.serializerFor() and JavaTypeInference.deserializerFor() aren't. They collect properties based on the available of both getter and setter.
(It calls JavaTypeInference.inferDataType() inside the method, making inconsistent even only these method is called.)

This inconsistent produces runtime issues when Java bean only has getter for some fields, even there's no such field for the getter method - as getter/setter methods are determined by naming convention.

I feel this is something we should fix, but would like to see opinions on how to fix it. If the user query has the problematic beans but hasn't encountered such issue, fixing the issue would drop off some columns, which would be backward incompatible. I think this is still the way to go, but if we concern more on not breaking existing query, we may want to at least document the ideal form of the bean Spark expects.

Would like to hear opinions on this.

Thanks,
Jungtaek Lim (HeartSaVioR)

Reply | Threaded
Open this post in threaded view
|

Re: Inconsistent schema on Encoders.bean (reported issues from user@)

cloud0fan
Can you give some simple examples to demonstrate the problem? I think the inconsistency would bring problems but don't know how.

On Fri, May 8, 2020 at 3:49 PM Jungtaek Lim <[hidden email]> wrote:
(bump to expose the discussion to more readers)

On Mon, May 4, 2020 at 4:57 PM Jungtaek Lim <[hidden email]> wrote:
Hi devs,

There're couple of issues being reported on the user@ mailing list which results in being affected by inconsistent schema on Encoders.bean.

1. Typed datataset from Avro generated classes? [1]
2. spark structured streaming GroupState returns weird values from sate [2]

Below is a part of JavaTypeInference.inferDataType() which handles beans:


it collects properties based on the availability of getter.

(It's applied as well as `SQLContext.beansToRows`.)

JavaTypeInference.serializerFor() and JavaTypeInference.deserializerFor() aren't. They collect properties based on the available of both getter and setter.
(It calls JavaTypeInference.inferDataType() inside the method, making inconsistent even only these method is called.)

This inconsistent produces runtime issues when Java bean only has getter for some fields, even there's no such field for the getter method - as getter/setter methods are determined by naming convention.

I feel this is something we should fix, but would like to see opinions on how to fix it. If the user query has the problematic beans but hasn't encountered such issue, fixing the issue would drop off some columns, which would be backward incompatible. I think this is still the way to go, but if we concern more on not breaking existing query, we may want to at least document the ideal form of the bean Spark expects.

Would like to hear opinions on this.

Thanks,
Jungtaek Lim (HeartSaVioR)

Reply | Threaded
Open this post in threaded view
|

Re: Inconsistent schema on Encoders.bean (reported issues from user@)

Jungtaek Lim-2
The first case of user report is obvious - according to the user report, AVRO generated code contains getter which denotes to itself hence Spark disallows (throws exception), but it doesn't have matching setter method (if I understand correctly) so technically it shouldn't matter.

For the second case of user report, I've reproduced with my own code. Please refer the gist code: https://gist.github.com/HeartSaVioR/fab85734b5be85198c48f45004c8e0ca

This code aggregates the max value of the values in key where the key is in the range of (0 ~ 9).

We're expecting the result of execution like (0, 10000), (1, 10001), ..., (9, 10009), but the result is going to be incorrect like below:

-------------------------------------------
Batch: 0
-------------------------------------------
+---+--------+
|key|maxValue|
+---+--------+
+---+--------+

-------------------------------------------
Batch: 1
-------------------------------------------
+---+--------+
|key|maxValue|
+---+--------+
|  0|   18990|
|  7|   18997|
|  6|   18996|
|  9|   18999|
|  5|   18995|
|  1|   18991|
|  3|   18993|
|  8|   18998|
|  2|   18992|
|  4|   18994|
+---+--------+

-------------------------------------------
Batch: 2
-------------------------------------------
+-----+------------+
|  key|    maxValue|
+-----+------------+
|18990|       30990|
|18997|540502118145|
|18996|249574852617|
|18999|146327314953|
|18995|243603134985|
|18991|476309451025|
|18993|287916490001|
|18998|324427845137|
|18992|412640801297|
|18994|302012976401|
+-----+------------+
...

This can happen with such inconsistent schemas because State in Structured Streaming doesn't check the schema (both name and type are unchecked) and simply apply the raw values with the sequence of column.

On Fri, May 8, 2020 at 5:50 PM Wenchen Fan <[hidden email]> wrote:
Can you give some simple examples to demonstrate the problem? I think the inconsistency would bring problems but don't know how.

On Fri, May 8, 2020 at 3:49 PM Jungtaek Lim <[hidden email]> wrote:
(bump to expose the discussion to more readers)

On Mon, May 4, 2020 at 4:57 PM Jungtaek Lim <[hidden email]> wrote:
Hi devs,

There're couple of issues being reported on the user@ mailing list which results in being affected by inconsistent schema on Encoders.bean.

1. Typed datataset from Avro generated classes? [1]
2. spark structured streaming GroupState returns weird values from sate [2]

Below is a part of JavaTypeInference.inferDataType() which handles beans:


it collects properties based on the availability of getter.

(It's applied as well as `SQLContext.beansToRows`.)

JavaTypeInference.serializerFor() and JavaTypeInference.deserializerFor() aren't. They collect properties based on the available of both getter and setter.
(It calls JavaTypeInference.inferDataType() inside the method, making inconsistent even only these method is called.)

This inconsistent produces runtime issues when Java bean only has getter for some fields, even there's no such field for the getter method - as getter/setter methods are determined by naming convention.

I feel this is something we should fix, but would like to see opinions on how to fix it. If the user query has the problematic beans but hasn't encountered such issue, fixing the issue would drop off some columns, which would be backward incompatible. I think this is still the way to go, but if we concern more on not breaking existing query, we may want to at least document the ideal form of the bean Spark expects.

Would like to hear opinions on this.

Thanks,
Jungtaek Lim (HeartSaVioR)

Reply | Threaded
Open this post in threaded view
|

Re: Inconsistent schema on Encoders.bean (reported issues from user@)

cloud0fan
is it a problem only for streaming or it affects batch queries as well?

On Fri, May 8, 2020 at 11:42 PM Jungtaek Lim <[hidden email]> wrote:
The first case of user report is obvious - according to the user report, AVRO generated code contains getter which denotes to itself hence Spark disallows (throws exception), but it doesn't have matching setter method (if I understand correctly) so technically it shouldn't matter.

For the second case of user report, I've reproduced with my own code. Please refer the gist code: https://gist.github.com/HeartSaVioR/fab85734b5be85198c48f45004c8e0ca

This code aggregates the max value of the values in key where the key is in the range of (0 ~ 9).

We're expecting the result of execution like (0, 10000), (1, 10001), ..., (9, 10009), but the result is going to be incorrect like below:

-------------------------------------------
Batch: 0
-------------------------------------------
+---+--------+
|key|maxValue|
+---+--------+
+---+--------+

-------------------------------------------
Batch: 1
-------------------------------------------
+---+--------+
|key|maxValue|
+---+--------+
|  0|   18990|
|  7|   18997|
|  6|   18996|
|  9|   18999|
|  5|   18995|
|  1|   18991|
|  3|   18993|
|  8|   18998|
|  2|   18992|
|  4|   18994|
+---+--------+

-------------------------------------------
Batch: 2
-------------------------------------------
+-----+------------+
|  key|    maxValue|
+-----+------------+
|18990|       30990|
|18997|540502118145|
|18996|249574852617|
|18999|146327314953|
|18995|243603134985|
|18991|476309451025|
|18993|287916490001|
|18998|324427845137|
|18992|412640801297|
|18994|302012976401|
+-----+------------+
...

This can happen with such inconsistent schemas because State in Structured Streaming doesn't check the schema (both name and type are unchecked) and simply apply the raw values with the sequence of column.

On Fri, May 8, 2020 at 5:50 PM Wenchen Fan <[hidden email]> wrote:
Can you give some simple examples to demonstrate the problem? I think the inconsistency would bring problems but don't know how.

On Fri, May 8, 2020 at 3:49 PM Jungtaek Lim <[hidden email]> wrote:
(bump to expose the discussion to more readers)

On Mon, May 4, 2020 at 4:57 PM Jungtaek Lim <[hidden email]> wrote:
Hi devs,

There're couple of issues being reported on the user@ mailing list which results in being affected by inconsistent schema on Encoders.bean.

1. Typed datataset from Avro generated classes? [1]
2. spark structured streaming GroupState returns weird values from sate [2]

Below is a part of JavaTypeInference.inferDataType() which handles beans:


it collects properties based on the availability of getter.

(It's applied as well as `SQLContext.beansToRows`.)

JavaTypeInference.serializerFor() and JavaTypeInference.deserializerFor() aren't. They collect properties based on the available of both getter and setter.
(It calls JavaTypeInference.inferDataType() inside the method, making inconsistent even only these method is called.)

This inconsistent produces runtime issues when Java bean only has getter for some fields, even there's no such field for the getter method - as getter/setter methods are determined by naming convention.

I feel this is something we should fix, but would like to see opinions on how to fix it. If the user query has the problematic beans but hasn't encountered such issue, fixing the issue would drop off some columns, which would be backward incompatible. I think this is still the way to go, but if we concern more on not breaking existing query, we may want to at least document the ideal form of the bean Spark expects.

Would like to hear opinions on this.

Thanks,
Jungtaek Lim (HeartSaVioR)

Reply | Threaded
Open this post in threaded view
|

Re: Inconsistent schema on Encoders.bean (reported issues from user@)

Jungtaek Lim-2
First case is not tied to the batch / streaming as Encoders.bean simply fails when inferring schema.

Second case is tied to the streaming, and I've described the reason in the last reply. I'm not sure we don't have similar case for batch though. (If there're some operators only relying on the sequence of the columns while matching row with schema, then it could be affected.)

On Mon, May 11, 2020 at 1:24 PM Wenchen Fan <[hidden email]> wrote:
is it a problem only for streaming or it affects batch queries as well?

On Fri, May 8, 2020 at 11:42 PM Jungtaek Lim <[hidden email]> wrote:
The first case of user report is obvious - according to the user report, AVRO generated code contains getter which denotes to itself hence Spark disallows (throws exception), but it doesn't have matching setter method (if I understand correctly) so technically it shouldn't matter.

For the second case of user report, I've reproduced with my own code. Please refer the gist code: https://gist.github.com/HeartSaVioR/fab85734b5be85198c48f45004c8e0ca

This code aggregates the max value of the values in key where the key is in the range of (0 ~ 9).

We're expecting the result of execution like (0, 10000), (1, 10001), ..., (9, 10009), but the result is going to be incorrect like below:

-------------------------------------------
Batch: 0
-------------------------------------------
+---+--------+
|key|maxValue|
+---+--------+
+---+--------+

-------------------------------------------
Batch: 1
-------------------------------------------
+---+--------+
|key|maxValue|
+---+--------+
|  0|   18990|
|  7|   18997|
|  6|   18996|
|  9|   18999|
|  5|   18995|
|  1|   18991|
|  3|   18993|
|  8|   18998|
|  2|   18992|
|  4|   18994|
+---+--------+

-------------------------------------------
Batch: 2
-------------------------------------------
+-----+------------+
|  key|    maxValue|
+-----+------------+
|18990|       30990|
|18997|540502118145|
|18996|249574852617|
|18999|146327314953|
|18995|243603134985|
|18991|476309451025|
|18993|287916490001|
|18998|324427845137|
|18992|412640801297|
|18994|302012976401|
+-----+------------+
...

This can happen with such inconsistent schemas because State in Structured Streaming doesn't check the schema (both name and type are unchecked) and simply apply the raw values with the sequence of column.

On Fri, May 8, 2020 at 5:50 PM Wenchen Fan <[hidden email]> wrote:
Can you give some simple examples to demonstrate the problem? I think the inconsistency would bring problems but don't know how.

On Fri, May 8, 2020 at 3:49 PM Jungtaek Lim <[hidden email]> wrote:
(bump to expose the discussion to more readers)

On Mon, May 4, 2020 at 4:57 PM Jungtaek Lim <[hidden email]> wrote:
Hi devs,

There're couple of issues being reported on the user@ mailing list which results in being affected by inconsistent schema on Encoders.bean.

1. Typed datataset from Avro generated classes? [1]
2. spark structured streaming GroupState returns weird values from sate [2]

Below is a part of JavaTypeInference.inferDataType() which handles beans:


it collects properties based on the availability of getter.

(It's applied as well as `SQLContext.beansToRows`.)

JavaTypeInference.serializerFor() and JavaTypeInference.deserializerFor() aren't. They collect properties based on the available of both getter and setter.
(It calls JavaTypeInference.inferDataType() inside the method, making inconsistent even only these method is called.)

This inconsistent produces runtime issues when Java bean only has getter for some fields, even there's no such field for the getter method - as getter/setter methods are determined by naming convention.

I feel this is something we should fix, but would like to see opinions on how to fix it. If the user query has the problematic beans but hasn't encountered such issue, fixing the issue would drop off some columns, which would be backward incompatible. I think this is still the way to go, but if we concern more on not breaking existing query, we may want to at least document the ideal form of the bean Spark expects.

Would like to hear opinions on this.

Thanks,
Jungtaek Lim (HeartSaVioR)

Reply | Threaded
Open this post in threaded view
|

Re: Inconsistent schema on Encoders.bean (reported issues from user@)

Jungtaek Lim-2
OK I just went through the change, and the change breaks bunch of existing UTs.


Note that I modified all the cases where Spark extracts the columns for "read method" only properties to both "read" & "write". It doesn't only change the code path of Encoders.bean, but also change the code path of createDataFrame from Java bean, including case class in Java language (Scala-Java Interop). Case class doesn't have explicit setter & getter methods.

Personally I'm not in favor of the uncertainly of definition of Java bean in Spark (explained nowhere), but also not sure we are OK with the breaking changes. We might be able to reduce the breaking changes by allowing the difference between createDataFrame (leave as it is) and Encoders.bean (require read & write methods), but it is still a breaking change and the difference would be confusing if we don't explain it enough.

Any thoughts?


On Mon, May 11, 2020 at 1:36 PM Jungtaek Lim <[hidden email]> wrote:
First case is not tied to the batch / streaming as Encoders.bean simply fails when inferring schema.

Second case is tied to the streaming, and I've described the reason in the last reply. I'm not sure we don't have similar case for batch though. (If there're some operators only relying on the sequence of the columns while matching row with schema, then it could be affected.)

On Mon, May 11, 2020 at 1:24 PM Wenchen Fan <[hidden email]> wrote:
is it a problem only for streaming or it affects batch queries as well?

On Fri, May 8, 2020 at 11:42 PM Jungtaek Lim <[hidden email]> wrote:
The first case of user report is obvious - according to the user report, AVRO generated code contains getter which denotes to itself hence Spark disallows (throws exception), but it doesn't have matching setter method (if I understand correctly) so technically it shouldn't matter.

For the second case of user report, I've reproduced with my own code. Please refer the gist code: https://gist.github.com/HeartSaVioR/fab85734b5be85198c48f45004c8e0ca

This code aggregates the max value of the values in key where the key is in the range of (0 ~ 9).

We're expecting the result of execution like (0, 10000), (1, 10001), ..., (9, 10009), but the result is going to be incorrect like below:

-------------------------------------------
Batch: 0
-------------------------------------------
+---+--------+
|key|maxValue|
+---+--------+
+---+--------+

-------------------------------------------
Batch: 1
-------------------------------------------
+---+--------+
|key|maxValue|
+---+--------+
|  0|   18990|
|  7|   18997|
|  6|   18996|
|  9|   18999|
|  5|   18995|
|  1|   18991|
|  3|   18993|
|  8|   18998|
|  2|   18992|
|  4|   18994|
+---+--------+

-------------------------------------------
Batch: 2
-------------------------------------------
+-----+------------+
|  key|    maxValue|
+-----+------------+
|18990|       30990|
|18997|540502118145|
|18996|249574852617|
|18999|146327314953|
|18995|243603134985|
|18991|476309451025|
|18993|287916490001|
|18998|324427845137|
|18992|412640801297|
|18994|302012976401|
+-----+------------+
...

This can happen with such inconsistent schemas because State in Structured Streaming doesn't check the schema (both name and type are unchecked) and simply apply the raw values with the sequence of column.

On Fri, May 8, 2020 at 5:50 PM Wenchen Fan <[hidden email]> wrote:
Can you give some simple examples to demonstrate the problem? I think the inconsistency would bring problems but don't know how.

On Fri, May 8, 2020 at 3:49 PM Jungtaek Lim <[hidden email]> wrote:
(bump to expose the discussion to more readers)

On Mon, May 4, 2020 at 4:57 PM Jungtaek Lim <[hidden email]> wrote:
Hi devs,

There're couple of issues being reported on the user@ mailing list which results in being affected by inconsistent schema on Encoders.bean.

1. Typed datataset from Avro generated classes? [1]
2. spark structured streaming GroupState returns weird values from sate [2]

Below is a part of JavaTypeInference.inferDataType() which handles beans:


it collects properties based on the availability of getter.

(It's applied as well as `SQLContext.beansToRows`.)

JavaTypeInference.serializerFor() and JavaTypeInference.deserializerFor() aren't. They collect properties based on the available of both getter and setter.
(It calls JavaTypeInference.inferDataType() inside the method, making inconsistent even only these method is called.)

This inconsistent produces runtime issues when Java bean only has getter for some fields, even there's no such field for the getter method - as getter/setter methods are determined by naming convention.

I feel this is something we should fix, but would like to see opinions on how to fix it. If the user query has the problematic beans but hasn't encountered such issue, fixing the issue would drop off some columns, which would be backward incompatible. I think this is still the way to go, but if we concern more on not breaking existing query, we may want to at least document the ideal form of the bean Spark expects.

Would like to hear opinions on this.

Thanks,
Jungtaek Lim (HeartSaVioR)

Reply | Threaded
Open this post in threaded view
|

Re: Inconsistent schema on Encoders.bean (reported issues from user@)

Sean Owen-2
Java Beans are well-defined; it's valid to have a getter- or
setter-only property. That doesn't mean Spark can meaningfully use
such a property, as it typically has to both read and write them. I
guess it depends on context. For example, I don't see how you can have
a deserializer without setters, or a serializer without getters.

case classes do have accessor (and if applicable mutator) methods
generated automatically but they do not follow bean conventions.
("foo" gets a "foo" method, not "getFoo")

I haven't read this in detail but it seems like most of the issue you
are seeing is that it's not checking the property names, just using
ordering, in your reproducer. That seems different?

On Sun, May 24, 2020 at 3:00 AM Jungtaek Lim
<[hidden email]> wrote:

>
> OK I just went through the change, and the change breaks bunch of existing UTs.
>
> https://github.com/apache/spark/pull/28611
>
> Note that I modified all the cases where Spark extracts the columns for "read method" only properties to both "read" & "write". It doesn't only change the code path of Encoders.bean, but also change the code path of createDataFrame from Java bean, including case class in Java language (Scala-Java Interop). Case class doesn't have explicit setter & getter methods.
>
> Personally I'm not in favor of the uncertainly of definition of Java bean in Spark (explained nowhere), but also not sure we are OK with the breaking changes. We might be able to reduce the breaking changes by allowing the difference between createDataFrame (leave as it is) and Encoders.bean (require read & write methods), but it is still a breaking change and the difference would be confusing if we don't explain it enough.
>
> Any thoughts?
>
>
> On Mon, May 11, 2020 at 1:36 PM Jungtaek Lim <[hidden email]> wrote:
>>
>> First case is not tied to the batch / streaming as Encoders.bean simply fails when inferring schema.
>>
>> Second case is tied to the streaming, and I've described the reason in the last reply. I'm not sure we don't have similar case for batch though. (If there're some operators only relying on the sequence of the columns while matching row with schema, then it could be affected.)
>>
>> On Mon, May 11, 2020 at 1:24 PM Wenchen Fan <[hidden email]> wrote:
>>>
>>> is it a problem only for streaming or it affects batch queries as well?
>>>
>>> On Fri, May 8, 2020 at 11:42 PM Jungtaek Lim <[hidden email]> wrote:
>>>>
>>>> The first case of user report is obvious - according to the user report, AVRO generated code contains getter which denotes to itself hence Spark disallows (throws exception), but it doesn't have matching setter method (if I understand correctly) so technically it shouldn't matter.
>>>>
>>>> For the second case of user report, I've reproduced with my own code. Please refer the gist code: https://gist.github.com/HeartSaVioR/fab85734b5be85198c48f45004c8e0ca
>>>>
>>>> This code aggregates the max value of the values in key where the key is in the range of (0 ~ 9).
>>>>
>>>> We're expecting the result of execution like (0, 10000), (1, 10001), ..., (9, 10009), but the result is going to be incorrect like below:
>>>>
>>>> -------------------------------------------
>>>> Batch: 0
>>>> -------------------------------------------
>>>> +---+--------+
>>>> |key|maxValue|
>>>> +---+--------+
>>>> +---+--------+
>>>>
>>>> -------------------------------------------
>>>> Batch: 1
>>>> -------------------------------------------
>>>> +---+--------+
>>>> |key|maxValue|
>>>> +---+--------+
>>>> |  0|   18990|
>>>> |  7|   18997|
>>>> |  6|   18996|
>>>> |  9|   18999|
>>>> |  5|   18995|
>>>> |  1|   18991|
>>>> |  3|   18993|
>>>> |  8|   18998|
>>>> |  2|   18992|
>>>> |  4|   18994|
>>>> +---+--------+
>>>>
>>>> -------------------------------------------
>>>> Batch: 2
>>>> -------------------------------------------
>>>> +-----+------------+
>>>> |  key|    maxValue|
>>>> +-----+------------+
>>>> |18990|       30990|
>>>> |18997|540502118145|
>>>> |18996|249574852617|
>>>> |18999|146327314953|
>>>> |18995|243603134985|
>>>> |18991|476309451025|
>>>> |18993|287916490001|
>>>> |18998|324427845137|
>>>> |18992|412640801297|
>>>> |18994|302012976401|
>>>> +-----+------------+
>>>> ...
>>>>
>>>> This can happen with such inconsistent schemas because State in Structured Streaming doesn't check the schema (both name and type are unchecked) and simply apply the raw values with the sequence of column.
>>>>
>>>> On Fri, May 8, 2020 at 5:50 PM Wenchen Fan <[hidden email]> wrote:
>>>>>
>>>>> Can you give some simple examples to demonstrate the problem? I think the inconsistency would bring problems but don't know how.
>>>>>
>>>>> On Fri, May 8, 2020 at 3:49 PM Jungtaek Lim <[hidden email]> wrote:
>>>>>>
>>>>>> (bump to expose the discussion to more readers)
>>>>>>
>>>>>> On Mon, May 4, 2020 at 4:57 PM Jungtaek Lim <[hidden email]> wrote:
>>>>>>>
>>>>>>> Hi devs,
>>>>>>>
>>>>>>> There're couple of issues being reported on the user@ mailing list which results in being affected by inconsistent schema on Encoders.bean.
>>>>>>>
>>>>>>> 1. Typed datataset from Avro generated classes? [1]
>>>>>>> 2. spark structured streaming GroupState returns weird values from sate [2]
>>>>>>>
>>>>>>> Below is a part of JavaTypeInference.inferDataType() which handles beans:
>>>>>>>
>>>>>>> https://github.com/apache/spark/blob/f72220b8ab256e8e6532205a4ce51d50b69c26e9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala#L139-L157
>>>>>>>
>>>>>>> it collects properties based on the availability of getter.
>>>>>>>
>>>>>>> (It's applied as well as `SQLContext.beansToRows`.)
>>>>>>>
>>>>>>> JavaTypeInference.serializerFor() and JavaTypeInference.deserializerFor() aren't. They collect properties based on the available of both getter and setter.
>>>>>>> (It calls JavaTypeInference.inferDataType() inside the method, making inconsistent even only these method is called.)
>>>>>>>
>>>>>>> This inconsistent produces runtime issues when Java bean only has getter for some fields, even there's no such field for the getter method - as getter/setter methods are determined by naming convention.
>>>>>>>
>>>>>>> I feel this is something we should fix, but would like to see opinions on how to fix it. If the user query has the problematic beans but hasn't encountered such issue, fixing the issue would drop off some columns, which would be backward incompatible. I think this is still the way to go, but if we concern more on not breaking existing query, we may want to at least document the ideal form of the bean Spark expects.
>>>>>>>
>>>>>>> Would like to hear opinions on this.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Jungtaek Lim (HeartSaVioR)
>>>>>>>
>>>>>>> 1. https://lists.apache.org/thread.html/r8f8e680e02955cdf05b4dd34c60a9868288fd10a03f1b1b8627f3d84%40%3Cuser.spark.apache.org%3E
>>>>>>> 2. http://mail-archives.apache.org/mod_mbox/spark-user/202003.mbox/%3cCAFX8L21Dzbyv5m1QOzs3y+PCmYCwbtJkO6YTWvKydZTq7u4gZw@...%3e

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Inconsistent schema on Encoders.bean (reported issues from user@)

Jungtaek Lim-2
I meant how to interpret Java Beans in Spark are not consistently defined.

Unlike you've guessed, in most paths Spark uses "read-only" properties. (All the failed existing tests in my experiment have "read-only" properties.) The problematic case is when Java bean is used for read-write; one case is using Java bean as data type of "state" in structured streaming, where Spark will convert rows to Java beans and vice versa.

On Sun, May 24, 2020 at 11:01 PM Sean Owen <[hidden email]> wrote:
Java Beans are well-defined; it's valid to have a getter- or
setter-only property. That doesn't mean Spark can meaningfully use
such a property, as it typically has to both read and write them. I
guess it depends on context. For example, I don't see how you can have
a deserializer without setters, or a serializer without getters.

case classes do have accessor (and if applicable mutator) methods
generated automatically but they do not follow bean conventions.
("foo" gets a "foo" method, not "getFoo")

I haven't read this in detail but it seems like most of the issue you
are seeing is that it's not checking the property names, just using
ordering, in your reproducer. That seems different?

On Sun, May 24, 2020 at 3:00 AM Jungtaek Lim
<[hidden email]> wrote:
>
> OK I just went through the change, and the change breaks bunch of existing UTs.
>
> https://github.com/apache/spark/pull/28611
>
> Note that I modified all the cases where Spark extracts the columns for "read method" only properties to both "read" & "write". It doesn't only change the code path of Encoders.bean, but also change the code path of createDataFrame from Java bean, including case class in Java language (Scala-Java Interop). Case class doesn't have explicit setter & getter methods.
>
> Personally I'm not in favor of the uncertainly of definition of Java bean in Spark (explained nowhere), but also not sure we are OK with the breaking changes. We might be able to reduce the breaking changes by allowing the difference between createDataFrame (leave as it is) and Encoders.bean (require read & write methods), but it is still a breaking change and the difference would be confusing if we don't explain it enough.
>
> Any thoughts?
>
>
> On Mon, May 11, 2020 at 1:36 PM Jungtaek Lim <[hidden email]> wrote:
>>
>> First case is not tied to the batch / streaming as Encoders.bean simply fails when inferring schema.
>>
>> Second case is tied to the streaming, and I've described the reason in the last reply. I'm not sure we don't have similar case for batch though. (If there're some operators only relying on the sequence of the columns while matching row with schema, then it could be affected.)
>>
>> On Mon, May 11, 2020 at 1:24 PM Wenchen Fan <[hidden email]> wrote:
>>>
>>> is it a problem only for streaming or it affects batch queries as well?
>>>
>>> On Fri, May 8, 2020 at 11:42 PM Jungtaek Lim <[hidden email]> wrote:
>>>>
>>>> The first case of user report is obvious - according to the user report, AVRO generated code contains getter which denotes to itself hence Spark disallows (throws exception), but it doesn't have matching setter method (if I understand correctly) so technically it shouldn't matter.
>>>>
>>>> For the second case of user report, I've reproduced with my own code. Please refer the gist code: https://gist.github.com/HeartSaVioR/fab85734b5be85198c48f45004c8e0ca
>>>>
>>>> This code aggregates the max value of the values in key where the key is in the range of (0 ~ 9).
>>>>
>>>> We're expecting the result of execution like (0, 10000), (1, 10001), ..., (9, 10009), but the result is going to be incorrect like below:
>>>>
>>>> -------------------------------------------
>>>> Batch: 0
>>>> -------------------------------------------
>>>> +---+--------+
>>>> |key|maxValue|
>>>> +---+--------+
>>>> +---+--------+
>>>>
>>>> -------------------------------------------
>>>> Batch: 1
>>>> -------------------------------------------
>>>> +---+--------+
>>>> |key|maxValue|
>>>> +---+--------+
>>>> |  0|   18990|
>>>> |  7|   18997|
>>>> |  6|   18996|
>>>> |  9|   18999|
>>>> |  5|   18995|
>>>> |  1|   18991|
>>>> |  3|   18993|
>>>> |  8|   18998|
>>>> |  2|   18992|
>>>> |  4|   18994|
>>>> +---+--------+
>>>>
>>>> -------------------------------------------
>>>> Batch: 2
>>>> -------------------------------------------
>>>> +-----+------------+
>>>> |  key|    maxValue|
>>>> +-----+------------+
>>>> |18990|       30990|
>>>> |18997|540502118145|
>>>> |18996|249574852617|
>>>> |18999|146327314953|
>>>> |18995|243603134985|
>>>> |18991|476309451025|
>>>> |18993|287916490001|
>>>> |18998|324427845137|
>>>> |18992|412640801297|
>>>> |18994|302012976401|
>>>> +-----+------------+
>>>> ...
>>>>
>>>> This can happen with such inconsistent schemas because State in Structured Streaming doesn't check the schema (both name and type are unchecked) and simply apply the raw values with the sequence of column.
>>>>
>>>> On Fri, May 8, 2020 at 5:50 PM Wenchen Fan <[hidden email]> wrote:
>>>>>
>>>>> Can you give some simple examples to demonstrate the problem? I think the inconsistency would bring problems but don't know how.
>>>>>
>>>>> On Fri, May 8, 2020 at 3:49 PM Jungtaek Lim <[hidden email]> wrote:
>>>>>>
>>>>>> (bump to expose the discussion to more readers)
>>>>>>
>>>>>> On Mon, May 4, 2020 at 4:57 PM Jungtaek Lim <[hidden email]> wrote:
>>>>>>>
>>>>>>> Hi devs,
>>>>>>>
>>>>>>> There're couple of issues being reported on the user@ mailing list which results in being affected by inconsistent schema on Encoders.bean.
>>>>>>>
>>>>>>> 1. Typed datataset from Avro generated classes? [1]
>>>>>>> 2. spark structured streaming GroupState returns weird values from sate [2]
>>>>>>>
>>>>>>> Below is a part of JavaTypeInference.inferDataType() which handles beans:
>>>>>>>
>>>>>>> https://github.com/apache/spark/blob/f72220b8ab256e8e6532205a4ce51d50b69c26e9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala#L139-L157
>>>>>>>
>>>>>>> it collects properties based on the availability of getter.
>>>>>>>
>>>>>>> (It's applied as well as `SQLContext.beansToRows`.)
>>>>>>>
>>>>>>> JavaTypeInference.serializerFor() and JavaTypeInference.deserializerFor() aren't. They collect properties based on the available of both getter and setter.
>>>>>>> (It calls JavaTypeInference.inferDataType() inside the method, making inconsistent even only these method is called.)
>>>>>>>
>>>>>>> This inconsistent produces runtime issues when Java bean only has getter for some fields, even there's no such field for the getter method - as getter/setter methods are determined by naming convention.
>>>>>>>
>>>>>>> I feel this is something we should fix, but would like to see opinions on how to fix it. If the user query has the problematic beans but hasn't encountered such issue, fixing the issue would drop off some columns, which would be backward incompatible. I think this is still the way to go, but if we concern more on not breaking existing query, we may want to at least document the ideal form of the bean Spark expects.
>>>>>>>
>>>>>>> Would like to hear opinions on this.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Jungtaek Lim (HeartSaVioR)
>>>>>>>
>>>>>>> 1. https://lists.apache.org/thread.html/r8f8e680e02955cdf05b4dd34c60a9868288fd10a03f1b1b8627f3d84%40%3Cuser.spark.apache.org%3E
>>>>>>> 2. http://mail-archives.apache.org/mod_mbox/spark-user/202003.mbox/%3cCAFX8L21Dzbyv5m1QOzs3y+PCmYCwbtJkO6YTWvKydZTq7u4gZw@...%3e