Distinct on Map data type -- SPARK-19893

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Distinct on Map data type -- SPARK-19893

ckhari4u
I see SPARK-19893 is backported to Spark 2.1 and 2.0.1 as well. I do not see
a clear justification for why SPARK 19893 is important and needed. I have a
sample table which works fine with an earlier build of Spark 2.1.0. Now that
the latest build is having the backport of SPARK-19893, its failing with
error:

Error in query: Cannot have map type columns in DataFrame which calls set
operations(intersect, except, etc.), but the type of column metrics is
map<string,int>;;
Distinct


*In Old Build of Spark 2.1.0, I tried the below:*


create TABLE map_demo2
(
country_id BIGINT,
metrics MAP <STRING, int>
);

insert into table map_demo2 select 2,map("chaka",102) ;
insert into table map_demo2 select 3,map("chaka",102) ;
insert into table map_demo2 select 4,map("mangaa",103) ;


spark-sql> select distinct metrics from map_demo2;
[Stage 0:>                                                          (0 + 4)
/ 5]18/01/12 21:55:41 WARN CryptoStreamUtils: It costs 8501 milliseconds to
create the Initialization Vector used by CryptoStream
18/01/12 21:55:41 WARN CryptoStreamUtils: It costs 8503 milliseconds to
create the Initialization Vector used by CryptoStream
18/01/12 21:55:41 WARN CryptoStreamUtils: It costs 8497 milliseconds to
create the Initialization Vector used by CryptoStream
18/01/12 21:55:41 WARN CryptoStreamUtils: It costs 8496 milliseconds to
create the Initialization Vector used by CryptoStream
[Stage 1:===============================>                       (1[Stage
1:===========================================>           (1[Stage
1:======================================================>(1                                                                
{"mangaa":103}
{"chaka":102}
{"chaka":103}
Time taken: 15.331 seconds, Fetched 3 row(s)

Here the simple distinct query works fine in Spark. Any thoughts why
DISTINCT/EXCEPT/INTERSECT operators are not supported on Map data types.
From the PR, it says,
// TODO: although map type is not orderable, technically map type should be
able to be
 +          // used inequality comparison, remove this type check once we
support it.

Could not figure out the issue caused by using the aforementioned operators?





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Distinct on Map data type -- SPARK-19893

cloud0fan
Actually Spark 2.1.0 doesn't work for your case, it may give you wrong result...
We are still working on adding this feature, but before that, we should fail earlier instead of returning wrong result.

On Sat, Jan 13, 2018 at 11:02 AM, ckhari4u <[hidden email]> wrote:
I see SPARK-19893 is backported to Spark 2.1 and 2.0.1 as well. I do not see
a clear justification for why SPARK 19893 is important and needed. I have a
sample table which works fine with an earlier build of Spark 2.1.0. Now that
the latest build is having the backport of SPARK-19893, its failing with
error:

Error in query: Cannot have map type columns in DataFrame which calls set
operations(intersect, except, etc.), but the type of column metrics is
map<string,int>;;
Distinct


*In Old Build of Spark 2.1.0, I tried the below:*


create TABLE map_demo2
(
country_id BIGINT,
metrics MAP <STRING, int>
);

insert into table map_demo2 select 2,map("chaka",102) ;
insert into table map_demo2 select 3,map("chaka",102) ;
insert into table map_demo2 select 4,map("mangaa",103) ;


spark-sql> select distinct metrics from map_demo2;
[Stage 0:>                                                          (0 + 4)
/ 5]18/01/12 21:55:41 WARN CryptoStreamUtils: It costs 8501 milliseconds to
create the Initialization Vector used by CryptoStream
18/01/12 21:55:41 WARN CryptoStreamUtils: It costs 8503 milliseconds to
create the Initialization Vector used by CryptoStream
18/01/12 21:55:41 WARN CryptoStreamUtils: It costs 8497 milliseconds to
create the Initialization Vector used by CryptoStream
18/01/12 21:55:41 WARN CryptoStreamUtils: It costs 8496 milliseconds to
create the Initialization Vector used by CryptoStream
[Stage 1:===============================>                       (1[Stage
1:===========================================>           (1[Stage
1:======================================================>(1
{"mangaa":103}
{"chaka":102}
{"chaka":103}
Time taken: 15.331 seconds, Fetched 3 row(s)

Here the simple distinct query works fine in Spark. Any thoughts why
DISTINCT/EXCEPT/INTERSECT operators are not supported on Map data types.
From the PR, it says,
// TODO: although map type is not orderable, technically map type should be
able to be
 +          // used inequality comparison, remove this type check once we
support it.

Could not figure out the issue caused by using the aforementioned operators?





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Distinct on Map data type -- SPARK-19893

ckhari4u
Hi Wan, could you please be more specific on the scenarios where it will give wrong results. I checked distinct and intersect operators in many use cases i have and could not figure out a failure scenario giving wrong results. 

Thanks

On Jan 12, 2018 7:36 PM, "Wenchen Fan" <[hidden email]> wrote:
Actually Spark 2.1.0 doesn't work for your case, it may give you wrong result...
We are still working on adding this feature, but before that, we should fail earlier instead of returning wrong result.

On Sat, Jan 13, 2018 at 11:02 AM, ckhari4u <[hidden email]> wrote:
I see SPARK-19893 is backported to Spark 2.1 and 2.0.1 as well. I do not see
a clear justification for why SPARK 19893 is important and needed. I have a
sample table which works fine with an earlier build of Spark 2.1.0. Now that
the latest build is having the backport of SPARK-19893, its failing with
error:

Error in query: Cannot have map type columns in DataFrame which calls set
operations(intersect, except, etc.), but the type of column metrics is
map<string,int>;;
Distinct


*In Old Build of Spark 2.1.0, I tried the below:*


create TABLE map_demo2
(
country_id BIGINT,
metrics MAP <STRING, int>
);

insert into table map_demo2 select 2,map("chaka",102) ;
insert into table map_demo2 select 3,map("chaka",102) ;
insert into table map_demo2 select 4,map("mangaa",103) ;


spark-sql> select distinct metrics from map_demo2;
[Stage 0:>                                                          (0 + 4)
/ 5]18/01/12 21:55:41 WARN CryptoStreamUtils: It costs 8501 milliseconds to
create the Initialization Vector used by CryptoStream
18/01/12 21:55:41 WARN CryptoStreamUtils: It costs 8503 milliseconds to
create the Initialization Vector used by CryptoStream
18/01/12 21:55:41 WARN CryptoStreamUtils: It costs 8497 milliseconds to
create the Initialization Vector used by CryptoStream
18/01/12 21:55:41 WARN CryptoStreamUtils: It costs 8496 milliseconds to
create the Initialization Vector used by CryptoStream
[Stage 1:===============================>                       (1[Stage
1:===========================================>           (1[Stage
1:======================================================>(1
{"mangaa":103}
{"chaka":102}
{"chaka":103}
Time taken: 15.331 seconds, Fetched 3 row(s)

Here the simple distinct query works fine in Spark. Any thoughts why
DISTINCT/EXCEPT/INTERSECT operators are not supported on Map data types.
From the PR, it says,
// TODO: although map type is not orderable, technically map type should be
able to be
 +          // used inequality comparison, remove this type check once we
support it.

Could not figure out the issue caused by using the aforementioned operators?





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



Reply | Threaded
Open this post in threaded view
|

Re: Distinct on Map data type -- SPARK-19893

cloud0fan
A very simple example is
sql("select create_map(1, 'a', 2, 'b')")
  .union(sql("select create_map(2, 'b', 1, 'a')"))
  .distinct

By definition a map should not care about the order of its entries, so the above query should return one record. However it returns 2 records before SPARK-19893

On Sat, Jan 13, 2018 at 11:51 AM, HariKrishnan CK <[hidden email]> wrote:
Hi Wan, could you please be more specific on the scenarios where it will give wrong results. I checked distinct and intersect operators in many use cases i have and could not figure out a failure scenario giving wrong results. 

Thanks


On Jan 12, 2018 7:36 PM, "Wenchen Fan" <[hidden email]> wrote:
Actually Spark 2.1.0 doesn't work for your case, it may give you wrong result...
We are still working on adding this feature, but before that, we should fail earlier instead of returning wrong result.

On Sat, Jan 13, 2018 at 11:02 AM, ckhari4u <[hidden email]> wrote:
I see SPARK-19893 is backported to Spark 2.1 and 2.0.1 as well. I do not see
a clear justification for why SPARK 19893 is important and needed. I have a
sample table which works fine with an earlier build of Spark 2.1.0. Now that
the latest build is having the backport of SPARK-19893, its failing with
error:

Error in query: Cannot have map type columns in DataFrame which calls set
operations(intersect, except, etc.), but the type of column metrics is
map<string,int>;;
Distinct


*In Old Build of Spark 2.1.0, I tried the below:*


create TABLE map_demo2
(
country_id BIGINT,
metrics MAP <STRING, int>
);

insert into table map_demo2 select 2,map("chaka",102) ;
insert into table map_demo2 select 3,map("chaka",102) ;
insert into table map_demo2 select 4,map("mangaa",103) ;


spark-sql> select distinct metrics from map_demo2;
[Stage 0:>                                                          (0 + 4)
/ 5]18/01/12 21:55:41 WARN CryptoStreamUtils: It costs 8501 milliseconds to
create the Initialization Vector used by CryptoStream
18/01/12 21:55:41 WARN CryptoStreamUtils: It costs 8503 milliseconds to
create the Initialization Vector used by CryptoStream
18/01/12 21:55:41 WARN CryptoStreamUtils: It costs 8497 milliseconds to
create the Initialization Vector used by CryptoStream
18/01/12 21:55:41 WARN CryptoStreamUtils: It costs 8496 milliseconds to
create the Initialization Vector used by CryptoStream
[Stage 1:===============================>                       (1[Stage
1:===========================================>           (1[Stage
1:======================================================>(1
{"mangaa":103}
{"chaka":102}
{"chaka":103}
Time taken: 15.331 seconds, Fetched 3 row(s)

Here the simple distinct query works fine in Spark. Any thoughts why
DISTINCT/EXCEPT/INTERSECT operators are not supported on Map data types.
From the PR, it says,
// TODO: although map type is not orderable, technically map type should be
able to be
 +          // used inequality comparison, remove this type check once we
support it.

Could not figure out the issue caused by using the aforementioned operators?





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]




Reply | Threaded
Open this post in threaded view
|

Re: Distinct on Map data type -- SPARK-19893

ckhari4u
Wan, Thanks a lot,! I see the issue now.

Do we have any JIRA's open for the future work to be done on this?



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Distinct on Map data type -- SPARK-19893

Tejas Patil
There is a JIRA for making Map types orderable : https://issues.apache.org/jira/browse/SPARK-18134 Given that this is a non-trivial change, it will take time.

On Sat, Jan 13, 2018 at 9:50 PM, ckhari4u <[hidden email]> wrote:
Wan, Thanks a lot,! I see the issue now.

Do we have any JIRA's open for the future work to be done on this?



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]