Catalyst fails to eliminate common subexpression

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Catalyst fails to eliminate common subexpression

Sandeep Joshi

I was examining the filters passed down to the Data Source API and noticed that a common subexpression in the SQL Select Where clause was not eliminated.   Notice the "gender = M" is listed twice in the plan explain.   Is this expected ?  I haven't tried more complex examples with AND in the where clause

scala> val df = spark.read.format("json").load("./people.json")
df: org.apache.spark.sql.DataFrame = [age: bigint, gender: string ... 2 more fields]

scala> df.show
+---+------+-------+------+
|age|gender|   name|salary|
+---+------+-------+------+
| 80|     M|   Andy|    10|
| 40|     M|Michael|    30|
| 70|     F| Justin|    70|
+---+------+-------+------+


scala> df.registerTempTable("people")
warning: there was one deprecation warning; re-run with -deprecation for details
                                           
scala> val outdf = spark.sql("select * from people where gender = \"M\" or salary > 30 or gender = \"M\"")
outdf: org.apache.spark.sql.DataFrame = [age: bigint, gender: string ... 2 more fields]

In the output below, you get the "gender = M" clause twice.

scala> outdf.explain
== Physical Plan ==
*Filter (((gender#9 = M) || (salary#11L > 30)) || (gender#9 = M))
+- *FileScan json [age#8L,gender#9,name#10,salary#11L] Batched: false, Format: JSON, Location: InMemoryFileIndex[file:/home/sandeep/utils/DB/spark/people.json], PartitionFilters: [], PushedFilters: [Or(Or(EqualTo(gender,M),GreaterThan(salary,30)),EqualTo(gender,M))], ReadSchema: struct<age:bigint,gender:string,name:string,salary:bigint>