Why some queries use logical.stats while others analyzed.stats?

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Why some queries use logical.stats while others analyzed.stats?

Jacek Laskowski
Hi,

I use Spark from the master today.

$ ./bin/spark-shell --version
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT
      /_/

Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_152
Branch master
Compiled by user jacek on 2018-01-04T05:44:05Z
Revision 7d045c5f00e2c7c67011830e2169a4e130c3ace8

Can anyone explain why some queries have stats in logical plan while others don't (and I had to use analyzed logical plan)?

I can explain the difference using the code, but I don't know why there is the difference.

spark.range(1000).write.parquet("/tmp/p1000")
// The stats are available in logical plan (in logical "phase")
scala> spark.read.parquet("/tmp/p1000").queryExecution.logical.stats
res21: org.apache.spark.sql.catalyst.plans.logical.Statistics = Statistics(sizeInBytes=6.9 KB, hints=none)

// logical plan fails, but it worked fine above --> WHY?!
val names = Seq((1, "one"), (2, "two")).toDF("id", "name")
scala> names.queryExecution.logical.stats
java.lang.UnsupportedOperationException
  at org.apache.spark.sql.catalyst.plans.logical.LeafNode.computeStats(LogicalPlan.scala:232)
  at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.default(SizeInBytesOnlyStatsPlanVisitor.scala:55)
  at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.default(SizeInBytesOnlyStatsPlanVisitor.scala:27)

// analyzed logical plan works fine
scala> names.queryExecution.analyzed.stats
res23: org.apache.spark.sql.catalyst.plans.logical.Statistics = Statistics(sizeInBytes=48.0 B, hints=none)
Reply | Threaded
Open this post in threaded view
|

Re: Why some queries use logical.stats while others analyzed.stats?

cloud0fan
First of all, I think you know that `QueryExecution` is a developer API right? By definition `QueryExecution.logical` is the input plan, which can even be unresolved. Developers should be aware of it and do not apply operations that need the plan to be resolved. Obviously `LogicalPlan.stats` needs the plan to be resolved.

For this particular case, we can make it work by defining `computeStats` in `AnalysisBarrier`. But it's also OK to just leave it as it is, as this doesn't break any real use cases.

On Thu, Jan 4, 2018 at 4:36 PM, Jacek Laskowski <[hidden email]> wrote:
Hi,

I use Spark from the master today.

$ ./bin/spark-shell --version
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT
      /_/

Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_152
Branch master
Compiled by user jacek on 2018-01-04T05:44:05Z
Revision 7d045c5f00e2c7c67011830e2169a4e130c3ace8

Can anyone explain why some queries have stats in logical plan while others don't (and I had to use analyzed logical plan)?

I can explain the difference using the code, but I don't know why there is the difference.

spark.range(1000).write.parquet("/tmp/p1000")
// The stats are available in logical plan (in logical "phase")
scala> spark.read.parquet("/tmp/p1000").queryExecution.logical.stats
res21: org.apache.spark.sql.catalyst.plans.logical.Statistics = Statistics(sizeInBytes=6.9 KB, hints=none)

// logical plan fails, but it worked fine above --> WHY?!
val names = Seq((1, "one"), (2, "two")).toDF("id", "name")
scala> names.queryExecution.logical.stats
java.lang.UnsupportedOperationException
  at org.apache.spark.sql.catalyst.plans.logical.LeafNode.computeStats(LogicalPlan.scala:232)
  at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.default(SizeInBytesOnlyStatsPlanVisitor.scala:55)
  at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.default(SizeInBytesOnlyStatsPlanVisitor.scala:27)

// analyzed logical plan works fine
scala> names.queryExecution.analyzed.stats
res23: org.apache.spark.sql.catalyst.plans.logical.Statistics = Statistics(sizeInBytes=48.0 B, hints=none)

Reply | Threaded
Open this post in threaded view
|

Re: Why some queries use logical.stats while others analyzed.stats?

Jacek Laskowski
Thanks Wenchen. That makes a lot of sense now (after you made the point about AnalysisBarrier that I've been seeing here and there, but haven't spent much time to explore yet, but turned out important).


On Thu, Jan 4, 2018 at 11:57 AM, Wenchen Fan <[hidden email]> wrote:
First of all, I think you know that `QueryExecution` is a developer API right? By definition `QueryExecution.logical` is the input plan, which can even be unresolved. Developers should be aware of it and do not apply operations that need the plan to be resolved. Obviously `LogicalPlan.stats` needs the plan to be resolved.

For this particular case, we can make it work by defining `computeStats` in `AnalysisBarrier`. But it's also OK to just leave it as it is, as this doesn't break any real use cases.

On Thu, Jan 4, 2018 at 4:36 PM, Jacek Laskowski <[hidden email]> wrote:
Hi,

I use Spark from the master today.

$ ./bin/spark-shell --version
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT
      /_/

Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_152
Branch master
Compiled by user jacek on 2018-01-04T05:44:05Z
Revision 7d045c5f00e2c7c67011830e2169a4e130c3ace8

Can anyone explain why some queries have stats in logical plan while others don't (and I had to use analyzed logical plan)?

I can explain the difference using the code, but I don't know why there is the difference.

spark.range(1000).write.parquet("/tmp/p1000")
// The stats are available in logical plan (in logical "phase")
scala> spark.read.parquet("/tmp/p1000").queryExecution.logical.stats
res21: org.apache.spark.sql.catalyst.plans.logical.Statistics = Statistics(sizeInBytes=6.9 KB, hints=none)

// logical plan fails, but it worked fine above --> WHY?!
val names = Seq((1, "one"), (2, "two")).toDF("id", "name")
scala> names.queryExecution.logical.stats
java.lang.UnsupportedOperationException
  at org.apache.spark.sql.catalyst.plans.logical.LeafNode.computeStats(LogicalPlan.scala:232)
  at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.default(SizeInBytesOnlyStatsPlanVisitor.scala:55)
  at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.default(SizeInBytesOnlyStatsPlanVisitor.scala:27)

// analyzed logical plan works fine
scala> names.queryExecution.analyzed.stats
res23: org.apache.spark.sql.catalyst.plans.logical.Statistics = Statistics(sizeInBytes=48.0 B, hints=none)


Reply | Threaded
Open this post in threaded view
|

Re: Why some queries use logical.stats while others analyzed.stats?

Jacek Laskowski
In reply to this post by cloud0fan
Hi Wenchen,

That's just now when I stumbled across this comment in `LeafNode.computeStats` [1]:

> Leaf nodes that can survive analysis must define their own statistics.

And the other in scaladoc of AnalysisBarrier [2]

> This analysis barrier will be removed at the end of analysis stage.

That makes a lot of sense now and makes QueryExecution.analyzed crucial (since there could be AnalysisBarriers in a plan). Thanks again!

Regarding your comment:

> For this particular case, we can make it work by defining `computeStats` in `AnalysisBarrier`. But it's also OK to just leave it as it is, as this doesn't break any real use cases.

Don't you think that `AnalysisBarrier.computeStats` could just dispatch to child.computeStats and although "this doesn't break any real use cases" would avoid questions like that? Less to worry about and would make things more comprehensible. Mind if I proposed a PR with the change (and another for the typo in the scaladoc where it says: "The SQL Analyzer goes through a whole query plan even most part of it is analyzed." [3])?



On Thu, Jan 4, 2018 at 11:57 AM, Wenchen Fan <[hidden email]> wrote:
First of all, I think you know that `QueryExecution` is a developer API right? By definition `QueryExecution.logical` is the input plan, which can even be unresolved. Developers should be aware of it and do not apply operations that need the plan to be resolved. Obviously `LogicalPlan.stats` needs the plan to be resolved.

For this particular case, we can make it work by defining `computeStats` in `AnalysisBarrier`. But it's also OK to just leave it as it is, as this doesn't break any real use cases.

On Thu, Jan 4, 2018 at 4:36 PM, Jacek Laskowski <[hidden email]> wrote:
Hi,

I use Spark from the master today.

$ ./bin/spark-shell --version
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT
      /_/

Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_152
Branch master
Compiled by user jacek on 2018-01-04T05:44:05Z
Revision 7d045c5f00e2c7c67011830e2169a4e130c3ace8

Can anyone explain why some queries have stats in logical plan while others don't (and I had to use analyzed logical plan)?

I can explain the difference using the code, but I don't know why there is the difference.

spark.range(1000).write.parquet("/tmp/p1000")
// The stats are available in logical plan (in logical "phase")
scala> spark.read.parquet("/tmp/p1000").queryExecution.logical.stats
res21: org.apache.spark.sql.catalyst.plans.logical.Statistics = Statistics(sizeInBytes=6.9 KB, hints=none)

// logical plan fails, but it worked fine above --> WHY?!
val names = Seq((1, "one"), (2, "two")).toDF("id", "name")
scala> names.queryExecution.logical.stats
java.lang.UnsupportedOperationException
  at org.apache.spark.sql.catalyst.plans.logical.LeafNode.computeStats(LogicalPlan.scala:232)
  at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.default(SizeInBytesOnlyStatsPlanVisitor.scala:55)
  at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.default(SizeInBytesOnlyStatsPlanVisitor.scala:27)

// analyzed logical plan works fine
scala> names.queryExecution.analyzed.stats
res23: org.apache.spark.sql.catalyst.plans.logical.Statistics = Statistics(sizeInBytes=48.0 B, hints=none)


Reply | Threaded
Open this post in threaded view
|

Re: Why some queries use logical.stats while others analyzed.stats?

cloud0fan
yea sure, thanks for doing it!

On Sat, Jan 6, 2018 at 9:25 PM, Jacek Laskowski <[hidden email]> wrote:
Hi Wenchen,

That's just now when I stumbled across this comment in `LeafNode.computeStats` [1]:

> Leaf nodes that can survive analysis must define their own statistics.

And the other in scaladoc of AnalysisBarrier [2]

> This analysis barrier will be removed at the end of analysis stage.

That makes a lot of sense now and makes QueryExecution.analyzed crucial (since there could be AnalysisBarriers in a plan). Thanks again!

Regarding your comment:

> For this particular case, we can make it work by defining `computeStats` in `AnalysisBarrier`. But it's also OK to just leave it as it is, as this doesn't break any real use cases.

Don't you think that `AnalysisBarrier.computeStats` could just dispatch to child.computeStats and although "this doesn't break any real use cases" would avoid questions like that? Less to worry about and would make things more comprehensible. Mind if I proposed a PR with the change (and another for the typo in the scaladoc where it says: "The SQL Analyzer goes through a whole query plan even most part of it is analyzed." [3])?



On Thu, Jan 4, 2018 at 11:57 AM, Wenchen Fan <[hidden email]> wrote:
First of all, I think you know that `QueryExecution` is a developer API right? By definition `QueryExecution.logical` is the input plan, which can even be unresolved. Developers should be aware of it and do not apply operations that need the plan to be resolved. Obviously `LogicalPlan.stats` needs the plan to be resolved.

For this particular case, we can make it work by defining `computeStats` in `AnalysisBarrier`. But it's also OK to just leave it as it is, as this doesn't break any real use cases.

On Thu, Jan 4, 2018 at 4:36 PM, Jacek Laskowski <[hidden email]> wrote:
Hi,

I use Spark from the master today.

$ ./bin/spark-shell --version
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT
      /_/

Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_152
Branch master
Compiled by user jacek on 2018-01-04T05:44:05Z
Revision 7d045c5f00e2c7c67011830e2169a4e130c3ace8

Can anyone explain why some queries have stats in logical plan while others don't (and I had to use analyzed logical plan)?

I can explain the difference using the code, but I don't know why there is the difference.

spark.range(1000).write.parquet("/tmp/p1000")
// The stats are available in logical plan (in logical "phase")
scala> spark.read.parquet("/tmp/p1000").queryExecution.logical.stats
res21: org.apache.spark.sql.catalyst.plans.logical.Statistics = Statistics(sizeInBytes=6.9 KB, hints=none)

// logical plan fails, but it worked fine above --> WHY?!
val names = Seq((1, "one"), (2, "two")).toDF("id", "name")
scala> names.queryExecution.logical.stats
java.lang.UnsupportedOperationException
  at org.apache.spark.sql.catalyst.plans.logical.LeafNode.computeStats(LogicalPlan.scala:232)
  at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.default(SizeInBytesOnlyStatsPlanVisitor.scala:55)
  at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.default(SizeInBytesOnlyStatsPlanVisitor.scala:27)

// analyzed logical plan works fine
scala> names.queryExecution.analyzed.stats
res23: org.apache.spark.sql.catalyst.plans.logical.Statistics = Statistics(sizeInBytes=48.0 B, hints=none)