dataFrame.na.fill() fails for column with dot

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

dataFrame.na.fill() fails for column with dot

Amandeep Sharma
Hi guys,
Apologies for the long mail.

I am running below code snippet

import org.apache.spark.sql.SparkSession
object ColumnNameWithDot {
 def main(args: Array[String]): Unit = {

 val spark = SparkSession.builder.appName("Simple Application")
 .config("spark.master", "local").getOrCreate()

 spark.sparkContext.setLogLevel("OFF")

 import spark.implicits._
 val df = Seq(("abc", 23), ("def", 44), (null, 9)).toDF("ColWith.Dot", "Col")
 df.na.fill(Map("`ColWith.Dot`" -> "n/a")).show()

 }
}

and it is failing with error
Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot resolve column name "ColWith.Dot" among (ColWith.Dot, Col);

I checked that code-fix were made for the similar issue, found https://issues.apache.org/jira/browse/SPARK-19473; but none of them fixed all cases.

I debugged the code below are the observations
  1. In org.apache.spark.sql.DataFrameNaFunctions.fillMap(values: Seq[(String, Any)]) the df.resolve(colName) call succeeds, since column name is quoted with back tick it resolves the column
  2. val projections = df.schema.fields.map {
        ...
        ...
    }.getOrElse(df.col(f.name))
    fails since resolved column name is not quoted with backtick
Problem lies in the
org.apache.spark.sql.catalyst.expressions
resolve(nameParts: Seq[String], resolver: Resolver): Option[NamedExpression]

where the comment says we try to resolve it as a column.

// If none of attributes match database.table.column pattern or
// `table.column` pattern, we try to resolve it as a column.
val (candidates, nestedFields) = matches match {
    case (Seq(), _) =>
        val name = nameParts.head
        val attributes = collectMatches(name, direct.get(name.toLowerCase(Locale.ROOT)))
        (attributes, nameParts.tail)
    case _ => matches
}


should be changed to

// If none of attributes match database.table.column pattern or
// `table.column` pattern, we try to resolve it as a column.
val (candidates, nestedFields) = matches match {
    case (Seq(), _) =>
        val name = nameParts.mkString(".")
        val attributes = collectMatches(name, direct.get(name.toLowerCase(Locale.ROOT)))
        (attributes, Seq.empty)
    case _ => matches
}

git diff is as below

-          val name = nameParts.head
+          val name = nameParts.mkString(".")
           val attributes = collectMatches(name, direct.get(name.toLowerCase(Locale.ROOT)))
-          (attributes, nameParts.tail)
+          (attributes, Seq.empty)

I tested this change, there is no longer need to use backtick with columns having dot in the name.
Can this change be merged?

Regards,
Amandeep
Reply | Threaded
Open this post in threaded view
|

Re: dataFrame.na.fill() fails for column with dot

Amandeep Sharma
Adding note to the previous email.
Test suite org.apache.spark.sql.catalyst.analysis.AnalysisSuite passed after the aforementioned changes.

Regards,
Amandeep Sharma


On Tue, Feb 9, 2021 at 3:07 PM Amandeep Sharma <[hidden email]> wrote:
Hi guys,
Apologies for the long mail.

I am running below code snippet

import org.apache.spark.sql.SparkSession
object ColumnNameWithDot {
 def main(args: Array[String]): Unit = {

 val spark = SparkSession.builder.appName("Simple Application")
 .config("spark.master", "local").getOrCreate()

 spark.sparkContext.setLogLevel("OFF")

 import spark.implicits._
 val df = Seq(("abc", 23), ("def", 44), (null, 9)).toDF("ColWith.Dot", "Col")
 df.na.fill(Map("`ColWith.Dot`" -> "n/a")).show()

 }
}

and it is failing with error
Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot resolve column name "ColWith.Dot" among (ColWith.Dot, Col);

I checked that code-fix were made for the similar issue, found https://issues.apache.org/jira/browse/SPARK-19473; but none of them fixed all cases.

I debugged the code below are the observations
  1. In org.apache.spark.sql.DataFrameNaFunctions.fillMap(values: Seq[(String, Any)]) the df.resolve(colName) call succeeds, since column name is quoted with back tick it resolves the column
  2. val projections = df.schema.fields.map {
        ...
        ...
    }.getOrElse(df.col(f.name))
    fails since resolved column name is not quoted with backtick
Problem lies in the
org.apache.spark.sql.catalyst.expressions
resolve(nameParts: Seq[String], resolver: Resolver): Option[NamedExpression]

where the comment says we try to resolve it as a column.

// If none of attributes match database.table.column pattern or
// `table.column` pattern, we try to resolve it as a column.
val (candidates, nestedFields) = matches match {
    case (Seq(), _) =>
        val name = nameParts.head
        val attributes = collectMatches(name, direct.get(name.toLowerCase(Locale.ROOT)))
        (attributes, nameParts.tail)
    case _ => matches
}


should be changed to

// If none of attributes match database.table.column pattern or
// `table.column` pattern, we try to resolve it as a column.
val (candidates, nestedFields) = matches match {
    case (Seq(), _) =>
        val name = nameParts.mkString(".")
        val attributes = collectMatches(name, direct.get(name.toLowerCase(Locale.ROOT)))
        (attributes, Seq.empty)
    case _ => matches
}

git diff is as below

-          val name = nameParts.head
+          val name = nameParts.mkString(".")
           val attributes = collectMatches(name, direct.get(name.toLowerCase(Locale.ROOT)))
-          (attributes, nameParts.tail)
+          (attributes, Seq.empty)

I tested this change, there is no longer need to use backtick with columns having dot in the name.
Can this change be merged?

Regards,
Amandeep
Reply | Threaded
Open this post in threaded view
|

Re: dataFrame.na.fill() fails for column with dot

Terry Kim
In reply to this post by Amandeep Sharma
Thanks Amandeep. This seems like a valid bug to me as quoted columns are not handled property for na.fill(). I think the better place to fix is in DataFrameNaFunctions.scala where "f.name" should be quoted.

Could you create a JIRA?

Thanks,
Terry
Reply | Threaded
Open this post in threaded view
|

Re: dataFrame.na.fill() fails for column with dot

Terry Kim
You probably need to update f. name here as well, but we can discuss further when you create a JIRA/PR.

Thanks,
Terry

On Tue, Feb 9, 2021 at 9:53 AM Terry Kim <[hidden email]> wrote:
Thanks Amandeep. This seems like a valid bug to me as quoted columns are not handled property for na.fill(). I think the better place to fix is in DataFrameNaFunctions.scala where "f.name" should be quoted.

Could you create a JIRA?

Thanks,
Terry