Accessing DataFrame inside UserDefinedFunction.

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Accessing DataFrame inside UserDefinedFunction.

knowsnothing
Hi Everyone,

I've been told, that accessing DataFrame object from UserDefinedFunction is
not possible, but turns out that the code shown below works fine (taken from
StackOverflow). Why is it so? Is it a bug, or is it expected?

Thanks in advance.

case class Target(wordListOne: Seq[String], WordListTwo: Seq[String])
val targetData = Seq(Target(Seq("Spark", "Wrong", "Something"), Seq("Java",
"Grape", "Banana")),
                     Target(Seq("Java", "Scala"), Seq("Scala", "Banana")),
                     Target(Seq(""), Seq("Grape", "Banana")),
                     Target(Seq(""), Seq("")))
val targets = spark.createDataset(targetData)

case class WordSimilarity(first: String, second: String, similarity: Double)
val similarityData = Seq(WordSimilarity("Spark", "Java", 0.8),
                     WordSimilarity("Scala", "Spark", 0.9),
                     WordSimilarity("Java", "Scala", 0.9),
                     WordSimilarity("Apple", "Grape", 0.66),
                     WordSimilarity("Scala", "Apple", -0.1),
                     WordSimilarity("Gine", "Spark", 0.1))
val dict = spark.createDataset(similarityData)

val countPositiveSimilarity = udf[Long, Seq[String], Seq[String]]((a, b) =>
    dict.filter(
        (($"first".isin(a: _*) && $"second".isin(b: _*)) ||
($"first".isin(b: _*) && $"second".isin(a: _*))) && $"similarity" > 0.7
    ).count
)

val countDF = targets.withColumn("positive_count",
countPositiveSimilarity($"wordListOne", $"wordListTwo"))




--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Accessing DataFrame inside UserDefinedFunction.

Anurag Verma

This is expected. You are not accessing the DataSet Dict when calling UDF countPositiveSimilarity. The dict dataframe as it existed when udf was created is encoded into udf. If you change  dict later on the changes will not get automatically picked up in UDF countPositiveSimilarity.

 

Sent from Mail for Windows 10

 

From: [hidden email]
Sent: Sunday, November 5, 2017 8:12 AM
To: [hidden email]
Subject: Accessing DataFrame inside UserDefinedFunction.

 

Hi Everyone,

 

I've been told, that accessing DataFrame object from UserDefinedFunction is

not possible, but turns out that the code shown below works fine (taken from

StackOverflow). Why is it so? Is it a bug, or is it expected?

 

Thanks in advance.

 

case class Target(wordListOne: Seq[String], WordListTwo: Seq[String])

val targetData = Seq(Target(Seq("Spark", "Wrong", "Something"), Seq("Java",

"Grape", "Banana")),

                     Target(Seq("Java", "Scala"), Seq("Scala", "Banana")),

                     Target(Seq(""), Seq("Grape", "Banana")),

                     Target(Seq(""), Seq("")))

val targets = spark.createDataset(targetData)

 

case class WordSimilarity(first: String, second: String, similarity: Double)

val similarityData = Seq(WordSimilarity("Spark", "Java", 0.8),

                     WordSimilarity("Scala", "Spark", 0.9),

                     WordSimilarity("Java", "Scala", 0.9),

                     WordSimilarity("Apple", "Grape", 0.66),

                     WordSimilarity("Scala", "Apple", -0.1),

                     WordSimilarity("Gine", "Spark", 0.1))

val dict = spark.createDataset(similarityData)

 

val countPositiveSimilarity = udf[Long, Seq[String], Seq[String]]((a, b) =>

    dict.filter(

        (($"first".isin(a: _*) && $"second".isin(b: _*)) ||

($"first".isin(b: _*) && $"second".isin(a: _*))) && $"similarity" > 0.7

    ).count

)

 

val countDF = targets.withColumn("positive_count",

countPositiveSimilarity($"wordListOne", $"wordListTwo"))

 

 

 

 

--

Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

 

---------------------------------------------------------------------

To unsubscribe e-mail: [hidden email]

 

 

Reply | Threaded
Open this post in threaded view
|

RE: Accessing DataFrame inside UserDefinedFunction.

knowsnothing
Thank you for your response Anurag.

I am not sure if I get your point. Are you suggesting that UDF somehow
serializes not only reference to  Dataset, but also all the data?





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]