approx_percentile computation

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

approx_percentile computation

Rishi
This post has NOT been accepted by the mailing list yet.
I need to compute have a spark quantiles on a numeric field after a group by operation. Is there a way to apply the approxPercentile on an aggregated list instead of a column?

E.g. The Dataframe looks like

k1 | k2 | k3 | v1

a1 | b1 | c1 | 879

a2 | b2 | c2 | 769

a1 | b1 | c1 | 129

a2 | b2 | c2 | 323
I need to first run groupBy (k1, k2, k3) and collect_list(v1), and then compute quantiles [10th, 50th...] on list of v1's

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: approx_percentile computation

Liang-Chi Hsieh

Hi,

You don't need to run approxPercentile against a list. Since it is an aggregation function, you can simply run:

// Just for illustrate the idea.
val approxPercentile = new ApproximatePercentile(v1, Literal(percentage))
val agg_approx_percentile = Column(approxPercentile.toAggregateExpression())

df.groupBy (k1, k2, k3).agg(collect_list(v1), agg_approx_percentile)


Rishi wrote
I need to compute have a spark quantiles on a numeric field after a group by operation. Is there a way to apply the approxPercentile on an aggregated list instead of a column?

E.g. The Dataframe looks like

k1 | k2 | k3 | v1

a1 | b1 | c1 | 879

a2 | b2 | c2 | 769

a1 | b1 | c1 | 129

a2 | b2 | c2 | 323
I need to first run groupBy (k1, k2, k3) and collect_list(v1), and then compute quantiles [10th, 50th...] on list of v1's
Liang-Chi Hsieh | @viirya
Spark Technology Center
http://www.spark.tc/
Loading...