# distributed computation of median

5 messages
Open this post in threaded view
|
Report Content as Inappropriate

## distributed computation of median

 Hello,   Is there any interest in an efficient distributed computation of the median algorithm? A google search pulls some stackoverflow discussion but it would be good to have one provided. I have an implementation (that could be improved) from the paper " Fast Computation of the Median by Successive Binning": https://github.com/4d55397500/medianbinning Thanks-
Open this post in threaded view
|
Report Content as Inappropriate

## Re: distributed computation of median

 Have you looked at t-digests? Calculating percentiles (including medians) is something that is inherently difficult/inefficient to do in a distributed system. T-digests provide a useful probabilistic structure to allow you to compute any percentile with a known (and tunable) margin of error. https://github.com/tdunning/t-digest
Open this post in threaded view
|
Report Content as Inappropriate

## Re: distributed computation of median

 In reply to this post by svjk24 The DataFrame API includes an approximate quartile implementation. If you ask for quantile 0.5, you will get approximate median. On Sun, Apr 16, 2017 at 9:24 PM svjk24 <[hidden email]> wrote: Hello,   Is there any interest in an efficient distributed computation of the median algorithm? A google search pulls some stackoverflow discussion but it would be good to have one provided. I have an implementation (that could be improved) from the paper " Fast Computation of the Median by Successive Binning": https://github.com/4d55397500/medianbinning Thanks-
Open this post in threaded view
|
Report Content as Inappropriate

## Re: distributed computation of median

 In reply to this post by Jason White Also q-tree is implemented in algebird, not hard to get it going in spark. That is another probabilistic data structure that is useful for this.On Apr 17, 2017 11:27, "Jason White" <[hidden email]> wrote:Have you looked at t-digests? Calculating percentiles (including medians) is something that is inherently difficult/inefficient to do in a distributed system. T-digests provide a useful probabilistic structure to allow you to compute any percentile with a known (and tunable) margin of error. https://github.com/tdunning/t-digest -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/distributed-computation-of-median-tp21356p21357.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: [hidden email]