distributed computation of median

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

distributed computation of median

svjk24
Hello,
  Is there any interest in an efficient distributed computation of the median algorithm?
A google search pulls some stackoverflow discussion but it would be good to have one provided.

I have an implementation (that could be improved)
from the paper " Fast Computation of the Median by Successive Binning":

https://github.com/4d55397500/medianbinning

Thanks-




Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: distributed computation of median

Jason White
Have you looked at t-digests?

Calculating percentiles (including medians) is something that is inherently difficult/inefficient to do in a distributed system. T-digests provide a useful probabilistic structure to allow you to compute any percentile with a known (and tunable) margin of error.

https://github.com/tdunning/t-digest
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: distributed computation of median

rxin
In reply to this post by svjk24
The DataFrame API includes an approximate quartile implementation. If you ask for quantile 0.5, you will get approximate median. 


On Sun, Apr 16, 2017 at 9:24 PM svjk24 <[hidden email]> wrote:
Hello,
  Is there any interest in an efficient distributed computation of the median algorithm?
A google search pulls some stackoverflow discussion but it would be good to have one provided.

I have an implementation (that could be improved)
from the paper " Fast Computation of the Median by Successive Binning":

https://github.com/4d55397500/medianbinning

Thanks-




Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: distributed computation of median

Koert Kuipers
In reply to this post by Jason White
Also q-tree is implemented in algebird, not hard to get it going in spark. That is another probabilistic data structure that is useful for this.

On Apr 17, 2017 11:27, "Jason White" <[hidden email]> wrote:
Have you looked at t-digests?

Calculating percentiles (including medians) is something that is inherently
difficult/inefficient to do in a distributed system. T-digests provide a
useful probabilistic structure to allow you to compute any percentile with a
known (and tunable) margin of error.

https://github.com/tdunning/t-digest




--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/distributed-computation-of-median-tp21356p21357.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: distributed computation of median

pavan adukuri
In reply to this post by svjk24
Do you know of any python implementation for the same?

thanks
pavan
On 4/17/17, 9:54 AM, svjk24 wrote:
Hello,
  Is there any interest in an efficient distributed computation of the median algorithm?
A google search pulls some stackoverflow discussion but it would be good to have one provided.

I have an implementation (that could be improved)
from the paper " Fast Computation of the Median by Successive Binning":

https://github.com/4d55397500/medianbinning

Thanks-





Loading...