Design document - MLlib's statistical package for DataFrames

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Design document - MLlib's statistical package for DataFrames

Tim Hunter
Hello all,

I have been looking at some of the missing items for complete feature
parity between spark.ml and spark.mllib. Here is a proposal for
porting mllib.stats, the descriptive statistics package:

https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit?usp=sharing

The umbrella ticket for this task is:
https://issues.apache.org/jira/browse/SPARK-4591

Please comment on the document. Also, if you want to work on one of
the algorithms, the design doc and the umbrella ticket have subtasks
that you can assign yourself to.

The cutoff deadline for Spark 2.2 is rapidly approaching, and it would
be great if we could claim parity for this release!

Cheers

Tim

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Design document - MLlib's statistical package for DataFrames

bradc
Hi,

While it is also missing in spark.mllib, I'd suggest adding cardinality as part of the Simple descriptive statistics for both spark.ml and spark.mlib?  This is useful even for data in double precision FP to understand the "uniqueness" of the feature data.

Cheers,
Brad
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Design document - MLlib's statistical package for DataFrames

Tim Hunter
Hi Brad,

this task is focusing on moving the existing algorithms, so that we
are held up by parity issues.

Do you have some paper suggestions for cardinality? I do not think
there is a feature request on JIRA either.

Tim

On Thu, Feb 16, 2017 at 2:21 PM, bradc <[hidden email]> wrote:

> Hi,
>
> While it is also missing in spark.mllib, I'd suggest adding cardinality as
> part of the Simple descriptive statistics for both spark.ml and spark.mlib?
> This is useful even for data in double precision FP to understand the
> "uniqueness" of the feature data.
>
> Cheers,
> Brad
>
>
>
>
> --
> View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Design-document-MLlib-s-statistical-package-for-DataFrames-tp21014p21016.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Design document - MLlib's statistical package for DataFrames

Pritish Nawlakhe
Hi

Would anyone know how to unsubscribe to this list?



Thank you!!

Regards
Pritish
Nirvana International Inc.

Big Data, Hadoop, Oracle EBS and IT Solutions
VA - SWaM, MD - MBE Certified Company
[hidden email]
http://www.nirvana-international.com 
Twitter: @nirvanainternat

-----Original Message-----
From: Tim Hunter [mailto:[hidden email]]
Sent: Friday, February 17, 2017 1:49 PM
To: bradc
Cc: [hidden email]
Subject: Re: Design document - MLlib's statistical package for DataFrames

Hi Brad,

this task is focusing on moving the existing algorithms, so that we are held up by parity issues.

Do you have some paper suggestions for cardinality? I do not think there is a feature request on JIRA either.

Tim

On Thu, Feb 16, 2017 at 2:21 PM, bradc <[hidden email]> wrote:

> Hi,
>
> While it is also missing in spark.mllib, I'd suggest adding
> cardinality as part of the Simple descriptive statistics for both spark.ml and spark.mlib?
> This is useful even for data in double precision FP to understand the
> "uniqueness" of the feature data.
>
> Cheers,
> Brad
>
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Design-docum
> ent-MLlib-s-statistical-package-for-DataFrames-tp21014p21016.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Design document - MLlib's statistical package for DataFrames

Holden Karau
It's at the bottom of every message (although some mail clients hide it for some reason), send an email to [hidden email]

On Sat, Feb 18, 2017 at 11:07 AM Pritish Nawlakhe <[hidden email]> wrote:
Hi

Would anyone know how to unsubscribe to this list?



Thank you!!

Regards
Pritish
Nirvana International Inc.

Big Data, Hadoop, Oracle EBS and IT Solutions
VA - SWaM, MD - MBE Certified Company
[hidden email]
http://www.nirvana-international.com
Twitter: @nirvanainternat

-----Original Message-----
From: Tim Hunter [mailto:[hidden email]]
Sent: Friday, February 17, 2017 1:49 PM
To: bradc
Cc: [hidden email]
Subject: Re: Design document - MLlib's statistical package for DataFrames

Hi Brad,

this task is focusing on moving the existing algorithms, so that we are held up by parity issues.

Do you have some paper suggestions for cardinality? I do not think there is a feature request on JIRA either.

Tim

On Thu, Feb 16, 2017 at 2:21 PM, bradc <[hidden email]> wrote:
> Hi,
>
> While it is also missing in spark.mllib, I'd suggest adding
> cardinality as part of the Simple descriptive statistics for both spark.ml and spark.mlib?
> This is useful even for data in double precision FP to understand the
> "uniqueness" of the feature data.
>
> Cheers,
> Brad
>
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Design-docum
> ent-MLlib-s-statistical-package-for-DataFrames-tp21014p21016.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

--
Cell : 425-233-8271
Loading...