[MLlib] PCA Aggregator

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

[MLlib] PCA Aggregator

mttsndrs
I built an Aggregator that computes PCA on grouped datasets. I wanted to use the PCA functions provided by MLlib, but they only work on a full dataset, and I needed to do it on a grouped dataset (like a RelationalGroupedDataset).

So I built a little Aggregator that can do that, here’s an example of how it’s called:

    val pcaAggregation = new PCAAggregator(vectorColumnName).toColumn

    // For each grouping, compute a PCA matrix/vector
    val pcaModels = inputData
      .groupBy(keys:_*)
      .agg(pcaAggregation.as(pcaOutput))

I used the same algorithms under the hood as RowMatrix.computePrincipalComponentsAndExplainedVariance, though this works directly on Datasets without converting to RDD first.

I’ve seen others who wanted this ability (for example on Stack Overflow) so I’d like to contribute it if it would be a benefit to the larger community.

So.. is this something worth contributing to MLlib? And if so, what are the next steps to start the process?

thanks!
Reply | Threaded
Open this post in threaded view
|

Re: [MLlib] PCA Aggregator

Erik Erlandson-2
Hi Matt!

There are a couple ways to do this. If you want to submit it for inclusion in Spark, you should start by filing a JIRA for it, and then a pull request.   Another possibility is to publish it as your own 3rd party library, which I have done for aggregators before.


On Wed, Oct 17, 2018 at 4:54 PM Matt Saunders <[hidden email]> wrote:
I built an Aggregator that computes PCA on grouped datasets. I wanted to use the PCA functions provided by MLlib, but they only work on a full dataset, and I needed to do it on a grouped dataset (like a RelationalGroupedDataset).

So I built a little Aggregator that can do that, here’s an example of how it’s called:

    val pcaAggregation = new PCAAggregator(vectorColumnName).toColumn

    // For each grouping, compute a PCA matrix/vector
    val pcaModels = inputData
      .groupBy(keys:_*)
      .agg(pcaAggregation.as(pcaOutput))

I used the same algorithms under the hood as RowMatrix.computePrincipalComponentsAndExplainedVariance, though this works directly on Datasets without converting to RDD first.

I’ve seen others who wanted this ability (for example on Stack Overflow) so I’d like to contribute it if it would be a benefit to the larger community.

So.. is this something worth contributing to MLlib? And if so, what are the next steps to start the process?

thanks!
Reply | Threaded
Open this post in threaded view
|

Re: [MLlib] PCA Aggregator

Stephen Boesch
Erik - is there a current locale for approved/recommended third party additions?  The spark-packages has been stale for years it seems.

Am Fr., 19. Okt. 2018 um 07:06 Uhr schrieb Erik Erlandson <[hidden email]>:
Hi Matt!

There are a couple ways to do this. If you want to submit it for inclusion in Spark, you should start by filing a JIRA for it, and then a pull request.   Another possibility is to publish it as your own 3rd party library, which I have done for aggregators before.


On Wed, Oct 17, 2018 at 4:54 PM Matt Saunders <[hidden email]> wrote:
I built an Aggregator that computes PCA on grouped datasets. I wanted to use the PCA functions provided by MLlib, but they only work on a full dataset, and I needed to do it on a grouped dataset (like a RelationalGroupedDataset).

So I built a little Aggregator that can do that, here’s an example of how it’s called:

    val pcaAggregation = new PCAAggregator(vectorColumnName).toColumn

    // For each grouping, compute a PCA matrix/vector
    val pcaModels = inputData
      .groupBy(keys:_*)
      .agg(pcaAggregation.as(pcaOutput))

I used the same algorithms under the hood as RowMatrix.computePrincipalComponentsAndExplainedVariance, though this works directly on Datasets without converting to RDD first.

I’ve seen others who wanted this ability (for example on Stack Overflow) so I’d like to contribute it if it would be a benefit to the larger community.

So.. is this something worth contributing to MLlib? And if so, what are the next steps to start the process?

thanks!
Reply | Threaded
Open this post in threaded view
|

Re: [MLlib] PCA Aggregator

Erik Erlandson-2

For 3rd-party libs, I have been publishing independently, for example at isarn-sketches-spark or silex:
Either of these repos provide some good working examples of publishing a spark UDAF or ML library for jvm and pyspark.
(If anyone is interested in contributing new components to either of these, feel free to reach out)

For people new to Spark library dev, Will Benton and I recently gave at talk at SAI-EU on publishing Spark libraries:
Cheers,
Erik

On Fri, Oct 19, 2018 at 9:40 AM Stephen Boesch <[hidden email]> wrote:
Erik - is there a current locale for approved/recommended third party additions?  The spark-packages has been stale for years it seems.

Am Fr., 19. Okt. 2018 um 07:06 Uhr schrieb Erik Erlandson <[hidden email]>:
Hi Matt!

There are a couple ways to do this. If you want to submit it for inclusion in Spark, you should start by filing a JIRA for it, and then a pull request.   Another possibility is to publish it as your own 3rd party library, which I have done for aggregators before.


On Wed, Oct 17, 2018 at 4:54 PM Matt Saunders <[hidden email]> wrote:
I built an Aggregator that computes PCA on grouped datasets. I wanted to use the PCA functions provided by MLlib, but they only work on a full dataset, and I needed to do it on a grouped dataset (like a RelationalGroupedDataset).

So I built a little Aggregator that can do that, here’s an example of how it’s called:

    val pcaAggregation = new PCAAggregator(vectorColumnName).toColumn

    // For each grouping, compute a PCA matrix/vector
    val pcaModels = inputData
      .groupBy(keys:_*)
      .agg(pcaAggregation.as(pcaOutput))

I used the same algorithms under the hood as RowMatrix.computePrincipalComponentsAndExplainedVariance, though this works directly on Datasets without converting to RDD first.

I’ve seen others who wanted this ability (for example on Stack Overflow) so I’d like to contribute it if it would be a benefit to the larger community.

So.. is this something worth contributing to MLlib? And if so, what are the next steps to start the process?

thanks!
Reply | Threaded
Open this post in threaded view
|

Re: [MLlib] PCA Aggregator

mttsndrs
In reply to this post by Erik Erlandson-2
Thanks, Eric. I went ahead and created SPARK-25782 for this improvement since it is a feature I and others have looked for in MLlib, but doesn't seem to exist yet. Also, while searching for PCA-related issues in JIRA I noticed that someone added grouping support for PCA to the MADlib project a while back (see MADLIB-947), so there does seem to be a demand for it.

thanks!
--Matt


On Fri, Oct 19, 2018 at 7:06 AM Erik Erlandson <[hidden email]> wrote:
Hi Matt!

There are a couple ways to do this. If you want to submit it for inclusion in Spark, you should start by filing a JIRA for it, and then a pull request.   Another possibility is to publish it as your own 3rd party library, which I have done for aggregators before.


On Wed, Oct 17, 2018 at 4:54 PM Matt Saunders <[hidden email]> wrote:
I built an Aggregator that computes PCA on grouped datasets. I wanted to use the PCA functions provided by MLlib, but they only work on a full dataset, and I needed to do it on a grouped dataset (like a RelationalGroupedDataset).

So I built a little Aggregator that can do that, here’s an example of how it’s called:

    val pcaAggregation = new PCAAggregator(vectorColumnName).toColumn

    // For each grouping, compute a PCA matrix/vector
    val pcaModels = inputData
      .groupBy(keys:_*)
      .agg(pcaAggregation.as(pcaOutput))

I used the same algorithms under the hood as RowMatrix.computePrincipalComponentsAndExplainedVariance, though this works directly on Datasets without converting to RDD first.

I’ve seen others who wanted this ability (for example on Stack Overflow) so I’d like to contribute it if it would be a benefit to the larger community.

So.. is this something worth contributing to MLlib? And if so, what are the next steps to start the process?

thanks!
Reply | Threaded
Open this post in threaded view
|

Re: [MLlib] PCA Aggregator

Sean Owen-2
It's OK to open a JIRA though I generally doubt any new functionality will be added. This might be viewed as a small worthwhile enhancement, haven't looked at it. It's always more compelling if you can sketch the use case for it and why it is more meaningful in spark than outside it. 

There is spark-packages for recording third party packages but it is not required nor even necessarily a comprehensive list. You can just self publish like any git or Maven project, if you develop a third party library

On Fri, Oct 19, 2018, 2:32 PM Matt Saunders <[hidden email]> wrote:
Thanks, Eric. I went ahead and created SPARK-25782 for this improvement since it is a feature I and others have looked for in MLlib, but doesn't seem to exist yet. Also, while searching for PCA-related issues in JIRA I noticed that someone added grouping support for PCA to the MADlib project a while back (see MADLIB-947), so there does seem to be a demand for it.

thanks!
--Matt


On Fri, Oct 19, 2018 at 7:06 AM Erik Erlandson <[hidden email]> wrote:
Hi Matt!

There are a couple ways to do this. If you want to submit it for inclusion in Spark, you should start by filing a JIRA for it, and then a pull request.   Another possibility is to publish it as your own 3rd party library, which I have done for aggregators before.


On Wed, Oct 17, 2018 at 4:54 PM Matt Saunders <[hidden email]> wrote:
I built an Aggregator that computes PCA on grouped datasets. I wanted to use the PCA functions provided by MLlib, but they only work on a full dataset, and I needed to do it on a grouped dataset (like a RelationalGroupedDataset).

So I built a little Aggregator that can do that, here’s an example of how it’s called:

    val pcaAggregation = new PCAAggregator(vectorColumnName).toColumn

    // For each grouping, compute a PCA matrix/vector
    val pcaModels = inputData
      .groupBy(keys:_*)
      .agg(pcaAggregation.as(pcaOutput))

I used the same algorithms under the hood as RowMatrix.computePrincipalComponentsAndExplainedVariance, though this works directly on Datasets without converting to RDD first.

I’ve seen others who wanted this ability (for example on Stack Overflow) so I’d like to contribute it if it would be a benefit to the larger community.

So.. is this something worth contributing to MLlib? And if so, what are the next steps to start the process?

thanks!
Reply | Threaded
Open this post in threaded view
|

Re: [MLlib] PCA Aggregator

mttsndrs
Hi Sean, thanks for your feedback. I saw this as a missing feature in the existing PCA implementation in MLlib. I suspect the use case is a common one: you have data from different entities (could be different users, different locations, or different products, for example) and you need to model them separately since they behave differently--perhaps their features run in different ranges, or perhaps they have completely different features. 

For example if you were modeling the weather in different parts of the world for a given time period, and the features were things like temperature, humidity, wind speed, pressure, etc. With the current PCA/RowMatrix options, you can only calculate PCA on the entire dataset, when you really want to model the weather in New York separately from the weather in Buenos Aires. Today your options are to collect the data from each city and calculate PCA using some other library like Breeze, or use the PCA implementation from MLlib but only on one key at a time.

The reason I thought it would be useful in Spark is that it makes the PCA offering in MLlib useful to more people. As it stands today, I wasn't able to use it for much and I suspect others had the same experience, for example:

This isn't really big enough to warrant its own library--it's just a single class. But if you think it's better to publish it externally I can certainly do that.

thanks again,
--Matt


On Fri, Oct 19, 2018 at 4:14 PM Sean Owen <[hidden email]> wrote:
It's OK to open a JIRA though I generally doubt any new functionality will be added. This might be viewed as a small worthwhile enhancement, haven't looked at it. It's always more compelling if you can sketch the use case for it and why it is more meaningful in spark than outside it. 

There is spark-packages for recording third party packages but it is not required nor even necessarily a comprehensive list. You can just self publish like any git or Maven project, if you develop a third party library

On Fri, Oct 19, 2018, 2:32 PM Matt Saunders <[hidden email]> wrote:
Thanks, Eric. I went ahead and created SPARK-25782 for this improvement since it is a feature I and others have looked for in MLlib, but doesn't seem to exist yet. Also, while searching for PCA-related issues in JIRA I noticed that someone added grouping support for PCA to the MADlib project a while back (see MADLIB-947), so there does seem to be a demand for it.

thanks!
--Matt


On Fri, Oct 19, 2018 at 7:06 AM Erik Erlandson <[hidden email]> wrote:
Hi Matt!

There are a couple ways to do this. If you want to submit it for inclusion in Spark, you should start by filing a JIRA for it, and then a pull request.   Another possibility is to publish it as your own 3rd party library, which I have done for aggregators before.


On Wed, Oct 17, 2018 at 4:54 PM Matt Saunders <[hidden email]> wrote:
I built an Aggregator that computes PCA on grouped datasets. I wanted to use the PCA functions provided by MLlib, but they only work on a full dataset, and I needed to do it on a grouped dataset (like a RelationalGroupedDataset).

So I built a little Aggregator that can do that, here’s an example of how it’s called:

    val pcaAggregation = new PCAAggregator(vectorColumnName).toColumn

    // For each grouping, compute a PCA matrix/vector
    val pcaModels = inputData
      .groupBy(keys:_*)
      .agg(pcaAggregation.as(pcaOutput))

I used the same algorithms under the hood as RowMatrix.computePrincipalComponentsAndExplainedVariance, though this works directly on Datasets without converting to RDD first.

I’ve seen others who wanted this ability (for example on Stack Overflow) so I’d like to contribute it if it would be a benefit to the larger community.

So.. is this something worth contributing to MLlib? And if so, what are the next steps to start the process?

thanks!
Reply | Threaded
Open this post in threaded view
|

Re: [MLlib] PCA Aggregator

Sean Owen-2
I think this is great info and context to put in the JIRA. 

On Fri, Oct 19, 2018, 6:53 PM Matt Saunders <[hidden email]> wrote:
Hi Sean, thanks for your feedback. I saw this as a missing feature in the existing PCA implementation in MLlib. I suspect the use case is a common one: you have data from different entities (could be different users, different locations, or different products, for example) and you need to model them separately since they behave differently--perhaps their features run in different ranges, or perhaps they have completely different features. 

For example if you were modeling the weather in different parts of the world for a given time period, and the features were things like temperature, humidity, wind speed, pressure, etc. With the current PCA/RowMatrix options, you can only calculate PCA on the entire dataset, when you really want to model the weather in New York separately from the weather in Buenos Aires. Today your options are to collect the data from each city and calculate PCA using some other library like Breeze, or use the PCA implementation from MLlib but only on one key at a time.

The reason I thought it would be useful in Spark is that it makes the PCA offering in MLlib useful to more people. As it stands today, I wasn't able to use it for much and I suspect others had the same experience, for example:

This isn't really big enough to warrant its own library--it's just a single class. But if you think it's better to publish it externally I can certainly do that.

thanks again,
--Matt


On Fri, Oct 19, 2018 at 4:14 PM Sean Owen <[hidden email]> wrote:
It's OK to open a JIRA though I generally doubt any new functionality will be added. This might be viewed as a small worthwhile enhancement, haven't looked at it. It's always more compelling if you can sketch the use case for it and why it is more meaningful in spark than outside it. 

There is spark-packages for recording third party packages but it is not required nor even necessarily a comprehensive list. You can just self publish like any git or Maven project, if you develop a third party library

On Fri, Oct 19, 2018, 2:32 PM Matt Saunders <[hidden email]> wrote:
Thanks, Eric. I went ahead and created SPARK-25782 for this improvement since it is a feature I and others have looked for in MLlib, but doesn't seem to exist yet. Also, while searching for PCA-related issues in JIRA I noticed that someone added grouping support for PCA to the MADlib project a while back (see MADLIB-947), so there does seem to be a demand for it.

thanks!
--Matt


On Fri, Oct 19, 2018 at 7:06 AM Erik Erlandson <[hidden email]> wrote:
Hi Matt!

There are a couple ways to do this. If you want to submit it for inclusion in Spark, you should start by filing a JIRA for it, and then a pull request.   Another possibility is to publish it as your own 3rd party library, which I have done for aggregators before.


On Wed, Oct 17, 2018 at 4:54 PM Matt Saunders <[hidden email]> wrote:
I built an Aggregator that computes PCA on grouped datasets. I wanted to use the PCA functions provided by MLlib, but they only work on a full dataset, and I needed to do it on a grouped dataset (like a RelationalGroupedDataset).

So I built a little Aggregator that can do that, here’s an example of how it’s called:

    val pcaAggregation = new PCAAggregator(vectorColumnName).toColumn

    // For each grouping, compute a PCA matrix/vector
    val pcaModels = inputData
      .groupBy(keys:_*)
      .agg(pcaAggregation.as(pcaOutput))

I used the same algorithms under the hood as RowMatrix.computePrincipalComponentsAndExplainedVariance, though this works directly on Datasets without converting to RDD first.

I’ve seen others who wanted this ability (for example on Stack Overflow) so I’d like to contribute it if it would be a benefit to the larger community.

So.. is this something worth contributing to MLlib? And if so, what are the next steps to start the process?

thanks!