Quantcast

MLlib mission and goals

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

MLlib mission and goals

Joseph Bradley
This thread is split off from the "Feedback on MLlib roadmap process proposal" thread for discussing the high-level mission and goals for MLlib.  I hope this thread will collect feedback and ideas, not necessarily lead to huge decisions.

Copying from the previous thread:

Seth:
"""
I would love to hear some discussion on the higher level goal of Spark MLlib (if this derails the original discussion, please let me know and we can discuss in another thread). The roadmap does contain specific items that help to convey some of this (ML parity with MLlib, model persistence, etc...), but I'm interested in what the "mission" of Spark MLlib is. We often see PRs for brand new algorithms which are sometimes rejected and sometimes not. Do we aim to keep implementing more and more algorithms? Or is our focus really, now that we have a reasonable library of algorithms, to simply make the existing ones faster/better/more robust? Should we aim to make interfaces that are easily extended for developers to easily implement their own custom code (e.g. custom optimization libraries), or do we want to restrict things to out-of-the box algorithms? Should we focus on more flexible, general abstractions like distributed linear algebra?

I was not involved in the project in the early days of MLlib when this discussion may have happened, but I think it would be useful to either revisit it or restate it here for some of the newer developers.
"""

Mingjie:
"""
+1 general abstractions like distributed linear algebra.
"""


I'll add my thoughts, starting with our past trajectory:
* Initially, MLlib was mainly trying to build a set of core algorithms.
* Two years ago, the big effort was adding Pipelines.
* In the last year, big efforts have been around completing Pipelines and making the library more robust.

I agree with Seth that a few immediate goals are very clear:
* feature parity for DataFrame-based API
* completing and improving testing for model persistence
* Python, R parity

In the future, it's harder to say, but if I had to pick my top 2 items, I'd list:

(1) Making MLlib more extensible
It will not be feasible to support a huge number of algorithms, so allowing users to customize their ML on Spark workflows will be critical.  This is IMO the most important thing we could do for MLlib.
Part of this could be building a healthy community of Spark Packages, and we will need to make it easier for users to write their own algorithms and packages to facilitate this.  Part of this could be allowing users to customize existing algorithms with custom loss functions, etc.

(2) Consistent improvements to core algorithms
A less exciting but still very important item will be constantly improving the core set of algorithms in MLlib. This could mean speed, scaling, robustness, and usability for the few algorithms which cover 90% of use cases.

There are plenty of other possibilities, and it will be great to hear the community's thoughts!

Thanks,
Joseph

--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

http://databricks.com

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: MLlib mission and goals

Stephen Boesch
Along the lines of #1:  the spark packages seemed to have had a good start about two years ago: but now there are not more than a handful in general use - e.g. databricks CSV. 
When the available packages are browsed the majority are incomplete, empty, unmaintained, or unclear.

Any ideas on how to resurrect spark packages in a way that there will be sufficient adoption for it to be meaningful?

2017-01-23 17:03 GMT-08:00 Joseph Bradley <[hidden email]>:
This thread is split off from the "Feedback on MLlib roadmap process proposal" thread for discussing the high-level mission and goals for MLlib.  I hope this thread will collect feedback and ideas, not necessarily lead to huge decisions.

Copying from the previous thread:

Seth:
"""
I would love to hear some discussion on the higher level goal of Spark MLlib (if this derails the original discussion, please let me know and we can discuss in another thread). The roadmap does contain specific items that help to convey some of this (ML parity with MLlib, model persistence, etc...), but I'm interested in what the "mission" of Spark MLlib is. We often see PRs for brand new algorithms which are sometimes rejected and sometimes not. Do we aim to keep implementing more and more algorithms? Or is our focus really, now that we have a reasonable library of algorithms, to simply make the existing ones faster/better/more robust? Should we aim to make interfaces that are easily extended for developers to easily implement their own custom code (e.g. custom optimization libraries), or do we want to restrict things to out-of-the box algorithms? Should we focus on more flexible, general abstractions like distributed linear algebra?

I was not involved in the project in the early days of MLlib when this discussion may have happened, but I think it would be useful to either revisit it or restate it here for some of the newer developers.
"""

Mingjie:
"""
+1 general abstractions like distributed linear algebra.
"""


I'll add my thoughts, starting with our past trajectory:
* Initially, MLlib was mainly trying to build a set of core algorithms.
* Two years ago, the big effort was adding Pipelines.
* In the last year, big efforts have been around completing Pipelines and making the library more robust.

I agree with Seth that a few immediate goals are very clear:
* feature parity for DataFrame-based API
* completing and improving testing for model persistence
* Python, R parity

In the future, it's harder to say, but if I had to pick my top 2 items, I'd list:

(1) Making MLlib more extensible
It will not be feasible to support a huge number of algorithms, so allowing users to customize their ML on Spark workflows will be critical.  This is IMO the most important thing we could do for MLlib.
Part of this could be building a healthy community of Spark Packages, and we will need to make it easier for users to write their own algorithms and packages to facilitate this.  Part of this could be allowing users to customize existing algorithms with custom loss functions, etc.

(2) Consistent improvements to core algorithms
A less exciting but still very important item will be constantly improving the core set of algorithms in MLlib. This could mean speed, scaling, robustness, and usability for the few algorithms which cover 90% of use cases.

There are plenty of other possibilities, and it will be great to hear the community's thoughts!

Thanks,
Joseph

--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

http://databricks.com


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: MLlib mission and goals

Sean Owen
In reply to this post by Joseph Bradley
My $0.02, which shouldn't be weighted too much.

I believe the mission as of Spark ML has been to provide the framework, and then implementation of 'the basics' only. It should have the tools that cover ~80% of use cases, out of the box, in a pretty well-supported and tested way.

It's not a goal to support an arbitrarily large collection of algorithms because each one adds marginally less value, and IMHO, is proportionally bigger baggage, because the contributors tend to skew academic, produce worse code, and don't stick around to maintain it. 

The project is already generally quite overloaded; I don't know if there's bandwidth to even cover the current scope. While 'the basics' is a subjective label, de facto, I think we'd have to define it as essentially "what we already have in place" for the foreseeable future.

That the bits on spark-packages.org aren't so hot is not a problem but a symptom. Would these really be better in the core project?

And, or: I entirely agree with Joseph's take.

On Tue, Jan 24, 2017 at 1:03 AM Joseph Bradley <[hidden email]> wrote:
This thread is split off from the "Feedback on MLlib roadmap process proposal" thread for discussing the high-level mission and goals for MLlib.  I hope this thread will collect feedback and ideas, not necessarily lead to huge decisions.

Copying from the previous thread:

Seth:
"""
I would love to hear some discussion on the higher level goal of Spark MLlib (if this derails the original discussion, please let me know and we can discuss in another thread). The roadmap does contain specific items that help to convey some of this (ML parity with MLlib, model persistence, etc...), but I'm interested in what the "mission" of Spark MLlib is. We often see PRs for brand new algorithms which are sometimes rejected and sometimes not. Do we aim to keep implementing more and more algorithms? Or is our focus really, now that we have a reasonable library of algorithms, to simply make the existing ones faster/better/more robust? Should we aim to make interfaces that are easily extended for developers to easily implement their own custom code (e.g. custom optimization libraries), or do we want to restrict things to out-of-the box algorithms? Should we focus on more flexible, general abstractions like distributed linear algebra?

I was not involved in the project in the early days of MLlib when this discussion may have happened, but I think it would be useful to either revisit it or restate it here for some of the newer developers.
"""

Mingjie:
"""
+1 general abstractions like distributed linear algebra.
"""


I'll add my thoughts, starting with our past trajectory:
* Initially, MLlib was mainly trying to build a set of core algorithms.
* Two years ago, the big effort was adding Pipelines.
* In the last year, big efforts have been around completing Pipelines and making the library more robust.

I agree with Seth that a few immediate goals are very clear:
* feature parity for DataFrame-based API
* completing and improving testing for model persistence
* Python, R parity

In the future, it's harder to say, but if I had to pick my top 2 items, I'd list:

(1) Making MLlib more extensible
It will not be feasible to support a huge number of algorithms, so allowing users to customize their ML on Spark workflows will be critical.  This is IMO the most important thing we could do for MLlib.
Part of this could be building a healthy community of Spark Packages, and we will need to make it easier for users to write their own algorithms and packages to facilitate this.  Part of this could be allowing users to customize existing algorithms with custom loss functions, etc.

(2) Consistent improvements to core algorithms
A less exciting but still very important item will be constantly improving the core set of algorithms in MLlib. This could mean speed, scaling, robustness, and usability for the few algorithms which cover 90% of use cases.

There are plenty of other possibilities, and it will be great to hear the community's thoughts!

Thanks,
Joseph

--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

http://databricks.com

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: MLlib mission and goals

Jörn Franke
I also agree with Joseph and Sean.
With respect to spark-packages. I think the issue is that you have to manually add it, although it basically fetches the package from Maven Central (or custom upload).

From an organizational perspective there are other issues. E.g. You have to download it from the internet instead of using an artifact repository within the enterprise. You do not want users to download arbitrarily packages from the Internet into a production cluster. You also want to make sure that they do not use outdated or snapshot versions, that you have control over dependencies, licenses etc.

Currently I do not see that big artifact repository managers will support spark packages anytime soon. I also do not see it from the big Hadoop distributions.


On 24 Jan 2017, at 11:37, Sean Owen <[hidden email]> wrote:

My $0.02, which shouldn't be weighted too much.

I believe the mission as of Spark ML has been to provide the framework, and then implementation of 'the basics' only. It should have the tools that cover ~80% of use cases, out of the box, in a pretty well-supported and tested way.

It's not a goal to support an arbitrarily large collection of algorithms because each one adds marginally less value, and IMHO, is proportionally bigger baggage, because the contributors tend to skew academic, produce worse code, and don't stick around to maintain it. 

The project is already generally quite overloaded; I don't know if there's bandwidth to even cover the current scope. While 'the basics' is a subjective label, de facto, I think we'd have to define it as essentially "what we already have in place" for the foreseeable future.

That the bits on spark-packages.org aren't so hot is not a problem but a symptom. Would these really be better in the core project?

And, or: I entirely agree with Joseph's take.

On Tue, Jan 24, 2017 at 1:03 AM Joseph Bradley <[hidden email]> wrote:
This thread is split off from the "Feedback on MLlib roadmap process proposal" thread for discussing the high-level mission and goals for MLlib.  I hope this thread will collect feedback and ideas, not necessarily lead to huge decisions.

Copying from the previous thread:

Seth:
"""
I would love to hear some discussion on the higher level goal of Spark MLlib (if this derails the original discussion, please let me know and we can discuss in another thread). The roadmap does contain specific items that help to convey some of this (ML parity with MLlib, model persistence, etc...), but I'm interested in what the "mission" of Spark MLlib is. We often see PRs for brand new algorithms which are sometimes rejected and sometimes not. Do we aim to keep implementing more and more algorithms? Or is our focus really, now that we have a reasonable library of algorithms, to simply make the existing ones faster/better/more robust? Should we aim to make interfaces that are easily extended for developers to easily implement their own custom code (e.g. custom optimization libraries), or do we want to restrict things to out-of-the box algorithms? Should we focus on more flexible, general abstractions like distributed linear algebra?

I was not involved in the project in the early days of MLlib when this discussion may have happened, but I think it would be useful to either revisit it or restate it here for some of the newer developers.
"""

Mingjie:
"""
+1 general abstractions like distributed linear algebra.
"""


I'll add my thoughts, starting with our past trajectory:
* Initially, MLlib was mainly trying to build a set of core algorithms.
* Two years ago, the big effort was adding Pipelines.
* In the last year, big efforts have been around completing Pipelines and making the library more robust.

I agree with Seth that a few immediate goals are very clear:
* feature parity for DataFrame-based API
* completing and improving testing for model persistence
* Python, R parity

In the future, it's harder to say, but if I had to pick my top 2 items, I'd list:

(1) Making MLlib more extensible
It will not be feasible to support a huge number of algorithms, so allowing users to customize their ML on Spark workflows will be critical.  This is IMO the most important thing we could do for MLlib.
Part of this could be building a healthy community of Spark Packages, and we will need to make it easier for users to write their own algorithms and packages to facilitate this.  Part of this could be allowing users to customize existing algorithms with custom loss functions, etc.

(2) Consistent improvements to core algorithms
A less exciting but still very important item will be constantly improving the core set of algorithms in MLlib. This could mean speed, scaling, robustness, and usability for the few algorithms which cover 90% of use cases.

There are plenty of other possibilities, and it will be great to hear the community's thoughts!

Thanks,
Joseph

--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

http://databricks.com

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: MLlib mission and goals

Stephen Boesch
In reply to this post by Sean Owen
re: spark-packages.org  and "Would these really be better in the core project?"   That was not at all the intent of my input: instead to ask "how and where to structure/place deployment quality code that yet were *not* part of the distribution?"   The spark packages has no curation whatsoever : no minimum standards of code quality and deployment structures, let alone qualitative measures of usefulness.

While spark packages would never rival CRAN and friends there is not even any mechanism in place to get started.  From the CRAN site:

   Even at the current growth rate of several packages a day, all submissions are still rigorously quality-controlled using strong testing features available in the R system .

Maybe give something that has a subset of these processes a try ?  Different folks than are already over-subscribed in MLlib ?

2017-01-24 2:37 GMT-08:00 Sean Owen <[hidden email]>:
My $0.02, which shouldn't be weighted too much.

I believe the mission as of Spark ML has been to provide the framework, and then implementation of 'the basics' only. It should have the tools that cover ~80% of use cases, out of the box, in a pretty well-supported and tested way.

It's not a goal to support an arbitrarily large collection of algorithms because each one adds marginally less value, and IMHO, is proportionally bigger baggage, because the contributors tend to skew academic, produce worse code, and don't stick around to maintain it. 

The project is already generally quite overloaded; I don't know if there's bandwidth to even cover the current scope. While 'the basics' is a subjective label, de facto, I think we'd have to define it as essentially "what we already have in place" for the foreseeable future.

That the bits on spark-packages.org aren't so hot is not a problem but a symptom. Would these really be better in the core project?

And, or: I entirely agree with Joseph's take.


On Tue, Jan 24, 2017 at 1:03 AM Joseph Bradley <[hidden email]> wrote:
This thread is split off from the "Feedback on MLlib roadmap process proposal" thread for discussing the high-level mission and goals for MLlib.  I hope this thread will collect feedback and ideas, not necessarily lead to huge decisions.

Copying from the previous thread:

Seth:
"""
I would love to hear some discussion on the higher level goal of Spark MLlib (if this derails the original discussion, please let me know and we can discuss in another thread). The roadmap does contain specific items that help to convey some of this (ML parity with MLlib, model persistence, etc...), but I'm interested in what the "mission" of Spark MLlib is. We often see PRs for brand new algorithms which are sometimes rejected and sometimes not. Do we aim to keep implementing more and more algorithms? Or is our focus really, now that we have a reasonable library of algorithms, to simply make the existing ones faster/better/more robust? Should we aim to make interfaces that are easily extended for developers to easily implement their own custom code (e.g. custom optimization libraries), or do we want to restrict things to out-of-the box algorithms? Should we focus on more flexible, general abstractions like distributed linear algebra?

I was not involved in the project in the early days of MLlib when this discussion may have happened, but I think it would be useful to either revisit it or restate it here for some of the newer developers.
"""

Mingjie:
"""
+1 general abstractions like distributed linear algebra.
"""


I'll add my thoughts, starting with our past trajectory:
* Initially, MLlib was mainly trying to build a set of core algorithms.
* Two years ago, the big effort was adding Pipelines.
* In the last year, big efforts have been around completing Pipelines and making the library more robust.

I agree with Seth that a few immediate goals are very clear:
* feature parity for DataFrame-based API
* completing and improving testing for model persistence
* Python, R parity

In the future, it's harder to say, but if I had to pick my top 2 items, I'd list:

(1) Making MLlib more extensible
It will not be feasible to support a huge number of algorithms, so allowing users to customize their ML on Spark workflows will be critical.  This is IMO the most important thing we could do for MLlib.
Part of this could be building a healthy community of Spark Packages, and we will need to make it easier for users to write their own algorithms and packages to facilitate this.  Part of this could be allowing users to customize existing algorithms with custom loss functions, etc.

(2) Consistent improvements to core algorithms
A less exciting but still very important item will be constantly improving the core set of algorithms in MLlib. This could mean speed, scaling, robustness, and usability for the few algorithms which cover 90% of use cases.

There are plenty of other possibilities, and it will be great to hear the community's thoughts!

Thanks,
Joseph

--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

http://databricks.com


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: MLlib mission and goals

Miao Wang
I started working on ML/MLLIB/R since last year. Here are some of my thoughts from a beginner's perspective:
 
Current ML/MLLIB core algorithms can serve as good implementation examples, which makes adding new algorithms easier. Even a beginner like me, can pick it up quickly and learn how to add new algorithms. So, adding new algorithms should not be a barrier for developers who really need specific algorithms and it should not be the first priority in ML/MLLIB long term goal. We should only add highly demanded algorithms. I hope there will be detailed JIRA/email discussions to decide whether we want to accept a new algorithm.
 
I strongly agree that we should improve ML/MLLIB usability, stability and performance in core algorithms and the foundations such as linear algebra library etc. This will keep Spark ML/MLLIB competitive in the area of machine learning framework. For example, Microsoft just open source a fast, distributed, high performance gradient boosting (GBDT, GBRT, GBM or MART) framework based on decision tree algorithms. The performance and accuracy is much better than xboost. We need to follow up and improve Spark GBT alogrithms in near future.
 
Another related area is SparkR. API Parity between SparkR and ML/MLLIB is important. We should also pay attention to R users' habits and experiences when maintaining API parity.
 
Miao    
 
----- Original message -----
From: Stephen Boesch <[hidden email]>
To: Sean Owen <[hidden email]>
Cc: "[hidden email]" <[hidden email]>
Subject: Re: MLlib mission and goals
Date: Tue, Jan 24, 2017 4:42 AM
 
re: spark-packages.org  and "Would these really be better in the core project?"   That was not at all the intent of my input: instead to ask "how and where to structure/place deployment quality code that yet were *not* part of the distribution?"   The spark packages has no curation whatsoever : no minimum standards of code quality and deployment structures, let alone qualitative measures of usefulness.
 
While spark packages would never rival CRAN and friends there is not even any mechanism in place to get started.  From the CRAN site:
 
   Even at the current growth rate of several packages a day, all submissions are still rigorously quality-controlled using strong testing features available in the R system .
 
Maybe give something that has a subset of these processes a try ?  Different folks than are already over-subscribed in MLlib ?
 
2017-01-24 2:37 GMT-08:00 Sean Owen <[hidden email]>:
My $0.02, which shouldn't be weighted too much.
 
I believe the mission as of Spark ML has been to provide the framework, and then implementation of 'the basics' only. It should have the tools that cover ~80% of use cases, out of the box, in a pretty well-supported and tested way.
 
It's not a goal to support an arbitrarily large collection of algorithms because each one adds marginally less value, and IMHO, is proportionally bigger baggage, because the contributors tend to skew academic, produce worse code, and don't stick around to maintain it. 
 
The project is already generally quite overloaded; I don't know if there's bandwidth to even cover the current scope. While 'the basics' is a subjective label, de facto, I think we'd have to define it as essentially "what we already have in place" for the foreseeable future.
 
That the bits on spark-packages.org aren't so hot is not a problem but a symptom. Would these really be better in the core project?
 
And, or: I entirely agree with Joseph's take.
 
On Tue, Jan 24, 2017 at 1:03 AM Joseph Bradley <[hidden email]> wrote:
This thread is split off from the "Feedback on MLlib roadmap process proposal" thread for discussing the high-level mission and goals for MLlib.  I hope this thread will collect feedback and ideas, not necessarily lead to huge decisions.
 
Copying from the previous thread:
 
Seth:
"""
I would love to hear some discussion on the higher level goal of Spark MLlib (if this derails the original discussion, please let me know and we can discuss in another thread). The roadmap does contain specific items that help to convey some of this (ML parity with MLlib, model persistence, etc...), but I'm interested in what the "mission" of Spark MLlib is. We often see PRs for brand new algorithms which are sometimes rejected and sometimes not. Do we aim to keep implementing more and more algorithms? Or is our focus really, now that we have a reasonable library of algorithms, to simply make the existing ones faster/better/more robust? Should we aim to make interfaces that are easily extended for developers to easily implement their own custom code (e.g. custom optimization libraries), or do we want to restrict things to out-of-the box algorithms? Should we focus on more flexible, general abstractions like distributed linear algebra?
 
I was not involved in the project in the early days of MLlib when this discussion may have happened, but I think it would be useful to either revisit it or restate it here for some of the newer developers.
"""
 
Mingjie:
"""
+1 general abstractions like distributed linear algebra.
"""
 
 
I'll add my thoughts, starting with our past trajectory:
* Initially, MLlib was mainly trying to build a set of core algorithms.
* Two years ago, the big effort was adding Pipelines.
* In the last year, big efforts have been around completing Pipelines and making the library more robust.
 
I agree with Seth that a few immediate goals are very clear:
* feature parity for DataFrame-based API
* completing and improving testing for model persistence
* Python, R parity
 
In the future, it's harder to say, but if I had to pick my top 2 items, I'd list:
 
(1) Making MLlib more extensible
It will not be feasible to support a huge number of algorithms, so allowing users to customize their ML on Spark workflows will be critical.  This is IMO the most important thing we could do for MLlib.
Part of this could be building a healthy community of Spark Packages, and we will need to make it easier for users to write their own algorithms and packages to facilitate this.  Part of this could be allowing users to customize existing algorithms with custom loss functions, etc.
 
(2) Consistent improvements to core algorithms
A less exciting but still very important item will be constantly improving the core set of algorithms in MLlib. This could mean speed, scaling, robustness, and usability for the few algorithms which cover 90% of use cases.
 
There are plenty of other possibilities, and it will be great to hear the community's thoughts!
 
Thanks,
Joseph
 
 
 
--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

http://databricks.com

 

--------------------------------------------------------------------- To unsubscribe e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: MLlib mission and goals

Asher Krim
On the topic of usability, I think more effort should be put into large scale testing. We've encountered issues with building large models that are not apparent in small models, and these issues have made productizing ML/MLLIB much more difficult than we first anticipated. Considering that one of the biggest selling points for Spark is ease of scaling to large datasets, I think fleshing out SPARK-15573 and testing large models should be a priority

On Tue, Jan 24, 2017 at 2:23 PM, Miao Wang <[hidden email]> wrote:
I started working on ML/MLLIB/R since last year. Here are some of my thoughts from a beginner's perspective:
 
Current ML/MLLIB core algorithms can serve as good implementation examples, which makes adding new algorithms easier. Even a beginner like me, can pick it up quickly and learn how to add new algorithms. So, adding new algorithms should not be a barrier for developers who really need specific algorithms and it should not be the first priority in ML/MLLIB long term goal. We should only add highly demanded algorithms. I hope there will be detailed JIRA/email discussions to decide whether we want to accept a new algorithm.
 
I strongly agree that we should improve ML/MLLIB usability, stability and performance in core algorithms and the foundations such as linear algebra library etc. This will keep Spark ML/MLLIB competitive in the area of machine learning framework. For example, Microsoft just open source a fast, distributed, high performance gradient boosting (GBDT, GBRT, GBM or MART) framework based on decision tree algorithms. The performance and accuracy is much better than xboost. We need to follow up and improve Spark GBT alogrithms in near future.
 
Another related area is SparkR. API Parity between SparkR and ML/MLLIB is important. We should also pay attention to R users' habits and experiences when maintaining API parity.
 
Miao    
 
----- Original message -----
From: Stephen Boesch <[hidden email]>
To: Sean Owen <[hidden email]>
Cc: "[hidden email]" <[hidden email]>
Subject: Re: MLlib mission and goals
Date: Tue, Jan 24, 2017 4:42 AM
 
re: spark-packages.org  and "Would these really be better in the core project?"   That was not at all the intent of my input: instead to ask "how and where to structure/place deployment quality code that yet were *not* part of the distribution?"   The spark packages has no curation whatsoever : no minimum standards of code quality and deployment structures, let alone qualitative measures of usefulness.
 
While spark packages would never rival CRAN and friends there is not even any mechanism in place to get started.  From the CRAN site:
 
   Even at the current growth rate of several packages a day, all submissions are still rigorously quality-controlled using strong testing features available in the R system .
 
Maybe give something that has a subset of these processes a try ?  Different folks than are already over-subscribed in MLlib ?
 
2017-01-24 2:37 GMT-08:00 Sean Owen <[hidden email]>:
My $0.02, which shouldn't be weighted too much.
 
I believe the mission as of Spark ML has been to provide the framework, and then implementation of 'the basics' only. It should have the tools that cover ~80% of use cases, out of the box, in a pretty well-supported and tested way.
 
It's not a goal to support an arbitrarily large collection of algorithms because each one adds marginally less value, and IMHO, is proportionally bigger baggage, because the contributors tend to skew academic, produce worse code, and don't stick around to maintain it. 
 
The project is already generally quite overloaded; I don't know if there's bandwidth to even cover the current scope. While 'the basics' is a subjective label, de facto, I think we'd have to define it as essentially "what we already have in place" for the foreseeable future.
 
That the bits on spark-packages.org aren't so hot is not a problem but a symptom. Would these really be better in the core project?
 
And, or: I entirely agree with Joseph's take.
 
On Tue, Jan 24, 2017 at 1:03 AM Joseph Bradley <[hidden email]> wrote:
This thread is split off from the "Feedback on MLlib roadmap process proposal" thread for discussing the high-level mission and goals for MLlib.  I hope this thread will collect feedback and ideas, not necessarily lead to huge decisions.
 
Copying from the previous thread:
 
Seth:
"""
I would love to hear some discussion on the higher level goal of Spark MLlib (if this derails the original discussion, please let me know and we can discuss in another thread). The roadmap does contain specific items that help to convey some of this (ML parity with MLlib, model persistence, etc...), but I'm interested in what the "mission" of Spark MLlib is. We often see PRs for brand new algorithms which are sometimes rejected and sometimes not. Do we aim to keep implementing more and more algorithms? Or is our focus really, now that we have a reasonable library of algorithms, to simply make the existing ones faster/better/more robust? Should we aim to make interfaces that are easily extended for developers to easily implement their own custom code (e.g. custom optimization libraries), or do we want to restrict things to out-of-the box algorithms? Should we focus on more flexible, general abstractions like distributed linear algebra?
 
I was not involved in the project in the early days of MLlib when this discussion may have happened, but I think it would be useful to either revisit it or restate it here for some of the newer developers.
"""
 
Mingjie:
"""
+1 general abstractions like distributed linear algebra.
"""
 
 
I'll add my thoughts, starting with our past trajectory:
* Initially, MLlib was mainly trying to build a set of core algorithms.
* Two years ago, the big effort was adding Pipelines.
* In the last year, big efforts have been around completing Pipelines and making the library more robust.
 
I agree with Seth that a few immediate goals are very clear:
* feature parity for DataFrame-based API
* completing and improving testing for model persistence
* Python, R parity
 
In the future, it's harder to say, but if I had to pick my top 2 items, I'd list:
 
(1) Making MLlib more extensible
It will not be feasible to support a huge number of algorithms, so allowing users to customize their ML on Spark workflows will be critical.  This is IMO the most important thing we could do for MLlib.
Part of this could be building a healthy community of Spark Packages, and we will need to make it easier for users to write their own algorithms and packages to facilitate this.  Part of this could be allowing users to customize existing algorithms with custom loss functions, etc.
 
(2) Consistent improvements to core algorithms
A less exciting but still very important item will be constantly improving the core set of algorithms in MLlib. This could mean speed, scaling, robustness, and usability for the few algorithms which cover 90% of use cases.
 
There are plenty of other possibilities, and it will be great to hear the community's thoughts!
 
Thanks,
Joseph
 
 
 
--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

http://databricks.com

 

--------------------------------------------------------------------- To unsubscribe e-mail: [hidden email]



--
Asher Krim
Senior Software Engineer
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: MLlib mission and goals

Saikat Kanjilal

In reading through this and thinking about usability is there any interest in building a performance measurement framework around some (or maybe all) of the ML/Lib algorithms, I envision this as something that can get run for each release build for our end users, it may be useful for internal ml devs to see what impact each change to their code has on performance, please pardon me if this already exists, am new to the codebase and contributing to spark.




From: Asher Krim <[hidden email]>
Sent: Tuesday, January 24, 2017 12:17 PM
To: Miao Wang
Cc: [hidden email]; [hidden email]; Sean Owen
Subject: Re: MLlib mission and goals
 
On the topic of usability, I think more effort should be put into large scale testing. We've encountered issues with building large models that are not apparent in small models, and these issues have made productizing ML/MLLIB much more difficult than we first anticipated. Considering that one of the biggest selling points for Spark is ease of scaling to large datasets, I think fleshing out SPARK-15573 and testing large models should be a priority

On Tue, Jan 24, 2017 at 2:23 PM, Miao Wang <[hidden email]> wrote:
I started working on ML/MLLIB/R since last year. Here are some of my thoughts from a beginner's perspective:
 
Current ML/MLLIB core algorithms can serve as good implementation examples, which makes adding new algorithms easier. Even a beginner like me, can pick it up quickly and learn how to add new algorithms. So, adding new algorithms should not be a barrier for developers who really need specific algorithms and it should not be the first priority in ML/MLLIB long term goal. We should only add highly demanded algorithms. I hope there will be detailed JIRA/email discussions to decide whether we want to accept a new algorithm.
 
I strongly agree that we should improve ML/MLLIB usability, stability and performance in core algorithms and the foundations such as linear algebra library etc. This will keep Spark ML/MLLIB competitive in the area of machine learning framework. For example, Microsoft just open source a fast, distributed, high performance gradient boosting (GBDT, GBRT, GBM or MART) framework based on decision tree algorithms. The performance and accuracy is much better than xboost. We need to follow up and improve Spark GBT alogrithms in near future.
 
Another related area is SparkR. API Parity between SparkR and ML/MLLIB is important. We should also pay attention to R users' habits and experiences when maintaining API parity.
 
Miao    
 
----- Original message -----
From: Stephen Boesch <[hidden email]>
To: Sean Owen <[hidden email]>
Cc: "[hidden email]" <[hidden email]>
Subject: Re: MLlib mission and goals
Date: Tue, Jan 24, 2017 4:42 AM
 
re: spark-packages.org  and "Would these really be better in the core project?"   That was not at all the intent of my input: instead to ask "how and where to structure/place deployment quality code that yet were *not* part of the distribution?"   The spark packages has no curation whatsoever : no minimum standards of code quality and deployment structures, let alone qualitative measures of usefulness.
 
While spark packages would never rival CRAN and friends there is not even any mechanism in place to get started.  From the CRAN site:
 
   Even at the current growth rate of several packages a day, all submissions are still rigorously quality-controlled using strong testing features available in the R system .
 
Maybe give something that has a subset of these processes a try ?  Different folks than are already over-subscribed in MLlib ?
 
2017-01-24 2:37 GMT-08:00 Sean Owen <[hidden email]>:
My $0.02, which shouldn't be weighted too much.
 
I believe the mission as of Spark ML has been to provide the framework, and then implementation of 'the basics' only. It should have the tools that cover ~80% of use cases, out of the box, in a pretty well-supported and tested way.
 
It's not a goal to support an arbitrarily large collection of algorithms because each one adds marginally less value, and IMHO, is proportionally bigger baggage, because the contributors tend to skew academic, produce worse code, and don't stick around to maintain it. 
 
The project is already generally quite overloaded; I don't know if there's bandwidth to even cover the current scope. While 'the basics' is a subjective label, de facto, I think we'd have to define it as essentially "what we already have in place" for the foreseeable future.
 
That the bits on spark-packages.org aren't so hot is not a problem but a symptom. Would these really be better in the core project?
 
And, or: I entirely agree with Joseph's take.
 
On Tue, Jan 24, 2017 at 1:03 AM Joseph Bradley <[hidden email]> wrote:
This thread is split off from the "Feedback on MLlib roadmap process proposal" thread for discussing the high-level mission and goals for MLlib.  I hope this thread will collect feedback and ideas, not necessarily lead to huge decisions.
 
Copying from the previous thread:
 
Seth:
"""
I would love to hear some discussion on the higher level goal of Spark MLlib (if this derails the original discussion, please let me know and we can discuss in another thread). The roadmap does contain specific items that help to convey some of this (ML parity with MLlib, model persistence, etc...), but I'm interested in what the "mission" of Spark MLlib is. We often see PRs for brand new algorithms which are sometimes rejected and sometimes not. Do we aim to keep implementing more and more algorithms? Or is our focus really, now that we have a reasonable library of algorithms, to simply make the existing ones faster/better/more robust? Should we aim to make interfaces that are easily extended for developers to easily implement their own custom code (e.g. custom optimization libraries), or do we want to restrict things to out-of-the box algorithms? Should we focus on more flexible, general abstractions like distributed linear algebra?
 
I was not involved in the project in the early days of MLlib when this discussion may have happened, but I think it would be useful to either revisit it or restate it here for some of the newer developers.
"""
 
Mingjie:
"""
+1 general abstractions like distributed linear algebra.
"""
 
 
I'll add my thoughts, starting with our past trajectory:
* Initially, MLlib was mainly trying to build a set of core algorithms.
* Two years ago, the big effort was adding Pipelines.
* In the last year, big efforts have been around completing Pipelines and making the library more robust.
 
I agree with Seth that a few immediate goals are very clear:
* feature parity for DataFrame-based API
* completing and improving testing for model persistence
* Python, R parity
 
In the future, it's harder to say, but if I had to pick my top 2 items, I'd list:
 
(1) Making MLlib more extensible
It will not be feasible to support a huge number of algorithms, so allowing users to customize their ML on Spark workflows will be critical.  This is IMO the most important thing we could do for MLlib.
Part of this could be building a healthy community of Spark Packages, and we will need to make it easier for users to write their own algorithms and packages to facilitate this.  Part of this could be allowing users to customize existing algorithms with custom loss functions, etc.
 
(2) Consistent improvements to core algorithms
A less exciting but still very important item will be constantly improving the core set of algorithms in MLlib. This could mean speed, scaling, robustness, and usability for the few algorithms which cover 90% of use cases.
 
There are plenty of other possibilities, and it will be great to hear the community's thoughts!
 
Thanks,
Joseph
 
 
 
--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

http://databricks.com

 

--------------------------------------------------------------------- To unsubscribe e-mail: [hidden email]



--
Asher Krim
Senior Software Engineer
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: MLlib mission and goals

bradc

I believe one of the higher level goals of Spark MLlib should be to improve the efficiency of the ML algorithms that already exist. Currently there ML has a reasonable coverage of the important core algorithms. The work to get to feature parity for DataFrame-based API and model persistence are also important.

Apache Spark needs to use higher-level BLAS3 and LAPACK routines, instead of BLAS1 & BLAS3. For a long time we've used the concept of compute intensity (compute_intensity = FP_operations/Word) to help look at the performance of the underling compute kernels (see the papers referenced below). It has been proven in many implementations that performance, scalability, and huge reduction in memory pressure can be achieved by using higher-level BLAS3 or LAPACK routines in both single node as well as distributed computations.

I performed a survey of some of Apache Spark's ML algorithms. Unfortunately most of the ML algorithms are implemented with BLAS1 or BLAS2 routines which have very low compute intensity. BLAS2 and BLAS1 routines require a lot more memory bandwidth and will not achieve peak performance on x86, GPUs, or any other processor.

Apache Spark 2.1.0 ML routines & BLAS Routines

ALS(Alternating Least Squares matrix factorization

  • BLAS2: _SPR, _TPSV
  • BLAS1: _AXPY, _DOT, _SCAL, _NRM2
Logistic regression classification
  • BLAS2: _GEMV
  • BLAS1: _DOT, _SCAL
Generalized linear regression
  • BLAS1: _DOT
Gradient-boosted tree regression
  • BLAS1: _DOT
GraphX SVD++
  • BLAS1: _AXPY, _DOT,_SCAL
Neural Net Multi-layer Perceptron
  • BLAS3: _GEMM
  • BLAS2: _GEMV

Only the Neural Net Multi-layer Perceptron uses BLAS3 matrix multiply (DGEMM). BTW the underscores are replaced by S, D, Z, C for (32-bit real, 64-bit double, 32-bit complex, 64-bit complex operations; respectably).

Refactoring the algorithms to use BLAS3 routines or higher level LAPACK routines will require coding changes to use sub-block algorithms but the performance benefits can be great.

More at: https://blogs.oracle.com/BestPerf/entry/improving_algorithms_in_spark_ml

Background:

Brad Carlile. Parallelism, compute intensity, and data vectorization. SuperComputing'93, November 1993.

John McCalpin. 213876927_Memory_Bandwidth_and_Machine_Balance_in_Current_High_Performance_Computers 1995

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: MLlib mission and goals

Joseph Bradley
Re: performance measurement framework
We (Databricks) used to use spark-perf, but that was mainly for the RDD-based API.  We've now switched to spark-sql-perf, which does include some ML benchmarks despite the project name.  I'll see about updating the project README to document how to run MLlib tests.


On Tue, Jan 24, 2017 at 6:02 PM, bradc <[hidden email]> wrote:

I believe one of the higher level goals of Spark MLlib should be to improve the efficiency of the ML algorithms that already exist. Currently there ML has a reasonable coverage of the important core algorithms. The work to get to feature parity for DataFrame-based API and model persistence are also important.

Apache Spark needs to use higher-level BLAS3 and LAPACK routines, instead of BLAS1 & BLAS3. For a long time we've used the concept of compute intensity (compute_intensity = FP_operations/Word) to help look at the performance of the underling compute kernels (see the papers referenced below). It has been proven in many implementations that performance, scalability, and huge reduction in memory pressure can be achieved by using higher-level BLAS3 or LAPACK routines in both single node as well as distributed computations.

I performed a survey of some of Apache Spark's ML algorithms. Unfortunately most of the ML algorithms are implemented with BLAS1 or BLAS2 routines which have very low compute intensity. BLAS2 and BLAS1 routines require a lot more memory bandwidth and will not achieve peak performance on x86, GPUs, or any other processor.

Apache Spark 2.1.0 ML routines & BLAS Routines

ALS(Alternating Least Squares matrix factorization

  • BLAS2: _SPR, _TPSV
  • BLAS1: _AXPY, _DOT, _SCAL, _NRM2
Logistic regression classification
  • BLAS2: _GEMV
  • BLAS1: _DOT, _SCAL
Generalized linear regression
  • BLAS1: _DOT
Gradient-boosted tree regression
  • BLAS1: _DOT
GraphX SVD++
  • BLAS1: _AXPY, _DOT,_SCAL
Neural Net Multi-layer Perceptron
  • BLAS3: _GEMM
  • BLAS2: _GEMV

Only the Neural Net Multi-layer Perceptron uses BLAS3 matrix multiply (DGEMM). BTW the underscores are replaced by S, D, Z, C for (32-bit real, 64-bit double, 32-bit complex, 64-bit complex operations; respectably).

Refactoring the algorithms to use BLAS3 routines or higher level LAPACK routines will require coding changes to use sub-block algorithms but the performance benefits can be great.

More at: https://blogs.oracle.com/BestPerf/entry/improving_algorithms_in_spark_ml

Background:

Brad Carlile. Parallelism, compute intensity, and data vectorization. SuperComputing'93, November 1993.

John McCalpin. 213876927_Memory_Bandwidth_and_Machine_Balance_in_Current_High_Performance_Computers 1995



View this message in context: Re: MLlib mission and goals
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.




--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

http://databricks.com

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: MLlib mission and goals

Seth Hendrickson
I agree with what Sean said about not supporting arbitrarily many algorithms. I think the goal of MLlib should be to support only core algorithms for machine learning. Ideally Spark ML provides a relatively small set of algorithms that are heavily optimized, and also provides a framework that makes it easy for users to extend and build their own packages and algos when they need to. Spark ML is already quite good for this. We have of course been doing a lot of work migrating to this new API, and now that we are approaching full parity, it would be good to shift the focus to performance as others have noted. Supporting a few algorithms that perform very well is significantly better than supporting many algorithms with moderate performance, IMO. 

I also think a more complete, optimized distributed linear algebra library would be a great asset, but it may be a more long term goal. A performance framework for regression testing would be great, but keeping it up to date is difficult.

Thanks for kicking this thread off Joseph!

On Tue, Jan 24, 2017 at 7:30 PM, Joseph Bradley <[hidden email]> wrote:
Re: performance measurement framework
We (Databricks) used to use spark-perf, but that was mainly for the RDD-based API.  We've now switched to spark-sql-perf, which does include some ML benchmarks despite the project name.  I'll see about updating the project README to document how to run MLlib tests.


On Tue, Jan 24, 2017 at 6:02 PM, bradc <[hidden email]> wrote:

I believe one of the higher level goals of Spark MLlib should be to improve the efficiency of the ML algorithms that already exist. Currently there ML has a reasonable coverage of the important core algorithms. The work to get to feature parity for DataFrame-based API and model persistence are also important.

Apache Spark needs to use higher-level BLAS3 and LAPACK routines, instead of BLAS1 & BLAS3. For a long time we've used the concept of compute intensity (compute_intensity = FP_operations/Word) to help look at the performance of the underling compute kernels (see the papers referenced below). It has been proven in many implementations that performance, scalability, and huge reduction in memory pressure can be achieved by using higher-level BLAS3 or LAPACK routines in both single node as well as distributed computations.

I performed a survey of some of Apache Spark's ML algorithms. Unfortunately most of the ML algorithms are implemented with BLAS1 or BLAS2 routines which have very low compute intensity. BLAS2 and BLAS1 routines require a lot more memory bandwidth and will not achieve peak performance on x86, GPUs, or any other processor.

Apache Spark 2.1.0 ML routines & BLAS Routines

ALS(Alternating Least Squares matrix factorization

  • BLAS2: _SPR, _TPSV
  • BLAS1: _AXPY, _DOT, _SCAL, _NRM2
Logistic regression classification
  • BLAS2: _GEMV
  • BLAS1: _DOT, _SCAL
Generalized linear regression
  • BLAS1: _DOT
Gradient-boosted tree regression
  • BLAS1: _DOT
GraphX SVD++
  • BLAS1: _AXPY, _DOT,_SCAL
Neural Net Multi-layer Perceptron
  • BLAS3: _GEMM
  • BLAS2: _GEMV

Only the Neural Net Multi-layer Perceptron uses BLAS3 matrix multiply (DGEMM). BTW the underscores are replaced by S, D, Z, C for (32-bit real, 64-bit double, 32-bit complex, 64-bit complex operations; respectably).

Refactoring the algorithms to use BLAS3 routines or higher level LAPACK routines will require coding changes to use sub-block algorithms but the performance benefits can be great.

More at: https://blogs.oracle.com/BestPerf/entry/improving_algorithms_in_spark_ml

Background:

Brad Carlile. Parallelism, compute intensity, and data vectorization. SuperComputing'93, November 1993.

John McCalpin. 213876927_Memory_Bandwidth_and_Machine_Balance_in_Current_High_Performance_Computers 1995



View this message in context: Re: MLlib mission and goals
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.




--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

http://databricks.com


Loading...