Revisiting Online serving of Spark models?

classic Classic list List threaded Threaded
30 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Revisiting Online serving of Spark models?

Holden Karau
Hi y'all,

With the renewed interest in ML in Apache Spark now seems like a good a time as any to revisit the online serving situation in Spark ML. DB & other's have done some excellent working moving a lot of the necessary tools into a local linear algebra package that doesn't depend on having a SparkContext.

There are a few different commercial and non-commercial solutions round this, but currently our individual transform/predict methods are private so they either need to copy or re-implement (or put them selves in org.apache.spark) to access them. How would folks feel about adding a new trait for ML pipeline stages to expose to do transformation of single element inputs (or local collections) that could be optionally implemented by stages which support this? That way we can have less copy and paste code possibly getting out of sync with our model training.

I think continuing to have on-line serving grow in different projects is probably the right path, forward (folks have different needs), but I'd love to see us make it simpler for other projects to build reliable serving tools.

I realize this maybe puts some of the folks in an awkward position with their own commercial offerings, but hopefully if we make it easier for everyone the commercial vendors can benefit as well.

Cheers,

Holden :)

--
Reply | Threaded
Open this post in threaded view
|

Re: Revisiting Online serving of Spark models?

Joseph Bradley
Thanks for bringing this up Holden!  I'm a strong supporter of this.

This was one of the original goals for mllib-local: to have local versions of MLlib models which could be deployed without the big Spark JARs and without a SparkContext or SparkSession.  There are related commercial offerings like this : ) but the overhead of maintaining those offerings is pretty high.  Building good APIs within MLlib to avoid copying logic across libraries will be well worth it.

We've talked about this need at Databricks and have also been syncing with the creators of MLeap.  It'd be great to get this functionality into Spark itself.  Some thoughts:
* It'd be valuable to have this go beyond adding transform() methods taking a Row to the current Models.  Instead, it would be ideal to have local, lightweight versions of models in mllib-local, outside of the main mllib package (for easier deployment with smaller & fewer dependencies).
* Supporting Pipelines is important.  For this, it would be ideal to utilize elements of Spark SQL, particularly Rows and Types, which could be moved into a local sql package.
* This architecture may require some awkward APIs currently to have model prediction logic in mllib-local, local model classes in mllib-local, and regular (DataFrame-friendly) model classes in mllib.  We might find it helpful to break some DeveloperApis in Spark 3.0 to facilitate this architecture while making it feasible for 3rd party developers to extend MLlib APIs (especially in Java).
* It could also be worth discussing local DataFrames.  They might not be as important as per-Row transformations, but they would be helpful for batching for higher throughput.

I'll be interested to hear others' thoughts too!

Joseph

On Wed, May 9, 2018 at 7:18 AM, Holden Karau <[hidden email]> wrote:
Hi y'all,

With the renewed interest in ML in Apache Spark now seems like a good a time as any to revisit the online serving situation in Spark ML. DB & other's have done some excellent working moving a lot of the necessary tools into a local linear algebra package that doesn't depend on having a SparkContext.

There are a few different commercial and non-commercial solutions round this, but currently our individual transform/predict methods are private so they either need to copy or re-implement (or put them selves in org.apache.spark) to access them. How would folks feel about adding a new trait for ML pipeline stages to expose to do transformation of single element inputs (or local collections) that could be optionally implemented by stages which support this? That way we can have less copy and paste code possibly getting out of sync with our model training.

I think continuing to have on-line serving grow in different projects is probably the right path, forward (folks have different needs), but I'd love to see us make it simpler for other projects to build reliable serving tools.

I realize this maybe puts some of the folks in an awkward position with their own commercial offerings, but hopefully if we make it easier for everyone the commercial vendors can benefit as well.

Cheers,

Holden :)

--



--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

http://databricks.com

Reply | Threaded
Open this post in threaded view
|

Re: Revisiting Online serving of Spark models?

Holden Karau


On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley <[hidden email]> wrote:
Thanks for bringing this up Holden!  I'm a strong supporter of this.

Awesome! I'm glad other folks think something like this belongs in Spark.
This was one of the original goals for mllib-local: to have local versions of MLlib models which could be deployed without the big Spark JARs and without a SparkContext or SparkSession.  There are related commercial offerings like this : ) but the overhead of maintaining those offerings is pretty high.  Building good APIs within MLlib to avoid copying logic across libraries will be well worth it.

We've talked about this need at Databricks and have also been syncing with the creators of MLeap.  It'd be great to get this functionality into Spark itself.  Some thoughts:
* It'd be valuable to have this go beyond adding transform() methods taking a Row to the current Models.  Instead, it would be ideal to have local, lightweight versions of models in mllib-local, outside of the main mllib package (for easier deployment with smaller & fewer dependencies).
* Supporting Pipelines is important.  For this, it would be ideal to utilize elements of Spark SQL, particularly Rows and Types, which could be moved into a local sql package.
* This architecture may require some awkward APIs currently to have model prediction logic in mllib-local, local model classes in mllib-local, and regular (DataFrame-friendly) model classes in mllib.  We might find it helpful to break some DeveloperApis in Spark 3.0 to facilitate this architecture while making it feasible for 3rd party developers to extend MLlib APIs (especially in Java).
I agree this could be interesting, and feed into the other discussion around when (or if) we should be considering Spark 3.0
I _think_ we could probably do it with optional traits people could mix in to avoid breaking the current APIs but I could be wrong on that point.
* It could also be worth discussing local DataFrames.  They might not be as important as per-Row transformations, but they would be helpful for batching for higher throughput.
That could be interesting as well. 

I'll be interested to hear others' thoughts too!

Joseph

On Wed, May 9, 2018 at 7:18 AM, Holden Karau <[hidden email]> wrote:
Hi y'all,

With the renewed interest in ML in Apache Spark now seems like a good a time as any to revisit the online serving situation in Spark ML. DB & other's have done some excellent working moving a lot of the necessary tools into a local linear algebra package that doesn't depend on having a SparkContext.

There are a few different commercial and non-commercial solutions round this, but currently our individual transform/predict methods are private so they either need to copy or re-implement (or put them selves in org.apache.spark) to access them. How would folks feel about adding a new trait for ML pipeline stages to expose to do transformation of single element inputs (or local collections) that could be optionally implemented by stages which support this? That way we can have less copy and paste code possibly getting out of sync with our model training.

I think continuing to have on-line serving grow in different projects is probably the right path, forward (folks have different needs), but I'd love to see us make it simpler for other projects to build reliable serving tools.

I realize this maybe puts some of the folks in an awkward position with their own commercial offerings, but hopefully if we make it easier for everyone the commercial vendors can benefit as well.

Cheers,

Holden :)

--



--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

http://databricks.com




--
Reply | Threaded
Open this post in threaded view
|

Re: Revisiting Online serving of Spark models?

Felix Cheung
Huge +1 on this!


From: [hidden email] <[hidden email]> on behalf of Holden Karau <[hidden email]>
Sent: Thursday, May 10, 2018 9:39:26 AM
To: Joseph Bradley
Cc: dev
Subject: Re: Revisiting Online serving of Spark models?
 


On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley <[hidden email]> wrote:
Thanks for bringing this up Holden!  I'm a strong supporter of this.

Awesome! I'm glad other folks think something like this belongs in Spark.
This was one of the original goals for mllib-local: to have local versions of MLlib models which could be deployed without the big Spark JARs and without a SparkContext or SparkSession.  There are related commercial offerings like this : ) but the overhead of maintaining those offerings is pretty high.  Building good APIs within MLlib to avoid copying logic across libraries will be well worth it.

We've talked about this need at Databricks and have also been syncing with the creators of MLeap.  It'd be great to get this functionality into Spark itself.  Some thoughts:
* It'd be valuable to have this go beyond adding transform() methods taking a Row to the current Models.  Instead, it would be ideal to have local, lightweight versions of models in mllib-local, outside of the main mllib package (for easier deployment with smaller & fewer dependencies).
* Supporting Pipelines is important.  For this, it would be ideal to utilize elements of Spark SQL, particularly Rows and Types, which could be moved into a local sql package.
* This architecture may require some awkward APIs currently to have model prediction logic in mllib-local, local model classes in mllib-local, and regular (DataFrame-friendly) model classes in mllib.  We might find it helpful to break some DeveloperApis in Spark 3.0 to facilitate this architecture while making it feasible for 3rd party developers to extend MLlib APIs (especially in Java).
I agree this could be interesting, and feed into the other discussion around when (or if) we should be considering Spark 3.0
I _think_ we could probably do it with optional traits people could mix in to avoid breaking the current APIs but I could be wrong on that point.
* It could also be worth discussing local DataFrames.  They might not be as important as per-Row transformations, but they would be helpful for batching for higher throughput.
That could be interesting as well. 

I'll be interested to hear others' thoughts too!

Joseph

On Wed, May 9, 2018 at 7:18 AM, Holden Karau <[hidden email]> wrote:
Hi y'all,

With the renewed interest in ML in Apache Spark now seems like a good a time as any to revisit the online serving situation in Spark ML. DB & other's have done some excellent working moving a lot of the necessary tools into a local linear algebra package that doesn't depend on having a SparkContext.

There are a few different commercial and non-commercial solutions round this, but currently our individual transform/predict methods are private so they either need to copy or re-implement (or put them selves in org.apache.spark) to access them. How would folks feel about adding a new trait for ML pipeline stages to expose to do transformation of single element inputs (or local collections) that could be optionally implemented by stages which support this? That way we can have less copy and paste code possibly getting out of sync with our model training.

I think continuing to have on-line serving grow in different projects is probably the right path, forward (folks have different needs), but I'd love to see us make it simpler for other projects to build reliable serving tools.

I realize this maybe puts some of the folks in an awkward position with their own commercial offerings, but hopefully if we make it easier for everyone the commercial vendors can benefit as well.

Cheers,

Holden :)

--



--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

http://databricks.com




--
Reply | Threaded
Open this post in threaded view
|

Re: Revisiting Online serving of Spark models?

Felix Cheung
Specifically I’d like bring part of the discussion to Model and PipelineModel, and various ModelReader and SharedReadWrite implementations that rely on SparkContext. This is a big blocker on reusing  trained models outside of Spark for online serving.

What’s the next step? Would folks be interested in getting together to discuss/get some feedback?


_____________________________
From: Felix Cheung <[hidden email]>
Sent: Thursday, May 10, 2018 10:10 AM
Subject: Re: Revisiting Online serving of Spark models?
To: Holden Karau <[hidden email]>, Joseph Bradley <[hidden email]>
Cc: dev <[hidden email]>


Huge +1 on this!


From: [hidden email] <[hidden email]> on behalf of Holden Karau <[hidden email]>
Sent: Thursday, May 10, 2018 9:39:26 AM
To: Joseph Bradley
Cc: dev
Subject: Re: Revisiting Online serving of Spark models?
 


On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley <[hidden email]> wrote:
Thanks for bringing this up Holden!  I'm a strong supporter of this.

Awesome! I'm glad other folks think something like this belongs in Spark.
This was one of the original goals for mllib-local: to have local versions of MLlib models which could be deployed without the big Spark JARs and without a SparkContext or SparkSession.  There are related commercial offerings like this : ) but the overhead of maintaining those offerings is pretty high.  Building good APIs within MLlib to avoid copying logic across libraries will be well worth it.

We've talked about this need at Databricks and have also been syncing with the creators of MLeap.  It'd be great to get this functionality into Spark itself.  Some thoughts:
* It'd be valuable to have this go beyond adding transform() methods taking a Row to the current Models.  Instead, it would be ideal to have local, lightweight versions of models in mllib-local, outside of the main mllib package (for easier deployment with smaller & fewer dependencies).
* Supporting Pipelines is important.  For this, it would be ideal to utilize elements of Spark SQL, particularly Rows and Types, which could be moved into a local sql package.
* This architecture may require some awkward APIs currently to have model prediction logic in mllib-local, local model classes in mllib-local, and regular (DataFrame-friendly) model classes in mllib.  We might find it helpful to break some DeveloperApis in Spark 3.0 to facilitate this architecture while making it feasible for 3rd party developers to extend MLlib APIs (especially in Java).
I agree this could be interesting, and feed into the other discussion around when (or if) we should be considering Spark 3.0
I _think_ we could probably do it with optional traits people could mix in to avoid breaking the current APIs but I could be wrong on that point.
* It could also be worth discussing local DataFrames.  They might not be as important as per-Row transformations, but they would be helpful for batching for higher throughput.
That could be interesting as well. 

I'll be interested to hear others' thoughts too!

Joseph

On Wed, May 9, 2018 at 7:18 AM, Holden Karau <[hidden email]> wrote:
Hi y'all,

With the renewed interest in ML in Apache Spark now seems like a good a time as any to revisit the online serving situation in Spark ML. DB & other's have done some excellent working moving a lot of the necessary tools into a local linear algebra package that doesn't depend on having a SparkContext.

There are a few different commercial and non-commercial solutions round this, but currently our individual transform/predict methods are private so they either need to copy or re-implement (or put them selves in org.apache.spark) to access them. How would folks feel about adding a new trait for ML pipeline stages to expose to do transformation of single element inputs (or local collections) that could be optionally implemented by stages which support this? That way we can have less copy and paste code possibly getting out of sync with our model training.

I think continuing to have on-line serving grow in different projects is probably the right path, forward (folks have different needs), but I'd love to see us make it simpler for other projects to build reliable serving tools.

I realize this maybe puts some of the folks in an awkward position with their own commercial offerings, but hopefully if we make it easier for everyone the commercial vendors can benefit as well.

Cheers,

Holden :)

--



--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

http://databricks.com




--


Reply | Threaded
Open this post in threaded view
|

Re: Revisiting Online serving of Spark models?

Joseph Bradley
Regarding model reading and writing, I'll give quick thoughts here:
* Our approach was to use the same format but write JSON instead of Parquet.  It's easier to parse JSON without Spark, and using the same format simplifies architecture.  Plus, some people want to check files into version control, and JSON is nice for that.
* The reader/writer APIs could be extended to take format parameters (just like DataFrame reader/writers) to handle JSON (and maybe, eventually, handle Parquet in the online serving setting).

This would be a big project, so proposing a SPIP might be best.  If people are around at the Spark Summit, that could be a good time to meet up & then post notes back to the dev list.

On Sun, May 20, 2018 at 8:11 PM, Felix Cheung <[hidden email]> wrote:
Specifically I’d like bring part of the discussion to Model and PipelineModel, and various ModelReader and SharedReadWrite implementations that rely on SparkContext. This is a big blocker on reusing  trained models outside of Spark for online serving.

What’s the next step? Would folks be interested in getting together to discuss/get some feedback?


_____________________________
From: Felix Cheung <[hidden email]>
Sent: Thursday, May 10, 2018 10:10 AM
Subject: Re: Revisiting Online serving of Spark models?
To: Holden Karau <[hidden email]>, Joseph Bradley <[hidden email]>
Cc: dev <[hidden email]>



Huge +1 on this!


From: [hidden email] <[hidden email]> on behalf of Holden Karau <[hidden email]>
Sent: Thursday, May 10, 2018 9:39:26 AM
To: Joseph Bradley
Cc: dev
Subject: Re: Revisiting Online serving of Spark models?
 


On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley <[hidden email]> wrote:
Thanks for bringing this up Holden!  I'm a strong supporter of this.

Awesome! I'm glad other folks think something like this belongs in Spark.
This was one of the original goals for mllib-local: to have local versions of MLlib models which could be deployed without the big Spark JARs and without a SparkContext or SparkSession.  There are related commercial offerings like this : ) but the overhead of maintaining those offerings is pretty high.  Building good APIs within MLlib to avoid copying logic across libraries will be well worth it.

We've talked about this need at Databricks and have also been syncing with the creators of MLeap.  It'd be great to get this functionality into Spark itself.  Some thoughts:
* It'd be valuable to have this go beyond adding transform() methods taking a Row to the current Models.  Instead, it would be ideal to have local, lightweight versions of models in mllib-local, outside of the main mllib package (for easier deployment with smaller & fewer dependencies).
* Supporting Pipelines is important.  For this, it would be ideal to utilize elements of Spark SQL, particularly Rows and Types, which could be moved into a local sql package.
* This architecture may require some awkward APIs currently to have model prediction logic in mllib-local, local model classes in mllib-local, and regular (DataFrame-friendly) model classes in mllib.  We might find it helpful to break some DeveloperApis in Spark 3.0 to facilitate this architecture while making it feasible for 3rd party developers to extend MLlib APIs (especially in Java).
I agree this could be interesting, and feed into the other discussion around when (or if) we should be considering Spark 3.0
I _think_ we could probably do it with optional traits people could mix in to avoid breaking the current APIs but I could be wrong on that point.
* It could also be worth discussing local DataFrames.  They might not be as important as per-Row transformations, but they would be helpful for batching for higher throughput.
That could be interesting as well. 

I'll be interested to hear others' thoughts too!

Joseph

On Wed, May 9, 2018 at 7:18 AM, Holden Karau <[hidden email]> wrote:
Hi y'all,

With the renewed interest in ML in Apache Spark now seems like a good a time as any to revisit the online serving situation in Spark ML. DB & other's have done some excellent working moving a lot of the necessary tools into a local linear algebra package that doesn't depend on having a SparkContext.

There are a few different commercial and non-commercial solutions round this, but currently our individual transform/predict methods are private so they either need to copy or re-implement (or put them selves in org.apache.spark) to access them. How would folks feel about adding a new trait for ML pipeline stages to expose to do transformation of single element inputs (or local collections) that could be optionally implemented by stages which support this? That way we can have less copy and paste code possibly getting out of sync with our model training.

I think continuing to have on-line serving grow in different projects is probably the right path, forward (folks have different needs), but I'd love to see us make it simpler for other projects to build reliable serving tools.

I realize this maybe puts some of the folks in an awkward position with their own commercial offerings, but hopefully if we make it easier for everyone the commercial vendors can benefit as well.

Cheers,

Holden :)

--



--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

http://databricks.com




--





--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

http://databricks.com

Reply | Threaded
Open this post in threaded view
|

Re: Revisiting Online serving of Spark models?

Holden Karau
I like that idea. I’ll be around Spark Summit.

On Mon, May 21, 2018 at 1:52 PM Joseph Bradley <[hidden email]> wrote:
Regarding model reading and writing, I'll give quick thoughts here:
* Our approach was to use the same format but write JSON instead of Parquet.  It's easier to parse JSON without Spark, and using the same format simplifies architecture.  Plus, some people want to check files into version control, and JSON is nice for that.
* The reader/writer APIs could be extended to take format parameters (just like DataFrame reader/writers) to handle JSON (and maybe, eventually, handle Parquet in the online serving setting).

This would be a big project, so proposing a SPIP might be best.  If people are around at the Spark Summit, that could be a good time to meet up & then post notes back to the dev list.

On Sun, May 20, 2018 at 8:11 PM, Felix Cheung <[hidden email]> wrote:
Specifically I’d like bring part of the discussion to Model and PipelineModel, and various ModelReader and SharedReadWrite implementations that rely on SparkContext. This is a big blocker on reusing  trained models outside of Spark for online serving.

What’s the next step? Would folks be interested in getting together to discuss/get some feedback?


_____________________________
From: Felix Cheung <[hidden email]>
Sent: Thursday, May 10, 2018 10:10 AM
Subject: Re: Revisiting Online serving of Spark models?
To: Holden Karau <[hidden email]>, Joseph Bradley <[hidden email]>
Cc: dev <[hidden email]>



Huge +1 on this!


From: [hidden email] <[hidden email]> on behalf of Holden Karau <[hidden email]>
Sent: Thursday, May 10, 2018 9:39:26 AM
To: Joseph Bradley
Cc: dev
Subject: Re: Revisiting Online serving of Spark models?
 


On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley <[hidden email]> wrote:
Thanks for bringing this up Holden!  I'm a strong supporter of this.

Awesome! I'm glad other folks think something like this belongs in Spark.
This was one of the original goals for mllib-local: to have local versions of MLlib models which could be deployed without the big Spark JARs and without a SparkContext or SparkSession.  There are related commercial offerings like this : ) but the overhead of maintaining those offerings is pretty high.  Building good APIs within MLlib to avoid copying logic across libraries will be well worth it.

We've talked about this need at Databricks and have also been syncing with the creators of MLeap.  It'd be great to get this functionality into Spark itself.  Some thoughts:
* It'd be valuable to have this go beyond adding transform() methods taking a Row to the current Models.  Instead, it would be ideal to have local, lightweight versions of models in mllib-local, outside of the main mllib package (for easier deployment with smaller & fewer dependencies).
* Supporting Pipelines is important.  For this, it would be ideal to utilize elements of Spark SQL, particularly Rows and Types, which could be moved into a local sql package.
* This architecture may require some awkward APIs currently to have model prediction logic in mllib-local, local model classes in mllib-local, and regular (DataFrame-friendly) model classes in mllib.  We might find it helpful to break some DeveloperApis in Spark 3.0 to facilitate this architecture while making it feasible for 3rd party developers to extend MLlib APIs (especially in Java).
I agree this could be interesting, and feed into the other discussion around when (or if) we should be considering Spark 3.0
I _think_ we could probably do it with optional traits people could mix in to avoid breaking the current APIs but I could be wrong on that point.
* It could also be worth discussing local DataFrames.  They might not be as important as per-Row transformations, but they would be helpful for batching for higher throughput.
That could be interesting as well. 

I'll be interested to hear others' thoughts too!

Joseph

On Wed, May 9, 2018 at 7:18 AM, Holden Karau <[hidden email]> wrote:
Hi y'all,

With the renewed interest in ML in Apache Spark now seems like a good a time as any to revisit the online serving situation in Spark ML. DB & other's have done some excellent working moving a lot of the necessary tools into a local linear algebra package that doesn't depend on having a SparkContext.

There are a few different commercial and non-commercial solutions round this, but currently our individual transform/predict methods are private so they either need to copy or re-implement (or put them selves in org.apache.spark) to access them. How would folks feel about adding a new trait for ML pipeline stages to expose to do transformation of single element inputs (or local collections) that could be optionally implemented by stages which support this? That way we can have less copy and paste code possibly getting out of sync with our model training.

I think continuing to have on-line serving grow in different projects is probably the right path, forward (folks have different needs), but I'd love to see us make it simpler for other projects to build reliable serving tools.

I realize this maybe puts some of the folks in an awkward position with their own commercial offerings, but hopefully if we make it easier for everyone the commercial vendors can benefit as well.

Cheers,

Holden :)

--



--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

http://databricks.com




--





--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

http://databricks.com

--
Reply | Threaded
Open this post in threaded view
|

Re: Revisiting Online serving of Spark models?

Holden Karau
(Oh also the write API has already been extended to take formats).

On Mon, May 21, 2018 at 2:51 PM Holden Karau <[hidden email]> wrote:
I like that idea. I’ll be around Spark Summit.

On Mon, May 21, 2018 at 1:52 PM Joseph Bradley <[hidden email]> wrote:
Regarding model reading and writing, I'll give quick thoughts here:
* Our approach was to use the same format but write JSON instead of Parquet.  It's easier to parse JSON without Spark, and using the same format simplifies architecture.  Plus, some people want to check files into version control, and JSON is nice for that.
* The reader/writer APIs could be extended to take format parameters (just like DataFrame reader/writers) to handle JSON (and maybe, eventually, handle Parquet in the online serving setting).

This would be a big project, so proposing a SPIP might be best.  If people are around at the Spark Summit, that could be a good time to meet up & then post notes back to the dev list.

On Sun, May 20, 2018 at 8:11 PM, Felix Cheung <[hidden email]> wrote:
Specifically I’d like bring part of the discussion to Model and PipelineModel, and various ModelReader and SharedReadWrite implementations that rely on SparkContext. This is a big blocker on reusing  trained models outside of Spark for online serving.

What’s the next step? Would folks be interested in getting together to discuss/get some feedback?


_____________________________
From: Felix Cheung <[hidden email]>
Sent: Thursday, May 10, 2018 10:10 AM
Subject: Re: Revisiting Online serving of Spark models?
To: Holden Karau <[hidden email]>, Joseph Bradley <[hidden email]>
Cc: dev <[hidden email]>



Huge +1 on this!


From: [hidden email] <[hidden email]> on behalf of Holden Karau <[hidden email]>
Sent: Thursday, May 10, 2018 9:39:26 AM
To: Joseph Bradley
Cc: dev
Subject: Re: Revisiting Online serving of Spark models?
 


On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley <[hidden email]> wrote:
Thanks for bringing this up Holden!  I'm a strong supporter of this.

Awesome! I'm glad other folks think something like this belongs in Spark.
This was one of the original goals for mllib-local: to have local versions of MLlib models which could be deployed without the big Spark JARs and without a SparkContext or SparkSession.  There are related commercial offerings like this : ) but the overhead of maintaining those offerings is pretty high.  Building good APIs within MLlib to avoid copying logic across libraries will be well worth it.

We've talked about this need at Databricks and have also been syncing with the creators of MLeap.  It'd be great to get this functionality into Spark itself.  Some thoughts:
* It'd be valuable to have this go beyond adding transform() methods taking a Row to the current Models.  Instead, it would be ideal to have local, lightweight versions of models in mllib-local, outside of the main mllib package (for easier deployment with smaller & fewer dependencies).
* Supporting Pipelines is important.  For this, it would be ideal to utilize elements of Spark SQL, particularly Rows and Types, which could be moved into a local sql package.
* This architecture may require some awkward APIs currently to have model prediction logic in mllib-local, local model classes in mllib-local, and regular (DataFrame-friendly) model classes in mllib.  We might find it helpful to break some DeveloperApis in Spark 3.0 to facilitate this architecture while making it feasible for 3rd party developers to extend MLlib APIs (especially in Java).
I agree this could be interesting, and feed into the other discussion around when (or if) we should be considering Spark 3.0
I _think_ we could probably do it with optional traits people could mix in to avoid breaking the current APIs but I could be wrong on that point.
* It could also be worth discussing local DataFrames.  They might not be as important as per-Row transformations, but they would be helpful for batching for higher throughput.
That could be interesting as well. 

I'll be interested to hear others' thoughts too!

Joseph

On Wed, May 9, 2018 at 7:18 AM, Holden Karau <[hidden email]> wrote:
Hi y'all,

With the renewed interest in ML in Apache Spark now seems like a good a time as any to revisit the online serving situation in Spark ML. DB & other's have done some excellent working moving a lot of the necessary tools into a local linear algebra package that doesn't depend on having a SparkContext.

There are a few different commercial and non-commercial solutions round this, but currently our individual transform/predict methods are private so they either need to copy or re-implement (or put them selves in org.apache.spark) to access them. How would folks feel about adding a new trait for ML pipeline stages to expose to do transformation of single element inputs (or local collections) that could be optionally implemented by stages which support this? That way we can have less copy and paste code possibly getting out of sync with our model training.

I think continuing to have on-line serving grow in different projects is probably the right path, forward (folks have different needs), but I'd love to see us make it simpler for other projects to build reliable serving tools.

I realize this maybe puts some of the folks in an awkward position with their own commercial offerings, but hopefully if we make it easier for everyone the commercial vendors can benefit as well.

Cheers,

Holden :)

--



--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

http://databricks.com




--





--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

http://databricks.com

--
--
Reply | Threaded
Open this post in threaded view
|

Re: Revisiting Online serving of Spark models?

Felix Cheung
+1 on meeting up!


From: Holden Karau <[hidden email]>
Sent: Monday, May 21, 2018 2:52:20 PM
To: Joseph Bradley
Cc: Felix Cheung; dev
Subject: Re: Revisiting Online serving of Spark models?
 
(Oh also the write API has already been extended to take formats).

On Mon, May 21, 2018 at 2:51 PM Holden Karau <[hidden email]> wrote:
I like that idea. I’ll be around Spark Summit.

On Mon, May 21, 2018 at 1:52 PM Joseph Bradley <[hidden email]> wrote:
Regarding model reading and writing, I'll give quick thoughts here:
* Our approach was to use the same format but write JSON instead of Parquet.  It's easier to parse JSON without Spark, and using the same format simplifies architecture.  Plus, some people want to check files into version control, and JSON is nice for that.
* The reader/writer APIs could be extended to take format parameters (just like DataFrame reader/writers) to handle JSON (and maybe, eventually, handle Parquet in the online serving setting).

This would be a big project, so proposing a SPIP might be best.  If people are around at the Spark Summit, that could be a good time to meet up & then post notes back to the dev list.

On Sun, May 20, 2018 at 8:11 PM, Felix Cheung <[hidden email]> wrote:
Specifically I’d like bring part of the discussion to Model and PipelineModel, and various ModelReader and SharedReadWrite implementations that rely on SparkContext. This is a big blocker on reusing  trained models outside of Spark for online serving.

What’s the next step? Would folks be interested in getting together to discuss/get some feedback?


_____________________________
From: Felix Cheung <[hidden email]>
Sent: Thursday, May 10, 2018 10:10 AM
Subject: Re: Revisiting Online serving of Spark models?
To: Holden Karau <[hidden email]>, Joseph Bradley <[hidden email]>
Cc: dev <[hidden email]>



Huge +1 on this!


From: [hidden email] <[hidden email]> on behalf of Holden Karau <[hidden email]>
Sent: Thursday, May 10, 2018 9:39:26 AM
To: Joseph Bradley
Cc: dev
Subject: Re: Revisiting Online serving of Spark models?
 


On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley <[hidden email]> wrote:
Thanks for bringing this up Holden!  I'm a strong supporter of this.

Awesome! I'm glad other folks think something like this belongs in Spark.
This was one of the original goals for mllib-local: to have local versions of MLlib models which could be deployed without the big Spark JARs and without a SparkContext or SparkSession.  There are related commercial offerings like this : ) but the overhead of maintaining those offerings is pretty high.  Building good APIs within MLlib to avoid copying logic across libraries will be well worth it.

We've talked about this need at Databricks and have also been syncing with the creators of MLeap.  It'd be great to get this functionality into Spark itself.  Some thoughts:
* It'd be valuable to have this go beyond adding transform() methods taking a Row to the current Models.  Instead, it would be ideal to have local, lightweight versions of models in mllib-local, outside of the main mllib package (for easier deployment with smaller & fewer dependencies).
* Supporting Pipelines is important.  For this, it would be ideal to utilize elements of Spark SQL, particularly Rows and Types, which could be moved into a local sql package.
* This architecture may require some awkward APIs currently to have model prediction logic in mllib-local, local model classes in mllib-local, and regular (DataFrame-friendly) model classes in mllib.  We might find it helpful to break some DeveloperApis in Spark 3.0 to facilitate this architecture while making it feasible for 3rd party developers to extend MLlib APIs (especially in Java).
I agree this could be interesting, and feed into the other discussion around when (or if) we should be considering Spark 3.0
I _think_ we could probably do it with optional traits people could mix in to avoid breaking the current APIs but I could be wrong on that point.
* It could also be worth discussing local DataFrames.  They might not be as important as per-Row transformations, but they would be helpful for batching for higher throughput.
That could be interesting as well. 

I'll be interested to hear others' thoughts too!

Joseph

On Wed, May 9, 2018 at 7:18 AM, Holden Karau <[hidden email]> wrote:
Hi y'all,

With the renewed interest in ML in Apache Spark now seems like a good a time as any to revisit the online serving situation in Spark ML. DB & other's have done some excellent working moving a lot of the necessary tools into a local linear algebra package that doesn't depend on having a SparkContext.

There are a few different commercial and non-commercial solutions round this, but currently our individual transform/predict methods are private so they either need to copy or re-implement (or put them selves in org.apache.spark) to access them. How would folks feel about adding a new trait for ML pipeline stages to expose to do transformation of single element inputs (or local collections) that could be optionally implemented by stages which support this? That way we can have less copy and paste code possibly getting out of sync with our model training.

I think continuing to have on-line serving grow in different projects is probably the right path, forward (folks have different needs), but I'd love to see us make it simpler for other projects to build reliable serving tools.

I realize this maybe puts some of the folks in an awkward position with their own commercial offerings, but hopefully if we make it easier for everyone the commercial vendors can benefit as well.

Cheers,

Holden :)

--



--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

http://databricks.com




--





--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

http://databricks.com

--
--
Reply | Threaded
Open this post in threaded view
|

Re: Revisiting Online serving of Spark models?

Leif Walsh
In reply to this post by Joseph Bradley
I’m with you on json being more readable than parquet, but we’ve had success using pyarrow’s parquet reader and have been quite happy with it so far. If your target is python (and probably if not now, then soon, R), you should look in to it. 

On Mon, May 21, 2018 at 16:52 Joseph Bradley <[hidden email]> wrote:
Regarding model reading and writing, I'll give quick thoughts here:
* Our approach was to use the same format but write JSON instead of Parquet.  It's easier to parse JSON without Spark, and using the same format simplifies architecture.  Plus, some people want to check files into version control, and JSON is nice for that.
* The reader/writer APIs could be extended to take format parameters (just like DataFrame reader/writers) to handle JSON (and maybe, eventually, handle Parquet in the online serving setting).

This would be a big project, so proposing a SPIP might be best.  If people are around at the Spark Summit, that could be a good time to meet up & then post notes back to the dev list.

On Sun, May 20, 2018 at 8:11 PM, Felix Cheung <[hidden email]> wrote:
Specifically I’d like bring part of the discussion to Model and PipelineModel, and various ModelReader and SharedReadWrite implementations that rely on SparkContext. This is a big blocker on reusing  trained models outside of Spark for online serving.

What’s the next step? Would folks be interested in getting together to discuss/get some feedback?


_____________________________
From: Felix Cheung <[hidden email]>
Sent: Thursday, May 10, 2018 10:10 AM
Subject: Re: Revisiting Online serving of Spark models?
To: Holden Karau <[hidden email]>, Joseph Bradley <[hidden email]>
Cc: dev <[hidden email]>



Huge +1 on this!


From: [hidden email] <[hidden email]> on behalf of Holden Karau <[hidden email]>
Sent: Thursday, May 10, 2018 9:39:26 AM
To: Joseph Bradley
Cc: dev
Subject: Re: Revisiting Online serving of Spark models?
 


On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley <[hidden email]> wrote:
Thanks for bringing this up Holden!  I'm a strong supporter of this.

Awesome! I'm glad other folks think something like this belongs in Spark.
This was one of the original goals for mllib-local: to have local versions of MLlib models which could be deployed without the big Spark JARs and without a SparkContext or SparkSession.  There are related commercial offerings like this : ) but the overhead of maintaining those offerings is pretty high.  Building good APIs within MLlib to avoid copying logic across libraries will be well worth it.

We've talked about this need at Databricks and have also been syncing with the creators of MLeap.  It'd be great to get this functionality into Spark itself.  Some thoughts:
* It'd be valuable to have this go beyond adding transform() methods taking a Row to the current Models.  Instead, it would be ideal to have local, lightweight versions of models in mllib-local, outside of the main mllib package (for easier deployment with smaller & fewer dependencies).
* Supporting Pipelines is important.  For this, it would be ideal to utilize elements of Spark SQL, particularly Rows and Types, which could be moved into a local sql package.
* This architecture may require some awkward APIs currently to have model prediction logic in mllib-local, local model classes in mllib-local, and regular (DataFrame-friendly) model classes in mllib.  We might find it helpful to break some DeveloperApis in Spark 3.0 to facilitate this architecture while making it feasible for 3rd party developers to extend MLlib APIs (especially in Java).
I agree this could be interesting, and feed into the other discussion around when (or if) we should be considering Spark 3.0
I _think_ we could probably do it with optional traits people could mix in to avoid breaking the current APIs but I could be wrong on that point.
* It could also be worth discussing local DataFrames.  They might not be as important as per-Row transformations, but they would be helpful for batching for higher throughput.
That could be interesting as well. 

I'll be interested to hear others' thoughts too!

Joseph

On Wed, May 9, 2018 at 7:18 AM, Holden Karau <[hidden email]> wrote:
Hi y'all,

With the renewed interest in ML in Apache Spark now seems like a good a time as any to revisit the online serving situation in Spark ML. DB & other's have done some excellent working moving a lot of the necessary tools into a local linear algebra package that doesn't depend on having a SparkContext.

There are a few different commercial and non-commercial solutions round this, but currently our individual transform/predict methods are private so they either need to copy or re-implement (or put them selves in org.apache.spark) to access them. How would folks feel about adding a new trait for ML pipeline stages to expose to do transformation of single element inputs (or local collections) that could be optionally implemented by stages which support this? That way we can have less copy and paste code possibly getting out of sync with our model training.

I think continuing to have on-line serving grow in different projects is probably the right path, forward (folks have different needs), but I'd love to see us make it simpler for other projects to build reliable serving tools.

I realize this maybe puts some of the folks in an awkward position with their own commercial offerings, but hopefully if we make it easier for everyone the commercial vendors can benefit as well.

Cheers,

Holden :)

--



--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

http://databricks.com




--





--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

http://databricks.com

--
--
Cheers,
Leif
Reply | Threaded
Open this post in threaded view
|

Re: Revisiting Online serving of Spark models?

Maximiliano Felice
Hi!

I'm don't usually write a lot on this list but I keep up to date with the discussions and I'm a heavy user of Spark. This topic caught my attention, as we're currently facing this issue at work. I'm attending to the summit and was wondering if it would it be possible for me to join that meeting. I might be able to share some helpful usecases and ideas.

Thanks,
Maximiliano Felice

El mar., 22 de may. de 2018 9:14 AM, Leif Walsh <[hidden email]> escribió:
I’m with you on json being more readable than parquet, but we’ve had success using pyarrow’s parquet reader and have been quite happy with it so far. If your target is python (and probably if not now, then soon, R), you should look in to it. 

On Mon, May 21, 2018 at 16:52 Joseph Bradley <[hidden email]> wrote:
Regarding model reading and writing, I'll give quick thoughts here:
* Our approach was to use the same format but write JSON instead of Parquet.  It's easier to parse JSON without Spark, and using the same format simplifies architecture.  Plus, some people want to check files into version control, and JSON is nice for that.
* The reader/writer APIs could be extended to take format parameters (just like DataFrame reader/writers) to handle JSON (and maybe, eventually, handle Parquet in the online serving setting).

This would be a big project, so proposing a SPIP might be best.  If people are around at the Spark Summit, that could be a good time to meet up & then post notes back to the dev list.

On Sun, May 20, 2018 at 8:11 PM, Felix Cheung <[hidden email]> wrote:
Specifically I’d like bring part of the discussion to Model and PipelineModel, and various ModelReader and SharedReadWrite implementations that rely on SparkContext. This is a big blocker on reusing  trained models outside of Spark for online serving.

What’s the next step? Would folks be interested in getting together to discuss/get some feedback?


_____________________________
From: Felix Cheung <[hidden email]>
Sent: Thursday, May 10, 2018 10:10 AM
Subject: Re: Revisiting Online serving of Spark models?
To: Holden Karau <[hidden email]>, Joseph Bradley <[hidden email]>
Cc: dev <[hidden email]>



Huge +1 on this!


From: [hidden email] <[hidden email]> on behalf of Holden Karau <[hidden email]>
Sent: Thursday, May 10, 2018 9:39:26 AM
To: Joseph Bradley
Cc: dev
Subject: Re: Revisiting Online serving of Spark models?
 


On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley <[hidden email]> wrote:
Thanks for bringing this up Holden!  I'm a strong supporter of this.

Awesome! I'm glad other folks think something like this belongs in Spark.
This was one of the original goals for mllib-local: to have local versions of MLlib models which could be deployed without the big Spark JARs and without a SparkContext or SparkSession.  There are related commercial offerings like this : ) but the overhead of maintaining those offerings is pretty high.  Building good APIs within MLlib to avoid copying logic across libraries will be well worth it.

We've talked about this need at Databricks and have also been syncing with the creators of MLeap.  It'd be great to get this functionality into Spark itself.  Some thoughts:
* It'd be valuable to have this go beyond adding transform() methods taking a Row to the current Models.  Instead, it would be ideal to have local, lightweight versions of models in mllib-local, outside of the main mllib package (for easier deployment with smaller & fewer dependencies).
* Supporting Pipelines is important.  For this, it would be ideal to utilize elements of Spark SQL, particularly Rows and Types, which could be moved into a local sql package.
* This architecture may require some awkward APIs currently to have model prediction logic in mllib-local, local model classes in mllib-local, and regular (DataFrame-friendly) model classes in mllib.  We might find it helpful to break some DeveloperApis in Spark 3.0 to facilitate this architecture while making it feasible for 3rd party developers to extend MLlib APIs (especially in Java).
I agree this could be interesting, and feed into the other discussion around when (or if) we should be considering Spark 3.0
I _think_ we could probably do it with optional traits people could mix in to avoid breaking the current APIs but I could be wrong on that point.
* It could also be worth discussing local DataFrames.  They might not be as important as per-Row transformations, but they would be helpful for batching for higher throughput.
That could be interesting as well. 

I'll be interested to hear others' thoughts too!

Joseph

On Wed, May 9, 2018 at 7:18 AM, Holden Karau <[hidden email]> wrote:
Hi y'all,

With the renewed interest in ML in Apache Spark now seems like a good a time as any to revisit the online serving situation in Spark ML. DB & other's have done some excellent working moving a lot of the necessary tools into a local linear algebra package that doesn't depend on having a SparkContext.

There are a few different commercial and non-commercial solutions round this, but currently our individual transform/predict methods are private so they either need to copy or re-implement (or put them selves in org.apache.spark) to access them. How would folks feel about adding a new trait for ML pipeline stages to expose to do transformation of single element inputs (or local collections) that could be optionally implemented by stages which support this? That way we can have less copy and paste code possibly getting out of sync with our model training.

I think continuing to have on-line serving grow in different projects is probably the right path, forward (folks have different needs), but I'd love to see us make it simpler for other projects to build reliable serving tools.

I realize this maybe puts some of the folks in an awkward position with their own commercial offerings, but hopefully if we make it easier for everyone the commercial vendors can benefit as well.

Cheers,

Holden :)

--



--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

http://databricks.com




--





--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

http://databricks.com

--
--
Cheers,
Leif
Reply | Threaded
Open this post in threaded view
|

Re: Revisiting Online serving of Spark models?

Saikat Kanjilal
I’m in the same exact boat as Maximiliano and have use cases as well for model serving and would love to join this discussion.

Sent from my iPhone

On May 22, 2018, at 6:39 AM, Maximiliano Felice <[hidden email]> wrote:

Hi!

I'm don't usually write a lot on this list but I keep up to date with the discussions and I'm a heavy user of Spark. This topic caught my attention, as we're currently facing this issue at work. I'm attending to the summit and was wondering if it would it be possible for me to join that meeting. I might be able to share some helpful usecases and ideas.

Thanks,
Maximiliano Felice

El mar., 22 de may. de 2018 9:14 AM, Leif Walsh <[hidden email]> escribió:
I’m with you on json being more readable than parquet, but we’ve had success using pyarrow’s parquet reader and have been quite happy with it so far. If your target is python (and probably if not now, then soon, R), you should look in to it. 

On Mon, May 21, 2018 at 16:52 Joseph Bradley <[hidden email]> wrote:
Regarding model reading and writing, I'll give quick thoughts here:
* Our approach was to use the same format but write JSON instead of Parquet.  It's easier to parse JSON without Spark, and using the same format simplifies architecture.  Plus, some people want to check files into version control, and JSON is nice for that.
* The reader/writer APIs could be extended to take format parameters (just like DataFrame reader/writers) to handle JSON (and maybe, eventually, handle Parquet in the online serving setting).

This would be a big project, so proposing a SPIP might be best.  If people are around at the Spark Summit, that could be a good time to meet up & then post notes back to the dev list.

On Sun, May 20, 2018 at 8:11 PM, Felix Cheung <[hidden email]> wrote:
Specifically I’d like bring part of the discussion to Model and PipelineModel, and various ModelReader and SharedReadWrite implementations that rely on SparkContext. This is a big blocker on reusing  trained models outside of Spark for online serving.

What’s the next step? Would folks be interested in getting together to discuss/get some feedback?


_____________________________
From: Felix Cheung <[hidden email]>
Sent: Thursday, May 10, 2018 10:10 AM
Subject: Re: Revisiting Online serving of Spark models?
To: Holden Karau <[hidden email]>, Joseph Bradley <[hidden email]>
Cc: dev <[hidden email]>



Huge +1 on this!


From: [hidden email] <[hidden email]> on behalf of Holden Karau <[hidden email]>
Sent: Thursday, May 10, 2018 9:39:26 AM
To: Joseph Bradley
Cc: dev
Subject: Re: Revisiting Online serving of Spark models?
 


On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley <[hidden email]> wrote:
Thanks for bringing this up Holden!  I'm a strong supporter of this.

Awesome! I'm glad other folks think something like this belongs in Spark.
This was one of the original goals for mllib-local: to have local versions of MLlib models which could be deployed without the big Spark JARs and without a SparkContext or SparkSession.  There are related commercial offerings like this : ) but the overhead of maintaining those offerings is pretty high.  Building good APIs within MLlib to avoid copying logic across libraries will be well worth it.

We've talked about this need at Databricks and have also been syncing with the creators of MLeap.  It'd be great to get this functionality into Spark itself.  Some thoughts:
* It'd be valuable to have this go beyond adding transform() methods taking a Row to the current Models.  Instead, it would be ideal to have local, lightweight versions of models in mllib-local, outside of the main mllib package (for easier deployment with smaller & fewer dependencies).
* Supporting Pipelines is important.  For this, it would be ideal to utilize elements of Spark SQL, particularly Rows and Types, which could be moved into a local sql package.
* This architecture may require some awkward APIs currently to have model prediction logic in mllib-local, local model classes in mllib-local, and regular (DataFrame-friendly) model classes in mllib.  We might find it helpful to break some DeveloperApis in Spark 3.0 to facilitate this architecture while making it feasible for 3rd party developers to extend MLlib APIs (especially in Java).
I agree this could be interesting, and feed into the other discussion around when (or if) we should be considering Spark 3.0
I _think_ we could probably do it with optional traits people could mix in to avoid breaking the current APIs but I could be wrong on that point.
* It could also be worth discussing local DataFrames.  They might not be as important as per-Row transformations, but they would be helpful for batching for higher throughput.
That could be interesting as well. 

I'll be interested to hear others' thoughts too!

Joseph

On Wed, May 9, 2018 at 7:18 AM, Holden Karau <[hidden email]> wrote:
Hi y'all,

With the renewed interest in ML in Apache Spark now seems like a good a time as any to revisit the online serving situation in Spark ML. DB & other's have done some excellent working moving a lot of the necessary tools into a local linear algebra package that doesn't depend on having a SparkContext.

There are a few different commercial and non-commercial solutions round this, but currently our individual transform/predict methods are private so they either need to copy or re-implement (or put them selves in org.apache.spark) to access them. How would folks feel about adding a new trait for ML pipeline stages to expose to do transformation of single element inputs (or local collections) that could be optionally implemented by stages which support this? That way we can have less copy and paste code possibly getting out of sync with our model training.

I think continuing to have on-line serving grow in different projects is probably the right path, forward (folks have different needs), but I'd love to see us make it simpler for other projects to build reliable serving tools.

I realize this maybe puts some of the folks in an awkward position with their own commercial offerings, but hopefully if we make it easier for everyone the commercial vendors can benefit as well.

Cheers,

Holden :)

--



--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

http://databricks.com




--





--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

http://databricks.com

--
--
Cheers,
Leif
Reply | Threaded
Open this post in threaded view
|

Re: Revisiting Online serving of Spark models?

Felix Cheung
Hi! How about we meet the community and discuss on June 6 4pm at (near) the Summit?

(I propose we meet at the venue entrance so we could accommodate people might not be in the conference)


From: Saikat Kanjilal <[hidden email]>
Sent: Tuesday, May 22, 2018 7:47:07 AM
To: Maximiliano Felice
Cc: Leif Walsh; Felix Cheung; Holden Karau; Joseph Bradley; dev
Subject: Re: Revisiting Online serving of Spark models?
 
I’m in the same exact boat as Maximiliano and have use cases as well for model serving and would love to join this discussion.

Sent from my iPhone

On May 22, 2018, at 6:39 AM, Maximiliano Felice <[hidden email]> wrote:

Hi!

I'm don't usually write a lot on this list but I keep up to date with the discussions and I'm a heavy user of Spark. This topic caught my attention, as we're currently facing this issue at work. I'm attending to the summit and was wondering if it would it be possible for me to join that meeting. I might be able to share some helpful usecases and ideas.

Thanks,
Maximiliano Felice

El mar., 22 de may. de 2018 9:14 AM, Leif Walsh <[hidden email]> escribió:
I’m with you on json being more readable than parquet, but we’ve had success using pyarrow’s parquet reader and have been quite happy with it so far. If your target is python (and probably if not now, then soon, R), you should look in to it. 

On Mon, May 21, 2018 at 16:52 Joseph Bradley <[hidden email]> wrote:
Regarding model reading and writing, I'll give quick thoughts here:
* Our approach was to use the same format but write JSON instead of Parquet.  It's easier to parse JSON without Spark, and using the same format simplifies architecture.  Plus, some people want to check files into version control, and JSON is nice for that.
* The reader/writer APIs could be extended to take format parameters (just like DataFrame reader/writers) to handle JSON (and maybe, eventually, handle Parquet in the online serving setting).

This would be a big project, so proposing a SPIP might be best.  If people are around at the Spark Summit, that could be a good time to meet up & then post notes back to the dev list.

On Sun, May 20, 2018 at 8:11 PM, Felix Cheung <[hidden email]> wrote:
Specifically I’d like bring part of the discussion to Model and PipelineModel, and various ModelReader and SharedReadWrite implementations that rely on SparkContext. This is a big blocker on reusing  trained models outside of Spark for online serving.

What’s the next step? Would folks be interested in getting together to discuss/get some feedback?


_____________________________
From: Felix Cheung <[hidden email]>
Sent: Thursday, May 10, 2018 10:10 AM
Subject: Re: Revisiting Online serving of Spark models?
To: Holden Karau <[hidden email]>, Joseph Bradley <[hidden email]>
Cc: dev <[hidden email]>



Huge +1 on this!


From: [hidden email] <[hidden email]> on behalf of Holden Karau <[hidden email]>
Sent: Thursday, May 10, 2018 9:39:26 AM
To: Joseph Bradley
Cc: dev
Subject: Re: Revisiting Online serving of Spark models?
 


On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley <[hidden email]> wrote:
Thanks for bringing this up Holden!  I'm a strong supporter of this.

Awesome! I'm glad other folks think something like this belongs in Spark.
This was one of the original goals for mllib-local: to have local versions of MLlib models which could be deployed without the big Spark JARs and without a SparkContext or SparkSession.  There are related commercial offerings like this : ) but the overhead of maintaining those offerings is pretty high.  Building good APIs within MLlib to avoid copying logic across libraries will be well worth it.

We've talked about this need at Databricks and have also been syncing with the creators of MLeap.  It'd be great to get this functionality into Spark itself.  Some thoughts:
* It'd be valuable to have this go beyond adding transform() methods taking a Row to the current Models.  Instead, it would be ideal to have local, lightweight versions of models in mllib-local, outside of the main mllib package (for easier deployment with smaller & fewer dependencies).
* Supporting Pipelines is important.  For this, it would be ideal to utilize elements of Spark SQL, particularly Rows and Types, which could be moved into a local sql package.
* This architecture may require some awkward APIs currently to have model prediction logic in mllib-local, local model classes in mllib-local, and regular (DataFrame-friendly) model classes in mllib.  We might find it helpful to break some DeveloperApis in Spark 3.0 to facilitate this architecture while making it feasible for 3rd party developers to extend MLlib APIs (especially in Java).
I agree this could be interesting, and feed into the other discussion around when (or if) we should be considering Spark 3.0
I _think_ we could probably do it with optional traits people could mix in to avoid breaking the current APIs but I could be wrong on that point.
* It could also be worth discussing local DataFrames.  They might not be as important as per-Row transformations, but they would be helpful for batching for higher throughput.
That could be interesting as well. 

I'll be interested to hear others' thoughts too!

Joseph

On Wed, May 9, 2018 at 7:18 AM, Holden Karau <[hidden email]> wrote:
Hi y'all,

With the renewed interest in ML in Apache Spark now seems like a good a time as any to revisit the online serving situation in Spark ML. DB & other's have done some excellent working moving a lot of the necessary tools into a local linear algebra package that doesn't depend on having a SparkContext.

There are a few different commercial and non-commercial solutions round this, but currently our individual transform/predict methods are private so they either need to copy or re-implement (or put them selves in org.apache.spark) to access them. How would folks feel about adding a new trait for ML pipeline stages to expose to do transformation of single element inputs (or local collections) that could be optionally implemented by stages which support this? That way we can have less copy and paste code possibly getting out of sync with our model training.

I think continuing to have on-line serving grow in different projects is probably the right path, forward (folks have different needs), but I'd love to see us make it simpler for other projects to build reliable serving tools.

I realize this maybe puts some of the folks in an awkward position with their own commercial offerings, but hopefully if we make it easier for everyone the commercial vendors can benefit as well.

Cheers,

Holden :)

--



--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

http://databricks.com




--





--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

http://databricks.com

--
--
Cheers,
Leif
Reply | Threaded
Open this post in threaded view
|

Re: Revisiting Online serving of Spark models?

Felix Cheung
Bump.


From: Felix Cheung <[hidden email]>
Sent: Saturday, May 26, 2018 1:05:29 PM
To: Saikat Kanjilal; Maximiliano Felice; Joseph Bradley
Cc: Leif Walsh; Holden Karau; dev
Subject: Re: Revisiting Online serving of Spark models?
 
Hi! How about we meet the community and discuss on June 6 4pm at (near) the Summit?

(I propose we meet at the venue entrance so we could accommodate people might not be in the conference)


From: Saikat Kanjilal <[hidden email]>
Sent: Tuesday, May 22, 2018 7:47:07 AM
To: Maximiliano Felice
Cc: Leif Walsh; Felix Cheung; Holden Karau; Joseph Bradley; dev
Subject: Re: Revisiting Online serving of Spark models?
 
I’m in the same exact boat as Maximiliano and have use cases as well for model serving and would love to join this discussion.

Sent from my iPhone

On May 22, 2018, at 6:39 AM, Maximiliano Felice <[hidden email]> wrote:

Hi!

I'm don't usually write a lot on this list but I keep up to date with the discussions and I'm a heavy user of Spark. This topic caught my attention, as we're currently facing this issue at work. I'm attending to the summit and was wondering if it would it be possible for me to join that meeting. I might be able to share some helpful usecases and ideas.

Thanks,
Maximiliano Felice

El mar., 22 de may. de 2018 9:14 AM, Leif Walsh <[hidden email]> escribió:
I’m with you on json being more readable than parquet, but we’ve had success using pyarrow’s parquet reader and have been quite happy with it so far. If your target is python (and probably if not now, then soon, R), you should look in to it. 

On Mon, May 21, 2018 at 16:52 Joseph Bradley <[hidden email]> wrote:
Regarding model reading and writing, I'll give quick thoughts here:
* Our approach was to use the same format but write JSON instead of Parquet.  It's easier to parse JSON without Spark, and using the same format simplifies architecture.  Plus, some people want to check files into version control, and JSON is nice for that.
* The reader/writer APIs could be extended to take format parameters (just like DataFrame reader/writers) to handle JSON (and maybe, eventually, handle Parquet in the online serving setting).

This would be a big project, so proposing a SPIP might be best.  If people are around at the Spark Summit, that could be a good time to meet up & then post notes back to the dev list.

On Sun, May 20, 2018 at 8:11 PM, Felix Cheung <[hidden email]> wrote:
Specifically I’d like bring part of the discussion to Model and PipelineModel, and various ModelReader and SharedReadWrite implementations that rely on SparkContext. This is a big blocker on reusing  trained models outside of Spark for online serving.

What’s the next step? Would folks be interested in getting together to discuss/get some feedback?


_____________________________
From: Felix Cheung <[hidden email]>
Sent: Thursday, May 10, 2018 10:10 AM
Subject: Re: Revisiting Online serving of Spark models?
To: Holden Karau <[hidden email]>, Joseph Bradley <[hidden email]>
Cc: dev <[hidden email]>



Huge +1 on this!


From: [hidden email] <[hidden email]> on behalf of Holden Karau <[hidden email]>
Sent: Thursday, May 10, 2018 9:39:26 AM
To: Joseph Bradley
Cc: dev
Subject: Re: Revisiting Online serving of Spark models?
 


On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley <[hidden email]> wrote:
Thanks for bringing this up Holden!  I'm a strong supporter of this.

Awesome! I'm glad other folks think something like this belongs in Spark.
This was one of the original goals for mllib-local: to have local versions of MLlib models which could be deployed without the big Spark JARs and without a SparkContext or SparkSession.  There are related commercial offerings like this : ) but the overhead of maintaining those offerings is pretty high.  Building good APIs within MLlib to avoid copying logic across libraries will be well worth it.

We've talked about this need at Databricks and have also been syncing with the creators of MLeap.  It'd be great to get this functionality into Spark itself.  Some thoughts:
* It'd be valuable to have this go beyond adding transform() methods taking a Row to the current Models.  Instead, it would be ideal to have local, lightweight versions of models in mllib-local, outside of the main mllib package (for easier deployment with smaller & fewer dependencies).
* Supporting Pipelines is important.  For this, it would be ideal to utilize elements of Spark SQL, particularly Rows and Types, which could be moved into a local sql package.
* This architecture may require some awkward APIs currently to have model prediction logic in mllib-local, local model classes in mllib-local, and regular (DataFrame-friendly) model classes in mllib.  We might find it helpful to break some DeveloperApis in Spark 3.0 to facilitate this architecture while making it feasible for 3rd party developers to extend MLlib APIs (especially in Java).
I agree this could be interesting, and feed into the other discussion around when (or if) we should be considering Spark 3.0
I _think_ we could probably do it with optional traits people could mix in to avoid breaking the current APIs but I could be wrong on that point.
* It could also be worth discussing local DataFrames.  They might not be as important as per-Row transformations, but they would be helpful for batching for higher throughput.
That could be interesting as well. 

I'll be interested to hear others' thoughts too!

Joseph

On Wed, May 9, 2018 at 7:18 AM, Holden Karau <[hidden email]> wrote:
Hi y'all,

With the renewed interest in ML in Apache Spark now seems like a good a time as any to revisit the online serving situation in Spark ML. DB & other's have done some excellent working moving a lot of the necessary tools into a local linear algebra package that doesn't depend on having a SparkContext.

There are a few different commercial and non-commercial solutions round this, but currently our individual transform/predict methods are private so they either need to copy or re-implement (or put them selves in org.apache.spark) to access them. How would folks feel about adding a new trait for ML pipeline stages to expose to do transformation of single element inputs (or local collections) that could be optionally implemented by stages which support this? That way we can have less copy and paste code possibly getting out of sync with our model training.

I think continuing to have on-line serving grow in different projects is probably the right path, forward (folks have different needs), but I'd love to see us make it simpler for other projects to build reliable serving tools.

I realize this maybe puts some of the folks in an awkward position with their own commercial offerings, but hopefully if we make it easier for everyone the commercial vendors can benefit as well.

Cheers,

Holden :)

--



--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

http://databricks.com




--





--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

http://databricks.com

--
--
Cheers,
Leif
Reply | Threaded
Open this post in threaded view
|

Re: Revisiting Online serving of Spark models?

Holden Karau
I'm down for that, we could all go for a walk maybe to the mint plazaa blue bottle and grab coffee (if the weather holds have our design meeting outside :p)?

On Tue, May 29, 2018 at 9:37 AM, Felix Cheung <[hidden email]> wrote:
Bump.


From: Felix Cheung <[hidden email]>
Sent: Saturday, May 26, 2018 1:05:29 PM
To: Saikat Kanjilal; Maximiliano Felice; Joseph Bradley
Cc: Leif Walsh; Holden Karau; dev

Subject: Re: Revisiting Online serving of Spark models?
 
Hi! How about we meet the community and discuss on June 6 4pm at (near) the Summit?

(I propose we meet at the venue entrance so we could accommodate people might not be in the conference)


From: Saikat Kanjilal <[hidden email]>
Sent: Tuesday, May 22, 2018 7:47:07 AM
To: Maximiliano Felice
Cc: Leif Walsh; Felix Cheung; Holden Karau; Joseph Bradley; dev
Subject: Re: Revisiting Online serving of Spark models?
 
I’m in the same exact boat as Maximiliano and have use cases as well for model serving and would love to join this discussion.

Sent from my iPhone

On May 22, 2018, at 6:39 AM, Maximiliano Felice <[hidden email]> wrote:

Hi!

I'm don't usually write a lot on this list but I keep up to date with the discussions and I'm a heavy user of Spark. This topic caught my attention, as we're currently facing this issue at work. I'm attending to the summit and was wondering if it would it be possible for me to join that meeting. I might be able to share some helpful usecases and ideas.

Thanks,
Maximiliano Felice

El mar., 22 de may. de 2018 9:14 AM, Leif Walsh <[hidden email]> escribió:
I’m with you on json being more readable than parquet, but we’ve had success using pyarrow’s parquet reader and have been quite happy with it so far. If your target is python (and probably if not now, then soon, R), you should look in to it. 

On Mon, May 21, 2018 at 16:52 Joseph Bradley <[hidden email]> wrote:
Regarding model reading and writing, I'll give quick thoughts here:
* Our approach was to use the same format but write JSON instead of Parquet.  It's easier to parse JSON without Spark, and using the same format simplifies architecture.  Plus, some people want to check files into version control, and JSON is nice for that.
* The reader/writer APIs could be extended to take format parameters (just like DataFrame reader/writers) to handle JSON (and maybe, eventually, handle Parquet in the online serving setting).

This would be a big project, so proposing a SPIP might be best.  If people are around at the Spark Summit, that could be a good time to meet up & then post notes back to the dev list.

On Sun, May 20, 2018 at 8:11 PM, Felix Cheung <[hidden email]> wrote:
Specifically I’d like bring part of the discussion to Model and PipelineModel, and various ModelReader and SharedReadWrite implementations that rely on SparkContext. This is a big blocker on reusing  trained models outside of Spark for online serving.

What’s the next step? Would folks be interested in getting together to discuss/get some feedback?


_____________________________
From: Felix Cheung <[hidden email]>
Sent: Thursday, May 10, 2018 10:10 AM
Subject: Re: Revisiting Online serving of Spark models?
To: Holden Karau <[hidden email]>, Joseph Bradley <[hidden email]>
Cc: dev <[hidden email]>



Huge +1 on this!


From: [hidden email] <[hidden email]> on behalf of Holden Karau <[hidden email]>
Sent: Thursday, May 10, 2018 9:39:26 AM
To: Joseph Bradley
Cc: dev
Subject: Re: Revisiting Online serving of Spark models?
 


On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley <[hidden email]> wrote:
Thanks for bringing this up Holden!  I'm a strong supporter of this.

Awesome! I'm glad other folks think something like this belongs in Spark.
This was one of the original goals for mllib-local: to have local versions of MLlib models which could be deployed without the big Spark JARs and without a SparkContext or SparkSession.  There are related commercial offerings like this : ) but the overhead of maintaining those offerings is pretty high.  Building good APIs within MLlib to avoid copying logic across libraries will be well worth it.

We've talked about this need at Databricks and have also been syncing with the creators of MLeap.  It'd be great to get this functionality into Spark itself.  Some thoughts:
* It'd be valuable to have this go beyond adding transform() methods taking a Row to the current Models.  Instead, it would be ideal to have local, lightweight versions of models in mllib-local, outside of the main mllib package (for easier deployment with smaller & fewer dependencies).
* Supporting Pipelines is important.  For this, it would be ideal to utilize elements of Spark SQL, particularly Rows and Types, which could be moved into a local sql package.
* This architecture may require some awkward APIs currently to have model prediction logic in mllib-local, local model classes in mllib-local, and regular (DataFrame-friendly) model classes in mllib.  We might find it helpful to break some DeveloperApis in Spark 3.0 to facilitate this architecture while making it feasible for 3rd party developers to extend MLlib APIs (especially in Java).
I agree this could be interesting, and feed into the other discussion around when (or if) we should be considering Spark 3.0
I _think_ we could probably do it with optional traits people could mix in to avoid breaking the current APIs but I could be wrong on that point.
* It could also be worth discussing local DataFrames.  They might not be as important as per-Row transformations, but they would be helpful for batching for higher throughput.
That could be interesting as well. 

I'll be interested to hear others' thoughts too!

Joseph

On Wed, May 9, 2018 at 7:18 AM, Holden Karau <[hidden email]> wrote:
Hi y'all,

With the renewed interest in ML in Apache Spark now seems like a good a time as any to revisit the online serving situation in Spark ML. DB & other's have done some excellent working moving a lot of the necessary tools into a local linear algebra package that doesn't depend on having a SparkContext.

There are a few different commercial and non-commercial solutions round this, but currently our individual transform/predict methods are private so they either need to copy or re-implement (or put them selves in org.apache.spark) to access them. How would folks feel about adding a new trait for ML pipeline stages to expose to do transformation of single element inputs (or local collections) that could be optionally implemented by stages which support this? That way we can have less copy and paste code possibly getting out of sync with our model training.

I think continuing to have on-line serving grow in different projects is probably the right path, forward (folks have different needs), but I'd love to see us make it simpler for other projects to build reliable serving tools.

I realize this maybe puts some of the folks in an awkward position with their own commercial offerings, but hopefully if we make it easier for everyone the commercial vendors can benefit as well.

Cheers,

Holden :)

--



--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

http://databricks.com




--





--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

http://databricks.com

--
--
Cheers,
Leif



--
Reply | Threaded
Open this post in threaded view
|

Re: Revisiting Online serving of Spark models?

Felix Cheung
You had me at blue bottle!

_____________________________
From: Holden Karau <[hidden email]>
Sent: Tuesday, May 29, 2018 9:47 AM
Subject: Re: Revisiting Online serving of Spark models?
To: Felix Cheung <[hidden email]>
Cc: Saikat Kanjilal <[hidden email]>, Maximiliano Felice <[hidden email]>, Joseph Bradley <[hidden email]>, Leif Walsh <[hidden email]>, dev <[hidden email]>


I'm down for that, we could all go for a walk maybe to the mint plazaa blue bottle and grab coffee (if the weather holds have our design meeting outside :p)?

On Tue, May 29, 2018 at 9:37 AM, Felix Cheung <[hidden email]> wrote:
Bump.


From: Felix Cheung <[hidden email]>
Sent: Saturday, May 26, 2018 1:05:29 PM
To: Saikat Kanjilal; Maximiliano Felice; Joseph Bradley
Cc: Leif Walsh; Holden Karau; dev

Subject: Re: Revisiting Online serving of Spark models?
 
Hi! How about we meet the community and discuss on June 6 4pm at (near) the Summit?

(I propose we meet at the venue entrance so we could accommodate people might not be in the conference)


From: Saikat Kanjilal <[hidden email]>
Sent: Tuesday, May 22, 2018 7:47:07 AM
To: Maximiliano Felice
Cc: Leif Walsh; Felix Cheung; Holden Karau; Joseph Bradley; dev
Subject: Re: Revisiting Online serving of Spark models?
 
I’m in the same exact boat as Maximiliano and have use cases as well for model serving and would love to join this discussion.

Sent from my iPhone

On May 22, 2018, at 6:39 AM, Maximiliano Felice <[hidden email]> wrote:

Hi!

I'm don't usually write a lot on this list but I keep up to date with the discussions and I'm a heavy user of Spark. This topic caught my attention, as we're currently facing this issue at work. I'm attending to the summit and was wondering if it would it be possible for me to join that meeting. I might be able to share some helpful usecases and ideas.

Thanks,
Maximiliano Felice

El mar., 22 de may. de 2018 9:14 AM, Leif Walsh <[hidden email]> escribió:
I’m with you on json being more readable than parquet, but we’ve had success using pyarrow’s parquet reader and have been quite happy with it so far. If your target is python (and probably if not now, then soon, R), you should look in to it. 

On Mon, May 21, 2018 at 16:52 Joseph Bradley <[hidden email]> wrote:
Regarding model reading and writing, I'll give quick thoughts here:
* Our approach was to use the same format but write JSON instead of Parquet.  It's easier to parse JSON without Spark, and using the same format simplifies architecture.  Plus, some people want to check files into version control, and JSON is nice for that.
* The reader/writer APIs could be extended to take format parameters (just like DataFrame reader/writers) to handle JSON (and maybe, eventually, handle Parquet in the online serving setting).

This would be a big project, so proposing a SPIP might be best.  If people are around at the Spark Summit, that could be a good time to meet up & then post notes back to the dev list.

On Sun, May 20, 2018 at 8:11 PM, Felix Cheung <[hidden email]> wrote:
Specifically I’d like bring part of the discussion to Model and PipelineModel, and various ModelReader and SharedReadWrite implementations that rely on SparkContext. This is a big blocker on reusing  trained models outside of Spark for online serving.

What’s the next step? Would folks be interested in getting together to discuss/get some feedback?


_____________________________
From: Felix Cheung <[hidden email]>
Sent: Thursday, May 10, 2018 10:10 AM
Subject: Re: Revisiting Online serving of Spark models?
To: Holden Karau <[hidden email]>, Joseph Bradley <[hidden email]>
Cc: dev <[hidden email]>



Huge +1 on this!


From:[hidden email] <[hidden email]> on behalf of Holden Karau <[hidden email]>
Sent: Thursday, May 10, 2018 9:39:26 AM
To: Joseph Bradley
Cc: dev
Subject: Re: Revisiting Online serving of Spark models?
 


On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley <[hidden email]> wrote:
Thanks for bringing this up Holden!  I'm a strong supporter of this.

Awesome! I'm glad other folks think something like this belongs in Spark.
This was one of the original goals for mllib-local: to have local versions of MLlib models which could be deployed without the big Spark JARs and without a SparkContext or SparkSession.  There are related commercial offerings like this : ) but the overhead of maintaining those offerings is pretty high.  Building good APIs within MLlib to avoid copying logic across libraries will be well worth it.

We've talked about this need at Databricks and have also been syncing with the creators of MLeap.  It'd be great to get this functionality into Spark itself.  Some thoughts:
* It'd be valuable to have this go beyond adding transform() methods taking a Row to the current Models.  Instead, it would be ideal to have local, lightweight versions of models in mllib-local, outside of the main mllib package (for easier deployment with smaller & fewer dependencies).
* Supporting Pipelines is important.  For this, it would be ideal to utilize elements of Spark SQL, particularly Rows and Types, which could be moved into a local sql package.
* This architecture may require some awkward APIs currently to have model prediction logic in mllib-local, local model classes in mllib-local, and regular (DataFrame-friendly) model classes in mllib.  We might find it helpful to break some DeveloperApis in Spark 3.0 to facilitate this architecture while making it feasible for 3rd party developers to extend MLlib APIs (especially in Java).
I agree this could be interesting, and feed into the other discussion around when (or if) we should be considering Spark 3.0
I _think_ we could probably do it with optional traits people could mix in to avoid breaking the current APIs but I could be wrong on that point.
* It could also be worth discussing local DataFrames.  They might not be as important as per-Row transformations, but they would be helpful for batching for higher throughput.
That could be interesting as well. 

I'll be interested to hear others' thoughts too!

Joseph

On Wed, May 9, 2018 at 7:18 AM, Holden Karau <[hidden email]> wrote:
Hi y'all,

With the renewed interest in ML in Apache Spark now seems like a good a time as any to revisit the online serving situation in Spark ML. DB & other's have done some excellent working moving a lot of the necessary tools into a local linear algebra package that doesn't depend on having a SparkContext.

There are a few different commercial and non-commercial solutions round this, but currently our individual transform/predict methods are private so they either need to copy or re-implement (or put them selves in org.apache.spark) to access them. How would folks feel about adding a new trait for ML pipeline stages to expose to do transformation of single element inputs (or local collections) that could be optionally implemented by stages which support this? That way we can have less copy and paste code possibly getting out of sync with our model training.

I think continuing to have on-line serving grow in different projects is probably the right path, forward (folks have different needs), but I'd love to see us make it simpler for other projects to build reliable serving tools.

I realize this maybe puts some of the folks in an awkward position with their own commercial offerings, but hopefully if we make it easier for everyone the commercial vendors can benefit as well.

Cheers,

Holden :)

--



--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

http://databricks.com




--





--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

http://databricks.com

--
--
Cheers,
Leif



--


Reply | Threaded
Open this post in threaded view
|

Re: Revisiting Online serving of Spark models?

Maximiliano Felice
Big +1 to a meeting with fresh air.

Could anyone send the invites? I don't really know which is the place Holden is talking about.

2018-05-29 14:27 GMT-03:00 Felix Cheung <[hidden email]>:
You had me at blue bottle!

_____________________________
From: Holden Karau <[hidden email]>
Sent: Tuesday, May 29, 2018 9:47 AM
Subject: Re: Revisiting Online serving of Spark models?
To: Felix Cheung <[hidden email]>
Cc: Saikat Kanjilal <[hidden email]>, Maximiliano Felice <[hidden email]>, Joseph Bradley <[hidden email]>, Leif Walsh <[hidden email]>, dev <[hidden email]>



I'm down for that, we could all go for a walk maybe to the mint plazaa blue bottle and grab coffee (if the weather holds have our design meeting outside :p)?

On Tue, May 29, 2018 at 9:37 AM, Felix Cheung <[hidden email]> wrote:
Bump.


From: Felix Cheung <[hidden email]>
Sent: Saturday, May 26, 2018 1:05:29 PM
To: Saikat Kanjilal; Maximiliano Felice; Joseph Bradley
Cc: Leif Walsh; Holden Karau; dev

Subject: Re: Revisiting Online serving of Spark models?
 
Hi! How about we meet the community and discuss on June 6 4pm at (near) the Summit?

(I propose we meet at the venue entrance so we could accommodate people might not be in the conference)


From: Saikat Kanjilal <[hidden email]>
Sent: Tuesday, May 22, 2018 7:47:07 AM
To: Maximiliano Felice
Cc: Leif Walsh; Felix Cheung; Holden Karau; Joseph Bradley; dev
Subject: Re: Revisiting Online serving of Spark models?
 
I’m in the same exact boat as Maximiliano and have use cases as well for model serving and would love to join this discussion.

Sent from my iPhone

On May 22, 2018, at 6:39 AM, Maximiliano Felice <[hidden email]> wrote:

Hi!

I'm don't usually write a lot on this list but I keep up to date with the discussions and I'm a heavy user of Spark. This topic caught my attention, as we're currently facing this issue at work. I'm attending to the summit and was wondering if it would it be possible for me to join that meeting. I might be able to share some helpful usecases and ideas.

Thanks,
Maximiliano Felice

El mar., 22 de may. de 2018 9:14 AM, Leif Walsh <[hidden email]> escribió:
I’m with you on json being more readable than parquet, but we’ve had success using pyarrow’s parquet reader and have been quite happy with it so far. If your target is python (and probably if not now, then soon, R), you should look in to it. 

On Mon, May 21, 2018 at 16:52 Joseph Bradley <[hidden email]> wrote:
Regarding model reading and writing, I'll give quick thoughts here:
* Our approach was to use the same format but write JSON instead of Parquet.  It's easier to parse JSON without Spark, and using the same format simplifies architecture.  Plus, some people want to check files into version control, and JSON is nice for that.
* The reader/writer APIs could be extended to take format parameters (just like DataFrame reader/writers) to handle JSON (and maybe, eventually, handle Parquet in the online serving setting).

This would be a big project, so proposing a SPIP might be best.  If people are around at the Spark Summit, that could be a good time to meet up & then post notes back to the dev list.

On Sun, May 20, 2018 at 8:11 PM, Felix Cheung <[hidden email]> wrote:
Specifically I’d like bring part of the discussion to Model and PipelineModel, and various ModelReader and SharedReadWrite implementations that rely on SparkContext. This is a big blocker on reusing  trained models outside of Spark for online serving.

What’s the next step? Would folks be interested in getting together to discuss/get some feedback?


_____________________________
From: Felix Cheung <[hidden email]>
Sent: Thursday, May 10, 2018 10:10 AM
Subject: Re: Revisiting Online serving of Spark models?
To: Holden Karau <[hidden email]>, Joseph Bradley <[hidden email]>
Cc: dev <[hidden email]>



Huge +1 on this!


From:[hidden email] <[hidden email]> on behalf of Holden Karau <[hidden email]>
Sent: Thursday, May 10, 2018 9:39:26 AM
To: Joseph Bradley
Cc: dev
Subject: Re: Revisiting Online serving of Spark models?
 


On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley <[hidden email]> wrote:
Thanks for bringing this up Holden!  I'm a strong supporter of this.

Awesome! I'm glad other folks think something like this belongs in Spark.
This was one of the original goals for mllib-local: to have local versions of MLlib models which could be deployed without the big Spark JARs and without a SparkContext or SparkSession.  There are related commercial offerings like this : ) but the overhead of maintaining those offerings is pretty high.  Building good APIs within MLlib to avoid copying logic across libraries will be well worth it.

We've talked about this need at Databricks and have also been syncing with the creators of MLeap.  It'd be great to get this functionality into Spark itself.  Some thoughts:
* It'd be valuable to have this go beyond adding transform() methods taking a Row to the current Models.  Instead, it would be ideal to have local, lightweight versions of models in mllib-local, outside of the main mllib package (for easier deployment with smaller & fewer dependencies).
* Supporting Pipelines is important.  For this, it would be ideal to utilize elements of Spark SQL, particularly Rows and Types, which could be moved into a local sql package.
* This architecture may require some awkward APIs currently to have model prediction logic in mllib-local, local model classes in mllib-local, and regular (DataFrame-friendly) model classes in mllib.  We might find it helpful to break some DeveloperApis in Spark 3.0 to facilitate this architecture while making it feasible for 3rd party developers to extend MLlib APIs (especially in Java).
I agree this could be interesting, and feed into the other discussion around when (or if) we should be considering Spark 3.0
I _think_ we could probably do it with optional traits people could mix in to avoid breaking the current APIs but I could be wrong on that point.
* It could also be worth discussing local DataFrames.  They might not be as important as per-Row transformations, but they would be helpful for batching for higher throughput.
That could be interesting as well. 

I'll be interested to hear others' thoughts too!

Joseph

On Wed, May 9, 2018 at 7:18 AM, Holden Karau <[hidden email]> wrote:
Hi y'all,

With the renewed interest in ML in Apache Spark now seems like a good a time as any to revisit the online serving situation in Spark ML. DB & other's have done some excellent working moving a lot of the necessary tools into a local linear algebra package that doesn't depend on having a SparkContext.

There are a few different commercial and non-commercial solutions round this, but currently our individual transform/predict methods are private so they either need to copy or re-implement (or put them selves in org.apache.spark) to access them. How would folks feel about adding a new trait for ML pipeline stages to expose to do transformation of single element inputs (or local collections) that could be optionally implemented by stages which support this? That way we can have less copy and paste code possibly getting out of sync with our model training.

I think continuing to have on-line serving grow in different projects is probably the right path, forward (folks have different needs), but I'd love to see us make it simpler for other projects to build reliable serving tools.

I realize this maybe puts some of the folks in an awkward position with their own commercial offerings, but hopefully if we make it easier for everyone the commercial vendors can benefit as well.

Cheers,

Holden :)

--



--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

http://databricks.com




--





--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

http://databricks.com

--
--
Cheers,
Leif



--



Reply | Threaded
Open this post in threaded view
|

Re: Revisiting Online serving of Spark models?

Saikat Kanjilal
Would love to join but am in Seattle, thoughts on how to make this work?

Regards

Sent from my iPhone

On May 29, 2018, at 10:35 AM, Maximiliano Felice <[hidden email]> wrote:

Big +1 to a meeting with fresh air.

Could anyone send the invites? I don't really know which is the place Holden is talking about.

2018-05-29 14:27 GMT-03:00 Felix Cheung <[hidden email]>:
You had me at blue bottle!

_____________________________
From: Holden Karau <[hidden email]>
Sent: Tuesday, May 29, 2018 9:47 AM
Subject: Re: Revisiting Online serving of Spark models?
To: Felix Cheung <[hidden email]>
Cc: Saikat Kanjilal <[hidden email]>, Maximiliano Felice <[hidden email]>, Joseph Bradley <[hidden email]>, Leif Walsh <[hidden email]>, dev <[hidden email]>



I'm down for that, we could all go for a walk maybe to the mint plazaa blue bottle and grab coffee (if the weather holds have our design meeting outside :p)?

On Tue, May 29, 2018 at 9:37 AM, Felix Cheung <[hidden email]> wrote:
Bump.


From: Felix Cheung <[hidden email]>
Sent: Saturday, May 26, 2018 1:05:29 PM
To: Saikat Kanjilal; Maximiliano Felice; Joseph Bradley
Cc: Leif Walsh; Holden Karau; dev

Subject: Re: Revisiting Online serving of Spark models?
 
Hi! How about we meet the community and discuss on June 6 4pm at (near) the Summit?

(I propose we meet at the venue entrance so we could accommodate people might not be in the conference)


From: Saikat Kanjilal <[hidden email]>
Sent: Tuesday, May 22, 2018 7:47:07 AM
To: Maximiliano Felice
Cc: Leif Walsh; Felix Cheung; Holden Karau; Joseph Bradley; dev
Subject: Re: Revisiting Online serving of Spark models?
 
I’m in the same exact boat as Maximiliano and have use cases as well for model serving and would love to join this discussion.

Sent from my iPhone

On May 22, 2018, at 6:39 AM, Maximiliano Felice <[hidden email]> wrote:

Hi!

I'm don't usually write a lot on this list but I keep up to date with the discussions and I'm a heavy user of Spark. This topic caught my attention, as we're currently facing this issue at work. I'm attending to the summit and was wondering if it would it be possible for me to join that meeting. I might be able to share some helpful usecases and ideas.

Thanks,
Maximiliano Felice

El mar., 22 de may. de 2018 9:14 AM, Leif Walsh <[hidden email]> escribió:
I’m with you on json being more readable than parquet, but we’ve had success using pyarrow’s parquet reader and have been quite happy with it so far. If your target is python (and probably if not now, then soon, R), you should look in to it. 

On Mon, May 21, 2018 at 16:52 Joseph Bradley <[hidden email]> wrote:
Regarding model reading and writing, I'll give quick thoughts here:
* Our approach was to use the same format but write JSON instead of Parquet.  It's easier to parse JSON without Spark, and using the same format simplifies architecture.  Plus, some people want to check files into version control, and JSON is nice for that.
* The reader/writer APIs could be extended to take format parameters (just like DataFrame reader/writers) to handle JSON (and maybe, eventually, handle Parquet in the online serving setting).

This would be a big project, so proposing a SPIP might be best.  If people are around at the Spark Summit, that could be a good time to meet up & then post notes back to the dev list.

On Sun, May 20, 2018 at 8:11 PM, Felix Cheung <[hidden email]> wrote:
Specifically I’d like bring part of the discussion to Model and PipelineModel, and various ModelReader and SharedReadWrite implementations that rely on SparkContext. This is a big blocker on reusing  trained models outside of Spark for online serving.

What’s the next step? Would folks be interested in getting together to discuss/get some feedback?


_____________________________
From: Felix Cheung <[hidden email]>
Sent: Thursday, May 10, 2018 10:10 AM
Subject: Re: Revisiting Online serving of Spark models?
To: Holden Karau <[hidden email]>, Joseph Bradley <[hidden email]>
Cc: dev <[hidden email]>



Huge +1 on this!


From:[hidden email] <[hidden email]> on behalf of Holden Karau <[hidden email]>
Sent: Thursday, May 10, 2018 9:39:26 AM
To: Joseph Bradley
Cc: dev
Subject: Re: Revisiting Online serving of Spark models?
 


On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley <[hidden email]> wrote:
Thanks for bringing this up Holden!  I'm a strong supporter of this.

Awesome! I'm glad other folks think something like this belongs in Spark.
This was one of the original goals for mllib-local: to have local versions of MLlib models which could be deployed without the big Spark JARs and without a SparkContext or SparkSession.  There are related commercial offerings like this : ) but the overhead of maintaining those offerings is pretty high.  Building good APIs within MLlib to avoid copying logic across libraries will be well worth it.

We've talked about this need at Databricks and have also been syncing with the creators of MLeap.  It'd be great to get this functionality into Spark itself.  Some thoughts:
* It'd be valuable to have this go beyond adding transform() methods taking a Row to the current Models.  Instead, it would be ideal to have local, lightweight versions of models in mllib-local, outside of the main mllib package (for easier deployment with smaller & fewer dependencies).
* Supporting Pipelines is important.  For this, it would be ideal to utilize elements of Spark SQL, particularly Rows and Types, which could be moved into a local sql package.
* This architecture may require some awkward APIs currently to have model prediction logic in mllib-local, local model classes in mllib-local, and regular (DataFrame-friendly) model classes in mllib.  We might find it helpful to break some DeveloperApis in Spark 3.0 to facilitate this architecture while making it feasible for 3rd party developers to extend MLlib APIs (especially in Java).
I agree this could be interesting, and feed into the other discussion around when (or if) we should be considering Spark 3.0
I _think_ we could probably do it with optional traits people could mix in to avoid breaking the current APIs but I could be wrong on that point.
* It could also be worth discussing local DataFrames.  They might not be as important as per-Row transformations, but they would be helpful for batching for higher throughput.
That could be interesting as well. 

I'll be interested to hear others' thoughts too!

Joseph

On Wed, May 9, 2018 at 7:18 AM, Holden Karau <[hidden email]> wrote:
Hi y'all,

With the renewed interest in ML in Apache Spark now seems like a good a time as any to revisit the online serving situation in Spark ML. DB & other's have done some excellent working moving a lot of the necessary tools into a local linear algebra package that doesn't depend on having a SparkContext.

There are a few different commercial and non-commercial solutions round this, but currently our individual transform/predict methods are private so they either need to copy or re-implement (or put them selves in org.apache.spark) to access them. How would folks feel about adding a new trait for ML pipeline stages to expose to do transformation of single element inputs (or local collections) that could be optionally implemented by stages which support this? That way we can have less copy and paste code possibly getting out of sync with our model training.

I think continuing to have on-line serving grow in different projects is probably the right path, forward (folks have different needs), but I'd love to see us make it simpler for other projects to build reliable serving tools.

I realize this maybe puts some of the folks in an awkward position with their own commercial offerings, but hopefully if we make it easier for everyone the commercial vendors can benefit as well.

Cheers,

Holden :)

--



--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

http://databricks.com




--





--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

http://databricks.com

--
--
Cheers,
Leif



--



Reply | Threaded
Open this post in threaded view
|

Re: Revisiting Online serving of Spark models?

Felix Cheung
Hi!

Thank you! Let’s meet then

June 6 4pm

Moscone West Convention Center
800 Howard Street, San Francisco, CA 94103

Ground floor (outside of conference area - should be available for all) - we will meet and decide where to go

(Would not send invite because that would be too much noise for dev@)

To paraphrase Joseph, we will use this to kick off the discusssion and post notes after and follow up online. As for Seattle, I would be very interested to meet in person lateen and discuss ;) 


_____________________________
From: Saikat Kanjilal <[hidden email]>
Sent: Tuesday, May 29, 2018 11:46 AM
Subject: Re: Revisiting Online serving of Spark models?
To: Maximiliano Felice <[hidden email]>
Cc: Felix Cheung <[hidden email]>, Holden Karau <[hidden email]>, Joseph Bradley <[hidden email]>, Leif Walsh <[hidden email]>, dev <[hidden email]>


Would love to join but am in Seattle, thoughts on how to make this work?

Regards

Sent from my iPhone

On May 29, 2018, at 10:35 AM, Maximiliano Felice <[hidden email]> wrote:

Big +1 to a meeting with fresh air.

Could anyone send the invites? I don't really know which is the place Holden is talking about.

2018-05-29 14:27 GMT-03:00 Felix Cheung <[hidden email]>:
You had me at blue bottle!

_____________________________
From: Holden Karau <[hidden email]>
Sent: Tuesday, May 29, 2018 9:47 AM
Subject: Re: Revisiting Online serving of Spark models?
To: Felix Cheung <[hidden email]>
Cc: Saikat Kanjilal <[hidden email]>, Maximiliano Felice <[hidden email]>, Joseph Bradley <[hidden email]>, Leif Walsh <[hidden email]>, dev <[hidden email]>



I'm down for that, we could all go for a walk maybe to the mint plazaa blue bottle and grab coffee (if the weather holds have our design meeting outside :p)?

On Tue, May 29, 2018 at 9:37 AM, Felix Cheung <[hidden email]> wrote:
Bump.


From: Felix Cheung <[hidden email]>
Sent: Saturday, May 26, 2018 1:05:29 PM
To: Saikat Kanjilal; Maximiliano Felice; Joseph Bradley
Cc: Leif Walsh; Holden Karau; dev

Subject: Re: Revisiting Online serving of Spark models?
 
Hi! How about we meet the community and discuss on June 6 4pm at (near) the Summit?

(I propose we meet at the venue entrance so we could accommodate people might not be in the conference)


From: Saikat Kanjilal <[hidden email]>
Sent: Tuesday, May 22, 2018 7:47:07 AM
To: Maximiliano Felice
Cc: Leif Walsh; Felix Cheung; Holden Karau; Joseph Bradley; dev
Subject: Re: Revisiting Online serving of Spark models?
 
I’m in the same exact boat as Maximiliano and have use cases as well for model serving and would love to join this discussion.

Sent from my iPhone

On May 22, 2018, at 6:39 AM, Maximiliano Felice <[hidden email]> wrote:

Hi!

I'm don't usually write a lot on this list but I keep up to date with the discussions and I'm a heavy user of Spark. This topic caught my attention, as we're currently facing this issue at work. I'm attending to the summit and was wondering if it would it be possible for me to join that meeting. I might be able to share some helpful usecases and ideas.

Thanks,
Maximiliano Felice

El mar., 22 de may. de 2018 9:14 AM, Leif Walsh <[hidden email]> escribió:
I’m with you on json being more readable than parquet, but we’ve had success using pyarrow’s parquet reader and have been quite happy with it so far. If your target is python (and probably if not now, then soon, R), you should look in to it. 

On Mon, May 21, 2018 at 16:52 Joseph Bradley <[hidden email]> wrote:
Regarding model reading and writing, I'll give quick thoughts here:
* Our approach was to use the same format but write JSON instead of Parquet.  It's easier to parse JSON without Spark, and using the same format simplifies architecture.  Plus, some people want to check files into version control, and JSON is nice for that.
* The reader/writer APIs could be extended to take format parameters (just like DataFrame reader/writers) to handle JSON (and maybe, eventually, handle Parquet in the online serving setting).

This would be a big project, so proposing a SPIP might be best.  If people are around at the Spark Summit, that could be a good time to meet up & then post notes back to the dev list.

On Sun, May 20, 2018 at 8:11 PM, Felix Cheung <[hidden email]> wrote:
Specifically I’d like bring part of the discussion to Model and PipelineModel, and various ModelReader and SharedReadWrite implementations that rely on SparkContext. This is a big blocker on reusing  trained models outside of Spark for online serving.

What’s the next step? Would folks be interested in getting together to discuss/get some feedback?


_____________________________
From: Felix Cheung <[hidden email]>
Sent: Thursday, May 10, 2018 10:10 AM
Subject: Re: Revisiting Online serving of Spark models?
To: Holden Karau <[hidden email]>, Joseph Bradley <[hidden email]>
Cc: dev <[hidden email]>



Huge +1 on this!


From:[hidden email] <[hidden email]> on behalf of Holden Karau <[hidden email]>
Sent: Thursday, May 10, 2018 9:39:26 AM
To: Joseph Bradley
Cc: dev
Subject: Re: Revisiting Online serving of Spark models?
 


On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley <[hidden email]> wrote:
Thanks for bringing this up Holden!  I'm a strong supporter of this.

Awesome! I'm glad other folks think something like this belongs in Spark.
This was one of the original goals for mllib-local: to have local versions of MLlib models which could be deployed without the big Spark JARs and without a SparkContext or SparkSession.  There are related commercial offerings like this : ) but the overhead of maintaining those offerings is pretty high.  Building good APIs within MLlib to avoid copying logic across libraries will be well worth it.

We've talked about this need at Databricks and have also been syncing with the creators of MLeap.  It'd be great to get this functionality into Spark itself.  Some thoughts:
* It'd be valuable to have this go beyond adding transform() methods taking a Row to the current Models.  Instead, it would be ideal to have local, lightweight versions of models in mllib-local, outside of the main mllib package (for easier deployment with smaller & fewer dependencies).
* Supporting Pipelines is important.  For this, it would be ideal to utilize elements of Spark SQL, particularly Rows and Types, which could be moved into a local sql package.
* This architecture may require some awkward APIs currently to have model prediction logic in mllib-local, local model classes in mllib-local, and regular (DataFrame-friendly) model classes in mllib.  We might find it helpful to break some DeveloperApis in Spark 3.0 to facilitate this architecture while making it feasible for 3rd party developers to extend MLlib APIs (especially in Java).
I agree this could be interesting, and feed into the other discussion around when (or if) we should be considering Spark 3.0
I _think_ we could probably do it with optional traits people could mix in to avoid breaking the current APIs but I could be wrong on that point.
* It could also be worth discussing local DataFrames.  They might not be as important as per-Row transformations, but they would be helpful for batching for higher throughput.
That could be interesting as well. 

I'll be interested to hear others' thoughts too!

Joseph

On Wed, May 9, 2018 at 7:18 AM, Holden Karau <[hidden email]> wrote:
Hi y'all,

With the renewed interest in ML in Apache Spark now seems like a good a time as any to revisit the online serving situation in Spark ML. DB & other's have done some excellent working moving a lot of the necessary tools into a local linear algebra package that doesn't depend on having a SparkContext.

There are a few different commercial and non-commercial solutions round this, but currently our individual transform/predict methods are private so they either need to copy or re-implement (or put them selves in org.apache.spark) to access them. How would folks feel about adding a new trait for ML pipeline stages to expose to do transformation of single element inputs (or local collections) that could be optionally implemented by stages which support this? That way we can have less copy and paste code possibly getting out of sync with our model training.

I think continuing to have on-line serving grow in different projects is probably the right path, forward (folks have different needs), but I'd love to see us make it simpler for other projects to build reliable serving tools.

I realize this maybe puts some of the folks in an awkward position with their own commercial offerings, but hopefully if we make it easier for everyone the commercial vendors can benefit as well.

Cheers,

Holden :)

--



--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

http://databricks.com




--





--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

http://databricks.com

--
--
Cheers,
Leif



--





Reply | Threaded
Open this post in threaded view
|

Re: Revisiting Online serving of Spark models?

Denny Lee
I most likely will not be able to join SF next week but definitely up for a session after Summit in Seattle to dive further into this, eh?! 

On Wed, May 30, 2018 at 9:32 AM Felix Cheung <[hidden email]> wrote:
Hi!

Thank you! Let’s meet then

June 6 4pm

Moscone West Convention Center

Ground floor (outside of conference area - should be available for all) - we will meet and decide where to go

(Would not send invite because that would be too much noise for dev@)

To paraphrase Joseph, we will use this to kick off the discusssion and post notes after and follow up online. As for Seattle, I would be very interested to meet in person lateen and discuss ;) 


_____________________________
From: Saikat Kanjilal <[hidden email]>
Sent: Tuesday, May 29, 2018 11:46 AM

Subject: Re: Revisiting Online serving of Spark models?
To: Maximiliano Felice <[hidden email]>
Cc: Felix Cheung <[hidden email]>, Holden Karau <[hidden email]>, Joseph Bradley <[hidden email]>, Leif Walsh <[hidden email]>, dev <[hidden email]>



Would love to join but am in Seattle, thoughts on how to make this work?

Regards

Sent from my iPhone

On May 29, 2018, at 10:35 AM, Maximiliano Felice <[hidden email]> wrote:

Big +1 to a meeting with fresh air.

Could anyone send the invites? I don't really know which is the place Holden is talking about.

2018-05-29 14:27 GMT-03:00 Felix Cheung <[hidden email]>:
You had me at blue bottle!

_____________________________
From: Holden Karau <[hidden email]>
Sent: Tuesday, May 29, 2018 9:47 AM
Subject: Re: Revisiting Online serving of Spark models?
To: Felix Cheung <[hidden email]>
Cc: Saikat Kanjilal <[hidden email]>, Maximiliano Felice <[hidden email]>, Joseph Bradley <[hidden email]>, Leif Walsh <[hidden email]>, dev <[hidden email]>



I'm down for that, we could all go for a walk maybe to the mint plazaa blue bottle and grab coffee (if the weather holds have our design meeting outside :p)?

On Tue, May 29, 2018 at 9:37 AM, Felix Cheung <[hidden email]> wrote:
Bump.


From: Felix Cheung <[hidden email]>
Sent: Saturday, May 26, 2018 1:05:29 PM
To: Saikat Kanjilal; Maximiliano Felice; Joseph Bradley
Cc: Leif Walsh; Holden Karau; dev

Subject: Re: Revisiting Online serving of Spark models?
 
Hi! How about we meet the community and discuss on June 6 4pm at (near) the Summit?

(I propose we meet at the venue entrance so we could accommodate people might not be in the conference)


From: Saikat Kanjilal <[hidden email]>
Sent: Tuesday, May 22, 2018 7:47:07 AM
To: Maximiliano Felice
Cc: Leif Walsh; Felix Cheung; Holden Karau; Joseph Bradley; dev
Subject: Re: Revisiting Online serving of Spark models?
 
I’m in the same exact boat as Maximiliano and have use cases as well for model serving and would love to join this discussion.

Sent from my iPhone

On May 22, 2018, at 6:39 AM, Maximiliano Felice <[hidden email]> wrote:

Hi!

I'm don't usually write a lot on this list but I keep up to date with the discussions and I'm a heavy user of Spark. This topic caught my attention, as we're currently facing this issue at work. I'm attending to the summit and was wondering if it would it be possible for me to join that meeting. I might be able to share some helpful usecases and ideas.

Thanks,
Maximiliano Felice

El mar., 22 de may. de 2018 9:14 AM, Leif Walsh <[hidden email]> escribió:
I’m with you on json being more readable than parquet, but we’ve had success using pyarrow’s parquet reader and have been quite happy with it so far. If your target is python (and probably if not now, then soon, R), you should look in to it. 

On Mon, May 21, 2018 at 16:52 Joseph Bradley <[hidden email]> wrote:
Regarding model reading and writing, I'll give quick thoughts here:
* Our approach was to use the same format but write JSON instead of Parquet.  It's easier to parse JSON without Spark, and using the same format simplifies architecture.  Plus, some people want to check files into version control, and JSON is nice for that.
* The reader/writer APIs could be extended to take format parameters (just like DataFrame reader/writers) to handle JSON (and maybe, eventually, handle Parquet in the online serving setting).

This would be a big project, so proposing a SPIP might be best.  If people are around at the Spark Summit, that could be a good time to meet up & then post notes back to the dev list.

On Sun, May 20, 2018 at 8:11 PM, Felix Cheung <[hidden email]> wrote:
Specifically I’d like bring part of the discussion to Model and PipelineModel, and various ModelReader and SharedReadWrite implementations that rely on SparkContext. This is a big blocker on reusing  trained models outside of Spark for online serving.

What’s the next step? Would folks be interested in getting together to discuss/get some feedback?


_____________________________
From: Felix Cheung <[hidden email]>
Sent: Thursday, May 10, 2018 10:10 AM
Subject: Re: Revisiting Online serving of Spark models?
To: Holden Karau <[hidden email]>, Joseph Bradley <[hidden email]>
Cc: dev <[hidden email]>



Huge +1 on this!


From:[hidden email] <[hidden email]> on behalf of Holden Karau <[hidden email]>
Sent: Thursday, May 10, 2018 9:39:26 AM
To: Joseph Bradley
Cc: dev
Subject: Re: Revisiting Online serving of Spark models?
 


On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley <[hidden email]> wrote:
Thanks for bringing this up Holden!  I'm a strong supporter of this.

Awesome! I'm glad other folks think something like this belongs in Spark.
This was one of the original goals for mllib-local: to have local versions of MLlib models which could be deployed without the big Spark JARs and without a SparkContext or SparkSession.  There are related commercial offerings like this : ) but the overhead of maintaining those offerings is pretty high.  Building good APIs within MLlib to avoid copying logic across libraries will be well worth it.

We've talked about this need at Databricks and have also been syncing with the creators of MLeap.  It'd be great to get this functionality into Spark itself.  Some thoughts:
* It'd be valuable to have this go beyond adding transform() methods taking a Row to the current Models.  Instead, it would be ideal to have local, lightweight versions of models in mllib-local, outside of the main mllib package (for easier deployment with smaller & fewer dependencies).
* Supporting Pipelines is important.  For this, it would be ideal to utilize elements of Spark SQL, particularly Rows and Types, which could be moved into a local sql package.
* This architecture may require some awkward APIs currently to have model prediction logic in mllib-local, local model classes in mllib-local, and regular (DataFrame-friendly) model classes in mllib.  We might find it helpful to break some DeveloperApis in Spark 3.0 to facilitate this architecture while making it feasible for 3rd party developers to extend MLlib APIs (especially in Java).
I agree this could be interesting, and feed into the other discussion around when (or if) we should be considering Spark 3.0
I _think_ we could probably do it with optional traits people could mix in to avoid breaking the current APIs but I could be wrong on that point.
* It could also be worth discussing local DataFrames.  They might not be as important as per-Row transformations, but they would be helpful for batching for higher throughput.
That could be interesting as well. 

I'll be interested to hear others' thoughts too!

Joseph

On Wed, May 9, 2018 at 7:18 AM, Holden Karau <[hidden email]> wrote:
Hi y'all,

With the renewed interest in ML in Apache Spark now seems like a good a time as any to revisit the online serving situation in Spark ML. DB & other's have done some excellent working moving a lot of the necessary tools into a local linear algebra package that doesn't depend on having a SparkContext.

There are a few different commercial and non-commercial solutions round this, but currently our individual transform/predict methods are private so they either need to copy or re-implement (or put them selves in org.apache.spark) to access them. How would folks feel about adding a new trait for ML pipeline stages to expose to do transformation of single element inputs (or local collections) that could be optionally implemented by stages which support this? That way we can have less copy and paste code possibly getting out of sync with our model training.

I think continuing to have on-line serving grow in different projects is probably the right path, forward (folks have different needs), but I'd love to see us make it simpler for other projects to build reliable serving tools.

I realize this maybe puts some of the folks in an awkward position with their own commercial offerings, but hopefully if we make it easier for everyone the commercial vendors can benefit as well.

Cheers,

Holden :)

--



--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

http://databricks.com




--





--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

http://databricks.com

--
--
Cheers,
Leif



--





12