SparkGraph review process

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

SparkGraph review process

Mats Rydberg
Hello dear Spark community

We are the developers behind the SparkGraph SPIP, which is a project created out of our work on openCypher Morpheus (https://github.com/opencypher/morpheus). During this year we have collaborated with mainly Xiangrui Meng of Databricks to define and develop a new SparkGraph module based on our experience from working on Morpheus. Morpheus - formerly known as "Cypher for Apache Spark" - has been in development for over 3 years and matured in its API and implementation.

The SPIP work has been on hold for a period of time now, as priorities at Databricks have changed which has occupied Xiangrui's time (as well as other happenings). As you may know, the latest API PR (https://github.com/apache/spark/pull/24851) is blocking us from moving forward with the implementation.

In an attempt to not lose track of this project we now reach out to you to ask whether there are any Spark committers in the community who would be prepared to commit to helping us review and merge our code contributions to Apache Spark? We are not asking for lots of direct development support, as we believe we have the implementation more or less completed already since early this year. There is a proof-of-concept PR (https://github.com/apache/spark/pull/24297) which contains the functionality.

If you could offer such aid it would be greatly appreciated. None of us are Spark committers, which is hindering our ability to deliver this project in time for Spark 3.0.

Sincerely
the Neo4j Graph Analytics team
Mats, Martin, Max, Sören, Jonatan

Reply | Threaded
Open this post in threaded view
|

Re: SparkGraph review process

Xiangrui Meng
Hi all,

I want to clarify my role first to avoid misunderstanding. I'm an individual contributor here. My work on the graph SPIP as well as other Spark features I contributed to are not associated with my employer. It became quite challenging for me to keep track of the graph SPIP work due to less available time at home.

On retrospective, we should have involved more Spark devs and committers early on so there is no single point of failure, i.e., me. Hopefully it is not too late to fix. I summarize my thoughts here to help onboard other reviewers:

1. On the technical side, my main concern is the runtime dependency on org.opencypher:okapi-shade. okapi depends on several Scala libraries. We came out with the solution to shade a few Scala libraries to avoid pollution. However, I'm not super confident that the approach is sustainable for two reasons: a) there exists no proper shading libraries for Scala, 2) We will have to wait for upgrades from those Scala libraries before we can upgrade Spark to use a newer Scala version. So it would be great if some Scala experts can help review the current implementation and help assess the risk.

2. Overloading helper methods. MLlib used to have several overloaded helper methods for each algorithm, which later became a major maintenance burden. Builders and setters/getters are more maintainable. I will comment again on the PR.

3. The proposed API partitions graph into sub-graphs, as described in the property graph model. It is unclear to me how it would affect query performance because it requires SQL optimizer to correctly recognize data from the same source and make execution efficient.

4. The feature, although originally targeted for Spark 3.0, should not be a Spark 3.0 release blocker because it doesn't require breaking changes. If we miss the code freeze deadline, we can introduce a build flag to exclude the module from the official release/distribution, and then make it default once the module is ready.

5. If unfortunately we still don't see sufficient committer reviews, I think the best option would be submitting the work to Apache Incubator instead to unblock the work. But maybe it is too earlier to discuss this option.

It would be great if other committers can offer help on the review! Really appreciated!

Best,
Xiangrui

On Fri, Oct 4, 2019 at 1:32 AM Mats Rydberg <[hidden email]> wrote:
Hello dear Spark community

We are the developers behind the SparkGraph SPIP, which is a project created out of our work on openCypher Morpheus (https://github.com/opencypher/morpheus). During this year we have collaborated with mainly Xiangrui Meng of Databricks to define and develop a new SparkGraph module based on our experience from working on Morpheus. Morpheus - formerly known as "Cypher for Apache Spark" - has been in development for over 3 years and matured in its API and implementation.

The SPIP work has been on hold for a period of time now, as priorities at Databricks have changed which has occupied Xiangrui's time (as well as other happenings). As you may know, the latest API PR (https://github.com/apache/spark/pull/24851) is blocking us from moving forward with the implementation.

In an attempt to not lose track of this project we now reach out to you to ask whether there are any Spark committers in the community who would be prepared to commit to helping us review and merge our code contributions to Apache Spark? We are not asking for lots of direct development support, as we believe we have the implementation more or less completed already since early this year. There is a proof-of-concept PR (https://github.com/apache/spark/pull/24297) which contains the functionality.

If you could offer such aid it would be greatly appreciated. None of us are Spark committers, which is hindering our ability to deliver this project in time for Spark 3.0.

Sincerely
the Neo4j Graph Analytics team
Mats, Martin, Max, Sören, Jonatan

Reply | Threaded
Open this post in threaded view
|

Re: SparkGraph review process

Xiao Li
1. On the technical side, my main concern is the runtime dependency on org.opencypher:okapi-shade. okapi depends on several Scala libraries. We came out with the solution to shade a few Scala libraries to avoid pollution. However, I'm not super confident that the approach is sustainable for two reasons: a) there exists no proper shading libraries for Scala, 2) We will have to wait for upgrades from those Scala libraries before we can upgrade Spark to use a newer Scala version. So it would be great if some Scala experts can help review the current implementation and help assess the risk.

This concern is valid. I think we should start the vote to ensure the whole community is aware of the risk and take the responsibility to maintain this in the long term. 

Cheers,

Xiao


Xiangrui Meng <[hidden email]> 于2019年10月4日周五 下午12:27写道:
Hi all,

I want to clarify my role first to avoid misunderstanding. I'm an individual contributor here. My work on the graph SPIP as well as other Spark features I contributed to are not associated with my employer. It became quite challenging for me to keep track of the graph SPIP work due to less available time at home.

On retrospective, we should have involved more Spark devs and committers early on so there is no single point of failure, i.e., me. Hopefully it is not too late to fix. I summarize my thoughts here to help onboard other reviewers:

1. On the technical side, my main concern is the runtime dependency on org.opencypher:okapi-shade. okapi depends on several Scala libraries. We came out with the solution to shade a few Scala libraries to avoid pollution. However, I'm not super confident that the approach is sustainable for two reasons: a) there exists no proper shading libraries for Scala, 2) We will have to wait for upgrades from those Scala libraries before we can upgrade Spark to use a newer Scala version. So it would be great if some Scala experts can help review the current implementation and help assess the risk.

2. Overloading helper methods. MLlib used to have several overloaded helper methods for each algorithm, which later became a major maintenance burden. Builders and setters/getters are more maintainable. I will comment again on the PR.

3. The proposed API partitions graph into sub-graphs, as described in the property graph model. It is unclear to me how it would affect query performance because it requires SQL optimizer to correctly recognize data from the same source and make execution efficient.

4. The feature, although originally targeted for Spark 3.0, should not be a Spark 3.0 release blocker because it doesn't require breaking changes. If we miss the code freeze deadline, we can introduce a build flag to exclude the module from the official release/distribution, and then make it default once the module is ready.

5. If unfortunately we still don't see sufficient committer reviews, I think the best option would be submitting the work to Apache Incubator instead to unblock the work. But maybe it is too earlier to discuss this option.

It would be great if other committers can offer help on the review! Really appreciated!

Best,
Xiangrui

On Fri, Oct 4, 2019 at 1:32 AM Mats Rydberg <[hidden email]> wrote:
Hello dear Spark community

We are the developers behind the SparkGraph SPIP, which is a project created out of our work on openCypher Morpheus (https://github.com/opencypher/morpheus). During this year we have collaborated with mainly Xiangrui Meng of Databricks to define and develop a new SparkGraph module based on our experience from working on Morpheus. Morpheus - formerly known as "Cypher for Apache Spark" - has been in development for over 3 years and matured in its API and implementation.

The SPIP work has been on hold for a period of time now, as priorities at Databricks have changed which has occupied Xiangrui's time (as well as other happenings). As you may know, the latest API PR (https://github.com/apache/spark/pull/24851) is blocking us from moving forward with the implementation.

In an attempt to not lose track of this project we now reach out to you to ask whether there are any Spark committers in the community who would be prepared to commit to helping us review and merge our code contributions to Apache Spark? We are not asking for lots of direct development support, as we believe we have the implementation more or less completed already since early this year. There is a proof-of-concept PR (https://github.com/apache/spark/pull/24297) which contains the functionality.

If you could offer such aid it would be greatly appreciated. None of us are Spark committers, which is hindering our ability to deliver this project in time for Spark 3.0.

Sincerely
the Neo4j Graph Analytics team
Mats, Martin, Max, Sören, Jonatan

Reply | Threaded
Open this post in threaded view
|

Re: SparkGraph review process

Holden Karau
Maybe let’s ask the folks from Lightbend who helped with the previous scala upgrade for their thoughts?

On Mon, Oct 14, 2019 at 8:24 PM Xiao Li <[hidden email]> wrote:
1. On the technical side, my main concern is the runtime dependency on org.opencypher:okapi-shade. okapi depends on several Scala libraries. We came out with the solution to shade a few Scala libraries to avoid pollution. However, I'm not super confident that the approach is sustainable for two reasons: a) there exists no proper shading libraries for Scala, 2) We will have to wait for upgrades from those Scala libraries before we can upgrade Spark to use a newer Scala version. So it would be great if some Scala experts can help review the current implementation and help assess the risk.

This concern is valid. I think we should start the vote to ensure the whole community is aware of the risk and take the responsibility to maintain this in the long term. 

Cheers,

Xiao


Xiangrui Meng <[hidden email]> 于2019年10月4日周五 下午12:27写道:
Hi all,

I want to clarify my role first to avoid misunderstanding. I'm an individual contributor here. My work on the graph SPIP as well as other Spark features I contributed to are not associated with my employer. It became quite challenging for me to keep track of the graph SPIP work due to less available time at home.

On retrospective, we should have involved more Spark devs and committers early on so there is no single point of failure, i.e., me. Hopefully it is not too late to fix. I summarize my thoughts here to help onboard other reviewers:

1. On the technical side, my main concern is the runtime dependency on org.opencypher:okapi-shade. okapi depends on several Scala libraries. We came out with the solution to shade a few Scala libraries to avoid pollution. However, I'm not super confident that the approach is sustainable for two reasons: a) there exists no proper shading libraries for Scala, 2) We will have to wait for upgrades from those Scala libraries before we can upgrade Spark to use a newer Scala version. So it would be great if some Scala experts can help review the current implementation and help assess the risk.

2. Overloading helper methods. MLlib used to have several overloaded helper methods for each algorithm, which later became a major maintenance burden. Builders and setters/getters are more maintainable. I will comment again on the PR.

3. The proposed API partitions graph into sub-graphs, as described in the property graph model. It is unclear to me how it would affect query performance because it requires SQL optimizer to correctly recognize data from the same source and make execution efficient.

4. The feature, although originally targeted for Spark 3.0, should not be a Spark 3.0 release blocker because it doesn't require breaking changes. If we miss the code freeze deadline, we can introduce a build flag to exclude the module from the official release/distribution, and then make it default once the module is ready.

5. If unfortunately we still don't see sufficient committer reviews, I think the best option would be submitting the work to Apache Incubator instead to unblock the work. But maybe it is too earlier to discuss this option.

It would be great if other committers can offer help on the review! Really appreciated!

Best,
Xiangrui

On Fri, Oct 4, 2019 at 1:32 AM Mats Rydberg <[hidden email]> wrote:
Hello dear Spark community

We are the developers behind the SparkGraph SPIP, which is a project created out of our work on openCypher Morpheus (https://github.com/opencypher/morpheus). During this year we have collaborated with mainly Xiangrui Meng of Databricks to define and develop a new SparkGraph module based on our experience from working on Morpheus. Morpheus - formerly known as "Cypher for Apache Spark" - has been in development for over 3 years and matured in its API and implementation.

The SPIP work has been on hold for a period of time now, as priorities at Databricks have changed which has occupied Xiangrui's time (as well as other happenings). As you may know, the latest API PR (https://github.com/apache/spark/pull/24851) is blocking us from moving forward with the implementation.

In an attempt to not lose track of this project we now reach out to you to ask whether there are any Spark committers in the community who would be prepared to commit to helping us review and merge our code contributions to Apache Spark? We are not asking for lots of direct development support, as we believe we have the implementation more or less completed already since early this year. There is a proof-of-concept PR (https://github.com/apache/spark/pull/24297) which contains the functionality.

If you could offer such aid it would be greatly appreciated. None of us are Spark committers, which is hindering our ability to deliver this project in time for Spark 3.0.

Sincerely
the Neo4j Graph Analytics team
Mats, Martin, Max, Sören, Jonatan

--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9