[PYTHON][DISCUSS] Moving to cloudpickle and or Py4J as a dependencies?

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[PYTHON][DISCUSS] Moving to cloudpickle and or Py4J as a dependencies?

Holden Karau
Hi PySpark Developers,

Cloudpickle is a core part of PySpark, and is originally copied from (and improved from) picloud. Since then other projects have found cloudpickle useful and a fork of cloudpickle was created and is now maintained as its own library (with better test coverage and resulting bug fixes I understand). We've had a few PRs backporting fixes from the cloudpickle project into Spark's local copy of cloudpickle - how would people feel about moving to taking an explicit (pinned) dependency on cloudpickle?

We could add cloudpickle to the setup.py and a requirements.txt file for users who prefer not to do a system installation of PySpark.

Py4J is maybe even a simpler case, we currently have a zip of py4j in our repo but could instead have a pinned version required. While we do depend on a lot of py4j internal APIs, version pinning should be sufficient to ensure functionality (and simplify the update process).

Cheers,

Holden :)

-- 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PYTHON][DISCUSS] Moving to cloudpickle and or Py4J as a dependencies?

rxin
With any dependency update (or refactoring of existing code), I always ask this question: what's the benefit? In this case it looks like the benefit is to reduce efforts in backports. Do you know how often we needed to do those?


On Tue, Feb 14, 2017 at 12:01 AM, Holden Karau <[hidden email]> wrote:
Hi PySpark Developers,

Cloudpickle is a core part of PySpark, and is originally copied from (and improved from) picloud. Since then other projects have found cloudpickle useful and a fork of cloudpickle was created and is now maintained as its own library (with better test coverage and resulting bug fixes I understand). We've had a few PRs backporting fixes from the cloudpickle project into Spark's local copy of cloudpickle - how would people feel about moving to taking an explicit (pinned) dependency on cloudpickle?

We could add cloudpickle to the setup.py and a requirements.txt file for users who prefer not to do a system installation of PySpark.

Py4J is maybe even a simpler case, we currently have a zip of py4j in our repo but could instead have a pinned version required. While we do depend on a lot of py4j internal APIs, version pinning should be sufficient to ensure functionality (and simplify the update process).

Cheers,

Holden :)

-- 

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PYTHON][DISCUSS] Moving to cloudpickle and or Py4J as a dependencies?

Holden Karau
It's a good question. Py4J seems to have been updated 5 times in 2016 and is a bit involved (from a review point of view verifying the zip file contents is somewhat tedious).

cloudpickle is a bit difficult to tell since we can have changes to cloudpickle which aren't correctly tagged as backporting changes from the fork (and this can take awhile to review since we don't always catch them right away as being backports).

Another difficulty with looking at backports is that since our review process for PySpark has historically been on the slow side, changes benefiting systems like dask or IPython parallel were not backported to Spark unless they caused serious errors.

I think the key benefits are better test coverage of the forked version of cloudpickle, using a more standardized packaging of dependencies, simpler updates of dependencies reduces friction to gaining benefits from other related projects work - Python serialization really isn't our secret sauce.

If I'm missing any substantial benefits or costs I'd love to know :)

On Mon, Feb 13, 2017 at 3:03 PM, Reynold Xin <[hidden email]> wrote:
With any dependency update (or refactoring of existing code), I always ask this question: what's the benefit? In this case it looks like the benefit is to reduce efforts in backports. Do you know how often we needed to do those?


On Tue, Feb 14, 2017 at 12:01 AM, Holden Karau <[hidden email]> wrote:
Hi PySpark Developers,

Cloudpickle is a core part of PySpark, and is originally copied from (and improved from) picloud. Since then other projects have found cloudpickle useful and a fork of cloudpickle was created and is now maintained as its own library (with better test coverage and resulting bug fixes I understand). We've had a few PRs backporting fixes from the cloudpickle project into Spark's local copy of cloudpickle - how would people feel about moving to taking an explicit (pinned) dependency on cloudpickle?

We could add cloudpickle to the setup.py and a requirements.txt file for users who prefer not to do a system installation of PySpark.

Py4J is maybe even a simpler case, we currently have a zip of py4j in our repo but could instead have a pinned version required. While we do depend on a lot of py4j internal APIs, version pinning should be sufficient to ensure functionality (and simplify the update process).

Cheers,

Holden :)

-- 




--
Cell : 425-233-8271
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PYTHON][DISCUSS] Moving to cloudpickle and or Py4J as a dependencies?

zero323

I don't have any strong views, so just to highlight possible issues:

  • Based on different issues I've seen there is a substantial amount of users which depend on system wide Python installations. As far as I am aware neither Py4j nor cloudpickle are present in the standard system repositories in Debian or Red Hat derivatives.
  • Assuming that Spark is committed to supporting Python 2 beyond its end of life we have to be sure that any external dependency has the same policy.
  • Py4j is missing from default Anaconda channel. Not a big issue, just a small annoyance.
  • External dependencies with pinned versions add some overhead to the development across versions (effectively we may need a separate env for each major Spark release). I've seen small inconsistencies in PySpark behavior with different Py4j versions so it is not completely hypothetical.
  • Adding possible version conflicts. It is probably not a big risk but something to consider (for example in combination Blaze + Dask + PySpark).
  • Adding another party user has to trust.

On 02/14/2017 12:22 AM, Holden Karau wrote:
It's a good question. Py4J seems to have been updated 5 times in 2016 and is a bit involved (from a review point of view verifying the zip file contents is somewhat tedious).

cloudpickle is a bit difficult to tell since we can have changes to cloudpickle which aren't correctly tagged as backporting changes from the fork (and this can take awhile to review since we don't always catch them right away as being backports).

Another difficulty with looking at backports is that since our review process for PySpark has historically been on the slow side, changes benefiting systems like dask or IPython parallel were not backported to Spark unless they caused serious errors.

I think the key benefits are better test coverage of the forked version of cloudpickle, using a more standardized packaging of dependencies, simpler updates of dependencies reduces friction to gaining benefits from other related projects work - Python serialization really isn't our secret sauce.

If I'm missing any substantial benefits or costs I'd love to know :)

On Mon, Feb 13, 2017 at 3:03 PM, Reynold Xin <[hidden email]> wrote:
With any dependency update (or refactoring of existing code), I always ask this question: what's the benefit? In this case it looks like the benefit is to reduce efforts in backports. Do you know how often we needed to do those?


On Tue, Feb 14, 2017 at 12:01 AM, Holden Karau <[hidden email]> wrote:
Hi PySpark Developers,

Cloudpickle is a core part of PySpark, and is originally copied from (and improved from) picloud. Since then other projects have found cloudpickle useful and a fork of cloudpickle was created and is now maintained as its own library (with better test coverage and resulting bug fixes I understand). We've had a few PRs backporting fixes from the cloudpickle project into Spark's local copy of cloudpickle - how would people feel about moving to taking an explicit (pinned) dependency on cloudpickle?

We could add cloudpickle to the setup.py and a requirements.txt file for users who prefer not to do a system installation of PySpark.

Py4J is maybe even a simpler case, we currently have a zip of py4j in our repo but could instead have a pinned version required. While we do depend on a lot of py4j internal APIs, version pinning should be sufficient to ensure functionality (and simplify the update process).

Cheers,

Holden :)

-- 




--
Cell : 425-233-8271

-- 
Maciej Szymkiewicz
Loading...