Installing PySpark on a local machine

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Installing PySpark on a local machine

Uri Laserson
Is there a documented/preferred method for installing PySpark on a local
machine?  I want to be able to run a Python interpreter on my local
machine, point it to my Spark cluster and go.  There doesn't appear to be a
setup.py file anywhere, nor is pyspark registered with PyPI.  I'm happy to
contribute these, but want to hear what the preferred method is first.

Uri

--
Uri Laserson, PhD
Data Scientist, Cloudera
Twitter/GitHub: @laserson
+1 617 910 0447
[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Installing PySpark on a local machine

Josh Rosen
I've thought about creating a setup.py file for PySpark; there are a couple
of subtleties involved:

   - PySpark's uses Py4J to create a regular Java Spark driver, so it's
   subject to the same limitations that Scala / Java Spark have when
   connecting from a local machine to a remote cluster; a number of ports need
   to be opened (this is discussed in more detail in other posts on this list;
   try searching for "connect to remote cluster" or something like that).
   - PySpark needs the Spark assembly JAR, so you'd still have to point the
   SPARK_HOME environment variable to local copy of the Spark assemblies.
   - We need to be careful about communication between incompatible
   versions of the Python and Java portions of the library.  We can probably
   fix this by embedding version numbers in the Python and Java libraries and
   comparing those numbers when launching the Java gateway.

If we decide to distribute a PySpark package on PyPI, we should integrate
its release with the regular Apache release process for Spark.

Does anyone know how other projects like Mesos distribute their Python
bindings?  Is there a good existing model that we should emulate?

- Josh


On Sun, Dec 22, 2013 at 4:29 PM, Uri Laserson <[hidden email]> wrote:

> Is there a documented/preferred method for installing PySpark on a local
> machine?  I want to be able to run a Python interpreter on my local
> machine, point it to my Spark cluster and go.  There doesn't appear to be a
> setup.py file anywhere, nor is pyspark registered with PyPI.  I'm happy to
> contribute these, but want to hear what the preferred method is first.
>
> Uri
>
> --
> Uri Laserson, PhD
> Data Scientist, Cloudera
> Twitter/GitHub: @laserson
> +1 617 910 0447
> [hidden email]
>