There seems to be some desire for third party language extensions for Apache Spark. Some notable examples include:


Presently, Apache Spark supports Python and R via a tightly integrated interop layer. It would seem that much of that existing interop layer could be refactored into a clean surface for general (third party) language bindings, such as the above mentioned. More specifically, could we generalize the following modules:

  1. Deploy runners (e.g., PythonRunner and RRunner)
  2. DataFrame Executors
  3. RDD operations?


The last being questionable: integrating third party language extensions at the RDD level may be too heavy-weight and unnecessary given the preference towards the DataFrame abstraction.


The main goals of this effort would be:

  1. Provide a clean abstraction for third party language extensions making it easier to maintain (the language extension) with the evolution of Apache Spark
  2. Provide guidance to third party language authors on how a language extension should be implemented
  3. Provide general reusable libraries that are not specific to any language extension
  4. Open the door to developers that prefer alternative languages


-Tyson Condie