pySpark and py4j: NoClassDefFoundError when upgrading a jar

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

pySpark and py4j: NoClassDefFoundError when upgrading a jar

Alessandro Liparoti

We developed a Scala library to run on spark called FV. We also built wrappers in python for its public API using py4j as in spark. For example, the main object is instantiated like this

self._java_obj = self._new_java_obj("com.example.FV", self.uid)

and the methods on the object are called in this way

def add(self, r):
    self._java_obj.add(r)

We are experiencing an annoying issue when running pyspark with this external library. We use to run the pyspark shell like this

pyspark --repositories <our-own-maven-release-repo> --packages <com.example.FV:latest.release>

When we release a new version and we have some change in the Scala API, things start to randomly break for some users. For example, in version 0.44 we had a class DateUtils (used by class Utils, which is used by class FV in method add) that was dropped in version 0.45. When version 0.45 was released and a user called the method add in python API we got

java.lang.NoClassDefFoundError: Could not initialize class DateUtils

Basically, the python API is running the method add which contains a reference to class DateUtils(v0.44) but when it is actually going to load the needed class it doesn't find it, because the loaded jar is the v0.45 (as the ivy log shows when starting up the shell)

Do you have any idea of what the problem might be? Does maybe py4j cache something so that when upgrading the classes we get this error?

Alessandro Liparoti