PySpark .collect() output to Scala Array[Row]

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

PySpark .collect() output to Scala Array[Row]

Nick Ruest
Hi,

I've hit a wall with trying to just implement a couple of Scala methods
of in a Python version of our project.

My Python function looks like this:

def Write_Graphml(data, graphml_path, sc):
    return
sc.getOrCreate()._jvm.io.archivesunleashed.app.WriteGraphML(data,
graphml_path).apply


Where data is a DataFrame that has been collected; data.collect().

On the Scala side is it basically:

object WriteGraphML {
  apply(data: Array[Row], graphmlPath: String): Boolean = {
    ...
    massages an Array[Row] into GraphML
    ...
    True
}

When I try to use it in PySpark, I end up getting this error message:

Py4JError: An error occurred while calling
None.io.archivesunleashed.app.WriteGraphML. Trace:
py4j.Py4JException: Constructor
io.archivesunleashed.app.WriteGraphML([class java.util.ArrayList, class
java.lang.String]) does not exist
        at
py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:179)
        at
py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:196)
        at py4j.Gateway.invoke(Gateway.java:237)
        at
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
        at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Thread.java:748)



Based on my research, I'm fairly certain it is because of how Py4J is
passing off the Python List (data) to the JVM, and then passing it to
Scala. It's ending up as an ArrayList instead of an Array[Row].

Do I need to tweak data before it is passed to Write_Graphml? Or am I
doing something else wrong here.

...and not 100% sure if this is a user or dev list question. Let me know
if I should move this over to user.

Thanks in advance for any help!

cheers!

-nruest


signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: PySpark .collect() output to Scala Array[Row]

Sean Owen-2
(This is better for user@)
You have an object, which can't be instantiated. You can make it a
class to make it instantiable, but you can try writing ...
WriteGraphML.apply(...) in python instead.

On Mon, May 25, 2020 at 1:23 PM Nick Ruest <[hidden email]> wrote:

>
> Hi,
>
> I've hit a wall with trying to just implement a couple of Scala methods
> of in a Python version of our project.
>
> My Python function looks like this:
>
> def Write_Graphml(data, graphml_path, sc):
>     return
> sc.getOrCreate()._jvm.io.archivesunleashed.app.WriteGraphML(data,
> graphml_path).apply
>
>
> Where data is a DataFrame that has been collected; data.collect().
>
> On the Scala side is it basically:
>
> object WriteGraphML {
>   apply(data: Array[Row], graphmlPath: String): Boolean = {
>     ...
>     massages an Array[Row] into GraphML
>     ...
>     True
> }
>
> When I try to use it in PySpark, I end up getting this error message:
>
> Py4JError: An error occurred while calling
> None.io.archivesunleashed.app.WriteGraphML. Trace:
> py4j.Py4JException: Constructor
> io.archivesunleashed.app.WriteGraphML([class java.util.ArrayList, class
> java.lang.String]) does not exist
>         at
> py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:179)
>         at
> py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:196)
>         at py4j.Gateway.invoke(Gateway.java:237)
>         at
> py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
>         at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
>         at py4j.GatewayConnection.run(GatewayConnection.java:238)
>         at java.lang.Thread.run(Thread.java:748)
>
>
>
> Based on my research, I'm fairly certain it is because of how Py4J is
> passing off the Python List (data) to the JVM, and then passing it to
> Scala. It's ending up as an ArrayList instead of an Array[Row].
>
> Do I need to tweak data before it is passed to Write_Graphml? Or am I
> doing something else wrong here.
>
> ...and not 100% sure if this is a user or dev list question. Let me know
> if I should move this over to user.
>
> Thanks in advance for any help!
>
> cheers!
>
> -nruest
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]