Encouraging tests SparkR, in particular, dapply, gapply and RDD based APIs (SPARK-21093)

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Encouraging tests SparkR, in particular, dapply, gapply and RDD based APIs (SPARK-21093)

Hyukjin Kwon
Hi all,

Recently, there was an issue about a leak in SparkR in https://issues.apache.org/jira/browse/SPARK-21093.
It was even worse because R workers crash on CentOS easily. This was fixed in 
in SparkR and the logics were rather radically changed after careful review of few reviewers.
Thanks to reviewers, in particular, Felix and Shivaram who stick with my PR and the issue.

However, it is still a rather radical change that might affect many APIs that runs R's native
functions (e.g., gapply, dapply and old RDD based APIs) and due to this concern this was not targeted
to Spark 2.2.0.

To cut it short, as suggested by R committers, I would like to encourage testing such APIs, that run
R native functions (UDF) to find any bug ahead. To be more specific, I would like to suggest both ways
as below to check if the PR really fixed the JIRA and if there is any bug with it.

1. Run the APIs multiple times and see if it works. If you are more interested in this,
probably, you could open another terminal and enter ...

  watch -n 0.01 "lsof -c R | wc -l

and see if the number consistently increases, which was the original issue. If want to be
more specific, run ...

  ps -fe | grep /exec/R

and check the PID of daemon.R. And then, run

  watch -n 0.01 "lsof -p [PID] | wc -l"

and check the same thing. Checking this with other good tools would also be very wellcome.

2. Run existing workloads with the APIs and check if it works correctly to find any hidden bugs ahead.