Spark 1.6.3 Driver OOM on createDataFrame

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Spark 1.6.3 Driver OOM on createDataFrame

Asher Krim
Hi All,

There seems to be a bug in Spark 1.6.3 which causes the driver to OOM when creating a dataframe using a lot of data in memory on the driver. Examining a heap dump, it looks like the driver is filled with multiple copies of the data. The following java code reproduces the bug:
  public void run() {
    try (JavaSparkContext sc = getSparkContext()) {
      SQLContext sqlContext = new SQLContext(sc);

      DataFrame df = sqlContext.createDataFrame(generateData().stream()
              .map(floats -> RowFactory.create(floats))
              .collect(Collectors.toList()),
          DataTypes.createStructType(new StructField[] { VECTOR_FIELD }));

      LOG.info("successfully parallelized {} rows", df.count());
    }
  }
private List<List<Float>> generateData() { List<List<Float>> data = new ArrayList<>(3_000_000); for (int i = 0; i < 3_000_000; i++) { List<Float> row = new ArrayList<>(300); for (int j = 0; j < 300; j++) { row.add(random.nextFloat()); } data.add(row); }

Increasing the driver memory to insane values (28g) doesn't help. I tested in Spark 2 and the problem seems to have been solved, however I'm not sure which issue is responsible for solving it. I assume it's one of these: https://issues.apache.org/jira/browse/SPARK-12511?jql=project%20%3D%20SPARK%20AND%20status%20%3D%20Resolved%20AND%20fixVersion%20%3D%202.0.0%20AND%20text%20~%20%22OOM%22

The reason this is an issue is because some machine learning models are represented as large-ish local data structures on the driver, so this bug is encountered while attempting to save them. Unfortunately, using mllib instead of ml is not an option since some mllib algorithms also rely on the dataframe API for persisting the model (such as word2vec and LDA), even though mllib is supposed to be based on RDDs. This makes these algorithms unusable for anything larger than toy examples in < Spark 2.

If anyone is familiar with this bug, I would really appreciate it if they could point me in the direction of the pr that fixed it.

Is a 1.6.4 release planned?
Would be possible to backport the dataframe bugfix?

Thanks,
Asher Krim
Senior Software Engineer
Loading...