There seems to be a bug in Spark 1.6.3 which causes the driver to OOM when creating a dataframe using a lot of data in memory on the driver. Examining a heap dump, it looks like the driver is filled with multiple copies of the data. The following java code reproduces the bug:
The reason this is an issue is because some machine learning models are represented as large-ish local data structures on the driver, so this bug is encountered while attempting to save them. Unfortunately, using mllib instead of ml is not an option since some mllib algorithms also rely on the dataframe API for persisting the model (such as word2vec and LDA), even though mllib is supposed to be based on RDDs. This makes these algorithms unusable for anything larger than toy examples in < Spark 2.
If anyone is familiar with this bug, I would really appreciate it if they could point me in the direction of the pr that fixed it.
Is a 1.6.4 release planned?
Would be possible to backport the dataframe bugfix?