[DISCUSS] Reducing memory usage of toPandas with Arrow "self_destruct" option
We've been working with PySpark and Pandas, and have found that to
convert a dataset using N bytes of memory to Pandas, we need to have
2N bytes free, even with the Arrow optimization enabled. The
fundamental reason is ARROW-3789: Arrow does not free the Arrow
table until conversion finishes, so there are 2 copies of the dataset
We'd like to improve this by taking advantage of the Arrow
"self_destruct" option available in Arrow >= 0.16. When converting a
suitable[*] Arrow table to a Pandas dataframe, it avoids the
worst-case 2x memory usage, with something more like ~25% overhead
instead, by freeing the columns in the Arrow table after converting
each column instead of at the end of conversion.
Does this sound like a desirable optimization to have in Spark? If so,
how should it be exposed to users? As discussed below, there are cases
where a user may or may not want it enabled.
There are some cases where you may _not_ want this optimization,
however, so the patch leaves it as a toggle. Is this the API we'd
want, or would we prefer a different API (e.g. a configuration flag)?
The reason we may not want this enabled by default is that the related
split_blocks option is more likely to find zero-copy opportunities,
which will result in the Pandas dataframe being backed by immutable
buffers. Some Pandas operations will error in these cases, e.g. .
Also, to minimize memory usage, we set use_threads=False to converts
each column sequentially, rather than in parallel, but this slows down
the conversion somewhat. One option here may be to set self_destruct
by default, but relegate the other two options (which further save
memory) to a toggle, and I can measure the impact of this if desired.