Contract for PartitionReader/InputPartition for ColumnarBatch?

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Contract for PartitionReader/InputPartition for ColumnarBatch?

Micah Kornfield
Hello spark-dev,

Looking at ColumnarBatch [1] it seems to indicate a single object is meant to be used for the entire loading process.

Does this imply that Spark assumes the ColumnarBatch and any direct references to ColumnarBatch (e.g. UTF8Strings) returned by InputPartitionReader/PartitionReader [2][3] get invalidated after "next()" is called on the Reader?  

Does the same apply for InternalRow?

Does it make sense to update the contracts one way or another (I'm happy to make a PR).?

Thanks,
Micah

Reply | Threaded
Open this post in threaded view
|

Re: Contract for PartitionReader/InputPartition for ColumnarBatch?

Bobby Evans
Micah,

You are correct. The contract for processing ColumnarBatches is that the code that produced the batch is responsible for closing it and anything downstream of it cannot keep any references to it. This is just like with UnsafeRow.  If an UnsafeRow is cached, like for aggregates or sorts, it must be copied into a separate memory buffer. This does not lend itself to efficient memory management when doing columnar processing, but for the intended purpose of loading columnar data and then instantly turning it into rows, it works fine.

Any change to this contract would require performance testing.  This is because several of the input formats are written to reuse the batch/memory buffer. Spark is configured by default to keep the batch size small so that the batch can fit in the CPU cache.  A change to the contract would potentially mean a lot of object churn for GC to handle, or some possibly complex code to do reference counting and memory reuse.

I personally would prefer to see this change because we are doing columnar processing with a lot of transformations that we don't want to keep memory statically allocated for. In our plugin, we have the consumer be responsible for closing the incoming batch, but our batch sizes are a lot larger so GC pressure is less of an issue. The only thing for us is that we have to manage the transition between the spark columnar model and our plugin's internal columnar model.  Not a big deal though.

Thanks,

Bobby

On Sat, Jun 27, 2020 at 11:28 PM Micah Kornfield <[hidden email]> wrote:
Hello spark-dev,

Looking at ColumnarBatch [1] it seems to indicate a single object is meant to be used for the entire loading process.

Does this imply that Spark assumes the ColumnarBatch and any direct references to ColumnarBatch (e.g. UTF8Strings) returned by InputPartitionReader/PartitionReader [2][3] get invalidated after "next()" is called on the Reader?  

Does the same apply for InternalRow?

Does it make sense to update the contracts one way or another (I'm happy to make a PR).?

Thanks,
Micah