DSV2 API Question

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

DSV2 API Question

Andrew Melo
Hello,

I've (nearly) implemented a DSV2-reader interface to read particle physics data stored in the ROOT (https://root.cern.ch/) file format. You can think of these ROOT files as roughly parquet-like: column-wise and nested (i.e. a column can be of type "float[]", meaning each row in the column is a variable-length  array of floats). The overwhelming majority of our columns are these variable-length arrays, since they represent physical quantities that vary widely with each particle collision*.

Exposing these columns via the "SupportsScanColumnarBatch" interface has raised a question I have about the DSV2 API. I know the interface is currently Evolving, but I don't know if this is the appropriate place to ask about it (I presume JIRA is a good place as well, but I had trouble finding exactly where the best place to join is)

There is no provision in the org.apache.spark.sql.vectorized.ColumnVector interface to return multiple rows of arrays (i.e. no "getArrays" analogue to "getArray"). A big use case we have is to pipe these data through UDFs, so it would be nice to be able to get the data from the file into a UDF batch without having to convert to an intermediate row-wise representation. Looking into ColumnarArray, however, it seems like instead of storing a single offset and length, it could be extended to arrays of "offsets" and "lengths". The public interface could remain the same by adding a 2nd constructor which accepts arrays and keeping the existing constructor as a degenerate case of a 1-length array.


* e.g. "electron_momentum" column will have a different number of entries each row, one for each electron that is produced in a collision.
Reply | Threaded
Open this post in threaded view
|

Re: DSV2 API Question

Bobby Evans-2
Columnar UDFs is still a work in progress.  For now, all UDFs are row-based, and in fact, all processing is row based.  We are working on plumbing in more columnar support to Spark https://github.com/apache/spark/pull/24795 but it is going to be a little while before we are at the point where we could support doing what you want.

The APIs for ColumnVector were really designed for the purpose of providing an iterator that goes from Columnar formatted data to row formatted data.  When we get to the point that we can support columnar UDFs some of the APIs in ColumnVector are likely to be expanded so you can do some of the things you are requesting.

I hope this helps,

Bobby

On Tue, Jun 25, 2019 at 4:24 PM Andrew Melo <[hidden email]> wrote:
Hello,

I've (nearly) implemented a DSV2-reader interface to read particle physics data stored in the ROOT (https://root.cern.ch/) file format. You can think of these ROOT files as roughly parquet-like: column-wise and nested (i.e. a column can be of type "float[]", meaning each row in the column is a variable-length  array of floats). The overwhelming majority of our columns are these variable-length arrays, since they represent physical quantities that vary widely with each particle collision*.

Exposing these columns via the "SupportsScanColumnarBatch" interface has raised a question I have about the DSV2 API. I know the interface is currently Evolving, but I don't know if this is the appropriate place to ask about it (I presume JIRA is a good place as well, but I had trouble finding exactly where the best place to join is)

There is no provision in the org.apache.spark.sql.vectorized.ColumnVector interface to return multiple rows of arrays (i.e. no "getArrays" analogue to "getArray"). A big use case we have is to pipe these data through UDFs, so it would be nice to be able to get the data from the file into a UDF batch without having to convert to an intermediate row-wise representation. Looking into ColumnarArray, however, it seems like instead of storing a single offset and length, it could be extended to arrays of "offsets" and "lengths". The public interface could remain the same by adding a 2nd constructor which accepts arrays and keeping the existing constructor as a degenerate case of a 1-length array.


* e.g. "electron_momentum" column will have a different number of entries each row, one for each electron that is produced in a collision.