DSv2 reader lifecycle

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

DSv2 reader lifecycle

Andrew Melo
Hello,

During testing of our DSv2 implementation (on 2.4.3 FWIW), it appears that our DataSourceReader is being instantiated multiple times for the same dataframe. For example, the following snippet

        Dataset<Row> df = spark
                .read()
                .format("edu.vanderbilt.accre.laurelin.Root")
                .option("tree",  "Events")
                .load("testdata/pristine/2018nanoaod1june2019.root");

Constructs edu.vanderbilt.accre.laurelin.Root twice and then calls createReader once (as an aside, this seems like a lot for 1000 columns? "CodeGenerator: Code generated in 8162.847517 ms")

but then running operations on that dataframe (e.g. df.count()) calls createReader for each call, instead of holding the existing DataSourceReader.

Is that the expected behavior? Because of the file format, it's quite expensive to deserialize all the various metadata, so I was holding the deserialized version in the DataSourceReader, but if Spark is repeatedly constructing new ones, then that doesn't help. If this is the expected behavior, how should I handle this as a consumer of the API?

Thanks!
Andrew
Reply | Threaded
Open this post in threaded view
|

Re: DSv2 reader lifecycle

Ryan Blue
Hi Andrew,

This is expected behavior for DSv2 in 2.4. A separate reader is configured for each operation because the configuration will change. A count, for example, doesn't need to project any columns, but a count distinct will. Similarly, if your read has different filters we need to apply those to a separate reader for each scan.

The newer API that we are releasing in Spark 3.0 addresses the concern that each reader is independent by using Catalog and Table interfaces. In the new version, Spark will load a table by name from a persistent catalog (loaded once) and use the table to create a reader for each operation. That way, you can load common metadata in the table, cache the table, and pass its info to readers as they are created.

rb

On Tue, Nov 5, 2019 at 4:58 PM Andrew Melo <[hidden email]> wrote:
Hello,

During testing of our DSv2 implementation (on 2.4.3 FWIW), it appears that our DataSourceReader is being instantiated multiple times for the same dataframe. For example, the following snippet

        Dataset<Row> df = spark
                .read()
                .format("edu.vanderbilt.accre.laurelin.Root")
                .option("tree",  "Events")
                .load("testdata/pristine/2018nanoaod1june2019.root");

Constructs edu.vanderbilt.accre.laurelin.Root twice and then calls createReader once (as an aside, this seems like a lot for 1000 columns? "CodeGenerator: Code generated in 8162.847517 ms")

but then running operations on that dataframe (e.g. df.count()) calls createReader for each call, instead of holding the existing DataSourceReader.

Is that the expected behavior? Because of the file format, it's quite expensive to deserialize all the various metadata, so I was holding the deserialized version in the DataSourceReader, but if Spark is repeatedly constructing new ones, then that doesn't help. If this is the expected behavior, how should I handle this as a consumer of the API?

Thanks!
Andrew


--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: DSv2 reader lifecycle

Andrew Melo
Hi Ryan,

Thanks for the pointers

On Thu, Nov 7, 2019 at 8:13 AM Ryan Blue <[hidden email]> wrote:
Hi Andrew,

This is expected behavior for DSv2 in 2.4. A separate reader is configured for each operation because the configuration will change. A count, for example, doesn't need to project any columns, but a count distinct will. Similarly, if your read has different filters we need to apply those to a separate reader for each scan.

Ah, I presumed that the interaction was slightly different, there was a single reader configured and (e.g.) pruneSchema was called repeatedly to change the desired output schema. I guess for 2.4 it's best for me to cache/memoize the metadata for paths/files to keep them from being repeatedly calculated.
 

The newer API that we are releasing in Spark 3.0 addresses the concern that each reader is independent by using Catalog and Table interfaces. In the new version, Spark will load a table by name from a persistent catalog (loaded once) and use the table to create a reader for each operation. That way, you can load common metadata in the table, cache the table, and pass its info to readers as they are created.

That's good to know, I'll search around JIRA for docs describing that functionality.

Thanks again,
Andrew
 

rb

On Tue, Nov 5, 2019 at 4:58 PM Andrew Melo <[hidden email]> wrote:
Hello,

During testing of our DSv2 implementation (on 2.4.3 FWIW), it appears that our DataSourceReader is being instantiated multiple times for the same dataframe. For example, the following snippet

        Dataset<Row> df = spark
                .read()
                .format("edu.vanderbilt.accre.laurelin.Root")
                .option("tree",  "Events")
                .load("testdata/pristine/2018nanoaod1june2019.root");

Constructs edu.vanderbilt.accre.laurelin.Root twice and then calls createReader once (as an aside, this seems like a lot for 1000 columns? "CodeGenerator: Code generated in 8162.847517 ms")

but then running operations on that dataframe (e.g. df.count()) calls createReader for each call, instead of holding the existing DataSourceReader.

Is that the expected behavior? Because of the file format, it's quite expensive to deserialize all the various metadata, so I was holding the deserialized version in the DataSourceReader, but if Spark is repeatedly constructing new ones, then that doesn't help. If this is the expected behavior, how should I handle this as a consumer of the API?

Thanks!
Andrew


--
Ryan Blue
Software Engineer
Netflix