Custom file datasource implementations

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Custom file datasource implementations

Navin Viswanath
I’m looking into upgrading our Spark version from 2.4 to 3.0 and noticed that there was an API change that seems unavoidable. I wanted to check whether our implementation is the ideal way to do this and if this change is necessary.
We have files on HDFS in thrift format and implement custom datasources in V1 by implementing a PartitionFile => InternalRow conversion(based on the FileFormat trait). Here is how we do that currently:

Given a thrift type T,
val schema: StructType = ... // infer thrift schema for an object of type T
val encoder: ExpressionEncoder[Row] = RowEncoder(schema)
val genericRow: GenericRow = toGenericRow(thriftObject, schema) // converts a thrift object to a GenericRow

Once we have a GenericRow and an ExpressionEncoder, we used to do the following to produce an InternalRow:

val internalRow: InternalRow = encoder.toRow(genericRow)

we now need:

val serializer = encoder.createSerializer()
val internalRow: InternalRow = serializer(genericRow)

I have a couple of questions about this:
1. Do we have to make this change in our sources going from 2.4 to 3.0? Since this change is marked as internal in the PR, I was wondering if this is not the ideal way to implement this.
2. Does moving to Datasource V2 avoid this change? From looking at the V2 API, it looks like I would still need to generate an InternalRow and an ExpressionEncoder appears to be the only way to do this, which implies I would still need this change.

My goal is to try and keep our source code the same going from 2.4 to 3.0 if possible.

Any help is appreciated!