[Spark SQL] Making InferSchema and JacksonParser public

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Spark SQL] Making InferSchema and JacksonParser public

Brian Hong
I work for a mobile game company. I'm solving a simple question: "Can we efficiently/cheaply query for the log of a particular user within given date period?"

I've created a special JSON text-based file format that has these traits:
 - Snappy compressed, saved in AWS S3
 - Partitioned by date. ie. 2017-01-01.sz, 2017-01-02.sz, ...
 - Sorted by a primary key (log_type) and a secondary key (user_id), Snappy block compressed by 5MB blocks
 - Blocks are indexed with primary/secondary key in file 2017-01-01.json
 - Efficient block based random access on primary key (log_type) and secondary key (user_id) using the index

I've created a Spark SQL DataFrame relation that can query this file format.  Since the schema of each log type is fairly consistent, I've reused the `InferSchema.inferSchema` method and `JacksonParser`in the Spark SQL code to support structured querying.  I've also implemented filter push-down to optimize the file access.

It is very fast when querying for a single user or querying for a single log type with a sampling ratio of 10000 to 1 compared to parquet file format.  (We do use parquet for some log types when we need batch analysis.)

One of the problems we face is that the methods we use above are private API.  So we are forced to use hacks to use these methods.  (Things like copying the code or using the org.apache.spark.sql package namespace)

I've been following Spark SQL code since 1.4, and the JSON schema inferencing code and JacksonParser seem to be relatively stable recently.  Can the core-devs make these APIs public?

We are willing to open source this file format because it is very excellent for archiving user related logs in S3.  The key dependency of private APIs in Spark SQL is the main hurdle in making this a reality.

Thank you for reading!

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Spark SQL] Making InferSchema and JacksonParser public

Michael Allman-2
Personally I'd love to see some kind of pluggability, configurability in the JSON schema parsing, maybe as an option in the DataFrameReader. Perhaps you can propose an API?

On Jan 18, 2017, at 5:51 AM, Brian Hong <[hidden email]> wrote:

I work for a mobile game company. I'm solving a simple question: "Can we efficiently/cheaply query for the log of a particular user within given date period?"

I've created a special JSON text-based file format that has these traits:
 - Snappy compressed, saved in AWS S3
 - Partitioned by date. ie. 2017-01-01.sz, 2017-01-02.sz, ...
 - Sorted by a primary key (log_type) and a secondary key (user_id), Snappy block compressed by 5MB blocks
 - Blocks are indexed with primary/secondary key in file 2017-01-01.json
 - Efficient block based random access on primary key (log_type) and secondary key (user_id) using the index

I've created a Spark SQL DataFrame relation that can query this file format.  Since the schema of each log type is fairly consistent, I've reused the `InferSchema.inferSchema` method and `JacksonParser`in the Spark SQL code to support structured querying.  I've also implemented filter push-down to optimize the file access.

It is very fast when querying for a single user or querying for a single log type with a sampling ratio of 10000 to 1 compared to parquet file format.  (We do use parquet for some log types when we need batch analysis.)

One of the problems we face is that the methods we use above are private API.  So we are forced to use hacks to use these methods.  (Things like copying the code or using the org.apache.spark.sql package namespace)

I've been following Spark SQL code since 1.4, and the JSON schema inferencing code and JacksonParser seem to be relatively stable recently.  Can the core-devs make these APIs public?

We are willing to open source this file format because it is very excellent for archiving user related logs in S3.  The key dependency of private APIs in Spark SQL is the main hurdle in making this a reality.

Thank you for reading!


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Spark SQL] Making InferSchema and JacksonParser public

rxin
In reply to this post by Brian Hong
That is internal, but the amount of code is not a lot. Can you just copy the relevant classes over to your project?

On Wed, Jan 18, 2017 at 5:52 AM Brian Hong <[hidden email]> wrote:
I work for a mobile game company. I'm solving a simple question: "Can we efficiently/cheaply query for the log of a particular user within given date period?"

I've created a special JSON text-based file format that has these traits:
 - Snappy compressed, saved in AWS S3
 - Partitioned by date. ie. 2017-01-01.sz, 2017-01-02.sz, ...
 - Sorted by a primary key (log_type) and a secondary key (user_id), Snappy block compressed by 5MB blocks
 - Blocks are indexed with primary/secondary key in file 2017-01-01.json
 - Efficient block based random access on primary key (log_type) and secondary key (user_id) using the index

I've created a Spark SQL DataFrame relation that can query this file format.  Since the schema of each log type is fairly consistent, I've reused the `InferSchema.inferSchema` method and `JacksonParser`in the Spark SQL code to support structured querying.  I've also implemented filter push-down to optimize the file access.

It is very fast when querying for a single user or querying for a single log type with a sampling ratio of 10000 to 1 compared to parquet file format.  (We do use parquet for some log types when we need batch analysis.)

One of the problems we face is that the methods we use above are private API.  So we are forced to use hacks to use these methods.  (Things like copying the code or using the org.apache.spark.sql package namespace)

I've been following Spark SQL code since 1.4, and the JSON schema inferencing code and JacksonParser seem to be relatively stable recently.  Can the core-devs make these APIs public?

We are willing to open source this file format because it is very excellent for archiving user related logs in S3.  The key dependency of private APIs in Spark SQL is the main hurdle in making this a reality.

Thank you for reading!



Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Spark SQL] Making InferSchema and JacksonParser public

Brian Hong
Yes that is the option I took while implementing this under Spark 1.4.  But every time there is a major update in Spark, I needed to re-copy the needed parts, which is very time consuming.

The reason is that InferSchema and JacksonParser uses many more Spark internal methods, which makes this very hard to copy and maintain.

Thanks!

On Thu, Jan 19, 2017 at 2:41 AM Reynold Xin <[hidden email]> wrote:
That is internal, but the amount of code is not a lot. Can you just copy the relevant classes over to your project?

On Wed, Jan 18, 2017 at 5:52 AM Brian Hong <[hidden email]> wrote:
I work for a mobile game company. I'm solving a simple question: "Can we efficiently/cheaply query for the log of a particular user within given date period?"

I've created a special JSON text-based file format that has these traits:
 - Snappy compressed, saved in AWS S3
 - Partitioned by date. ie. 2017-01-01.sz, 2017-01-02.sz, ...
 - Sorted by a primary key (log_type) and a secondary key (user_id), Snappy block compressed by 5MB blocks
 - Blocks are indexed with primary/secondary key in file 2017-01-01.json
 - Efficient block based random access on primary key (log_type) and secondary key (user_id) using the index

I've created a Spark SQL DataFrame relation that can query this file format.  Since the schema of each log type is fairly consistent, I've reused the `InferSchema.inferSchema` method and `JacksonParser`in the Spark SQL code to support structured querying.  I've also implemented filter push-down to optimize the file access.

It is very fast when querying for a single user or querying for a single log type with a sampling ratio of 10000 to 1 compared to parquet file format.  (We do use parquet for some log types when we need batch analysis.)

One of the problems we face is that the methods we use above are private API.  So we are forced to use hacks to use these methods.  (Things like copying the code or using the org.apache.spark.sql package namespace)

I've been following Spark SQL code since 1.4, and the JSON schema inferencing code and JacksonParser seem to be relatively stable recently.  Can the core-devs make these APIs public?

We are willing to open source this file format because it is very excellent for archiving user related logs in S3.  The key dependency of private APIs in Spark SQL is the main hurdle in making this a reality.

Thank you for reading!



Loading...