MongoDB Spark Connector - Schema Inference

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

MongoDB Spark Connector - Schema Inference

This post has NOT been accepted by the mailing list yet.
This post was updated on .

I am using mongo db spark connector (mongo-spark-connector_2.10) to read mongo documents. my question is regarding the schema inference.

I see that mongo spark is using MongoSinglePartitioner to infer schema. So when I try to sample big collection (few million documents) to infer schema it is very slow. Default sample size is 1000. Is there any reason why mongo spark is using SinglePartitioner to infer schema instead of using multiple partitions. I want to read all fields from a collection, so I am sampling large number of records from collection to infer schema. Right now for 1 million records schema inference is taking 20 minutes.

Is there any way I can specify different partitioner to infer schema to speed up schema inference ? or
Are there any other approaches to infer schema from mongo for big collections.