Deduplication in RDD

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Deduplication in RDD

This post has NOT been accepted by the mailing list yet.


I am new to Spark and I would like to know how Deduplication feature is handled in RDD for unstructured data. I have seen papers or articles how to remove duplicates records in a structured data. But, I don't see much for unstructured data. Can RDD itself deduplicated? This might help to accommodate more RDDs in memory and also processing duplicated RDD might reduce the processing time. Can anyone of you give reference to this topic? I am very much interested in how deduplication works in data processing engine like Spark. Thanks in Advance who can provide me information on this, I really appreciate your help on this.