Data Property Accumulators

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Data Property Accumulators

Holden Karau
Are folks interested in seeing data property accumulators for RDDs? I made a proposal for this back in Spark 2016 ( https://docs.google.com/document/d/1lR_l1g3zMVctZXrcVjFusq2iQVpr4XvRK_UUDsDr6nk/edit ) but ABI compatibility was a stumbling block I couldn't design around. I can look at reviving it for Spark 3 or just go ahead and close out this idea.

--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
Reply | Threaded
Open this post in threaded view
|

Re: Data Property Accumulators

Erik Erlandson-2

I'm wondering whether keeping track of accumulation in "consistent mode" is like a case for mapping straight to the Try value, so parsedData has type RDD[Try[...]], and counting failures is parsedData.filter(_.isFailure).count, etc

Put another way: Consistent mode accumulation seems (to me) like it is trying to obey spark's RDD compute model, contrasted with legacy accumulators which subvert that model. I think the fact that your "option 3" is sending information about accumulators down through mapping function api, as well as passing through an Option" stage, is also hinting at that idea.

That might mean the idiomatic way to do consistent mode is via the existing spark API, and using constructs like Try, Either, Option, Tuple, or just a new column carrying additional accumulator channels.


On Fri, Aug 16, 2019 at 5:48 PM Holden Karau <[hidden email]> wrote:
Are folks interested in seeing data property accumulators for RDDs? I made a proposal for this back in Spark 2016 ( https://docs.google.com/document/d/1lR_l1g3zMVctZXrcVjFusq2iQVpr4XvRK_UUDsDr6nk/edit ) but ABI compatibility was a stumbling block I couldn't design around. I can look at reviving it for Spark 3 or just go ahead and close out this idea.

--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9