After using Spark for many years, including SystemML's Spark backend, I'd like to give some feedback on potential PairRDD API extensions that I would find very useful:
1) MapToPair with preservesPartitioning flag: For many binary operations with broadcasts, we always need to use mapPartitionsToPair just to flag it as partitioning-preserving. A simple mapValues cannot be used in these situations because the keys are needed for broadcast block lookups.
2) Multi-key lookups: For indexing operations over hash-partitioned out-of-core RDDs, a lookup with a list of keys would be very helpful because filter and join have to read all RDD partitions just to deserialize them and investigate the keys. We currently emulate an efficient multi-key lookup by creating a probe set of partition ids and wrapping Spark's PartitionPruningRDD around the input RDD.
3) Streaming collect/parallelize: For applications that manage their own driver and executor memory, collect and parallelize pose challenges. The collect action requires memory for both the in-memory representation and the collected list of key-value pairs. Unfortunately, the existing toLocalIterator is even slower than writing the RDD to HDFS and reading it back into the driver. Similarly, strong references to parallelized RDDs pin memory at the driver. Having an unsafe flag to pin the RDD instead into a persistent storage level at the executors would significantly simplify the driver memory management.
One can work around these minor issues but maybe it would be useful for many people to address them once in Spark via API extensions.