Map with state for RDDs

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Map with state for RDDs

Antonin Delpeuch
Hi,

Spark Streaming has a `mapWithState` API to run a map on a stream while
maintaining a state as elements are read.

The core RDD API does not seem to have anything similar. Given a RDD of
elements of type T, an initial state of type S and a map function (S,T)
-> (S,T), return an RDD of Ts obtained by applying the map function in
sequence, updating the state as elements are mapped.

There seems to be some interest on the user mailing list for this:
http://apache-spark-user-list.1001560.n3.nabble.com/Keep-state-inside-map-function-td10968.html
The solution suggested there is to use mapPartitions, but that does not
make it possible to share the state from one partition to another.

I am thinking a proper mapWithState could be implemented with a
dedicated RDD along the lines of what zipWithIndex does. When the RDD is
created, run a job to compute the state at partition boundaries. Then,
store those states in the partitions returned, which lets you iterate
these partitions independently, starting from the stored state.

Obviously I can do this in my own project with a custom RDD. Would there
be appetite to have this in Spark itself?

Cheers,
Antonin

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Map with state for RDDs

Antonin Delpeuch
The API signature would of course be more general (sorry!):

Given a RDD of elements of type T, an initial state of type S and a map
function (S,T) -> (S,U), return an RDD of Us obtained by applying the
map function in sequence, updating the state as elements are mapped.

With this formulation, zipWithIndex would be a special case of
mapWithState (so it could be refactored to be expressed as such).

Antonin

On 24/05/2020 10:58, Antonin Delpeuch (lists) wrote:

> Hi,
>
> Spark Streaming has a `mapWithState` API to run a map on a stream while
> maintaining a state as elements are read.
>
> The core RDD API does not seem to have anything similar. Given a RDD of
> elements of type T, an initial state of type S and a map function (S,T)
> -> (S,T), return an RDD of Ts obtained by applying the map function in
> sequence, updating the state as elements are mapped.
>
> There seems to be some interest on the user mailing list for this:
> http://apache-spark-user-list.1001560.n3.nabble.com/Keep-state-inside-map-function-td10968.html
> The solution suggested there is to use mapPartitions, but that does not
> make it possible to share the state from one partition to another.
>
> I am thinking a proper mapWithState could be implemented with a
> dedicated RDD along the lines of what zipWithIndex does. When the RDD is
> created, run a job to compute the state at partition boundaries. Then,
> store those states in the partitions returned, which lets you iterate
> these partitions independently, starting from the stored state.
>
> Obviously I can do this in my own project with a custom RDD. Would there
> be appetite to have this in Spark itself?
>
> Cheers,
> Antonin
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Map with state for RDDs

Antonin Delpeuch
On 24/05/2020 11:27, Antonin Delpeuch (lists) wrote:
> With this formulation, zipWithIndex would be a special case of
> mapWithState (so it could be refactored to be expressed as such).

Forget about this part, it would obviously not, since zipWithIndex can
compute the size of each partition independently, so it would be
inefficient to refactor it as such.

Antonin

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]