[jira] [Commented] (SPARK-732) Recomputation of RDDs may result in duplicated accumulator updates

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (SPARK-732) Recomputation of RDDs may result in duplicated accumulator updates

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/SPARK-732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13954345#comment-13954345 ]

ASF GitHub Bot commented on SPARK-732:
--------------------------------------

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-39002053
 
    Merged build finished. Build is starting -or- tests failed to complete.


> Recomputation of RDDs may result in duplicated accumulator updates
> ------------------------------------------------------------------
>
>                 Key: SPARK-732
>                 URL: https://issues.apache.org/jira/browse/SPARK-732
>             Project: Apache Spark
>          Issue Type: Bug
>    Affects Versions: 0.7.0, 0.6.2, 0.7.1, 0.8.0, 0.7.2, 0.7.3, 0.8.1, 0.9.0, 0.8.2
>            Reporter: Josh Rosen
>            Assignee: Nan Zhu
>             Fix For: 1.0.0
>
>
> Currently, Spark doesn't guard against duplicated updates to the same accumulator due to recomputations of an RDD.  For example:
> {code}
>     val acc = sc.accumulator(0)
>     data.map(x => acc += 1; f(x))
>     data.count()
>     // acc should equal data.count() here
>     data.foreach{...}
>     // Now, acc = 2 * data.count() because the map() was recomputed.
> {code}
> I think that this behavior is incorrect, especially because this behavior allows the additon or removal of a cache() call to affect the outcome of a computation.
> There's an old TODO to fix this duplicate update issue in the [DAGScheduler code|https://github.com/mesos/spark/blob/ec5e553b418be43aa3f0ccc24e0d5ca9d63504b2/core/src/main/scala/spark/scheduler/DAGScheduler.scala#L494].
> I haven't tested whether recomputation due to blocks being dropped from the cache can trigger duplicate accumulator updates.
> Hypothetically someone could be relying on the current behavior to implement performance counters that track the actual number of computations performed (including recomputations).  To be safe, we could add an explicit warning in the release notes that documents the change in behavior when we fix this.
> Ignoring duplicate updates shouldn't be too hard, but there are a few subtleties.  Currently, we allow accumulators to be used in multiple transformations, so we'd need to detect duplicate updates at the per-transformation level.  I haven't dug too deeply into the scheduler internals, but we might also run into problems where pipelining causes what is logically one set of accumulator updates to show up in two different tasks (e.g. rdd.map(accum += x; ...) and rdd.map(accum += x; ...).count() may cause what's logically the same accumulator update to be applied from two different contexts, complicating the detection of duplicate updates).



--
This message was sent by Atlassian JIRA
(v6.2#6252)
Reply | Threaded
Open this post in threaded view
|

Re: [jira] [Commented] (SPARK-732) Recomputation of RDDs may result in duplicated accumulator updates

Mattmann, Chris A (3980)
Guys I fixed this by adding [hidden email] to the mailing list, no
more moderation required.

Cheers,
Chris





-----Original Message-----
From: "ASF GitHub Bot   (JIRA)" <[hidden email]>
Reply-To: "[hidden email]" <[hidden email]>
Date: Saturday, March 29, 2014 10:14 AM
To: "[hidden email]" <[hidden email]>
Subject: [jira] [Commented] (SPARK-732) Recomputation of RDDs may result
in duplicated accumulator updates

>
>    [
>https://issues.apache.org/jira/browse/SPARK-732?page=com.atlassian.jira.pl
>ugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13954345#comm
>ent-13954345 ]
>
>ASF GitHub Bot commented on SPARK-732:
>--------------------------------------
>
>Github user AmplabJenkins commented on the pull request:
>
>    https://github.com/apache/spark/pull/228#issuecomment-39002053
>  
>    Merged build finished. Build is starting -or- tests failed to
>complete.
>
>
>> Recomputation of RDDs may result in duplicated accumulator updates
>> ------------------------------------------------------------------
>>
>>                 Key: SPARK-732
>>                 URL: https://issues.apache.org/jira/browse/SPARK-732
>>             Project: Apache Spark
>>          Issue Type: Bug
>>    Affects Versions: 0.7.0, 0.6.2, 0.7.1, 0.8.0, 0.7.2, 0.7.3, 0.8.1,
>>0.9.0, 0.8.2
>>            Reporter: Josh Rosen
>>            Assignee: Nan Zhu
>>             Fix For: 1.0.0
>>
>>
>> Currently, Spark doesn't guard against duplicated updates to the same
>>accumulator due to recomputations of an RDD.  For example:
>> {code}
>>     val acc = sc.accumulator(0)
>>     data.map(x => acc += 1; f(x))
>>     data.count()
>>     // acc should equal data.count() here
>>     data.foreach{...}
>>     // Now, acc = 2 * data.count() because the map() was recomputed.
>> {code}
>> I think that this behavior is incorrect, especially because this
>>behavior allows the additon or removal of a cache() call to affect the
>>outcome of a computation.
>> There's an old TODO to fix this duplicate update issue in the
>>[DAGScheduler
>>code|https://github.com/mesos/spark/blob/ec5e553b418be43aa3f0ccc24e0d5ca9
>>d63504b2/core/src/main/scala/spark/scheduler/DAGScheduler.scala#L494].
>> I haven't tested whether recomputation due to blocks being dropped from
>>the cache can trigger duplicate accumulator updates.
>> Hypothetically someone could be relying on the current behavior to
>>implement performance counters that track the actual number of
>>computations performed (including recomputations).  To be safe, we could
>>add an explicit warning in the release notes that documents the change
>>in behavior when we fix this.
>> Ignoring duplicate updates shouldn't be too hard, but there are a few
>>subtleties.  Currently, we allow accumulators to be used in multiple
>>transformations, so we'd need to detect duplicate updates at the
>>per-transformation level.  I haven't dug too deeply into the scheduler
>>internals, but we might also run into problems where pipelining causes
>>what is logically one set of accumulator updates to show up in two
>>different tasks (e.g. rdd.map(accum += x; ...) and rdd.map(accum += x;
>>...).count() may cause what's logically the same accumulator update to
>>be applied from two different contexts, complicating the detection of
>>duplicate updates).
>
>
>
>--
>This message was sent by Atlassian JIRA
>(v6.2#6252)