File JIRAs for all flaky test failures

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

File JIRAs for all flaky test failures

Kay Ousterhout-2
Hi all,

I've noticed the Spark tests getting increasingly flaky -- it seems more common than not now that the tests need to be re-run at least once on PRs before they pass.  This is both annoying and problematic because it makes it harder to tell when a PR is introducing new flakiness.

To try to clean this up, I'd propose filing a JIRA *every time* Jenkins fails on a PR (for a reason unrelated to the PR).  Just provide a quick description of the failure -- e.g., "Flaky test: DagSchedulerSuite" or "Tests failed because 250m timeout expired", a link to the failed build, and include the "Tests" component.  If there's already a JIRA for the issue, just comment with a link to the latest failure.  I know folks don't always have time to track down why a test failed, but this it at least helpful to someone else who, later on, is trying to diagnose when the issue started to find the problematic code / test.

If this seems like too high overhead, feel free to suggest alternative ways to make the tests less flaky!

-Kay
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: File JIRAs for all flaky test failures

Saikat Kanjilal

I was working on something to address this a while ago https://issues.apache.org/jira/browse/SPARK-9487 but the difficulty in testing locally made things a lot more complicated to fix for each of the unit tests, should we resurface this JIRA again, I would whole heartedly agree with the flakiness assessment of the unit tests.

issues.apache.org
In Python we use `local[4]` for unit tests, while in Scala/Java we use `local[2]` and `local` for some unit tests in SQL, MLLib, and other components. If the ...




From: Kay Ousterhout <[hidden email]>
Sent: Wednesday, February 15, 2017 12:10 PM
To: [hidden email]
Subject: File JIRAs for all flaky test failures
 
Hi all,

I've noticed the Spark tests getting increasingly flaky -- it seems more common than not now that the tests need to be re-run at least once on PRs before they pass.  This is both annoying and problematic because it makes it harder to tell when a PR is introducing new flakiness.

To try to clean this up, I'd propose filing a JIRA *every time* Jenkins fails on a PR (for a reason unrelated to the PR).  Just provide a quick description of the failure -- e.g., "Flaky test: DagSchedulerSuite" or "Tests failed because 250m timeout expired", a link to the failed build, and include the "Tests" component.  If there's already a JIRA for the issue, just comment with a link to the latest failure.  I know folks don't always have time to track down why a test failed, but this it at least helpful to someone else who, later on, is trying to diagnose when the issue started to find the problematic code / test.

If this seems like too high overhead, feel free to suggest alternative ways to make the tests less flaky!

-Kay
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: File JIRAs for all flaky test failures

Armin Braun
I think one thing that is contributing to this a lot too is the general issue of the tests taking up a lot of file descriptors (10k+ if I run them on a standard Debian machine).
There are a few suits that contribute to this in particular like `org.apache.spark.ExecutorAllocationManagerSuite` which, like a few others, appears to consume a lot of fds.

Wouldn't it make sense to open JIRAs about those and actively try to reduce the resource consumption of these tests? 
Seems to me these can cause a lot of unpredictable behavior (making the reason for flaky tests hard to identify especially when there's timeouts etc. involved) + they make it prohibitively expensive for many to test locally imo.

On Wed, Feb 15, 2017 at 9:24 PM, Saikat Kanjilal <[hidden email]> wrote:

I was working on something to address this a while ago https://issues.apache.org/jira/browse/SPARK-9487 but the difficulty in testing locally made things a lot more complicated to fix for each of the unit tests, should we resurface this JIRA again, I would whole heartedly agree with the flakiness assessment of the unit tests.

In Python we use `local[4]` for unit tests, while in Scala/Java we use `local[2]` and `local` for some unit tests in SQL, MLLib, and other components. If the ...




From: Kay Ousterhout <[hidden email]>
Sent: Wednesday, February 15, 2017 12:10 PM
To: [hidden email]
Subject: File JIRAs for all flaky test failures
 
Hi all,

I've noticed the Spark tests getting increasingly flaky -- it seems more common than not now that the tests need to be re-run at least once on PRs before they pass.  This is both annoying and problematic because it makes it harder to tell when a PR is introducing new flakiness.

To try to clean this up, I'd propose filing a JIRA *every time* Jenkins fails on a PR (for a reason unrelated to the PR).  Just provide a quick description of the failure -- e.g., "Flaky test: DagSchedulerSuite" or "Tests failed because 250m timeout expired", a link to the failed build, and include the "Tests" component.  If there's already a JIRA for the issue, just comment with a link to the latest failure.  I know folks don't always have time to track down why a test failed, but this it at least helpful to someone else who, later on, is trying to diagnose when the issue started to find the problematic code / test.

If this seems like too high overhead, feel free to suggest alternative ways to make the tests less flaky!

-Kay

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: File JIRAs for all flaky test failures

shane knapp
it's not an open-file limit -- i have the jenkins workers set up w/a soft file limit of 100k, and a hard limit of 200k.

On Wed, Feb 15, 2017 at 12:48 PM, Armin Braun <[hidden email]> wrote:
I think one thing that is contributing to this a lot too is the general issue of the tests taking up a lot of file descriptors (10k+ if I run them on a standard Debian machine).
There are a few suits that contribute to this in particular like `org.apache.spark.ExecutorAllocationManagerSuite` which, like a few others, appears to consume a lot of fds.

Wouldn't it make sense to open JIRAs about those and actively try to reduce the resource consumption of these tests? 
Seems to me these can cause a lot of unpredictable behavior (making the reason for flaky tests hard to identify especially when there's timeouts etc. involved) + they make it prohibitively expensive for many to test locally imo.

On Wed, Feb 15, 2017 at 9:24 PM, Saikat Kanjilal <[hidden email]> wrote:

I was working on something to address this a while ago https://issues.apache.org/jira/browse/SPARK-9487 but the difficulty in testing locally made things a lot more complicated to fix for each of the unit tests, should we resurface this JIRA again, I would whole heartedly agree with the flakiness assessment of the unit tests.

In Python we use `local[4]` for unit tests, while in Scala/Java we use `local[2]` and `local` for some unit tests in SQL, MLLib, and other components. If the ...




From: Kay Ousterhout <[hidden email]>
Sent: Wednesday, February 15, 2017 12:10 PM
To: [hidden email]
Subject: File JIRAs for all flaky test failures
 
Hi all,

I've noticed the Spark tests getting increasingly flaky -- it seems more common than not now that the tests need to be re-run at least once on PRs before they pass.  This is both annoying and problematic because it makes it harder to tell when a PR is introducing new flakiness.

To try to clean this up, I'd propose filing a JIRA *every time* Jenkins fails on a PR (for a reason unrelated to the PR).  Just provide a quick description of the failure -- e.g., "Flaky test: DagSchedulerSuite" or "Tests failed because 250m timeout expired", a link to the failed build, and include the "Tests" component.  If there's already a JIRA for the issue, just comment with a link to the latest failure.  I know folks don't always have time to track down why a test failed, but this it at least helpful to someone else who, later on, is trying to diagnose when the issue started to find the problematic code / test.

If this seems like too high overhead, feel free to suggest alternative ways to make the tests less flaky!

-Kay


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: File JIRAs for all flaky test failures

Saikat Kanjilal
In reply to this post by Armin Braun

I would recommend we just open JIRA's for unit tests based on module (core/ml/sql etc) and we fix this one module at a time, this at least keeps the number of unit tests needing fixing down to a manageable number.




From: Armin Braun <[hidden email]>
Sent: Wednesday, February 15, 2017 12:48 PM
To: Saikat Kanjilal
Cc: Kay Ousterhout; [hidden email]
Subject: Re: File JIRAs for all flaky test failures
 
I think one thing that is contributing to this a lot too is the general issue of the tests taking up a lot of file descriptors (10k+ if I run them on a standard Debian machine).
There are a few suits that contribute to this in particular like `org.apache.spark.ExecutorAllocationManagerSuite` which, like a few others, appears to consume a lot of fds.

Wouldn't it make sense to open JIRAs about those and actively try to reduce the resource consumption of these tests? 
Seems to me these can cause a lot of unpredictable behavior (making the reason for flaky tests hard to identify especially when there's timeouts etc. involved) + they make it prohibitively expensive for many to test locally imo.

On Wed, Feb 15, 2017 at 9:24 PM, Saikat Kanjilal <[hidden email]> wrote:

I was working on something to address this a while ago https://issues.apache.org/jira/browse/SPARK-9487 but the difficulty in testing locally made things a lot more complicated to fix for each of the unit tests, should we resurface this JIRA again, I would whole heartedly agree with the flakiness assessment of the unit tests.

In Python we use `local[4]` for unit tests, while in Scala/Java we use `local[2]` and `local` for some unit tests in SQL, MLLib, and other components. If the ...




From: Kay Ousterhout <[hidden email]>
Sent: Wednesday, February 15, 2017 12:10 PM
To: [hidden email]
Subject: File JIRAs for all flaky test failures
 
Hi all,

I've noticed the Spark tests getting increasingly flaky -- it seems more common than not now that the tests need to be re-run at least once on PRs before they pass.  This is both annoying and problematic because it makes it harder to tell when a PR is introducing new flakiness.

To try to clean this up, I'd propose filing a JIRA *every time* Jenkins fails on a PR (for a reason unrelated to the PR).  Just provide a quick description of the failure -- e.g., "Flaky test: DagSchedulerSuite" or "Tests failed because 250m timeout expired", a link to the failed build, and include the "Tests" component.  If there's already a JIRA for the issue, just comment with a link to the latest failure.  I know folks don't always have time to track down why a test failed, but this it at least helpful to someone else who, later on, is trying to diagnose when the issue started to find the problematic code / test.

If this seems like too high overhead, feel free to suggest alternative ways to make the tests less flaky!

-Kay

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: File JIRAs for all flaky test failures

Josh Rosen-2
A useful tool for investigating test flakiness is my Jenkins Test Explorer service, running at https://spark-tests.appspot.com/

This has some useful timeline views for debugging flaky builds. For instance, at https://spark-tests.appspot.com/jobs/spark-master-test-maven-hadoop-2.6 (may be slow to load) you can see this chart: https://i.imgur.com/j8LV3pX.png. Here, each column represents a test run and each row represents a test which failed at least once over the displayed time period.

In that linked example screenshot you'll notice that a few columns have grey squares indicating that tests were skipped but lack any red squares to indicate test failures. This usually indicates that the build failed due to a problem other than an individual test failure. For example, I clicked into one of those builds and found that one test suite failed in test setup because the previous suite had not properly cleaned up its SparkContext (I'll file a JIRA for this).

You can click through the interface to drill down to reports on individual builds, tests, suites, etc. As an example of an individual test's detail page, https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.rdd.LocalCheckpointSuite&test_name=missing+checkpoint+block+fails+with+informative+message shows the patterns of flakiness in a streaming checkpoint test.

Finally, there's an experimental "interesting new test failures" report which tries to surface tests which have started failing very recently: https://spark-tests.appspot.com/failed-tests/new. Specifically, entries in this feed are test failures which a) occurred in the last week, b) were not part of a build which had 20 or more failed tests, and c) were not observed to fail in during the previous week (i.e. no failures from [2 weeks ago, 1 week ago)), and d) which represent the first time that the test failed this week (i.e. a test case will appear at most once in the results list). I've also exposed this as an RSS feed at https://spark-tests.appspot.com/rss/failed-tests/new.


On Wed, Feb 15, 2017 at 12:51 PM Saikat Kanjilal <[hidden email]> wrote:

I would recommend we just open JIRA's for unit tests based on module (core/ml/sql etc) and we fix this one module at a time, this at least keeps the number of unit tests needing fixing down to a manageable number.




From: Armin Braun <[hidden email]>
Sent: Wednesday, February 15, 2017 12:48 PM
To: Saikat Kanjilal
Cc: Kay Ousterhout; [hidden email]
Subject: Re: File JIRAs for all flaky test failures
 
I think one thing that is contributing to this a lot too is the general issue of the tests taking up a lot of file descriptors (10k+ if I run them on a standard Debian machine).
There are a few suits that contribute to this in particular like `org.apache.spark.ExecutorAllocationManagerSuite` which, like a few others, appears to consume a lot of fds.

Wouldn't it make sense to open JIRAs about those and actively try to reduce the resource consumption of these tests? 
Seems to me these can cause a lot of unpredictable behavior (making the reason for flaky tests hard to identify especially when there's timeouts etc. involved) + they make it prohibitively expensive for many to test locally imo.

On Wed, Feb 15, 2017 at 9:24 PM, Saikat Kanjilal <[hidden email]> wrote:

I was working on something to address this a while ago https://issues.apache.org/jira/browse/SPARK-9487 but the difficulty in testing locally made things a lot more complicated to fix for each of the unit tests, should we resurface this JIRA again, I would whole heartedly agree with the flakiness assessment of the unit tests.

In Python we use `local[4]` for unit tests, while in Scala/Java we use `local[2]` and `local` for some unit tests in SQL, MLLib, and other components. If the ...




From: Kay Ousterhout <[hidden email]>
Sent: Wednesday, February 15, 2017 12:10 PM
To: [hidden email]
Subject: File JIRAs for all flaky test failures
 
Hi all,

I've noticed the Spark tests getting increasingly flaky -- it seems more common than not now that the tests need to be re-run at least once on PRs before they pass.  This is both annoying and problematic because it makes it harder to tell when a PR is introducing new flakiness.

To try to clean this up, I'd propose filing a JIRA *every time* Jenkins fails on a PR (for a reason unrelated to the PR).  Just provide a quick description of the failure -- e.g., "Flaky test: DagSchedulerSuite" or "Tests failed because 250m timeout expired", a link to the failed build, and include the "Tests" component.  If there's already a JIRA for the issue, just comment with a link to the latest failure.  I know folks don't always have time to track down why a test failed, but this it at least helpful to someone else who, later on, is trying to diagnose when the issue started to find the problematic code / test.

If this seems like too high overhead, feel free to suggest alternative ways to make the tests less flaky!

-Kay

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: File JIRAs for all flaky test failures

Saikat Kanjilal
The issue was not with a lack of tooling, I used the url you are describing below to drill down to the exact test failure/stack trace, the problem was that my builds would work like a charm locally but fail with these errors on Jenkins, this was the whole challenge in fixing the unit tests, it was rare (if ever) where I would be able to replicate test failures locally.

Sent from my iPhone

On Feb 15, 2017, at 5:40 PM, Josh Rosen <[hidden email]> wrote:

A useful tool for investigating test flakiness is my Jenkins Test Explorer service, running at https://spark-tests.appspot.com/

This has some useful timeline views for debugging flaky builds. For instance, at https://spark-tests.appspot.com/jobs/spark-master-test-maven-hadoop-2.6 (may be slow to load) you can see this chart: https://i.imgur.com/j8LV3pX.png. Here, each column represents a test run and each row represents a test which failed at least once over the displayed time period.

In that linked example screenshot you'll notice that a few columns have grey squares indicating that tests were skipped but lack any red squares to indicate test failures. This usually indicates that the build failed due to a problem other than an individual test failure. For example, I clicked into one of those builds and found that one test suite failed in test setup because the previous suite had not properly cleaned up its SparkContext (I'll file a JIRA for this).

You can click through the interface to drill down to reports on individual builds, tests, suites, etc. As an example of an individual test's detail page, <a href="https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.rdd.LocalCheckpointSuite&amp;test_name=missing&#43;checkpoint&#43;block&#43;fails&#43;with&#43;informative&#43;message">https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.rdd.LocalCheckpointSuite&test_name=missing+checkpoint+block+fails+with+informative+message shows the patterns of flakiness in a streaming checkpoint test.

Finally, there's an experimental "interesting new test failures" report which tries to surface tests which have started failing very recently: https://spark-tests.appspot.com/failed-tests/new. Specifically, entries in this feed are test failures which a) occurred in the last week, b) were not part of a build which had 20 or more failed tests, and c) were not observed to fail in during the previous week (i.e. no failures from [2 weeks ago, 1 week ago)), and d) which represent the first time that the test failed this week (i.e. a test case will appear at most once in the results list). I've also exposed this as an RSS feed at https://spark-tests.appspot.com/rss/failed-tests/new.


On Wed, Feb 15, 2017 at 12:51 PM Saikat Kanjilal <[hidden email]> wrote:

I would recommend we just open JIRA's for unit tests based on module (core/ml/sql etc) and we fix this one module at a time, this at least keeps the number of unit tests needing fixing down to a manageable number.




From: Armin Braun <[hidden email]>
Sent: Wednesday, February 15, 2017 12:48 PM
To: Saikat Kanjilal
Cc: Kay Ousterhout; [hidden email]
Subject: Re: File JIRAs for all flaky test failures
 
I think one thing that is contributing to this a lot too is the general issue of the tests taking up a lot of file descriptors (10k+ if I run them on a standard Debian machine).
There are a few suits that contribute to this in particular like `org.apache.spark.ExecutorAllocationManagerSuite` which, like a few others, appears to consume a lot of fds.

Wouldn't it make sense to open JIRAs about those and actively try to reduce the resource consumption of these tests? 
Seems to me these can cause a lot of unpredictable behavior (making the reason for flaky tests hard to identify especially when there's timeouts etc. involved) + they make it prohibitively expensive for many to test locally imo.

On Wed, Feb 15, 2017 at 9:24 PM, Saikat Kanjilal <[hidden email]> wrote:

I was working on something to address this a while ago https://issues.apache.org/jira/browse/SPARK-9487 but the difficulty in testing locally made things a lot more complicated to fix for each of the unit tests, should we resurface this JIRA again, I would whole heartedly agree with the flakiness assessment of the unit tests.

In Python we use `local[4]` for unit tests, while in Scala/Java we use `local[2]` and `local` for some unit tests in SQL, MLLib, and other components. If the ...




From: Kay Ousterhout <[hidden email]>
Sent: Wednesday, February 15, 2017 12:10 PM
To: [hidden email]
Subject: File JIRAs for all flaky test failures
 
Hi all,

I've noticed the Spark tests getting increasingly flaky -- it seems more common than not now that the tests need to be re-run at least once on PRs before they pass.  This is both annoying and problematic because it makes it harder to tell when a PR is introducing new flakiness.

To try to clean this up, I'd propose filing a JIRA *every time* Jenkins fails on a PR (for a reason unrelated to the PR).  Just provide a quick description of the failure -- e.g., "Flaky test: DagSchedulerSuite" or "Tests failed because 250m timeout expired", a link to the failed build, and include the "Tests" component.  If there's already a JIRA for the issue, just comment with a link to the latest failure.  I know folks don't always have time to track down why a test failed, but this it at least helpful to someone else who, later on, is trying to diagnose when the issue started to find the problematic code / test.

If this seems like too high overhead, feel free to suggest alternative ways to make the tests less flaky!

-Kay

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: File JIRAs for all flaky test failures

Saikat Kanjilal

I'd just like to follow up again on this thread, should we devote some energy to fixing unit tests based on module, there wasn't much interest in this last time but given the nature of this thread I'd be willing to deep dive into this again with some help.


From: Saikat Kanjilal <[hidden email]>
Sent: Wednesday, February 15, 2017 6:12 PM
To: Josh Rosen
Cc: Armin Braun; Kay Ousterhout; [hidden email]
Subject: Re: File JIRAs for all flaky test failures
 
The issue was not with a lack of tooling, I used the url you are describing below to drill down to the exact test failure/stack trace, the problem was that my builds would work like a charm locally but fail with these errors on Jenkins, this was the whole challenge in fixing the unit tests, it was rare (if ever) where I would be able to replicate test failures locally.

Sent from my iPhone

On Feb 15, 2017, at 5:40 PM, Josh Rosen <[hidden email]> wrote:

A useful tool for investigating test flakiness is my Jenkins Test Explorer service, running at https://spark-tests.appspot.com/

This has some useful timeline views for debugging flaky builds. For instance, at https://spark-tests.appspot.com/jobs/spark-master-test-maven-hadoop-2.6 (may be slow to load) you can see this chart: https://i.imgur.com/j8LV3pX.png. Here, each column represents a test run and each row represents a test which failed at least once over the displayed time period.

In that linked example screenshot you'll notice that a few columns have grey squares indicating that tests were skipped but lack any red squares to indicate test failures. This usually indicates that the build failed due to a problem other than an individual test failure. For example, I clicked into one of those builds and found that one test suite failed in test setup because the previous suite had not properly cleaned up its SparkContext (I'll file a JIRA for this).

You can click through the interface to drill down to reports on individual builds, tests, suites, etc. As an example of an individual test's detail page, <a href="https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.rdd.LocalCheckpointSuite&amp;test_name=missing&#43;checkpoint&#43;block&#43;fails&#43;with&#43;informative&#43;message">https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.rdd.LocalCheckpointSuite&test_name=missing+checkpoint+block+fails+with+informative+message shows the patterns of flakiness in a streaming checkpoint test.

Finally, there's an experimental "interesting new test failures" report which tries to surface tests which have started failing very recently: https://spark-tests.appspot.com/failed-tests/new. Specifically, entries in this feed are test failures which a) occurred in the last week, b) were not part of a build which had 20 or more failed tests, and c) were not observed to fail in during the previous week (i.e. no failures from [2 weeks ago, 1 week ago)), and d) which represent the first time that the test failed this week (i.e. a test case will appear at most once in the results list). I've also exposed this as an RSS feed at https://spark-tests.appspot.com/rss/failed-tests/new.


On Wed, Feb 15, 2017 at 12:51 PM Saikat Kanjilal <[hidden email]> wrote:

I would recommend we just open JIRA's for unit tests based on module (core/ml/sql etc) and we fix this one module at a time, this at least keeps the number of unit tests needing fixing down to a manageable number.




From: Armin Braun <[hidden email]>
Sent: Wednesday, February 15, 2017 12:48 PM
To: Saikat Kanjilal
Cc: Kay Ousterhout; [hidden email]
Subject: Re: File JIRAs for all flaky test failures
 
I think one thing that is contributing to this a lot too is the general issue of the tests taking up a lot of file descriptors (10k+ if I run them on a standard Debian machine).
There are a few suits that contribute to this in particular like `org.apache.spark.ExecutorAllocationManagerSuite` which, like a few others, appears to consume a lot of fds.

Wouldn't it make sense to open JIRAs about those and actively try to reduce the resource consumption of these tests? 
Seems to me these can cause a lot of unpredictable behavior (making the reason for flaky tests hard to identify especially when there's timeouts etc. involved) + they make it prohibitively expensive for many to test locally imo.

On Wed, Feb 15, 2017 at 9:24 PM, Saikat Kanjilal <[hidden email]> wrote:

I was working on something to address this a while ago https://issues.apache.org/jira/browse/SPARK-9487 but the difficulty in testing locally made things a lot more complicated to fix for each of the unit tests, should we resurface this JIRA again, I would whole heartedly agree with the flakiness assessment of the unit tests.

In Python we use `local[4]` for unit tests, while in Scala/Java we use `local[2]` and `local` for some unit tests in SQL, MLLib, and other components. If the ...




From: Kay Ousterhout <[hidden email]>
Sent: Wednesday, February 15, 2017 12:10 PM
To: [hidden email]
Subject: File JIRAs for all flaky test failures
 
Hi all,

I've noticed the Spark tests getting increasingly flaky -- it seems more common than not now that the tests need to be re-run at least once on PRs before they pass.  This is both annoying and problematic because it makes it harder to tell when a PR is introducing new flakiness.

To try to clean this up, I'd propose filing a JIRA *every time* Jenkins fails on a PR (for a reason unrelated to the PR).  Just provide a quick description of the failure -- e.g., "Flaky test: DagSchedulerSuite" or "Tests failed because 250m timeout expired", a link to the failed build, and include the "Tests" component.  If there's already a JIRA for the issue, just comment with a link to the latest failure.  I know folks don't always have time to track down why a test failed, but this it at least helpful to someone else who, later on, is trying to diagnose when the issue started to find the problematic code / test.

If this seems like too high overhead, feel free to suggest alternative ways to make the tests less flaky!

-Kay

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: File JIRAs for all flaky test failures

rxin
What exactly is the issue? I've been working on Spark dev for a long time and very rarely do I actually run into an issue that only manifest on Jenkins but not locally. I don't have some magic local setup either.

We should definitely cut down test flakiness.


On Thu, Feb 16, 2017 at 5:26 PM, Saikat Kanjilal <[hidden email]> wrote:

I'd just like to follow up again on this thread, should we devote some energy to fixing unit tests based on module, there wasn't much interest in this last time but given the nature of this thread I'd be willing to deep dive into this again with some help.


From: Saikat Kanjilal <[hidden email]>
Sent: Wednesday, February 15, 2017 6:12 PM
To: Josh Rosen
Cc: Armin Braun; Kay Ousterhout; [hidden email]

Subject: Re: File JIRAs for all flaky test failures
 
The issue was not with a lack of tooling, I used the url you are describing below to drill down to the exact test failure/stack trace, the problem was that my builds would work like a charm locally but fail with these errors on Jenkins, this was the whole challenge in fixing the unit tests, it was rare (if ever) where I would be able to replicate test failures locally.

Sent from my iPhone

On Feb 15, 2017, at 5:40 PM, Josh Rosen <[hidden email]> wrote:

A useful tool for investigating test flakiness is my Jenkins Test Explorer service, running at https://spark-tests.appspot.com/

This has some useful timeline views for debugging flaky builds. For instance, at https://spark-tests.appspot.com/jobs/spark-master-test-maven-hadoop-2.6 (may be slow to load) you can see this chart: https://i.imgur.com/j8LV3pX.png. Here, each column represents a test run and each row represents a test which failed at least once over the displayed time period.

In that linked example screenshot you'll notice that a few columns have grey squares indicating that tests were skipped but lack any red squares to indicate test failures. This usually indicates that the build failed due to a problem other than an individual test failure. For example, I clicked into one of those builds and found that one test suite failed in test setup because the previous suite had not properly cleaned up its SparkContext (I'll file a JIRA for this).

You can click through the interface to drill down to reports on individual builds, tests, suites, etc. As an example of an individual test's detail page, https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.rdd.LocalCheckpointSuite&test_name=missing+checkpoint+block+fails+with+informative+message shows the patterns of flakiness in a streaming checkpoint test.

Finally, there's an experimental "interesting new test failures" report which tries to surface tests which have started failing very recently: https://spark-tests.appspot.com/failed-tests/newSpecifically, entries in this feed are test failures which a) occurred in the last week, b) were not part of a build which had 20 or more failed tests, and c) were not observed to fail in during the previous week (i.e. no failures from [2 weeks ago, 1 week ago)), and d) which represent the first time that the test failed this week (i.e. a test case will appear at most once in the results list). I've also exposed this as an RSS feed at https://spark-tests.appspot.com/rss/failed-tests/new.


On Wed, Feb 15, 2017 at 12:51 PM Saikat Kanjilal <[hidden email]> wrote:

I would recommend we just open JIRA's for unit tests based on module (core/ml/sql etc) and we fix this one module at a time, this at least keeps the number of unit tests needing fixing down to a manageable number.




From: Armin Braun <[hidden email]>
Sent: Wednesday, February 15, 2017 12:48 PM
To: Saikat Kanjilal
Cc: Kay Ousterhout; [hidden email]
Subject: Re: File JIRAs for all flaky test failures
 
I think one thing that is contributing to this a lot too is the general issue of the tests taking up a lot of file descriptors (10k+ if I run them on a standard Debian machine).
There are a few suits that contribute to this in particular like `org.apache.spark.ExecutorAllocationManagerSuite` which, like a few others, appears to consume a lot of fds.

Wouldn't it make sense to open JIRAs about those and actively try to reduce the resource consumption of these tests? 
Seems to me these can cause a lot of unpredictable behavior (making the reason for flaky tests hard to identify especially when there's timeouts etc. involved) + they make it prohibitively expensive for many to test locally imo.

On Wed, Feb 15, 2017 at 9:24 PM, Saikat Kanjilal <[hidden email]> wrote:

I was working on something to address this a while ago https://issues.apache.org/jira/browse/SPARK-9487 but the difficulty in testing locally made things a lot more complicated to fix for each of the unit tests, should we resurface this JIRA again, I would whole heartedly agree with the flakiness assessment of the unit tests.

In Python we use `local[4]` for unit tests, while in Scala/Java we use `local[2]` and `local` for some unit tests in SQL, MLLib, and other components. If the ...




From: Kay Ousterhout <[hidden email]>
Sent: Wednesday, February 15, 2017 12:10 PM
To: [hidden email]
Subject: File JIRAs for all flaky test failures
 
Hi all,

I've noticed the Spark tests getting increasingly flaky -- it seems more common than not now that the tests need to be re-run at least once on PRs before they pass.  This is both annoying and problematic because it makes it harder to tell when a PR is introducing new flakiness.

To try to clean this up, I'd propose filing a JIRA *every time* Jenkins fails on a PR (for a reason unrelated to the PR).  Just provide a quick description of the failure -- e.g., "Flaky test: DagSchedulerSuite" or "Tests failed because 250m timeout expired", a link to the failed build, and include the "Tests" component.  If there's already a JIRA for the issue, just comment with a link to the latest failure.  I know folks don't always have time to track down why a test failed, but this it at least helpful to someone else who, later on, is trying to diagnose when the issue started to find the problematic code / test.

If this seems like too high overhead, feel free to suggest alternative ways to make the tests less flaky!

-Kay


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: File JIRAs for all flaky test failures

Saikat Kanjilal

Reynold,

Its not one issue , I encountered multiple issues (stack traces/exceptions etc) where the issue only occured on Jenkins but not on my local environments, I would have to dig up all those old unit tests to list them all 😊 and I'm not willing to do that unless we deem this to be an actual problem that we want to spend time and energy to fix.


Thanks




From: Reynold Xin <[hidden email]>
Sent: Thursday, February 16, 2017 8:27 AM
To: Saikat Kanjilal
Cc: [hidden email]
Subject: Re: File JIRAs for all flaky test failures
 
What exactly is the issue? I've been working on Spark dev for a long time and very rarely do I actually run into an issue that only manifest on Jenkins but not locally. I don't have some magic local setup either.

We should definitely cut down test flakiness.


On Thu, Feb 16, 2017 at 5:26 PM, Saikat Kanjilal <[hidden email]> wrote:

I'd just like to follow up again on this thread, should we devote some energy to fixing unit tests based on module, there wasn't much interest in this last time but given the nature of this thread I'd be willing to deep dive into this again with some help.


From: Saikat Kanjilal <[hidden email]>
Sent: Wednesday, February 15, 2017 6:12 PM
To: Josh Rosen
Cc: Armin Braun; Kay Ousterhout; [hidden email]

Subject: Re: File JIRAs for all flaky test failures
 
The issue was not with a lack of tooling, I used the url you are describing below to drill down to the exact test failure/stack trace, the problem was that my builds would work like a charm locally but fail with these errors on Jenkins, this was the whole challenge in fixing the unit tests, it was rare (if ever) where I would be able to replicate test failures locally.

Sent from my iPhone

On Feb 15, 2017, at 5:40 PM, Josh Rosen <[hidden email]> wrote:

A useful tool for investigating test flakiness is my Jenkins Test Explorer service, running at https://spark-tests.appspot.com/

This has some useful timeline views for debugging flaky builds. For instance, at https://spark-tests.appspot.com/jobs/spark-master-test-maven-hadoop-2.6 (may be slow to load) you can see this chart: https://i.imgur.com/j8LV3pX.png. Here, each column represents a test run and each row represents a test which failed at least once over the displayed time period.

In that linked example screenshot you'll notice that a few columns have grey squares indicating that tests were skipped but lack any red squares to indicate test failures. This usually indicates that the build failed due to a problem other than an individual test failure. For example, I clicked into one of those builds and found that one test suite failed in test setup because the previous suite had not properly cleaned up its SparkContext (I'll file a JIRA for this).

You can click through the interface to drill down to reports on individual builds, tests, suites, etc. As an example of an individual test's detail page, <a href="https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.rdd.LocalCheckpointSuite&amp;test_name=missing&#43;checkpoint&#43;block&#43;fails&#43;with&#43;informative&#43;message" target="_blank">https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.rdd.LocalCheckpointSuite&test_name=missing+checkpoint+block+fails+with+informative+message shows the patterns of flakiness in a streaming checkpoint test.

Finally, there's an experimental "interesting new test failures" report which tries to surface tests which have started failing very recently: https://spark-tests.appspot.com/failed-tests/newSpecifically, entries in this feed are test failures which a) occurred in the last week, b) were not part of a build which had 20 or more failed tests, and c) were not observed to fail in during the previous week (i.e. no failures from [2 weeks ago, 1 week ago)), and d) which represent the first time that the test failed this week (i.e. a test case will appear at most once in the results list). I've also exposed this as an RSS feed at https://spark-tests.appspot.com/rss/failed-tests/new.


On Wed, Feb 15, 2017 at 12:51 PM Saikat Kanjilal <[hidden email]> wrote:

I would recommend we just open JIRA's for unit tests based on module (core/ml/sql etc) and we fix this one module at a time, this at least keeps the number of unit tests needing fixing down to a manageable number.




From: Armin Braun <[hidden email]>
Sent: Wednesday, February 15, 2017 12:48 PM
To: Saikat Kanjilal
Cc: Kay Ousterhout; [hidden email]
Subject: Re: File JIRAs for all flaky test failures
 
I think one thing that is contributing to this a lot too is the general issue of the tests taking up a lot of file descriptors (10k+ if I run them on a standard Debian machine).
There are a few suits that contribute to this in particular like `org.apache.spark.ExecutorAllocationManagerSuite` which, like a few others, appears to consume a lot of fds.

Wouldn't it make sense to open JIRAs about those and actively try to reduce the resource consumption of these tests? 
Seems to me these can cause a lot of unpredictable behavior (making the reason for flaky tests hard to identify especially when there's timeouts etc. involved) + they make it prohibitively expensive for many to test locally imo.

On Wed, Feb 15, 2017 at 9:24 PM, Saikat Kanjilal <[hidden email]> wrote:

I was working on something to address this a while ago https://issues.apache.org/jira/browse/SPARK-9487 but the difficulty in testing locally made things a lot more complicated to fix for each of the unit tests, should we resurface this JIRA again, I would whole heartedly agree with the flakiness assessment of the unit tests.

In Python we use `local[4]` for unit tests, while in Scala/Java we use `local[2]` and `local` for some unit tests in SQL, MLLib, and other components. If the ...




From: Kay Ousterhout <[hidden email]>
Sent: Wednesday, February 15, 2017 12:10 PM
To: [hidden email]
Subject: File JIRAs for all flaky test failures
 
Hi all,

I've noticed the Spark tests getting increasingly flaky -- it seems more common than not now that the tests need to be re-run at least once on PRs before they pass.  This is both annoying and problematic because it makes it harder to tell when a PR is introducing new flakiness.

To try to clean this up, I'd propose filing a JIRA *every time* Jenkins fails on a PR (for a reason unrelated to the PR).  Just provide a quick description of the failure -- e.g., "Flaky test: DagSchedulerSuite" or "Tests failed because 250m timeout expired", a link to the failed build, and include the "Tests" component.  If there's already a JIRA for the issue, just comment with a link to the latest failure.  I know folks don't always have time to track down why a test failed, but this it at least helpful to someone else who, later on, is trying to diagnose when the issue started to find the problematic code / test.

If this seems like too high overhead, feel free to suggest alternative ways to make the tests less flaky!

-Kay


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: File JIRAs for all flaky test failures

Sean Owen
In reply to this post by Saikat Kanjilal
I'm not sure what you're specifically suggesting. Of course flaky tests are bad and they should be fixed, and people do. Yes, some are pretty hard to fix because they are rarely reproducible if at all. If you want to fix, fix; there's nothing more to it.

I don't perceive flaky tests to be a significant problem. It has gone from bad to occasional over the past year in my anecdotal experience.

On Thu, Feb 16, 2017 at 4:26 PM Saikat Kanjilal <[hidden email]> wrote:

I'd just like to follow up again on this thread, should we devote some energy to fixing unit tests based on module, there wasn't much interest in this last time but given the nature of this thread I'd be willing to deep dive into this again with some help.


From: Saikat Kanjilal <[hidden email]>
Sent: Wednesday, February 15, 2017 6:12 PM
To: Josh Rosen
Cc: Armin Braun; Kay Ousterhout; [hidden email]

Subject: Re: File JIRAs for all flaky test failures
The issue was not with a lack of tooling, I used the url you are describing below to drill down to the exact test failure/stack trace, the problem was that my builds would work like a charm locally but fail with these errors on Jenkins, this was the whole challenge in fixing the unit tests, it was rare (if ever) where I would be able to replicate test failures locally.

Sent from my iPhone

On Feb 15, 2017, at 5:40 PM, Josh Rosen <[hidden email]> wrote:

A useful tool for investigating test flakiness is my Jenkins Test Explorer service, running at https://spark-tests.appspot.com/

This has some useful timeline views for debugging flaky builds. For instance, at https://spark-tests.appspot.com/jobs/spark-master-test-maven-hadoop-2.6 (may be slow to load) you can see this chart: https://i.imgur.com/j8LV3pX.png. Here, each column represents a test run and each row represents a test which failed at least once over the displayed time period.

In that linked example screenshot you'll notice that a few columns have grey squares indicating that tests were skipped but lack any red squares to indicate test failures. This usually indicates that the build failed due to a problem other than an individual test failure. For example, I clicked into one of those builds and found that one test suite failed in test setup because the previous suite had not properly cleaned up its SparkContext (I'll file a JIRA for this).

You can click through the interface to drill down to reports on individual builds, tests, suites, etc. As an example of an individual test's detail page, https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.rdd.LocalCheckpointSuite&test_name=missing+checkpoint+block+fails+with+informative+message shows the patterns of flakiness in a streaming checkpoint test.

Finally, there's an experimental "interesting new test failures" report which tries to surface tests which have started failing very recently: https://spark-tests.appspot.com/failed-tests/new. Specifically, entries in this feed are test failures which a) occurred in the last week, b) were not part of a build which had 20 or more failed tests, and c) were not observed to fail in during the previous week (i.e. no failures from [2 weeks ago, 1 week ago)), and d) which represent the first time that the test failed this week (i.e. a test case will appear at most once in the results list). I've also exposed this as an RSS feed at https://spark-tests.appspot.com/rss/failed-tests/new.


On Wed, Feb 15, 2017 at 12:51 PM Saikat Kanjilal <[hidden email]> wrote:

I would recommend we just open JIRA's for unit tests based on module (core/ml/sql etc) and we fix this one module at a time, this at least keeps the number of unit tests needing fixing down to a manageable number.




From: Armin Braun <[hidden email]>
Sent: Wednesday, February 15, 2017 12:48 PM
To: Saikat Kanjilal
Cc: Kay Ousterhout; [hidden email]
Subject: Re: File JIRAs for all flaky test failures
 
I think one thing that is contributing to this a lot too is the general issue of the tests taking up a lot of file descriptors (10k+ if I run them on a standard Debian machine).
There are a few suits that contribute to this in particular like `org.apache.spark.ExecutorAllocationManagerSuite` which, like a few others, appears to consume a lot of fds.

Wouldn't it make sense to open JIRAs about those and actively try to reduce the resource consumption of these tests? 
Seems to me these can cause a lot of unpredictable behavior (making the reason for flaky tests hard to identify especially when there's timeouts etc. involved) + they make it prohibitively expensive for many to test locally imo.

On Wed, Feb 15, 2017 at 9:24 PM, Saikat Kanjilal <[hidden email]> wrote:

I was working on something to address this a while ago https://issues.apache.org/jira/browse/SPARK-9487 but the difficulty in testing locally made things a lot more complicated to fix for each of the unit tests, should we resurface this JIRA again, I would whole heartedly agree with the flakiness assessment of the unit tests.

In Python we use `local[4]` for unit tests, while in Scala/Java we use `local[2]` and `local` for some unit tests in SQL, MLLib, and other components. If the ...




From: Kay Ousterhout <[hidden email]>
Sent: Wednesday, February 15, 2017 12:10 PM
To: [hidden email]
Subject: File JIRAs for all flaky test failures
 
Hi all,

I've noticed the Spark tests getting increasingly flaky -- it seems more common than not now that the tests need to be re-run at least once on PRs before they pass.  This is both annoying and problematic because it makes it harder to tell when a PR is introducing new flakiness.

To try to clean this up, I'd propose filing a JIRA *every time* Jenkins fails on a PR (for a reason unrelated to the PR).  Just provide a quick description of the failure -- e.g., "Flaky test: DagSchedulerSuite" or "Tests failed because 250m timeout expired", a link to the failed build, and include the "Tests" component.  If there's already a JIRA for the issue, just comment with a link to the latest failure.  I know folks don't always have time to track down why a test failed, but this it at least helpful to someone else who, later on, is trying to diagnose when the issue started to find the problematic code / test.

If this seems like too high overhead, feel free to suggest alternative ways to make the tests less flaky!

-Kay

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: File JIRAs for all flaky test failures

Saikat Kanjilal

I am specifically suggesting documenting a list of the the flaky tests and fixing them, that's all.  To organize the effort I suggested tackling this by module.  Your second sentence is what I was trying to gauge from the community before putting anymore effort into this.




From: Sean Owen <[hidden email]>
Sent: Thursday, February 16, 2017 8:45 AM
To: Saikat Kanjilal; [hidden email]
Subject: Re: File JIRAs for all flaky test failures
 
I'm not sure what you're specifically suggesting. Of course flaky tests are bad and they should be fixed, and people do. Yes, some are pretty hard to fix because they are rarely reproducible if at all. If you want to fix, fix; there's nothing more to it.

I don't perceive flaky tests to be a significant problem. It has gone from bad to occasional over the past year in my anecdotal experience.

On Thu, Feb 16, 2017 at 4:26 PM Saikat Kanjilal <[hidden email]> wrote:

I'd just like to follow up again on this thread, should we devote some energy to fixing unit tests based on module, there wasn't much interest in this last time but given the nature of this thread I'd be willing to deep dive into this again with some help.


From: Saikat Kanjilal <[hidden email]>
Sent: Wednesday, February 15, 2017 6:12 PM
To: Josh Rosen
Cc: Armin Braun; Kay Ousterhout; [hidden email]

Subject: Re: File JIRAs for all flaky test failures
The issue was not with a lack of tooling, I used the url you are describing below to drill down to the exact test failure/stack trace, the problem was that my builds would work like a charm locally but fail with these errors on Jenkins, this was the whole challenge in fixing the unit tests, it was rare (if ever) where I would be able to replicate test failures locally.

Sent from my iPhone

On Feb 15, 2017, at 5:40 PM, Josh Rosen <[hidden email]> wrote:

A useful tool for investigating test flakiness is my Jenkins Test Explorer service, running at https://spark-tests.appspot.com/

This has some useful timeline views for debugging flaky builds. For instance, at https://spark-tests.appspot.com/jobs/spark-master-test-maven-hadoop-2.6 (may be slow to load) you can see this chart: https://i.imgur.com/j8LV3pX.png. Here, each column represents a test run and each row represents a test which failed at least once over the displayed time period.

In that linked example screenshot you'll notice that a few columns have grey squares indicating that tests were skipped but lack any red squares to indicate test failures. This usually indicates that the build failed due to a problem other than an individual test failure. For example, I clicked into one of those builds and found that one test suite failed in test setup because the previous suite had not properly cleaned up its SparkContext (I'll file a JIRA for this).

You can click through the interface to drill down to reports on individual builds, tests, suites, etc. As an example of an individual test's detail page, <a href="https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.rdd.LocalCheckpointSuite&amp;test_name=missing&#43;checkpoint&#43;block&#43;fails&#43;with&#43;informative&#43;message" class="gmail_msg" target="_blank">https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.rdd.LocalCheckpointSuite&test_name=missing+checkpoint+block+fails+with+informative+message shows the patterns of flakiness in a streaming checkpoint test.

Finally, there's an experimental "interesting new test failures" report which tries to surface tests which have started failing very recently: https://spark-tests.appspot.com/failed-tests/new. Specifically, entries in this feed are test failures which a) occurred in the last week, b) were not part of a build which had 20 or more failed tests, and c) were not observed to fail in during the previous week (i.e. no failures from [2 weeks ago, 1 week ago)), and d) which represent the first time that the test failed this week (i.e. a test case will appear at most once in the results list). I've also exposed this as an RSS feed at https://spark-tests.appspot.com/rss/failed-tests/new.


On Wed, Feb 15, 2017 at 12:51 PM Saikat Kanjilal <[hidden email]> wrote:

I would recommend we just open JIRA's for unit tests based on module (core/ml/sql etc) and we fix this one module at a time, this at least keeps the number of unit tests needing fixing down to a manageable number.




From: Armin Braun <[hidden email]>
Sent: Wednesday, February 15, 2017 12:48 PM
To: Saikat Kanjilal
Cc: Kay Ousterhout; [hidden email]
Subject: Re: File JIRAs for all flaky test failures
 
I think one thing that is contributing to this a lot too is the general issue of the tests taking up a lot of file descriptors (10k+ if I run them on a standard Debian machine).
There are a few suits that contribute to this in particular like `org.apache.spark.ExecutorAllocationManagerSuite` which, like a few others, appears to consume a lot of fds.

Wouldn't it make sense to open JIRAs about those and actively try to reduce the resource consumption of these tests? 
Seems to me these can cause a lot of unpredictable behavior (making the reason for flaky tests hard to identify especially when there's timeouts etc. involved) + they make it prohibitively expensive for many to test locally imo.

On Wed, Feb 15, 2017 at 9:24 PM, Saikat Kanjilal <[hidden email]> wrote:

I was working on something to address this a while ago https://issues.apache.org/jira/browse/SPARK-9487 but the difficulty in testing locally made things a lot more complicated to fix for each of the unit tests, should we resurface this JIRA again, I would whole heartedly agree with the flakiness assessment of the unit tests.

In Python we use `local[4]` for unit tests, while in Scala/Java we use `local[2]` and `local` for some unit tests in SQL, MLLib, and other components. If the ...




From: Kay Ousterhout <[hidden email]>
Sent: Wednesday, February 15, 2017 12:10 PM
To: [hidden email]
Subject: File JIRAs for all flaky test failures
 
Hi all,

I've noticed the Spark tests getting increasingly flaky -- it seems more common than not now that the tests need to be re-run at least once on PRs before they pass.  This is both annoying and problematic because it makes it harder to tell when a PR is introducing new flakiness.

To try to clean this up, I'd propose filing a JIRA *every time* Jenkins fails on a PR (for a reason unrelated to the PR).  Just provide a quick description of the failure -- e.g., "Flaky test: DagSchedulerSuite" or "Tests failed because 250m timeout expired", a link to the failed build, and include the "Tests" component.  If there's already a JIRA for the issue, just comment with a link to the latest failure.  I know folks don't always have time to track down why a test failed, but this it at least helpful to someone else who, later on, is trying to diagnose when the issue started to find the problematic code / test.

If this seems like too high overhead, feel free to suggest alternative ways to make the tests less flaky!

-Kay

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: File JIRAs for all flaky test failures

rxin
Josh's tool should give enough signal there already. I don't think we need some manual process to document them. If you want to work on those that'd be great. I bet you will get a lot of love because all developers hate flaky tests.


On Thu, Feb 16, 2017 at 6:19 PM, Saikat Kanjilal <[hidden email]> wrote:

I am specifically suggesting documenting a list of the the flaky tests and fixing them, that's all.  To organize the effort I suggested tackling this by module.  Your second sentence is what I was trying to gauge from the community before putting anymore effort into this.




From: Sean Owen <[hidden email]>
Sent: Thursday, February 16, 2017 8:45 AM
To: Saikat Kanjilal; [hidden email]

Subject: Re: File JIRAs for all flaky test failures
 
I'm not sure what you're specifically suggesting. Of course flaky tests are bad and they should be fixed, and people do. Yes, some are pretty hard to fix because they are rarely reproducible if at all. If you want to fix, fix; there's nothing more to it.

I don't perceive flaky tests to be a significant problem. It has gone from bad to occasional over the past year in my anecdotal experience.

On Thu, Feb 16, 2017 at 4:26 PM Saikat Kanjilal <[hidden email]> wrote:

I'd just like to follow up again on this thread, should we devote some energy to fixing unit tests based on module, there wasn't much interest in this last time but given the nature of this thread I'd be willing to deep dive into this again with some help.


From: Saikat Kanjilal <[hidden email]>
Sent: Wednesday, February 15, 2017 6:12 PM
To: Josh Rosen
Cc: Armin Braun; Kay Ousterhout; [hidden email]

Subject: Re: File JIRAs for all flaky test failures
The issue was not with a lack of tooling, I used the url you are describing below to drill down to the exact test failure/stack trace, the problem was that my builds would work like a charm locally but fail with these errors on Jenkins, this was the whole challenge in fixing the unit tests, it was rare (if ever) where I would be able to replicate test failures locally.

Sent from my iPhone

On Feb 15, 2017, at 5:40 PM, Josh Rosen <[hidden email]> wrote:

A useful tool for investigating test flakiness is my Jenkins Test Explorer service, running at https://spark-tests.appspot.com/

This has some useful timeline views for debugging flaky builds. For instance, at https://spark-tests.appspot.com/jobs/spark-master-test-maven-hadoop-2.6 (may be slow to load) you can see this chart: https://i.imgur.com/j8LV3pX.png. Here, each column represents a test run and each row represents a test which failed at least once over the displayed time period.

In that linked example screenshot you'll notice that a few columns have grey squares indicating that tests were skipped but lack any red squares to indicate test failures. This usually indicates that the build failed due to a problem other than an individual test failure. For example, I clicked into one of those builds and found that one test suite failed in test setup because the previous suite had not properly cleaned up its SparkContext (I'll file a JIRA for this).

You can click through the interface to drill down to reports on individual builds, tests, suites, etc. As an example of an individual test's detail page, https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.rdd.LocalCheckpointSuite&test_name=missing+checkpoint+block+fails+with+informative+message shows the patterns of flakiness in a streaming checkpoint test.

Finally, there's an experimental "interesting new test failures" report which tries to surface tests which have started failing very recently: https://spark-tests.appspot.com/failed-tests/newSpecifically, entries in this feed are test failures which a) occurred in the last week, b) were not part of a build which had 20 or more failed tests, and c) were not observed to fail in during the previous week (i.e. no failures from [2 weeks ago, 1 week ago)), and d) which represent the first time that the test failed this week (i.e. a test case will appear at most once in the results list). I've also exposed this as an RSS feed at https://spark-tests.appspot.com/rss/failed-tests/new.


On Wed, Feb 15, 2017 at 12:51 PM Saikat Kanjilal <[hidden email]> wrote:

I would recommend we just open JIRA's for unit tests based on module (core/ml/sql etc) and we fix this one module at a time, this at least keeps the number of unit tests needing fixing down to a manageable number.




From: Armin Braun <[hidden email]>
Sent: Wednesday, February 15, 2017 12:48 PM
To: Saikat Kanjilal
Cc: Kay Ousterhout; [hidden email]
Subject: Re: File JIRAs for all flaky test failures
 
I think one thing that is contributing to this a lot too is the general issue of the tests taking up a lot of file descriptors (10k+ if I run them on a standard Debian machine).
There are a few suits that contribute to this in particular like `org.apache.spark.ExecutorAllocationManagerSuite` which, like a few others, appears to consume a lot of fds.

Wouldn't it make sense to open JIRAs about those and actively try to reduce the resource consumption of these tests? 
Seems to me these can cause a lot of unpredictable behavior (making the reason for flaky tests hard to identify especially when there's timeouts etc. involved) + they make it prohibitively expensive for many to test locally imo.

On Wed, Feb 15, 2017 at 9:24 PM, Saikat Kanjilal <[hidden email]> wrote:

I was working on something to address this a while ago https://issues.apache.org/jira/browse/SPARK-9487 but the difficulty in testing locally made things a lot more complicated to fix for each of the unit tests, should we resurface this JIRA again, I would whole heartedly agree with the flakiness assessment of the unit tests.

In Python we use `local[4]` for unit tests, while in Scala/Java we use `local[2]` and `local` for some unit tests in SQL, MLLib, and other components. If the ...




From: Kay Ousterhout <[hidden email]>
Sent: Wednesday, February 15, 2017 12:10 PM
To: [hidden email]
Subject: File JIRAs for all flaky test failures
 
Hi all,

I've noticed the Spark tests getting increasingly flaky -- it seems more common than not now that the tests need to be re-run at least once on PRs before they pass.  This is both annoying and problematic because it makes it harder to tell when a PR is introducing new flakiness.

To try to clean this up, I'd propose filing a JIRA *every time* Jenkins fails on a PR (for a reason unrelated to the PR).  Just provide a quick description of the failure -- e.g., "Flaky test: DagSchedulerSuite" or "Tests failed because 250m timeout expired", a link to the failed build, and include the "Tests" component.  If there's already a JIRA for the issue, just comment with a link to the latest failure.  I know folks don't always have time to track down why a test failed, but this it at least helpful to someone else who, later on, is trying to diagnose when the issue started to find the problematic code / test.

If this seems like too high overhead, feel free to suggest alternative ways to make the tests less flaky!

-Kay


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: File JIRAs for all flaky test failures

Kay Ousterhout
Following up on this with a renewed plea to file JIRAs when you see flaky tests.  I did a quick skim of the PR builder, and there were 17 times in the last week when a flaky test led to a Jenkins failure, and someone re-ran the tests without filing (or updating) a JIRA (my apologies to anyone who was incorrectly added here):

cloud-fan (4)
gatorsmile (4)
wzhfy (4)
holdenk (1)
kunalkhamar (1)
hyukjinkwon (1)
scrapcodes (1)
srowen (1)

Are you on this list?  It's not too late to look at the test that failed and file (or update) the appropriate JIRA.

If you weren't convinced by my last email, here are some reasons to file a JIRA:

(0) Flaky tests are not always broken tests -- sometimes the underlying code is broken. (e.g., SPARK-19803SPARK-19988SPARK-19072 )

(1) Before a flaky test gets fixed, some human needs to file a JIRA.  The person who sees the flaky test via the PR builder is best suited to do this, because you already had to look at which test failed (to make sure it wasn't related to your change) and you know the test is flaky (because you're re-running it assuming it will succeed), at which point it takes <1 minute to file a JIRA

(2) Related to the above, existing automation is not sufficient.  Josh's tool is very useful for debugging flaky tests [1] but it does not yet automatically file JIRAs.  Many recent flaky tests might have shown up on the nifty interesting recent failures dashboard, but they didn't get noticed until they were failing for more than a week and dropped from that dashboard.  One recent test, SPARK-19990, was causing *every* Maven build to fail for over > 1 week, but was only noticed when someone filed a JIRA as a result of a flaky PR test.

(3) JIRAs result in helpful people fixing the tests!  If you're interested in doing this, the flaky test label is a good place to start.  Many thanks to folks who have recently helped with filing and fixing flaky tests: Sital Kedia, Song Jun, Xiao Li, Shubham Chopra, Shixiong Zhu, Genmao Yu, Imran Rashid, and I'm sure many more (this list is based on a quick JIRA skim).

Thanks!!

Kay



[1] You can create a URL using the "suite_name" and optionally "test_name" GET parameters in Josh's app to investigate a flaky test; e.g., to see how often the "hive bucketing is not supported" test in ShowCreateTableSuite has been failing: https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.hive.ShowCreateTableSuite&test_name=hive+bucketing+is+not+supported (be patient -- it takes a minute and sometimes a re-load to work).


On Thu, Feb 16, 2017 at 9:22 AM, Reynold Xin <[hidden email]> wrote:
Josh's tool should give enough signal there already. I don't think we need some manual process to document them. If you want to work on those that'd be great. I bet you will get a lot of love because all developers hate flaky tests.


On Thu, Feb 16, 2017 at 6:19 PM, Saikat Kanjilal <[hidden email]> wrote:

I am specifically suggesting documenting a list of the the flaky tests and fixing them, that's all.  To organize the effort I suggested tackling this by module.  Your second sentence is what I was trying to gauge from the community before putting anymore effort into this.




From: Sean Owen <[hidden email]>
Sent: Thursday, February 16, 2017 8:45 AM
To: Saikat Kanjilal; [hidden email]

Subject: Re: File JIRAs for all flaky test failures
 
I'm not sure what you're specifically suggesting. Of course flaky tests are bad and they should be fixed, and people do. Yes, some are pretty hard to fix because they are rarely reproducible if at all. If you want to fix, fix; there's nothing more to it.

I don't perceive flaky tests to be a significant problem. It has gone from bad to occasional over the past year in my anecdotal experience.

On Thu, Feb 16, 2017 at 4:26 PM Saikat Kanjilal <[hidden email]> wrote:

I'd just like to follow up again on this thread, should we devote some energy to fixing unit tests based on module, there wasn't much interest in this last time but given the nature of this thread I'd be willing to deep dive into this again with some help.


From: Saikat Kanjilal <[hidden email]>
Sent: Wednesday, February 15, 2017 6:12 PM
To: Josh Rosen
Cc: Armin Braun; Kay Ousterhout; [hidden email]

Subject: Re: File JIRAs for all flaky test failures
The issue was not with a lack of tooling, I used the url you are describing below to drill down to the exact test failure/stack trace, the problem was that my builds would work like a charm locally but fail with these errors on Jenkins, this was the whole challenge in fixing the unit tests, it was rare (if ever) where I would be able to replicate test failures locally.

Sent from my iPhone

On Feb 15, 2017, at 5:40 PM, Josh Rosen <[hidden email]> wrote:

A useful tool for investigating test flakiness is my Jenkins Test Explorer service, running at https://spark-tests.appspot.com/

This has some useful timeline views for debugging flaky builds. For instance, at https://spark-tests.appspot.com/jobs/spark-master-test-maven-hadoop-2.6 (may be slow to load) you can see this chart: https://i.imgur.com/j8LV3pX.png. Here, each column represents a test run and each row represents a test which failed at least once over the displayed time period.

In that linked example screenshot you'll notice that a few columns have grey squares indicating that tests were skipped but lack any red squares to indicate test failures. This usually indicates that the build failed due to a problem other than an individual test failure. For example, I clicked into one of those builds and found that one test suite failed in test setup because the previous suite had not properly cleaned up its SparkContext (I'll file a JIRA for this).

You can click through the interface to drill down to reports on individual builds, tests, suites, etc. As an example of an individual test's detail page, https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.rdd.LocalCheckpointSuite&test_name=missing+checkpoint+block+fails+with+informative+message shows the patterns of flakiness in a streaming checkpoint test.

Finally, there's an experimental "interesting new test failures" report which tries to surface tests which have started failing very recently: https://spark-tests.appspot.com/failed-tests/new. Specifically, entries in this feed are test failures which a) occurred in the last week, b) were not part of a build which had 20 or more failed tests, and c) were not observed to fail in during the previous week (i.e. no failures from [2 weeks ago, 1 week ago)), and d) which represent the first time that the test failed this week (i.e. a test case will appear at most once in the results list). I've also exposed this as an RSS feed at https://spark-tests.appspot.com/rss/failed-tests/new.


On Wed, Feb 15, 2017 at 12:51 PM Saikat Kanjilal <[hidden email]> wrote:

I would recommend we just open JIRA's for unit tests based on module (core/ml/sql etc) and we fix this one module at a time, this at least keeps the number of unit tests needing fixing down to a manageable number.




From: Armin Braun <[hidden email]>
Sent: Wednesday, February 15, 2017 12:48 PM
To: Saikat Kanjilal
Cc: Kay Ousterhout; [hidden email]
Subject: Re: File JIRAs for all flaky test failures
 
I think one thing that is contributing to this a lot too is the general issue of the tests taking up a lot of file descriptors (10k+ if I run them on a standard Debian machine).
There are a few suits that contribute to this in particular like `org.apache.spark.ExecutorAllocationManagerSuite` which, like a few others, appears to consume a lot of fds.

Wouldn't it make sense to open JIRAs about those and actively try to reduce the resource consumption of these tests? 
Seems to me these can cause a lot of unpredictable behavior (making the reason for flaky tests hard to identify especially when there's timeouts etc. involved) + they make it prohibitively expensive for many to test locally imo.

On Wed, Feb 15, 2017 at 9:24 PM, Saikat Kanjilal <[hidden email]> wrote:

I was working on something to address this a while ago https://issues.apache.org/jira/browse/SPARK-9487 but the difficulty in testing locally made things a lot more complicated to fix for each of the unit tests, should we resurface this JIRA again, I would whole heartedly agree with the flakiness assessment of the unit tests.

In Python we use `local[4]` for unit tests, while in Scala/Java we use `local[2]` and `local` for some unit tests in SQL, MLLib, and other components. If the ...




From: Kay Ousterhout <[hidden email]>
Sent: Wednesday, February 15, 2017 12:10 PM
To: [hidden email]
Subject: File JIRAs for all flaky test failures
 
Hi all,

I've noticed the Spark tests getting increasingly flaky -- it seems more common than not now that the tests need to be re-run at least once on PRs before they pass.  This is both annoying and problematic because it makes it harder to tell when a PR is introducing new flakiness.

To try to clean this up, I'd propose filing a JIRA *every time* Jenkins fails on a PR (for a reason unrelated to the PR).  Just provide a quick description of the failure -- e.g., "Flaky test: DagSchedulerSuite" or "Tests failed because 250m timeout expired", a link to the failed build, and include the "Tests" component.  If there's already a JIRA for the issue, just comment with a link to the latest failure.  I know folks don't always have time to track down why a test failed, but this it at least helpful to someone else who, later on, is trying to diagnose when the issue started to find the problematic code / test.

If this seems like too high overhead, feel free to suggest alternative ways to make the tests less flaky!

-Kay



Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: File JIRAs for all flaky test failures

Saikat Kanjilal

I'm happy to help out in this effort and will look at that label and see what tests I can look into and/or fix.




From: Kay Ousterhout <[hidden email]>
Sent: Monday, March 27, 2017 9:47 PM
To: Reynold Xin
Cc: Saikat Kanjilal; Sean Owen; [hidden email]
Subject: Re: File JIRAs for all flaky test failures
 
Following up on this with a renewed plea to file JIRAs when you see flaky tests.  I did a quick skim of the PR builder, and there were 17 times in the last week when a flaky test led to a Jenkins failure, and someone re-ran the tests without filing (or updating) a JIRA (my apologies to anyone who was incorrectly added here):

cloud-fan (4)
gatorsmile (4)
wzhfy (4)
holdenk (1)
kunalkhamar (1)
hyukjinkwon (1)
scrapcodes (1)
srowen (1)

Are you on this list?  It's not too late to look at the test that failed and file (or update) the appropriate JIRA.

If you weren't convinced by my last email, here are some reasons to file a JIRA:

(0) Flaky tests are not always broken tests -- sometimes the underlying code is broken. (e.g., SPARK-19803SPARK-19988SPARK-19072 )

(1) Before a flaky test gets fixed, some human needs to file a JIRA.  The person who sees the flaky test via the PR builder is best suited to do this, because you already had to look at which test failed (to make sure it wasn't related to your change) and you know the test is flaky (because you're re-running it assuming it will succeed), at which point it takes <1 minute to file a JIRA

(2) Related to the above, existing automation is not sufficient.  Josh's tool is very useful for debugging flaky tests [1] but it does not yet automatically file JIRAs.  Many recent flaky tests might have shown up on the nifty interesting recent failures dashboard, but they didn't get noticed until they were failing for more than a week and dropped from that dashboard.  One recent test, SPARK-19990, was causing *every* Maven build to fail for over > 1 week, but was only noticed when someone filed a JIRA as a result of a flaky PR test.

(3) JIRAs result in helpful people fixing the tests!  If you're interested in doing this, the flaky test label is a good place to start.  Many thanks to folks who have recently helped with filing and fixing flaky tests: Sital Kedia, Song Jun, Xiao Li, Shubham Chopra, Shixiong Zhu, Genmao Yu, Imran Rashid, and I'm sure many more (this list is based on a quick JIRA skim).

Thanks!!

Kay



[1] You can create a URL using the "suite_name" and optionally "test_name" GET parameters in Josh's app to investigate a flaky test; e.g., to see how often the "hive bucketing is not supported" test in ShowCreateTableSuite has been failing: <a href="https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.hive.ShowCreateTableSuite&amp;test_name=hive&#43;bucketing&#43;is&#43;not&#43;supported"> https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.hive.ShowCreateTableSuite&test_name=hive+bucketing+is+not+supported (be patient -- it takes a minute and sometimes a re-load to work).


On Thu, Feb 16, 2017 at 9:22 AM, Reynold Xin <[hidden email]> wrote:
Josh's tool should give enough signal there already. I don't think we need some manual process to document them. If you want to work on those that'd be great. I bet you will get a lot of love because all developers hate flaky tests.


On Thu, Feb 16, 2017 at 6:19 PM, Saikat Kanjilal <[hidden email]> wrote:

I am specifically suggesting documenting a list of the the flaky tests and fixing them, that's all.  To organize the effort I suggested tackling this by module.  Your second sentence is what I was trying to gauge from the community before putting anymore effort into this.




From: Sean Owen <[hidden email]>
Sent: Thursday, February 16, 2017 8:45 AM
To: Saikat Kanjilal; [hidden email]

Subject: Re: File JIRAs for all flaky test failures
 
I'm not sure what you're specifically suggesting. Of course flaky tests are bad and they should be fixed, and people do. Yes, some are pretty hard to fix because they are rarely reproducible if at all. If you want to fix, fix; there's nothing more to it.

I don't perceive flaky tests to be a significant problem. It has gone from bad to occasional over the past year in my anecdotal experience.

On Thu, Feb 16, 2017 at 4:26 PM Saikat Kanjilal <[hidden email]> wrote:

I'd just like to follow up again on this thread, should we devote some energy to fixing unit tests based on module, there wasn't much interest in this last time but given the nature of this thread I'd be willing to deep dive into this again with some help.


From: Saikat Kanjilal <[hidden email]>
Sent: Wednesday, February 15, 2017 6:12 PM
To: Josh Rosen
Cc: Armin Braun; Kay Ousterhout; [hidden email]

Subject: Re: File JIRAs for all flaky test failures
The issue was not with a lack of tooling, I used the url you are describing below to drill down to the exact test failure/stack trace, the problem was that my builds would work like a charm locally but fail with these errors on Jenkins, this was the whole challenge in fixing the unit tests, it was rare (if ever) where I would be able to replicate test failures locally.

Sent from my iPhone

On Feb 15, 2017, at 5:40 PM, Josh Rosen <[hidden email]> wrote:

A useful tool for investigating test flakiness is my Jenkins Test Explorer service, running at https://spark-tests.appspot.com/

This has some useful timeline views for debugging flaky builds. For instance, at https://spark-tests.appspot.com/jobs/spark-master-test-maven-hadoop-2.6 (may be slow to load) you can see this chart: https://i.imgur.com/j8LV3pX.png. Here, each column represents a test run and each row represents a test which failed at least once over the displayed time period.

In that linked example screenshot you'll notice that a few columns have grey squares indicating that tests were skipped but lack any red squares to indicate test failures. This usually indicates that the build failed due to a problem other than an individual test failure. For example, I clicked into one of those builds and found that one test suite failed in test setup because the previous suite had not properly cleaned up its SparkContext (I'll file a JIRA for this).

You can click through the interface to drill down to reports on individual builds, tests, suites, etc. As an example of an individual test's detail page, <a href="https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.rdd.LocalCheckpointSuite&amp;test_name=missing&#43;checkpoint&#43;block&#43;fails&#43;with&#43;informative&#43;message" class="m_-6080552829749566991m_5116908533849687395gmail_msg" target="_blank">https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.rdd.LocalCheckpointSuite&test_name=missing+checkpoint+block+fails+with+informative+message shows the patterns of flakiness in a streaming checkpoint test.

Finally, there's an experimental "interesting new test failures" report which tries to surface tests which have started failing very recently: https://spark-tests.appspot.com/failed-tests/new. Specifically, entries in this feed are test failures which a) occurred in the last week, b) were not part of a build which had 20 or more failed tests, and c) were not observed to fail in during the previous week (i.e. no failures from [2 weeks ago, 1 week ago)), and d) which represent the first time that the test failed this week (i.e. a test case will appear at most once in the results list). I've also exposed this as an RSS feed at https://spark-tests.appspot.com/rss/failed-tests/new.


On Wed, Feb 15, 2017 at 12:51 PM Saikat Kanjilal <[hidden email]> wrote:

I would recommend we just open JIRA's for unit tests based on module (core/ml/sql etc) and we fix this one module at a time, this at least keeps the number of unit tests needing fixing down to a manageable number.




From: Armin Braun <[hidden email]>
Sent: Wednesday, February 15, 2017 12:48 PM
To: Saikat Kanjilal
Cc: Kay Ousterhout; [hidden email]
Subject: Re: File JIRAs for all flaky test failures
 
I think one thing that is contributing to this a lot too is the general issue of the tests taking up a lot of file descriptors (10k+ if I run them on a standard Debian machine).
There are a few suits that contribute to this in particular like `org.apache.spark.ExecutorAllocationManagerSuite` which, like a few others, appears to consume a lot of fds.

Wouldn't it make sense to open JIRAs about those and actively try to reduce the resource consumption of these tests? 
Seems to me these can cause a lot of unpredictable behavior (making the reason for flaky tests hard to identify especially when there's timeouts etc. involved) + they make it prohibitively expensive for many to test locally imo.

On Wed, Feb 15, 2017 at 9:24 PM, Saikat Kanjilal <[hidden email]> wrote:

I was working on something to address this a while ago https://issues.apache.org/jira/browse/SPARK-9487 but the difficulty in testing locally made things a lot more complicated to fix for each of the unit tests, should we resurface this JIRA again, I would whole heartedly agree with the flakiness assessment of the unit tests.

In Python we use `local[4]` for unit tests, while in Scala/Java we use `local[2]` and `local` for some unit tests in SQL, MLLib, and other components. If the ...




From: Kay Ousterhout <[hidden email]>
Sent: Wednesday, February 15, 2017 12:10 PM
To: [hidden email]
Subject: File JIRAs for all flaky test failures
 
Hi all,

I've noticed the Spark tests getting increasingly flaky -- it seems more common than not now that the tests need to be re-run at least once on PRs before they pass.  This is both annoying and problematic because it makes it harder to tell when a PR is introducing new flakiness.

To try to clean this up, I'd propose filing a JIRA *every time* Jenkins fails on a PR (for a reason unrelated to the PR).  Just provide a quick description of the failure -- e.g., "Flaky test: DagSchedulerSuite" or "Tests failed because 250m timeout expired", a link to the failed build, and include the "Tests" component.  If there's already a JIRA for the issue, just comment with a link to the latest failure.  I know folks don't always have time to track down why a test failed, but this it at least helpful to someone else who, later on, is trying to diagnose when the issue started to find the problematic code / test.

If this seems like too high overhead, feel free to suggest alternative ways to make the tests less flaky!

-Kay



Loading...