Crowdsourced triage Scapegoat compiler plugin warnings

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Crowdsourced triage Scapegoat compiler plugin warnings

Josh Rosen-2
I'm interested in using the Scapegoat Scala compiler plugin to find potential bugs and performance problems in Spark. Scapegoat has a useful built-in set of inspections and is pretty easy to extend with custom ones. For example, I added an inspection to spot places where we call .apply() on a Seq which is not an IndexedSeq in order to make it easier to spot potential O(n^2) performance bugs.

There are lots of false-positives and benign warnings (as with any linter / static analyzer) so I don't think it's feasible to us to include this as a blocking step in our regular build. I am planning to build tooling to surface only new warnings so going forward this can become a useful code-review aid. 

The current codebase has roughly 1700 warnings that I would like to triage and categorize as false-positives or real bugs. I can't do this alone, so here's how you can help:
  • Visit the Google Docs spreadsheet at https://docs.google.com/spreadsheets/d/1z7xNMjx7VCJLCiHOHhTth7Hh4R0F6LwcGjEwCDzrCiM/edit?usp=sharing and find an un-triaged warning.
  • In the columns at the right of the sheet, enter your name in the appropriate column to mark a warning as a false-positive or as a real bug and/or performance issue. If think a warning is a real issue then use the "comments" column for providing additional detail.
  • Please don't file JIRAs or PRs for individual warnings; I suspect that we'll find clusters of issues which are best fixed in a few larger PRs vs. lots of smaller ones. Certain warnings are probably simply style issues so we should discuss those before trying to fix them.
The sheet has hidden columns capturing the Spark revision and Scapegoat revision. I can use this to programmatically update the sheet and remap lines after updating either Scapegoat (to suppress false-positives) or Spark (to incorporate fixes and surface new warnings). For those who are interested, the sheet was produced with this script: https://gist.github.com/JoshRosen/1ae12a979880d9a98988aa87d70ff2a8

Depending on the results of this experiment we might want to integrate a high-signal subset of the Scapegoat warnings into our build. I'm also hoping that we'll be able to build a useful corpus of triaged warnings in order to help improve Scapegoat itself and eliminate common false-positives.

Thanks and happy bug-hunting,
Josh Rosen
Reply | Threaded
Open this post in threaded view
|

Re: Crowdsourced triage Scapegoat compiler plugin warnings

Hyukjin Kwon
Gentle ping to dev for help. I hope this effort is not abandoned.

On 25 May 2017 9:41 am, "Josh Rosen" <[hidden email]> wrote:
I'm interested in using the Scapegoat Scala compiler plugin to find potential bugs and performance problems in Spark. Scapegoat has a useful built-in set of inspections and is pretty easy to extend with custom ones. For example, I added an inspection to spot places where we call .apply() on a Seq which is not an IndexedSeq in order to make it easier to spot potential O(n^2) performance bugs.

There are lots of false-positives and benign warnings (as with any linter / static analyzer) so I don't think it's feasible to us to include this as a blocking step in our regular build. I am planning to build tooling to surface only new warnings so going forward this can become a useful code-review aid. 

The current codebase has roughly 1700 warnings that I would like to triage and categorize as false-positives or real bugs. I can't do this alone, so here's how you can help:
  • Visit the Google Docs spreadsheet at https://docs.google.com/spreadsheets/d/1z7xNMjx7VCJLCiHOHhTth7Hh4R0F6LwcGjEwCDzrCiM/edit?usp=sharing and find an un-triaged warning.
  • In the columns at the right of the sheet, enter your name in the appropriate column to mark a warning as a false-positive or as a real bug and/or performance issue. If think a warning is a real issue then use the "comments" column for providing additional detail.
  • Please don't file JIRAs or PRs for individual warnings; I suspect that we'll find clusters of issues which are best fixed in a few larger PRs vs. lots of smaller ones. Certain warnings are probably simply style issues so we should discuss those before trying to fix them.
The sheet has hidden columns capturing the Spark revision and Scapegoat revision. I can use this to programmatically update the sheet and remap lines after updating either Scapegoat (to suppress false-positives) or Spark (to incorporate fixes and surface new warnings). For those who are interested, the sheet was produced with this script: https://gist.github.com/JoshRosen/1ae12a979880d9a98988aa87d70ff2a8

Depending on the results of this experiment we might want to integrate a high-signal subset of the Scapegoat warnings into our build. I'm also hoping that we'll be able to build a useful corpus of triaged warnings in order to help improve Scapegoat itself and eliminate common false-positives.

Thanks and happy bug-hunting,
Josh Rosen

Reply | Threaded
Open this post in threaded view
|

Re: Crowdsourced triage Scapegoat compiler plugin warnings

Sean Owen
Looks like a whole lot of the results have been analyzed. I suspect there's more than enough to act on already. I think we should wait until after 2.2 is done.
Anybody prefer how to proceed here -- just open a JIRA to take care of a batch of related types of issues and go for it?

On Sat, Jun 17, 2017 at 4:45 PM Hyukjin Kwon <[hidden email]> wrote:
Gentle ping to dev for help. I hope this effort is not abandoned.


On 25 May 2017 9:41 am, "Josh Rosen" <[hidden email]> wrote:
I'm interested in using the Scapegoat Scala compiler plugin to find potential bugs and performance problems in Spark. Scapegoat has a useful built-in set of inspections and is pretty easy to extend with custom ones. For example, I added an inspection to spot places where we call .apply() on a Seq which is not an IndexedSeq in order to make it easier to spot potential O(n^2) performance bugs.

There are lots of false-positives and benign warnings (as with any linter / static analyzer) so I don't think it's feasible to us to include this as a blocking step in our regular build. I am planning to build tooling to surface only new warnings so going forward this can become a useful code-review aid. 

The current codebase has roughly 1700 warnings that I would like to triage and categorize as false-positives or real bugs. I can't do this alone, so here's how you can help:
  • Visit the Google Docs spreadsheet at https://docs.google.com/spreadsheets/d/1z7xNMjx7VCJLCiHOHhTth7Hh4R0F6LwcGjEwCDzrCiM/edit?usp=sharing and find an un-triaged warning.
  • In the columns at the right of the sheet, enter your name in the appropriate column to mark a warning as a false-positive or as a real bug and/or performance issue. If think a warning is a real issue then use the "comments" column for providing additional detail.
  • Please don't file JIRAs or PRs for individual warnings; I suspect that we'll find clusters of issues which are best fixed in a few larger PRs vs. lots of smaller ones. Certain warnings are probably simply style issues so we should discuss those before trying to fix them.
The sheet has hidden columns capturing the Spark revision and Scapegoat revision. I can use this to programmatically update the sheet and remap lines after updating either Scapegoat (to suppress false-positives) or Spark (to incorporate fixes and surface new warnings). For those who are interested, the sheet was produced with this script: https://gist.github.com/JoshRosen/1ae12a979880d9a98988aa87d70ff2a8

Depending on the results of this experiment we might want to integrate a high-signal subset of the Scapegoat warnings into our build. I'm also hoping that we'll be able to build a useful corpus of triaged warnings in order to help improve Scapegoat itself and eliminate common false-positives.

Thanks and happy bug-hunting,
Josh Rosen

Reply | Threaded
Open this post in threaded view
|

Re: Crowdsourced triage Scapegoat compiler plugin warnings

Hyukjin Kwon
Hi all,


Another gentle ping for help.

Probably, let me open up a JIRA and proceed this after a couple of weeks if no one is going to do this although I hope someone takes this.


Thanks.

2017-06-18 2:16 GMT+09:00 Sean Owen <[hidden email]>:
Looks like a whole lot of the results have been analyzed. I suspect there's more than enough to act on already. I think we should wait until after 2.2 is done.
Anybody prefer how to proceed here -- just open a JIRA to take care of a batch of related types of issues and go for it?

On Sat, Jun 17, 2017 at 4:45 PM Hyukjin Kwon <[hidden email]> wrote:
Gentle ping to dev for help. I hope this effort is not abandoned.


On 25 May 2017 9:41 am, "Josh Rosen" <[hidden email]> wrote:
I'm interested in using the Scapegoat Scala compiler plugin to find potential bugs and performance problems in Spark. Scapegoat has a useful built-in set of inspections and is pretty easy to extend with custom ones. For example, I added an inspection to spot places where we call .apply() on a Seq which is not an IndexedSeq in order to make it easier to spot potential O(n^2) performance bugs.

There are lots of false-positives and benign warnings (as with any linter / static analyzer) so I don't think it's feasible to us to include this as a blocking step in our regular build. I am planning to build tooling to surface only new warnings so going forward this can become a useful code-review aid. 

The current codebase has roughly 1700 warnings that I would like to triage and categorize as false-positives or real bugs. I can't do this alone, so here's how you can help:
  • Visit the Google Docs spreadsheet at https://docs.google.com/spreadsheets/d/1z7xNMjx7VCJLCiHOHhTth7Hh4R0F6LwcGjEwCDzrCiM/edit?usp=sharing and find an un-triaged warning.
  • In the columns at the right of the sheet, enter your name in the appropriate column to mark a warning as a false-positive or as a real bug and/or performance issue. If think a warning is a real issue then use the "comments" column for providing additional detail.
  • Please don't file JIRAs or PRs for individual warnings; I suspect that we'll find clusters of issues which are best fixed in a few larger PRs vs. lots of smaller ones. Certain warnings are probably simply style issues so we should discuss those before trying to fix them.
The sheet has hidden columns capturing the Spark revision and Scapegoat revision. I can use this to programmatically update the sheet and remap lines after updating either Scapegoat (to suppress false-positives) or Spark (to incorporate fixes and surface new warnings). For those who are interested, the sheet was produced with this script: https://gist.github.com/JoshRosen/1ae12a979880d9a98988aa87d70ff2a8

Depending on the results of this experiment we might want to integrate a high-signal subset of the Scapegoat warnings into our build. I'm also hoping that we'll be able to build a useful corpus of triaged warnings in order to help improve Scapegoat itself and eliminate common false-positives.

Thanks and happy bug-hunting,
Josh Rosen


Reply | Threaded
Open this post in threaded view
|

Re: Crowdsourced triage Scapegoat compiler plugin warnings

Holden Karau
I'm happy to help out in the places I'm more familiar with some starting next week. I suspect things probably just fell to the wayside for some folks during the 2.2.0 release.

On Thu, Jul 13, 2017 at 1:16 AM, Hyukjin Kwon <[hidden email]> wrote:
Hi all,


Another gentle ping for help.

Probably, let me open up a JIRA and proceed this after a couple of weeks if no one is going to do this although I hope someone takes this.


Thanks.

2017-06-18 2:16 GMT+09:00 Sean Owen <[hidden email]>:
Looks like a whole lot of the results have been analyzed. I suspect there's more than enough to act on already. I think we should wait until after 2.2 is done.
Anybody prefer how to proceed here -- just open a JIRA to take care of a batch of related types of issues and go for it?

On Sat, Jun 17, 2017 at 4:45 PM Hyukjin Kwon <[hidden email]> wrote:
Gentle ping to dev for help. I hope this effort is not abandoned.


On 25 May 2017 9:41 am, "Josh Rosen" <[hidden email]> wrote:
I'm interested in using the Scapegoat Scala compiler plugin to find potential bugs and performance problems in Spark. Scapegoat has a useful built-in set of inspections and is pretty easy to extend with custom ones. For example, I added an inspection to spot places where we call .apply() on a Seq which is not an IndexedSeq in order to make it easier to spot potential O(n^2) performance bugs.

There are lots of false-positives and benign warnings (as with any linter / static analyzer) so I don't think it's feasible to us to include this as a blocking step in our regular build. I am planning to build tooling to surface only new warnings so going forward this can become a useful code-review aid. 

The current codebase has roughly 1700 warnings that I would like to triage and categorize as false-positives or real bugs. I can't do this alone, so here's how you can help:
  • Visit the Google Docs spreadsheet at https://docs.google.com/spreadsheets/d/1z7xNMjx7VCJLCiHOHhTth7Hh4R0F6LwcGjEwCDzrCiM/edit?usp=sharing and find an un-triaged warning.
  • In the columns at the right of the sheet, enter your name in the appropriate column to mark a warning as a false-positive or as a real bug and/or performance issue. If think a warning is a real issue then use the "comments" column for providing additional detail.
  • Please don't file JIRAs or PRs for individual warnings; I suspect that we'll find clusters of issues which are best fixed in a few larger PRs vs. lots of smaller ones. Certain warnings are probably simply style issues so we should discuss those before trying to fix them.
The sheet has hidden columns capturing the Spark revision and Scapegoat revision. I can use this to programmatically update the sheet and remap lines after updating either Scapegoat (to suppress false-positives) or Spark (to incorporate fixes and surface new warnings). For those who are interested, the sheet was produced with this script: https://gist.github.com/JoshRosen/1ae12a979880d9a98988aa87d70ff2a8

Depending on the results of this experiment we might want to integrate a high-signal subset of the Scapegoat warnings into our build. I'm also hoping that we'll be able to build a useful corpus of triaged warnings in order to help improve Scapegoat itself and eliminate common false-positives.

Thanks and happy bug-hunting,
Josh Rosen





--
Cell : 425-233-8271
Reply | Threaded
Open this post in threaded view
|

Re: Crowdsourced triage Scapegoat compiler plugin warnings

Sean Owen
In reply to this post by Hyukjin Kwon
I don't think everything needs to be triaged. There are a ton of useful changes that have been identified. I think you could just pick some warning types where they've all been triaged and go fix them. 

On Thu, Jul 13, 2017 at 9:16 AM Hyukjin Kwon <[hidden email]> wrote:
Hi all,


Another gentle ping for help.

Probably, let me open up a JIRA and proceed this after a couple of weeks if no one is going to do this although I hope someone takes this.


Reply | Threaded
Open this post in threaded view
|

Re: Crowdsourced triage Scapegoat compiler plugin warnings

Jorge Sánchez
It might be worth sharing this with the user list, there must be people willing to collaborate who are not on the dev list.

2017-07-13 10:00 GMT+01:00 Sean Owen <[hidden email]>:
I don't think everything needs to be triaged. There are a ton of useful changes that have been identified. I think you could just pick some warning types where they've all been triaged and go fix them. 

On Thu, Jul 13, 2017 at 9:16 AM Hyukjin Kwon <[hidden email]> wrote:
Hi all,


Another gentle ping for help.

Probably, let me open up a JIRA and proceed this after a couple of weeks if no one is going to do this although I hope someone takes this.



Reply | Threaded
Open this post in threaded view
|

Re: Crowdsourced triage Scapegoat compiler plugin warnings

Sean Owen
Possibly, but it will take some degree of sophistication and understanding of Spark to evaluate these. I'd want committers or at least Spark developers to look at this rather than users. 

On Thu, Jul 13, 2017, 15:00 Jorge Sánchez <[hidden email]> wrote:
It might be worth sharing this with the user list, there must be people willing to collaborate who are not on the dev list.

2017-07-13 10:00 GMT+01:00 Sean Owen <[hidden email]>:
I don't think everything needs to be triaged. There are a ton of useful changes that have been identified. I think you could just pick some warning types where they've all been triaged and go fix them. 

On Thu, Jul 13, 2017 at 9:16 AM Hyukjin Kwon <[hidden email]> wrote:
Hi all,


Another gentle ping for help.

Probably, let me open up a JIRA and proceed this after a couple of weeks if no one is going to do this although I hope someone takes this.