[MLlib][Test] Smoke and Metamorphic Testing of MLlib

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

[MLlib][Test] Smoke and Metamorphic Testing of MLlib

Steffen Herbold
Dear developers,

I am writing you because I applied an approach for the automated testing
of classification algorithms to Spark MLlib and would like to forward
the results to you.

The approach is a combination of smoke testing and metamorphic testing.
The smoke tests try to find problems by executing the training and
prediction functions of classifiers with different data. These smoke
tests should ensure the basic functioning of classifiers. I defined 20
different data sets, some very simple (uniform features in [0,1]), some
with extreme distributions, e.g., data close to machine precision. The
metamorphic tests determine if classification results change as expected
if the training data is modified, e.g., by reordering features, flipping
class labels, or reordering instances.

I generated 70 different JUnit tests for six different Spark ML
classifiers. In summary, I found the following potential problems:
- One error due to a value being out of bounds for the Logistic
regression classifier if data approaches MAXDOUBLE. Which bound is
affected is not explained.
- The classification of NaïveBayes and the LinearSVC sometimes changed
if one is added to each feature value.
- The classification of LogisticRegression, DecisionTree, and
RandomForest were not inverted when all binary class labels are flipped.
- The classification of LogisticRegression, DecisionTree, GBT, and
RandomForest sometimes changed when the features are reordered.
- The classification of LogisticRegression, RandomForest, and LinearSVC
sometimes changed when the instances are reordered.

You can find details of our results online [1]. The provided resources
include the current draft of the paper that describes the tests as well
as detailed results in detail. Moreover, we provide an executable test
suite with all tests we executed, as well as the export of our test
results as XML file that contains all details of the test execution,
including stack traces in case of exceptions. The preprint and online
materials also contain the results for two other machine learning
libraries, i.e., Weka and scikit-learn. Additionally, you can find the
atoml tool used to generate the tests on GitHub [2].

I hope that these tests may help with the future development of Spark
MLlib. You could help me a lot by answering the following questions:
- Do you consider the tests helpful?
- Do you consider any source code or documentation changes due to our
findings?
- Would you be interested in a pull request or any other type of
integration of (a subset of) the tests into your project?
- Would you be interested in more such tests, e.g., for the
consideration of hyper parameters, other algorithm types like
clustering, or more complex algorithm specific metamorphic tests?

I am looking forward to your feedback.

Best regards,
Steffen Herbold

[1] http://user.informatik.uni-goettingen.de/~sherbold/atoml-results/
[2] https://github.com/sherbold/atoml

--
Dr. Steffen Herbold
Institute of Computer Science
University of Goettingen
Goldschmidtstraße 7
37077 Göttingen, Germany
mailto. [hidden email]
tel. +49 551 39-172037


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [MLlib][Test] Smoke and Metamorphic Testing of MLlib

Sean Owen-2
Certainly if your tests have found a problem, open a JIRA and/or pull request with the fix and relevant tests.

More tests generally can't hurt, though I guess we should maybe have a look at them first. If they're a lot of boilerplate and covering basic functions already covered by other tests, they're not as useful, but tests covering new cases should probably be added.

On Wed, Aug 22, 2018 at 6:14 AM Steffen Herbold <[hidden email]> wrote:
Dear developers,

I am writing you because I applied an approach for the automated testing
of classification algorithms to Spark MLlib and would like to forward
the results to you.

The approach is a combination of smoke testing and metamorphic testing.
The smoke tests try to find problems by executing the training and
prediction functions of classifiers with different data. These smoke
tests should ensure the basic functioning of classifiers. I defined 20
different data sets, some very simple (uniform features in [0,1]), some
with extreme distributions, e.g., data close to machine precision. The
metamorphic tests determine if classification results change as expected
if the training data is modified, e.g., by reordering features, flipping
class labels, or reordering instances.

I generated 70 different JUnit tests for six different Spark ML
classifiers. In summary, I found the following potential problems:
- One error due to a value being out of bounds for the Logistic
regression classifier if data approaches MAXDOUBLE. Which bound is
affected is not explained.
- The classification of NaïveBayes and the LinearSVC sometimes changed
if one is added to each feature value.
- The classification of LogisticRegression, DecisionTree, and
RandomForest were not inverted when all binary class labels are flipped.
- The classification of LogisticRegression, DecisionTree, GBT, and
RandomForest sometimes changed when the features are reordered.
- The classification of LogisticRegression, RandomForest, and LinearSVC
sometimes changed when the instances are reordered.

You can find details of our results online [1]. The provided resources
include the current draft of the paper that describes the tests as well
as detailed results in detail. Moreover, we provide an executable test
suite with all tests we executed, as well as the export of our test
results as XML file that contains all details of the test execution,
including stack traces in case of exceptions. The preprint and online
materials also contain the results for two other machine learning
libraries, i.e., Weka and scikit-learn. Additionally, you can find the
atoml tool used to generate the tests on GitHub [2].

I hope that these tests may help with the future development of Spark
MLlib. You could help me a lot by answering the following questions:
- Do you consider the tests helpful?
- Do you consider any source code or documentation changes due to our
findings?
- Would you be interested in a pull request or any other type of
integration of (a subset of) the tests into your project?
- Would you be interested in more such tests, e.g., for the
consideration of hyper parameters, other algorithm types like
clustering, or more complex algorithm specific metamorphic tests?

I am looking forward to your feedback.

Best regards,
Steffen Herbold

[1] http://user.informatik.uni-goettingen.de/~sherbold/atoml-results/
[2] https://github.com/sherbold/atoml

--
Dr. Steffen Herbold
Institute of Computer Science
University of Goettingen
Goldschmidtstraße 7
37077 Göttingen, Germany
mailto. [hidden email]
tel. <a href="tel:+49%20551%2039172037" value="+4955139172037" target="_blank">+49 551 39-172037


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [MLlib][Test] Smoke and Metamorphic Testing of MLlib

Matei Zaharia
Administrator
In reply to this post by Steffen Herbold
Hi Steffen,

Thanks for sharing your results about MLlib — this sounds like a useful tool. However, I wanted to point out that some of the results may be expected for certain machine learning algorithms, so it might be good to design those tests with that in mind. For example:

> - The classification of LogisticRegression, DecisionTree, and RandomForest were not inverted when all binary class labels are flipped.
> - The classification of LogisticRegression, DecisionTree, GBT, and RandomForest sometimes changed when the features are reordered.
> - The classification of LogisticRegression, RandomForest, and LinearSVC sometimes changed when the instances are reordered.

All of these things might occur because the algorithms are nondeterministic. Were the effects large or small? Or, for example, was the final difference in accuracy statistically significant? Many ML algorithms are trained using randomized algorithms like stochastic gradient descent, so you can’t expect exactly the same results under these changes.

> - The classification of NaïveBayes and the LinearSVC sometimes changed if one is added to each feature value.

This might be due to nondeterminism as above, but it might also be due to regularization or nonlinear effects for some algorithms. For example, some algorithms might look at the relative values of features, in which case adding 1 to each feature value transforms the data. Other algorithms might require that data be centered around a mean of 0 to work best.

I haven’t read the paper in detail, but basically it would be good to account for randomized algorithms as well as various model assumptions, and make sure the differences in results in these tests are statistically significant.

Matei


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [MLlib][Test] Smoke and Metamorphic Testing of MLlib

Steffen Herbold
Dear Matei,

thanks for the feedback!

I used the setSeed option for all randomized classifiers and always used
the same seeds for training with the hope that this deals with the
non-determinism. I did not run any significance tests, because I was
considering this from a functional perspective, assuming that the
nondeterminism would be dealt with if I fix the seed values. The test
results contain how many instances were classified differently.
Sometimes these are only 1 or 2 out of 100 instances, i.e., almost
certainly not significant. Other cases seem to be more interesting. For
example, 20/100 instances were classified differently by the linear SVM
for informative uniformly distributed data if we added 1 to each feature
value.

I know that these problems should sometimes be expected. However, I was
actually not sure what to expect, especially after I started to look at
the results for different ML libraries in comparison. The random forest
are a good example. I expected them to be dependent on feature/instance
order. However, they are not in Weka, only in scikit-learn and Spark
MLlib. There are more such examples, like logistic regression that
exhibits different behavior in all three libraries. Thus, I decided to
just give my results to the people who know what to expect from their
implementations, i.e., the devs.

I will probably expand my test generator to allow more detailed
specifications of the expectations of the algorithms in the future. This
seems to be a "must" for a potentially productive use by projects.
Relaxing the assertions to only react if the differences are significant
would be another possible change. This could be a command line option to
allow different strictness of testing.

Best,
Steffen


Am 22.08.2018 um 23:27 schrieb Matei Zaharia:

> Hi Steffen,
>
> Thanks for sharing your results about MLlib — this sounds like a useful tool. However, I wanted to point out that some of the results may be expected for certain machine learning algorithms, so it might be good to design those tests with that in mind. For example:
>
>> - The classification of LogisticRegression, DecisionTree, and RandomForest were not inverted when all binary class labels are flipped.
>> - The classification of LogisticRegression, DecisionTree, GBT, and RandomForest sometimes changed when the features are reordered.
>> - The classification of LogisticRegression, RandomForest, and LinearSVC sometimes changed when the instances are reordered.
> All of these things might occur because the algorithms are nondeterministic. Were the effects large or small? Or, for example, was the final difference in accuracy statistically significant? Many ML algorithms are trained using randomized algorithms like stochastic gradient descent, so you can’t expect exactly the same results under these changes.
>
>> - The classification of NaïveBayes and the LinearSVC sometimes changed if one is added to each feature value.
> This might be due to nondeterminism as above, but it might also be due to regularization or nonlinear effects for some algorithms. For example, some algorithms might look at the relative values of features, in which case adding 1 to each feature value transforms the data. Other algorithms might require that data be centered around a mean of 0 to work best.
>
> I haven’t read the paper in detail, but basically it would be good to account for randomized algorithms as well as various model assumptions, and make sure the differences in results in these tests are statistically significant.
>
> Matei
>

--
Dr. Steffen Herbold
Institute of Computer Science
University of Goettingen
Goldschmidtstraße 7
37077 Göttingen, Germany
mailto. [hidden email]
tel. +49 551 39-172037


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [MLlib][Test] Smoke and Metamorphic Testing of MLlib

Erik Erlandson-2

Behaviors at this level of detail, across different ML implementations, are highly unlikely to ever align exactly. Statistically small changes in logic, such as "<" versus "<=", or differences in random number generators, etc, (to say nothing of different implementation languages) will accumulate over training to yield different models, even if their overall performance should be similar.

. The random forest are a good example. I expected them to be dependent on feature/instance order. However, they are not in Weka, only in scikit-learn and Spark MLlib. There are more such examples, like logistic regression that exhibits different behavior in all three libraries.

 
Reply | Threaded
Open this post in threaded view
|

Re: [MLlib][Test] Smoke and Metamorphic Testing of MLlib

Matei Zaharia
Administrator
In reply to this post by Steffen Herbold
Yes, that makes sense, but just to be clear, using the same seed does *not* imply that the algorithm should produce “equivalent” results by some definition of equivalent if you change the input data. For example, in SGD, the random seed might be used to select the next minibatch of examples, but if you reorder the data or change the labels, this will result in a different gradient being computed. Just because the dataset transformation seems to preserve the ML problem at a high abstraction level does not mean that even a deterministic ML algorithm (MLlib with seed) will give the same result. Maybe other libraries do, but it doesn’t necessarily mean that MLlib is doing something wrong here.

Basically, I’m just saying that as an ML library developer I wouldn’t be super concerned about these particular test results (especially if just a few instances change classification). I would be much more interested, however, in results like the following:

- The algorithm’s evaluation metrics (loss, accuracy, etc) are statistically significant if you change these properties of the data. This probably requires you to run multiple times with different seeds.
- MLlib’s evaluation metrics for a problem differ in a statistically significant way from other ML libraries, for algorithms configured with equivalent hyperparameters. (Sometimes libraries have different definitions for hyperparameters though).

The second one is definitely something we’ve tested for informally in the past, though it is not in unit tests as far as I know.

Matei

> On Aug 23, 2018, at 5:14 AM, Steffen Herbold <[hidden email]> wrote:
>
> Dear Matei,
>
> thanks for the feedback!
>
> I used the setSeed option for all randomized classifiers and always used the same seeds for training with the hope that this deals with the non-determinism. I did not run any significance tests, because I was considering this from a functional perspective, assuming that the nondeterminism would be dealt with if I fix the seed values. The test results contain how many instances were classified differently. Sometimes these are only 1 or 2 out of 100 instances, i.e., almost certainly not significant. Other cases seem to be more interesting. For example, 20/100 instances were classified differently by the linear SVM for informative uniformly distributed data if we added 1 to each feature value.
>
> I know that these problems should sometimes be expected. However, I was actually not sure what to expect, especially after I started to look at the results for different ML libraries in comparison. The random forest are a good example. I expected them to be dependent on feature/instance order. However, they are not in Weka, only in scikit-learn and Spark MLlib. There are more such examples, like logistic regression that exhibits different behavior in all three libraries. Thus, I decided to just give my results to the people who know what to expect from their implementations, i.e., the devs.
>
> I will probably expand my test generator to allow more detailed specifications of the expectations of the algorithms in the future. This seems to be a "must" for a potentially productive use by projects. Relaxing the assertions to only react if the differences are significant would be another possible change. This could be a command line option to allow different strictness of testing.
>
> Best,
> Steffen
>
>
> Am 22.08.2018 um 23:27 schrieb Matei Zaharia:
>> Hi Steffen,
>>
>> Thanks for sharing your results about MLlib — this sounds like a useful tool. However, I wanted to point out that some of the results may be expected for certain machine learning algorithms, so it might be good to design those tests with that in mind. For example:
>>
>>> - The classification of LogisticRegression, DecisionTree, and RandomForest were not inverted when all binary class labels are flipped.
>>> - The classification of LogisticRegression, DecisionTree, GBT, and RandomForest sometimes changed when the features are reordered.
>>> - The classification of LogisticRegression, RandomForest, and LinearSVC sometimes changed when the instances are reordered.
>> All of these things might occur because the algorithms are nondeterministic. Were the effects large or small? Or, for example, was the final difference in accuracy statistically significant? Many ML algorithms are trained using randomized algorithms like stochastic gradient descent, so you can’t expect exactly the same results under these changes.
>>
>>> - The classification of NaïveBayes and the LinearSVC sometimes changed if one is added to each feature value.
>> This might be due to nondeterminism as above, but it might also be due to regularization or nonlinear effects for some algorithms. For example, some algorithms might look at the relative values of features, in which case adding 1 to each feature value transforms the data. Other algorithms might require that data be centered around a mean of 0 to work best.
>>
>> I haven’t read the paper in detail, but basically it would be good to account for randomized algorithms as well as various model assumptions, and make sure the differences in results in these tests are statistically significant.
>>
>> Matei
>>
>
> --
> Dr. Steffen Herbold
> Institute of Computer Science
> University of Goettingen
> Goldschmidtstraße 7
> 37077 Göttingen, Germany
> mailto. [hidden email]
> tel. +49 551 39-172037
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]