Testing machine learning and AI algorithms is hard. In fact, testing scientific software in general is hard, and there has been some literature on this more general topic. As said in the introduction of Carver et al (2017):
The development of scientific software differs significantly from the development of more traditional business information systems, from which many software engineering best practices and tools have been drawn. These differences appear at various phases of the software lifecycle…
These differences appear throughout the following stages of the software lifecycle: requirement, design, coding, validation and verification, and deployment. In particular, the difficulties one usually faces in validating and verifying scientific software are, according to Carver et al:
- Results are often unknown when exploring novel science or engineering areas and algorithms
- Popular software engineering tools often do not work on the architectures used in computational science and engineering
Indeed, machine learning and AI algorithms are often the result of explorations by data scientists and researchers. When the resulting algorithms are subsequently implemented by engineers, some very important practices in software quality assurance are often difficult implement, due to the differences in working culture between the two roles.
The oracle problem
One of them is the oracle problem, as was discussed in Weyuker (1982) extensively. In software testing, an oracle refers to a mechanism which can tell you whether a program is working correctly. For the simplest case, an oracle could be a direct comparison of the output of the program with the correct answer. More complicated oracles may involve running another program to determine if the output of the target program is correct.
For machine learning and AI algorithms, the problem is that often there is no oracle without human intervention. Take image recognition for example. A common way of solving image recognition problem is by supervised machine learning. This usually begins by curating a dataset of correctly labelled image for training and validating, which is itself a way of introducing an oracle for the software model in development by human intervention. This implies that the range of test conducted to the model will be limited by human efforts, and expanding the test cases, for example increasing the sampling size and varieties significantly with online testing, would be a problem.
Several methods have been proposed to mitigate the oracle problem, including the use of pseudo-oracle and metamorphic testing, which we will discuss later in this article.
Pseudo-oracle, proposed by Davis and Weyuker (1981), means that we can have multiple implementations solving the same problem verifying each other. The outputs of all implementations are compared, and if one of them is different, it could indicate that the algorithm has some faults in it. In machine learning, for example, one of the methods to filter out adversarial examples is to compare the outputs of the model with and without feature squeezing. This is an application of pseudo-orcale to test a supervised machine learning model online.
In the following we will discuss metamorphic testing which is proposed by Chen et al (1998). We will start by introducing the concept of metamorphic testing as a general purpose toolkit in software testing, followed by explaining some research regarding applying metamorphic testing to machine learning software.
Metamorphic testing is a software testing methodology to mitigate the oracle problem. The idea is simple: even if we do not know the correct output of a single input, we might still know the relations between the outputs of multiple inputs, especially if the inputs are themselves related. We can check the software for these relations, called metamorphic relation. If they don’t hold, it is a sure sign that the software has some defects.
Since its first publication nearly 20 years ago, metamorphic testing has shown many promising results and detects many defects in popular software such as GCC and LLVM. These success stories are collected in the reviewing article by Chen et al (2017).
An example given by Chen et al (1998) is to test whether a given path is the shortest path in an undirected weighted graph. This is a typical case in software testing where resorting to an oracle would be expensive when the graph is complicated. If (x, v1, …, vN, y) is the proposed shortest path between x and y with length p, Chen et al suggest a few follow-up tests to conduct on the algorithm, among them:
- Find the shortest paths between y and x (in reverse direction), and verify that the length is also p.
- Find the shortest paths between x and vK, and between vK and y. Verify that the sum of the lengths of the two paths is also p, for any K between 1 and N.
To put the methodology formally, let’s say f is the software we are testing and we have an input x with the output f(x). We don’t have an oracle for f(x) so verifying this single test case would be difficult. However, a known transformation T, which is based on a metamorphic relation, can be applied to x to generate the follow-up test case T(x). We can calculate f(T(x)), and then verify it with an oracle T'(f(x)) constructed from the known f(x) by using the metamorphic relation. In the above example of the shortest path problem, the known resulting length p becomes an oracle for the two categories of follow-up test cases.
Metamorphic testing on machine learning classifiers
Xie et al (2011) proposed applying metamorphic testing to machine learning classifiers. They conducted tests on Weka version 3.5.7, using its implementations of k-Nearest Neighbors (kNN) and naive Bayes classifier (NBC) as examples. They discovered some unexpected assumptions in the kNN implementation, and in the case of NBC, found some faults in it.
They also compared the performance of cross-validation, which is a common practice in data science, with metamorphic testing by using mutation analysis. In other words, they changed the implementation of algorithms to inject errors intentionally, and then use cross-validation and metamorphic testing to see if the errors are detected. It turns out there are some common programming errors that are hard to detect by cross-validation, but are detectable by metamorphic testing.
- MR-0: Consistence with affine transformation
- MR-1.1: Permutation of class labels
- MR-1.2: Permutation of the attribute
- MR-2.1: Addition of uninformative attributes
- MR-2.2: Addition of informative attributes
- MR-3.1: Consistence with re-prediction
- MR-3.2: Additional training sample
- MR-4.1: Addition of classes by duplicating samples
- MR-4.2: Addition of classes by re-labeling samples
- MR-5.1: Removal of classes
- MR-5.2: Removal of samples
These are intuitive relations among program outputs.
There hasn’t been a lot of reports on applying metamorphic testing to recently developed machine learning and AI algorithms, especially to deep learning models based on neural networks. However, we believe this is a promising approach to validate machine learning software. Equipped with more domain specific metamorphic relations, metamorphic testing provides a significantly larger body of test samples to examine the implementation. This could be a good practice to test the software in addition to cross-validating against curated dataset.
- J. Carver, N. P. C. Hong, and G. K. Thiruvathukal, Eds., Software engineering for science. Boca Raton: Taylor & Francis, CRC Press, 2017. ↩
- E. J. Weyuker, “On testing non-testable programs,” The Computer Journal, vol. 25, no. 4, pp. 465–470, 1982. ↩
- M. D. Davis and E. J. Weyuker, “Pseudo-oracles for Non-testable Programs,” in Proceedings of the ACM ’81 Conference, New York, NY, USA, 1981, pp. 254–257. ↩
- T. Y. Chen, S. C. Cheung, and S. M. Yiu, “Metamorphic testing: a new approach for generating next test cases,” Technical Report HKUST-CS98-01, Department of Computer Science, Hong Kong University of Science and Technology, Hong Kong, 1998. ↩
- V. Le, M. Afshari, and Z. Su, “Compiler Validation via Equivalence Modulo Inputs,” in Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, New York, NY, USA, 2014, pp. 216–226. ↩
- T. Y. Chen et al., “Metamorphic Testing: A Review of Challenges and Opportunities,” 2017. ↩
- X. Xie, J. W. K. Ho, C. Murphy, G. Kaiser, B. Xu, and T. Y. Chen, “Testing and validating machine learning classifiers by metamorphic testing,” Journal of Systems and Software, vol. 84, no. 4, pp. 544–558, Apr. 2011. ↩
- C. Murphy, G. E. Kaiser, L. Hu, and L. Wu, “Properties of Machine Learning Applications for Use in Metamorphic Testing,” in SEKE, 2008, vol. 8, pp. 867–872. ↩