Test your Machine Learning Algorithm with Metamorphic Testing

Testing machine learning and AI algorithms is hard. In fact, testing scientific software in general is hard, and there has been some literature on this more general topic. As said in the introduction of Carver et al (2017)[1]:

The development of scientific software differs significantly from the development of more traditional business information systems, from which many software engineering best practices and tools have been drawn. These differences appear at various phases of the software lifecycle…

These differences appear throughout the following stages of the software lifecycle: requirement, design, coding, validation and verification, and deployment. In particular, the difficulties one usually faces in validating and verifying scientific software are, according to Carver et al:

  • Results are often unknown when exploring novel science or engineering areas and algorithms
  • Popular software engineering tools often do not work on the architectures used in computational science and engineering

Indeed, machine learning and AI algorithms are often the result of explorations by data scientists and researchers. When the resulting algorithms are subsequently implemented by engineers, some very important practices in software quality assurance are often difficult implement, due to the differences in working culture between the two roles.

The oracle problem

One of them is the oracle problem, as was discussed in Weyuker (1982)[2] extensively. In software testing, an oracle refers to a mechanism which can tell you whether a program is working correctly. For the simplest case, an oracle could be a direct comparison of the output of the program with the correct answer. More complicated oracles may involve running another program to determine if the output of the target program is correct.

For machine learning and AI algorithms, the problem is that often there is no oracle without human intervention. Take image recognition for example. A common way of solving image recognition problem is by supervised machine learning. This usually begins by curating a dataset of correctly labelled image for training and validating, which is itself a way of introducing an oracle for the software model in development by human intervention. This implies that the range of test conducted to the model will be limited by human efforts, and expanding the test cases, for example increasing the sampling size and varieties significantly with online testing, would be a problem.

Several methods have been proposed to mitigate the oracle problem, including the use of pseudo-oracle and metamorphic testing, which we will discuss later in this article.

Pseudo-oracle, proposed by Davis and Weyuker (1981)[3], means that we can have multiple implementations solving the same problem verifying each other. The outputs of all implementations are compared, and if one of them is different, it could indicate that the algorithm has some faults in it. In machine learning, for example, one of the methods to filter out adversarial examples is to compare the outputs of the model with and without feature squeezing. This is an application of pseudo-orcale to test a supervised machine learning model online.

In the following we will discuss metamorphic testing which is proposed by Chen et al (1998)[4]. We will start by introducing the concept of metamorphic testing as a general purpose toolkit in software testing, followed by explaining some research regarding applying metamorphic testing to machine learning software.

Metamorphic testing

Metamorphic testing is a software testing methodology to mitigate the oracle problem. The idea is simple: even if we do not know the correct output of a single input, we might still know the relations between the outputs of multiple inputs, especially if the inputs are themselves related. We can check the software for these relations, called metamorphic relation. If they don’t hold, it is a sure sign that the software has some defects.

Since its first publication nearly 20 years ago, metamorphic testing has shown many promising results and detects many defects in popular software such as GCC and LLVM[5]. These success stories are collected in the reviewing article by Chen et al (2017)[6].

An example given by Chen et al (1998) is to test whether a given path is the shortest path in an undirected weighted graph. This is a typical case in software testing where resorting to an oracle would be expensive when the graph is complicated. If (x, v1, …, vN, y) is the proposed shortest path between x and y with length p, Chen et al suggest a few follow-up tests to conduct on the algorithm, among them:

  • Find the shortest paths between y and x (in reverse direction), and verify that the length is also p.
  • Find the shortest paths between x and vK, and between vK and y. Verify that the sum of the lengths of the two paths is also p, for any K between 1 and N.

To put the methodology formally, let’s say f is the software we are testing and we have an input x with the output f(x). We don’t have an oracle for f(x) so verifying this single test case would be difficult. However, a known transformation T, which is based on a metamorphic relation, can be applied to x to generate the follow-up test case T(x). We can calculate f(T(x)), and then verify it with an oracle T'(f(x)) constructed from the known f(x) by using the metamorphic relation. In the above example of the shortest path problem, the known resulting length p becomes an oracle for the two categories of follow-up test cases.

Metamorphic testing on machine learning classifiers

Xie et al (2011)[7] proposed applying metamorphic testing to machine learning classifiers. They conducted tests on Weka version 3.5.7, using its implementations of k-Nearest Neighbors (kNN) and naive Bayes classifier (NBC) as examples. They discovered some unexpected assumptions in the kNN implementation, and in the case of NBC, found some faults in it.

They also compared the performance of cross-validation, which is a common practice in data science, with metamorphic testing by using mutation analysis. In other words, they changed the implementation of algorithms to inject errors intentionally, and then use cross-validation and metamorphic testing to see if the errors are detected. It turns out there are some common programming errors that are hard to detect by cross-validation, but are detectable by metamorphic testing.

The metamorphic relations used by Xie et al are a subset of the relations proposed by Murphy et al (2008)[8], including:

  • MR-0: Consistence with affine transformation
  • MR-1.1: Permutation of class labels
  • MR-1.2: Permutation of the attribute
  • MR-2.1: Addition of uninformative attributes
  • MR-2.2: Addition of informative attributes
  • MR-3.1: Consistence with re-prediction
  • MR-3.2: Additional training sample
  • MR-4.1: Addition of classes by duplicating samples
  • MR-4.2: Addition of classes by re-labeling samples
  • MR-5.1: Removal of classes
  • MR-5.2: Removal of samples

These are intuitive relations among program outputs.


There hasn’t been a lot of reports on applying metamorphic testing to recently developed machine learning and AI algorithms, especially to deep learning models based on neural networks. However, we believe this is a promising approach to validate machine learning software. Equipped with more domain specific metamorphic relations, metamorphic testing provides a significantly larger body of test samples to examine the implementation. This could be a good practice to test the software in addition to cross-validating against curated dataset.

  1. J. Carver, N. P. C. Hong, and G. K. Thiruvathukal, Eds., Software engineering for science. Boca Raton: Taylor & Francis, CRC Press, 2017. 
  2. E. J. Weyuker, “On testing non-testable programs,” The Computer Journal, vol. 25, no. 4, pp. 465–470, 1982. 
  3. M. D. Davis and E. J. Weyuker, “Pseudo-oracles for Non-testable Programs,” in Proceedings of the ACM ’81 Conference, New York, NY, USA, 1981, pp. 254–257. 
  4. T. Y. Chen, S. C. Cheung, and S. M. Yiu, “Metamorphic testing: a new approach for generating next test cases,” Technical Report HKUST-CS98-01, Department of Computer Science, Hong Kong University of Science and Technology, Hong Kong, 1998. 
  5. V. Le, M. Afshari, and Z. Su, “Compiler Validation via Equivalence Modulo Inputs,” in Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, New York, NY, USA, 2014, pp. 216–226. 
  6. T. Y. Chen et al., “Metamorphic Testing: A Review of Challenges and Opportunities,” 2017. 
  7. X. Xie, J. W. K. Ho, C. Murphy, G. Kaiser, B. Xu, and T. Y. Chen, “Testing and validating machine learning classifiers by metamorphic testing,” Journal of Systems and Software, vol. 84, no. 4, pp. 544–558, Apr. 2011. 
  8. C. Murphy, G. E. Kaiser, L. Hu, and L. Wu, “Properties of Machine Learning Applications for Use in Metamorphic Testing,” in SEKE, 2008, vol. 8, pp. 867–872. 

GDPR and its impacts on machine learning applications

The General Data Protection Regulation (GDPR) was adopted by European Parliament in April 2016, and will be enforceable throughout EU by May 2018. Many regulations regarding algorithmic decision-making are added to this new set of law, compared to the previous Data Protection Directive (DPD) which is expected to be superseded. In what follows we give an overview on the implicated technical challenges in algorithmic fairness and explainable AI, following the outlines given in Goodman et al (2016)[1].

Firstly we would like to say that we are not experts on EU law system. The opinions laid out in this article represents a summary of the research papers we have read on the subject matter, and we do our best to present them accurately. We believe that a mutual beneficial relationship built on trust between human and algorithms is not purely a technical problem. Therefore we are also concerned about the legal and social aspects of algorithms. By this article we hope to bring out more discussions on these very important issues.

Highlights in GDPR

According to Goodman et al, much of the regulations in the GDPR “clearly aimed at perceived gaps and inconsistencies in the EU’s current approach to data protection.” This includes, for example, a clear specification of the right to be forgotten, and regulations on collecting data from EU citizens by foreign companies.

There are three major differences between the GDPR and the previous DPD:

  1. GDPR is a Regulation, while DPD is a Directive. A Directive is a set of general rules which is only enforceable when each EU country transfer them into their national law. On the other hand a Regulation is similar to national laws, the only difference being it covers the whole region of EU. Therefore the GDPR will be enforceable in May 2018 without any additional legislative process.
  2. GDPR explicitly states that companies violating the regulation is subject to a penalty up to 20 million euro or 4% of their global revenue, whichever is higher. (Article 83, Paragraph 5)
  3. GDPR is not limited to companies with their headquarters set in EU, but to all companies that are holding data from EU citizen. (Article 3, Paragraph 1)

For the rest of the article we will focus on Article 22 of the GDPR regarding automated individual decision making:

Full text of GDPR Article 22


GDPR states in Article 22 Paragraph 4 that, decisions “which produces legal effects concerning him or her” or of similar importance shall not be based on the following categories of personal data specified in Article 9 Paragraph 1:

…personal data revealing racial or ethnic origin, political opinions, religious or philosophical beliefs, or trade union membership, and the processing of genetic data, biometric data for the purpose of uniquely identifying a natural person, data concerning health or data concerning a natural person’s sex life or sexual orientation.

Under minimal interpretation, using the above categories of sensitive data directly in algorithms is prohibited. However as Goodman et al point out (and as we have been saying (Chinese) in our review of algorithmic fairness research), discrimination cannot be eliminated solely by excluding sensitive data in decision-making process, due to redundant encoding, redlining, and other problems.

Under maximal interpretation, it is possible that all data having some relations with the above categories of data have to be excluded. Goodman et al argue that this would introduce some technical challenges, and it could still not be enough to eliminate all biases from the algorithm, because:

  1. It could be difficult to completely remove the influences of sensitive data from an algorithm and still keep it useful.
  2. Uncertainty bais, where the algorithm has different levels of confidence to its predictions about different groups of people, due to one or some of the groups being underrepresented in the data. (This, however, might be addressed by the “equal opportunity” fairness proposed by Hardt et al[2].)

Please refer to our previous article “Are algorithms fair?” (Chinese) for the state of current algorithmic fairness research.

Due to these difficulties, Goodman et al argue that it could be hard to find an interpretation of the Regulation that suits all kinds of situation. Therefore it is likely that one will have to examine the actual usage and details of the algorithm in question in each of the cases, and the explainability of algorithms will be important.

Right to explanation

It has been said that GDPR has introduced the “right to explanation”, but what exactly is being granted in the Regulation might still be in debate.

GDPR Article 22 Paragraph 3 states that a data controller “shall implement suitable measures to safeguard…at least the right to obtain human intervention on the part of the controller, to express his or her point of view and to contest the decision”, otherwise a person has “the right not to be subject to a decision based solely on automated processing” (Paragraph 1).

As Wachter et al (2017)[3] point out, a right to explanation is not mentioned in the above text. Even with Recital 71, where an explanation is explicitly mentioned, it is dubious whether the right is legally binding. Wachter et al believe that in summary, what GDPR has introduced is a “right to be informed”, and this is not equivalent to the right to ask for explanations about any decisions made by algorithms.

On the other hand, Goodman et al believe that Article 13 to 15 give a person the right to access the data that’s been collected, and the right to know the purpose of collecting it, which includes the right to receive “meaningful information about the logic (algorithm) and possible impact.” For example, Article 13 Paragraph 2 (f) states that data controllers must inform the user about the followings before collecting data:

Full text of GDPR Article 13 Paragraph 2 (h)

Therefore it might be worthwhile to ask to what extend can one ask for an explanation about an algorithm. Goodman et al cite Burrell(2016)[4] for the following barriers to understanding an algorithm:

  1. Intentional concealment on the part of corporations or other institutions, where decision making procedures are kept from public scrutiny.
  2. Gaps in technical literacy which mean that, for most people, simply having access to underlying code is insufficient.
  3. A “mismatch between the mathematical optimization in high-dimensionality characteristic of machine learning and the demands of human-scale reasoning and styles of interpretation”

Recently, research in explainable machine learning and explainable AI has made substantial progress. Projects such as Explainable Artificial Intelligence of DARPA, and the 2016 ICML Workshop on Human Interpretability in Machine Learning, and IJCAI-17 Workshop on Explainable AI (XAI) Proceedings have a lot of information, which we will cover in future articles.

When GDPR is enforced

Apart from the regulations in GDPR which will be enforceable soon, there are discussions about further protections to human rights with regard to advancements in machine learning applications. Thelisson et al (2016)[5] draw a comparison between regulations on algorithms and EU regulations regarding food safety, and point out several possible measures for further protections:

  • Code of conduct
  • Quality label
  • Data chain transparency
  • Discrimination-aware machine learning research

These are directions worth following. Furthermore, as Wachter et al have pointed out in the previous section, the protective measures granted by the Regulation might not be as effective as expected. Therefore we suspect further amendments to GDPR might also be in progress.

  1. Goodman, B., & Flaxman, S. (2016). European Union regulations on algorithmic decision-making and a “right to explanation,” 1–9. Retrieved from http://arxiv.org/abs/1606.08813 
  2. M. Hardt, E. Price, and N. and Srebro, “Equality of Opportunity in Supervised Learning,” in Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, Eds. Barcelona, Spain: Curran Associates, Inc., 2016, pp. 3315–3323. 
  3. Wachter, S., Mittelstadt, B., & Floridi, L. (2017). Why a Right to Explanation of Automated Decision-Making Does Not Exist in the General Data Protection Regulation. International Data Privacy Law, 7(2), 76–99. Retrieved from https://academic.oup.com/idpl/article/3860948 
  4. J. Burrell, “How the machine ‘thinks’: Understanding opacity in machine learning algorithms,” Big Data & Society, vol. 3, no. 1, p. 2053951715622512, Jan. 2016. 
  5. Thelisson, E., Padh EPFL, K., & Elisa Celis EPFL, L. (2016). Regulatory Mechanisms and Algorithms towards Trust in AI/ML. Retrieved from http://home.earthlink.net/~dwaha/research/meetings/ijcai17-xai/9. (Thelisson, Padh, & Celis XAI-17) Regulatory Mechanisms and Algorithms towards Trust in AIML.pdf 

Recent developments in Adversarial Example, Part I

As of mid-2017 there are already many proposals to fool DNNs by adversarial example. Papernot et al (2016) proposed so far the most powerful such attack against image classifiers by exploiting transferability. According to the experiments conducted, the method can achieve nearly 90% misclassification rates for DNNs services hosted on Amazon and Google. This practical shows that black-box attack is feasible.

Adversarial example is a set of data that was created deliberately to fool DNNs. The topic started when Szegedy et al (2013) found that for DNNs trained with ImageNet or AlexNet, often a small amount of change in the input can result in vast differences in the output.

For example, suppose the model recognizes a picture of a truck correctly, it may only require changing relatively few pixels in the picture for the model to classify it differently, and the changes are tiny relative to the picture that human eyes are unlikely to recognize the differences.

This will be a two-part blog post on recent developments of adversarial examples on DNNs. We will start by mentioning the important properties of these examples, pointing readers to the two algorithms of finding them, and some published strategies of defense. In the second part we will collect some important developments in this regard since 2017. The readers may also wish to consult the excellent introduction Attacking Machine Learning with Adversarial Examples by Open AI.

Important properties of adversarial examples

There are some basic findings in Szegedy et al (2013):

  1. DNNs trained by different sections of the same dataset, with different network architecture, supposedly are different networks. However by experimentation it has been found that often they are mislead by the same adversarial example. In other words, an adversarial example found for one DNN can often be transferred to another DNN on the same problem.
  2. Finding adversarial example is basically an optimisation problem. One needs to look around an input data point by algorithms. It’s not goining to be easy to find one by randomly perturb the data.

So far what we can explain about the existence of adversarial example is that they are the results of the linear calculations in DNNs. This is explained in Goodfellow et al (2014) roughly as follows. The linear calculations in DNNs are equations such as , where and are weight and bias respectively, is the input, and is the output stimulation. Here and are both vectors, and is the inner product of the two vectors. Therefore when is parallel to , a small amount to changes represented by can results in vast differences in the output , because can be large. When the dimension of and is high (when there are lots of model features), is big enough to push the output over a decision boundary.

This view of adversarial examples explains a few things:

  1. Common regularization techniques such as dropout, pertaining, model average is unlikely to be effective against adversarial examples.
  2. The existence of adversarial examples is a result of the geometric properties of the decision boundary, so this also explains their transferability.
  3. For nonlinear models such as RBF network, adversarial examples are not as effective as to linear models. Of course, it is much harder to train nonlinear models. Therefore model programmers might be facing the choice of linear models which are easier to train and potentially unstable, or nonlinear models which are difficult to train but more robust.

Algorithms for finding adversarial examples

So far there are two main algorithms for finding adversarial examples:

  1. Fast Gradient Sign Method (FGSM) in Goodfellow et al (2014).

  2. Jacobian-based Saliency Map Approach (JSMA) in Papernot et al (2016).

The thread model of exploiting DNNs with adversarial examples generally has the following categories (Papernot et al (2016)).

Adversarial goals (from easy to difficult)

  1. Confidence reduction.
  2. Misclassification: change to output.
  3. Targeted misclassification: craft an input with a specific output.

Source/target misclassify: modify a specific input to a specific output.

Adversarial capabilities (from more to little)

  1. Architecture & training data
  2. Architecture
  3. Training data sample
  4. Oracle
  5. Samples

Under this thread model, the research we have mentioned so far all have the adversarial goal (4). As for their adversarial capabilities, scenarios before 2015 all require (2) which means the exploiters need to know the architecture of the target network, and since 2016 it has been (4) which means the exploiters only need to have access to some prediction results.

Note that through the scenarios in Tramer et al (2016) a malicious entity can gain information about the target network architecture, and then apply the information to generate adversarial examples.

Defense strategies

Two broad categories exist for defense: reactive and proactive.

  • Reactive: add defense against adversarial example after the model is trained. Many of the results concerns how to filter out adversarial examples from model input.
  • Proactive: train the model to be more resistant to adversarial examples.

The two main approaches to proactive defense are:

  • Adversarial training in Shaham et al (2015) uses generated adversarial examples to retrain the model, thereby reduce the changes caused by perturbation. Such retrained models are more stable locally and generating adversarial examples for them would be harder.
  • Defensive distillation in Papernot et al (2016) feeds the probability vectors produced by the model back to itself to distill the weights, causing the decision boundaries to be smoother and harder to find adversarial examples.

The more effective approach so far seems to be defense distillation, which can be shown to be effective against FGSM and JSMA. This is before Papernot et al (2016) and Carlini et al (2016) announced new scenarios of attack. Papernot et al (2017) attempts to harden defensive distillation but the results are limited.


  • Carlini, N., & Wagner, D. (2016). Towards Evaluating the Robustness of Neural Networks. Retrieved from http://arxiv.org/abs/1608.04644
  • Goodfellow, I. J., Shlens, J., & Szegedy, C. (2014). Explaining and Harnessing Adversarial Examples, 1–11. Retrieved from http://arxiv.org/abs/1412.6572
  • Papernot, N., & McDaniel, P. (2017). Extending Defensive Distillation. arXiv. Retrieved from http://arxiv.org/abs/1705.05264
  • Papernot, N., McDaniel, P., Goodfellow, I., Jha, S., Celik, Z. B., & Swami, A. (2016). Practical Black-Box Attacks against Machine Learning. https://doi.org/10.1145/3052973.3053009
  • Papernot, N., McDaniel, P., Wu, X., Jha, S., & Swami, A. (2016). Distillation as a Defense to Adversarial Perturbations Against Deep Neural Networks. Proceedings — 2016 IEEE Symposium on Security and Privacy, SP 2016, 582–597. https://doi.org/10.1109/SP.2016.41
  • Papernot, N., Mcdaniel, P., Jha, S., Fredrikson, M., Celik, Z. B., & Swami, A. (2016). The limitations of deep learning in adversarial settings. Proceedings — 2016 IEEE European Symposium on Security and Privacy, EURO S and P 2016, 372–387. https://doi.org/10.1109/EuroSP.2016.36
  • Shaham, U., Yamada, Y., & Negahban, S. (2015). Understanding Adversarial Training: Increasing Local Stability of Neural Nets through Robust Optimization, 1–12. Retrieved from http://arxiv.org/abs/1511.05432
  • Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., & Fergus, R. (2013). Intriguing properties of neural networks. https://doi.org/10.1021/ct2009208
  • Tramèr, F., Zhang, F., & Juels, A. (2016). Stealing Machine Learning Models via Prediction APIs. In Proceedings of the 25th USENIX Security Symposium (pp. 601–618). https://doi.org/10.1103/PhysRevC.94.034301