University of Wollongong School of Computing and Information Technology CSIT314 Software Development Methodologies Autumn 2020 Group Project – Part II (20 marks) Project title: Developing an automated testing tool Due to submit via Moodle: 10:30 am, Thursday Week 13 (11 June 2020) What to submit: Produce (1) a project report detailing the group's work (indicate each and every team member’s contribution on the cover page as group work assesses individual contributions); submit it together with (2) your source code. Your project report should, in addition to documenting your detailed work, include instructions (and screenshots as evidence) to show how to compile and run your programs, as well as a demo (again, using screenshots) to show the features of the tool you developed for automated test case generation, execution, and result verification. How to submit: Submit a zip file under “Group Project Part II” in Moodle. Your zip file should include (1) your project report (a PDF file) and (2) your source code. The Moodle submission must be made by only ONE member of your group by the deadline. Remember to name your submission file using your group ID. Project description 1. Read the paper “A Testing Tool for Machine Learning Applications” attached at the end of this document. 2. Follow the style of the above paper to develop a software testing tool for any domain (it does not have to be the machine learning domain; for example, it could be the testing of booking.com, a search engine, or a compiler) of your choice using any programming language that is available in the University lab (e.g., C, C++, Java, etc). 3. Key requirements You must follow the Test Driven Development methodology. Your tool must support automated test case generation (at least randomly). Your tool must support automated execution of the software under test. Your tool must support automated result checking & test report generation. Marking criteria Have you correctly followed the Test Driven Development (TDD) methodology? Please show the test data / test suite you designed and executed for each iteration of your TDD process. The quality of test data and appropriate refactoring are important marking criteria. 5 marks To what degree can your tool assist with test data generation? 5 marks To what degree can your tool assist with test executions? 2 marks To what degree can your tool check the correctness or appropriateness of the test results automatically? That is, the test oracle you designed. 8 marks Note: A “test oracle” is a mechanism, or a method, with which the tester can decide whether the outcomes of test case executions are correct or acceptable. A test oracle answers the question “how can we know whether the test results are correct or acceptable?” Your automated testing tool must implement an oracle in order to decide whether the test has passed or failed. A Testing Tool for Machine Learning Applications Yelin Liu School of Computing and IT University of Wollongong Wollongong, NSW 2522, Australia
[email protected] Yang Liu Booking.com B.V. Herengracht 597, 1017 CE Amsterdam Netherlands
[email protected] Tsong Yueh Chen Department of Computer Science & Software Engineering Swinburne University of Technology Hawthorn, VIC 3122, Australia
[email protected] Zhi Quan Zhou∗ School of Computing and IT University of Wollongong Wollongong, NSW 2522, Australia
[email protected] ABSTRACT We present the design of MTKeras, a generic metamorphic testing framework for machine learning, and demonstrate its effectiveness through case studies in image classification and sentiment analysis. KEYWORDS Metamorphic testing, metamorphic relation pattern, MR composi- tion, oracle problem, neural network API, Keras, MTKeras ACM Reference Format: Yelin Liu, Yang Liu, Tsong Yueh Chen, and Zhi Quan Zhou. 2020. A Testing Tool for Machine Learning Applications. In IEEE/ACM 42nd International Conference on Software Engineering Workshops (ICSEW’20), 2020. ACM. https://doi.org/10.1145/3387940.3392694 1 INTRODUCTION Researchers have applied metamorphic testing (MT) to test machine learning (ML) systems in specific domains such as computer vision, machine translation, and autonomous systems [8, 9]. Nevertheless, the current practice of applying MT to ML is still at an early stage. In particular, the identification of metamorphic relations (MRs) is still largely a manual process, not to mention the implementation (coding) of MRs into test drivers. MRs are the most important com- ponent of MT, referring to the expected relations among the inputs and outputs of multiple executions of the target program [3]. It has been observed that MRs identified for different application domains often share similar viewpoints, hence the introduction of the con- cept of metamorphic relation patterns (MRPs) [5, 9]. For example, “equivalence under geometric transformation” is an MRP that can be used to derive a concrete MR for the time series analysis domain and another concrete MR for the autonomous driving domain [9]. ∗All correspondence should be addressed to Dr. Z. Q. Zhou. In this research, therefore, we ask the following research ques- tions. RQ1: Can we develop a generic, domain-independent auto- mated metamorphic testing framework to allow developers and testers of ML systems to define their own MRs? Here, “define” means “iden- tify and implement.”RQ2:What is the applicability and effectiveness of our solution? To address RQ1, we have developed and open- sourced the first version of an automated metamorphic testing framework named MTKeras, which allows the users to define their own MRs based on a prescribed collection of operators. We have also conducted preliminary case studies to investigate RQ2. 2 MTKERAS: AN MT FRAMEWORK FOR ML ML platforms and libraries, such as TensorFlow and Theano, are now widely available to allow users to develop and train their own ML models. We have built our MT framework, MTKeras, on the Keras platform. Keras (https://keras.io) is a popular high-level neural networks API, developed in Python and working on top of low-level libraries—those backend engines such as Tensorflow and Theano can be plugged seamlessly into Keras. The Keras API empowers users to configure and train a neural network model based on datasets for various tasks such as image classification or sentiment analysis. MTKeras enables automated metamorphic testing by providing the users with anMR library for testing their ML models and applications. We have designed the MR library based on the concept of a hierarchical structure (levels of abstractions) of MRPs [9]. MTKeras also allows the users to define and run new MRs through the composition of multiple MRs. The source test cases are provided by the users whereas follow- up test cases are generated by MTKeras. MR-violation tests are automatically recorded during testing. The design of MTKeras is centered around two basic concepts: metamorphic relation input patterns (MRIPs) [9] and metamorphic relation output patterns (MROPs) [5], which describe the relations among the source and follow-up inputs and outputs, respectively. Both MRIPs and MROPs can have multiple levels of abstractions. Examples of MRIPs include replace (changing the value of part of the input to another value; cf. MRreplace of [6]), noise (adding noise to the input data; cf. [7]), additive andmultiplicative (modifying the input by addition and multiplication, respectively; cf. “metamor- phic properties” defined by Murphy et al. [4]). Examples of MROPs include subsume/subset [5, 10], equivalent and equal [5]. MTKeras is extendable as it allows a user to plug in new MRIPs and MROPs ICSEW’20, May 23–29, 2020, Seoul, Republic of Korea Yelin Liu, Yang Liu, Tsong Yueh Chen, and Zhi Quan Zhou and configure them into concrete MRs. We have implemented it as a python package for ease of use and open-sourced it at Github 1. The user can perform MT in a simple and intuitive way by writing a single line of code in the following format: Mtkeras(
,[,]).[.] where points to the place where the source test cases are stored; declares the type of each element (test case) of (e.g., grayscaleImage, colorImage, text, etc); (optional) gives the name of the ML model under test. represents an MRIP or a sequence of MRIPs; and “[.]” represents an optional MROP. Note that and always go together—they are either both present or both absent. For example, when testing an image classification model, we could write: Mtkeras(myTestSet,colorImage,myDNNModel).noise().fliph().equal() which tells MTKeras to use “myTestSet” (an array name) as the set of source test cases, where each test case is a color image, to generate follow-up test cases by first adding a noise point to each image and then horizontally flipping it. The name of the ML model under test is “myDNNModel.” The last term, equal(), tells MTKeras to check whether the classification results for the source and follow-up test cases are the same. MTKeras then performs MT automatically and identify all the violating cases. Mtkeras returns an object and the violating cases are stored in its variable named “violatingCases.” Note that the model name “myDNNModel” and the MROP “equal()” are optional, without which MTKeras will return a set of follow-up test cases without further tests. The user can then use this set of test cases for various purposes, including but not limited to MT (such as for data augmentation). 3 TWO CASE STUDIES WITH MTKERAS All ML models tested in our case studies are taken from the official website of Keras [1]. The first case study uses the MNIST handwrit- ten digit dataset [2], which contains 70,000 grayscale 28 × 28-pixel images of handwritten digits, with 60,000 and 10,000 images being the training and test sets, respectively. The Keras model under test has a 98.40% accuracy. To test this ML model, we first define MR1 as follows: Increasing the gray scale (by 10%) of each of the images should not improve the ML model’s accuracy because, intuitively, a darker image is more difficult to recognize. As expected, our exper- iment returns 0 violations. We then define MR2 as follows: Adding a random noise point to each of the images should not improve the ML model’s accuracy because, intuitively, an image with a noise point is more difficult to recognize. It is interesting to find that, out of 1,000 experiments (hence 1,000 pairs of source and follow-up accuracy scores), 387 have violated MR2, meaning that in 38.7% of the situations, adding a random noise point has helped to increase the model’s accuracy. We plan to conduct further investigation to convey some understanding of the reasons for this phenomenon, for example, is it because the addition of the noise point happened to accentuate an important ML feature for a specific model? A more interesting observation is that combining MR1 and MR2 has yielded a violation rate of 45.5% (455 out of 1,000), providing ev- idence that composition/combination of MRs can effectively increase the fault-detection effectiveness. 1https://github.com/lawrence415610/Mtkeras.git The second case study applies MTKeras to test four different types of ML models (CNN, RCNN, FastText, and LSTM [1]) trained on an IMDB sentiment classification dataset, a collection of movie reviews labeled by “1” for positive and “0” for negative feelings. We define MR3 as follows: Randomly shuffling (permuting) the words in each movie review shall dramatically reduce the accuracy of the ML models. The permutative MRIP is very popular in MT practice (cf. [4]). The validity ofMR3 is obvious as shuffling the words makes the sentence meaningless. The experimental results, however, is surprising. The results of 100 MT experiments show that shuffling the words only decreases the accuracy by a very small degree (RNN: around 4%, CNN: around 7%, FastText: 0%, LSTM: around 3.5%), indicating that the ML models under test are insensitive to word orders. This case study shows that MRs can help to enhance system understanding, confirming our previous report [9]. 4 CONCLUSIONS AND FUTUREWORK We have presented the design of MTKeras, a generic metamorphic testing framework on top of the Keras machine learning platform, and demonstrated its applicability and problem-detection effective- ness through case studies in two very different problem domains: image classification and sentiment analysis. We have shown that the composition of MRs can greatly improve the problem-detection effectiveness of individual MRs, and that MRs can help to enhance the understanding of the underlying ML models. This work demon- strates the usefulness of metamorphic relation patterns. We have open sourced MTKeras at Github. Future research will include an investigation of the time cost associated with the learning curve for a novice tester to use the tool as well as further extensions and larger-scale case studies of the framework. ACKNOWLEDGMENTS This work was supported in part by an Australian Government Research Training Program scholarship and a Western River en- trepreneurship grant. We wish to thank Morphick Solutions Pty Ltd, Australia, for supporting this research. REFERENCES [1] [n.d.]. https://github.com/keras-team/keras/tree/master/examples [2] [n.d.]. http://yann.lecun.com/exdb/mnist/ [3] T. Y. Chen, F.-C. Kuo, H. Liu, P.-L. Poon, D. Towey, T. H. Tse, and Z. Q. Zhou. Metamorphic testing: A review of challenges and opportunities. ACM Computing Surveys 51, 1 (2018), 4:1–4:27. https://doi.org/10.1145/3143561 [4] C. Murphy, G. Kaiser, L. Hu, and L. Wu. 2008. Properties of machine learning ap- plications for use in metamorphic testing. In Proceedings of the 20th International Conference on Software Engineering and Knowledge Engineering. 867–872. [5] S. Segura, J. A. Parejo, J. Troya, and A. Ruiz-Cortés. Metamorphic testing of RESTful web APIs. IEEE Transactions on Software Engineering 44, 11 (2018), 1083–1099. https://doi.org/10.1109/TSE.2017.2764464 [6] L. Sun and Z. Q. Zhou. 2018. Metamorphic testing for machine translations: MT4MT. In Proceedings of the 25th Australasian Software Engineering Conference (ASWEC 2018). IEEE, 96–100. [7] C. Wu, L. Sun, and Z. Q. Zhou. 2019. The impact of a dot: Case studies of a noise metamorphic relation pattern. In Proceedings of the IEEE/ACM 4th International Workshop on Metamorphic Testing (MET ’19). IEEE, 17–23. [8] J. M. Zhang, M. Harman, L. Ma, and Y. Liu. 2019. Machine learning testing: Survey, landscapes and horizons. https://arxiv.org/abs/1906.10742 [9] Z. Q. Zhou, L. Sun, T. Y. Chen, and D. Towey. Metamorphic relations for enhancing system understanding and use. IEEE Transactions on Software Engineering. https: //doi.org/10.1109/TSE.2018.2876433 [10] Z. Q. Zhou, S. Zhang, M. Hagenbuchner, T. H. Tse, F.-C. Kuo, and T. Y. Chen. Au- tomated functional testing of online search services. Software Testing, Verification and Reliability 22, 4 (2012), 221–243.