程序代写案例-CS5011-Assignment 3

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

School of Computer Science, University of St Andrews, CS5011, 202021
CS5011: A3 Learning Pump it Up: Data Mining the Water Table
Assignment: A3 Assignment 3
Deadline: 7th of April 2021
Weighting: 25% of module mark
Please note that MMS is the definitive source for deadline and credit details. You are
expected to have read and understood all the information in this specification and any
accompanying documents at least a week before the deadline. You must contact the
lecturer regarding any queries well in advance of the deadline.
1 Objective
This practical aims to construct and use artificial neural networks to solve a waterpump clas
sification problem with realworld data. The problem and data themselves come with various
challenges, which we will try to solve one by one during the practical.
2 Competencies
• Design, implement, and document an artificial neural network system for classification
problems and relevant data preprocessing techniques.
• Understand different classification metrics and apply them in specific contexts.
• Understand, use and adapt a neural network within an AI system.
3 Practical Requirements
3.1 Introduction
Access to safe water is one of the most critical necessities and a very basic human right. How
ever, according to a report by the World Bank1 in 2018, “more than 23 million citizens in Tan
zania retrieve drinking water from unimproved sources, and 41 million people use unimproved
sanitation facilities”. In many rural areas of the country, the water pump systems are poorly
maintained, resulting in lowquality water or no access to water. This had negatively affected
many people’s health and quality of life. The Tanzanian Ministry of Water and other NGOs
have been seeking for solutions to improve the maintenance operations at waterpoints to make
sure that clean water is accessible to local communities. In this practical, we will look into
a realworld dataset on operating conditions of water pumps collected at several waterpoints
across Tanzania. The data was provided by Taarifa and the Tanzanian Ministry of Water, and
1https://www.worldbank.org/en/news/press-release/2018/03/20/improving-water-supply-and-
sanitation-can-help-tanzania-achieve-its-human-development-goals
1
is downloadable from the DrivenData’s competition “Pump it Up: Data Mining the Water Ta
ble” 2 [1]. Each example in the competition dataset represents various pieces of information
about a water pump (input features) and the pump’s status (output label). There are three pos
sible statuses: functional, functional needs repair, and non functional. The task is to
predict the status of a pump.
This dataset comes with a number of challenging characteristics of realworld data. The
input features are of mixed types (numerical, categorical and datetime, see the Appendix at the
end of this document for a full list of features). Some parts of the data is missing. The dataset
is also imbalanced, i.e., the amounts of data among classes (pump’s statuses) are uneven.
3.2 Core Tasks
Imagine we want to use our prediction model to make a maintenance plan for water pumps.
Due to limited resources, the engineers will only visit and fix the pumps labelled as functional
needs repair. It is important that our prediction is correct so that pumps that needs repair are
not skipped by the engineers.
Since the main aim is to predict whether a pump needs repair, we will merge the two other
classes (functional and non functional) into one. The resulting binary classification prob
lem, in which we want to predict where a pump needs to be repaired, will be used for all the
tasks of this part.
The dataset provided by the competition is imbalanced. However, we will first start with
a very simple setting where we only consider numeric input features and with balanced data
(Task 1). We will then move to Task 2 where all input features provided by the competition are
considered (but still with balanced data). In Task 3 and Task 4, we will investigate the issue with
imbalanced data (Task 3) and how to deal with it (Task 4). We will use artificial neural networks
(NN) as the prediction models for all tasks in this part.
The datasets used for each task are available as .csv files on studres and are mentioned
in the task specifications below. For simplicity, you can assume that the column order is fixed
as in the files provided, and that for numeric input features, there will be no missing data.
Task 1: Building a neural network prediction model using only numeric input features
For this task, we will consider a very simple setting where we only use numeric input fea
tures for the prediction. The dataset for this task can be found on studres (task1_train.csv,
task1_test.csv, and task1_test_nolabels.csv).
Your task is to design, build, and train a feedforward neural network (NN) model to predict
whether a pump needs repair using the given data. In addition, the trained network is to be
used by the engineers to predict which, among a list of newly acquired waterpoints, need pump
repairs according to their input conditions. The list of ids of those needing repairs is to be
outputted in order for the engineers to carry out the work together with information about the
quality of this prediction.
The evaluation metric used for this part is classification accuracy. The following points
should be considered in your implementation and your report:
• What datapreprocessing should be done?
• Assuming that we use two hidden layers for the NN, what are the number of input and
output nodes, number of hidden nodes per hidden layer, choice of activation function and
loss function?
• How is the training done (e.g., learning rate, minibatch size, avoiding overfitting)?
2https://www.drivendata.org/competitions/7/pump-it-up-data-mining-the-water-table
2
Your submitted programmust support three functionalities, namely train, test and predict,
as described below:
• train: Given an input training dataset (task1_train.csv), your program should train an
NN prediction model and save it (together with any necessary data preprocessing) to
disk.
• test: Given your saved (and trained) NN predictionmodel (and data preprocessors), and
a test dataset (task1_test.csv), your program should load the trained model, applying
datapreprocessing on the test data, and output the corresponding prediction to a text file.
It should also output to the screen the evaluation metric value(s) on the given test set.
• predict: Given your saved (and trained) NN prediction model (and data preprocessors),
and a dataset of pumpswithout labels (task1_train_nolabels.csv), your program should
load the trained model, applying datapreprocessing on the given dataset, and output to
the screen the list of ids of pumps needing repair.
You must also include in your submission your final trained model (produced by the func
tionality train) on the dataset given for this task.
For more details about requirements on the submitted program and on how to organise your
submission, see Section 4. Requirements for the report can be found in Section 5.
Task 2: Building a neural network prediction model using all input features
This task is similar to the previous one. The difference is that categorical input features are also
used. For those extra features there are missing data. Some features also have a large number
of categories. The data preprocessing will be more complicated. Your task is to identify the
necessary preprocessing steps, implement them, build an NN prediction model for the given
dataset, report the results and compare them with the results obtained from Task 1. The output
requirements are the same as in Task 1. The dataset used for this task is also available on
studres (task2_train.csv, task2_test.csv, and task2_test_nolabels.csv).
Task 3: Identifying issues with imbalanced data
For this task, we will use the whole dataset provided by the competition (but still with two output
classes instead of three). This dataset is imbalanced, i.e., the number of examples per class
is uneven. The dataset can be found on studres (task3_train.csv, task3_test.csv, and
task3_test_nolabels.csv).
The first step is to use the program created in Task 2 to train a NNmodel on the new dataset
and evaluate it. After that, analyse the results and discuss in your report the potential issues
with using classification accuracy as the evaluation metric. Which alternative metrics should
we look at instead in such situations?
The required submission for this task includes the report and the trained NN used in your
analysis.
Task 4: Dealing with imbalanced binary classification
This task is a continuation of Task 3, where we will look into solutions to deal with imbalanced bi
nary classification problems. We really do not want to miss the true functional needs repair
pumps in our prediction as it means they will be skipped by the engineers. Of course incorrectly
identify too many pumps as functional needs repair is also not great due to the limited re
sources we have. Propose your solutions for this task, implement them and discuss your results
in the report.
3
The required outputs for this task are similar to the ones described in Task 1. You can
add more arguments into the programs’ command lines if needed (for example, if you are
investigating more than one solution approaches, you can add an argument to your program to
indicate which approach is being used), but remember to include precise instructions on how
to run your programs with the modified argument list in your report. The output printed to the
screen by the test program should also include all evaluation metrics used for this task.
3.3 AdvancedLevel Tasks
For this part, you can choose at most two advancedlevel tasks. Please note that there is no
benefit for implementing more than two. It is strongly recommended that you ensure you have
completed the core tasks in the previous section before attempting any of those requirements.
Task 5: Dealing with imbalanced multiclass classification
This task extends the scenario considered in Task 4 by dealing directly with three output classes
instead of merging the two majority classes into one. The dataset for this task is available
on studres (task5_train.csv, task5_test.csv, and task5_test_nolabels.csv). Do your
approaches in Task 4 still work as expected, or what changes do you make for this multiclass
case? How do the results look like?
Now imagine a more realistic scenario as follows. If we misclassify a functional needs
repair pump as functional, this would mean that the pump will still be activated but its issues
will not be checked and repaired by the engineers. This can pose health risk to the community if
the water is lowquality or contaminated, or causing other serious issues if the water pipes hap
pen to burst. On the other hand, if we misclassify a functional needs repair pump as non
functional, the pump will be deactivated and will be checked later (after all the functional
ones are repaired), which will cause certain resource wasting. The former misclassification
case is likely more serious than the latter one. This problem is called costsensitive classifica
tion, where we have different costs for different misclassification cases. At the same time we
also have the issue of imbalanced data. Discuss your solution approaches for this scenario,
implement them and analyse the results.
The required outputs for this task are similar to the ones described in Task 1, but feel free
to add more arguments into the programs’ command lines if needed. The output printed to the
screen by the test program should also include all evaluation metrics used for this task.
Task 6: Improving performance using hyperparameter tuning
Performance of a machine learning method can be improved by tuning its hyperparameters.
For this task, you are asked to identify the possible hyperparameter choices for the neural
network binary classifier built in Task 4, setup the tuning experiment properly, tune those hyper
parameters using a method of your choice, and report the results.
The required outputs for this task include the final trained NN model produced at the end of
this task, and a test program to use the saved model as described in previous tasks.
Task 7: Solving the imbalanced multiclass classification problem using other types of
models
The scenario considered in this task is the same as the one described in Task 5. However, you
are allowed to use other types of prediction models rather than neural networks. Discuss your
solution approaches, implement them and report the results.
The required outputs for this task include the final trained model produced, and a test pro
gram to use the saved model as described in previous tasks.
4
Task 8: Making the NN prediction system more flexible
In the core tasks we provide the system with the ability of predicting the status of the pumps of
newly seen waterpoints. In this task you are asked to build a new system that merges the train
and prediction systems simulating a more dynamic system in which the engineers can input
the status of additional waterpoints and their input conditions on the fly in addition to receiving
predictions. The dataset used for this task is the same as Task 4’s. The system should be
provided with a simple textbased interface to interact with the engineers. The newly added
data should be used to retrain the network. After training the system should be ready to predict
the status of newer waterpoints, or to accept newer data points in a loop. In your report, please
motivate your design choices and describe how your system deals with newly received training
data.
4 Code Specification
4.1 Programming Languages and Libraries
For this practical, you can choose either Java or Python as the programming language. You
may only use Python if you are already familiar with this programming language. Your
implementation must run (and compile) without the use of an IDE.
Your system should be compatible with the version of Java/Python available on the School
Lab Machines (Java version: Amazon Corretto 11 – JDK11, Python version: ≥ 3.7).
The DL4J 3 library and the scikit-learn4 library are used for Java and Python implemen
tation, respectively.
For Java implementation: The DL4J’s website recommends using maven to manage all
dependencies. In case you are not familiar with maven, we provide a .jar library file which
includes all the basic dependencies required by DL4J. This can be found in the example DL4J
code on studres (A3ExampleWithoutMaven.zip)). We also provide an example DL4J code with
maven (A3ExampleWithMaven.zip). You can use either of them as a starter code for your Java
implementation.
Important note: the provided DL4J library .jar file is quite heavy (almost 1GB), therefore,
DO NOT include that file in your submission even if you choose to use it. We will copy our local
version of that file to your submission folder during marking. Also if you choose to use maven,
remember not to include the .jar files downloaded by maven in your submission, the maven file
pom.xml file is sufficient.
For Python implementation: if you use external libraries not included in scikit-learn,
you must provide a requirements.txt file 5 listing those extra Python packages.
Some useful links:
• Access to the School lab PCs:
https://systems.wiki.cs.st-andrews.ac.uk/index.php/Lab_PCs
https://systems.wiki.cs.st-andrews.ac.uk/index.php/Working_remotely
https://systems.wiki.cs.st-andrews.ac.uk/index.php/Video_tutorials
• Access to Python on the School lab PCs:
https://systems.wiki.cs.st-andrews.ac.uk/index.php/Python_on_Windows_lab_clients
https://systems.wiki.cs.st-andrews.ac.uk/index.php/Python_on_Linux
3https://deeplearning4j.konduit.ai/
4https://scikit-learn.org/
5https://pip.pypa.io/en/stable/reference/pip_freeze/
5
• DL4J:
The official website of dl4j: https://deeplearning4j.konduit.ai/
dl4j blog: https://blog.konduit.ai/
dl4jexamples repository: https://github.com/eclipse/deeplearning4j-examples
dl4j community forum, where you can post questions and get support: https://community.
konduit.ai/
• scikit-learn: the official website of scikit-learn should have everything you need: https:
//scikit-learn.org/
4.2 Code Submission and Running
4.2.1 Code Running
Your code should run the following command:
java A3main
or
python A3main.py
More specifically, the three Java command lines for Task 1 should be:
java A3main task1 train
java A3main task1 test
java A3main task1 predict
Examples:
java A3main task1 train task1_train.csv task1_NN
java A3main task1 test task1_test.csv task1_NN task1_test_predictions.txt
java A3main task1 predict task1_test_nolabels.csv task1_NN
Note that the argument in the commands above is the prefix of files
where the trained NN (and preprocessing) are saved. For example, when user runs the com
mand line:
java A3main task1 train task1_train.csv task1_NN
The trained NN and its datapreprocessors will be saved in a file (or multiple files, depending
on your implementation) with name(s) starting as task1_NN.
The file contains prediction of the input dataset. Each ith line of this output
text file must have one single predicted label (e.g., functional needs repair) corresponding
to the prediction for the ith input sample.
4.2.2 Code submission
Your source code should be placed in a directory called A3src/ and should include all non
standard external libraries besides DL4J and scikit-learn, as described in Section 4.1. Your
submissions must include all the required components described for each task.
Please do not put the datasets provided on studres into your submission!
Submit the whole folder as a single .zip file to MMS. Your source code should be well
structured and wellcommented whenever possible. Where any code from other sources is
used, you must provide clear description of which code is yours.
Please note that code and submission that do not adhere to these instructions may
not be accepted.
6
5 Report
You are required to submit a report describing your submission in PDF with the structure and
requirements presented in the additional document CS5011_A_Reports. The report should
also include clear instructions on how to run your code. The core sections about design and
implementation have an advisory limit of 1000 words for all tasks in total. The evaluation and
analysis sections have an advisory limit of an additional 300 words per task.
Consider the following points in your report:
1. Describe all components of your solution approaches, including datapreprocessingmeth
ods, the network’s structure and its hyperparameters (e.g., number of hidden layers,
number of hidden nodes, learning rate), the evaluation metrics used, and experiment
setup (e.g., how was the data split? how were the evaluations done?), and insights into
why they are chosen.
2. Report results and analyse them thoroughly, you are encouraged to use summarised
tables and graphs where possible to visualise your results and analysis, and to support
your findings.
3. DO NOT put in your report large tables with lots of numbers in it, or lots of similar graphs
that might take up dozen of pages, as they will make it difficult for the readers to go
through your report and understand the key points. Use summarised statistics and merg
ing graphs together if possible. The detailed results of all experiments should be provided
in separate .csv files instead.
4. Provide a summary of all your key findings for each task (either at the beginning or at the
end of the report’s section for the relevant task).
6 Deliverables
A single ZIP file must be submitted electronically via MMS by the deadline. Submissions in any
other format will be rejected.
Your ZIP file should contain:
1. A PDF report as discussed in Section 5
2. Your code and the trained NN models as described in Section 4 and in the specification
of each task.
7 Assessment Criteria
Marking will follow the guidelines given in the school student handbook. The following factors
will be considered:
• Achieved requirements and quality of the implementations provided.
• Quality of the report, including analysis and insights into the proposed solutions and re
sults.
Some guideline descriptors for this assignment are given below:
7
• For a mark of 7 to 11: the submission implements Task 1. At the higher end of this range,
the implementation, experiments, evaluations and report must be adequately done.
• For a mark of 11 to 13: the submission provides outputs expected for the higher range of
the previous band and for Task 2, the implementation, experiments, evaluations should
be adequate. At the higher end of this range, the report and comparisons should be of
good quality.
• For a mark of 13 to 15: the submission provides outputs expected for the higher range
of the previous band and for Task 3. At the higher end of this range, a good analysis on
results of Task 3, with precise and insightful discussions on the evaluation metrics should
be provided.
• For a mark of 15 to 17: the submission provides outputs expected for the higher range of
the previous band and for Task 4. At the higher end of this range, very good implemen
tation with clear, insightful and wellwritten report should be provided.
• For a mark of 17 and above: the submission provides outputs expected for the higher
range of the previous band, and with either good attempts on two advancelevel tasks
(including implementation, experiments and report), or with excellent implementation, ex
periments and report on one advancedlevel task.
8 Policies and Guidelines
Marking: See the standard mark descriptors in the School Student Handbook
https://info.cs.st-andrews.ac.uk/student-handbook/learning-teaching/feedback.html#
Mark_-Descriptors
Lateness Penalty: The standard penalty for late submission applies (Scheme B: 1 mark per
8 hour period, or part thereof):
https://info.cs.st-andrews.ac.uk/student-handbook/learning-teaching/assessment.html#
latenesspenalties
Good Academic Practice: The University policy on Good Academic Practice applies:
https://www.st-andrews.ac.uk/students/rules/academicpractice/
Nguyen Dang, Alice Toniolo
[email protected]andrews.ac.uk
March 10, 2021
8
References
[1] Peter Bull, Isaac Slavitt, and Greg Lipstein. Harnessing the power of the crowd to increase
capacity for data science in the social sector. arXiv preprint arXiv:1606.07781, 2016.
9
Appendix 1: List of all features
1. amount_tsh: Total static head (amount water available to waterpoint)
2. date_recorded: The date the row was entered
3. funder: Who funded the well
4. gps_height: Altitude of the well
5. installer: Organization that installed the well
6. longitude: GPS coordinate
7. latitude: GPS coordinate
8. wpt_name: Name of the waterpoint if there is one
9. num_private
10. basin: Geographic water basin
11. subvillage: Geographic location
12. region: Geographic location
13. region_code: Geographic location (coded)
14. district_code: Geographic location (coded)
15. lga: Geographic location
16. ward: Geographic location
17. population: Population around the well
18. public_meeting: True/False
19. recorded_by: Group entering this row of data
20. scheme_management: Who operates the waterpoint
21. scheme_name: Who operates the waterpoint
22. permit: If the waterpoint is permitted
23. construction_year: Year the waterpoint was constructed
24. extraction_type: The kind of extraction the waterpoint uses
25. extraction_type_group: The kind of extraction the waterpoint uses
26. extraction_type_class: The kind of extraction the waterpoint uses
27. management: How the waterpoint is managed
28. management_group: How the waterpoint is managed
29. payment: What the water costs
30. payment_type: What the water costs
10
31. water_quality: The quality of the water
32. quality_group: The quality of the water
33. quantity: The quantity of water
34. quantity_group: The quantity of water
35. source: The source of the water
36. source_type: The source of the water
37. source_class: The source of the water
38. waterpoint_type: The kind of waterpoint
39. waterpoint_type_group: The kind of waterpoint
11

欢迎咨询51作业君