辅导案例-CO832

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

CO832: Data Mining and Knowledge Discovery
Assessment 2: Practical Data Analysis
Instructions and General Marking Scheme
Marek Grzes´
March 6, 2020
This document is complemented by a video that is available on the module’s Moodle
page. You are strongly encouraged to watch the video before you start working on this
assessment.
You can do this assessment either individually or in a small group with just two people. In the
latter case, the group must hand in a single assessment, and the two students in the group will get
the same mark.
This assessment is worth 10% of the total marks for this module.
1 Objectives
Section 3 presents the dataset that you will analyse in this assessment. The dataset defines a
standard classification problem. Your task can be summarised as follows:
1. Select a classification algorithm that you will use for your analysis. Decision trees and classi-
fication rules were introduced in lectures, and you are encouraged to use them, but you may
want to explore other types of classification algorithms. You should not use instance based
classification algorithms (e.g. k-NN) because they will make parts of your analysis challenging.
Na¨ıve Bayes is strongly biased, and you may not be able to reduce its bias easily. For this
reason, you should avoid Na¨ıve Bayes in this assessment too.
2. Having the algorithm selected in the previous step, you will explore the performance of the
algorithm on the data. The goal of your exploration is to tune the parameters of the algorithm
such that the 10-fold cross-validation accuracy is maximised. By doing that, you will also
explore the training error of the method.
3. Having the results obtained in the previous step, you will analyse the results referring to the
bias-variance trade-off that was presented in lectures. Specifically, your challenge is to identify
results and their parameter settings that lead to low bias, low variance, and a good trade-off
between bias and variance. In order to receive the highest mark for this assessment, you will
1
need to argue about the properties of the data. This discussion should be supported by the
results of your bias-variance trade-off analysis. For example, you can argue whether the data
that you are analysing is noisy or not, or whether a complex decision boundary is required. It
will be helpful for you to read (James et al., 2013, Secs. 2.2.1–2.2.2)1 and study their Fig. 2.12
to see how the properties of the data can influence the bias-variance trade-off.
4. The last step of your investigation is to analyse the models learned from the data. In partic-
ular, you will identify attributes that influence the class attribute. For example, inspecting
the decision trees learned from the data, you will look for attributes that are important. Im-
portance will depend on the position of the attribute in the decision tree or on the number
of nodes in which the attribute appears. If your model is Na¨ıve Bayes, you will look for
attributes that have high probability P (attribute|class). Your goal in this step is basically to
look into the models that you will obtain, and to extract domain specific knowledge about
the data from the models.
2 What to submit?
Submit a technical report that will have the following structure:
• A title page with your name and Kent login.
• One-page report that will present your findings, observations, and technical analysis of your
results. Note that it is not sufficient to present observations. You need to analyse your
observations using technical terms. Mentioning observations without explaining and justifying
them will not give you the highest marks.
• An appendix that will contain figures and tables that you want to include to support your
discussion in the technical report. The figures and tables in the appendix should have infor-
mative and self-explanatory captions. This means that the figures and tables can to a large
extent be self-contained. The size of the appendix is unlimited, and you can have as many
figures and tables as you want (but you don’t have to add many if you don’t need to).
Note that the main part of the technical report has a strict limit of one page, and your marks
will be reduced for violating this restriction. All pages should be of the usual size A4, margins
of at least 1 inch, and the font size 12pt or larger.
If you will produce any code for this assessment in any programming language, the code should
be submitted as well. The code and the report will be separate parts in the submission page.
3 Dataset
The dataset is called Pima Indians Onset of Diabetes. Each instance represents medical details for
one patient and the task is to predict whether the patient will have an onset of diabetes within the
next five years. There are 8 numerical input variables all of which have varying scales. You can
1e-book is available at https://www.kent.ac.uk/library/
2
learn more about this dataset on the UCI Machine Learning Repository (http://www.ics.uci.
edu/~mlearn/MLRepository.html). Top results on this dataset are in the order of 77% accuracy.
The dataset is available in a file called diabetes.arff on the module’s web page on Moodle. The
file extension “.arff” means it is a file in a format specifically suitable for WEKA.
The class attribute in the diabetes.arff dataset is called “class”. Make sure that you use this
attribute as a class attribute in your investigation.
4 Data Format
After a few lines at the top of the file defining the dataset name, its attributes and corresponding
values, each line in the file represents a data example/instance, and it consists of several attribute
values. One of those attributes is then chosen as a class attribute, whose value is to be predicted
by a classification algorithm. Missing values (if present in a dataset) are represented using the “?”
symbol.
The arff format is a standard data format in Weka, but it can imported in Python easily. The
following script can read an example dataset.
Listing 1: load-arff-data.py
import os
from scipy.io import arff
import pandas as pd
data = arff.loadarff(’/home/mgrzes/Documents/Teaching/datasets/diabetes.arff’)
df = pd.DataFrame(data [0])
print("Original arff data:")
print(df.head ())
# above we can see that we need to decode the last column
df[’class’] = df[’class’].str.decode("utf -8")
# strings are fine now
print("Data with fixed strings:")
print(df.head ())
# let’s save this data frame in the CSV format
df.to_csv (r’/tmp/diabetes.csv’, index=False , header=True)
print("The CSV file:")
os.system("head -n 5 /tmp/diabetes.csv")
# you can load the csv data easily
df = pd.read_csv("/tmp/diabetes.csv")
# preview the first 5 lines of the loaded data
print("Read from the CSV file:")
print(df.head ())
The output of this program is below.
3
1 Original arff data:
2 preg plas pres skin insu mass pedi age class
3 0 6.0 148.0 72.0 35.0 0.0 33.6 0.627 50.0 b’tested_positive ’
4 1 1.0 85.0 66.0 29.0 0.0 26.6 0.351 31.0 b’tested_negative ’
5 2 8.0 183.0 64.0 0.0 0.0 23.3 0.672 32.0 b’tested_positive ’
6 3 1.0 89.0 66.0 23.0 94.0 28.1 0.167 21.0 b’tested_negative ’
7 4 0.0 137.0 40.0 35.0 168.0 43.1 2.288 33.0 b’tested_positive ’
8 Data with fixed strings:
9 preg plas pres skin insu mass pedi age class
10 0 6.0 148.0 72.0 35.0 0.0 33.6 0.627 50.0 tested_positive
11 1 1.0 85.0 66.0 29.0 0.0 26.6 0.351 31.0 tested_negative
12 2 8.0 183.0 64.0 0.0 0.0 23.3 0.672 32.0 tested_positive
13 3 1.0 89.0 66.0 23.0 94.0 28.1 0.167 21.0 tested_negative
14 4 0.0 137.0 40.0 35.0 168.0 43.1 2.288 33.0 tested_positive
15 The CSV file:
16 preg ,plas ,pres ,skin ,insu ,mass ,pedi ,age ,class
17 6.0 ,148.0 ,72.0 ,35.0 ,0.0 ,33.6 ,0.627 ,50.0 , tested_positive
18 1.0 ,85.0 ,66.0 ,29.0 ,0.0 ,26.6 ,0.351 ,31.0 , tested_negative
19 8.0 ,183.0 ,64.0 ,0.0 ,0.0 ,23.3 ,0.672 ,32.0 , tested_positive
20 1.0 ,89.0 ,66.0 ,23.0 ,94.0 ,28.1 ,0.167 ,21.0 , tested_negative
21 Read from the CSV file:
22 preg plas pres skin insu mass pedi age class
23 0 6.0 148.0 72.0 35.0 0.0 33.6 0.627 50.0 tested_positive
24 1 1.0 85.0 66.0 29.0 0.0 26.6 0.351 31.0 tested_negative
25 2 8.0 183.0 64.0 0.0 0.0 23.3 0.672 32.0 tested_positive
26 3 1.0 89.0 66.0 23.0 94.0 28.1 0.167 21.0 tested_negative
27 4 0.0 137.0 40.0 35.0 168.0 43.1 2.288 33.0 tested_positive
If you decide to use Python for this assessment, you don’t have to use the arff format. You can
find this dataset in alternative formats, and you can use those.
5 Software
You can choose one of the two options:
1. Weka, which won’t require any programming
2. Machine learning packages in Python, which will require writing code in Python
5.1 Option 1—Weka
WEKA is a freely-available data mining tool that is installed on university PCs. It implements
the J48 algorithm to learn decision trees that was introduced in our lectures. You can use this
algorithm for your investigation, or you can choose another classification algorithm. Note that J48
is known in the machine learning literature as C4.5 and C5.0.
The video that complements this document contains a brief introduction to WEKA, and it shows
the major features of this package that are required to do the assessment. The module’s Moodle
page includes a few additional documents and tutorials on WEKA.
4
5.2 Option 2—Python
There is no dedicated Python tutorial associated with this assessment. However, and example script
is provided that could be a good start for you. If you don’t want to use Python, you can simply
choose the first option and do this assessment in Weka.
The following script runs the k-NN algorithm on the Iris dataset, and plots a validation curve in
which the parameter k is varied. Note that the validation curves were explained in our lecture on
the bias-variance decomposition and inductive bias.
Listing 2: validation-curve.py
# https ://scipy -lectures.org/packages/scikit -learn/index.html
# Validation curve is computed using the sklearn function validation_curve.
# Note that only one parameter can be varied when this function is used.
# If you decide to vary more than one parameter , you may need to plot
# a 3D surface using matplotlib or to generate several univariate plots
# (like the one generated by this program) for different values of other
# parameters.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import validation_curve
from sklearn.neighbors import KNeighborsClassifier
# we use the standard Iris dataset
iris = load_iris ()
X = iris.data
y = iris.target
# range of k in K-NN
nrange = np.arange(1, 31)
# model = make_pipeline(PolynomialFeatures (), LinearRegression ())
model = KNeighborsClassifier () # weights=’uniform ’ by default
# we vary k in the k-NN algorithm
train_scores , validation_scores = validation_curve(
model , X, y, param_name=’n_neighbors ’, param_range=nrange)
# Plot the mean train score and validation score across folds
# plot error
plt.plot(nrange , validation_scores.mean(axis=1), label=’cross -validation ’)
plt.plot(nrange , train_scores.mean(axis=1), label=’training ’)
plt.legend(loc=’best’)
plt.title("Accuracy - NOTE HIGH BIAS ON THE RIGHT -HAND SIDE")
plt.show()
# plot accuracy
plt.figure ()
plt.plot(nrange , 1 - validation_scores.mean(axis=1), label=’cross -validation ’)
plt.plot(nrange , 1- train_scores.mean(axis=1), label=’training ’)
plt.legend(loc=’best’)
5
0 5 10 15 20 25 30
k in KNN
0.00
0.01
0.02
0.03
0.04
0.05
Er
ro
r
Error - NOTE HIGH BIAS ON THE RIGHT-HAND SIDE
cross-validation
training
Figure 1: Output of the Python example included in this document
plt.title("Error - NOTE HIGH BIAS ON THE RIGHT -HAND SIDE")
plt.xlabel(’k in KNN’)
plt.ylabel(’Error’)
plt.show()
The output of this program is shown in Fig. 1.
6 Implementation
Make sure the test mode for evaluating predictive accuracy is set to 10-fold cross-validation. This
will give you an estimate of the generalisation error on the unseen data. Note that when cross-
validation is used in WEKA, there is no information in the output about the error on the
training data. The % of correctly classified instances that is printed for cross-validation is the
average across k-folds of cross-validation. In order to obtain the % of correctly classified instances
on the training data in WEKA, you will need to repeat every experiment selecting “Use training
set” instead of “Cross-validation” in the “Test options” panel in the “Classify” tab. This means
that you will need to run every experiment (i.e., for every set of the parameters) twice to record
accuracy on the training data and on cross-validation.
In order to implement your analysis and to address objectives stated in Sec. 1, you will need to run
your classification algorithm several times specifying different values of the algorithm’s parameters.
The parameters that you will adjust should be those that influence the bias-variance trade-off of
the algorithm that you will use. For example, if your algorithm is J48, you may want to tune the
6
Parameter Description
minNumObj
The minimum number of instances per
leaf
unpruned Whether pruning is performed
confidenceFactor
The confidence factor used for pruning
(smaller values incur more pruning)
reducedErrorPruning
Whether reduced-error pruning is used
instead of C4.5 (J48) pruning.
numFolds
Determines the amount of data used for
reduced-error pruning. One fold is used
for pruning, the rest for growing the
tree.
Table 1: J48 parameters that can be explored in this assessment.
parameters that are shown in Tab. 1 because those parameters control the size of the decision trees.
Note that the last objective in Sec. 1 may require different values of the parameters than the pre-
vious objectives because its goal is to interpret the model instead of purely increasing its predictive
accuracy.
7 Deadline
The printed technical report has to be handed in to the Student Administration Office by the
deadline for this assessment, which is specified in the Student Data System. It is your responsibility
to find out what time the Student Administration Office closes on the day of the deadline. If you
prefer to submit your report electronically, there will be a Turnitin link on Moodle that will allow
you to so. For electronic submissions, the preferred format is PDF. If you submit ODT or DOCX,
your file will be automatically converted to PDF using my bash script that will run libreoffice on
your file.
If you submit electronically, please include your name and your Kent login on the title page of
your document. It takes a lot of time for us to deal with anonymous submissions after we have
printed them.
8 Time Estimated to Complete the Assessment
The time that students take to write a short report as required in this assessment varies significantly
across students; but as a rough estimate, students can be expected to spend about 20 hours to do
this assessment. This estimate refers to the total time to do the assessment, i.e., including the time
to read documentation about the parameters of the algorithms and their software implementation,
carry out the experiments and analyse the results, and write the technical report. Note that this
time estimate assumes that the students have been learning the module material on a regular basis.
If they did not engage in intensive self-study and reflection on the material provided in the lectures,
they may need to spend considerably more time on this assessment.
7
9 Notes on Plagiarism
Senate has agreed the following definition of plagiarism:
Plagiarism is the act of repeating the ideas or discoveries of another as one’s own.
To copy sentences, phrases or even striking expressions without acknowledgement in
a manner that may deceive the reader as to the source is plagiarism; to paraphrase
in a manner that may deceive the reader is likewise plagiarism. Where such copying
or close paraphrase has occurred the mere mention of the source in a bibliography will
not be deemed sufficient acknowledgement; in each such instance it must be referred
specifically to its source. Verbatim quotations must be directly acknowledged either in
inverted commas or by indenting.
The work you submit must be your own, except where its original author is clearly referenced. We
reserve the right to run checks on all submitted work in an effort to identify possible plagiarism,
and take disciplinary action against anyone found to have committed plagiarism. When you use
other peoples’ material, you must clearly indicate the source of the material.
10 General Marking Scheme
Your technical report will be assessed based on two main criteria: (1) technical quality, and (2)
the comprehensibility of the report. Technical quality, which is the most important criterion,
involves the correct use of technical terms, concepts and arguments. In general, the more advanced
(and correct) the technical concepts and arguments that you used in your report, the higher the
mark. The comprehensibility of the report involves the use of well-written sentences, which are
understandable and meaningful, as well as grammatically correct. It also involves the use of clear
figures to illustrate your arguments—for instance, you will lose marks if your figure includes text
in a very small font size which is hard to read. The more clearly (and correctly) written your text
is, and the clearer the figures are, the higher the mark.
Your report will be assigned a mark based on a categorical marking scale used by the University,
which includes a range of a few discrete numerical marks for each categorical mark, as follows:
Mark range: 100, 95, 85, 78, 75, 72 Marks within that range are allocated based on the extent
to which your technical report has the following characteristics: The analyses are of excellent
technical quality, reporting all the information required in the assessment’s instruction and with
many arguments that involve advanced technical concepts and are clearly and correctly explained—
with no technical mistakes.
Mark range: 68, 65, 62 Marks within that range are allocated based on the extent to which your
technical report has the following characteristics: The analyses are of very good technical quality,
reporting all the information required in the assessment’s instructions and with several arguments
that involve advanced technical concepts and are in general clearly and correctly explained—possibly
with a few relatively minor technical mistakes or a few hard-to-understand sentences.
8
Mark range: 58, 55, 52 Marks within that range are allocated based on the extent to which
your technical report has the following characteristics: The analyses are not of good quality, in
general, but at least the report contains most of the information required in the assessment’s
instructions. The technical arguments are not clearly and correctly explained—there are some
significant technical mistakes and possibly many hard-to-understand sentences. If the report is of
reasonable technical quality with arguments that show legitimate (although not advanced) technical
knowledge, then a higher mark in this range (e.g. 58) will be allocated.
The marks below (i.e., marks < 50) correspond to a “fail” mark since the pass mark for this
module is 50%.
Mark range: 48, 45, 42 Marks within that range are allocated based on the extent to which
your technical report has the following characteristics: The analyses are of poor quality, in general,
and/or the report contains a relatively small part of the information required in the assessment’s
instructions. The technical arguments are not clearly and correctly explained—there are many
significant technical mistakes and many hard-to-understand sentences, and/or too few technical
arguments.
Mark range: 38, 35, 32, 20, 10, 0 Marks within that range are allocated based on the extent to
which your technical report has the following characteristics: The analyses are of very poor quality,
in general, and/or the report lacks most parts of the information required in the assessment’s
instructions. The technical arguments cannot be understood, and/or the exiting arguments are
invalid.
References
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning
with applications in R, volume 112. Springer.
9