辅导案例-ECMM444

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

ECMM444 Fundamentals Of Data
Science
Continuous Assessment 2
This continuous assessment (CA) comprises 60% of the overall module
assessment. This is an individual exercise and your attention is drawn to the
College and University guidelines on collaboration and plagiarism, which are
available from the College website. As a rule of thumb, to understand when
collaboration becomes plagiarism, consider the following:
it is OK when students communicate and support each other in better
understanding the concepts presented in the lectures;
it is not OK when students communicate how these concepts can be
combined and used to solve specific Assignments questions.
Question 1
Acquire the Iris dataset using the following procedure:
from sklearn.datasets import load_iris
X,y = load_iris(return_X_y=True)
The data matrix contains 150 vectors (also called instances) with 4 attributes
each (i.e. it is a 150 x 4 matrix) and the vector contains the class encoded as
the integers 0,1, and 2.
a) Split the data matrix in two data matrices and each containing a
balanced number of instances per class (i.e. contains as many instances
from class as ). Split the class vector into and accordingly (i.e.
the first instance in is the class of the first instance in , etc).
y
Dtr Dts
Dtr
k Dts y ytr yts
ytr Dtr
b) Define a function to compute the distance between two vectors (of arbitrary
dimension) as the length of the difference vector.
c) Using the distance function, build the function one_knn_predict that
implements the 1 nearest neighbor classification technique. This will be later
used to implement a k-nearest neighbor classifier. The function takes in input
two data matrices and and a target vector . For each instance in
it returns the class associated to the closest (i.e. the least distant) instance in
.
d) Create a function fit_LDA that takes in input a data matrix and an
associated class vector and outputs the fit parameters for the Linear
Discriminant Analysis (LDA) classifier. Create a function test_LDA that takes in
input a data matrix and the fit parameters for the LDA classifier and returns a
prediction for each element in the data matrix. The two functions fit_LDA and
test_LDA form your implementation of a LDA classifier (i.e. do not use a third
party library implementation for the LDA classifier).
e) Use and to fit your implementation of the 1 nearest neighbor
classifier using the one_knn_predict function and your implementation of the
LDA classifier. Compute the accuracy of the 1 nearest neighbor classifier and
the accuracy of the LDA classifier on . The accuracy is the proportion of
true results (i.e. when the class predicted was the same as the true class) over
the total number of predictions.
(Total 30 marks)
Dts Dtr ytr Dts
Dtr
Dtr ytr
Dts
Question 2
Acquire the Iris dataset as indicated in Question Q1. Select only the instances
relative to a single class and denote the data matrix as .
a) Create a function add_missing that takes in input a data matrix and a
number and returns a data matrix and a matrix . Each row of the
matrix contains the row and column indices of an entry in chosen
uniformly at random. The matrix is a copy of except for the entries
specified in : the value in each entry is the column average of .
b) Create a function impute that takes in input a data matrix and a number
and returns a data matrix . The matrix is the reconstruction of
using its largest singular vectors and values (i.e. it is the truncated SVD
reconstruction of ).
c) Compute the average length of the difference vectors between the
corresponding instances in and in its reconstructed version.
d) Repeat the following 30 times:
from generate using
apply impute to using a specific to compute
compute between and Consider the average over the 30 trials.
Report a plot of the average value when the procedure is repeated 30 times,
for and .
(Total marks 30)
D
D
k D′ k × 2 P
P i j Dij D
D′ D
P Dij
′ Di
M
r M ′ M ′ M
r
M
E
D
D D′ k = 50
D′ r M
E D M E
E
k = 50 r = 1, 2, 3, 4
Question 3
a) Build a function make_data to generate a sample matrix from a random
multivariate Gaussian distribution. The function should take in input the vector
space dimension and the desired number of samples. To define the
distribution, generate a random -dimensional vector as the mean with values
in and a random covariance matrix with values in [–1,1]. Your
procedure should guarantee that the covariance matrix is a positive definite
matrix.
b) Using the function make_data , generate 3 sample matrices, each
containing 200 instances of dimension . Combine them in a single data
matrix . Build a corresponding class vector containing the identity of the
Gaussian distribution of origin for each instance.
c) Perform a Principal Component Analysis and compute as the 2
dimensional projection of along the main components. Plot distinguishing
the instance class by color.
d) Create a function fit_QDA that takes in input a data matrix and an
associated class vector and outputs the fit parameters for the Quadratic
Discriminant Analysis (QDA) classifier. Create a function test_QDA that takes
in input a data matrix and the fit parameters for the QDA classifier and returns a
prediction for each element in the data matrix. The two functions fit_QDA and
test_QDA form your implementation of a QDA classifier (i.e. do not use a third
party library implementation for the QDA classifier).
e) Compute and plot the decision surface of your Quadratic Discriminant
Analysis classifier relative to the data matrix and class vector (hint - you
can use a set of test points arranged on a regular 2D grid). You should obtain a
plot similar to the following:
p
p
[−20, 20] p × p
p = 4
D y
D′
D D′
D′ y
(Total marks 40)
Submitting your work
Please write your student ID in the first cell of the notebook. You should submit
the Jupyter notebook containing the code with its output for all the questions.
Make a separate cell for each point a), b), c), etc of each question. Submit a
single archive file .zip or .tgz containing both a PDF copy of your
notebook and also the source file with extension .ipynb . Markers will not be
able to give feedback if you do not submit the PDF version of your code and
marks will be deducted if you fail to do so.
Marking criteria
Work will be marked against the following criteria. Although it varies a bit from
question to question they all have approximately equal weight.
Does your algorithm correctly solve the problem? In most of the
questions the required code has been described, but not always in
complete detail and some decisions are left to you.
Is the code syntactically correct? Is your program a legal Python program
regardless of whether it implements the algorithm?
Is the code beautiful or ugly? Is the implementation clear and efficient or
is it unclear and extremely inefficient (e.g. it takes more than a few minutes
to execute)? Is the code well structured? Have you made good use of
functions? Are you using Numpy functions on entire arrays when possible?
Is the code well laid out and commented? Is there a comment describing
what the code does? Have you used space to make the code clear to
human readers?
There are 10% penalties for:
Not submitting the PDF version of your programs.
Not creating functions as instructed in the questions.