程序代写案例-EMAT20011

欢迎使用51辅导,51作业君孵化低价透明的学长辅导平台,服务保持优质,平均费用压低50%以上! 51fudao.top
Introduction to Data Science - EMAT20011
Final Assessed Coursework 2020-2021
Instructions

Please submit by: 1pm Friday May 7th, 2021 ⇒ Via Blackboard -
Please state your name and 7-digit student-number (not candidate number) at the top of your report - it can
be found on your card. This is an individual coursework; you are expected to work alone.
This assessed coursework is based on the examples and lab practice of the course, it is worth 100% of the unit. Please
submit your work as a PDF document, containing all text and results of your work. use MATLAB to complete the tasks
described below. Keep your answers short. Please include the basic Matlab commands used to generate a result.
This document is divided into 2 parts: instructions and tasks.

INSTRUCTIONS
Please find the file uspsdata.mat in Blackboard, in the same location where you found this
coursework script. This contains the data that you will analyse. Save it locally, and use the
command load uspsdata.mat in matlab, to import the file. It contains two matrices, one is the
training set and one is the test set. Please test that you can import the data as soon as possible, and
contact the TAs if you find that you cannot. This data represents a set of square images and their
labels. Each image is a row in a matrix. Labels are in the first column, pixels are in the subsequent
256 positions. They represent handwritten digits, and there are 10 classes (10 different labels).

For each question you are expected to report an answer, in the form of a figure or a plot or a table,
as requested. It may also ask about design choices. The methods used should be mentioned but not
explained at length, unless requested. Cite sources if you are using any material external to the course.

Please upload a pdf document containing the answers through the online submission system in
Blackboard, by the deadline. Include in the pdf document the key lines of Matlab that you used to
produce the results.
TASKS:

● Q1- Download the USPS datasets from Blackboard (file: uspsdata.mat attached to the same post
where this coursework script was posted). Save it and import it into matlab with load
uspsdata.mat. - Report here descriptive statistics of each of the two datasets (e.g., size,
dimensions, number of classes, histogram, or pie chart of relative sizes of each class, etc.). Visualise
some of the data items, noting that the first entry of each data item is its “class label” and the
remaining 256 entries represent the entries of a square matrix. Visualise 4 randomly selected
images, using reshape, and the subplot and imagesc commands of matlab . Each item will be
visualized as a square image. (10 marks)

● Q2 - Visualize the datapoints in two dimensions using either Principal Components Analysis (PCA)
or Multidimensional Scaling (cmdscale). Each item will be a point in a space. Use different colors
for each class. (10 marks)
● Q3a - Cluster the (unlabeled part of the full dimensional) training data using k-means - choose an
appropriate number of clusters - measure how well the clusters match the class-labels by using
crosstab (cross tabulating) - compare different measures of similarity in kmeans - is one of them
better than the others? - Notice that the centroids can be regarded as datapoints themselves:
visualize the centroids of each class using imagesc and subplot. Each centroid can be seen as an
image. (20 marks)

● Q3b. Discuss why answers may be different in two identical runs. Use crosstab to quantify the
difference between clusters induced by two successive runs of k-means. (10 marks)

● Q4- Consider one specific class of images / vectors (that is, images having the same class-label).
Compute the mean squared distance (MSD) across all pairs of image / vector within that class.
Compare it with the same quantity measured on a random set of vectors of equal size sampled from
the whole dataset. Generate a histogram showing the distribution of MSD for random subsets of
images. Compare the histogram of the MSD for the chosen class. What can you conclude from
these results? (20 marks)

● Q5- Using the dedicated matlab commands (fitctree), train (fit) a decision tree to separate class 5
from the other 9 classes. Train on the training set, then test its performance on the test set. Report
a confusion matrix for this 2-class problem. Repeat with a Support Vector Machine (SVM,
fitcsvm). Compare confusion matrices. (20 marks)

● Q6- Repeat the above task (Q5) using k-Nearest Neighbour (fitcknn) and compare the results.
(10 marks)


Please submit your report as a pdf file, including Figures and the key Matlab commands used to generate
the results.

欢迎咨询51作业君
51作业君

Email:51zuoyejun

@gmail.com

添加客服微信: abby12468