Overview MATH38161 Coursework Korbinian Strimmer 14 November 2022

This coursework is either about PCA analysis for MNIST data or about clustering using the -means algorithm. Choose one of the two course work tasks A and B (details follow further below): Task A is oriented towards data analysis and interpretation. Task B is about implementing the -means algorithm in R. The assignment will be marked out of 20. See below for the marking schemes for Tasks A and B. You have two weeks to complete the coursework. The intended total workload is 10 hours. Format You may use any document preparation system of your choice but the final document must be a single PDF in A4 format. The text in the PDF must be machine-readable. Your report must include the complete analysis in a reproducible way: include the complete computer code, figures, text etc. in one document. Show your full name and your University ID on the title page of your report. Indicate which task (A or B) your are addressing. Recommended length: 4 pages content (single sided) plus title page. Maximum length 5 pages content. Any excess content beyond 5 pages content will be ignored and will not count. Submission process and deadline The deadline for submission is Monday 28 December 2022, 12 noon. Submission is online on Blackboard. Note that you can only submit one version of your course work. Copying and plagiarism This is individual coursework all text and analyses need to be done and written independently by yourself and in your own words! Cite and acknowledge all your sources. Do not copy anything, write everything yourself, including computer code! Copying and plagiarism (=passing off someone elses work as your own) is a very serious offence and will be strictly prosecuted. As a minimum you will fail the coursework and receive zero marks. As 3rd year students you will also be referred to a Faculty level scientific misconduct panel and you may be subject to additional penalties. For more details see the Guidance to students on plagiarism available at https://documents.manchest er.ac.uk/display.aspx?DocID=2870 . 1 Coursework descriptions: Task A: Dimension reduction for MNIST data using PCA Your task is to analyse a small part of the MNIST image data set containing handwritten digits from 0 to 9, see https://en.wikipedia.org/wiki/MNIST_database and http://yann.lecun.com/exdb/mnist/ for details. The data are contained in the file mnistTest.rda (size 2.8 MB) available on Blackboard. In total 10,000 images are contained in the test data, each containing 784 pixels (picture size 28 times 28). Each pixel takes an integer value between 0 and 255: ## [1] 10000 784 range(mnistTest$x) # 0 255 ## [1] 0 255 Here is a plot of the first 15 images: load("mnistTest.rda") dim(mnistTest$x) # 10000 784 par(mfrow=c(3,5)) for (k in 1:15) # first 15 images { m = matrix( mnistTest$x[k,] , nrow=28, byrow=TRUE) image(t(apply(m, 2, rev)), col=grey(seq(1,0,length=256)), axes = FALSE) } 2 The numerical value of the digit shown in each image is available as label (as R factor with 10 levels): mnistTest$y[1:15] # first 15 labels ## [1]721041495906901 ## Levels: 0 1 2 3 4 5 6 7 8 9 In this coursework your task is to analyse this data set using PCA. Compute the 784 principal components from the 784 original pixel variables. Compute and plot the proportion of variation attributed to each principal component. Show a scatter plot of the first two principal components. Use the known labels to colourise the scatter plot. Interpret and discuss the result. Structure of the report Your report should be structured into the following sections: 1. Dataset 2. Methods 3. Results and Discussion 4. References In Section 1 provide some background and describe the data set. In Section 2 briefly introduce the method(s) you are using to analyse the data. In Section 3 run the analyses and present and interpret the results. Show all your R code so that your results are fully reproducible. In Section 4 list all journal articles, books, wikipedia entries, github pages and other sources you refer to in your report. Marking scheme The assignment will be marked out of 20 as follows: Description of the data: 4 marks Description of the methods: 4 marks Results and Discussion section: 8 marks Overall presentation of report: 4 marks 3 Task B: Implementing the -means algorithm in R Your task is to create your own R implementation of the -means algorithm and investigate its properties using synthetic simulated data. In this task you need implement everything yourself in pure R. Do not use any R package for clustering or any built-in functions for -means. Structure of the report Your report should be structured into the following sections: 1. The -means method 2. Implementation in R 3. Results and Discussion 4. References In Section 1 provide explain in your own words how -means works. In Section 2 present your own plain R implementation of the -means algorithms. The implementation must must be elementary and not rely on any external R package or on the internal built-in command for -means. In Section 3 investigate your implementation of the -means algorithm by analysing simulated data and compare it with the built-in version. Show all your R code so that your results are fully reproducible. In Section 4 list all journal articles, books, wikipedia entries, github pages and other sources you refer to in your report. Marking scheme (total marks: 20) Description of theory: 4 marks Implementation in own R code: 4 marks Discussion of analysis and results: 8 marks Overall presentation of report: 4 marks 4