Page 1 ST3MVA Assignment 1 – Hand in Date: 12(noon) 6th Nov 2019 Late work will be subject to University policy. You can complete this assignment either individually or in pairs, if working in a pair you must email me your names before 12 (noon) 25th October, The datasets are available to download from Blackboard (from the Assessments page). Note the page limits – any work outside of these limits will not be read! R output, including graphs are not included in the page limit so do please submit sensibly sized output. Question 1: Mid infrared spectroscopy (MIR) is a method involving infrared light being beamed at a sample of matter and the absorption of the light is then measured: different absorptions are seen for different wavelengths of light. It is expected that different samples will respond slightly differently depending on their makeup, for example if samples of fruit were being tested, measurements from one wavelength might vary due to the degree of ripeness, measurements from another wavelength might vary depending on the concentration of juice. Therefore, such data can be useful in modelling variables for which it is otherwise hard to obtain measurements for directly. You have the MIR absorption values for 30 different wavelengths (variables wl1 to wl30) for 29 samples of manure. The aim is to see if there are clear patterns in the data. Furthermore, it is of interest to see which of the wavelengths are best for uncovering any patterns in the data. a) In R (or R Studio) apply principal components analysis to this dataset. Include all relevant R output in your submission. In your answer you should include ONLY R output (graphs, and results) that you think are RELEVANT even if they are not specifically referred to in the questions below (for example if PCs 8 and 9 are not important then don’t include loadings/scores plots of PC8 vs PC9). Parts b) to d) cover the interpretation, as such you should not include any discussion/comments amongst your output submitted for part a). [30 marks] Department of Mathematics and Statistics Unit name goes here Page 2 The page limit for the following written parts of this question is ¾ of a page of A4. b) How many principal components do you think should be interpreted and why? [5 marks] c) Using the principal component output interpret your chosen number of principal components. [15 marks] d) Interpret your scores plot/s – do there appear to be any groups of samples? If so, what properties do these group have? [15 marks] Question 2: A dog toy manufacturer has conducted a small-scale piece of market research to investigate customer reactions to their latest product. The manufacturer asked 15 dog owners to use the new toy to play with their dog for a week. Each dog owner then completed a questionnaire rating 8 different aspects of the toy from ‘durability’, to ‘ease of play’ to ‘cost’ – these are variables A1 – A8 in the dataset. Each aspect has been given an integer rating from 0 to 9, where 0 is the worst rating and 9 the best. Before you read the dataset into R you need to add yourself/yourselves as extra individual/s! The rating you will provide for each aspect is based on your unique 8 digit student number/s. For example if my student number was 28374615 I would add myself to the dataset as follows: ID of 28374615 and ratings of A1 = 2, A2 = 8, A3 = 3, A4 = 7, A5 = 4, A6 = 6, A7 = 1, and A8 = 5, see below: If you are working in a pair you should add both of you as two new individuals in the dataset (resulting in 17 individuals in total). If you have any questions regarding this please speak to/email me (
[email protected]). You should upload your individualised dog toy dataset, or provide a screen shot! Page 3 a) Produce a set of star or segment glyphs for the dog toy data, including yourself/yourselves as the additional individual/s, using R (or R Studio). Your plot should have a sensible title which includes your student number/s. [10 marks] b) In R (or R Studio) produce two dendrograms of your dog toy data, including yourself/yourselves as the additional individual/s. The distance measure you should use for both plots is Manhattan. The two clustering algorithms you should consider are furthest neighbour/complete linkage and nearest neighbour/single linkage. Your plots should have sensible titles including the clustering algorithm and your student number/s. (It is OK for your additional individuals to be displayed as 16 and 17 in your plots). [10 marks] You are required to write no more than a couple of sentences each for c), d) and e) as such the page limit for the following written parts of this question is ½ page of A4, minimum font size 12 with standard margins. c) Specify one clear example of agreement between your two dendrograms and one clear example of how they differ. [5 marks] d) State your preferred dendrogram (there is no right or wrong answer for this), and suggest a height at which you would cut it. You should justify why you would cut at your chosen height. [5 marks] e) Consider both your glyphs and your chosen preferred dendrogram together. Briefly discuss similarities in the conclusions that can be made from the two analyses, you should highlight an example in your discussion. What additional information is provided in your glyphs? [5 marks] Further guidance: Be succinct – these are not trick questions. Focus on exactly what the question is asking and answer it directly. Page 4 What to upload: You can upload a single word or pdf file containing all of your R output and answers, if you prefer you can upload one file per question (but not for each part of a question). You should upload your individualised dog toy dataset, or provide a screen shot! (I will be reproducing your plots to mark them so I need to be able to check that the data you are entering is correct). You do not need to submit any R code.