辅导案例-MATH 1309-Assignment 2
1 Assignment 2 PG MATH 1309 MULTIVARIATE ANALYSIS 45 points DUE date October 25, 2019 11.59pm. Show your SAS code, output and answers within the one attached assignment pdf or docx that you submit in Canvas. Question 1 (23 marks) The file THC.csv contains data on concentrations of 13 different chemical compounds in marijuana plants own in the same region in Colombia that are derived from three different species varieties. 1. Compute the mean and standard deviation for the 13 chemical concentrations in the sampleTHC data via SAS (1.5 marks) 2. Produce the correlation matrix and a scatterplot in SAS. Is the correlation matrix suitable for a principal component analysis (1.5 marks) 3. Perform a Principal component analysis using SAS on the raw data and assess how many PCs need to retain. Answer the following from the resultant output (10 marks, each part below is worth 2 marks) a) What percentage of the total sample variation is accounted for the first, second and third PCs? b) Interpret the first 3 PC’s. c) Write out the first, second and third PCs as linear functions of the original variables. d) Can the data be effectively summarised in fewer than 13 dimensions? Justify your answer Comment on it. e) Obtain via SAS or sketch the scree plot to confirm your choice of the number of PCs. 4. Perform a principal component analysis using SAS on the correlation matrix. Answer the following from the resultant output (10 marks, each part below is worth 2 marks) a) What percentage of the total sample variation is accounted for the first, second and third PCs? b) Interpret the first 3 PC’s. c) Write out the first, second and third PCs as linear functions of the standardised variables. d) Can the data be effectively summarized in fewer than 13 dimensions? Justify your answer Comment on it. e) Obtain via SAS or the scree plot to confirm your choice of the number of PCs. 2 Question 2 (14 marks) Consider the raw data set with 12 observations, on 5 socio-economic variables, called Population, School, Employment, Services and HouseValue. data SocioEconomics; input Population School Employment Services HouseValue; datalines; 5700 12.8 2500 270 25000 1000 10.9 600 10 10000 3400 8.8 1000 10 9000 3800 13.6 1700 140 25000 4000 12.8 1600 140 25000 8200 8.3 2600 60 12000 1200 11.4 400 10 16000 9100 11.5 3300 60 14000 9900 12.5 3400 180 18000 9600 13.7 3600 390 25000 9600 9.6 3300 80 12000 9400 11.4 4000 100 13000 ; proc factor data=SocioEconomics simple corr; run; Conduct a factor analysis by using the following SAS statements above. Show your SAS code (it can vary to the one I suggest), output and answers within the ONE assignment pdf or docx that you submit in Canvas. 1. Prepare the dataset for a Factor analysis via SAS. (1 mark) 2. Generate the means and standard deviations of the data. (1 mark) 3. Perform a Factor analysis on the raw data and the correlation matrix using the code above, and answer the following questions. (2 marks) 4. From the eigenvalues of the correlation matrix and the factor loading matrix and communalities outputted answer the following questions. a) Do the first two principal components (factors) provide an adequate summary of the data? (1 mark) b) How much of the variation is accounted for by 2 factors? (1 mark) c) How much of the variation is accounted for by 3 factors? (1 mark) 5. To get the scoring coefficients as eigenvalues use PROC PRINCOMP to display the scoring coefficients as eigenvectors, use, and answer the following questions 3 proc princomp data=SocioEconomics; run; a) What are the eigenvalues and the respective eigenvectors? (1 mark) b) What is the proportion of the variance accounted for by the first and second component respectively? (1 mark) c) Together how much do the first and second factors together account for the standardised variance? (1 mark) d) Do the final communality estimates show that all the variables are well accounted for by how many components or factors. Justify your answer. (1 mark) 6. To obtain the component scores as linear combinations of the observed variables request the standardized scoring coefficients by adding the SCORE option in the FACTOR statement: and run this. Note that the SCORE option in the code below requests the display of the standardized scoring coefficients. proc factor data=SocioEconomics n=5 score; run; As each factor/component can expressed as a linear combination of the standardised observed variables using the code above, answer the following questions:, a) Write down the first principal component or Factor1 in terms of the standardised variables. (1 mark) b) Write down the second principal component or Factor2 in terms of the standardised variables. (1 mark) c) Write the first and second PCs in terms of eigenvectors. (1 mark) NOTES/HINTS: The SIMPLE option specified in the PROC FACTOR statement generates the means and standard deviations of all observed variables in the analysis The CORR option specified in the PROC FACTOR statement generates the output of the observed correlations. To express the observed variables as functions of the components (or factors), you inspect the factor loading matrix. To obtain the component scores as linear combinations of the observed variables request the standardized scoring coefficients by adding the SCORE option in the FACTOR statement: The SCORE option in the code below requests the display of the standardized scoring coefficients proc factor data=SocioEconomics n=5 score; run; 4 QUESTION 3 (8 marks) Six variables measured on 100 genuine and 100 forged (counterfeit/fake) old Swiss 1000-franc bank notes are given in Appendix A of the assignment (also available in R library) data(banknote) A data.frame of dimension 200x7 with the following 7 variables: Class a factor with classes: genuine, counterfeit Length Length of bill (mm) Left Width of left edge (mm) Right Width of right edge (mm) Bottom Bottom margin width (mm) Top Top margin width (mm) Diagonal Length of diagonal (mm) Note that the data in Appendix A if you do not use R has 6 columns correspond to the following 6 variables: 1. Length of the bank note, length 2. Height of the bank note, measured on the left, left 3. Height of the bank note, measured on the right, right 4. Distance of inner frame to the lower border, bottom 5. Distance of inner frame to the upper border, top 6. Length of the diagonal, diag You need to create the class or group indicator column (genuine versus fake) to Appendix A data. Show your SAS code, SAS output and answers within your final assignment pdf or docx that you submit in Canvas. 1. Prepare the dataset for input for a Discriminant analysis via SAS. (0.5 marks) 2. Generate the means and the variance-covariance matrix of the data for the genuine notes. (0.5 marks) 3. Generate the means and standard deviations and the variance-covariance matrix of the data for the forged/fake/counterfeit notes. (0.5 marks) 4. Produce the correlation matrix and an associated scatterplot of the inputted data for the genuine notes. (0.5 marks) 5 5. Produce the correlation matrix and an associated Scatterplot of the inputted data for the forged /fake notes. (0.5 marks) 6. Run the discriminant analysis using the SAS code below which allocates a bank note with the following characteristics X0T = (214.9, 130.1, 129.9, 9, 10.6, 140.5) to the appropriate grouping i.e. allocates it to either the genuine or the forged/fake class. Using the SAS DISCRIM code below and resultant output answer the following questions. a) Is 1= 2. ? Justify your answer. (1 mark) b) How is the bank note with X0 T = (214.9, 130.1, 129.9, 9, 10.6, 140.5) allocated? (1 mark) c) Write down the resultant confusion matrix. (1 mark) data test; input length left right bottom top diag; cards; 214.9 130.1 129.9 9 10.6 140.5 ; run; proc discrim data=combine pool=test crossvalidate testdata=test testout=a; class type; var length left right bottom top diag; priors "real"=0.99 "fake"=0.01; run; proc print; run; HINTS AND NOTES TO LEARN AND TO INTERPRET THE OUTPUT: In the SAS code above By including pool=test, SAS will decide what kind of discriminant analysis to carry out based on the results of this test. o If the test fails to reject, then SAS will automatically do a linear discriminant analysis (LDF). o If the test rejects, then SAS will do a quadratic discriminant analysis (QDF). There are two other options also. If we put pool=yes then SAS will conduct a linear discriminant analysis whether it is warranted or not. It will pool the variance-covariance matrices of the 2 classes/groups and do a linear discriminant analysis without reporting Bartlett's test. 6 If pool=no then SAS will not pool the variance-covariance matrices and perform the quadratic discriminant analysis. SAS does not actually print out the quadratic discriminant function, but it will use quadratic discriminant analysis to classify sample units into populations. Note: SAS runs the Bartlett's Test to test whether there is a significant difference between the variance-covariance matrices of the genuine and counterfeit (fake) bank notes, i.e. it tests is 1= 2. APPENDIX A Observations 1-100 are the genuine bank notes and the other 100 observations are the counterfeit (forged/fake) bank notes. You need to create the class or group indicator column otherwise use data(banknote) Length Height Height Inner Frame Inner Frame Diagonal (left) (right) (lower) (upper) 214.8 131.0 131.1 9.0 9.7 141.0 214.6 129.7 129.7 8.1 9.5 141.7 214.8 129.7 129.7 8.7 9.6 142.2 214.8 129.7 129.6 7.5 10.4 142.0 215.0 129.6 129.7 10.4 7.7 141.8 215.7 130.8 130.5 9.0 10.1 141.4 215.5 129.5 129.7 7.9 9.6 141.6 214.5 129.6 129.2 7.2 10.7 141.7 214.9 129.4 129.7 8.2 11.0 141.9 215.2 130.4 130.3 9.2 10.0 140.7 215.3 130.4 130.3 7.9 11.7 141.8 215.1 129.5 129.6 7.7 10.5 142.2 215.2 130.8 129.6 7.9 10.8 141.4 214.7 129.7 129.7 7.7 10.9 141.7 215.1 129.9 129.7 7.7 10.8 141.8 214.5 129.8 129.8 9.3 8.5 141.6 214.6 129.9 130.1 8.2 9.8 141.7 215.0 129.9 129.7 9.0 9.0 141.9 215.2 129.6 129.6 7.4 11.5 141.5 214.7 130.2 129.9 8.6 10.0 141.9 215.0 129.9 129.3 8.4 10.0 141.4 215.6 130.5 130.0 8.1 10.3 141.6 215.3 130.6 130.0 8.4 10.8 141.5 215.7 130.2 130.0 8.7 10.0 141.6 215.1 129.7 129.9 7.4 10.8 141.1 215.3 130.4 130.4 8.0 11.0 142.3 215.5 130.2 130.1 8.9 9.8 142.4 215.1 130.3 130.3 9.8 9.5 141.9 215.1 130.0 130.0 7.4 10.5 141.8 214.8 129.7 129.3 8.3 9.0 142.0 215.2 130.1 129.8 7.9 10.7 141.8 214.8 129.7 129.7 8.6 9.1 142.3 7 215.0 130.0 129.6 7.7 10.5 140.7 215.6 130.4 130.1 8.4 10.3 141.0 215.9 130.4 130.0 8.9 10.6 141.4 214.6 130.2 130.2 9.4 9.7 141.8 215.5 130.3 130.0 8.4 9.7 141.8 215.3 129.9 129.4 7.9 10.0 142.0 215.3 130.3 130.1 8.5 9.3 142.1 213.9 130.3 129.0 8.1 9.7 141.3 214.4 129.8 129.2 8.9 9.4 142.3 214.8 130.1 129.6 8.8 9.9 140.9 214.9 129.6 129.4 9.3 9.0 141.7 214.9 130.4 129.7 9.0 9.8 140.9 214.8 129.4 129.1 8.2 10.2 141.0 214.3 129.5 129.4 8.3 10.2 141.8 214.8 129.9 129.7 8.3 10.2 141.5 214.8 129.9 129.7 7.3 10.9 142.0 214.6 129.7 129.8 7.9 10.3 141.1 214.5 129.0 129.6 7.8 9.8 142.0 214.6 129.8 129.4 7.2 10.0 141.3 215.3 130.6 130.0 9.5 9.7 141.1 214.5 130.1 130.0 7.8 10.9 140.9 215.4 130.2 130.2 7.6 10.9 141.6 214.5 129.4 129.5 7.9 10.0 141.4 215.2 129.7 129.4 9.2 9.4 142.0 215.7 130.0 129.4 9.2 10.4 141.2 215.0 129.6 129.4 8.8 9.0 141.1 215.1 130.1 129.9 7.9 11.0 141.3 215.1 130.0 129.8 8.2 10.3 141.4 215.1 129.6 129.3 8.3 9.9 141.6 215.3 129.7 129.4 7.5 10.5 141.5 215.4 129.8 129.4 8.0 10.6 141.5 214.5 130.0 129.5 8.0 10.8 141.4 215.0 130.0 129.8 8.6 10.6 141.5 215.2 130.6 130.0 8.8 10.6 140.8 214.6 129.5 129.2 7.7 10.3 141.3 214.8 129.7 129.3 9.1 9.5 141.5 215.1 129.6 129.8 8.6 9.8 141.8 214.9 130.2 130.2 8.0 11.2 139.6 213.8 129.8 129.5 8.4 11.1 140.9 215.2 129.9 129.5 8.2 10.3 141.4 215.0 129.6 130.2 8.7 10.0 141.2 214.4 129.9 129.6 7.5 10.5 141.8 215.2 129.9 129.7 7.2 10.6 142.1 214.1 129.6 129.3 7.6 10.7 141.7 214.9 129.9 130.1 8.8 10.0 141.2 214.6 129.8 129.4 7.4 10.6 141.0 215.2 130.5 129.8 7.9 10.9 140.9 214.6 129.9 129.4 7.9 10.0 141.8 215.1 129.7 129.7 8.6 10.3 140.6 214.9 129.8 129.6 7.5 10.3 141.0 215.2 129.7 129.1 9.0 9.7 141.9 8 215.2 130.1 129.9 7.9 10.8 141.3 215.4 130.7 130.2 9.0 11.1 141.2 215.1 129.9 129.6 8.9 10.2 141.5 215.2 129.9 129.7 8.7 9.5 141.6 215.0 129.6 129.2 8.4 10.2 142.1 214.9 130.3 129.9 7.4 11.2 141.5 215.0 129.9 129.7 8.0 10.5 142.0 214.7 129.7 129.3 8.6 9.6 141.6 215.4 130.0 129.9 8.5 9.7 141.4 214.9 129.4 129.5 8.2 9.9 141.5 214.5 129.5 129.3 7.4 10.7 141.5 214.7 129.6 129.5 8.3 10.0 142.0 215.6 129.9 129.9 9.0 9.5 141.7 215.0 130.4 130.3 9.1 10.2 141.1 214.4 129.7 129.5 8.0 10.3 141.2 215.1 130.0 129.8 9.1 10.2 141.5 214.7 130.0 129.4 7.8 10.0 141.2 214.4 130.1 130.3 9.7 11.7 139.8 214.9 130.5 130.2 11.0 11.5 139.5 214.9 130.3 130.1 8.7 11.7 140.2 215.0 130.4 130.6 9.9 10.9 140.3 214.7 130.2 130.3 11.8 10.9 139.7 215.0 130.2 130.2 10.6 10.7 139.9 215.3 130.3 130.1 9.3 12.1 140.2 214.8 130.1 130.4 9.8 11.5 139.9 215.0 130.2 129.9 10.0 11.9 139.4 215.2 130.6 130.8 10.4 11.2 140.3 215.2 130.4 130.3 8.0 11.5 139.2 215.1 130.5 130.3 10.6 11.5 140.1 215.4 130.7 131.1 9.7 11.8 140.6 214.9 130.4 129.9 11.4 11.0 139.9 215.1 130.3 130.0 10.6 10.8 139.7 215.5 130.4 130.0 8.2 11.2 139.2 214.7 130.6 130.1 11.8 10.5 139.8 214.7 130.4 130.1 12.1 10.4 139.9 214.8 130.5 130.2 11.0 11.0 140.0 214.4 130.2 129.9 10.1 12.0 139.2 214.8 130.3 130.4 10.1 12.1 139.6 215.1 130.6 130.3 12.3 10.2 139.6 215.3 130.8 131.1 11.6 10.6 140.2 215.1 130.7 130.4 10.5 11.2 139.7 214.7 130.5 130.5 9.9 10.3 140.1 214.9 130.0 130.3 10.2 11.4 139.6 215.0 130.4 130.4 9.4 11.6 140.2 215.5 130.7 130.3 10.2 11.8 140.0 215.1 130.2 130.2 10.1 11.3 140.3 214.5 130.2 130.6 9.8 12.1 139.9 214.3 130.2 130.0 10.7 10.5 139.8 214.5 130.2 129.8 12.3 11.2 139.2 214.9 130.5 130.2 10.6 11.5 139.9 214.6 130.2 130.4 10.5 11.8 139.7 9 214.2 130.0 130.2 11.0 11.2 139.5 214.8 130.1 130.1 11.9 11.1 139.5 214.6 129.8 130.2 10.7 11.1 139.4 214.9 130.7 130.3 9.3 11.2 138.3 214.6 130.4 130.4 11.3 10.8 139.8 214.5 130.5 130.2 11.8 10.2 139.6 214.8 130.2 130.3 10.0 11.9 139.3 214.7 130.0 129.4 10.2 11.0 139.2 214.6 130.2 130.4 11.2 10.7 139.9 215.0 130.5 130.4 10.6 11.1 139.9 214.5 129.8 129.8 11.4 10.0 139.3 214.9 130.6 130.4 11.9 10.5 139.8 215.0 130.5 130.4 11.4 10.7 139.9 215.3 130.6 130.3 9.3 11.3 138.1 214.7 130.2 130.1 10.7 11.0 139.4 214.9 129.9 130.0 9.9 12.3 139.4 214.9 130.3 129.9 11.9 10.6 139.8 214.6 129.9 129.7 11.9 10.1 139.0 214.6 129.7 129.3 10.4 11.0 139.3 214.5 130.1 130.1 12.1 10.3 139.4 214.5 130.3 130.0 11.0 11.5 139.5 215.1 130.0 130.3 11.6 10.5 139.7 214.2 129.7 129.6 10.3 11.4 139.5 214.4 130.1 130.0 11.3 10.7 139.2 214.8 130.4 130.6 12.5 10.0 139.3 214.6 130.6 130.1 8.1 12.1 137.9 215.6 130.1 129.7 7.4 12.2 138.4 214.9 130.5 130.1 9.9 10.2 138.1 214.6 130.1 130.0 11.5 10.6 139.5 214.7 130.1 130.2 11.6 10.9 139.1 214.3 130.3 130.0 11.4 10.5 139.8 215.1 130.3 130.6 10.3 12.0 139.7 216.3 130.7 130.4 10.0 10.1 138.8 215.6 130.4 130.1 9.6 11.2 138.6 214.8 129.9 129.8 9.6 12.0 139.6 214.9 130.0 129.9 11.4 10.9 139.7 213.9 130.7 130.5 8.7 11.5 137.8 214.2 130.6 130.4 12.0 10.2 139.6 214.8 130.5 130.3 11.8 10.5 139.4 214.8 129.6 130.0 10.4 11.6 139.2 214.8 130.1 130.0 11.4 10.5 139.6 214.9 130.4 130.2 11.9 10.7 139.0 214.3 130.1 130.1 11.6 10.5 139.7 214.5 130.4 130.0 9.9 12.0 139.6 214.8 130.5 130.3 10.2 12.1 139.1 214.5 130.2 130.4 8.2 11.8 137.8 215.0 130.4 130.1 11.4 10.7 139.1 214.8 130.6 130.6 8.0 11.4 138.7 215.0 130.5 130.1 11.0 11.4 139.3 214.6 130.5 130.4 10.1 11.4 139.3 214.7 130.2 130.1 10.7 11.1 139.5 10 214.7 130.4 130.0 11.5 10.7 139.4 214.5 130.4 130.0 8.0 12.2 138.5 214.8 130.0 129.7 11.4 10.6 139.2 214.8 129.9 130.2 9.6 11.9 139.4 214.6 130.3 130.2 12.7 9.1 139.2 215.1 130.2 129.8 10.2 12.0 139.4 215.4 130.5 130.6 8.8 11.0 138.6 214.7 130.3 130.2 10.8 11.1 139.2 215.0 130.5 130.3 9.6 11.0 138.5 214.9 130.3 130.5 11.6 10.6 139.8 215.0 130.4 130.3 9.9 12.1 139.6 215.1 130.3 129.9 10.3 11.5 139.7 214.8 130.3 130.4 10.6 11.1 140.0 214.7 130.7 130.8 11.2 11.2 139.4 214.3 129.9 129.9 10.2 11.5 139.6