Multivariate Analysis: Assignment 2

The University of New South Wales

School of Mathematics

Department of Statistics

2019 T3: Due Friday, Week 10, 22 November at 23:59

• Submission instructions will be posted shortly.

• No late assignments will be accepted without a successful application for

a Special Consideration.

• For computational and applied exercises, you may use either R or SAS.

Include commands used and a reasonable amount of relevant output.

• Use of computer algebra systems is permitted and encouraged, though

note that one may not be available during the exams.

1. Consider identifying the neurotic state of an individual referred for psy-

chiatric examination. Three measurements A, B, and C are made on each

individual. The mean scores for each of 3 groups are:

Group A B C

Anxiety 2.970 1.13 0.795

Normal 0.655 0.06 0.090

Obsession 4.420 1.72 1.155

The pooled within group covariance matrix

Spooled =

2.27 0.371 0.5170.371 0.565 −0.013

0.517 −0.013 0.505

.

(a) Discriminant analysis For the following, calculate from the infor-

mation provided here,

i. Assuming equal misclassification costs and equal priors for the

three groups, calculate the linear discriminant scores for classi-

fying each of the three groups.

1

ii. Based on the above scores, classify the following newly observed

individuals:

A B C

Mary 2.5 1.1 1.0

Fred 4.2 1.4 1.3

Giselda 1.1 0.6 0.3

iii. Suppose that in the population of people administered this exam-

ination, 20% are, in fact, “normal”, 40% have anxiety, and 40%

have obsession. Show how this changes the linear discriminant

scores and classifications of the three individuals.

iv. Consider classifying individuals from the “Anxiety” and “Ob-

session” groups only. Determine the linear discriminant func-

tion and estimate the probabilities of misclassification P(1|2) and

P(2|1).

(b) Discriminant analysis continued Load the original dataset from

neurotic.csv provided. Using R or SAS:

i.–iii. Repeat the corresponding parts of Part (a).

iv. Calculate the in-sample confusion matrix for LDA (assuming

equal prior probabilities).

v. Use an appropriate hypothesis test to check that the equal within-

group variance assumption required by LDA is satisfied. Report

the test statistic, the p-value, and state the conclusion in the

context of the problem.

(c) Support vector machine Fit and tune a support vector machine of

your choice for predicting the patient group from the measurements.

Report the following for the SVM fit:

i. Selected tuning parameters.

ii. In-sample confusion matrix.

iii. Out-of-sample accuracy estimated by cross-validation.

iv. Predictions for the individuals in 1(a)ii.

(d) Principal component analysis Perform a principal component

analysis on the three measurements A, B, and C, ignoring grouping.

i. Report the coefficients for the components, the eigenvalues, and

the cumulative variance explained.

ii. How many components are needed to explain at least 90% of the

variation in the data?

iii. How many components are needed according to the Kaiser’s rule?

2. Data on n = 20 consecutive years has been collected reflecting annual

average prices of beef steers X1 and of hogs X2 and the annual per capita

consumption of beef X3 and of pork X4. We are interested in the rela-

tionship of livestock prices to meat production. The file price-cons.csv

2

contains the variables Y (year index) and X1, X2, X3, X4. We could pro-

ceed by calculating U = (X1 + X2)/2, V = X3 + X4 and then regressing

U on V.

(a) Canonical correlation A perhaps better procedure would be to

construct a (weighted) price index U = a1X1+a2X2 and consumption

index V = b3X3 + b4X4 and to look at the maximal correlation

between U and V. This is the canonical correlation analysis approach.

i. Find and list both the canonical correlations and the related

canonical variates (i.e., U and V ). Express the canonical variates

using the raw coefficients and also by using the standardised

coefficients (i.e., coefficients obtained by first standardising the

variables involved). Since the prices are in dollar units but the

consumption is in pounds, does it make sense to standardise

here?

Hint: Recall from the lecture that SAS provides standardised

coefficients as a part of its output. In R, they may be ob-

tained by first using the scale() function to standardise the

inputs and then performing canonical correlation analysis on

those.

ii. Using canonical correlation analysis, formulate and test the hy-

pothesis of independence of the price index and of the consump-

tion index (intuition shows that it must be rejected). Report the

test statistic, the p-value, and state the conclusion in the context

of the problem.

iii. Is one only canonical variable pair enough (i.e., is the second

canonical correlation also significant)?

(b) Multivariate linear model Now, suppose that our goal is not cor-

relation but explanation: we wish to model consumption as a function

of the prices.

i. Fit a multivariate linear model with the consumption variables

as responses and prices as predictors. Report the coefficients, the

standard errors, and the estimated variance–covariance matrix of

the residuals.

ii. Briefly (in 2–3 sentences), interpret the regression coefficients

and their significance.

3