Multivariate Analysis: Assignment 2
The University of New South Wales
School of Mathematics
Department of Statistics
2019 T3: Due Friday, Week 10, 22 November at 23:59
• Submission instructions will be posted shortly.
• No late assignments will be accepted without a successful application for
a Special Consideration.
• For computational and applied exercises, you may use either R or SAS.
Include commands used and a reasonable amount of relevant output.
• Use of computer algebra systems is permitted and encouraged, though
note that one may not be available during the exams.
1. Consider identifying the neurotic state of an individual referred for psy-
chiatric examination. Three measurements A, B, and C are made on each
individual. The mean scores for each of 3 groups are:
Group A B C
Anxiety 2.970 1.13 0.795
Normal 0.655 0.06 0.090
Obsession 4.420 1.72 1.155
The pooled within group covariance matrix
Spooled =
 2.27 0.371 0.5170.371 0.565 −0.013
0.517 −0.013 0.505
 .
(a) Discriminant analysis For the following, calculate from the infor-
mation provided here,
i. Assuming equal misclassification costs and equal priors for the
three groups, calculate the linear discriminant scores for classi-
fying each of the three groups.
1
ii. Based on the above scores, classify the following newly observed
individuals:
A B C
Mary 2.5 1.1 1.0
Fred 4.2 1.4 1.3
Giselda 1.1 0.6 0.3
iii. Suppose that in the population of people administered this exam-
ination, 20% are, in fact, “normal”, 40% have anxiety, and 40%
have obsession. Show how this changes the linear discriminant
scores and classifications of the three individuals.
iv. Consider classifying individuals from the “Anxiety” and “Ob-
session” groups only. Determine the linear discriminant func-
tion and estimate the probabilities of misclassification P(1|2) and
P(2|1).
(b) Discriminant analysis continued Load the original dataset from
neurotic.csv provided. Using R or SAS:
i.–iii. Repeat the corresponding parts of Part (a).
iv. Calculate the in-sample confusion matrix for LDA (assuming
equal prior probabilities).
v. Use an appropriate hypothesis test to check that the equal within-
group variance assumption required by LDA is satisfied. Report
the test statistic, the p-value, and state the conclusion in the
context of the problem.
(c) Support vector machine Fit and tune a support vector machine of
your choice for predicting the patient group from the measurements.
Report the following for the SVM fit:
i. Selected tuning parameters.
ii. In-sample confusion matrix.
iii. Out-of-sample accuracy estimated by cross-validation.
iv. Predictions for the individuals in 1(a)ii.
(d) Principal component analysis Perform a principal component
analysis on the three measurements A, B, and C, ignoring grouping.
i. Report the coefficients for the components, the eigenvalues, and
the cumulative variance explained.
ii. How many components are needed to explain at least 90% of the
variation in the data?
iii. How many components are needed according to the Kaiser’s rule?
2. Data on n = 20 consecutive years has been collected reflecting annual
average prices of beef steers X1 and of hogs X2 and the annual per capita
consumption of beef X3 and of pork X4. We are interested in the rela-
tionship of livestock prices to meat production. The file price-cons.csv
2
contains the variables Y (year index) and X1, X2, X3, X4. We could pro-
ceed by calculating U = (X1 + X2)/2, V = X3 + X4 and then regressing
U on V.
(a) Canonical correlation A perhaps better procedure would be to
construct a (weighted) price index U = a1X1+a2X2 and consumption
index V = b3X3 + b4X4 and to look at the maximal correlation
between U and V. This is the canonical correlation analysis approach.
i. Find and list both the canonical correlations and the related
canonical variates (i.e., U and V ). Express the canonical variates
using the raw coefficients and also by using the standardised
coefficients (i.e., coefficients obtained by first standardising the
variables involved). Since the prices are in dollar units but the
consumption is in pounds, does it make sense to standardise
here?
Hint: Recall from the lecture that SAS provides standardised
coefficients as a part of its output. In R, they may be ob-
tained by first using the scale() function to standardise the
inputs and then performing canonical correlation analysis on
those.
ii. Using canonical correlation analysis, formulate and test the hy-
pothesis of independence of the price index and of the consump-
tion index (intuition shows that it must be rejected). Report the
test statistic, the p-value, and state the conclusion in the context
of the problem.
iii. Is one only canonical variable pair enough (i.e., is the second
canonical correlation also significant)?
(b) Multivariate linear model Now, suppose that our goal is not cor-
relation but explanation: we wish to model consumption as a function
of the prices.
i. Fit a multivariate linear model with the consumption variables
as responses and prices as predictors. Report the coefficients, the
standard errors, and the estimated variance–covariance matrix of
the residuals.
ii. Briefly (in 2–3 sentences), interpret the regression coefficients
and their significance.
3  Email:51zuoyejun

@gmail.com