The University of Sydney Page 1

STAT5003

Week 13

Review and Final Exam

Presented by

Dr. Justin Wishart

The University of Sydney Page 2

Exam format

– Two hour written exam

– 20 Multiple Choice questions

– Questions can have one or two correct answers. You need to select the

exact correct answer(s) to get a mark

– Some short answer questions

– Two longer answer questions

The University of Sydney Page 3

Topics covered

– Everything in the lectures/tutorials from Weeks 1 to 12 (except

any topic that was marked as not examinable)

– Writing R code is not tested, but there could be questions on

interpreting R outputs

– You should understand how the algorithms work and be able to

sketch out the key steps in pseudo code

The University of Sydney Page 4

Methods we have learnt

– Regression

– Multivariate linear regression

– Clustering

– Hierarchical clustering

– K-means clustering

– Classification

– Logistic regression

– LDA

– KNN

– SVM

– Random Forest

– Decision trees

– Boosted trees (Adaboost, XGBoost, GBM)

The University of Sydney Page 5

Multiple Regression

= 0 + 1 1 + 2 2 + …+ +

– Find coefficients to minimise the total sum of squares of the

residuals

The University of Sydney Page 6

Local regression (smoothing)

A typical model in this case is

= +

– The function f is some smooth function (differentiable).

The University of Sydney Page 7

Density estimation

– Maximum Likelihood approach

– Reformulate as

(1, 2, … , |) Probability of observing 1, 2, … , given parameter(s)

= ς=1

→ln = σ=1

ln

The University of Sydney Page 8

Kernel density estimation

– Smooths the data with a chosen hyperparameter (bandwidth)

to estimate the density.

መ =

1

ℎ

=1

−

ℎ

The University of Sydney Page 9

Hierarchical Clustering

– Bottom-up clustering approach.

– Each point is its own cluster

– Clustering tuned by merging

close values

The University of Sydney Page 10

K-means algorithm

– 1. Data randomly allocated

– 2. Centres computed.

– Data matched to closest

centre.

– Repeat.

The University of Sydney Page 11

Principal Components Analysis (PCA)

– Find linear combinations of variables that maximum the

variability.

The University of Sydney Page 12

PCA and t-SNE

PCA tSNE

The University of Sydney Page 13

Logistic Regression

Logistic regression model:

= log

1 −

= 0 + 11 +⋯+ =

= Pr( = 1|) = ℎ =

=

1

1 + −

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

P

ro

ba

bi

lit

y

-5 -4 -3 -2 -1 0 1 2 3 4 5

X

1

0.5

The University of Sydney Page 14

Linear Discriminant Analysis (LDA)

: Probability of coming from class k (prior probability)

: Density function for X given that X is an observation from

class k

The University of Sydney Page 15

Cross validation

– Fitting model to entire dataset can overfit the data and not

perform well on new data

– Split data into training and tests sets to alleviate this and find

the right bias/variance trade-off.

The University of Sydney Page 16

Bootstrap

– Simulate related data (sampling with replacement) and

examine statistical performance on all the re-sampled data.

The University of Sydney Page 17

Support Vector Machines (SVM)

– Find the best hyperplane or boundary to separate data into

classes.

– Image taken from

https://en.wikipedia.org/wiki/Support_vector_machine

The University of Sydney Page 18

Missing Data

– Remove missing data (complete cases)

– Single Imputation

– Multiple imputation

– Expert knowledge of reasons for missing data.

The University of Sydney Page 19

Basic decision trees

– Partition space into rectangular regions that minimise outcome

deviation.

Millions

The University of Sydney Page 20

Bagging trees and random forests

– Use bootstrap technique to

create resampled trees and

average the result.

– መ =

1

σ=1

መ∗()

– Random forests do further

sampling to improve model.

The University of Sydney Page 21

Boosting

– Fit tree to residuals and learn slowly

– Slowly improve the fit in areas where the model doesn’t

perform well.

– Some boosting algorithms discussed

– AdaBoost

– Stochastic gradient boosting

– XGBoost

The University of Sydney Page 22

Feature Selection

– Filter selection via fold changes.

– Best subset selection.

– Forward selection.

– Backward selection.

– Choose model that minimises test error

– Directly via test set

– Indirectly via penalised criterion.

The University of Sydney Page 23

Ridge Regression and Lasso

– Constrained optimisation techniques that minimise the squares

with different constraints.

– Lasso has the extra benefit of feature selection as a free

bonus.

The University of Sydney Page 24

Monte Carlo Methods

– Repeated simulation to estimate the full distribution and

summary values.

– Exploits law of large numbers.

– Can sample from f if inverse of exists, then we can

generate as: = −1

– Acceptance rejection method to handle more difficult

distributions.

= න() ∙ ≈

1

=1

()

The University of Sydney Page 25

Markov Chain Monte Carlo

– Big use in modelling Bayesian methods.

– Simulates a process (random variable that changes over time)

– Simulate new point based off the current point.

– Can estimate even more complex distributions that in Monte

Carlo methods.

The University of Sydney Page 26

Methods and metrics to evaluate models

– Sensitivity and specificity

– Accuracy

– Residual sum of squares (for regression)

– ROC curves and AUC

– K-fold cross-validation

The University of Sydney Page 27

Example multiple choice question

Which of the following method(s) is/are unsupervised learning

methods?

A. K means clustering

B. Logistic regression

C. Random forest

D. Support vector machines

The University of Sydney Page 28

Example short answer question

a. Explain how the parameters are estimated in simple least

squares regression.

b. Explain a scenario where simple linear regression is not

appropriate.

c. Compute the predicted weight for a person that is 160cm tall

and compute the residual of the first person in the table below.

Sample X : Height (cm) Y: Weight (kg)

1 160 60

2 170.2 77

3 172 62

= 50.412 + 0.0634

The University of Sydney Page 29

Example long answer question

– Describe the Markov Chain Monte Carlo procedure. You may

use pseudo code as part of your answer.