程序代写案例-QBUS2820-Assignment 2

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

1

QBUS2820 Predictive Analytics
Semester 1, 2021

Individual Assignment 2

Key information

1. Required submissions:
a. ONE written report including Task A and Task B (word or pdf format, through
Canvas- Assignments- Report Submission (Individual Assignment 2)).
b. ONE Jupyter Notebook .ipynb file (through Canvas- Assignments- Upload
Your Program Code Files (Individual Assignment 2)).
2. Due date/time: Sunday 6-Jun-2021, 11:59 pm.
3. The late penalty for the assignment is 5% of the assigned mark per day, starting after
the due date. The closing date Wednesday 16-Jun-2021, 11:59 pm is the last date on
which an assessment will be accepted for marking.
4. Weight: 30% of the final mark.
5. Anonymous marking: Owing to the anonymous marking policy of the University,
please only include your student ID in the submitted report, and do NOT include your
name. The file name of your report and code file should follow the following format.
Replace "SID" with your Student ID. Example: SIDAssignment2S22021.
6. Presentation of the assignment is part of marking criteria of the assignment. Markers
will assign 5 marks for clarity of writing and presentation. Numbers with decimals
should be reported to the four-decimal point.

Key rules:
▪ Carefully read the requirements for each part of the assignment.
▪ Please follow any further instructions announced on Canvas.
▪ Reproducibility is fundamental in data analysis, so that you will be required to submit a
Jupyter Notebook that generates your results. Not submitting your code will lead to a loss
of 50% of the assignment marks.
▪ Failure to read information and follow instructions may lead to a loss of marks. Furthermore,
note that it is your responsibility to be informed of the University of Sydney and Business
School rules and guidelines, and follow them.
▪ Referencing: Harvard Referencing System. (You may find the details at:
http://libguides.library.usyd.edu.au/c.php?g=508212&p=3476130)

2

Task A

Question 1

We are using K-nearest neighbours (KNN) regression to resolve the following tasks. The data
set is given in the following table with 2 features X1 and X2:

X1 X2 Y
5 7 3
3.3 2 5
6 4.5 6.5

Suppose we have a new test data point X0 = [4, 5]T.

a. With k = 1 in KNN regression, find (X0).
b. With k = 2 in KNN regression, find (X0).
c. With k = 3 in KNN regression, find (X0).

Question 2

a. What is overfitting?
b. How do we cope with overfitting?
c. We have seen methods like Ridge and LASSO to reduce variance among the coefficients.
We can use these methods to do feature selection also. Which one of them is more
appropriate? Explain.

3

Question 3

Travellers frequently buy insurance, which pays for medical emergencies while travelling.
The premiums are determined primarily on the basis of age. However, additional variables
are often considered. Foremost among these are continuing medical problems such as cancer
and previous heart attacks. The help refine the calculation of premiums, on actuary was in
the process of determining the probabilities of various outcomes. One area of interest is
people who have diabetes. It is known that diabetics suffer a greater incidence of heart
attacks than non-diabetics. After consulting medical specialists, the actuary found that
diabetics who smoke, have high cholesterol levels and are overweight have a much higher
probability of heart attacks. Additionally, age and gender also affect the probability in
virtually all populations. To evaluate the risks more precisely, the actuary took a random
sample of diabetics and used the following regression model:

ln(y) = β0 + β1x1 + β2x2 + β3x3 + β4x4 + β5x5 + ε where

y = odds of suffering a heart attack in the next five years
x1 = average number of cigarettes smoked per day
x2 = cholesterol level
x3 = number of kilograms overweight
x4 = age
x5 = gender (1 = female; 0 = male)

The coefficients of the above regression equation are:

β̂0 = -2.15, β̂1 = 0.00847, β̂2 = 0.00214, β̂3 = 0.00539, β̂4 = 0.00989, and β̂5 = -0.288

a. What is the above model called?
b. Is ordinary least squares (OLS) regression model appropriate in this scenario? Why or
why not?
c. Was this model estimated by the method of least squares? If not, what estimation method
was used?
d. Interpret the sign of each of the coefficients (except the intercept) in terms of the
probability that an individual will probably have a heart attack in the next five years.
e. Calculate the probability of a heart attack in the next five years for the following
individual who suffers from diabetes:

Average number of cigarettes per day: 20
Cholesterol level: 200
Number of kilograms overweight: 25
Age: 50
Gender: Male

f. Refer to part (e). How would you classify this particular individual?
g. Recalculate the probability of a heart attack if the individual in part (e) is able to quit
smoking.
h. Recalculate the probability of a heart attack if the individual in part (e) is able to reduce
their cholesterol level to 150.
i. Recalculate the probability of a heart attack if the individual in part (e) loses 25 kilograms.
4

Question 4

Suppose that X1, . . . , Xn form a random sample from a normal distribution for which both
the mean and the variance are unknown.
a. Find the maximum likelihood estimators (MLEs) of μ and σ2.
b. Refer to part (a). Are they unbiased for μ and σ2? Briefly explain.
c. Find the MLE of the 0.95 quantile of the distribution; i.e., of the point θ such that
P(X < θ) = 0.95.
d. Find the MLE of φ = P(X > 2).

Question 5

Consider the MA(2) model:

Yt = θ0 + θ1εt-1 + θ2εt-2 + εt

where |θ1| < 1, |θ2| < 1, and εt ~ WN(0, σ2).

Calculate

a. E(Yt+i | Ωt) for i = 1, 2, 3, 4 where Ωt is all information up to and including time t.
b. Var(Yt+i | Ωt) for i = 1, 2, 3, 4.
c. Derive an expression for a 2-standard error confidence band around the forecast of Yt+i for
i = 1, 2, 3, 4; i.e., E(Yt+i | Ωt) ± 2√Var(Yt+i + Ωt). Your answer to part (a) gives the
forecast, and your answer to part (b) gives you the standard error of your forecast.

5

Task B

In this task, you will use “Visitors.csv” data to forecast 24 months of monthly number of
visitors to a country following the last period in the dataset.

Your objective is to develop univariate forecasting models, i.e., only using the historical
number of visitors, to address this problem.

You can download the dataset “Visitors.csv” from Canvas.

In this task, you need to:
▪ conduct exploratory data analysis
▪ select 2 different forecasting models with justifications to complete the
forecasting task. At least one of the two forecasting models must be the models
covered in the unit. For the presented two models, you need to present:
• your rationale,
• methodology,
• model diagnostics,
• model validations,
• forecasting results for 24 months of monthly number of visitors following the last
period in the dataset.
▪ present conclusions, limitations and next step suggestions.

欢迎咨询51作业君