辅导案例-STATS 369

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

STATS 369
Page 1 of 6
THE UNIVERSITY OF AUCKLAND

SEMESTER TWO 2019
Campus: City

STATISTICS

Data Science Practice

(Time Allowed: THREE hours)

NOTE:

Calculators are permitted

There are 5 questions, with a total of 125 marks.

STATS 369
Page 2 of 6
1. In the following R code

sqcon<- dbConnect(dbDriver("SQLite"), "WIKIDB/sqlite.db")
sqevents <- tbl(sqcon,"events")
byday <- sqevents %>%
mutate(timestamp = as.character(timestamp)) %>%
mutate(day = if_else(substr(timestamp, 7, 8) == "30",
substr(timestamp, 8, 9),
substr(timestamp, 7, 8))) %>%
group_by(session_id, day) %>%
summarise(pages=sum(action=="visitPage")) %>%
group_by(day) %>%
summarise(mean(pages>0)) %>%
arrange(day) %>%
collect()

(a) What does the call to tbl do?
(5 marks)

(b) What does the second call to mutate do and how would the code differ if you were
working in memory instead of from a database?
(5 marks)

(c) What does group_by(session_id,day) do?
(5 marks)

(d) At what point does the SQL query creating the variable pages get run?
(5 marks)

(20 marks total)

STATS 369
Page 3 of 6
2. Consider the following R/keras code, where the input is 400-word text excerpts with all
but the 5000 most common words removed, and the goal is to assess positive or negative
sentiment

filters <- 250
kernel_size <- 3
model %>%
layer_embedding(input_dim=5000, output_dim=50, input_length = 400) %>%
layer_dropout(0.2)%>%
layer_conv_1d(filters, kernel_size, padding = "valid",
activation = "relu") %>%
layer_global_max_pooling_1d() %>%
layer_dense(250) %>%
layer_dropout(0.2) %>%
layer_activation("relu") %>%
layer_dense(1) %>%
layer_activation("sigmoid")

a) What does the layer_embedding() line do?
(6 marks)
b) What does layer_dropout(0.2) do?
(6 marks)
c) What does the layer_conv_1d() line do?
(6 marks)
d) Give an advantage of this over a ‘bag of words’ , and an advantage over a dense
multilayer perceptron.
(12 marks)

(30 marks total)

STATS 369
Page 4 of 6
3. Consider the following R code and output, which predicts the area in Italy where samples
of olive oil were grown, in order to help detect inaccurate labelling:

olive<-read.csv("olive.csv")
xgb.cv(data=as.matrix(select(olive, palmitic:eicosenoic)),
label=olive$area-1, num_class=9, nrounds=10, nfold=10,
objective="multi:softmax")
## [1] train-merror:0.032639+0.005834 test-merror:0.092732+0.044335
## [2] train-merror:0.021364+0.003561 test-merror:0.070125+0.040668
## [3] train-merror:0.017674+0.002794 test-merror:0.068401+0.042468
## [4] train-merror:0.012625+0.002164 test-merror:0.068401+0.042468
## [5] train-merror:0.008158+0.002086 test-merror:0.070186+0.042181
## [6] train-merror:0.005632+0.001825 test-merror:0.071850+0.043688
## [7] train-merror:0.004663+0.001290 test-merror:0.071850+0.039872
## [8] train-merror:0.002720+0.000952 test-merror:0.071819+0.042138
## [9] train-merror:0.002331+0.000778 test-merror:0.073695+0.038882
## [10] train-merror:0.001553+0.000777 test-merror:0.064885+0.042008

(a) Describe the class of model being fitted: what are the components and how are they
estimated and then combined into a single overall predictor?
(15 marks)
(b) How are the train and test errors computed, and why is the train error smaller?
(5 marks)
(c) What does nrounds=10 mean, and how many rounds appears to be optimal?
(5 marks)
(d) If olive oil not from Italy were tested using the resulting classifier, would it be
possible to tell from the xgboost output that the predictions were not reliable? If so,
how?
(5 marks)

(30 marks total)

STATS 369
Page 5 of 6
4. Consider the following algorithms/estimators as covered in the course:
1. forward selection minimising AIC for linear regression
2. random forests
3. lasso
4. xgboost
5. convolutional neural networks

(a) Which one(s) give(s) sparse predictors?
(5 marks)

(b) Which one(s) find(s) the true training-set optimum of their objective function?
(5 marks)

(c) Which one(s) use(s) an explicit penalty for regularisation?
(5 marks)

(d) Which ones(s) can exactly reproduce a linear relationship?
(5 marks)

(20 marks total)

STATS 369
Page 6 of 6
5. A research paper published in 2016 claimed that a neural network could distinguish
‘criminal’ from ‘non-criminal’ men with 95% accuracy using photographs of their faces. The
algorithm was trained on 1856 national ID card photographs of convicted criminals and 1126
photographs of non-criminals scraped from the internet, converted to grayscale, and then
trimmed and scaled to the same resolution as the ID card photographs. The subjects of the
photos were all of the same ethnicity and aged 18 to 55.

The researchers say in their Conclusions section “By extensive experiments and vigorous
cross validations, we have demonstrated that via supervised machine learning, data-driven
face classifiers are able to make reliable inference on criminality… After controlled for
race, gender and age, the general law-biding public have facial appearances that vary in a
significantly lesser degree than criminals”

(a) The algorithm was reported to have an area under the ROC curve of 0.95, with
sensitivity and specificity of 90%. Suppose that 10% of men have a criminal
conviction of the sort being used. What proportion of classifications in the
population would be accurate based on these numbers?
(8 marks)
(b) What is one reason, apart from a visible impact of criminality, that the two
groups of photographs would be systematically different?
(7 marks)
(c) For such a classifier to be useful, it would have to be used on photographs of
people who do not yet have a criminal conviction, to predict future criminal
behaviour. What are the major impacts (beneficial or harmful) of doing this,
both assuming that the classifier truly is as accurate as claimed, and assuming
that it is actually much less accurate?
(10 marks)

(25 marks total)

欢迎咨询51作业君