程序代写案例-ST3400

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

ST3400
UNIVERSITY OF WARWICK
SUMMER EXAMINATIONS 2020
PROGRAMMING FOR DATA SCIENCE
Time Allowed: 2 Hours
Full marks may be gained by correctly answering three complete questions. Candidates may
attempt all questions. Marks will be awarded for the best three answers only.
1
ST3400
1. (a) For each of the following statements, provide either a proof or a counterexample:
(i) f = O(g) =⇒ f = o(g), [2]
(ii) f = o(g), g = o(h) =⇒ f = o(h). [4]
(b) Define the following concepts in the context of supervised learning:
(i) Overfitting and underfitting. [3]
(ii) The curse of dimensionality. [3]
(c) Consider a regression problem with design matrix (xij) ∈ RN×D, a response vector
(y1, . . . , yN )
> ∈ RN , and a vector of regression coefficients (β1, . . . , βD)> ∈ RD, i.e.
yi =
D∑
j=1
βjxij + i, i = 1, . . . , N,
for some independent, mean-zero errors i.
(i) Write down the optimization problem in this setting corresponding to
(I) ordinary least squares regression, [2]
(II) ridge regression, [2]
(III) LASSO regression. [2]
(You do not need to solve these optimization problems.)
(ii) Suggest an advantage of ridge regression over the other two, and suggest an
advantage of LASSO regression over the other two. [2]
[TOTAL: 20]
Continued ...
2
ST3400
2. Consider the computation of ABc, where A and B are both n× n dimensional matrices,
and c is a n× 1 dimensional vector.
(a) One approach to compute ABc is to multiply A and B first, and then multiply the
product of A and B by c. Give the exact time complexity of this approach in terms
of the number of basic instructions including addition and multiplication. [3]
(b) The other approach is to compute the product of B and c first, and then multiply A
and this product. Give the exact time complexity of this approach in terms of the
number of basic instructions including addition and multiplication. [2]
(c) Denote f(n) as the time complexity in (a) and g(n) as the time complexity in (b).
Which of the following statement is correct (there might be multiple):
(1)f = O(g), (2)f = o(g), (3)f = Ω(g), (4)f = Θ(g).
Provide a proof for your choice. [3]
(d) A tridiagonal matrix is a band matrix that has nonzero elements on the main diag-
onal, the first diagonal below this, and the first diagonal above the main diagonal
only. Compute f(n) and g(n) when A and B are both tridiagonal, using as few basic
instructions as possible. [6]
(e) Provide a sufficient condition for A, B and c, such that f(n) = g(n) = 2n. [2]
(f) Design the fastest algorithm for computing A10, where A is an n × n dimensional
matrix. Compute the time complexity of your algorithm. [4]
[TOTAL: 20]
Continued ...
3
ST3400
3. (a) The PageRank algorithm is used for ranking web pages. It makes use of a large n×n
stochastic matrix of the form
G = α(H + dw>) + (1− α)1p>,
where α ∈ [0, 1], H is a sparse “hyperlink” transition matrix with entries Hij 6= 0
iff there is a link from page i to page j, d is a vector with di = 1 iff page i does not
link to any other pages, w and p are probability vectors with nonnegative elements
summing to 1, and 1 is a vector of all ones.
(i) Provide an interpretation of the transition matrix G in terms of the behaviour
of a “random surfer”. [2]
(ii) Give sufficient conditions on G for the location of the random surfer to converge
to a unique distribution, and explain how, given any hyperlink matrix H, we
can choose G so that it satisfies these conditions. [2]
(b) Consider the miniature world-wide web shown below, comprising three web pages.
Page 2 links to page 3, and page 3 links to page 2.
1 3
2
For this example, assume α = 1 and w =
(
1
3 ,
1
3 ,
1
3
)>
.
(i) Write down G for this example. [2]
(ii) Show the PageRank scores of the three web pages. (By ‘PageRank score’ we
mean pi3, where pi
>G = pi> and pi = (pi1, pi2, pi3)> is a unit vector with non-
negative components.) [4]
(iii) If the owner of web page 3 remove the link from page 3 to page 1, compute the
PageRank scores of the three web pages. [4]
(c) Two-armed Bernoulli bandit problems.
(i) Define the epsilon-greedy strategy. [2]
(ii) In an experiment, arm 1 has been played 10 times and 5 successes observed,
whilst arm 2 has been played 20 times and 8 successes observed. Assuming that
the unknown probability of success associated with arms 1 and 2 are 0.5 and
0.6 respectively, what is the expected reward at the next time when playing the
epsilon-greedy strategy with = 0.1? [4]
[TOTAL: 20]
Continued ...
4
ST3400
4. For this question, recall the linear regression
Y = Xβ + e
where X = [x1, ...,xN ]
> is the N ×D design matrix, Y = (y1, ..., yN )> and e are N × 1
column vectors. Suppose N = 1000 and D = 10. Assume that ordinary least squares is
used to estimate β.
(a) Suppose you want to use best subsets regression to choose a model that has the
smallest squared error. Describe how you may use 10-fold cross validation to do it.
Provide the pseudo-code. [5]
(b) Now suppose you want to assess the predictive performance of the full model by
calculating
CV =
1
N
N∑
i=1
(yi − yˆ(i))2,
where yˆ(i) is the predicted value at xi obtained when the model is estimated with
the ith data point deleted. This is also called leave one out cross validation. Denote
the ordinary least squares estimate of β as βˆ.
(i) Denote the ordinary least squares estimate of β as βˆ(i) with the ith data point
deleted. Prove
βˆ(i) = (X>X)−1X>Y (i),
where Y (i) = (y1, ..., yi−1, yˆ(i), yi+1, ..., yN )> with yˆ(i) = x>i βˆ
(i). [2]
(ii) Using the result in (i), prove
yi − yˆ(i) = yi − yˆi
1− hii ,
where yˆi = x
>
i βˆ and hii is the ith diagonal term of H = X(X
>X)−1X>. [5]
(iii) Comment on the implication of the result in (ii). [2]
(iv) Recall the ridge regression estimator can be denoted as βˆRIDGE = (X>X +
λID)
−1X>Y , where ID is a D × D dimensional identity matrix. Assume e ∼
N(0, σ2ID). Show
βˆRIDGE ∼ N(WX>Xβ, σ2WX>XW ), W := (X>X + λID)−1.
[2]
(v) Diagonalising the symmetric matrix X>X = PDP>
(P orthogonal, PP> = P>P = ID, D diagonal), show
Var(βˆRIDGE) = σ2P (D + λID)
−2DP>, which is decreasing in λ,
Bias2(βˆRIDGE) = λ2 ‖Wβ‖22 , which is increasing in λ.
[4]
[TOTAL: 20]
End
5
ST3400
UNIVERSITY OF WARWICK
SUMMER EXAMINATIONS 2020: SOLUTIONS
1. (a) (i) [SEEN SIMILAR, 2 marks] False. For example n = O(n) but n 6= o(n) since
this would imply that n ≤ n for sufficiently large n, for any > 0.
(ii) [SEEN SIMILAR, 4 marks] Suppose for > 0 ∃n0(), n1() such that f(n) ≤
g(n) ∀n ≥ n0() and g(n) ≤ h(n) ∀n ≥ n1(). Now, for > 0 we have,
∀n ≥ max{n0(
√
), n1(
√
)},
f(n) ≤ √g(n) ≤ √√h(n) ≤ h(n),
i.e. f = o(h).
(b) (i) [BOOKWORK, 3 marks] Overfitting is the phenomenon by which the model for
a supervised learning problem is too flexible and the solution captures both the
signal and noise in the training data. It is usually caused by a model that is
too complicated/of too high capacity. Underfitting is the reverse: a model is too
inflexible and cannot capture the signal in the training data. It is usually caused
by a model that is too simple/of too low capacity.
(ii) [BOOKWORK, 3 marks] The curse of dimensionality refers to the phenomenon
that in a high-dimensional space, nearest points still a long way off.
(c) (i) (I) [BOOKWORK, 2 marks] The ordinary least squares (OLS) estimator βˆOLS
minimizes |Y − Xβ|22, where Y = (y1, . . . , yN )>, X = (xij), and β =
(β1, . . . , βD)
>, i.e.
βˆOLS = arg min
β
N∑
i=1
(Y −Xβ)2i .
(II) [BOOKWORK, 2 marks] The ridge regression estimator minimizes |Y −
Xβ|22 + λ|β|22, i.e.
βˆRIDGE = arg min
β
N∑
i=1
(Y −Xβ)2i + λ
D∑
i=1
β2i .
(III) [BOOKWORK, 2 marks] The LASSO estimator βˆLASSO minimizes |Y −
Xβ|22 + λ|β|1, i.e.
βˆLASSO = arg min
β
N∑
i=1
(Y −Xβ)2i + λ
D∑
i=1
|βi|.
(ii) [SEEN, 2 marks] Unlike OLS and LASSO, ridge regression always has a unique,
closed-form solution. [OLS does not always have a unique solution; LASSO’s
solution is not closed-form.] The LASSO solution typically sets a subset of
coefficients βi equal to 0, encouraging sparsity in the solution which aids inter-
pretability.
2. (a) [SEEN SIMILAR, 3 marks] Define C = AB. Obtaining each element of AB involves
n multiplications and n − 1 additions. The time complexity for computing AB is
n2[n + (n − 1)]. The time complexity for computing the product of AB and c is
n[n+ (n− 1)]. Thus, the total time complexity is n(n+ 1)(2n− 1).
6
ST3400
(b) [SEEN SIMILAR, 2 marks] Computing ABc via this approach involves computing
the product of a n× n matrix and a n× 1 vector twice. Using the argument in (i),
the total time complexity is 2n(2n− 1).
(c) [BOOKWORK, 3 marks] f = Θ(g) is correct. To prove it, note that if we take n0 = 1
and M = 1, we have f(n) = n(n+ 1)(2n− 1) ≥ g(n) = 2n(2n− 1) for amny n ≥ n0
as required by the definition of big Θ.
(d) [UNSEEN, 6 marks] We compute g(n) first. Define d = Bc where di =
∑n
k=1Bi, kck.
Note that
di = Bi,i−1ci−1 +Bi,ici +Bi,i+1ci+1
and d1 = B1,1c1 + B1,2c2 and dn = B1,1c1 + B1,2c2. So in total we need 5(n − 2) +
3× 2 = 5n− 4 instructions for computing Bc. Computing A(Bc) then needs 10n− 8
instructions.
Now we compute f(n). Define C = AB = (Cij) and we first calculate the time
complexity of multiplying A and B. By definition, Cij =
∑n
k=1AikBkj . Note that
when |i− j| > 2, Cij = 0. Because both A and B are tridiagonal, we can write
Cii = Ai,i−1Bi−1,i +Ai,iBi,i +Ai,i+1Bi+1,i, 1 < i < n
and C11 = A11B11+A12B2,1 and Cnn = An,n−1Bn−1,n+An,nBn,n. So calculating the
diagonal takes 3×2+5× (n−2) = 5n−4 instructions. For the first two offdiagonals,
we have
Ci,i+1 = Ai,iBi,i+1 +Ai,i+1Bi+1,i+1, Ci+1,i = Ai+1,iBi,i +Ai+1,i+1Bi+1,i,
giving a total of 2×(n−1)×3 = 6(n−1) instructions. For the second two offdiagonals,
we have
Ci,i+2 = Ai,i+1Bi+1,i+2, Ci+2,i = Ai+2,i+1Bi+1,i,
giving a total of 2 × (n − 2) = 2(n − 2) instructions. So in total, computing AB
requires 5n − 4 + 6(n − 1) + 2(n − 2) = 13n − 14. Multiplying AB and c requires
another 5 ∗ 2 + 7 ∗ 2 + 9 ∗ (n− 4) = 9n− 12 instructions. So in total, f(n) = 22n− 26
instructions.
(e) [UNSEEN, 2 marks] f(n) = g(n) = 2n occurs when A and B are both diagonal
matrices.
(f) [UNSEEN, 4 marks] Noting A10 = [(A2)2]2A2 and the time complexity for computing
A2 is n(n+ 1)(2n− 1) as discussed in (i). The total time complexity is then 4n(n+
1)(2n− 1).
3. (a) (i) [BOOKWORK, 2 marks] We can view G = α(H + dw>) + (1 − α)1p> as the
transition matrix of a “random surfer”. With probability α, the surfer follows
one of the links on the page they are on, each with equal probability, or chooses
a page according to the probability vector w if there are no links on the page
they are on. With probability 1−α, the surfer instead chooses a page according
to the probability p, which represents the personal preferences of the surfer.
(ii) [BOOKWORK, 2 marks] The random surfer will converge to the unique station-
ary distribution of the Markov chain with transition matrix G if G is irreducible
and aperiodic [this follows from the Perron-Frobenius theorem, which need not
be named for full marks]. If we choose α > 0 and p to have all entries non-zero
then this is enough to guarantee that G is irreducible and aperiodic, since at
7
ST3400
any page we can with positive probability reach any other page (strongly irre-
ducible), and at any page we can with positive probability remain on the same
page (guaranteeing aperiodicity). [It is not enough to choose w to have all non-
zero entries; consider the example in part (b) of this question, which has positive
w but is neither irreducible nor aperiodic.]
(b) (i) [UNSEEN, 2 marks]
G =
0 1 00 0 1
1 0 0
 .
(ii) [UNSEEN, 4 marks] For this example, writing pi = (a, b, c)> and solving pi>G =
pi>, we find pi = (1/3, 1/3, 1/3)>.
(iii) [UNSEEN, 4 marks] For the link from page 3 to page 1 is removed, we have
G =
0 1 00 0 1
1
3
1
3
1
3
 .
Solving pi>G = pi>, we find pi = (1/6, 1/3, 1/2)>.
(c) (i) [BOOKWORK, 2 marks] In the epsilon-greedy strategy we play the arm with
the best rate of success (breaking ties arbitrarily) with probability 1− and with
probability play either arm with equal probability.
(ii) [UNSEEN, 4 marks] Since we have observed a better rate of success with arm 1,
we play it with probability 1− /2 and play arm 2 with probability /2. Hence
the expected reward is (1− 2)× 0.5 + 2 × 0.6 = 1920 × 12 + 120 × 35 = 101/200.
4. (a) [BOOKWORK, 5 marks]
0. Given training data (xi, yi)
N
i=1, randomly split it into K equally sized subsets.
1. For each model (in total we have 2D models), Do
2. For j = 1, . . . ,K:
• Train a model on the data but with the j-th subset removed.
• Test the model on the j-th subset only (a validation set of sizeN/K) and compute
the mean validation error
3. Compute the average error of the K subsets.
4. Pick the model with the lowest validation error.
5. Re-train that model using all N training cases.
(b) (i) [UNSEEN, 2 marks] Recall that β(i) is the least squares estimator with the ith
data point deleted. By definition
β(i) = arg min
β
N∑
j=1,j 6=i
(x>j β − yj)2
where the minimization problem is the same as
N∑
j=1,j 6=i
(x>j β − yj)2 + (x>i β − x>i β(i))2.
Thus, we have
βˆ(i) = (X>X)−1X>Y (i),
where Y (i) = (y1, ..., yi−1, yˆ(i), yi+1, ..., yN )> with yˆ(i) = x>i βˆ
(i).
8
ST3400
(ii) [UNSEEN, 5 marks] Using the result in (i), we have
yˆ(i) = x>i βˆ
(i) = x>i (X
>X)−1X>Y (i) = x>i (X
>X)−1X>(Y + Y (i) − Y )
= x>i (X
>X)−1X>Y + x>i (X
>X)−1X>(Y (i) − Y )
= yˆi + hii(yˆ
(i) − yi),
where hii is the ith diagonal term ofH = X(X
>X)−1X>. Solving the equation
for yˆ(i) we have
yˆ(i) =
yˆi
1− hii −
hii
1− hii yi.
Therefore
yi − yˆ(i) = yi − yˆi
1− hii
(iii) [UNSEEN, 2 marks] Leave one out cross validation would require one to fit the
same model on N different datasets, thus is computationally intensive. The
result in (ii) enables us to compute
CV =
1
N
N∑
i=1
(yi − yˆ(i))2 = 1
N
N∑
i=1
(
yi − yˆi
1− hii )
2,
where computing yˆi requires fitting the model on the whole dataset and hii is
from the hat matrix. That is, computing the CV score only requires fitting the
model once and thus is computationally much more efficient.
(iv) [BOOKWORK, 2 marks] Since e ∼ N(0, σ2ID), Y = (Xβ+e) ∼ N(Xβ, σ2ID).
Since βˆRIDGE = (X>X + λID)−1X>Y is a linear transformation of Y and
“Z ∼ N(µ,Σ) =⇒MZ ∼ N(Mµ,MΣM>)′′
we find
βˆRIDGE ∼ N(WX>Xβ, σ2WX>XW ), W := (X>X + λID)−1.
(v) [BOOKWORK, 4 marks] SinceX>X = PDP>, we haveW = P (D+λID)−1P>
and thus
Var(βˆRIDGE) = σ2P (D + λID)
−1P>PDP>P (D + λID)−1P>
= σ2P (D + λID)
−1D(D + λID)−1P>
= σ2P (D + λID)
−2DP> (diagonal matrices commute with each other).
Bias2(βˆRIDGE) = ‖WX>Xβ − β‖22
= ‖(WX>X −WW−1)β‖22
= ‖W [X>X − (X>X + λID)]β‖22 = λ2‖Wβ‖22.
9

欢迎咨询51作业君