ST3400 UNIVERSITY OF WARWICK SUMMER EXAMINATIONS 2020 PROGRAMMING FOR DATA SCIENCE Time Allowed: 2 Hours Full marks may be gained by correctly answering three complete questions. Candidates may attempt all questions. Marks will be awarded for the best three answers only. 1 ST3400 1. (a) For each of the following statements, provide either a proof or a counterexample: (i) f = O(g) =⇒ f = o(g), [2] (ii) f = o(g), g = o(h) =⇒ f = o(h). [4] (b) Define the following concepts in the context of supervised learning: (i) Overfitting and underfitting. [3] (ii) The curse of dimensionality. [3] (c) Consider a regression problem with design matrix (xij) ∈ RN×D, a response vector (y1, . . . , yN ) > ∈ RN , and a vector of regression coefficients (β1, . . . , βD)> ∈ RD, i.e. yi = D∑ j=1 βjxij + i, i = 1, . . . , N, for some independent, mean-zero errors i. (i) Write down the optimization problem in this setting corresponding to (I) ordinary least squares regression, [2] (II) ridge regression, [2] (III) LASSO regression. [2] (You do not need to solve these optimization problems.) (ii) Suggest an advantage of ridge regression over the other two, and suggest an advantage of LASSO regression over the other two. [2] [TOTAL: 20] Continued ... 2 ST3400 2. Consider the computation of ABc, where A and B are both n× n dimensional matrices, and c is a n× 1 dimensional vector. (a) One approach to compute ABc is to multiply A and B first, and then multiply the product of A and B by c. Give the exact time complexity of this approach in terms of the number of basic instructions including addition and multiplication. [3] (b) The other approach is to compute the product of B and c first, and then multiply A and this product. Give the exact time complexity of this approach in terms of the number of basic instructions including addition and multiplication. [2] (c) Denote f(n) as the time complexity in (a) and g(n) as the time complexity in (b). Which of the following statement is correct (there might be multiple): (1)f = O(g), (2)f = o(g), (3)f = Ω(g), (4)f = Θ(g). Provide a proof for your choice. [3] (d) A tridiagonal matrix is a band matrix that has nonzero elements on the main diag- onal, the first diagonal below this, and the first diagonal above the main diagonal only. Compute f(n) and g(n) when A and B are both tridiagonal, using as few basic instructions as possible. [6] (e) Provide a sufficient condition for A, B and c, such that f(n) = g(n) = 2n. [2] (f) Design the fastest algorithm for computing A10, where A is an n × n dimensional matrix. Compute the time complexity of your algorithm. [4] [TOTAL: 20] Continued ... 3 ST3400 3. (a) The PageRank algorithm is used for ranking web pages. It makes use of a large n×n stochastic matrix of the form G = α(H + dw>) + (1− α)1p>, where α ∈ [0, 1], H is a sparse “hyperlink” transition matrix with entries Hij 6= 0 iff there is a link from page i to page j, d is a vector with di = 1 iff page i does not link to any other pages, w and p are probability vectors with nonnegative elements summing to 1, and 1 is a vector of all ones. (i) Provide an interpretation of the transition matrix G in terms of the behaviour of a “random surfer”. [2] (ii) Give sufficient conditions on G for the location of the random surfer to converge to a unique distribution, and explain how, given any hyperlink matrix H, we can choose G so that it satisfies these conditions. [2] (b) Consider the miniature world-wide web shown below, comprising three web pages. Page 2 links to page 3, and page 3 links to page 2. 1 3 2 For this example, assume α = 1 and w = ( 1 3 , 1 3 , 1 3 )> . (i) Write down G for this example. [2] (ii) Show the PageRank scores of the three web pages. (By ‘PageRank score’ we mean pi3, where pi >G = pi> and pi = (pi1, pi2, pi3)> is a unit vector with non- negative components.) [4] (iii) If the owner of web page 3 remove the link from page 3 to page 1, compute the PageRank scores of the three web pages. [4] (c) Two-armed Bernoulli bandit problems. (i) Define the epsilon-greedy strategy. [2] (ii) In an experiment, arm 1 has been played 10 times and 5 successes observed, whilst arm 2 has been played 20 times and 8 successes observed. Assuming that the unknown probability of success associated with arms 1 and 2 are 0.5 and 0.6 respectively, what is the expected reward at the next time when playing the epsilon-greedy strategy with = 0.1? [4] [TOTAL: 20] Continued ... 4 ST3400 4. For this question, recall the linear regression Y = Xβ + e where X = [x1, ...,xN ] > is the N ×D design matrix, Y = (y1, ..., yN )> and e are N × 1 column vectors. Suppose N = 1000 and D = 10. Assume that ordinary least squares is used to estimate β. (a) Suppose you want to use best subsets regression to choose a model that has the smallest squared error. Describe how you may use 10-fold cross validation to do it. Provide the pseudo-code. [5] (b) Now suppose you want to assess the predictive performance of the full model by calculating CV = 1 N N∑ i=1 (yi − yˆ(i))2, where yˆ(i) is the predicted value at xi obtained when the model is estimated with the ith data point deleted. This is also called leave one out cross validation. Denote the ordinary least squares estimate of β as βˆ. (i) Denote the ordinary least squares estimate of β as βˆ(i) with the ith data point deleted. Prove βˆ(i) = (X>X)−1X>Y (i), where Y (i) = (y1, ..., yi−1, yˆ(i), yi+1, ..., yN )> with yˆ(i) = x>i βˆ (i). [2] (ii) Using the result in (i), prove yi − yˆ(i) = yi − yˆi 1− hii , where yˆi = x > i βˆ and hii is the ith diagonal term of H = X(X >X)−1X>. [5] (iii) Comment on the implication of the result in (ii). [2] (iv) Recall the ridge regression estimator can be denoted as βˆRIDGE = (X>X + λID) −1X>Y , where ID is a D × D dimensional identity matrix. Assume e ∼ N(0, σ2ID). Show βˆRIDGE ∼ N(WX>Xβ, σ2WX>XW ), W := (X>X + λID)−1. [2] (v) Diagonalising the symmetric matrix X>X = PDP> (P orthogonal, PP> = P>P = ID, D diagonal), show Var(βˆRIDGE) = σ2P (D + λID) −2DP>, which is decreasing in λ, Bias2(βˆRIDGE) = λ2 ‖Wβ‖22 , which is increasing in λ. [4] [TOTAL: 20] End 5 ST3400 UNIVERSITY OF WARWICK SUMMER EXAMINATIONS 2020: SOLUTIONS 1. (a) (i) [SEEN SIMILAR, 2 marks] False. For example n = O(n) but n 6= o(n) since this would imply that n ≤ n for sufficiently large n, for any > 0. (ii) [SEEN SIMILAR, 4 marks] Suppose for > 0 ∃n0(), n1() such that f(n) ≤ g(n) ∀n ≥ n0() and g(n) ≤ h(n) ∀n ≥ n1(). Now, for > 0 we have, ∀n ≥ max{n0( √ ), n1( √ )}, f(n) ≤ √g(n) ≤ √√h(n) ≤ h(n), i.e. f = o(h). (b) (i) [BOOKWORK, 3 marks] Overfitting is the phenomenon by which the model for a supervised learning problem is too flexible and the solution captures both the signal and noise in the training data. It is usually caused by a model that is too complicated/of too high capacity. Underfitting is the reverse: a model is too inflexible and cannot capture the signal in the training data. It is usually caused by a model that is too simple/of too low capacity. (ii) [BOOKWORK, 3 marks] The curse of dimensionality refers to the phenomenon that in a high-dimensional space, nearest points still a long way off. (c) (i) (I) [BOOKWORK, 2 marks] The ordinary least squares (OLS) estimator βˆOLS minimizes |Y − Xβ|22, where Y = (y1, . . . , yN )>, X = (xij), and β = (β1, . . . , βD) >, i.e. βˆOLS = arg min β N∑ i=1 (Y −Xβ)2i . (II) [BOOKWORK, 2 marks] The ridge regression estimator minimizes |Y − Xβ|22 + λ|β|22, i.e. βˆRIDGE = arg min β N∑ i=1 (Y −Xβ)2i + λ D∑ i=1 β2i . (III) [BOOKWORK, 2 marks] The LASSO estimator βˆLASSO minimizes |Y − Xβ|22 + λ|β|1, i.e. βˆLASSO = arg min β N∑ i=1 (Y −Xβ)2i + λ D∑ i=1 |βi|. (ii) [SEEN, 2 marks] Unlike OLS and LASSO, ridge regression always has a unique, closed-form solution. [OLS does not always have a unique solution; LASSO’s solution is not closed-form.] The LASSO solution typically sets a subset of coefficients βi equal to 0, encouraging sparsity in the solution which aids inter- pretability. 2. (a) [SEEN SIMILAR, 3 marks] Define C = AB. Obtaining each element of AB involves n multiplications and n − 1 additions. The time complexity for computing AB is n2[n + (n − 1)]. The time complexity for computing the product of AB and c is n[n+ (n− 1)]. Thus, the total time complexity is n(n+ 1)(2n− 1). 6 ST3400 (b) [SEEN SIMILAR, 2 marks] Computing ABc via this approach involves computing the product of a n× n matrix and a n× 1 vector twice. Using the argument in (i), the total time complexity is 2n(2n− 1). (c) [BOOKWORK, 3 marks] f = Θ(g) is correct. To prove it, note that if we take n0 = 1 and M = 1, we have f(n) = n(n+ 1)(2n− 1) ≥ g(n) = 2n(2n− 1) for amny n ≥ n0 as required by the definition of big Θ. (d) [UNSEEN, 6 marks] We compute g(n) first. Define d = Bc where di = ∑n k=1Bi, kck. Note that di = Bi,i−1ci−1 +Bi,ici +Bi,i+1ci+1 and d1 = B1,1c1 + B1,2c2 and dn = B1,1c1 + B1,2c2. So in total we need 5(n − 2) + 3× 2 = 5n− 4 instructions for computing Bc. Computing A(Bc) then needs 10n− 8 instructions. Now we compute f(n). Define C = AB = (Cij) and we first calculate the time complexity of multiplying A and B. By definition, Cij = ∑n k=1AikBkj . Note that when |i− j| > 2, Cij = 0. Because both A and B are tridiagonal, we can write Cii = Ai,i−1Bi−1,i +Ai,iBi,i +Ai,i+1Bi+1,i, 1 < i < n and C11 = A11B11+A12B2,1 and Cnn = An,n−1Bn−1,n+An,nBn,n. So calculating the diagonal takes 3×2+5× (n−2) = 5n−4 instructions. For the first two offdiagonals, we have Ci,i+1 = Ai,iBi,i+1 +Ai,i+1Bi+1,i+1, Ci+1,i = Ai+1,iBi,i +Ai+1,i+1Bi+1,i, giving a total of 2×(n−1)×3 = 6(n−1) instructions. For the second two offdiagonals, we have Ci,i+2 = Ai,i+1Bi+1,i+2, Ci+2,i = Ai+2,i+1Bi+1,i, giving a total of 2 × (n − 2) = 2(n − 2) instructions. So in total, computing AB requires 5n − 4 + 6(n − 1) + 2(n − 2) = 13n − 14. Multiplying AB and c requires another 5 ∗ 2 + 7 ∗ 2 + 9 ∗ (n− 4) = 9n− 12 instructions. So in total, f(n) = 22n− 26 instructions. (e) [UNSEEN, 2 marks] f(n) = g(n) = 2n occurs when A and B are both diagonal matrices. (f) [UNSEEN, 4 marks] Noting A10 = [(A2)2]2A2 and the time complexity for computing A2 is n(n+ 1)(2n− 1) as discussed in (i). The total time complexity is then 4n(n+ 1)(2n− 1). 3. (a) (i) [BOOKWORK, 2 marks] We can view G = α(H + dw>) + (1 − α)1p> as the transition matrix of a “random surfer”. With probability α, the surfer follows one of the links on the page they are on, each with equal probability, or chooses a page according to the probability vector w if there are no links on the page they are on. With probability 1−α, the surfer instead chooses a page according to the probability p, which represents the personal preferences of the surfer. (ii) [BOOKWORK, 2 marks] The random surfer will converge to the unique station- ary distribution of the Markov chain with transition matrix G if G is irreducible and aperiodic [this follows from the Perron-Frobenius theorem, which need not be named for full marks]. If we choose α > 0 and p to have all entries non-zero then this is enough to guarantee that G is irreducible and aperiodic, since at 7 ST3400 any page we can with positive probability reach any other page (strongly irre- ducible), and at any page we can with positive probability remain on the same page (guaranteeing aperiodicity). [It is not enough to choose w to have all non- zero entries; consider the example in part (b) of this question, which has positive w but is neither irreducible nor aperiodic.] (b) (i) [UNSEEN, 2 marks] G = 0 1 00 0 1 1 0 0 . (ii) [UNSEEN, 4 marks] For this example, writing pi = (a, b, c)> and solving pi>G = pi>, we find pi = (1/3, 1/3, 1/3)>. (iii) [UNSEEN, 4 marks] For the link from page 3 to page 1 is removed, we have G = 0 1 00 0 1 1 3 1 3 1 3 . Solving pi>G = pi>, we find pi = (1/6, 1/3, 1/2)>. (c) (i) [BOOKWORK, 2 marks] In the epsilon-greedy strategy we play the arm with the best rate of success (breaking ties arbitrarily) with probability 1− and with probability play either arm with equal probability. (ii) [UNSEEN, 4 marks] Since we have observed a better rate of success with arm 1, we play it with probability 1− /2 and play arm 2 with probability /2. Hence the expected reward is (1− 2)× 0.5 + 2 × 0.6 = 1920 × 12 + 120 × 35 = 101/200. 4. (a) [BOOKWORK, 5 marks] 0. Given training data (xi, yi) N i=1, randomly split it into K equally sized subsets. 1. For each model (in total we have 2D models), Do 2. For j = 1, . . . ,K: • Train a model on the data but with the j-th subset removed. • Test the model on the j-th subset only (a validation set of sizeN/K) and compute the mean validation error 3. Compute the average error of the K subsets. 4. Pick the model with the lowest validation error. 5. Re-train that model using all N training cases. (b) (i) [UNSEEN, 2 marks] Recall that β(i) is the least squares estimator with the ith data point deleted. By definition β(i) = arg min β N∑ j=1,j 6=i (x>j β − yj)2 where the minimization problem is the same as N∑ j=1,j 6=i (x>j β − yj)2 + (x>i β − x>i β(i))2. Thus, we have βˆ(i) = (X>X)−1X>Y (i), where Y (i) = (y1, ..., yi−1, yˆ(i), yi+1, ..., yN )> with yˆ(i) = x>i βˆ (i). 8 ST3400 (ii) [UNSEEN, 5 marks] Using the result in (i), we have yˆ(i) = x>i βˆ (i) = x>i (X >X)−1X>Y (i) = x>i (X >X)−1X>(Y + Y (i) − Y ) = x>i (X >X)−1X>Y + x>i (X >X)−1X>(Y (i) − Y ) = yˆi + hii(yˆ (i) − yi), where hii is the ith diagonal term ofH = X(X >X)−1X>. Solving the equation for yˆ(i) we have yˆ(i) = yˆi 1− hii − hii 1− hii yi. Therefore yi − yˆ(i) = yi − yˆi 1− hii (iii) [UNSEEN, 2 marks] Leave one out cross validation would require one to fit the same model on N different datasets, thus is computationally intensive. The result in (ii) enables us to compute CV = 1 N N∑ i=1 (yi − yˆ(i))2 = 1 N N∑ i=1 ( yi − yˆi 1− hii ) 2, where computing yˆi requires fitting the model on the whole dataset and hii is from the hat matrix. That is, computing the CV score only requires fitting the model once and thus is computationally much more efficient. (iv) [BOOKWORK, 2 marks] Since e ∼ N(0, σ2ID), Y = (Xβ+e) ∼ N(Xβ, σ2ID). Since βˆRIDGE = (X>X + λID)−1X>Y is a linear transformation of Y and “Z ∼ N(µ,Σ) =⇒MZ ∼ N(Mµ,MΣM>)′′ we find βˆRIDGE ∼ N(WX>Xβ, σ2WX>XW ), W := (X>X + λID)−1. (v) [BOOKWORK, 4 marks] SinceX>X = PDP>, we haveW = P (D+λID)−1P> and thus Var(βˆRIDGE) = σ2P (D + λID) −1P>PDP>P (D + λID)−1P> = σ2P (D + λID) −1D(D + λID)−1P> = σ2P (D + λID) −2DP> (diagonal matrices commute with each other). Bias2(βˆRIDGE) = ‖WX>Xβ − β‖22 = ‖(WX>X −WW−1)β‖22 = ‖W [X>X − (X>X + λID)]β‖22 = λ2‖Wβ‖22. 9
欢迎咨询51作业君