Statistical ML STATS 303 STATS 303 Please assign questions to corresponding pages when submitting and show your work to earn credit (you don’t need to show the details of 1-dimensional integrations - feel free to use your favorite integral calculator but adding steps to a computation may give you points in case of wrong answer). In case of numerical values, give your answer to an approximation of 3 decimal digits. 1. Suppose that P (Y = 1) = P (Y = 0) = 1/2 and X|Y = 0 ∼ N(0, 1) X|Y = 1 ∼ 1 2 N(−5, 1) + 1 2 N(5, 1). (a) Find an expression for the Bayes classifier (the Bayes decision rule for classification) for the 0/1 loss and find an expression for the corresponding Bayes risk. (b) What linear classifier minimizes the risk and what is its risk? (if not unique, give one optimal linear classifier). In this case, a linear classifier is a classifier that divides the space with a single line (which in the d = 1 case is a single point). 2. Consider a three category classification problem. Let the prior probabilites: P (Y = 1) = P (Y = 2) = P (Y = 3) = 1/3 The class-conditional densities are multivariate normal densities with parameters: µ1 = [0, 0] >, µ2 = [1, 1]>, µ3 = [−1, 1]> and Σ1 = Σ2 = Σ3 = ( 1 0 0 1 ) (a) Compute the Bayes optimal classifier for the 0/1 loss for the following points: x = [−0.7, 0.1] and x = [0.7, 0.7] (b) Now assume that Σ1 = ( 0.7 0 0 0.7 ) Σ2 = ( 0.8 0.2 0.2 0.8 ) Σ3 = ( 0.8 0.2 0.2 0.8 ) Compute the Bayes optimal classifier for the 0/1 loss for the following points: x = [−0.5, 0.5] and x = [0.5, 0.5]. 3. Assume a regression model y = f(x) + where x, y ∈ R, f(x) is some deterministic but unknown function and ∼ N (0, σ2). Suppose g(x|θ) is our estimator to f where θ denotes the parameters. (a) Write the density p(y|x) in terms of g(x|θ) and σ. 1 Statistical ML STATS 303 (b) Suppose there is an unknown joint density p(x, y) for x and y. Explain why the log likelihood L(θ|X ) of p(x, y), where the sample X = {x`, y`}N`=1 contains i.i.d. data points, can be written as L(θ|X ) = log N∏ `=1 p(y`|x`) + C . (c) According to Parts (a) and (b), show that the maximum likelihood estimator is given by minimizing 1 2 N∑ `=1 [y` − g(x`|θ)]2 . 4. Consider the data points x1 = (0, 1, 2) T, x2 = (−1, 3, 4)T, x3 = (0, 0, 1)T and x4 = (2, 3,−2)T. (a) Write a data matrix X for the data points where each row correspond to a data point. (b) Suppose a system gives output yj if we input xj for j = 1, 2, 3, 4. We fit a ridge regression model by solving min w∈R4 1 2 ∥∥∥y − X˜w∥∥∥2 2 + λ 2 ‖w‖22 , (1) where y = [y1, y2, y3, y4] T. What is X˜? (c) By taking the gradient with respect to w, derive the solution of (1) in terms of X˜,λ and y. (d) Describe qualitatively how you expect your answer to the previous question would change if you used a regularization of the form λ‖w‖1 where ‖ · ‖1 denotes the `1 norm in R4. (e) What is the name of the two problems considered above? 5. (a) Let {x`}N`=1 for x` ∈ R be given. The K-NN density estimator is given by pˆ(x) = K 2NdK(x) where dK(x) is the distance between x and its K-th closest neighbor in {x`}N`=1. Prove that pˆ is NOT a probability density. (b) Consider applying K-means with K = 2 clusters to the five points (0, 0), (1, 2), (2, 0), (3, 2), (4, 0). Suppose the initial centers are set to be (0, 0) and (3, 0). Write the E-step and the M-step for the first iteration. You need to clearly state the locations of the centers and the labels of the points. 6. Let (Z1, Y `), . . . , (ZN , Y N ) be generated independently as follows: Z` ∼ Bernoulli(p) Y `|Z` = 0 ∼ N (0, 1) Y `|Z` = 1 ∼ N (3, 1) (a) Assume we do not observe the {Z`}. Write the distribution (density) fY (y) of Y as a mixture. (b) Write down the complete likelihood function for p (assuming the Z` are observed) (c) Write down the E and the M step for the EM algorithm (i.e. write the updates of the algorithm explicitly, like we did in class for the gaussian case) 7. Let X1 ∈ R and X2 ∈ R and Y = m(X1, X2) + where E() = 0 (a) Consider the class of multiplicative predictors of the form mˆ(x1, x2) = βx1x2. Let β ∗ be the best predictor, that is, β∗ minimizes E(Y −βX1X2)2. Find an expression for β∗ in terms of expectations of the quantities being considered. 2 Statistical ML STATS 303 (b) Suppose the true regression function is Y = X1 +X2 + . Also assume that E(X1) = E(X2) = 0, E(X21 ) = E(X22 ) = 1 and that X1 and X2 are independent. Find the predictive risk R = E(Y − β∗X1X2)2 where β∗ was defined in part (a) (this answer should be in form of a number). 3
欢迎咨询51作业君