COMP3670/6670 Introduction to Machine Learning Semester 2, 2020 Final Exam • Write your name and UID on the first page (you will be fine if you forget to write them). • This is an open book exam. You may bring in any materials including electronic and paper- based ones. Any calculators (programmable included) are allowed. No communication devices are permitted during the exam. • Reading time: 30 minutes • Writing time: 180 minutes • For all the questions, write your answer CLEARLY on papers prepared by yourself. • There are totally 8 pages (including the cover page) • Points possible: 100 • This is not a hurdle. • When you are asked to provide a justification to your answer, if your justification is incorrect, you will get 0. 1 • Section 1. Linear Algebra and Matrix Decomposition (13 points) 1. (6 points) Let {v1,v2} be linearly independent vectors in Rn. Let v3 be a vector in Rn that does not lie in the span of v1,v2. Prove that {v1,v2,v3} is linearly independent. Solution. Assume for a contradiction that {v1,v2,v3} are not linearly independent. Then there exists constants c1, c2, c3, not all zero, such that c1v1 + c2v2 + c3v3 = 0 Now, if c3 = 0, then the above equation becomes c1v1 + c2v2 = 0 and therefore {v1,v2} are linearly dependent, a contradiction. So, we have that c3 6= 0. Hence c1v1 + c2v2 + c3v3 = 0 c1v1 + c2v2 = −c3v3 −c1 c3 v1 + −c2 c3 v2 = v3 and hence v3 lies in the span of v1,v2, a contradiction. 2. (7 points) Consider the matrix [ 0 −1 1 0 ] Find its eigenvalues. What does this matrix geometrically do when applied to a vector? Explain how this relates to the set of eigenvalues for this matrix. Solution. We form the characteristic equation det(A− λI) det(A− λI) = det [−λ −1 1 λ ] = λ2 − (−1)(1) = λ2 + 1 Setting it equal to zero and solving, λ2 = 1⇒ λ = ±√−1 We find no eigenvalues, as we cannot square root a negative number (at least not in the real numbers). This matrix can be seen to perform a rotation of 90 degrees about the origin,[ 0 −1 1 0 ] [ x y ] = [−y x ] which means that no vector in R2 will be transformed into a scalar multiple of itself. This is why we see no real eigenvalues. 2 • Section 2. Analytic Geometry and Vector Calculus (12 points) 1. (6 points) Find all matrices T ∈ R2×2 such that for any v ∈ R2, T (v) · v = 0 Solution. Let T = [ a b c d ] ,v = [ x y ] . Then, T (v) · v = [ a b c d ] [ x y ] · [ x y ] = 0[ ax+ by cx+ dy ] · [ x y ] = 0 (ax+ by)x+ (cx+ dy)y = 0 Choose x = 0, then (a0 + by)0 + (c0 + dy)y = 0⇒ dy2 = 0 Since dy2 = 0 needs to hold for all y, this is only possible when d = 0. Now, choose y = 0, then (ax+ b0)x+ (cx+ d0)0 = 0⇒ ax2 = 0 Since ax2 = 0 needs to hold for all x, this is only possible when a = 0. Substituting this into the above, (0x+ by)x+ (cx+ 0y)y = 0 bxy + cxy = 0 Since this needs to hold for all x and for all y, it must be the case that b = −c. Hence, {T ∈ R2×2 : ∀v ∈ R2, (Tv) · v = 0} = {[ 0 α −α 0 ] : α ∈ R } 2. (6 points) Let x,a ∈ Rn×1, and define f : Rn×1 → R as f(x) = xTaaTx Compute ∇xf(x). Solution. There are a few ways to do this. Students could take the derivative with respect to a component of x, or could use product rule, or could recognize that xTaaTx = xTaxTa = (xTa)2 and use chain rule. I will use the latter approach. Let h(x) = xTa and g(x) = x2. Then f(x) = g(h(x)), and so df dx = dg dh dh dx Then, dgdh = 2h and dh dx = a T , by identities provided in lecture slides. Hence, df dx = 2haT = 2xTaaT . as required. 3 • Section 3. Probability (15 points) Consider the following scenario. I flip a fair coin. If the coin comes up heads, I roll a fair 4 sided die (with sides {1, 2, 3, 4}), and then I tell you the result of rolling the die. If the coin comes up tails, I roll a fair 6 sided die (with sides {1, 2, 3, 4, 5, 6}), and then I tell you the result of rolling the die. Let X denote the number I tell you. 1. (3 points) What is the set of all possible outcomes X for X? Solution. We either roll a 4-sided die, and get a number from 1 to 4, or roll a 6-sided die, and get a number from 1 to 6. Hence X = {1, 2, 3, 4, 5, 6}. 2. (4 points) Compute P (X = x) for all x ∈ X . Solution. First consider the case where x ∈ {1, 2, 3, 4}. Let C denote the outcome of the coin, with C = {heads, tails} denoting the set of outcomes for the coin. P (X = x) = ∑ c∈C P (X = x | C = c)P (C = c) = P (X = x | C = heads)P (C = heads) + P (X = x | C = tails)P (C = tails) = 1/4 · 1/2 + 1/6 · 1/2 = 5/24 Then consider the case where x ∈ {5, 6}, which will be the same as before, but now P (X = x | C = heads) = 0, as the four-sided die can’t roll 5’s or 6’s. P (X = x) = ∑ c∈C P (X = x | C = c)P (C = c) = P (X = x | C = heads)P (C = heads) + P (X = x | C = tails)P (C = tails) = 0 · 1/2 + 1/6 · 1/2 = 1/12 Hence, P (X = x) = { 5/24 x ∈ {1, 2, 3, 4} 1/12 x ∈ {5, 6} 3. (4 points) I tell you that X = 1. How likely is it that the coin flipped heads? Solution. We apply Bayes rule. P (C = heads | X = 1) = P (X = 1 | C = heads)P (C = heads) P (X = 1) = 1/4 · 1/2 5/24 = 3/5 4. (4 points) We repeat the above experiment, but this time whatever die is selected, is rolled twice. I inform you that the outcome for both rolls was a 1. How likely is it that the coin flipped was heads? Solution. Each die roll is i.i.d. P (C = heads | rolled one twice) = P (rolled one twice | C = heads)P (C = heads) P (rolled one twice) Now, P (rolled one twice | C = heads) = P (X = 1 | C = heads)2 = 1/16, and P (rolled one twice | C = tails) = P (X = 1 | C = tails)2 = 1/36 P (rolled one twice) = P (rolled one twice | C = heads)P (C = heads) + P (rolled one twice | C = tails)P (C = tails = P (X = 1 | C = heads)2(1/2) + P (X = 1 | C = tails)2(1/2) = 1/16 · 1/2 + 1/36 · 1/2 = 13/288 4 Hence, P (C = heads | rolled one twice) = 1/16 · 1/2 13/288 = 9/13 5 • Section 4. Clustering and Gaussian Mixture Model (GMM) (15 points) Both Kmeans and GMM can be viewed as aiming to find θ to optimise p(X |θ). Here, X is the dataset, and θ is related to the model. Answer the following questions. 1. (2 points) In kmeans, use no more than 2 sentences to describe what θ contains. Solution. θ contains the coordinates of centroids, as well as the cluster ID that each sample is assigned to. Deduct 0.5 mark if answering the number of clusters. 2. (3 points) In kmeans, use no more than 2 sentences to describe the probabilistic meaning of p(X |θ). Solution. Given model parameters θ, the probability that dataset X is generated by this model. 3. (3 points) Assume that samples in X are from 3 classes. After training a GMM with 3 components on X , we use this GMM as a classifier to predict which class a new sample x belongs to. In no more than 3 sentences, describe the prediction process. (Use math symbols where relevant; you do not have to explain the symbols if they are same with lecture slides, e.g., µ). Solution. We obtain three Gaussian distributions from training. We compute p(x|θk), the probability that the test sample x is generated by the kth Gaussian distribution. We pick the Gaussian distribution that gives the largest probability as the label of this test sample. 4. (3 points) Is it correct to say that the kmeans method enables us to find θ that minimises p(X |θ)? Explain your answer in 2 sentences. Solution. Incorrect. First, we aim to maximise instead of minimise p(X |θ). Second, kmeans enables us to find a local maximum instead of a global one. 5. (4 points) Suppose we have the following 10 data points. They are partitioned into two classes, red and blue. Which model could generate this partition, kmeans only, GMM only, both, or neither? Explain your answer in three sentences. (Note: where GMM is relevant, the classification principal is similar to Question 3 above, except that this question has two classes.) Figure 1: 10 data points divided into two classes, blue and red. Solution. GMM only. It cannot be generated by kmeans because of the right most point is closer to the center of the red class, so it will be classified as red. GMM can generate this figure because the blue class has a larger variance under GMM. 6 • Section 5. Linear Regression (13 points) You are doing a machine learning internship at a bank, analysing user age and their daily expense. You collected seven samples and plotted them in Fig. 2(a). 20 25 30 35 40 45 50 age 30 40 50 60 70 80 90 ex pe ns e 20 25 30 35 40 45 50 age 30 40 50 60 70 80 90 ex pe ns e Point ! (a) Regression with normal data (b) Regression with normal and abnormal data Figure 2: Linear regression with normal and abnormal data. 1. (3 points) From your observation of Fig. 2(a), use 1 sentence to describe the relationship between user age and expense. Solution. Expense increases linearly as age increases. After obtaining Fig. 2(a), you collected a new data point A. Together with the previous seven points, you do linear regression for a second time and obtain the dashed line in Fig. 2(b). 2. (4 points) Generally when adding new samples to the dataset, it is expected that the fitted line will be different. In our example, there is quite a large difference between the new model (dashed line) and the old model (solid line). In two sentences, explain why the change is large. (Hint: you don’t have to explain why there is a “change”. Focus on “large”.) Solution. 1) Point A deviates a lot from the other points, thus rendering a large gradi- ent/loss asking the model to be closer to it. 2) There are limited number of samples in this dataset, so the big gradient from A causes the model to move a lot. 3. (6 points) You originally used the squared error, i.e., for the nth training sample (xn, yn), l(xn, yn,θ) = (yn − θTxn)2, where xn is the feature of the sample, yn is the label, and θ contains the model parameters. Your supervisor tells you that Point A is an outlier and that it is best to exclude its impact on your model. Write down an amended loss function that can achieve this goal. Explain how it excludes the impact of outliers on your linear model. Note: you will get partial marks if your loss function can merely alleviate the impact of A. Solution. l(xn, yn,θ) = min((yn − θTxn)2 −m, 0). m is a hyperparameter. From Fig. 1, its value should be approximately between 100 and 400 (it’s not necessary for a student to make this estimation). There could be other equivalent ones. Explanation. For a proper value of m, the outlier point A will have a large (yn− θTxn)2, thus causing term (yn − θTxn)2 −m to be greater than 0. In this case, there will be 0 loss value. For other points, (yn − θTxn)2 − m will be less than 0, thus preserving the value after the min operator. 7 • Section 6. Principal Component Analysis (PCA) and Linear Regression (20 points) We are given a centered dataset, (x1, y1), (x2, y2), ..., (xN , yN ), where xn ∈ R, yn ∈ R, and N is the number of samples. “Centered” means ∑N n=1 xn = 0, and ∑N n=1 yn = 0. Now for this dataset, we apply linear regression and PCA. For linear regression, our model is y = θx+θ0 = θx, where we treat yn as labels and xn as the feature. For PCA, we obtain the first and second principal components: pc1 and pc2. An example is shown in Fig. 5. -6 -4 -2 0 2 4 6 x -5 -4 -3 -2 -1 0 1 2 3 4 5 y data points y= x pc1 pc2 Figure 3: Linear regression with normal and abnormal data. 1. (3 points) Are pc1 and pc2 orthogonal? Explain your answer in two sentences. Solution. Yes. The covariance matrx of this 2-D dataset has two different eigenvalues because the data distribution is oval. Eigenvectors corresponding to different eigenvalues are orthogonal. 2. (4 points) In usual cases, the regression output θ is not in the same direction with pc1 (and pc2). Explain why θ and pc1 are usually different in direction. You can use whatever is relevant to help you illustrate, such as figures or maths. (Hint: differences of PCA and linear regression in their optimisation objective.) Solution. Both linear regression and PCA aim to find a model to minimise a loss function. Linear regression aims to find a subspace such that the predicted output and the model output are close. In comparison, PCA wants to find a subspace such that the orthogonal projection of samples onto this subspace is minimised. See figure below for an illustration. Regression PCA Figure 4: Difference between PCA and regression. 3. (4 points) On your paper, draw an example dataset for which θ and pc1 are of the same direction. Your figure should contain the x-axis, the y-axis, at least 3 data points, as well as θ and pc1 (the latter two should be overlapping). If necessary, write the coordinates of the data points. 8 Solution. Many scenarios. For example, if three different points are on the same line, θ and pc1 overlap. Two examples are shown below. (-1, 0) (0, 0) (1, 0) (-2, -1) (-2, 1) (2, 1) (2, -1) Example 1 Example 2 Figure 5: Two examples of overlapping PCA and regression outputs. 4. (5 points) Let a = [x1, x2, ..., xN ] T ∈ RN , and b = [y1, y2, ..., yN ]T ∈ RN . On this centered dataset, show that when aTb = 0, the regression output is the x-axis. (We assume the MSE as loss function) Solution. The model of linear regression is y = θ1x+ θ0 = θx. The closed form solution of linear regression is [θ1, θ0] = θ = (X TX)−1XTy, where X = [ x1 x2 ... xN 1 1 ... 1 ]T ,y = [y1, y2, ..., yN ] T . Because aTb = 0 and y1 + y2 + ...+ yN = 0, we obtain that XTy = [ N∑ i=1 xiyi, N∑ i=1 yi] = [0, 0]. Therefore, the output of linear regression is the x-axis, i.e., y = 0. 5. (4 points) Continuing from Question 4, calculate the covariance matrix of this dataset (you do not have to do standardization). When ∑N i=1 x 2 i> ∑N i=1 y 2 i , show that the first principal component pc1 is horizontal. Solution. The covariance matrix is calculated as, S = 1 N [ x1 x2 ... xN y1 y2 ... yN ] [ x1 x2 ... xN y1 y2 ... yN ]T = 1 N [ ∑N i=1 x 2 i ∑N i=1 xiyi∑N i=1 xiyi ∑N i=1 y 2 i ] = 1 N [∑N i=1 x 2 i 0 0 ∑N i=1 y 2 i ] . When ∑N i=1 x 2 i 6= ∑N i=1 y 2 i , the two eigenvalues are λ1 = N∑ i=1 x2i , λ2 = N∑ i=1 y2i . When ∑N i=1 x 2 i> ∑N i=1 y 2 i , the eigenvector corresponding to λ1 is the principal component pc1. We calculate this eigenvector by solving the following equation system, 1 N [∑N i=1 x 2 i 0 0 ∑N i=1 y 2 i ] b = λ1b. Substituting λ1 with ∑N i=1 x 2 i , we can obtain the eigenvector as b = [1, 0]T , which is horizontal. 9 • Section 7. Classification (12 points) You have developed a linear classifier to classify the sentiment of a sentence into positive and negative. Assume that positive sentences and negative sentences have equal numbers in both training and testing sets. Your classifier obtains an accuracy of 40% on the test data. 1. (2 points) Is this classifier meaningful? Explain your answer in two sentences. Solution. No. Random guess would have 50% accuracy, so the classifier is not meaningful. 2. (3 points) Without re-training the classifier, use three sentences to describe how you improve the previous classifier and why it becomes better. Solution. Negate every prediction, i.e., if the prediction is negative then change it to positive and vice versa. Then we get a classifier that has 60% accuracy. PCA is a useful technique to project data samples onto a lower-dimensional subspace that preserves data variance. Oftentimes, it is used to preprocess features before training a classifier. 3. (4 points) You have four data points of two classes. Their class labels (A or B) and coordi- nates are listed below. - Class A: (10, 1) and (-10, 1). - Class B: (10, -1) and (-10, -1) For this case, is it a useful step to project the data onto the first principle component before training a classifier? If yes, draw the decision boundary after the projection. If no, briefly explain your answer. Solution. No. The first principal component is the x-axis. If we project the points on the x-axix, (10, 1) and (10, -1) will overlap, and (-10, 1) and (-10, -1) will overlap. The two classes are no longer separable. In this question, the informative dimension is trimmed off. 4. (3 points) Explain why PCA is helpful for classifier training in many real-world cases. Solution. In real-world cases, features have a higher dimension and there might not be sufficient training data. PCA allows us to remove the redundant features, alleviate overfit- ting and improve training efficiency. A student would get full marks if the answer contains “remove redundant features”. ——— End of the paper ——– 10
欢迎咨询51作业君