CSE 142 Machine Learning Due: November 7th Homework 2 Directions: This homework is to be done individually. Please upload a set of solutions containing your name and @ucsc.edu e-mail address to Canvas. Typeset (e.g. TeX) solutions are preferred, but scans or photographs of hand-written solutions are acceptable provided that they are neat and legible. The TA may deduct points for poorly organized or illegible solutions. Question: 1 2 3 4 Total Points: 20 30 20 30 100 Bonus Points: 0 0 0 0 0 Score: Questions: 1. Logistic Regression. Logistic regression treats a binary classification (e.g. is/is-not a dog) probabal- istically, where the probability for any example x to be assigned the class y = 1 is given in terms the logistic sigmoid function g and a weight vector w that must be learned: g(w · x) = exp (w · x) 1 + exp (w · x) p(1 | x,w) = g(w · x) p(0 | x,w) = 1− g(w · x) (a) (10 points) Let g(w · x) = q From the definitions above, prove that w · x is a logit or “log-odds" function when 0 < q < 1 w · x = ln q 1− q (b) (10 points) Just as p(1|w,x) = g(w · x) is the probability for a positive classification conditioned on a set of weights, it is the likelihood of weights w given a classification y = 1. Show that the gradient (in parameter space - derivatives should be taken with respect to the components of w) of this log likelihood function reduces to ∇ log g(w · x) = x(1− g(w · x)) which is a vector quantity parallel to x. 2. (30 points) Naive Bayes. Use Naive Bayes to estimate whether a student will be an honor student (H) or normal student (N) in college based on their high school performance. Each instance has two features: the student’s high school GPA (a real number) and whether or not the student took any AP courses (a Boolean value, yes=1, no=0). Based on the following training data, create (by hand or with computational tools) a Naive Bayes prediction rule using normal (Gaussian) distributions to estimate the conditional probability density of high school GPAs given honors status (H or N) (this assigns non- zero probability to negative or greater-than-four GPA values, but that is fine for our purposes) and a Bernoulli distribution for the AP feature. label AP GPA H yes 4.0 H yes 3.7 H no 2.5 N no 3.8 N yes 3.3 N yes 3.0 N no 3.0 N no 2.7 N no 2.2 Recall that Naive Bayes makes the simplifying assumption that the features are conditionally independent given a class / label: p(gpa, ap | honors) = p(gpa | honors)p(ap | honors). Use maximum likelihood estimation (not the unbiased or Laplace estimates) for the distributions of the two features conditioned on the two classes. Give the mean and variance for each distribution over GPA values. For the variance here, you only need to calculate the biased sample variance estimator (divided by n), not the unbiased one (divided by n− 1). Describe your prediction rule in the following form: If AP courses are taken, predict H if the GPA is in a ⊂ R if AP courses are not taken, predict H if the GPA is in b ⊂ R where a and b must be found. Hint: It is probably easier to get this description if you take logarithms. 3 digits of precision should sufficient. Also, the logarithm of the Gaussian densities are quadratic, possibly yielding two distrinct zeros that correspond to the boundaries of the intervals a or b. 3. Nearest Neighbors. Assume that examples are drawn uniformly from the unit square. Independent of example features (i.e., location (x1, x2) in the unit square), labels are generated at random such that a proportion q where 0.5 < q < 1 are assigned label y = 1 and the remainder are assigned label y = 0. The Bayes-optimal hypothesis will minimize the error rate of its predictions given (x1, x2) by always predicting y = 1 and suffering an error rate of 1− q. How should we expect the 3-Nearest-Neighbors algorithm perform? Assume that the algorithm is trained on a large set of known labels. For each new sample, the algorithm finds the three closest (according to any metric!) points in its training set, finds the label shared by the majority of these three, and assigns this label to the new example. (a) (10 points) What is the expected error rate of the 3-Nearest-Neighbor algorithm, in terms of q? (b) (10 points) When is this better or worse than the Naive Bayes solution (recall, 0.5 < q < 1)? 4. Decision Tree. Sammy-the slug owns a car dealership that sells two types of cars: Honda(H) and BMW(B). He collected the following data of his customers that records their Gender, Annual income and the type of car they purchased: Gender AI (Annual Income in thousands) Preference F 100 B M 150 B F 80 B M 75 B F 90 H F 85 H M 85 H F 70 H Page 2 This question is about building decision trees to predict the car type that a new customer will prefer. Note that one of the attributes, Annual Income (AI), is a continuous variable. Lets assume that we will only allow binary splits for this attribute of the form AI < a and AI ≥ a, where a lies in the dataset. However, there can be multiple such splits in one path from root to leaf. For all your calculations, use log base 2. (a) (3 points) For this part of the question, lets assume that for some unknown reason, Sammy insists on keeping the “Annual Income” at the root node. How many possible values of a does he need to consider? What are they? (b) (3 points) What is the entropy of labels (car type) in the training dataset? (c) (9 points) What is the optimal root node for this dataset? Show your calculations. (d) (15 points) Draw the Decision Tree that would be learned by ID3 on this dataset. Only binary splits are allowed i.e. a node can have only two children. Your tree should contain all the information necessary to read it. At each node, indicate: 1. the attribute you are splitting on (when splitting on AI, also include the a used); 2. the label distribution of the instances at that node before the split (e.g. if there are 3 instances at a node and 2 belong to H and 1 belongs to B class, then label the node as 〈2, 1〉). 3. Additionally, for each non-leaf node indicate the gain attained by the corresponding split, and label each leaf node by its class-label (H or B). 4. All edges between a parent and a child should be labeled with the value of the attribute. 5. It is okay to draw the tree by hand and include a clear picture in your pdf. 6. Don’t forget to include your calculations (6 points here). Page 3
欢迎咨询51作业君