Page 1 of 6 Midterm Examination COMP-4311-WA Big Data Winter 2020 Time: 50 minutes Total Points: 30 Name: Last Name First Name Student ID: Page 2 of 6 1) What is complete case deletion in imputation? Why is it an acceptable method of imputation to deal with MCAR (Missing Completely at Random)? Explain with an example. (3 points) 2) Let’s assume the population of a city consists of poor, middle-class, and rich. Someone collects a random sample from this population and somehow ends up collecting data only from middle-class people. What is the Gini Impurity of this sample that has only middle- class people? Explain. (2 points) Page 3 of 6 3) Explain the dynamic threshold approach to incorporate continuous-valued attributes in decision tree learning. (2 points) 4) What is Bagging? Describe a machine learning method that uses bagging. (4 points) Page 4 of 6 5) How does increasing or decreasing the number of random features used for information gain calculation at each node affect the performance of Random Forests? (4 points) 6) What is the objective function in k-means clustering? Why is it not always suitable for determining the appropriate value of k? (3 points) Page 5 of 6 7) What are single linkage and complete linkage methods in hierarchical clustering? Is the single linkage method good for achieving compact clusters? Explain your answer using an example/figure. (5 points) 8) When the size of the best subset of features is large, does sequential backward selection perform better or worse? Why? Explain your answer. (3 points) Page 6 of 6 9) Can replicated over-sampling cause overfitting? Why? Explain your answer (3 points). 10) What is the difference between cost-sensitive learning and cost-sensitive prediction? (1 point)
欢迎咨询51作业君