辅导案例-DSC 190

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

DSC 190: Midterm May 7, 2020
Guidelines
• Exam duration: 9:30 AM, May 7 to 9:30 AM, May 8 PT.
• There are 9 problems in total.
• You are not supposed to code anything for this midterm.
• Submission must be made on Gradescope.
• Some of the problems may not have unique answers. Your justification and reasoning will be the most
important part.
1. SVM (10 points):
Suppose you have the following dataset with 6 observations and 2 classes (Green, Blue).
x1 x2 y
2 4 Green
4 4 Blue
2 1 Green
2 2 Green
4 3 Blue
4 2 Blue
Table 1: Dataset for SVM
(a) Draw a rough plot of the 2D observations and also draw the maximal margin hyperplane separating
the two classes. What is the equation for the hyperplane?
(b) What are the support vectors for the maximal margin classifier?
(c) What is the margin for the maximal margin hyperplane?
You have to solve the question manually. Code will not be accepted as a valid answer.
2. Logistic Regression (10 points):
Assume we collect some data on NFL players with two variables X1 = Number of hours of training per week,
X2 = Player Rating on a scale of 1 to 5, and Y = The player scores a touchdown (Yes = 1 and No = 0). We
fit a logistic regression and find the estimated weights, w0 = −6, w1 = 0.05, w2 = 1. Assume the regression
model is accurate to answer the following questions:
(a) Estimate the probability that a player who trains for 40 hrs per week and has a rating of 3.5 scores a
touchdown in the upcoming game.
(b) How many hours per week would the player in part (a) need to train to have a 50% chance of scoring
a touchdown in the next game?
1
DSC 190: Midterm May 7, 2020
3. Overfitting, Bias vs. Variance (10 points):
(a) List three common strategies to address overfitting.
(b) Draw a graph of bias and variance vs. model complexity to show how bias and variance change as the
model complexity increases. Briefly explain the graph.
4. Comparing Data Mining Concepts/Methods (10 points):
Briefly describe two major differences between the following pairs of concepts or methods. Please be concise
in your explanations.
(a) Linear Regression and Logistic Regression
(b) Linear Regression and Linear SVM
(c) Linear SVM and Kernel SVM
(d) k-Means vs. Expectation-Maximization (EM)
(e) PCA and t-SNE
(f) Frequent Patterns and Association Rules
5. Naive Bayes (10 points):
In this question, please give the final value as well as the necessary intermediate steps. Suppose you have
the following training set with three Boolean input variables: x, y and z, and a Boolean output variable U .
x y z U
1 0 0 0
0 1 1 0
0 0 1 0
1 0 0 1
0 0 1 1
0 1 0 1
1 1 0 1
Table 2: Input-Output features
Suppose you have trained a Naive Bayes classifier to predict U with x, y, and z as features. Once the learning
is finished:
(a) What would be the predicted probability of P (U = 0|x = 0, y = 1, z = 0)?
(b) What would be the predicted probability of P (U = 0|x = 0)?
For the next two parts, assume that a Naive Bayes classifier is learned by considering the combination
of (x, y, z) as one feature instead of x, y, z being three different features:
(c) What would be the predicted probability of P (U = 0|x = 0, y = 1, z = 0)?
(d) What would be the predicted probability of P (U = 0|x = 0)?
2
DSC 190: Midterm May 7, 2020
6. Evaluation Measurement (10 points):
Consider the sentiment detection task where the possible labels are Positive and Negative. Suppose you are
evaluating on two datasets: Dataset-A and Dataset-B.
The dataset statistics are mentioned in Table-3, 4.
Label Number of Samples
Positive 900
Negative 100
Table 3: Dataset-A Statistics
Label Number of Samples
Positive 550
Negative 450
Table 4: Dataset-B Statistics
Actual Positive Actual Negative
Predicted Positive 800 70
Predicted Negative 100 30
Table 5: Confusion Matrix on Dataset-A
Actual Positive Actual Negative
Predicted Positive 450 130
Predicted Negative 100 320
Table 6: Confusion Matrix on Dataset-B
(a) Which of the following performance metric would be a better representative of the performance of any
classifier on Dataset-A? For example, high performance on Dataset-A according to a metric M implies
that the classifier is very good. Explain your reasoning.
(a) Precision
(b) Recall
(c) F1-score
(d) Accuracy
3
DSC 190: Midterm May 7, 2020
(b) Which of the following performance metric would be a better representative of the performance of any
classifier on Dataset-B? For example, high performance on Dataset-B according to a metric M implies
that the classifier is very good. Explain your reasoning.
(a) Precision
(b) Recall
(c) F1-score
(d) Accuracy
(c) Suppose we evaluated a classifier on Dataset-A and Dataset-B and the confusion matrices are shown in
Table-5, 6. For each case, compute Precision, Recall, F1-score, Accuracy. Comment on the performance
of the classifier, i.e., on which dataset, the classifier is more effective, and why?
(d) Consider the following metric M described as follows: M = TP×TNFN×FP where TP represents number of
True Positives, TN represents number of True Negatives, FN represents number of False Negatives
and FP represents number of False Positives. Come up with a scenario in which this metric is useful
and is a representative of the performance of the classifier and justify why.
7. Model Choices for Binary Classification (10 points):
(a) Given a binary classification training set of 1,000,000 instances, suppose 1% of training instances were
wrongly labeled. Which classifier will you prefer to conduct training on this dataset? Decision Tree
or Random Forest? Why?
(b) Given a binary classification training set of 1,000,000 instances, suppose there are a few outliers as
we observed in the training data visualization. Which classifier will you prefer to conduct training on
this dataset? Logistic Regression or SVM? Why?
(c) To build a classifier on high-dimensional features using small training data, one will need to consider
the scenario where many features are just irrelevant noises. To train a generalizable classifier, do you
want to use Naive Bayes or Logistic Regression? If you choose Logistic Regression, which regularization
setting do you want to use and why?
8. Arya Mixture Model (20 points):
Suppose there exists a distribution called Arya (this is a distribution created by us) whose probability density
function is described below. Suppose there are K clusters and each cluster Zi is characterized by an Arya
distribution and cluster priors are mentioned below. Let’s assume each data point Xi ∈ R+ (i = 1...n) is
drawn as follows:
P (Zi) = pii for i = 1, 2, . . .K
Xi ∼ Arya(2, βZi)
The probability density function of Arya(2, β) is:
P (X = x) = β2xe−βx
(a) Suppose K = 3 and β1 = 1, β2 = 2, β3 = 4. What is P (Z = 1|X = 1)?
(b) Describe the E-step. Write an equation for each value that is computed.
4
DSC 190: Midterm May 7, 2020
9. K-means Clustering (10 points):
Given the (x, y) pairs in table-7, you have to cluster them into 2 clusters using the k-means algorithm.
Assume k-means uses Euclidean distance.
Data Point Index x y
1 1.90 0.97
2 1.76 0.84
3 2.32 1.63
4 2.31 2.09
5 1.14 2.11
6 5.02 3.02
7 5.74 3.84
8 2.25 3.47
9 4.71 3.60
10 3.17 4.96
Table 7: Dataset for k-means clustering
Figure 1: Dataset for k-means
5
DSC 190: Midterm May 7, 2020
The plot of points is shown in Figure-1.
Let the first cluster center be the tenth data point and the second cluster center be the first
data point. Please run the k-means algorithm for 1 iteration with the number of clusters= 2. What are
the cluster assignments after 1 iteration? What are the cluster assignments after convergence?
Fill in the table below. You need not code this. You can just do this manually by calculating the distance
(Euclidean distance).
Data Point Index Cluster Assignment after One Iteration Cluster Assignment after Convergence
1
2
3
4
5
6
7
8
9
10
Table 8: Cluster Assignments
6