辅导案例-CS 334

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

CS 334: Homework #2
Submission Instructions: The homework is due on Sept 16th at 11:59 PM ET. Make sure your
program uses Python 3. Your submission should consist of two steps (detailed below). If either of
the two steps are late, then your assignment is late.
1. Upload PDF to Gradescope: Create a single high-quality PDF with your solutions to all
the problems. The solutions can be typed and/or written and scanned but the resulting pdf
must be easily readable.
2. Submit a single zip containing all the source code to Canvas: Make sure your code
is complete, well-commented, and all the files are included in the directory. Also include
any additional code files or auxiliary data that is needed to run your code or were used to
generate any of your answers in the PDF. These files should be named appropriately so we
can easily tell what question this is related to. This must also include the honor statement,
a README.txt file that contains the following words:
/* THIS CODE IS MY OWN WORK, IT WAS WRITTEN WITHOUT CONSULTING CODE
WRITTEN BY OTHER STUDENTS.
Your_Name_Here */
I collaborated with the following classmates for this homework:

Rename your code directory to hw2- and zip it. As an example, for an eid of lxiong,
the following lines on a unix terminal would rename everything and zip it into the file
hw2-lxiong.zip.
$ mv cs334-hw2 hw2-lxiong
$ zip -r hw2-lxiong.zip hw2-lxiong
Submit the single zip file to Canvas.
1. Decision Tree Implementation (25+10+7+8=50 points): For this problem, you will
implement the decision tree algorithm. The algorithm should work on both of the datasets
from the homework 1 (q3 and q4). The template code is found in dt.py. You are not allowed
to use any scikit-learn modules in this problem. You can either execute the file from the
command line by running the following command python dt.py samples> or use another python file to call the DecisionTree class methods.
(a) Implement the train function of decision tree. The arguments to the function are xFeat,
an n× d array where each row represents a sample, and each column a feature, and y, a
1-D array of size n× 1 that contains which class (0 or 1). You can add variables to the
class to store whatever information you need.
(b) Implement the predict function of decision tree. This will take in a 2D array (m× d),
and should output a 1-D array (m×1) that represents the predicted class of each sample.
(c) What is the training accuracy and test accuracy of the data for different values of max
depth and minimum number of samples in a leaf? Plot the accuracy of train and test as
a function of these terms (You can have 2 plots, one for each variable by fixing the other
variable at a default value; or have a 3D plot to understand the relationship of accuracy
to both values).
1
(d) What is the computational complexity of the train and predict function you implemented
in terms of the training size (n), the number of features (d), and the maximum depth
(p)? You must justify your answer.
2. (5+6+5+7=23 pts) Exploring Model Assessment Strategies
We will be using the wine quality dataset from the previous homework to explore the different
model assessment strategies covered in class. The template code is found in q2.py. You can
use any of the scikit-learn modules in this problem.
(a) Implement the holdout technique to assess the model. The arguments to the function
holdout are model, xFeat, y, testSize. The parameter specifications are:
i. model is a DecisionTreeClassifier object from sklearn
ii. xFeat is an n × d array where each row represents a sample and each column a
feature,
iii. y is a 1-D array of size n× 1 that contains which class (0 or 1)
iv. testSize is a float between 0 and 1 that represents the proportion of the dataset
to include in the test split
The output of the function is the following:
i. trainAuc is the AUC on the training dataset
ii. testAuc is the AUC on the holdout dataset.
iii. timeElapsed is the time the function took.
(b) Implement the k-fold cross-validation approach for a model, where the performance is
reported from the k-different validation. The arguments to the function are model,
xFeat, y, k. See (a) for the definitions of the first three parameters. k is the user-
specified value for k-fold cross validation. For the output, the trainAuc, and testAuc
should reflect the average across the k-different folds.
(c) Implement Monte Carlo Cross-validation approach with s samples. The arguments to
the function are model, xFeat, y, testSize, s. See (a) for definitions of the first
four parameters. s is the number of samples for performing the validation assessment.
Similar to (b) the trainAuc, and testAuc should reflect the average across the s samples.
Hint: You may find it useful to re-use code from (a).
(d) Run the q2.py script from the command line to get a table of the AUC and the time.
Comment on how the different model selection techniques compare with one another
with regards to AUC, robustness of the validation estimate, and the computational time
of the three different hold-out techniques.
3. (10+6+6+5=27 pts) Robustness of Decision Trees and K-NN
In this problem, we will explore how sensitive decision trees and k-nn are on the wine-quality
dataset.
(a) Use k-fold cross validation to find the optimal hyperparameters for k-nn and decision
tree. What are the optimal parameters, θopt, for each model? Also make sure to justify
your choice of k (associated with k-fold and not to be confused with k-NN).
2
(b) For the optimal hyperparameter found in (a), θknn-opt, train the k-nn on the entire
training dataset and evaluate both AUC and accuracy on the test dataset. Create 3
datasets where you randomly remove 5%, 10%, and 20% of the original training data.
For each random subset, train a new k-nn model using the same parameter, θopt, and
evaluate both AUC and accuracy on the test dataset.
(c) For the optimal decision tree parameters found in (a), θdt-opt, train a decision tree on the
entire training dataset and evaluate both AUC and accuracy on the test dataset. Using
the random subsets from (b), train a new decision tree model with the same parameters
and evaluate both AUC and accuracy on the test dataset.
(d) Report the AUC and accuracy of the 8 different models from (b) and (c) in a table.
Comment on the sensitivity of the two algorithms to the training datasets (how they are
impacted, not in the sense of sensitivity and specificity).
3

51作业君