程序代写案例-CMPT 727

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

Programming Assignment
CMPT 727
Spring 2021
Naive Bayes (NB) classifiers are often competitive classifiers even though their strong inde-
pendence assumptions may be unrealistic. If C denotes the class variable and (A1, ..., AN)
the attributes, then a NB model can be represented as a directed graph with these variables
as nodes and edges {(C,Ai) : 1 ≤ i ≤ n}. The following figure illustrates the graph structure.
In this assignment, you will implement the parameter learning for NB classifiers and use it to
compare with logistic regression with LASSO. You will apply these classifiers to predict the
party affiliation of either Democrat or Republican of US Congresspeople (the class variable)
based on their votes for 16 different measures (the attribute variables) shown in Table 1. Not
all congresspeople voted on all 16 measures, so sometimes entries in this dataset will have
missing attributes; however, we will still be able to utilize our Bayes Network to accurately
classify these examples. To keep things simple, the class and attribute variables are all binary
with 0, 1 corresponding to a no and yes vote respectively.
1
Programming Assignment
CMPT 727
Spring 2021
We have provided starter code for this assignment 1. You can download it here.
What to submit
Please submit the following files to CourSYS:
• nb.py – Your completed NB implementation. Do not change the signature of the
functions that you were supposed to implement.
• lasso.py – Your completed LASSO implementation using the scikit-learn package.
• report.pdf – A pdf file answering all the questions in this assignment.
(8 points) Question 1
Implement a NB classifier that both learns the parameters from the training data and can
use these parameters to score and classify examples in the training data.
When training the models, some of the parameters may not have enough examples for accu-
rate estimation. To mitigate this, we will use a Beta(0.1, 0.1) prior on the parameters of the
vote distributions.
In order to evaluate the performance of our classifiers on the dataset, we will use 10-fold cross-
validation. Under 10-fold cross-validation, the dataset is first partitioned into 10 equally sized
1This assignment is adapted from Stanford CS 228.
2
Programming Assignment
CMPT 727
Spring 2021
partitions. Of these 10 partitions, one of them is used for the test set while the rest of the
data are used as the training set to compute test error on this partition. This process is
repeated for the other nine partitions, and we can take the average of the resulting test
errors to obtain the 10-fold CV test error. We have implemented this procedure for you in
the function evaluate.
What is your test error rate using 10-fold cross-validation?
Note: you can use the evaluate function in the nb.py, but leave the optional argument train
subset to its default value until question 3.
(2 points) Question 2
Using the LASSO classifier from scikit-learn package (set learning rate alpha = 0.001),
implement 10-fold cross-validation and compute the associated test error. What is your test
error rate using 10-fold cross-validation? Which classifier gives a better result?
Note: you can use the evaluate function in the lasso.py, but leave the optional argument
train subset to its default value until question 3.
(2 points) Question 3
To investigate the effect of the amount of training data we have, let’s instead use smaller
subset of the training data during 10-fold cross-validation. Start with 10 samples and increase
this number by 10 for each run until we reach 100.
Compute the training error and test error of your trained classifiers. Plot NB training error,
NB test error, LASSO test error, and LASSO training error against sample size on the same
plot.
What do you observe? Briefly explain the patterns and why we expected them.
Note: set the arguments train subset=True, subset size = x when calling evaluate so that
the classifiers are trained on data size of x.
(4 points) Question 4
To evaluate whether our models can learn to base their predictions on the right features, we
will create synthetic data. We imagine that bills 1-4 are partisan, with Democrats voting yes
95% of the time and Republicans voting yes 5% of the time. We imagine that bills 5-16 are
nonpartisan, with all members voting either way 50% of the time. We will simulate equal
numbers of Democrats and Republicans.
Call generate q4 data function in data helper.py to create 4000 synthetic congresspeople.
Apply LASSO and NB respectively to this data set. Start with 400 samples and increase
this number by 400 for each run until we reach 4000.
3
Programming Assignment
CMPT 727
Spring 2021
We will say that LASSO correctly ignores a nonpartisan if the associated feature is 0; we
will say that NB does so if the difference between P (Y es|Democrat) and P (Y es|Republican)
is less than 0.1. Plot the fraction of nonpartisan bills that are ignored as a function of the
training set size. What do you observe?
(4 points) Question 5
In general, working with data where the values of attributes and labels are missing is difficult
when learning model parameters. However, we can still use our generative model from a fully
trained Bayes Network to classify examples in which some of the attributes may be unob-
served. Suppose Ai is unobserved. We can still compute P (C|A1, . . . , Ai−1, Ai, Ai+1, . . . , AN)
by computing P (C|A1, . . . , Ai−1, Ai+1, . . . , AN) and marginalizing over Ai.
Update your NB implementation to handle the case where some attributes may have missing
values and use this new implementation to classify Incomplete Entry 1 in Table 1 . Given
the observed attributes, what is the marginal probability this Congressperson is Democrat
(C = 1) given the votes we did observe? Can you predict how this Congressperson voted on
education spending (A12)?
Run LASSO, using the commonly-used strategy of replacing unknown values with 0 (“no”).
What is the marginal probability this Congressperson is Democrat (C = 1)?
Note: You should train your classifier on the original dataset.
Question 6
How many hours did you spend on this assignment?
4

欢迎咨询51作业君