辅导案例-CS 6316-Assignment 3

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

Assignment 3: kNN and SVM
UVA CS 6316 :
Machine Learning (Fall 2019)
Out: Oct. 10 2019
Due: Oct. 22, Sun midnight 11:59pm, @ Collab
a The assignment should be submitted in the PDF format through Collab. If you prefer hand-writing QA
parts of answers, please convert them (e.g., by scanning or using an app) into PDF form.
b For questions and clarifications, please post on piazza.
c Policy on collaboration:
Homework should be done individually: each student must hand in their own answers. It is acceptable,
however, for students to collaborate in figuring out answers and helping each other solve the problems.
We will be assuming that, with the honor code, you will be taking the responsibility to make sure you
personally understand the solution to any work arising from such collaboration.
d Policy on late homework: Homework is worth full credit at the midnight on the due date. Each student
has three extension days to be used at his or her own discretion throughout the entire course. Your
grade would be discounted by 15% per day when you use these 3 late days. You could use the 3 days
in whatever combination you like. For example, all 3 days on 1 assignment (for a maximum grade of
55%) or 1 each day over 3 assignments (for a maximum grade of 85% on each). After you’ve used all
3 days, you cannot get credit for anything turned in late.
e Policy on grading:
1: 50 points in total. 20 points for code submission (and able to run). 30 points for displaying results
in the report.
2: 40 points in total. 20 points for code submission (and able to run). 10 point for result submission.
10 points for the result table in your report.
3: 10 points in total. 5 points for answering each question.
4: 10 extra points if you wins the competition! (Get Top 20 accuracy in leaderboard)
The overall grade will be divided by 10 and inserted into the grade book. Therefore, you can earn 11
out of 10.
1 KNN and Model Selection (k) (programming)
In this problem, we will implement k-Nearest Neighbor classifier and select k using cross validation. We
have provided dataset file “Movie Review Data.txt”. Please use the “knn template.py” template to guide
your work. Please follow the instructions and the function names/descriptions in the template. Feel free
to cross-check your implementation against sci-kit’s KNN. Other requirements or recommendations are the
same as HW1 and HW2. This problem covers two main tasks:
• (1): To implement a very simple classifier, k-nearest neighbors (kNN), from scratch.
• (2): To implement a 4-folds Cross Validation strategy for selecting the best k for kNN classification.
Feel free to reuse the cross-validation code in the previous HWs.
1
Detailed Instructions:
1. We have provided dataset file “Movie Review Data.txt”. The template also provides a x, y = read file(file)
function to load the dataset.
2. You are required to fill-in the function
findBestK(x, y, klist, nfolds)
following the guidance in the template. You are given a list of k ∈ {3, 5, 7, 9, 11, 13}. The goal is to
find the best k using nfolds = 4 fold cross validation.
3. The first step is implementing the cross validation method that splits data into folds:
x train, y train, x test, y test = fold(x, y, i, nfolds)
.i indicates the fold number (iterated from 0 to nfolds − 1). This function returns the ith train and
test fold from x and y. Feel free to try other folds for cross validation, kfold=3-fold or kfold=10-fold.
4. After getting a train and test fold, the next step is implementing kNN classification method using the
function:
y predict = classify(x train, y train, x test, k)
Where k is the number of data points that kNN takes into consideration when labeling an unlabeled
data point. Note here the training data is part of the testing (classify) algorithm. In kNN, we label a
new data point based on the k points that are the closest to it in our train dataset. In this function, for
each testing point, find its k nearest neighbors in the training set (those having the smallest Euclidean
distance to the testing point). As reviewed in class, kNN requires a distance metric to compute distance
between samples. We will use Euclidean distance as the measurement of distance in kNN. Then we
label the testing point according to the majority voting rule. The predictions y predict is a list of 0 or
1 predictions obtained from the classify method.
5. Once predictions are obtained, the next step is to evaluate the prediction. For this, implement the
calc accuracy method:
acc = calc accuracy(y predict, y)
Where y predict is the list of 0 or 1 predictions obtained from the classify method and y is the true
label for the testing points (the last column of the test data). Repeat this process for all folds, to get
cross validation accuracy for each k.
(Hint1: If your accuracy is below 50%, please consider how the order of the samples are dictated by
the class.)
6. Task 1: Report the cross validation accuracy for all values of k as well as the best k. Discuss why
some values work better than others.
7. Task 2: Implement function barplot(klist, accuracy list). A bar graph is recommended to show how
accuracy varies with k. Use k as x-axis and accuracy as y-axis. Put the bar plot in your report. You
can use matplotlib for this question. Att: we will not consider the computational speed in grading
your kNN code.
2
2 Support Vector Machines with Scikit-Learn and preprocessing
This question covers two main tasks:
• (1) For this assignment, you will implement an SVM classifier. Given a proper set of attributes, your
program will be able to determine whether an individual makes more than 50,000 USD/year.
• (2) You will use scikit-learn’s C-Support Vector Classifier: sklearn.svm.SVC. Install the latest stable
version of scikit-learn following directions available at
http://scikit-learn.org/stable/install.html
Dataset Details: Two dataset files are provided, labeled sample set ”salary.labeled.csv” and unlabeled
sample set ”salary.2Predict.csv”. Make sure to download them from Collab! The ”salary.labeled.csv” is
the total training data available to you. The unlabeled sample set ”salary.2Predict.csv” is a text file in
the same format as the labeled dataset “salary.labeled.csv”, except that its last column includes a
fake field for class labels. The details of the column names and feature properties are provided in the
svm template.py. You are required to generate/ predict labels for samples in ”salary.2Predict.csv” (including
about 10k samples).
Note: We will evaluate your output ‘predictions’ - an array of strings (“>50K” or “<=50K”) according
to the true labels of these test samples in ”salary.2Predict.csv” (ATT: you don’t have these labels !!! ).
This simulates a Kaggle-competition in which test labels are always held out and only team-ranking will be
released after all teams have submitted their predictions. When grading this assignment, we will rank all
students’ predictions. So please try to submit the best performing model that you can! Best groups in
the tournament will receive bonus credit.
Detailed Steps:
1. The first step is loading and preprocessing the data:
• The dataset contains both categorical and continuous features(See sample data in Table 1).
First, implement load data function in svm template.py. This function takes as input filename
training csv and processes into numerical vector representation. For example, to start, you can
try normalization on continuous features and one-hot encoding on categorical data.
• For preprocessing, use techniques mentioned in class!
• Be creative. A good data preprocessing can help you win the tournament.
2. The next step is to choose a good hyperparameter for SVM classifier via cross validation. You are
required to write a wrapper code for cross-validation as part of train and select model(training csv),
and iterate through different hyperparameters (just like we did in HW1/2).
• The different hyperparameters you can choose include SVM kernels (basic linear kernel / polyno-
mial kernel / RBF kernel), C value, .... A few examples are provided in the svm template.py.
• Call sklearn.svm.SVC with your parameter to get a classification!
• You need to include the CV classification accuracy results in your pdf report by performing 3-fold
cross validation (CV) on the labeled set ”salary.labeled.csv” (including about 38k samples) using
at least three different SVM kernels you pick. Please provide details about the kernels you have
tried and their performance (e.g. 3CV classification accuracy ) including results on both CV train
accuracy and CV test accuracy into the writing. For instance, you can summarize the results into
a table with each row containing kernel choice, kernel parameter, CV train accuracy and CV test
accuracy.
3. Next, also as part of train and select model(training csv), select the best hyperparameter, train a
model using the selected hyperparameter and also return the corresponding cross validation score as
well as the trained SVC model.
4. Finally, use the trained model to make predictions on the test set using predict(“salary.2Predict.csv”, trained model)
and save the output. The svm template.py provides output results(predictions) to save the predictions
in the desired format. Remember not to shuffle your prediction outputs!
3
Submission Instructions: You are required to submit the following :
1. A python program svm.py (filled in svm template.py):
It should be able to train and select a model using a set of hyperparameters on the training data, these
hyperparameters should be hard coded.
2. We should be able to use the learned/ trained model to classify samples in an unlabeled test set using
the following function:
p r e d i c t i o n s = p r e d i c t ( ‘ ‘ s a l a r y . 2 Pred i c t . csv ” , t ra ined mode l )
3. A file “predictions.txt” generated by:
o u t p u t r e s u l t s ( p r e d i c t i o n s )
Please do not archive the file or change the file name for the automated grading.
4. In your PDF submission,
• A table reporting classification accuracy (score) averaged over the test folds, along with details
of the best performing kernel, C etc.
• You need to include the CV classification accuracy results in your pdf report. Please provide
details about the hyperparameters you have tried and their performance (e.g. 3CV classification
accuracy ) including results on both train and test folds into the writing. For instance, you can
summarize the results into a table with each row containing kernel choice, kernel parameter, CV
train accuracy and CV test accuracy.
• Please also discuss in the written submission about how you preprocessed the data, and how you
chose your SVM hyperparameters.
Classes: >50K, <=50K.
Attributes:
age: continuous.
workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
fnlwgt: continuous.
education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th,
Preschool.
education-num: continuous.
marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical,
Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
sex: Female, Male.
capital-gain: continuous.
capital-loss: continuous.
hours-per-week: continuous.
native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China,
Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan,
Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
Table 1: About the data in SVM coding.
3 Sample QA Questions:
Question 1. Support Vector Machine
Soft-margin linear SVM can be formulated as the following constrained quadratic optimization problem:
argmin{w,b} 12w
Tw + C
∑m
i=1 i
such that
yi(w
Txi + b) ≥ 1− i
i ≥ 0 ∀i
where C is the regularization parameter, which balances the margin (smaller wTw) and the penalty of
mis-classification (smaller
∑m
i=1 i).
4
(a) (True/False) Number of support vectors do not depend on the selected value of C. Please provide a
one-sentence justification.
In the following figure, nC=0 is the number of support vectors for C getting close to 0 (limC→0) and
nC=∞ is the number of support vectors for C gets close to ∞.
(b) Select the correct option:( )
(1) nC=0 > nC=∞
(2) nC=0 < nC=∞
(3) nC=0 = nC=∞
(4) Not enough information provided
(c) Match the following list of kernels used for SVM classification with the following figures:
(1) Linear Kernel : K(x, x′) = xTx′
(2) Polynomial Kernel (order = 2) : K(x, x′) = (1 + xTx′)2
(3) Radial Basis Kernel : K(x, x′) = exp(− 12 ||x− x′||2)
5
6