辅导案例-CSCE 478/878 -Assignment 1
1


Introduction to Machine Learning
CSCE 478/878

Programming Assignment 1

Fall 2020

K Nearest Neighbors Model for Binary Classification


Basic Info

Late submission is not allowed on the 1st programming assignment.

You will work in teams of maximum three students.

The programming code will be graded on both implementation and correctness.

The written report will be graded on content, conclusions, and presentation. It must be
formatted according to the given template (posted on Canvas). The report will be graded
as if the values obtained from the code portion were correct. The report should be short
and to the point. The length should be between 2 to 4 pages of text plus any tables and
graphs.


Assignment Goals and Tasks

This assignment is intended to build the following skills:
1. Implementation of the brute force K-NN algorithm for binary classification
2. Data pre-processing and feature selection techniques
3. Model evaluation techniques

Furthermore, the functions you develop for the data pre-processing and model evaluation
will form a basis for future assignments.

In particular, you will be evaluating various K-NN models on the white wine portion of
the Wine Quality dataset from UCI’s repository. In order to carry out this evaluation, you
will:

2
1. Implement two distance metrics (Euclidean and Manhattan)
2. Implement the brute-force K-NN algorithm
3. Implement a cross-validation function
4. Apply scaling to the dataset
5. Perform feature selection
6. Evaluate your model
7. Write a short report





Assignment Instructions

Note: you are not allowed to use any Scikit-Learn models or functions.

i) The code should be written in a Jupyter notebook and the report should be
prepared as a PDF file.
a. Name the notebook
`______a
ssignment1.ipynb`
b. Name the PDF
`______<
assignment1.pdf`
ii) The Jupyter notebook and the report should be submitted via webhandin. Only
one submission is required for each group.
iii) Use the cover page (posted on Canvas) as the front page of your report.
iv) Download the wine quality dataset from:
http://archive.ics.uci.edu/ml/datasets/Wine+Quality
You will be using the white wine dataset: “winequality-white.csv”

________________________________________________________________________

Score Distribution

Part A (Model Code): 478 (50 pts) & 878 (65 pts)
Part B (Data Processing): 478 (20 pts) & 878 (25 pts)
Part C (Model Evaluation): 478 (40 pts) & 878 (50 pts)
Part D (Written Report): 478 & 878 (30 pts)

Total: 478 (140 pts) & 878 (170 pts)

________________________________________________________________________





3


Part A: Model Code (478: 50 pts & 878: 65 pts)

1) Write a function to calculate and return the Euclidean distance of two vectors.
[2 pts]
2) Write a function to calculate and return the Manhattan distance of two vectors.
[2 pts]
3) Write a function to calculate and return the accuracy and generalization error of
two vectors.
[4 pts]
4) Write three functions to compute: precision, recall and F1 score.
[6 pts]
5) Write a function to compute the confusion matrix of two vectors.
[4 pts]
6) Write a function to generate the Receiver Operating Characteristic (ROC) curve.
[8 pts]
7) Write a function to compute area under curve (AUC) for the ROC curve.
[4 pts]
8) [Extra Credit for 478 and Mandatory for 878] Write a function to generate the
precision-recall curve.
[10 pts]

9) Implement a KNN_Classifier model class. It should have the following three
methods.
[20 pts]


a)
fit(self, X, Y, n_neighbors, weights, kwargs)

This method simply needs to store the relevant values as instance variables.


Arguments:
X : ndarray
A numpy array with rows representing data samples and columns
representing features.

Y : ndarray
A 1D numpy array with labels corresponding to each row of the feature
matrix X.

n_neighbors : int
The number of nearest neighbors.


4

weights : string, optional (default = ‘uniform’)
The weight function used in prediction. Possible values:

- ‘uniform’: uniform weights. All points in each neighborhood
are weighted equally.
- ‘distance’: weight points by the inverse of their distance. in this
case, closer neighbors of a query point will have a greater
influence than neighbors which are further away.
[Extra Credit for 478 and Mandatory for 878]
[5 pts]



kwargs :
Dictionary of arguments to be passed to the distance function (this
will not be used with our simple distance functions, but is
important for more complex functions). If you are not familiar,
look up using the ** operator to unpack dictionaries as arguments.


Returns:
No return value necessary.


b)
predict(self, X)

This method will use the instance variables stored by the fit method.


Arguments:
X : ndarray
A numpy array containing samples to be used for prediction. Its rows
represent data samples and columns represent features.


Returns:
1D array of predictions for each row in X.
The 1D array should be designed as a column vector.


c)
__init__(self)
It’s a standard python initialization function so we can instantiate the
class. Just “pass” this.

5


Part B: Data Processing (478: 20 pts & 878: 25 pts)

10) Read in the winequality-white.csv file as a Pandas data frame.
11) The target will be the “quality” column which represents rating of wine and
ranges from 3 to 8. You will need to convert it into a two-category variable
consisting of “good” (quality > 5) & “bad” (quality <= 5). Your target vector
should have 0s (representing “bad” quality wine) and 1s (representing “good”
quality wine).
12) Use the techniques from the recitation to summarize each of the variables in the
dataset in terms of mean, standard deviation, and quartiles. Include this in your
report. [3 pts]
13) Shuffle the rows of your data. You can use def = df.sample(frac=1) as an
idiomatic way to shuffle the data in Pandas without losing column names.
[2 pts]
14) Generate pair plots using the seaborn package (see 2nd recitation notebook). This
will be used to identify and report the redundant features, if there is any.
[2 pts]
15) Drop the redundant features. [1 pts]
16) Write a function named “partition” to split your data into train and test set. The
function should take 3 arguments: feature matrix (numpy array with rows
representing data samples and columns representing features.), target vector
(numpy array with labels corresponding to each row of the feature matrix), t. Here
t is a real number to determine the size of partition. For example, if t is set to 0.2,
then 80% of the data will be used for training and 20% for testing. This function
should return two feature matrices for train and test data, and two target vectors
for train and test data. [6 pts]

17) Naively run your KNN_Classifier model on the train dataset with n_neighbors =
5 and using Euclidean distance. [6 pts]

a. Use accuracy and F1 score to compare your predictions to the expected
labels.
b. Now standardize each feature of your training set (subtract mean and
divide by standard deviation). Use the mean and standard deviation values
for each feature in the training set to scale the test data.
c. Re-run the KNN_Classifier model on the standardized data, find the
accuracy and F1 score with the expected labels.
d. Compare the two accuracy values and the F1 scores; and decide whether
you should use standardized data or unscaled data for the remainder of the
assignment. This will go in the report
e. [Extra Credit for 478 and Mandatory for 878] Perform a similar test for
inverse distance weighting in the KNN_Classifier model and determine
whether or not to use it. This will go in the report. [5 pts]


6

Part C: Model Evaluation (478: 40 pts & 878: 50 pts)

18) Evaluation of an estimator performance via cross-validation: Implement the
S-fold cross-validation function. [10 pts]
a. sFold(folds, data, labels, model, model_args, error_fuction)
i. folds is an integer number of folds.
ii. data is a numpy array with rows representing data samples and
columns representing features.
iii. labels is a numpy array with labels corresponding to each row of
training_features.
iv. model is an object with the fit and predict methods.
v. model args is a dictionary of arguments to pass to the classification
algorithm. If you are unfamiliar, look up using the ** operator to
unpack dictionaries as arguments
vi. error_function
1. Returns error value between predicted and true labels. For
example, mean squared error (mse) function could be used
as error_function.
b. How it should work:
i. Use a helper function to calculate an s-partition of the data (i.e.,
partition the data into s equally sized portions)
ii. For each partition
1. Make a model using the model class
2. Fit the data to all other partitions (1 – folds)
3. Make prediction on current partition
4. Store expected labels and predicted labels for current
partition
iii. Calculate the average error (for all partitions) using the
error_function on stored expected and predicted labels.
c. It should return:
i. A Python dictionary with the following elements
1. Expected labels
2. Predicted labels
3. Average error

19) Use your sfold function to evaluate the performance of your model over each
combination of k and distance metrics from the following sets: [5 pts]
a. k = [1, 5, 9, 11]
b. distance = [Euclidean, Manhattan]
c. [Extra Credit for 478 and Mandatory for 878] weights = [uniform,
distance] [5 pts]
d. Store the returned dictionary for each. We will need these for the report.
e. Determine the best model based on the overall performance (lowest
average error). For the error_function of the S-fold function argument use
the F1 score function.
7

20) Evaluate your model on the test data and report the performance measures.
[10 pts]
a. Precision
b. Recall
c. F1 score
d. Confusion matrix
e. Accuracy & Generalization Error

21) Generate the ROC curve and determine the optimal threshold. This will go in your
report.
[8 pts]

22) Compute the AUC score.
[2 pts]

23) [Extra Credit for 478 and Mandatory for 878] Generate the precision-recall
curve and determine the optimal threshold.
[5 pts]

24) Calculate and report the 95% confidence interval on the generalization error
estimate.
[5 pts]



Part D: Written Report (30 pts)

25) Write a “Data Summary” section.
[10 pts]

a. Describe the dataset and the variables. What is the target? What are you
calculating it from?
b. Include a table of each variables’ descriptive statistics (mean, standard
deviation, quartiles).
c. Describe whether or not you used feature scaling and why or why not.
d. Describe whether or not you dropped any feature and why or why not.

26) Write a “Methods” section.
[10 pts]

a. Describe the runtime complexity of the KNN_Classifier model.
b. Explain the effects of increasing k. When is and isn’t it (increasing k)
effective?
c. 878 students: Describe whether or not you used inverse distance
weighting in the features and why.

8
27) Write a “Results” section.
[10 pts]

a. Describe the performance of the model with respect to the different levels
of k and the different distance metrics. Include a table of performances,
bolding the best.
b. Characterize the overall performance of your model.
c. Discuss on which quality values your model performed well, and on which
it performed poorly. Include a table of average error (e.g., F1 score) to
support your claims.
d. Give any final conclusions.



At the beginning of the report clearly specify the amount of contribution by each
member of the team. One submission per group is required.

51作业君 51作业君

Email:51zuoyejun

@gmail.com

添加客服微信: ITCSdaixie