辅导案例-CSCE 478/878 -Assignment 1
1 Introduction to Machine Learning CSCE 478/878 Programming Assignment 1 Fall 2020 K Nearest Neighbors Model for Binary Classification Basic Info Late submission is not allowed on the 1st programming assignment. You will work in teams of maximum three students. The programming code will be graded on both implementation and correctness. The written report will be graded on content, conclusions, and presentation. It must be formatted according to the given template (posted on Canvas). The report will be graded as if the values obtained from the code portion were correct. The report should be short and to the point. The length should be between 2 to 4 pages of text plus any tables and graphs. Assignment Goals and Tasks This assignment is intended to build the following skills: 1. Implementation of the brute force K-NN algorithm for binary classification 2. Data pre-processing and feature selection techniques 3. Model evaluation techniques Furthermore, the functions you develop for the data pre-processing and model evaluation will form a basis for future assignments. In particular, you will be evaluating various K-NN models on the white wine portion of the Wine Quality dataset from UCI’s repository. In order to carry out this evaluation, you will: 2 1. Implement two distance metrics (Euclidean and Manhattan) 2. Implement the brute-force K-NN algorithm 3. Implement a cross-validation function 4. Apply scaling to the dataset 5. Perform feature selection 6. Evaluate your model 7. Write a short report Assignment Instructions Note: you are not allowed to use any Scikit-Learn models or functions. i) The code should be written in a Jupyter notebook and the report should be prepared as a PDF file. a. Name the notebook `______a ssignment1.ipynb` b. Name the PDF `______< assignment1.pdf` ii) The Jupyter notebook and the report should be submitted via webhandin. Only one submission is required for each group. iii) Use the cover page (posted on Canvas) as the front page of your report. iv) Download the wine quality dataset from: http://archive.ics.uci.edu/ml/datasets/Wine+Quality You will be using the white wine dataset: “winequality-white.csv” ________________________________________________________________________ Score Distribution Part A (Model Code): 478 (50 pts) & 878 (65 pts) Part B (Data Processing): 478 (20 pts) & 878 (25 pts) Part C (Model Evaluation): 478 (40 pts) & 878 (50 pts) Part D (Written Report): 478 & 878 (30 pts) Total: 478 (140 pts) & 878 (170 pts) ________________________________________________________________________ 3 Part A: Model Code (478: 50 pts & 878: 65 pts) 1) Write a function to calculate and return the Euclidean distance of two vectors. [2 pts] 2) Write a function to calculate and return the Manhattan distance of two vectors. [2 pts] 3) Write a function to calculate and return the accuracy and generalization error of two vectors. [4 pts] 4) Write three functions to compute: precision, recall and F1 score. [6 pts] 5) Write a function to compute the confusion matrix of two vectors. [4 pts] 6) Write a function to generate the Receiver Operating Characteristic (ROC) curve. [8 pts] 7) Write a function to compute area under curve (AUC) for the ROC curve. [4 pts] 8) [Extra Credit for 478 and Mandatory for 878] Write a function to generate the precision-recall curve. [10 pts] 9) Implement a KNN_Classifier model class. It should have the following three methods. [20 pts] a) fit(self, X, Y, n_neighbors, weights, kwargs) This method simply needs to store the relevant values as instance variables. Arguments: X : ndarray A numpy array with rows representing data samples and columns representing features. Y : ndarray A 1D numpy array with labels corresponding to each row of the feature matrix X. n_neighbors : int The number of nearest neighbors. 4 weights : string, optional (default = ‘uniform’) The weight function used in prediction. Possible values: - ‘uniform’: uniform weights. All points in each neighborhood are weighted equally. - ‘distance’: weight points by the inverse of their distance. in this case, closer neighbors of a query point will have a greater influence than neighbors which are further away. [Extra Credit for 478 and Mandatory for 878] [5 pts] kwargs : Dictionary of arguments to be passed to the distance function (this will not be used with our simple distance functions, but is important for more complex functions). If you are not familiar, look up using the ** operator to unpack dictionaries as arguments. Returns: No return value necessary. b) predict(self, X) This method will use the instance variables stored by the fit method. Arguments: X : ndarray A numpy array containing samples to be used for prediction. Its rows represent data samples and columns represent features. Returns: 1D array of predictions for each row in X. The 1D array should be designed as a column vector. c) __init__(self) It’s a standard python initialization function so we can instantiate the class. Just “pass” this. 5 Part B: Data Processing (478: 20 pts & 878: 25 pts) 10) Read in the winequality-white.csv file as a Pandas data frame. 11) The target will be the “quality” column which represents rating of wine and ranges from 3 to 8. You will need to convert it into a two-category variable consisting of “good” (quality > 5) & “bad” (quality <= 5). Your target vector should have 0s (representing “bad” quality wine) and 1s (representing “good” quality wine). 12) Use the techniques from the recitation to summarize each of the variables in the dataset in terms of mean, standard deviation, and quartiles. Include this in your report. [3 pts] 13) Shuffle the rows of your data. You can use def = df.sample(frac=1) as an idiomatic way to shuffle the data in Pandas without losing column names. [2 pts] 14) Generate pair plots using the seaborn package (see 2nd recitation notebook). This will be used to identify and report the redundant features, if there is any. [2 pts] 15) Drop the redundant features. [1 pts] 16) Write a function named “partition” to split your data into train and test set. The function should take 3 arguments: feature matrix (numpy array with rows representing data samples and columns representing features.), target vector (numpy array with labels corresponding to each row of the feature matrix), t. Here t is a real number to determine the size of partition. For example, if t is set to 0.2, then 80% of the data will be used for training and 20% for testing. This function should return two feature matrices for train and test data, and two target vectors for train and test data. [6 pts] 17) Naively run your KNN_Classifier model on the train dataset with n_neighbors = 5 and using Euclidean distance. [6 pts] a. Use accuracy and F1 score to compare your predictions to the expected labels. b. Now standardize each feature of your training set (subtract mean and divide by standard deviation). Use the mean and standard deviation values for each feature in the training set to scale the test data. c. Re-run the KNN_Classifier model on the standardized data, find the accuracy and F1 score with the expected labels. d. Compare the two accuracy values and the F1 scores; and decide whether you should use standardized data or unscaled data for the remainder of the assignment. This will go in the report e. [Extra Credit for 478 and Mandatory for 878] Perform a similar test for inverse distance weighting in the KNN_Classifier model and determine whether or not to use it. This will go in the report. [5 pts] 6 Part C: Model Evaluation (478: 40 pts & 878: 50 pts) 18) Evaluation of an estimator performance via cross-validation: Implement the S-fold cross-validation function. [10 pts] a. sFold(folds, data, labels, model, model_args, error_fuction) i. folds is an integer number of folds. ii. data is a numpy array with rows representing data samples and columns representing features. iii. labels is a numpy array with labels corresponding to each row of training_features. iv. model is an object with the fit and predict methods. v. model args is a dictionary of arguments to pass to the classification algorithm. If you are unfamiliar, look up using the ** operator to unpack dictionaries as arguments vi. error_function 1. Returns error value between predicted and true labels. For example, mean squared error (mse) function could be used as error_function. b. How it should work: i. Use a helper function to calculate an s-partition of the data (i.e., partition the data into s equally sized portions) ii. For each partition 1. Make a model using the model class 2. Fit the data to all other partitions (1 – folds) 3. Make prediction on current partition 4. Store expected labels and predicted labels for current partition iii. Calculate the average error (for all partitions) using the error_function on stored expected and predicted labels. c. It should return: i. A Python dictionary with the following elements 1. Expected labels 2. Predicted labels 3. Average error 19) Use your sfold function to evaluate the performance of your model over each combination of k and distance metrics from the following sets: [5 pts] a. k = [1, 5, 9, 11] b. distance = [Euclidean, Manhattan] c. [Extra Credit for 478 and Mandatory for 878] weights = [uniform, distance] [5 pts] d. Store the returned dictionary for each. We will need these for the report. e. Determine the best model based on the overall performance (lowest average error). For the error_function of the S-fold function argument use the F1 score function. 7 20) Evaluate your model on the test data and report the performance measures. [10 pts] a. Precision b. Recall c. F1 score d. Confusion matrix e. Accuracy & Generalization Error 21) Generate the ROC curve and determine the optimal threshold. This will go in your report. [8 pts] 22) Compute the AUC score. [2 pts] 23) [Extra Credit for 478 and Mandatory for 878] Generate the precision-recall curve and determine the optimal threshold. [5 pts] 24) Calculate and report the 95% confidence interval on the generalization error estimate. [5 pts] Part D: Written Report (30 pts) 25) Write a “Data Summary” section. [10 pts] a. Describe the dataset and the variables. What is the target? What are you calculating it from? b. Include a table of each variables’ descriptive statistics (mean, standard deviation, quartiles). c. Describe whether or not you used feature scaling and why or why not. d. Describe whether or not you dropped any feature and why or why not. 26) Write a “Methods” section. [10 pts] a. Describe the runtime complexity of the KNN_Classifier model. b. Explain the effects of increasing k. When is and isn’t it (increasing k) effective? c. 878 students: Describe whether or not you used inverse distance weighting in the features and why. 8 27) Write a “Results” section. [10 pts] a. Describe the performance of the model with respect to the different levels of k and the different distance metrics. Include a table of performances, bolding the best. b. Characterize the overall performance of your model. c. Discuss on which quality values your model performed well, and on which it performed poorly. Include a table of average error (e.g., F1 score) to support your claims. d. Give any final conclusions. At the beginning of the report clearly specify the amount of contribution by each member of the team. One submission per group is required.