2020/11/30 下午4:00“COMP7404C 2021 - Assignment 4”的副本 - Colaboratory ⻚码:1/14https://colab.research.google.com/drive/13s70d0RXOC34UUOu9T-GIn…N5CGTtN?userstoinvite=zhangjie98721%40gmail.com&actionButton=1 Part A: Conceptual Questions Solve the following questions by hand. No need to implement any code. COMP7404 - Assignment 4 Consider a Perceptron with 2 inputs and 1 output. Let the weights of the Perceptron be and and let the bias be . Calculate the output of the following inputs:(0, 0), (1, 0), (0, 1), (1, 1) A1 = 1푤1 = 1푤2 = −1.5푤0 Your answer here DeMne a perceptron for the following logical functions: AND, NOT, NAND, NOR A2 Your answer here The parity problem returns 1 if the number of inputs that are 1 is even, and 0 otherwise. Can a perceptron learn this problem for 3 inputs? A3 Your answer here 2020/11/30 下午4:00“COMP7404C 2021 - Assignment 4”的副本 - Colaboratory ⻚码:2/14https://colab.research.google.com/drive/13s70d0RXOC34UUOu9T-GIn…N5CGTtN?userstoinvite=zhangjie98721%40gmail.com&actionButton=1 Suppose that the following are a set of point in two classes: Class1: Class2: Plot them and Mnd the optimal separating line. What are the support vectors, and what is the meaning? A4 (1, 1), (1, 2), (2, 1) (0, 0), (1, 0), (0, 1) Your answer here Suppose that the probability of Mve events are . Calculate the entropy and write down in words what this means. A5 푃 (푓푖푟푠푡) = 0.5, 푃 (푠푒푐표푛푑) = 푃 (푡ℎ푖푟푑) = 푃 (푓표푢푟푡ℎ) = 푃 (푓푖푓푡ℎ) = 0.125 Your answer here Design a decision tree that computes the logical AND function. How does it compare to the Perceptron solution? A6 Your answer here 2020/11/30 下午4:00“COMP7404C 2021 - Assignment 4”的副本 - Colaboratory ⻚码:3/14https://colab.research.google.com/drive/13s70d0RXOC34UUOu9T-GIn…N5CGTtN?userstoinvite=zhangjie98721%40gmail.com&actionButton=1 Turn the following politically incorrect data into a decision tree to classify which attributes make a person attractive, and then extract the rules. Use the Gini Impurity. Height Hair Eyes Attractive? Small Blonde Brown No Tall Dark Brown No Tall Blonde Blue Yes Tall Dark Blue No Small Dark Blue No Tall Red Blue Yes Tall Blonde Brown No Small Blonde Blue Yes A7 Your answer here Suppose we collect data for a group of students in a postgraduate machine learning class with features = hours studies, = undergraduate GPA and label = receive an A. We Mt a logistic regression and produce estimated weights as follows: , , . 1. Estimate the probability that a student who studies for 40h and has an undergraduate GPA of 3.5 gets an A in the class 2. How many hours would the student in part 1. need to study to have a 50% chance of getting an A in the class? A8 푥1 푥2 푦 = −6푤0 = 0.05푤1 = 1푤2 Your answer here 2020/11/30 下午4:00“COMP7404C 2021 - Assignment 4”的副本 - Colaboratory ⻚码:4/14https://colab.research.google.com/drive/13s70d0RXOC34UUOu9T-GIn…N5CGTtN?userstoinvite=zhangjie98721%40gmail.com&actionButton=1 Suppose that we take a data set, divide it into equally-sized training and test sets, and then try out two different classiMcation procedures. First we use logistic regression and get an error rate of 20% on the training data and 30% on the test data. Next we use 1-nearest neighbors (i.e., K=1) and get an average error rate (averaged over both test and training data sets) of 18%. Based on these results, which method should we prefer to use for classiMcation of new observations? Why? A9 Your answer here Suppose the features in your training set have very different scales. Which algorithms discussed in class might suffer from this, and how? What can you do about it? A10 Your answer here If your AdaBoost ensemble underMts the training data, which hyperparameters should you tweak and how? A11 Your answer here What is the beneMt of out-of-bag evaluation? A12 Your answer here What is the difference between hard and soft voting classiMers? A13 2020/11/30 下午4:00“COMP7404C 2021 - Assignment 4”的副本 - Colaboratory ⻚码:5/14https://colab.research.google.com/drive/13s70d0RXOC34UUOu9T-GIn…N5CGTtN?userstoinvite=zhangjie98721%40gmail.com&actionButton=1 Your answer here Solve the following questions implementing solutions in code. Part B: Applied Questions Consider the following Perceptron code. B1 import numpy as np class Perceptron(object): """Perceptron classifier. Parameters ------------ eta : float Learning rate (between 0.0 and 1.0) n_iter : int Passes over the training dataset. Attributes ----------- w_ : 1d-array Weights after fitting. errors_ : list Number of misclassifications in every epoch. """ def __init__(self, eta=0.01, n_iter=10): self.eta = eta self.n_iter = n_iter def fit(self, X, y): """Fit training data. Parameters ---------- X : {array-like}, shape = [n_samples, n_features] Training vectors, where n_samples is the number of samples and n_features is the number of features. y : array-like, shape = [n_samples] Target values. 2020/11/30 下午4:00“COMP7404C 2021 - Assignment 4”的副本 - Colaboratory ⻚码:6/14https://colab.research.google.com/drive/13s70d0RXOC34UUOu9T-GIn…N5CGTtN?userstoinvite=zhangjie98721%40gmail.com&actionButton=1 Returns ------- self : object """ self.w_ = np.zeros(1 + X.shape[1]) self.errors_ = [] for _ in range(self.n_iter): errors = 0 for xi, target in zip(X, y): update = self.eta * (target - self.predict(xi)) self.w_[1:] += update * xi self.w_[0] += update errors += int(update != 0.0) self.errors_.append(errors) return self def net_input(self, X): """Calculate net input""" return np.dot(X, self.w_[1:]) + self.w_[0] def predict(self, X): """Return class label after unit step""" return np.where(self.net_input(X) >= 0.0, 1, -1) 0 1 2 3 4 145 6.7 3.0 5.2 2.3 Iris-virginica 146 6.3 2.5 5.0 1.9 Iris-virginica 147 6.5 3.0 5.2 2.0 Iris-virginica 148 6.2 3.4 5.4 2.3 Iris-virginica 149 5.9 3.0 5.1 1.8 Iris-virginica import pandas as pd data_src = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data df = pd.read_csv(data_src, header=None) df.tail() 2020/11/30 下午4:00“COMP7404C 2021 - Assignment 4”的副本 - Colaboratory ⻚码:7/14https://colab.research.google.com/drive/13s70d0RXOC34UUOu9T-GIn…N5CGTtN?userstoinvite=zhangjie98721%40gmail.com&actionButton=1 %matplotlib inline import matplotlib.pyplot as plt import numpy as np # select setosa and versicolor y = df.iloc[0:100, 4].values y = np.where(y == 'Iris-setosa', -1, 1) # extract sepal length and petal length X = df.iloc[0:100, [0, 2]].values # plot data plt.scatter(X[:50, 0], X[:50, 1], color='red', marker='o', label='setosa') plt.scatter(X[50:100, 0], X[50:100, 1], color='blue', marker='x', label='versicolor') plt.xlabel('sepal length [cm]') plt.ylabel('petal length [cm]') plt.legend(loc='upper left') plt.tight_layout() plt.show() ppn = Perceptron(eta=0.1, n_iter=10) ppn = ppn.fit(X, y) 2020/11/30 下午4:00“COMP7404C 2021 - Assignment 4”的副本 - Colaboratory ⻚码:8/14https://colab.research.google.com/drive/13s70d0RXOC34UUOu9T-GIn…N5CGTtN?userstoinvite=zhangjie98721%40gmail.com&actionButton=1 from matplotlib.colors import ListedColormap from matplotlib.colors import ListedColormap def plot_decision_regions(X, y, classifier, resolution=0.01): markers = ('s', 'x', 'o', '^', 'v') colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan') cmap = ListedColormap(colors[:len(np.unique(y))]) x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1 x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1 xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution), np.arange(x2_min, x2_max, resolution)) Z = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T) Z = Z.reshape(xx1.shape) plt.contourf(xx1, xx2, Z, alpha=0.3, cmap=cmap) plt.xlim(xx1.min(), xx1.max()) plt.ylim(xx2.min(), xx2.max()) for idx, cl in enumerate(np.unique(y)): plt.scatter(x=X[y == cl, 0], y=X[y == cl, 1], alpha=0.8, c=colors[idx], marker=markers[idx], label=cl, edgecolor='black') As shown in function plot_decision_regions, the decision regions can be visualized by dense sampling via meshgrid. However, if the grid resolution is not enough, as artiMcially set below, the boundary will appear inaccurate. Implement function plot_decision_boundary below to analytically compute and plot the decision boundary. def plot_decision_boundary(X, y, classifier): # replace the two lines below with your code x1_interval = [X[:, 0].min() - 1, X[:, 0].max() + 1] x2_interval = [X[:, 1].min() - 1, X[:, 1].max() + 1] plt.plot(x1_interval, x2_interval, color='green', linewidth=4, label='boundary' 2020/11/30 下午4:00“COMP7404C 2021 - Assignment 4”的副本 - Colaboratory ⻚码:9/14https://colab.research.google.com/drive/13s70d0RXOC34UUOu9T-GIn…N5CGTtN?userstoinvite=zhangjie98721%40gmail.com&actionButton=1 low_res = 0.1 # intentional for this exercise plot_decision_regions(X, y, classifier=ppn, resolution=low_res) plot_decision_boundary(X, y, classifier=ppn) plt.xlabel('sepal length [cm]') plt.ylabel('petal length [cm]') plt.legend(loc='upper left') plt.tight_layout() plt.show() In class we applied different scikit-learn classifers for the Iris data set. In this question, we will apply the same set of classiMers over a different data set: hand- written digits. Please write down the code for different classiMers, choose their hyper- parameters, and compare their performance via the accuracy score as in the Iris dataset. Which classiMer(s) perform(s) the best and worst, and why? The classiMers include: perceptron logistic regression SVM decision tree random forest KNN The dataset is available as part of scikit learn, as follows. B2 2020/11/30 下午4:00“COMP7404C 2021 - Assignment 4”的副本 - Colaboratory ⻚码:10/14https://colab.research.google.com/drive/13s70d0RXOC34UUOu9T-GI…N5CGTtN?userstoinvite=zhangjie98721%40gmail.com&actionButton=1 (1797, 64) (1797,) from sklearn.datasets import load_digits digits = load_digits() X = digits.data # training data y = digits.target # training label print(X.shape) print(y.shape) import matplotlib.pyplot as plt import pylab as pl import matplotlib as mpl mpl.rcParams['figure.dpi'] = 150 num_rows = 4 num_cols = 5 fig, ax = plt.subplots(nrows=num_rows, ncols=num_cols, sharex=True, sharey=True) ax = ax.flatten() for index in range(num_rows*num_cols): img = digits.images[index] label = digits.target[index] ax[index].imshow(img, cmap='Greys', interpolation='nearest') ax[index].set_title('digit ' + str(label)) ax[0].set_xticks([]) ax[0].set_yticks([]) plt.tight_layout() plt.show() 2020/11/30 下午4:00“COMP7404C 2021 - Assignment 4”的副本 - Colaboratory ⻚码:11/14https://colab.research.google.com/drive/13s70d0RXOC34UUOu9T-GI…N5CGTtN?userstoinvite=zhangjie98721%40gmail.com&actionButton=1 Date Preprocessing Hint: Divide training and test data set and apply other techinques we have learned if needed. #Your code comes here ClassiMer #1 Perceptron #Your code, including traing and testing, to observe the accuracies. 2020/11/30 下午4:00“COMP7404C 2021 - Assignment 4”的副本 - Colaboratory ⻚码:12/14https://colab.research.google.com/drive/13s70d0RXOC34UUOu9T-GI…N5CGTtN?userstoinvite=zhangjie98721%40gmail.com&actionButton=1 ClassiMer #2 Logistic Regression #Your code, including traing and testing, to observe the accuracies. ClassiMer #3 SVM #Your code, including traing and testing, to observe the accuracies. ClassiMer #4 Decision Tree #Your code, including traing and testing, to observe the accuracies. Classifer #5 Random Forest #Your code, including traing and testing, to observe the accuracies. ClassiMer #6 KNN #Your code, including traing and testing, to observe the accuracies. 2020/11/30 下午4:00“COMP7404C 2021 - Assignment 4”的副本 - Colaboratory ⻚码:13/14https://colab.research.google.com/drive/13s70d0RXOC34UUOu9T-GI…N5CGTtN?userstoinvite=zhangjie98721%40gmail.com&actionButton=1 Build a spam classiMer: Download examples of spam and ham from Apache SpamAssassin’s public datasets. Unzip the datasets and familiarize yourself with the data format. Split the datasets into a training set and a test set. Write a data preparation pipeline to convert each email into a feature vector. Your preparation pipeline should transform an email into a (sparse) vector that indicates the presence or absence of each possible word. For example, if all emails only ever contain four words, “Hello,” “how,” “are,” “you,” then the email “Hello you Hello Hello you” would be converted into a vector [1, 0, 0, 1] (meaning [“Hello” is present, “how” is absent, “are” is absent, “you” is present]), or [3, 0, 0, 2] if you prefer to count the number of occurrences of each word. You may want to add hyperparameters to your preparation pipeline to control whether or not to strip off email headers, convert each email to lowercase, remove punctuation, replace all URLs with “URL,” replace all numbers with “NUMBER,” or even perform stemming (i.e., trim off word endings; there are Python libraries available to do this). Finally, try out several classiMers and see if you can build a great spam classiMer, with both high recall and high precision. B3 #Your answer here 2020/11/30 下午4:00“COMP7404C 2021 - Assignment 4”的副本 - Colaboratory ⻚码:14/14https://colab.research.google.com/drive/13s70d0RXOC34UUOu9T-GI…N5CGTtN?userstoinvite=zhangjie98721%40gmail.com&actionButton=1
欢迎咨询51作业君