# 辅导案例-COMP7404C-Assignment 4

2020/11/30 下午4:00“COMP7404C 2021 - Assignment 4”的副本 - Colaboratory
Part A: Conceptual Questions
Solve the following questions by hand. No need to implement any code.
COMP7404 - Assignment 4
Consider a Perceptron with 2 inputs and 1 output. Let the weights of the Perceptron be
and and let the bias be . Calculate the output of the following
inputs:(0, 0), (1, 0), (0, 1), (1, 1)
A1
= 1푤1 = 1푤2 = −1.5푤0
DeMne a perceptron for the following logical functions: AND, NOT, NAND, NOR
A2
The parity problem returns 1 if the number of inputs that are 1 is even, and 0 otherwise. Can a
perceptron learn this problem for 3 inputs?
A3
2020/11/30 下午4:00“COMP7404C 2021 - Assignment 4”的副本 - Colaboratory
Suppose that the following are a set of point in two classes:
Class1:
Class2:
Plot them and Mnd the optimal separating line. What are the support vectors, and what is the
meaning?
A4
(1, 1), (1, 2), (2, 1)
(0, 0), (1, 0), (0, 1)
Suppose that the probability of Mve events are
. Calculate
the entropy and write down in words what this means.
A5
푃 (푓푖푟푠푡) = 0.5, 푃 (푠푒푐표푛푑) = 푃 (푡ℎ푖푟푑) = 푃 (푓표푢푟푡ℎ) = 푃 (푓푖푓푡ℎ) = 0.125
Design a decision tree that computes the logical AND function. How does it compare to the
Perceptron solution?
A6
2020/11/30 下午4:00“COMP7404C 2021 - Assignment 4”的副本 - Colaboratory
Turn the following politically incorrect data into a decision tree to classify which attributes
make a person attractive, and then extract the rules. Use the Gini Impurity.
Height Hair Eyes Attractive?
Small Blonde Brown No
Tall Dark Brown No
Tall Blonde Blue Yes
Tall Dark Blue No
Small Dark Blue No
Tall Red Blue Yes
Tall Blonde Brown No
Small Blonde Blue Yes
A7
Suppose we collect data for a group of students in a postgraduate machine learning class
with features = hours studies, = undergraduate GPA and label = receive an A. We Mt a
logistic regression and produce estimated weights as follows: , ,
.
1. Estimate the probability that a student who studies for 40h and has an undergraduate
GPA of 3.5 gets an A in the class
2. How many hours would the student in part 1. need to study to have a 50% chance of
getting an A in the class?
A8
푥1 푥2 푦
= −6푤0 = 0.05푤1
= 1푤2
2020/11/30 下午4:00“COMP7404C 2021 - Assignment 4”的副本 - Colaboratory
Suppose that we take a data set, divide it into equally-sized training and test sets, and then try
out two different classiMcation procedures. First we use logistic regression and get an error
rate of 20% on the training data and 30% on the test data. Next we use 1-nearest neighbors
(i.e., K=1) and get an average error rate (averaged over both test and training data sets) of
18%. Based on these results, which method should we prefer to use for classiMcation of new
observations? Why?
A9
Suppose the features in your training set have very different scales. Which algorithms
discussed in class might suffer from this, and how? What can you do about it?
A10
If your AdaBoost ensemble underMts the training data, which hyperparameters should you
tweak and how?
A11
What is the beneMt of out-of-bag evaluation?
A12
What is the difference between hard and soft voting classiMers?
A13
2020/11/30 下午4:00“COMP7404C 2021 - Assignment 4”的副本 - Colaboratory
Solve the following questions implementing solutions in code.
Part B: Applied Questions
Consider the following Perceptron code.
B1
import numpy as np

class Perceptron(object):
"""Perceptron classifier.

Parameters
------------
eta : float
Learning rate (between 0.0 and 1.0)
n_iter : int
Passes over the training dataset.

Attributes
-----------
w_ : 1d-array
Weights after fitting.
errors_ : list
Number of misclassifications in every epoch.

"""
def __init__(self, eta=0.01, n_iter=10):
self.eta = eta
self.n_iter = n_iter

def fit(self, X, y):
"""Fit training data.

Parameters
----------
X : {array-like}, shape = [n_samples, n_features]
Training vectors, where n_samples is the number of samples and
n_features is the number of features.
y : array-like, shape = [n_samples]
Target values.
2020/11/30 下午4:00“COMP7404C 2021 - Assignment 4”的副本 - Colaboratory

Returns
-------
self : object

"""
self.w_ = np.zeros(1 + X.shape)
self.errors_ = []

for _ in range(self.n_iter):
errors = 0
for xi, target in zip(X, y):
update = self.eta * (target - self.predict(xi))
self.w_[1:] += update * xi
self.w_ += update
errors += int(update != 0.0)
self.errors_.append(errors)
return self

def net_input(self, X):
"""Calculate net input"""
return np.dot(X, self.w_[1:]) + self.w_

def predict(self, X):
"""Return class label after unit step"""
return np.where(self.net_input(X) >= 0.0, 1, -1)
0 1 2 3 4
145 6.7 3.0 5.2 2.3 Iris-virginica
146 6.3 2.5 5.0 1.9 Iris-virginica
147 6.5 3.0 5.2 2.0 Iris-virginica
148 6.2 3.4 5.4 2.3 Iris-virginica
149 5.9 3.0 5.1 1.8 Iris-virginica
import pandas as pd

data_src = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data

df.tail()
2020/11/30 下午4:00“COMP7404C 2021 - Assignment 4”的副本 - Colaboratory
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

# select setosa and versicolor
y = df.iloc[0:100, 4].values
y = np.where(y == 'Iris-setosa', -1, 1)

# extract sepal length and petal length
X = df.iloc[0:100, [0, 2]].values

# plot data
plt.scatter(X[:50, 0], X[:50, 1],
color='red', marker='o', label='setosa')
plt.scatter(X[50:100, 0], X[50:100, 1],
color='blue', marker='x', label='versicolor')

plt.xlabel('sepal length [cm]')
plt.ylabel('petal length [cm]')
plt.legend(loc='upper left')

plt.tight_layout()
plt.show()
ppn = Perceptron(eta=0.1, n_iter=10)
ppn = ppn.fit(X, y)
2020/11/30 下午4:00“COMP7404C 2021 - Assignment 4”的副本 - Colaboratory
from matplotlib.colors import ListedColormap

from matplotlib.colors import ListedColormap
def plot_decision_regions(X, y, classifier, resolution=0.01):
markers = ('s', 'x', 'o', '^', 'v')
colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')
cmap = ListedColormap(colors[:len(np.unique(y))])
x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1
x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution),
np.arange(x2_min, x2_max, resolution))
Z = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T)
Z = Z.reshape(xx1.shape)
plt.contourf(xx1, xx2, Z, alpha=0.3, cmap=cmap)
plt.xlim(xx1.min(), xx1.max())
plt.ylim(xx2.min(), xx2.max())
for idx, cl in enumerate(np.unique(y)):
plt.scatter(x=X[y == cl, 0],
y=X[y == cl, 1],
alpha=0.8,
c=colors[idx],
marker=markers[idx],
label=cl,
edgecolor='black')
As shown in function plot_decision_regions, the decision regions can be visualized by dense
sampling via meshgrid. However, if the grid resolution is not enough, as artiMcially set below,
the boundary will appear inaccurate.
Implement function plot_decision_boundary below to analytically compute and plot the
decision boundary.
def plot_decision_boundary(X, y, classifier):

# replace the two lines below with your code
x1_interval = [X[:, 0].min() - 1, X[:, 0].max() + 1]
x2_interval = [X[:, 1].min() - 1, X[:, 1].max() + 1]

plt.plot(x1_interval, x2_interval, color='green', linewidth=4, label='boundary'
2020/11/30 下午4:00“COMP7404C 2021 - Assignment 4”的副本 - Colaboratory
low_res = 0.1 # intentional for this exercise
plot_decision_regions(X, y, classifier=ppn, resolution=low_res)
plot_decision_boundary(X, y, classifier=ppn)
plt.xlabel('sepal length [cm]')
plt.ylabel('petal length [cm]')
plt.legend(loc='upper left')
plt.tight_layout()
plt.show()
In class we applied different scikit-learn classifers for the Iris data set.
In this question, we will apply the same set of classiMers over a different data set: hand-
written digits. Please write down the code for different classiMers, choose their hyper-
parameters, and compare their performance via the accuracy score as in the Iris dataset.
Which classiMer(s) perform(s) the best and worst, and why?
The classiMers include:
perceptron
logistic regression
SVM
decision tree
random forest
KNN
The dataset is available as part of scikit learn, as follows.
B2
2020/11/30 下午4:00“COMP7404C 2021 - Assignment 4”的副本 - Colaboratory
(1797, 64)
(1797,)

X = digits.data # training data
y = digits.target # training label

print(X.shape)
print(y.shape)
import matplotlib.pyplot as plt
import pylab as pl
import matplotlib as mpl
mpl.rcParams['figure.dpi'] = 150
num_rows = 4
num_cols = 5
fig, ax = plt.subplots(nrows=num_rows, ncols=num_cols, sharex=True, sharey=True)
ax = ax.flatten()
for index in range(num_rows*num_cols):
img = digits.images[index]
label = digits.target[index]
ax[index].imshow(img, cmap='Greys', interpolation='nearest')
ax[index].set_title('digit ' + str(label))
ax.set_xticks([])
ax.set_yticks([])
plt.tight_layout()
plt.show()
2020/11/30 下午4:00“COMP7404C 2021 - Assignment 4”的副本 - Colaboratory
Date Preprocessing
Hint: Divide training and test data set and apply other techinques we have learned if needed.

ClassiMer #1 Perceptron
#Your code, including traing and testing, to observe the accuracies.

2020/11/30 下午4:00“COMP7404C 2021 - Assignment 4”的副本 - Colaboratory
ClassiMer #2 Logistic Regression
#Your code, including traing and testing, to observe the accuracies.

ClassiMer #3 SVM
#Your code, including traing and testing, to observe the accuracies.

ClassiMer #4 Decision Tree
#Your code, including traing and testing, to observe the accuracies.

Classifer #5 Random Forest
#Your code, including traing and testing, to observe the accuracies.

ClassiMer #6 KNN
#Your code, including traing and testing, to observe the accuracies.

2020/11/30 下午4:00“COMP7404C 2021 - Assignment 4”的副本 - Colaboratory
Build a spam classiMer:
Unzip the datasets and familiarize yourself with the data format.
Split the datasets into a training set and a test set.
Write a data preparation pipeline to convert each email into a feature vector. Your
preparation pipeline should transform an email into a (sparse) vector that indicates the
presence or absence of each possible word. For example, if all emails only ever contain
four words, “Hello,” “how,” “are,” “you,” then the email “Hello you Hello Hello you” would
be converted into a vector [1, 0, 0, 1] (meaning [“Hello” is present, “how” is absent, “are”
is absent, “you” is present]), or [3, 0, 0, 2] if you prefer to count the number of
occurrences of each word.
You may want to add hyperparameters to your preparation pipeline to control whether or
not to strip off email headers, convert each email to lowercase, remove punctuation,
replace all URLs with “URL,” replace all numbers with “NUMBER,” or even perform
stemming (i.e., trim off word endings; there are Python libraries available to do this).
Finally, try out several classiMers and see if you can build a great spam classiMer, with
both high recall and high precision.
B3  