辅导案例-CSCE 478/878 -Assignment 2

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

1

Introduction to Machine Learning
CSCE 478/878

Programming Assignment 2

Fall 2020

Linear Regression

Basic Info

You will work in teams of maximum three students from the previous assignment.

The programming code will be graded on both implementation and correctness.

The written report will be graded on content, conclusions, and presentation. It must be
formatted according to the given template (posted on Canvas). The report will be graded
as if the values obtained from the code portion were correct. The report should be short
and to the point. The length should be between 2 to 4 pages of text plus any tables and
graphs.

Assignment Goals and Tasks

This assignment is intended to build the following skills:
1. Implementation of the iterative optimization Gradient Descent algorithms for
solving a Linear Regression problem
2. Implementation of various regularization techniques
3. Polynomial regression
4. Learning curve

2
Assignment Instructions

Note: you are not allowed to use any Scikit-Learn models or functions.

i. The code should be written in a Jupyter notebook and the report should be
prepared as a PDF file.
a. Name the notebook
`____assignment1.ipynb`
b. Name the PDF
`____assignment1.pdf`
ii. The Jupyter notebook and the report should be submitted via webhandin. Only
one submission is required for each group.
iii. Use the cover page (posted on Canvas) as the front page of your report.
iv. Download the wine quality dataset from:
http://archive.ics.uci.edu/ml/datasets/Wine+Quality
You will be using the red wine dataset: “winequality-red.csv”

________________________________________________________________________
Score Distribution

Part A (Model Code): 478 (68 pts) & 878 (78 pts)
Part B (Data Processing): 478 & 878 (7 pts)
Part C (Model Evaluation): 478 (25 pts) & 878 (35 pts)
Part D (Written Report): 478 & 878 (25 pts)
Extra credit (BONUS) tasks for both 478 & 878: 30 pts

Total: 478 (125 pts) & 878 (145 pts)

________________________________________________________________________

Part A: Model Code (478: 68 pts & 878: 78 pts)

1. Implement the following function that generates the polynomial and interaction
features for a given degree of the polynomial.
[10 pts]

polynomialFeatures(X, degree)

Argument:
X : ndarray
A numpy array with rows representing data samples and columns
representing features (d-dimensional feature).

degree : integer
The degree of the polynomial features. Default = 1.
3

Returns:
A new feature matrix consisting of all polynomial combinations of the features
with degree equal to the specified degree. For example, if an input sample is two
dimensional and of the form [a, b], the degree-2 polynomial features are [a, b, a2,
ab, b2].

2. Implement the following function to calculate and return the mean squared error
(mse) of two vectors. [3 pts]

mse(Y_true, Y_pred)

Arguments:
Y_true : ndarray
1D array containing data with “float” type. True y values.

Y_pred : ndarray
1D array containing data with “float” type. Values predicted by your model.

Returns:
cost : float
It returns a float value containing mean squared error between Y_true and
Y_pred.

Note: these 1D arrays should be designed as column vectors.

3. Implement the following function to compute training and validation errors. It
will be used to plot learning curves. The function takes the feature matrix X
(usually the training data matrix) and the training size (from the “train_size”
parameter) and by using cross-validation computes the average mse for the
training fold and the validation fold. It iterates through the entire X with an
increment step of the “train_size”. For example, if there are 50 samples (rows) in
X and the “train_size” is 10, then the function will start from the first 10 samples
and will successively add 10 samples in each iteration. During each iteration it
will use k-fold cross-validation to compute the average mse for the training fold
and the validation fold. Thus, for example, for 50 samples there will be 5
iterations (on 10, 20, 30, 40, and 50 samples) and for each iteration it will
compute the cross-validated average mse for the training and the validation fold.
For training the model (using the “fit” method) it will use the model parameters
from the function argument. The function will return two arrays containing
training and validation root-mean-square error (rmse) values.
[15 pts]

4

learning_curve(model, X, Y, cv, train_size=1, learning_rate=0.01,
epochs=1000, tol=None, regularizer=None, lambd=0.0, **kwargs)

Arguments:

model : object type that implements the “fit” and “predict” methods.
An object of that type which is cloned for each validation.

X : ndarray
A numpy array with rows representing data samples and columns
representing features.

Y : ndarray
A 1D numpy array with labels corresponding to each row of the feature
matrix X.

cv : int
integer, to specify the number of folds in a k-fold cross-validation.

train_sizes : int or float
Relative or absolute numbers of training examples that will be used to
generate the learning curve. If the data type is float, it is regarded as a
fraction of the maximum size of the training set (that is determined by
the selected validation method), i.e. it has to be within (0, 1]. Otherwise
it is interpreted as absolute sizes of the training sets.

learning_rate : float
It provides the step size for parameter update.

epochs : int
The maximum number of passes over the training data for updating the
weight vector.

tol : float or None
The stopping criterion. If it is not None, the iterations will stop when
(error > previous_error - tol). If it is None, the number of iterations will
be set by the “epochs”.

regularizer : string
The string value could be one of the following: l1, l2, None.
If it’s set to None, the cost function without the regularization term will
be used for computing the gradient and updating the weight vector.
However, if it’s set to l1 or l2, the appropriate regularized cost function
needs to be used for computing the gradient and updating the weight
vector.

lambd : float
It provides the regularization coefficient. It is used only when the
“regularizer” is set to l1 or l2.
5

Returns:

train_scores : ndarray
root-mean-square error (rmse) values on training sets.

val_scores : ndarray
root-mean-square error (rmse) values on validation sets.

4. [Extra Credit for 478 and Mandatory for 878] Implement the following
function to plot the training and validation root mean square error (rmse) values
of the data matrix X for various polynomial degree starting from 1 up to the value
set by “maxPolynomialDegree”. It takes the data matrix X (usually the training data
matrix) and the maxPolynomialDegree; and for each polynomial degree it will
augment the data matrix, then use k-fold cross-validation to compute the average
mse for both the training and the validation fold. For training the model (using
the “fit” method) it will use the model parameters from the function argument.
Finally, the function will plot the root-mean-square error (rmse) values for the
training and validation folds for each degree of the data matrix starting from 1 up
to the maxPolynomialDegree.
[10 pts]

plot_polynomial_model_complexity(model, X, Y, cv, maxPolynomialDegree,
learning_rate=0.01, epochs=1000, tol=None, regularizer=None, lambd=0.0,
**kwargs)

Arguments:

model : object type that implements the “fit” and “predict” methods.
An object of that type which is cloned for each validation.

X : ndarray
A numpy array with rows representing data samples and columns
representing features.

Y : ndarray
A 1D numpy array with labels corresponding to each row of the feature
matrix X.

cv : int
integer, to specify the number of folds in a (Stratified)K-Fold,

maxPolynomialDegree : int
It will be used to determine the maximum polynomial degree for X. For
example, if it is set to 3, then the function will compute both the training
and validation mse values for degree 1, 2 and 3.

6
learning_rate : float
It provides the step size for parameter update.

epochs : int
The maximum number of passes over the training data for updating the
weight vector.

tol : float or None
The stopping criterion. If it is not None, the iterations will stop when
(error > previous_error - tol). If it is None, the number of iterations will
be set by the “epochs”.

regularizer : string
The string value could be one of the following: l1, l2, None.
If it’s set to None, the cost function without the regularization term will
be used for computing the gradient and updating the weight vector.
However, if it’s set to l1 or l2, the appropriate regularized cost function
needs to be used for computing the gradient and updating the weight
vector.

lambd : float
It provides the regularization coefficient. It is used only when the
“regularizer” is set to l1 or l2.

Returns:
There is no return value. This function plots the root-mean-square error
(rmse) values for both the training set and the validation set for
degree of X between 1 and maxPolynomialDegree.

5. Implement a Linear_Regression model class. It should have the following three
methods. Note the that “fit” method should implement the batch gradient
descent algorithm.
[40 pts]

a)
fit(self, X, Y, learning_rate=0.01, epochs=1000, tol=None, regularizer=None,
lambd=0.0, **kwargs)

Arguments:
X : ndarray
A numpy array with rows representing data samples and columns
representing features.

Y : ndarray
A 1D numpy array with labels corresponding to each row of the feature
matrix X.

7
learning_rate : float
It provides the step size for parameter update.

epochs : int
The maximum number of passes over the training data for updating the
weight vector.

tol : float or None
The stopping criterion. If it is not None, the iterations will stop when
(error > previous_error - tol). If it is None, the number of iterations will
be set by the “epochs”.

regularizer : string
The string value could be one of the following: l1, l2, None.
If it’s set to None, the cost function without the regularization term will
be used for computing the gradient and updating the weight vector.
However, if it’s set to l1 or l2, the appropriate regularized cost function
needs to be used for computing the gradient and updating the weight
vector.

Note: you may define two helper functions for computing the regularized
cost for “l1” and “l2” regularizers.

lambd : float
It provides the regularization coefficient. It is used only when the
“regularizer” is set to l1 or l2.

Returns:
No return value necessary.

Note: the “fit” method should use a weight vector “theta_hat” that contains the
parameters for the model (one parameter for each feature and one for bias). The
“theta_hat” should be a 1D column vector.

Finally, it should update the model parameter “theta” to be used in “predict”
method as follows.
self.theta = theta_hat

b)
predict(self, X)

Arguments:
X : ndarray
A numpy array containing samples to be used for prediction. Its rows
represent data samples and columns represent features.

8
Returns:
1D array of predictions for each row in X.
The 1D array should be designed as a column vector.

Note: the “predict” method uses the self.theta to make predictions.

__init__(self)
It’s a standard python initialization function so we can instantiate the
class. Just “pass” this.

Part B: Data Processing (478 & 878: 7 pts)

6. Read in the winequality-white.csv file as a Pandas data frame.
7. Use the techniques from the recitation to summarize each of the variables in the
dataset in terms of mean, standard deviation, and quartiles. Include this in your
report. [3 pts]
8. Shuffle the rows of your data. You can use def = df.sample(frac=1) as an
idiomatic way to shuffle the data in Pandas without losing column names.
[2 pts]
9. Generate pair plots using the seaborn package. This will be used to identify and
report the redundant features, if there is any. [2 pts]

Part C: Model Evaluation (478: 25 pts & 878: 35 pts)

10. Model selection via Hyperparameter tuning: Use the kFold function (known
as sFold function from previous assignment) to evaluate the performance of your
model over each combination of lambd, learning_rate and regularizer from the
following sets:
[15 pts]
a. lambd = [1.0, 0, 0.1, 0.01, 0.001, 0.0001]
b. learning_rate = [0.1, 0.01, 0.001, 0.001]
c. regularizer = [l1, l2]
d. Store the returned dictionary for each and present it in the report.
e. Determine the best model (model selection) based on the overall
performance (lowest average error). For the error_function argument of
the kFold function (known as sFold function from previous assignment),
use the “mse” function. For the model selection don’t augment the
features. In other words, your model selection procedure should use the
data matrix X as it is.

11. Evaluate your model on the test data and report the mean squared error. [4 pts]
9

12. Using the best model plot the learning curve. Use the rmse values obtained from
the “learning_curve” function to plot this curve. [3 pts]

13. Determine the best model hyperparameter values for the training data matrix with
polynomial degree 3 and plot the learning curve. Use the rmse values obtained
from the “learning_curve” function to plot this curve. [3 pts]

14. [Extra Credit for 478 and Mandatory for 878] Using the
plot_polynomial_model_complexity function plot the rmse values for the training
and validation folds for polynomial degree 1, 2, 3, 4 and 5. Use the training data
as input for this function. You need to choose the hyperparameter values
judiciously to work on the higher-degree polynomial models.
[10 pts]

15. [Extra Credit for both 478 & 878] Implement the Stochastic Gradient
Descent Linear Regression algorithm. Using cross-validation determine the best
model. Evaluate your model (for polynomial degree 1) on the test data and report
the mean squared error.
[30 pts]

Part D: Written Report (25 pts)

16. Describe whether or not you used feature scaling and why or why not. [3 pts]
17. Describe whether or not you dropped any feature and why or why not. [3 pts]
18. In the lecture we have studied two types of Linear Regression algorithm: closed-
form solution and iterative optimization. Which algorithm is more suitable for the
current dataset? Justify your answer. [4 pts]
19. Would the batch gradient descent and the stochastic gradient descent algorithm
learn similar values for the model weights? Justify your answer. Let’s say that
you used a large learning rate. Would that make any difference in terms of
learning the weights by both algorithms? [5 pts]
20. Consider the learning curve of your model (degree 1). What conclusion can you
draw from the learning curve about (a) whether your model is
overfitting/underfitting and (b) its generalization error. Justify your answer.
[5 pts]
21. Consider the learning curve of the 3rd degree polynomial data matrix. What
conclusion can you draw from the learning curve about (a) whether your model is
overfitting/underfitting and (b) its generalization error. Justify your answer.
[5 pts]