辅导案例-COMP6245

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

School of Electronics and Computer Science
University of Southampton
COMP6245 (2019/20): Foundations of Machine Learning Lab 4 (of 6)
Issue 30 October 2019
Deadline 10 November 2019 (10:00)
Feedback 16 November 2019
In this Lab, you are expected to work independently; i.e. you may only discuss
with or ask questions from a demonstrator or the lecturer. You are, of course, free to
refer to the cited texts or access information from Web-based resources (indeed, this
is recommended).
Objective
• Implementing linear regression
• Regularization using quadratic and sparsity-inducing penalties
• Implementing sparse regression on a realistic problem in chemoinformatics
1 Linear Least Squares Regression:
We will work with the Diabetes dataset from the UCI Machine Learning repository [1]
taken from the package sklearn. Load the data and inspect the features and targets. It
is usually a good idea to plot a few histograms of the targets and pair-wise scatters of the
features in any new problem you are tasked to solve.
• Implement a linear predictor that is solved by the pseudo-inverse method:
a =
(
Y t Y
)−1
Y t f ,
where Y is the N × p input matrix and f is the N × 1 vector of responses.
• Solve the same problem using the linear model from sklearn and compare the
results.
2 Regularization
Tikhonov regularization minimizes the mean squared error with a quadratic penalty on
the weights [2] (available online in https://web.stanford.edu/~hastie/Papers/ESLII.
pdf):
min
a
||f − Y a||22 + γ ||a||22
Derive and implement a regularized regression. Show, using two bar graphs of the weights
side by side to the same scale, how the two solutions differ.
1
Figure 1: Solutions of Linear and Regularized Regressions
3 Sparse Regression
L1 regularization is a method for achieving sparse solutions [3]. It minimizes:
min
a
||f − Y a||22 + γ ||a||1
We will use the sklearn package for implementing its solution. For the Diabetes problem
considered above, solve the lasso problem and plot the resulting weights as a bar graph.
Observe how the number of non-zero weights change with the regularization parameter
γ. Your comparisons should look similar to Fig. 1. In each of these cases, compare the
prediction errors. In the case of the sparse regression, would you say the features with
nonzero weights are more meaningful (to answer, you have to find the source of the data
and look at the variables)?
Regularization Path
In implementing the lasso it is convenient to study the regularization path (Fig. 2, Image
taken from https://scikit-learn.org/stable/auto_examples/linear_model/plot_
lasso_lars.html) Implement and study the regularization path for the six-variable illus-
Figure 2: Regularization Path: How regression coefficients change with hyperparameter.
trative example considered in [3].
4 Solubility Prediction:
We will now look at a large problem of predicting solubility of chemical compounds from
features derived from their molecular structure. Predicting function from structural vari-
2
ables is an important problem because it is easy to define and synthesize small chemical
compounds, but very expensive to test them experimentally. Hence the step known as in
silico screening is increasingly popular. The dataset we will use is from Huuskonen et al.
[4] and the problem has also been considered recently in Pirashvili et al. [5] using more
sophisticated machinery. Have a skim-read through the introductory and results sections
of these papers.
Data used in [4] with several additional features and more compounds is available in the
excel spread sheet Husskonen Solubility Features.xlsx.
• Load the data, split into training and test sets, implement a linear regression and
plot the predicted solubilities against the true solubilities on the training and test
sets. To facilitate comparison, draw the two scatter plots side by side to the same
scale on both axes.
• Implement a lasso regularized solution and plot graphs of how the prediction error
(on the test data) and the corresponding number of non-zero coefficients change with
increasing regularization.
• Are you able to make any comment comparing your results to those claimed in [4]
or [5]?
Report
Write a short report of no more than four pages, summarising your work.
Appendix: Snippets of Code
1. Linear regression on diabetes dataset
from sklearn import datasets
from sklearn.linear_model import LinearRegression
# Load data, inspect and do exploratory plots
#
diabetes = datasets.load_diabetes()
Y = diabetes.data
f = diabetes.target
NumData, NumFeatures = Y.shape
print(NumData, NumFeatures)
print(f.shape)
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(12, 4))
ax[0].hist(f, bins=40)
ax[0].set_title("Distribution of Target", fontsize=14)
ax[1].scatter(Y[:,6], Y[:,7], c=’m’, s=3)
ax[1].set_title("Scatter of Two Inputs", fontsize=14)
2. Comparing pseudo-inverse solution to sklearn output
3
# Linear regression using sklearn
#
lin = LinearRegression()
lin.fit(Y, f)
fh1 = lin.predict(Y)
# Pseudo-incerse solution to linear regression
#
a = np.linalg.inv(Y.T @ Y) @ Y.T @ f
fh2 = Y @ a
# Plot predictions to check if they look the same!
#
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(10,5))
ax[0].scatter(f, fh1, c=’c’, s=3)
ax[0].grid(True)
ax[0].set_title("Sklearn", fontsize=14)
ax[1].scatter(f, fh2, c=’m’, s=3)
ax[1].grid(True)
ax[1].set_title("Pseudoinverse", fontsize=14)
3. Tikhanov Tegularizer
gamma = 0.5
aR = np.linalg.inv(Y.T @ Y + gamma*np.identity(NumFeatures)) @ Y.T @ f
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(8,4))
ax[0].bar(np.arange(len(a)), a)
ax[0].set_title(’Pseudo-inverse solution’, fontsize=14)
ax[0].grid(True)
ax[0].set_ylim(np.min(a), np.max(a))
ax[1].bar(np.arange(len(aR)), aR)
ax[1].set_title(’Regularized solution’, fontsize=14)
ax[1].grid(True)
ax[1].set_ylim(np.min(a), np.max(a))
4. Sparsity inducing (lasso) regularizer
from sklearn.linear_model import Lasso
ll = Lasso(alpha=0.2)
ll.fit(Y, f)
yh_lasso = ll.predict(Y)
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(10,4))
ax[0].bar(np.arange(len(a)), a)
ax[0].set_title(’Pseudo-inverse solution’, fontsize=14)
ax[0].grid(True)
ax[0].set_ylim(np.min(a), np.max(a))
ax[1].bar(np.arange(len(ll.coef_)), ll.coef_)
ax[1].set_title(’Lasso solution’, fontsize=14)
ax[1].grid(True)
ax[1].set_ylim(np.min(a), np.max(a))
5. Lasso Regularization path on a synthetic example:
4
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import lasso_path
from sklearn import datasets
# Synthetic data:
# Problem taken from Hastie, et al., Statistical Learning with Sparsity
# Z1, Z2 ~ N(0,1)
# Y = 3*Z1 -1.5*Z2 + 10*N(0,1) Noisy response
# Noisy inputs (the six are in two groups of three each)
# Xj= Z1 + 0.2*N(0,1) for j = 1,2,3, and
# Xj= Z2 + 0.2*N(0,1) for j = 4,5,6.
N = 100
y = np.empty(0)
X = np.empty([0,6])
for i in range(N):
Z1= np.random.randn()
Z2= np.random.randn()
y = np.append(y, 3*Z1 - 1.5*Z2 + 2*np.random.randn())
Xarr = np.array([Z1,Z1,Z1,Z2,Z2,Z2])+ np.random.randn(6)/5
X = np.vstack ((X, Xarr.tolist()))
# Compute regressions with Lasso and return paths
#
alphas_lasso, coefs_lasso, _ = lasso_path(X, y, fit_intercept=False)
# Plot each coefficient
#
fig, ax = plt.subplots(figsize = (8,4))
for i in range(6):
ax.plot(alphas_lasso, coefs_lasso[i,:])
ax.grid(True)
ax.set_xlabel("Regularization")
ax.set_ylabel("Regression Coefficients")
6. Predicting Solubility of Chemical Compounds
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
sol = pd.read_excel("Husskonen_Solubility_Features.xlsx", verbose=False)
print(sol.shape)
colnames = sol.columns
print(colnames)
f = sol["LogS.M."].values
fig, ax = plt.subplots(figsize=(4,4))
ax.hist(f, bins=40, facecolor=’m’)
ax.set_title("Histogram of Log Solubility", fontsize=14)
ax.grid(True)
5
Y = sol[colnames[5:len(colnames)]]
N, p = Y.shape
print(Y.shape)
print(f.shape)
# Split data into training and test sets
#
from sklearn.model_selection import train_test_split
Y_train, Y_test, f_train, f_test = train_test_split(Y, f, test_size=0.3)
# Regularized regression
#
gamma = 2.3
a = np.linalg.inv(Y_train.T @ Y_train + gamma*np.identity(p)) @ Y_train.T @ f_train
fh_train = Y_train @ a.values
fh_test = Y_test @ a.values
# Plot training and test predictions
#
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(10,4))
ax[0].scatter(f_train, fh_train, c=’m’, s=3)
ax[0].grid(True)
ax[0].set_title("Training Data", fontsize=14)
ax[1].scatter(f_test, fh_test, c=’m’, s=3)
ax[1].grid(True)
ax[1].set_title("Test Data", fontsize=14)
# Over to you for implementing Lasso
#
References
[1] K. Bache and M. Lichman, “UCI machine learning repository.” http://archive.ics.
uci.edu/ml, 2013.
[2] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning.
Springer, 2008.
[3] T. Hastie, R. Tibshirani, and M. Wainwright, Statistical Learning with Sparsity: The
Lasso and Generalizations. Chapman & Hall/CRC, 2015.
[4] J. Huuskonen, M. Salo, and J. Taskinen, “Aqueous solubility prediction of drugs based
on molecular topology and neural network modeling,” Journal of Chemical Information
and Computer Sciences, vol. 38, no. 3, pp. 450–456, 1998.
[5] M. Pirashvili, L. Steinberg, B. G. F., M. Niranjan, J. G. Frey, and J. Brodzki, “Im-
proved understanding of aqueous solubility modeling through topological data analy-
sis,” Journal of Cheminformatics, vol. 10, no. 1, p. 54, 2018.
Mahesan Niranjan November 2014
6