程序代写案例-CSI 531

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

Homework 1 for CSI 531
Due: Oct 6, 23:59pm
All homeworks are individual assignments. This means: write your own solutions and do not
copy code/solutions from peers or online. Should academic dishonesty be detected, the proper
reporting protocols will be invoked (see Syllabus for details).
Instructions: Submit two files. One should be a write-up of all solutions and observations, as LastnameFirst-
nameSolution.pdf. The second should be an archive LastnameFirstnameCode.zip containing code and any results
files.
1 [5 pts.] Irreducible data example
In class we discussed that not all datasets’ dimensionality can be successfully reduced using PCA.
(a)[1 pts] Discuss the cases when PCA will fail.
(b)[2 pts] How do we quantify that it fails?
(c)[2 pts] Provide an example dataset of 2D points (specify the points as vectors of numbers) in which PCA will
not work well for dimensionality reduction. Explain why. Hint: Think of 2D points and reduction to 1D.
2 [40 pts] Dimensionality reduction
For this question you will use the Cloud Dataset from the UCI ML repository: https://archive.ics.uci.edu/
ml/datasets/Cloud. Read about it to get familiar with what is measured. Within the data, there are two
datasets: DB #1 and DB #2. For this homework, just use the 1024 vectors in DB #1. Use python for all your
programming. (The data file is also included in the homework on BB for your convenience.) You will have to
submit your code in LastnameFirstnameCode.zip together with the relevant write-up in the main solution file
LastnameFirstnameSolution.pdf.
(a)[5 pts] Load the data into a python program and center it. Note: there should be a function called center()
in your code that achieves this.
(b)[5 pts] Compute the covariance matrix of the data Σ. Hint: by using the definition of sample covariance, as
a matrix product or as a sum of outer products. See book for details. Use Numpy for linear algebra computations
(https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.linalg.html).As a result you should have a function
covar() in your code which does not use the built-in covariance functions.
(c)[5 pts] Compute the eigenvectors and eigenvalues of Σ. The numpy linear algebra module referenced above
has a function that can help.
(d)[10 pts] Plot the percentage of retained variance as a function of the number of principle components used.
Determine the number of principal components (PCs) r that will ensure 90% retained variance? How did you
compute this? Provide a function in your code that determines r based on an arbitrary percentage α of retained
variance.
(e)[10 pts] Compute the reduced dimension data matrix A with two dimensions by projection on the first two
PCs. Plot the points using a scatter plot (two dimensional diagram that places each sample i according to its new
dimensions ai1, ai2). Discuss the observations. Are there clusters of close-by points? Argue for or against whether
1
these are sufficient dimensions.
(f)[5 pts] Study the PCA implementation in pythons’ sklearn library https://scikit-learn.org/stable/
modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA. Do PCA using the li-
brary on the same data. Do the eigenvalues approximately match to what you computed above?
3 [20 pts.] Kernel methods
Consider the problem of finding the most dissimilar diametric pair (MDDP): this is a pair of data points that are
dissimilar from the mean and also dissimilar from each other. Below is an algorithm that would find such a pair
given a data matrix D:
Algorithm 1: MDDP(D)
Result: a, b - the most dissimilar diametric points in D
Compute the data mean µ = mean(D);
s = +∞;
for i in (1 . . . n) do
for j in (i+ 1 . . . n) do
temp = xTi µ+ x
T
j µ+ x
T
i xj ;
if temp < s then
s = temp;
a = xi;
b = xj ;
end
end
end
The algorithm computes the sum of inner products xTi µ+ x
T
j µ+ x
T
i xj for each pair of points and returns the
pair with the lowest such quantity.
(a)[5 pts] Demonstrate the execution of this algorithm on the following data matrix of 2D instances: D=

0 1
1 3
5 0
2 4
.
Show the steps and the resulting MDD pair of points.
(b)[15 pts] As we discussed in class sometimes we would like to kernelize methods to handle non-linearity in data.
Provide a pseudo-code for a kernel version of the MDDP algorithm above. The goal is to kernelize the algorithm
for an arbitrary kernel Hint: Assume that you can compute the kernel matrix K, corresponding to
some mapping φ() and then use the basic kernel operations we discussed in class and also in the
book, to derive the steps of MDDP in terms of elements in K.
4 [10 pts.] Orthogonality of Error in Regression:
Prove that Yˆ T = 0, where Yˆ is the predicted response and = Y −Yˆ is the error between the actual and predicted
response. Hint: Use the solution for the predicted response as a transformation of Y through the
hat matrix.
5 [25 pts.] Regression to understand CO2 pollution:
For this task, you will use data about CO2 emissions in European cities included as an excel sheet in the homework
in BB. The data has total emissions per city (Column E) as well as other variables including number of airports,
buildings, etc (columns G and later). Use only data for cities (Column C: admin level=8). Feel free to prepare
the data into simpler text format before you use code.
The goal will be to train linear regression models of the total CO2 as a function of the predictor variables in
columns G and on.
2
(a) [7pts] Use scikitlearn’s linear regression: linear model.LinearRegression to learn an unregularized model.
Report the corresponding coefficient for each predictor variable.
(b) [7pts] Use scikitlearn’s ridge regression: linear model.Ridge to learn a regularized model. Try 2 values for
alpha: 1 and 10. Report the coefficients of the predictor variables for the two models.
(c) [11pts] Discuss your findings. Which are the most important factors determining the CO2 emissions? How
do the loading coefficients for factors differ between regularized and non-regularized models? How does the
residual error differ? Use visuazlizations as necessary to discuss your findings.
3

欢迎咨询51作业君