UNIVERSITY OF CALIFORNIA, DAVIS

DEPARTMENT OF COMPUTER SCIENCE

ECS 171: Homework Set 1

Instructor: Ilias Tagkopoulos

TAs: Jason Youn, Ameen Eetemadi, and ChengEn Tan

{jyoun, eetemadi, cetan}@ucdavis.edu

September 30, 2019

General Instructions: The homework should be submitted electronically through Canvas. Each

submission should be a zip file that includes the following: (a) a report in pdf format ("re-

port_HW1.pdf") that includes your answers to all questions, plots, figures and any instructions

to run your code, (b) the python code files. Please note: (a) do not include any other files,

for instance files that we have provided such as datasets, (b) each function should be written

with the appropriate remarks in the code so it is generally understandable (what it does, how it

does it), (c) do not use any toolbox unless it is explicitly allowed in the homework description.

Shared/copied code from any source is not allowed, as it is considered plagiarism.

1 OF CARS AND MEN [100PT ]

In this exercise, you will investigate the type of relationship that exists between the “miles per

gallon” (mpg) rating of a car and several of its attributes. For this task, you will use the “Auto

MPG” dataset (“auto-mpg.data” file; 398 cars, 9 features; remove the 6 records with missing

values to end up with 392 samples) that is available in the UCI Machine Learning Repository:

https://archive.ics.uci.edu/ml/datasets/Auto+MPG

For this assignment, you will need to code your solution from scratch. Unless explicitly stated,

it is fine to use open source code, for example sci-kit learn, to help you write your own imple-

mentation of the methods. Perform and report (code and results) the following:

1

1. Assume that we want to classify the cars into 4 categories: low, medium, high, and very

high mpg. Find what the threshold for each category should be, so that all samples are

divided into four equally-sized bins. [10pt]

2. Create a 2D scatterplot matrix, similar to that of Figure 1.4 in the ML book (K. Murphy,

page 6; also available on the lecture 1 slides - the figure with the flowers). You may use

any published code to perform this. Which pair from all pair-wise feature combinations

is the most informative regarding the four mpg categories? [10pt]

3. Write a linear regression solver that can accommodate polynomial basis functions on a

single variable for prediction of MPG. Your code should use the Ordinary Least Squares

(OLS) estimator (i.e. the Maximum-likelihood estimator). Code this from scratch. Its

recommended to use a library (e.g. numpy) for basic linear algebra operations (addition,

multiplication and inverse). [20p]

4. Split the dataset in the first 292 samples for training and the rest 100 samples for testing.

Use your solver to regress for 0th to 3rd order polynomial on a single independent variable

(feature) each time by using mpg as the dependent variable. Report (a) the training and

(b) the testing mean squared errors for each variable individually (except the “car name”

string variable, so a total of 7 features that are independent variables). Plot the lines and

data for the testing set, one plot per variable (so 4 lines in each plot, 7 plots total). Which

polynomial order performs the best in the test set? Which is the most informative feature

for mpg consumption in that case? [20pt]

5. Modify your solver to be able to handle second order polynomials of all 7 independent

variables simultaneously (i.e. 15 terms). Regress with 0th, 1st and 2nd order and report

(a) the training and (b) the testing mean squared error (MSE). Use the same 292/100 split.

[15pt]

6. Using logistic regression (1st order), perform classification on the various classes (low/medium/high/very

high). Report the training/testing classification precision (you might want to look how

precision is defined and how it is calculated). You can use a library (e.g. scikit-learn) to

perform logistic regression. [10pt]

7. re-do the logistic regression training/testing, but now after you apply min-max normal-

ization to the dataset. Do you see any difference in performance? [5pt]

8. If a USA manufacturer (origin 1) had considered to introduce a model in 1981 with the fol-

lowing characteristics: 4 cylinders, 400 cc displacement, 150 horsepower, 3500 lb weight,

8 m/sec2 acceleration, what is the MPG rating that we should have expected? In which

mpg category (low,medium,high mpg) would it belong? Use second-order, multi-variate

polynomial and logistic regression. [10pt]

2