04/10/2019 asgn3

localhost:8888/notebooks/Dropbox/monash_academic/FIT5197/assignments/A3/final/asgn3.ipynb 1/23

FIT5197 Assignment 3

Student Name: FILL IN HERE

Student ID: FILL IN HERE

Tutorial Time: FILL IN HERE

Tutor's Name: FILL IN HERE

Introduction

This assignment is out of a total of 64 marks (including 59 marks for questions, 2 marks for elegance of graphics, and 3 marks for code quality).

Purpose of assignment

To test your knowledge of hypothesis testing, linear and logistic regression.

This assignment achieves the Learning Outcomes of Modelling for Data Analysis

1. perform exploratory data analysis with descriptive statistics on given datasets;

2. construct models for inferential statistical analysis;

3. produce models for predictive statistical analysis

4. perform fundamental random sampling, simulation and hypothesis testing for required scenarios;

5. implement a model for data analysis through programming and scripting;

6. interpret results for a variety of models.

Do's and Don'ts

Do's

To complete this assignment, you will need to fill in several of the cells in the below notebook. Please make sure to follow all the instructions

given in this assignment carefully and completely.

04/10/2019 asgn3

localhost:8888/notebooks/Dropbox/monash_academic/FIT5197/assignments/A3/final/asgn3.ipynb 2/23

We recommend that you use Azure notebooks (just like for the previous assignment - you might find the video on Moodle for that assignment

useful here too).

Marks are given for both correct answers and working out, so please make sure to provide all working out you use to reach your answer.

For questions which require R code, please make sure to put your code into the cell provided. You should not have to create any new cells to

complete this assignment.

For questions which require written answers (i.e. questions that don't require code, and which do not explicitly state otherwise in the question),

you may either write your answers by hand, scan them in as photos and embed those photos into your Jupyter notebook, or alternatively you can

use Markdown to write your answers in the cells directly. We recommend the second method if you would like to learn Markdown and LaTeX

(both of which are useful skills), but we do not require this.

Make sure you submit to Moodle a completed Jupyter Notebook .ipynb file and a .pdf file of the notebook (you can convert the file to a PDF

within Jupyter by going to File -> Download as -> PDF)

Apply for special consideration if appropriate

Get support early from this unit or other services in the university

Study skills

General support

Don'ts

Don't submit late

5% late penalty per day applies including weekends and public holidays

e.g. 65/100 becomes 55/100 if 2 days late as 0.05*100=5

Don't your files together in one zip file

Zip file submission will have a penalty of 10% as it adds to the manual handling of assignments.

Don't participate in plagiarism, collusion or contract cheating

All suspected cases will be investigated

consequences for such misconduct can include receiving zero marks for the assignment or unit or suspension or exclusion

see university policy for more details

Don't import any libraries - you should be able to do all questions in this assignment without using the library() function (i.e. just using

R's built-in functions). We reserve the right to deduct marks for using libraries to answer questions, if the libraries make the questions

easier to answer.

Marking Criteria

Your work will be marked on:

Correctness of answers

04/10/2019 asgn3

localhost:8888/notebooks/Dropbox/monash_academic/FIT5197/assignments/A3/final/asgn3.ipynb 3/23

Quality of explanation and working out

Coding quality and accuracy

Visualisation quality and informativeness/labelling

Overall clarity

Assignment 3 contributes to 15% of your final unit mark

Additional Marks

There are a couple of criteria which relate to the assignment as a whole. The criteria in this section are not bonus marks, but instead directly count

towards the total marks for the assignment.

Elegance of Graphics (2 marks)

To ensure your graphics are of a high enough quality to receive the marks for elegance of graphics, please do one of the following: * Generate your

graphs using R (where R code is required by the question) or some computer system such as Excel, R, Python, etc. (where R code is not required by

the question), OR * Draw your graphs by hand, very neatly and with a ruler

Code Quality (3 marks)

To ensure your code is of a high enough quality to receive the marks for code quality, please ensure the following conditions are met: * Your code

does not take too long to run (if a cell takes more than a minute to run, it takes too long UNLESS OTHERWISE NOTED) * No trivial optimisations are

missing from the code - we don't expect perfection but if something is blatantly written in a very sub-optimal way we might dock code quality marks

for this

Bonus Marks

For this assignment, there are several bonus marks available

Use of Markdown/LaTeX (2 marks)

Although we do not require that you use Markdown/LaTeX for this assignment, we have assigned 2 bonus marks to encourage you to try this.

Markdown/LaTeX is a great skill to have up your sleeve when you're working on maths or trying to share your work with other people. To receive one

bonus mark, you must use Markdown/LaTeX for a majority of the questions in this assignment (at least 3). To receive two bonus marks, you must use

Markdown/LaTeX for all of the questions in this assignment, where appropriate (i.e. except when code is required instead, of course).

You can see some documentation and examples for Jupyter's markdown and LaTeX system at the following links:

04/10/2019 asgn3

localhost:8888/notebooks/Dropbox/monash_academic/FIT5197/assignments/A3/final/asgn3.ipynb 4/23

https://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Working%20With%20Markdown%20Cells.html (https://jupyter-

notebook.readthedocs.io/en/stable/examples/Notebook/Working%20With%20Markdown%20Cells.html) https://jupyter-

notebook.readthedocs.io/en/stable/examples/Notebook/Typesetting%20Equations.html (https://jupyter-

notebook.readthedocs.io/en/stable/examples/Notebook/Typesetting%20Equations.html)

Question 1 - Confidence Intervals and Hypothesis Testing (20 marks)

Previously, we worked with a company who was making phone cases using a new 3D printing method. Because 3D printing is not a perfect process,

management wanted to know how many samples they should produce to get a good approximation for the failure probability of the process.

This was several weeks ago, and the engineers have now begun testing the new 3D printing process. We are no longer concerned with whether a

print is a "success" or "failure" - we are going to be more precise now. Although they will do more tests later, the engineers have performed an initial

test run of 20 successful units (that is, they keep producing units until they have 20 "successful" prints).

For this question, we are going to assume that the 3D print thickness is normally distributed.

In [ ]:

Question 1.a (2 marks)

We have decided that, for the 3D printing process to be considered a success, we are going to require the following facts to be true (with a

confidence level ):

The mean print thickness must be 1mm

The confidence interval for the mean must be 0.1mm wide

We will test these facts by determining whether the corresponding values are within the 99% confidence intervals for our estimates, and also whether

the wideness of our confidence intervals fits. That is, we will calculate a two-sided 99% confidence interval for the mean of the print thickness and see

if 1mm is within that range, and whether that range has a total width 0.1mm wide.

Using R code, write a function conf.interval.mean.n, which takes two arguments (data and alpha) and returns a list containing two values:

list$lower: the lower end of the two-sided confidence interval of the mean of with confidence level

list$upper: the upper end of the two-sided confidence interval of the mean of with confidence level

α = 0.01

≤

≤

data alpha

data alpha

test.20.units <- read.csv("test20units.csv")$x

head(test.20.units)

04/10/2019 asgn3

localhost:8888/notebooks/Dropbox/monash_academic/FIT5197/assignments/A3/final/asgn3.ipynb 5/23

For this question, since we both have a small sample size and also don't know the standard deviation of the population (we are using the

estimate from the sample), use the t-distribution rather than the normal distribution

The function should look something like this:

conf.interval.mean.t <- function(data, alpha) {

# Your code here

}

Note: we are going to use hypothesis testing, rather than this confidence-interval based approach, later in this question. Hypothesis

testing is a better way to determine whether the 3D printing process meets the requirements or not

In [ ]:

In [ ]:

Question 1.b (2 marks)

Are the two criteria established in Question 1.a met? That is, is the desired mean of 1mm thickness within the confidence interval, and is the

confidence interval less than 0.1mm wide from the lower to upper bound?

Using this question as an example, why is it important to consider both whether the value is within the confidence interval and how wide the

confidence interval is (relative to how wide an error we are allowed)?

YOUR ANSWER HERE

Question 1.c (6 marks)

The engineers have developed two alternative methods for 3D printing the phone cases. The engineers are now trying to minimize "warping" (which

is where phone units are not printed flat and are instead warped slightly - this warping is measured in millimeters).

We have two datasets (one for each of the two new printing methods), where each dataset contains warping measurements from 50 prints produced

using that dataset's 3D printing method. From our many tests with the original method, we know that the mean warping measurement for the original

printing method is very close to 0.1mm (so we will use that as our estimate for the mean of the original population).

# your code here

ci.mean.20 <- conf.interval.mean.t(test.20.units, 0.01)

ci.mean.20

04/10/2019 asgn3

localhost:8888/notebooks/Dropbox/monash_academic/FIT5197/assignments/A3/final/asgn3.ipynb 6/23

The engineers have asked us to determine whether either of their new 3D printing methods are superior to the original method. To do this, we could

just calculate the mean of each dataset's measurements and compare that to the 0.1mm mean of the original method, but this does not take into

account the variance introduced by sampling from the population. Instead, we will use hypothesis testing.

For each of the two new printing methods, each with datasets method1.50.units and method2.50.units respectively, we need to test the hypothesis

that the mean of that method's dataset is less than the mean of the original population (i.e., less than 0.1mm). So, for each of the two printing

methods, do the following (there are four cells below - the first cell imports the data, and the next three cells correspond to the following three points):

Create a hypothesis test (defining the null hypothesis and alternative hypothesis). Note that this should be a one-sided test.

Calculate a p-value using a t.test

Interpret this p-value, using a confidence level

When interpreting the p-value, keep in mind that the goal is to determine whether each of the new methods is better than the original method (in

terms of warping).

Do not use the inbuilt t.test function in R to answer this question. Use the formulas in the lecture notes to calculate the values from

scratch. You may write your own version of a t-test function if you like, but you don't have to do this.

α = 0.01

In [ ]:

YOUR ANSWER HERE

In [ ]:

YOUR ANSWER HERE

Question 1.d (3 marks)

When we use for a single hypothesis test, we expect to find a "false positive" (that is, we reject the null hypothesis when we should not

have) in around 1 in 100 experiments.

However, in Question 1.c, we did two hypothesis tests (one for each of our two methods). We will consider the implications of this below.

Please do the following:

α = 0.01

method1.50.units <- read.csv("method1_50units.csv")$x

method2.50.units <- read.csv("method2_50units.csv")$x

# your code here

04/10/2019 asgn3

localhost:8888/notebooks/Dropbox/monash_academic/FIT5197/assignments/A3/final/asgn3.ipynb 7/23

Assuming that the probability of finding a false positive is 1% for each experiment, calculate the probability of finding at least 1 false positive

overall (for two hypothesis tests).

Calculate the probability of finding at least 1 false positive overall, if we tested 100 methods instead of just two.

Briefly discuss something you think data scientists could do to reduce the risk of false positives when performing multiple hypothesis tests at

once. Justify your answer.

YOUR ANSWER HERE

Question 1.e (5 marks)

The engineers have decided, for other reasons than warping, to use one of the two new 3D printing methods instead of the old one. They would now

like us to determine which of these two new printing methods is superior.

So, we need to test two hypotheses:

That method 1 is superior to method 2 (i.e. method1 has a lower mean warping than method2)

That method 2 is superior to method 1 (i.e. method2 has a lower mean warping than method1)

For each of methods 1 and 2, please do the following (there are three cells below, corresponding to the the following three points):

Create a hypothesis test (defining the null and alternative hypotheses) which tests whether the chosen method is superior to the alternative

method (i.e. either method 1 is superior to method 2, or method 2 is superior to method 1). Note that this should be a one-sided test.

Calculate a p-value

Interpret the p-value, using a confidence level

Do not use the inbuilt t.test function in R to answer this question. Use the formulas in the lecture notes to calculate the values from

scratch. You may write your own version of a t-test function if you like, but you don't have to do this.

α = 0.01

YOUR ANSWER HERE

In [ ]:

YOUR ANSWER HERE

# your code here

04/10/2019 asgn3

localhost:8888/notebooks/Dropbox/monash_academic/FIT5197/assignments/A3/final/asgn3.ipynb 8/23

Question 1.f (2 marks)

If you completed Question 1.e correctly, you should have found that the two p-values calculated (one for each of the two methods) sum to 1. Or, put

another way, each p-value is the other p-value.

Explain why this should be the case, justifying your answer.

1−

YOUR ANSWER HERE

Question 2 - Logistic Regression (19 marks)

The engineers are now doing the first production run, which is going to take a while. One of your colleagues has challenged you to a game of chess

while you wait for the test run to complete.

The game is being played by correspondance (meaning that you aren't playing over a board, and can do other things while you play). Your opponent

neglected to tell you that they are a former Australian chess champion, and you have unfortunately found yourself in a tricky position.

Your opponent is now beginning to gloat, and has asked you to resign so as to not waste the time of such a great chess player as them (they're only

kidding but it still hurts). You think that maybe you can use data science to determine whether the position is lost (meaning you should resign) or

drawn (meaning you should keep playing and teach your colleague a lesson).

One of your other colleagues suggested that, since your opponent is so good, they surely should be able to win in ten turns or less. Your opponent

has agreed to these rules - if they cannot beat you in ten turns, the game will be declared a draw.

You've found a dataset which contains thousands of chess positions where white has a rook and a king, and black has a king. Each entry in the

original dataset also includes whether the position is winnable by white some number of turns (labelled by the number of turns, e.g. "one", "two", ...,

"fifteen", "sixteen") or not (labelled "draw", since black can only draw with a king). In each position, it is black's turn to play.

The code cell below imports the dataset and does a small transformation on it, converting it from a 17-class problem ("draw", "one", "two", ...,

"sixteen") to a binary problem ("loss" or "draw"), where a position is considered a draw if either black can force a draw, or if white cannot win within

ten moves (with optimal play on both sides). Each entry in this transformed dataset contains the following information:

The white king's file (a, b, c, d)

The white king's rank (1, 2, 3, 4)

The white rook's file (a, b, ..., h)

The white rook's rank (1, 2, ..., 8)

The black king's file (a, b, ..., h)

04/10/2019 asgn3

localhost:8888/notebooks/Dropbox/monash_academic/FIT5197/assignments/A3/final/asgn3.ipynb 9/23

The black king's rank (1, 2, ..., 8)

The optimal end result of this position for black ("loss" if the position can be won by white in less than 10 moves, and "draw" otherwise).

Notice that the white king's rank and file are always within 1-4 and a-d respectively. It appears that whoever prepared this dataset rotated the board

so the white king was always in this quadrant, probably to reduce the number of variables in our model. This would be something we would have to

worry about if we wanted to make this model work for any possible chess position; in any case, don't worry about this for the purposes of this

assignment as it won't be an issue (your opponent's king happens to be in this quadrant so it's not a problem).

Important Note: In this question, you might see the error message "prediction from a rank-deficient fit may be misleading..." show up. You

can ignore this for the purposes of this assignment (try to think why we might be seeing a "rank-deficient" fit - there's a discussion at

https://stats.stackexchange.com/questions/35071/what-is-rank-deficiency-and-how-to-deal-with-it

(https://stats.stackexchange.com/questions/35071/what-is-rank-deficiency-and-how-to-deal-with-it) if you're curious).

Background information This dataset is derived from a real dataset, which was "generated by Michael Bain and Arthur van Hoff at the Turing

Institute, Glasgow, UK". You can find the original dataset in the UCI repository at https://archive.ics.uci.edu/ml/datasets/Chess+%28King-

Rook+vs.+King%29 (https://archive.ics.uci.edu/ml/datasets/Chess+%28King-Rook+vs.+King%29) (but don't download the dataset from there;

instead, use the code that we have provided below as well as the file "krkopt.data" on Moodle).

04/10/2019 asgn3

localhost:8888/notebooks/Dropbox/monash_academic/FIT5197/assignments/A3/final/asgn3.ipynb 10/23

In [ ]:

Question 2.a (5 marks)

Before we can use our dataset, we need to preprocess it to make it suitable for use in a logistic regression. At the moment, our dataset consists of six

feature variables, all of which are categorical, as well as one label variable (which is binary). We need to perform "one-hot encoding" on all of these

categorical variables (i.e. all of the categorical variables, but not the binary label variable - think about why it would be useless to do one-hot encoding

on a binary variable, if we're avoiding the dummy variable trap as discussed below).

One-hot encoding (referred to as "indicator variables" in the lectures - see slide 41 in week 8's lecture notes) is the process of taking a categorical

feature (with possible values it can take) and converting it to a set of binary values (in practice we use , to avoid the "dummy variable trap").

It is important to do this because, when categorical variables are encoded as a single integer (e.g. "red" = 1, "green" = 2, "blue" = 3, ...), a logistic

regression (and many other modelling techniques) will fit the feature as though "red" is closer to "green" than it is to "blue" (since 1 is closer to 2 than

to 3). So, we split the values up into individual variables to avoid this.

k k k − 1

N = 250

set.seed(42)

raw.df <- read.csv("krkopt.data")

names(raw.df) <- c("wk_file", "wk_rank", "wr_file", "wr_rank", "bk_file", "bk_rank", "result")

draws <- c("draw", "eleven", "twelve", "thirteen", "fourteen", "fifteen", "sixteen")

draw.idx <- raw.df$result %in% draws

loss.idx <- !draw.idx

draw.df <- raw.df[draw.idx,]

loss.df <- raw.df[loss.idx,]

draw.df$result = 1

loss.df$result = 0

draw.df <- draw.df[sample(nrow(draw.df), N),]

loss.df <- loss.df[sample(nrow(loss.df), N),]

raw.df <- rbind(draw.df, loss.df)

raw.df <- raw.df[sample(nrow(raw.df)),]

head(raw.df)

04/10/2019 asgn3

localhost:8888/notebooks/Dropbox/monash_academic/FIT5197/assignments/A3/final/asgn3.ipynb 11/23

In fact, R's built-in logistic regression automatically performs one-hot encoding on factor variables prior to running the algorithm (but we're going to

ignore that and implement the encoding ourselves instead). To learn more about one-hot encoding, Wikipedia has a good page on it at

https://en.wikipedia.org/wiki/Dummy_variable_(statistics) (https://en.wikipedia.org/wiki/Dummy_variable_(statistics)) (note that this technique goes by

several names, including one-hot encoding, dummy variables, indicator variables, and a few others listed on the Wikipedia page).

When we perform one-hot encoding, we take the dataset and, for each column we wish to encode:

Find how many different values the factor variable can take, using the levels() function; call this number

Create new columns in the data-frame, corresponding to each level of the factor. For each row, set the value to 1 if the original variable's

value corresponded to the column, and 0 otherwise. If the variable took the last value of the factor (the -th value), just set them all to 0. We use

columns to avoid the "dummy variable trap" (https://en.wikipedia.org/wiki/Dummy_variable_(statistics)

(https://en.wikipedia.org/wiki/Dummy_variable_(statistics)))

Don't forget to delete the original factor column from the data frame when you're done

So, prepare our dataset as follows:

1. Write a function to.factors, which takes the variables listed below and returns a data-frame identical to df, except that the columns listed in

col.names have been converted to factor columns (you are allowed to use R's built-in as.factor() function here)

df is a data frame for which we wish to convert some variables to factors

col.names is a vector of strings, where each string corresponds to the name of a column in the data-frame which we want to convert to a

factor column using the to.factor() function

2. Write a function one.hot.encode, which takes two variables (df and to.encode) and returns a data-frame identical to df, except that the factor

columns listed in to.encode have been replaced with one-hot encoding columns as described above

df is a data frame which we wish to apply one-hot encoding to

to.encode is a vector of strings, where each string corresponds to the name of a column in the data-frame which we want to convert from a

factor column to a one-hot encoding

The column names for the one-hot encoding should take the following form:

original_col_name.factor_name

For example, the column encoding when "wk_file" = "b" should have the name "wk_file_b", and the column encoding when "wr_rank" = 6 should have

the name "wr_rank_6".

The to.factors function should look like this:

k

k − 1

k

k − 1

04/10/2019 asgn3

localhost:8888/notebooks/Dropbox/monash_academic/FIT5197/assignments/A3/final/asgn3.ipynb 12/23

to.factors <- function(df, col.names) {

# Your code here

}

The one.hot.encode function should look like this:

one.hot <- function(df, to.encode) {

# Your code here

}

In [ ]:

In [ ]:

Question 2.b (3 marks)

Important note: the rest of the questions in Question 2 rely on the dataset created in Question 2.a. To ensure you are not disadvantaged

when answering the rest of these questions, we have provided another dataset ("krkopt_preprocessed.csv") which is the expected result

from the code in Question 2.a. You must use this new dataset for the remainder of Question 2 (the code cell immediately below this

question will import the data as "df.processed") You can compare your output from Question 2.a to this dataset if you want to check your

code for Question 2.a.

We now have a dataset which has been preprocessed and is ready for use with R's built-in glm() function.

Perform the following steps in the code block two blocks down (under the block that reads "krkopt_preprocessed.csv"):

1. Use glm() to create a linear regression

# your code here

preprocess <- function(df, factor.names) {

fact.df <- to.factors(df, factor.names)

one.hot.df <- one.hot(fact.df, factor.names)

return (one.hot.df)

}

factor.names <- c("wk_file", "wk_rank", "wr_file", "wr_rank", "bk_file", "bk_rank")

df.student.preprocessed <- preprocess(raw.df, factor.names)

head(df.student.preprocessed)

04/10/2019 asgn3

localhost:8888/notebooks/Dropbox/monash_academic/FIT5197/assignments/A3/final/asgn3.ipynb 13/23

2. Use the predict() function to find the estimated probabilities for the training set (using type="response") Store these probabilities in a variable

called "probs"

3. Convert these probabilities into categorical predictions, and find the accuracy of the model. Store these predictions in a variable called

"preds".

The code block underneath this one plots the predicted probability of success (on the x-axis) vs. whether the print was a success or not (on the y-

axis).

Do not import any libraries to do any of the above tasks, as they are doable in R without importing other libraries.

In [ ]:

In [ ]:

In [ ]:

Question 2.c (4 marks)

The above model seems to do reasonably well, but let's see if we can do better using some variable selection.

You've been discussing your model with some of your colleagues, and they have made several suggestions:

One colleague has suggested that it does not matter where the rook or the white king is (only where the black king is); that is, we only need to

consider the following variables:

bk_rank_1, bk_rank_2, ..., bk_rank_7

bk_file_a, bk_file_b, ..., bk_file_g

Another colleague has suggested that it only matters where the white rook and king are (and not where the black king is); that is, we only need

to consider the following variables:

wr_rank_1, wr_rank_2, ..., wr_rank_7

wr_file_a, wr_file_b, ..., wr_rank_g,

wk_rank_1, wk_rank_2, wk_rank_3

wk_file_a, wk_file_b, wk_file_c

df.processed <- read.csv("krkopt_preprocessed.csv")

head(df.processed)

# your code here

plot(df.processed$result ~ probs)

table(preds, df.processed$result)

04/10/2019 asgn3

localhost:8888/notebooks/Dropbox/monash_academic/FIT5197/assignments/A3/final/asgn3.ipynb 14/23

Your opponent has heard what you've been up to, and they have suggested that it only matters where the white rook is (not the black or white

king king). You don't trust them, but you'll give it a try... you only need to consider the following variables:

wr_rank_1, wr_rank_2, wr_rank_3

wr_file_a, wr_file_b, wk_file_c

For each of these three proposals, create a linear model which only contains those variables. Then calculate the accuracy of each of these models

using the evaluate function provided.

In [ ]:

In [ ]:

Question 2.d (1 mark)

We can also use an automated, step-wise process to determine remove variables, using R's built-in step() function.

The code to do this is already given to you in the first code cell below, as a function called "to.stepwise". You need to call this function in the second

code cell below, two times. Each time will use a different value of k:

fit.aic: uses k = 2

fit.bic: uses k = log(N) where your dataset has N rows

You should then run the third code cell below, which will call your evaluate() function on fit.aic and fit.bic. Take a look at how the AIC and BIC models

do in comparison to the three models we came up with above (we'll discuss this in the next question).

In [ ]:

evaluate <- function(fit, truth) {

probs <- predict(fit, type="response")

preds <- probs > 0.5

accuracy <- mean(preds == truth)

return (accuracy)

}

# your code here

to.stepwise <- function(df, k) {

fit <- glm(result ~ ., family="binomial", data=df)

return (step(fit, k=k))

}

04/10/2019 asgn3

localhost:8888/notebooks/Dropbox/monash_academic/FIT5197/assignments/A3/final/asgn3.ipynb 15/23

In [ ]:

In [ ]:

Question 2.e (2 marks)

The step-wise AIC method, if implemented correctly, should find the formula

result ~ wk_file_a + wk_file_b + wk_file_c + wk_rank_1 + wk_rank_2 +

wr_file_a + wr_file_c + wr_file_e + wr_rank_3 + wr_rank_4 +

wr_rank_5 + wr_rank_7 + bk_file_a + bk_file_b + bk_file_c +

bk_file_d + bk_file_e + bk_file_f + bk_file_g + bk_rank_1 +

bk_rank_2 + bk_rank_4 + bk_rank_5 + bk_rank_6 + bk_rank_7

with accuracy at around 0.86.

The step-wise BIC method, if implemented correctly, should find the formula

result ~ wk_file_a + wk_file_b + wk_file_c + wk_rank_1 + wr_rank_3 +

wr_rank_4 + wr_rank_5 + bk_file_a + bk_file_c + bk_file_d +

bk_file_e + bk_file_f + bk_file_g + bk_rank_1 + bk_rank_2 +

bk_rank_5 + bk_rank_7

with accuracy at around 0.856.

Given all this information, were any of your colleagues right? Or were they all wrong? Also, can you draw a conclusion from the formula found by BIC

as to which pieces are important to have in the right position (out of the white king, the black king or the white rook)? Be sure to justify your answer.

YOUR ANSWER HERE

Question 2.f (2 marks)

You have finished your model, and have decided to use the AIC model (which should be saved as a variable named "fit.aic"). You now want to apply

your model to the chess position you find yourself in against your colleague.

# your code here

evaluate(fit.aic, df.processed$result)

evaluate(fit.bic, df.processed$result)

04/10/2019 asgn3

localhost:8888/notebooks/Dropbox/monash_academic/FIT5197/assignments/A3/final/asgn3.ipynb 16/23

You are in the following position (it's your move, i.e. black's move):

Note: the above image will only show if you include the file "chess_pos.png" (contained in the "data.zip" file you've been given on Moodle) in the

same folder as this Jupyter notebook, and then re-run this Markdown cell by putting it into edit mode and then running it. Alternatively, you can just

look at the file itself.

This position has the following variable values:

wk_file = c

wk_rank = 3

wr_file = f

wr_rank = 5

bk_file = e

bk_rank = 5

You will need to convert this position into a dataframe of one row with the one-hot encoding we used above, and then pass this into the predict

function using the model "fit.aic" and using type="response" for the predict function to get a probability.

Do this in the cell below and see what result you get. Then, in the Markdown cell below that, state whether your model indicates that you truly should

resign in this position (i.e. the position is lost) or whether your opponent was bluffing (i.e. the position does not seem to be winnable in less than 10

moves).

Note: you might be tempted to create a simple dataframe with one row and convert this to one-hot encoding using the functions you have

defined above - this probably won't work, and we recommend that you create the dataframe using some other method. If you're curious,

try to think about why this is the case.

04/10/2019 asgn3

localhost:8888/notebooks/Dropbox/monash_academic/FIT5197/assignments/A3/final/asgn3.ipynb 17/23

Hint: This Stack-Overflow post might help: https://stackoverflow.com/questions/10689055/create-an-empty-data-frame

(https://stackoverflow.com/questions/10689055/create-an-empty-data-frame). In particular, consider how to use the line

empty_df <- df[FALSE,]

In [ ]:

YOUR ANSWER HERE

Question 2.g (2 marks)

In our code above (and also in the code in Question 3), we train our models using all of our training data, and also evaluate the models on our

training data. Describe a problem with this approach, and give an example of how we could fix this (other than leave-one-out cross-validation, which

is used in Question 3).

YOUR ANSWER HERE

Question 3 - Linear Regression (20 marks)

Just as you give your opponent the finishing move that draws your chess game (which is no small feat against an Australian champion - you have the

admiration of your colleagues), the engineers come running into your office in a panic. They tell you that their deadline for finishing their 3D printing

process is tomorrow, but that they are still having trouble solving a problem with their printing process.

They have printed 50 test cases, with varying degrees of success. They have found that a large number of their cases turn out very rough to the

touch; they are concerned that management will see this and scrap the 3D printing project altogether (and their jobs with it!).

You calmly ask them to send over all the data they have, and get to work developing a linear regression model to try to identify which component(s)

of the process are causing the roughness problem.

Important Note: In this question, you might see the error message "prediction from a rank-deficient fit may be misleading..." show up. You

can ignore this for the purposes of this assignment (try to think why we might be seeing a "rank-deficient" fit - there's a discussion at

https://stats.stackexchange.com/questions/35071/what-is-rank-deficiency-and-how-to-deal-with-it

(https://stats.stackexchange.com/questions/35071/what-is-rank-deficiency-and-how-to-deal-with-it) if you're curious).

# your code here

04/10/2019 asgn3

localhost:8888/notebooks/Dropbox/monash_academic/FIT5197/assignments/A3/final/asgn3.ipynb 18/23

Background information This dataset is derived from a real dataset, which was uploaded to Kaggle on 9/14/2018 by Ahmet Okudan. The Kaggle

upload indicates that the dataset "comes from research by TR/Selcuk University Mechanical Engineering department". More information can be found

at https://www.kaggle.com/afumetto/3dprinter (https://www.kaggle.com/afumetto/3dprinter)

In [ ]:

Question 3.a (1 mark)

First things first - we need to take a look at our data. Write a function "plot.variables" which takes two arguments - var.x and var.y. It then plots var.x vs

var.y. Make sure your plots have a reasonable label (e.g. var.x vs var.y, substituting in the actual values of var.x and var.y) as well as reasonable x-

axis and y-axis labels (e.g. var.x on the x-axis and var.y on the y-axis, again substituting the actual values of var.x and var.y).

Your function should look like this:

plot.variables <- function(var.x, var.y) {

# Your code here

}

Then run the second code cell below, which will plot out all your other variables vs roughness.

In [ ]:

In [ ]:

Question 3.b (1 mark)

data.3d <- read.csv("data.csv")

data.3d <- data.3d[,-c(11, 12)]

head(data.3d)

# your code here

par(mfrow=c(3, 2))

options(repr.plot.width=6, repr.plot.height=6)

for (name in names(data.3d)) {

if (name != "roughness") {

plot.variables(name, "roughness")

}

}

04/10/2019 asgn3

localhost:8888/notebooks/Dropbox/monash_academic/FIT5197/assignments/A3/final/asgn3.ipynb 19/23

Now we have an idea of what our dataset looks like, we can start working on creating a linear model. Using R's built-in lm() function, create a linear

model of roughness vs all the other variables in our dataset. Save this model in a variable called "fit.lm"

In [ ]:

In [ ]:

Question 3.c (1 mark)

It would be nice to get an idea of how our estimates compare with the true roughness values using a graph. In the code cell below, use R to create a

plot of the predicted roughness (using our linear model) on the x-axis vs the actual roughness on the y-axis.

Like in Question 3.a, make sure to use a reasonable label for both your plot and your axes.

In [ ]:

Question 3.d (2 marks)

The graph seems to indicate that our model does reasonably well, but we would like to have more mathematically-sound way of determining how

close our model gets, on average.

Write a function, "mean.sq.error", which takes two arguments (model and df) and returns the mean squared error of the predictions determined by

model and df.

Then write another function, "mean.abs.error", which takes two arguments (model and df) and returns the mean absolute error of the predictions

determined by model and df.

Your MSE function should look like this:

mean.sq.error <- function(model, df) {

# Your code here

}

Your MAE function should look like this:

# your code here

fit.lm

# your code here

04/10/2019 asgn3

localhost:8888/notebooks/Dropbox/monash_academic/FIT5197/assignments/A3/final/asgn3.ipynb 20/23

mean.abs.error <- function(model, df) {

# Your code here

}

You may assume that "roughness" is the name of the target variable for the dataframe passed into your function (i.e., you can hard-code this into the

function).

In [ ]:

In [ ]:

Question 3.e (1 mark)

We need to narrow down what could be the cause for print roughness; we will use some variable selection to do this. Like in Question 2.d, we've

given you the code for running step-wise variable selection, although (again, like in Question 2.d) you need to provide the values of k for the AIC (k =

2) and BIC (k = log(N) for N data points in our dataset). You need to call this function in the second code cell below, two times. Each time will use a

different value of k:

fit.lm.aic: uses k = 2

fit.lm.bic: uses k = log(N) where your dataset has N rows

You should then run the third code cell below, which will call your mean.sq.error() and mean.abs.error() functions on fit.lm.aic and fit.lm.bic. Take a

look at how the AIC and BIC models do in comparison to the three models we came up with above (although you don't have to discuss this in your

answer).

In [ ]:

In [ ]:

# your code here

mean.sq.error(fit.lm, data.3d)

mean.abs.error(fit.lm, data.3d)

to.stepwise <- function(df, k) {

fit <- lm(roughness ~ ., data=df)

return (step(fit, k=k))

}

# your code here

04/10/2019 asgn3

localhost:8888/notebooks/Dropbox/monash_academic/FIT5197/assignments/A3/final/asgn3.ipynb 21/23

In [ ]:

Question 3.f (4 marks)

Both the AIC and BIC models identify the same formula, and therefore remove the same variables:

roughness ~ layer_height + nozzle_temperature + bed_temperature + print_speed + material

Use the summary() function on "fit.lm", "fit.lm.aic", and "fit.lm.bic" in the cell block below.

In the markdown cell below the code cell, answer the following three questions:

1. Do the variables removed by AIC and BIC all have large p-values?

2. What does it means to have a large p-value?

3. Why it is reasonable that your answer to the first of these three questions ("Do the variables removed by AIC and BIC all have large p-values?")

would be what we expect to happen? You can answer this using intuition; you don't need to mathematically prove it.

In [ ]:

YOUR ANSWER HERE

Question 3.g (3 marks)

It is a problem that we are training and evaluating on the same dataset (in Question 2.g, we asked you to discuss why this is the case).

One way to mitigate this problem is by using "leave-one-out cross validation", or LOOCV (https://en.wikipedia.org/wiki/Cross-

validation_(statistics)#Leave-one-out_cross-validation) (https://en.wikipedia.org/wiki/Cross-validation_(statistics)#Leave-one-out_cross-validation)). In

LOOCV, for a dataset of size we train different models, where each model trains on all but one of the points in our dataset. We then evaluate theN N

mean.sq.error(fit.lm, data.3d)

mean.sq.error(fit.lm.aic, data.3d)

mean.sq.error(fit.lm.bic, data.3d)

mean.abs.error(fit.lm, data.3d)

mean.abs.error(fit.lm.aic, data.3d)

mean.abs.error(fit.lm.bic, data.3d)

# your code here

04/10/2019 asgn3

localhost:8888/notebooks/Dropbox/monash_academic/FIT5197/assignments/A3/final/asgn3.ipynb 22/23

squared error from each model predicting the point we left out (hence the name "leave-one-out" CV), and average the final result.

In the code cell below, implement a function "cv.loo", which takes as argument a dataset "df" and returns the average squared prediction error from

LOOCV done on that dataset. You may assume that "roughness" is the name of the target variable for the dataframe passed into your function (i.e.,

you can hard-code this into the function).

Your function should look something like this:

cv.loo <- function (df) {

# Your code here

}

In [ ]:

In [ ]:

Question 3.h (2 marks)

The LOOCV error calculated in Question 3.g, if calculated correctly, was significantly higher than the training MSE. Why might this be the case?

Describe one approach we might take to reduce this problem.

YOUR ANSWER HERE

Question 3.i (1 mark)

In the code above, we use the LOOCV technique to help solve the problem (discussed in Question 2.g) where we evaluate on the trianing set.

However, aside from validation techniques like this, it is also common to set aside another testing dataset, which we use at the very end of training

our model as a final evaluation. For simplicity, we did not do this here (although perhaps we should have). Why do you think this is considered good

practice? You might like to do some research on this (just be careful not to accidentally plaigarise your answer from another source).

YOUR ANSWER HERE

# your code here

cv.loo(data.3d)

04/10/2019 asgn3

localhost:8888/notebooks/Dropbox/monash_academic/FIT5197/assignments/A3/final/asgn3.ipynb 23/23

Question 3.j (4 marks)

The engineers are eagerly awaiting your analysis of their situation. Summarize your findings from the previous parts of this question, being sure to

address the following points:

1. Which variables do not appear to be related to roughness?

2. Which variables do appear to be related to roughness?

3. What are some (at least three) courses of action that the engineers could take to reduce the roughness of their 3D prints? You might need to do

a little research on 3D printing to understand what each variable means.

YOUR ANSWER HERE