辅导案例-QBUS2820-Assignment 1

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

1

QBUS2820 Predictive Analytics
Semester 2, 2020

Individual Assignment 1

Key information

1. Required submissions:
a. ONE written report (word or pdf format, through Canvas- Assignments-
Report Submission (Individual Assignment 1)).
b. ONE Jupyter Notebook .ipynb file (through Canvas- Assignments- Upload
Your Program Code Files (Individual Assignment 1)).
2. Due date/time: Friday 16-October-2020, 2pm (report and Jupyter notebook
submission). The late penalty for the assignment is 5% of the assigned mark per day,
starting after 2pm on the due date. The closing date/time Monday 26-October-2020,
2pm is the last date/time on which an assessment will be accepted for marking.
3. Weight: 30% of the final mark.
4. Length: The main text of your report (including Task A and Task B) should have a
maximum of 15 pages. Especially for Task A, you should write a complete report
including sections such as business context, problem formulation, data processing,
EDA, and feature engineering, methodology, analysis, conclusions and limitations,
etc.
5. If you wish to include additional material, you can do so by creating an appendix.
There is no page limit for the appendix. Keep in mind that making good use of your
audience’s time is an essential business skill. Every sentence, table and figure have to
count. Extraneous and/or wrong material will reduce your mark no matter the quality
of the assignment.
6. Anonymous marking: As the anonymous marking policy of the University, please
only include your student ID in the submitted report, and do NOT include your
name. The file name of your report should follow the following format. Replace
"SID" with your Student ID. Example: SIDQbus2820Assignment1S22020.
7. Presentation of the assignment is part of the assignment. Markers will assign 5
marks for clarity of writing and presentation. Numbers with decimals should be
reported to the four-decimal point.

Key rules:
• Carefully read the requirements for each part of the assignment.
• Please follow any further instructions announced on Canvas.
• You must use Python for the assignment. Use "random_state= 1" when needed, e.g.
when using “train_test_split” function of Python. For all other parameters that are not
specified in the questions, use the default values of corresponding Python functions.
• Reproducibility is fundamental in data analysis, so that you will be required to submit a
Jupyter Notebook that generates your results. Not submitting your code will lead to a loss of
50% of the assignment marks.
2

• Failure to read information and follow instructions may lead to a loss of marks.
Furthermore, note that it is your responsibility to be informed of the University of Sydney
and Business School rules and guidelines, and follow them.
• Referencing: Harvard Referencing System. (You may find the details at:
http://libguides.library.usyd.edu.au/c.php?g=508212&p=3476130)

Task A. Moneyball (35 Marks)

You will work on the NBA salary dataset.

Note: This task does not require prior knowledge of basketball. You should not add any
personal subjective assumptions about the data based on your existing knowledge. This can
lead to inaccurate results. You should use the techniques that we learnt and you discovered to
complete the prediction task.

1. Problem description

Select three models to predict NBA player salary from performance statistics.

The selected three models need to follow the below rules:
• one model is a linear regression model;
• one model is a kNN regression model;
• one model is a model that we have not covered in the QBUS2820 unit so far. You
need to self-explore and self-study this model, since the ability of self-study is
critical in the field of machine learning which is evolving quickly.

As a consultant working for a sports analytics company, the NBA league approached you to
develop predictive models to predict NBA salaries based on machine learning techniques. To
enable this task, you were provided with a dataset containing highly detailed performance of
the NBA players.

As part of the contract, you need to write a report according to the details below.

2. Understanding the data

You can download the “NBA_Train.csv” and “NBA_test.csv” data for the Canvas. The
response is the SALARY($Millions) column in the dataset.

NBA glossary link below and the glossary Table at the end of the Task A can help you
understand the meaning of the variables better:
https://stats.nba.com/help/glossary

You should use the given test set to evaluate the performance of your work. The
performance/scoring metric is: Root Mean Squared Error (RMSE), for the test set.

Your target the of the test set RMSE is: less than 4.1 ($Millions).

Please note at least two of your three presented models must achieve the above target.
3

3. Written report

The purpose of the report is to describe, explain, and justify your solution to the client with
polished presentation. Be concise and objective. Find ways to say more with less. When it
doubts, put it in the appendix. You can refer to the file “TaskA_instructions” on more
detailed instructions on how to work on Task A including writing the report.
Suggested outline:

1. (2 marks) Introduction: write a few paragraphs stating the business problem and
summarising your final solution and results. Use plain English and avoid technical
language as much as possible in this section (it should be for a wide audience).
2. (2 marks) Data processing and exploratory data analysis: provide key information
about the data, discuss potential issues, and highlight interesting facts that are useful
for the rest of your analysis.
3. (3 marks) Feature engineering.
4. (14 marks) Methodology of the linear regression and kNN regression models: present
your selected models, your rationale, how you fit them, model selection process, some
interpretations, etc.
5. (7 marks) Methodology of the model that is not covered in our unit: why you choose
this model, briefly present how the model works, explain how the model is different to
linear regression and kNN regression, how you fit the model, model selection process,
some interpretations, etc.
6. (5 marks) Report and interpret the test set performance that meets the required
criteria.
7. (2 marks) Final analysis, conclusion, limitations and future work suggestions.
The NBA glossary Table.

Sports Reference LLC, 2016a.
4

Task B. Gradient Ascent (25 Marks)

Instructions

You will work on the Boston housing dataset, to predict the median value of owner-
occupied homes in $1000’s as the target y (column “medv”), using linear regression.

You can download the data set “Boston_housing_data.csv” and the corresponding data
description file “Boston_housing_data_description” on Canvas.

1. (5 marks) You decide to use this dataset for a linear regression task, that is to use the
existing information to predict the “medv”. But before that, you want to first conduct
exploratory data analysis (EDA).

After EDA, please choose three most relevant features and present your results
including:
• How your EDA is conducted;
• Carefully explain your feature selection rules and criteria;
• There is no need to train a regression model for this question.

2. (20 marks) Based on slides from 27 to 32 of week 5 slides, write your own Gradient
Ascent algorithm, to estimate the parameters of the given regression problem with the
intercept.

Use three features: “rm”, “dis” and “tax”, and = [0, … ,0] as your initialization point. You
may write a python function named such as Gradient_Ascent_Algo, with various
inputs, e.g. data matrix , target , an initial parameter vector (0) = [0, … ,0], learning
rate, the number of GD iterations , stopping criteria and other arguments you see
appropriate.

You need to:
• find the optimal learning rate for your own Gradient Ascent algorithm and explain
why this is the optimal value;
• use leave one out cross validation to select the optimal learning rate;
• design and justify your approach.

You can use the “Reference code 1” to load and standardize your data and use the “Reference
code 2” as a template to start your Gradient Ascent algorithm.

Reference code 1:

import pandas as pd
import numpy as np
import matplotlib.pylab as plt

data = pd.read_csv('Boston_housing_data.csv')
X = data[['rm','dis','tax']]
y = np.array(data[["medv"]])
5

# Standardization
X = (X-X.mean(axis=0))/X.std(axis=0)
y = (y-y.mean(axis=0))/y.std(axis=0)

# Adding constant
X = np.column_stack((np.ones(len(X)), X))

Reference code 2:

"""
Build the gradient ascent function for MLE
"""
# m denotes the number of training examples here, not the number of features
def Gradient_Aescent_Algo(X, y, beta, alpha, numIterations):
# sample size
N = len(X)
XTrans = X.transpose()
# create a vector to save all the likelihood values at each iteration
likelihood_values = np.zeros((numIterations,1))
beta_values = np.zeros((numIterations,2))

for i in range(0, numIterations):
# predicted values from the model
f_X = np.dot(X, beta)

# calculte the likelihood

# save all the likelihood values at each iteration

# calcualte the gradient using matrix representation

# update the parameters simulteneously with learning rate alpha
beta = beta + alpha * gradient
# save all the estimated parametes at each step
beta_values[i,:]= beta.transpose()
return beta, likelihood_values, beta_values

欢迎咨询51作业君