辅导案例-GNMENTS 2 AND

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

158.222-2020 Semester 1 Massey University
Page 1 of 10
ASSIGNMENTS 2 AND 3

Deadline: Hand in by midnight 3 May 2020
Evaluation: Project 2: 10% of your final course grade.
Project 3: 10% of your final course grade.
Late Submission: Refer to the course guide.
Work This assignment is to be done individually.
Purpose: Implement the entire data science/analytics workflow. Use regression techniques to solve real-
world problems. Gain skills in extracting data from the web using APIs and web scraping. Build on
the data wrangling, data visualization and introductory data analysis skills gained up to this point
as well as problem formulation and presentation of findings. Gain skills in kNN regression
modelling and supervised and unsupervised learning.
Learning outcomes 1 - 5 from the course outline.

Please note that all data manipulation must be written in python code in the Jupyter Notebook
environment. No marks will be awarded for any data wrangling that is completed in excel.
Assignments 2 and 3 will need to be submitted separately, even though they have the same due
date. Make sure that you create a separate notebook for each assignment.
These assignments will take longer than you think, so…
Do not leave starting these assignments until the last minute. You have the tools you need to
start now.
As of the week 5 lecture, you will have been introduced to tools that will assist you in completing
Assignment 2.
By week 7 (before semester break) you will be able to complete most of Assignment 3, except for
part 3, which you will be able to complete after the week 8 lecture.

158.222-2020 Semester 1 Massey University
Page 2 of 10

****************
*** Plagiarism ***
****************

It is mandatory that any assessment items that you submit during your University
study are your own work. Massey University takes a firm stance on academic
misconduct, such as plagiarism and any form of cheating.

Plagiarism is the copying or paraphrasing of another person’s work, whether
published or unpublished, without clearly acknowledging it. It includes copying the
work of other students and reusing work previously submitted by yourself for another
course. It also includes the copying of code from unacknowledged sources.

Academic integrity breaches impact on students as it disadvantages honest students
and undermines the credibility of your qualification. Plagiarism, and cheating in tests
and exams will be penalised; it is likely to lead to loss of marks for that item of
assessment and may lead to an automatic failing grade for the course and/or
exclusion from reenrolment at the University.

Please see the Academic Integrity Guide for Students on the University website for
more information. The Guide steps you through the University Academic Integrity
Policy and Procedures. For example, you will find definitions of academic integrity
misconduct, such as plagiarism; how misconduct is determined and managed; and
where to find resources and assistance to help develop the skills of academic writing,
exam preparation and time management. These skills will help you approach
university study with academic integrity.

158.222-2020 Semester 1 Massey University
Page 3 of 10
ASSIGNMENT 2: DATA ACQUISITION AND REGRESSION
In Assignment 2 you will be integrating data from two sources:
• The World Happiness Index

and one of:
• The World Bank API
• A web-scraped source of your choosing

Your goal is to build Regression models for predicting happiness, following a good process, including:
• careful selection of explanatory variables (features) through engaging your critical thinking in choosing data
sources, exploratory data analysis and optional feature set expansion;
• good problem formulation;
• good model experimentation (including explanation of your experimentation approach), and
• thoughtful model interpretation

TASK 1: DATA ACQUISITION AND INTEGRATION (25 MARKS)
a) Static Data: Import Table 2.1 of the World Happiness Report data (1 mark)
You can download the “WHRData.xls” static dataset from the Stream site. This dataset is from the 2020 World
Happiness Report. You can learn more about this report here:
https://worldhappiness.report/ed/2020/
Data definitions and other variable documentation can be found here:
https://happiness-report.s3.amazonaws.com/2020/WHR20_Ch2_Statistical_Appendix.pdf
You should familiarise yourself with the data documentation before proceeding. As a bare minimum, you will
need to identify which variable represents ‘Happiness’.
Note: if you are unable to meet the challenges laid out in Task 1 b) and c) you will still be able to continue
with Tasks 2 and 3 with only the static dataset.
b) Dynamic data (14 marks)
Do ONE of either option 1 or option 2:
OPTION 1:
API Data: Identify, import and wrangle indicators of your choosing from the World Bank API
The World Bank API is briefly introduced in Lecture 5. Your task is to identify and import 5 or more World Bank
indicators (features) that you would like to have as options for inclusion in your models for predicting
happiness.
• Identify: To identify 5 or more appropriate indicators, you will need to explore the World Bank API
documentation and figure out for yourself how to find which indicators are available and then how to
identify and request them. Finding your own way through the documentation is a deliberate part of this
challenge. Briefly explain your process and why you chose your features. These links will provide you with
a start:
https://datahelpdesk.worldbank.org/knowledgebase/articles/889392-api-documentation
https://datahelpdesk.worldbank.org/knowledgebase/articles/898599-api-indicator-queries
158.222-2020 Semester 1 Massey University
Page 4 of 10
• Import and wrangle your chosen indicators so that they are in the right shape for integration with the WHR
data. In Lecture 5, only one indicator is imported. To import many indicators in a tidy fashion (i.e. without
repeating code) will possibly involve the use of a loop and/or function, depending on your approach.

Note that by default you may not be returned all the data you require - you may have to set arguments to
obtain the full range (keep an eye out for the ‘per_page’ argument). Also note that you can specify a date
range.
Task 1b) Option 1 marking:
❖ 6 marks for identification of features and explanation of why you chose them. We are looking for your
curiosity and initiative in exploring the World Bank API and figuring out how to use it to effectively identify
appropriate indicators. 0/7 marks will be awarded if you simply import a subset of the indicators that you
have been given codes for in Lecture 5.
❖ 8 marks for the import and wrangling of the data – the more elegant and tidy the solution, the higher the
marks

OPTION 2:
Web-scraped data: Source, import, parse and wrangle web data
• Source: Go to the internet and find another data source with which to expand your feature set that:
o can be web-scraped,
o you think may improve your predictive model, and
o can be meaningfully integrated with the WHR data and your World Bank data.
In case it is not obvious, you will be looking for data that can be linked on both country name and one
or more years of the data you have already acquired.

• Import, parse, wrangle: Scrape the data and wrangle it into the shape it needs to be in in order to
integrate it later.

• Explain: Include a brief explanation of your wrangling process at the beginning of wrangling.

Task 1b) Option 2 marking:

❖ 3 marks for finding an appropriate and good quality data source and explanation of why you chose it.
❖ 8 marks for effective and tidy import/parse/wrangle code
❖ 3 marks for briefly explaining your wrangling process before you import your data

c) Integration: By whichever means appropriate, clean labels and integrate the two datasets from a), b) into one
dataframe (10 marks)
• Inspect and clean labels for integration: To integrate your data without losing rows, you will need to
make sure your labels you are joining with are compatible. This will may involve some data
cleaning/updating using good old-fashioned gruntwork. For instance, the same country can have two
different names in two different datasets (e.g. Democratic People's Republic of Korea vs North Korea).
Do some data checks pre and post-integration to ensure you have not lost data. Data loss due to some
countries being present in one dataset but genuinely not in another is acceptable.

• Include a brief explanation of your process at the beginning.

• Integrate your data into one dataframe.

Task 1c) Marking:

❖ 6 marks for checking label compatibility for integration (via scripting) and, if required,
cleaning/updating those labels
❖ 2 marks for briefly explaining your process
158.222-2020 Semester 1 Massey University
Page 5 of 10
❖ 2 mark for the final integration (at this point the final integration should be a straightforward line (or
few lines) of code.)

TASK 2: DATA CLEANING AND EXPLORATORY DATA ANALYSIS (EDA) (24 MARKS)
a) EDA – data quality inspection (8 marks)
• Explore: Explore your data with a view to looking for data quality issues. This could involve looking at
summary statistics, plots, inspection of nulls and duplicates – whatever you think is appropriate, there
is no single correct way of doing this. Clean your data if and as required and save the cleaned dataset
to csv.

• Explain: Include a brief explanation of your process at the beginning.

Task 2a) marking:

❖ 6 marks for your code/outputs: Did you produce outputs appropriate inspecting and addressing data
quality issues?
❖ 2 marks for briefly explaining your process.

b) EDA – the search for good predictors (16 marks)
• Explore: Explore your data with the goal of finding explanatory variables/features that could be good
predictors of your target variable (Happiness). This should include:
o Inspection of correlations between features
o Pairs plot/scatter matrix
o Any other visualisation that you deem appropriate

• Explain: Include a brief explanation of your process at the beginning.

• Inspect and transform: Inspect your chosen subset of potential explanatory variables more closely with
some visualisations and/or summary statistics. Do any of them look like they need transformation to
conform to a normal distribution? Transform any variables that need transformation with an
appropriate transformation for normality (e.g. log, square, quarter root etc). Go back and check
correlations as required – searching for predictors will likely be an iterative process.

• Discuss: Briefly discuss your findings, e.g. “I have chosen this subset of variables as good candidates for
model predictors because…” (warning: do not copy and paste this text into your report, we will
deduct marks if you do.) It is also OK to choose variables for reasons other than them being the best
possible predictors – perhaps you are curious as to whether a given variable would have any effect in a
model.

Note: You are looking for features that are well correlated with the target variable. You are also looking out for
features that are highly correlated with each other. Be aware that while models can have predictive power
while including highly correlated explanatory features (multicollinearity), the effects of those correlated features
will be masked by each other. Where there is multicollinearity, interpretation of specific feature coefficients is
uncertain. Bear this in mind later when interpreting your models.
Note: You may find that all your chosen explanatory variables end up coming from the same data source. That
is OK.
Task 2b) marking:

❖ 12 marks for your code/outputs (explore, inspect and transform): Did you produce outputs appropriate
for finding good predictors? Did you transform where appropriate? Is your code elegant and concise?
❖ 4 marks for your words (explain and discuss): Did you explain your process and discuss your findings?
Are your words elegant and concise?
158.222-2020 Semester 1 Massey University
Page 6 of 10

**BONUS QUESTION**
Up to 10 marks will be awarded for feature set expansion via the creation of derived variable/s that make a significant
and novel contribution to your final model. How you do this is completely up to you and being a bonus question, no
further guidance will be given. Ingenuity and initiative will be rewarded. As this is an extension task, a very high
standard is set for achieving maximum marks.
TASK 3: MODELLING (44 MARKS)
Build the best regression model you can, with Happiness as the target variable, within whichever bounds you set yourself
in your problem formulation.
• Formulate a problem: You know ‘Happiness’ is your target variable, but what else are you interested in with
respect to this problem? Would you like to simply find the model with the most predictive power? Are you
interested in understanding how particular features of interest to you affect Happiness? Or perhaps you are
interested in finding the most parsimonious model possible, while still retaining predictive power? Another
approach is to look at models for a particular group or groups. Perhaps you would like to filter your dataset to
include only OECD countries? Or perhaps you would like to build different models for developed, developing
and underdeveloped countries? (the World Bank API has this data). Maybe you have some other ideas? Briefly
explain how you will be approaching this regression problem. This will help you to focus your experimentation.

• Experiment: Explore different regression models in a way that is appropriate to your problem formulation.
Experiment with linear and multiple linear regression as appropriate. Consider a form of the step-wise
algorithm. Optionally, look at a polynomial regression (this is not expected).

o Do not use joint plots as a substitute for regression modelling. Zero marks will be given to any model
experimentation that relies on joint plots.
o Do use a module for modelling, and do not code up your regression model from scratch.
o Do consider ‘Year’ as a feature to include in your model.
o Do display model statistics

Note: If you are interested in the predictive power of your model, your best model is likely to include multiple
explanatory variables so don’t waste time bulking out the assignment with single variable models.

Note: when you have more than one explanatory variable in your model, you will not be able to produce the
regression plots from Lecture 4 because they are two dimensional (target vs one explanatory). That is OK.
There are other ways to visualise if you want to produce plots, for instance you could use a visualisation to
compare certain model summary statistics (like RMSE, prob(F), RSq) that you have collated into a dataframe
from multiple different model outputs.

• Write elegant code: Experimenting with many different models will involve repetition of code so employ loops
and functions for model creation and evaluation. Functions and loops = less code = easier to read reports and
easier and more effective experimentation.

• Evaluate/interpret: To compare models, model outputs must be interpreted. For instance, the probability of
the F statistic tells us whether there is a significant relationship between the response and explanatory variables
as expressed by the model. R-Squared tells us about the strength of that relationship (and how good our model
would be for prediction). Consider the coefficients for your explanatory variables – are they significant and
doing heavy lifting in the model, or are they surprisingly superfluous? Can the coefficients be interpreted or is
multicollinearity an issue? You may like to calculate RMSE and interpret that in context.

• Present preferred/final model: settle on a preferred or final model for further inspection.

o Residuals: Produce a plot of residuals and fitted values and explain whether it is likely that this model
fulfils the necessary assumptions of homoscedasticity (homoscedastic residuals should not fan out) and
linearity (the residuals should randomly scatter around the fitted line and not follow a curved shape).
You could find code for this online, or you could look up the code in the exercise hints for Lecture 4.
For the purposes of this assignment you are not expected to analyse the residuals beyond a visual
158.222-2020 Semester 1 Massey University
Page 7 of 10
inspection. We would usually inspect residuals before interpreting any model output. That
requirement is waived here to pare down the scope.
o Describe what the coefficients of the model mean, remembering to mention what units they are in
(eg sealevel = 0.58*temp_celsius : ‘for every degree Celsius increase in average global temperature,
sea level rises by 58 centimetres’).
o Explain how reliable the model was. Was it a good fit and good for prediction? How did the residuals
look, do you think they conformed well enough with assumptions? Could you recommend this
predictive model to a client?
o Optional - Plot the confidence intervals and prediction bands for that model and describe what they
tell you (there are no extra marks for this option)

Note: As we do not delve deeply into statistics in this course, and to keep the assignment scope manageable, we will
not be holding your work in this assignment to a high statistical standard (for instance, looking for outliers, high
leverage points, inspection of residuals etc). We are more interested in you demonstrating some curiosity, your
ability to use the tools provided and showing that you can select good predictive features and evaluate a model.
Task 3 Marking:
❖ 4 marks for problem formulation
❖ 14 marks for model experimentation
❖ 5 marks for elegance of code (use of loops/functions)
❖ 10 marks for appropriate interpretations
❖ 11 marks for presentation of preferred model:
o Residuals plot – 4 marks
o Interpretation of residuals plot – 2 marks
o Coefficient explanation – 2 marks
o Discussion of model reliability – 3 marks

TASK 4: PRESENTATION - ‘REPORT-ERIZE’ YOUR WORK (7 MARKS)
Go back through what you have done and turn your Assignment 2 work into something that looks like a report that you
could hand to a client (a technically savvy client as you still need to include your scripting for marking). Include a brief
introduction, that describes the modelling problem you formulated and a brief description of the datasets that you use,
and a conclusion. Use formatted mark down boxes that include headings. It is OK to include text that clearly delineates
the different tasks of the assignment (eg ‘Task 1b’). In fact, any formatting that makes the task of marking easier would
be most appreciated.
Clear out any unnecessary code and outputs that clutter your work. Run your text through a spell checker extension. See
the end of assignment 3 for more tips on how to tidy up a report.
HAND-IN:
Zip-up all your notebooks, python files and dataset(s) into a single file. Submit this file via stream. Make sure that your
jupyter notebook has benn run with all outputs visible. Download an HTML version of your notebook (with outputs
showing) and include this in your zip file.

ASSIGNMENT 3 STARTS ON THE NEXT PAGE

158.222-2020 Semester 1 Massey University
Page 8 of 10
ASSIGNMENT 3: KNN REGRESSION, SUPERVISED AND UNSUPERVISED LEARNING
PROJECT OUTLINE
In this project you will be producing another Jupyter Notebook report. This project requires that you apply techniques
taught so far either build a kNN regression models or supervised learning models. You will also build unsupervised
learning models. You will be using the dataset you have developed in assignment 2 that you may optionally expand. If
you choose the supervised learning option, you may use a different dataset of your choosing, if you wish.
You do not need to repeat any of the analysis from assignment 2. Consider assignment 3 to be an extension of the work
you did in assignment 2.
You may nonetheless find that further data wrangling and analysis is required to pick and use features for modelling in
assignment 3. If that is the case, then this should be included and will be considered in the marking.
TASK 1 – IMPORT THE CSV YOU SAVED IN TASK 2A) OF ASSIGNMENT 2 (NO MARKS FOR THIS)
TASK 2 – BUILD KNN REGRESSION MODELS OR SUPERVISED LEARNING MODELS (50 MARKS)
OPTION 1 – KNN REGRESSION MODELS
• Formulate: Using your assignment 2 dataset, creatively formulate a problem that enables you to perform kNN
regression for prediction. It is acceptable if this problem is the same problem you explored in your regression
analysis in assignment 2. Describe this problem in your introduction.
• Model: Experiment with models for this prediction containing different subsets of features.
Modelling expectations (what we are looking for when marking):
o Scaling of all input variables
o train/test split for all models so that the models can be meaningfully evaluated (train with the training
data, evaluate with the testing data). This is not explicitly done in the kNN regression lecture, but it is
done in the supervised learning lecture. Some guidance on how to achieve this with kNN regression (if
you need it) is included in the appendix.
o Experimentation with feature subsets
o Experimentation with model parameters - different distance metrics and different values of k
• Evaluation and interpretation - Generate, interpret and compare evaluation metrics for your various models.
Ideally, this will involve some visualisations such as plotting metric performance for different models. Consider
questions such as which values of k are most robust for the size of your dataset and your problem domain?
• Discussion - How reliable are your prediction models? Could you recommend any to a client? Would you expect
this model to preserve its accuracy on data beyond the range it was built on?

Note: As with assignment 2, there are plots produced in the kNN regression lecture than cannot be reproduced for
multi-variable models. Please do not let this prevent you from producing models with many variables as they will
give you the best results. Again, you could consider plotting other model outputs instead.
OPTION 2 – SUPERVISED LEARNING
• Formulate: Using your assignment 2 dataset, or another dataset of your choosing, creatively formulate a
classification problem, for which you can build supervised learning models. Describe this problem by way of
introduction.
• Explore features: Explore the ability of features to discriminate between your chosen or derived class labels. For
instance, as in the lecture, you could plot histograms or box plots of different features by class label and see if
the distributions are noticeably different. Consider exploring other types of plots.
• Model: Create models using different subsets of features for prediction.
Modelling expectations (what we are looking for when marking):
o Scaling of all input variables
o train/test split for all models so that the models can be meaningfully evaluated (train with the training
data, evaluate with the testing data). There is guidance for how to apply a train test split in this
context in the supervised learning lecture.
o Experimentation with feature subsets and feature selection
o Evaluation and interpretation - Generate, interpret and compare evaluation metrics for your various
158.222-2020 Semester 1 Massey University
Page 9 of 10
models. Ideally, this will involve some visualisations - such as plotting metric performance across
different models. Even better, try cross-validation.
• Discussion - How reliable are your prediction models? Could you recommend any to a client? Would you expect
this model to preserve its accuracy on data beyond the range it was built on?

Note: Feel free to derive and generate new features based on the ones that exist.
Note: If you would like to use your assignment 2 dataset for Option 2, you will need categories to predict. There are
many ways of doing this – you could see whether there are appropriate categorical variables from the world Bank
API or other data sources that you could integrate into the dataset. Alternatively, and more simply, you could derive
labels from an existing variable. For instance, you could create ‘high’, ‘’medium’ and ‘low’ happiness labels
according to happiness score (or do something similar with any variable of your choosing that you would like to
predict). If you use an entirely new dataset, we would expect some EDA.
Note: You can use Python’s scikit-learn module for machine learning or try using other algorithms. There are many
other Python implementations of machine learning algorithms such as Neural Networks (PyBrain) which are not
implemented in scikit-learn. Which you may use if you wish.
TASK 3 – BUILD UNSUPERVISED LEARNING MODELS (40 MARKS)
• Feature selection: Choose different subsets of features from your assignment 2 dataset for clustering.
• Scale all of the input variables that you will be using
• Perform cluster analyses using the feature sets.
• Visualise, evaluate and interpret your results (there is some basic guidance on visualising clusters in the
appendix)
• Discuss

TASK 4: PRESENTATION - ‘REPORT-ERIZE’ YOUR WORK (10 MARKS)
Refer to Task 4, Assignment 2 for what to do here.
Assignment 3 Requirements:
The Python code in the submitted notebooks must be entirely self-contained and all the experiments and the graphs
must be replicable. Do not use absolute paths, but instead use relative paths if you need to. Consider hiding away some
of your Python code in your notebook by putting them into .py files that you can import and call. This will help the
readability of your final notebook by not allowing the python code to distract from your actual findings and discussions.
Do not dump dataframe contents in the notebook – show only 5-10 lines at a time – as this severely affects readability.
You may install and use any additional Python packages you wish that will help you with this project. When submitting
your project, include a README file that specifies what additional python packages you have installed in order to make
your project repeatable on my computer, should I need to install extra modules.
Marking criteria - Marks will be awarded for different components of the project using the following rubric:
Component Marks Requirements and expectations
Task 2 50 Implementation of a train/test split. Scaling of variables. Quality of experimentation,
analysis, interpretation and evaluation of results and conclusions. Appropriate use of
visualisations in evaluations (and feature selection as appropriate to the option chosen).
Task 3 40 Scaling of variables. Quality of experimentation, analysis, interpretation of results and
conclusions. Appropriate use of visualisations in evaluations.
Task 4 10 Report structure, tidiness of code and outputs
Hand-in: Zip-up all your notebooks, python files and dataset(s) into a single file. Submit this file via stream.
158.222-2020 Semester 1 Massey University
Page 10 of 10
APPENDIX
TRAIN TEST SPLIT WITH KNN REGRESSION
Do a train test split of your data before doing any kNN modelling (you will get spurious accuracy results otherwise). To
achieve this, you would need to do something like this:

from sklearn.cross_validation import train_test_split
X = df_std #the standardised explanatory features
y = np.array(df['Target_variable'])
# split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

If you plan to borrow the ‘calculate_regression_goodness_of_fit’ function for your analysis, you would need to change
this line:

y_mean = y.mean()
to this:
y_mean = ys.mean()

and then let your common sense guide you as to the other necessary changes to the example code - think about where
the training sets should be used and where the testing sets should be used (hint, the training sets should be used for the
model fit, the testing sets should be used for prediction and goodness of fit calculation)
VISUALISING CLUSTERS
Look at 2 or 3 different features at a time in scatter plots with points coloured according to cluster and see if you can
discern which features were important in defining the clustering (there are other ways of doing this, but for the purposes
of this assignment we are satisfied if you simply look at some visualisations). There is no guarantee that you will be able
to see a clear difference but have a go and show what you have done. Try to describe the effect that each feature has on
the clustering, if it is discernible.

The examples below are artificial, but I provide them to give you the general idea. In the example on the left, the feature
Y is important in defining the clusters, not X. In the example on the right, X is important in defining the clusters, not Y:

You will likely need to do iterations of plots like this through your features to get an idea of which features may have
been important in defining your clusters (functions are your friend). In reality, it will be a combination of them. If one
feature is really dominant, you should double check that you have scaled your variables.