辅导案例-FIT1043-Assignment 2

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

FIT1043 Assignment 2 Description
Due date: Friday 16 Oct 2020 - 11:55 pm
This is an individual assessment and worth 20% of your total mark for FIT1043. You
need to use Python in this assignment.
Aim
This assignment aims to analyse, visualise, and model data using Python. It will test your
ability to:
1. Use Python to transform data and extract information
2. Use various graphical and non-graphical tools for performing exploratory data
analysis and visualisation
3. Understand the strengths and weaknesses of certain visualisation tools
4. Implement and interpret the results of machine learning models in Python

Data
1. The data set “CityPairs.csv” contains information about the scheduled operations of
international airlines operating to and from Australia. Data mainly shows the
passengers, freight and mail carried between city pairs connected by a single flight
number service. The description of each data column is shown in Table 1.
Table1: Description of the “CityPairs” columns
Column Name Description
Month A unique identifier for Year-Month
AustralianPort Australian port where traffic is uplifted or
discharged within a single flight number
ForeignPort The foreign port where traffic is uplifted or
discharged within a single flight number
Country Based on the international uplift or discharge port
within a single flight number
Passengers_In Number of passengers inbound to Australia
Freight_In_(tonnes) Freight inbound to Australia in tonnes
Mail_In_(tonnes) Mail inbound to Australia in tonnes
Passengers_Out Number of passengers outbound from Australia
Freight_Out_(tonnes) Freight outbound from Australia in tonnes
Mail_Out_(tonnes) Mail outbound from Australia in tonnes
Passengers_Total Passengers_In+ Passengers_Out
Freight_Total_(tonnes) Freight_In_(tonnes)+Freight_Out_(tonnes)
Mail_Total_(tonnes) Mail_In_(tonnes)+Mail_Out_(tonnes)
Year Year ranging from 1985 to 2016
Month_num Month ranging from 1 to 12
2. The data set “ClusteringData.csv” contains information about the ‘suicide rate’ and
‘GDP per capita’.
Hand-in Requirements
Please hand in two files including a PDF file containing your answer, and a Jupyter notebook
file (.ipynb) containing your Python code to all the questions respectively. You need to
consider the following cases for your submission:
1. PDF file should contain:
Answers to the questions. Make sure to include screenshots/images of the graphs
you generate in your report (you will need to use screen-capture functionality to
create appropriate images). Moreover, please include your Python code, not the
screenshot of your codes, to justify your answers to all the questions. The Turnitin
would not be generated if you include a screenshot of your codes and you will lose
20% of the assignment mark if you include a screenshot of the codes instead of
writing/copying your codes.
To generate a pdf report, you can use Word to write your report, but you need to
convert it to PDF before your submission. Alternatively, an easier way is to generate
a pdf version of your Jupyter notebook by using Ctrl+P in the Jupyter notebook.
This pdf file is a mandatory requirement to check the Turnitin by Monash
University.

2. Ipynb file should contain:
Your Python codes for this assignment.

You will need to submit two separate files. “Zip”, “rar” or any other similar file
compression format have a penalty of 10% of your assignment mark as they can not be
processed by Turnitin system automatically.
You will be penalized by 5% of the assignment mark (5% out of 20 marks) if you submit
after the due date for every day that you are late. If you could not submit your assignment
before the due date, please make sure to submit your files at most 7 days after the
assignment due date, we do not mark assignments which will be submitted after 23th of
October 11:55 pm.

Assignment Tasks
There are three parts that you need to complete for this assignment. Parts A and B involve
graphing and fitting regression lines to the “CityPairs.csv” data in Python, as well as analysing
the results by answering further questions. Part C involves creating a proper clustering model
for “ClusteringData.csv” data. There are two challenge questions which you need to answer
if you want to get 85 and above.

Part A - Analysing Mail Flow in Australian Capital Cities
We will investigate the amount of traffic flowing to/from different Australian State Capitals in
the following questions. The Australian State Capital ports are as follows: Adelaide, Brisbane,
Darwin, Hobart, Melbourne, Perth, and Sydney.
Remember to add proper labels and titles to all plots in this assignment.
1. We want to explore the mail traffic flow of each Australian capital port. To calculate
the mail traffic flow, you need to calculate the total Mail_In for each of the ports as
well as the total Mail_Out for each of them. Create a bar chart which shows the total
Mail_In and total Mail_Out for each of the ports and answer the following questions.
1.1. Which city has the largest amount of mail flowing in?
1.2. Can you properly compare the values for all cities by looking at the plot? Why?
1.3. Why do you think the mail traffic amount is significantly higher for some of the
ports?
2. Create a line chart to show the trend of total mail traffic to each of the following ports
against year: Perth and Brisbane.
2.1. How was the mail package traffic to each of the Brisbane and Perth ports in the
mid-80s?
2.2. How was the total mail traffic to Brisbane and Perth in 2016? (You should
analyse the data and see if 2016 has anything different from the previous years
or not. Make sure to use the output of the function “describe()” of Pandas data
frame to answer this question).
Part B1 - Linear Regression and Prediction
We explore the annual freight at all Australian state capital ports against the year in this part.
Create a scatter plot in Python showing the total annual freight in tonnes at all Australian
state capital ports against the year and answer the following questions.
1. Does the data show a clear pattern? Describe the relationship you observe.
2. Are there any outliers? If so, use the IQR rule to remove the outliers. IQR rule
is a simple method which can help you to detect the outliers based on the output
of function “describe()”.
3. Create a simple linear regression to model the relationship between Year and
the Total Freight. Does the linear fit look to be a good fit? Justify your answer.
4. How fast is the total amount of freight increasing each year? [Hint: Think about
what parameter in the regression model represents the rate of change]
5. What does the linear model predict for the total freight volume at Australian
state capital ports in 2020?
6. Try fitting the linear model only to the data from the year 2005 onwards. What
happens to the prediction for 2020? Which prediction could you trust more?
Why?
Part B2 - Comparing Traffic Volumes
We explore the distribution of total monthly passenger traffic at all of the Australian ports in
this question.
1. You first need to calculate the total number of Passengers_In and the total
number of Passengers_Out for each unique month over the years for all the
Australian ports. (i.e consider each month of each year as a unique month to
calculate the total number of passengers. You can make use of the column
“Month” for this). Next, create histograms to check the distribution of monthly
Passengers_In and monthly Passengers_Out. Describe the distributions. Can
you see any outliers in the plots? Discuss your answer.
2. Use boxplots to visualise the information of question 1. How many outliers
can you see in the plots? Use the IQR rule to show the data points which are
considered as outliers of monthly Passengers_In and monthly Passengers_Out)
3. As you can see in question 2, the information which is provided by a boxplot is
so similar to the information which we saw in the output of the function
“describe()”. However, they are not the same. What are the differences between
the information which are shown in a box plot and the output of function
“describe()”?
Part C - Clustering task
Use the dataset ‘ClusteringData.csv’ to cluster the dataset using KMeans. Try different values
of K and visualise your clusters. What is the best value of K based on your visualisation? Why
do you think it is the best value for K? Describe the clusters which you see in your visualisation
for the best value of K.
Challenge1:
An important challenge in KMeans is about how to evaluate the quality of clusters. As there is
no dependent variable for clustering models, we cannot check the accuracy or error of our
model and we need other approaches to check the quality of the models. There are many other
metrics/approaches which you can use to evaluate the performance of a clustering model;
Silhouette score is one of them.
1. Explain how the Silhouette score works and what is the meaning of having the
following results as a Silhouette score?
1.1. Silhouette Score is 0.02
1.2. Silhouette Score is -0.06
1.3. Silhouette Score is 0.97
1.4. Silhouette score is -0.9
2. Implement Silhouette score in Sklearn and find out the best number of clusters (K)
based on the silhouette score for ‘ClusteringData.csv’.
Challenge 2:
Imagine you are going to start working as a data scientist in a company. One of your tasks is
to check if you can create a linear/polynomial regression model for a small dataset which has
an independent variable and a dependent variable. As you have a small dataset, you would first
plot the data and see the following relationship between your dependent and independent
dataset.
Then, you would decide to use a transformation and transform the data to use a regression
model. After applying the transformation, you plot the data again and this time you see the
following relationship.
1. What is the transformation which you used?
2. Do you think your decision to transform the data with that transformation is a good
idea? Discuss your answer.
Good Luck!

欢迎咨询51作业君