程序代写案例-FIT5145-Assignment 1

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

Page 1 of 5

Faculty of Information Technology
Semester 1, 2021

FIT5145: Introduction to Data Science
Assignment 1: Description
Due Date: 11:55 PM, Sunday 28 March 2021
The aim of this assignment is to investigate and visualise data using various data science tools.
It will test your ability to:
1. Read data files in R and extract related data from those files;
2. Clean and process data into the required formats;
3. Use various graphical and non-graphical tools to performing exploratory data analysis
and visualisation;
4. Use basic tools for managing and processing data; and
5. Communicate your findings in your report.
Assessment Details:
● Assessment Type: Individual Assignment
● Total marks: 10%
● Due Date: 11:55 PM, Sunday 28 March 2021. Please notice that we do not accept
submissions after 4 April 2021 (i.e., 7 days after the due date).
Submission Details:
You will need to submit two separate files (Important Note: Zip file submission will have a
penalty of 10%):
1. A report in PDF containing your (a) code, (b) answer, and (iii) explanation used to
answer each question. Note that you can use Word or other word processing software
to format your submission. Just save the final copy to a PDF before submitting. Make
sure to include screenshots/images of the graphs you generate in order to justify your
answers to all the questions (Marks will be assigned to reports based on their correctness
and clarity. -- For example, higher marks will be given to reports containing graphs
with appropriately labelled axes). Make sure that the Turnitin score will be generated
properly for your PDF file (We just need the Turnitin score for the PDF file, not the R
Code). The PDF files which do not have a Turnitin score will be penalised by 20% of
the mark.
2. The R code as an RMarkdown file that you wrote to analyse and plot the data.

Page 2 of 5

Task A: Investigating Natural Increase in Australia's population
In this task, you are required to visualise the relationship between the births, deaths, total
fertility rate (TFR), net overseas migration (Births) and net interstate migration (NIM) for the
different Australian states/territories, and gain insights on how these relations and trends
change over time. The data files used in this task were originally downloaded from Australian
Bureau of Statistics. We have extracted the data from the original files and transformed them
into a simpler format. Please download the data from Moodle:
● Births.csv - This file contains yearly data regarding the recorded number of births by
Australian state/territory of registration between 1977 and 2016.
● Deaths.csv -This file contains yearly data regarding the recorded number of deaths by
Australian state/territory of registration between 1977 and 2016.
● TFR.csv - This file contains yearly data on the recorded average number of births per
woman over her lifetime by each state/territory between 1971 and 2016.
● NOM.csv - This data file contains yearly data on the net gain or loss of population
through immigration (migrant arrivals) to Australia and emigration (migrant
departures) from Australia, for the period between 1977 and 2016.
● NIM.csv - This data file contains yearly data on the net gain or loss of population
through the movement of people from one state or territory to another, for the period
between 1977 and 2016.
A1. Investigating the Births, Deaths and TFR Data
1. Plot the number of births recorded in each state/territory for different Australian states
over different years.
a. Describe the trend in number of births for Queensland and Tasmania for the
period 1977 to 2016?
b. Draw a bar chart to show the number of births in each Australian state in 2016.
2. Inspect the data on Total Fertility Rate (TFR.csv) for Queensland and Northern
Territory.
a. What was the minimum value for TFR recorded in the dataset for Queensland
and when did that occur?
b. What was the corresponding TFR value for Northern Territory in the same year?
3. Next, plot the natural growth in Australia's population over different years. For this,
you will need to aggregate the total births and deaths by year. (HINT: Natural growth
in a population is the difference between the total numbers of births and deaths in a
population. For instance, Natural Growth of Australia’s Population = Total Births in
Australia - Total Deaths in Australia). Describe the trend in natural growth in Australian
population over time.

Page 3 of 5

A2. Investigating the Migration Data (NOM and NIM)
1. Let’s look at the Net Overseas Migration (NOM) data in different states over time.
a. Use R to plot the NOM to Victoria, Tasmania and Western Australia over time.
Explain and compare the trend in all three states (VIC, TAS and WA).
b. Plot the Net Overseas Migration (NOM) to Australia over time. Do you find
the trend strange? Explain the reason to your answer (Hint: You might go online
to find contributing factors to this trend).
2. Now let's look at the relationship between Net Overseas Migration (NOM) and Net
Interstate Migration (NIM).
a. Use R to combine the data from the different files into a single table. The
resulting table should contain the NOM and NIM values for each of the states
for a given year. What are the first year and last year for the combined data?
b. Now that you have the data combined, we can see whether there is a
relationship between NOM and NIM. Plot the values against each other using
a scatter plot. Can you see any relationship between NOM and NIM?
c. Try selecting and plotting the data for Victoria only using scatter plot. Can you
see a relationship now? If so, explain the relationship.
d. Finally, plot the Net Interstate Migration (NIM) for Queensland and New South
Wales over different years. Note graphs for both QLD and NSW should be on
the same plot. Compare these two states on the plot. What can you infer from
the trend you see for these two states? Discuss your findings.

Task B: Exploratory Analysis of Tweets about Bushfires in Australia

In this task, you are presented with some pre-processed tweets about bushfires in Australia.
Please download the data from Moodle:
● twitter_data.csv
Please refer to Table 1 if you want to know the meaning of each feature/column. For example,
nFollows shows the number of followers that a Twitter user has. A Twitter user who has more
than a thousand followers can be considered as a popular user. It should be noted that NOT
every tweet in the dataset is relevant to the bushfires in Australia, as represented by the value
in the last column (1 denotes relevant and 0 irrelevant tweet).

Page 4 of 5

Table 1: Description of Columns in the Data File

You are required to investigate the features of the twitter dataset. Please clearly label and
explain your R code used to answer each question.

B1. Investigating the Data
Please make sure to understand the dataset and its variables properly before answering
the following questions. You need to have a good insight into the dataset to be able to
understand some of the questions properly and avoid confusion.
1. How many tweets are there all together in the data file? How many of these tweets were
posted from a verified account?
2. Draw a histogram showing the distribution of #entities extracted from the tweets. Set an
appropriate bin size to present this information.
Page 5 of 5

3. Compute the descriptive statistics (mean, std, quartile1, median, and max) of #entities of
relevant (i.e., relevanceJudge=1) and non-relevant (i.e., with relevanceJudge=0) tweets in
the dataset. Explain any interesting findings.
4. What is the average length of the tweets (in characters) that are judged as relevant? What
is the average length of a non-relevant tweet?
5. To gain further insights into the twitter age of the users, it would be better to group the
twitterAge in categorical bins. Create a new column to show the twitter age group in your
dataframe based on twitterAge by converting it into the following groupings or categories
[‘0-1’,’1-2’,’2-3’,’3-4’, ‘4-5’, ‘5+’], in which ‘0-1’ refers to the ages equal to older than 0
and younger then 1. We use the same logic for other age groups.

a. Generate boxplots summarising the distribution of each twitter age group against
their median tweet length. What do you observe? Is there much variation in tweet
length across the age groups?
b. Which age group has the lowest median tweet length and which one has the highest?
State these median values.
c. According to the current bushfire tweet dataset, which age group is more active on
twitter (has posted most tweets)? (Note: Each record in the dataframe is a tweet).
d. Create a plot showing the total number of tweets posted by each age group (from
Part [c] above).
e. Which age group on average has the highest number of followers on twitter?

欢迎咨询51作业君