Page 1 of 5 Faculty of Information Technology Semester 1, 2021 FIT5145: Introduction to Data Science Assignment 1: Description Due Date: 11:55 PM, Sunday 28 March 2021 The aim of this assignment is to investigate and visualise data using various data science tools. It will test your ability to: 1. Read data files in R and extract related data from those files; 2. Clean and process data into the required formats; 3. Use various graphical and non-graphical tools to performing exploratory data analysis and visualisation; 4. Use basic tools for managing and processing data; and 5. Communicate your findings in your report. Assessment Details: ● Assessment Type: Individual Assignment ● Total marks: 10% ● Due Date: 11:55 PM, Sunday 28 March 2021. Please notice that we do not accept submissions after 4 April 2021 (i.e., 7 days after the due date). Submission Details: You will need to submit two separate files (Important Note: Zip file submission will have a penalty of 10%): 1. A report in PDF containing your (a) code, (b) answer, and (iii) explanation used to answer each question. Note that you can use Word or other word processing software to format your submission. Just save the final copy to a PDF before submitting. Make sure to include screenshots/images of the graphs you generate in order to justify your answers to all the questions (Marks will be assigned to reports based on their correctness and clarity. -- For example, higher marks will be given to reports containing graphs with appropriately labelled axes). Make sure that the Turnitin score will be generated properly for your PDF file (We just need the Turnitin score for the PDF file, not the R Code). The PDF files which do not have a Turnitin score will be penalised by 20% of the mark. 2. The R code as an RMarkdown file that you wrote to analyse and plot the data. Page 2 of 5 Task A: Investigating Natural Increase in Australia's population In this task, you are required to visualise the relationship between the births, deaths, total fertility rate (TFR), net overseas migration (Births) and net interstate migration (NIM) for the different Australian states/territories, and gain insights on how these relations and trends change over time. The data files used in this task were originally downloaded from Australian Bureau of Statistics. We have extracted the data from the original files and transformed them into a simpler format. Please download the data from Moodle: ● Births.csv - This file contains yearly data regarding the recorded number of births by Australian state/territory of registration between 1977 and 2016. ● Deaths.csv -This file contains yearly data regarding the recorded number of deaths by Australian state/territory of registration between 1977 and 2016. ● TFR.csv - This file contains yearly data on the recorded average number of births per woman over her lifetime by each state/territory between 1971 and 2016. ● NOM.csv - This data file contains yearly data on the net gain or loss of population through immigration (migrant arrivals) to Australia and emigration (migrant departures) from Australia, for the period between 1977 and 2016. ● NIM.csv - This data file contains yearly data on the net gain or loss of population through the movement of people from one state or territory to another, for the period between 1977 and 2016. A1. Investigating the Births, Deaths and TFR Data 1. Plot the number of births recorded in each state/territory for different Australian states over different years. a. Describe the trend in number of births for Queensland and Tasmania for the period 1977 to 2016? b. Draw a bar chart to show the number of births in each Australian state in 2016. 2. Inspect the data on Total Fertility Rate (TFR.csv) for Queensland and Northern Territory. a. What was the minimum value for TFR recorded in the dataset for Queensland and when did that occur? b. What was the corresponding TFR value for Northern Territory in the same year? 3. Next, plot the natural growth in Australia's population over different years. For this, you will need to aggregate the total births and deaths by year. (HINT: Natural growth in a population is the difference between the total numbers of births and deaths in a population. For instance, Natural Growth of Australia’s Population = Total Births in Australia - Total Deaths in Australia). Describe the trend in natural growth in Australian population over time. Page 3 of 5 A2. Investigating the Migration Data (NOM and NIM) 1. Let’s look at the Net Overseas Migration (NOM) data in different states over time. a. Use R to plot the NOM to Victoria, Tasmania and Western Australia over time. Explain and compare the trend in all three states (VIC, TAS and WA). b. Plot the Net Overseas Migration (NOM) to Australia over time. Do you find the trend strange? Explain the reason to your answer (Hint: You might go online to find contributing factors to this trend). 2. Now let's look at the relationship between Net Overseas Migration (NOM) and Net Interstate Migration (NIM). a. Use R to combine the data from the different files into a single table. The resulting table should contain the NOM and NIM values for each of the states for a given year. What are the first year and last year for the combined data? b. Now that you have the data combined, we can see whether there is a relationship between NOM and NIM. Plot the values against each other using a scatter plot. Can you see any relationship between NOM and NIM? c. Try selecting and plotting the data for Victoria only using scatter plot. Can you see a relationship now? If so, explain the relationship. d. Finally, plot the Net Interstate Migration (NIM) for Queensland and New South Wales over different years. Note graphs for both QLD and NSW should be on the same plot. Compare these two states on the plot. What can you infer from the trend you see for these two states? Discuss your findings. Task B: Exploratory Analysis of Tweets about Bushfires in Australia In this task, you are presented with some pre-processed tweets about bushfires in Australia. Please download the data from Moodle: ● twitter_data.csv Please refer to Table 1 if you want to know the meaning of each feature/column. For example, nFollows shows the number of followers that a Twitter user has. A Twitter user who has more than a thousand followers can be considered as a popular user. It should be noted that NOT every tweet in the dataset is relevant to the bushfires in Australia, as represented by the value in the last column (1 denotes relevant and 0 irrelevant tweet). Page 4 of 5 Table 1: Description of Columns in the Data File You are required to investigate the features of the twitter dataset. Please clearly label and explain your R code used to answer each question. B1. Investigating the Data Please make sure to understand the dataset and its variables properly before answering the following questions. You need to have a good insight into the dataset to be able to understand some of the questions properly and avoid confusion. 1. How many tweets are there all together in the data file? How many of these tweets were posted from a verified account? 2. Draw a histogram showing the distribution of #entities extracted from the tweets. Set an appropriate bin size to present this information. Page 5 of 5 3. Compute the descriptive statistics (mean, std, quartile1, median, and max) of #entities of relevant (i.e., relevanceJudge=1) and non-relevant (i.e., with relevanceJudge=0) tweets in the dataset. Explain any interesting findings. 4. What is the average length of the tweets (in characters) that are judged as relevant? What is the average length of a non-relevant tweet? 5. To gain further insights into the twitter age of the users, it would be better to group the twitterAge in categorical bins. Create a new column to show the twitter age group in your dataframe based on twitterAge by converting it into the following groupings or categories [‘0-1’,’1-2’,’2-3’,’3-4’, ‘4-5’, ‘5+’], in which ‘0-1’ refers to the ages equal to older than 0 and younger then 1. We use the same logic for other age groups. a. Generate boxplots summarising the distribution of each twitter age group against their median tweet length. What do you observe? Is there much variation in tweet length across the age groups? b. Which age group has the lowest median tweet length and which one has the highest? State these median values. c. According to the current bushfire tweet dataset, which age group is more active on twitter (has posted most tweets)? (Note: Each record in the dataframe is a tweet). d. Create a plot showing the total number of tweets posted by each age group (from Part [c] above). e. Which age group on average has the highest number of followers on twitter?
欢迎咨询51作业君