程序代写案例-ETC1010-Assignment 1

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

Assignment 1 ETC1010 - 5510
New South Wales Crime Incidents Report
Your name
Monday, April 12 2021
Instructions to Students
This assignment is designed to simulate a scenario in which you are taking over someone’s existing work and
continuing with it to draw some further insights.
This is a real world dataset taken from the New South Wales Bureau of Crime Statistics and Research.
The data can be found here at https://www.bocsar.nsw.gov.au/Documents/Datasets/SuburbData.zip.
Specifically, the data file called “SuburbData2019csv” located in your data folder inside the RStudio project
will be used for this assignment.
You have just joined a consulting company as a data scientist. To give you some experience and guidance,
you are performing a quick summary of the data while answering a number questions that the chief business
analytics leader has. This is not a formal report, but rather something you are giving to your manager that
describes the data with some interesting insights.
Please make sure you read the hints throughout the assignment to help guide you on the tasks.
The points allocated for each of the elements in the assignment are marked next to the code for each question.
Marking + Grades
• This assignment will be worth 10% of your total grade, and is marked out of 116 marks total. Due
on: Friday 26 March.
For this assignment, you will need to upload the following into Moodle:
- Your Rmd file,
- The rendered html file, and
- The PDF rendered file.
How to find help from R functions?
Remember, you can look up the help file for functions by typing: ?function_name. For example, ?mean.
Feel free to google questions you have about how to do other kinds of plots, and post on the “Assignment
Discussion Forum” any questions you have about the assignment.
How to complete this assignment?
To complete the assignment, you will need to fill in the blanks with appropriate function names, arguments,
or other names. These sections are marked with ___. At a minimum, your assignment should be able
to be “knitted” using the Knit button for your Rmarkdown document.
If you want to look at what the assignment looks like in progress with some of the R codes remaining invalid
in the R code chunks, remember that you can set the R chunk options to eval = FALSE like so:
1
```{r this-chunk-will-not-run, eval = FALSE} `r''`
ggplot()
```
If you use eval = FALSE or cache = TRUE, please remember to ensure that you have set to eval
= TRUE when you submit the assignment, to ensure all your R codes run.
There are a few tricky bits that might require you to look back into your previous R code chunks (that is
intentionally done for you to understand how things work within an Rmd file!)
You will be completing this assignment INDIVIDUALLY.
Due Date
This assignment is due in by close of business (5pm) on Friday, 26 March 2021. You will submit the assignment
via Moodle. Please make sure you add your name on the YAML part of this Rmd file.
Treatment
You work as a data scientist in the well-named consulting company, “Consulting for You”.
It’s your second day at the company, and you’re taken to your desk. Your boss says to you:
We have a data set with the crime statistics in New South Wales for the past years!
We’ve got a meeting coming up soon to get insights about the crime in NSW. We want you to
tell us about this data set and what we can do with it.
You’re in with the new hires of data scientists here. We’d like you to take a look at the data and
tell me what the spreadsheet tells us. I’ve written some questions on the report for you to answer.
Most importantly, can you get this to me by 5pm, Friday, 26 March 2021.
Please read below and answer all the questions (ensure that you can knit the file to produce an
html file and a PDF file to hand them in to me via Moodle):
Load all the libraries that you need here
library(tidyverse)
Reading and preparing data
crime_dat <- read_csv("data/SuburbData2019.csv")
# I am selecting here only a portion of the data
# to reduce computation times.
crime_data <-crime_dat %>%
select(-c(`Jan 1995`:`Jan 2010`)) %>%
dplyr::filter(Suburb %in% c("Chifley",
"Redfern",
"Clare",
"Coogee",
"Paddington",
"Redfern",
"Zetland",
2
"Claymore",
"Congo",
"Yenda",
"Young",
"Yarra",
"Woodcroft",
"Woodhill",
"Warri",
"Waterloo",
"Randwick"))
Question 1: Display the first 10 rows of the data set
Hint: Check ?head in your R console
head(crime_data, 10) # 1pt
## # A tibble: 10 x 122
## Suburb `Offence categor~ Subcategory `Feb 2010` `Mar 2010` `Apr 2010`
##
## 1 Chifley Homicide Murder * 0 0 0
## 2 Chifley Homicide Attempted murder 0 0 0
## 3 Chifley Homicide Murder accessory,~ 0 0 0
## 4 Chifley Homicide Manslaughter * 0 0 0
## 5 Chifley Assault Domestic violence~ 1 0 1
## 6 Chifley Assault Non-domestic viol~ 2 0 0
## 7 Chifley Assault Assault Police 0 0 0
## 8 Chifley Sexual offences Sexual assault 1 0 0
## 9 Chifley Sexual offences Indecent assault,~ 0 0 0
## 10 Chifley Abduction and ki~ Abduction and kid~ 0 0 0
## # ... with 116 more variables: May 2010 , Jun 2010 , Jul 2010 ,
## # Aug 2010 , Sep 2010 , Oct 2010 , Nov 2010 ,
## # Dec 2010 , Jan 2011 , Feb 2011 , Mar 2011 ,
## # Apr 2011 , May 2011 , Jun 2011 , Jul 2011 ,
## # Aug 2011 , Sep 2011 , Oct 2011 , Nov 2011 ,
## # Dec 2011 , Jan 2012 , Feb 2012 , Mar 2012 ,
## # Apr 2012 , May 2012 , Jun 2012 , Jul 2012 ,
## # Aug 2012 , Sep 2012 , Oct 2012 , Nov 2012 ,
## # Dec 2012 , Jan 2013 , Feb 2013 , Mar 2013 ,
## # Apr 2013 , May 2013 , Jun 2013 , Jul 2013 ,
## # Aug 2013 , Sep 2013 , Oct 2013 , Nov 2013 ,
## # Dec 2013 , Jan 2014 , Feb 2014 , Mar 2014 ,
## # Apr 2014 , May 2014 , Jun 2014 , Jul 2014 ,
## # Aug 2014 , Sep 2014 , Oct 2014 , Nov 2014 ,
## # Dec 2014 , Jan 2015 , Feb 2015 , Mar 2015 ,
## # Apr 2015 , May 2015 , Jun 2015 , Jul 2015 ,
## # Aug 2015 , Sep 2015 , Oct 2015 , Nov 2015 ,
## # Dec 2015 , Jan 2016 , Feb 2016 , Mar 2016 ,
## # Apr 2016 , May 2016 , Jun 2016 , Jul 2016 ,
## # Aug 2016 , Sep 2016 , Oct 2016 , Nov 2016 ,
## # Dec 2016 , Jan 2017 , Feb 2017 , Mar 2017 ,
## # Apr 2017 , May 2017 , Jun 2017 , Jul 2017 ,
## # Aug 2017 , Sep 2017 , Oct 2017 , Nov 2017 ,
## # Dec 2017 , Jan 2018 , Feb 2018 , Mar 2018 ,
3
## # Apr 2018 , May 2018 , Jun 2018 , Jul 2018 ,
## # Aug 2018 , ...
Question 2: How many variables and observations do we have?
Hint: Look for help ?dim in your R console and remember that variables are in columns and observations in
rows. dim() returns the number of rows and the number of columns in the data set (in that order)
dim(crime_data) # 1pt
## [1] 992 122
The number of variables are 122 (1pt) and the number of rows are 992 (1pt)
Question 3: What are the names of the first 20 variables in this
data set?
names(crime_data)[1:20] # 1pt
## [1] "Suburb" "Offence category" "Subcategory" "Feb 2010"
## [5] "Mar 2010" "Apr 2010" "May 2010" "Jun 2010"
## [9] "Jul 2010" "Aug 2010" "Sep 2010" "Oct 2010"
## [13] "Nov 2010" "Dec 2010" "Jan 2011" "Feb 2011"
## [17] "Mar 2011" "Apr 2011" "May 2011" "Jun 2011"
Question 4: Rename the variable of “Offence category” to “Of-
fence_category” and show the names of the first 4 variables in the
data set
crime <- crime_data %>%
rename(Offence_category = `Offence category`) # 1pt
names(crime)[1:4] # 1pt
## [1] "Suburb" "Offence_category" "Subcategory" "Feb 2010"
Question 5: Change the “crime” data (“SuburbData2019csv”) into
long format so that all the years are grouped together into a variable
called “year” and the corresponding incidents count into a variable
called “incidents”
crime_long <- crime %>%
pivot_longer(cols = `Feb 2010`:`Dec 2019`, # 2pt
names_to = "year", # 1pt
values_to = "incidents") # 1pt
head(crime_long) # 1pt
## # A tibble: 6 x 5
4
## Suburb Offence_category Subcategory year incidents
##
## 1 Chifley Homicide Murder * Feb 2010 0
## 2 Chifley Homicide Murder * Mar 2010 0
## 3 Chifley Homicide Murder * Apr 2010 0
## 4 Chifley Homicide Murder * May 2010 0
## 5 Chifley Homicide Murder * Jun 2010 0
## 6 Chifley Homicide Murder * Jul 2010 0
Question 6: Separate the column “year” into two columns with
names “Month” and “Year”. Display the first 3 lines of the data
set to show the updated data set
crime_long_new <- crime_long %>%
separate(col = year, # 1pt
into = c("Month", "Year"), " " ) # 2pt
head(crime_long_new, n= 3) # 1pt
## # A tibble: 3 x 6
## Suburb Offence_category Subcategory Month Year incidents
##
## 1 Chifley Homicide Murder * Feb 2010 0
## 2 Chifley Homicide Murder * Mar 2010 0
## 3 Chifley Homicide Murder * Apr 2010 0
Question 7: If you look at the data crime_long_new, you will
notice that the variable of “Year” is coded as character. In this
section, we are going to convert the variable of “Year” to a numeric
variable
crime_long_new %>%
mutate(Year = as.numeric(Year)) # 1pt
## # A tibble: 118,048 x 6
## Suburb Offence_category Subcategory Month Year incidents
##
## 1 Chifley Homicide Murder * Feb 2010 0
## 2 Chifley Homicide Murder * Mar 2010 0
## 3 Chifley Homicide Murder * Apr 2010 0
## 4 Chifley Homicide Murder * May 2010 0
## 5 Chifley Homicide Murder * Jun 2010 0
## 6 Chifley Homicide Murder * Jul 2010 0
## 7 Chifley Homicide Murder * Aug 2010 0
## 8 Chifley Homicide Murder * Sep 2010 0
## 9 Chifley Homicide Murder * Oct 2010 0
## 10 Chifley Homicide Murder * Nov 2010 0
## # ... with 118,038 more rows
5
head(crime_long_new) # 1pt
## # A tibble: 6 x 6
## Suburb Offence_category Subcategory Month Year incidents
##
## 1 Chifley Homicide Murder * Feb 2010 0
## 2 Chifley Homicide Murder * Mar 2010 0
## 3 Chifley Homicide Murder * Apr 2010 0
## 4 Chifley Homicide Murder * May 2010 0
## 5 Chifley Homicide Murder * Jun 2010 0
## 6 Chifley Homicide Murder * Jul 2010 0
Question 8: Display the years in the data set. How many years are
included in this data set?
Remember that you can learn more about what these functions by typing: ?unique or ?length
into the R console.
unique(crime_long_new$Year) # 1pt
## [1] "2010" "2011" "2012" "2013" "2014" "2015" "2016" "2017" "2018" "2019"
# length tell us the length or longitude of a variable or a vector
length(unique(crime_long_new$Year)) # 1pt
## [1] 10
Question 9: How many different suburbs are there in the data set?
length(unique(crime_long_new$Suburb)) # 1pt
## [1] 16
n_distinct(crime_long_new$Suburb) # 1pt
## [1] 16
Question 10: How many incidents do we have per “Offence_category”
in total for 2019?
crime_long_new %>%
dplyr::filter(Year == "2019") %>% # 1pt
count(Offence_category, wt = incidents) # 1pt
## # A tibble: 21 x 2
## Offence_category n
## *
## 1 Abduction and kidnapping 1
## 2 Against justice procedures 1950
## 3 Arson 60
## 4 Assault 1396
## 5 Betting and gaming offences 1
## 6 Blackmail and extortion 2
6
## 7 Disorderly conduct 429
## 8 Drug offences 1416
## 9 Homicide 2
## 10 Intimidation, stalking and harassment 566
## # ... with 11 more rows
Question 11: Which is the “Offence_category” with highest num-
ber of incidents in 2019?
crime_long_new %>%
dplyr::filter(Year == "2019") %>% # 1pt
count(Offence_category, wt = incidents, sort = TRUE) # 1pt
## # A tibble: 21 x 2
## Offence_category n
##
## 1 Theft 4061
## 2 Against justice procedures 1950
## 3 Drug offences 1416
## 4 Assault 1396
## 5 Malicious damage to property 1093
## 6 Intimidation, stalking and harassment 566
## 7 Transport regulatory offences 517
## 8 Disorderly conduct 429
## 9 Liquor offences 356
## 10 Sexual offences 273
## # ... with 11 more rows
Question 12: How many offences are there in each Subcategory of
the “Offence_category” of Homicide?
crime_long_new %>%
dplyr::filter(Offence_category == "Homicide") %>% # 1pt
group_by(Subcategory) %>% # 1pt
summarise(Number_of_incidents = sum(incidents)) # 1pt
## # A tibble: 4 x 2
## Subcategory Number_of_incidents
## *
## 1 Attempted murder 3
## 2 Manslaughter * 1
## 3 Murder * 14
## 4 Murder accessory, conspiracy 1
7
Question 13: Select the suburb called “Paddington” and calculate
the number of incidents for “Offence_category” of “Drug offences”
then calculate the total number of incidents for each Subcategory.
Finally, show a table arranged by “Number_of_ incidents” (high
to low)
Paddington <- crime_long_new %>%
dplyr::filter( Suburb == "Paddington", # 2pt
Offence_category == "Drug offences") %>% # 1pt
group_by(Subcategory) %>% # 1pt
summarise(Number_of_incidents = sum(incidents)) %>% # 1pt
arrange(-Number_of_incidents) # 1pt
head(Paddington) # 1pt
## # A tibble: 6 x 2
## Subcategory Number_of_incidents
##
## 1 Possession and/or use of cannabis 154
## 2 Possession and/or use of cocaine 111
## 3 Possession and/or use of other drugs 82
## 4 Other drug offences 73
## 5 Dealing, trafficking in cocaine 68
## 6 Possession and/or use of amphetamines 57
Question 14: Let’s have a look at the changes over time for “Pos-
session and/or use of cannabis” in the suburb of Paddington
To answer this question, we need to first filter the “Suburb” and the “Subcategory”. Then, group incident by
year and finally sum the number of incidents for each year
Paddington_cannabis <- crime_long_new %>%
dplyr::filter( Suburb == "Paddington", # 1pt
Subcategory == "Possession and/or use of cannabis") %>% # 1pt
group_by(Year) %>% # 1pt
summarise(Number_of_incidents = sum(incidents)) %>% # 1pt
mutate(Year = as.numeric(Year)) # 1pt
head(Paddington_cannabis,3) # 1pt
## # A tibble: 3 x 2
## Year Number_of_incidents
##
## 1 2010 17
## 2 2011 17
## 3 2012 15
8
Question 15: Create a line plot to display the trend of the incidents
that you calculated for Paddington
On the x-axis you should have “Year” and on the y-axis you should display “Number_of_incidents”
ggplot(Paddington_cannabis, aes( x = Year, y = Number_of_incidents)) + # 2pt
geom_line() # 1pt
10
15
20
25
2010.0 2012.5 2015.0 2017.5
Year
N
um
be
r_
of
_i
nc
id
en
ts
Question 16: Create the same plot as in Question 15 but now in-
clude also the suburb called “Randwick” (you will see two trends in
the same plot). Make sure that the variable of “Suburb” is defined
as a factor
both_cannabis <- crime_long_new %>%
dplyr::filter(Suburb %in% c("Paddington", "Randwick"), # 1pt
Subcategory == "Possession and/or use of cannabis") %>% # 1pt
group_by(Year, Suburb) %>% # 1pt
summarise(Number_of_incidents = sum(incidents)) %>% # 1pt
mutate(Year = as.numeric(Year), # 1pt
Suburb = as.factor(Suburb)) # 1pt
## `summarise()` has grouped output by 'Year'. You can override using the `.groups` argument.
ggplot(both_cannabis, aes( x = Year, # 1pt
y = Number_of_incidents, # 1pt
9
color = Suburb)) + # 1pt
geom_line() # 1pt
10
20
30
40
2010.0 2012.5 2015.0 2017.5
Year
N
um
be
r_
of
_i
nc
id
en
ts
Suburb
Paddington
Randwick
Question 17: Let’s now look at the total number of crime incidents
in NSW and create a plot to visualize the trend
crime_long_new %>%
dplyr::select( Year, # 1pt
incidents) %>% # 1pt
group_by(Year) %>% # 1pt
summarise(Number_of_incidents = sum(incidents)) %>% # 1pt
mutate(Year = as.numeric(Year)) %>% # 1pt
ggplot(aes(x = Year, y = Number_of_incidents )) + # 1pt
geom_line() # 1pt
10
11000
11500
12000
12500
13000
2010.0 2012.5 2015.0 2017.5
Year
N
um
be
r_
of
_i
nc
id
en
ts
Question 18: Now, let’s change the background color of the plot to
white using the theme_bw()
crime_long_new %>%
dplyr::select( Year, # 1pt
incidents) %>% # 1pt
group_by(Year) %>% # 1pt
summarise(Number_of_incidents = sum(incidents)) %>% # 1pt
mutate(Year = as.numeric(Year)) %>% # 1pt
ggplot(aes(x = Year, y = Number_of_incidents )) + # 1pt
geom_line() + # 1pt
theme_bw() # 1pt
11
11000
11500
12000
12500
13000
2010.0 2012.5 2015.0 2017.5
Year
N
um
be
r_
of
_i
nc
id
en
ts
Question 19: Let’s change the line color to green and replace it
with a dotted line
crime_long_new %>%
dplyr::select( Year, # 1pt
incidents) %>% # 1pt
group_by(Year) %>% # 1pt
summarise(Number_of_incidents = sum(incidents)) %>% # 1pt
mutate(Year = as.numeric(Year)) %>% # 1pt
ggplot(aes(x = Year, y = Number_of_incidents )) + # 1pt
geom_line(linetype = "dotted", color ="green") # 1pt
12
11000
11500
12000
12500
13000
2010.0 2012.5 2015.0 2017.5
Year
N
um
be
r_
of
_i
nc
id
en
ts
Question 20: Now, let’s look at the total number of crime incidents
for the suburbs of Redfern, Coogee, and Zetland by creating a bar
plot where we have the incidents per suburb by year next to each
other
comparison_data<- crime_long_new %>%
dplyr::select(Suburb, # 1pt
Year, # 1pt
incidents) %>% # 1pt
dplyr::filter( Suburb %in% c("Redfern", "Coogee", "Zetland")) %>% # 1pt
group_by(Year, Suburb) %>% # 1pt
summarise(Number_of_incidents = sum(incidents)) # 1pt
## `summarise()` has grouped output by 'Year'. You can override using the `.groups` argument.
ggplot(comparison_data, aes(x = Year, # 1pt
y = Number_of_incidents, # 1pt
fill = Suburb)) + # 1pt
geom_bar(stat = "identity", # 1pt
position = "dodge") + # 1pt
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) # 1pt
13
02000
4000
20
10
20
11
20
12
20
13
20
14
20
15
20
16
20
17
20
18
20
19
Year
N
um
be
r_
of
_i
nc
id
en
ts
Suburb
Coogee
Redfern
Zetland
Question 21: Change the x and y-axis labels to “Years” and "
Incidents", respectively, for the figure in Question 20 and use the
black and white theme
ggplot(comparison_data, aes(x = Year, # 1pt
y = Number_of_incidents, # 1pt
fill = Suburb)) + # 1pt
geom_bar(stat = "identity", # 1pt
position = "dodge") + # 1pt
theme_bw() + # 1pt
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + # 1pt
xlab("Years") + # 1pt
ylab("Incidents") # 1pt
14
02000
4000
20
10
20
11
20
12
20
13
20
14
20
15
20
16
20
17
20
18
20
19
Years
In
ci
de
nt
s
Suburb
Coogee
Redfern
Zetland
Question 22: Add the following title to the figure constructed in
Question 21: “Number of criminal incidents”
ggplot(comparison_data, aes(x = Year, # 1pt
y = Number_of_incidents, # 1pt
fill = Suburb)) + # 1pt
geom_bar(stat = "identity", # 1pt
position = "dodge") + # 1pt
theme_bw() + # 1pt
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + # 1pt
xlab("Years") + # 1pt
ylab("Incidents") + # 1pt
ggtitle("Number of criminal incidents") # 1pt
15
02000
4000
20
10
20
11
20
12
20
13
20
14
20
15
20
16
20
17
20
18
20
19
Years
In
ci
de
nt
s
Suburb
Coogee
Redfern
Zetland
Number of criminal incidents
Question 23: By using “facet_wrap”, create a line plot to show the
trends for “Number_of_incidents” for each of the three suburbs
ggplot(comparison_data, aes(x = Year, # 1pt
y = Number_of_incidents, # 1pt
group =Suburb)) + # 1pt
geom_line() + # 1pt
facet_wrap(~Suburb) + # 1pt
theme() + # 1pt
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) # 1pt
16
Coogee Redfern Zetland
20
10
20
11
20
12
20
13
20
14
20
15
20
16
20
17
20
18
20
19
20
10
20
11
20
12
20
13
20
14
20
15
20
16
20
17
20
18
20
19
20
10
20
11
20
12
20
13
20
14
20
15
20
16
20
17
20
18
20
19
0
2000
4000
Year
N
um
be
r_
of
_i
nc
id
en
ts
Question 24: Transform the data set named comparison_data into
a wide format where the suburbs of Coogee, Redfern, and Zetland
are displayed as columns
comparison_data %>%
pivot_wider(id_cols = Year, # 1pt
names_from = Suburb, # 1pt
values_from = Number_of_incidents) # 1pt
## # A tibble: 10 x 4
## # Groups: Year [10]
## Year Coogee Redfern Zetland
##
## 1 2010 897 3225 197
## 2 2011 1189 3822 318
## 3 2012 877 3959 380
## 4 2013 885 4440 312
## 5 2014 762 4400 359
## 6 2015 912 4674 562
## 7 2016 1016 5623 493
## 8 2017 1011 4411 526
## 9 2018 1013 4102 572
## 10 2019 1119 4052 621
17

欢迎咨询51作业君