程序代写案例-EM 747-Assignment 2

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

Homework Assignment 2
COM EM 747
Spring 2021
Assignment by:
Rebecca Auger
Chris Wells
Due: By text or word document, on Blackboard, by 5:00pm, Friday, March 5.
This assignment builds your learning in R, dplyr and ggplot. To have a little fun, you will work with a
dataset of NHL player statistics for the majority of the assignment.
Please note: * Questions with (Q) require a brief written response * Questions with (C) require that you
report the code used to answer the question
Part 1: Warm-Up
Remember our old friend mtcars? The dplyr package includes an expanded number of datasets to play
around with, including a dataset of Star Wars characters called “starwars”.
1. Take a look at the starwars dataset. (With dplyr loaded, you can just type “starwars”.) Does this
dataset follow the guidelines of “tidy” data? Why or why not? (Q)
2. Using ggplot, make a bar chart showing how many characters are from each homeworld in the dataset.
What is the code, and what does your chart look like? (You can take a screenshot, or use the “Export”
drop down menu to save as an image or copy to clipboard.) (Hint: to achieve this, you only need to
use ggplot() with geom_bar()) (C)
3. The chart looks very cluttered, as it is now. If our goal here is to look at homeworlds that lots of
Star Wars characters have come from, we might want to filter out planets that are the homeworld of
only one character. Note: to do this, we will need several dplyr functions, which we will then pass to
ggplot() to create the bar plot.
• Below is some code that would accomplish this goal if it was in the correct order.
• Reorder the code block to create a bar chart that only includes homeworlds that occur more than
once in the dataset. (C)
filter(count>=2)%>%
ggplot(aes(x=homeworld,y=count))+
group_by(homeworld)%>%
starwars%>%
geom_bar(stat="identity")
summarise(count=n())%>%
1
Part 2: NHL Data
Load the NHL Player Statistics dataset from Blackboard.
Originally, this dataset is from naturalstattrick.com, a website with an abundance of NHL player data
available for free.
Don’t worry if you aren’t familiar with hockey; this assignment isn’t designed to test your knowledge of the
game. And the first step in any data analysis process is understanding the context of your data. But if you
have questions, always feel free to ask!
Here is some discriptive information about the dataset:
• Each row catalogues one player’s time with one team. Sometimes players are traded mid-way through
a season, which means that part of their season’s record may be for one team at the start of a season
and a different team at the end of the season. By separating individual player seasons into multiple
rows, the dataset allows users to look at both player and team statistics.
• The dataset contains statistics for one season, the 2018-2019 NHL season (the last season with a regular
schedule).
• The dataset’s variable names, with brief descriptions, are as follows:
– Player: player’s full first and last name
– Team: team the player was playing for when the stats were recorded and aggregated
– Position: playing position (Center (C), Left-Wing (L), Right-Wing (R), Defenseman (D))
– GP: games played
– TOI: time on ice (in minutes)
– Goals: goals scored by the player
– Total.Assists: assists credited to the player for helping another player score (includes primary and
secondary assists)
– Total.Points: in hockey, “points” are the sum of goals and assists
– Shots: shots on goal; attempts to score
– PIM: penalties in minutes
– Total.Penalties: number of penalties
– Penalties.Drawn: number of penalties against the player (i.e. the opposing team sustained a
penalty)
– Giveaways: puck was lost to the other team
– Takeaways: puck was taken from the other team
– Hits: number of physical checks laid by the player
– Hits.Taken: number of physical checks taken by the player
– Faceoffs.Won: similar to a basketball tip-off, but a faceoff takes place before every round of play,
this statistic indicates the number of times player was able to gain control of the puck before the
opposing player
– Faceoffs.Lost: number of times player lost control of the puck to the opposing player in the faceoff
– General.Position: forward (F), includes center, left-wing, right-wing; and defensemen (D)
1. A consequence of the data structure is that if a player changed teams in mid-season, he will appear in
two separate rows–one for each team. What step would be necessary to make observations based on
player performance over the full season as opposed to player performance on a specific team? (Q)
1b. What dplyr function would accomplish this step? Provide a code snippet that would result in
a tibble where each row is a player’s performance over the whole season. (Hint: the resulting tibble
should have the same columns as the original data.) (Hint 2: the resulting tibble should have 906
rows.) (C)
2
2. Suppose you are trying to identify the best players in the league, and you have decided to use total
points to assess performance.
2a. Keeping in mind that we want to look at each player’s season-level performance (i.e., keep using
your code from 1b), create a histogram of the distribution of total points for all players (i.e. bars should
represent the number of players achieving a certain number of points). Report the code used to create
the histogram. What do you notice about the distribution of the histogram? (C)(Q)
2b. Choose a reasonable cut-off point that will help you identify who the very highest-scoring players
are. What code would return the list/table of these players? (C)
2c. Create a bar chart with only these players: the players’ names should be on the x-axis, and the
height of hte bars should indicate the total points they scored. (C)
2d. There are 82 games in a hockey season, and you might notice that many of the players with high
total points have played almost all of them and have over 1000 minutes of time on ice. Maybe you
are interested in the relationship between points and time on ice. Make a scatterplot comparing total
points to total time on ice. Report your code. What basic relationship do you observe? What are 2
different causal explanations that could explain this relationship? (C)(Q)
2e. Interesting. . . but hockey is a team sport, which means different players have different roles.
Now build on your scatterplot: using the same x and y, now use color to indicate what position each
player plays on the ice (use the variable General.Position, not Position). Report your code and briefly
explain how this change helps you understand the relationships in the graph. (Hint: in the variable
General.Position, D stands for Defense and F stands for Forward.) (C)(Q)
2.f. Now for a quick sub-set. When exploring data, we often want to be able to zoom in on datapoints
that look especially different or interesting. Here, you might be interested in those 3 defensemen who
scored quite a few points. If you wanted to zoom in on these individuals, and created a table that
showed those three players (in rows) and their team, total points and time on ice, what dplyr commands
would you use?
3. Maybe you don’t really care about particular NHL players, you want to compare the performance of
teams. Say we want to create a bar chart indicating the total goals scored by each team, another
indicating the average goals scored by each player on each team, and a box plot so we can understand
the distribution of goals on a team level.
3a. First, we want to think about what we want our graph to look like. If we want to compare teams
with each other based on the all of the goals they have scored, what will be measured on the x-axis
and y-axis of the chart? (Q)
3b. To begin setting up your data, use group_by and summarise to create a tibble with teams as rows
and a value for each row showing how many goals the team scored. (C)
3c. Now, pipe this expression into ggplot and create a bar chart with the correct x and y axes, keeping
in mind the default statistic used in geom_bar, and how to change it. (Accuracy check: There should
be 32 bars on the graph, one for each team.) (C)
3d. Now let’s take a look at average goals scored per player. You can use the code you produced for
3c, and make a minor alteration to display the mean number of goals scored by members of each team.
(C)
4. Suppose you don’t care about the rest of the NHL, the only thing that really matters to you is the
Boston Bruins.
4a. Report the code used to make a tibble with only players from the Boston Bruins (BOS) included
(C)
4b. Using pipes (%>%), report the code necessary to go from the original dataset to a scatterplot of
Boston Bruins players’ total points versus time on ice (C)
4c. Building on Healy’s suggestions for datapoint labeling, choose three of the more notable datapoints
from 4b and label them with the player’s names. (C)
3
5. Total points are one way to think about player skill. Another might be the accuracy of player shots.
Percentages are common measures of accuracy, usually calculated as the proportion of attempts that
succeeded. In the case of hockey, shot percentage might be a useful measure. There is no measure of
shot percentage in the current data, but you can create it:
5a. Use the variables Shots and Goals to create a new variable, shot percentage. What you want to
calculate is the percentage of shots that scored goals. (Hint: mutate() could be useful). (C)
5b. Let’s graph that new variable somehow. Pipe your result from 5a into a ggplot() statement. You
choose what you want to graph with it: you could graph it on its own as a histogram or bar plot, or
you could use another variable to create a scatterplot. What have you found? (C), (Q)
4

欢迎咨询51作业君