辅导案例-141B 2020

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

Homework 3
Statistics 141B 2020
Duncan Temple Lang
Due 5pm Thursday, 19th November
1 Data
We will use data about many, many aspects of baseball. This data has been compiled by Sean
Lahman and he has kindly made them available for use by many. Jeff Knecht has made the data,
up to 2011, available as an SQLite database. It is available via cloning a git repository (https:
//github.com/jknecht/lahmann-2013.sqlite) You can also retrieve from Canvas at https://
canvas.ucdavis.edu/files/10465512/download?download_frd=1. There are 24 regular tables
in this database (and one additional table). Documentation for each of the tables is available at
http://seanlahman.com/files/database/readme2013.txt.
You can also work with a version of the data that includes 6 additional years. See https://github.
com/WebucatorTraining/lahman-baseball-mysql/raw/master/lahmansbaseballdb.sqlite. Note
that the names of the Master table has changed and please explicitly note if you are using this
version.
You will need to install the RSQLite and DBI packages.
2 Report Structure
For this assignment, your report will focus on each question separately rather than having a nar-
rative of the overall, high-level process.
For each question,
• include the question
• map the (intentionally) casually written question to more precisely identify the relevant vari-
ables you need and the rows from the different tables, describing how you are interpreting
some of the ambiguous terms
• show the SQL and R code you use to get the relevant data
• interpret the results, justifying with numerical or graphical summaries.
• optionally, show the R code for doing this directly from the data frames not using SQL.
3 Questions
There are numerous questions below. You don’t have to answer all of them. You need to complete
13 questions. You can do more questions and receive bonus marks for those.
As always, you are encouraged to pose your own questions and explore those. State such questions
clearly and indicate why it might be interesting. Then explore the data and interpret your findings,
providing justification and the code.
Do as much of the computations in SQL and then bring the resulting table back to R to finish the
computations. Not all of the questions can be answered (easily) directly in SQL alone.
Give the answer and show the SQL and R code used to answer each question.
There are bonus points for validating the same calculations in R to verify the results using the
tables directly imported into R.
1. What years does the data cover? are there data for each of these years?
2. How many (unique) people are included in the database? How many are players, managers, etc?
3. ? How many players became managers?
4. How many players are there in each year, from 2000 to 2013? Do all teams have the same
number of players?
5. What team won the World Series in 2010? Include the name of the team, the league and division.
6. What team lost the World Series each year? Again, include the name of the team, league and
division.
7. Compute the table of World Series winners for all years, again with the name of the team, league
and division.
8. Compute the table that has both the winner and runner-up for the World Series in each tuple/row
for all years, again with the name of the team, league and division, and also the number games the
losing team won in the series. The first 3 rows look like
2013|BOS|Boston Red Sox|AL|E|SLN|St. Louis Cardinals|NL|C|3
2012|SFN|San Francisco Giants|NL|W|DET|Detroit Tigers|AL|C|0
2011|SLN|St. Louis Cardinals|NL|C|TEX|Texas Rangers|AL|W|3
9. Do you see a relationship between the number of games won in a season and winning the World
Series?
10. In 2003, what were the three highest salaries? (We refer here to unique salaries, i.e., there may
be several players getting the exact same amount.) Find the players who got any of these 3 salaries
with all of their details?
11. For 2010, compute the total payroll of each of the different teams. Next compute the team
payrolls for all years in the database for which we have salary information. Display these in a plot.
12. Explore the change in salary over time. Use a plot. Identify the teams that won the world
series or league on the plot. How does salary relate to winning the league and/or world series.
13. Compare payrolls for the teams that are in the same leagues, and then in the same divisions.
Are there any interesting characteristics? Have certain teams always had top payrolls over the
years? Is there a connection between payroll and performance?
14. Which player has hit the most home runs? Show the number per year.
15. Has the distribution of home runs for players increased over the years?
16. Do players who hit more home runs receive higher salaries?
2
17. Are certain baseball parks better for hitting home runs?
18. What’s the distribution of double plays? triple plays?
19. What pitchers have a large number of double or triple plays? Again, give their details (names,
team, year, ...).
20. How many games do pitchers start in a season? Plot this against games finished in a season.
21. How many games do pitchers win in a season?
22. How are wins related to hits, strikeouts, walks, homeruns and earned runs?
23. What are the top ten collegiate producers of major league baseball players? How many colleges
are represented in the database?
24. What players have pitched in the post season and also hit a home run in their career?
When answering the questions, try to summarize the results in convenient and informative form
(e.g. tables and/or plots) that illustrate the key features.
4 Sample Answer
We’lll use the cs database from StackExchange to illustrate a question.
Question: How many questions and responses are there?
Questions and answers are in the Posts. We do not include comments to address this question.
The PostTypeId field identifies the type of the post, e.g, Question, Answer, Tag wiki excerpt, etc.
We count the number of each of these post types with a GROUP BY operation. Then we restrict the
results to be just for Question and Answer, dropping the other types of posts.
We need to get the label describing the post type rather than just using the Id, e.g., 1, 2, 3, ...
These don’t directly tell us which is a question, an answer, .... We get these by JOINing with the
PostTypeIdMap table.
The SQL for this is
SELECT PostTypeId, value, COUNT(PostTypeId)
FROM Posts AS p, PostTypeIdMap AS m
WHERE PostTypeId = m.Id
GROUP BY PostTypeId
HAVING value IN (’Question’, ’Answer’)
Alternatively, we can restrict the tuples to count to be just the Question and Answer values
SELECT PostTypeId, value, COUNT(PostTypeId)
FROM Posts AS p, PostTypeIdMap AS m
WHERE PostTypeId = m.Id
AND value IN (’Question’, ’Answer’)
GROUP BY PostTypeId
PostTypeId value COUNT(PostTypeId)
1 1 Question 37243
2 2 Answer 43500
3
There was no point in includinge the PostTypeId column except for a sanity check.
Note: we send these queries from R to the database via dbGetQuery() which I didn’t show here.
To do this in R, match the values PostTypeId field in Posts to the id column in PostTypeIdMap.
This then allows us to get the corresponding element of the value column in PostTypeIdMap to
get the post type label. Then we use table() to get the counts.
posts = dbGetQuery(cs, "SELECT * FROM Posts")
postTypeMap = dbGetQuery(cs, "SELECT * FROM PostTypeIdMap")
m = match(posts$PostTypeId, postTypeMap$id)
table(postTypeMap$value[m])
table(postTypeMap$value[m])[c("Question", "Answer")]
4

欢迎咨询51作业君