程序代写案例-EM 747-Assignment 3

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

Assignment 3
COM EM 747
Spring 2021
Assignment by:
Rebecca Auger
Chris Wells
1. Note that there are several packages you will want to be ready to work with.
library(tidyverse)
library(tidytext)
library(stringr)
library(ggplot2)
2. Read in the data
Use the sample of the Meghan Markle tweets data we have prepared for you: markle_fixed.csv
This dataset is based on the same one we used for our text processing exercise.
You will probably find the key to that exercise very helpful in this assignment!
Read the file into an object in your R session.
3. Fixing dates
Note that this file already has an ‘id’ column, and the columns ‘date_of_tweet’ and ‘datetime’. So you do
not need to create those, or take a sample.
However, because R has read ‘date_of_tweet’ and ‘datetime’ from a csv file, it has not classed them as date
variables. Use the lines below to adjust their dates. (Note that we have called our data object ‘mm’; you
may have named yours differently.)
mm$date_of_tweet <- as.Date(mm$date_of_tweet)
mm$datetime <- as.POSIXct(mm$datetime)
4. Moving to lower case
In this assignment, you are going to create several different sets of data based on words that tweets contain.
To do this, you want to be able to search within tweets without having to deal with capitalization.
The function tolower() takes any character vector and returns a copy of it with all letters in lower case.
(Check out ‘?tolower’ to see more about this function.)
You can use tolower() to create a new column in your dataset that is a copy of the original tweet text, but
in which all letters are lower case.
1
5. Create subsets of the data
Your task in this assignment is to compare the sentiment of tweets mentioning different topics involved in
the Meghan-Harry-Oprah interview. The topics you should consider are: 1. The queen 2. Oprah Winfrey
3. Racism
You should create subsets of your data such that each subset is a dataframe of tweets that contain the
keywords: 1. “queen” 2. “oprah” OR “winfrey” 3. “racism” OR “racist”
As a hint, recall from the text exercise that the following script will return rows from a dataframe where a
set of characters (here just the symbol ‘@’) appears in the field ‘word’.
See also the text exercise for more detail.
mm %>% filter(str_detect(word, "@"))
You will want to perform three searches like this, each time saving the resulting rows into a new object.
6. Unnest the tokens from the tweets, and remove stopwords.
Now we will clean up each subset to make it ready for sentiment analysis.
For each subset, you will want to (1) unnest its tokens and save the result to a new object; (2) remove
stopwords.
Hint: see text exercise for how best to do this with tweets
7. Applying sentiment scores
Now we are ready to assign each word with sentiment scores. Recall from the text exercise that this can be
done by using an inner join between the tidy text dataframe and the expression ‘get_sentiments(“afinn”)’.
In the text exercise, it looked like this:
sentiment_mm<-tidy_mm%>%
inner_join(get_sentiments("afinn"))
You will need to do this with each of your three tidy dataframes.
8. Calculating average sentiment
You now have three different tidy dataframes, with sentiment scores for each word. In aggregate, those
sentiment scores represent the sentiment expressed in tweets that contained the mention of a specific topic:
either the queen, or Oprah, or racism.
Maybe we would like to calculate the average sentiment of tweets containing each of those key words.
What do you notice about the average sentiment scores of tweets about these topics?
9. Sentiment over time
We finally might wish to consider how sentiment about these topics has changed over time.
Here is an example of the code you might use to calculate sentiment on a day-by-day basis. First aggregate
sentiment, and second average sentiment per tweet (as shown at bottom of text exercise).
2
queen_sent %>%
group_by(date_of_tweet, id) %>%
summarize(sentiment = sum(value)) %>%
summarize(day_sentiment = sum(sentiment)) %>%
ggplot(aes(x = date_of_tweet, y = day_sentiment)) +
geom_line()
queen_sent %>%
group_by(date_of_tweet, id) %>%
summarize(sentiment = sum(value)) %>%
summarize(day_sentiment = sum(sentiment)/n()) %>%
ggplot(aes(x = date_of_tweet, y = day_sentiment)) +
geom_line()
Create the same plots for the other two topics. What do you notice about the patterns of sentiment for
these three topics? Why do you think they display the patterns they have?
Keep in mind: 1. That ALL of the tweets were selected based on particular keywords, namely the hashtag
#meghanmarkle 2. That the SCALE of the graphs may be quite different between different topics. 3. That
the two types of graph represent somewhat different things. Which is most appropriate to the kinds of
comparisons you want to make?
3

欢迎咨询51作业君