辅导案例-FALL 2019

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

MS IN STATISTICS
COMPUTER LANGUAGE EXAM- FALL 2019
Grading Criteria
Your work will be graded as PASS/FAIL based on (in order of importance): correctness,
style, and efficiency.
Submission Instructions
Prepare the following items listed below and put them into a compressed (.zip) file:
1. a copy of your problem 1 R script as a .R file
2. a copy of your problem 2 SAS program saved as a .sas file
Name the .zip file CLE__.zip. Your full name should have
no spaces, and the underscore should be an underscore, not a hyphen or a space. Upload
this file to the Fall 2019 CLE UVaCollab page. For example, my submission would be
CLE_TaylorBrown_trb5me.zip.
Double Check Your Two Files Before You Submit!
To facilitate grading, please check the following before you submit your work:
1. you do not rename data files before reading them in, or read in data files that I do not
provide,
2. both of your submitted scripts/programs can run from start to finish with no interven-
tion or custom starting point/state,
3. important variables are neither modified nor destroyed at any point in the script after
they are created.
1
Specific Instructions for R scripts
1. Your code submission should not refer to global file locations (use setwd() interactively
in the console so your code refers to files using only their names),
2. comment out or remove calls to install.packages (assume I have all of the software
you use)
Specific Instructions for SAS scripts
1. Try to keep references to file paths at the beginning of the script (e.g. with libname
or filename),
The Data
Both problems make use of data from the Stack Exchange Data Explorer (https://data.
stackexchange.com/). The file called nondeletedpostssubset.csv provides information
about user posts on https://stats.stackexchange.com/.
2
Problem 1
You must use R to complete this portion.
1. Read in the data file nondeletedpostssubset.csv. Store it as a data frame, and make
sure to call it myDF. Also, make sure that you remove the column named ClosedDate.
2. Create a logical vector, not belonging to myDF, that describes whether or not each
element of myDF’s ViewCount vector is strictly larger than 8000. Make sure to ap-
propriately handle missing values; in other words, this vector should contain TRUEs,
FALSEs, and NAs. Store the result as a factor vector named viewCountCategories.
3. Create six new/separate data sets called noTags, oneTag, twoTags, threeTags, fourTags,
fiveTags. Each of these will contain only posts/rows from myDF that have zero tags,
one tag, two tags, three tags, four tags and five tags, respectively. Notice that each
post/row from myDF might have up to five tags, but that there are many tags to choose
from when posting a question.
4. “Extract” all the individual tags from myDF’s Tags column. Call the resulting character
vector cleanTags, and make sure there is only one or zero tag(s) represented in each
element of cleanTags. To be clear, an empty space character in cleanTags should
correspond with an untagged post/row in myDF. For posts/rows in myDF with multiple
tags, each tag should be in a separate element of cleanTags. Note that this means
that cleanTags will be longer than the number of rows in myDF; each row at of the
latter will correspond to at least one element of cleanTags, and possibly more. Many
tags will have duplicates in cleanTags.
5. Sort myDF’s AnswerCount vector by its CreationDate vector, and assign to a separate
vector called orderedCreationDate. This orderedCreationDate vector should not
be a part of myDF. Finally, make a histogram of this new variable you have created. Be
careful of the class/mode of CreationDate.
6. What are the average scores (in the Score column) for each category in myDF’s PostTypeId
column? Store your result in a dataframe named aveScoresByPostType. Have its first
column be the group label, and its second column contain the averages.
3
Problem 2
You must use SAS to complete this portion.
1. Read in the data file nondeletedpostssubset.csv. Make sure that you remove the
column named ClosedDate, and store the data set with a name of myData.
2. Create a new variable in myData that describes whether or not each element of myData’s
ViewCount variable is strictly larger than 8000. Make sure to appropriately handle
missing values; in other words, this variable’s elements should be either true, false, or
missing (as character strings). Store your result as variable named viewCountCategories.
3. Create six new/separate data sets called noTags, oneTag, twoTags, threeTags, fourTags,
fiveTags. Each of these will contain posts/rows from myData that have precisely zero
tags, one tag, two tags, three tags, four tags and five tags, respectively. If you’re
having difficulty counting the number of times a character appears in each row of a
column/variable, you might consider using SAS’s COUNT function.
4. Create a new data set called myOrderedData, which will be the data set myData ordered
by its variable CreationDate. Finally, make a histogram of this new variable you have
created. Be careful not to overwrite myData.
5. What are the average scores (in the Score column) for each category in myData’s
PostTypeId column? Store your results in a separate data set called aveScoresByPostType
that has only two columns. Have its first column be the group label, and its second
column contain the averages.
6. Tabulate myData’s answerCount column. In other words, count up the number of
times/rows each answer frequency appears in answerCount. Write out this information
to a new data set called answerCountTabulation, and make sure it contains only two
columns: one column for the number of answers a post received, and a second column
for how many such posts received that many answers. Also, make sure your data set
answerCount is ordered in descending order by the second column.
4

欢迎咨询51作业君