辅导案例-STAD29

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

STAD29: Statistics for the Life and Social Sciences
Lecture notes
Lecture notes STAD29: Statistics for the Life and Social Sciences 1 / 802
Course Outline
Section 1
Course Outline
Lecture notes STAD29: Statistics for the Life and Social Sciences 2 / 802
Course Outline
Course and instructor
Lecture: Wednesday 14:00-16:00 in HW 215. Optional computer lab
Monday 16:00-17:00 in BV 498.
Instructor: Ken Butler
Office: IC 471.
E-mail: [email protected]
Office hours: Wednesday 11:00-12:00. Or make an appointment.
E-mail always good.
Course website: link.
Using Quercus for assignments/grades only; using website for
everything else.
Lecture notes STAD29: Statistics for the Life and Social Sciences 3 / 802
Course Outline
Texts
There is no official text for this course.
You may find “R for Data Science”, link helpful for R background.
I will refer frequently to my book of Problems and Solutions in Applied
Statistics (PASIAS), link.
Both of these resources are and will remain free.
Lecture notes STAD29: Statistics for the Life and Social Sciences 4 / 802
Course Outline
Programs, prerequisites and exclusions
Prerequisites:
For undergrads: STAC32. Not negotiable.
For grad students, a first course in statistics, and some training in
regression and ANOVA. The less you know, the more you’ll have to
catch up!
This course is a required part of Applied Statistics minor.
Exclusions: this course is not for Math/Statistics/CS
majors/minors. It is for students in other fields who wish to learn
some more advanced statistical methods. The exclusions in the
Calendar reflect this.
If you are in one of those programs, you won’t get program credit for
this course, or for any future STA courses you take.
Lecture notes STAD29: Statistics for the Life and Social Sciences 5 / 802
Course Outline
Computing
Computing: big part of the course, not optional. You will need to
demonstrate that you can use R to analyze data, and can critically
interpret the output.
For grad students who have not come through STAC32, I am happy to
offer extra help to get you up to speed.
Lecture notes STAD29: Statistics for the Life and Social Sciences 6 / 802
Course Outline
Assessment 1/2
Grading: (2 hour) midterm, (3 hour) final exam. Assignments most
weeks, due Tuesday at 11:59pm. Graduate students (STA 1007) also
required to complete a project using one or more of the techniques
learned in class, on a dataset from their field of study. Projects due on
the last day of classes.
Assessment:
STAD29 STA 1007
Assignments 20% 20%
Midterm exam 30% 20%
Project - 20%
Final exam 50% 40%
Lecture notes STAD29: Statistics for the Life and Social Sciences 7 / 802
Course Outline
Assessment 2/2
Assessments missed with documentation will cause a re-weighting of
other assessments of same type. No make-ups.
You must pass the final exam to guarantee passing the course. If
you fail the final exam but would otherwise have passed the course,
you receive a grade of 45.
Lecture notes STAD29: Statistics for the Life and Social Sciences 8 / 802
Course Outline
Plagiarism
This link defines academic offences at this university. Read it. You are
bound by it.
Plagiarism defined (at the end) as
The wrongful appropriation and purloining, and publication as
one’s own, of the ideas, or the expression of the ideas … of another.
The code and explanations that you write and hand in must be yours
and yours alone.
When you hand in work, it is implied that it is your work. Handing in
work, with your name on it, that was actually done by someone else is
an academic offence.
If I am suspicious that anyone’s work is plagiarized, I will take action.
Lecture notes STAD29: Statistics for the Life and Social Sciences 9 / 802
Course Outline
Getting help
The English Language Development Centre supports all students in
developing better Academic English and critical thinking skills needed
in academic communication. Make use of the personalized support in
academic writing skills development. Details and sign-up information:
link.
Students with diverse learning styles and needs are welcome in this
course. In particular, if you have a disability/health consideration that
may require accommodations, please feel free to approach the
AccessAbility Services Office as soon as possible. I will work with you
and AccessAbility Services to ensure you can achieve your learning goals
in this course. Enquiries are confidential. The UTSC AccessAbility
Services staff are available by appointment to assess specific needs,
provide referrals and arrange appropriate accommodations: (416)
287-7560 or by e-mail: [email protected].
Lecture notes STAD29: Statistics for the Life and Social Sciences 10 / 802
Course Outline
Course material
Dates and times
Regression-like things
review of (multiple) regression
logistic regression (including multi-category responses)
survival analysis
ANOVA-like things
more ANOVA
multivariate ANOVA
repeated measures
Multivariate methods
discriminant analysis
cluster analysis
(multidimensional scaling)
principal components
factor analysis
Miscellanea
(time series), multiway frequency tables
Lecture notes STAD29: Statistics for the Life and Social Sciences 11 / 802
Dates and Times
Section 2
Dates and Times
Lecture notes STAD29: Statistics for the Life and Social Sciences 12 / 802
Dates and Times
Packages for this section
library(tidyverse)
library(lubridate)
Lecture notes STAD29: Statistics for the Life and Social Sciences 13 / 802
Dates and Times
Dates
Dates represented on computers as “days since an origin”, typically Jan
1, 1970, with a negative date being before the origin:
mydates <- c("1970-01-01", "2007-09-04", "1931-08-05")
(somedates <- tibble(text = mydates) %>%
mutate(
d = as.Date(text),
numbers = as.numeric(d)
))
## # A tibble: 3 x 3
## text d numbers
##
## 1 1970-01-01 1970-01-01 0
## 2 2007-09-04 2007-09-04 13760
## 3 1931-08-05 1931-08-05 -14029
Lecture notes STAD29: Statistics for the Life and Social Sciences 14 / 802
Dates and Times
Doing arithmetic with dates
Dates are “actually” numbers, so can add and subtract (difference is
2007 date in d minus others):
somedates %>% mutate(plus30 = d + 30, diffs = d[2] - d)
## # A tibble: 3 x 5
## text d numbers plus30 diffs
##
## 1 1970-01-01 1970-01-01 0 1970-01-31 13760 da…
## 2 2007-09-04 2007-09-04 13760 2007-10-04 0 da…
## 3 1931-08-05 1931-08-05 -14029 1931-09-04 27789 da…
Lecture notes STAD29: Statistics for the Life and Social Sciences 15 / 802
Dates and Times
Reading in dates from a file
read_csv and the others can guess that you have dates, if you format
them as year-month-day, like column 1 of this .csv:
date,status,dunno
2011-08-03,hello,August 3 2011
2011-11-15,still here,November 15 2011
2012-02-01,goodbye,February 1 2012
Then read them in:
my_url <- "http://www.utsc.utoronto.ca/~butler/c32/mydates.csv"
ddd <- read_csv(my_url)
## Parsed with column specification:
## cols(
## date = col_date(format = ""),
## status = col_character(),
## dunno = col_character()
## )
read_csv guessed that the 1st column is dates, but not 3rd.
Lecture notes STAD29: Statistics for the Life and Social Sciences 16 / 802
Dates and Times
The data as read in
ddd
## # A tibble: 3 x 3
## date status dunno
##
## 1 2011-08-03 hello August 3 2011
## 2 2011-11-15 still here November 15 2011
## 3 2012-02-01 goodbye February 1 2012
Lecture notes STAD29: Statistics for the Life and Social Sciences 17 / 802
Dates and Times
Dates in other formats
Preceding shows that dates should be stored as text in format
yyyy-mm-dd (ISO standard).
To deal with dates in other formats, use package lubridate and
convert. For example, dates in US format with month first:
tibble(usdates = c("05/27/2012", "01/03/2016", "12/31/2015")) %>%
mutate(iso = mdy(usdates))
## # A tibble: 3 x 2
## usdates iso
##
## 1 05/27/2012 2012-05-27
## 2 01/03/2016 2016-01-03
## 3 12/31/2015 2015-12-31
Lecture notes STAD29: Statistics for the Life and Social Sciences 18 / 802
Dates and Times
Trying to read these as UK dates
tibble(usdates = c("05/27/2012", "01/03/2016", "12/31/2015")) %>%
mutate(uk = dmy(usdates))
## Warning: 2 failed to parse.
## # A tibble: 3 x 2
## usdates uk
##
## 1 05/27/2012 NA
## 2 01/03/2016 2016-03-01
## 3 12/31/2015 NA
For UK-format dates with month second, one of these dates is legit,
but the other two make no sense.
Lecture notes STAD29: Statistics for the Life and Social Sciences 19 / 802
Dates and Times
Our data frame’s last column:
Back to this:
ddd
## # A tibble: 3 x 3
## date status dunno
##
## 1 2011-08-03 hello August 3 2011
## 2 2011-11-15 still here November 15 2011
## 3 2012-02-01 goodbye February 1 2012
Month, day, year in that order.
Lecture notes STAD29: Statistics for the Life and Social Sciences 20 / 802
Dates and Times
so interpret as such
(ddd %>% mutate(date2 = mdy(dunno)) -> d4)
## # A tibble: 3 x 4
## date status dunno date2
##
## 1 2011-08-03 hello August 3 2011 2011-08-03
## 2 2011-11-15 still here November 15 2011 2011-11-15
## 3 2012-02-01 goodbye February 1 2012 2012-02-01
Lecture notes STAD29: Statistics for the Life and Social Sciences 21 / 802
Dates and Times
Are they really the same?
Column date2 was correctly converted from column dunno:
d4 %>% mutate(equal = identical(date, date2))
## # A tibble: 3 x 5
## date status dunno date2 equal
##
## 1 2011-08-03 hello August 3 20… 2011-08-03 TRUE
## 2 2011-11-15 still he… November 15… 2011-11-15 TRUE
## 3 2012-02-01 goodbye February 1 … 2012-02-01 TRUE
The two columns of dates are all the same.
Lecture notes STAD29: Statistics for the Life and Social Sciences 22 / 802
Dates and Times
Making dates from pieces
Starting from this file:
year month day
1970 1 1
2007 9 4
1940 4 15
my_url <- "http://www.utsc.utoronto.ca/~butler/c32/pieces.txt"
dates0 <- read_delim(my_url, " ")
## Parsed with column specification:
## cols(
## year = col_double(),
## month = col_double(),
## day = col_double()
## )
Lecture notes STAD29: Statistics for the Life and Social Sciences 23 / 802
Dates and Times
Making some dates
dates0
## # A tibble: 3 x 3
## year month day
##
## 1 1970 1 1
## 2 2007 9 4
## 3 1940 4 15
dates0 %>%
unite(dates, day, month, year) %>%
mutate(d = dmy(dates)) -> newdates
Lecture notes STAD29: Statistics for the Life and Social Sciences 24 / 802
Dates and Times
The results
newdates
## # A tibble: 3 x 2
## dates d
##
## 1 1_1_1970 1970-01-01
## 2 4_9_2007 2007-09-04
## 3 15_4_1940 1940-04-15
unite glues things together with an underscore between them (if you
don’t specify anything else). Syntax: first thing is new column to be
created, other columns are what to make it out of.
unite makes the original variable columns year, month, day disappear.
The column dates is text, while d is a real date.
Lecture notes STAD29: Statistics for the Life and Social Sciences 25 / 802
Dates and Times
Extracting information from dates
newdates %>%
mutate(
mon = month(d),
day = day(d),
weekday = wday(d, label = T)
)
## # A tibble: 3 x 5
## dates d mon day weekday
##
## 1 1_1_1970 1970-01-01 1 1 Thu
## 2 4_9_2007 2007-09-04 9 4 Tue
## 3 15_4_1940 1940-04-15 4 15 Mon
Lecture notes STAD29: Statistics for the Life and Social Sciences 26 / 802
Dates and Times
Dates and times
Standard format for times is to put the time after the date, hours,
minutes, seconds:
(dd <- tibble(text = c(
"1970-01-01 07:50:01", "2007-09-04 15:30:00",
"1940-04-15 06:45:10", "2016-02-10 12:26:40"
)))
## # A tibble: 4 x 1
## text
##
## 1 1970-01-01 07:50:01
## 2 2007-09-04 15:30:00
## 3 1940-04-15 06:45:10
## 4 2016-02-10 12:26:40
Lecture notes STAD29: Statistics for the Life and Social Sciences 27 / 802
Dates and Times
Converting text to date-times:
Then get from this text using ymd_hms:
dd %>% mutate(dt = ymd_hms(text))
## # A tibble: 4 x 2
## text dt
##
## 1 1970-01-01 07:50:01 1970-01-01 07:50:01
## 2 2007-09-04 15:30:00 2007-09-04 15:30:00
## 3 1940-04-15 06:45:10 1940-04-15 06:45:10
## 4 2016-02-10 12:26:40 2016-02-10 12:26:40
Lecture notes STAD29: Statistics for the Life and Social Sciences 28 / 802
Dates and Times
Timezones
Default timezone is “Universal Coordinated Time”. Change it via tz=
and the name of a timezone:
dd %>%
mutate(dt = ymd_hms(text, tz = "America/Toronto")) -> dd
dd %>% mutate(zone = tz(dt))
## # A tibble: 4 x 3
## text dt zone
##
## 1 1970-01-01 07:50… 1970-01-01 07:50:01 America/Tor…
## 2 2007-09-04 15:30… 2007-09-04 15:30:00 America/Tor…
## 3 1940-04-15 06:45… 1940-04-15 06:45:10 America/Tor…
## 4 2016-02-10 12:26… 2016-02-10 12:26:40 America/Tor…
Lecture notes STAD29: Statistics for the Life and Social Sciences 29 / 802
Dates and Times
Extracting time parts
As you would expect:
dd %>%
select(-text) %>%
mutate(
h = hour(dt),
sec = second(dt),
min = minute(dt),
zone = tz(dt)
)
## # A tibble: 4 x 5
## dt h sec min zone
##
## 1 1970-01-01 07:50:01 7 1 50 America/Tor…
## 2 2007-09-04 15:30:00 15 0 30 America/Tor…
## 3 1940-04-15 06:45:10 6 10 45 America/Tor…
## 4 2016-02-10 12:26:40 12 40 26 America/Tor…Lecture notes STAD29: Statistics for the Life and Social Sciences 30 / 802
Dates and Times
Same times, but different time zone:
dd %>%
select(dt) %>%
mutate(oz = with_tz(dt, "Australia/Sydney"))
## # A tibble: 4 x 2
## dt oz
##
## 1 1970-01-01 07:50:01 1970-01-01 22:50:01
## 2 2007-09-04 15:30:00 2007-09-05 05:30:00
## 3 1940-04-15 06:45:10 1940-04-15 21:45:10
## 4 2016-02-10 12:26:40 2016-02-11 04:26:40
In more detail:
## [1] "1970-01-01 22:50:01 AEST"
## [2] "2007-09-05 05:30:00 AEST"
## [3] "1940-04-15 21:45:10 AEST"
## [4] "2016-02-11 04:26:40 AEDT"Lecture notes STAD29: Statistics for the Life and Social Sciences 31 / 802
Dates and Times
How long between date-times?
We may need to calculate the time between two events. For example,
these are the dates and times that some patients were admitted to and
discharged from a hospital:
admit,discharge
1981-12-10 22:00:00,1982-01-03 14:00:00
2014-03-07 14:00:00,2014-03-08 09:30:00
2016-08-31 21:00:00,2016-09-02 17:00:00
Lecture notes STAD29: Statistics for the Life and Social Sciences 32 / 802
Dates and Times
Do they get read in as date-times?
These ought to get read in and converted to date-times:
my_url <- "http://www.utsc.utoronto.ca/~butler/c32/hospital.csv"
stays <- read_csv(my_url)
## Parsed with column specification:
## cols(
## admit = col_datetime(format = ""),
## discharge = col_datetime(format = "")
## )
and so it proves.
Lecture notes STAD29: Statistics for the Life and Social Sciences 33 / 802
Dates and Times
Subtracting the date-times
In the obvious way, this gets us an answer:
stays %>% mutate(stay = discharge - admit)
## # A tibble: 3 x 3
## admit discharge stay
##
## 1 1981-12-10 22:00:00 1982-01-03 14:00:00 568.0 hou…
## 2 2014-03-07 14:00:00 2014-03-08 09:30:00 19.5 hou…
## 3 2016-08-31 21:00:00 2016-09-02 17:00:00 44.0 hou…
Number of hours; hard to interpret.
Lecture notes STAD29: Statistics for the Life and Social Sciences 34 / 802
Dates and Times
Days
Fractional number of days would be better:
# stays %>%
# mutate(stay_days = (discharge - admit) / ddays(1))
stays %>%
mutate(
stay_days = as.period(admit %--% discharge) / days(1))
## estimate only: convert to intervals for accuracy
## # A tibble: 3 x 3
## admit discharge stay_days
##
## 1 1981-12-10 22:00:00 1982-01-03 14:00:00 23.7
## 2 2014-03-07 14:00:00 2014-03-08 09:30:00 0.812
## 3 2016-08-31 21:00:00 2016-09-02 17:00:00 1.83
Lecture notes STAD29: Statistics for the Life and Social Sciences 35 / 802
Dates and Times
Completed days
Pull out with day() etc, as for a date-time
stays %>%
mutate(
stay = as.period(admit %--% discharge),
stay_days = day(stay),
stay_hours = hour(stay)
) %>%
select(starts_with("stay"))
## # A tibble: 3 x 3
## stay stay_days stay_hours
##
## 1 23d 16H 0M 0S 23 16
## 2 19H 30M 0S 0 19
## 3 1d 20H 0M 0S 1 20
Lecture notes STAD29: Statistics for the Life and Social Sciences 36 / 802
Dates and Times
Comments
Date-times are stored internally as seconds-since-something, so that
subtracting two of them will give, internally, a number of seconds.
Just subtracting the date-times is displayed as a time (in units that R
chooses for us).
Functions ddays(1), dminutes(1) etc. will give number of seconds
in a day or a minute, thus dividing by them will give (fractional) days,
minutes etc. This works for things like days/minutes with equal
numbers of seconds, but not months/years.
Better: convert to a “period”, then divide by days(1), months(1)
etc.
These ideas useful for calculating time from a start point until an event
happens (in this case, a patient being discharged from hospital).
Lecture notes STAD29: Statistics for the Life and Social Sciences 37 / 802
Review of (multiple) regression
Section 3
Review of (multiple) regression
Lecture notes STAD29: Statistics for the Life and Social Sciences 38 / 802
Review of (multiple) regression
Regression
Use regression when one variable is an outcome (response, ).
See if/how response depends on other variable(s), explanatory,
1, 2,….
Can have one or more than one explanatory variable, but always one
response.
Assumes a straight-line relationship between response and explanatory.
Ask:
is there a relationship between and ’s, and if so, which ones?
what does the relationship look like?
Lecture notes STAD29: Statistics for the Life and Social Sciences 39 / 802
Review of (multiple) regression
Packages
library(MASS) # for Box-Cox, later
library(tidyverse)
library(broom)
Lecture notes STAD29: Statistics for the Life and Social Sciences 40 / 802
Review of (multiple) regression
A regression with one
13 children, measure average total sleep time (ATST, mins) and age (years)
for each. See if ATST depends on age. Data in sleep.txt, ATST then
age. Read in data:
my_url <- "http://www.utsc.utoronto.ca/~butler/d29/sleep.txt"
sleep <- read_delim(my_url, " ")
## Parsed with column specification:
## cols(
## atst = col_double(),
## age = col_double()
## )
Lecture notes STAD29: Statistics for the Life and Social Sciences 41 / 802
Review of (multiple) regression
Check data
summary(sleep)
## atst age
## Min. :461.8 Min. : 4.400
## 1st Qu.:491.1 1st Qu.: 7.200
## Median :528.3 Median : 8.900
## Mean :519.3 Mean : 9.058
## 3rd Qu.:532.5 3rd Qu.:11.100
## Max. :586.0 Max. :14.000
Make scatter plot of ATST (response) vs. age (explanatory) using code
overleaf:
Lecture notes STAD29: Statistics for the Life and Social Sciences 42 / 802
Review of (multiple) regression
The scatterplot
ggplot(sleep, aes(x = age, y = atst)) + geom_point()
l
l
l
l
l
l
l
l
l
l
l
ll
500
550
4 6 8 10 12 14
age
a
ts
t
Figure 1: plot of chunk suggo
Lecture notes STAD29: Statistics for the Life and Social Sciences 43 / 802
Review of (multiple) regression
Correlation
Measures how well a straight line fits the data:
with(sleep, cor(atst, age))
## [1] -0.9515469
1 is perfect upward trend, −1 is perfect downward trend, 0 is no trend.
This one close to perfect downward trend.
Can do correlations of all pairs of variables:
cor(sleep)
## atst age
## atst 1.0000000 -0.9515469
## age -0.9515469 1.0000000
Lecture notes STAD29: Statistics for the Life and Social Sciences 44 / 802
Review of (multiple) regression
Lowess curve
Sometimes nice to guide the eye: is the trend straight, or not?
Idea: lowess curve. “Locally weighted least squares”, not affected by
outliers, not constrained to be linear.
Lowess is a guide: even if straight line appropriate, may wiggle/bend a
little. Looking for serious problems with linearity.
Add lowess curve to plot using geom_smooth:
Lecture notes STAD29: Statistics for the Life and Social Sciences 45 / 802
Review of (multiple) regression
Plot with lowess curve
ggplot(sleep, aes(x = age, y = atst)) + geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
l
l
l
l
l
l
l
l
l
l
l
ll
450
500
550
600
4 6 8 10 12 14
age
a
ts
t
Figure 2: plot of chunk icko
Lecture notes STAD29: Statistics for the Life and Social Sciences 46 / 802
Review of (multiple) regression
The regression
Scatterplot shows no obvious curve, and a pretty clear downward trend. So
we can run the regression:
sleep.1 <- lm(atst ~ age, data = sleep)
Lecture notes STAD29: Statistics for the Life and Social Sciences 47 / 802
Review of (multiple) regression
The output
summary(sleep.1)
##
## Call:
## lm(formula = atst ~ age, data = sleep)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.011 -9.365 2.372 6.770 20.411
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 646.483 12.918 50.05 2.49e-14 ***
## age -14.041 1.368 -10.26 5.70e-07 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.15 on 11 degrees of freedom
## Multiple R-squared: 0.9054, Adjusted R-squared: 0.8968
## F-statistic: 105.3 on 1 and 11 DF, p-value: 5.7e-07
Lecture notes STAD29: Statistics for the Life and Social Sciences 48 / 802
Review of (multiple) regression
Conclusions
The relationship appears to be a straight line, with a downward trend.
-tests for model as a whole and -test for slope (same) both confirm
this (P-value 5.7 × 10−7 = 0.00000057).
Slope is −14, so a 1-year increase in age goes with a 14-minute
decrease in ATST on average.
R-squared is correlation squared (when one anyway), between 0 and
1 (1 good, 0 bad).
Here R-squared is 0.9054, pleasantly high.
Lecture notes STAD29: Statistics for the Life and Social Sciences 49 / 802
Review of (multiple) regression
Doing things with the regression output
Output from regression (and eg. -test) is all right to look at, but hard
to extract and re-use information from.
Package broom extracts info from model output in way that can be
used in pipe (later):
tidy(sleep.1)
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
##
## 1 (Intercept) 646. 12.9 50.0 2.49e-14
## 2 age -14.0 1.37 -10.3 5.70e- 7
Lecture notes STAD29: Statistics for the Life and Social Sciences 50 / 802
Review of (multiple) regression
also one-line summary of model:
glance(sleep.1)
## # A tibble: 1 x 11
## r.squared adj.r.squared sigma statistic p.value df
##
## 1 0.905 0.897 13.2 105. 5.70e-7 2
## # … with 5 more variables: logLik , AIC ,
## # BIC , deviance , df.residual
Lecture notes STAD29: Statistics for the Life and Social Sciences 51 / 802
Review of (multiple) regression
Broom part 2
sleep.1 %>% augment(sleep) %>% slice(1:8)
## # A tibble: 8 x 9
## atst age .fitted .se.fit .resid .hat .sigma .cooksd
##
## 1 586 4.4 585. 7.34 1.30 0.312 13.8 0.00320
## 2 462. 14 450. 7.68 11.8 0.341 13.0 0.319
## 3 491. 10.1 505. 3.92 -13.6 0.0887 13.0 0.0568
## 4 565 6.7 552. 4.87 12.6 0.137 13.1 0.0844
## 5 462 11.5 485. 4.95 -23.0 0.141 11.3 0.294
## 6 532. 9.6 512. 3.72 20.4 0.0801 12.0 0.114
## 7 478. 12.4 472. 5.85 5.23 0.198 13.7 0.0243
## 8 515. 8.9 522. 3.65 -6.32 0.0772 13.6 0.0105
## # … with 1 more variable: .std.resid
Useful for plotting residuals against an -variable.
Lecture notes STAD29: Statistics for the Life and Social Sciences 52 / 802
Review of (multiple) regression
CI for mean response and prediction intervals
Once useful regression exists, use it for prediction:
To get a single number for prediction at a given , substitute into
regression equation, eg. age 10: predicted ATST is
646.48 − 14.04(10) = 506 minutes.
To express uncertainty of this prediction:
CI for mean response expresses uncertainty about mean ATST for all
children aged 10, based on data.
Prediction interval expresses uncertainty about predicted ATST for a
new child aged 10 whose ATST not known. More uncertain.
Also do above for a child aged 5.
Lecture notes STAD29: Statistics for the Life and Social Sciences 53 / 802
Review of (multiple) regression
Intervals
Make new data frame with these values for age
my.age <- c(10, 5)
ages.new <- tibble(age = my.age)
ages.new
## # A tibble: 2 x 1
## age
##
## 1 10
## 2 5
Feed into predict:
pc <- predict(sleep.1, ages.new, interval = "c")
pp <- predict(sleep.1, ages.new, interval = "p")
Lecture notes STAD29: Statistics for the Life and Social Sciences 54 / 802
Review of (multiple) regression
The intervals
Confidence intervals for mean response:
cbind(ages.new, pc)
## age fit lwr upr
## 1 10 506.0729 497.5574 514.5883
## 2 5 576.2781 561.6578 590.8984
Prediction intervals for new response:
cbind(ages.new, pp)
## age fit lwr upr
## 1 10 506.0729 475.8982 536.2475
## 2 5 576.2781 543.8474 608.7088
Lecture notes STAD29: Statistics for the Life and Social Sciences 55 / 802
Review of (multiple) regression
Comments
Age 10 closer to centre of data, so intervals are both narrower than
those for age 5.
Prediction intervals bigger than CI for mean (additional uncertainty).
Technical note: output from predict is R matrix, not data frame, so
Tidyverse bind_cols does not work. Use base R cbind.
Lecture notes STAD29: Statistics for the Life and Social Sciences 56 / 802
Review of (multiple) regression
That grey envelope
Marks confidence interval for mean for all :
ggplot(sleep, aes(x = age, y = atst)) + geom_point() +
geom_smooth(method = "lm") +
scale_y_continuous(breaks = seq(420, 600, 20))
l
l
l
l
l
l
l
l
l
l
l
ll
440
460
480
500
520
540
560
580
600
4 6 8 10 12 14
age
a
ts
t
Figure 3: plot of chunk unnamed-chunk-41
Lecture notes STAD29: Statistics for the Life and Social Sciences 57 / 802
Review of (multiple) regression
Diagnostics
How to tell whether a straight-line regression is appropriate?
Before: check scatterplot for straight trend.
After: plot residuals (observed minus predicted response) against
predicted values. Aim: a plot with no pattern.
Lecture notes STAD29: Statistics for the Life and Social Sciences 58 / 802
Review of (multiple) regression
Residual plot
Not much pattern here — regression appropriate.
ggplot(sleep.1, aes(x = .fitted, y = .resid)) + geom_point()
l
l
l
l
l
l
l
l
l
l
ll
l
−20
−10
0
10
20
480 520 560
.fitted
.
re
si
d
Figure 4: plot of chunk akjhkadjfhjahnkkk
Lecture notes STAD29: Statistics for the Life and Social Sciences 59 / 802
Review of (multiple) regression
An inappropriate regression
Different data:
my_url <- "http://www.utsc.utoronto.ca/~butler/d29/curvy.txt"
curvy <- read_delim(my_url, " ")
## Parsed with column specification:
## cols(
## xx = col_double(),
## yy = col_double()
## )
Lecture notes STAD29: Statistics for the Life and Social Sciences 60 / 802
Review of (multiple) regression
Scatterplot
ggplot(curvy, aes(x = xx, y = yy)) + geom_point()
l
l
l
l
l
l
l
l
l
l
4
8
12
16
0.0 2.5 5.0 7.5
xx
yy
Figure 5: plot of chunk unnamed-chunk-42
Lecture notes STAD29: Statistics for the Life and Social Sciences 61 / 802
Review of (multiple) regression
Regression line, anyway
curvy.1 <- lm(yy ~ xx, data = curvy)
summary(curvy.1)
##
## Call:
## lm(formula = yy ~ xx, data = curvy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.582 -2.204 0.000 1.514 3.509
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.5818 1.5616 4.855 0.00126 **
## xx 0.9818 0.2925 3.356 0.00998 **
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.657 on 8 degrees of freedom
## Multiple R-squared: 0.5848, Adjusted R-squared: 0.5329
## F-statistic: 11.27 on 1 and 8 DF, p-value: 0.009984
Lecture notes STAD29: Statistics for the Life and Social Sciences 62 / 802
Review of (multiple) regression
Residual plot
ggplot(curvy.1, aes(x = .fitted, y = .resid)) + geom_point()
l
l
l
l
l l
l
l
l l
−2
0
2
7.5 10.0 12.5 15.0
.fitted
.
re
si
d
Figure 6: plot of chunk altoadige
Lecture notes STAD29: Statistics for the Life and Social Sciences 63 / 802
Review of (multiple) regression
No good: fixing it up
Residual plot has curve: middle residuals positive, high and low ones
negative. Bad.
Fitting a curve would be better. Try this:
curvy.2 <- lm(yy ~ xx + I(xx^2), data = curvy)
Adding xx-squared term, to allow for curve.
Another way to do same thing: specify how model changes:
curvy.2a <- update(curvy.1, . ~ . + I(xx^2))
Lecture notes STAD29: Statistics for the Life and Social Sciences 64 / 802
Review of (multiple) regression
Regression 2
tidy(curvy.2)
## # A tibble: 3 x 5
## term estimate std.error statistic p.value
##
## 1 (Intercept) 3.9 0.773 5.04 0.00149
## 2 xx 3.74 0.400 9.36 0.0000331
## 3 I(xx^2) -0.307 0.0428 -7.17 0.000182
glance(curvy.2) #
## # A tibble: 1 x 11
## r.squared adj.r.squared sigma statistic p.value df
##
## 1 0.950 0.936 0.983 66.8 2.75e-5 3
## # … with 5 more variables: logLik , AIC ,
## # BIC , deviance , df.residual
Lecture notes STAD29: Statistics for the Life and Social Sciences 65 / 802
Review of (multiple) regression
Comments
xx-squared term definitely significant (P-value 0.000182), so need this
curve to describe relationship.
Adding squared term has made R-squared go up from 0.5848 to
0.9502: great improvement.
This is a definite curve!
Lecture notes STAD29: Statistics for the Life and Social Sciences 66 / 802
Review of (multiple) regression
The residual plot now
No problems any more:
ggplot(curvy.2, aes(x = .fitted, y = .resid)) + geom_point()
l
l
l
l
l l
l
l
l
l
−1.0
−0.5
0.0
0.5
1.0
6 9 12 15
.fitted
.
re
si
d
Figure 7: plot of chunk unnamed-chunk-47
Lecture notes STAD29: Statistics for the Life and Social Sciences 67 / 802
Review of (multiple) regression
Another way to handle curves
Above, saw that changing (adding 2) was a way of handling curved
relationships.
Another way: change (transformation).
Can guess how to change , or might be theory:
example: relationship = (exponential growth):
take logs to get ln = ln + .
Taking logs has made relationship linear (ln as response).
Or, estimate transformation, using Box-Cox method.
Lecture notes STAD29: Statistics for the Life and Social Sciences 68 / 802
Review of (multiple) regression
Box-Cox
Install package MASS via install.packages("MASS") (only need to
do once)
Every R session you want to use something in MASS, type
library(MASS)
Lecture notes STAD29: Statistics for the Life and Social Sciences 69 / 802
Review of (multiple) regression
Some made-up data
my_url <- "http://www.utsc.utoronto.ca/~butler/d29/madeup.csv"
madeup <- read_csv(my_url)
madeup
## # A tibble: 8 x 3
## row x y
##
## 1 1 0 17.9
## 2 2 1 33.6
## 3 3 2 82.7
## 4 4 3 31.2
## 5 5 4 177.
## 6 6 5 359.
## 7 7 6 469.
## 8 8 7 583.
Seems to be faster-than-linear growth, maybe exponential growth.
Lecture notes STAD29: Statistics for the Life and Social Sciences 70 / 802
Review of (multiple) regression
Scatterplot: faster than linear growth
ggplot(madeup, aes(x = x, y = y)) + geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
l l
l
l
l
l
l
l
0
250
500
0 2 4 6
x
y
Figure 8: plot of chunk dsljhsdjlhf
Lecture notes STAD29: Statistics for the Life and Social Sciences 71 / 802
Review of (multiple) regression
Running Box-Cox
library(MASS) first.
Feed boxcox a model formula with a squiggle in it, such as you would
use for lm.
Output: a graph (next page):
boxcox(y ~ x, data = madeup)
Lecture notes STAD29: Statistics for the Life and Social Sciences 72 / 802
Review of (multiple) regression
The Box-Cox output
−2 −1 0 1 2
−
20
−
15
−
10
−
5
λ
lo
g−
Li
ke
lih
oo
d
95%
Figure 9: plot of chunk trentoLecture notes STAD29: Statistics for the Life and Social Sciences 73 / 802
Review of (multiple) regression
Comments
(lambda) is the power by which you should transform to get the
relationship straight (straighter). Power 0 is “take logs”
Middle dotted line marks best single value of (here about 0.1).
Outer dotted lines mark 95% CI for , here −0.3 to 0.7, approx.
(Rather uncertain about best transformation.)
Any power transformation within the CI supported by data. In this
case, log ( = 0) and square root ( = 0.5) good, but no
transformation ( = 1) not.
Pick a “round-number” value of like 2, 1, 0.5, 0,−0.5,−1. Here 0
and 0.5 good values to pick.
Lecture notes STAD29: Statistics for the Life and Social Sciences 74 / 802
Review of (multiple) regression
Did transformation straighten things?
Plot transformed against . Here, log:
ggplot(madeup, aes(x = x, y = log(y))) + geom_point() +
geom_smooth()
l
l
l
l
l
l
l
l
2
4
6
8
0 2 4 6
x
lo
g(y
)
Figure 10: plot of chunk unnamed-chunk-50
Looks much straighter.
Lecture notes STAD29: Statistics for the Life and Social Sciences 75 / 802
Review of (multiple) regression
Regression with transformed
madeup.1 <- lm(log(y) ~ x, data = madeup)
glance(madeup.1)
## # A tibble: 1 x 11
## r.squared adj.r.squared sigma statistic p.value df
##
## 1 0.883 0.864 0.501 45.3 5.24e-4 2
## # … with 5 more variables: logLik , AIC ,
## # BIC , deviance , df.residual
tidy(madeup.1)
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
##
## 1 (Intercept) 2.91 0.323 8.99 0.000106
## 2 x 0.520 0.0773 6.73 0.000524
R-squared now decently high.
Lecture notes STAD29: Statistics for the Life and Social Sciences 76 / 802
Review of (multiple) regression
Multiple regression
What if more than one ? Extra issues:
Now one intercept and a slope for each : how to interpret?
Which -variables actually help to predict ?
Different interpretations of “global” -test and individual -tests.
R-squared no longer correlation squared, but still interpreted as “higher
better”.
In lm line, add extra s after ~.
Interpretation not so easy (and other problems that can occur).
Lecture notes STAD29: Statistics for the Life and Social Sciences 77 / 802
Review of (multiple) regression
Multiple regression example
Study of women and visits to health professionals, and how the number of
visits might be related to other variables:
timedrs: number of visits to health professionals (over course of study)
phyheal: number of physical health problems
menheal: number of mental health problems
stress: result of questionnaire about number and type of life changes
timedrs response, others explanatory.
Lecture notes STAD29: Statistics for the Life and Social Sciences 78 / 802
Review of (multiple) regression
The data
my_url <-
"http://www.utsc.utoronto.ca/~butler/d29/regressx.txt"
visits <- read_delim(my_url, " ")
## Parsed with column specification:
## cols(
## subjno = col_double(),
## timedrs = col_double(),
## phyheal = col_double(),
## menheal = col_double(),
## stress = col_double()
## )
Lecture notes STAD29: Statistics for the Life and Social Sciences 79 / 802
Review of (multiple) regression
Check data
visits
## # A tibble: 465 x 5
## subjno timedrs phyheal menheal stress
##
## 1 1 1 5 8 265
## 2 2 3 4 6 415
## 3 3 0 3 4 92
## 4 4 13 2 2 241
## 5 5 15 3 6 86
## 6 6 3 5 5 247
## 7 7 2 5 6 13
## 8 8 0 4 5 12
## 9 9 7 5 4 269
## 10 10 4 3 9 391
## # … with 455 more rows
Lecture notes STAD29: Statistics for the Life and Social Sciences 80 / 802
Review of (multiple) regression
Fit multiple regression
visits.1 <- lm(timedrs ~ phyheal + menheal + stress,
data = visits)
glance(visits.1)
## # A tibble: 1 x 11
## r.squared adj.r.squared sigma statistic p.value df
##
## 1 0.219 0.214 9.71 43.0 1.56e-24 4
## # … with 5 more variables: logLik , AIC ,
## # BIC , deviance , df.residual
Lecture notes STAD29: Statistics for the Life and Social Sciences 81 / 802
Review of (multiple) regression
The slopes
Model as a whole strongly significant even though R-sq not very big (lots of
data). At least one of the ’s predicts timedrs.
tidy(visits.1)
## # A tibble: 4 x 5
## term estimate std.error statistic p.value
##
## 1 (Intercept) -3.70 1.12 -3.30 1.06e- 3
## 2 phyheal 1.79 0.221 8.08 5.60e-15
## 3 menheal -0.00967 0.129 -0.0749 9.40e- 1
## 4 stress 0.0136 0.00361 3.77 1.85e- 4
The physical health and stress variables initely help to predict the number of
visits, but with those in the model we don’t need menheal. However, look
at prediction of timedrs from menheal by itself:
Lecture notes STAD29: Statistics for the Life and Social Sciences 82 / 802
Review of (multiple) regression
Just menheal
visits.2 <- lm(timedrs ~ menheal, data = visits)
glance(visits.2)
## # A tibble: 1 x 11
## r.squared adj.r.squared sigma statistic p.value df
##
## 1 0.0653 0.0633 10.6 32.4 2.28e-8 2
## # … with 5 more variables: logLik , AIC ,
## # BIC , deviance , df.residual
tidy(visits.2)
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
##
## 1 (Intercept) 3.82 0.870 4.38 0.0000144
## 2 menheal 0.667 0.117 5.69 0.0000000228
Lecture notes STAD29: Statistics for the Life and Social Sciences 83 / 802
Review of (multiple) regression
menheal by itself
menheal by itself does significantly help to predict timedrs.
But the R-sq is much less (6.5% vs. 22%).
So other two variables do a better job of prediction.
With those variables in the regression (phyheal and stress), don’t
need menheal as well.
Lecture notes STAD29: Statistics for the Life and Social Sciences 84 / 802
Review of (multiple) regression
Investigating via correlation
Leave out first column (subjno):
visits %>% select(-subjno) %>% cor()
## timedrs phyheal menheal stress
## timedrs 1.0000000 0.4395293 0.2555703 0.2865951
## phyheal 0.4395293 1.0000000 0.5049464 0.3055517
## menheal 0.2555703 0.5049464 1.0000000 0.3697911
## stress 0.2865951 0.3055517 0.3697911 1.0000000
phyheal most strongly correlated with timedrs.
Not much to choose between other two.
But menheal has higher correlation with phyheal, so not as much to
add to prediction as stress.
Goes to show things more complicated in multiple regression.
Lecture notes STAD29: Statistics for the Life and Social Sciences 85 / 802
Review of (multiple) regression
Residual plot (from timedrs on all)
ggplot(visits.1, aes(x = .fitted, y = .resid)) + geom_point()
l
l
l
l
l
l
ll
ll
l
l l
l
l
ll
l
l
l
l
l
l
l
l
ll
ll
l
l l
l
l l
ll
l l
l
l
l
l
l
ll l
l
l
l
ll
l
l
ll
l l
l
l l
l
l
l
l
l
l
l
l
l
l l
l
l
l
l
l
l
l
l
ll lll l
l l
l
l
l
l
l
l
ll l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l l l
l
l
l
ll
l
l
l
l
l
l
ll l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l lll l
lll
l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
l
l
l
l ll
ll
l
l
l
l
l
l
l
l
ll
l
l
l
l l
l
lll
l
l
ll
l
l
l l
ll l
l
l
l
l
l
l
l
l
l
l
ll l
l
ll l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
lll
l
l
l l l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l l
l
l
ll
l
l
l
l
l
l
l
l
lll
l
l l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
ll
0
25
50
0 10 20
.fitted
.
re
si
d
Figure 11: plot of chunk iffy8Lecture notes STAD29: Statistics for the Life and Social Sciences 86 / 802
Review of (multiple) regression
Comment
Apparently random. But…
Lecture notes STAD29: Statistics for the Life and Social Sciences 87 / 802
Review of (multiple) regression
Normal quantile plot of residuals
ggplot(visits.1, aes(sample = .resid)) + stat_qq() + stat_qq_line()
l l l l l
lllll
lllllllllllllllll
lllllllllll
llllllllllll
lllllllllllllllllll
lllllllllllllllllllllll
lllllllllllllllllll
lllllllllllllll
lllllllll
lllll
llll
lll
llll
lll
l
ll
l
llll
l
ll
l
lll
l
ll
l
l
l l l
l
l
0
25
50
−2 0 2
theoretical
sa
m
pl
e
Figure 12: plot of chunk unnamed-chunk-58Lecture notes STAD29: Statistics for the Life and Social Sciences 88 / 802
Review of (multiple) regression
Absolute residuals
Is there trend in size of residuals (fan-out)? Plot absolute value of residual
against fitted value (graph next page):
g <- ggplot(visits.1, aes(x = .fitted, y = abs(.resid))) +
geom_point() + geom_smooth()
Lecture notes STAD29: Statistics for the Life and Social Sciences 89 / 802
Review of (multiple) regression
The plot
l
l
l
l
l
l
ll
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
ll
l
l l
l
l
l
ll
l
l
l
l
l
l
l
ll
ll
l
l
l
l
l
l
ll
l l
l
l ll l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l l l
l
ll
l
l
ll ll
l
l
l
l
l
ll ll
l
ll
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l l
l
l l
l
l
l
ll
l
l
l
l
l
l
l l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
l
l
l
l
l
l
l l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
ll
l
l
l
l
l
l
l ll
l l
lll l
l
l
l
l
l
ll
l
l
l
l l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
ll
l
l
ll
l
l
l l
ll l
l
l
l
l
l l
l
l
l
l
l
l
ll
l
l
l
l l l
l
l
l
l
l
l
l
l l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
lll
l
l
l l l
l
l
l
l
ll l
l
l
l
l
l
l
l
l
l
l
ll
l
l
ll
l
l
l l
l
lll l
l
l
l
ll l
ll
0
20
40
60
0 10 20
.fitted
a
bs
(.r
es
id)
Figure 13: plot of chunk unnamed-chunk-60
Lecture notes STAD29: Statistics for the Life and Social Sciences 90 / 802
Review of (multiple) regression
Comments
On the normal quantile plot:
highest (most positive) residuals are way too high
distribution of residuals skewed to right (not normal at all)
On plot of absolute residuals:
size of residuals getting bigger as fitted values increase
predictions getting more variable as fitted values increase
that is, predictions getting less accurate as fitted values increase, but
predictions should be equally accurate all way along.
Both indicate problems with regression, of kind that transformation of
response often fixes: that is, predict function of response timedrs
instead of timedrs itself.
Lecture notes STAD29: Statistics for the Life and Social Sciences 91 / 802
Review of (multiple) regression
Box-Cox transformations
Taking log of timedrs and having it work: lucky guess. How to find
good transformation?
Box-Cox again.
Extra problem: some of timedrs values are 0, but Box-Cox expects all
+. Note response for boxcox:
boxcox(timedrs + 1 ~ phyheal + menheal + stress, data = visits)
Lecture notes STAD29: Statistics for the Life and Social Sciences 92 / 802
Review of (multiple) regression
Try 1
−2 −1 0 1 2
−
24
00
−
20
00
−
16
00
λ
lo
g−
Li
ke
lih
oo
d
95%
Figure 14: plot of chunk unnamed-chunk-62
Lecture notes STAD29: Statistics for the Life and Social Sciences 93 / 802
Review of (multiple) regression
Comments on try 1
Best: just less than zero.
Hard to see scale.
Focus on in (−0.3, 0.1):
my.lambda <- seq(-0.3, 0.1, 0.01)
my.lambda
## [1] -0.30 -0.29 -0.28 -0.27 -0.26 -0.25 -0.24 -0.23 -0.22
## [10] -0.21 -0.20 -0.19 -0.18 -0.17 -0.16 -0.15 -0.14 -0.13
## [19] -0.12 -0.11 -0.10 -0.09 -0.08 -0.07 -0.06 -0.05 -0.04
## [28] -0.03 -0.02 -0.01 0.00 0.01 0.02 0.03 0.04 0.05
## [37] 0.06 0.07 0.08 0.09 0.10
Lecture notes STAD29: Statistics for the Life and Social Sciences 94 / 802
Review of (multiple) regression
Try 2
boxcox(timedrs + 1 ~ phyheal + menheal + stress,
lambda = my.lambda,
data = visits
)
−0.3 −0.2 −0.1 0.0 0.1
−
13
15
−
13
05
λ
lo
g−
Li
ke
lih
oo
d
95%
Figure 15: plot of chunk unnamed-chunk-64
Lecture notes STAD29: Statistics for the Life and Social Sciences 95 / 802
Review of (multiple) regression
Comments
Best: just about −0.07.
CI for about (−0.14, 0.01).
Only nearby round number: = 0, log transformation.
Lecture notes STAD29: Statistics for the Life and Social Sciences 96 / 802
Review of (multiple) regression
Fixing the problems
Try regression again, with transformed response instead of original one.
Then check residual plot to see that it is OK now.
visits.3 <- lm(log(timedrs + 1) ~ phyheal + menheal + stress,
data = visits
)
timedrs+1 because some timedrs values 0, can’t take log of 0.
Won’t usually need to worry about this, but when response could be
zero/negative, fix that before transformation.
Lecture notes STAD29: Statistics for the Life and Social Sciences 97 / 802
Review of (multiple) regression
Output
summary(visits.3)
##
## Call:
## lm(formula = log(timedrs + 1) ~ phyheal + menheal + stress, data = visits)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.95865 -0.44076 -0.02331 0.42304 2.36797
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.3903862 0.0882908 4.422 1.22e-05 ***
## phyheal 0.2019361 0.0173624 11.631 < 2e-16 ***
## menheal 0.0071442 0.0101335 0.705 0.481
## stress 0.0013158 0.0002837 4.638 4.58e-06 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7625 on 461 degrees of freedom
## Multiple R-squared: 0.3682, Adjusted R-squared: 0.3641
## F-statistic: 89.56 on 3 and 461 DF, p-value: < 2.2e-16
Lecture notes STAD29: Statistics for the Life and Social Sciences 98 / 802
Review of (multiple) regression
Comments
Model as a whole strongly significant again
R-sq higher than before (37% vs. 22%) suggesting things more linear
now
Same conclusion re menheal: can take out of regression.
Should look at residual plots (next pages). Have we fixed problems?
Lecture notes STAD29: Statistics for the Life and Social Sciences 99 / 802
Review of (multiple) regression
Residuals against fitted values
ggplot(visits.3, aes(x = .fitted, y = .resid)) +
geom_point()
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l ll
l l
l
l
l
l
l
l
l
ll
l
l
ll
l
l
l
l
l
l
l
l l
l
l
l
l
l
ll
l
l
l
l
l
l l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll l
l
l
l
l
l
ll
l
l
l
l
l
l
l l
l
l
lll
l l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l l
l
l
l l
l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
−2
−1
0
1
2
1 2 3 4
.fitted
.
re
si
d
Figure 16: plot of chunk unnamed-chunk-67
Lecture notes STAD29: Statistics for the Life and Social Sciences 100 / 802
Review of (multiple) regression
Normal quantile plot of residuals
ggplot(visits.3, aes(sample = .resid)) + stat_qq() + stat_qq_line()
l l l
l
l ll
lll
lll
llllll
lllll
lll
llll
llll
lllll
lllll
llll
llll
llllll
lllll
llllll
lllll
lllllll
lllllll
lllllll
llllll
llllll
lllll
lllll
lllll
lllll
lllll
lllll
llll
lll
lll
llll
ll
lll
ll
lll
ll
llllll
l
l
llll
l l
l l
l
−2
−1
0
1
2
−2 0 2
theoretical
sa
m
pl
e
Figure 17: plot of chunk unnamed-chunk-68
Lecture notes STAD29: Statistics for the Life and Social Sciences 101 / 802
Review of (multiple) regression
Absolute residuals against fitted
ggplot(visits.3, aes(x = .fitted, y = abs(.resid))) +
geom_point() + geom_smooth()
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l l l
l
l
l
l
ll
l
l
l
l
l
l l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
lll
l l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l l l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
l
l l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
lll
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l l
l
l
l l
l
l
l
ll
l
l
l
l
l
l
l
l
l l
l
0.0
0.5
1.0
1.5
2.0
1 2 3 4
.fitted
a
bs
(.r
es
id)
Figure 18: plot of chunk unnamed-chunk-69
Lecture notes STAD29: Statistics for the Life and Social Sciences 102 / 802
Review of (multiple) regression
Comments
Residuals vs. fitted looks a lot more random.
Normal quantile plot looks a lot more normal (though still a little
right-skewness)
Absolute residuals: not so much trend (though still some).
Not perfect, but much improved.
Lecture notes STAD29: Statistics for the Life and Social Sciences 103 / 802
Review of (multiple) regression
Testing more than one at once
The -tests test only whether one variable could be taken out of the
regression you’re looking at.
To test significance of more than one variable at once, fit model with
and without variables
then use anova to compare fit of models:
visits.5 <- lm(log(timedrs + 1) ~ phyheal + menheal + stress,
data = visits)
visits.6 <- lm(log(timedrs + 1) ~ stress, data = visits)
Lecture notes STAD29: Statistics for the Life and Social Sciences 104 / 802
Review of (multiple) regression
Results of tests
anova(visits.6, visits.5)
## Analysis of Variance Table
##
## Model 1: log(timedrs + 1) ~ stress
## Model 2: log(timedrs + 1) ~ phyheal + menheal + stress
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 463 371.47
## 2 461 268.01 2 103.46 88.984 < 2.2e-16 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Models don’t fit equally well, so bigger one fits better.
Or “taking both variables out makes the fit worse, so don’t do it”.
Taking out those ’s is a mistake. Or putting them in is a good idea.
Lecture notes STAD29: Statistics for the Life and Social Sciences 105 / 802
Review of (multiple) regression
The punting data
Data set punting.txt contains 4 variables for 13 right-footed football
kickers (punters): left leg and right leg strength (lbs), distance punted (ft),
another variable called “fred”. Predict punting distance from other variables:
left right punt fred
170 170 162.50 171
130 140 144.0 136
170 180 174.50 174
160 160 163.50 161
150 170 192.0 159
150 150 171.75 151
180 170 162.0 174
110 110 104.83 111
110 120 105.67 114
120 130 117.58 126
140 120 140.25 129
130 140 150.17 136
150 160 165.17 154
Lecture notes STAD29: Statistics for the Life and Social Sciences 106 / 802
Review of (multiple) regression
Reading in
Separated by multiple spaces with columns lined up:
my_url <- "http://www.utsc.utoronto.ca/~butler/d29/punting.txt"
punting <- read_table(my_url)
## Parsed with column specification:
## cols(
## left = col_double(),
## right = col_double(),
## punt = col_double(),
## fred = col_double()
## )
Lecture notes STAD29: Statistics for the Life and Social Sciences 107 / 802
Review of (multiple) regression
The data
punting
## # A tibble: 13 x 4
## left right punt fred
##
## 1 170 170 162. 171
## 2 130 140 144 136
## 3 170 180 174. 174
## 4 160 160 164. 161
## 5 150 170 192 159
## 6 150 150 172. 151
## 7 180 170 162 174
## 8 110 110 105. 111
## 9 110 120 106. 114
## 10 120 130 118. 126
## 11 140 120 140. 129
## 12 130 140 150. 136
## 13 150 160 165. 154
Lecture notes STAD29: Statistics for the Life and Social Sciences 108 / 802
Review of (multiple) regression
Regression and output
punting.1 <- lm(punt ~ left + right + fred, data = punting)
glance(punting.1)
## # A tibble: 1 x 11
## r.squared adj.r.squared sigma statistic p.value df
##
## 1 0.778 0.704 14.7 10.5 0.00267 4
## # … with 5 more variables: logLik , AIC ,
## # BIC , deviance , df.residual
tidy(punting.1)
## # A tibble: 4 x 5
## term estimate std.error statistic p.value
##
## 1 (Intercept) -4.69 29.1 -0.161 0.876
## 2 left 0.268 2.11 0.127 0.902
## 3 right 1.05 2.15 0.490 0.636
## 4 fred -0.267 4.23 -0.0632 0.951
Lecture notes STAD29: Statistics for the Life and Social Sciences 109 / 802
Review of (multiple) regression
Comments
Overall regression strongly significant, R-sq high.
None of the ’s significant! Why?
-tests only say that you could take any one of the ’s out without
damaging the fit; doesn’t matter which one.
Explanation: look at correlations.
Lecture notes STAD29: Statistics for the Life and Social Sciences 110 / 802
Review of (multiple) regression
The correlations
cor(punting)
## left right punt fred
## left 1.0000000 0.8957224 0.8117368 0.9722632
## right 0.8957224 1.0000000 0.8805469 0.9728784
## punt 0.8117368 0.8805469 1.0000000 0.8679507
## fred 0.9722632 0.9728784 0.8679507 1.0000000
All correlations are high: ’s with punt (good) and with each other
(bad, at least confusing).
What to do? Probably do just as well to pick one variable, say right
since kickers are right-footed.
Lecture notes STAD29: Statistics for the Life and Social Sciences 111 / 802
Review of (multiple) regression
Just right
punting.2 <- lm(punt ~ right, data = punting)
anova(punting.2, punting.1)
## Analysis of Variance Table
##
## Model 1: punt ~ right
## Model 2: punt ~ left + right + fred
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 11 1962.5
## 2 9 1938.2 2 24.263 0.0563 0.9456
No significant loss by dropping other two variables.
Lecture notes STAD29: Statistics for the Life and Social Sciences 112 / 802
Review of (multiple) regression
Comparing R-squareds
summary(punting.1)$r.squared
## [1] 0.7781401
summary(punting.2)$r.squared
## [1] 0.7753629
Basically no difference. In regression (over), right significant:
Lecture notes STAD29: Statistics for the Life and Social Sciences 113 / 802
Review of (multiple) regression
Regression results
tidy(punting.2)
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
##
## 1 (Intercept) -3.69 25.3 -0.146 0.886
## 2 right 1.04 0.169 6.16 0.0000709
Lecture notes STAD29: Statistics for the Life and Social Sciences 114 / 802
Review of (multiple) regression
But…
Maybe we got the form of the relationship with left wrong.
Check: plot residuals from previous regression (without left) against
left.
Residuals here are “punting distance adjusted for right leg strength”.
If there is some kind of relationship with left, we should include in
model.
Plot of residuals against original variable: augment from broom.
Lecture notes STAD29: Statistics for the Life and Social Sciences 115 / 802
Review of (multiple) regression
Augmenting punting.2
punting.2 %>% augment(punting) -> punting.2.aug
punting.2.aug %>% slice(1:8)
## # A tibble: 8 x 11
## left right punt fred .fitted .se.fit .resid .hat
##
## 1 170 170 162. 171 174. 5.29 -11.1 0.157
## 2 130 140 144 136 142. 3.93 1.72 0.0864
## 3 170 180 174. 174 184. 6.60 -9.49 0.244
## 4 160 160 164. 161 163. 4.25 0.366 0.101
## 5 150 170 192 159 174. 5.29 18.4 0.157
## 6 150 150 172. 151 153. 3.73 19.0 0.0778
## 7 180 170 162 174 174. 5.29 -11.6 0.157
## 8 110 110 105. 111 111. 7.38 -6.17 0.305
## # … with 3 more variables: .sigma , .cooksd ,
## # .std.resid
Lecture notes STAD29: Statistics for the Life and Social Sciences 116 / 802
Review of (multiple) regression
Residuals against left
ggplot(punting.2.aug, aes(x = left, y = .resid)) +
geom_point()
l
l
l
l
ll
l
l
l
l
l
l
l
−10
0
10
20
120 140 160 180
left
.
re
si
d
Figure 19: plot of chunk basingstoke
Lecture notes STAD29: Statistics for the Life and Social Sciences 117 / 802
Review of (multiple) regression
Comments
There is a curved relationship with left.
We should add left-squared to the regression (and therefore put left
back in when we do that):
punting.3 <- lm(punt ~ left + I(left^2) + right,
data = punting
)
Lecture notes STAD29: Statistics for the Life and Social Sciences 118 / 802
Review of (multiple) regression
Regression with left-squared
summary(punting.3)
##
## Call:
## lm(formula = punt ~ left + I(left^2) + right, data = punting)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.3777 -5.3599 0.0459 4.5088 13.2669
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.623e+02 9.902e+01 -4.669 0.00117 **
## left 6.888e+00 1.462e+00 4.710 0.00110 **
## I(left^2) -2.302e-02 4.927e-03 -4.672 0.00117 **
## right 7.396e-01 2.292e-01 3.227 0.01038 *
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.931 on 9 degrees of freedom
## Multiple R-squared: 0.9352, Adjusted R-squared: 0.9136
## F-statistic: 43.3 on 3 and 9 DF, p-value: 1.13e-05
Lecture notes STAD29: Statistics for the Life and Social Sciences 119 / 802
Review of (multiple) regression
Comments
This was definitely a good idea (R-squared has clearly increased).
We would never have seen it without plotting residuals from
punting.2 (without left) against left.
Negative slope for leftsq means that increased left-leg strength only
increases punting distance up to a point: beyond that, it decreases
again.
Lecture notes STAD29: Statistics for the Life and Social Sciences 120 / 802
Logistic regression (ordinal/nominal response)
Section 4
Logistic regression (ordinal/nominal response)
Lecture notes STAD29: Statistics for the Life and Social Sciences 121 / 802
Logistic regression (ordinal/nominal response)
Logistic regression
When response variable is measured/counted, regression can work well.
But what if response is yes/no, lived/died, success/failure?
Model probability of success.
Probability must be between 0 and 1; need method that ensures this.
Logistic regression does this. In R, is a generalized linear model with
binomial “family”:
glm(y ~ x, family="binomial")
Begin with simplest case.
Lecture notes STAD29: Statistics for the Life and Social Sciences 122 / 802
Logistic regression (ordinal/nominal response)
Packages
library(MASS)
library(tidyverse)
library(broom)
library(nnet)
Lecture notes STAD29: Statistics for the Life and Social Sciences 123 / 802
Logistic regression (ordinal/nominal response)
The rats, part 1
Rats given dose of some poison; either live or die:
dose status
0 lived
1 died
2 lived
3 lived
4 died
5 died
Lecture notes STAD29: Statistics for the Life and Social Sciences 124 / 802
Logistic regression (ordinal/nominal response)
Read in:
my_url <- "http://www.utsc.utoronto.ca/~butler/d29/rat.txt"
rats <- read_delim(my_url, " ")
## Parsed with column specification:
## cols(
## dose = col_double(),
## status = col_character()
## )
glimpse(rats)
## Observations: 6
## Variables: 2
## $ dose 0, 1, 2, 3, 4, 5
## $ status "lived", "died", "lived", "lived", "died",…
Lecture notes STAD29: Statistics for the Life and Social Sciences 125 / 802
Logistic regression (ordinal/nominal response)
Basic logistic regression
Make response into a factor first:
rats2 <- rats %>% mutate(status = factor(status))
then fit model:
status.1 <- glm(status ~ dose, family = "binomial", data = rats2)
Lecture notes STAD29: Statistics for the Life and Social Sciences 126 / 802
Logistic regression (ordinal/nominal response)
Output
summary(status.1)
##
## Call:
## glm(formula = status ~ dose, family = "binomial", data = rats2)
##
## Deviance Residuals:
## 1 2 3 4 5 6
## 0.5835 -1.6254 1.0381 1.3234 -0.7880 -0.5835
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.6841 1.7979 0.937 0.349
## dose -0.6736 0.6140 -1.097 0.273
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8.3178 on 5 degrees of freedom
## Residual deviance: 6.7728 on 4 degrees of freedom
## AIC: 10.773
##
## Number of Fisher Scoring iterations: 4
Lecture notes STAD29: Statistics for the Life and Social Sciences 127 / 802
Logistic regression (ordinal/nominal response)
Interpreting the output
Like (multiple) regression, get tests of significance of individual ’s
Here not significant (only 6 observations).
“Slope” for dose is negative, meaning that as dose increases,
probability of event modelled (survival) decreases.
Lecture notes STAD29: Statistics for the Life and Social Sciences 128 / 802
Logistic regression (ordinal/nominal response)
Output part 2: predicted survival probs
p <- predict(status.1, type = "response")
cbind(rats, p)
## dose status p
## 1 0 lived 0.8434490
## 2 1 died 0.7331122
## 3 2 lived 0.5834187
## 4 3 lived 0.4165813
## 5 4 died 0.2668878
## 6 5 died 0.1565510
Lecture notes STAD29: Statistics for the Life and Social Sciences 129 / 802
Logistic regression (ordinal/nominal response)
The rats, more
More realistic: more rats at each dose (say 10).
Listing each rat on one line makes a big data file.
Use format below: dose, number of survivals, number of deaths.
dose lived died
0 10 0
1 7 3
2 6 4
3 4 6
4 2 8
5 1 9
6 lines of data correspond to 60 actual rats.
Saved in rat2.txt.
Lecture notes STAD29: Statistics for the Life and Social Sciences 130 / 802
Logistic regression (ordinal/nominal response)
These data
my_url <- "http://www.utsc.utoronto.ca/~butler/d29/rat2.txt"
rat2 <- read_delim(my_url, " ")
## Parsed with column specification:
## cols(
## dose = col_double(),
## lived = col_double(),
## died = col_double()
## )
rat2
## # A tibble: 6 x 3
## dose lived died
##
## 1 0 10 0
## 2 1 7 3
## 3 2 6 4
## 4 3 4 6
## 5 4 2 8
## 6 5 1 9
Lecture notes STAD29: Statistics for the Life and Social Sciences 131 / 802
Logistic regression (ordinal/nominal response)
Create response matrix:
Each row contains multiple observations.
Create two-column response:
#survivals in first column,
#deaths in second.
response <- with(rat2, cbind(lived, died))
response
## lived died
## [1,] 10 0
## [2,] 7 3
## [3,] 6 4
## [4,] 4 6
## [5,] 2 8
## [6,] 1 9
Response is R matrix:
class(response)
## [1] "matrix"
Lecture notes STAD29: Statistics for the Life and Social Sciences 132 / 802
Logistic regression (ordinal/nominal response)
Fit logistic regression
using response you just made:
rat2.1 <- glm(response ~ dose,
family = "binomial",
data = rat2
)
Lecture notes STAD29: Statistics for the Life and Social Sciences 133 / 802
Logistic regression (ordinal/nominal response)
Output
summary(rat2.1)
##
## Call:
## glm(formula = response ~ dose, family = "binomial", data = rat2)
##
## Deviance Residuals:
## 1 2 3 4 5 6
## 1.3421 -0.7916 -0.1034 0.1034 0.0389 0.1529
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.3619 0.6719 3.515 0.000439 ***
## dose -0.9448 0.2351 -4.018 5.87e-05 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 27.530 on 5 degrees of freedom
## Residual deviance: 2.474 on 4 degrees of freedom
## AIC: 18.94
##
## Number of Fisher Scoring iterations: 4
Lecture notes STAD29: Statistics for the Life and Social Sciences 134 / 802
Logistic regression (ordinal/nominal response)
Predicted survival probs
p <- predict(rat2.1, type = "response")
cbind(rat2, p)
## dose lived died p
## 1 0 10 0 0.9138762
## 2 1 7 3 0.8048905
## 3 2 6 4 0.6159474
## 4 3 4 6 0.3840526
## 5 4 2 8 0.1951095
## 6 5 1 9 0.0861238
Lecture notes STAD29: Statistics for the Life and Social Sciences 135 / 802
Logistic regression (ordinal/nominal response)
Comments
Significant effect of dose.
Effect of larger dose is to decrease survival probability (“slope”
negative; also see in decreasing predictions.)
Lecture notes STAD29: Statistics for the Life and Social Sciences 136 / 802
Logistic regression (ordinal/nominal response)
Multiple logistic regression
With more than one , works much like multiple regression.
Example: study of patients with blood poisoning severe enough to
warrant surgery. Relate survival to other potential risk factors.
Variables, 1=present, 0=absent:
survival (death from sepsis=1), response
shock
malnutrition
alcoholism
age (as numerical variable)
bowel infarction
See what relates to death.
Lecture notes STAD29: Statistics for the Life and Social Sciences 137 / 802
Logistic regression (ordinal/nominal response)
Read in data
my_url <-
"http://www.utsc.utoronto.ca/~butler/d29/sepsis.txt"
sepsis <- read_delim(my_url, " ")
## Parsed with column specification:
## cols(
## death = col_double(),
## shock = col_double(),
## malnut = col_double(),
## alcohol = col_double(),
## age = col_double(),
## bowelinf = col_double()
## )
Lecture notes STAD29: Statistics for the Life and Social Sciences 138 / 802
Logistic regression (ordinal/nominal response)
The data
sepsis
## # A tibble: 106 x 6
## death shock malnut alcohol age bowelinf
##
## 1 0 0 0 0 56 0
## 2 0 0 0 0 80 0
## 3 0 0 0 0 61 0
## 4 0 0 0 0 26 0
## 5 0 0 0 0 53 0
## 6 1 0 1 0 87 0
## 7 0 0 0 0 21 0
## 8 1 0 0 1 69 0
## 9 0 0 0 0 57 0
## 10 0 0 1 0 76 0
## # … with 96 more rows
Lecture notes STAD29: Statistics for the Life and Social Sciences 139 / 802
Logistic regression (ordinal/nominal response)
Fit model
sepsis.1 <- glm(death ~ shock + malnut + alcohol + age +
bowelinf,
family = "binomial",
data = sepsis
)
Lecture notes STAD29: Statistics for the Life and Social Sciences 140 / 802
Logistic regression (ordinal/nominal response)
Output part 1
tidy(sepsis.1)
## # A tibble: 6 x 5
## term estimate std.error statistic p.value
##
## 1 (Intercept) -9.75 2.54 -3.84 0.000124
## 2 shock 3.67 1.16 3.15 0.00161
## 3 malnut 1.22 0.728 1.67 0.0948
## 4 alcohol 3.35 0.982 3.42 0.000635
## 5 age 0.0922 0.0303 3.04 0.00237
## 6 bowelinf 2.80 1.16 2.40 0.0162
All P-values fairly small
but malnut not significant: remove.
Lecture notes STAD29: Statistics for the Life and Social Sciences 141 / 802
Logistic regression (ordinal/nominal response)
Removing malnut
sepsis.2 <- update(sepsis.1, . ~ . - malnut)
tidy(sepsis.2)
## # A tibble: 5 x 5
## term estimate std.error statistic p.value
##
## 1 (Intercept) -8.89 2.32 -3.84 0.000124
## 2 shock 3.70 1.10 3.35 0.000797
## 3 alcohol 3.19 0.917 3.47 0.000514
## 4 age 0.0898 0.0292 3.07 0.00211
## 5 bowelinf 2.39 1.07 2.23 0.0260
Everything significant now.
Lecture notes STAD29: Statistics for the Life and Social Sciences 142 / 802
Logistic regression (ordinal/nominal response)
Comments
Most of the original ’s helped predict death. Only malnut seemed
not to add anything.
Removed malnut and tried again.
Everything remaining is significant (though bowelinf actually became
less significant).
All coefficients are positive, so having any of the risk factors (or being
older) increases risk of death.
Lecture notes STAD29: Statistics for the Life and Social Sciences 143 / 802
Logistic regression (ordinal/nominal response)
Predictions from model without “malnut”
A few chosen at random:
sepsis.pred <- predict(sepsis.2, type = "response")
d <- data.frame(sepsis, sepsis.pred)
myrows <- c(4, 1, 2, 11, 32)
slice(d, myrows)
## death shock malnut alcohol age bowelinf sepsis.pred
## 1 0 0 0 0 26 0 0.001415347
## 2 0 0 0 0 56 0 0.020552383
## 3 0 0 0 0 80 0 0.153416834
## 4 1 0 0 1 66 1 0.931290137
## 5 1 0 0 1 49 0 0.213000997
Lecture notes STAD29: Statistics for the Life and Social Sciences 144 / 802
Logistic regression (ordinal/nominal response)
Comments
Survival chances pretty good if no risk factors, though decreasing with
age.
Having more than one risk factor reduces survival chances dramatically.
Usually good job of predicting survival; sometimes death predicted to
survive.
Lecture notes STAD29: Statistics for the Life and Social Sciences 145 / 802
Logistic regression (ordinal/nominal response)
Assessing proportionality of odds for age
An assumption we made is that log-odds of survival depends linearly on
age.
Hard to get your head around, but basic idea is that survival chances
go continuously up (or down) with age, instead of (for example) going
up and then down.
In this case, seems reasonable, but should check:
Lecture notes STAD29: Statistics for the Life and Social Sciences 146 / 802
Logistic regression (ordinal/nominal response)
Residuals vs. age
ggplot(augment(sepsis.2), aes(x = age, y = .resid)) +
geom_point()
l
l
l
l
l
l
l
l
l
l
l
l
l
ll l ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
−1
0
1
2
3
25 50 75
age
.
re
si
d
Figure 20: plot of chunk virtusentella
Lecture notes STAD29: Statistics for the Life and Social Sciences 147 / 802
Logistic regression (ordinal/nominal response)
Comments
No apparent problems overall.
Confusing “line” across: no risk factors, survived.
Lecture notes STAD29: Statistics for the Life and Social Sciences 148 / 802
Logistic regression (ordinal/nominal response)
Probability and odds
For probability , odds is /(1 − ):
Prob. Odds log-odds in words
0.5 0.5/0.5 = 1/1 = 1.00 0.00 “even money”
0.1 0.1/0.9 = 1/9 = 0.11 −2.20 “9 to 1”
0.4 0.4/0.6 = 1/1.5 = 0.67 −0.41 “1.5 to 1”
0.8 0.8/0.2 = 4/1 = 4.00 1.39 “4 to 1 on”
Gamblers use odds: if you win at 9 to 1 odds, get original stake back
plus 9 times the stake.
Probability has to be between 0 and 1
Odds between 0 and infinity
Log-odds can be anything: any log-odds corresponds to valid
probability.
Lecture notes STAD29: Statistics for the Life and Social Sciences 149 / 802
Logistic regression (ordinal/nominal response)
Odds ratio
Suppose 90 of 100 men drank wine last week, but only 20 of 100
women.
Prob of man drinking wine 90/100 = 0.9, woman 20/100 = 0.2.
Odds of man drinking wine 0.9/0.1 = 9, woman 0.2/0.8 = 0.25.
Ratio of odds is 9/0.25 = 36.
Way of quantifying difference between men and women: “odds of
drinking wine 36 times larger for males than females”.
Lecture notes STAD29: Statistics for the Life and Social Sciences 150 / 802
Logistic regression (ordinal/nominal response)
Sepsis data again
Recall prediction of probability of death from risk factors:
sepsis.2.tidy <- tidy(sepsis.2)
sepsis.2.tidy
## # A tibble: 5 x 5
## term estimate std.error statistic p.value
##
## 1 (Intercept) -8.89 2.32 -3.84 0.000124
## 2 shock 3.70 1.10 3.35 0.000797
## 3 alcohol 3.19 0.917 3.47 0.000514
## 4 age 0.0898 0.0292 3.07 0.00211
## 5 bowelinf 2.39 1.07 2.23 0.0260
Slopes in column estimate.
Lecture notes STAD29: Statistics for the Life and Social Sciences 151 / 802
Logistic regression (ordinal/nominal response)
Multiplying the odds
Can interpret slopes by taking “exp” of them. We ignore intercept.
sepsis.2.tidy %>%
mutate(exp_coeff=exp(estimate)) %>%
select(term, exp_coeff)
## # A tibble: 5 x 2
## term exp_coeff
##
## 1 (Intercept) 0.000137
## 2 shock 40.5
## 3 alcohol 24.2
## 4 age 1.09
## 5 bowelinf 10.9
Lecture notes STAD29: Statistics for the Life and Social Sciences 152 / 802
Logistic regression (ordinal/nominal response)
Interpretation
## # A tibble: 5 x 2
## term exp_coeff
##
## 1 (Intercept) 0.000137
## 2 shock 40.5
## 3 alcohol 24.2
## 4 age 1.09
## 5 bowelinf 10.9
These say “how much do you multiply odds of death by for increase of
1 in corresponding risk factor?” Or, what is odds ratio for that factor
being 1 (present) vs. 0 (absent)?
Eg. being alcoholic vs. not increases odds of death by 24 times
One year older multiplies odds by about 1.1 times. Over 40 years,
about 1.0940 = 31 times.
Lecture notes STAD29: Statistics for the Life and Social Sciences 153 / 802
Logistic regression (ordinal/nominal response)
Odds ratio and relative risk
Relative risk is ratio of probabilities.
Above: 90 of 100 men (0.9) drank wine, 20 of 100 women (0.2).
Relative risk 0.9/0.2=4.5. (odds ratio was 36).
When probabilities small, relative risk and odds ratio similar.
Eg. prob of man having disease 0.02, woman 0.01.
Relative risk 0.02/0.01 = 2.
Lecture notes STAD29: Statistics for the Life and Social Sciences 154 / 802
Logistic regression (ordinal/nominal response)
Odds ratio vs. relative risk
Odds for men and for women:
(od1 <- 0.02 / 0.98) # men
## [1] 0.02040816
(od2 <- 0.01 / 0.99) # women
## [1] 0.01010101
Odds ratio
od1 / od2
## [1] 2.020408
Very close to relative risk of 2.
Lecture notes STAD29: Statistics for the Life and Social Sciences 155 / 802
Logistic regression (ordinal/nominal response)
More than 2 response categories
With 2 response categories, model the probability of one, and prob of
other is one minus that. So doesn’t matter which category you model.
With more than 2 categories, have to think more carefully about the
categories: are they
ordered : you can put them in a natural order (like low, medium, high)
nominal : ordering the categories doesn’t make sense (like red, green,
blue).
R handles both kinds of response; learn how.
Lecture notes STAD29: Statistics for the Life and Social Sciences 156 / 802
Logistic regression (ordinal/nominal response)
Ordinal response: the miners
Model probability of being in given category or lower.
Example: coal-miners often suffer disease pneumoconiosis. Likelihood
of disease believed to be greater among miners who have worked
longer.
Severity of disease measured on categorical scale: none, moderate, 3
severe.
Lecture notes STAD29: Statistics for the Life and Social Sciences 157 / 802
Logistic regression (ordinal/nominal response)
Miners data
Data are frequencies:
Exposure None Moderate Severe
5.8 98 0 0
15.0 51 2 1
21.5 34 6 3
27.5 35 5 8
33.5 32 10 9
39.5 23 7 8
46.0 12 6 10
51.5 4 2 5
Lecture notes STAD29: Statistics for the Life and Social Sciences 158 / 802
Logistic regression (ordinal/nominal response)
Reading the data
Data in aligned columns with more than one space between, so:
my_url <- "http://www.utsc.utoronto.ca/~butler/d29/miners-tab.txt"
freqs <- read_table(my_url)
## Parsed with column specification:
## cols(
## Exposure = col_double(),
## None = col_double(),
## Moderate = col_double(),
## Severe = col_double()
## )
Lecture notes STAD29: Statistics for the Life and Social Sciences 159 / 802
Logistic regression (ordinal/nominal response)
The data
freqs
## # A tibble: 8 x 4
## Exposure None Moderate Severe
##
## 1 5.8 98 0 0
## 2 15 51 2 1
## 3 21.5 34 6 3
## 4 27.5 35 5 8
## 5 33.5 32 10 9
## 6 39.5 23 7 8
## 7 46 12 6 10
## 8 51.5 4 2 5
Lecture notes STAD29: Statistics for the Life and Social Sciences 160 / 802
Logistic regression (ordinal/nominal response)
Tidying and row proportions
freqs %>%
gather(Severity, Freq, None:Severe) %>%
group_by(Exposure) %>%
mutate(proportion = Freq / sum(Freq)) -> miners
Lecture notes STAD29: Statistics for the Life and Social Sciences 161 / 802
Logistic regression (ordinal/nominal response)
Result
miners
## # A tibble: 24 x 4
## # Groups: Exposure [8]
## Exposure Severity Freq proportion
##
## 1 5.8 None 98 1
## 2 15 None 51 0.944
## 3 21.5 None 34 0.791
## 4 27.5 None 35 0.729
## 5 33.5 None 32 0.627
## 6 39.5 None 23 0.605
## 7 46 None 12 0.429
## 8 51.5 None 4 0.364
## 9 5.8 Moderate 0 0
## 10 15 Moderate 2 0.0370
## # … with 14 more rows
Lecture notes STAD29: Statistics for the Life and Social Sciences 162 / 802
Logistic regression (ordinal/nominal response)
Plot proportions against exposure
ggplot(miners, aes(x = Exposure, y = proportion,
colour = Severity)) +
geom_point() + geom_smooth(se = F)
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
l
l l
l
l
l
0.00
0.25
0.50
0.75
1.00
10 20 30 40 50
Exposure
pr
op
or
tio
n Severity
l
l
l
Moderate
None
Severe
Lecture notes STAD29: Statistics for the Life and Social Sciences 163 / 802
Logistic regression (ordinal/nominal response)
Reminder of data setup
miners
## # A tibble: 24 x 4
## # Groups: Exposure [8]
## Exposure Severity Freq proportion
##
## 1 5.8 None 98 1
## 2 15 None 51 0.944
## 3 21.5 None 34 0.791
## 4 27.5 None 35 0.729
## 5 33.5 None 32 0.627
## 6 39.5 None 23 0.605
## 7 46 None 12 0.429
## 8 51.5 None 4 0.364
## 9 5.8 Moderate 0 0
## 10 15 Moderate 2 0.0370
## # … with 14 more rows
Lecture notes STAD29: Statistics for the Life and Social Sciences 164 / 802
Logistic regression (ordinal/nominal response)
Creating an ordered factor
Problem: on plot, Severity categories in wrong order.
In the data frame, categories in correct order.
Package forcats (in tidyverse) has functions for creating factors to
specifications.
fct_inorder takes levels in order they appear in data:
miners %>%
mutate(sev_ord = fct_inorder(Severity)) -> miners
To check:
levels(miners$sev_ord)
## [1] "None" "Moderate" "Severe"
Lecture notes STAD29: Statistics for the Life and Social Sciences 165 / 802
Logistic regression (ordinal/nominal response)
New data frame
miners
## # A tibble: 24 x 5
## # Groups: Exposure [8]
## Exposure Severity Freq proportion sev_ord
##
## 1 5.8 None 98 1 None
## 2 15 None 51 0.944 None
## 3 21.5 None 34 0.791 None
## 4 27.5 None 35 0.729 None
## 5 33.5 None 32 0.627 None
## 6 39.5 None 23 0.605 None
## 7 46 None 12 0.429 None
## 8 51.5 None 4 0.364 None
## 9 5.8 Moderate 0 0 Moderate
## 10 15 Moderate 2 0.0370 Moderate
## # … with 14 more rows
Lecture notes STAD29: Statistics for the Life and Social Sciences 166 / 802
Logistic regression (ordinal/nominal response)
Improved plot
ggplot(miners, aes(x = Exposure, y = proportion,
colour = sev_ord)) +
geom_point() + geom_smooth(se = F)
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
l
l l
l
l
l
0.00
0.25
0.50
0.75
1.00
10 20 30 40 50
Exposure
pr
op
or
tio
n sev_ord
l
l
l
None
Moderate
Severe
Figure 21: plot of chunk unnamed-chunk-114
Lecture notes STAD29: Statistics for the Life and Social Sciences 167 / 802
Logistic regression (ordinal/nominal response)
Fitting ordered logistic model
Use function polr from package MASS. Like glm.
sev.1 <- polr(sev_ord ~ Exposure,
weights = Freq,
data = miners
)
Lecture notes STAD29: Statistics for the Life and Social Sciences 168 / 802
Logistic regression (ordinal/nominal response)
Output: not very illuminating
summary(sev.1)
##
## Re-fitting to get Hessian
## Call:
## polr(formula = sev_ord ~ Exposure, data = miners, weights = Freq)
##
## Coefficients:
## Value Std. Error t value
## Exposure 0.0959 0.01194 8.034
##
## Intercepts:
## Value Std. Error t value
## None|Moderate 3.9558 0.4097 9.6558
## Moderate|Severe 4.8690 0.4411 11.0383
##
## Residual Deviance: 416.9188
## AIC: 422.9188
Lecture notes STAD29: Statistics for the Life and Social Sciences 169 / 802
Logistic regression (ordinal/nominal response)
Does exposure have an effect?
Fit model without Exposure, and compare using anova. Note 1 for model
with just intercept:
sev.0 <- polr(sev_ord ~ 1, weights = Freq, data = miners)
anova(sev.0, sev.1)
## Likelihood ratio tests of ordinal regression models
##
## Response: sev_ord
## Model Resid. df Resid. Dev Test
## 1 1 369 505.1621
## 2 Exposure 368 416.9188 1 vs 2
## Df LR stat. Pr(Chi)
## 1
## 2 1 88.24324 0
Exposure definitely has effect on severity of disease.
Lecture notes STAD29: Statistics for the Life and Social Sciences 170 / 802
Logistic regression (ordinal/nominal response)
Another way
What (if anything) can we drop from model with exposure?
drop1(sev.1, test = "Chisq")
## Single term deletions
##
## Model:
## sev_ord ~ Exposure
## Df AIC LRT Pr(>Chi)
## 422.92
## Exposure 1 509.16 88.243 < 2.2e-16 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05
## '.' 0.1 ' ' 1
Nothing. Exposure definitely has effect.
Lecture notes STAD29: Statistics for the Life and Social Sciences 171 / 802
Logistic regression (ordinal/nominal response)
Predicted probabilities
Make new data frame out of all the exposure values (from original data
frame), and predict from that:
sev.new <- tibble(Exposure = freqs$Exposure)
pr <- predict(sev.1, sev.new, type = "p")
miners.pred <- cbind(sev.new, pr)
miners.pred
## Exposure None Moderate Severe
## 1 5.8 0.9676920 0.01908912 0.01321885
## 2 15.0 0.9253445 0.04329931 0.03135614
## 3 21.5 0.8692003 0.07385858 0.05694115
## 4 27.5 0.7889290 0.11413004 0.09694093
## 5 33.5 0.6776641 0.16207145 0.16026444
## 6 39.5 0.5418105 0.20484198 0.25334756
## 7 46.0 0.3879962 0.22441555 0.38758828
## 8 51.5 0.2722543 0.21025011 0.51749563
Lecture notes STAD29: Statistics for the Life and Social Sciences 172 / 802
Logistic regression (ordinal/nominal response)
Comments
Model appears to match data: as exposure goes up, prob of None goes
down, Severe goes up (sharply for high exposure).
Like original data frame, this one nice to look at but not tidy. We
want to make graph, so tidy it.
Also want the severity values in right order.
Usual gather, plus a bit:
miners.pred %>%
gather(Severity, probability, -Exposure) %>%
mutate(sev_ord = fct_inorder(Severity)) -> preds
Lecture notes STAD29: Statistics for the Life and Social Sciences 173 / 802
Logistic regression (ordinal/nominal response)
Some of the gathered predictions
preds %>% slice(1:15)
## Exposure Severity probability sev_ord
## 1 5.8 None 0.96769203 None
## 2 15.0 None 0.92534455 None
## 3 21.5 None 0.86920028 None
## 4 27.5 None 0.78892903 None
## 5 33.5 None 0.67766411 None
## 6 39.5 None 0.54181046 None
## 7 46.0 None 0.38799618 None
## 8 51.5 None 0.27225426 None
## 9 5.8 Moderate 0.01908912 Moderate
## 10 15.0 Moderate 0.04329931 Moderate
## 11 21.5 Moderate 0.07385858 Moderate
## 12 27.5 Moderate 0.11413004 Moderate
## 13 33.5 Moderate 0.16207145 Moderate
## 14 39.5 Moderate 0.20484198 Moderate
## 15 46.0 Moderate 0.22441555 Moderate
Lecture notes STAD29: Statistics for the Life and Social Sciences 174 / 802
Logistic regression (ordinal/nominal response)
Plotting predicted and observed proportions
Plot:
predicted probabilities, lines (shown) joining points (not shown)
data, just the points.
Unfamiliar process: data from two different data frames:
g <- ggplot(preds, aes(
x = Exposure, y = probability,
colour = sev_ord
)) + geom_line() +
geom_point(data = miners, aes(y = proportion))
Idea: final geom_point uses data in miners rather than preds,
-variable for plot is proportion from that data frame, but
-coordinate is Exposure, as it was before, and colour is Severity
as before. The final geom_point “inherits” from the first aes as
needed.
Lecture notes STAD29: Statistics for the Life and Social Sciences 175 / 802
Logistic regression (ordinal/nominal response)
The plot: data match model
g
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
l
l l
l
l
l
0.00
0.25
0.50
0.75
1.00
10 20 30 40 50
Exposure
pr
ob
ab
ilit
y sev_ord
l
l
l
None
Moderate
Severe
Figure 22: plot of chunk unnamed-chunk-125
Lecture notes STAD29: Statistics for the Life and Social Sciences 176 / 802
Logistic regression (ordinal/nominal response)
Unordered responses
With unordered (nominal) responses, can use generalized logit.
Example: 735 people, record age and sex (male 0, female 1), which of
3 brands of some product preferred.
Data in mlogit.csv separated by commas (so read_csv will work):
my_url <- "http://www.utsc.utoronto.ca/~butler/d29/mlogit.csv"
brandpref <- read_csv(my_url)
## Parsed with column specification:
## cols(
## brand = col_double(),
## sex = col_double(),
## age = col_double()
## )
Lecture notes STAD29: Statistics for the Life and Social Sciences 177 / 802
Logistic regression (ordinal/nominal response)
The data
brandpref
## # A tibble: 735 x 3
## brand sex age
##
## 1 1 0 24
## 2 1 0 26
## 3 1 0 26
## 4 1 1 27
## 5 1 1 27
## 6 3 1 27
## 7 1 0 27
## 8 1 0 27
## 9 1 1 27
## 10 1 0 27
## # … with 725 more rows
Lecture notes STAD29: Statistics for the Life and Social Sciences 178 / 802
Logistic regression (ordinal/nominal response)
Bashing into shape, and fitting model
sex and brand not meaningful as numbers, so turn into factors:
brandpref <- brandpref %>%
mutate(sex = factor(sex)) %>%
mutate(brand = factor(brand))
We use multinom from package nnet. Works like polr.
brands.1 <- multinom(brand ~ age + sex, data = brandpref)
## # weights: 12 (6 variable)
## initial value 807.480032
## iter 10 value 702.976983
## final value 702.970704
## converged
Lecture notes STAD29: Statistics for the Life and Social Sciences 179 / 802
Logistic regression (ordinal/nominal response)
Can we drop anything?
Unfortunately drop1 seems not to work:
drop1(brands.1, test = "Chisq", trace = 0)
## trying - age
## Error in if (trace) {: argument is not interpretable as logical
so fall back on fitting model without what you want to test, and
comparing using anova.
Lecture notes STAD29: Statistics for the Life and Social Sciences 180 / 802
Logistic regression (ordinal/nominal response)
Do age/sex help predict brand? 1/2
Fit models without each of age and sex:
brands.2 <- multinom(brand ~ age, data = brandpref)
## # weights: 9 (4 variable)
## initial value 807.480032
## iter 10 value 706.796323
## iter 10 value 706.796322
## final value 706.796322
## converged
brands.3 <- multinom(brand ~ sex, data = brandpref)
## # weights: 9 (4 variable)
## initial value 807.480032
## final value 791.861266
## converged
Lecture notes STAD29: Statistics for the Life and Social Sciences 181 / 802
Logistic regression (ordinal/nominal response)
Do age/sex help predict brand? 2/2
anova(brands.2, brands.1)
## Likelihood ratio tests of Multinomial Models
##
## Response: brand
## Model Resid. df Resid. Dev Test Df LR stat. Pr(Chi)
## 1 age 1466 1413.593
## 2 age + sex 1464 1405.941 1 vs 2 2 7.651236 0.02180495
anova(brands.3, brands.1)
## Likelihood ratio tests of Multinomial Models
##
## Response: brand
## Model Resid. df Resid. Dev Test Df LR stat. Pr(Chi)
## 1 sex 1466 1583.723
## 2 age + sex 1464 1405.941 1 vs 2 2 177.7811 0
Lecture notes STAD29: Statistics for the Life and Social Sciences 182 / 802
Logistic regression (ordinal/nominal response)
Do age/sex help predict brand? 3/3
age definitely significant (second anova)
sex seems significant also (first anova)
Keep both.
Lecture notes STAD29: Statistics for the Life and Social Sciences 183 / 802
Logistic regression (ordinal/nominal response)
Another way to build model
Start from model with everything and run step:
step(brands.1, trace = 0)
## trying - age
## trying - sex
## Call:
## multinom(formula = brand ~ age + sex, data = brandpref)
##
## Coefficients:
## (Intercept) age sex1
## 2 -11.77469 0.3682075 0.5238197
## 3 -22.72141 0.6859087 0.4659488
##
## Residual Deviance: 1405.941
## AIC: 1417.941
Final model contains both age and sex so neither could be removed.
Lecture notes STAD29: Statistics for the Life and Social Sciences 184 / 802
Logistic regression (ordinal/nominal response)
Predictions: all possible combinations
Create data frame with various age and sex:
ages <- c(24, 28, 32, 35, 38)
sexes <- factor(0:1)
new <- crossing(age = ages, sex = sexes)
new
## # A tibble: 10 x 2
## age sex
##
## 1 24 0
## 2 24 1
## 3 28 0
## 4 28 1
## 5 32 0
## 6 32 1
## 7 35 0
## 8 35 1
## 9 38 0
## 10 38 1
Lecture notes STAD29: Statistics for the Life and Social Sciences 185 / 802
Logistic regression (ordinal/nominal response)
Making predictions
p <- predict(brands.1, new, type = "probs")
probs <- cbind(new, p)
or
p %>% as_tibble() %>%
bind_cols(new) -> probs
Lecture notes STAD29: Statistics for the Life and Social Sciences 186 / 802
Logistic regression (ordinal/nominal response)
The predictions
probs
## # A tibble: 10 x 5
## `1` `2` `3` age sex
##
## 1 0.948 0.0502 0.00181 24 0
## 2 0.915 0.0819 0.00279 24 1
## 3 0.793 0.183 0.0236 28 0
## 4 0.696 0.271 0.0329 28 1
## 5 0.405 0.408 0.187 32 0
## 6 0.291 0.495 0.214 32 1
## 7 0.131 0.397 0.472 35 0
## 8 0.0840 0.432 0.484 35 1
## 9 0.0260 0.239 0.735 38 0
## 10 0.0162 0.252 0.732 38 1
Young males (sex=0) prefer brand 1, but older males prefer brand 3.
Females similar, but like brand 1 less and brand 2 more.
Lecture notes STAD29: Statistics for the Life and Social Sciences 187 / 802
Logistic regression (ordinal/nominal response)
Making a plot
Plot fitted probability against age, distinguishing brand by colour and
gender by plotting symbol.
Also join points by lines, and distinguish lines by gender.
I thought about facetting, but this seems to come out clearer.
First need tidy data frame, by familiar process:
probs %>%
gather(brand, probability, -(age:sex)) -> probs.long
Lecture notes STAD29: Statistics for the Life and Social Sciences 188 / 802
Logistic regression (ordinal/nominal response)
The tidy data (random sample of rows)
probs.long %>% sample_n(10)
## # A tibble: 10 x 4
## age sex brand probability
##
## 1 38 0 2 0.239
## 2 28 1 1 0.696
## 3 38 1 2 0.252
## 4 32 0 3 0.187
## 5 32 0 1 0.405
## 6 24 1 2 0.0819
## 7 35 0 2 0.397
## 8 38 1 3 0.732
## 9 24 1 1 0.915
## 10 35 1 3 0.484
Lecture notes STAD29: Statistics for the Life and Social Sciences 189 / 802
Logistic regression (ordinal/nominal response)
The plot
ggplot(probs.long, aes(
x = age, y = probability,
colour = brand, shape = sex
)) +
geom_point() + geom_line(aes(linetype = sex))
l
l
l
l
ll
l
l
l
l l
l
l
l
0.00
0.25
0.50
0.75
24 28 32 36
age
pr
ob
ab
ilit
y
brand
l
l
l
1
2
3
sex
l 0
1
Figure 23: plot of chunk unnamed-chunk-140
Lecture notes STAD29: Statistics for the Life and Social Sciences 190 / 802
Logistic regression (ordinal/nominal response)
Digesting the plot
Brand vs. age: younger people (of both genders) prefer brand 1, but
older people (of both genders) prefer brand 3. (Explains significant age
effect.)
Brand vs. sex: females (dashed) like brand 1 less than males (solid),
like brand 2 more (for all ages).
Not much brand difference between genders (solid and dashed lines of
same colours close), but enough to be significant.
Model didn’t include interaction, so modelled effect of gender on brand
same for each age, modelled effect of age same for each gender.
Lecture notes STAD29: Statistics for the Life and Social Sciences 191 / 802
Logistic regression (ordinal/nominal response)
Alternative data format
Summarize all people of same brand preference, same sex, same age on one
line of data file with frequency on end:
1 0 24 1
1 0 26 2
1 0 27 4
1 0 28 4
1 0 29 7
1 0 30 3
...
Whole data set in 65 lines not 735! But how?
Lecture notes STAD29: Statistics for the Life and Social Sciences 192 / 802
Logistic regression (ordinal/nominal response)
Getting alternative data format
brandpref %>%
group_by(age, sex, brand) %>%
summarize(Freq = n()) %>%
ungroup() -> b
b %>% slice(1:6)
## # A tibble: 6 x 4
## age sex brand Freq
##
## 1 24 0 1 1
## 2 26 0 1 2
## 3 27 0 1 4
## 4 27 1 1 4
## 5 27 1 3 1
## 6 28 0 1 4
Lecture notes STAD29: Statistics for the Life and Social Sciences 193 / 802
Logistic regression (ordinal/nominal response)
Fitting models, almost the same
Just have to remember weights to incorporate frequencies.
Otherwise multinom assumes you have just 1 obs on each line!
Again turn (numerical) sex and brand into factors:
b %>%
mutate(sex = factor(sex)) %>%
mutate(brand = factor(brand)) -> bf
b.1 <- multinom(brand ~ age + sex, data = bf, weights = Freq)
b.2 <- multinom(brand ~ age, data = bf, weights = Freq)
Lecture notes STAD29: Statistics for the Life and Social Sciences 194 / 802
Logistic regression (ordinal/nominal response)
P-value for sex identical
anova(b.2, b.1)
## Likelihood ratio tests of Multinomial Models
##
## Response: brand
## Model Resid. df Resid. Dev Test Df LR stat. Pr(Chi)
## 1 age 126 1413.593
## 2 age + sex 124 1405.941 1 vs 2 2 7.651236 0.02180495
Same P-value as before, so we haven’t changed anything important.
Lecture notes STAD29: Statistics for the Life and Social Sciences 195 / 802
Logistic regression (ordinal/nominal response)
Including data on plot
Everyone’s age given as whole number, so maybe not too many
different ages with sensible amount of data at each:
b %>%
group_by(age) %>%
summarize(total = sum(Freq))
## # A tibble: 14 x 2
## age total
##
## 1 24 1
## 2 26 2
## 3 27 9
## 4 28 15
## 5 29 19
## 6 30 23
## 7 31 40
## 8 32 333
## 9 33 55
## 10 34 64
## 11 35 35
## 12 36 85
## 13 37 22
## 14 38 32Lecture notes STAD29: Statistics for the Life and Social Sciences 196 / 802
Logistic regression (ordinal/nominal response)
Comments and next
Not great (especially at low end), but live with it.
Need proportions of frequencies in each brand for each age-gender
combination. Mimic what we did for miners:
b %>%
group_by(age, sex) %>%
mutate(proportion = Freq / sum(Freq)) -> brands
Lecture notes STAD29: Statistics for the Life and Social Sciences 197 / 802
Logistic regression (ordinal/nominal response)
Checking proportions for age 32
brands %>% filter(age == 32)
## # A tibble: 6 x 5
## # Groups: age, sex [2]
## age sex brand Freq proportion
##
## 1 32 0 1 48 0.407
## 2 32 0 2 51 0.432
## 3 32 0 3 19 0.161
## 4 32 1 1 62 0.288
## 5 32 1 2 117 0.544
## 6 32 1 3 36 0.167
First three proportions (males) add up to 1.
Last three proportions (females) add up to 1.
So looks like proportions of right thing.
Lecture notes STAD29: Statistics for the Life and Social Sciences 198 / 802
Logistic regression (ordinal/nominal response)
Attempting plot
Take code from previous plot and:
remove geom_point for fitted values
add geom_point with correct data= and aes to plot data.
g <- ggplot(probs.long, aes(
x = age, y = probability,
colour = brand, shape = sex
)) +
geom_line(aes(linetype = sex)) +
geom_point(data = brands, aes(y = proportion))
Data seem to correspond more or less to fitted curves:
Lecture notes STAD29: Statistics for the Life and Social Sciences 199 / 802
Logistic regression (ordinal/nominal response)
The plot
g
l l l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
0.00
0.25
0.50
0.75
1.00
24 28 32 36
age
pr
ob
ab
ilit
y
brand
l
l
l
1
2
3
sex
l 0
1
Figure 24: plot of chunk unnamed-chunk-147
Lecture notes STAD29: Statistics for the Life and Social Sciences 200 / 802
Logistic regression (ordinal/nominal response)
But…
Some of the plotted points based on a lot of people, and some only a
few.
Idea: make the size of plotted point bigger if point based on a lot of
people (in Freq).
Hope that larger points then closer to predictions.
Code:
g <- ggplot(probs.long, aes(
x = age, y = probability,
colour = brand, shape = sex
)) +
geom_line(aes(linetype = sex)) +
geom_point(
data = brands,
aes(y = proportion, size = Freq)
)
Lecture notes STAD29: Statistics for the Life and Social Sciences 201 / 802
Logistic regression (ordinal/nominal response)
The plot
g
l l l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
0.00
0.25
0.50
0.75
1.00
24 28 32 36
age
pr
ob
ab
ilit
y
brand
l
l
l
1
2
3
sex
l 0
1
Freq
l
l
l
30
60
90
Figure 25: plot of chunk unnamed-chunk-149
Lecture notes STAD29: Statistics for the Life and Social Sciences 202 / 802
Logistic regression (ordinal/nominal response)
Trying interaction between age and gender
b.4 <- update(b.1, . ~ . + age:sex)
## # weights: 15 (8 variable)
## initial value 807.480032
## iter 10 value 704.811229
## iter 20 value 702.582802
## final value 702.582761
## converged
anova(b.1, b.4)
## Likelihood ratio tests of Multinomial Models
##
## Response: brand
## Model Resid. df Resid. Dev Test Df
## 1 age + sex 124 1405.941
## 2 age + sex + age:sex 122 1405.166 1 vs 2 2
## LR stat. Pr(Chi)
## 1
## 2 0.7758861 0.678451
No evidence that effect of age on brand preference differs for the two
genders.
Lecture notes STAD29: Statistics for the Life and Social Sciences 203 / 802
Survival analysis
Section 5
Survival analysis
Lecture notes STAD29: Statistics for the Life and Social Sciences 204 / 802
Survival analysis
Survival analysis
So far, have seen:
response variable counted or measured (regression)
response variable categorized (logistic regression)
and have predicted response from explanatory variables.
But what if response is time until event (eg. time of survival after
surgery)?
Additional complication: event might not have happened at end of
study (eg. patient still alive). But knowing that patient has “not died
yet” presumably informative. Such data called censored.
Enter survival analysis, in particular the “Cox proportional hazards
model”.
Explanatory variables in this context often called covariates.
Lecture notes STAD29: Statistics for the Life and Social Sciences 205 / 802
Survival analysis
Example: still dancing?
12 women who have just started taking dancing lessons are followed for
up to a year, to see whether they are still taking dancing lessons, or
have quit. The “event” here is “quit”.
This might depend on:
a treatment (visit to a dance competition)
woman’s age (at start of study).
Lecture notes STAD29: Statistics for the Life and Social Sciences 206 / 802
Survival analysis
Data
Months Quit Treatment Age
1 1 0 16
2 1 0 24
2 1 0 18
3 0 0 27
4 1 0 25
7 1 1 26
8 1 1 36
10 1 1 38
10 0 1 45
12 1 1 47
Lecture notes STAD29: Statistics for the Life and Social Sciences 207 / 802
Survival analysis
About the data
months and quit are kind of combined response:
Months is number of months a woman was actually observed dancing
quit is 1 if woman quit, 0 if still dancing at end of study.
Treatment is 1 if woman went to dance competition, 0 otherwise.
Fit model and see whether Age or Treatment have effect on survival.
Want to do predictions for probabilities of still dancing as they depend
on whatever is significant, and draw plot.
Lecture notes STAD29: Statistics for the Life and Social Sciences 208 / 802
Survival analysis
Packages (for this section)
Install packages survival and survminer if not done.
Load survival, survminer, broom and tidyverse:
library(tidyverse)
library(survival)
library(survminer)
library(broom)
Lecture notes STAD29: Statistics for the Life and Social Sciences 209 / 802
Survival analysis
Read data
Column-aligned:
url <- "http://www.utsc.utoronto.ca/~butler/d29/dancing.txt"
dance <- read_table(url)
## Parsed with column specification:
## cols(
## Months = col_double(),
## Quit = col_double(),
## Treatment = col_double(),
## Age = col_double()
## )
Lecture notes STAD29: Statistics for the Life and Social Sciences 210 / 802
Survival analysis
The data
dance
## # A tibble: 12 x 4
## Months Quit Treatment Age
##
## 1 1 1 0 16
## 2 2 1 0 24
## 3 2 1 0 18
## 4 3 0 0 27
## 5 4 1 0 25
## 6 5 1 0 21
## 7 11 1 0 55
## 8 7 1 1 26
## 9 8 1 1 36
## 10 10 1 1 38
## 11 10 0 1 45
## 12 12 1 1 47
Lecture notes STAD29: Statistics for the Life and Social Sciences 211 / 802
Survival analysis
Examine response and fit model
Response variable (has to be outside data frame):
mth <- with(dance, Surv(Months, Quit))
mth
## [1] 1 2 2 3+ 4 5 11 7 8 10 10+ 12
Then fit model, predicting mth from explanatories:
dance.1 <- coxph(mth ~ Treatment + Age, data = dance)
Lecture notes STAD29: Statistics for the Life and Social Sciences 212 / 802
Survival analysis
Output looks a lot like regression
summary(dance.1)
## Call:
## coxph(formula = mth ~ Treatment + Age, data = dance)
##
## n= 12, number of events= 10
##
## coef exp(coef) se(coef) z Pr(>|z|)
## Treatment -4.44915 0.01169 2.60929 -1.705 0.0882 .
## Age -0.36619 0.69337 0.15381 -2.381 0.0173 *
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## Treatment 0.01169 85.554 7.026e-05 1.9444
## Age 0.69337 1.442 5.129e-01 0.9373
##
## Concordance= 0.964 (se = 0.039 )
## Likelihood ratio test= 21.68 on 2 df, p=2e-05
## Wald test = 5.67 on 2 df, p=0.06
## Score (logrank) test = 14.75 on 2 df, p=6e-04
Lecture notes STAD29: Statistics for the Life and Social Sciences 213 / 802
Survival analysis
Conclusions
Use = 0.10 here since not much data.
Three tests at bottom like global F-test. Consensus that something
predicts survival time (whether or not dancer quit and how long it
took).
Age (definitely), Treatment (marginally) both predict survival time.
Lecture notes STAD29: Statistics for the Life and Social Sciences 214 / 802
Survival analysis
Model checking
With regression, usually plot residuals against fitted values.
Not quite same here (nonlinear model), but “martingale residuals”
should have no pattern vs. “linear predictor”.
ggcoxdiagnostics from package survminer makes plot, to which
we add smooth. If smooth trend more or less straight across, model
OK.
Martingale residuals can go very negative, so won’t always look normal.
Lecture notes STAD29: Statistics for the Life and Social Sciences 215 / 802
Survival analysis
Martingale residual plot for dance data
This looks good (with only 12 points):
ggcoxdiagnostics(dance.1) + geom_smooth(se = F)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
l
l
l
l
l
l
l l
l
ll
l
residuals
−5 0 5
−1
0
1
Linear Predictions
R
es
id
ua
ls
(ty
pe
=
ma
rtin
ga
le
)
Figure 26: plot of chunk unnamed-chunk-158
Lecture notes STAD29: Statistics for the Life and Social Sciences 216 / 802
Survival analysis
Predicted survival probs
The function we use is called surv_fit, though actually works rather
like predict.
First create a data frame of values to predict from. We’ll do all
combos of ages 20 and 40, treatment and not, using crossing to get
all the combos:
treatments <- c(0, 1)
ages <- c(20, 40)
dance.new <- crossing(Treatment = treatments, Age = ages)
dance.new
## # A tibble: 4 x 2
## Treatment Age
##
## 1 0 20
## 2 0 40
## 3 1 20
## 4 1 40
Lecture notes STAD29: Statistics for the Life and Social Sciences 217 / 802
Survival analysis
The predictions
One prediction for each time for each combo of age and treatment in
dance.new:
s <- survfit(dance.1, newdata = dance.new, data = dance)
summary(s)
## Call: survfit(formula = dance.1, newdata = dance.new, data = dance)
##
## time n.risk n.event survival1 survival2 survival3 survival4
## 1 12 1 8.76e-01 1.00e+00 9.98e-01 1.000
## 2 11 2 3.99e-01 9.99e-01 9.89e-01 1.000
## 4 8 1 1.24e-01 9.99e-01 9.76e-01 1.000
## 5 7 1 2.93e-02 9.98e-01 9.60e-01 1.000
## 7 6 1 2.96e-323 6.13e-01 1.70e-04 0.994
## 8 5 1 0.00e+00 2.99e-06 1.35e-98 0.862
## 10 4 1 0.00e+00 3.61e-20 0.00e+00 0.593
## 11 2 1 0.00e+00 0.00e+00 0.00e+00 0.000
## 12 1 1 0.00e+00 0.00e+00 0.00e+00 0.000
Lecture notes STAD29: Statistics for the Life and Social Sciences 218 / 802
Survival analysis
Conclusions from predicted probs
Older women more likely to be still dancing than younger women
(compare “profiles” for same treatment group).
Effect of treatment seems to be to increase prob of still dancing
(compare “profiles” for same age for treatment group vs. not)
Would be nice to see this on a graph. This is ggsurvplot from
package survminer:
s1 <- do.call(survfit, list(formula=dance.1,
newdata=dance.new,
data=dance))
g <- ggsurvplot(s1, conf.int = F)
Lecture notes STAD29: Statistics for the Life and Social Sciences 219 / 802
Survival analysis
“Strata” (groups)
uses “strata” thus (dance.new):
## # A tibble: 4 x 2
## Treatment Age
##
## 1 0 20
## 2 0 40
## 3 1 20
## 4 1 40
Lecture notes STAD29: Statistics for the Life and Social Sciences 220 / 802
Survival analysis
Plotting survival probabilities
g
+
+
+
+
0.00
0.25
0.50
0.75
1.00
0 3 6 9 12
Time
Su
rv
iva
l p
ro
ba
bi
lity
Strata + + + +1 2 3 4
Figure 27: plot of chunk survival-plot
Lecture notes STAD29: Statistics for the Life and Social Sciences 221 / 802
Survival analysis
Discussion
Survivor curve farther to the right is better (better chance of surviving
longer).
Best is age 40 with treatment, worst age 20 without.
Appears to be:
age effect (40 better than 20)
treatment effect (treatment better than not)
In analysis, treatment effect only marginally significant.
Lecture notes STAD29: Statistics for the Life and Social Sciences 222 / 802
Survival analysis
A more realistic example: lung cancer
When you load in an R package, get data sets to illustrate functions in
the package.
One such is lung. Data set measuring survival in patients with
advanced lung cancer.
Along with survival time, number of “performance scores” included,
measuring how well patients can perform daily activities.
Sometimes high good, but sometimes bad!
Variables below, from the data set help file (?lung).
Lecture notes STAD29: Statistics for the Life and Social Sciences 223 / 802
Survival analysis
The variables
Lecture notes STAD29: Statistics for the Life and Social Sciences 224 / 802
Survival analysis
Uh oh, missing values
lung %>% slice(1:16)
## inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss
## 1 3 306 2 74 1 1 90 100 1175 NA
## 2 3 455 2 68 1 0 90 90 1225 15
## 3 3 1010 1 56 1 0 90 90 NA 15
## 4 5 210 2 57 1 1 90 60 1150 11
## 5 1 883 2 60 1 0 100 90 NA 0
## 6 12 1022 1 74 1 1 50 80 513 0
## 7 7 310 2 68 2 2 70 60 384 10
## 8 11 361 2 71 2 2 60 80 538 1
## 9 1 218 2 53 1 1 70 80 825 16
## 10 7 166 2 61 1 2 70 70 271 34
## 11 6 170 2 57 1 1 80 80 1025 27
## 12 16 654 2 68 2 2 70 70 NA 23
## 13 11 728 2 68 2 1 90 90 NA 5
## 14 21 71 2 60 1 NA 60 70 1225 32
## 15 12 567 2 57 1 1 80 70 2600 60
## 16 1 144 2 67 1 1 80 90 NA 15
Lecture notes STAD29: Statistics for the Life and Social Sciences 225 / 802
Survival analysis
A closer look
summary(lung)
## inst time status age sex
## Min. : 1.00 Min. : 5.0 Min. :1.000 Min. :39.00 Min. :1.000
## 1st Qu.: 3.00 1st Qu.: 166.8 1st Qu.:1.000 1st Qu.:56.00 1st Qu.:1.000
## Median :11.00 Median : 255.5 Median :2.000 Median :63.00 Median :1.000
## Mean :11.09 Mean : 305.2 Mean :1.724 Mean :62.45 Mean :1.395
## 3rd Qu.:16.00 3rd Qu.: 396.5 3rd Qu.:2.000 3rd Qu.:69.00 3rd Qu.:2.000
## Max. :33.00 Max. :1022.0 Max. :2.000 Max. :82.00 Max. :2.000
## NA's :1
## ph.ecog ph.karno pat.karno meal.cal wt.loss
## Min. :0.0000 Min. : 50.00 Min. : 30.00 Min. : 96.0 Min. :-24.000
## 1st Qu.:0.0000 1st Qu.: 75.00 1st Qu.: 70.00 1st Qu.: 635.0 1st Qu.: 0.000
## Median :1.0000 Median : 80.00 Median : 80.00 Median : 975.0 Median : 7.000
## Mean :0.9515 Mean : 81.94 Mean : 79.96 Mean : 928.8 Mean : 9.832
## 3rd Qu.:1.0000 3rd Qu.: 90.00 3rd Qu.: 90.00 3rd Qu.:1150.0 3rd Qu.: 15.750
## Max. :3.0000 Max. :100.00 Max. :100.00 Max. :2600.0 Max. : 68.000
## NA's :1 NA's :1 NA's :3 NA's :47 NA's :14
Lecture notes STAD29: Statistics for the Life and Social Sciences 226 / 802
Survival analysis
Remove obs with any missing values
lung %>% drop_na() -> lung.complete
lung.complete %>%
select(meal.cal:wt.loss) %>%
slice(1:10)
## meal.cal wt.loss
## 1 1225 15
## 2 1150 11
## 3 513 0
## 4 384 10
## 5 538 1
## 6 825 16
## 7 271 34
## 8 1025 27
## 9 2600 60
## 10 1150 -5
Missing values seem to be gone.
Lecture notes STAD29: Statistics for the Life and Social Sciences 227 / 802
Survival analysis
Check!
summary(lung.complete)
## inst time status age sex
## Min. : 1.00 Min. : 5.0 Min. :1.000 Min. :39.00 Min. :1.000
## 1st Qu.: 3.00 1st Qu.: 174.5 1st Qu.:1.000 1st Qu.:57.00 1st Qu.:1.000
## Median :11.00 Median : 268.0 Median :2.000 Median :64.00 Median :1.000
## Mean :10.71 Mean : 309.9 Mean :1.719 Mean :62.57 Mean :1.383
## 3rd Qu.:15.00 3rd Qu.: 419.5 3rd Qu.:2.000 3rd Qu.:70.00 3rd Qu.:2.000
## Max. :32.00 Max. :1022.0 Max. :2.000 Max. :82.00 Max. :2.000
## ph.ecog ph.karno pat.karno meal.cal wt.loss
## Min. :0.0000 Min. : 50.00 Min. : 30.00 Min. : 96.0 Min. :-24.000
## 1st Qu.:0.0000 1st Qu.: 70.00 1st Qu.: 70.00 1st Qu.: 619.0 1st Qu.: 0.000
## Median :1.0000 Median : 80.00 Median : 80.00 Median : 975.0 Median : 7.000
## Mean :0.9581 Mean : 82.04 Mean : 79.58 Mean : 929.1 Mean : 9.719
## 3rd Qu.:1.0000 3rd Qu.: 90.00 3rd Qu.: 90.00 3rd Qu.:1162.5 3rd Qu.: 15.000
## Max. :3.0000 Max. :100.00 Max. :100.00 Max. :2600.0 Max. : 68.000
No missing values left.
Lecture notes STAD29: Statistics for the Life and Social Sciences 228 / 802
Survival analysis
Model 1: use everything except inst
names(lung.complete)
## [1] "inst" "time" "status" "age" "sex" "ph.ecog" "ph.karno"
## [8] "pat.karno" "meal.cal" "wt.loss"
Event was death, goes with status of 2:
resp <- with(lung.complete, Surv(time, status == 2))
lung.1 <- coxph(resp ~ . - inst - time - status,
data = lung.complete
)
“Dot” means “all the other variables”.
Lecture notes STAD29: Statistics for the Life and Social Sciences 229 / 802
Survival analysis
summary of model 1: too tiny to see!
summary(lung.1)
## Call:
## coxph(formula = resp ~ . - inst - time - status, data = lung.complete)
##
## n= 167, number of events= 120
##
## coef exp(coef) se(coef) z Pr(>|z|)
## age 1.080e-02 1.011e+00 1.160e-02 0.931 0.35168
## sex -5.536e-01 5.749e-01 2.016e-01 -2.746 0.00603 **
## ph.ecog 7.395e-01 2.095e+00 2.250e-01 3.287 0.00101 **
## ph.karno 2.244e-02 1.023e+00 1.123e-02 1.998 0.04575 *
## pat.karno -1.207e-02 9.880e-01 8.116e-03 -1.488 0.13685
## meal.cal 2.835e-05 1.000e+00 2.594e-04 0.109 0.91298
## wt.loss -1.420e-02 9.859e-01 7.766e-03 -1.828 0.06748 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## age 1.0109 0.9893 0.9881 1.0341
## sex 0.5749 1.7395 0.3872 0.8534
## ph.ecog 2.0950 0.4773 1.3479 3.2560
## ph.karno 1.0227 0.9778 1.0004 1.0455
## pat.karno 0.9880 1.0121 0.9724 1.0038
## meal.cal 1.0000 1.0000 0.9995 1.0005
## wt.loss 0.9859 1.0143 0.9710 1.0010
##
## Concordance= 0.653 (se = 0.029 )
## Likelihood ratio test= 28.16 on 7 df, p=2e-04
## Wald test = 27.5 on 7 df, p=3e-04
## Score (logrank) test = 28.31 on 7 df, p=2e-04
Lecture notes STAD29: Statistics for the Life and Social Sciences 230 / 802
Survival analysis
Overall significance
The three tests of overall significance:
glance(lung.1) %>% select(starts_with("p.value"))
## # A tibble: 1 x 3
## p.value.log p.value.sc p.value.wald
##
## 1 0.000205 0.000193 0.000271
All strongly significant. Something predicts survival.
Lecture notes STAD29: Statistics for the Life and Social Sciences 231 / 802
Survival analysis
Coefficients for model 1
tidy(lung.1) %>% select(term, p.value) %>% arrange(p.value)
## # A tibble: 7 x 2
## term p.value
##
## 1 ph.ecog 0.00101
## 2 sex 0.00603
## 3 ph.karno 0.0457
## 4 wt.loss 0.0675
## 5 pat.karno 0.137
## 6 age 0.352
## 7 meal.cal 0.913
sex and ph.ecog definitely significant here
age, pat.karno and meal.cal definitely not
Take out definitely non-sig variables, and try again.
Lecture notes STAD29: Statistics for the Life and Social Sciences 232 / 802
Survival analysis
Model 2
lung.2 <- update(lung.1, . ~ . - age - pat.karno - meal.cal)
tidy(lung.2) %>% select(term, p.value)
## # A tibble: 4 x 2
## term p.value
##
## 1 sex 0.00409
## 2 ph.ecog 0.000112
## 3 ph.karno 0.101
## 4 wt.loss 0.108
Lecture notes STAD29: Statistics for the Life and Social Sciences 233 / 802
Survival analysis
Compare with first model:
anova(lung.2, lung.1)
## Analysis of Deviance Table
## Cox model: response is resp
## Model 1: ~ sex + ph.ecog + ph.karno + wt.loss
## Model 2: ~ (inst + time + status + age + sex + ph.ecog + ph.karno + pat.karno + meal.cal + wt.loss) - inst - time - status
## loglik Chisq Df P(>|Chi|)
## 1 -495.67
## 2 -494.03 3.269 3 0.352
No harm in taking out those variables.
Lecture notes STAD29: Statistics for the Life and Social Sciences 234 / 802
Survival analysis
Model 3
Take out ph.karno and wt.loss as well.
lung.3 <- update(lung.2, . ~ . - ph.karno - wt.loss)
tidy(lung.3) %>% select(term, estimate, p.value)
## # A tibble: 2 x 3
## term estimate p.value
##
## 1 sex -0.510 0.00958
## 2 ph.ecog 0.483 0.000266
Lecture notes STAD29: Statistics for the Life and Social Sciences 235 / 802
Survival analysis
Check whether that was OK
anova(lung.3, lung.2)
## Analysis of Deviance Table
## Cox model: response is resp
## Model 1: ~ sex + ph.ecog
## Model 2: ~ sex + ph.ecog + ph.karno + wt.loss
## loglik Chisq Df P(>|Chi|)
## 1 -498.38
## 2 -495.67 5.4135 2 0.06675 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Just OK.
Lecture notes STAD29: Statistics for the Life and Social Sciences 236 / 802
Survival analysis
Commentary
OK (just) to take out those two covariates.
Both remaining variables strongly significant.
Nature of effect on survival time? Consider later.
Picture?
Lecture notes STAD29: Statistics for the Life and Social Sciences 237 / 802
Survival analysis
Plotting survival probabilities
Create new data frame of values to predict for, then predict:
sexes <- c(1, 2)
ph.ecogs <- 0:3
lung.new <- crossing(sex = sexes, ph.ecog = ph.ecogs)
lung.new
## # A tibble: 8 x 2
## sex ph.ecog
##
## 1 1 0
## 2 1 1
## 3 1 2
## 4 1 3
## 5 2 0
## 6 2 1
## 7 2 2
## 8 2 3
Lecture notes STAD29: Statistics for the Life and Social Sciences 238 / 802
Survival analysis
Making the plot
s <- survfit(lung.3, data = lung.complete, newdata = lung.new)
s1 <- do.call(survfit, list(formula = lung.3,
data = lung.complete,
newdata = lung.new))
g <- ggsurvplot(s1, conf.int = F)
Lecture notes STAD29: Statistics for the Life and Social Sciences 239 / 802
Survival analysis
The plot
g
+
++++++++++++++++ +
++++
+ ++++
+ +++ + +
+
+++++++++++++++++ +
++++
+ ++++
+ +++ + +
+
+
+++++++++++++++++ +
++++
+ ++++
+ +++ + +
+
+
++++++++++++++++ +
++++ + ++++
+
+++++++++++++++ +
++++
+ ++++
+ +++ + +
0.00
0.25
0.50
0.75
1.00
0 250 500 750 1000
Time
Su
rv
iva
l p
ro
ba
bi
lity
Strata
+
+
+
+
+
+
+
+
1
2
3
4
5
6
7
8
Figure 28: plot of chunk unnamed-chunk-180
Lecture notes STAD29: Statistics for the Life and Social Sciences 240 / 802
Survival analysis
Discussion of survival curves
Best survival is teal-blue curve, stratum 5, females with ph.ecog score
0.
Next best: blue, stratum 6, females with score 1, and red, stratum 1,
males score 0.
Worst: green, stratum 4, males score 3.
For any given ph.ecog score, females have better predicted survival
than males.
For both genders, a lower score associated with better survival.
Lecture notes STAD29: Statistics for the Life and Social Sciences 241 / 802
Survival analysis
The coefficients in model 3
tidy(lung.3) %>% select(term, estimate, p.value)
## # A tibble: 2 x 3
## term estimate p.value
##
## 1 sex -0.510 0.00958
## 2 ph.ecog 0.483 0.000266
sex coeff negative, so being higher sex value (female) goes with less
hazard of dying.
ph.ecog coeff positive, so higher ph.ecog score goes with more
hazard of dying
Two coeffs about same size, so being male rather than female
corresponds to 1-point increase in ph.ecog score. Note how survival
curves come in 3 pairs plus 2 odd.
Lecture notes STAD29: Statistics for the Life and Social Sciences 242 / 802
Survival analysis
Martingale residuals for this model
No problems here:
ggcoxdiagnostics(lung.3) + geom_smooth(se = F)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l l
l
l l ll
residuals
−0.5 0.0 0.5 1.0
−2.5
0.0
2.5
Linear Predictions
R
es
id
ua
ls
(ty
pe
=
ma
rtin
ga
le
)
Figure 29: plot of chunk unnamed-chunk-182
Lecture notes STAD29: Statistics for the Life and Social Sciences 243 / 802
Survival analysis
When the Cox model fails
Invent some data where survival is best at middling age, and worse at
high and low age:
age <- seq(20, 60, 5)
survtime <- c(10, 12, 11, 21, 15, 20, 8, 9, 11)
stat <- c(1, 1, 1, 1, 0, 1, 1, 1, 1)
d <- tibble(age, survtime, stat)
y <- with(d, Surv(survtime, stat))
Small survival time 15 in middle was actually censored, so would have
been longer if observed.
Lecture notes STAD29: Statistics for the Life and Social Sciences 244 / 802
Survival analysis
Fit Cox model
y.1 <- coxph(y ~ age, data = d)
summary(y.1)
## Call:
## coxph(formula = y ~ age, data = d)
##
## n= 9, number of events= 8
##
## coef exp(coef) se(coef) z Pr(>|z|)
## age 0.01984 1.02003 0.03446 0.576 0.565
##
## exp(coef) exp(-coef) lower .95 upper .95
## age 1.02 0.9804 0.9534 1.091
##
## Concordance= 0.545 (se = 0.105 )
## Likelihood ratio test= 0.33 on 1 df, p=0.6
## Wald test = 0.33 on 1 df, p=0.6
## Score (logrank) test = 0.33 on 1 df, p=0.6
Lecture notes STAD29: Statistics for the Life and Social Sciences 245 / 802
Survival analysis
Martingale residuals
Down-and-up indicates incorrect relationship between age and survival:
ggcoxdiagnostics(y.1) + geom_smooth(se = F)
l
l
l
l
l
l
l
l
l
residuals
−0.4 −0.2 0.0 0.2 0.4
−2
−1
0
1
2
Linear Predictions
R
es
id
ua
ls
(ty
pe
=
ma
rtin
ga
le
)
Figure 30: plot of chunk unnamed-chunk-185
Lecture notes STAD29: Statistics for the Life and Social Sciences 246 / 802
Survival analysis
Attempt 2
Add squared term in age:
y.2 <- coxph(y ~ age + I(age^2), data = d)
tidy(y.2) %>% select(term, estimate, p.value)
## # A tibble: 2 x 3
## term estimate p.value
##
## 1 age -0.380 0.116
## 2 I(age^2) 0.00483 0.0977
(Marginally) helpful.
Lecture notes STAD29: Statistics for the Life and Social Sciences 247 / 802
Survival analysis
Martingale residuals this time
Not great, but less problematic than before:
ggcoxdiagnostics(y.2) + geom_smooth(se = F)
l
l
l
l
l
l
l
l
l
residuals
−0.5 0.0 0.5 1.0
−2
−1
0
1
2
Linear Predictions
R
es
id
ua
ls
(ty
pe
=
ma
rtin
ga
le
)
Figure 31: plot of chunk unnamed-chunk-187
Lecture notes STAD29: Statistics for the Life and Social Sciences 248 / 802
Analysis of variance
Section 6
Analysis of variance
Lecture notes STAD29: Statistics for the Life and Social Sciences 249 / 802
Analysis of variance
Analysis of variance
Analysis of variance used with:
counted/measured response
categorical explanatory variable(s)
that is, data divided into groups, and see if response significantly
different among groups
or, see whether knowing group membership helps to predict response.
Typically two stages:
-test to detect any differences among/due to groups
if -test significant, do multiple comparisons to see which groups
significantly different from which.
Need special multiple comparisons method because just doing (say)
two-sample -tests on each pair of groups gives too big a chance of
finding “significant” differences by accident.
Lecture notes STAD29: Statistics for the Life and Social Sciences 250 / 802
Analysis of variance
Packages
These:
library(tidyverse)
library(broom)
library(car) # for Levene's text
Lecture notes STAD29: Statistics for the Life and Social Sciences 251 / 802
Analysis of variance
Example: Pain threshold and hair colour
Do people with different hair colour have different abilities to deal with
pain?
Men and women of various ages divided into 4 groups by hair colour:
light and dark blond, light and dark brown.
Each subject given a pain sensitivity test resulting in pain threshold
score: higher score is higher pain tolerance.
19 subjects altogether.
Lecture notes STAD29: Statistics for the Life and Social Sciences 252 / 802
Analysis of variance
The data
In hairpain.txt:
hair pain
lightblond 62
lightblond 60
lightblond 71
lightblond 55
lightblond 48
darkblond 63
darkblond 57
darkblond 52
darkblond 41
darkblond 43
lightbrown 42
lightbrown 50
lightbrown 41
lightbrown 37
darkbrown 32
darkbrown 39
darkbrown 51
darkbrown 30
darkbrown 35
Lecture notes STAD29: Statistics for the Life and Social Sciences 253 / 802
Analysis of variance
Summarizing the groups
my_url <- "http://www.utsc.utoronto.ca/~butler/d29/hairpain.txt"
hairpain <- read_delim(my_url, " ")
hairpain %>%
group_by(hair) %>%
summarize(
n = n(),
xbar = mean(pain),
s = sd(pain)
)
## # A tibble: 4 x 4
## hair n xbar s
##
## 1 darkblond 5 51.2 9.28
## 2 darkbrown 5 37.4 8.32
## 3 lightblond 5 59.2 8.53
## 4 lightbrown 4 42.5 5.45
Brown-haired people seem to have lower pain tolerance.
Lecture notes STAD29: Statistics for the Life and Social Sciences 254 / 802
Analysis of variance
Boxplot
ggplot(hairpain, aes(x = hair, y = pain)) + geom_boxplot()
l
30
40
50
60
70
darkblond darkbrown lightblond lightbrown
hair
pa
in
Figure 32: plot of chunk tartuffoLecture notes STAD29: Statistics for the Life and Social Sciences 255 / 802
Analysis of variance
Assumptions
Data should be:
normally distributed within each group
same spread for each group
darkbrown group has upper outlier (suggests not normal)
darkblond group has smaller IQR than other groups.
But, groups small.
Shrug shoulders and continue for moment.
Lecture notes STAD29: Statistics for the Life and Social Sciences 256 / 802
Analysis of variance
Testing equality of SDs
via Levene’s test in package car:
leveneTest(pain ~ hair, data = hairpain)
## Warning in leveneTest.default(y = y, group = group, ...): group coerced to factor.
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 3 0.3927 0.76
## 15
No evidence (at all) of difference among group SDs.
Possibly because groups small.
Lecture notes STAD29: Statistics for the Life and Social Sciences 257 / 802
Analysis of variance
Analysis of variance
hairpain.1 <- aov(pain ~ hair, data = hairpain)
summary(hairpain.1)
## Df Sum Sq Mean Sq F value Pr(>F)
## hair 3 1361 453.6 6.791 0.00411 **
## Residuals 15 1002 66.8
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
P-value small: the mean pain tolerances for the four groups are not all
the same.
Which groups differ from which, and how?
Lecture notes STAD29: Statistics for the Life and Social Sciences 258 / 802
Analysis of variance
Multiple comparisons
Which groups differ from which? Multiple comparisons method. Lots.
Problem: by comparing all the groups with each other, doing many
tests, have large chance to (possibly incorrectly) reject 0 ∶ groups
have equal means.
4 groups: 6 comparisons (1 vs 2, 1 vs 3, …, 3 vs 4). 5 groups: 10
comparisons. Thus 6 (or 10) chances to make mistake.
Get “familywise error rate” of 0.05 (whatever), no matter how many
comparisons you’re doing.
My favourite: Tukey, or “honestly significant differences”: how far
apart might largest, smallest group means be (if actually no
differences). Group means more different: significantly different.
Lecture notes STAD29: Statistics for the Life and Social Sciences 259 / 802
Analysis of variance
Tukey
TukeyHSD:
TukeyHSD(hairpain.1)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = pain ~ hair, data = hairpain)
##
## $hair
## diff lwr upr p adj
## darkbrown-darkblond -13.8 -28.696741 1.0967407 0.0740679
## lightblond-darkblond 8.0 -6.896741 22.8967407 0.4355768
## lightbrown-darkblond -8.7 -24.500380 7.1003795 0.4147283
## lightblond-darkbrown 21.8 6.903259 36.6967407 0.0037079
## lightbrown-darkbrown 5.1 -10.700380 20.9003795 0.7893211
## lightbrown-lightblond -16.7 -32.500380 -0.8996205 0.0366467
Lecture notes STAD29: Statistics for the Life and Social Sciences 260 / 802
Analysis of variance
The old-fashioned way
List group means in order
Draw lines connecting groups that are not significantly different:
darkbrown lightbrown darkblond lightblond
37.4 42.5 51.2 59.2
-------------------------
---------------
lightblond significantly higher than everything except darkblond
(at = 0.05).
darkblond in middle ground: not significantly less than lightblond,
not significantly greater than darkbrown and lightbrown.
More data might resolve this.
Looks as if blond-haired people do have higher pain tolerance, but not
completely clear.
Lecture notes STAD29: Statistics for the Life and Social Sciences 261 / 802
Analysis of variance
Some other multiple-comparison methods
Work any time you do tests at once (not just ANOVA).
Bonferroni: multiply all P-values by .
Holm: multiply smallest P-value by , next-smallest by − 1, etc.
False discovery rate: multiply smallest P-value by /1, 2nd-smallest
by /2, …, -th smallest by /.
Stop after non-rejection.
Lecture notes STAD29: Statistics for the Life and Social Sciences 262 / 802
Analysis of variance
Example
P-values 0.005, 0.015, 0.03, 0.06 (4 tests all done at once) Use
= 0.05.
Bonferroni:
Multiply all P-values by 4 (4 tests).
Reject only 1st null.
Holm:
Times smallest P-value by 4: 0.005 ∗ 4 = 0.020 < 0.05, reject.
Times next smallest by 3: 0.015 ∗ 3 = 0.045 < 0.05, reject.
Times next smallest by 2: 0.03 ∗ 2 = 0.06 > 0.05, do not reject. Stop.
Lecture notes STAD29: Statistics for the Life and Social Sciences 263 / 802
Analysis of variance
…Continued
With P-values 0.005, 0.015, 0.03, 0.06:
False discovery rate:
Times smallest P-value by 4: 0.005 ∗ 4 = 0.02 < 0.05: reject.
Times second smallest by 4/2: 0.015 ∗ 4/2 = 0.03 < 0.05, reject.
Times third smallest by 4/3: 0.03 ∗ 4/3 = 0.04 < 0.05, reject.
Times fourth smallest by 4/4: 0.06 ∗ 4/4 = 0.06 > 0.05, do not reject.
Stop.
Lecture notes STAD29: Statistics for the Life and Social Sciences 264 / 802
Analysis of variance
pairwise.t.test
attach(hairpain)
pairwise.t.test(pain, hair, p.adj = "none")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: pain and hair
##
## darkblond darkbrown lightblond
## darkbrown 0.01748 - -
## lightblond 0.14251 0.00075 -
## lightbrown 0.13337 0.36695 0.00817
##
## P value adjustment method: none
pairwise.t.test(pain, hair, p.adj = "holm")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: pain and hair
##
## darkblond darkbrown lightblond
## darkbrown 0.0699 - -
## lightblond 0.4001 0.0045 -
## lightbrown 0.4001 0.4001 0.0408
##
## P value adjustment method: holm
Lecture notes STAD29: Statistics for the Life and Social Sciences 265 / 802
Analysis of variance
pairwise.t.test part 2
pairwise.t.test(pain, hair, p.adj = "fdr")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: pain and hair
##
## darkblond darkbrown lightblond
## darkbrown 0.0350 - -
## lightblond 0.1710 0.0045 -
## lightbrown 0.1710 0.3670 0.0245
##
## P value adjustment method: fdr
pairwise.t.test(pain, hair, p.adj = "bon")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: pain and hair
##
## darkblond darkbrown lightblond
## darkbrown 0.1049 - -
## lightblond 0.8550 0.0045 -
## lightbrown 0.8002 1.0000 0.0490
##
## P value adjustment method: bonferroni
Lecture notes STAD29: Statistics for the Life and Social Sciences 266 / 802
Analysis of variance
Comments
P-values all adjusted upwards from “none”.
Required because 6 tests at once.
Highest P-values for Bonferroni: most “conservative”.
Prefer Tukey or FDR or Holm.
Tukey only applies to ANOVA, not to other cases of multiple testing.
Lecture notes STAD29: Statistics for the Life and Social Sciences 267 / 802
Analysis of variance
Rats and vitamin B
What is the effect of dietary vitamin B on the kidney?
A number of rats were randomized to receive either a B-supplemented
diet or a regular diet.
Desired to control for initial size of rats, so classified into size classes
lean and obese.
After 20 weeks, rats’ kidneys weighed.
Variables:
Response: kidneyweight (grams).
Explanatory: diet, ratsize.
Read in data:
my_url <- "http://www.utsc.utoronto.ca/~butler/d29/vitaminb.txt"
vitaminb <- read_delim(my_url, " ")
## Parsed with column specification:
## cols(
## ratsize = col_character(),
## diet = col_character(),
## kidneyweight = col_double()
## )
Lecture notes STAD29: Statistics for the Life and Social Sciences 268 / 802
Analysis of variance
The data
vitaminb
## # A tibble: 28 x 3
## ratsize diet kidneyweight
##
## 1 lean regular 1.62
## 2 lean regular 1.8
## 3 lean regular 1.71
## 4 lean regular 1.81
## 5 lean regular 1.47
## 6 lean regular 1.37
## 7 lean regular 1.71
## 8 lean vitaminb 1.51
## 9 lean vitaminb 1.65
## 10 lean vitaminb 1.45
## # … with 18 more rows
Lecture notes STAD29: Statistics for the Life and Social Sciences 269 / 802
Analysis of variance
Grouped boxplot
ggplot(vitaminb, aes(
x = ratsize, y = kidneyweight,
fill = diet
)) + geom_boxplot()
l
1.5
2.0
2.5
3.0
lean obese
ratsize
ki
dn
ey
we
ig
ht diet
regular
vitaminb
Figure 33: plot of chunk unnamed-chunk-197
Lecture notes STAD29: Statistics for the Life and Social Sciences 270 / 802
Analysis of variance
What’s going on?
Calculate group means:
summary <- vitaminb %>%
group_by(ratsize, diet) %>%
summarize(mean = mean(kidneyweight))
summary
## # A tibble: 4 x 3
## # Groups: ratsize [2]
## ratsize diet mean
##
## 1 lean regular 1.64
## 2 lean vitaminb 1.53
## 3 obese regular 2.64
## 4 obese vitaminb 2.67
Rat size: a large and consistent effect.
Diet: small/no effect (compare same rat size, different diet).
Effect of rat size same for each diet: no interaction.
Lecture notes STAD29: Statistics for the Life and Social Sciences 271 / 802
Analysis of variance
ANOVA with interaction
vitaminb.1 <- aov(kidneyweight ~ ratsize * diet,
data = vitaminb
)
summary(vitaminb.1)
## Df Sum Sq Mean Sq F value Pr(>F)
## ratsize 1 8.068 8.068 141.179 1.53e-11 ***
## diet 1 0.012 0.012 0.218 0.645
## ratsize:diet 1 0.036 0.036 0.638 0.432
## Residuals 24 1.372 0.057
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Significance/nonsignificance as we expected.
Note no significant interaction (can be removed).
Lecture notes STAD29: Statistics for the Life and Social Sciences 272 / 802
Analysis of variance
Interaction plot
Plot mean of response variable against one of the explanatory, using
other one as groups. Start from summary:
g <- ggplot(summary, aes(
x = ratsize, y = mean,
colour = diet, group = diet
)) +
geom_point() + geom_line()
For this, have to give both group and colour.
Lecture notes STAD29: Statistics for the Life and Social Sciences 273 / 802
Analysis of variance
The interaction plot
g
l
l
l
l
1.5
1.8
2.1
2.4
2.7
lean obese
ratsize
m
e
a
n
diet
l
l
regular
vitaminb
Figure 34: plot of chunk unnamed-chunk-201
Lines basically parallel, indicating no interaction.
Lecture notes STAD29: Statistics for the Life and Social Sciences 274 / 802
Analysis of variance
Take out interaction
vitaminb.2 <- update(vitaminb.1, . ~ . - ratsize:diet)
summary(vitaminb.2)
## Df Sum Sq Mean Sq F value Pr(>F)
## ratsize 1 8.068 8.068 143.256 7.59e-12 ***
## diet 1 0.012 0.012 0.221 0.643
## Residuals 25 1.408 0.056
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
No Tukey for diet: not significant.
No Tukey for ratsize: only two sizes, and already know that obese
rats have larger kidneys than lean ones.
Bottom line: diet has no effect on kidney size once you control for size
of rat.
Lecture notes STAD29: Statistics for the Life and Social Sciences 275 / 802
Analysis of variance
The auto noise data
In 1973, the President of Texaco cited an automobile filter developed by
Associated Octel Company as effective in reducing pollution. However,
questions had been raised about the effects of filter silencing. He referred to
the data included in the report (and below) as evidence that the silencing
properties of the Octel filter were at least equal to those of standard
silencers.
u <- "http://www.utsc.utoronto.ca/~butler/d29/autonoise.txt"
autonoise <- read_table(u)
## Parsed with column specification:
## cols(
## noise = col_double(),
## size = col_character(),
## type = col_character(),
## side = col_character()
## )
Lecture notes STAD29: Statistics for the Life and Social Sciences 276 / 802
Analysis of variance
The data
autonoise
## # A tibble: 36 x 4
## noise size type side
##
## 1 840 M Std R
## 2 770 L Octel L
## 3 820 M Octel R
## 4 775 L Octel R
## 5 825 M Octel L
## 6 840 M Std R
## 7 845 M Std L
## 8 825 M Octel L
## 9 815 M Octel L
## 10 845 M Std R
## # … with 26 more rows
Lecture notes STAD29: Statistics for the Life and Social Sciences 277 / 802
Analysis of variance
Making boxplot
Make a boxplot, but have combinations of filter type and engine size.
Use grouped boxplot again, thus:
g <- autonoise %>%
ggplot(aes(x = size, y = noise, fill = type)) +
geom_boxplot()
Lecture notes STAD29: Statistics for the Life and Social Sciences 278 / 802
Analysis of variance
The boxplot
See difference in engine noise between Octel and standard is larger for
medium engine size than for large or small.
Some evidence of differences in spreads (ignore for now):
g
760
780
800
820
840
L M S
size
n
o
is
e
type
Octel
Std
Figure 35: plot of chunk unnamed-chunk-206
Lecture notes STAD29: Statistics for the Life and Social Sciences 279 / 802
Analysis of variance
ANOVA
autonoise.1 <- aov(noise ~ size * type, data = autonoise)
summary(autonoise.1)
## Df Sum Sq Mean Sq F value Pr(>F)
## size 2 26051 13026 199.119 < 2e-16 ***
## type 1 1056 1056 16.146 0.000363 ***
## size:type 2 804 402 6.146 0.005792 **
## Residuals 30 1962 65
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The interaction is significant, as we suspected from the boxplots.
The within-group spreads don’t look very equal, but only based on 6
obs each.
Lecture notes STAD29: Statistics for the Life and Social Sciences 280 / 802
Analysis of variance
Tukey: ouch!
autonoise.2 <- TukeyHSD(autonoise.1)
autonoise.2$`size:type`
## diff lwr upr p adj
## M:Octel-L:Octel 51.6666667 37.463511 65.869823 6.033496e-11
## S:Octel-L:Octel 52.5000000 38.296844 66.703156 4.089762e-11
## L:Std-L:Octel 5.0000000 -9.203156 19.203156 8.890358e-01
## M:Std-L:Octel 75.8333333 61.630177 90.036489 4.962697e-14
## S:Std-L:Octel 55.8333333 41.630177 70.036489 9.002910e-12
## S:Octel-M:Octel 0.8333333 -13.369823 15.036489 9.999720e-01
## L:Std-M:Octel -46.6666667 -60.869823 -32.463511 6.766649e-10
## M:Std-M:Octel 24.1666667 9.963511 38.369823 1.908995e-04
## S:Std-M:Octel 4.1666667 -10.036489 18.369823 9.454142e-01
## L:Std-S:Octel -47.5000000 -61.703156 -33.296844 4.477636e-10
## M:Std-S:Octel 23.3333333 9.130177 37.536489 3.129974e-04
## S:Std-S:Octel 3.3333333 -10.869823 17.536489 9.787622e-01
## M:Std-L:Std 70.8333333 56.630177 85.036489 6.583623e-14
## S:Std-L:Std 50.8333333 36.630177 65.036489 8.937329e-11
## S:Std-M:Std -20.0000000 -34.203156 -5.796844 2.203265e-03
Lecture notes STAD29: Statistics for the Life and Social Sciences 281 / 802
Analysis of variance
Interaction plot
This time, don’t have summary of mean noise for each size-type
combination.
One way is to compute summaries (means) first, and feed into ggplot
as in vitamin B example.
Or, have ggplot compute them for us, thus:
g <- ggplot(autonoise, aes(
x = size, y = noise,
colour = type, group = type
)) +
stat_summary(fun.y = mean, geom = "point") +
stat_summary(fun.y = mean, geom = "line")
Lecture notes STAD29: Statistics for the Life and Social Sciences 282 / 802
Analysis of variance
Interaction plot
The lines are definitely not parallel, showing that the effect of type is
different for medium-sized engines than for others:
g
l
l l
l
l
l
770
790
810
830
L M S
size
n
o
is
e
type
l
l
Octel
Std
Figure 36: plot of chunk unnamed-chunk-210
Lecture notes STAD29: Statistics for the Life and Social Sciences 283 / 802
Analysis of variance
If you don’t like that…
…then compute the means first, in a pipeline:
autonoise %>%
group_by(size, type) %>%
summarize(mean_noise = mean(noise)) %>%
ggplot(aes(
x = size, y = mean_noise, group = type,
colour = type
)) + geom_point() + geom_line()
l
l
l
l
l
l
770
790
810
830
L M S
size
m
e
a
n
_
n
o
is
e
type
l
l
Octel
Std
Figure 37: plot of chunk unnamed-chunk-211
Lecture notes STAD29: Statistics for the Life and Social Sciences 284 / 802
Analysis of variance
Simple effects for auto noise example
In auto noise example, weren’t interested in all comparisons between
car size and filter type combinations.
Wanted to demonstrate (lack of) difference between filter types for
each car type.
These are called simple effects of one variable (filter type) conditional
on other variable (car type).
To do this, pull out just the data for small cars, compare noise for the
two filter types. Then repeat for medium and large cars. (Three
one-way ANOVAs.)
Lecture notes STAD29: Statistics for the Life and Social Sciences 285 / 802
Analysis of variance
Do it using dplyr tools
Small cars:
autonoise %>%
filter(size == "S") %>%
aov(noise ~ type, data = .) %>%
summary()
## Df Sum Sq Mean Sq F value Pr(>F)
## type 1 33.3 33.33 0.548 0.476
## Residuals 10 608.3 60.83
No filter difference for small cars.
For Medium, change S to M and repeat.
Lecture notes STAD29: Statistics for the Life and Social Sciences 286 / 802
Analysis of variance
Simple effect of filter type for medium cars
{
autonoise %>%
filter(size == "M") %>%
aov(noise ~ type, data = .) %>%
summary()
## Df Sum Sq Mean Sq F value Pr(>F)
## type 1 1752.1 1752.1 68.93 8.49e-06 ***
## Residuals 10 254.2 25.4
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
}
There is an effect of filter type for medium cars. Look at means to investigate
(over).
Lecture notes STAD29: Statistics for the Life and Social Sciences 287 / 802
Analysis of variance
Mean noise for each filter type
…for medium engine size:
autonoise %>%
filter(size == "M") %>%
group_by(type) %>%
summarize(m = mean(noise))
## # A tibble: 2 x 2
## type m
##
## 1 Octel 822.
## 2 Std 846.
Octel filters produce less noise for medium cars.
Lecture notes STAD29: Statistics for the Life and Social Sciences 288 / 802
Analysis of variance
Large cars
Large cars:
autonoise %>%
filter(size == "L") %>%
aov(noise ~ type, data = .) %>%
summary()
## Df Sum Sq Mean Sq F value Pr(>F)
## type 1 75 75 0.682 0.428
## Residuals 10 1100 110
No significant difference again.
Lecture notes STAD29: Statistics for the Life and Social Sciences 289 / 802
Analysis of variance
Or…
use glance from broom:
autonoise %>%
filter(size == "L") %>%
aov(noise ~ type, data = .) %>%
glance()
## # A tibble: 1 x 11
## r.squared adj.r.squared sigma statistic p.value df
##
## 1 0.0638 -0.0298 10.5 0.682 0.428 2
## # … with 5 more variables: logLik , AIC ,
## # BIC , deviance , df.residual
P-value same as from summary output.
Lecture notes STAD29: Statistics for the Life and Social Sciences 290 / 802
Analysis of variance
All at once, using split/apply/combine
The “split” part:
autonoise %>%
group_by(size) %>%
nest()
## # A tibble: 3 x 2
## # Groups: size [3]
## size data
## >
## 1 M [12 × 3]
## 2 L [12 × 3]
## 3 S [12 × 3]
Now have three rows, with the data frame for each size encoded as one
element of this data frame.
Lecture notes STAD29: Statistics for the Life and Social Sciences 291 / 802
Analysis of variance
Apply
Write function to do aov on a data frame with columns noise and
type, returning P-value:
aov_pval <- function(x) {
noise.1 <- aov(noise ~ type, data = x)
gg <- glance(noise.1)
gg$p.value
}
Test it:
autonoise %>%
filter(size == "L") %>%
aov_pval()
## [1] 0.428221
Check.
Lecture notes STAD29: Statistics for the Life and Social Sciences 292 / 802
Analysis of variance
Combine
Apply this function to each of the nested data frames (one per engine
size):
autonoise %>%
group_by(size) %>%
nest() %>%
mutate(p_val = map_dbl(data, ~ aov_pval(.)))
## # A tibble: 3 x 3
## # Groups: size [3]
## size data p_val
## >
## 1 M [12 × 3] 0.00000849
## 2 L [12 × 3] 0.428
## 3 S [12 × 3] 0.476
map_dbl because aov_pval returns a decimal number (a dbl).
Investigate what happens if you use map instead.Lecture notes STAD29: Statistics for the Life and Social Sciences 293 / 802
Analysis of variance
Tidy up
The data column was stepping-stone to getting answer. Don’t need it
any more:
simple_effects <- autonoise %>%
group_by(size) %>%
nest() %>%
mutate(p_val = map_dbl(data, ~ aov_pval(.))) %>%
select(-data)
simple_effects
## # A tibble: 3 x 2
## # Groups: size [3]
## size p_val
##
## 1 M 0.00000849
## 2 L 0.428
## 3 S 0.476
Lecture notes STAD29: Statistics for the Life and Social Sciences 294 / 802
Analysis of variance
Simultaneous tests
When testing simple effects, doing several tests at once. (In this case,
3.)
Have to adjust P-values for this. Eg. Holm:
simple_effects %>%
arrange(p_val) %>%
mutate(multiplier = 4 - row_number()) %>%
mutate(p_val_adj = p_val * multiplier)
## # A tibble: 3 x 4
## # Groups: size [3]
## size p_val multiplier p_val_adj
##
## 1 M 0.00000849 3 0.0000255
## 2 L 0.428 3 1.28
## 3 S 0.476 3 1.43
* No change in rejection decisions.
* Octel filters sig. better in terms of noise for medium cars, and not sig. different for other
sizes.
* Octel filters never significantly worse than standard ones.
Lecture notes STAD29: Statistics for the Life and Social Sciences 295 / 802
Analysis of variance
Confidence intervals
Perhaps better way of assessing simple effects: look at confidence
intervals rather than tests.
Gives us sense of accuracy of estimation, and thus whether
non-significance might be lack of power: “absence of evidence is not
evidence of absence”.
Works here because two filter types, using t.test for each engine
type.
Want to show that the Octel filter is equivalent to or better than the
standard filter, in terms of engine noise.
Lecture notes STAD29: Statistics for the Life and Social Sciences 296 / 802
Analysis of variance
Equivalence and noninferiority
Known as “equivalence testing” in medical world. A good read: link.
Basic idea: decide on size of difference that would be considered
“equivalent”, and if CI entirely inside ±, have evidence in favour of
equivalence.
We really want to show that the Octel filters are “no worse” than the
standard one: that is, equivalent or better than standard filters.
Such a “noninferiority test” done by checking that upper limit of CI,
new minus old, is less than . (This requires careful thinking about (i)
which way around the difference is and (ii) whether a higher or lower
value is better.)
Lecture notes STAD29: Statistics for the Life and Social Sciences 297 / 802
Analysis of variance
CI for small cars
Same idea as for simple effect test:
autonoise %>%
filter(size == "S") %>%
t.test(noise ~ type, data = .) %>%
.[["conf.int"]]
## [1] -14.517462 7.850795
## attr(,"conf.level")
## [1] 0.95
Lecture notes STAD29: Statistics for the Life and Social Sciences 298 / 802
Analysis of variance
CI for medium cars
autonoise %>%
filter(size == "M") %>%
t.test(noise ~ type, data = .) %>%
.[["conf.int"]]
## [1] -30.75784 -17.57549
## attr(,"conf.level")
## [1] 0.95
Lecture notes STAD29: Statistics for the Life and Social Sciences 299 / 802
Analysis of variance
CI for large cars
autonoise %>%
filter(size == "L") %>%
t.test(noise ~ type, data = .) %>%
.[["conf.int"]]
## [1] -19.270673 9.270673
## attr(,"conf.level")
## [1] 0.95
Lecture notes STAD29: Statistics for the Life and Social Sciences 300 / 802
Analysis of variance
Or, all at once: split/apply/combine
ci_func <- function(x) {
tt <- t.test(noise ~ type, data = x)
tt$conf.int
}
autonoise %>%
nest(-size) %>%
mutate(ci = map(data, ~ ci_func(.))) %>%
unnest(ci) -> cis
## Warning: All elements of `...` must be named.
## Did you want `data = c(noise, type, side)`?
Lecture notes STAD29: Statistics for the Life and Social Sciences 301 / 802
Analysis of variance
Results
cis
## # A tibble: 6 x 3
## size data ci
## >
## 1 M [12 × 3] -30.8
## 2 M [12 × 3] -17.6
## 3 L [12 × 3] -19.3
## 4 L [12 × 3] 9.27
## 5 S [12 × 3] -14.5
## 6 S [12 × 3] 7.85
Lecture notes STAD29: Statistics for the Life and Social Sciences 302 / 802
Analysis of variance
Procedure
Function to get CI of difference in noise means for types of filter on
input data frame
Group by size, nest (mini-df per size)
Calculate CI for each thing in data (ie. each size). map: CI is two
numbers long
unnest ci column to see two numbers in each CI.
Lecture notes STAD29: Statistics for the Life and Social Sciences 303 / 802
Analysis of variance
CIs and noninferiority test
Suppose we decide that a 20 dB difference would be considered
equivalent. (I have no idea whether that is reasonable.)
Intervals:
hilooo=rep(c("lower", "upper"), 3)
hilooo
## [1] "lower" "upper" "lower" "upper" "lower" "upper"
cis %>%
mutate(hilo = rep(c("lower", "upper"), 3)) %>%
pivot_wider(names_from=hilo, values_from=ci)
## # A tibble: 3 x 4
## size data lower upper
## >
## 1 M [12 × 3] -30.8 -17.6
## 2 L [12 × 3] -19.3 9.27
## 3 S [12 × 3] -14.5 7.85Lecture notes STAD29: Statistics for the Life and Social Sciences 304 / 802
Analysis of variance
Comments
In all cases, upper limit of CI is less than 20 dB. The Octel filters are
“noninferior” to the standard ones.
Caution: we did 3 procedures at once again. The true confidence level
is not 95%. (Won’t worry about that here.)
Lecture notes STAD29: Statistics for the Life and Social Sciences 305 / 802
Analysis of variance
Contrasts in ANOVA
Sometimes, don’t want to compare all groups, only some of them.
Might be able to specify these comparisons ahead of time; other
comparisons of no interest.
Wasteful to do ANOVA and Tukey.
Lecture notes STAD29: Statistics for the Life and Social Sciences 306 / 802
Analysis of variance
Example: chainsaw kickback
From link.
Forest manager concerned about safety of chainsaws issued to field
crew. 4 models of chainsaws, measure “kickback” (degrees of
deflection) for 5 of each:
A B C D
-----------
42 28 57 29
17 50 45 29
24 44 48 22
39 32 41 34
43 61 54 30
So far, standard 1-way ANOVA: what differences are there among
models?
Lecture notes STAD29: Statistics for the Life and Social Sciences 307 / 802
Analysis of variance
chainsaw kickback (2)
But: models A and D are designed to be used at home, while models B
and C are industrial models.
Suggests these comparisons of interest:
home vs. industrial
the two home models A vs. D
the two industrial models B vs. C.
Don’t need to compare all the pairs of models.
Lecture notes STAD29: Statistics for the Life and Social Sciences 308 / 802
Analysis of variance
What is a contrast?
Contrast is a linear combination of group means.
Notation: for (population) mean of group , and so on.
In example, compare two home models: 0 ∶ − = 0.
Compare two industrial models: 0 ∶ − = 0.
Compare average of two home models vs. average of two industrial
models: 0 ∶
1
2( + ) −
1
2( + ) = 0 or
0 ∶ 0.5 − 0.5 − 0.5 + 0.5 = 0.
Note that coefficients of contrasts add to 0, and right-hand side is 0.
Lecture notes STAD29: Statistics for the Life and Social Sciences 309 / 802
Analysis of variance
Contrasts in R
Comparing two home models A and D ( − = 0):
c.home <- c(1, 0, 0, -1)
Comparing two industrial models B and C ( − = 0):
c.industrial <- c(0, 1, -1, 0)
Comparing home average vs. industrial average
(0.5 − 0.5 − 0.5 + 0.5 = 0):
c.home.ind <- c(0.5, -0.5, -0.5, 0.5)
Lecture notes STAD29: Statistics for the Life and Social Sciences 310 / 802
Analysis of variance
Orthogonal contrasts
What happens if we multiply the contrast coefficients one by one?
c.home * c.industrial
## [1] 0 0 0 0
c.home * c.home.ind
## [1] 0.5 0.0 0.0 -0.5
c.industrial * c.home.ind
## [1] 0.0 -0.5 0.5 0.0
in each case, the results add up to zero. Such contrasts are called
orthogonal.
Lecture notes STAD29: Statistics for the Life and Social Sciences 311 / 802
Analysis of variance
Orthogonal contrasts (2)
Compare these:
c1 <- c(1, -1, 0)
c2 <- c(0, 1, -1)
sum(c1 * c2)
## [1] -1
Not zero, so c1 and c2 are not orthogonal.
Orthogonal contrasts are much easier to deal with.
Can use non-orthogonal contrasts, but more trouble (beyond us).
Lecture notes STAD29: Statistics for the Life and Social Sciences 312 / 802
Analysis of variance
Read in data
url <- "http://www.utsc.utoronto.ca/~butler/d29/chainsaw.txt"
chain.wide <- read_table(url)
chain.wide
## # A tibble: 5 x 4
## A B C D
##
## 1 42 28 57 29
## 2 17 50 45 29
## 3 24 44 48 22
## 4 39 32 41 34
## 5 43 61 54 30
Lecture notes STAD29: Statistics for the Life and Social Sciences 313 / 802
Analysis of variance
Tidying
Need all the kickbacks in one column:
chain.wide %>%
pivot_longer(A:D, names_to = "model", names_ptypes = list(model=factor()),
values_to = "kickback") -> chain
Lecture notes STAD29: Statistics for the Life and Social Sciences 314 / 802
Analysis of variance
Starting the analysis (2)
The proper data frame (tiny):
chain
## # A tibble: 20 x 2
## model kickback
##
## 1 A 42
## 2 B 28
## 3 C 57
## 4 D 29
## 5 A 17
## 6 B 50
## 7 C 45
## 8 D 29
## 9 A 24
## 10 B 44
## 11 C 48
## 12 D 22
## 13 A 39
## 14 B 32
## 15 C 41
## 16 D 34
## 17 A 43
## 18 B 61
## 19 C 54
## 20 D 30
Lecture notes STAD29: Statistics for the Life and Social Sciences 315 / 802
Analysis of variance
Setting up contrasts
m <- cbind(c.home, c.industrial, c.home.ind)
m
## c.home c.industrial c.home.ind
## [1,] 1 0 0.5
## [2,] 0 1 -0.5
## [3,] 0 -1 -0.5
## [4,] -1 0 0.5
contrasts(chain$model) <- m
Lecture notes STAD29: Statistics for the Life and Social Sciences 316 / 802
Analysis of variance
ANOVA as if regression
chain.1 <- lm(kickback ~ model, data = chain)
summary(chain.1)
##
## Call:
## lm(formula = kickback ~ model, data = chain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.00 -7.10 0.60 6.25 18.00
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.450 2.179 17.649 6.52e-12 ***
## modelc.home 2.100 3.081 0.682 0.50524
## modelc.industrial -3.000 3.081 -0.974 0.34469
## modelc.home.ind -15.100 4.357 -3.466 0.00319 **
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.743 on 16 degrees of freedom
## Multiple R-squared: 0.4562, Adjusted R-squared: 0.3542
## F-statistic: 4.474 on 3 and 16 DF, p-value: 0.01833Lecture notes STAD29: Statistics for the Life and Social Sciences 317 / 802
Analysis of variance
Conclusions
tidy(chain.1) %>% select(term, p.value)
## # A tibble: 4 x 2
## term p.value
##
## 1 (Intercept) 6.52e-12
## 2 modelc.home 5.05e- 1
## 3 modelc.industrial 3.45e- 1
## 4 modelc.home.ind 3.19e- 3
Two home models not sig. diff. (P-value 0.51)
Two industrial models not sig. diff. (P-value 0.34)
Home, industrial models are sig. diff. (P-value 0.0032).
Lecture notes STAD29: Statistics for the Life and Social Sciences 318 / 802
Analysis of variance
Means by model
The means:
chain %>%
group_by(model) %>%
summarize(mean.kick = mean(kickback)) %>%
arrange(desc(mean.kick))
## # A tibble: 4 x 2
## model mean.kick
##
## 1 C 49
## 2 B 43
## 3 A 33
## 4 D 28.8
Home models A & D have less kickback than industrial ones B & C.
Makes sense because industrial users should get training to cope with
additional kickback.
Lecture notes STAD29: Statistics for the Life and Social Sciences 319 / 802
Analysis of covariance
Section 7
Analysis of covariance
Lecture notes STAD29: Statistics for the Life and Social Sciences 320 / 802
Analysis of covariance
Analysis of covariance
ANOVA: explanatory variables categorical (divide data into groups)
traditionally, analysis of covariance has categorical ’s plus one
numerical (“covariate”) to be adjusted for.
lm handles this too.
Simple example: two treatments (drugs) (a and b), with before and
after scores.
Does knowing before score and/or treatment help to predict after
score?
Is after score different by treatment/before score?
Lecture notes STAD29: Statistics for the Life and Social Sciences 321 / 802
Analysis of covariance
Data
Treatment, before, after:
a 5 20
a 10 23
a 12 30
a 9 25
a 23 34
a 21 40
a 14 27
a 18 38
a 6 24
a 13 31
b 7 19
b 12 26
b 27 33
b 24 35
b 18 30
b 22 31
b 26 34
b 21 28
b 14 23
b 9 22
Lecture notes STAD29: Statistics for the Life and Social Sciences 322 / 802
Analysis of covariance
Packages
tidyverse and broom:
library(tidyverse)
library(broom)
Lecture notes STAD29: Statistics for the Life and Social Sciences 323 / 802
Analysis of covariance
Read in data
url <- "http://www.utsc.utoronto.ca/~butler/d29/ancova.txt"
prepost <- read_delim(url, " ")
prepost %>% sample_n(9) # randomly chosen rows
## # A tibble: 9 x 3
## drug before after
##
## 1 b 7 19
## 2 b 24 35
## 3 a 23 34
## 4 a 6 24
## 5 b 26 34
## 6 a 14 27
## 7 b 14 23
## 8 b 12 26
## 9 b 27 33
Lecture notes STAD29: Statistics for the Life and Social Sciences 324 / 802
Analysis of covariance
Making a plot
ggplot(prepost, aes(x = before, y = after, colour = drug)) +
geom_point()
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
20
25
30
35
40
10 20
before
a
fte
r
drug
l
l
a
b
Figure 38: plot of chunk ancova-plot
Lecture notes STAD29: Statistics for the Life and Social Sciences 325 / 802
Analysis of covariance
Comments
As before score goes up, after score goes up.
Red points (drug A) generally above blue points (drug B), for
comparable before score.
Suggests before score effect and drug effect.
Lecture notes STAD29: Statistics for the Life and Social Sciences 326 / 802
Analysis of covariance
The means
prepost %>%
group_by(drug) %>%
summarize(
before_mean = mean(before),
after_mean = mean(after)
)
## # A tibble: 2 x 3
## drug before_mean after_mean
##
## 1 a 13.1 29.2
## 2 b 18 28.1
Mean “after” score slightly higher for treatment A.
Mean “before” score much higher for treatment B.
Greater improvement on treatment A.
Lecture notes STAD29: Statistics for the Life and Social Sciences 327 / 802
Analysis of covariance
Testing for interaction
prepost.1 <- lm(after ~ before * drug, data = prepost)
anova(prepost.1)
## Analysis of Variance Table
##
## Response: after
## Df Sum Sq Mean Sq F value Pr(>F)
## before 1 430.92 430.92 62.6894 6.34e-07 ***
## drug 1 115.31 115.31 16.7743 0.0008442 ***
## before:drug 1 12.34 12.34 1.7948 0.1990662
## Residuals 16 109.98 6.87
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interaction not significant. Will remove later.
Lecture notes STAD29: Statistics for the Life and Social Sciences 328 / 802
Analysis of covariance
Predictions, with interaction included
Make combinations of before score and drug:
new <- crossing(
before = c(5, 15, 25),
drug = c("a", "b")
)
new
## # A tibble: 6 x 2
## before drug
##
## 1 5 a
## 2 5 b
## 3 15 a
## 4 15 b
## 5 25 a
## 6 25 b
Lecture notes STAD29: Statistics for the Life and Social Sciences 329 / 802
Analysis of covariance
Do predictions:
pred <- predict(prepost.1, new)
preds <- bind_cols(new, pred = pred)
preds
## # A tibble: 6 x 3
## before drug pred
##
## 1 5 a 21.3
## 2 5 b 18.7
## 3 15 a 31.1
## 4 15 b 25.9
## 5 25 a 40.8
## 6 25 b 33.2
Lecture notes STAD29: Statistics for the Life and Social Sciences 330 / 802
Analysis of covariance
Making a plot with lines for each drug
g <- ggplot(prepost,
aes(x = before, y = after, colour = drug)) +
geom_point() + geom_line(data = preds, aes(y = pred))
Here, final line:
joins points by lines for different data set (preds rather than prepost),
different (pred rather than after),
but same (x=before inherited from first aes).
Last line could (more easily) be
geom_smooth(method = "lm", se = F)
which would work here, but not for later plot.
Lecture notes STAD29: Statistics for the Life and Social Sciences 331 / 802
Analysis of covariance
The plot
Lines almost parallel, but not quite.
Non-parallelism (interaction) not significant:
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
20
25
30
35
40
10 20
before
a
fte
r
drug
l
l
a
b
Figure 39: plot of chunk nachwazzo
Lecture notes STAD29: Statistics for the Life and Social Sciences 332 / 802
Analysis of covariance
Taking out interaction
prepost.2 <- update(prepost.1, . ~ . - before:drug)
anova(prepost.2)
## Analysis of Variance Table
##
## Response: after
## Df Sum Sq Mean Sq F value Pr(>F)
## before 1 430.92 430.92 59.890 5.718e-07 ***
## drug 1 115.31 115.31 16.025 0.0009209 ***
## Residuals 17 122.32 7.20
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Take out non-significant interaction.
before and drug strongly significant.
Do predictions again and plot them.
Lecture notes STAD29: Statistics for the Life and Social Sciences 333 / 802
Analysis of covariance
Predicted values again (no-interaction model)
pred <- predict(prepost.2, new)
preds <- bind_cols(new, pred = pred)
preds
## # A tibble: 6 x 3
## before drug pred
##
## 1 5 a 22.5
## 2 5 b 17.3
## 3 15 a 30.8
## 4 15 b 25.6
## 5 25 a 39.0
## 6 25 b 33.9
Each increase of 10 in before score results in 8.3 in predicted after score, the
same for both drugs.
Lecture notes STAD29: Statistics for the Life and Social Sciences 334 / 802
Analysis of covariance
Making a plot, again
g <- ggplot(
prepost,
aes(x = before, y = after, colour = drug)
) +
geom_point() +
geom_line(data = preds, aes(y = pred))
Exactly same as before, but using new predictions.
Lecture notes STAD29: Statistics for the Life and Social Sciences 335 / 802
Analysis of covariance
The no-interaction plot of predicted values
g
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
20
25
30
35
40
10 20
before
a
fte
r
drug
l
l
a
b
Figure 40: plot of chunk cabazzo
Lines now parallel. No-interaction model forces them to have the same
slope.
Lecture notes STAD29: Statistics for the Life and Social Sciences 336 / 802
Analysis of covariance
Different look at model output
anova(prepost.2) tests for significant effect of before score and of
drug, but doesn’t help with interpretation.
summary(prepost.2) views as regression with slopes:
summary(prepost.2)
##
## Call:
## lm(formula = after ~ before + drug, data = prepost)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.6348 -2.5099 -0.2038 1.8871 4.7453
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.3600 1.5115 12.147 8.35e-10 ***
## before 0.8275 0.0955 8.665 1.21e-07 ***
## drugb -5.1547 1.2876 -4.003 0.000921 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.682 on 17 degrees of freedom
## Multiple R-squared: 0.817, Adjusted R-squared: 0.7955
## F-statistic: 37.96 on 2 and 17 DF, p-value: 5.372e-07
Lecture notes STAD29: Statistics for the Life and Social Sciences 337 / 802
Analysis of covariance
Understanding those slopes
tidy(prepost.2)
## # A tibble: 3 x 5
## term estimate std.error statistic p.value
##
## 1 (Intercept) 18.4 1.51 12.1 8.35e-10
## 2 before 0.827 0.0955 8.66 1.21e- 7
## 3 drugb -5.15 1.29 -4.00 9.21e- 4
before ordinary numerical variable; drug categorical.
lm uses first category druga as baseline.
Intercept is prediction of after score for before score 0 and drug A.
before slope is predicted change in after score when before score
increases by 1 (usual slope)
Slope for drugb is change in predicted after score for being on drug B
rather than drug A. Same for any before score (no interaction).
Lecture notes STAD29: Statistics for the Life and Social Sciences 338 / 802
Analysis of covariance
Summary
ANCOVA model: fits different regression line for each group, predicting
response from covariate.
ANCOVA model with interaction between factor and covariate allows
different slopes for each line.
Sometimes those lines can cross over!
If interaction not significant, take out. Lines then parallel.
With parallel lines, groups have consistent effect regardless of value of
covariate.
Lecture notes STAD29: Statistics for the Life and Social Sciences 339 / 802
Multivariate ANOVA
Section 8
Multivariate ANOVA
Lecture notes STAD29: Statistics for the Life and Social Sciences 340 / 802
Multivariate ANOVA
Multivariate analysis of variance
Standard ANOVA has just one response variable.
What if you have more than one response?
Try an ANOVA on each response separately.
But might miss some kinds of interesting dependence between the
responses that distinguish the groups.
Lecture notes STAD29: Statistics for the Life and Social Sciences 341 / 802
Multivariate ANOVA
Packages
library(car)
library(tidyverse)
Lecture notes STAD29: Statistics for the Life and Social Sciences 342 / 802
Multivariate ANOVA
Small example
Measure yield and seed weight of plants grown under 2 conditions: low
and high amounts of fertilizer.
Data (fertilizer, yield, seed weight):
url <- "http://www.utsc.utoronto.ca/~butler/d29/manova1.txt"
hilo <- read_delim(url, " ")
## Parsed with column specification:
## cols(
## fertilizer = col_character(),
## yield = col_double(),
## weight = col_double()
## )
2 responses, yield and seed weight.
Lecture notes STAD29: Statistics for the Life and Social Sciences 343 / 802
Multivariate ANOVA
The data
hilo
## # A tibble: 8 x 3
## fertilizer yield weight
##
## 1 low 34 10
## 2 low 29 14
## 3 low 35 11
## 4 low 32 13
## 5 high 33 14
## 6 high 38 12
## 7 high 34 13
## 8 high 35 14
Lecture notes STAD29: Statistics for the Life and Social Sciences 344 / 802
Multivariate ANOVA
Boxplot for yield for each fertilizer group
ggplot(hilo, aes(x = fertilizer, y = yield)) + geom_boxplot()
30
32
34
36
38
high low
fertilizer
yie
ld
Figure 41: plot of chunk ferto
Yields overlap for fertilizer groups.Lecture notes STAD29: Statistics for the Life and Social Sciences 345 / 802
Multivariate ANOVA
Boxplot for weight for each fertilizer group
ggplot(hilo, aes(x = fertilizer, y = weight)) + geom_boxplot()
10
11
12
13
14
high low
fertilizer
w
e
ig
ht
Figure 42: plot of chunk casteldisangro
Weights overlap for fertilizer groups.Lecture notes STAD29: Statistics for the Life and Social Sciences 346 / 802
Multivariate ANOVA
ANOVAs for yield and weight
hilo.y <- aov(yield ~ fertilizer, data = hilo)
summary(hilo.y)
## Df Sum Sq Mean Sq F value Pr(>F)
## fertilizer 1 12.5 12.500 2.143 0.194
## Residuals 6 35.0 5.833
hilo.w <- aov(weight ~ fertilizer, data = hilo)
summary(hilo.w)
## Df Sum Sq Mean Sq F value Pr(>F)
## fertilizer 1 3.125 3.125 1.471 0.271
## Residuals 6 12.750 2.125
Neither response depends significantly on fertilizer. But…
Lecture notes STAD29: Statistics for the Life and Social Sciences 347 / 802
Multivariate ANOVA
Plotting both responses at once
Have two response variables (not more), so can plot the response
variables against each other, labelling points by which fertilizer group
they’re from.
First, create data frame with points (31, 14) and (38, 10) (why? Later):
d <- tribble(
~line_x, ~line_y,
31, 14,
38, 10
)
Then plot data as points, and add line through points in d:
g <- ggplot(hilo, aes(x = yield, y = weight,
colour = fertilizer)) + geom_point() +
geom_line(data = d,
aes(x = line_x, y = line_y, colour = NULL))
Lecture notes STAD29: Statistics for the Life and Social Sciences 348 / 802
Multivariate ANOVA
The plot
l
l
l
l
l
l
l
l
10
11
12
13
14
30 32 34 36 38
yield
w
e
ig
ht
fertilizer
l
l
high
low
Figure 43: plot of chunk charlecombeLecture notes STAD29: Statistics for the Life and Social Sciences 349 / 802
Multivariate ANOVA
Comments
Graph construction:
Joining points in d by line.
geom_line inherits colour from aes in ggplot.
Data frame d has no fertilizer (previous colour), so have to unset.
Results:
High-fertilizer plants have both yield and weight high.
True even though no sig difference in yield or weight individually.
Drew line separating highs from lows on plot.
Lecture notes STAD29: Statistics for the Life and Social Sciences 350 / 802
Multivariate ANOVA
MANOVA finds multivariate differences
Is difference found by diagonal line significant? MANOVA finds out.
response <- with(hilo, cbind(yield, weight))
hilo.1 <- manova(response ~ fertilizer, data = hilo)
summary(hilo.1)
## Df Pillai approx F num Df den Df Pr(>F)
## fertilizer 1 0.80154 10.097 2 5 0.01755 *
## Residuals 6
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Yes! Difference between groups is diagonally, not just up/down
(weight) or left-right (yield). The yield-weight combination matters.
Lecture notes STAD29: Statistics for the Life and Social Sciences 351 / 802
Multivariate ANOVA
Strategy
Create new response variable by gluing together columns of responses,
using cbind.
Use manova with new response, looks like lm otherwise.
With more than 2 responses, cannot draw graph. What then?
If MANOVA test significant, cannot use Tukey. What then?
Use discriminant analysis (of which more later).
Lecture notes STAD29: Statistics for the Life and Social Sciences 352 / 802
Multivariate ANOVA
Another way to do MANOVA
Install (once) and load package car:
library(car)
Lecture notes STAD29: Statistics for the Life and Social Sciences 353 / 802
Multivariate ANOVA
Another way…
hilo.2.lm <- lm(response ~ fertilizer, data = hilo)
hilo.2 <- Manova(hilo.2.lm)
hilo.2
##
## Type II MANOVA Tests: Pillai test statistic
## Df test stat approx F num Df den Df Pr(>F)
## fertilizer 1 0.80154 10.097 2 5 0.01755 *
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Same result as small-m manova.
Manova will also do repeated measures, coming up later.
Lecture notes STAD29: Statistics for the Life and Social Sciences 354 / 802
Multivariate ANOVA
Another example: peanuts
Three different varieties of peanuts (mysteriously, 5, 6 and 8) planted
in two different locations.
Three response variables: y, smk and w.
u <- "http://www.utsc.utoronto.ca/~butler/d29/peanuts.txt"
peanuts.orig <- read_delim(u, " ")
## Parsed with column specification:
## cols(
## obs = col_double(),
## location = col_double(),
## variety = col_double(),
## y = col_double(),
## smk = col_double(),
## w = col_double()
## )
Lecture notes STAD29: Statistics for the Life and Social Sciences 355 / 802
Multivariate ANOVA
The data
peanuts.orig
## # A tibble: 12 x 6
## obs location variety y smk w
##
## 1 1 1 5 195. 153. 51.4
## 2 2 1 5 194. 168. 53.7
## 3 3 2 5 190. 140. 55.5
## 4 4 2 5 180. 121. 44.4
## 5 5 1 6 203 157. 49.8
## 6 6 1 6 196. 166 45.8
## 7 7 2 6 203. 166. 60.4
## 8 8 2 6 198. 162. 54.1
## 9 9 1 8 194. 164. 57.8
## 10 10 1 8 187 165. 58.6
## 11 11 2 8 202. 167. 65
## 12 12 2 8 200 174. 67.2
Lecture notes STAD29: Statistics for the Life and Social Sciences 356 / 802
Multivariate ANOVA
Setup for analysis
peanuts <- peanuts.orig %>%
mutate(
location = factor(location),
variety = factor(variety)
)
response <- with(peanuts, cbind(y, smk, w))
head(response)
## y smk w
## [1,] 195.3 153.1 51.4
## [2,] 194.3 167.7 53.7
## [3,] 189.7 139.5 55.5
## [4,] 180.4 121.1 44.4
## [5,] 203.0 156.8 49.8
## [6,] 195.9 166.0 45.8
Lecture notes STAD29: Statistics for the Life and Social Sciences 357 / 802
Multivariate ANOVA
Analysis (using Manova)
peanuts.1 <- lm(response ~ location * variety, data = peanuts)
peanuts.2 <- Manova(peanuts.1)
peanuts.2
##
## Type II MANOVA Tests: Pillai test statistic
## Df test stat approx F num Df den Df
## location 1 0.89348 11.1843 3 4
## variety 2 1.70911 9.7924 6 10
## location:variety 2 1.29086 3.0339 6 10
## Pr(>F)
## location 0.020502 *
## variety 0.001056 **
## location:variety 0.058708 .
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Lecture notes STAD29: Statistics for the Life and Social Sciences 358 / 802
Multivariate ANOVA
Comments
Interaction not quite significant, but main effects are.
Combined response variable (y,smk,w) definitely depends on location
and on variety
Weak dependence of (y,smk,w) on the location-variety combination.
Understanding that dependence beyond our scope right now.
Lecture notes STAD29: Statistics for the Life and Social Sciences 359 / 802
Repeated measures by profile analysis
Section 9
Repeated measures by profile analysis
Lecture notes STAD29: Statistics for the Life and Social Sciences 360 / 802
Repeated measures by profile analysis
Repeated measures by profile analysis
More than one response measurement for each subject. Might be
measurements of the same thing at different times
measurements of different but related things
Generalization of matched pairs (“matched triples”, etc.).
Variation: each subject does several different treatments at different
times (called crossover design).
Expect measurements on same subject to be correlated, so
assumptions of independence will fail.
Called repeated measures. Different approaches, but profile analysis
uses Manova (set up right way).
Another approach uses mixed models (random effects).
Lecture notes STAD29: Statistics for the Life and Social Sciences 361 / 802
Repeated measures by profile analysis
Packages
library(car)
library(tidyverse)
Lecture notes STAD29: Statistics for the Life and Social Sciences 362 / 802
Repeated measures by profile analysis
Example: histamine in dogs
8 dogs take part in experiment.
Dogs randomized to one of 2 different drugs.
Response: log of blood concentration of histamine 0, 1, 3 and 5
minutes after taking drug. (Repeated measures.)
Data in dogs.txt, column-aligned.
Lecture notes STAD29: Statistics for the Life and Social Sciences 363 / 802
Repeated measures by profile analysis
Read in data
my_url <- "http://www.utsc.utoronto.ca/~butler/d29/dogs.txt"
dogs <- read_table(my_url)
## Parsed with column specification:
## cols(
## dog = col_character(),
## drug = col_character(),
## x = col_character(),
## lh0 = col_double(),
## lh1 = col_double(),
## lh3 = col_double(),
## lh5 = col_double()
## )
Lecture notes STAD29: Statistics for the Life and Social Sciences 364 / 802
Repeated measures by profile analysis
Setting things up
dogs
## # A tibble: 8 x 7
## dog drug x lh0 lh1 lh3 lh5
##
## 1 A Morphine N -3.22 -1.61 -2.3 -2.53
## 2 B Morphine N -3.91 -2.81 -3.91 -3.91
## 3 C Morphine N -2.66 0.34 -0.73 -1.43
## 4 D Morphine N -1.77 -0.56 -1.05 -1.43
## 5 E Trimethaphan N -3.51 -0.48 -1.17 -1.51
## 6 F Trimethaphan N -3.51 0.05 -0.31 -0.51
## 7 G Trimethaphan N -2.66 -0.19 0.07 -0.22
## 8 H Trimethaphan N -2.41 1.14 0.72 0.21
response <- with(dogs, cbind(lh0, lh1, lh3, lh5))
dogs.1 <- lm(response ~ drug, data = dogs)
Lecture notes STAD29: Statistics for the Life and Social Sciences 365 / 802
Repeated measures by profile analysis
The repeated measures MANOVA
Get list of response variable names; we call them times. Save in data
frame.
times <- colnames(response)
times.df <- data.frame(times)
dogs.2 <- Manova(dogs.1,
idata = times.df,
idesign = ~times
)
dogs.2
##
## Type II Repeated Measures MANOVA Tests: Pillai test statistic
## Df test stat approx F num Df den Df Pr(>F)
## (Intercept) 1 0.76347 19.3664 1 6 0.004565 **
## drug 1 0.34263 3.1272 1 6 0.127406
## times 1 0.94988 25.2690 3 4 0.004631 **
## drug:times 1 0.89476 11.3362 3 4 0.020023 *
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1Lecture notes STAD29: Statistics for the Life and Social Sciences 366 / 802
Repeated measures by profile analysis
Wide and long format
Interaction significant. Pattern of response over time different for the
two drugs.
Want to investigate interaction.
Lecture notes STAD29: Statistics for the Life and Social Sciences 367 / 802
Repeated measures by profile analysis
The wrong shape
But data frame has several observations per line (“wide format”):
dogs %>% slice(1:6)
## # A tibble: 6 x 7
## dog drug x lh0 lh1 lh3 lh5
##
## 1 A Morphine N -3.22 -1.61 -2.3 -2.53
## 2 B Morphine N -3.91 -2.81 -3.91 -3.91
## 3 C Morphine N -2.66 0.34 -0.73 -1.43
## 4 D Morphine N -1.77 -0.56 -1.05 -1.43
## 5 E Trimethaphan N -3.51 -0.48 -1.17 -1.51
## 6 F Trimethaphan N -3.51 0.05 -0.31 -0.51
Plotting works with data in “long format”: one response per line.
The responses are log-histamine at different times, labelled
lh-something. Call them all lh and put them in one column, with the
time they belong to labelled.
Lecture notes STAD29: Statistics for the Life and Social Sciences 368 / 802
Repeated measures by profile analysis
Running gather, try 1
dogs %>% gather(time, lh, lh0:lh5)
## # A tibble: 32 x 5
## dog drug x time lh
##
## 1 A Morphine N lh0 -3.22
## 2 B Morphine N lh0 -3.91
## 3 C Morphine N lh0 -2.66
## 4 D Morphine N lh0 -1.77
## 5 E Trimethaphan N lh0 -3.51
## 6 F Trimethaphan N lh0 -3.51
## 7 G Trimethaphan N lh0 -2.66
## 8 H Trimethaphan N lh0 -2.41
## 9 A Morphine N lh1 -1.61
## 10 B Morphine N lh1 -2.81
## # … with 22 more rows
Lecture notes STAD29: Statistics for the Life and Social Sciences 369 / 802
Repeated measures by profile analysis
Getting the times
Not quite right: for the times, we want just the numbers, not the letters lh
every time. Want new variable containing just number in time:
parse_number.
dogs %>%
gather(timex, lh, lh0:lh5) %>%
mutate(time = parse_number(timex))
## # A tibble: 32 x 6
## dog drug x timex lh time
##