STAD29: Statistics for the Life and Social Sciences Lecture notes Lecture notes STAD29: Statistics for the Life and Social Sciences 1 / 802 Course Outline Section 1 Course Outline Lecture notes STAD29: Statistics for the Life and Social Sciences 2 / 802 Course Outline Course and instructor Lecture: Wednesday 14:00-16:00 in HW 215. Optional computer lab Monday 16:00-17:00 in BV 498. Instructor: Ken Butler Office: IC 471. E-mail:
[email protected] Office hours: Wednesday 11:00-12:00. Or make an appointment. E-mail always good. Course website: link. Using Quercus for assignments/grades only; using website for everything else. Lecture notes STAD29: Statistics for the Life and Social Sciences 3 / 802 Course Outline Texts There is no official text for this course. You may find “R for Data Science”, link helpful for R background. I will refer frequently to my book of Problems and Solutions in Applied Statistics (PASIAS), link. Both of these resources are and will remain free. Lecture notes STAD29: Statistics for the Life and Social Sciences 4 / 802 Course Outline Programs, prerequisites and exclusions Prerequisites: For undergrads: STAC32. Not negotiable. For grad students, a first course in statistics, and some training in regression and ANOVA. The less you know, the more you’ll have to catch up! This course is a required part of Applied Statistics minor. Exclusions: this course is not for Math/Statistics/CS majors/minors. It is for students in other fields who wish to learn some more advanced statistical methods. The exclusions in the Calendar reflect this. If you are in one of those programs, you won’t get program credit for this course, or for any future STA courses you take. Lecture notes STAD29: Statistics for the Life and Social Sciences 5 / 802 Course Outline Computing Computing: big part of the course, not optional. You will need to demonstrate that you can use R to analyze data, and can critically interpret the output. For grad students who have not come through STAC32, I am happy to offer extra help to get you up to speed. Lecture notes STAD29: Statistics for the Life and Social Sciences 6 / 802 Course Outline Assessment 1/2 Grading: (2 hour) midterm, (3 hour) final exam. Assignments most weeks, due Tuesday at 11:59pm. Graduate students (STA 1007) also required to complete a project using one or more of the techniques learned in class, on a dataset from their field of study. Projects due on the last day of classes. Assessment: STAD29 STA 1007 Assignments 20% 20% Midterm exam 30% 20% Project - 20% Final exam 50% 40% Lecture notes STAD29: Statistics for the Life and Social Sciences 7 / 802 Course Outline Assessment 2/2 Assessments missed with documentation will cause a re-weighting of other assessments of same type. No make-ups. You must pass the final exam to guarantee passing the course. If you fail the final exam but would otherwise have passed the course, you receive a grade of 45. Lecture notes STAD29: Statistics for the Life and Social Sciences 8 / 802 Course Outline Plagiarism This link defines academic offences at this university. Read it. You are bound by it. Plagiarism defined (at the end) as The wrongful appropriation and purloining, and publication as one’s own, of the ideas, or the expression of the ideas … of another. The code and explanations that you write and hand in must be yours and yours alone. When you hand in work, it is implied that it is your work. Handing in work, with your name on it, that was actually done by someone else is an academic offence. If I am suspicious that anyone’s work is plagiarized, I will take action. Lecture notes STAD29: Statistics for the Life and Social Sciences 9 / 802 Course Outline Getting help The English Language Development Centre supports all students in developing better Academic English and critical thinking skills needed in academic communication. Make use of the personalized support in academic writing skills development. Details and sign-up information: link. Students with diverse learning styles and needs are welcome in this course. In particular, if you have a disability/health consideration that may require accommodations, please feel free to approach the AccessAbility Services Office as soon as possible. I will work with you and AccessAbility Services to ensure you can achieve your learning goals in this course. Enquiries are confidential. The UTSC AccessAbility Services staff are available by appointment to assess specific needs, provide referrals and arrange appropriate accommodations: (416) 287-7560 or by e-mail:
[email protected]. Lecture notes STAD29: Statistics for the Life and Social Sciences 10 / 802 Course Outline Course material Dates and times Regression-like things review of (multiple) regression logistic regression (including multi-category responses) survival analysis ANOVA-like things more ANOVA multivariate ANOVA repeated measures Multivariate methods discriminant analysis cluster analysis (multidimensional scaling) principal components factor analysis Miscellanea (time series), multiway frequency tables Lecture notes STAD29: Statistics for the Life and Social Sciences 11 / 802 Dates and Times Section 2 Dates and Times Lecture notes STAD29: Statistics for the Life and Social Sciences 12 / 802 Dates and Times Packages for this section library(tidyverse) library(lubridate) Lecture notes STAD29: Statistics for the Life and Social Sciences 13 / 802 Dates and Times Dates Dates represented on computers as “days since an origin”, typically Jan 1, 1970, with a negative date being before the origin: mydates <- c("1970-01-01", "2007-09-04", "1931-08-05") (somedates <- tibble(text = mydates) %>% mutate( d = as.Date(text), numbers = as.numeric(d) )) ## # A tibble: 3 x 3 ## text d numbers ##
## 1 1970-01-01 1970-01-01 0 ## 2 2007-09-04 2007-09-04 13760 ## 3 1931-08-05 1931-08-05 -14029 Lecture notes STAD29: Statistics for the Life and Social Sciences 14 / 802 Dates and Times Doing arithmetic with dates Dates are “actually” numbers, so can add and subtract (difference is 2007 date in d minus others): somedates %>% mutate(plus30 = d + 30, diffs = d[2] - d) ## # A tibble: 3 x 5 ## text d numbers plus30 diffs ## ## 1 1970-01-01 1970-01-01 0 1970-01-31 13760 da… ## 2 2007-09-04 2007-09-04 13760 2007-10-04 0 da… ## 3 1931-08-05 1931-08-05 -14029 1931-09-04 27789 da… Lecture notes STAD29: Statistics for the Life and Social Sciences 15 / 802 Dates and Times Reading in dates from a file read_csv and the others can guess that you have dates, if you format them as year-month-day, like column 1 of this .csv: date,status,dunno 2011-08-03,hello,August 3 2011 2011-11-15,still here,November 15 2011 2012-02-01,goodbye,February 1 2012 Then read them in: my_url <- "http://www.utsc.utoronto.ca/~butler/c32/mydates.csv" ddd <- read_csv(my_url) ## Parsed with column specification: ## cols( ## date = col_date(format = ""), ## status = col_character(), ## dunno = col_character() ## ) read_csv guessed that the 1st column is dates, but not 3rd. Lecture notes STAD29: Statistics for the Life and Social Sciences 16 / 802 Dates and Times The data as read in ddd ## # A tibble: 3 x 3 ## date status dunno ## ## 1 2011-08-03 hello August 3 2011 ## 2 2011-11-15 still here November 15 2011 ## 3 2012-02-01 goodbye February 1 2012 Lecture notes STAD29: Statistics for the Life and Social Sciences 17 / 802 Dates and Times Dates in other formats Preceding shows that dates should be stored as text in format yyyy-mm-dd (ISO standard). To deal with dates in other formats, use package lubridate and convert. For example, dates in US format with month first: tibble(usdates = c("05/27/2012", "01/03/2016", "12/31/2015")) %>% mutate(iso = mdy(usdates)) ## # A tibble: 3 x 2 ## usdates iso ## ## 1 05/27/2012 2012-05-27 ## 2 01/03/2016 2016-01-03 ## 3 12/31/2015 2015-12-31 Lecture notes STAD29: Statistics for the Life and Social Sciences 18 / 802 Dates and Times Trying to read these as UK dates tibble(usdates = c("05/27/2012", "01/03/2016", "12/31/2015")) %>% mutate(uk = dmy(usdates)) ## Warning: 2 failed to parse. ## # A tibble: 3 x 2 ## usdates uk ## ## 1 05/27/2012 NA ## 2 01/03/2016 2016-03-01 ## 3 12/31/2015 NA For UK-format dates with month second, one of these dates is legit, but the other two make no sense. Lecture notes STAD29: Statistics for the Life and Social Sciences 19 / 802 Dates and Times Our data frame’s last column: Back to this: ddd ## # A tibble: 3 x 3 ## date status dunno ## ## 1 2011-08-03 hello August 3 2011 ## 2 2011-11-15 still here November 15 2011 ## 3 2012-02-01 goodbye February 1 2012 Month, day, year in that order. Lecture notes STAD29: Statistics for the Life and Social Sciences 20 / 802 Dates and Times so interpret as such (ddd %>% mutate(date2 = mdy(dunno)) -> d4) ## # A tibble: 3 x 4 ## date status dunno date2 ## ## 1 2011-08-03 hello August 3 2011 2011-08-03 ## 2 2011-11-15 still here November 15 2011 2011-11-15 ## 3 2012-02-01 goodbye February 1 2012 2012-02-01 Lecture notes STAD29: Statistics for the Life and Social Sciences 21 / 802 Dates and Times Are they really the same? Column date2 was correctly converted from column dunno: d4 %>% mutate(equal = identical(date, date2)) ## # A tibble: 3 x 5 ## date status dunno date2 equal ## ## 1 2011-08-03 hello August 3 20… 2011-08-03 TRUE ## 2 2011-11-15 still he… November 15… 2011-11-15 TRUE ## 3 2012-02-01 goodbye February 1 … 2012-02-01 TRUE The two columns of dates are all the same. Lecture notes STAD29: Statistics for the Life and Social Sciences 22 / 802 Dates and Times Making dates from pieces Starting from this file: year month day 1970 1 1 2007 9 4 1940 4 15 my_url <- "http://www.utsc.utoronto.ca/~butler/c32/pieces.txt" dates0 <- read_delim(my_url, " ") ## Parsed with column specification: ## cols( ## year = col_double(), ## month = col_double(), ## day = col_double() ## ) Lecture notes STAD29: Statistics for the Life and Social Sciences 23 / 802 Dates and Times Making some dates dates0 ## # A tibble: 3 x 3 ## year month day ## ## 1 1970 1 1 ## 2 2007 9 4 ## 3 1940 4 15 dates0 %>% unite(dates, day, month, year) %>% mutate(d = dmy(dates)) -> newdates Lecture notes STAD29: Statistics for the Life and Social Sciences 24 / 802 Dates and Times The results newdates ## # A tibble: 3 x 2 ## dates d ## ## 1 1_1_1970 1970-01-01 ## 2 4_9_2007 2007-09-04 ## 3 15_4_1940 1940-04-15 unite glues things together with an underscore between them (if you don’t specify anything else). Syntax: first thing is new column to be created, other columns are what to make it out of. unite makes the original variable columns year, month, day disappear. The column dates is text, while d is a real date. Lecture notes STAD29: Statistics for the Life and Social Sciences 25 / 802 Dates and Times Extracting information from dates newdates %>% mutate( mon = month(d), day = day(d), weekday = wday(d, label = T) ) ## # A tibble: 3 x 5 ## dates d mon day weekday ## ## 1 1_1_1970 1970-01-01 1 1 Thu ## 2 4_9_2007 2007-09-04 9 4 Tue ## 3 15_4_1940 1940-04-15 4 15 Mon Lecture notes STAD29: Statistics for the Life and Social Sciences 26 / 802 Dates and Times Dates and times Standard format for times is to put the time after the date, hours, minutes, seconds: (dd <- tibble(text = c( "1970-01-01 07:50:01", "2007-09-04 15:30:00", "1940-04-15 06:45:10", "2016-02-10 12:26:40" ))) ## # A tibble: 4 x 1 ## text ## ## 1 1970-01-01 07:50:01 ## 2 2007-09-04 15:30:00 ## 3 1940-04-15 06:45:10 ## 4 2016-02-10 12:26:40 Lecture notes STAD29: Statistics for the Life and Social Sciences 27 / 802 Dates and Times Converting text to date-times: Then get from this text using ymd_hms: dd %>% mutate(dt = ymd_hms(text)) ## # A tibble: 4 x 2 ## text dt ## ## 1 1970-01-01 07:50:01 1970-01-01 07:50:01 ## 2 2007-09-04 15:30:00 2007-09-04 15:30:00 ## 3 1940-04-15 06:45:10 1940-04-15 06:45:10 ## 4 2016-02-10 12:26:40 2016-02-10 12:26:40 Lecture notes STAD29: Statistics for the Life and Social Sciences 28 / 802 Dates and Times Timezones Default timezone is “Universal Coordinated Time”. Change it via tz= and the name of a timezone: dd %>% mutate(dt = ymd_hms(text, tz = "America/Toronto")) -> dd dd %>% mutate(zone = tz(dt)) ## # A tibble: 4 x 3 ## text dt zone ## ## 1 1970-01-01 07:50… 1970-01-01 07:50:01 America/Tor… ## 2 2007-09-04 15:30… 2007-09-04 15:30:00 America/Tor… ## 3 1940-04-15 06:45… 1940-04-15 06:45:10 America/Tor… ## 4 2016-02-10 12:26… 2016-02-10 12:26:40 America/Tor… Lecture notes STAD29: Statistics for the Life and Social Sciences 29 / 802 Dates and Times Extracting time parts As you would expect: dd %>% select(-text) %>% mutate( h = hour(dt), sec = second(dt), min = minute(dt), zone = tz(dt) ) ## # A tibble: 4 x 5 ## dt h sec min zone ## ## 1 1970-01-01 07:50:01 7 1 50 America/Tor… ## 2 2007-09-04 15:30:00 15 0 30 America/Tor… ## 3 1940-04-15 06:45:10 6 10 45 America/Tor… ## 4 2016-02-10 12:26:40 12 40 26 America/Tor…Lecture notes STAD29: Statistics for the Life and Social Sciences 30 / 802 Dates and Times Same times, but different time zone: dd %>% select(dt) %>% mutate(oz = with_tz(dt, "Australia/Sydney")) ## # A tibble: 4 x 2 ## dt oz ## ## 1 1970-01-01 07:50:01 1970-01-01 22:50:01 ## 2 2007-09-04 15:30:00 2007-09-05 05:30:00 ## 3 1940-04-15 06:45:10 1940-04-15 21:45:10 ## 4 2016-02-10 12:26:40 2016-02-11 04:26:40 In more detail: ## [1] "1970-01-01 22:50:01 AEST" ## [2] "2007-09-05 05:30:00 AEST" ## [3] "1940-04-15 21:45:10 AEST" ## [4] "2016-02-11 04:26:40 AEDT"Lecture notes STAD29: Statistics for the Life and Social Sciences 31 / 802 Dates and Times How long between date-times? We may need to calculate the time between two events. For example, these are the dates and times that some patients were admitted to and discharged from a hospital: admit,discharge 1981-12-10 22:00:00,1982-01-03 14:00:00 2014-03-07 14:00:00,2014-03-08 09:30:00 2016-08-31 21:00:00,2016-09-02 17:00:00 Lecture notes STAD29: Statistics for the Life and Social Sciences 32 / 802 Dates and Times Do they get read in as date-times? These ought to get read in and converted to date-times: my_url <- "http://www.utsc.utoronto.ca/~butler/c32/hospital.csv" stays <- read_csv(my_url) ## Parsed with column specification: ## cols( ## admit = col_datetime(format = ""), ## discharge = col_datetime(format = "") ## ) and so it proves. Lecture notes STAD29: Statistics for the Life and Social Sciences 33 / 802 Dates and Times Subtracting the date-times In the obvious way, this gets us an answer: stays %>% mutate(stay = discharge - admit) ## # A tibble: 3 x 3 ## admit discharge stay ## ## 1 1981-12-10 22:00:00 1982-01-03 14:00:00 568.0 hou… ## 2 2014-03-07 14:00:00 2014-03-08 09:30:00 19.5 hou… ## 3 2016-08-31 21:00:00 2016-09-02 17:00:00 44.0 hou… Number of hours; hard to interpret. Lecture notes STAD29: Statistics for the Life and Social Sciences 34 / 802 Dates and Times Days Fractional number of days would be better: # stays %>% # mutate(stay_days = (discharge - admit) / ddays(1)) stays %>% mutate( stay_days = as.period(admit %--% discharge) / days(1)) ## estimate only: convert to intervals for accuracy ## # A tibble: 3 x 3 ## admit discharge stay_days ## ## 1 1981-12-10 22:00:00 1982-01-03 14:00:00 23.7 ## 2 2014-03-07 14:00:00 2014-03-08 09:30:00 0.812 ## 3 2016-08-31 21:00:00 2016-09-02 17:00:00 1.83 Lecture notes STAD29: Statistics for the Life and Social Sciences 35 / 802 Dates and Times Completed days Pull out with day() etc, as for a date-time stays %>% mutate( stay = as.period(admit %--% discharge), stay_days = day(stay), stay_hours = hour(stay) ) %>% select(starts_with("stay")) ## # A tibble: 3 x 3 ## stay stay_days stay_hours ## ## 1 23d 16H 0M 0S 23 16 ## 2 19H 30M 0S 0 19 ## 3 1d 20H 0M 0S 1 20 Lecture notes STAD29: Statistics for the Life and Social Sciences 36 / 802 Dates and Times Comments Date-times are stored internally as seconds-since-something, so that subtracting two of them will give, internally, a number of seconds. Just subtracting the date-times is displayed as a time (in units that R chooses for us). Functions ddays(1), dminutes(1) etc. will give number of seconds in a day or a minute, thus dividing by them will give (fractional) days, minutes etc. This works for things like days/minutes with equal numbers of seconds, but not months/years. Better: convert to a “period”, then divide by days(1), months(1) etc. These ideas useful for calculating time from a start point until an event happens (in this case, a patient being discharged from hospital). Lecture notes STAD29: Statistics for the Life and Social Sciences 37 / 802 Review of (multiple) regression Section 3 Review of (multiple) regression Lecture notes STAD29: Statistics for the Life and Social Sciences 38 / 802 Review of (multiple) regression Regression Use regression when one variable is an outcome (response, ). See if/how response depends on other variable(s), explanatory, 1, 2,…. Can have one or more than one explanatory variable, but always one response. Assumes a straight-line relationship between response and explanatory. Ask: is there a relationship between and ’s, and if so, which ones? what does the relationship look like? Lecture notes STAD29: Statistics for the Life and Social Sciences 39 / 802 Review of (multiple) regression Packages library(MASS) # for Box-Cox, later library(tidyverse) library(broom) Lecture notes STAD29: Statistics for the Life and Social Sciences 40 / 802 Review of (multiple) regression A regression with one 13 children, measure average total sleep time (ATST, mins) and age (years) for each. See if ATST depends on age. Data in sleep.txt, ATST then age. Read in data: my_url <- "http://www.utsc.utoronto.ca/~butler/d29/sleep.txt" sleep <- read_delim(my_url, " ") ## Parsed with column specification: ## cols( ## atst = col_double(), ## age = col_double() ## ) Lecture notes STAD29: Statistics for the Life and Social Sciences 41 / 802 Review of (multiple) regression Check data summary(sleep) ## atst age ## Min. :461.8 Min. : 4.400 ## 1st Qu.:491.1 1st Qu.: 7.200 ## Median :528.3 Median : 8.900 ## Mean :519.3 Mean : 9.058 ## 3rd Qu.:532.5 3rd Qu.:11.100 ## Max. :586.0 Max. :14.000 Make scatter plot of ATST (response) vs. age (explanatory) using code overleaf: Lecture notes STAD29: Statistics for the Life and Social Sciences 42 / 802 Review of (multiple) regression The scatterplot ggplot(sleep, aes(x = age, y = atst)) + geom_point() l l l l l l l l l l l ll 500 550 4 6 8 10 12 14 age a ts t Figure 1: plot of chunk suggo Lecture notes STAD29: Statistics for the Life and Social Sciences 43 / 802 Review of (multiple) regression Correlation Measures how well a straight line fits the data: with(sleep, cor(atst, age)) ## [1] -0.9515469 1 is perfect upward trend, −1 is perfect downward trend, 0 is no trend. This one close to perfect downward trend. Can do correlations of all pairs of variables: cor(sleep) ## atst age ## atst 1.0000000 -0.9515469 ## age -0.9515469 1.0000000 Lecture notes STAD29: Statistics for the Life and Social Sciences 44 / 802 Review of (multiple) regression Lowess curve Sometimes nice to guide the eye: is the trend straight, or not? Idea: lowess curve. “Locally weighted least squares”, not affected by outliers, not constrained to be linear. Lowess is a guide: even if straight line appropriate, may wiggle/bend a little. Looking for serious problems with linearity. Add lowess curve to plot using geom_smooth: Lecture notes STAD29: Statistics for the Life and Social Sciences 45 / 802 Review of (multiple) regression Plot with lowess curve ggplot(sleep, aes(x = age, y = atst)) + geom_point() + geom_smooth() ## `geom_smooth()` using method = 'loess' and formula 'y ~ x' l l l l l l l l l l l ll 450 500 550 600 4 6 8 10 12 14 age a ts t Figure 2: plot of chunk icko Lecture notes STAD29: Statistics for the Life and Social Sciences 46 / 802 Review of (multiple) regression The regression Scatterplot shows no obvious curve, and a pretty clear downward trend. So we can run the regression: sleep.1 <- lm(atst ~ age, data = sleep) Lecture notes STAD29: Statistics for the Life and Social Sciences 47 / 802 Review of (multiple) regression The output summary(sleep.1) ## ## Call: ## lm(formula = atst ~ age, data = sleep) ## ## Residuals: ## Min 1Q Median 3Q Max ## -23.011 -9.365 2.372 6.770 20.411 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 646.483 12.918 50.05 2.49e-14 *** ## age -14.041 1.368 -10.26 5.70e-07 *** ## --- ## Signif. codes: ## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 13.15 on 11 degrees of freedom ## Multiple R-squared: 0.9054, Adjusted R-squared: 0.8968 ## F-statistic: 105.3 on 1 and 11 DF, p-value: 5.7e-07 Lecture notes STAD29: Statistics for the Life and Social Sciences 48 / 802 Review of (multiple) regression Conclusions The relationship appears to be a straight line, with a downward trend. -tests for model as a whole and -test for slope (same) both confirm this (P-value 5.7 × 10−7 = 0.00000057). Slope is −14, so a 1-year increase in age goes with a 14-minute decrease in ATST on average. R-squared is correlation squared (when one anyway), between 0 and 1 (1 good, 0 bad). Here R-squared is 0.9054, pleasantly high. Lecture notes STAD29: Statistics for the Life and Social Sciences 49 / 802 Review of (multiple) regression Doing things with the regression output Output from regression (and eg. -test) is all right to look at, but hard to extract and re-use information from. Package broom extracts info from model output in way that can be used in pipe (later): tidy(sleep.1) ## # A tibble: 2 x 5 ## term estimate std.error statistic p.value ## ## 1 (Intercept) 646. 12.9 50.0 2.49e-14 ## 2 age -14.0 1.37 -10.3 5.70e- 7 Lecture notes STAD29: Statistics for the Life and Social Sciences 50 / 802 Review of (multiple) regression also one-line summary of model: glance(sleep.1) ## # A tibble: 1 x 11 ## r.squared adj.r.squared sigma statistic p.value df ## ## 1 0.905 0.897 13.2 105. 5.70e-7 2 ## # … with 5 more variables: logLik , AIC , ## # BIC , deviance , df.residual Lecture notes STAD29: Statistics for the Life and Social Sciences 51 / 802 Review of (multiple) regression Broom part 2 sleep.1 %>% augment(sleep) %>% slice(1:8) ## # A tibble: 8 x 9 ## atst age .fitted .se.fit .resid .hat .sigma .cooksd ## ## 1 586 4.4 585. 7.34 1.30 0.312 13.8 0.00320 ## 2 462. 14 450. 7.68 11.8 0.341 13.0 0.319 ## 3 491. 10.1 505. 3.92 -13.6 0.0887 13.0 0.0568 ## 4 565 6.7 552. 4.87 12.6 0.137 13.1 0.0844 ## 5 462 11.5 485. 4.95 -23.0 0.141 11.3 0.294 ## 6 532. 9.6 512. 3.72 20.4 0.0801 12.0 0.114 ## 7 478. 12.4 472. 5.85 5.23 0.198 13.7 0.0243 ## 8 515. 8.9 522. 3.65 -6.32 0.0772 13.6 0.0105 ## # … with 1 more variable: .std.resid Useful for plotting residuals against an -variable. Lecture notes STAD29: Statistics for the Life and Social Sciences 52 / 802 Review of (multiple) regression CI for mean response and prediction intervals Once useful regression exists, use it for prediction: To get a single number for prediction at a given , substitute into regression equation, eg. age 10: predicted ATST is 646.48 − 14.04(10) = 506 minutes. To express uncertainty of this prediction: CI for mean response expresses uncertainty about mean ATST for all children aged 10, based on data. Prediction interval expresses uncertainty about predicted ATST for a new child aged 10 whose ATST not known. More uncertain. Also do above for a child aged 5. Lecture notes STAD29: Statistics for the Life and Social Sciences 53 / 802 Review of (multiple) regression Intervals Make new data frame with these values for age my.age <- c(10, 5) ages.new <- tibble(age = my.age) ages.new ## # A tibble: 2 x 1 ## age ## ## 1 10 ## 2 5 Feed into predict: pc <- predict(sleep.1, ages.new, interval = "c") pp <- predict(sleep.1, ages.new, interval = "p") Lecture notes STAD29: Statistics for the Life and Social Sciences 54 / 802 Review of (multiple) regression The intervals Confidence intervals for mean response: cbind(ages.new, pc) ## age fit lwr upr ## 1 10 506.0729 497.5574 514.5883 ## 2 5 576.2781 561.6578 590.8984 Prediction intervals for new response: cbind(ages.new, pp) ## age fit lwr upr ## 1 10 506.0729 475.8982 536.2475 ## 2 5 576.2781 543.8474 608.7088 Lecture notes STAD29: Statistics for the Life and Social Sciences 55 / 802 Review of (multiple) regression Comments Age 10 closer to centre of data, so intervals are both narrower than those for age 5. Prediction intervals bigger than CI for mean (additional uncertainty). Technical note: output from predict is R matrix, not data frame, so Tidyverse bind_cols does not work. Use base R cbind. Lecture notes STAD29: Statistics for the Life and Social Sciences 56 / 802 Review of (multiple) regression That grey envelope Marks confidence interval for mean for all : ggplot(sleep, aes(x = age, y = atst)) + geom_point() + geom_smooth(method = "lm") + scale_y_continuous(breaks = seq(420, 600, 20)) l l l l l l l l l l l ll 440 460 480 500 520 540 560 580 600 4 6 8 10 12 14 age a ts t Figure 3: plot of chunk unnamed-chunk-41 Lecture notes STAD29: Statistics for the Life and Social Sciences 57 / 802 Review of (multiple) regression Diagnostics How to tell whether a straight-line regression is appropriate? Before: check scatterplot for straight trend. After: plot residuals (observed minus predicted response) against predicted values. Aim: a plot with no pattern. Lecture notes STAD29: Statistics for the Life and Social Sciences 58 / 802 Review of (multiple) regression Residual plot Not much pattern here — regression appropriate. ggplot(sleep.1, aes(x = .fitted, y = .resid)) + geom_point() l l l l l l l l l l ll l −20 −10 0 10 20 480 520 560 .fitted . re si d Figure 4: plot of chunk akjhkadjfhjahnkkk Lecture notes STAD29: Statistics for the Life and Social Sciences 59 / 802 Review of (multiple) regression An inappropriate regression Different data: my_url <- "http://www.utsc.utoronto.ca/~butler/d29/curvy.txt" curvy <- read_delim(my_url, " ") ## Parsed with column specification: ## cols( ## xx = col_double(), ## yy = col_double() ## ) Lecture notes STAD29: Statistics for the Life and Social Sciences 60 / 802 Review of (multiple) regression Scatterplot ggplot(curvy, aes(x = xx, y = yy)) + geom_point() l l l l l l l l l l 4 8 12 16 0.0 2.5 5.0 7.5 xx yy Figure 5: plot of chunk unnamed-chunk-42 Lecture notes STAD29: Statistics for the Life and Social Sciences 61 / 802 Review of (multiple) regression Regression line, anyway curvy.1 <- lm(yy ~ xx, data = curvy) summary(curvy.1) ## ## Call: ## lm(formula = yy ~ xx, data = curvy) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3.582 -2.204 0.000 1.514 3.509 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 7.5818 1.5616 4.855 0.00126 ** ## xx 0.9818 0.2925 3.356 0.00998 ** ## --- ## Signif. codes: ## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2.657 on 8 degrees of freedom ## Multiple R-squared: 0.5848, Adjusted R-squared: 0.5329 ## F-statistic: 11.27 on 1 and 8 DF, p-value: 0.009984 Lecture notes STAD29: Statistics for the Life and Social Sciences 62 / 802 Review of (multiple) regression Residual plot ggplot(curvy.1, aes(x = .fitted, y = .resid)) + geom_point() l l l l l l l l l l −2 0 2 7.5 10.0 12.5 15.0 .fitted . re si d Figure 6: plot of chunk altoadige Lecture notes STAD29: Statistics for the Life and Social Sciences 63 / 802 Review of (multiple) regression No good: fixing it up Residual plot has curve: middle residuals positive, high and low ones negative. Bad. Fitting a curve would be better. Try this: curvy.2 <- lm(yy ~ xx + I(xx^2), data = curvy) Adding xx-squared term, to allow for curve. Another way to do same thing: specify how model changes: curvy.2a <- update(curvy.1, . ~ . + I(xx^2)) Lecture notes STAD29: Statistics for the Life and Social Sciences 64 / 802 Review of (multiple) regression Regression 2 tidy(curvy.2) ## # A tibble: 3 x 5 ## term estimate std.error statistic p.value ## ## 1 (Intercept) 3.9 0.773 5.04 0.00149 ## 2 xx 3.74 0.400 9.36 0.0000331 ## 3 I(xx^2) -0.307 0.0428 -7.17 0.000182 glance(curvy.2) # ## # A tibble: 1 x 11 ## r.squared adj.r.squared sigma statistic p.value df ## ## 1 0.950 0.936 0.983 66.8 2.75e-5 3 ## # … with 5 more variables: logLik , AIC , ## # BIC , deviance , df.residual Lecture notes STAD29: Statistics for the Life and Social Sciences 65 / 802 Review of (multiple) regression Comments xx-squared term definitely significant (P-value 0.000182), so need this curve to describe relationship. Adding squared term has made R-squared go up from 0.5848 to 0.9502: great improvement. This is a definite curve! Lecture notes STAD29: Statistics for the Life and Social Sciences 66 / 802 Review of (multiple) regression The residual plot now No problems any more: ggplot(curvy.2, aes(x = .fitted, y = .resid)) + geom_point() l l l l l l l l l l −1.0 −0.5 0.0 0.5 1.0 6 9 12 15 .fitted . re si d Figure 7: plot of chunk unnamed-chunk-47 Lecture notes STAD29: Statistics for the Life and Social Sciences 67 / 802 Review of (multiple) regression Another way to handle curves Above, saw that changing (adding 2) was a way of handling curved relationships. Another way: change (transformation). Can guess how to change , or might be theory: example: relationship = (exponential growth): take logs to get ln = ln + . Taking logs has made relationship linear (ln as response). Or, estimate transformation, using Box-Cox method. Lecture notes STAD29: Statistics for the Life and Social Sciences 68 / 802 Review of (multiple) regression Box-Cox Install package MASS via install.packages("MASS") (only need to do once) Every R session you want to use something in MASS, type library(MASS) Lecture notes STAD29: Statistics for the Life and Social Sciences 69 / 802 Review of (multiple) regression Some made-up data my_url <- "http://www.utsc.utoronto.ca/~butler/d29/madeup.csv" madeup <- read_csv(my_url) madeup ## # A tibble: 8 x 3 ## row x y ## ## 1 1 0 17.9 ## 2 2 1 33.6 ## 3 3 2 82.7 ## 4 4 3 31.2 ## 5 5 4 177. ## 6 6 5 359. ## 7 7 6 469. ## 8 8 7 583. Seems to be faster-than-linear growth, maybe exponential growth. Lecture notes STAD29: Statistics for the Life and Social Sciences 70 / 802 Review of (multiple) regression Scatterplot: faster than linear growth ggplot(madeup, aes(x = x, y = y)) + geom_point() + geom_smooth() ## `geom_smooth()` using method = 'loess' and formula 'y ~ x' l l l l l l l l 0 250 500 0 2 4 6 x y Figure 8: plot of chunk dsljhsdjlhf Lecture notes STAD29: Statistics for the Life and Social Sciences 71 / 802 Review of (multiple) regression Running Box-Cox library(MASS) first. Feed boxcox a model formula with a squiggle in it, such as you would use for lm. Output: a graph (next page): boxcox(y ~ x, data = madeup) Lecture notes STAD29: Statistics for the Life and Social Sciences 72 / 802 Review of (multiple) regression The Box-Cox output −2 −1 0 1 2 − 20 − 15 − 10 − 5 λ lo g− Li ke lih oo d 95% Figure 9: plot of chunk trentoLecture notes STAD29: Statistics for the Life and Social Sciences 73 / 802 Review of (multiple) regression Comments (lambda) is the power by which you should transform to get the relationship straight (straighter). Power 0 is “take logs” Middle dotted line marks best single value of (here about 0.1). Outer dotted lines mark 95% CI for , here −0.3 to 0.7, approx. (Rather uncertain about best transformation.) Any power transformation within the CI supported by data. In this case, log ( = 0) and square root ( = 0.5) good, but no transformation ( = 1) not. Pick a “round-number” value of like 2, 1, 0.5, 0,−0.5,−1. Here 0 and 0.5 good values to pick. Lecture notes STAD29: Statistics for the Life and Social Sciences 74 / 802 Review of (multiple) regression Did transformation straighten things? Plot transformed against . Here, log: ggplot(madeup, aes(x = x, y = log(y))) + geom_point() + geom_smooth() l l l l l l l l 2 4 6 8 0 2 4 6 x lo g(y ) Figure 10: plot of chunk unnamed-chunk-50 Looks much straighter. Lecture notes STAD29: Statistics for the Life and Social Sciences 75 / 802 Review of (multiple) regression Regression with transformed madeup.1 <- lm(log(y) ~ x, data = madeup) glance(madeup.1) ## # A tibble: 1 x 11 ## r.squared adj.r.squared sigma statistic p.value df ## ## 1 0.883 0.864 0.501 45.3 5.24e-4 2 ## # … with 5 more variables: logLik , AIC , ## # BIC , deviance , df.residual tidy(madeup.1) ## # A tibble: 2 x 5 ## term estimate std.error statistic p.value ## ## 1 (Intercept) 2.91 0.323 8.99 0.000106 ## 2 x 0.520 0.0773 6.73 0.000524 R-squared now decently high. Lecture notes STAD29: Statistics for the Life and Social Sciences 76 / 802 Review of (multiple) regression Multiple regression What if more than one ? Extra issues: Now one intercept and a slope for each : how to interpret? Which -variables actually help to predict ? Different interpretations of “global” -test and individual -tests. R-squared no longer correlation squared, but still interpreted as “higher better”. In lm line, add extra s after ~. Interpretation not so easy (and other problems that can occur). Lecture notes STAD29: Statistics for the Life and Social Sciences 77 / 802 Review of (multiple) regression Multiple regression example Study of women and visits to health professionals, and how the number of visits might be related to other variables: timedrs: number of visits to health professionals (over course of study) phyheal: number of physical health problems menheal: number of mental health problems stress: result of questionnaire about number and type of life changes timedrs response, others explanatory. Lecture notes STAD29: Statistics for the Life and Social Sciences 78 / 802 Review of (multiple) regression The data my_url <- "http://www.utsc.utoronto.ca/~butler/d29/regressx.txt" visits <- read_delim(my_url, " ") ## Parsed with column specification: ## cols( ## subjno = col_double(), ## timedrs = col_double(), ## phyheal = col_double(), ## menheal = col_double(), ## stress = col_double() ## ) Lecture notes STAD29: Statistics for the Life and Social Sciences 79 / 802 Review of (multiple) regression Check data visits ## # A tibble: 465 x 5 ## subjno timedrs phyheal menheal stress ## ## 1 1 1 5 8 265 ## 2 2 3 4 6 415 ## 3 3 0 3 4 92 ## 4 4 13 2 2 241 ## 5 5 15 3 6 86 ## 6 6 3 5 5 247 ## 7 7 2 5 6 13 ## 8 8 0 4 5 12 ## 9 9 7 5 4 269 ## 10 10 4 3 9 391 ## # … with 455 more rows Lecture notes STAD29: Statistics for the Life and Social Sciences 80 / 802 Review of (multiple) regression Fit multiple regression visits.1 <- lm(timedrs ~ phyheal + menheal + stress, data = visits) glance(visits.1) ## # A tibble: 1 x 11 ## r.squared adj.r.squared sigma statistic p.value df ## ## 1 0.219 0.214 9.71 43.0 1.56e-24 4 ## # … with 5 more variables: logLik , AIC , ## # BIC , deviance , df.residual Lecture notes STAD29: Statistics for the Life and Social Sciences 81 / 802 Review of (multiple) regression The slopes Model as a whole strongly significant even though R-sq not very big (lots of data). At least one of the ’s predicts timedrs. tidy(visits.1) ## # A tibble: 4 x 5 ## term estimate std.error statistic p.value ## ## 1 (Intercept) -3.70 1.12 -3.30 1.06e- 3 ## 2 phyheal 1.79 0.221 8.08 5.60e-15 ## 3 menheal -0.00967 0.129 -0.0749 9.40e- 1 ## 4 stress 0.0136 0.00361 3.77 1.85e- 4 The physical health and stress variables initely help to predict the number of visits, but with those in the model we don’t need menheal. However, look at prediction of timedrs from menheal by itself: Lecture notes STAD29: Statistics for the Life and Social Sciences 82 / 802 Review of (multiple) regression Just menheal visits.2 <- lm(timedrs ~ menheal, data = visits) glance(visits.2) ## # A tibble: 1 x 11 ## r.squared adj.r.squared sigma statistic p.value df ## ## 1 0.0653 0.0633 10.6 32.4 2.28e-8 2 ## # … with 5 more variables: logLik , AIC , ## # BIC , deviance , df.residual tidy(visits.2) ## # A tibble: 2 x 5 ## term estimate std.error statistic p.value ## ## 1 (Intercept) 3.82 0.870 4.38 0.0000144 ## 2 menheal 0.667 0.117 5.69 0.0000000228 Lecture notes STAD29: Statistics for the Life and Social Sciences 83 / 802 Review of (multiple) regression menheal by itself menheal by itself does significantly help to predict timedrs. But the R-sq is much less (6.5% vs. 22%). So other two variables do a better job of prediction. With those variables in the regression (phyheal and stress), don’t need menheal as well. Lecture notes STAD29: Statistics for the Life and Social Sciences 84 / 802 Review of (multiple) regression Investigating via correlation Leave out first column (subjno): visits %>% select(-subjno) %>% cor() ## timedrs phyheal menheal stress ## timedrs 1.0000000 0.4395293 0.2555703 0.2865951 ## phyheal 0.4395293 1.0000000 0.5049464 0.3055517 ## menheal 0.2555703 0.5049464 1.0000000 0.3697911 ## stress 0.2865951 0.3055517 0.3697911 1.0000000 phyheal most strongly correlated with timedrs. Not much to choose between other two. But menheal has higher correlation with phyheal, so not as much to add to prediction as stress. Goes to show things more complicated in multiple regression. Lecture notes STAD29: Statistics for the Life and Social Sciences 85 / 802 Review of (multiple) regression Residual plot (from timedrs on all) ggplot(visits.1, aes(x = .fitted, y = .resid)) + geom_point() l l l l l l ll ll l l l l l ll l l l l l l l l ll ll l l l l l l ll l l l l l l l ll l l l l ll l l ll l l l l l l l l l l l l l l l l l l l l l l l l ll lll l l l l l l l l l ll l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l ll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l lll l lll l l l l l l l l l l l l l l l l l l l l l l ll ll l l l l l l l l ll l l l l l l lll l l ll l l l l ll l l l l l l l l l l l ll l l ll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l lll l l l l l l l l l ll l l l l l l l l l l l l ll l l l l l l l l lll l l l l l l l l l l l l l l l l ll l ll 0 25 50 0 10 20 .fitted . re si d Figure 11: plot of chunk iffy8Lecture notes STAD29: Statistics for the Life and Social Sciences 86 / 802 Review of (multiple) regression Comment Apparently random. But… Lecture notes STAD29: Statistics for the Life and Social Sciences 87 / 802 Review of (multiple) regression Normal quantile plot of residuals ggplot(visits.1, aes(sample = .resid)) + stat_qq() + stat_qq_line() l l l l l lllll lllllllllllllllll lllllllllll llllllllllll lllllllllllllllllll lllllllllllllllllllllll lllllllllllllllllll lllllllllllllll lllllllll lllll llll lll llll lll l ll l llll l ll l lll l ll l l l l l l l 0 25 50 −2 0 2 theoretical sa m pl e Figure 12: plot of chunk unnamed-chunk-58Lecture notes STAD29: Statistics for the Life and Social Sciences 88 / 802 Review of (multiple) regression Absolute residuals Is there trend in size of residuals (fan-out)? Plot absolute value of residual against fitted value (graph next page): g <- ggplot(visits.1, aes(x = .fitted, y = abs(.resid))) + geom_point() + geom_smooth() Lecture notes STAD29: Statistics for the Life and Social Sciences 89 / 802 Review of (multiple) regression The plot l l l l l l ll ll l l l l l l l l l l l l l l l ll ll l l l l l l ll l l l l l l l ll ll l l l l l l ll l l l l ll l l l l l l l l l l l l l l l l l l l l l l l l l ll l l ll ll l l l l l ll ll l ll l l l l l l l l l l ll l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll ll l l l l l l l ll l l lll l l l l l l ll l l l l l l l l l l l ll l l l l l l l l l ll l l l l l l l ll l l ll l l l l ll l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l lll l l l l l l l l l ll l l l l l l l l l l l ll l l ll l l l l l lll l l l l ll l ll 0 20 40 60 0 10 20 .fitted a bs (.r es id) Figure 13: plot of chunk unnamed-chunk-60 Lecture notes STAD29: Statistics for the Life and Social Sciences 90 / 802 Review of (multiple) regression Comments On the normal quantile plot: highest (most positive) residuals are way too high distribution of residuals skewed to right (not normal at all) On plot of absolute residuals: size of residuals getting bigger as fitted values increase predictions getting more variable as fitted values increase that is, predictions getting less accurate as fitted values increase, but predictions should be equally accurate all way along. Both indicate problems with regression, of kind that transformation of response often fixes: that is, predict function of response timedrs instead of timedrs itself. Lecture notes STAD29: Statistics for the Life and Social Sciences 91 / 802 Review of (multiple) regression Box-Cox transformations Taking log of timedrs and having it work: lucky guess. How to find good transformation? Box-Cox again. Extra problem: some of timedrs values are 0, but Box-Cox expects all +. Note response for boxcox: boxcox(timedrs + 1 ~ phyheal + menheal + stress, data = visits) Lecture notes STAD29: Statistics for the Life and Social Sciences 92 / 802 Review of (multiple) regression Try 1 −2 −1 0 1 2 − 24 00 − 20 00 − 16 00 λ lo g− Li ke lih oo d 95% Figure 14: plot of chunk unnamed-chunk-62 Lecture notes STAD29: Statistics for the Life and Social Sciences 93 / 802 Review of (multiple) regression Comments on try 1 Best: just less than zero. Hard to see scale. Focus on in (−0.3, 0.1): my.lambda <- seq(-0.3, 0.1, 0.01) my.lambda ## [1] -0.30 -0.29 -0.28 -0.27 -0.26 -0.25 -0.24 -0.23 -0.22 ## [10] -0.21 -0.20 -0.19 -0.18 -0.17 -0.16 -0.15 -0.14 -0.13 ## [19] -0.12 -0.11 -0.10 -0.09 -0.08 -0.07 -0.06 -0.05 -0.04 ## [28] -0.03 -0.02 -0.01 0.00 0.01 0.02 0.03 0.04 0.05 ## [37] 0.06 0.07 0.08 0.09 0.10 Lecture notes STAD29: Statistics for the Life and Social Sciences 94 / 802 Review of (multiple) regression Try 2 boxcox(timedrs + 1 ~ phyheal + menheal + stress, lambda = my.lambda, data = visits ) −0.3 −0.2 −0.1 0.0 0.1 − 13 15 − 13 05 λ lo g− Li ke lih oo d 95% Figure 15: plot of chunk unnamed-chunk-64 Lecture notes STAD29: Statistics for the Life and Social Sciences 95 / 802 Review of (multiple) regression Comments Best: just about −0.07. CI for about (−0.14, 0.01). Only nearby round number: = 0, log transformation. Lecture notes STAD29: Statistics for the Life and Social Sciences 96 / 802 Review of (multiple) regression Fixing the problems Try regression again, with transformed response instead of original one. Then check residual plot to see that it is OK now. visits.3 <- lm(log(timedrs + 1) ~ phyheal + menheal + stress, data = visits ) timedrs+1 because some timedrs values 0, can’t take log of 0. Won’t usually need to worry about this, but when response could be zero/negative, fix that before transformation. Lecture notes STAD29: Statistics for the Life and Social Sciences 97 / 802 Review of (multiple) regression Output summary(visits.3) ## ## Call: ## lm(formula = log(timedrs + 1) ~ phyheal + menheal + stress, data = visits) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.95865 -0.44076 -0.02331 0.42304 2.36797 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.3903862 0.0882908 4.422 1.22e-05 *** ## phyheal 0.2019361 0.0173624 11.631 < 2e-16 *** ## menheal 0.0071442 0.0101335 0.705 0.481 ## stress 0.0013158 0.0002837 4.638 4.58e-06 *** ## --- ## Signif. codes: ## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.7625 on 461 degrees of freedom ## Multiple R-squared: 0.3682, Adjusted R-squared: 0.3641 ## F-statistic: 89.56 on 3 and 461 DF, p-value: < 2.2e-16 Lecture notes STAD29: Statistics for the Life and Social Sciences 98 / 802 Review of (multiple) regression Comments Model as a whole strongly significant again R-sq higher than before (37% vs. 22%) suggesting things more linear now Same conclusion re menheal: can take out of regression. Should look at residual plots (next pages). Have we fixed problems? Lecture notes STAD29: Statistics for the Life and Social Sciences 99 / 802 Review of (multiple) regression Residuals against fitted values ggplot(visits.3, aes(x = .fitted, y = .resid)) + geom_point() l l l l l ll l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l ll l l ll l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l ll l l l l l l l l l l lll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l l l l ll l l ll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l l l l l l l −2 −1 0 1 2 1 2 3 4 .fitted . re si d Figure 16: plot of chunk unnamed-chunk-67 Lecture notes STAD29: Statistics for the Life and Social Sciences 100 / 802 Review of (multiple) regression Normal quantile plot of residuals ggplot(visits.3, aes(sample = .resid)) + stat_qq() + stat_qq_line() l l l l l ll lll lll llllll lllll lll llll llll lllll lllll llll llll llllll lllll llllll lllll lllllll lllllll lllllll llllll llllll lllll lllll lllll lllll lllll lllll llll lll lll llll ll lll ll lll ll llllll l l llll l l l l l −2 −1 0 1 2 −2 0 2 theoretical sa m pl e Figure 17: plot of chunk unnamed-chunk-68 Lecture notes STAD29: Statistics for the Life and Social Sciences 101 / 802 Review of (multiple) regression Absolute residuals against fitted ggplot(visits.3, aes(x = .fitted, y = abs(.resid))) + geom_point() + geom_smooth() l l l l l ll l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l ll l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l lll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l lll l l l l l l ll l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l 0.0 0.5 1.0 1.5 2.0 1 2 3 4 .fitted a bs (.r es id) Figure 18: plot of chunk unnamed-chunk-69 Lecture notes STAD29: Statistics for the Life and Social Sciences 102 / 802 Review of (multiple) regression Comments Residuals vs. fitted looks a lot more random. Normal quantile plot looks a lot more normal (though still a little right-skewness) Absolute residuals: not so much trend (though still some). Not perfect, but much improved. Lecture notes STAD29: Statistics for the Life and Social Sciences 103 / 802 Review of (multiple) regression Testing more than one at once The -tests test only whether one variable could be taken out of the regression you’re looking at. To test significance of more than one variable at once, fit model with and without variables then use anova to compare fit of models: visits.5 <- lm(log(timedrs + 1) ~ phyheal + menheal + stress, data = visits) visits.6 <- lm(log(timedrs + 1) ~ stress, data = visits) Lecture notes STAD29: Statistics for the Life and Social Sciences 104 / 802 Review of (multiple) regression Results of tests anova(visits.6, visits.5) ## Analysis of Variance Table ## ## Model 1: log(timedrs + 1) ~ stress ## Model 2: log(timedrs + 1) ~ phyheal + menheal + stress ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 463 371.47 ## 2 461 268.01 2 103.46 88.984 < 2.2e-16 *** ## --- ## Signif. codes: ## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Models don’t fit equally well, so bigger one fits better. Or “taking both variables out makes the fit worse, so don’t do it”. Taking out those ’s is a mistake. Or putting them in is a good idea. Lecture notes STAD29: Statistics for the Life and Social Sciences 105 / 802 Review of (multiple) regression The punting data Data set punting.txt contains 4 variables for 13 right-footed football kickers (punters): left leg and right leg strength (lbs), distance punted (ft), another variable called “fred”. Predict punting distance from other variables: left right punt fred 170 170 162.50 171 130 140 144.0 136 170 180 174.50 174 160 160 163.50 161 150 170 192.0 159 150 150 171.75 151 180 170 162.0 174 110 110 104.83 111 110 120 105.67 114 120 130 117.58 126 140 120 140.25 129 130 140 150.17 136 150 160 165.17 154 Lecture notes STAD29: Statistics for the Life and Social Sciences 106 / 802 Review of (multiple) regression Reading in Separated by multiple spaces with columns lined up: my_url <- "http://www.utsc.utoronto.ca/~butler/d29/punting.txt" punting <- read_table(my_url) ## Parsed with column specification: ## cols( ## left = col_double(), ## right = col_double(), ## punt = col_double(), ## fred = col_double() ## ) Lecture notes STAD29: Statistics for the Life and Social Sciences 107 / 802 Review of (multiple) regression The data punting ## # A tibble: 13 x 4 ## left right punt fred ## ## 1 170 170 162. 171 ## 2 130 140 144 136 ## 3 170 180 174. 174 ## 4 160 160 164. 161 ## 5 150 170 192 159 ## 6 150 150 172. 151 ## 7 180 170 162 174 ## 8 110 110 105. 111 ## 9 110 120 106. 114 ## 10 120 130 118. 126 ## 11 140 120 140. 129 ## 12 130 140 150. 136 ## 13 150 160 165. 154 Lecture notes STAD29: Statistics for the Life and Social Sciences 108 / 802 Review of (multiple) regression Regression and output punting.1 <- lm(punt ~ left + right + fred, data = punting) glance(punting.1) ## # A tibble: 1 x 11 ## r.squared adj.r.squared sigma statistic p.value df ## ## 1 0.778 0.704 14.7 10.5 0.00267 4 ## # … with 5 more variables: logLik , AIC , ## # BIC , deviance , df.residual tidy(punting.1) ## # A tibble: 4 x 5 ## term estimate std.error statistic p.value ## ## 1 (Intercept) -4.69 29.1 -0.161 0.876 ## 2 left 0.268 2.11 0.127 0.902 ## 3 right 1.05 2.15 0.490 0.636 ## 4 fred -0.267 4.23 -0.0632 0.951 Lecture notes STAD29: Statistics for the Life and Social Sciences 109 / 802 Review of (multiple) regression Comments Overall regression strongly significant, R-sq high. None of the ’s significant! Why? -tests only say that you could take any one of the ’s out without damaging the fit; doesn’t matter which one. Explanation: look at correlations. Lecture notes STAD29: Statistics for the Life and Social Sciences 110 / 802 Review of (multiple) regression The correlations cor(punting) ## left right punt fred ## left 1.0000000 0.8957224 0.8117368 0.9722632 ## right 0.8957224 1.0000000 0.8805469 0.9728784 ## punt 0.8117368 0.8805469 1.0000000 0.8679507 ## fred 0.9722632 0.9728784 0.8679507 1.0000000 All correlations are high: ’s with punt (good) and with each other (bad, at least confusing). What to do? Probably do just as well to pick one variable, say right since kickers are right-footed. Lecture notes STAD29: Statistics for the Life and Social Sciences 111 / 802 Review of (multiple) regression Just right punting.2 <- lm(punt ~ right, data = punting) anova(punting.2, punting.1) ## Analysis of Variance Table ## ## Model 1: punt ~ right ## Model 2: punt ~ left + right + fred ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 11 1962.5 ## 2 9 1938.2 2 24.263 0.0563 0.9456 No significant loss by dropping other two variables. Lecture notes STAD29: Statistics for the Life and Social Sciences 112 / 802 Review of (multiple) regression Comparing R-squareds summary(punting.1)$r.squared ## [1] 0.7781401 summary(punting.2)$r.squared ## [1] 0.7753629 Basically no difference. In regression (over), right significant: Lecture notes STAD29: Statistics for the Life and Social Sciences 113 / 802 Review of (multiple) regression Regression results tidy(punting.2) ## # A tibble: 2 x 5 ## term estimate std.error statistic p.value ## ## 1 (Intercept) -3.69 25.3 -0.146 0.886 ## 2 right 1.04 0.169 6.16 0.0000709 Lecture notes STAD29: Statistics for the Life and Social Sciences 114 / 802 Review of (multiple) regression But… Maybe we got the form of the relationship with left wrong. Check: plot residuals from previous regression (without left) against left. Residuals here are “punting distance adjusted for right leg strength”. If there is some kind of relationship with left, we should include in model. Plot of residuals against original variable: augment from broom. Lecture notes STAD29: Statistics for the Life and Social Sciences 115 / 802 Review of (multiple) regression Augmenting punting.2 punting.2 %>% augment(punting) -> punting.2.aug punting.2.aug %>% slice(1:8) ## # A tibble: 8 x 11 ## left right punt fred .fitted .se.fit .resid .hat ## ## 1 170 170 162. 171 174. 5.29 -11.1 0.157 ## 2 130 140 144 136 142. 3.93 1.72 0.0864 ## 3 170 180 174. 174 184. 6.60 -9.49 0.244 ## 4 160 160 164. 161 163. 4.25 0.366 0.101 ## 5 150 170 192 159 174. 5.29 18.4 0.157 ## 6 150 150 172. 151 153. 3.73 19.0 0.0778 ## 7 180 170 162 174 174. 5.29 -11.6 0.157 ## 8 110 110 105. 111 111. 7.38 -6.17 0.305 ## # … with 3 more variables: .sigma , .cooksd , ## # .std.resid Lecture notes STAD29: Statistics for the Life and Social Sciences 116 / 802 Review of (multiple) regression Residuals against left ggplot(punting.2.aug, aes(x = left, y = .resid)) + geom_point() l l l l ll l l l l l l l −10 0 10 20 120 140 160 180 left . re si d Figure 19: plot of chunk basingstoke Lecture notes STAD29: Statistics for the Life and Social Sciences 117 / 802 Review of (multiple) regression Comments There is a curved relationship with left. We should add left-squared to the regression (and therefore put left back in when we do that): punting.3 <- lm(punt ~ left + I(left^2) + right, data = punting ) Lecture notes STAD29: Statistics for the Life and Social Sciences 118 / 802 Review of (multiple) regression Regression with left-squared summary(punting.3) ## ## Call: ## lm(formula = punt ~ left + I(left^2) + right, data = punting) ## ## Residuals: ## Min 1Q Median 3Q Max ## -11.3777 -5.3599 0.0459 4.5088 13.2669 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -4.623e+02 9.902e+01 -4.669 0.00117 ** ## left 6.888e+00 1.462e+00 4.710 0.00110 ** ## I(left^2) -2.302e-02 4.927e-03 -4.672 0.00117 ** ## right 7.396e-01 2.292e-01 3.227 0.01038 * ## --- ## Signif. codes: ## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 7.931 on 9 degrees of freedom ## Multiple R-squared: 0.9352, Adjusted R-squared: 0.9136 ## F-statistic: 43.3 on 3 and 9 DF, p-value: 1.13e-05 Lecture notes STAD29: Statistics for the Life and Social Sciences 119 / 802 Review of (multiple) regression Comments This was definitely a good idea (R-squared has clearly increased). We would never have seen it without plotting residuals from punting.2 (without left) against left. Negative slope for leftsq means that increased left-leg strength only increases punting distance up to a point: beyond that, it decreases again. Lecture notes STAD29: Statistics for the Life and Social Sciences 120 / 802 Logistic regression (ordinal/nominal response) Section 4 Logistic regression (ordinal/nominal response) Lecture notes STAD29: Statistics for the Life and Social Sciences 121 / 802 Logistic regression (ordinal/nominal response) Logistic regression When response variable is measured/counted, regression can work well. But what if response is yes/no, lived/died, success/failure? Model probability of success. Probability must be between 0 and 1; need method that ensures this. Logistic regression does this. In R, is a generalized linear model with binomial “family”: glm(y ~ x, family="binomial") Begin with simplest case. Lecture notes STAD29: Statistics for the Life and Social Sciences 122 / 802 Logistic regression (ordinal/nominal response) Packages library(MASS) library(tidyverse) library(broom) library(nnet) Lecture notes STAD29: Statistics for the Life and Social Sciences 123 / 802 Logistic regression (ordinal/nominal response) The rats, part 1 Rats given dose of some poison; either live or die: dose status 0 lived 1 died 2 lived 3 lived 4 died 5 died Lecture notes STAD29: Statistics for the Life and Social Sciences 124 / 802 Logistic regression (ordinal/nominal response) Read in: my_url <- "http://www.utsc.utoronto.ca/~butler/d29/rat.txt" rats <- read_delim(my_url, " ") ## Parsed with column specification: ## cols( ## dose = col_double(), ## status = col_character() ## ) glimpse(rats) ## Observations: 6 ## Variables: 2 ## $ dose 0, 1, 2, 3, 4, 5 ## $ status "lived", "died", "lived", "lived", "died",… Lecture notes STAD29: Statistics for the Life and Social Sciences 125 / 802 Logistic regression (ordinal/nominal response) Basic logistic regression Make response into a factor first: rats2 <- rats %>% mutate(status = factor(status)) then fit model: status.1 <- glm(status ~ dose, family = "binomial", data = rats2) Lecture notes STAD29: Statistics for the Life and Social Sciences 126 / 802 Logistic regression (ordinal/nominal response) Output summary(status.1) ## ## Call: ## glm(formula = status ~ dose, family = "binomial", data = rats2) ## ## Deviance Residuals: ## 1 2 3 4 5 6 ## 0.5835 -1.6254 1.0381 1.3234 -0.7880 -0.5835 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 1.6841 1.7979 0.937 0.349 ## dose -0.6736 0.6140 -1.097 0.273 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 8.3178 on 5 degrees of freedom ## Residual deviance: 6.7728 on 4 degrees of freedom ## AIC: 10.773 ## ## Number of Fisher Scoring iterations: 4 Lecture notes STAD29: Statistics for the Life and Social Sciences 127 / 802 Logistic regression (ordinal/nominal response) Interpreting the output Like (multiple) regression, get tests of significance of individual ’s Here not significant (only 6 observations). “Slope” for dose is negative, meaning that as dose increases, probability of event modelled (survival) decreases. Lecture notes STAD29: Statistics for the Life and Social Sciences 128 / 802 Logistic regression (ordinal/nominal response) Output part 2: predicted survival probs p <- predict(status.1, type = "response") cbind(rats, p) ## dose status p ## 1 0 lived 0.8434490 ## 2 1 died 0.7331122 ## 3 2 lived 0.5834187 ## 4 3 lived 0.4165813 ## 5 4 died 0.2668878 ## 6 5 died 0.1565510 Lecture notes STAD29: Statistics for the Life and Social Sciences 129 / 802 Logistic regression (ordinal/nominal response) The rats, more More realistic: more rats at each dose (say 10). Listing each rat on one line makes a big data file. Use format below: dose, number of survivals, number of deaths. dose lived died 0 10 0 1 7 3 2 6 4 3 4 6 4 2 8 5 1 9 6 lines of data correspond to 60 actual rats. Saved in rat2.txt. Lecture notes STAD29: Statistics for the Life and Social Sciences 130 / 802 Logistic regression (ordinal/nominal response) These data my_url <- "http://www.utsc.utoronto.ca/~butler/d29/rat2.txt" rat2 <- read_delim(my_url, " ") ## Parsed with column specification: ## cols( ## dose = col_double(), ## lived = col_double(), ## died = col_double() ## ) rat2 ## # A tibble: 6 x 3 ## dose lived died ## ## 1 0 10 0 ## 2 1 7 3 ## 3 2 6 4 ## 4 3 4 6 ## 5 4 2 8 ## 6 5 1 9 Lecture notes STAD29: Statistics for the Life and Social Sciences 131 / 802 Logistic regression (ordinal/nominal response) Create response matrix: Each row contains multiple observations. Create two-column response: #survivals in first column, #deaths in second. response <- with(rat2, cbind(lived, died)) response ## lived died ## [1,] 10 0 ## [2,] 7 3 ## [3,] 6 4 ## [4,] 4 6 ## [5,] 2 8 ## [6,] 1 9 Response is R matrix: class(response) ## [1] "matrix" Lecture notes STAD29: Statistics for the Life and Social Sciences 132 / 802 Logistic regression (ordinal/nominal response) Fit logistic regression using response you just made: rat2.1 <- glm(response ~ dose, family = "binomial", data = rat2 ) Lecture notes STAD29: Statistics for the Life and Social Sciences 133 / 802 Logistic regression (ordinal/nominal response) Output summary(rat2.1) ## ## Call: ## glm(formula = response ~ dose, family = "binomial", data = rat2) ## ## Deviance Residuals: ## 1 2 3 4 5 6 ## 1.3421 -0.7916 -0.1034 0.1034 0.0389 0.1529 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 2.3619 0.6719 3.515 0.000439 *** ## dose -0.9448 0.2351 -4.018 5.87e-05 *** ## --- ## Signif. codes: ## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 27.530 on 5 degrees of freedom ## Residual deviance: 2.474 on 4 degrees of freedom ## AIC: 18.94 ## ## Number of Fisher Scoring iterations: 4 Lecture notes STAD29: Statistics for the Life and Social Sciences 134 / 802 Logistic regression (ordinal/nominal response) Predicted survival probs p <- predict(rat2.1, type = "response") cbind(rat2, p) ## dose lived died p ## 1 0 10 0 0.9138762 ## 2 1 7 3 0.8048905 ## 3 2 6 4 0.6159474 ## 4 3 4 6 0.3840526 ## 5 4 2 8 0.1951095 ## 6 5 1 9 0.0861238 Lecture notes STAD29: Statistics for the Life and Social Sciences 135 / 802 Logistic regression (ordinal/nominal response) Comments Significant effect of dose. Effect of larger dose is to decrease survival probability (“slope” negative; also see in decreasing predictions.) Lecture notes STAD29: Statistics for the Life and Social Sciences 136 / 802 Logistic regression (ordinal/nominal response) Multiple logistic regression With more than one , works much like multiple regression. Example: study of patients with blood poisoning severe enough to warrant surgery. Relate survival to other potential risk factors. Variables, 1=present, 0=absent: survival (death from sepsis=1), response shock malnutrition alcoholism age (as numerical variable) bowel infarction See what relates to death. Lecture notes STAD29: Statistics for the Life and Social Sciences 137 / 802 Logistic regression (ordinal/nominal response) Read in data my_url <- "http://www.utsc.utoronto.ca/~butler/d29/sepsis.txt" sepsis <- read_delim(my_url, " ") ## Parsed with column specification: ## cols( ## death = col_double(), ## shock = col_double(), ## malnut = col_double(), ## alcohol = col_double(), ## age = col_double(), ## bowelinf = col_double() ## ) Lecture notes STAD29: Statistics for the Life and Social Sciences 138 / 802 Logistic regression (ordinal/nominal response) The data sepsis ## # A tibble: 106 x 6 ## death shock malnut alcohol age bowelinf ## ## 1 0 0 0 0 56 0 ## 2 0 0 0 0 80 0 ## 3 0 0 0 0 61 0 ## 4 0 0 0 0 26 0 ## 5 0 0 0 0 53 0 ## 6 1 0 1 0 87 0 ## 7 0 0 0 0 21 0 ## 8 1 0 0 1 69 0 ## 9 0 0 0 0 57 0 ## 10 0 0 1 0 76 0 ## # … with 96 more rows Lecture notes STAD29: Statistics for the Life and Social Sciences 139 / 802 Logistic regression (ordinal/nominal response) Fit model sepsis.1 <- glm(death ~ shock + malnut + alcohol + age + bowelinf, family = "binomial", data = sepsis ) Lecture notes STAD29: Statistics for the Life and Social Sciences 140 / 802 Logistic regression (ordinal/nominal response) Output part 1 tidy(sepsis.1) ## # A tibble: 6 x 5 ## term estimate std.error statistic p.value ## ## 1 (Intercept) -9.75 2.54 -3.84 0.000124 ## 2 shock 3.67 1.16 3.15 0.00161 ## 3 malnut 1.22 0.728 1.67 0.0948 ## 4 alcohol 3.35 0.982 3.42 0.000635 ## 5 age 0.0922 0.0303 3.04 0.00237 ## 6 bowelinf 2.80 1.16 2.40 0.0162 All P-values fairly small but malnut not significant: remove. Lecture notes STAD29: Statistics for the Life and Social Sciences 141 / 802 Logistic regression (ordinal/nominal response) Removing malnut sepsis.2 <- update(sepsis.1, . ~ . - malnut) tidy(sepsis.2) ## # A tibble: 5 x 5 ## term estimate std.error statistic p.value ## ## 1 (Intercept) -8.89 2.32 -3.84 0.000124 ## 2 shock 3.70 1.10 3.35 0.000797 ## 3 alcohol 3.19 0.917 3.47 0.000514 ## 4 age 0.0898 0.0292 3.07 0.00211 ## 5 bowelinf 2.39 1.07 2.23 0.0260 Everything significant now. Lecture notes STAD29: Statistics for the Life and Social Sciences 142 / 802 Logistic regression (ordinal/nominal response) Comments Most of the original ’s helped predict death. Only malnut seemed not to add anything. Removed malnut and tried again. Everything remaining is significant (though bowelinf actually became less significant). All coefficients are positive, so having any of the risk factors (or being older) increases risk of death. Lecture notes STAD29: Statistics for the Life and Social Sciences 143 / 802 Logistic regression (ordinal/nominal response) Predictions from model without “malnut” A few chosen at random: sepsis.pred <- predict(sepsis.2, type = "response") d <- data.frame(sepsis, sepsis.pred) myrows <- c(4, 1, 2, 11, 32) slice(d, myrows) ## death shock malnut alcohol age bowelinf sepsis.pred ## 1 0 0 0 0 26 0 0.001415347 ## 2 0 0 0 0 56 0 0.020552383 ## 3 0 0 0 0 80 0 0.153416834 ## 4 1 0 0 1 66 1 0.931290137 ## 5 1 0 0 1 49 0 0.213000997 Lecture notes STAD29: Statistics for the Life and Social Sciences 144 / 802 Logistic regression (ordinal/nominal response) Comments Survival chances pretty good if no risk factors, though decreasing with age. Having more than one risk factor reduces survival chances dramatically. Usually good job of predicting survival; sometimes death predicted to survive. Lecture notes STAD29: Statistics for the Life and Social Sciences 145 / 802 Logistic regression (ordinal/nominal response) Assessing proportionality of odds for age An assumption we made is that log-odds of survival depends linearly on age. Hard to get your head around, but basic idea is that survival chances go continuously up (or down) with age, instead of (for example) going up and then down. In this case, seems reasonable, but should check: Lecture notes STAD29: Statistics for the Life and Social Sciences 146 / 802 Logistic regression (ordinal/nominal response) Residuals vs. age ggplot(augment(sepsis.2), aes(x = age, y = .resid)) + geom_point() l l l l l l l l l l l l l ll l ll l l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l −1 0 1 2 3 25 50 75 age . re si d Figure 20: plot of chunk virtusentella Lecture notes STAD29: Statistics for the Life and Social Sciences 147 / 802 Logistic regression (ordinal/nominal response) Comments No apparent problems overall. Confusing “line” across: no risk factors, survived. Lecture notes STAD29: Statistics for the Life and Social Sciences 148 / 802 Logistic regression (ordinal/nominal response) Probability and odds For probability , odds is /(1 − ): Prob. Odds log-odds in words 0.5 0.5/0.5 = 1/1 = 1.00 0.00 “even money” 0.1 0.1/0.9 = 1/9 = 0.11 −2.20 “9 to 1” 0.4 0.4/0.6 = 1/1.5 = 0.67 −0.41 “1.5 to 1” 0.8 0.8/0.2 = 4/1 = 4.00 1.39 “4 to 1 on” Gamblers use odds: if you win at 9 to 1 odds, get original stake back plus 9 times the stake. Probability has to be between 0 and 1 Odds between 0 and infinity Log-odds can be anything: any log-odds corresponds to valid probability. Lecture notes STAD29: Statistics for the Life and Social Sciences 149 / 802 Logistic regression (ordinal/nominal response) Odds ratio Suppose 90 of 100 men drank wine last week, but only 20 of 100 women. Prob of man drinking wine 90/100 = 0.9, woman 20/100 = 0.2. Odds of man drinking wine 0.9/0.1 = 9, woman 0.2/0.8 = 0.25. Ratio of odds is 9/0.25 = 36. Way of quantifying difference between men and women: “odds of drinking wine 36 times larger for males than females”. Lecture notes STAD29: Statistics for the Life and Social Sciences 150 / 802 Logistic regression (ordinal/nominal response) Sepsis data again Recall prediction of probability of death from risk factors: sepsis.2.tidy <- tidy(sepsis.2) sepsis.2.tidy ## # A tibble: 5 x 5 ## term estimate std.error statistic p.value ## ## 1 (Intercept) -8.89 2.32 -3.84 0.000124 ## 2 shock 3.70 1.10 3.35 0.000797 ## 3 alcohol 3.19 0.917 3.47 0.000514 ## 4 age 0.0898 0.0292 3.07 0.00211 ## 5 bowelinf 2.39 1.07 2.23 0.0260 Slopes in column estimate. Lecture notes STAD29: Statistics for the Life and Social Sciences 151 / 802 Logistic regression (ordinal/nominal response) Multiplying the odds Can interpret slopes by taking “exp” of them. We ignore intercept. sepsis.2.tidy %>% mutate(exp_coeff=exp(estimate)) %>% select(term, exp_coeff) ## # A tibble: 5 x 2 ## term exp_coeff ## ## 1 (Intercept) 0.000137 ## 2 shock 40.5 ## 3 alcohol 24.2 ## 4 age 1.09 ## 5 bowelinf 10.9 Lecture notes STAD29: Statistics for the Life and Social Sciences 152 / 802 Logistic regression (ordinal/nominal response) Interpretation ## # A tibble: 5 x 2 ## term exp_coeff ## ## 1 (Intercept) 0.000137 ## 2 shock 40.5 ## 3 alcohol 24.2 ## 4 age 1.09 ## 5 bowelinf 10.9 These say “how much do you multiply odds of death by for increase of 1 in corresponding risk factor?” Or, what is odds ratio for that factor being 1 (present) vs. 0 (absent)? Eg. being alcoholic vs. not increases odds of death by 24 times One year older multiplies odds by about 1.1 times. Over 40 years, about 1.0940 = 31 times. Lecture notes STAD29: Statistics for the Life and Social Sciences 153 / 802 Logistic regression (ordinal/nominal response) Odds ratio and relative risk Relative risk is ratio of probabilities. Above: 90 of 100 men (0.9) drank wine, 20 of 100 women (0.2). Relative risk 0.9/0.2=4.5. (odds ratio was 36). When probabilities small, relative risk and odds ratio similar. Eg. prob of man having disease 0.02, woman 0.01. Relative risk 0.02/0.01 = 2. Lecture notes STAD29: Statistics for the Life and Social Sciences 154 / 802 Logistic regression (ordinal/nominal response) Odds ratio vs. relative risk Odds for men and for women: (od1 <- 0.02 / 0.98) # men ## [1] 0.02040816 (od2 <- 0.01 / 0.99) # women ## [1] 0.01010101 Odds ratio od1 / od2 ## [1] 2.020408 Very close to relative risk of 2. Lecture notes STAD29: Statistics for the Life and Social Sciences 155 / 802 Logistic regression (ordinal/nominal response) More than 2 response categories With 2 response categories, model the probability of one, and prob of other is one minus that. So doesn’t matter which category you model. With more than 2 categories, have to think more carefully about the categories: are they ordered : you can put them in a natural order (like low, medium, high) nominal : ordering the categories doesn’t make sense (like red, green, blue). R handles both kinds of response; learn how. Lecture notes STAD29: Statistics for the Life and Social Sciences 156 / 802 Logistic regression (ordinal/nominal response) Ordinal response: the miners Model probability of being in given category or lower. Example: coal-miners often suffer disease pneumoconiosis. Likelihood of disease believed to be greater among miners who have worked longer. Severity of disease measured on categorical scale: none, moderate, 3 severe. Lecture notes STAD29: Statistics for the Life and Social Sciences 157 / 802 Logistic regression (ordinal/nominal response) Miners data Data are frequencies: Exposure None Moderate Severe 5.8 98 0 0 15.0 51 2 1 21.5 34 6 3 27.5 35 5 8 33.5 32 10 9 39.5 23 7 8 46.0 12 6 10 51.5 4 2 5 Lecture notes STAD29: Statistics for the Life and Social Sciences 158 / 802 Logistic regression (ordinal/nominal response) Reading the data Data in aligned columns with more than one space between, so: my_url <- "http://www.utsc.utoronto.ca/~butler/d29/miners-tab.txt" freqs <- read_table(my_url) ## Parsed with column specification: ## cols( ## Exposure = col_double(), ## None = col_double(), ## Moderate = col_double(), ## Severe = col_double() ## ) Lecture notes STAD29: Statistics for the Life and Social Sciences 159 / 802 Logistic regression (ordinal/nominal response) The data freqs ## # A tibble: 8 x 4 ## Exposure None Moderate Severe ## ## 1 5.8 98 0 0 ## 2 15 51 2 1 ## 3 21.5 34 6 3 ## 4 27.5 35 5 8 ## 5 33.5 32 10 9 ## 6 39.5 23 7 8 ## 7 46 12 6 10 ## 8 51.5 4 2 5 Lecture notes STAD29: Statistics for the Life and Social Sciences 160 / 802 Logistic regression (ordinal/nominal response) Tidying and row proportions freqs %>% gather(Severity, Freq, None:Severe) %>% group_by(Exposure) %>% mutate(proportion = Freq / sum(Freq)) -> miners Lecture notes STAD29: Statistics for the Life and Social Sciences 161 / 802 Logistic regression (ordinal/nominal response) Result miners ## # A tibble: 24 x 4 ## # Groups: Exposure [8] ## Exposure Severity Freq proportion ## ## 1 5.8 None 98 1 ## 2 15 None 51 0.944 ## 3 21.5 None 34 0.791 ## 4 27.5 None 35 0.729 ## 5 33.5 None 32 0.627 ## 6 39.5 None 23 0.605 ## 7 46 None 12 0.429 ## 8 51.5 None 4 0.364 ## 9 5.8 Moderate 0 0 ## 10 15 Moderate 2 0.0370 ## # … with 14 more rows Lecture notes STAD29: Statistics for the Life and Social Sciences 162 / 802 Logistic regression (ordinal/nominal response) Plot proportions against exposure ggplot(miners, aes(x = Exposure, y = proportion, colour = Severity)) + geom_point() + geom_smooth(se = F) l l l l l l l l l l l l l l l l l l l l l l l 0.00 0.25 0.50 0.75 1.00 10 20 30 40 50 Exposure pr op or tio n Severity l l l Moderate None Severe Lecture notes STAD29: Statistics for the Life and Social Sciences 163 / 802 Logistic regression (ordinal/nominal response) Reminder of data setup miners ## # A tibble: 24 x 4 ## # Groups: Exposure [8] ## Exposure Severity Freq proportion ## ## 1 5.8 None 98 1 ## 2 15 None 51 0.944 ## 3 21.5 None 34 0.791 ## 4 27.5 None 35 0.729 ## 5 33.5 None 32 0.627 ## 6 39.5 None 23 0.605 ## 7 46 None 12 0.429 ## 8 51.5 None 4 0.364 ## 9 5.8 Moderate 0 0 ## 10 15 Moderate 2 0.0370 ## # … with 14 more rows Lecture notes STAD29: Statistics for the Life and Social Sciences 164 / 802 Logistic regression (ordinal/nominal response) Creating an ordered factor Problem: on plot, Severity categories in wrong order. In the data frame, categories in correct order. Package forcats (in tidyverse) has functions for creating factors to specifications. fct_inorder takes levels in order they appear in data: miners %>% mutate(sev_ord = fct_inorder(Severity)) -> miners To check: levels(miners$sev_ord) ## [1] "None" "Moderate" "Severe" Lecture notes STAD29: Statistics for the Life and Social Sciences 165 / 802 Logistic regression (ordinal/nominal response) New data frame miners ## # A tibble: 24 x 5 ## # Groups: Exposure [8] ## Exposure Severity Freq proportion sev_ord ## ## 1 5.8 None 98 1 None ## 2 15 None 51 0.944 None ## 3 21.5 None 34 0.791 None ## 4 27.5 None 35 0.729 None ## 5 33.5 None 32 0.627 None ## 6 39.5 None 23 0.605 None ## 7 46 None 12 0.429 None ## 8 51.5 None 4 0.364 None ## 9 5.8 Moderate 0 0 Moderate ## 10 15 Moderate 2 0.0370 Moderate ## # … with 14 more rows Lecture notes STAD29: Statistics for the Life and Social Sciences 166 / 802 Logistic regression (ordinal/nominal response) Improved plot ggplot(miners, aes(x = Exposure, y = proportion, colour = sev_ord)) + geom_point() + geom_smooth(se = F) l l l l l l l l l l l l l l l l l l l l l l l 0.00 0.25 0.50 0.75 1.00 10 20 30 40 50 Exposure pr op or tio n sev_ord l l l None Moderate Severe Figure 21: plot of chunk unnamed-chunk-114 Lecture notes STAD29: Statistics for the Life and Social Sciences 167 / 802 Logistic regression (ordinal/nominal response) Fitting ordered logistic model Use function polr from package MASS. Like glm. sev.1 <- polr(sev_ord ~ Exposure, weights = Freq, data = miners ) Lecture notes STAD29: Statistics for the Life and Social Sciences 168 / 802 Logistic regression (ordinal/nominal response) Output: not very illuminating summary(sev.1) ## ## Re-fitting to get Hessian ## Call: ## polr(formula = sev_ord ~ Exposure, data = miners, weights = Freq) ## ## Coefficients: ## Value Std. Error t value ## Exposure 0.0959 0.01194 8.034 ## ## Intercepts: ## Value Std. Error t value ## None|Moderate 3.9558 0.4097 9.6558 ## Moderate|Severe 4.8690 0.4411 11.0383 ## ## Residual Deviance: 416.9188 ## AIC: 422.9188 Lecture notes STAD29: Statistics for the Life and Social Sciences 169 / 802 Logistic regression (ordinal/nominal response) Does exposure have an effect? Fit model without Exposure, and compare using anova. Note 1 for model with just intercept: sev.0 <- polr(sev_ord ~ 1, weights = Freq, data = miners) anova(sev.0, sev.1) ## Likelihood ratio tests of ordinal regression models ## ## Response: sev_ord ## Model Resid. df Resid. Dev Test ## 1 1 369 505.1621 ## 2 Exposure 368 416.9188 1 vs 2 ## Df LR stat. Pr(Chi) ## 1 ## 2 1 88.24324 0 Exposure definitely has effect on severity of disease. Lecture notes STAD29: Statistics for the Life and Social Sciences 170 / 802 Logistic regression (ordinal/nominal response) Another way What (if anything) can we drop from model with exposure? drop1(sev.1, test = "Chisq") ## Single term deletions ## ## Model: ## sev_ord ~ Exposure ## Df AIC LRT Pr(>Chi) ## 422.92 ## Exposure 1 509.16 88.243 < 2.2e-16 *** ## --- ## Signif. codes: ## 0 '***' 0.001 '**' 0.01 '*' 0.05 ## '.' 0.1 ' ' 1 Nothing. Exposure definitely has effect. Lecture notes STAD29: Statistics for the Life and Social Sciences 171 / 802 Logistic regression (ordinal/nominal response) Predicted probabilities Make new data frame out of all the exposure values (from original data frame), and predict from that: sev.new <- tibble(Exposure = freqs$Exposure) pr <- predict(sev.1, sev.new, type = "p") miners.pred <- cbind(sev.new, pr) miners.pred ## Exposure None Moderate Severe ## 1 5.8 0.9676920 0.01908912 0.01321885 ## 2 15.0 0.9253445 0.04329931 0.03135614 ## 3 21.5 0.8692003 0.07385858 0.05694115 ## 4 27.5 0.7889290 0.11413004 0.09694093 ## 5 33.5 0.6776641 0.16207145 0.16026444 ## 6 39.5 0.5418105 0.20484198 0.25334756 ## 7 46.0 0.3879962 0.22441555 0.38758828 ## 8 51.5 0.2722543 0.21025011 0.51749563 Lecture notes STAD29: Statistics for the Life and Social Sciences 172 / 802 Logistic regression (ordinal/nominal response) Comments Model appears to match data: as exposure goes up, prob of None goes down, Severe goes up (sharply for high exposure). Like original data frame, this one nice to look at but not tidy. We want to make graph, so tidy it. Also want the severity values in right order. Usual gather, plus a bit: miners.pred %>% gather(Severity, probability, -Exposure) %>% mutate(sev_ord = fct_inorder(Severity)) -> preds Lecture notes STAD29: Statistics for the Life and Social Sciences 173 / 802 Logistic regression (ordinal/nominal response) Some of the gathered predictions preds %>% slice(1:15) ## Exposure Severity probability sev_ord ## 1 5.8 None 0.96769203 None ## 2 15.0 None 0.92534455 None ## 3 21.5 None 0.86920028 None ## 4 27.5 None 0.78892903 None ## 5 33.5 None 0.67766411 None ## 6 39.5 None 0.54181046 None ## 7 46.0 None 0.38799618 None ## 8 51.5 None 0.27225426 None ## 9 5.8 Moderate 0.01908912 Moderate ## 10 15.0 Moderate 0.04329931 Moderate ## 11 21.5 Moderate 0.07385858 Moderate ## 12 27.5 Moderate 0.11413004 Moderate ## 13 33.5 Moderate 0.16207145 Moderate ## 14 39.5 Moderate 0.20484198 Moderate ## 15 46.0 Moderate 0.22441555 Moderate Lecture notes STAD29: Statistics for the Life and Social Sciences 174 / 802 Logistic regression (ordinal/nominal response) Plotting predicted and observed proportions Plot: predicted probabilities, lines (shown) joining points (not shown) data, just the points. Unfamiliar process: data from two different data frames: g <- ggplot(preds, aes( x = Exposure, y = probability, colour = sev_ord )) + geom_line() + geom_point(data = miners, aes(y = proportion)) Idea: final geom_point uses data in miners rather than preds, -variable for plot is proportion from that data frame, but -coordinate is Exposure, as it was before, and colour is Severity as before. The final geom_point “inherits” from the first aes as needed. Lecture notes STAD29: Statistics for the Life and Social Sciences 175 / 802 Logistic regression (ordinal/nominal response) The plot: data match model g l l l l l l l l l l l l l l l l l l l l l l l 0.00 0.25 0.50 0.75 1.00 10 20 30 40 50 Exposure pr ob ab ilit y sev_ord l l l None Moderate Severe Figure 22: plot of chunk unnamed-chunk-125 Lecture notes STAD29: Statistics for the Life and Social Sciences 176 / 802 Logistic regression (ordinal/nominal response) Unordered responses With unordered (nominal) responses, can use generalized logit. Example: 735 people, record age and sex (male 0, female 1), which of 3 brands of some product preferred. Data in mlogit.csv separated by commas (so read_csv will work): my_url <- "http://www.utsc.utoronto.ca/~butler/d29/mlogit.csv" brandpref <- read_csv(my_url) ## Parsed with column specification: ## cols( ## brand = col_double(), ## sex = col_double(), ## age = col_double() ## ) Lecture notes STAD29: Statistics for the Life and Social Sciences 177 / 802 Logistic regression (ordinal/nominal response) The data brandpref ## # A tibble: 735 x 3 ## brand sex age ## ## 1 1 0 24 ## 2 1 0 26 ## 3 1 0 26 ## 4 1 1 27 ## 5 1 1 27 ## 6 3 1 27 ## 7 1 0 27 ## 8 1 0 27 ## 9 1 1 27 ## 10 1 0 27 ## # … with 725 more rows Lecture notes STAD29: Statistics for the Life and Social Sciences 178 / 802 Logistic regression (ordinal/nominal response) Bashing into shape, and fitting model sex and brand not meaningful as numbers, so turn into factors: brandpref <- brandpref %>% mutate(sex = factor(sex)) %>% mutate(brand = factor(brand)) We use multinom from package nnet. Works like polr. brands.1 <- multinom(brand ~ age + sex, data = brandpref) ## # weights: 12 (6 variable) ## initial value 807.480032 ## iter 10 value 702.976983 ## final value 702.970704 ## converged Lecture notes STAD29: Statistics for the Life and Social Sciences 179 / 802 Logistic regression (ordinal/nominal response) Can we drop anything? Unfortunately drop1 seems not to work: drop1(brands.1, test = "Chisq", trace = 0) ## trying - age ## Error in if (trace) {: argument is not interpretable as logical so fall back on fitting model without what you want to test, and comparing using anova. Lecture notes STAD29: Statistics for the Life and Social Sciences 180 / 802 Logistic regression (ordinal/nominal response) Do age/sex help predict brand? 1/2 Fit models without each of age and sex: brands.2 <- multinom(brand ~ age, data = brandpref) ## # weights: 9 (4 variable) ## initial value 807.480032 ## iter 10 value 706.796323 ## iter 10 value 706.796322 ## final value 706.796322 ## converged brands.3 <- multinom(brand ~ sex, data = brandpref) ## # weights: 9 (4 variable) ## initial value 807.480032 ## final value 791.861266 ## converged Lecture notes STAD29: Statistics for the Life and Social Sciences 181 / 802 Logistic regression (ordinal/nominal response) Do age/sex help predict brand? 2/2 anova(brands.2, brands.1) ## Likelihood ratio tests of Multinomial Models ## ## Response: brand ## Model Resid. df Resid. Dev Test Df LR stat. Pr(Chi) ## 1 age 1466 1413.593 ## 2 age + sex 1464 1405.941 1 vs 2 2 7.651236 0.02180495 anova(brands.3, brands.1) ## Likelihood ratio tests of Multinomial Models ## ## Response: brand ## Model Resid. df Resid. Dev Test Df LR stat. Pr(Chi) ## 1 sex 1466 1583.723 ## 2 age + sex 1464 1405.941 1 vs 2 2 177.7811 0 Lecture notes STAD29: Statistics for the Life and Social Sciences 182 / 802 Logistic regression (ordinal/nominal response) Do age/sex help predict brand? 3/3 age definitely significant (second anova) sex seems significant also (first anova) Keep both. Lecture notes STAD29: Statistics for the Life and Social Sciences 183 / 802 Logistic regression (ordinal/nominal response) Another way to build model Start from model with everything and run step: step(brands.1, trace = 0) ## trying - age ## trying - sex ## Call: ## multinom(formula = brand ~ age + sex, data = brandpref) ## ## Coefficients: ## (Intercept) age sex1 ## 2 -11.77469 0.3682075 0.5238197 ## 3 -22.72141 0.6859087 0.4659488 ## ## Residual Deviance: 1405.941 ## AIC: 1417.941 Final model contains both age and sex so neither could be removed. Lecture notes STAD29: Statistics for the Life and Social Sciences 184 / 802 Logistic regression (ordinal/nominal response) Predictions: all possible combinations Create data frame with various age and sex: ages <- c(24, 28, 32, 35, 38) sexes <- factor(0:1) new <- crossing(age = ages, sex = sexes) new ## # A tibble: 10 x 2 ## age sex ## ## 1 24 0 ## 2 24 1 ## 3 28 0 ## 4 28 1 ## 5 32 0 ## 6 32 1 ## 7 35 0 ## 8 35 1 ## 9 38 0 ## 10 38 1 Lecture notes STAD29: Statistics for the Life and Social Sciences 185 / 802 Logistic regression (ordinal/nominal response) Making predictions p <- predict(brands.1, new, type = "probs") probs <- cbind(new, p) or p %>% as_tibble() %>% bind_cols(new) -> probs Lecture notes STAD29: Statistics for the Life and Social Sciences 186 / 802 Logistic regression (ordinal/nominal response) The predictions probs ## # A tibble: 10 x 5 ## `1` `2` `3` age sex ## ## 1 0.948 0.0502 0.00181 24 0 ## 2 0.915 0.0819 0.00279 24 1 ## 3 0.793 0.183 0.0236 28 0 ## 4 0.696 0.271 0.0329 28 1 ## 5 0.405 0.408 0.187 32 0 ## 6 0.291 0.495 0.214 32 1 ## 7 0.131 0.397 0.472 35 0 ## 8 0.0840 0.432 0.484 35 1 ## 9 0.0260 0.239 0.735 38 0 ## 10 0.0162 0.252 0.732 38 1 Young males (sex=0) prefer brand 1, but older males prefer brand 3. Females similar, but like brand 1 less and brand 2 more. Lecture notes STAD29: Statistics for the Life and Social Sciences 187 / 802 Logistic regression (ordinal/nominal response) Making a plot Plot fitted probability against age, distinguishing brand by colour and gender by plotting symbol. Also join points by lines, and distinguish lines by gender. I thought about facetting, but this seems to come out clearer. First need tidy data frame, by familiar process: probs %>% gather(brand, probability, -(age:sex)) -> probs.long Lecture notes STAD29: Statistics for the Life and Social Sciences 188 / 802 Logistic regression (ordinal/nominal response) The tidy data (random sample of rows) probs.long %>% sample_n(10) ## # A tibble: 10 x 4 ## age sex brand probability ##