Predicting Overall Popular Vote of The Liberal Party in the Next Federal Election in Canada. STA304 - Assignment 3 Jiajie Zou, Ruochen Zhao, Xinlong Lin November 5, 2021 Contents 0.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 0.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 0.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 0.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 0.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1 0.1 Introduction The next Canadian federal election will be held (Pammett and Dornan 2016) soon. The outcome is of interest to all citizens and residents of Canada. The federal election is a countrywide election across 10 provinces and 3 territories to elect members of the federal government of Canada (Pammett and Dornan 2016). In this analysis we examined individual-level survey data and post-stratified census data to predict the overall popular vote of the Liberal Party of Canada (also known as Liberals) in this election (Pammett and Dornan 2016). The Liberal Party of Canada is the eldest and longest-serving active federal political party in Canada (Jeffrey 2010). The party has asserted dominance in federal politics for much of Canada’s history (Clarkson 2014). Liberals witheld power for almost 60 years of the 20th century (Clarkson 2014). The party supports the ideologies of liberalism, and in general sits at the centre to centre-left of the western political spectrum (Jeffrey 2010). The research question is if Liberals would get about the same popular vote in the next federal election as they did in the last election, which was 33% (Raynauld, Turcotte, and Gillies 2021). Therefore, our research hypothesis is that Liberals are going to get 33% popular vote in the next federal election. Popular vote is the same as the total number of votes for a political party (Raynauld, Turcotte, and Gillies 2021). As such, the aim of this analysis is to predict the percentage of total votes Liberals are going to get in the next federal election. We chose to use the statistical method multilevel regression with post-stratification. The outcome variable we were particularly interested in was if a voter would vote for Liberals; it is a binary outcome (Downes et al. 2018). We first fit a multivariable, multilevel logistic regression model to predict our outcome variable using a few demographic variables. Then, we poststratified the selected sample with the variables in the logistic regression model. We subsequently assigned individuals into different cells based on combinations of these variables. We then utilized the logistic regression model to predict the probability of voting for Liberals for each created cell. Finally, we combined the predicted probabilities of all cells to compute the Liberals’ overall popular vote. The survey dataset that was used is the the Canadian Election Study (CES) 2019 - Phone Survey and the census dataset was the 2017 General Social Survey (GSS) on the Family Canada (2020). Finally, we compared this post-stratified prediction of the popular vote with the hypothesized value of 33%. The Data section provides numerical, textual, and graphical description of the census and survey datasets and important variables in the them. TheMethods section covers statistical methods and analysis techniques used in this study. The Results section presents and explains our analysis results. The Conclusion section concludes our study with a summary of crucial findings and a complete commentary and discussion on the overall study and analysis. 2 0.2 Data Census data The census dataset was retrieved from the 2017 General Social Survey (GSS) on the Family. The 2017 GSS, conducted from February 2, 2017 to November 30, 2017, is a sample survey of cross-sectional survey design (Canada 2020). The target population comprised all non-institutionalized persons over 15 years of age, living in the 10 major provinces of Canada (Canada 2020). The survey uses a novel sampling frame, created in 2013, that encompasses telephone numbers with Statistics Canada’s Address Register, and executes data collection over phone (landline and cell) (Canada 2020). The important role family plays in people’s lives cannot be disputed. Today’s family, however, must push through changing marital, family, and professional trajectories. While our understanding of families in Canada has improved considerably over the past few years, the future of families remains a topic of great interest. As we see that families are getting more diverse. The GSS on families intends to inform researchers on the different types and characteristics of families in Canada to enhance our understanding of families. (Canada 2020) The survey collected a large amount of data for each respondent and moreover related information about each family member of the respondent’s household. The response rate was 52.4%, which is enough to be representative of the target population (Canada 2020). Survey data The survey data was data retrieved the Canadian Election Study (CES) 2019 - Phone Survey (Stephenson et al. 2020). There were 2 stages of data collection as part of this survey. During the last Canadian federal election campaign that was held in 2019, telephone interviews were conducted with Canadian citizens over 18 years old (Stephenson et al. 2020). Respondents to this survey were contacted by phone and later interviewed (Stephenson et al. 2020). The survey included different questions asking for the respondent’s demographic variables and their perspectives on Canadian politics, opinion on different political parties in Canada, their voting records, and what party they wanted to vote for in the federal election (Stephenson et al. 2020). Data cleaning We processed data for the variables we chose to use in the MRP analysis. The variables were age, sex, place of birth, marriage history and province. We chose to round age to the nearest integer in both the census and the survey datasets. We only retained males and females in both datasets since only one person did not identify themselves as male or female. We decided to have 2 categories for place of birth: born in Canada and born outside Canada due to the fact that the majority of respondents in the two datasets were born in Canada. In both datasets, we categorized marriage history as ever married or not (2 categories). This effectively and concisely represents a individual’s marriage history. Individuals who were separated, divorced, common-law or widowed were treated as never married since we treated marriage as an legal, official status. Both datasets had 10 provinces so no data cleaning was required. The response variable is if the respondent would vote for Liberals, thus we made it a binary variable that has a “1” if the respondent would vote for Liberals and a “0” if not. Belo we provide a detailed description the selected variables. • The variable age is a numerical and records the age of the respondent. • Sex is a binary and records if the respondent is male or female. Place of birth is a binary variable and records if the respondent was born in or outside of Canada. • Marriage history is a binary variable that records if the respondent was ever married. • Province is a categorical level with 10 levels, for the 10 provinces of Canada. • The response variable is binary variable that records if the respondent would vote for Liberals and is a binary with a “1” if the respondent would vote for Liberals and a “0” if not. 3 For modeling, we removed variables not listed above from both datasets. We also only retained observations that did not have any missing values to the above variables since we did not want to handle any bias that may be induced from imputation of missing values. Data summaries Table 1 shows the percentages of males and females in the survey and the census datasets. The percentage of males in the survey dataset is 0.575 and in the census dataset is 0.456. The percentage of females in the survey dataset is 0.425 and in the census dataset is 0.544. The percentages of males and females in the survey dataset are apparently different than those of the census dataset. Table 1: Proportions of males and females in the survey and the census datasets Dataset Male percentage Female percentage Survey 0.575 0.425 Census 0.456 0.544 Table 2 shows the percentages of respondents born in and outside Canada in the survey and the census datasets. The percentage of people born in Canada in the survey dataset is 0.858 and in the census dataset is 0.8. The percentage of people born outside of Canada in the survey dataset is 0.142 and in the census dataset is 0.2. The difference in the percentages of place of birth is minimal between the two datasets. Table 2: Percentages of place of birth in the survey and the census datasets Dataset Born in Canada percentage Born Outside Canada percentage Survey 0.858 0.142 Census 0.800 0.200 Table 3 shows the percentages of marriage history in the survey and the census datasets. The percentage of people who was ever married in the survey dataset is 0.691 and in the census dataset is 0.697. The percentage of people who was never married in the survey dataset is 0.309 and in the census dataset is 0.303. The percentages of marriage history are quite the same in the two datasets. Table 3: Percentages of marriage history for the survey and the census datasets Dataset Ever married proportion Never married proportion Survey 0.691 0.309 Census 0.697 0.303 Figure 1 displaus the age distributions in the survey and the census datasets. The age distributions are very similar except that in the survey dataset there were respondents over 80 years old when in the census dataset there wasn not any. The age distributions were close to uniform and close to symmetric while not multi-modal. In Figure 2 presents popular vote for Liberals in the survey dataset. We observe that almost 25% of the respondents replied that they would vote for Liberals. 4 0500 1000 20 40 60 80 Age Fr eq ue nc y 0 50 100 150 200 25 50 75 100 Age Fr eq ue nc y Figure 1: Distribution of age in the census dataset (left) and the survey dataset (right). 5 0% 25% 50% 75% 100% No Yes Pe rc e n ta ge Figure 2: Sample popular vote for Liberals in in survey dataset - the distribution. 0.3 Methods 0.3.1 Model Specifics We employed a multilevel logistic regression model to predict our outcome - if one would vote for Liberals in the next federal election. Logistic regression is a categorization regression model (Wright 1995). It is useful for predicting a binary outcome based on a set of predictor variables (Wright 1995). A binary outcome has only two possible cases — either the event occurs (1) or it does not occur (0). Predictor variables are those variables that might affect the outcome (Wright 1995). In our situation, it is appropriate since the aforementioned outcome is binary. The predictor variables in our model were age, sex, marriage history, and place of birth. These variables in synchronization represent the majority of the different groups and subgroups within the Canadian population (Hosmer, Lemeshow, and Sturdivant 2000). Using combinations of these variables we were able to integrate the political opinion of these different groups and more substantially, the entire Canadian population. The model provides the log odds of the outcome. An odds of an event is the probability of the event occurring divided by the probability of the event not occurring. The regression coefficient of a predictor is a quantification of the change in log odds when the predictor changes (Hosmer, Lemeshow, and Sturdivant 2000). For post-stratification, we predict the probabilities (probability of voting for Liberals) from the estimated log odds from the logistic regression model. Multilevel models are statistical models of parameters that differ at more than 1 level, most often an indi- vidual level and a group level (McCulloch and Neuhaus 2005). Multilevel models are specifically appropriate for research designs in which data for subjects are organized at more than 1 level. The units of analysis are often individuals (lower level) who are nested within groups (higher level) (Demidenko 2013). The random intercept model is the most widely used type of multilevel models (McCulloch and Neuhaus 2005). A random intercept model is a model where intercepts are assumed to vary, and therefore, the predicted outcome for each individual is predicted by the intercept of the group the individual belongs in together with individual- 6 level predictors (Demidenko 2013). In our analysis, our respondents are in Canada and innately grouped by province based on where they are located. Thus, each province was modeled to have its own random intercept that is shared by all individuals located in that province. Model summaries were examined to determine if each predictor is statistically significant in predicting whether an individual is going to vote for Liberals. Here we show the equation of the multilevel logistic regression model: log ( p 1− p ) = β0j + β1Xmale + β2Xborn outside Canada + β3Xever married β0j = r00 + r01 +Wj + µ0j where • p is the probability of voting for Liberals • Xmale = 1 if the respondent is male and = 0 if the respondent is female • Xborn outside Canada = 1 if the respondent was born outside Canada; = 0 if the respondent was born in Canada • Xever married = 1 if the respondent was ever married; = 0 if was never married • β1 is the difference in log odds of voting for Liberals for males versus females. • β2 is the difference in log odds of voting for Liberals for those born outside Canada versus those born in Canada. • β3 is the difference in log odds of voting for Liberals for those who were ever married versus those who were never married. • β0j is the random intercept for the jth province. • Wj = 1 if the respondent was located in the jth province. • µ0,j is the statistical noise in the random intercept of the jth province. 0.3.2 Post-Stratification Post-stratification is a widely used method in sampling and survey analysis for integrating population distri- butions of variables with survey estimates (Buttice and Highton 2013). The fundamental technique splits up the sample into cells according to combinations of different variables (each distinct combination formulates a cell), and calculates a post-stratification estimate based on weighted estimates of each cell (Holt and Smith 1979). Popular estimates include means, proportions and totals. Should weighted estimates of each cell be estimated by a multi-level regression model, which is often done, the technique becomes multilevel regression and post-stratification (MRP) (Buttice and Highton 2013). Post-stratification is appropriate when the distributions of particular variables in the sample do not resemble those in the underlying population (Holt and Smith 1979). This is often the case when attempting to map a sub-countrywide or smaller-scale survey to a nationwide or large-scale census, which is how we attempted to map CES 2019 phone survey to the 2017 General Social Survey (census). We discovered big differences in percentages of males and females in the survey and the census datasets. Additionally, the distributions of province and place of birth are a little different. In presence of such differences, post-stratification is a suitable method. 7 First we divided individuals in the census dataset into different cells. The cells were created by distinct combinations of age, sex, place of birth, marriage history, and province. We next predicted the probability of voting for Liberals in each cell with our multilevel logistic regression model. At the end, we combined the estimated probabilities into one aggregate, population-wide probability of voting for Liberals, analogous to the overall popular vote for Liberals, using the formula: yˆPS = ∑ j Nj yˆj∑ j Nj here • Nj is the total number of individuals in the jth cell • yˆj is the predicted probability of voting for Liberals for the jth cell All analysis for this report was programmed using R version 4.1.1. 8 0.4 Results Table 4 contains the multi-level logistic regression model summary for predicting if an individual would vote for Liberals. The table has regression coefficient estimates, their standard errors and P-values. Age, place of birth and marriage history are statistically significant in predicting if an individual is going to vote for Liberals. For a year increase in age, the log odds of voting for Liberals increases by 0.010. This means that as individuals become older, they become more inclined to vote for Liberals. Individuals who were born outside of Canada had 0.567 higher log odds of voting for Liberals compared to individuals who were born in Canada. This implies that individuals who were born outside Canada were more likely to vote for Liberals than individuals who were born in Canada. The log odds of voting for Liberals individuals who were ever married was 0.249 lower than individuals who were never married. This implies individuals who were ever married were less likely to vote for Liberals than individuals who were never married. The results do not surprise us since Liberals have been more supportive of less wealthy (those who were never married were more likely to have a lower household income than those who were married), immigrants and refugees (those who were born outside Canada) and the elderly by giving them better financial and healthcare support (Wilson 2011). Table 4: Multilevel logit regression model summary. Estimate Standard Error P value (Intercept) -1.597 0.216 0.000 age 0.010 0.003 0.001 sexMale -0.157 0.091 0.084 birthOutside 0.567 0.122 0.000 marriageYes -0.249 0.110 0.024 The regression and post-stratification estimate of overall popular vote for Liberals in the next Canadian federal election is 0.258. The results are reasonable and not surprising since firstly, the sample size of the census dataset is quite smaller than the voting population of Canada. About 66% of Canada’s 27 million registered voters voted in the 2019 federal election (Raynauld, Turcotte, and Gillies 2021). In addition, indicating that one would vote for Liberals does not necessarily imply they were going to actually vote for Liberals; their preferences could have changed. The post-stratification estimate of the overall popular vote of Liberals is 25.8%, much lower than the hy- pothesized value of 33.0%. But, this estimate does directly addresses the research question of interest and attains the survey goal. We aimed to predict the overall popular vote of Liberals and we got an estimate through MRP. We in addition addressed the hypothesis by comparing our estimate to our hypothesized value. Overall, our results were extremely useful. 9 0.5 Conclusions The research goal of this study was to predict the overall popular vote of Liberals in the next Canadian federal election and our hypothesis was that the Liberal Party of Canada would get approximately the same popular vote percentage as they did in the 2019 federal election, which was 33%. We employed multilevel logistic regression and post-stratification using age, sex, place of birth, marriage history and province to get a final estimate of the overall popular vote. Our results show that the predicted overall popular vote for Liberals is 25.8%. Our results indicate that as individuals get older, they are more inclined to vote for Liberals. Individuals who were born outside Canada were more inclined to vote for Liberals than individuals who were born in Canada. Individuals who were ever married were less willing to vote for Liberals than individuals who were never married. The results are not of much surprise since Liberals have been more supportive of these types of individuals. The post-stratification estimate of the overall popular vote of Liberals is 25.8%, subtantially lower than the hypothesized overall popular vote of 33.0%. This is due to the fact that we used a relatively small survey dataset to construct the logistic regression model and the census dataset itself is not much bigger either. For this reason, our estimate innately reflect the percentage of people that indicated they wanted to vote for Liberals in the CES 2019 phone survey, but not the overall popular vote from the target voting population. There were several limitations to our study. Firstly, we had small sample sizes and this most likely led to bias and inaccuracies in the estimation process. We merely used data from a small survey (CES 2019). Both the survey and the census datasets were vulnerable to non-response bias and non-sampling errors Berg (2005). Furthermore, we did not assess the prediction ability of our multilevel logistic regression model because it is challenging to do so provided the model’s hierarchical structure and statistical noise from the random intercept (McCulloch and Neuhaus 2005). Lastly, we do not have confidence that the census population in the 2017 GSS could represent the population of registered voters in Canada effectively. For our next steps, we could collect and combine data from more surveys and censuses in hope of enlarging our sample size to decrease bias and estimation imprecision. We would also look for a suitable metric, if everything permits, to assess the regression model’s prediction power to get an grasp of how good the model can predict the outcome using choosen variables. With more datasets, we can also discover more variables to use in our MRP process, which can in turn elevate model prediction power and elevate the accuracy of the MRP estimate. In this study, we predicted the overall popular vote for the Liberal Party of Canada in the next Canadian federal election with a MRP model. We predicted the overall popular vote to be 25.8%, lower than what the Liberals got in the last federal election. However, this result is still interesting and useful because it gives a rough ballpark guesstimate of what could happen. 10 Bibliography Allaire, J. 2012. “RStudio: Integrated Development Environment for r.” Boston, MA 770: 394. Auguie, Baptiste. 2017. gridExtra: Miscellaneous Functions for "Grid" Graphics. https://CRAN.R-project. org/package=gridExtra. Bates, Douglas, Martin Mächler, Ben Bolker, and Steve Walker. 2015. “Fitting Linear Mixed-Effects Models Using lme4.” Journal of Statistical Software 67 (1): 1–48. https://doi.org/10.18637/jss.v067.i01. Berg, Nathan. 2005. “Non-Response Bias.” Buttice, Matthew K, and Benjamin Highton. 2013. “How Does Multilevel Regression and Poststratification Perform with Conventional National Surveys?” Political Analysis, 449–67. Canada, Statistics. 2020. “General Social Survey Cycle 31: Family, 2017.” Abacus Data Network. https: //doi.org/11272.1/AB2/G3DUFG. Clarkson, Stephen. 2014. The Big Red Machine: How the Liberal Party Dominates Canadian Politics. UBC press. Çetinkaya-Rundel, Mine, David Diez, Andrew Bray, Albert Y. Kim, Ben Baumer, Chester Ismay, Nick Pa- terno, and Christopher Barr. 2021. Openintro: Data Sets and Supplemental Functions from ’OpenIntro’ Textbooks and Labs. https://CRAN.R-project.org/package=openintro. Demidenko, Eugene. 2013. Mixed Models: Theory and Applications with r. John Wiley & Sons. Downes, Marnie, Lyle C Gurrin, Dallas R English, Jane Pirkis, Dianne Currier, Matthew J Spittal, and John B Carlin. 2018. “Multilevel Regression and Poststratification: A Modeling Approach to Estimating Population Quantities from Highly Selected Survey Samples.” American Journal of Epidemiology 187 (8): 1780–90. Holt, David, and TM Fred Smith. 1979. “Post Stratification.” Journal of the Royal Statistical Society: Series A (General) 142 (1): 33–46. Hosmer, David W.., Stanley Lemeshow, and Rodney X.. Sturdivant. 2000. Applied Logistic Regression. Wiley New York. Jeffrey, Brooke. 2010. Divided Loyalties: The Liberal Party of Canada, 1984-2008. University of Toronto Press. McCulloch, Charles E, and John M Neuhaus. 2005. “Generalized Linear Mixed Models.” Encyclopedia of Biostatistics 4. Pammett, Jon H, and Christopher Dornan. 2016. The Canadian Federal Election of 2015. Dundurn. R Core Team. 2021. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/. Raynauld, Vincent, André Turcotte, and Jamie Gillies. 2021. “Introduction: The 2019 Canadian Federal Election.” In Political Marketing in the 2019 Canadian Federal Election, 1–10. Springer. Stephenson, Laura B, Allison Harell, Daniel Rubenson, and Peter John Loewen. 2020. “2019 Canadian Election Study - Phone Survey.” Harvard Dataverse. https://doi.org/10.7910/DVN/8RHLG1. Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain Francois, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686. Wickham, Hadley, and Dana Seidel. 2020. Scales: Scale Functions for Visualization. https://CRAN.R- project.org/package=scales. Wilson, Trevor. 2011. The Downfall of the Liberal Party, 1914-1935. Faber & Faber. Wright, Raymond E. 1995. “Logistic Regression.” 11 Xie, Yihui, Joseph J Allaire, and Garrett Grolemund. 2018. R Markdown: The Definitive Guide. CRC Press. Zhu, Hao. 2021. kableExtra: Construct Complex Table with ’Kable’ and Pipe Syntax. https://CRAN.R- project.org/package=kableExtra. 12
欢迎咨询51作业君