程序代写案例-STA304-Assignment 3

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

Predicting Overall Popular Vote of The Liberal Party in the Next
Federal Election in Canada.
STA304 - Assignment 3
Jiajie Zou, Ruochen Zhao, Xinlong Lin
November 5, 2021
Contents
0.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
0.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
0.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
0.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
0.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1
0.1 Introduction
The next Canadian federal election will be held (Pammett and Dornan 2016) soon. The outcome is of interest
to all citizens and residents of Canada. The federal election is a countrywide election across 10 provinces
and 3 territories to elect members of the federal government of Canada (Pammett and Dornan 2016). In
this analysis we examined individual-level survey data and post-stratified census data to predict the overall
popular vote of the Liberal Party of Canada (also known as Liberals) in this election (Pammett and Dornan
2016). The Liberal Party of Canada is the eldest and longest-serving active federal political party in Canada
(Jeffrey 2010). The party has asserted dominance in federal politics for much of Canada’s history (Clarkson
2014). Liberals witheld power for almost 60 years of the 20th century (Clarkson 2014). The party supports
the ideologies of liberalism, and in general sits at the centre to centre-left of the western political spectrum
(Jeffrey 2010).
The research question is if Liberals would get about the same popular vote in the next federal election as
they did in the last election, which was 33% (Raynauld, Turcotte, and Gillies 2021). Therefore, our research
hypothesis is that Liberals are going to get 33% popular vote in the next federal election. Popular vote is
the same as the total number of votes for a political party (Raynauld, Turcotte, and Gillies 2021). As such,
the aim of this analysis is to predict the percentage of total votes Liberals are going to get in the next federal
election. We chose to use the statistical method multilevel regression with post-stratification. The outcome
variable we were particularly interested in was if a voter would vote for Liberals; it is a binary outcome
(Downes et al. 2018).
We first fit a multivariable, multilevel logistic regression model to predict our outcome variable using a
few demographic variables. Then, we poststratified the selected sample with the variables in the logistic
regression model. We subsequently assigned individuals into different cells based on combinations of these
variables. We then utilized the logistic regression model to predict the probability of voting for Liberals
for each created cell. Finally, we combined the predicted probabilities of all cells to compute the Liberals’
overall popular vote. The survey dataset that was used is the the Canadian Election Study (CES) 2019
- Phone Survey and the census dataset was the 2017 General Social Survey (GSS) on the Family Canada
(2020). Finally, we compared this post-stratified prediction of the popular vote with the hypothesized value
of 33%.
The Data section provides numerical, textual, and graphical description of the census and survey datasets
and important variables in the them. TheMethods section covers statistical methods and analysis techniques
used in this study. The Results section presents and explains our analysis results. The Conclusion section
concludes our study with a summary of crucial findings and a complete commentary and discussion on the
overall study and analysis.
2
0.2 Data
Census data
The census dataset was retrieved from the 2017 General Social Survey (GSS) on the Family. The 2017 GSS,
conducted from February 2, 2017 to November 30, 2017, is a sample survey of cross-sectional survey design
(Canada 2020). The target population comprised all non-institutionalized persons over 15 years of age, living
in the 10 major provinces of Canada (Canada 2020). The survey uses a novel sampling frame, created in
2013, that encompasses telephone numbers with Statistics Canada’s Address Register, and executes data
collection over phone (landline and cell) (Canada 2020).
The important role family plays in people’s lives cannot be disputed. Today’s family, however, must push
through changing marital, family, and professional trajectories. While our understanding of families in
Canada has improved considerably over the past few years, the future of families remains a topic of great
interest. As we see that families are getting more diverse. The GSS on families intends to inform researchers
on the different types and characteristics of families in Canada to enhance our understanding of families.
(Canada 2020)
The survey collected a large amount of data for each respondent and moreover related information about
each family member of the respondent’s household. The response rate was 52.4%, which is enough to be
representative of the target population (Canada 2020).
Survey data
The survey data was data retrieved the Canadian Election Study (CES) 2019 - Phone Survey (Stephenson
et al. 2020). There were 2 stages of data collection as part of this survey. During the last Canadian federal
election campaign that was held in 2019, telephone interviews were conducted with Canadian citizens over
18 years old (Stephenson et al. 2020). Respondents to this survey were contacted by phone and later
interviewed (Stephenson et al. 2020). The survey included different questions asking for the respondent’s
demographic variables and their perspectives on Canadian politics, opinion on different political parties in
Canada, their voting records, and what party they wanted to vote for in the federal election (Stephenson et
al. 2020).
Data cleaning
We processed data for the variables we chose to use in the MRP analysis. The variables were age, sex, place
of birth, marriage history and province. We chose to round age to the nearest integer in both the census
and the survey datasets. We only retained males and females in both datasets since only one person did not
identify themselves as male or female. We decided to have 2 categories for place of birth: born in Canada
and born outside Canada due to the fact that the majority of respondents in the two datasets were born
in Canada. In both datasets, we categorized marriage history as ever married or not (2 categories). This
effectively and concisely represents a individual’s marriage history. Individuals who were separated, divorced,
common-law or widowed were treated as never married since we treated marriage as an legal, official status.
Both datasets had 10 provinces so no data cleaning was required. The response variable is if the respondent
would vote for Liberals, thus we made it a binary variable that has a “1” if the respondent would vote for
Liberals and a “0” if not.
Belo we provide a detailed description the selected variables.
• The variable age is a numerical and records the age of the respondent.
• Sex is a binary and records if the respondent is male or female. Place of birth is a binary variable and
records if the respondent was born in or outside of Canada.
• Marriage history is a binary variable that records if the respondent was ever married.
• Province is a categorical level with 10 levels, for the 10 provinces of Canada.
• The response variable is binary variable that records if the respondent would vote for Liberals and is
a binary with a “1” if the respondent would vote for Liberals and a “0” if not.
3
For modeling, we removed variables not listed above from both datasets. We also only retained observations
that did not have any missing values to the above variables since we did not want to handle any bias that
may be induced from imputation of missing values.
Data summaries
Table 1 shows the percentages of males and females in the survey and the census datasets.
The percentage of males in the survey dataset is 0.575 and in the census dataset is 0.456. The percentage
of females in the survey dataset is 0.425 and in the census dataset is 0.544. The percentages of males and
females in the survey dataset are apparently different than those of the census dataset.
Table 1: Proportions of males and females in the survey and the census datasets
Dataset Male percentage Female percentage
Survey 0.575 0.425
Census 0.456 0.544
Table 2 shows the percentages of respondents born in and outside Canada in the survey and the census
datasets.
The percentage of people born in Canada in the survey dataset is 0.858 and in the census dataset is 0.8.
The percentage of people born outside of Canada in the survey dataset is 0.142 and in the census dataset is
0.2. The difference in the percentages of place of birth is minimal between the two datasets.
Table 2: Percentages of place of birth in the survey and the census datasets
Dataset Born in Canada percentage Born Outside Canada percentage
Survey 0.858 0.142
Census 0.800 0.200
Table 3 shows the percentages of marriage history in the survey and the census datasets.
The percentage of people who was ever married in the survey dataset is 0.691 and in the census dataset is
0.697. The percentage of people who was never married in the survey dataset is 0.309 and in the census
dataset is 0.303. The percentages of marriage history are quite the same in the two datasets.
Table 3: Percentages of marriage history for the survey and the census datasets
Dataset Ever married proportion Never married proportion
Survey 0.691 0.309
Census 0.697 0.303
Figure 1 displaus the age distributions in the survey and the census datasets. The age distributions are
very similar except that in the survey dataset there were respondents over 80 years old when in the census
dataset there wasn not any. The age distributions were close to uniform and close to symmetric while not
multi-modal.
In Figure 2 presents popular vote for Liberals in the survey dataset. We observe that almost 25% of the
respondents replied that they would vote for Liberals.
4
0500
1000
20 40 60 80
Age
Fr
eq
ue
nc
y
0
50
100
150
200
25 50 75 100
Age
Fr
eq
ue
nc
y
Figure 1: Distribution of age in the census dataset (left) and the survey dataset (right).
5
0%
25%
50%
75%
100%
No Yes
Pe
rc
e
n
ta
ge
Figure 2: Sample popular vote for Liberals in in survey dataset - the distribution.
0.3 Methods
0.3.1 Model Specifics
We employed a multilevel logistic regression model to predict our outcome - if one would vote for Liberals in
the next federal election. Logistic regression is a categorization regression model (Wright 1995). It is useful
for predicting a binary outcome based on a set of predictor variables (Wright 1995). A binary outcome
has only two possible cases — either the event occurs (1) or it does not occur (0). Predictor variables are
those variables that might affect the outcome (Wright 1995). In our situation, it is appropriate since the
aforementioned outcome is binary. The predictor variables in our model were age, sex, marriage history, and
place of birth. These variables in synchronization represent the majority of the different groups and subgroups
within the Canadian population (Hosmer, Lemeshow, and Sturdivant 2000). Using combinations of these
variables we were able to integrate the political opinion of these different groups and more substantially,
the entire Canadian population. The model provides the log odds of the outcome. An odds of an event is
the probability of the event occurring divided by the probability of the event not occurring. The regression
coefficient of a predictor is a quantification of the change in log odds when the predictor changes (Hosmer,
Lemeshow, and Sturdivant 2000). For post-stratification, we predict the probabilities (probability of voting
for Liberals) from the estimated log odds from the logistic regression model.
Multilevel models are statistical models of parameters that differ at more than 1 level, most often an indi-
vidual level and a group level (McCulloch and Neuhaus 2005). Multilevel models are specifically appropriate
for research designs in which data for subjects are organized at more than 1 level. The units of analysis are
often individuals (lower level) who are nested within groups (higher level) (Demidenko 2013). The random
intercept model is the most widely used type of multilevel models (McCulloch and Neuhaus 2005). A random
intercept model is a model where intercepts are assumed to vary, and therefore, the predicted outcome for
each individual is predicted by the intercept of the group the individual belongs in together with individual-
6
level predictors (Demidenko 2013). In our analysis, our respondents are in Canada and innately grouped
by province based on where they are located. Thus, each province was modeled to have its own random
intercept that is shared by all individuals located in that province.
Model summaries were examined to determine if each predictor is statistically significant in predicting
whether an individual is going to vote for Liberals.
Here we show the equation of the multilevel logistic regression model:
log
(
p
1− p
)
= β0j + β1Xmale + β2Xborn outside Canada + β3Xever married
β0j = r00 + r01 +Wj + µ0j
where
• p is the probability of voting for Liberals
• Xmale = 1 if the respondent is male and = 0 if the respondent is female
• Xborn outside Canada = 1 if the respondent was born outside Canada; = 0 if the respondent was born
in Canada
• Xever married = 1 if the respondent was ever married; = 0 if was never married
• β1 is the difference in log odds of voting for Liberals for males versus females.
• β2 is the difference in log odds of voting for Liberals for those born outside Canada versus those born
in Canada.
• β3 is the difference in log odds of voting for Liberals for those who were ever married versus those who
were never married.
• β0j is the random intercept for the jth province.
• Wj = 1 if the respondent was located in the jth province.
• µ0,j is the statistical noise in the random intercept of the jth province.
0.3.2 Post-Stratification
Post-stratification is a widely used method in sampling and survey analysis for integrating population distri-
butions of variables with survey estimates (Buttice and Highton 2013). The fundamental technique splits up
the sample into cells according to combinations of different variables (each distinct combination formulates a
cell), and calculates a post-stratification estimate based on weighted estimates of each cell (Holt and Smith
1979). Popular estimates include means, proportions and totals. Should weighted estimates of each cell be
estimated by a multi-level regression model, which is often done, the technique becomes multilevel regression
and post-stratification (MRP) (Buttice and Highton 2013).
Post-stratification is appropriate when the distributions of particular variables in the sample do not resemble
those in the underlying population (Holt and Smith 1979). This is often the case when attempting to map
a sub-countrywide or smaller-scale survey to a nationwide or large-scale census, which is how we attempted
to map CES 2019 phone survey to the 2017 General Social Survey (census). We discovered big differences
in percentages of males and females in the survey and the census datasets. Additionally, the distributions
of province and place of birth are a little different. In presence of such differences, post-stratification is a
suitable method.
7
First we divided individuals in the census dataset into different cells. The cells were created by distinct
combinations of age, sex, place of birth, marriage history, and province. We next predicted the probability
of voting for Liberals in each cell with our multilevel logistic regression model. At the end, we combined the
estimated probabilities into one aggregate, population-wide probability of voting for Liberals, analogous to
the overall popular vote for Liberals, using the formula:
yˆPS =
∑
j Nj yˆj∑
j Nj
here
• Nj is the total number of individuals in the jth cell
• yˆj is the predicted probability of voting for Liberals for the jth cell
All analysis for this report was programmed using R version 4.1.1.
8
0.4 Results
Table 4 contains the multi-level logistic regression model summary for predicting if an individual would vote
for Liberals. The table has regression coefficient estimates, their standard errors and P-values. Age, place
of birth and marriage history are statistically significant in predicting if an individual is going to vote for
Liberals. For a year increase in age, the log odds of voting for Liberals increases by 0.010. This means that
as individuals become older, they become more inclined to vote for Liberals. Individuals who were born
outside of Canada had 0.567 higher log odds of voting for Liberals compared to individuals who were born in
Canada. This implies that individuals who were born outside Canada were more likely to vote for Liberals
than individuals who were born in Canada. The log odds of voting for Liberals individuals who were ever
married was 0.249 lower than individuals who were never married. This implies individuals who were ever
married were less likely to vote for Liberals than individuals who were never married. The results do not
surprise us since Liberals have been more supportive of less wealthy (those who were never married were
more likely to have a lower household income than those who were married), immigrants and refugees (those
who were born outside Canada) and the elderly by giving them better financial and healthcare support
(Wilson 2011).
Table 4: Multilevel logit regression model summary.
Estimate Standard Error P value
(Intercept) -1.597 0.216 0.000
age 0.010 0.003 0.001
sexMale -0.157 0.091 0.084
birthOutside 0.567 0.122 0.000
marriageYes -0.249 0.110 0.024
The regression and post-stratification estimate of overall popular vote for Liberals in the next Canadian
federal election is 0.258. The results are reasonable and not surprising since firstly, the sample size of the
census dataset is quite smaller than the voting population of Canada. About 66% of Canada’s 27 million
registered voters voted in the 2019 federal election (Raynauld, Turcotte, and Gillies 2021). In addition,
indicating that one would vote for Liberals does not necessarily imply they were going to actually vote for
Liberals; their preferences could have changed.
The post-stratification estimate of the overall popular vote of Liberals is 25.8%, much lower than the hy-
pothesized value of 33.0%. But, this estimate does directly addresses the research question of interest and
attains the survey goal. We aimed to predict the overall popular vote of Liberals and we got an estimate
through MRP. We in addition addressed the hypothesis by comparing our estimate to our hypothesized
value. Overall, our results were extremely useful.
9
0.5 Conclusions
The research goal of this study was to predict the overall popular vote of Liberals in the next Canadian
federal election and our hypothesis was that the Liberal Party of Canada would get approximately the same
popular vote percentage as they did in the 2019 federal election, which was 33%. We employed multilevel
logistic regression and post-stratification using age, sex, place of birth, marriage history and province to get
a final estimate of the overall popular vote. Our results show that the predicted overall popular vote for
Liberals is 25.8%.
Our results indicate that as individuals get older, they are more inclined to vote for Liberals. Individuals
who were born outside Canada were more inclined to vote for Liberals than individuals who were born in
Canada. Individuals who were ever married were less willing to vote for Liberals than individuals who were
never married. The results are not of much surprise since Liberals have been more supportive of these types
of individuals.
The post-stratification estimate of the overall popular vote of Liberals is 25.8%, subtantially lower than the
hypothesized overall popular vote of 33.0%. This is due to the fact that we used a relatively small survey
dataset to construct the logistic regression model and the census dataset itself is not much bigger either.
For this reason, our estimate innately reflect the percentage of people that indicated they wanted to vote for
Liberals in the CES 2019 phone survey, but not the overall popular vote from the target voting population.
There were several limitations to our study. Firstly, we had small sample sizes and this most likely led to bias
and inaccuracies in the estimation process. We merely used data from a small survey (CES 2019). Both the
survey and the census datasets were vulnerable to non-response bias and non-sampling errors Berg (2005).
Furthermore, we did not assess the prediction ability of our multilevel logistic regression model because it
is challenging to do so provided the model’s hierarchical structure and statistical noise from the random
intercept (McCulloch and Neuhaus 2005). Lastly, we do not have confidence that the census population in
the 2017 GSS could represent the population of registered voters in Canada effectively.
For our next steps, we could collect and combine data from more surveys and censuses in hope of enlarging
our sample size to decrease bias and estimation imprecision. We would also look for a suitable metric, if
everything permits, to assess the regression model’s prediction power to get an grasp of how good the model
can predict the outcome using choosen variables. With more datasets, we can also discover more variables
to use in our MRP process, which can in turn elevate model prediction power and elevate the accuracy of
the MRP estimate.
In this study, we predicted the overall popular vote for the Liberal Party of Canada in the next Canadian
federal election with a MRP model. We predicted the overall popular vote to be 25.8%, lower than what the
Liberals got in the last federal election. However, this result is still interesting and useful because it gives a
rough ballpark guesstimate of what could happen.
10
Bibliography
Allaire, J. 2012. “RStudio: Integrated Development Environment for r.” Boston, MA 770: 394.
Auguie, Baptiste. 2017. gridExtra: Miscellaneous Functions for "Grid" Graphics. https://CRAN.R-project.
org/package=gridExtra.
Bates, Douglas, Martin Mächler, Ben Bolker, and Steve Walker. 2015. “Fitting Linear Mixed-Effects Models
Using lme4.” Journal of Statistical Software 67 (1): 1–48. https://doi.org/10.18637/jss.v067.i01.
Berg, Nathan. 2005. “Non-Response Bias.”
Buttice, Matthew K, and Benjamin Highton. 2013. “How Does Multilevel Regression and Poststratification
Perform with Conventional National Surveys?” Political Analysis, 449–67.
Canada, Statistics. 2020. “General Social Survey Cycle 31: Family, 2017.” Abacus Data Network. https:
//doi.org/11272.1/AB2/G3DUFG.
Clarkson, Stephen. 2014. The Big Red Machine: How the Liberal Party Dominates Canadian Politics. UBC
press.
Çetinkaya-Rundel, Mine, David Diez, Andrew Bray, Albert Y. Kim, Ben Baumer, Chester Ismay, Nick Pa-
terno, and Christopher Barr. 2021. Openintro: Data Sets and Supplemental Functions from ’OpenIntro’
Textbooks and Labs. https://CRAN.R-project.org/package=openintro.
Demidenko, Eugene. 2013. Mixed Models: Theory and Applications with r. John Wiley & Sons.
Downes, Marnie, Lyle C Gurrin, Dallas R English, Jane Pirkis, Dianne Currier, Matthew J Spittal, and
John B Carlin. 2018. “Multilevel Regression and Poststratification: A Modeling Approach to Estimating
Population Quantities from Highly Selected Survey Samples.” American Journal of Epidemiology 187
(8): 1780–90.
Holt, David, and TM Fred Smith. 1979. “Post Stratification.” Journal of the Royal Statistical Society:
Series A (General) 142 (1): 33–46.
Hosmer, David W.., Stanley Lemeshow, and Rodney X.. Sturdivant. 2000. Applied Logistic Regression.
Wiley New York.
Jeffrey, Brooke. 2010. Divided Loyalties: The Liberal Party of Canada, 1984-2008. University of Toronto
Press.
McCulloch, Charles E, and John M Neuhaus. 2005. “Generalized Linear Mixed Models.” Encyclopedia of
Biostatistics 4.
Pammett, Jon H, and Christopher Dornan. 2016. The Canadian Federal Election of 2015. Dundurn.
R Core Team. 2021. R: A Language and Environment for Statistical Computing. Vienna, Austria: R
Foundation for Statistical Computing. https://www.R-project.org/.
Raynauld, Vincent, André Turcotte, and Jamie Gillies. 2021. “Introduction: The 2019 Canadian Federal
Election.” In Political Marketing in the 2019 Canadian Federal Election, 1–10. Springer.
Stephenson, Laura B, Allison Harell, Daniel Rubenson, and Peter John Loewen. 2020. “2019 Canadian
Election Study - Phone Survey.” Harvard Dataverse. https://doi.org/10.7910/DVN/8RHLG1.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain
Francois, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software
4 (43): 1686. https://doi.org/10.21105/joss.01686.
Wickham, Hadley, and Dana Seidel. 2020. Scales: Scale Functions for Visualization. https://CRAN.R-
project.org/package=scales.
Wilson, Trevor. 2011. The Downfall of the Liberal Party, 1914-1935. Faber & Faber.
Wright, Raymond E. 1995. “Logistic Regression.”
11
Xie, Yihui, Joseph J Allaire, and Garrett Grolemund. 2018. R Markdown: The Definitive Guide. CRC
Press.
Zhu, Hao. 2021. kableExtra: Construct Complex Table with ’Kable’ and Pipe Syntax. https://CRAN.R-
project.org/package=kableExtra.
12

欢迎咨询51作业君