程序代写案例-MO 63130

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

Data Analysis for Investments
Professor Guofu Zhou1
Olin Business School
Washington University in St. Louis
St. Louis, MO 63130
E-mail: zh [email protected]
The Lecture Notes are in-depth optional readings for the students
Course Use Only; All Rights Reserved (please do not distribute)
Current version: December, 2021
1The lecture notes for Fin 532B, Data Analysis for Investments. c©2005 and 2021 by Guofu Zhou.
CONTENTS
Contents
1 Properties and Models of Stock Returns 1
1.1 Multiple-period returns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Expected returns vs realized returns . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Mean, std, and confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Mode and median . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5 Skewness and kurtosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.6 Univariate distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.6.1 Uniform distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.6.2 Normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.6.3 Lognormal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.6.4 χ2-distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.6.5 t-distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.6.6 A skewed normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.6.7 F -distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.7 Multivariate distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.7.1 Mean and variance of linear transformations . . . . . . . . . . . . . . . . . . . 21
1.7.2 Bivariate normal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.7.3 Multivariate normal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.7.4 Multivariate t . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.7.5 Wishart distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.8 Simple Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.8.1 Univariate linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.8.2 Multiple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
c© Zhou, 2021 Page 1
CONTENTS
1.8.3 Autocorrelations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
1.8.4 Time series models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2 Portfolio Choice 1: Mean-variance Theory 35
2.1 Ad hoc rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.1.1 Equal-weighting: 1/N . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.1.2 Value-weighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.1.3 Volatility-weighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.1.4 Risk parity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.1.5 Global minimum-variance portfolio . . . . . . . . . . . . . . . . . . . . . . . . 41
2.2 MV Optimal portfolio: Riskfree asset case . . . . . . . . . . . . . . . . . . . . . . . . 45
2.2.1 One risky asset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.2.2 N = 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.2.3 Multiple risky assets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.2.4 Two-fund separation theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.2.5 Parameter estimation by sample moments . . . . . . . . . . . . . . . . . . . . 56
2.2.6 Practical implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.2.7 MV frontier and utility maximization . . . . . . . . . . . . . . . . . . . . . . 60
2.2.8 Alternative formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
2.2.9 Links to regression and machine learning . . . . . . . . . . . . . . . . . . . . 62
2.3 Tracking error minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
2.4 Information ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
2.5 How to outperform with alpha asset? . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
2.6 Fundamental Law of active portfolio management . . . . . . . . . . . . . . . . . . . 70
2.6.1 IR = IC
√
N . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
c© Zhou, 2021 Page 2
CONTENTS
2.6.2 A casino example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
2.6.3 A proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
2.7 MV Optimal portfolio: No rf case . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
2.7.1 Variance minimization given µp . . . . . . . . . . . . . . . . . . . . . . . . . . 75
2.7.2 Two-fund separation: No rf case . . . . . . . . . . . . . . . . . . . . . . . . . 79
2.7.3 Utility maximization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
2.7.4 Optimality of ad hoc rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
2.7.5 Links to linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3 Portfolio Choice 2: Constraints and Extensions 84
3.1 Practical constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.2 Quadratic programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
3.3 Asset allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
3.3.1 Stocks and bonds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
3.3.2 Multi-asset classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
3.4 Large set of individual stocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
3.5 Estimation risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
3.5.1 The plug-in rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
3.5.2 Errors in using a model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
3.5.3 Estimation errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
3.5.4 Analytical assessment∗ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
3.5.5 Correlation shrinkage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
3.5.6 Combination of 1/N with plug-in . . . . . . . . . . . . . . . . . . . . . . . . . 97
3.5.7 Backtesting: A comparison of rules . . . . . . . . . . . . . . . . . . . . . . . . 98
3.5.8 A Bayesian solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
c© Zhou, 2021 Page 3
CONTENTS
3.6 Transaction costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
3.7 Model uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
3.7.1 Perturbation of the normal model . . . . . . . . . . . . . . . . . . . . . . . . 103
3.7.2 Model averaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
3.8 Alternative objective functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
3.8.1 Kelly’s criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
3.8.2 Higher moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
3.8.3 Other utilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4 Simulation, Bootstrap and Shrinkage 110
4.1 Sampling from distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.1.1 Univariate case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.1.2 Bivariate case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
4.1.3 Cholesky decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
4.1.4 Singular value decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
4.2 Monte Carlo integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
4.2.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
4.2.2 VaR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
4.2.3 Option pricing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
4.3 Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
4.3.1 Estimating standard error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
4.3.2 Estimating confidence interval . . . . . . . . . . . . . . . . . . . . . . . . . . 124
4.3.3 Bootstrapping portfolio weights . . . . . . . . . . . . . . . . . . . . . . . . . . 126
4.4 Shrinkage estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
4.4.1 Sample averages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
c© Zhou, 2021 Page 4
CONTENTS
4.4.2 Mean shrinkage: Stein estimators . . . . . . . . . . . . . . . . . . . . . . . . . 129
4.4.3 Covariance shrinkage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
4.4.4 Use of correlation shrinkage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
4.4.5 Eigenvalue adjustment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
4.4.6 Exponentially weighted moving averages . . . . . . . . . . . . . . . . . . . . . 133
4.4.7 GS covariance matrix estimator . . . . . . . . . . . . . . . . . . . . . . . . . . 135
5 Factor Models 1: Known Factors 139
5.1 The CAPM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
5.1.1 Proof 1: preference assumption . . . . . . . . . . . . . . . . . . . . . . . . . . 139
5.1.2 Proof 2: return assumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
5.1.3 Market model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
5.1.4 Some truths on Alpha . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
5.1.5 Claims of the CAPM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
5.1.6 GRS test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
5.1.7 CAPM and market efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
5.1.8 Fama-MacBeth 2-pass regressions . . . . . . . . . . . . . . . . . . . . . . . . . 149
5.1.9 Stochastic discount factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
5.1.10 GMM test and others . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
5.2 Spanning tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
5.3 Fama-French 3- and 5-factor models . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
5.4 Additional factor models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
5.5 Non-traded factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
5.6 How to construct factors? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
5.6.1 Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
c© Zhou, 2021 Page 5
CONTENTS
5.6.2 Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
5.6.3 Cross-section regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
5.6.4 Machine learning methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
5.6.5 Time series vs cross section . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
5.7 Uses of factor models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
5.7.1 Capital budgeting/Expected return estimation . . . . . . . . . . . . . . . . . 163
5.7.2 Smart beta and factor investing . . . . . . . . . . . . . . . . . . . . . . . . . . 164
5.7.3 Hedging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
5.7.4 Measuring performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
6 Factor Models 2: Unknown Factors 167
6.1 Latent factor model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
6.2 Principal components analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
6.2.1 Eigenvalue and eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
6.2.2 PCs: data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
6.2.3 PCs: random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
6.2.4 PCA factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
6.2.5 The theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
6.2.6 High-dimensional PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
6.3 Asymptotic PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
6.4 Covariance matrix estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
6.4.1 Invertibility problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
6.4.2 Factor-model based estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
6.5 Both explicit and latent factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
6.6 All-inclusive factor model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
c© Zhou, 2021 Page 6
CONTENTS
6.6.1 Time series factor model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
6.6.2 Fundamental factor model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
6.6.3 All types of factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
6.7 Factor analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
7 Performance and Style 188
7.1 Performance measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
7.1.1 Alpha . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
7.1.2 Sharpe ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
7.1.3 Sortino ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
7.1.4 Information ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
7.1.5 Treynor ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
7.1.6 Treynor and Black appraisal ratio . . . . . . . . . . . . . . . . . . . . . . . . 191
7.1.7 Graham-Harvey volatility-matched return . . . . . . . . . . . . . . . . . . . . 191
7.1.8 Maximum drawdown and Calmar ratio . . . . . . . . . . . . . . . . . . . . . . 191
7.2 Sharpe ratio: further analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
7.2.1 Asymptotic standard error . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
7.2.2 Test the difference between two SRs . . . . . . . . . . . . . . . . . . . . . . . 193
7.3 Portfolio-based style analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
7.4 Return-based style analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
7.5 Hedge fund styles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
8 Anomalies and Behavior Finance 196
8.1 Anomalies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
8.1.1 Size and January effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
8.1.2 The weekend effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
c© Zhou, 2021 Page 7
CONTENTS
8.1.3 The value effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
8.1.4 The momentum effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
8.1.5 Closed-end fund puzzle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
8.1.6 Mutual fund persistence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
8.1.7 IPOs abnormal returns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
8.1.8 Technical analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
8.2 Are the anomalies real? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
8.3 Limits to arbitrage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
8.4 Behavior finance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
9 Predictability 1: Time Series 204
9.1 Market efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
9.2 Random walk? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
9.3 Limits to predictability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
9.4 Predictive regressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
9.4.1 Basic model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
9.4.2 Out-of-sample performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
9.4.3 Statistical significance/tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
9.4.4 Economic significance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
9.5 Forecasting with many predictors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
9.5.1 Forecast combination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
9.5.2 PCA or PCR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
9.5.3 sPCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
9.5.4 Partial least squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
9.5.5 PLS: m > 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
c© Zhou, 2021 Page 8
CONTENTS
9.6 Common time-series predictors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
9.6.1 Macro economic variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
9.6.2 Technical variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
9.6.3 Investor sentiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
9.6.4 Investor attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
9.6.5 Short interest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
9.6.6 Corporate activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
9.6.7 Option market . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
9.6.8 Others . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
9.7 Mixed-frequency predictors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
9.8 Nowcasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
10 Machine Learning Tools 229
10.1 What is Machine Learning? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
10.2 Types of Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
10.2.1 Unsupervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
10.2.2 Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
10.2.3 Reinforcement learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
10.3 A short literature review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
10.4 Why penalized regressions? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
10.4.1 Bias-variance tradeoff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
10.4.2 Prediction error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
10.4.3 Problems with many regressors . . . . . . . . . . . . . . . . . . . . . . . . . . 233
10.5 LASSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
10.5.1 The idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
c© Zhou, 2021 Page 9
CONTENTS
10.5.2 The code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
10.5.3 The theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
10.6 Cross-validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
10.7 Ridge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
10.7.1 The idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
10.7.2 The code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
10.7.3 The theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
10.8 Enet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
10.9 C-LASSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
10.10E-LASSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
10.11Neutral network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
10.11.1 No hidden layer: linear regression . . . . . . . . . . . . . . . . . . . . . . . . . 248
10.11.2 One hidden layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
10.11.3 Gradient decent: A search algorithm . . . . . . . . . . . . . . . . . . . . . . . 251
10.11.4 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
10.12Genetic algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
10.13Ensemble Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
10.13.1 Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
10.13.2 Stacking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
10.13.3 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
11 Predictability 2: Cross Section 257
11.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
11.2 Cross-section regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
11.3 OLS estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
c© Zhou, 2021 Page 10
CONTENTS
11.4 E-LASSO estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
11.5 Weighted cross section regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
12 Bayesian Estimation 262
12.1 Bayes Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
12.1.1 Conditional events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
12.1.2 Conditional densities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
12.2 Classical vs Bayesian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
12.2.1 σ2 known . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
12.2.2 σ2 unknown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
12.3 Informative priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
12.3.1 σ2 known . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
12.3.2 σ2 unknown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
12.4 Predictive distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
12.5 Bayesian regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
12.6 Bayesian CAPM test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
13 Black-Litterman Model 278
13.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
13.2 Single risky asset case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
13.3 Multiple risky asset case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
13.4 Alternative approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
14 References 283
c© Zhou, 2021 Page 11
1 Properties and Models of Stock Returns
In this section, we examine and review the statistical properties of primarily equity returns, and
the associated models. Here we are mainly concerned with univariate time series of an individual
stock return, while leaving the more complex multivariate case to later sessions.
1.1 Multiple-period returns
Let Pt be a stock price at time t, say today, and Pt−1 the price last period (could be yesterday or
last month).
There are three commonly referred returns:
• Gross return
R∗t =
Pt +Dt
Pt−1
, (1.1)
the percentage gain of investing Pt−1 dollars at t − 1. For example, if you buy at stock at
$100 last year (time t− 1, and it pays $2 dividends at the end of year (time t), and the price
today (time t) is $103, then your gross return is 1.05%.
• Simple return or simply return
Rt =
Pt+Dt−Pt−1
Pt−1
= Pt+DtPt−1 − 1
= R∗t − 1
(1.2)
i.e., the net percentage gain on investing your money. It is 5% in the previous example. One
often decomposes the return into two terms,
Rt =
Pt−Pt−1
Pt−1 +
Dt
Pt−1
= capital gain (loss) + dividend yield
(1.3)
Then there are 3% in capital gain and 2% in dividends in the earlier example.
• Continuously compounded return or log return:
rt = log
(
Pt +Dt
Pt−1
)
, (1.4)
c© Zhou, 2021 Page 1
1.1 Multiple-period returns
which says the gain grows at the continuously compounded rate rt. To see this, assume
Dt = 0, then the above equation implies
Pt = Pt−1ert , (1.5)
i.e., the price appreciates at rate rt if no dividends.
• Simple v.s. Continuous: There are a few notable differences between the simple and con-
tinuous returns. First, the simple return is in the range of [−1,+∞), but the latter is in
(−∞,+∞). So, theoretically speaking, simple return has no symmetry, but continuous re-
turn does, and hence we cannot assume that the simple return is normally distributed, but
can do so for the continuous return as we do in option pricing. However, most empirical
studies still use the normality assumption for simple returns as an approximation.
Second, the simple return is always greater or equal (rarely) to the continuous return. In
our earlier example, the simple return is 5%, and the continuous return is only 4.88% (=
log(105/100)). Computing the accumulative wealth could be misleading using simple returns.
For example, if a non-dividend paying stock goes up from 100 to 200, and then 100. The
average simple return is 25% (= (100%− 50%)/2), but no value is created because the stock
drops back to 100. The average continuous returns will measure the value correctly,
1
2
[log(200/100) + log(100/200)] =
1
2
× 0 = 0.
Although not popular, there are two other measures of gains in investments.
• The net gain
gt = Pt +Dt − Pt−1, (1.6)
which is simply the gain in value. For stocks, this series is unstable and the return is the
preferred series to model. However, the returns for trading futures cannot be defined as we
do for stocks here because the cost of buying futures is arguably zero. So, gt is the usually
object of study for futures contracts. To make it stable, it is often divided by the notional
value of the contract (assuming implicitly a leverage ratio).
• The return with margin,
R∗∗t =
m(Pt+Dt)
Pt−1 − 1
= mRt − 1,
(1.7)
c© Zhou, 2021 Page 2
1.1 Multiple-period returns
which is the return when $1 is used to buy mPt−1 shares (ignoring the interest charge on
margins). When there is no use of margin, m = 1. In the US, one can in general use $1 to
buy $2 worth of stocks (m = 2), a margin of 50% (of the purchased assets).
Suppose now you invest $1, and earn 10% in year 1 and 20% in year 2. Then your wealth in
year 2 is
W2 = (1 + .10)(1 + .20) = 1.32, (1.8)
and your two-year return is 32%. The implicit assumption is that dividends, if there are any, are
reinvested into the same asset. What is your average annual return?
There are two common averages, arithmetic and geometric ones.
• Arithmetic Average: For an investment of T period with R1, . . . , RT as the returns from
today to 1, . . ., time T − 1 to T , the arithmetic average return is defined as
Ra =
R1 +R2 + · · ·+RT
T
. (1.9)
In our previous example, T = 2, and Ra = 15%.
• Geometric Average: Note that the end of period wealth is
WT = (1 +R1)(1 +R2) · · · (1 +RT ). (1.10)
The geometric average, Rg, is defined as such a return which compounds to the end of period
wealth,
(1 +Rg)
T = (1 +R1)(1 +R2) · · · (1 +RT ). (1.11)
In our earlier example, it is clear that Rg = 14.9%.
• Arithmetic v.s. Geometric: Mathematically, Ra is is always greater Rg unless all the period by
period returns are equal (in this rare case, they are equal). Theoretically, the more volatilities
of the returns, the greater their differences.
In practice, some investors mistakenly using the arithmetic average, which is a proxy of expected
return, to compound the wealth. But this can be very inaccurate. For example, over the period
c© Zhou, 2021 Page 3
1.2 Expected returns vs realized returns
1926–2002, the average annual returns in the US stock market are Ra = 17.74% and Rg = 11.64%,
respectively. If we use them to compound the investment of $10,000 for thirty years, then we have
(1 + 17.74%)30 = 1, 341, 900, (1 + 11.64%)30 = 272, 020,
which are totally different. In pensions, the expected returns are often used to discount the future
liabilities, which will likely to under-estimate the true liabilities substantially.
In portfolio management and many investment contexts, we model and analyze returns using
simple returns and arithmetic averages. The primary reason is for statistical consistency. If we
assume individual returns are normally distributed, so will their portfolios. However, if we assume
individual returns are log-normally distributed (as we do for pricing options in the Black-Scholes
model), their their portfolios will no longer log-normally distributed. In practice, just remembering
that compounding should use continuous returns or geometric averages will be sufficient.
Jacquier, Kane and Marcus (2003, 2005) point out that the statistical estimates of the average
return can be substantially upward or downward biased toward the estimates of the long-term
expected returns (for example, for investment horizons of 40 years, the difference in forecasts can
easily exceed a factor of 2!). Then, the question is whether one can derive an unbiased estimator.
Jacquier, Kane and Marcus (2005) does obtain such an unbiased estimator by assuming the variance
is known. But the assumption is clearly unrealistic. Kan and Zhou (2009) solve the problem
completely by providing a new unbiased estimator without that assumption. Moreover, they provide
an unbiased estimator for a range of wealth levels which seem add more relevant information.
1.2 Expected returns vs realized returns
• At the start of the period today (time t), future variables are unknown, and we can only
calculate their expected value. So the expected return is
E[Rt+1] =
E[Pt+1] + E[Dt+1]
Pt
− 1, (1.12)
• At the end of the period (time t+ 1), however, the realized return can be computed based on
the observed price and dividend,
Rt+1 =
Pt+1 +Dt+1
Pt
− 1, (1.13)
c© Zhou, 2021 Page 4
1.3 Mean, std, and confidence intervals
The point is that the two can be quite different. For example, at the beginning of the year, I may
expect to get 10% return, but the realized return at the end of the year can actually be −20%!
Another point is that a present value (PV) model of the stock price can be derived from (1.12).
To see this, we can rewrite the equation as
Pt =
E[Pt+1] + E[Dt+1]
1 + r
, (1.14)
where r = E[Rt+1], the expected return or discount rate. The above equation says that the stock
price today is the expected payoff next period discounted back to today. Assume r is constant for
simplicity. Applying it to Pt+1, we get
Pt+1 =
E[Pt+2] + E[Dt+2]
1 + r
, (1.15)
Combining the two, we have
Pt =
E[Dt+1]
1 + r
+
E[Dt+2]
(1 + r)2
+
E[Pt+2]
(1 + r)2
. (1.16)
By using the same logic, we eventually get
Pt =
E[Dt+1]
1 + r
+
E[Dt+2]
(1 + r)2
+
E[Dt+3]
(1 + r)3
+ · · · , (1.17)
which says that the stock price today is the sum of its discounted expected future cash flows. Thus,
changes in expectations about future dividends or about the discount rate will cause changes in the
current stock price.
1.3 Mean, std, and confidence intervals
As stock return Rt is random over time, we sometimes emphasize this fact by using notation R˜t.
Usually we assume that the stock return is independently and identically distributed (iid) over
time. Denote the density function by f(x).
The properties of the distribution is often examined by looking at the first two moments, the
mean and variance. Mathematically, they are defined by
µ = E(Rt) =
∫ +∞
−∞
x f(x) dx (1.18)
and
σ2 = E(Rt − µ)2 =
∫ +∞
−∞
(x− µ)2 f(x) dx, (1.19)
c© Zhou, 2021 Page 5
1.3 Mean, std, and confidence intervals
where, for simplicity, we assumed the range of the integrals are (−∞,∞). The mean is the same
as the expected value, and the standard deviation, also known as volatility in finance, is simply
σ = Vol =
√
variance.
The mean summarizes the center of the mass of the distribution, while the standard deviation tells
how far most of the mass is away from the center. The mean and variance are the most important
quantities of any distribution.
Given data/observations of returns
R1, . . . , RT ,
how do we estimate the mean and variance? We often use the sample mean and variance (estimating
integrals by their sums, called sample analogues),
µˆ =
1
T
T∑
t=1
Rt, (1.20)
σˆ2 =
1
T − 1
T∑
t=1
(Rt − µˆ)2, (1.21)
where T is the sample size. The above sample averages are intuitive approximations of the integrals.
Statistically, both of them are unbiased estimators,
E(µˆ) = µ, E(σˆ2) = σ2,
i.e., their expected values are equal to the true parameters. It says that if you estimate the
parameters over many data sets, you will be right on average. However, given a set of data, you
will have estimation errors. The confidence intervals below quantify such errors.
Note that the following is also a popular estimator of the variance,
sˆ2 =
1
T
T∑
t=1
(Rt − µˆ)2. (1.22)
Mathematically, this is the maximum likelihood estimator that maximizes the density function of
the data. However, numerically, the difference between the two is very small when T is greater
than, say, 100.
In Python, Numpy.std(Data) will compute the standard deviation of the data by using the
denominator T , the default. To use denominator T − 1, we simply specify the parameter type of
ddof (stands for Delta Degrees of Freedom) as 1, i.e., we use Numpy.std(Data,ddof=1).
c© Zhou, 2021 Page 6
1.3 Mean, std, and confidence intervals
How accurate are the estimates? Although the estimators are unbiased, but they have variances.
If the data are iid normal, and if the variance of the data, σ2, is known, then standard text books
tell us the popular 95% confidence interval is[
µˆ− 1.96 σ√
T
, µˆ+ 1.96
σ√
T
]
, (1.23)
which means that the interval has 95% probability of containing the true µ. To see why, it is easy
to show that
µˆ ∼ N(µ, 1
T
σ2). (1.24)
As it is well known that, for a standard normal z˜, its 95% probability interval is determined from,
0.95 = Prob(−1.96 < z˜ < 1.96), (1.25)
In our case, based on (1.24), if we standardize xˆ by letting
z˜ =
µˆ− µ
σ/
√
T
,
z˜ must follow the standard normal, and hence (1.25) implies
µˆ− 1.96 σ√
T
< µ < µˆ+ 1.96
σ√
T
,
which is the proof. Note that, to get the 90% or 99% confidence intervals, we simply replace 1.96
by 1.65 or 2.58 (the 90% and 99% of the standard normal cutoffs). If you want to have a greater
confidence to cover µ, the interval must be wider.
However, σ is unknown in the real world, but it can be estimated by σˆ. Then the 95% confidence
interval is approximately [
µˆ− 1.96 σˆ√
T
, µˆ+ 1.96
σˆ√
T
]
. (1.26)
Since σˆ introduces error in estimating σ, the true confidence interval should be wider than this one.
Nevertheless, under normality and if the sample size is greater than 50, the above is very accurate.
The reason is that the exact confidence interval will be determined now by
t˜ ≡ µˆ− µ
σˆ/
√
T
.
Statistically, t˜ so defined have an exact t-distribution, and hence the true confidence interval is[
µˆ+ t0.025
σˆ√
T
, µˆ+ t.975
σˆ√
T
]
, (1.27)
c© Zhou, 2021 Page 7
1.3 Mean, std, and confidence intervals
where t.025 and t.975 are the lower and upper sides of the 95% interval for the t-distribution,
0.95 = Prob(t.025 < t˜ < t.975). (1.28)
where the degree of freedom of the t-distribution is ν = T −1. When the sample size is T = 50, t.95
is about 1.96 (t.025 is about −1.96), and the normal confidence interval is a good approximation. As
T increases, t.95 becomes closer to 1.96 and reaches it in the limit because t-distribution approaches
the normal as the degree of freedom increases to infinity.
The following Python codes make the above easy to implement:
1
2 alpha = 0.05 # significance level = 5%
3 T = len(Data) # sample size
4 df = T - 1 # degree of freedom=sample size - 1
5 t95 = scipy.stats.t.ppf(1 - alpha/2, df) # t-critical value for 95%
6 s = np.std(Data , ddof =1) # sample standard deviation of Data
7 x = np.mean(Data) # sample mean
8 lower = xbar - t95 * (s / np.sqrt(T))
9 upper = xbar + t95 * (s / np.sqrt(T))
Example 1.1 Suppose that with T = 5 data points, you obtain sample mean and standard devi-
ation (see the optional Python code),
µˆ = 0.10, σˆ = 0.1768.
Then, the true confidence interval, (1.27), is
[−0.1195, 0.3195],
and the normal approximation, (1.26), is
[−0.0550, 0.2549].
In this case, as the sample size is small, the difference between the two is large.
Now suppose that T = 120 (e.g., 10 years of monthly data), then the confidence intervals are
[0.0680, 0.1320]
and
[0.0683, 0.1316]
which are much tighter and closer, as they should be as T becomes larger. ♠
c© Zhou, 2021 Page 8
1.4 Mode and median
It should be emphasized that the above confidence interval is only true when the data is iid
normal. When normality is violated as it often does, the interval is only an approximation. Theo-
retically, as the sample size becomes large, it will be more accurate. When the sample size is small,
the bootstrap procedure (see can be used to improve the accuracy.
What is the 95% confidence interval for σ? Now we need the statistical result that
(T − 1)σˆ2
σ2
∼ χ2T−1, (1.29)
that is, the ratio of the variance estimator to the true variance after multiplying T − 1 has a chi-
squared distribution with T − 1 degrees of freedom. Then, solving from above, the 95% confidence
interval for σ2 is [
(T − 1)σˆ2
χ20.975
,
(T − 1)σˆ2
χ20.025
]
, (1.30)
where χ20.025 and χ
2
0.975 are the lower and upper sides of the the 95% confidence interval for the
chi-squared distribution. Then the confidence interval for σ is obtained by taking the square-roots
on both sides of (1.30).
1.4 Mode and median
Besides moments, mode and median are also of use at times. The mode of a distribution is defined
as the value of x, around which the density function has a peak or the greatest value. The mode is
not necessarily unique. The density function of a continuous distribution can have multiple local
maxima, and so it is commonly referred as multimodal (as opposed to unimodal).
However, most distributions used in finance has a unique mode. For example, the normal
distribution has only one mode and it is the mean. However, the mode is generally different from
the mean, especially for asymmetric distributions.
The median is such an x value, say x0, such that the probability of x to be greater or less than
it is exactly 50%, ∫ x0
−∞
f(x) dx = 0.5 =
∫ −∞
x0
f(x) dx. (1.31)
For discrete distributions or for a set of data, the median is the central number from the smallest
to the largest. If the number of data points is even, there will be 2 numbers in the middle, then
the median is the average of those two.
c© Zhou, 2021 Page 9
1.5 Skewness and kurtosis
In Python, Numpy has a function for median, but not for mode. So the easiest way is to use
another page that does both:
1
2 import statistics as stat # import the package
3 stat.mode(Data) # the output is the mode of the data
4 stat.median(Data) # the output is the median of the data
1.5 Skewness and kurtosis
In the real world, the mean and variance do not summary the complete properties of the data, and
we need more measures, the third and and fourth centered moments,
µ3 = E(Rt − µ)3 =
∫ +∞
−∞
(x− µ)3 f(x) dx (1.32)
and
µ4 = E(Rt − µ)4 =
∫ +∞
−∞
(x− µ)4 f(x) dx. (1.33)
Then, skewness and kurtosis are defined as the standardized third and fourth centered moments,
Skewness =
µ3
σ3
(1.34)
and
Kurtosis =
µ4
σ4
. (1.35)
Since they are divided by the powers of the standard deviation, they will be invariant to scaling of
the returns. Economically, if you double your holding of the asset, you will have the same skewness
or kurtosis.
If the skewness is positive, it means that there is relatively more mass on the right side of
the mean. For symmetric distributions around its expected value like the standard normal, the
skewness is zero. The kurtosis measures how fat the tail of the distribution is. For the standard
normal distribution, the kurtosis is 3.
With data available, they can clearly be estimated by their sample counterparts,
γˆ3 =
1
T
T∑
t=1
(Rt − µˆ)3/σˆ3, (1.36)
γˆ4 =
1
T
T∑
t=1
(Rt − µˆ)4/σˆ4, (1.37)
c© Zhou, 2021 Page 10
1.5 Skewness and kurtosis
where 1/T , similar to the variance case, may be replaced by other scalars to make them unbiased.
But they make little numerical differences when the sample size is large, say greater than 100.
In Python, we can use scipy.stats.skew and scipy.stats.kurtosis to compute the skewness and
kurtosis from data. As the case for standard estimation, the above simple moment estimators
are biased. To compute the unbiased estimates, we simply specify a parameter bias=False. The
default is the above formulas with biased estimates. The Python codes computes them when set
to be unbiased (but the kurtosis code subtracts 3 in this case, so it is close to zero for normal
distributed data).
Mathematically (see Doane and Seward, 2011, Joanes and Gill, 1998, and references therein),
the following adjusted Fisher–Pearson standardized moment coefficient
g3 =
√
T (T − 1)
T − 2 γˆ3
is the unbiased skewness estimator, and
g4 =
(T + 1)T
(T − 1) (T − 2) (T − 3)
∑T
t=1(Rt − µˆ)4
sˆ2
− 3 (n− 1)
2
(n− 2)(n− 3) + 3,
is unbiased kurtosis estimator with sˆ2 as the unbiased variance estimator.
Statistically, any random variable is completely determined by its moment generating function
g(t) = E(etx) =
∫ +∞
−∞
etx f(x) dx. (1.38)
In other words, knowing f(x) we know g(t), and knowing g(t), we can recover f(x). Since
ex = 1 + x+
x2
2!
+
x3
3!
+ · · ·+ x
n
n!
+ · · · · · · ,
it follows that
g(t) = 1 + tE(x) +
t2
2!
E(x2) +
x3
3!
E(x3) + · · ·+ x
n
n!
E(xn) + · · · · · · ,
where E(x), E(x2), E(x3), E(x4), ... are the moments of x, which are related to the earlier centered
moments (x subtracted from its mean). The point is that the first 4 moments summarize almost
all features of a distribution. In practice, moments of higher order than 4 are almost never been
used. Since normality is a common assumption, the skewness and kurtosis are useful tests in telling
us how the data deviate from normality assumption (see Section 1.6.2). They are also relevant to
portfolio choice too (see Section 3.8.2).
c© Zhou, 2021 Page 11
1.6 Univariate distributions
1.6 Univariate distributions
In this subsection, we provide a short review of common univariate distributions that are often
used in finance.
1.6.1 Uniform distribution
While this distribution is not as widely used as the normal distribution, it is the simplest continuous
distribution that is useful for understanding others. This distribution can be used to simulate other
distributions or can be used in a Bayesian setup for describing diffuse priors.
A random variable u has a standard uniform distribution, denoted as U(0, 1), if it has equal
probability to be any numbers in [0, 1]. Since it is equally likely over [0, 1], the density function
must be a constant over the range. Then its density function must be
f(x) =
 1, if x ∈ [0, 1];0, otherwise, (1.39)
which follows from the fact that the integral should be 1, and so the constant is 1. In Bayesian
analysis (to be discussed), if our prior belief is that the expected return on a stock is equally likely
to be 0% to 100%, then we can model this belief by using U(0, 1).
There are two important properties of the standard uniform distribution. First, If u is a random
number from U(0, 1), then x = G−1(u) will be a random number from G(x), where G(x) is the
cumulative distribution function of any continuous distribution. Hence, the standard uniform
distribution helps to obtain random numbers from any other continuous distributions. Second, if
u follows U(0, 1), so is 1 − u. This property can be used in Monte Carlo simulations to reduce
variance.
The cumulative distribution function of a U(0, 1) random variable has also a simple form. It is
clear that
F (x) = Prob(u ≤ x) =

0, if x < 0;
x, if x ∈ [0, 1];
1, if x > 1;
(1.40)
For example, the probability for u ≤ 0.5 is clearly 50%.
c© Zhou, 2021 Page 12
1.6 Univariate distributions
In general, we can consider a uniform distribution over any finite interval [a, b]. The density
function is
f(x) =
 1b−a , if x ∈ [a, b];0, otherwise, (1.41)
and the cumulative distribution function is
F (x) = Prob(u ≤ x) =

0, if x < 0;
x−a
b−a , if x ∈ [a, b];
1, if x > b.
(1.42)
Moreover, the n-th moment can be solved analytically,
E(un) =
bn+1 − an+1
(n+ 1)(b− a) . (1.43)
In particular, the mean is E(u) = (b+ a)/2 and the variance is (b− a)2/12.
1.6.2 Normal distribution
The normal distribution is the most used not only in finance, but also in statistics, whose density
function is
f(x) =
1
σ
√
2pi
e−
(x−µ)2
2σ2 , (1.44)
where µ is the mean and σ2 is the variance. When a random variable x˜ (stock return) follows a
normal distribution, we often write it as
x˜ ∼ N(µ, σ2). (1.45)
In simulations, a compute code, say Python often provides a random number from the standard
normal,
z˜ ∼ N(0, 1), (1.46)
then a random number x˜ can be computed from
x˜ = µ+ σz˜
has the desired mean µ and variance µ.
The cumulative distribution function (cdf) of the standard normal distribution, usually denoted
by Φ(x) is the statistical literature, is
Φ(x) ≡ Prob(z < x) = 1√
2pi
∫ x
−∞
e−t
2/2 dt (1.47)
c© Zhou, 2021 Page 13
1.6 Univariate distributions
which is the probability for the standard normal random variable below a fixed constant x.
With Python, one can easily compute the density and cdf. For example, running the commands
below at Spyder prompt,
1
2 import scipy.stats
3
4 scipy.stats.norm (0,1).pdf(0)
5
6 scipy.stats.norm (0,1).cdf (1.96)
you will get the value of the density at 0, 0.3989, the probability less than 0, 50%, and that less
than 1.96, 97.5%.
There are some simple facts on the normal distribution. If a set of data are randomly drawn
from the normal distribution, then 68% of the data fall within one standard deviation from the
mean, 95% percent within two standard deviations, and 99.7% within three standard deviations.
For example, mathematically,
Prob(−2 < z˜ < 2) = 95%,
where z˜ follows the standard normal (note: the above equality is approximately true and easier to
remember. The exact equality requires replacing the 2 by 1.96). These facts are easily verified with
Python.
The normal distribution has its higher central moments analytically available,
E(x− µ)m =
 0, if m is odd;σm(m− 1)!!, if m is even,
where (m− 1)!! = (m− 1)(m− 3) · · · 3 · 1, the double factorial. In particular, E(x− µ)4 = 3σ4 and
E(x− µ)6 = 15σ6. It follows that the normal distribution has skewness zero and kurtosis 3:
Skewness = 0, (1.48)
Kurtosis = 3, (1.49)
which are obtained from the 3rd and fourth central moments by dividing σ3 and σ4, respectively.
In practice, the mean and variance are unknown, but can be easily estimated by using the
c© Zhou, 2021 Page 14
1.6 Univariate distributions
sample mean and sample variance (see (1.20) and (1.21)). How good the estimates are? The
confidence intervals discussed there answer this question.
Is the normal distribution a good assumption for a given set of data? The common tests are to
examine whether the sample skewness and kurtosis are too far from those of the normal distribution.
Asymptotically, if the data is normally distributed and iid, the sample skewness and kurtosis
should converge to 0 and 3, with the following distributions,
γˆ3 ∼ N(0, 6
T
) (1.50)
γˆ4 ∼ N(3, 24
T
). (1.51)
In other words, as sample size T increases, they should be close to 0 and 3, respectively. How close
is close? This will be judged by the confidence intervals from the above asymptotic distributions.
Hence, if the estimated skewness and kurtosis are far way from the above asymptotic distributions,
we can reject the null hypothesis that the data are normal.
Example 1.2 As demonstrated in class, we can compute and obtain
γˆ3 = −0.4551, γˆ4 = 6.3448,
for the CRSP stock index based on monthly returns on from January 1934 to December 2011
(T = 78 ∗ 12 = 936). The standard errors are√
6
T
= 0.0801,
√
24
T
= 0.1601
respectively. Then the 95% confidence intervals are [−0.1569, 0.1569] and [2.6861, 3.3139]. Since
the estimates are outside of these intervals, we reject the hypothesis that the index is normally
distributed. ♠
1.6.3 Lognormal distribution
The stock returns are often measured in terms of the simple returns. Theoretically, there is a
potential problem for the use of a normal distribution because the simple returns are asymmetric
and bounded below by -100%. In many applications, -100% is in the far left tail and may be safely
c© Zhou, 2021 Page 15
1.6 Univariate distributions
ignored, say for daily returns. Nevertheless, when the size of the return is large, say for annual
returns, that can be an important issue.
In this case, we often use the continuously compound return,
rt = log
(
Pt +Dt
Pt−1
)
.
When we assume rt is normal, we say the price Pt is lognormally distributed because its logarithm
is normally distributed. In the famous Black-Scholes formula for option prices, the stock price is
assumed to be lognormal. Mathematically, if y is lognormal or the log of y is normal,
log(y) ∼ N(µ, σ2), (1.52)
then its density function is given by
g(y) =
1
y
1
σ
√
2pi
e−
(log y−µ)2
2σ2 ,
and its mean and variance are
E(y) = eµ+
σ2
2 , σ2y =
(
eσ
2 − 1
)
e2µ+σ
2
.
So lognormal and normal distributions are quite different. Also normal distribution is symmetric,
and lognormal is obviously not.
1.6.4 χ2-distribution
The chi-squared distribution is widely used as chi-square tests for goodness of fit of an observed
model. Most asymptotic tests, such as whether a set of parameters are equal to some prescribed
ones or the likelihood ratio test, have a χ2-distribution.
Statistically, it is defined as the distribution of the sum of squared standard normal deviates:
χ2 = X21 +X
2
2 + · · ·+X2n, (1.53)
where Xi’s are independent standard normal random variables, and n is known as the degrees of
freedom of the χ2-distribution.
c© Zhou, 2021 Page 16
1.6 Univariate distributions
Its summary statistics are
Mean = n, (1.54)
Variance = n(n+ 2), (1.55)
Skewness = n(n+ 2)(n+ 4), (1.56)
Kurtosis = n(n+ 2)(n+ 4)(n+ 6). (1.57)
Its density function is
f(x) =
1
Γ
(
1
2n
)xn/2e−x/2, (1.58)
Γ(·) is the Gamma function,
Γ(z) ≡
∫ ∞
0
xz−1e−x dx
with properties that
Γ
(
1
2
)
=
√
pi, Γ(z + 1) = zΓ(z),
and, for any positive integer n, Γ(n) = (n − 1)!. For examples, Γ(1) = 1, Γ(2) = 1! = 1, Γ(3) =
2! = 2, and Γ(4) = 3! = 6 (note that Γ(0) is undefined or +∞).
An interesting fact about χ2 is that the sum of normal variables deviated from their mean still
follows a χ2, that is,
z = (X1 − X¯)2 + (X2 − X¯)2 + · · ·+ (Xn − X¯)2 ∼ χ2n−1,
where X¯ is the mean of the Xi’s. Note that the degree of freedom is reduced by 1. In general, this
result can be extended to a linear regression model. The sum of the fitted residuals (errors) will be
χ2-distributed up to a scale, which is the variance of the residual and can be consistently estimated
by the sample variance. The degree of freedom will go down by the number of regressors.
1.6.5 t-distribution
The t-distribution is the most used in finance for testing a hypothesis. It can also be used as a
model for stock returns. Indeed, if the return data have fatter tails than the normal, the normality
hypothesis will be rejected. Then the t-distribution is a good alternative candidate distribution.
Statistically, the t-distribution is the ratio of a standard normal to a square-root of χ2, that is,
X√
Z/ν
∼ t(ν), (1.59)
c© Zhou, 2021 Page 17
1.6 Univariate distributions
where X ∼ N(0, 1), Z ∼ χ2ν , and ν is also known as the degree of freedom for t here. Note that√
ν is used to scale X/
√
Z in the definition. The reason is that
√
Z has a value around
√
ν (as χ2ν
has a mean ν), then the scaling makes X divided by a value around 1, not changing its variance
by much unless ν is small (see the moments below).
Historically, the t-distribution is motivated for analyzing the sampling accuracy. Let X1, . . . , Xn
be iid samples from a general normal distribution N(µ, σ2). We have sample mean
X¯ =
1
n
n∑
i=1
Xi
and and sample variance
s2 =
1
n− 1
n∑
i=1
(Xi − X¯)2.
If σ is known, then we have
X¯ − µ
σ/
√
n
∼ N(0, 1),
i.e., we can obtain the confidence interval on the true and unknown mean µ by scaling the standard
normal distribution. However, σ is unknown in practice, but can be estimated by s. Replacing σ
by s, then the term has t-distribution,
X¯ − µ
s/
√
n
∼ t(n− 1),
and so we can use the t-distribution to determine the confidence interval. In particular, we can
test the hypothesis µ = 0. This can be extended into many models, such as the linear regression,
to test whether a slope is zero or not.
The density function is
f(x) =
Γ[(ν + 1)/2]
Γ(1/2)Γ(ν/2)
1
σ
√
ν
(
1 +
(ft − µ)2
ν
)−(ν+1)/2
, (1.60)
where ν is the degree of freedom. Its summary statistics are
Mean = µ, (1.61)
Variance =
ν
ν − 2 , (ν > 2), (1.62)
Skewness = 0, (1.63)
Kurtosis = 3 +
6
ν − 4 , (ν > 4). (1.64)
It is seen that t is symmetric and cannot capture any skewness in the data. However, for whatever
level of kurtosis, t distribution can model it as long as ν is close to 4 enough. On the other hand,
when ν goes to infinity, the kurtosis becomes zero and the distribution converges to the normal.
c© Zhou, 2021 Page 18
1.6 Univariate distributions
1.6.6 A skewed normal distribution
In practice, the normal distribution is often rejected, and the t-distribution is a better alternative
of the data. However, it is not as used often because it is more complex, and also because it usually
will not change the results that much.
Note that both normal and t distributions are symmetric. In certain applications or for certainty
stocks, the skewness can be very important, but are completely ignored by both normal and t. In
this case, a distribution with non-zero skewness is needed.
However, a skewed distribution is more complex to construct. For example,
x = µ+ σ
z − E(z)√
var(z)
(1.65)
is a skewed normal distribution (see Azzalini, 1985, and Azzalini and Valle, 1996), where the density
of z is given by
g(z) = 2φ(z)Φ(λz), (1.66)
where φ and Φ are the standard normal density and distribution function, respectively.
Ideally, the choice of a statistical model for stock returns should satisfy three criterion:
1. consistency: fits the past data
2. testability: one should be able to test hypotheses of whether the model fits the data
3. parsimony: few parameters and tractable
But such a model is difficult to find. The study of skewed and skewed-t distributions, especially in
the multivariate case and in asset allocation applications, is a subject of ongoing research.
1.6.7 F -distribution
The F -distribution is defined as a ratio of two χ2 random variables with adjusting of degrees of
freedom,
z ≡ X1/d1
X2/d2
∼ F (d1, d2) (1.67)
c© Zhou, 2021 Page 19
1.7 Multivariate distributions
where X1 and X2 are independently χ
2-distributed with degrees of freedom d1 and d2, respectively.
In a univariate regression with multiple regressors, t-distribution is often used for testing whether
a slope is zero. However, the t-test is no longer applicable in multivariate regressions, which are
common in finance as usually many assets are run in regressions. So the F -distribution can be
regarded as an extension of the squared t into multivariate hypotheses testing, or [t(d)]2 is the
same as F (1, d).
The density function is
f(x; d1, d2) =
1
B
(
d1
2 ,
d2
2
) (d1
d2
) d1
2
x
d1
2
−1
(
1 +
d1
d2
x
)− d1+d2
2
, (1.68)
where B(x, y) is the beta function,
B(x, y) =
∫ 1
0
tx−1(1− t)y−1 dt,
which can be computed from the Gamma function via B(x, y) = Γ(x)Γ(y)/Γ(x+ y).
Its first two moments are
Mean = d2/(d2 − 2), (d2 > 2), (1.69)
Variance = 2
(
d2
d2 − 2
)2 d1 + d2 − 2
d1(d2 − 4) , (d2 > 4). (1.70)
It is interesting that the mean depends only on the degree of freedom in the denominator, but the
variance does related to both.
1.7 Multivariate distributions
In investments, we are often interested in a set of assets rather than a single one. Hence, multivariate
distributions are critically useful for modeling the returns of many assets. Multivariate normal is
the most commonly used multivariate distribution in finance.
To gain some more insights, we will review first a general property of linear transformation of a
vector of random variables, then discuss bivariate normal, and finally the multivariate normal and
multivariate t distributions.
c© Zhou, 2021 Page 20
1.7 Multivariate distributions
1.7.1 Mean and variance of linear transformations
In various application, it is required to compute the mean and covariance matrix of linear trans-
formations of a random vector. Becaue of this, we list the formulas below.
Let X be an n-vector of random variables,
X =

X1
...
Xn
 .
It is often of interest to know the distributional properties of its linear transformation,
Y = AX +B, (1.71)
where A and B are constants, an N ×m matrix and an m-vector, respectively. It is well known
(check your stats texts) that the mean and covariance matrix of Y are
E[Y ] = AE[X] +B (1.72)
cvar[Y ] = Acvar[X]A′. (1.73)
The proof of the first equation is trivial as constants can be factored out in taking expectations.
The second equation follows from
var[Y ] = E
(
[AX −Aµ][AX −Aµ]′) = A (E[X − µ][X − µ]′)A′ = Acvar[X]A′,
where, with µ = E[X] or the mean, the first and last equalities are valid by definitions.
In finance, we are often interested in a portfolio of stock return. Taking X as a vector of the
returns, and A = (w1, . . . , wn) as the portfolio weights, then Equations (1.72) and (1.73) provide
the mean and covariance matrix of the portfolio,
E[Y ] = AE[X] = w1µ1 + w2µ2 + · · ·+ wnµn, (1.74)
cvar[Y ] = A cvar[X]A′ = w′Σw, (1.75)
where w is an n×1 vector of the weights, and Σ is the covariance matrix of X. These two formulas
are very useful for portfolio decisions.
Equations (1.72) and (1.73) are true regard less of the distribution of X. In other words, they
are true whether the elements of X have normal or t or χ2 distributions. They are also very useful
c© Zhou, 2021 Page 21
1.7 Multivariate distributions
for simulations. Computers often generate random numbers that are standardized, then the two
equations help to transform them into random variables of arbitrary mean and variance. Chapter
4 provides the details.
1.7.2 Bivariate normal
The simplest bivariate normal distribution is the distribution of two independent standard normal
variables. In this case, each one has the standard normal density (Equation (1.44 with µ = 0 and
σ = 1). Due to independence, their joint density function is a product of the individual densities,
f(x1, x2) =
1√
2pi
e−
x21
2 × 1√
2pi
e−
x22
2 =
1
2pi
e−
1
2
(x21+x
2
2) (1.76)
which completely determines the distribution.
In general, for two variables,
X =
X1
X2
 ,
with arbitrary mean and covariance matrix,
µ =
µ1
µ2
 , Σ =
 σ21 ρσ1σ2
ρσ1σ2 σ
2
2
 ,
where ρ is the correlation, the bivariate normal density function is
f(x1, x2) =
1
2pi
|Σ|−1/2exp
[
−1
2
(x− µ)′Σ−1(x− µ)
]
, (1.77)
where |Σ| is the determinant of matrix Σ and x = (x1, x2)′.
To make sense of the density, recall that the determinant and inversion of any 2× 2 matrix
A =
a b
c d
 ,
can be written out explicitly,
det(A) = ad− bc, A−1 = 1
det(A)
 d −b
−c a
 . (1.78)
Then
|Σ|−1/2 = (σ21σ22 − ρ2σ21σ22)−1/2 =
1
σ1σ2
√
1− ρ2
c© Zhou, 2021 Page 22
1.7 Multivariate distributions
and
(x− µ)′Σ−1(x− µ) = 1
σ21σ
2
2(1− ρ2)
x1 − µ1
x2 − µ2
′  σ22 −ρσ1σ2
−ρσ1σ2 σ21
x1 − µ1
x2 − µ2

=
1
σ21σ
2
2(1− ρ2)
(
x1 − µ1)2
σ21
− 2ρ(x1 − µ1)(x2 − µ2)
σ1σ2
+
(x2 − µ2)2
σ22
)
,
where the last term is from multiplying the previous equation out and combining the terms. Hence,
if needed, the bivariate normal density is straightforward to compute.
An important property is the conditional distribution. Denote X and Y now as two stock
returns, following bivariate normal. Conditional on stock X go up, should stock Y goes up?
First, Y conditional on X is still a normal distribution. The mean is (derivations are not given
here),
E[Y |X = x)] = µY + ρx− µX
σX
σY . (1.79)
This formulas makes intuitive sense. If X goes up relative to its mean (higher than expected), Y
will be so too if the two are positively correlated.
The variance of the conditional expectation is
Var[Y |X = x)] = σ2Y (1− ρ2). (1.80)
This formulas makes intuitive sense too. Without knowing X, the variance of Y is σ2Y . Information
on X will help to reduce the variance and the reduction depends on their correlation.
1.7.3 Multivariate normal
The density function of an n-dimensional multivariate normal, X ∼ N(µ,Σ), is
f(x) = (2pi)−n/2|Σ|−1/2exp
[
−1
2
(x− µ)′Σ−1(x− µ)
]
, (1.81)
where µ is an n-vector of means, and Σ is the covariance matrix, n × n, and x = (x1, . . . , xn)′.
Although this is more complex than the bivariate case, it is still easy to compute, if needed, in
practice by using a computer.
c© Zhou, 2021 Page 23
1.7 Multivariate distributions
One of the most important properties of the normal is that all conditional and marginal distri-
butions are normally distributed too. Let X ∼ N(µ,Σ), and partition X,µ and Σ as
X =
X1
X2
 , µ =
µ1
µ2
 , Σ =
Σ11 Σ12
Σ21 Σ22
 , (1.82)
where X1 and µ1 are k-vectors and Σ11 is a k by k matrix. Then, the conditional distribution of
X1 given X2 is still normal with mean and covariance matrix
E[X1 |X2] = µ1 + Σ12Σ−122 (X2 − µ2), (1.83)
Var[X1 |X2] = Σ11 − Σ12Σ−122 Σ21. (1.84)
In other words, given a joint distribution of normal random variables, we can get their conditional
means and variances easily from above formulas, which determine the entire distribution in the
normal case.
1.7.4 Multivariate t
The multivariate t-distribution is an extension of the univariate t to an n-dimensional vector.
Assume Y is an n-dimension normal, Y ∼ N(µ,Σ), and u is an independent χ2ν random variable,
we call the distribution of X,
X ≡ Y√
u/ν
, (1.85)
a multivariate t with degrees of freedom ν. The density function is
f(x) =
Γ[(ν + n)/2]
νn/2pin/2Γ(ν/2)|Σ|1/2
(
1 +
1
2
(x− µ)′Σ−1(x− µ)
)−(ν+n)/2
. (1.86)
Note that the covariance of the multivariate t is ν/(ν − 2)Σ, not |Σ|. Although the multivariate t
is not used by many, Tu and Zhou (2004) find that it is a much better model than the multivariate
normal. Despite that it is symmetric, a skewness test is unable to reject it for the data.
1.7.5 Wishart distribution
Wishart distribution is an extension of χ2-distribution in n > 1 dimension. Mathematically, it
is defined as the products of independent normally-distributed vectors. Let Z1, Z2, . . . , ZT be T
c© Zhou, 2021 Page 24
1.8 Simple Models
independent random n-vectors, each of which follows a multivariate normal distribution with zero
mean and the same variance,
Zt ∼ N(0,Σ),
i.e., the Zt’s are T independent random draws from the same multivariate normal distribution. We
call the distribution of the m×m matrix A below Wishart,
A = Z1Z
′
1 + Z2Z
′
2 + · · ·+ ZTZ ′T = Z ′Z ∼ W (T,Σ), (1.87)
where Z = (Z1, . . . , ZT ) and T > m.
When m = 1,
A = z21 + z
2
2 + · · ·+ z2m
is clearly a χ2-distribution scaled by Σ with T degrees of freedom.
Suppose now that X1, X2, . . . , XT are independent N(µ,Σ) random variables. The sample
covariance matrix is
S =
1
T − 1
T∑
i=1
(Xi − X¯)(Xi − X¯)′,
where X¯ is the sample mean. It is well known that S is an unbiased estimator of Σ,
E(S) = Σ.
Moreover, the covariance of any two elements of S is
Cov(sij , skl) =
1
T − 1σijσjk,
which is useful for computing the standard errors of the elements of S.
1.8 Simple Models
Consider now a statistical model or data-generating process for a series of observed stock returns.
As mentioned earlier, one of the key assumptions we often make is the iid (independently and
identically distributed) assumption. This is the assumption underlying the simple linear regression
models for many applications.
In this subsection, we review some of the most important properties of the linear regression.
Then we discuss ways that relax the iid assumption.
c© Zhou, 2021 Page 25
1.8 Simple Models
1.8.1 Univariate linear regression
To understand well linear regressions, the best starting point is the univariate linear regression,
y˜ = α+ βx˜+ ˜, (1.88)
where we want to model a linear relation between random variable y˜ and random variable x˜, such
as a stock return and the market return, with α and β as the parameters and ˜ as the random
error.
The linear regression is usually written in term of observations,
yi = α+ βxi + i, i = 1, . . . , n, (1.89)
where
yi is the called the dependent variable, regressand or left-hand variable;
xi is the dependent variable, explanatory variable, regressor or right-hand variable;
α intercept, regression coefficient;
β slope, regression coefficient;
i residual, disturbance, error (usually assumed mean 0, with homoscedasticity: uncorrelated
and have identical variance; even normally distributed; but generally assumed iid);
n number of observation, sample size (usually we use T instead n in finance).
How do we get the parameter estimate? We want find estimated values αˆ and βˆ of the true but
unknown parameters α and β, that provide the ”best” fit in some sense for the data points. The
most common objective is to minimize the mean-squared error
Q(α, β) =
n∑
i=1
(yi − α− βxi)2, (1.90)
which is why the resulted solution is called the ordinary least-squares (OLS) estimator.
Taking first-order derivatives of Q(α, β) with respect α and β and setting them to be zeros, the
solutions are
αˆ = y¯ − βˆx¯, (1.91)
βˆ =
1
T
∑n
i=1(xi − x¯)(yi − y¯)
1
T
∑n
i=1(xi − x¯)2
, (1.92)
c© Zhou, 2021 Page 26
1.8 Simple Models
where y¯ and x¯ are the sample means of the data. It is the above formulas that are used in the
standard OLS packages too compute the estimators.
The above formulas are quite easy to understand intuitively. Multiplying the random variable
expression of linear regression (1.88) by the difference of x˜ from its mean, x˜− µx, and then taking
expectation on both sides, we obtain
E(x˜− µx)y˜ = βE(x˜− µx)x˜,
that is
β =
cov(x˜, y˜)
var(x˜)
,
which says that beta is the covariance between x˜ and y˜ divided by the variance of x˜ (recall, perhaps
from your Investment class, in the CAPM regression, beta is the covariance between stock and the
market divided by the market variance). The previous βˆ is simply the sample approximation of β.
Similarly, taking expectation in (1.88), we have
α = E(y˜)− βx˜,
so αˆ is the sample analogue of α.
It may be noted that sometimes the regression is run without the intercept, i.e.,
y˜ = βx˜+ ˜, (1.93)
is assumed to be the true model. In this case, the OLS estimator of β is
βˆ =
1
T
∑n
i=1 xiyi
1
T
∑n
i=1 x
2
i
,
which is clearly different from the case with the intercept. Previously, βˆ is computed by de-meaning
the data. Now, it is the raw data. Moreover, without the intercept, the covariance and variance
interpretation for β is generally no longer available. However, we will only focus on the main case
with intercept in the regression.
In practice or in data science books, vector and matrix notations are common. Let y and x be
vectors of the data/observations,
y =

y1
...
yn
 , x =

x1
...
xn
 ,
c© Zhou, 2021 Page 27
1.8 Simple Models
then the regression can be written as
y = α1n + βx+ , (1.94)
where 1n an n-vector of 1s, and is an n-vector of the residual. Let
X =

1 x1
...
...
1 xn
 = [1n x]
be the n× 2 data matrix of the constant and the regressor, then the regression is often written in
the vector and matrix form
y = X
α
β
+ . (1.95)
Moreover, the OLS estimator can also be written in matrix form,αˆ
βˆ
 = (X ′X)−1X ′y, (1.96)
which is the well known analytical formula for the OLS estimator, and can be generalized easily
into the case when we have multiple regressors (see Section 1.8.2).
How good is the fit? This is usually judged by the R2, which is a measure of the proportion of
the variation in y that is explained by the variation in x:
R2 = 1−
n∑
i=1
(yi − yˆi)2∑
(yi − y¯)2
= 1− V arianceresidual
V ariancetotal
. (1.97)
where y¯ is the sample mean and yˆ = αˆ + βˆx are the fitted values. It is clear that R2 is between 0
and 1,
0 ≤ R2 ≤ 1.
When it is 1, x explains y perfectly well. In this case, they must be perfectly correlated. When the
R2 is zero, then x has nothing to do with y. Of course, in practice, R2 should be within 0 and 1,
and will not be that extreme. Typically, in the CAPM regression on a large stock return, a value
of R2 80% or 90% is not uncommon. However, in predictions when you regression current values
on the past, the R2 is very low, in the range of 0 to 5%.
Mathematically, when you add an additional regressor into the regression, the R2 will increase
by design as more variables will always help to explain more. However, out-of-sample (when you
c© Zhou, 2021 Page 28
1.8 Simple Models
apply the model to future data or new data), it is typically the case that too many regressors
do worse, called over-fitting in statistics. Therefore, the adjusted R2 is proposed to penalize the
number of regressors,
R2adj = 1− (1−R2)
n− 1
n−K − 1 , (1.98)
where K is the number of regressors. So, everything else is equal, the greater the K, the lower the
R2adj . Of course, the greater the R
2
adj , the better the fit of the linear regression. Note that now
1 is the upper bound for R2adj that is unachievable in practice and can only get close to, while it
can also take negative values.
How accurate are the OLS estimators compared with the true values? This will depend on
assumptions we make on the linear regression model,
yi = α+ βxi + i, i = 1, . . . , n. (1.99)
There are 3 key assumptions:
1) i has 0 mean and E(i|xi) = 0;
2) (yi, xi)’s are iid;
3) yi and xi have finite 4th moments (large outliers are unlikely).
Many books impose normality assumption on i. In this case, E(i|xi) = 0 is equivalent to that
xi and i are uncorrelated. This is much stronger than Assumption 1) and is unnecessary asymp-
totically. However, the zero mean assumption is always necessary to guarantee convergence of the
estimators. The E(i|xi) = 0 ensures identification of the slope. Otherwise, there will be missing
regressors and the OLS slope will fail to converge. An example is the following regression
Salary = a+ b× Education + c×Ability + .
If Ability, which is correlated with Education, is missing, then the residual, c × Ability + , will
be correlated with Education. In this case, the OSL regression will likely to get a larger b than
otherwise, attributing more effects to Education.
Assumption 2) is technical, and maybe weakened to allow dependence of the data over time as
long as certain stationary assumptions hold. Assumption 3) is important to ensure certain degrees
of accuracy.
c© Zhou, 2021 Page 29
1.8 Simple Models
Under those assumptions, the OLS estimators will converge to the true parameters as the
sample size becomes large. Then we have asymptotic confidence intervals for the estimators, and
asymptotic t-ratio tests.
It is will be useful to analyze the often assumed ideal case in which the residuals are assumed
to be iid normal,
i ∼ N(0, σ2).
From (1.96), it is straightforward to show that the OLS estomator is jointly normal distributed too,αˆ
βˆ
 ∼ N
α
β
 , σ2(X ′X)−1
 . (1.100)
In particular, it implies that
αˆ ∼ N(α, 1
n
(1 + θ2x)σ
2), (1.101)
where
θx = x¯/std(x),
i.e., θx is equal to the sample mean divided by sample standard deviation of regressor x. The above
formula says that αˆ is an unbiased estimator of α, and it is normally distributed with variance
1
n(1 + θ
2
x)σ
2. As sample size n grows, αˆ becomes more accurate.
In practice, σ2 is unknown, but can be estimated by using the realized residuals. Let σˆ2 be the
estimator. Then, due to errors in estimating σ2, the standardized alpha, αˆ/(1/
√
n(1 + θ2x)σˆ), will
follow a t-distribution instead of a normal even if the residual is assumed normal here. This is the
traditional and popular t-ratio, which is often used to test whether or not α is zero,
What is the impact of using too many regressors in the OLS regression? The more the re-
gressiors, the lower the R2 or the better the in-sample fit. But the adjusted R2 may not go down.
Importantly, the estimation error tends to grow with the number of regressors, so do the forecasting
errors. In general, too good in-sample fit of the model leads to worse out-of-sample forecasting (see
Section 10.4.3).
1.8.2 Multiple linear regression
When there are multiple regressors, say K of them, the linear regression model becomes
yi = β0 + β1xi1 + · · ·+ βKxiK + i, i = 1, . . . , n. (1.102)
c© Zhou, 2021 Page 30
1.8 Simple Models
Let
y =

y1
...
yn
 , X =

1 x11 · · · x1K
...
...
...
...
1 xn1 · · · xnK
 = [1n x],
then the regression can be written as
y = Xβ + , (1.103)
where
β =

β0
β1
...
βK

is a (K + 1)-vector of the parameters.
It is easy to verify that the OLS estimators till has the same form as before,
β = (X ′X)−1X ′y. (1.104)
While almost all results on univariate regression carry through in the multiple regressor case,
there is one important error arising from using K regressors. To see this, let us define the L2 norm,
a commonly used notation in both statistics and data science. For any vector a = (a1, . . . , am)
′, its
squared norm or squared L2 distance is
||a||2 = a21 + a22 + · · ·+ a2m (1.105)
which is the sum of the squares of the components of the vector. The norm itself is the squared
root,
||a|| =
√
a21 + a
2
2 + · · ·+ a2m.
With the new notation, we can write the mean-squared error (see 1.90) as ||y−Xβ||2, and the OLS
estimator βˆ as the solution to
min
β
||y −Xβ||2.
Under the assumption that the errors are iid,
i ∼ IID(0, σ2),
the key result is about the expected errors of estimating the true betas (see, e.g., Giraud, 2015,
p. 8),
E
[
||βˆ − β||2
]
= (K + 1)σ2, (1.106)
c© Zhou, 2021 Page 31
1.8 Simple Models
when the data X are standardized to have orthonormal columns. The above equation says that the
expected errors are proportional to σ2 with a scalar K + 1. When there is one regressor, the error
is 2σ2, but the error grows into 100σ2!, when there are 99 regressors. This says that regressions will
not work well if there are too many regressors, posing a challenge that requires machine learning
methods for dimension reduction (see Chapter 10 and references therein).
1.8.3 Autocorrelations
The fundamental assumption made so far is the iid assumption. There has been a vast amount of
research that relax this assumption by fitting the data using various time series models, such as
ARMA, ARCH and GARCH.
A common way of examining whether the data is time-dependent is to compute the sample
autocorrelations of the data,
ρˆτ =
∑T−τ
t=1 (Rt − µˆ)(Rt+τ − µˆ)
(Rt − µˆ)2 . (1.107)
If the data is independently distributed over time, R˜t should be independent from R˜t+τ , and so
the computed ρˆτ should be close to zero. For large sample size T , ρˆτ is approximately normally
distributed,
ρˆτ ∼ N(0, 1
T
), (1.108)
if the data is iid. So, the standard error of ρˆτ is roughly 1/
√
T . So, if ρˆτ is away from zero by 2
standard deviations, we may reject the independence assumption.
1.8.4 Time series models
The simplest model for stock returns is
Rt = µ+ t, t ∼ N(0, σ2). (1.109)
It says that the returns are iid normal with constant mean µ and variance σ2. This is the same
model mentioned earlier in (1.44).
In the real world, even if we believe that the long-term expected return is constant, the expected
return may change over time conditional on information variables. For example, conditional on the
c© Zhou, 2021 Page 32
1.8 Simple Models
oil price being high or low, our expected return on the stock market may be different. A simple
model to reflect this may be
Rt = µ+ αzt−1 + βRt−1 + t, t ∼ N(0, σ2), (1.110)
where zt−1 is the information variable. Here the above model allows for also the dependence of
the returns on its past. In general, zt−1 can be a vector of available information and extra lags of
returns can also be included in the above regression, to get a model like
Rt = µ+ α
′zt−1 + β1Rt−1 + · · ·+ βpRt−p + t + γ1t−1 + · · ·+ γqq−1, t ∼ N(0, σ2), (1.111)
which is the standard ARMA(p, q) times series model plus regressors.
In equation (1.110), the expected return conditional on {zt−1, Rt−1},
E[Rt | zt−1, Rt−1] = αzt−1 + βRt−1, (1.112)
changes over time and varies with the information variables. However, the conditional volatility is
constant,
Var[Rt | zt−1, Rt−1] = Var[t] = σ2. (1.113)
This is unrealistic in applications.
To model the time-varying volatility, Engle (1982), Nobel prize-winning work, proposes to use
Rt = µ+ t, t | It ∼ N(0, σ2t ) (1.114)
σ2t = a0 + a1
2
t−1 + · · ·+ ap2t−p, (1.115)
a0 > 0, a1, . . . , ap ≥ 0, (1.116)
where It stands for all available information at time t. Notice that Rt in (1.114) is, for simplicity,
assumed to have a constant mean µ. However, the variance of Rt conditional on past information
is σ2t , a function of time. In other words, the conditional volatility σt is now time-varying, which
introduces heteroscedasticity of the variance across time. How does this volatility, say daily, change
over time? Equation (1.115) assumes that it depends on the shocks to the returns of the previous
day and up to past p days. A surprising large drop of the stock market yesterday is likely to increase
the volatility (vol) today. Since the dependence is of regression type, the model is known as an
autoregressive conditional heteroscedasticity model, or ARCH(p).
c© Zhou, 2021 Page 33
1.8 Simple Models
Bollerslev (1986) generalizes ARCH(p) into GARCH(p, q) by adding q past volatilities into the
vol regression,
Rt = µ+ t, t | It ∼ N(0, σ2t ) (1.117)
σ2t = a0 + a1
2
t−1 + · · ·+ ap2t−p + b1σ2t−1 + · · ·+ bqσ2t−q, (1.118)
a0 > 0, a1, . . . , ap, b1, . . . , bq ≥ 0. (1.119)
The simplest GARCH model is GARCH(1,1),
Rt = µ+ t, t | It ∼ N(0, σ2t ) (1.120)
σ2t = ω + a
2
t−1 + bσ
2
t−1, (1.121)
ω > 0, a, b ≥ 0, a+ b < 1, (1.122)
which has only one lag on each of the regressors. It has only three parameters, and hence easy to
estimate in practice. Softwares in many programming languages, such as Excel or Matlab or R or
Python, are available for the estimation. GARCH(1,1) is the generic or ‘vanilla’ GARCH model
used by many financial institutions.
Technically, GARCH(1,1) is useful for two reasons. First, complex GARCH models require the
estimation of more parameters which turns out unnecessary as the maximum likelihood function
becomes flat with more parameters. Second, GARCH(1,1) does capture most of the salient feature
of the data which ARCH(p) fails to do. To see this, we apply (1.121) recursively and get
σ2t = ω + a
2
t−1 + b(ω + a
2
t−2 + bσ
2
t−2) (1.123)
= ω + a2t−1 + b(ω + a
2
t−2 + b(ω + a
2
t−3 + b(· · · ))) (1.124)
=
ω
1− b + a(
2
t−1 + b
2
t−2 + b
23t−3 + · · · ). (1.125)
This says that GARCH(1,1) model is in effect an ARCH model of an infinity order with coefficients
declining exponentially in weighting the past shocks.
The condition of a + b < 1 in equation (1.114) is the stability condition of the model. If it is
true, GARCH(1,1) is a stationary process for which the uncondition or long-term vol exists, and is
equal to
σ2 = Var(Rt) =
ω
1− a− b . (1.126)
For stock returns, the stability condition is satisfied though the estimates of a + b are close to 1.
However, for currencies/exchanges rates and commodities prices, the estimates are too close to 1.
c© Zhou, 2021 Page 34
If a+ b = 1, we can reparameterize GARCH(1,1) into
Rt = µ+ t, t | It ∼ N(0, σ2t ) (1.127)
σ2t = ω + (1− λ)2t−1 + λσ2t−1, (1.128)
ω > 0, 0 ≤ λ ≤ 1. (1.129)
This is known as an integrated GARCH or I-GARCH.
In contrast to ARCH or GARCH, the simplest model for time-varying volatilities (usually daily)
is
σ2 = (1− λ)R2t−1 + λσ2t−1. (1.130)
Note that the estimated variance based on the single observation Rt−1 is simply R2t−1 if we ignore
the virtually zero mean of the daily return. So the righthand side is a weighted average of the vol
estimated based on the more recent data and yesterday’s vol. RiskMetrics fixes the value of λ as
0.94 so that no estimation is needed in their applications. If we apply (1.126) recursively similar
to the GARCH case, it is easy to see that
σ2 = (1− λ)(R2t−1 + λR2t−2 + λ2R2t−3 + · · · ) = (1− λ)
∞∑
i=1
λi−1R2t−i, (1.131)
i.e., the vol is an infinite exponentially weighted moving average (widely known as EWMA) of
the squared returns. For further readings on time series model, see Alexander (2001) and your
Econometrics texts.
2 Portfolio Choice 1: Mean-variance Theory
In this chapter, we discuss strategies for selecting a portfolio among N risky securities with perhaps
the addition of the riskfree asset. We provide first a few ad hoc rules. Then, we derive and discuss
the optimal portfolio rules under the popular mean-variance framework, in which investors care
only about the means and covariances of the assets that determine the portfolio risk and return.
In other words, we examine mainly the case in which the portfolio risk and return are determined
by the asset means and covariances only.
c© Zhou, 2021 Page 35
2.1 Ad hoc rules
2.1 Ad hoc rules
In this subsection, we discuss 5 ad hoc portfolio selection rules: equal-weighting, value-weighting,
volatility-weighting, risk parity, and global minimum-variance portfolio. Although these rules are
are not optimal under standard assumptions such as the mean-variance utility, they are widely used
in practice in different contexts.
All of the 5 rules are strategies for investing into N risky assets only. When there is a riskfree
asset, the rules should be modified based on the needs of the applications.
2.1.1 Equal-weighting: 1/N
Suppose there are N assets with returns R1, . . . , RN , the equal-weighting rule is to divide your
money equally cross the assets, with weights
wnaive =
(
1
N
,
1
N
, . . . ,
1
N
)′
, (2.1)
where the portfolio weights across assets are the same, and are equal to 1/N . This is known as an
naive rule because that is a simple and intuitive way of investing. The resulted portfolio return is
Rp,e =
1
N
R1 +
1
N
R2 + · · ·+ 1
N
RN , (2.2)
which simply adds the returns with weight 1/N or the average return cross assets (different from
the usual asset average return which is computed across time).
Example 2.1 There are 3 stocks with prices $20 and $40 and $50, respectively. Then, based on
the 1/N rule, we invest 1/3 of our money into each, regardless of how expensive or how good each
company is. If we have $3000, then we invest $1000 into each stock. ♠
In practice, investors use the 1/N rule often in placing bets on ideas or on sectors of the stock
market or on asset classes. The 1/N is simple, and is useful when the estimation errors on the
asset expected returns are large. But better strategies are available even when we worry about
estimation risk, a topic discussed later in Section 3.5.
c© Zhou, 2021 Page 36
2.1 Ad hoc rules
2.1.2 Value-weighting
Let V1, . . . , VN be the values of the N assets. The value-weighted portfolio is
Rp,V =
V1
V1 + · · ·+ VN R1 +
V2
V1 + · · ·+ VN R2 + · · ·+
VN
V1 + · · ·+ VN RN , (2.3)
where the portfolio weight on asset i is
wi =
Vi
V1 + · · ·+ VN , i = 1, 2, . . . , N, (2.4)
whose sum is equal to one.
Example 2.2 If there are 2 stocks whose market values are $20,000 and $80,000, respectively, then
our portfolio weights on the 2 stocks are
w1 =
20, 000
100, 000
= 0.25; w2 =
80, 000
100, 000
= 0.75.
Note that only the information on the value of the firms are used, regardless of the current prices
or economic outlooks of the companies. ♠
Value-weighting is very popular in practice. Almost all stock indices are value-weighted (Dow is
an exception which is price-weighted, holding equal shares of the stocks in the index), in particular
the S&P500 is. In a value-weighted portfolio, one holds all the assets, and holdings of which is
proportional to its value relative to the total market value. Almost all index funds invest their
money via value-weighting, and so no research or stock analysis is needed, which is why their
costs are low (virtually zero except for book keeping and occasionally trading due to dividends
reinvestment or redemption or addition/removal of stocks in the indices).
2.1.3 Volatility-weighting
Volatility is one of the most important factors for portfolio selection. Let σ1, . . . , σN be the volatil-
ities (standard deviations) of N assets. The volatility-weighted portfolio is proportional to the
inverse of variances,
Rp,σ2 =
1
σ21
R1 +
1
σ22
R2 + · · ·+ 1
σ2N
RN . (2.5)
c© Zhou, 2021 Page 37
2.1 Ad hoc rules
It is clear that the greater the volatility of an asset, the less the weight we put on that asset. Since
the above weights do not sum to one, the normalized weights are
wi =
1
σ2i
1
σ21
+ · · ·+ 1
σ2N
, i = 1, 2, . . . , N, (2.6)
then the sum is 1 by construction. Hence, the volatility-weighted portfolio is fully determined by
Rp,σ2 = w1R1 + w2R2 + · · ·+ wNRN , (2.7)
which is the portfolio per dollar invested based on volatility information alone.
Note that the weight of the first asset is inversely related to σ21, but not to σ1! Hence, the
common known volatility-weighting defined above should really be called inverse-variance-weighting.
The true volatility-weighting or inverse-volatility-weighting portfolio is
Rp,σ = w1R1 + w2R2 + · · ·+ wNRN , j = 1, 2, . . . , N, (2.8)
with
wi =
1
σi
1
σ1
+ · · ·+ 1σN
, i = 1, 2, . . . , N. (2.9)
In contrast to the earlier case, it is volatility (the square root of the variance) that determine the
portfolio.
The question is why people use variance to inversely weight on assets? The reason is that
such weights are optimal in minimizing the portfolio risk if the asset returns are independent from
one another (see Section 2.1.5). In general, the optimal portfolio is related to the inverse of the
covariance matrix (see Sections 2.2 and 2.7), not directly and linearly related to the volatility per
se.
Example 2.3 If two stocks have 20% and 40% volatility, respectively, then
w1 =
1
.22
1
.22
+ 1
.42
= 80%
and w2 = 1− w1 = 20%. Note that the weight is a nonlinear function of the volatilities. Here the
second stock has twice the volatility of the first, but its weight is not half of the former, but only
1/4 of it. However, the true (inverse) volatility-weighting is
w1 =
1
.2
1
.2 +
1
.4
= 66.67%,
c© Zhou, 2021 Page 38
2.1 Ad hoc rules
lower than before. The reason is that, in terms of volatility, the second stock is twice big. But in
terms of variance, it is 4 times as large (.42/.22 = 4), so you invest more in the first. ♠
Some active funds use the volatility weighting to effectively reduce the volatility of a portfolio.
If the stock returns are independent, the strategy will generate the portfolio with the minimum
volatility. However, if the stocks are correlated as they are in the real world, the volatility weighting
will not get the minimum volatility portfolio theoretically, because the correlation information can
be used to reduce risk (see Section 2.1.5). In the real world, the estimation of the correlations is
noisy, it will be unclear which of the two strategies will do better. That is why some mangers still
use volatility weighting for certain investments.
It should be noted that the 1/N rule is a special case of volatility weighting. When all the
volatilities are taking as equal, it is clear that wi = 1/N for all i. So, volatility weighting is more
general than the 1/N , and it tends to do slightly better than 1/N since it incorporates volatility
information info decision making and volatility can usually be fairly accurately estimated.
2.1.4 Risk parity
Risk parity is a portfolio rule that puts equal weight on the risk contribution of each asset, so it
also known as equal risk or equally-weighted risk portfolio. Note that the volatility of each asset
can be different from one another. It is just that each asset, with its weight, contributes equally to
the total volatility of the portfolio.
For simplicity, consider first the two asset case with volatilities σ1 and σ2 and correlation ρ.
The portfolio is
Rp = w1R1 + w2R2, (2.10)
where R1 and R2 are returns of the assets. Recall from any standard investment text that the
portfolio volatility is
σp =
√
w21σ
2
1 + 2ρw1w2σ1σ2 + w
2
2σ
2
2. (2.11)
The contribution to σp of the first asset is its weight times per unit contribution, that is,
C1 = w1 × ∂σp
∂w1
=
w21σ
2
1 + ρw1w2σ1σ2
σp
. (2.12)
c© Zhou, 2021 Page 39
2.1 Ad hoc rules
Similarly or by symmetry, the risk contribution of the second asset is
C2 =
w22σ
2
2 + ρw1w2σ1σ2
σp
. (2.13)
For them to have equal contributions, we want C1 = C2, i.e.,
w21σ
2
1 = w
2
2σ
2
2 = (1− w1)2σ22.
Hence the solution is
w1 =
σ−11
σ−11 + σ
−1
2
, w2 =
σ−12
σ−11 + σ
−1
2
. (2.14)
Note that in the two asset case, the correlation plays no role. The risk parity is the same as the
(inverse) true volatility-weighting. But this will not be true in general when there are N > 2 assets.
Example 2.4 Suppose that there are two stocks whose volatilities are 20% and 40%, respectively,
then the weights will be
w1 =
1/.2
1/.2 + 1/.4
= 67%, w2 = 23%.
One can verify that .67 × .2 = .13 = .23 × .4 (rounding errors beyond 3 digits as we rounded w1
and w2), i.e., both assets contribute equally to a risk level of 23% to the portfolio. ♠
A typical allocation advice from investment advisors is to invest about 60% in stocks and 40%
in bonds. This implies that 90% of the portfolio risk is from the stock portion of the portfolio given
the much greater stock volatility. To see this, assume they have 20% and 12% volatilities with no
correlation. Then,
σp =
√
0.6 ∗ 0.22 + 0.4 ∗ 0.122 = 17.25%, C1 = 15.49%, C2 = 1.76%,
and so the stock risk share is C1/C2 = 90%. In contrast, the risk parity portfolio has weights
w1 = 37.5%, w2 = 62.5%.
With these weights, the stock portfolio has half of the risk of the entire portfolio.
In applications, risk parity managers attempt to equalize risk across asset classes such as stocks,
bonds, commodities, real estate and currency. During the recent financial crisis, stocks lost about
50% while the bonds were up, so the risk parity portfolio clearly did better. In the long-run,
however, it will under-perform the traditional portfolio as the mean return of bonds is lower than
c© Zhou, 2021 Page 40
2.1 Ad hoc rules
the stocks. Some portfolio managers argue that one can use leverage to increase the return on the
entire portfolio to be comparable or to beat the traditional asset allocation. But whether this is
true or not is unclear. Theoretically, it seems unlikely to succeed over all regimes of the markets.
When N > 2, in the special but unrealistic case in which the correlation among all the assets
is the same, the weight on the i-th asset is analytically obtainable,
wi =
σ−1i∑N
j=1 σ
−1
j
. (2.15)
This is the same as the (inverse) true volatility-weighting.
However, when the correlations are different cross the assets, as is the case in the real world, the
correlations will matter, and there are no simple formulas available to find the portfolio weights.
Denote by Σ the covariance matrix of the assets. It is well known that the volatility of the portfolio
with weights w is
σ(w) =
√
w′Σw. (2.16)
The risk contribution of asset i is
σi(w) = wi × ∂σ(w)
∂wi
=
wi(Σw)i√
w′Σw
,
where (Σw)i denotes the derivative of (Σw) taking with respect to wi. Because of equal contribution,
we have σi(w) = σ(w)/N , implying that
wi =
w′Σw
N(Σw)i
. (2.17)
Note that wi also appears on the righthand side, and so the above is not an analytically solution. To
solve all the wi’s that make the above equation holds, we need to solve the following minimization
problem,
min
w
N∑
i=1
[
wi − w
′Σw
N(Σw)i
]2
, (2.18)
subject to the constraint that all the weights sum to 1. The solution has to be found numerically
via Python, Matlab or R. Se´bastien, Roncalli, and Teiletche (2010) provide further property of the
equal risk portfolio.
2.1.5 Global minimum-variance portfolio
In practice, asset expected returns/means are difficult to estimate, and so many investors/fund
managers simply ignore the means (they typically may not be too much different for similar stocks)
c© Zhou, 2021 Page 41
2.1 Ad hoc rules
and focus on minimizing the risk, to obtain a minimum risk possible portfolio, called the global
minimum-variance portfolio (GMV).
Consider first the case of two risky assets. In terms of the earlier notation, we want to minimize
σp, which is equivalent to minimizing its square (variance),
σ2p = w
2
1σ
2
1 + 2ρw1w2σ1σ2 + w
2
2σ
2
2. (2.19)
Plugging-in w2 = 1− w1 and then taking derivative with respect to w1, we have
dσ2p
dw1
= 2w1σ
2
1 + 2ρ(1− w1)σ1σ2 − 2ρw1σ1σ2 − 2(1− w1)σ22 = 0.
Solving w1, we obtain
w1 =
σ22 − ρσ1σ2
σ21 − 2ρσ1σ2 + σ22
. (2.20)
This is the weight on the fist asset that will minimize the portfolio risk (the weight on the second
asset is w2 = 1 − w1). The formula makes intuitive sense. If the second asset is riskier or σ22 is
larger, we should weight more on the first asset.
Example 2.5 Suppose the vol of the first stock is 20%, the second is 40%, and the correlation is
50%. Then
w1 =
.42 − 0.5 ∗ 0.2 ∗ 0.4
0.22 − 2 ∗ 0.5 ∗ 0.2 ∗ 0.4 + 0.42 = 0.8571,
that is, you invest 85.71% of your money into the first asset, and the rest of your money into the
second. The your portfolio will have the minimum risk possible. What is the minimum risk? This
will be computed below. ♠
It will be useful to consider a few special case. If ρ = 1, i.e., the two stocks are perfectly positively
correlated, we can eliminate the risk entirely by buying one and shorting another. Indeed, assume
σ2 > σ1 without loss of generality. We have
w1 =
σ2
σ2 − σ1 , w2 = −
σ1
σ2 − σ1 ,
Then we long the first and short the second, and the portfolio risk is zero. Now if ρ = −1, i.e., the
two stocks are perfectly negatively correlated, then we can buy both,
w1 =
σ2
σ1 + σ2
, w2 =
σ1
σ1 + σ2
,
to minimize the risk to zero.
c© Zhou, 2021 Page 42
2.1 Ad hoc rules
In practice, |ρ| = 1 is impossible, so we will rule this case out in what follows. Then the
minimum risk portfolio must have positive risk or the their covariance matrix Σ must be positive
definite, and in particular invertible.
If ρ = 0, the above becomes the volatility weighting strategy. If the variances are equal, the
portfolio becomes the equal-weighted one (w1 = 1/2).
When there are N > 2 risky assets, one can still derive, in a similar fashion just with matrix
algebra, the weights of the minimum-variance portfolio, known also as global minimum-variance
portfolio (GMV),
wg =
Σ−11N
1′NΣ−11N
, (2.21)
where 1N is an N ×1 vector of ones, and Σ is the covariance matrix of the N assets and is assumed
invertible. Although matrix inversion is involved, the portfolio weights on the N assets, wg, are
straightforward to compute using Python, Matlab or R.
Note that the minimized variance risk of the GMV portfolio is
V ar(Rp) = 1/(1
′
NΣ
−11N ) > 0, (2.22)
which is derived from simply plug-in wg into the portfolio risk formula. Note that the risk cannot
be eliminated completely, but only be minimized. The reason is that the invertability of Σ assumes
that no asset is redundant, that is, any asset return cannot be a linear combination of others.
This in particular rules out any perfect correlation between any pair of assets. For stocks, this
assumption is clearly true in practice. As a result, there is always non-zero risk for any stock
portfolio. If it were, the portfolio must be zero identically, and then we can solve one stock return
in terms of the rest, a contradiction with the invertability assumption.
It is of interest to see how the above formula works when N = 2. Given the covariance matrix,
we can find analytically its inverse, based on (1.78) and the discussions there,
Σ−1 =
1
det(Σ)
 σ22 −ρσ1σ2
−ρσ1σ2 σ21
 , Σ =
 σ21 ρσ1σ2
ρσ1σ2 σ
2
2
 ,
where the determinant det(Σ) = σ21σ
2
2 − ρ2σ21σ22 > 0 under the assumption that |ρ| < 1 (so that Σ
is invertible). Then
Σ−11N =
1
det(Σ)
 σ22 − ρσ1σ2
−ρσ1σ2 + σ21
 , 1′NΣ−11N = 1det(Σ)(σ21 − 2ρσ1σ2 + σ22).
c© Zhou, 2021 Page 43
2.1 Ad hoc rules
The first element of their ratio is exactly the weight on the first asset as given by (2.20), and the
second element is the weight on the second asset. Moreover, we obtain the minimized variance risk
V ar(Rp) = σ
2
1σ
2
2(1− ρ2)/(σ21 − 2ρσ1σ2 + σ22). (2.23)
This formula can be easily computed by hand in stead of using Python.
Example 2.6 (continue on Example 2.5) The variance risk is
V ar(Rp) = 0.2
2 ∗ 0.32(1− 0.52)/(.22 − 2 ∗ 0.5 ∗ 0.2 ∗ 0.3 + 0.32) = 0.03857143,
and so the vol is
√
V ar(Rp) = 19.64%. It does not reduce too much from 20%. ♠
There are two important remarks on the GMV portfolio. First, it is the portfolio that has the
lowest risk among all possible portfolios, regardless of what values the expected returns on all the
stocks take. However, given information on the stock expected returns, one can design a portfolio
for a desired level of expected return on the portfolio, say 12% per year, with the minimum risk
permissible. This portfolio will have no smaller risk than the GMV by definition of the latter, but
it has the minimum risk among those portfolios that earns 12% expected return per year. The next
two subsections address this issue for the cases without a riskless asset and with it, respectively.
Second, in practice, note that it is always not easy to estimate the expected returns. The
GMV avoids this problem by not using expected returns at all, but it still requires estimating the
covariance matrix. When N is large, it is difficult to get a good estimate of Σ. This issue will be
discussed further in Sections 6.4, 4.4.3, and 4.4.7.
It may also be noted that the inverse volatility weighting is a special case of the GMV when
the assets are assumed uncorrelated. In this case, Σ is a diagonal matrix,
Σ =

σ21 0 . . . 0
0 σ22 . . . 0
...
... . . .
...
0 0 . . . σ2N
 (2.24)
c© Zhou, 2021 Page 44
2.2 MV Optimal portfolio: Riskfree asset case
and its inverse is obvious,
Σ−1 =

1
σ21
0 . . . 0
0 1
σ22
. . . 0
...
... . . .
...
0 0 . . . 1
σ2N
 . (2.25)
Then, multiplying out the terms, the GMV is indeed the same as the volatility-weighting (or more
accurately, inverse variance-weighting), in the zero correlation case.
2.2 MV Optimal portfolio: Riskfree asset case
The mean-variance framework is not only used by many practitioners, but also is a framework
useful for understanding a variety issues involved in portfolio choice and asset pricing. Now we
assume that there is a riskfree asset available in additional to N risky assets. This is the case most
investment books focus on.
For pedagogical reasons, we consider first a single risky asset case, then two risky assets case,
and finally the multiple assets case. We also provide an alternative and equivalent formation that
maximize return for a given level of risk.
2.2.1 One risky asset
Consider the problem of an investor who allocates money between investing in the stock index and
money market. Let rt and rf be the returns on the market and riskfree investment, respectively.
Then the return on the portfolio is
Rpt = wrt + (1− w)rf , (2.26)
where w is the amount invested in the risky asset and (1−w) is that invested in the riskfree asset.
If the investor’s initial wealth is W0, then the next period wealth should be W = W0(1 +Rpt).
Rewrite Rpt as
Rpt = w(rt − rf ) + rf = wRt + rf , (2.27)
c© Zhou, 2021 Page 45
2.2 MV Optimal portfolio: Riskfree asset case
where
Rt ≡ rt − rf (2.28)
is known as the excess return or return in excess of the riskfree rate. In most asset pricing models,
we assume that there is the riskfree asset, approximated by the Treasury bill returns in practice
when the investment horizon is short, say a month. As a result, most empirical research uses excess
returns on asset rather than the original or raw returns.
The popular assumption in portfolio analysis is that the market excess return is iid normally
distributed:
Rt = µ+ t, (2.29)
where t has a normal distribution mean zero and variance σ
2, and µ is the expected excess return
on the market. It is then easy to verify that the mean and variance of the portfolio are
E[Rpt] = wµ+ rf , Var[Rpt] = w
2σ2. (2.30)
In the mean-variance framework, the investor is assumed to care about only the mean and variance
of the portfolio, who prefers higher mean and lower variance.
Note that a preference must specified to determine the optimal portfolio. Assume the standard
mean-variance utility,
U(w) = E[Rpt]− γ
2
Var[Rpt] = rf + wµ− γ
2
w2σ2, (2.31)
where γ is the coefficient of relative risk aversion, i.e., the trade-off parameter between risk and
return. Then the investor chooses w to maximize U(w). Taking the derivative and setting it be
zero, we get the first-order condition (FOC):
dU(w)
dw
= µ− γwσ2 = 0,
and hence the optimal choice is
w =
1
γ
µ
σ2
, (2.32)
which is proportional to the mean-variance ratio of the asset. The formula is intuitively clear.
The greater the expected return or the lower the risk, the more the money the investor invests in
the risky asset. On the other hand, everything else being equal, the more risk-averse the investor
(larger γ) is, the less the money is invested in the risky asset.
c© Zhou, 2021 Page 46
2.2 MV Optimal portfolio: Riskfree asset case
Example 2.7 Assume that the riskfree asset earns 3% (per year) and the risky asset has an
expected return of 12% and a volatility, σ, of 20%. Then the portfolio return is
Rpt = wrt + (1− w)3% = w(rt − 3%) + 3%,
and µ = E(rt − 3%) = 9%. If γ = 2.8, then
w =
1
2.8
0.09
.202
= 0.8036,
which says that we put 80.36% into the risky asset and the reminder into the riskfree asset. ♠
The mean and variance of the portfolio are, based on (2.30),
E[Rpt] =
1
γ
µ2
σ2
+ rf , Var[Rpt] =
1
γ2
µ2
σ2
, (2.33)
in terms of µ and σ2, parameters of the asset returns. These formulas are useful in assessing
portfolio risk and return in practice.
How do investors assess the performance of a portfolio? The Sharpe ratio, originated by William
Sharpe in 1966 and revised in 1994, is the most widely used yardstick in practice,
Sharpe Ratio =
E[Rp − rf ]√
var[Rp − rf ]
, (2.34)
where Rp is the return on an asset or a portfolio. That is, the Sharpe ratio is the ratio of the excess
return to its standard deviation, or the risk premium one earns on the portfolio per unit of risk.
In our one risky asset case here, it is clear that
Sharpe Ratio =
µ
σ
. (2.35)
It is interesting that, no matter how one chooses his/her portfolio, one gets the same Sharpe ratio.
However, this is only true in the case of one risky asset. When there are N > 1 risk assets, different
portfolios will have different Sharpe ratios. To get the portfolio with the greatest Sharpe ratio, one
has to choose the weights optimally.
There are two remarks. First, the Sharpe ratio is often reported in practice in annualized form,
Sharpe Ratioa =
√
L
E[Rp − rf ]√
var[Rp − rf ]
, (2.36)
c© Zhou, 2021 Page 47
2.2 MV Optimal portfolio: Riskfree asset case
where L is the number of periods per year. For example, if the return is daily or monthly, we
should annualize it with L = 252 (trading days) and L = 12, respectively, which is obtained by
annualizing the return with L and the standard deviation with
√
L. Second, the above formula is
an ex ante measure that is based on expectations. In practice, the realized or ex-post Sharpe ratio
is reported, which is computed based on the same equation as above but with the realized returns
on the portfolio and riskfree rate.
Mathematically, maximizing the Sharpe ratio is equivalent to maximizing the mean-variance
utility when N > 1, as shown below. However, although the optimal portfolios can be different
across individual investors with different risk tolerances, the Sharpe ratio is the same for all investors
as long as they hold optimal portfolios! This is different from the result in the case where the riskless
asset is not available.
The last point can be understood intuitively. When there is no riskfree asset, investors should
select portfolios in the mean-variance frontier (see Section 2.7) and different portfolios in the frontier
have different Sharpe ratios. However, when investors can invest in the riskfree asset, they all hold
a portfolio of the riskfree asset and the same tangent portfolio to the line starting from the level
of the riskfree rate. The Sharpe ratio of this portfolio is the same regardless how one allocates
between the two. This is similar to the one risky asset case.
The portfolio solution (2.32) is unconstrained, where no restrictions on w are imposed. In
practice, short-selling is often imposed which requires w ≥ 0. In addition, borrowing at the riskfree
rate is usually not feasible. So, a common restriction is 0 ≤ w ≤ 1. In this case, if the solution
(2.32) falls into this range, it will be the optimal one. If not, either 0 or 1 will be the solution. Of
course, for a hedge fund or a large investor, some limited shorts and borrowing may be possible.
Then the constraint may be written as a ≤ w ≤ b for some constants a and b, and we just search
the solution in this range.
2.2.2 N = 2
Consider now the the case with N = 2 assets. Let rt and Rt be the returns and excess return,
respectively,
rt =
r1
r2
 , Rt ≡
r1 − rft
r2 − rft
 = rt − rft12, 12 ≡
1
1
 , (2.37)
c© Zhou, 2021 Page 48
2.2 MV Optimal portfolio: Riskfree asset case
where rft is the riskfree rate return. Let µ be the expected excess return and Σ covariance matrix
of the excess return,
µ =
µ1
µ2
 , Σ =
 σ21 ρσ1σ2
ρσ1σ2 σ
2
2
 .
Then the portfolio return is
Rpt = w1r1 + w2r2 + (1− w1 − w2)rft = w′rt + (1− w′12)rf = w′Rt + rf ,
where w = (w1, w2)
′ are the portfolio weights on the risky assets. Note that the sum of w1 and
w2 are no longer required to be equal to 1 because the remainder goes to the riskfree asset. If
their sum is less than 1, the money is investment in the riskfree asset. If it is greater than 1, the
difference from 1 is the amount of borrowing from the riskfree asset.
The variance risk is
Var(Rpt) = w
2
1σ
2
1 + 2ρw1w2σ1σ2 + w
2
2σ
2
2 =
[
w1 w2
] σ21 ρσ1σ2
ρσ1σ2 σ
2
2
w1
w2
 = w′Σw.
The investor is assumed to choose w so as to maximize the same mean-variance objective
function
U(w) = E[Rpt]− γ
2
Var[Rpt] = rft + w
′µ− γ
2
w′Σw, (2.38)
The solution (see end of this subsection) to the optimization is
w∗ =
1
γ
Σ−1µ =
1
γ
 σ21 ρσ1σ2
ρσ1σ2 σ
2
2
−1 µ1
µ2
 . (2.39)
Example 2.8 Assume that there are N = 2 risky assets. The excess returns (the returns minus
the riskfree rate) have the expected return and covariance matrix:
µ =
µ1
µ2
 =
0.10
0.20
 , Σ =
 0.32 0.5× 0.3× 0.4
0.5× 0.3× 0.4 0.42
 .
Assume rf = 3% and γ = 3. Then our portfolio is
Rpt = w1r1 + w2r2 + (1− w1 − w2)× 3% = w′rt + (1− w′1N )rf .
c© Zhou, 2021 Page 49
2.2 MV Optimal portfolio: Riskfree asset case
Our optimal choice of w is
w∗ =
w∗1
w∗2
 = 1
3
 0.32 0.5× 0.3× 0.4
0.5× 0.3× 0.4 0.42
−1 0.10
0.20

=
1
3
14.82 −5.56
−5.56 8.33
0.10
0.20
 =
0.123
0.370
 ,
where the inverse of the matrix and the product of the matrix with the vector can be easily computed
using Python, Matlab or R. ♠
Note that if the correlation is zero, the inversion of Σ is trivial,σ21 0
0 σ22
−1 =
 1σ21 0
0 1
σ22

and so w∗1
w∗2
 =
 1γ µ1σ21
1
γ
µ2
σ22
 , or w∗i = 1γ µiσ2i , i = 1, 2.
This says that, when the two assets are uncorrelated, we can apply our portfolio formula to each
of them separately, as if we have one asset at a time.
It can be verified that the squared Sharpe ratio of the optimal portfolio is
(Sharpe Ratio)2 = µ′Σ−1µ =
µ1
µ2
′  σ21 ρσ1σ2
ρσ1σ2 σ
2
2
−1 µ1
µ2
 . (2.40)
When the two assets are uncorrelated, it has a much simpler form,
(Sharpe Ratio)2 = µ′Σ−1µ =
(
µ1
σ1
)2
+
(
µ2
σ2
)2
, (2.41)
i.e., the portfolio squared Sharpe ratio is simply the sum of the individual ones, in the special
uncorrelated case.
Proof of (2.39): The first-order condition is
∂U(w)
∂w1
= µ1 − γ(w1σ21 + ρw2σ1σ2),
∂U(w)
∂w2
= µ1 − γ(ρσ1w1 + w2σ22).
In matrix form, it is
µ− γΣw = 0.
Solving w, by multiplying Σ−1 on both sides and dividing by γ, yields the formula. Q.E.D
c© Zhou, 2021 Page 50
2.2 MV Optimal portfolio: Riskfree asset case
2.2.3 Multiple risky assets
Consider now the general case. Let rt be the returns on N risky assets. We define
Rt ≡ rt − rft1N (2.42)
as the excess return similarly, where 1N is an N -vector of ones.
The common assumption on the probability distribution of Rt is that the excess return Rt is
iid multivariate normal with mean µ and covariance matrix Σ. Given the portfolio weights w, an
N × 1 vector, on the risky assets, the return on the portfolio at time t is
Rpt = w
′rt + (1− w′1N )rf = w′Rt + rf . (2.43)
The investor is assumed to choose w so as to maximize the same mean-variance objective
function
U(w) = E[Rpt]− γ
2
Var[Rpt] = rf + w
′µ− γ
2
w′Σw. (2.44)
The solution to the problem is similarly obtained as:
w∗ =
1
γ
Σ−1µ, (2.45)
which is the optimal portfolio weights.
Proof: Define df/dw as an N -vector formed by df/dwi for any function f = f(w1, . . . , wN ), which
is a vector formed taking derivative one variable at a time. Then it can be verified that
dw′µ
dw
= µ,
dw′Σw
dw
= Σw. (2.46)
Hence, the first-order condition is
dU(w)
dw
= µ− γΣw = 0.
Multiplying Σ−1 on both sides and simplifying the expression, we get (2.45). Q.E.D
With the optimal portfolio weights, the maximized expected utility is
U(w∗) = rf +
1
2γ
µ′Σ−1µ = rf +
θ2
2γ
, (2.47)
where θ2 = µ′Σ−1µ. That is the maximum utility that the investor can obtain when the portfolio
weights w∗ are computed based on the true parameters. In practice, however, the parameters have
c© Zhou, 2021 Page 51
2.2 MV Optimal portfolio: Riskfree asset case
to be estimated and the estimation errors will impact significantly on the performance, an issue to
be examined later.
Example 2.9 Assume that there are N = 3 risky assets. The excess returns (the returns minus
the riskfree rate) have the expected return and covariance matrix:
µ =

µ1
µ2
µ3
 =

0.10
0.20
0.30
 , Σ =

0.32 0.5× 0.3× 0.4 0.1
0.5× 0.3× 0.4 0.42 0.1
0.1 0.1 0.52
 .
Assume rf = 3% and γ = 3 as before. Then our portfolio is
Rpt = w1r1 + w2r2 + w3r3 + (1− w1 − w2 − w3)× 3% = w′rt + (1− w′1N )rf .
Our optimal choice of w is
w∗ =

w∗1
w∗2
w∗3
 = 13

0.32 0.5× 0.3× 0.4 0.1
0.5× 0.3× 0.4 0.42 0.1
0.1 0.1 0.52

−1 
0.10
0.20
0.30

=
1
3

21.43 −3.57 −7.14
−3.57 8.93 −2.14
−7.14 −2.14 7.71


0.10
0.20
0.30
 =

−0.238
0.262
0.390
 ,
where the inverse of the matrix and the product of the matrix with the vector can be easily
computed using Python, Matlab or R. The asnwer makes intuitive sense. Now the third asset is
very attractive with return 30%, so we want to buy more. But the risk is too high, so we short the
first asset to offset a substantial amount of the risk. ♠
Although the optimal portfolio formula has problems in practical applications (see Section
2.2.6), it is very important and will be used throughout the lectures to provide insights on optimal
investments. Below are two analytical examples:
Example 2.10 Consider the popular 1/N portfolio rule that invests $1 fully by putting 1/N into
each asset (see Section 2.1.1). This effectively assumes that each asset has the same expected
return, say µ0, and volatility, say σ0. Then, Σ is a diagonal matrix of σ
2
0, and so, by (2.45), the
c© Zhou, 2021 Page 52
2.2 MV Optimal portfolio: Riskfree asset case
optimal portfolio weights are
w∗ =
1
γ
µ0
σ20

1
...
1
 .
Although this is a scale or leveraged position of the 1/N portfolio, but it is unlikely exactly equal to
it (when 1γ
µ0
σ20
= 1N ). Hence, the widely used 1/N portfolio is not theoretically optimal when there is
riskless asset even if the risky assets have the same expected return and risk, and are uncorrelated.
Nevertheless, the 1/N rule is useful in practice as it does not require the estimation of the expected
asset returns and covariance matrix, which are noisy and, with the noisy estimates, the optimal
portfolio rule usually performs poorly. Some solutions will be discussed later (see Section 3.5). ♠
Example 2.11 To have a better understanding of the optimal portfolio weights, consider the
special case when the assets are uncorrelated. In this case, Σ will be a diagonal matrix, and so the
portfolio weights are
w∗ =
1
γ

µ1/σ
2
1
...
µN/σ
2
N
 .
Then we can write the weight on the i-th asset as
w∗i =
1
γ
µi
σi
1
σi
.
Note that µi/σi is the Sharpe ratio of asset i. The formulas says that, when the assets are uncor-
related, we load the Sharpe ratio by a factor of 1σi . For two assets with the same Sharpe ratio, we
invest more into the one with lower risk. If the risk is 2 times less, we double the investment. ♠
The expected return and variance of the optimal portfolio are,
µp = E[Rpt] = w
∗′µ+ rf =
1
γ
µ′Σ−1µ+ rf , (2.48)
Var[Rpt] = w
∗′Σw∗ =
1
γ2
µ′Σ−1µ. (2.49)
Hence, the squared Sharpe ratio is
(Sharpe Ratio)2 =
(E[Rpt]− rf )2
Var[Rpt]
= µ′Σ−1µ. (2.50)
To summarize, the squared Sharpe ratio of the optimal portfolio is
Sharpe Ratio =
√
µ′Σ−1µ. (2.51)
c© Zhou, 2021 Page 53
2.2 MV Optimal portfolio: Riskfree asset case
When N = 1, it reduces to (2.35).
It is interesting that the Sharpe ratio is independent of risk aversion. Risk-averse investors invest
less on the tangency portfolio (see next subsection), and aggressive ones invest more. But both
portfolios are efficient and they achieve the same Sharpe ratio. However, an investor in practice,
who holds often an inefficient portfolio or a portfolio under constraints (such as no short-sell), will
only obtain a lower Sharpe ratio than the theoretical maximum one
√
µ′Σ−1µ.
Note that, theoretically, the Sharpe ratio is the same for all investors no matter what their
risk aversion is, as long as they choose the optimal portfolios which differs from a scale of the risk
aversion. However, if one investor chooses the portfolio by another rule, not the optimal portfolio, he
will have a lower Sharpe ratio. So, that all investors have the same Sharpe ratio is true only if they
all behave rationally and choose their optimal portfolios. However, in practice, investors will not
have the same Sharpe ratios. This is because their asset universe may not be the same. Moreover,
even if the asset universe is the same, they may not agree upon on the same true parameters, µ or
Σ, and then their portfolios can be quite different from each other and from the optimal portfolio.
Consider now the case when the asset returns are uncorrelated (Example 2.11). In this case,
the Sharpe ratio formula simplifies to
(Sharpe Ratio)2 =
N∑
i=1
(
µi
σi
)2
, (2.52)
that is, the square of the portfolio Sharpe ratio is the sum of the squares of the individual Sharpe
ratios. In other words, when the asset returns are uncorrelated, each asset contributes to the
portfolio performance in terms of its own Sharpe ratio. The greater the individual Sharpe ratio,
the greater the contribution.
2.2.4 Two-fund separation theorem
Since here we assume that the riskless asset is available, the sum of the components of w∗ (weights
on risky assets) will not be equal to one in general. When it is less than 1, implying that we put
the rest of the money into the riskfree asset. When it is greater than 1, we borrow money at the
riskfree rate to invest into the risky assets.
c© Zhou, 2021 Page 54
2.2 MV Optimal portfolio: Riskfree asset case
To understand it better, let
wη =
Σ−1µ
1′NΣ−1µ
, (2.53)
it is clear that the weights sum to 1, w′η1N = 1. Then
Rη = wη
′rt, (2.54)
is a fully invested portfolio or a fund of the risky assets. We will show below that it is an efficient
portfolio, and is tangent to the mean-variance frontier with a line starting from (0, rf ), known as
tangency portfolio, and show later that it is the market portfolio under some further conditions (see
Section 5.1.1.
The optimal portfolio weights can be written as
w∗ =
1
γ
Σ−1µ =
c
γ
wη, (2.55)
where c = 1N
′Σ−1µ is the scalar, and a constant given the parameters. Then the optimal portfolio
return is
Rpt =
c
γ
Rη +
(
1− c
γ
)
rf . (2.56)
This is the two-fund separation theorem, known also as mutual fund separation. If says that, if
investors have mean-variance utility and agree on all the expected returns and covariance matrix
of the assets, they all will choose among a portfolio of two funds, Rη and rf , out of all the possible
combinations of individual stocks. The allocation between the two funds depends on their risk
aversion. If they are aggressive (small γ), they invest more into the tangency portfolio (market
portfolio). If the are conservative (large γ), they invest less. In the extremely case that γ = +∞,
they put all money into the riskfree asset.
Now we want to show that Rη is the tangent portfolio. First, it must be an efficient portfolio.
When γ = c, the investor will invest all the money into Rη. If Rη were not efficient, there will be
one portfolio of risk assets that does better and the investor will be better off with this portfolio,
a contradiction. Second, Equation (2.56) is a line connecting (0, rf ) with Rη. If this line is not
tangent at Rη, there must a portfolio on the frontier that lies above this line. Then that portfolio
performs better, contradicting the fact that all the optimal solutions are on the line.
c© Zhou, 2021 Page 55
2.2 MV Optimal portfolio: Riskfree asset case
2.2.5 Parameter estimation by sample moments
To implement the mean-variance optimal portfolio, we have to provide µ and Σ, the population
parameters of the return data-generating process. But they are unknown and have to be estimated
in practice.
Consider now how to estimate them from data. Suppose there are T periods of observed excess
returns data ΦT = {R1, R2, · · · , RT }, and we would like to form a portfolio for period T+1. Under
the standard assumption that the excess return Rt is i.i.d., the common sample estimates are
µˆ =
1
T
T∑
t=1
Rt, (2.57)
Σˆ =
1
T − 1
T∑
t=1
(Rt − µˆ)(Rt − µˆ)′, (2.58)
which are known as sample moments, as they are the results of replacing the theoretical integrals
by sample averages.
Statistically, the estimators are unbiased,
E[µˆ] = µ, E[Σˆ] = Σ,
which means that the average estimates over in infinite number of data sets will be equal to the
true parameters. However, given any sample size T , the estimates will only be around the true
parameters with random errors and their standard deviations will provide an indication on how
large the errors are (see the confidence intervals, Section 1.3).
While we will examine estimation errors in Section 3.5, it is important to point out here that
it is the inverse of the covariance matrix that is used for computing the optimal portfolio weights,
and the inverse of Σˆ is a biased estimator of Σ−1 (though Σˆ is unbiased to Σ),
E[Σˆ−1] =
T − 1
T −N − 2Σ
−1, (2.59)
which is well known in statistics (see, e.g., Anderson, 1984, p. 270). It says that the inverse of Σˆ
often over-estimates Σ−1. As a result, one will over-invest into the risky assets if one uses Σˆ−1 to
estimate Σ−1. If T = 120, N = 10, one will over invests 12% (as 121/108 = 1.12).
Hence, in practice, a better estimation of the inverse of the covariance matrix is
Σ˜−1 =
T −N − 2
T − 1 Σˆ
−1
c© Zhou, 2021 Page 56
2.2 MV Optimal portfolio: Riskfree asset case
or using
Σ˜ =
1
T −N − 2
T∑
t=1
(Rt − µˆ)(Rt − µˆ)′. (2.60)
as an estimator for Σ for the purpose of obtaining Σ−1 or computing the optimal portfolio.
Technically, why does the inverse destroy unbiasness? This is because of Jensen inequality:
E[g(x˜)] ≥ g[E(x˜)],
for any convex function g(·). Consider for example the case of N = 1. σˆ2 is an unbiased estimator
of σ2. Let g(x) = 1x , x > 0, it is clear that g
′ < 1 and g′′ > 0, so g is a convex function. Then, from
Jensen inequality,
E
[
1
σˆ2
]
≥ 1
E(σˆ2)
=
1
σ2
.
Since g is not a constant, the inequality holds strictly. That is the intuition why the inverse is no
longer biased.
2.2.6 Practical implementation
Suppose that we have T = 360, 30 years of monthly data. How do we know how well the theoretical
investment rule
w∗ =
1
γ
Σ−1µ
perform in the past?
One way is to estimate the parameters with all the data to obtain an estimate of w∗, called it
wˆ; and then apply it to all the past data, to obtain the (estimated) optimal portfolio return
Rpt = wˆ
′rt + (1− wˆ′1N )rf , t = 1, 2, . . . , 360.
Then we can examine the Sharpe ratio of this portfolio, etc.
The above in-sample procedure is simple, and is good for pedagogical purposes. It is in-sample
because it assumes that one knows all the data in the analysis. The argument is that, although
the true parameters are unknown, using all the data will give us the best estimate and then
the performance would be closer to the one which can be achieved by those who were using the
true parameters. But in the real world, no one knows the true parameters. Moreover, the true
parameters may change over time too, and so a simple one-shot estimate may not work well.
c© Zhou, 2021 Page 57
2.2 MV Optimal portfolio: Riskfree asset case
Feasibility is the major objection to in-sample analysis because it cannot be carried out in
reality. Indeed, in the first month, you only have the data for that month, and do not have all
other 360 months of data, which are not yet available, for estimating the parameters. Therefore,
we cannot invest in the first month. To really see the past performance of a realistic investment,
we need to divide the data into two periods, say from the Month 1 to Month 120, which is used as
the training data, to estimate the parameters, call it wˆ(120). Then we can start investing in Month
120 and move forward.
In Month 121, we could continue to use wˆ(120) as the weights for our optimal portfolio, but then
we do not make use of the new data. Since more data generally lead to more accurate estimates,
so, as people almost always do in the real world, we will update our estimate of the moments and
hence the portfolio weights with the additional data in Month 121 to obtain a new estimate wˆ(121).
Similarly, in Month 122, we will update the weights to wˆ(122) for our optimal portfolio in Month
122. In general, we can compute portfolio weights over time to obtain
wˆ(120), wˆ(121), . . . , wˆ(359), wˆ(360).
However, we may not compute wˆ(360) as we may not consider investment in Month 360 because the
return is unavailable without data in Month 361.
With the above weights, we can then compute the returns,
Rpt = wˆ
(t−1)′rt + (1− wˆ(t−1)′1N )rf , t = 121, 122, . . . , 360.
That is, we invest in Month 120 and get return in Month 121, wˆ(120)′r121 + (1 − wˆ(120)′)rf , then
invest in Month 121, and get return in Month 122, and so on. The last return is in Month 360,
result of investing in Month 359 with weights wˆ(359). All of these returns are the ones that can be
used for performance analysis in terms of Sharpe ratio, etc.
The above procedure is known as a recursive one that estimates the parameters recursively
using all available data. An alternative is to estimate the parameters with a fixed window of past
data. Say we estimate the parameters by using the past 120 months of data only. For example, in
Month 121, we use data from Month 2 to Month 121, and, in Month 122, we use data from Month
3 to Month 122. This is often known as a rolling procedure. In contrast to the in-sample analysis,
both recursive and rolling are out-of-sample assessment, which do not use future information and
is feasible in real time.
c© Zhou, 2021 Page 58
2.2 MV Optimal portfolio: Riskfree asset case
In practice, more advanced estimation methods can be used. Additional data such as daily or
fundamental information may be utilized too. Moreover, a more general mean-variance problem in
practice is to impose a range constraint on the weights: (see Example 2.11)
ai ≤ wi ≤ bi, i = 1, 2, . . . , N, (2.61)
i.w., for each asset i, the position has to be between ai and bi. For example, if a1 = 0 and b1 = 0.10,
we cannot short sell the first asset and cannot invest more than 10% of our money into it.
In this case, no analytical formulas will be available for the optimal portfolio weights, but they
can usually easily be solved numerically. Note that one cannot truncate the solution from the
unconditional one, the analytical formula, when N > 1. The reason is that once one sets a weight
at a bound, all other weights must be re-selected to optimize the objection, and many times taking
the largest or lowest level of the bound may not be optimal either. However, numerically, one can
solve the above constrained problem, even with more complex constraints, easily via using available
quadratic programming packages in Python. These issues are further discussed later.
In summary, there are important limitations of the optimal portfolio formula:
1) practical constraints have to be imposed in the real world. The constrained portfolio is
obviously different (see Section 3) and have to be solved numerically.
2) the formula implies roughly 50% short positions in a large portfolio (N is large), and hence
it is difficult to implement.
3) the mean and covariance matrix have to be estimated in practice, which is difficult, and the
optimal portfolio is sensitive to even a small change of the inputs;
– an expected return of 10% vs 8% can cause the portfolio weights to change much more
than 2%;
– the invertibility of the sample covariance matrix requires T ≥ N + 2, which is violated
if N = 1000 stocks and T = 120 with 10 years of monthly data (the solution is to be
discussed later).
4) it should be noted that any portfolio rule, except value-weighting, requires costly portfolio
rebalancing.
c© Zhou, 2021 Page 59
2.2 MV Optimal portfolio: Riskfree asset case
2.2.7 MV frontier and utility maximization
When there is no riskfree asset, we can define an optimal portfolio as one that minimizes risk for
a given level of return or maximizes the expected utility (review your Investment Theory class or
see Section 2.7 for details). The mean-variance (MV) efficient portfolio frontier is a concave curve,
the plot of the expected return of the optimal portfolio vs the risk.
When there is the riskfree asset, as is the case we assume now, investors will in general choose
a portfolio risky assets and also invest or borrow in the riskfree asset. The new mean-variance
efficient portfolio frontier is a line connecting the riskfree rate to a portfolio tangent to the frontier
(known as tangency portfolio), and this line consists of all the possible portfolios an investor may
choose.
The utility maximization identifies exact which point on the line an investor will choose given
her/his risk aversion. The optimal portfolio formula, equation (2.45),
w∗ =
1
γ
Σ−1µ,
a scale of the tangent portfolio because its weights do not summer to 1, as shown in Equation
(2.53). The difference from 1 is invested in the riskfree asset.
As mentioned before, the Sharpe ratio of all efficient portfolios is the same, though the portfolios
may be different in exposures in risky assets. However, when there is no riskless asset, all the
portfolios on the tradition mean-variance frontier, are efficient, but they have different Sharpe
ratios. Now, in the presence of the riskless asset, the only efficient portfolio from the frontier is
the tangency portfolio. Investors will choose a combination of this with the riskfree asset, no other
risky assets, to obtain the best possible Sharpe ratio.
2.2.8 Alternative formulation
Instead of maximizing the expected utility, one can maximize the expected return for a given level
of risk (or minimize the risk for a given level of return). Mathematically, both approaches are
equivalent.
c© Zhou, 2021 Page 60
2.2 MV Optimal portfolio: Riskfree asset case
To see the equivalence, let σ2 be a given level of risk. Then the maximization problem is
max
w s.t. w′Σw=σ2
E[Rpt] = µ
′w + rf , (2.62)
whose solution is
wa =
σ√
µ′Σµ
Σ−1µ. (2.63)
Proof: The Lagrangian of the objective function is:
L = µ′w + rf − λ
2
(w′Σw − σ2), (2.64)
where λ is the multiplier that transformed the constrained optimization problem to an uncon-
strained one. Taking first derivatives with respect to all the wi’s and λ (recall (2.46)), and setting
them to zeros, we get the first-order conditions (FOC);
µ− λΣw = 0, (2.65)
w′Σw − σ2 = 0. (2.66)
Multiplying (2.65) by µ′Σ−1 and w′ respectively, we get
µ′Σ−1µ− λµ′w = 0, (2.67)
w′µ− λw′Σw = 0. (2.68)
These two equation implies, using the fact that µ′w = w′µ,
µ′Σ−1µ = λµ′w = λw′µ = λ2w′Σw.
Hence, we can solve λ as:
λ =
√
µ′Σ−1µ
σ
. (2.69)
Plugging this back to (2.65), we get w as (2.63). Q.E.D
In comparison (2.63) with the standard formula (2.45), we have
γ =
√
µ′Σµ
σ
. (2.70)
This means that, if one wants to have a fixed level of portfolio risk σ, the effective risk aversion is
given above. On the other hand, for a given risk-aversion of γ, the risk of the portfolio is
σ =
√
µ′Σµ
γ
. (2.71)
c© Zhou, 2021 Page 61
2.2 MV Optimal portfolio: Riskfree asset case
When γ > 0 takes all possible risk aversion values, σ will take all the possible risk levels. So
mathematically, the two optimization problems are equivalent.
However, it should be noted that the equivalence assumes the true parameters µ and Σ are
known. But they are unknown in practice, so the two approaches are no longer equivalent when
there is parameter uncertainty. The reason is that, given a certain level of risk aversion, one does
not know exactly what risk level to take as µ and Σ are unknown. Conversely, if one sets a risk
level, this will not necessarily be consistent with his/her risk aversion given that we do not know µ
and Σ.
2.2.9 Links to regression and machine learning
Jobson and Korkie (1983) establish an interesting link between the optimal portfolio and a linear
regression. Britten-Jones (1999), based on the regression framework, provides ways to test hypothe-
ses on portfolio weights. The regression framework can also be used to estimate portfolio weights
when there are a large number of assets (see, e.g., Ross and Zhou (2021)).
Consider the regression of a constant on the asset excess returns,
1T = Xβ + , (2.72)
where 1T is a T-vector of 1’s, X is a T × N matrix of the N asset excess returns data, and β is
N × 1 of the regression coefficients. The least-squares estimator has the usual formula,
βˆ = (X ′X)−1X ′1T . (2.73)
In what follow, we will show that
βˆ =
Σˆ−1µˆ
1 + µˆ′Σˆ−1µˆ
, (2.74)
where the mean and covariance matrix of the excess returns are estimated by using a slightly
different formulas from 2.57 and 2.58,
µˆ = X ′1T /T, (2.75)
Σˆ = (X − 1T µˆ′)′(X − 1T µˆ′)/T, (2.76)
where µˆ is the same (just in vector notation), Σˆ is obtained with dividing by T instead of by T − 1.
c© Zhou, 2021 Page 62
2.2 MV Optimal portfolio: Riskfree asset case
Recall that the very important optimal portfolio weights formula, (2.45), and its estimate is
clearly
wˆ∗ =
1
γ
Σˆ−1µˆ.
Comparing this and the beta expression, we see that βˆ is the same as the estimated optimal portfolio
weights if γ = 1 + µˆ′Σˆ−1µˆ. Hence, we can recover wˆ∗ from βˆ up to a scalar.
Proof of (2.74): Using a standard matrix inversion formula, we have
(X ′X)−1 = (Σˆ + µˆµˆ′)−1 = Σˆ−1 − Σˆ
−1µˆµˆ′Σˆ−1
1 + µˆ′Σˆ−1µˆ
.
Then, using the OLS formula and simplifying, we get the desired result. Q.E.D.
Ao, Li and Zheng (2019) recently find another link to a linear regression, based on which a wide
range of machine leaning tools can be applied to portfolio choice.
Recall that the best squared Sharpe ratio investors can possibly obtain is
θ = µ′Σµ,
and they obtain it by choosing the optimal portfolio
w∗ =
1
γ
Σ−1µ,
given a level of risk aversion γ. Alternatively, given a desired level of risk σ, the investors can
choose the portfolio
wa =
σ√
µ′Σµ
Σ−1µ. (2.77)
The problem is that it is very difficult to get an accurate estimate of Σ−1µ in practice when N is
large, so the analytical portfolio formulas have limited value in a high dimensional case.
Ao, Li and Zheng (2019) show that, given a level of desired risk σ, the optimal portfolio weights
are a solution of the linear regression problem below:
min
w
(rc − w′R)2, rc ≡ σ1 + θ√
θ
,
where rc is a fixed constant in the optimization problem. The important message from this new
formation is that we do not have to use the analytical formulas which are not reliable in high
dimension. Instead, we solve the weights from the above problem by various dimensional reduction
and model fitting techniques, opening a door for wide ML (machine learning) applications
c© Zhou, 2021 Page 63
2.3 Tracking error minimization
Note that it is a two-stage procedure. First, we estimate rc, which is a single constant and is
easier to estimate more accurately than estimating w from the analytical formulas. Then, in the
second-step, we solve the weights by using various econometric or machine learning methods.
2.3 Tracking error minimization
In practice, big institutional money is invested across asset classes, such as bonds, stocks. currencies
and commodities. Within the stock portfolio, there are usually two styles of investments. The first
is passive management to gain returns identical to an index, such as the S&P500. This is a basically
by-and-hold strategy. There is a growing trend for doing so as outperforming the market is not
an easy task, and those who promise they can often fail badly. The second is active management
where managers trade frequently to beat the market or to outperform some certain benchmarks.
The simplest way to generate the index return is to hold the index, i.e., to buy all the stock
in the index proportional to their weights defining the index. Alternatively, one can replicate the
index by using fewer stocks based on mean-variance portfolio theory, capitalization and stratifies
methods.1
However, passive investments are still not free. Some one has to manage it. As stocks coming
in or out of the index, trading has to take place. The same is true for dividend reinvesting and for
money in and out of the index funds. So, one has to pay a fee to invest in a passive index fund in
practice. Vanguard is a leader in index funds which created one of the first index funds in 1975.
As of today, it manages over $5 trillion in assets. Their index funds charge one of the lowest fees,
but is still 0.05%, or 5 basis points as of 2013. This can still be enormous if the base or asset under
management (AUM) is large.
Active portfolio managers today are often required to beat an index with the minimum greater
risk net of all transaction costs. The active return is defined as
Active Return = Total Return on Managed Port− Total Return on Index. (2.78)
The idea is that if you, an active portfolio manager, can beat the index (with perhaps some level
of given risk) by achieving positive active returns, I can pay you a fee. For example, if your track
records show that you can beat the index by 1-3% per year with risk no greater than the index by
1See, e.g., Fabozzi (1999, Ch 14) for the latter two methods.
c© Zhou, 2021 Page 64
2.3 Tracking error minimization
2%, I can pay you, say 50 or 70 basis points. You will gain by making more money than managing
a pure index fund (with the same AUM), and I will gain also than by investing in a pure index
fund. Even if the market is perfectly efficient, one can theoretically beat the market index by
taking higher risk. So, in practice, managers who try to beat an index are allowed to take higher
risk, but there is a limit. Tracking Error limits are used as “risk budgets” to control the risk that
the managers can take. Then the question is whether any gain is comparable with the risk taking.
Note that the error limits are in terms of volatility, not return. The reason is that the active return
is very difficult to estimate in the real world, because the estimated expected returns can be very
different from the realized ones. In contrast, the volatilities are more stable.
Let w¯ be the portfolio weights of a benchmark index, R be an n-vector of the asset returns and
w the weights of an actively managed tracking portfolio. The Tracking Error (which is volatility)
is defined as
Tracking Error = TE ≡ Var[w′R− w¯′R] = (w − w¯)′V (w − w¯), (2.79)
where V is the covariance matrix of the underlying assets R. If your managed active portfolio has
close volatility to the index, say with 2% difference, and if it has some substantial higher return,
you may be a good active manager.
The TE optimization problem of practical interest is
min
w
TE = (w − w¯)′V (w − w¯) (2.80)
s.t. w′1N = 1
w′R− w¯′R = g,
which minimizes TE while achieving a given target, g, of expected performance relative to the
benchmark. Recall that the constraint w′1N = 1 is the standard one which implies the money is
fully invested so the weights sums to 1. In practice, some pensions or institutional investors may
want to take 4% greater risk than then market, in order to earn a greater expected return. This
is a valid objective even if the market is fully efficient or the market index is unbeatable (with the
same risk). To understand this, suppose the market has 12% annual return and 20% volatility. If
investors want to take only 20% market risk, the easiest thing they can do is to earn 12% market
return by buying the index. If they are willing to take 24% market risk, they should earn, say, 15%,
greater expected return (than 12%). However, it is often inefficient or infeasible for them to do this
by themselves, and so they have to hire active managers to obtain and manage such a portfolio for
them.
c© Zhou, 2021 Page 65
2.3 Tracking error minimization
Roll (1992) provides an analytical solution to the above TE problem,
wr = w¯ ±
√
g
d
V −1(µ− µ0), (2.81)
where d is one of the four efficient set constants:
a = µ′V −1µ, b = µ′V −11N , c = 1′NV
−11N , d = a− b2/c, (2.82)
and µ and µ0 are the expected return on R and on the global minimum variance portfolio, respec-
tively.
However, in the real world, the earlier optimization problem usually has constraints, such as
position limits and shorting selling restrictions, and hence there are no analytical formulas for the
solutions. However, since to minimize the TE is the same as to minimize a quadratic function of
w,
TE =
1
2
w′(2V )w − (2w¯′V )w + w¯V w¯,
where w¯ and V are constants, quadratic programming can be used to solve the problem under
general constraints (see the next chapter).
The TE optimization allows managers to beat the index while controlling the tracking error.
However, there is a hidden problem with the TE criterion. The problem is that the variance of the
tracking portfolio is
Var[w′rR] = Var[w¯
′R] + 2(w − w¯)′V w¯ + (w − w¯)′V (w − w¯), (2.83)
i.e., the variance of the tracking portfolio can be quite large relative to the index of if the 2nd term
is sizable. In other words, if the TE is 4% risk (the square-root value), the actually active portfolio
variance can exceed the market by more than 4%2 if the 2nd term is not zero. In other words, the
common TE optimization is not perfect and one needs to be cautious on its understatement of the
true risk. However, this may not be an issue as usually the active portfolio has little correlation
with the market.
If there is a concern on the under-statement of the TE optimization, one can solve the active
portfolio by fixing the total risk of the tracking portfolio is at a given level σ2p:
Var[w′rR] = w
′
rV wr = σ
2
p, (2.84)
and then maximize the expected return. Jorion (2003) provide an analytical solution to this op-
timization problem under the standard fully investment (weights sum to 1) and the total risk
c© Zhou, 2021 Page 66
2.4 Information ratio
constraints. Again, however, if no short-selling or other practical constraints are imposed, the op-
timization problem has to, and can be easily solved numerically by using quadratic programming
tools.
2.4 Information ratio
How do we assess the performance of a portfolio manager whose goal is to beat the S&P500? The
information ratio, also known as appraisal ratio, is the widely used measure in practice. Some
hedge funds even use it as a metric for calculating a performance fee.
The information ratio measures the performance of a portfolio relative to a benchmark index,
IR =
E(Rp −RB)
σ(Rp −RB) , (2.85)
i.e., the ratio of the expected active return of a fund to its standard deviation relative to RB, where
RB is the return on a benchmark index the fund manager attempts to beat, and Rp is the fund
return.
It is clear that the greater the IR, the smarter the fund manager. Recall that σ(Rp − RB) is
the tracking error. Given a tracking error allowance, the portfolio should outperform the index (as
it usually takes a calculated greater risk than the index), but the question is by how much. The
ratio states precisely the expected return per unit of the tracking error. In practice, according to
Grinold and Ronald (1999, p. 114), top-quartile investment managers typically achieve annualized
information ratios of about 0.5. This means that, if the fund manager uses a risk aversion of γ = 1,
she/he can beat the index by about 1.25% per year (Grinold and Ronald, 1999, p. 114).
It is worth noting that the Sharpe ratio of the fund is
Sharpe Ratio =
E(Rp − rf )
σ(Rp − rf ) , (2.86)
where rf is the risk free rate. The IR is simply obtained by replacing the rf with RB. Comparing
the Sharpe ratio of a manager who is to outperform the conservative utility index with one who
is to outperform a high tech index does not make sense (as the latter will likely to have a higher
Sharpe ratio by simply holding the index). How much they beat their benchmarks should be the
criteria, so we should use IR.
c© Zhou, 2021 Page 67
2.5 How to outperform with alpha asset?
2.5 How to outperform with alpha asset?
In practice, one often asks the question that: how can I improve my portfolio if I find an asset that
has a positive alpha?
Let R be the excess return on an asset that has a positive alpha (note that a negative alpha
will be fine too because then shorting it will have positive alpha), that is, in the benchmark model
regression,
R = α+ βRB + , (2.87)
where α > 0, which is the asset’s alpha relative to the benchmark mark portfolio you hold. β is
asset beta, and is the residual with zero mean. Our objective is to find a portfolio of RB and R
such that its performance is better than the benchmark, with a greater Sharpe ratio, for example.
We often consider the part of the asset that is uncorrelated with the benchmark,
r = R− βRB, (2.88)
known as the residual return to fund managers, the return without benchmark risk. Note that
E[r] = α, var[r] = σ2,
where σ2 is the variance of . The residual return or residual asset is tradable as one can buy a
unit of the underlying asset with shorting β portion of the benchmark. Mathematically, finding an
optimal portfolio among RB and R is the same as finding an optimal portfolio among RB and r.
It is just that the portfolio formula of the latter is simpler.
Consider now a portfolio among RB and r,
Rp = w1RB + w2r. (2.89)
Note that the two assets are uncorrelated, and so, based on our optimal portfolio formula, we havew1
w2
 = 1
γ
var[RB] 0
0 var[r]
−1 E[RB]
E[r]

=
 1γ E[RB ]var[RB ]
1
γ
α
var[r]
 ,
c© Zhou, 2021 Page 68
2.5 How to outperform with alpha asset?
that is, our weight on the benchmark portfolio will remain the same!, but with an additional
investment on the residual asset whose weight is of the usual form (mean-variance ratio) as if we
are investing in it alone!
Recall our formula for the squared Sharp ratio, (2.51), we have now
(Sharpe Ratio)2 =
(E[RB])
2
var[RB]
+
α2
var[r]
, (2.90)
so the squared Sharpe of the residual asset adds directly into the squared Sharp ratio of our portfolio.
The greater it is, the more it helps on the performance of our portfolio.
Our analysis above shows that for two assets with the same alpha, they do not contribute
equally to the portfolio. The one with the smaller residual variance contributes more, because it is
the Sharpe ratio of the residual asset that matters, not solely the alpha value itself.
Example 2.12 As Example 2.7, we assume that the riskfree asset earns 3% (per year), and your
benchmark (say the market) has an expected return of 12% (excess return 9%) and a volatility of
20%. Then, assuming γ = 2.8, the optimal portfolio (among the riskfree and benchmark assets) is
w1 =
1
2.8
0.09
.202
= 0.8036.
Now you have an alpha portfolio (e.g., return on an investment based on a number of ideas), and
assume the alpha is 5% and the residual volatility is 15%. Then you will continue to hold the
benchmark with weight w1, but at the same time, invest
w2 =
1
2.8
0.05
.152
= 0.7937.
So you need to borrow money (0.5973 = w1 + w2 − 1) to investment. The squared Sharp ratio of
your portfolio will be
(Sharpe Ratio)2 =
0.092
0.202
+
0.052
0.152
= 0.3136. (2.91)
So the Sharp ratio is 0.56, improving about 25% from 0.45, the Sharp ratio without the alpha
portfolio.
In practice, the borrowing may not be feasible. In this case, the optimal portfolio should be
solved under the constraints (see next Section on how to solve such problems), and the resulting
Sharp ratio will be lower than 0.56. ♠
c© Zhou, 2021 Page 69
2.6 Fundamental Law of active portfolio management
2.6 Fundamental Law of active portfolio management
2.6.1 IR = IC
√
N
If one has no forecasting sills at all, one cannot possibly beat the market by taking the same level
of risk. Suppose that one does have some skills. The question is then how to translate this skill
into the active return efficiently. To this question, Grinold (1989) proposes the fundamental law of
active portfolio management (FLAM).
Note that the value-added of a portfolio is the information ratio (IR), which measures the
performance of the active portfolio per unit of the active risk. In its simplest and most intuitive form,
the FLAM states that the value-added or performance of an active manager, IR, is proportional to
the information coefficient (IC) and the square-root of the market breadth (BR),
IR = IC
√
N, (2.92)
where IC, the information coefficient (skill), is measured in terms of the correlation between the
return forecasts with the actual future returns, which are assumed constant across assets and over
time, and N is the number of assets (market breadth here).
In words, following Romero and Balch (2014), the FLAM says that
performance = skill×
√
breath, (2.93)
where, the annual performance depends on skill and breath, where skill is a measure of how well a
manager forecast future return, and breadth represents the number of investment decisions (trades)
the manager makes each year. The law suggests that as a manager’s skill increases, or makes more
use of the skill, more money will be made. That is not surprising, but what may be surprising is
that, to double performance, one has to double the skill, or, at the same level of skill, quadruple
the trading.
According to Romero and Balch (2014), Warren Buffett has 92% of his fund’s money invested
in only 12 stocks in September, 2010. So he has high skills and applies it to a limited number of
stocks. Why does not he apply it to more? It is likely that his high level of skill is not portable. On
the other hand, a hedge fund that uses machine learning about future stock returns is less efficient.
But the skill can be applied to almost any stock. As a result, based on the FALM, both can enjoy
great performance on their funds.
c© Zhou, 2021 Page 70
2.6 Fundamental Law of active portfolio management
In short, the FLAM says that IR is linearly related to skill. If a manager doubles his/her
forecasting accuracy (IC), then IR doubles. If the accuracy can be doubled at a research cost, and
if the cost is lower than the value-added, it should be doubled. Applying the same level of skill to a
portfolio of 500 stocks will generate 10 times more value than applying it to 5 stocks. As a result,
a small degree of predictability can potentially help an active manager to make significantly gains
in beating the benchmark if this predictability can be repeated used many times during the year
or applied to many assets.
2.6.2 A casino example
To understand better about the FLAM, consider a casino game of tossing a unfair coin (playing
slots machine with odds favoring the casino is similar to the abstract). Suppose the payoff to the
casino is:
payoff =
 −1, if head, 49% prob;+1, if tail, 51% prob. (2.94)
Clearly the expected value of the game is
µ = 0.49× (−1) + 0.51× 1 = 0.02, (2.95)
with variance
σ2 = 0.49× (−1− 0.02)2 + 0.51× (1− 0.02)2 = 0.99982, (2.96)
and so the risk (standard deviation) is σ = 0.9998.
Do you think the casino should play the game if it can play only once? The answer is no despite
the game has a positive value. The reason is that the return and risk trade-off is poor, with a
Sharpe ratio (assuming zero riskfree rate)
SR =
µ
σ
= 0.02,
because the stock market has a Sharpe ratio about 0.5. Intuitively, you are risking $1 with high
probability, and makes only a small expected profit of 0.02.
Do you think the casino should play the game if it can play, N , a large number of times? The
answer is then absolutely! Because now the Sharpe ratio is
SR =
µN
σ
√
N
= 0.02
√
N
c© Zhou, 2021 Page 71
2.6 Fundamental Law of active portfolio management
can be very large (remember that the return and variance both grows linearly in N , but the risk
only at a rate of
√
N).
In terms of the FLAM, 0.02 is the skill, the win per game, and N is the breath. For the fixed
skill level, the greater the N , the more profitable the strategy of playing N games.
2.6.3 A proof
Now let us see why the law is true. Consider a managed portfolio with return Rp in excess of the
risk free rate. Let RB be the excess return on the benchmark portfolio, then we have
Rp = RB +RA, (2.97)
where RA is the return on the active portfolio. The proof essentially generalizes the analysis of
Section 2.5.
Suppose that the benchmark consists of N risky assets. We can always decompose the excess
return on the i-th asset into
Ri = αi + βiRB + i, i = 1, . . . , N, (2.98)
where αi is the asset’s alpha, βi is its beta, and i is the residual with zero mean conditional on
available information. This is the market model we discussed before. Mathematically, the return
composition is simply a projection of Ri onto 1 and RB. Then
ri ≡ Ri − βiRB, (2.99)
known as residual return to fund managers, the return without benchmark risk.2 It is clear that
investing in Ri is the equivalent to investing in ri in the sense that their weights different only by
a weight on the market.
Let rit be the residual return at time t, and αˆit be our forecasted return based on prior infor-
mation It−1, that is,
E[rit | It−1] = αˆit. (2.100)
2This is often assumed by practitioners. However, theoretically, the project only guarantees zero correlation with
RB , and so can be dependent without normality assumption on returns.
c© Zhou, 2021 Page 72
2.6 Fundamental Law of active portfolio management
Let µp be the expected return on a portfolio of rit’s with portfolio weights wi’s, then, using (2.100),
µp =
N∑
i=1
wiαˆit, (2.101)
and the variance is
σ2p =
N∑
i=1
w2i σ
2
i , (2.102)
where σ2i is the variance of rit conditional on the information and the residual returns here are
assumed uncorrelated here (if necessary, more factors can be added to make the residual return
uncorrelated; see factor models later).
Consider now the value-added or the risk adjusted returns,
U = µp − γ
2
σ2p.
The optimal portfolio choice, clear from the first-order condition, is
wi =
1
γ
αˆit
σ2i
, (2.103)
which is the standard formula for uncorrelated assets (Example 2.11). Suppose now that we keep
σ2p, which is TE, as a constant overtime. Plugging (2.103) into (2.102), the implied squared risk
aversion is
γ2 =
(
N∑
i=1
αˆ2it
σ2i
)
/σ2p.
Plugging this back to (2.103), and then we get µp,
µp = σp ×
√√√√ N∑
i=1
αˆ2it
σ2i
. (2.104)
So the final task is to simplify the last term.
Assume that rit and αˆit are normally distributed. Since they have a correlation of IC, we can
assume that
αˆit = IC × σi × zit, (2.105)
where zit is standard normal with perfect correlation with the standardized residual return rit/σi,
and αˆit is assumed to have a zero expected mean (which is reasonable as as stocks have zero alphas
long-term). Then it can be verified that indeed
IC = corr(αˆit, rit).
c© Zhou, 2021 Page 73
2.7 MV Optimal portfolio: No rf case
Therefore, by (2.104), we have
µp
σp
= IC ×
√√√√ N∑
i=1
z2it = IC ×
√
χ2N , (2.106)
where the last term follows from the definition of a chi-squared distribution. From statistics, the
expected of its square-root can be computed as a ratio of Gamma functions, and then an application
of Stirling’s approximation yields
E[
√
χ2N ] =
√
N − 1
[
1 +
1
4N
+O
(
1
N2
)]
, (2.107)
where O(1/N2) indicates errors of order 1/N2. Note that it is obvious Eχ2N = N and E [
√
χ2N ]
is not, but (2.107) says that, for chi-squared variables, the latter is well approximate by
√
N − 1
or by
√
N (the −1 can be ignored when N is large). Then (2.106) implies the FLAM after taking
expectation over the conditional information or over time.
Our above proof follows closely Ye (2008). While the FLAM has received enormous attention
for its key insights into portfolio strategy design and performance evaluation (see, e.g., Chincarini
and Kim, 2006, and Qian and Hua and Sorensen, 2006), subsequently studies show that the FLAM
states only the idealized gain. Once realistic constraints are imposed, the gain is much smaller than
predicted (see, e.g., Clarke, Silva and Thorley, 2002). Zhou (2008a, b) analyzes how estimation and
optimal use of conditional information affect the gain. Ding and Martin (2017) provide the latest
analysis.
2.7 MV Optimal portfolio: No rf case
In the real world, most if not all equity funds require 100% investment in the risky asset, so it
is of interest to consider the mean-variance optimal portfolio without the riskfree asset. This is
also often discussed prior to the riskfree asset case in most investment texts. Since this case is
technically more complex, and it is not essential for most of the early results, we postpone the
discussions until now.
Denote now rpt as the (raw) return on N risk assets, and
µ0 = E[rpt]
the expected return. We use notation µ0 to make sure no confusion on µ that denotes the expected
excess return (return subtracting from the riskfree rate). We still use the same notation, Σ to
c© Zhou, 2021 Page 74
2.7 MV Optimal portfolio: No rf case
denote the covariance matrix, which is the theoretically identical regardless we use raw or excess
returns because the riskfree rate is a constant and it does not affect the covariance. However, in
the real world, the riskfree rate changes, though constant per period (say per month), but it varies
over periods, and hence the estimated covariances in the two cases can be different numerically.
This will have no impact on the theory.
We will obtain the optimal portfolio in two ways. The first is the familiar variance minimization
and the second is from mean-variance utility maximization.
2.7.1 Variance minimization given µp
Standard investment texts solve the mean-variance optimal portfolio by minimizing the risk for a
given level of return. Mathematically, this is to solve portfolio weights w from
min
w
1
2w
′Σw (2.108)
s.t. w′1N = 1
w′µ0 = µp,
where 1N is a vector of 1s, and µp is the given level of return. The risk here is captured by the
variance of the portfolio, w′Σw. Minimizing the variance is mathematically equivalent to minimizing
its square-root, the volatility.
To understand the matrix notion, consider the case where we have two risky assets only, N = 2.
Then we have the asset expected return and covariance matrix in matrix form as
µ0 =
µ01
µ02
 , Σ =
 σ21 ρσ1σ2
ρσ1σ2 σ
2
2
 .
The portfolio variance risk is
σ2p = w
2
1σ
2
1 + 2ρw1w2σ1σ2 + w
2
2σ
2
2 =
[
w1 w2
] σ21 ρσ1σ2
ρσ1σ2 σ
2
2
w1
w2
 = w′Σw,
which is the objective function scaled by 1/2. Assume as usual that we are fully invested, then
w1 + w2 = 1 =
[
w1 w2
]1
1
 = w′12,
c© Zhou, 2021 Page 75
2.7 MV Optimal portfolio: No rf case
which is the first restriction, also known as budget constraint. The second restriction is on the
investment objective. Suppose that we want to obtain an expected return of 15% for our portfolio,
then
w1µ01 + w2µ02 = 15%
must be satisfied for our portfolio weights.
The solution is well known. Based a standard optimization procedure (derivation given below),
the optimal weights are
w = c1Σ
−11N + c2Σ−1µ0, (2.109)
where 1N is an N × 1 vector of ones, and c1 and c2 are constants,
c1 =
c− bµp
∆
, c2 =
aµp − b
∆
(2.110)
with
a = 1′NΣ
−11N , b = 1′NΣ
−1µ0, c = µ′0Σ
−1µ0, (2.111)
and ∆ = ac− b2 > 0, all of which are constants independent of µp.
Numerically, given the asset raw expected returns and their covariance matrix, µ0 and Σ, as
well as the desired level of expected portfolio return, we can compute a, b, and c, which are the 3
key coefficients determining the optimal portfolio. Indeed, with their values, we can compute easily
∆, c1 and c2. Then the optimal portfolio weights are computed from Equation (2.109).
The minimized portfolio variance is
σ2p = w
′Σw = w′Σ
(
c1Σ
−11N + c2Σ−1µ0
)
= c1w
′1N + c2w′µ0 = c1 + c2µp
=
aµ2p − 2bµp + c
∆
, (2.112)
which is the familiar mean-variance frontier or parabola: as the expected return increase, so must
the risk, but it increases in a parabolic pattern. Note that the investors will only choose the upper
part of the mean-variance frontier or efficient frontier. For any portfolio w on the efficient frontier,
−w is exactly the minor image on the lower part of the frontier with the same risk, and negative
of the return, which is why the lower part will never be chosen.
Technically, the existence of the mean-variance frontier has two conditions:
a) Σ is nonsingular or there are no redundant assets;
c© Zhou, 2021 Page 76
2.7 MV Optimal portfolio: No rf case
b) at least two assets have different expected returns.
If Σ is singular, the inversion breaks down. This happens only one of the assets is a linear com-
bination of other assets. In particular, this rules out perfect correlation of any two assets. Now,
if all the assets have the same expected returns, any portfolio of them will have the same return.
Hence, it will be impossible to obtain a portfolio with return of any other value. Under the two
conditions, you can get an optimal portfolio with any target expected return, with the risk given
by (2.112). Fox example, you can design a portfolio with monthly return of 100%, but then the risk
must be too high too. However, this may not be achievable in the real world because the optimal
portfolio must then require large short positions (negative weights) which can run into difficulties
in implementation. More on practical portfolio constraints will be discussed in the next chapter.
Classic graduate texts, such as, Ingersoll (1987) and Huang and Litzenberger (1988), have in
depth discussions of the mean-variance frontier as well as the proofs. For completeness, we provide
the derivation here.
Proof of (2.109): Let
L = 1
2
w′Σw − η(w′1N − 1)− λ(w′µ0 − µp)
be the Lagrangian (objective function with constraints), where η and λ are additional parameters
to choose to reflect the constraints. Define df/dw as an N -vector formed by df/dwi for any function
f = f(w1, . . . , wN ). Then it can be verified that
dw′µ0
dw
= µ0,
dw′Σw
dw
= Σw. (2.113)
Hence, the first-order condition is
∂L
∂w
= Σw − η1N − λµ0 = 0 (2.114)
∂L
∂η
= w′1N − 1 = 0 (2.115)
∂L
∂λ
= w′µ0 − µp = 0 (2.116)
Equation (2.114) provides
w = ηΣ−11N + λΣ−1µ0 (2.117)
Multiplying both sides of this equation by µ′ and using (2.116), we have
w = ηµ′0Σ
−11N + λµ′0Σ
−1µ0 (2.118)
c© Zhou, 2021 Page 77
2.7 MV Optimal portfolio: No rf case
Multiplying both sides of (2.117) by 1′N and using (2.115), we have
1 = η1′NΣ
−11N + λ1′NΣ
−1µ0 (2.119)
Now both equations (2.118) and (2.119) are linear equations for η and λ. Since two linear equations
with two variables can be solved analytically (see 1.78), then the solution of η and λ is obtained,
and so we get w∗ from (2.117), the same one given earlier. Q.E.D.
When the given expected portfolio return µp is taken as
µp = b/a,
the resulting portfolio has the minimum risk, which is evident from the first-order condition:
dσ2p
dµp
=
2aµp − 2b
∆
.
The portfolio is known as global minimum-variance portfolio (GMV), whose weights are
wg =
Σ−11N
1′NΣ−11N
, (2.120)
which is the same as (2.21) discussed earlier. Here we see an alternative derivation.
To implement portfolio selection, as is the case in the riskless asset case, µ and Σ have to be
estimated. Say they are to be estimated from historical data. Suppose there are T periods of
observed raw returns data ΦT = {r1, r2, · · · , rT } and we would like to form a portfolio for period
T + 1. Under the common assumption that Rt is i.i.d., the standard estimates are
µˆ0 =
1
T
T∑
t=1
rt, (2.121)
Σˆ =
1
T − 1
T∑
t=1
(Rt − µˆ0)(rt − µˆ0)′. (2.122)
Mathematically, these are the same estimators as before, (2.57) and (2.58), and they share the
same properties. The only difference is that, previously the returns are measured in terms of excess
returns, and now they are simply the raw return.
It should be mentioned that practical portfolio choice may involved many constraints, such as
no short-sales and position limits, other than weights summing to 1. If the investment policy is
to hold a large portion of assets in the market index, and a small portion in an active portfolio
like the optimal portfolio here. Combining the two will not violate the constraints. For example,
c© Zhou, 2021 Page 78
2.7 MV Optimal portfolio: No rf case
although w∗ typically requires the shorting of about 50% of the assets, the fund can often simply
under-weight some assets in the index if the active portion say is only 20% is of the assets, without
violating the constraints.
However, if the optimal portfolio is a standalone portfolio and if it is not allowed to go short, then
the above analytical formula is no longer applicable, and numerical solution is the only approach.
This will be addressed in the next chapter.
2.7.2 Two-fund separation: No rf case
Analytically, the optimal portfolio formula (2.109) can be written as a portfolio of two other frontier
portfolio,
w∗ = (c1a)wg + (c2b)wq, (2.123)
where (c1a) + (c2b) = 1, wg is the GMV portfolio and portfolio wq is defined by
wq =
Σ−1µ0
1′NΣ−1µ0
. (2.124)
Graphically, wq is the tangent portfolio to the line starting from the origin point (see, e.g., Ingersoll
(1987) for a proof). Equitation (2.123) is the Two-fund Separation Theorem in the case of no
riskfree asset. It says that any optimal portfolio is a portfolio of two funds, wg and wq. In an ideal
mean-variance economy, offering the two funds will be sufficient for all investors demands in the
absence of the riskfree rate.
In fact, any two distinct frontier portfolios can serve as the two fund. The reason is that both
of them will satisfy (2.123). An inversion implies that wg and wq can be their portfolios, and hence
every efficient portfolio will be their portfolio too.
Define µp/σp as the Sharpe ratio of a portfolio in the absence of the riskfree, then wq is the only
frontier portfolio that maximizes the Sharpe ratio. This is true because wq is the tangency portfolio
from origin and the mean-variance frontier is underneath it. Interestingly, in our case of no riskfree
asset case, all frontier portfolios are optimal, and investors will choose different ones depending on
their risk tolerance (see next subsection), andhttps://www.overleaf.com/project/5f6b68d0a663fc0001e98565
then they will achieve different levels of Sharpe ratios. In contrast, as you learn in the case when
the riskfree asset is available, although the optimal portfolios can be different, they all have the
same Sharpe ratio (though defined differently with use of the riskfree rate).
c© Zhou, 2021 Page 79
2.7 MV Optimal portfolio: No rf case
There is one more interesting property of the portfolio wq. Its expected portfolio return is 1,
E[w′qR] = 1.
This property will be used later for a link to a linear regression.
2.7.3 Utility maximization
Now we assume that the investor chooses his/her portfolio weights w so as to maximize his/her
quadratic utility function,
max
s.t. w′1N=1
U(w) = E[rpt]− γ
2
Var[rpt] = w
′µ0 − γ
2
w′Σw, (2.125)
where γ is the risk aversion parameter of the investor. The greater its value, the more the risk
aversion as it penalize the risk more. Note that here we still assume that, as almost all studies do,
that the investors is fully invested in the risky assets, so that the weights sum to 1,
w1 + w2 + · · ·+ wN = w′1N = 1,
1N is an N × 1 vector of ones.
However, we no longer have a constraint on the expected return of the optimal portfolio. In
fact, the investor does not what level of expected return she or he should choose. Intuitively, if the
investor can tolerate a high degree of return, then she/he will choose a portfolio with greater risk
and so greater expected return. Hence, the level of risk and expected return will be completely
determined by the risk aversion or the utility function. This is why we no longer impose any
restrictions on expected portfolio returns vs the variance risk minimization.
The optimal weights are
w∗ = wg +
1
γ
wz, (2.126)
where
wg =
Σ−11N
1′NΣ−11N
, wz = Σ
−1(µ0 − 1Nµg),
with µg = µ
′
0Σ
−11N/1′NΣ
−11N the expected return on the global minimum variance (GMV) port-
folio wg. Equation (2.126) says that holding the optimal portfolio is the same as investing into two
funds, wg and wz (as these two themselves are portfolios or funds). Since investors here invest 100%
c© Zhou, 2021 Page 80
2.7 MV Optimal portfolio: No rf case
into the risky assets, they always hold 100% of wg. Depending on their degrees of risk aversion,
their exposures to wz vary.
Note that wz is a zero investment portfolio satisfying 1
′
Nwz = 0. It is clear from (2.126) that
any optimal portfolio is a linear combination of wg and wz. Mathematically, one can show that
maximizing the quadratic utility function is equivalent to the usual portfolio risk minimization for a
given level of return. Indeed, when the risk aversion is infinity, investor will choose the GMV. As the
risk aversion goes down, the optimal portfolio from (2.132) will trace out the upper mean-variance
frontier.
In practice, utility maximization is critical, as it tells us which portfolio to buy for an investor
or a fund manager. In contrast, the mean-variance frontier itself does not provide such information.
It tells only the choose one from the frontier, and does not tell which one.
Proof of (2.126): it is similar as before. Now the Lagrangian is
L = w′µ0 − γ
2
w′Σw − η(w′1N − 1).
The first-order condition is
∂L
∂w
= µ0 − γΣw − η1N = 0 (2.127)
∂L
∂η
= w′1N − 1 = 0 (2.128)
Equation (2.127) provides
w =
1
γ
Σ−1(µ0 − η1N ) = 1
γ
Σ−1µ0 − η 1
γ
Σ−11N (2.129)
Multiplying both sides by 1′N and using (2.128), we have
1 =
1
γ
1′NΣ
−1µ0 − η 1
γ
1′NΣ
−11N (2.130)
and hence
η = −γ − 1
′
NΣ
−1µ0
1′NΣ−11N
= −γ(1′NΣ−11N )−1 + µg (2.131)
Plugging this into (2.129), we obtain the result. Q.E.D.
2.7.4 Optimality of ad hoc rules
Let us consider two special cases of utility maximization. This helps to see conditions under which
some of the earlier ad hoc rules are optimal.
c© Zhou, 2021 Page 81
2.7 MV Optimal portfolio: No rf case
Consider first the popular 1/N rule that puts equal money cross risky assets. If we assume equal
expected returns across the assets (utility maximization allows this possibility) and Σ is diagonal
with equal volatilities, then wg = (1/N)1N and wz = 0N , and so
w∗ = wg +
1
γ
wz =
1
N
1N , (2.132)
which is the usual equal-weighed portfolio. Indeed, when the assets are independent from each
other, and have the same mean and variance, full and equal diversification is possible, and so the
usual equal-weighed portfolio is optimal.
Consider next the inverse volatility-weighting. If we assume Σ is diagonal but allowing for
different elements, and if we still assume that the expected asset returns are equal, then wz = 0N ,
and wg reduces to the volatility-weighted weights,
wgi =
1
σ2i
1
σ21
+ · · ·+ 1
σ2N
, i = 1, ..., N.
This is also intuitive. When the assets are independent from each other and have the same means,
the weight on each asset depends on only its own volatility scaled by the total volatility of all assets.
Consider finally the GMV portfolio. It is clear that w∗ = wg if and only if wz = 0N or if
and only if all the means are equal. Hence, when the expected returns on a set of stocks/assets
are roughly the same (after perhaps grouping stocks by their expected returns), the GMV may
be a good rule to apply, without having to provide the expected return estimates which are noisy.
Note that the estimation will not affect the ranking of the stocks by estimating expected returns
because the errors are likely highly correlated. For example, if all stocks have an error of 5% in
their expected return, this will not affect their true ranking. That may be the reason why the GMV
rule is popular in practice.
2.7.5 Links to linear regression
Based the on earlier result of Jobson and Korkie (1983), it is easy to see that there is also an
interesting relation between the optimal mean-variance portfolio wq and a linear regressions.
Assume we have iid asset returns and the sample size is T . Consider the regression of a constant
on the asset returns,
1T = Xβ + , (2.133)
c© Zhou, 2021 Page 82
2.7 MV Optimal portfolio: No rf case
where 1T is a T-vector of 1’s, X is a T × N matrix of the N asset returns data, and β is N × 1
of the regression coefficients. Note that, since there is no riskfree rate, the returns are raw returns
here.
Let wˆq be the estimate of wq from data, with µ and Σ estimated by µˆ and Σˆ from (2.121) and
(2.122), then
wˆq =
βˆ
1′N βˆ
, (2.134)
where βˆ is the OLS regression estimate. In other words, the regression slopes are proportional to
the optimal portfolio weights. The term 1′N βˆ is a sum of all the betas, to make the ratio vector,
βˆ/1′N βˆ, sums to 1, so that it is the portfolio weights.
Proof: The least-squares estimator has the usual formula,
βˆ = (X ′X)−1X ′1T .
In matrix form, we can write (2.121) and (2.122) as
µˆ0 = X
′1T /T, (2.135)
Σˆ = (X − 1T µˆ′0)′(X − 1T µˆ′0)/T. (2.136)
Then, using a standard matrix inversion formula, we have
(X ′X)−1 = (Σˆ + µˆ0µˆ′0)
−1 = Σˆ−1 − Σˆ
−1µˆ0µˆ′0Σˆ−1
1 + µˆ′0Σˆ−1µˆ0
.
Hence, we obtain
βˆ =
Σˆ−1µˆ0
1 + µˆ′0Σˆ−1µˆ0
.
This is clearly proportional to wˆq. To make it sum to 1, we divide it by 1
′
N βˆ, yielding exactly wˆq.
Q.E.D.
Britten-Jones (1999), based on the regression framework, provides ways to test hypotheses on
portfolio weights. Brides (2009) extends the relation further. Consider the regression
η1T = Xβ + , (2.137)
with the portfolio constraint that 1′Nβ = 1. When η = 1, this is exactly the case studied before, and
the slope must be the portfolio wˆq, as here we have imposed the constraint and so the estimated
betas are the OLS betas scaled to have their sum equal to 1.
c© Zhou, 2021 Page 83
Now when η is any constant, the slope is clearly a function of η, βˆ = βˆ(η), whose explicit
expression is complex under the constraint 1′Nβ = 1. The interesting result is that βˆ(η) must be
the estimated optimal portfolio weights whose expected return is η.
Mathematically, it can be verified that
′/T = (η − µˆe)2 + β′Σˆβ,
where µˆe = β
′µˆ is the estimated expected return of the portfolio with wights β. In minimizing the
mean-squared error ′/T , the solution is to make the first term be zero and the second term as
small as possible. This says exactly that the OLS betas provide the minimal risk given the expected
return η. As η varies in (0,∞), βˆ(η) traces out all the possible upper frontier portfolios.
3 Portfolio Choice 2: Constraints and Extensions
In this section, we discuss first portfolio choice decisions under practical constraints. Next we
examine alternative objective other than the mean-variance utility. Then, we consider modeling
errors: estimation errors and model misspecification/uncertainty.
3.1 Practical constraints
There are a lot of restrictions in the real world in investing in stocks. The first is no short-sell
restriction. Many funds face such a restriction as they cannot short securities. However, hedge
funds can typically do so freely, and some funds are allow to do 130/30 (some even up up to
150/50), where 130/30 means the fund can goes both long and short at the same time with up to
130% exposure to its position and the 30% to its short positive. Even so, it is costly to borrow
stocks to short sell and sometimes it is almost impossible to short certain stocks.
Suppose we have 5 stocks. If no short-sells are allows, it will imply that the following constraints
on the portfolio weights,
w ≥ 0, or wi ≥ 0, for i = 1, 2, . . . , 5. (3.1)
Position limits is another common restriction. A fund manager cannot put too much money
c© Zhou, 2021 Page 84
3.1 Practical constraints
into a single stock by perhaps internal rules. There are at least three reasons. The first is to force
diversification. The second is to limit exposures to certain industries or ideas. The third is to
reduce the trading costs as it is difficult to get in or out of a stock if one owns too many shares of
the company. In our 5 stock example, if we impose no than 50% on one particular company, then
w ≤ 0.50, or wi ≤ 0.50, for i = 1, 2, . . . , 5. (3.2)
In practice, it is difficult to borrow money for an equity fund manager. To ensure no borrowing,
we need the sum of weights on the risky assets is no greater than 1,
w1 + w2 + · · ·+ w5 ≤ 1, (3.3)
for our 5 stock example.
Another issue is transaction costs. Suppose manager rebalances the portfolio monthly, and
she/he has to trade the stocks monthly to ensure the desired weights. If it is too expensive to trade
stock 1 (which may be a very illiquid stock), then the following constraint may be imposed,
−0.10 ≤ w01 − w1 ≤ 0.10, (3.4)
where w01 is the previous month weight. This effectively imposes both a lower and upper bound on
w1.
One desired objective is to form a portfolio with certain attributes or characteristics. For
example, it may be desirable to make the beta of the portfolio to be 1.5. In this case, we impose
1.5 = βp = w1β1 + w1β1 + · · ·+ w5β5. (3.5)
One may also similarly impose restrictions on earnings-to-price (E/P) or size or sector exposures.
There are hedge funds specializing in long-short strategies and some other funds may can do it
also with some degree. One particular idea is pair trading. For every share investing in stock 1, we
want to short stock 2. In this case, the constraint is
w1 + w2 = 0.
Suppose now we have a portfolio of 2n stocks, and want to go long half them and go short the
other half, then
w1 + w2 + · · ·+ wn = 1,
c© Zhou, 2021 Page 85
3.2 Quadratic programming
wn+1 + wn+2 + · · ·+ w2n = −1.
Note that the net invest in the portfolio is zero, called zero-cost portfolio theoretically (in practice,
trading costs and short selling costs are not negligible).
It should be noted that the analytical formulas provided in the previous section is no longer
valid once we impose restrictions. It is impossible to derive formulas for the constrained case. We
have to solve the optimal portfolios numerically. This is the topic of the next subsection.
3.2 Quadratic programming
Quadratic programming is the process of solving an optimization problem, minimizing a quadratic
function of multiple variables subject to linear constraints on the variables. Mathematically, we
solve
min
x
Π =
1
2
x′Qx+ q′x (3.6)
s.t. Gx ≤ h
Ax = b,
where x = (x1, x2, . . . , xn)
′ are the variables. The problem is well understood in mathematics and
computer science, and algorithms are well developed to solve it numerically via Python, Matlab or
R. Note that the above constraints contain upper and lower bounds on the weights as special cases
as demonstrated below.
It is critically important to understand the link between our mean-variance utility maximization
and quadratic programming. Recall that our objective function is
U(w) = rf + w
′µ− γ
2
w′Σw, (3.7)
which is the riskfree case case. Without the riskfree asset, there is no rf term. Mathematically,
maximizing U(w) is the same as minimizing −U(w),
−U(w) = −rf + γ
2
w′Σw − µ′w, (3.8)
as w′µ = µ′w. Since rf is a constant, which does not affect the optimal solution, we can ignore it.
Comparing (3.8) with (3.6), we have
Q = γΣ, q = −µ′, x = w.
c© Zhou, 2021 Page 86
3.2 Quadratic programming
Hence, utility maximization is a quadratic programming problem.
Moreover, the practical constraints can be easily incorporate into standard constraints of the
quadratic programming. For example, consider two assets with no short-sells and with limit of 80%
on each. Then we want to have
0 ≤ w1 ≤ 0.8, 0 ≤ w2 ≤ 0.8.
Let
G1 =
1 0
0 1
 , h1 =
0.8
0.8
 .
It is clear
G1x =
1 0
0 1
x1
x2
 =
x1
x2
 ≤ h1
reflects the upper bound. Let
G2 =
−1 0
0 −1
 , h2 =
0.0
0.0
 .
It is clear
G2x =
−1 0
0 −1
x1
x2
 =
−x1
−x2
 ≤ h2
reflects the lower bound. Hence, if we stack G1 and G2 together, h1 and h2 together,
G =
G1
G2
 , h =
h1
h2
 ,
then Gx ≤ h reflects both the upper and lower bounds.
The equality constraints are even easier. For example, if we want to impose
w1 + w2 = 1,
we simply let
A = [1 1], b = 1,
then Ax = b reflects the constraints. Another is that we want to fix the beta of the portfolio at 1.5,
w1 × 0.8 + w2 × 2.2 = 1.5,
c© Zhou, 2021 Page 87
3.3 Asset allocation
where 0.8 and 2.2 are the individual betas. Then
A = [0.8 2.2], b = 1.5,
If there are many assets and many equality constrains are imposed, then we just stack each A’s
and b’s together, as did for G.
In short, mean-variance utility maximization under various practical constraints do not have
analytical solutions, but it has a perfect link to quadratic programming, and hence can be solved
easily in practice with computer algorithms via Python, Matlab or R. However, there are some
important issues that cannot be solved by quadratic programming, as discussed below.
3.3 Asset allocation
Asset allocation in practice is usually the advice wealth advisors or consultants give to their clients
on how to allocate their investments over a small set of asset classes.
It is often about long-term strategic asset allocation where a fixed proportion is suggested for
each asset class, and the portfolio is rebalanced quarterly or annually. The second allocation
strategy, Dynamic asset allocation, will change occasionally the weights on the asset classes over
time based on the future expectations, thus requiring accurate market predictions. The third is
Tactical asset allocation, where investors are more active in adjusting weights in assets, sectors, or
individual stocks that show the most potential for perceived gains While an original asset mix is
formulated much like strategic and dynamic portfolio. Market timing can be viewed as the extreme
form of the latter that jumps in or out of the market with active forecasts.
3.3.1 Stocks and bonds
The simplest asset allocation problem is to split the money between stocks (say S&P500 index)
and bonds (a bond index), and the money is usually invested into funds suggested by advisors or
ETFs (exchange traded funds).
Based on Ferri (2010, p. 76), the two assets above have 9.7% and 7.7% annual return over
1973–2009. However, the risk of the stock market is much higher, 18.8% vs only 5.5% in the bond
market.
c© Zhou, 2021 Page 88
3.3 Asset allocation
Assume that the two are uncorrelated (from Ferri, 2010, p. 58, they have 49% correlation during
1990–1994, but only 16% during 1995–1999, and then goes negative to −46% during 2000-2004,
and then −20% from 2005–2009). Then the minimum risk portfolio has 100% weight on the bonds
as typically short sells are not allowed, which can be verified from the GMV formula (2.20).
So the minimum risk is 5.5%. If an investor cannot tolerate this level of risk, maybe short-terms
T-bills are the only way to go, which barely beats inflation in the long-term (see your Investment
text or Ferri, 2010, p. 27). If the investor can take more than 5.5% risk, then allocation to the
stock market makes sense. An naive 50% allocation will earn an expected return of
Rp = 0.5× 9.7 + 0.5× 7.7 = 8.7%,
and a risk of
σp =
√
0.1882 + 5.52 = 13.85%.
If the investor can take more risk, then more money can be allocated to stocks to earn greater
return. However, the additional return may incur too much risk to an average investor. As a
result, the typical recommendation is to invest 60-70% in the stock market, and put the rest into
bonds.
Note that the money here is supposed left for long-term investment, and so cash might already
have been set out for short-term liquidity. This is the reason why there is no riskfree investment
(investment into T-bills) here.
3.3.2 Multi-asset classes
Suppose that the investor has enough wealth so that the money can be further invested into different
asset classes beyond stocks and bonds. And within each class, subclasses can be considered. For
example, within equities, the funds may be divided into growth stocks and international markets
for aggressive wealth growth and diversification. Within bonds, may consider other fixed-income
investments such as U.S. corporate bond, and international bonds.
The other asset classes may be: commodities (precious metals, nonferrous metals, agriculture,
energy,etc), real estate, collectibles (such as art, coins, or stamps) insurance products (annuity,
life settlements, catastrophe bonds, personal life insurance products, etc.) derivatives (such as
c© Zhou, 2021 Page 89
3.4 Large set of individual stocks
options, collateralized debt, and futures), foreign currency, venture capital, private equity, distressed
securities and hedge funds.
Mathematically, the optimal allocation can be solved the same by imposing realistic constraints
with quadratic programming. With 10 or 20 assets, the covariance matrix can be relatively easy
to estimate, and so there is usually no problems for the implementation. In practice, some naive
or round values are often provided to investors.
3.4 Large set of individual stocks
Consider now the case we need to invest our money into a large number of assets, say thousands
of stocks.
In this case, it is difficult to get a good covariance matrix estimator. The sample covariance
matrix is usually useless as it is often not invertible, because the condition T ≥ N + 2 is violated
as N , which is thousands, is greater than T , the sample size (# of time periods here). Note that
even if T can be large, data a long time ago may have little relevance today. We discuss below two
solutions to solve the problem.
The first method is to take a two-stage approach. In the first stage, we divide stocks into a few
or a few dozen of categories. The division sets up an asset allocation problem that is to allocate
funds into a few dozen groups of stocks. Within each group, we may simply invest into the group
indices (though one may further choose stocks to outperform the group indices) such as industry
indices. In the second stage, we decide how much to invest into each one of the groups. To do so,
we can use either a naive portfolio rule, such as 1/N or any ad hoc rule discussed in the previous
section, or an optimized rule. This will not not require the inversion of a large covariance matrix. In
practice, indeed, many institutional investors operate in this fashion (see, e.g., Platanakis, Sutcliffe
and Ye, 2021).
The second method is to impose a factor structure or to use a shrinkage, so that the covariance
matrix becomes invertible. A simple factor approach is discussed in Section 6.4, where the residual
is assumed to have an diagonal matrix. Relaxing this assumption, Fan, Liao and Mincheva (2013)
provide a more general POET estimator (Principal Orthogonal ComplEment Thresholding). The
POET is well known and its R-code is available on web. Ledoit and Wolf (2013, 2017, 2020) provide
c© Zhou, 2021 Page 90
3.5 Estimation risk
some of the most useful alternative shrinkage estimators.
3.5 Estimation risk
Portfolio optimization generally refers to selecting the best portfolio (asset allocation) out of the set
of all possible portfolios according to some objective function. The mean-variance portfolio theory
is a particular case.
3.5.1 The plug-in rule
In the case of the presence of the riskless asset and without portfolio constraints, we have a simple
analytical formula for the optimal portfolio. However, the formula has unknown parameters. To
apply it, we have to estimate the parameters first, and then, replacing the parameters by the
estimates, we obtain the so-called plug-in optimal portfolio rule, or plug-in rule for short.
Specially, in our mean-variance portfolio context, since the true parameters µ and Σ are un-
known, the true theoretical optimal portfolio cannot be obtained. To implement this mean-variance
portfolio theory of Markowitz (1952), the optimal portfolio weights are usually estimated by fol-
lowing a two-step procedure.
Suppose there are T periods of observed excess returns data (since we assume riskless asset is
available and the all our portfolio choice formulas only on excess returns) ΦT = {R1, R2, · · · , RT }
and we would like to form a portfolio for period T + 1. In the first-step, the mean and covariance
matrix of the asset returns are estimated based on the observed data. Under the assumption that
Rt is i.i.d., the standard estimates are
µˆ =
1
T
T∑
t=1
Rt, (3.9)
Σˆ =
1
T − 1
T∑
t=1
(Rt − µˆ)(Rt − µˆ)′, (3.10)
which are the same estimators given in Section 2.2.5. The above estimates are extension of the
univariate sample mean and sample variance, (1.20) and (1.21), into the multiple asset or high
dimensional case.
c© Zhou, 2021 Page 91
3.5 Estimation risk
In the second-step, the sample estimates are then treated as if they were the true parameters,
and are simply plugged into the theoretical formula (2.45) to compute the optimal portfolio weights,
wˆ =
1
γ
Σˆ−1µˆ. (3.11)
This is known as the plug-in rule, obtained by plugging in the sample estimates. Statistically, the
above moment estimators are the most efficient estimates that converge to the true parameters as
the sample size T increases to infinity.
However, in practice, the sample size is small and limited. Hence, there are substantial estima-
tion errors in estimating both the expected return and the covariance matrix. This issue is focus
of the rest of subsections and will also be examined further in Chapter 4.
3.5.2 Errors in using a model
In general, portfolio optimization assumes a model for the data generating process. It is important
to remember that “All models are wrong, but some are useful” (George Box, Statistician). Hence,
there are three types of errors in a model:
1. Errors in fitting the data
Our models are built to fit the past data, and the models are never perfect as there are
assumed random errors even within the models.
2. Errors in parameter estimates
Given the assumed random errors, there are additional errors resulted from estimating the
assumed true but unknown parameters.
3. Errors in capturing the changing world
The models are built from and for the past data, but the future may move into an unforeseen
regimes/crisis or there are shifts of behaviors.
3.5.3 Estimation errors
Here we focus on estimation errors. In particular, even if the data generating process is true, but
there are errors in estimating the expected returns and covariance matrix, due to limited data.
c© Zhou, 2021 Page 92
3.5 Estimation risk
Because of the errors, the plug-in rule often performs poorly.
Example 3.1 Similar to Example 2.8, assume that there are N = 2 risky assets with monthly
expected return and monthly covariance matrix:
µ =
1
12
0.10
0.20
 , Σ = 1
12
 0.32 0.5× 0.3× 0.4
0.5× 0.3× 0.4 0.42
 ,
and γ = 3. Then the optimal weights at the true parameters are
w =
1
γ
Σ−1µ =
0.123
0.370
 .
If T = 120 (10 years of data), then the standard errors of the estimated expected returns are
0.30/12√
120
= 0.002282,
0.30/12√
120
= 0.003043.
Since any values with 2 standard deviations are quite likely, let us say we have made errors of 1
standard deviation. Then we use the true values plus the errors, i.e.,
µˆ = µ+
0.002282
0.003043
 ,
to compute the weights,
wˆ =
1
γ
Σ−1µˆ =
0.191
0.421
 ,
which are quite different from w. That is the problem caused by just errors in µ. ♠
In practice, we have errors in estimating Σ, that makes the problem worse than the example. In
addition, as N becomes large, the problem becomes more severe. The issue is known to practitioners
a longtime ago. For example, Michaud and Michaud (2008), emphasize that that it is difficult to
estimate the inputs (mean and covariance matrix) of the portfolio optimization, and that even small
changes in the inputs can lead to very large changes in the optimized portfolio weights.
Brown (1976, 1978), Klein and Bawa (1976), and Bawa, Brown and Klain (1979) are earlier
academic studies on the problem. Kan and Zhou (2007), DeMiguel, Garlappi, and Uppal (2009),
Tu and Zhou (2011), and Pedersen, Babu, and Levine (2020) are examples of recent studies.
We will, in the subsections that follow, provide first a theoretical analysis of the problem, and
then discuss some of the solutions.
c© Zhou, 2021 Page 93
3.5 Estimation risk
3.5.4 Analytical assessment∗
To understand the impact of estimation errors, consider the loss of expected utility. Recall from
the previous chapter that, if an investor knows the true µ and Σ , the optimal portfolio is
w∗ =
1
γ
Σ−1µ, (3.12)
that yields an maximum utility of
U(w∗) =
1
2γ
µ′Σ−1µ. (3.13)
and a maximum Sharpe ratio of
Sharpe Ratio =
√
µ′Σ−1µ.
The results hold only if the investor is so smart that he or she knows the true parameters. But no
one knows the true parameters in the real world.
To see the consequence of not knowing the true parameters, consider a case in which an investor
does not know µ, but knows Σ. Assume knowing Σ is to simplify the formulas below. Also in
practice, one can potentially use high-frequency data to estimate Σ with greater accuracy than
replying the sample covariance matrix. Assume that the investor uses the sample mean wˆ to
estimate µ, then he/she can only invest based on the plug-in rule,
wˆplug =
1
γ
Σ−1µˆ, (3.14)
and the expected utility is only
E[U(wˆ)] =
1
2γ
µ′Σ−1µ− 1
2γ
N
T
, (3.15)
which is lower than previously. The Sharpe ratio is lower too,
Sharpe Ratio =
µ′Σ−1µ√
µ′Σ−1µ+N/T
. (3.16)
Both (3.15) and (3.16) follow from Kan and Zhou (2007) by assuming that the data are iid normal.
Equation (3.15) says that the investor will get less utility than someone who know the true
parameters, which will depend critically on N , the number of stocks, and T , the number of sample
size (periods of data). If the data are monthly, say T = 120 (10 years of data), then, if N = 50, the
expected utility will be negative! (as the µ′Σ−1µ is the monthly squared Sharpe ratio and its value
c© Zhou, 2021 Page 94
3.5 Estimation risk
usually is far less than 0.5). The investor would be better off by putting money into the riskfree
asset as the likelihood of choosing a bad portfolio from inaccurate estimation is too high, given the
estimates from the limited data.
However, as sample size T goes infinity, the utilities will be the same as then the estimated
parameters will converge to the true parameters. In the real world, T is finite, and so there is
always an issue whether the data provide sufficient information for a given application.
3.5.5 Correlation shrinkage
Pedersen, Babu, and Levine (2020) emphasize that the poor performance problem is primarily
due to errors in estimating the small eigenvalue portfolio, although the latter seems known to
practitioners for years. For example, Chen and Yuan (2016) propose to use a factor model to
eliminate small eigenvalue portfolios.
To see the impact of the small eigenvalue portfolios, consider the assets after linear transforma-
tion by the principal components analysis (PCA) (see Section 6.2 on PCA),
RPCA = A′(R− µ), (3.17)
where A is the standardized eigenvectors of the covariance matrix Σ such that Σ = AλA′ with λ as
the eigenvectors with decreasing order (see Equation (6.27)), and so RPCA are N portfolios of the
original assets with the eigenvectors as weights. Then investing in all the N original assets is the
same as investing in their N independent portfolios RPCA. Let µP1 , . . . , µ
P
N be the expected returns
on the latter, and σP1 , . . . , σ
P
N be their volatility risks. Since the variance of the j-th component
of PCA is λj (σ
P
j =
√
λj) and the PCA portfolios are uncorrelated, we have, from Example 2.11,
that the weight on the j-th portfolio is
wPj =
1
γ
µPj√
λj
1√
λj
. (3.18)
This says that the portfolio weight is a produce of three terms: the inverse risk aversion, the
Sharpe ratio of the asset, and with 1√
λj
. The first two terms are typically bounded. But the last
term can be too large if the eigenvalues are small. Note that the eigenvalues are ordered here,
λ1 ≥ λ2 ≥ . . . ≥ λN . Indeed, the small eigenvalues are often under-estimated or estimate to be
too small in the real world, especially if N is large (high dimensionality). As a result, the optimal
portfolio will load up too heavily on the principal component associated with small eigenvalues.
c© Zhou, 2021 Page 95
3.5 Estimation risk
Hence, any errors on the mean will have a huge impact on the performance of the estimated optimal
portfolio too.
Because of the issue above, Pedersen, Babu, and Levine (2020) provide a simple solution to
the small eigenvalue problem. They propose to shrink the correlation of the asset returns and call
their solution as Enhanced Portfolio Optimization. Contemporaneously, Menchero and Li (2020)
also provide a similar shrinkage solution, but they focus on risk forecasting.
Let Σ be the covariance matrix as usual. We can express it mathematically as a product of its
root-squared variances and the correlation matrix. When N = 2, it is easy to verify that
Σ =
 σ21 ρ12σ1σ2
ρ21σ1σ2 σ
2
2
 =
σ1 0
0 σ2
 1 ρ12
ρ21 1
σ1 0
0 σ2
 .
In general,
Σ =

σ1 0 . . . 0
0 σ2 . . . 0
...
... . . .
...
0 0 . . . σn


1 ρ12 . . . ρ1N
ρ21 1 . . . ρ2N
...
... . . .
...
ρN1 ρN2 . . . 1


σ1 0 . . . 0
0 σ2 . . . 0
...
... . . .
...
0 0 . . . σn
 = DσΩDσ, (3.19)
where σ2i is the variance of asset i and ρij is the correlation between asset i and asset j, Dσ is a
diagonal matrix of the σ2i ’s and Ω is the correlation matrix. Denote by Ωˆ the estimated correlation
matrix. The shrinkage estimator is defined as
Ωˆη = (1− η)Ωˆ + ηIN , η ∈ [0, 1], (3.20)
where IN is the identity matrix. Then the covariance matrix is estimated by
Ση = DσΩηDσ. (3.21)
If η = 0 there will be no shrinkage and we simply use the original Ω. If η = 0, we replace it by the
identity matrix, effectively ignoring all the correlations so that the eigenvalues will be much more
accurately estimated as the asset variances. The shrinkage toward zero correlations is intuitive
because the true correlations among assets tend to be small and the estimates tend to be high in
practice.
How to choose η? One can choose a value of η that works well in the past and keep updating
it overtime. Pedersen, Babu, and Levine (2020) find that a simple choice of η = 50% works well
in a number of data sets. We will compare this rule with others later in Section 3.5.7 in our data
applications.
c© Zhou, 2021 Page 96
3.5 Estimation risk
3.5.6 Combination of 1/N with plug-in
Due to estimation errors, many investors use the ad hoc rules (See Section 2.1) in practice. Con-
sistent with this, DeMiguel, Garlappi, and Uppal (2009) show that the simple 1/N investment rule
can actually outperform most estimated optimized rules (which are optimal if there were no esti-
mation errors), including the previous plun-in rule. Tu and Zhou (2011) provide improved portfolio
rules by combining the 1/N with estimated optimized rules in the riskfree asset case. Kan, Wang
and Zhou (2020) provide new improved portfolio rules in the no riskless asset case. The estimation
errors are more severe in a large portfolio (N is large), for which Ao, Li, Zheng (2019) introduce
some effective methods.
In what follows, we focus on the rule proposed by Tu and Zhou (2011), which is simple and
effective. First, instead of using wˆ, we use a scaled one:
w¯ =
1
γ
Σ˜−1µˆ, (3.22)
where Σ˜ = T−1T−N−2 Σˆ. The scaled w¯ is unbiased, and performs better than wˆ theoretically. Indeed,
it outperforms wˆ in almost all empirical applications.
Let we = 1N/N be the 1/N rule that invest 1/N of every dollar into each asset. Tu and Zhou
(2011) consider a combination of we with w¯,
wˆC = (1− δ)we + δw¯. (3.23)
Intuitively, this is a portfolio diversification. Instead investing using either w′eR or δw¯′R, we invest
a portfolio of them and this should do better in general.
Indeed, theoretically, there exists δ > 0 such that wˆC will dominate both we and w¯, i.e.,
performing better unless the true parameters take special values such that Σ−1µ/γ = 1/N . In the
latter case, δ = 0, and wˆC becomes we. However, how to get δ > 0?
In practice, δ > 0 can be estimated, but the performance of using the estimated one will be
not as good as using δ w eakens due to errors in the estimation. Nevertheless, it tends to perform
better than w¯ in most applications. The estimate of δ is
δˆ = pˆi1/(pˆi1 + pˆi2) (3.24)
c© Zhou, 2021 Page 97
3.5 Estimation risk
with pˆi1 and pˆi2 given by
pˆi1 = w
′
eΣˆwe −
2
γ
w′eµˆ+
1
γ2
θ˜2, (3.25)
pˆi2 =
1
γ2
(c1 − 1)θ˜2 + c1
γ2
N
T
, (3.26)
where θ˜2 is an estimator of θ2 = µ′Σ−1µ and c1 = (T − 2)(T −N − 2)/((T −N − 1)(T −N − 4)),
with T > N + 4.
A natural estimator of θ2 is its sample counterpart,
θˆ2 = µˆ′Σˆ−1µˆ. (3.27)
But θˆ2 can be a heavily biased estimator of θ2 when T is small. Hence, we use the estimator below,
proposed by Kan and Zhou (2007),
θ˜2 =
(T −N − 2)θˆ2 −N
T
+
2(θˆ2)
N
2 (1 + θˆ2)−
T−2
2
TBθˆ2/(1+θˆ2)(N/2, (T −N)/2)
, (3.28)
where
Bx(a, b) =
∫ x
0
ya−1(1− y)b−1 dy (3.29)
is the incomplete beta function. The first part is the unbiased estimator of θ2 and the second part
is the adjustment to improve the unbiased estimator when the unbiased estimator is too small.
A simple combination of we with w¯ is the naive diversification, θ = 1/2,
wˆnaive =
1
2
we +
1
2
w¯. (3.30)
In practice, this rule works well. It is a special case of model averaging (see Section 3.7.2), and it
is much simpler than the optimal combination wC and no estimation of θ is needed. However, wC
has the theoretical advantage that it will converge to the true rule w∗ as the sample size goes to
infinity. In contrast, wˆnaive will never converge.
3.5.7 Backtesting: A comparison of rules
Backtesting often refers to testing a model or strategy based on historical data. It provides insights
on how well the performance could be if the model were used. However, it should be remembered
that a model that worked well in the past may not work well in the future.
c© Zhou, 2021 Page 98
3.5 Estimation risk
As an example of backtesting, we provide here a detailed implementation and a comparison of
5 portfolio investment rules: 1/N , the plug-in, the GMV, the correlation shrinkage of Pedersen,
Babu, and Levine (2020), and the combination rule of Tu and Zhou (2011).
Suppose that we have data from period 1 to T . We have to use some samples for initial
estimation. Assume that we use data from 1 to T0 as the initial sample to obtain the parameters,
say T0 = 120 of 10 years data (monthly observations). Then we can invest starting from time T0.
Note that the 1/N rule,
wˆ(1) =
1
N
1N , (3.31)
can start at any time as it does not require estimation, but we set it to start at T0 for comparison
with other rules. Based on the data, the plug-in rule is easily computed at T0,
wˆ(2) =
1
γ
Σˆ−1µˆ, (3.32)
based on formula (3.11) where µˆ and Σˆ are estimated by using data up to T0 [or replacing T by T0
in (3.9) and (3.10)], and the the GMV rules is computed from
wˆ(3) =
Σˆ−11N
1′N Σˆ−11N
, (3.33)
based on formula (2.21).
The correlation shrinkage of Pedersen, Babu, and Levine (2020) is computed like the plug-in
rule,
wˆ(4) =
1
γ
Σˆ−1η µˆ, Σˆη = DˆσΩˆηDˆσ (3.34)
except that now the covariance matrix is estimated from shrinkage with η = 1/2. The combination
rule of Tu and Zhou (2011) is
wˆ(5) = (1− δ)we + δw¯, (3.35)
where δ is T Σˆ/(T −N − 2), and δ is somewhat complex but can still be done with a few steps.
At time T0 + 1, we have one more data point, from from 1 to T0 + 1. Then we re-estimate
the parameters and use them to re-compute the portfolio weights to determine the investment at
T0 + 1. Then we do the same for at T0 + 2, and so on, till time T0 + (T − T0) = T . Note that we
have then returns on various investment rules available from T0 + 1, T0 + 2, ..., to T . Then we can
use them to assess the performance of the rules, such as comparing their Sharpe ratios.
c© Zhou, 2021 Page 99
3.5 Estimation risk
Recall from 2.2.6, the above estimation is known as recursive estimation with time-varying
windows or with data recursively available. At T0, we use T0 data points. At T0 + 1, we use one
more data point, and so on. The length of data or sample size is increasing over time. See the
Python codes for implementation.
Alternative, one can also do the estimation with a fixed window size. For example, we use
T0 = 120 data points at T0, and still use 120 data points at T0 + 1 (the data from 2, 3, ..., T0,
T0 + 1), and continue to do so till time T . A reason for doing this may be that the data that are
too old may not necessarily capture well what is happening recently or in the near future. This
procedure is known as rolling estimation with a widow size of 120.
To summarize, it is important in practice to consider about alternative investment rules, because
all of them are approximations and no one is better than the other always. For example, the 1/N
rule is good for cases in which there are many assets whose moments are difficult to estimate and
whose expected returns are likely equal. But it clearly performs poorly when applies to the case
of one riskfree asset and one risky asset. Hence, for a given application, it is important know have
a complete list of investment strategies (the above and plus more, such as value-weighting and
additional combination rules), and to find out ones are better than others. Then the better ones
may be applied directly or used after combining further.
Note that portfolio weights of the 4 estimated portfolio rules can be too large in some cases.
To make them more realistic, we also compare the rules under the constraints that the weights on
each assets are less than a bound,
|w(i)j | ≤ b, j = 1, . . . , N, i = 1, 2, 3, 4,
where b is the limits of long or short on each asset. For the GMV and the plug-in rules, we can
solve the constrained problems via quadratic program. For the constrained Tu-Zhou rule, we can
use the previous δˆ as an approximation to obtain it as the combination of the 1/N and the solved
constrained plug-in rule. The combination weights will clearly satisfy the above bound because
both of the underlying rules do.
c© Zhou, 2021 Page 100
3.5 Estimation risk
3.5.8 A Bayesian solution
The expected stock returns or means are known to be difficult to estimate. The sample mean is
the most common estimator, but some shrinkage estimators will be discussed later. For Σ, there
are simple alternative estimators. The maximum likelihood estimator is
Sˆ =
1
T
T∑
t=1
(Rt − µˆ)(Rt − µˆ)′. (3.36)
Its relation to our other estimators are
• Unbiased estimator of Σ:
Σˆ = T Sˆ/(T − 1), (3.37)
which is unbiased that E[Σˆ] = Σ. However, Σˆ and Sˆ are numerically almost identical when
T = 120.
• Unbiased estimator of Σ−1:
Σ˜ = T Sˆ/(T −N − 2) (3.38)
which satisfies E[Σ˜−1] = Σ−1.
• Bayesian rule under a diffuse prior:
ΣˆBayes = (T + 1)Sˆ/(T −N − 2), (3.39)
which is the implied Σ estimator of the Bayesian optimal portfolio weights, the solution to
wˆBayes = argmaxw
∫
RT+1
U(w)p(RT+1|ΦT ) dRT+1 (3.40)
= argmaxw
∫
RT+1
∫
µ
∫
Σ
U(w)p(RT+1, µ,Σ|ΦT ) dµdΣdRT+1, (3.41)
where U(w) is the utility of holding a portfolio w at time T + 1, p(RT+1|ΦT ) is the predictive
density, and
p(RT+1, µ,Σ|ΦT ) = p(RT+1|µ,Σ,ΦT )p(µ,Σ|ΦT ), (3.42)
where p(µ,Σ|ΦT ) is the posterior density of µ and Σ, and the prior is diffuse,
p0(µ,Σ) ∝ |Σ|−
N+1
2 . (3.43)
Notice that the Bayesian approach maximizes the average expected utility over the distribu-
tion of the parameters. The solution under the diffuse prior is almost the same as using the
inverse unbiased estimator Σ˜.
c© Zhou, 2021 Page 101
3.6 Transaction costs
The performance of the various estimators are, in terms of expected utility,
E[J(Σˆ)] < E[J(Σ¯)] < E[J(Σ˜)] < E[J(ΣˆBayes)].
However, they are still far away from being optimal. Tu and Zhou (2010) use an informative prior
and the results are substantially better in general. Some shrinkage estimators of both µ and Σ will
be discussed later (see Section 4.4).
3.6 Transaction costs
Transaction costs are important in practice. Novy-Marx and Velikov (2016) is a recent study on
the cost of trading anomalies, that is, the non-market factors such as size and momentum, and
other long-short portfolios. They divide the anomaly into 3 turnover groups, low, mid and high.
They find that the cost of trading low-turnover strategies (such as size, gross profitability and
value) is generally quite low, often less than 10 bp per month, because the strategies require only
annual rebalancing. The cost for mid-turnover strategies (14–35% turnover on each side such as
momentum and idiosyncratic volatility) is from 20 to 57 bp, and for high-turn-over (≥ 90% per
month), over 1%. However, Frazzini, et al (2015) argue that the transaction costs are lower if
trading the position over 3 days.
To individual investors, the price impact is negligible and the cost is mainly commissions (there
are two other tiny fees: the SEC and FINRA Trading Activity Fee (TAF), which are regulatory fees
charged on the sale (only) of any security, $22.10 per million for SEC and 0.0000119 per share for
TAF, all rounded up to the nearest penny). Initiated by Robinhood brokerage (Robinhood.com),
zero commission was offered, that forced many brokerages today to have zero commissions on
online orders (assisted orders by brokers will be charged $25 or so, and options and other complex
instruments still require fees). Individual investors can also trade US stocks algorithmically free
via API directly from Alpaca and Interactive Brokers, among others.
3.7 Model uncertainty
There are many forms of model uncertainty. We consider two cases here. In the first case, we
provide more robust of the parameters that improves the earlier sample moment estimates (see,
c© Zhou, 2021 Page 102
3.7 Model uncertainty
e.g., Meucci, 2005, for more such analysis). In the second case where the true model is unknown,
we discuss the popular model averaging as an effective approach for using many candidate models.
3.7.1 Perturbation of the normal model
Assume that the true data distribution falls into the class of distributions:
{G |G = (1− )FN + W}, (3.44)
where FN is the normal distribution, W an arbitrary one and is a constant between 0 and 1. The
equation says that the true distribution, which may not be normal, is around a neighborhood of
the common assumed normal one. The question we ask is that how this will affect our parameter
estimates.
Perret-Gentila and Victoria-Feserb (2004) prove two interesting results. First, the asymptotic
bias of the optimal portfolio weights only depends the asymptotic bias of µˆ and Σˆ. Second, the bias
can potentially be infinite even though the data may deviate from normality by a small amount.
Then, how do we estimate the parameters µ and Σ in such a way that the bias be small?
The solution is to use weighted averages, rather than the simple averages or the standard sample
moments (see (3.9) and (3.10)), to estimate µ and Σ:
µˆ =
1
T
T∑
t=1
wmt Rt, (3.45)
Vˆ =
1
T
T∑
t=1
wv1t (Rt − µˆ)(Rt − µˆ)′
wv2t
, (3.46)
where the weights wmt , w
v1
t and w
v2
t depend on two control parameters (see Perret-Gentila and
Victoria-Feserb for details). As it turns out, the above estimates are more robust than the standard
sample moments.
3.7.2 Model averaging
When there are multiple estimates or models, one of the popular decision method is the so-called
maxmin rule, which maximizes the min (worse scenario) of the objective function so that the total
loss is minimized.
c© Zhou, 2021 Page 103
3.7 Model uncertainty
In the mean-variance framework, suppose the investor is provided by J experts with the esti-
mates of mean and covariance matrix of asset returns:
µj , Σj , j = 1, 2, . . . J. (3.47)
Which of them should the investor use?
The maxmin rule suggests the investor to choose the portfolio weights to maximize the worse
case utility:
max
[
min
j
Qj(w)
]
= maxminQj(w), (3.48)
where Qj(w) is
Qj(w) = w
′µj − γ
2
w′Σjw, (3.49)
i.e., the objective function evaluated at the estimated parameters µj and Σj .
A naive Bayesian procedure may assign probability λj for the expert j’s estimate, then the
optimal portfolio weights are
w =
1
γ
 J∑
j=1
Σj
−1 J∑
j=1
µj
 , (3.50)
where the probabilities satisfying 0 ≤ λj ≤ 1 and
∑J
j=1 λj = 1.
3
Formally, the Bayesian model averaging approach proceeds from a set of priors of J models,
p0(Mj) = prior probability for Model j, j = 1, 2, . . . J. (3.51)
After observing the data R, the posterior probability is given by
p(Mj |R) = p0(Mj)p(R |Mj)∑J
j=1 p0(Mj)p(R |Mj)
, (3.52)
where p(R |Mj) is the marginal likelihood computed by
p(R |Mj) =
∫
p0(θj |Mj) p(R | θj ,Mj) dθj , (3.53)
where p0(θj |Mj) and p(R | θj ,Mj) are the usual prior and posterior density of the parameter θj
conditional on model j being true.
Then, the predictive return is that from each model weighted by the posterior probabilities,
p(rT+1 |R) =
J∑
j=1
p(rT+1 |Mj , R) p(Mj |R), (3.54)
3Lutgens and Schotman (2004) provide more discussion on combining various estimates.
c© Zhou, 2021 Page 104
3.8 Alternative objective functions
which, when combined with the objective function, provides the Bayesian optimal portfolio choice.
In the mean-variance case, the portfolio choice depends only on the predictive mean and variance:
E∗M =
J∑
j=1
E∗j p(Mj |R), (3.55)
V ∗M =
J∑
j=1
V ∗j p(Mj |R) +
J∑
j=1
(E∗j − E∗M )(E∗j − E∗M )′ p(Mj |R), (3.56)
where E∗j and V
∗
j are the predictive moments from model j.
4
However, an equal averaging or equal weight on the estimates or models is simple and popular
in practice, and it usually works well.
3.8 Alternative objective functions
Mean-variance objective function is the focus here, and is the most widely used in practice. We
consider some alternatives in this subsection.
3.8.1 Kelly’s criterion
Kelly criterion (known also as Kelly strategy, Kelly bet, ...) provides an optimal gambling method,
or a formula, whose suggested a fixed proportional bet will lead almost surely to the greatest possible
wealth compared to any other strategy in the long run. Mathematically, its objective function is
to maximize the expected geometric growth rate.
How to place your bets in an advantage game? Kelly criterion provides an answer to it, with a
predetermined fraction of wealth. Algorithmic trading often codes it into the program, some hedge
funds use it for their trading strategies, and Warren Buffett and Bill Gross are reported using it
too. This is in fact not surprising because it is close to the mean-variance utility with risk aversion
γ = 1, and is exactly to maximize the expected log expected utility of wealth.
However, as γ is around 3 for a typical investor, Kelly criterion (with γ = 1) seems too aggressive
to many investors. As a result, half Kelly is often used in practice, which is half of the usual Kelly
4Web site, http://www.research.att.com/volinsky/bma.html, provides a big list of papers on Bayesian model
averaging.
c© Zhou, 2021 Page 105
3.8 Alternative objective functions
bet and is equivalent to setting γ = 2. In comparison with the history of the mean-variance portfolio
theory, Kelly (1956) propose his criterion in 1956, which is 4 years later after Markowitz (1952) who
proposes his portfolio theory. There are various extension of Kelly criterion, of which MacLean,
Thorp, and Ziemba (2011) provide a survey of the literature.
Consider about a gamble. Suppose that
• p is the wining probability of a bet;
• M is the money you win when bet $1 (if win, get 1 +M back);
• L is the Loss (if lose, get 1− L back).
To maximize the terminal wealth (assume that you can play the game over and over), Kelly criterion
says that one should invest K%, Kelly percentage, of one’s wealth:
K% =
pM − (1− p)L
M × L =
Expected Return
M × L . (3.57)
and simplifies to
K% = p− 1− p
M
, if L = 1. (3.58)
Example 3.2 Suppose your trading strategy has 50% chance to triple your money and 50% to
lose all. How much money should you invest in it each time?
Now we have
p = 50%, W = 2, L = 1.
Kelly’s rule says you should invest
K% = .5− 1− .5
2
= 25%.
Note that the expected mean and variance of the gamble are
µ = .5 ∗ 2 + 0.5 ∗ (−1) = .5,
σ2 = .5∗(2− .5)2 + .5∗(−1− .5)2 = 2.25.
Assume the risk-aversion is γ = 1, and the riskfree rate is zero, the optimal investment from
mean-variance utility theory is
w =
1
γ
µ
σ2
=
.5
2.25
= 22.22%,
which is quite close to the Kelly’s solution. ♠
c© Zhou, 2021 Page 106
3.8 Alternative objective functions
Note that for a typical risk-aversion of 3, one invests only 22.22%/3=7.4%. This means that a
Kelly investor is generally very aggressive and endures much greater risk.
Example 3.3 Suppose you are offered to play a coin tossing game many times. The coin has 60%
chance for heads and 40% for tail. Heads up you win $1, and tails you lose $1. What is the best
strategy for you to place your bets if you have $100 to start with, to maximize your long-tern gains?
It is clear that
p = 60%, W = 1, L = 1,
and so
K% = .6− 1− .6
1
= 20%.
Also we have
µ = .6 ∗ 1 + 0.4 ∗ (−1) = .20,
σ2 = .6 ∗ (1− .2)2 + .4 ∗ (−1− .2)2 = 0.96,
then
w =
1
γ
µ
σ2
=
.2
0.96
= 0.2083,
again quite close to the Kelly’s solution. ♠
Proof of (3.57) : Consider the discrete case only, and consider one period first. Let W0 be the
wealth today, and W1 be the wealth after the gamble, and R is the return on the playing the game.
Then
W1 = W0(1 +R), logW1 = logW0 + log(1 +R).
We choose k, the % of investment, to maximize expected log wealth:
maxkE[logW1] = logW0 + p log(1 + kM) + (1− p) log(1− kL).
The first-order condition is
pM
1 + kM
− (1− p)L
1− kL = 0,
i.e.,
pM(1− kL) = (1− p)L(1 + kM).
c© Zhou, 2021 Page 107
3.8 Alternative objective functions
Solving k yields (3.57). Now consider T periods. Assume iid (independent and identically dis-
tributed) payoffs, then the terminal wealth is
logWT = logW0 + [p log(1 + k1M) + (1− p) log(1− k1L)]
+ [p log(1 + k2M) + (1− p) log(1− k2L)]
+ · · · · · · · · ·
+ [p log(1 + kTM) + (1− p) log(1− kTL)].
Therefore, maximizing the expected log terminal wealth is the same as maximizing the expected
wealth in each period! Hence the solution is the same. Q.E.D.
3.8.2 Higher moments
The mean-variance portfolio theory applies theoretically if the stock returns are normally dis-
tributed or if investors care only about the mean and variance of returns. In the real world, the
returns are not normally distributed, which are noted at least as early as Kendall and Hill (1953)
and Mandelbrot (1963). Clearly there are no reasons why other moments should not matter in the
utility function of investors. The mean-variance assumption is for simplicity and tractability.
Beyond the first two moments, the following 4 moment utility is often used,
U = µ− γ
2
σ2 + γ3
γ3Skew
6
− γ4γ
4(Kurt− 3)
720
, (3.59)
where the γ’s are parameters. It is generally true that, everything else equal, investors prefer
positive skewness and does not like kurtosis.
However, it is in generally very difficult to solve the portfolio optimization problem with the
above utility. Samuelson (1970) and Arditti (1971) are the early studies. Jurczenko and Maillet
(2006) have an excellent collection of articles. Jiang et al (2020) examine the empirical evidence
on asymmetry including skewness. Mehlawat, Gupta and Khan (2021) provide references on some
of the latest advances.
c© Zhou, 2021 Page 108
3.8 Alternative objective functions
3.8.3 Other utilities
The mean-variance utility function defined over the expected return and variance, equation (2.44),
is equivalent to the following quadratic utility function defined over wealth,
U(Wt+1) = aWt+1 − bW 2t+1, (3.60)
where Wt+1 = (1 +Rp,t+1)Wt and Wt is the initial wealth at time t. While the quadratic utility is
popular, there are two others that are simple and popular too,
• Exponential utility:
U(Wt+1) = −exp(−θWt+1), (3.61)
• Power utility:
U(Wt+1) =
W 1−γt+1
1− γ , (3.62)
a special case of which is the log utility: U(Wt+1) = log(Wt+1). Additional utility functions may be
found in an asset pricing book like Huang and Litzenberger (1988) and Cochrane (2001). It is clear
that the portfolio decisions will in general be different under different utility functions. However,
as the quadratic utility is a second-order approximation of smooth utilities, the differences may not
be dramatic.
A fundamental limitation of the portfolio choice problem studied thus far is its short-term
nature or decision over one-period. Clearly, in practice, investors care about long-term well being.
For example, an investor might want to maximize the terminal wealth of T periods from today,
Wt+T = (1 +Rp,t+1)(1 +Rp,t+2) · · · (1 +Rp,t+T )Wt. (3.63)
Samuelson (1969) and Merton (1969, 1971) show that the portfolio choice will be myopic (same as
the 1-period decision) for power utility with iid returns or for log utility with arbitrary returns.
However, recent studies, as summarized by Campbell and Viceira (2003), show that investors’
long-term portfolio choice should vary with changing economic conditions and changing labor in-
come (which was not modeled previously). In particular, while cash is safer to short-term investors,
not so for long-term ones as the long-term investors have to re-invest the cash at uncertainty in-
terest rates. People whose labor income is fairly uncorrelated with the stock market should invest
c© Zhou, 2021 Page 109
more in the equities when they are young. See Campbell and Viceira (2003) for more discussions
on the intertemporal decisions of a long-term investor.
Other than maximizing utility functions, one can also use alternative criteria for the optimality
of portfolio choice. Some of the criteria are reviewed by Meucci (Ch 5, 2005).
4 Simulation, Bootstrap and Shrinkage
In this section, we study how to draw random samples from multivariate distributions used to model
multiple stock returns. We also discuss bootstrap that resamples from data to obtain standard error
or test sizes which are typically more accurate than asymptotic theory. Then we discuss shrinkage
estimation of means and covariances of asset returns.
4.1 Sampling from distributions
In investment analysis and derivatives evaluation, simulation is very important. To make it easy
to understand, we will show how to draw random samples from a univariate, bivariate and the
multivariate distributions.
4.1.1 Univariate case
To start, consider the simplest common question of how to simulate a normally-distributed monthly
return on a stock with 12% annual return and 20% annual volatility.
Note that any computer programming language is almost surely has a function to generate the
standard normal variable. In Python, the following code
1
2 x = np.random.randn(m,n) # Generate random variables from N(0,1)
generates an m × n matrix of independent samples from the standard normal. To get only one
sample, we simply change the (m,n) to (1, 1).
c© Zhou, 2021 Page 110
4.1 Sampling from distributions
Statistically, assume follows the standardized normal distribution with mean zero and variable
one,
∼ N(0, 1), (4.1)
which is the one any computer can simulate. Then
R =
12%
12
+ σ × ∼ N(1%, σ2), σ = 20%√
12
, (4.2)
has the desired distribution. In other words, we have to make a linear transformation of the
standard normal, adding the mean and scaling the standard deviation, to have the return with
desired mean and variance. In terms of Python code, we have
1
2 e1 = np.random.randn (1,1) # Generate a random from N(0,1)
3 R = 0.12/12 + (0.2/np.sqrt (12))*e1
Then the R will be what we need. In applications we may need to simulate many such returns. We
can either simulate e1 as a vector to begin with, or use a loop and simulate e1 one at a time.
Suppose that you have generate 10 returns by modifying the above code. Next time you run
the program again and you will get another set of 10 returns. Often you want to get the same 10
returns or someone else wants to re-check your results with the same returns, how do you do that?
You add a seed function into the code:
1
2 np.random.seed (1234) # to set random numbers the same each time running
3 e1 = np.random.randn (10 ,1) # Generate 10 random variables from N(0,1)
4 R = 0.12/12* np.ones (10,1) + (.2/np.sqrt (12))*e1 # now R is 10 by 1
What the seed function does it to allow you to get the random numbers from the same starting
point, determined the input 1234 via the built in seed function. The reason we can do that is
all the random numbers provided by computers are almost purely random, but not exact. You
can imagine that all these numbers are on a gigantic circle and the computer just picks them up
sequentially one by one (but they are almost random outcomes). The seed function simply chooses
a starting point on the circle, which is arbitrary but fixed.
c© Zhou, 2021 Page 111
4.1 Sampling from distributions
4.1.2 Bivariate case
In practice, generating two stock returns with a given covariance structure is very common and
important. If one simulate the stock returns separately, then the returns will be independent! This
is not the case we want as stock returns are often correlated in the real world.
Consider first the simplified problem of generating two standardized random variables (zero
means and unit standard deviations) with a given correlation ρ. We will generate two random
numbers first (we know they are independent), and then get a new set of two random numbers
which are correlated at level ρ.
Assume 1 and 2 follow the standardized normal distribution with mean zero and variable one,
i ∼ N(0, 1), i = 1, 2. (4.3)
Then the following linear transformation of them,
x1 = 1 (4.4)
x2 = ρ 1 +
√
1− ρ2 2 (4.5)
will be a bivariate normal with mean zero, standard deviation 1, and correlation ρ,
x =
x1
x2
 ∼ N
0
0
 ,
1 ρ
ρ 1
 (4.6)
This is easy to verify. For example,
E(x1x2) = E[1(ρ1 +
√
1− ρ22)] = ρ, (4.7)
or, more elegantly by matrix algebra, we compute the covariance matrix of the x’s,
Var(x) = E(xx′) =
1 0
ρ
√
1− ρ2
E(′)
1 ρ
0
√
1− ρ2
 (4.8)
=
1 0
ρ
√
1− ρ2
1 ρ
0
√
1− ρ2
 =
1 ρ
ρ 1
 , (4.9)
which says x has the covariance matrix expression as in (4.6).
c© Zhou, 2021 Page 112
4.1 Sampling from distributions
To get a bivariate normal random variable with arbitrary means and variances, we simply shift
and scale x,
y1 = µ1 + σ11, (4.10)
y2 = µ2 + σ2[ρ 1 +
√
1− ρ2 2], (4.11)
then y will be bivariate normal with mean µ1 and µ2 and covariance matrix
Var(y) =
 σ21 ρσ1σ2
ρσ1σ2 σ
2
2
 . (4.12)
In Python, instead of doing the above from scratch, we can use a ready code, to simulate from
the multivariate normal distribution directly,
1
2 y = np.random.multivariate_normal(mean , cov , (m,))
will generate an m × 2 matrix, with independent rows and each row follows N(mean, cov). The
program does the earlier transformation for us. However, it is useful to understand the transfor-
mation, as it is generally applicable for altering the covariance structure of any vectors with an
arbitrary distribution.
4.1.3 Cholesky decomposition
In general, we can make a similar, but more complex transformation to simulate a random sample
from an arbitrary n−dimension normal distribution. Of course, for multivariate normal, we can use
the above Python code for any n without carrying out the Cholesky decomposition by ourselves.
But, understanding the Cholesky decomposition is generally useful.
Let
µ ≡ E[y] =

µ1
µ2
...
µn
 , V ≡ Var(y) =

var(y1) cov(y1, y2) . . . cov(y1, yn)
cov(y2, y1) var(y2) . . . cov(y2, yn)
...
... . . .
...
cov(yn, y1) cov(yn, y2) . . . var(yn)
 . (4.13)
Our objective is to get a random sample from a multivariate normal distribution with the above
mean and covariance matrix.
c© Zhou, 2021 Page 113
4.1 Sampling from distributions
As before, we first generate n standard normal variables, = (1, . . . , n)
′. Then we transform
them to get a new set of the desired n-vector.
The key is obtain the Cholesky decomposition of the covariance matrix. Mathematically, there
exists a lower-triangular matrix L, such that
LL′ = V, (4.14)
which is known as the Cholesky decomposition or Cholesky factorization of covariance matrix V .
For example, when n = 2, and if
V =
1 ρ
ρ 1
 ,
then
L′L = V, L =
1 ρ
0
√
1− ρ2

because 1 0
ρ
√
1− ρ2
×
1 0
ρ
√
1− ρ2
′ =
1 0
ρ
√
1− ρ2
1 ρ
0
√
1− ρ2
 =
1 ρ
ρ 1
 ,
which is the bivariate case we studied earlier.
When n > 2, it is impossible to find a formula for L. But in practice, there are many softwares
computing L, which is a lower-triangular matrix with positive diagonal elements because, to be the
covariance matrix of nons-singular random variables, V is a positive definite matrix.
So we can always make the following transformation,
y = µ+ L. (4.15)
Mathematically, it can be verified that
Var(y) = E[(y − µ)(y − µ)′] = LE[′]L′ = LL′ = V,
i.e., the covariance matrix of y is indeed V . (Recall in the univariate case, it is σ × σ = σ2, so the
Cholesky decomposition works like taking a square root of the variance).
The y defined in (4.15) is the sample we need. This procedure works not only for normal
distributions, but also for a general distribution to get the desired covariance matrix. However, only
c© Zhou, 2021 Page 114
4.2 Monte Carlo integration
the normal and elliptical distributions in general have the property that their linear transformations
remain in the same class of distributions. An counterexample is the lognormal. A linear combination
of two lognormal variables will no longer be lognormal.
4.1.4 Singular value decomposition
Singular value decomposition (SVD), widely used, is a general decomposition applying to any m×n
matrix M ,
M = UDV, (4.16)
where D is an r × r square matrix for some r > 0, and U and V are m × r and r × n orthogonal
matrices such that UU ′ = V V ′ = Ir with Ir as identity matrix. Note that SVD also applies to
complex matrices, but we consider only real matrices (matrices of real numbers) here.
In particular, when M is a covariance matrix, SVD becomes the eigenvalue or spectral decom-
position. Based on (6.27), we can define the square root of a covariance formally,
Σ1/2 = [A1, . . . , An]

√
λ1 0 . . . 0
0
√
λ2 . . . 0
...
... . . .
...
0 0 . . .
√
λn
 [A1, . . . , An]
′ = A
√
λA′. (4.17)
Then it follows that
Σ1/2Σ1/2 = A
√
λA′A
√
λA′ = A
√
λ
√
λA′ = Σ.
Hence, the Cholesky decomposition is not the only way to generate a given covariance matrix,
because Σ1/2 can do the same. However, computationally, the Cholesky decomposition is the
most efficient as it requires much less time to compute than computing Σ1/2.
4.2 Monte Carlo integration
Simulation has a number of applications in finance. Here we illustrate with two examples. One is
on estimating risk, and another is on option valuation.
c© Zhou, 2021 Page 115
4.2 Monte Carlo integration
4.2.1 Theory
The Monte Carlo integration is a simulation approach to compute an expected value,
µ = E[f(x˜)] =
∫
f(x)g(x) dx, (4.18)
where x˜ is a random variable with density function g(x). The integration is the expected value of
a general function of the random variable x˜, f(x˜).
For example, in option pricing, g(x) may be the lognormal density of the stock price at expira-
tion, and f(x˜) is the present value of the payoff of an European option. The option price is then
the expected value, which requires the valuation of the integral under the risk-neutral distribution
of the terminal stock price. The famous Black-Scholes formula is an outcome of this integral.
The advantage of using Monte Carlo simulation is that it can compute the option value with
non-standard payoff functions or options that depend on high dimensional random variables. In
these cases while analytical valuations are often impossible, the Monte Carlo approach is as easy as
in the simple case. However, this is true only for European options. With American options, due
to early exercise, the Monte Carlo method has to be adapted and it can become rather complex.
The Monte Carlo simply uses the average value of the function at simulated samples to approx-
imate the true expected value,
µˆ =
f(x1) + f(x2) + · · ·+ f(xn)
n
, (4.19)
where n is the # of simulated samples, and x1, x2, . . . , xn are independent random samples of x˜
from its distribution with density g(x).
By law of large numbers, we know that µˆ must converge to µ as n goes to large. In practice,
n = 10, 000 is good for many applications.
What is the error? Let Sn denote the numerator of the righthand side of (4.19), the central
limit theorem says that the standardized Sn converges to a standard normal distribution,
Sn − nµ
σ
√
n
=⇒ N(0, 1), σ2 ≡ var[f(x˜)], (4.20)
i.e., σ is the standard deviation of the random function f(x˜). The above equation implies that
µˆ = µ+
σ√
n
z + o(
1√
n
), (4.21)
c© Zhou, 2021 Page 116
4.2 Monte Carlo integration
where z is a standard normal random variable and o(1/
√
n) denotes a term of higher order of 1/
√
n,
i.e., a term that converges to 0 after dividing by 1/
√
n. In other words, the error of the Monte
Carlo integration is random, but its magnitude in terms of standard deviation is roughly σ/
√
n.
So, roughly speaking, the error
MC Error = µˆ− µ ≈ Problem Difficulty√
Simulation Size
.
This makes intuitive sense. The greater the variance of the random function, the more difficult to
find its expected value. Given the difficulty level σ, the error converges to zero at a rate of 1/
√
n.
Suppose n = 10, 000, then the error is typically 1% of σ.
Since σ is unknown, it has to be estimated too. With the same simulated samples, it can be
estimated by
σˆ =
√√√√( 1
n
n∑
i=1
f(xi)2
)
− µˆ2. (4.22)
Then the Monte Carlo error is estimated by σˆ/
√
n, and we can construct an approximate 95%
confidence interval,
[µˆ− 1.96 σˆ√
n
, µˆ+ 1.96
σˆ√
n
]
for the true but unknown µ = E[f(x˜)].
4.2.2 VaR
Consider first how to compute the VaR (value-at-risk) of a portfolio. VaR provides a single number,
a measure of the total risks of a portfolio of various financial assets. It answers at what loss level
that we are X% confident it will not be exceeded in N business days. In practice, people often
compute VaR at X = 99 and N = 10.
Mathematically, the portfolio is a function of random variables, and we need to compute a cutoff
point in the distribution of the value of the portfolio, such that which there is 99% probability that
the value of the portfolio is greater than.
Suppose we have a portfolio of three stocks which follow normal distributions. Let µ and Σ be
the expected returns and covariance matrix, and w be the portfolio weights. Assume µ and Σ are
c© Zhou, 2021 Page 117
4.2 Monte Carlo integration
annualized, then we need compute first the 10-day expected returns and covariance matrix, which
are
µ10day = (10/252)× µ, Σ10day = (10/252)× Σ.
The the returns in 10 days have the following normal distribution,
Rp = w1r1 + w2r2 + w3r3 ∼ N(µ10day,Σ10day).
We can then generate hundreds and thousands of random returns from this distribution. The worst
1% return cut off is the VaR. The in-class exercise shows all the codes. The Python codes have all
the details.
Indeed, with 99% confidence, we should not lose more than that amount. Note that the pro-
cedure applies to any distributions as long as we can generate samples from them. If the portfolio
has options or derivatives, we generate first the underlying risk variables/shocks, and then compute
the returns.
4.2.3 Option pricing
Simulation, or Monte Carlo simulation as often called, can be easily applied to value all European
options. It can also be used to value American options, but the procedure is very complex.
Consider, for example, the valuation of a standard call option on a non-dividend-paying stock
with parameters
S = 50, X = 50, T = 0.25, r = 10%, σ = 30%,
i.e, the current price is 50, riskfree rate is 10% (continuous compounding), volatility is 30% (of the
continuous stock return), strike price is 50 and the expiration is 3 months. The call price is easy
to compute from the Black-Scholes formula:
C = S N(d1)−X e−rT N(d2), (4.23)
where
d1 =
ln(S/X) + (r + σ2/2)T√
σ2T
, d2 = d1 −
√
σ2T , (4.24)
and N(d) is the normal distribution function. Indeed, it is straightforward to code the formula into
Python:
c© Zhou, 2021 Page 118
4.2 Monte Carlo integration
1 import numpy as np
2 import scipy.stats as si
3
4 # define a function to compute the standard call with no dividend
5
6 def BS_call(S, X, T, r, sigma):
7
8 # S: spot price; X: strike; T: time to maturity; r: riskfree rate; sigma: vol
9 d1 = (np.log(S / X) + (r + 0.5 * sigma ** 2) * T) / (sigma * np.sqrt(T))
10 d2 = d1 - np.sqrt ( (sigma ** 2)*T )
11 N1 = si.norm.cdf(d1, 0.0, 1.0)
12 N2 = si.norm.cdf(d2, 0.0, 1.0)
13 call = S * N1 - X * np.exp(-r * T) * N2
14
15 return call
Then a value of 3.6104 is obtained.
Alternatively, one can easily compute the price by Monte Carlo. Recall from option pricing that,
to get the Black-Scholes formula, the stock price is assumed to follow a geometric Brownian motion,
i.e., the stock price is independently lognormally distributed, or the returns are independently
normally distributed at anytime, and in particular,
ln(ST /S) ∼ N [µT, σ2T ], (4.25)
where µ = r − σ2/2 is the risk-neutral expected return. That is,
ST = Se
µT+σ
√
T z˜, (4.26)
where z˜ follows the standard normal distribution. Hence, we can drawM = 10, 000, say, z1, z2, . . . , zM ,
random numbers from the normal, and then can compute M random terminal prices
SmT = Se
µT+σ
√
T z˜m , m = 1, 2, . . . ,M. (4.27)
At each of the terminal price, the call is worth clearly the present value of the payoff function
cm = e
−rT (SmT −X)+. (4.28)
[Recall (S −X)+ is defined as S −X if S > X and 0 otherwise, the payoff of the option.] Then,
the average value over all the simulated prices is
c =
c1 + c2 + · · ·+ c10,000
10, 000
, (4.29)
c© Zhou, 2021 Page 119
4.2 Monte Carlo integration
which is the Monte Carlo approximation of the true call price. The greater the M , the greater
the accuracy. Note that, numerically, you can ignore the discounting term, e−rT , in (4.28), and
then discount at the end in (4.29), then you will save some computational time. But this may be
confusing to an inexperienced programmer, and one may simply just use above formulas.
What is the advantage of Monte Carlo? It is generally applicable while formulas may not be
available for many non-standard options. As a theoretical example, consider an option with a payoff
of
f(ST ) = (S
pi
T −X)+ (4.30)
at maturity. The payoff is similar to the standard option, but different in that now the stock price
has a power of the irrational number pi. Then in this case, there are no formulas as the usual
technique for obtaining the Black-Scholes fails. However, Monte Carlo method is as easy as above,
by simply replacing the payoff function with the new one in computing the present value cm.
In the previous examples, the option is a function of the terminal price alone and hence drawing
the stock price at time T is sufficient. However, many options have complex payoff functions that
require drawing the price path over time. An example is the lookback call option that allows the
holder to exercise with the proceeds equal to the difference between the highest price during the
option’s life and the strike price.
Let C be the call price of the lookback. The option is not the standard option and cannot
be evaluated by the Black-Scholes formula. There does exist a complex analytical formula, but
simulation is much easier to compute. If one defines the payoff as the max price minus its mean,
then no formulas are available. But the simulation approach is still easy with a little complexity.
Now we need to draw a path of the stock prices.
Making the same geometric Brownian motion or log-normal distribution on the stock price, the
stock price is log-normally distributed conditional on its past:
ln(St+1) ∼ N [ln(St) + µ 1
12
, σ2
1
12
]
So,
ln(S1) ∼ N [ln(S0) + µ 1
12
, σ2
1
12
],
ln(S2) ∼ N [ln(S1) + µ 1
12
, σ2
1
12
],
c© Zhou, 2021 Page 120
4.2 Monte Carlo integration
etc., and
ln(S12) ∼ N [ln(S11) + µ 1
12
, σ2
1
12
],
where S0 = S is the current stock price.
That is, we now draw a stock price path: next month, S1 (get y1 from N [ln(S0) + µ
1
12 , σ
2 1
12 ]
and solve lnS1 = y1 to get S1 = e
y1), the second month, S2, and so on. We have now the simulated
stock prices from the 1st month to the end of the year: S1, . . . , S12. Let S
∗
1 be the maximum of the
prices, then payoff of the call option is:
c1 = S
∗
1 −X,
if S∗1 > X; and zero otherwise.
Next, we can repeat the above process and get another path of the stock price, and get S∗2 and
c2. Continuing in this way 10 times, we get c1, c2, . . . , c10.
Recall that the call price should be equal to its expected payoff discounted back to today:
C = e−rT × (Expected payoff).
By the “Law of Large Numbers”, the expected payoff is the probability limit of the average of the
payoffs of the simulations, and so
c = e−rT
[
c1 + c2 + · · ·+ c10
10
]
should be an approximation of the true price taken over the 12 monthly intervals.
Here we have simulated the path of the stock price at the month interval. In practice, to achieve
high accuracy, we may have to simulate the path at much smaller intervals, say daily or hourly. In
addition, we may not just do 10 simulations as done here. Usually, a number of 10,000 simulations
gives quite accurate results.
Formally, we can obtain the price path at interval 1n (n = 12 in the above) by simulating n
prices one after another from:
ln(St+1) ∼ N [ln(St) + µ 1
n
, σ2
1
n
], t = 0, 1, 2, . . . , (n− 1).
Based on these prices, we can evaluate the payoff on the option, ci (in the i-th simulation). Suppose
we do in total m simulations (m = 10 in the above), then we get m payoffs: c1, c2, . . . , cm, and the
c© Zhou, 2021 Page 121
4.3 Bootstrap
call price is given by:
Cm = e
−rT c1 + c2 + · · ·+ cm
m
(4.31)
When n and m are large enough, the call price computed from above will converge to the true
theoretical price under the standard diffusion, or the geometric Brownian motion, or the log-normal
distribution assumption. Theoretically, the rate of convergence is of order (1/
√
m+ 1/n).
The above examples are options on a single stock. The same approach applies to options on
multiple stocks or portfolios. In this case, the only difference is to draw now random samples from
multivariate distribution based on the Cholesky decomposition.
The Monte Carlo simulation approach relies on the risk-neutral valuation principle. It applies
to virtually any European option. With some complex modifications, it can be applied to American
options. Overall, its simplicity and generality makes it appealing to a great number of practitioners.
Glassermann (2004), for example, provides more theories, such as various related method with
accelerated convergence, and Hilpisch (2015) has a extensive example of Python implementations.
4.3 Bootstrap
The bootstrap method, introduced by Efron (1979), is a computation intensive method for esti-
mating the distribution of an estimator or a test statistic by resampling the data at hand. It treats
the data as if they were the population. Under mild regularity conditions, the bootstrap method
generally yields an approximation to the sampling distribution of an estimator or test statistic that
is at least as accurate as the approximation obtained from traditional first-order asymptotic theory
(see, e.g., Horowitz (1997)).
4.3.1 Estimating standard error
The idea of bootstrap is very simple. Consider the problem of estimating the standard error of an
estimator. Suppose we have iid excess return data,
x1, . . . , xT ,
and compute the Sharpe ratio
κˆ =
xˆ
sˆ
c© Zhou, 2021 Page 122
4.3 Bootstrap
where s is the sample standard deviation,
s2 =
1
T − 1
T∑
t=1
(xt − x¯)2.
But how accurate is it? Ever if the data is normally distributed, the standard error of κˆ is not easy
to derive.
However, it can be easily estimated via a bootstrap:
1. Draw T data, (x∗1, . . . , x∗T ) randomly from the original data set with replacement;
2. Compute the Sharp ratio for the data drawn, κˆ∗, and save the result as yj = κˆ∗;
3. Repeat the above B times, say, B = 10, 000, to obtain all the yj ’s (j = 1, 2, . . . , B);
4. Compute standard deviation of the yj ’s.
The standard deviation is the bootstrap approximation of the standard error of κˆ.
Why does it work? Statistically, the variance of κˆ is defined as the integral of the squared
difference with the true Sharpe ratio,
var(κˆ) =
∫
(κˆ− µ/σ)2 dF (x1, . . . , xT )
≈
∫
(κˆ− µ/σ)2 dF ∗(x1, . . . , xT ), (4.32)
where F (x1, . . . , xT ) is the true and generally unknown distribution of the data, and F
∗(x1, . . . , xT )
is the empirical distribution that assigns equal probability to each data point, defined by
F ∗(x = xt) =
1
T
, t = 1, . . . , T. (4.33)
In other word, we obtain Equation (4.32) by using the so-called bootstrap plug-in principle, replac-
ing the unknown by the empirical distribution, and then we can evaluate any statistics based on
the latter.
To compute (4.32), we use Monte Carlo simulations to simulate, say B = 10, 000, sample from
F ∗(x1, . . . , xT ), this is exactly what the re-sampling with replacement does! Then the variance is
approximated by simulation,
var(κˆ) ≈ 1
T − 1
B∑
j=1
(κˆ∗j − µ/σ)2.
c© Zhou, 2021 Page 123
4.3 Bootstrap
Replacing the unknown µ/σ by its bootstrap average, we obtain the result of Step 4.
The above procedure is also often applied to bias correction. The idea is that a statistics, such
as the standard deviation, can be computed based on the data or based on the bootstraped data.
Under certain regularity conditions, the bootstraped estimator will be better, and their difference
is call the bias correction.
However, it should be pointed out that if the iid assumption is violated or even under the iid
but if the skewness or kurtosis is high, there is no guarantee that the bootstraped statistic will
always be better. Without the iid assumption, a block bootstrap (see, e.g., Shao and Tu, 1995,
Chapter 9) should be used to capture serial correlations. Indeed, while iid assumption is not a bad
assumption for many asset returns, but it is unlikely to be always true for the returns on a trading
strategy or a managed fund.
4.3.2 Estimating confidence interval
Now let us keep the iid assumption, but relax normality. Then the usual confidence interval (see
Section 1.3) is questionable. In this case, the bootstrap can be used to find the confidence interval
which can be more accurate in small sample. The procedure has three easy steps:
1. Draw T data, (x∗1, . . . , x∗T ) randomly from the original data set with replacement;
2. Compute the sample mean for the data drawn, xˆ∗, and save the result yj = xˆ∗;
3. Repeat the above B times, say, B = 10, 000, to obtain all the yj ’s (j = 1, 2, . . . , B);
4. Compute the 2.5% and highest 97.25% percentiles, δ0.025 and δ0.975, of the yj ’s.
The result, [δ0.025, δ0.975], is our estimate for the 95% confidence interval (e.g., if B = 1000, δ0.025 is
the 25th value and δ0.975 is the 975th value after the yj ’s are sorted from the lowest to the highest;
if B = 10000, the 125th and 9750th).
The above is known as the bootstrap percentile approach. Interesting, it does not use sample
mean xˆ at all. This approach may be used to approximate any confidence interval on any statistic.
However, mathematically, as argued by Rice (2007, p. 285), it will be much more accurate using
the centered bootstrap:
c© Zhou, 2021 Page 124
4.3 Bootstrap
1. Draw T data, (x∗1, . . . , x∗T ) randomly from the original data set with replacement;
2. Compute the mean for the data drawn, xˆ∗, and save the result yj = xˆ∗ − xˆ;
3. Repeat the above B times, say, B = 10, 000, to obtain all the yj ’s (j = 1, 2, . . . , B);
4. Compute the 2.5% and highest 97.25% percentiles, η0.025 and η0.975, of the yj ’s.
The result,
[xˆ− η0.975, xˆ− η0.025], (4.34)
is our estimate for the 95% confidence interval. Although this looks quite different from the previous
one, as pointed by Rice, the two methods are equivalent if the bootstrap distribution is symmetric.
To understand the expression (4.34), we know that the centered bootstrap is mathematically to
use the distribution of xˆ∗ − xˆ to approximate xˆ− µ because xˆ is the true mean of the bootstraped
data, and µ is the true mean of the data. Hence,
0.95 = Prob(−η.025 < xˆ∗ − xˆ < η.975)
≈ Prob(−η.025 < xˆ− µ < η.975)
= Prob(−η.025 − xˆ < −µ < η.975 − xˆ)
= Prob(xˆ− η0.975 < µ < xˆ− ηˆ0.025)
which is exactly (4.34), the 95% probability interval covering µ.
The analysis above shows that, if the distributions xˆ∗ − xˆ and xˆ − µ are not close, then the
bootstrap can be inaccurate. However, even if they are not, the distributions of the standardized
versions, (xˆ∗ − xˆ)/σ∗ and (xˆ∗ − xˆ)/σ could, where σ and σ∗ are the standard deviations. Suppose
σˆ∗ is an estimate of the standard deviation from a bootstrap as we did before. Then we can do
another studentized bootstrap:
1. Draw T data, (x∗1, . . . , x∗T ) randomly from the original data set with replacement;
2. Compute the mean for the data drawn, xˆ∗, and save yj = (x¯∗ − xˆ)/σˆ∗;
3. Repeat the above B times, say, B = 10, 000, to obtain all the yj ’s (j = 1, 2, . . . , B);
4. Compute the 2.5% and highest 97.25% percentiles, τ0.025 and τ0.975, of the yj ’s.
c© Zhou, 2021 Page 125
4.3 Bootstrap
The 95% confidence interval is
[xˆ− σˆ∗τ0.975, xˆ− σˆ∗τ0.025], (4.35)
Note that the studentized bootstrap is so-named because (xˆ∗− xˆ)/σ∗ approximate a t distribution.
Also it is computational more demanding as it requires a bootstrap to compute σˆ∗ first. How-
ever, today the computational time is rarely of concern and the greater accuracy makes the the
studentized bootstrap more useful in practice.
For more discussions on bootstrap, see Efron (1979), Shao and Tu (1995), Horowitz (1997) and
Rice (2007). For applying the bootstrap to test the CAPM, see Chou and Zhou (2006).
4.3.3 Bootstrapping portfolio weights
Similar to applying bootstrap to estimating the standard deviation, we can also apply it to esti-
mating any function of the parameters, the optimal portfolio weights in particular.
Consider the case in which we have no constraints and have the riskfree asset available, then
the optimal portfolio formula is (see (2.45)),
w∗ =
1
γ
Σ−1µ. (4.36)
When data are available, we can compute the sample mean and sample covariance matrix (assume
invertible here) to get the plug-in rule (see (3.11)),
wˆ =
1
γ
Σˆ−1µˆ. (4.37)
As we discussed before, this rule often does not do well in practice due to estimation errors.
Since bootstrap can improve small sample performance, it is natural to bootstrap the data to
get a bootstrapped rule. Let χ = (x1, · · · , xT ) be the original data set. We can re-sample with
replacement to get another set, χ(1) = (x
(1)
1 , · · · , x(1)T ). With this data set, we get another plug-in
rule wˆ(1). Continuing this n times, then we obtain
wˆboot =
wˆ(1) + wˆ(2) + · · ·+ wˆ(n)
n
, (4.38)
which is known as the bootstrapped portfolio investment rule.
c© Zhou, 2021 Page 126
4.4 Shrinkage estimation
While it is unclear whether they are the first, Michaud and Michaud (2008) apply bootstrap
for obtaining resampled efficient frontier, which is simply the average of those from the re-sampled
data. They filed a U. S. patent for it.
Note that the plug-in rule is usually the worse performer in many applications, so beating it
does not say much. Theoretically, there is no reason that the bootstrapped rule can outperform
other rules that we examined earlier. Samples of typical applications will be given in the Python
example.
4.4 Shrinkage estimation
The mean and covariance matrix, µ and V , of asset returns Rt are fundamental in many financial
decisions such as asset allocation or computation of the VaR. In this subsection, we discuss the prop-
erties of the sample average estimators, which are also known as moment estimators or maximum
likelihood estimators. Then we discuss shrinkage estimates that provide improved performance.
4.4.1 Sample averages
Under the assumption that Rt is i.i.d. (or stationary in general), the mean and covariance matrix,
µ and V , can simply be estimated by using their sample analogues,
µˆ =
1
T
T∑
t=1
Rt, (4.39)
Vˆ =
1
T
T∑
t=1
(Rt − µˆ)(Rt − µˆ)′, (4.40)
where Rt is an n-vector of asset returns, and hence µˆ is also an n-vector, and Vˆ is an n×n matrix.
It should be noted that the sample covariance matrix will not be invertible if T ≤ (N − 2).
In the high dimension case when N is large, even though T > N , the estimator Vˆ can be very
inaccurate with many too small eigenvalues (see Section 6.2.6).
The above estimators are known as moment estimators because we simply replace the theoretical
c© Zhou, 2021 Page 127
4.4 Shrinkage estimation
expectation (moments), i.e., the righthand side of
µ = E[Rt], (4.41)
V = E[(Rt − µ)(Rt − µ)′] (4.42)
by their sample counterparts as an estimation for the left hand side.
They are also known as maximum likelihood (ML) estimators, which are the most efficient one
among all unbiased estimators (achieving the so-called Cramer-Rao bound).5
By Law of Large Numbers, µˆ and Vˆ must converge to µ and V as the sample size T increase
to infinity. The question is how to assess the accuracy.
Asymptotically, for independently and identically distributed returns,
µˆ ∼ N
(
µ,
1
T
V
)
. (4.43)
So the squared root of the diagonal elements of Vˆ /T provides the standard errors for µ, which
indicates how far the estimates might deviate from the true µ.
However, it is more difficult to assess the standard errors for Vˆ . One of the difficulties is that
Vˆ is an n × n (almost surely positive definite) symmetric matrix, of which there are n(n + 1)/2
distinct elements. Under normality assumption6, it is known that
Σˆ ∼ Wn(T − 1, V/T ), (4.44)
where Wn(T −1, V/T ) denotes a Wishart distribution with T −1 degrees of freedom and covariance
matrix Σ/T . In general, a Wishart distribution, denoted by Wn(ν,Σ) is defined as the distribution
of ν sums of matrices,
W = X1X
′
1 +X2X
′
2 + · · ·+XνXν , (4.45)
where X1, . . . , Xν are ν independent normal variables, Xi ∼ N(0,Σ). It is a generalization of the
usual chi-squared distribution into n-dimensional space.
To write the standard errors for the covariances, we need to introduce two popular matrix
operators. The first one is vec, which vectorizes any matrix into a vector by stacking the columns
5Exercise: when n = 1, verify that the ML estimator is indeed as given above.
6We refer the general case to Muirhead (1982, Chapter 1).
c© Zhou, 2021 Page 128
4.4 Shrinkage estimation
one on top of the other,
vec(A) ≡

A1
A2
...
An
 , if A =

a1,1 a1,2 . . . a1,n
a2,1 a2,2 . . . a2,n
...
... . . .
...
am,1 am,2 . . . am,n
 ≡ (A1, A2, . . . , An). (4.46)
The second operator, ⊗, is known as Kronecker product which turns two matrices into a larger
matrix,
A⊗B ≡

a1,1B a1,2B . . . a1,nB
a2,1B a2,2B . . . a2,nB
...
... . . .
...
am,1B am,2B . . . am,nB
 . (4.47)
Then, the standard errors for the covariances are approximately
Var[vec(Vˆ )] =
1
T
(In2 +Knn)(Vˆ ⊗ Vˆ ), (4.48)
where In2 is the identity matrix of order n
2 and Knn is the commutation matrix such as vec(A) =
Knnvec(A
′) for any order n matrix A.
4.4.2 Mean shrinkage: Stein estimators
In estimating µ using µˆ, a standard measure of loss of efficiency is the squared errors,
Loss(µˆ, µ) = (µˆ− µ)′(µˆ− µ) =
N∑
j=1
(µˆj − µj)2, (4.49)
where N is the number of assets. Geometrically, it is the distance between two points, µˆ and µ.
The closer µˆ to µ, the smaller the loss. For a long time, µˆ is considered as the estimator with the
best estimator, until Stein (1955) published his path-breaking paper to prove the contrary. In other
words, the sample mean does not have the smallest expected loss.
In general, we can consider an estimator of the form,
µˆS = (1− α)µˆ+ αbˆ. (4.50)
This is known as James-Stein shrinkage estimator which shrinks the estimator toward a target
estimator bˆ. When α = 0, no bias, but the MSE can be high. When α 6= 0, but small, then there
c© Zhou, 2021 Page 129
4.4 Shrinkage estimation
is bias, but the variance of αbˆ can be small. Hence, it is a matter of trade-off between the bias and
variance.
Under multivariate normality assumption, the optimal choice of α is
α =
1
T
Nλ¯− 2λ1
(µˆ− bˆ)′(µˆ− bˆ) , N > 2, (4.51)
where λ¯ is the average of the eigenvalue of V and λ1 is the largest. So, when the sample size T is
small, α weights heavily toward the target. However, it will be smaller and smaller as the sample
size gets large, so that the estimator is essentially µˆ for large T .
It will be of interest to see a special case in which the asset returns are independent and have
unit variances, and assume bˆ = 0. Then V is the identity matrix, and both λ¯ and λ1 are 1. Then
we have
µˆSj =
(
1− 1
T
N − 2∑N
j=1 µˆ
2
j
)
µˆj ,
that is, the Stein’s estimator of the j-th asset mean is the usual sample mean shrinked toward 0
by the first term. As sample size T becomes large, it gets closer the sample mean.
There are two popular choices for bˆ in practice. The first is to use the average sample mean
across assets, known as the average of the average,
bˆ =
(
1
N
N∑
i=1
µ˜i
)
× 1N , (4.52)
where µ˜i is some prior mean estimate for asset i and 1N is an N -vector of ones, so that bˆ is an
N -vector too scaled by the average of the asset means or the grand mean average across assets and
time.7 Another choice is due to Jorion (1986) who suggests
bˆ =
1′N Vˆ
−1µˆ
1′N Vˆ −11N
× 1N . (4.53)
For additional estimators and the theory, see Lehmann and Casella (1998), Maruyama (2004) and
Kan and Zhou (2007).
7Theoretically, µ˜i should be independent of µˆi. But in practice one may take them as the same, then the James-
Stein estimator shrinks toward the grand mean.
c© Zhou, 2021 Page 130
4.4 Shrinkage estimation
4.4.3 Covariance shrinkage
The covariance matrix is important not only for the optimal portfolio construction, but also useful
for risk forecasting of a portfolio. The reason for the latter case is that, if w is one’s portfolio
weight. Regardless how w is chosen, the risk of the portfolio, say next month, is
σ2p = w
′Σw.
Since Σ is unknown, and must be estimated today to do the forecast. However, Menchero and Li
(2020) show that shrinkage may not be needed for risk forecasting, but it is always important for
portfolio selection.
Traditionally, however, the shrinkage is carried out directly to the covariance matrix. Following
Meucci (2005, p. 208) who in turn follow Ledoit and Wolf (2003), the shrinkage estimator of the
covariance is to shrink the sample covariance matrix to a target,
Vˆ S = (1− α)Vˆ + αCˆ, (4.54)
where the target
Cˆ =
∑N
i=1 λˆi
N
× IN , (4.55)
with λˆi as the i-th largest eigenvalue of Vˆ and IN the identify matrix of order N ; and the weight is
α =
1
T
1
T
∑T
t=1 tr
[
(RtR
′
t − Vˆ )2
]
tr
[
(Vˆ − Cˆ)2
] , (4.56)
where “tr” is the trace operator that takes the trace (sum of diagonal elements) of a matrix.
In practice, due to concerns of stability of parameters, the effective sample size cannot be very
large, and hence the shrinkage approach adds value for better estimates. Very often we need to
estimate the covariance in high dimension, which is a topic of constant research. The PCA and
the factor analysis, discussed in Sections 6.2.6 and 6.4, are relevant. Pourahmadi (2013) provides
an easily accessible analysis. Recently, Ledoit and Wolf (2017) provide yet another shrinkage
estimator. However, empirically, the evidence from Pedersen, Babu, and Levine (2020) appears
to suggest that shrinking the correlation matrix is better than shrinking the covariance matrix in
portfolio optimization.
c© Zhou, 2021 Page 131
4.4 Shrinkage estimation
4.4.4 Use of correlation shrinkage
Recall that Pedersen, Babu, and Levine (2020) argue for the use of correlation shrinkage. As the
covariance matrix can be decomposed as a product of the vol matrix, the correlation matrix and the
vol matrix, we can shrink the correlation matrix toward the identity matrix (so that the correlations
are shrank toward zero). The details are given in Section 3.5.5.
While it is known that the small eigenvalues will cause problems to portfolios, it is unclear why
making the correlations smaller will help. Intuitively, the smaller the correlations, the closer the
covariance matrix to an diagonal matrix. Since the diagonal matrix has the original asset variances
as the eigenvalues, it is then unlikely that the estimated smallest eigenvalue can be too small.
Below we can it can in fact be true. Assume that there are two assets. Consider the simple case
of the correlation matrix. When both assets have variances of 1. Then their covariance matrix is
the same as the correlation matrix,
Σ =
1 ρ
ρ 1
 (4.57)
where ρ is the correlation. In this case, it is easy to
det(Σ) = 1− ρ2 = λ1λ2,
and
2 = λ1 + λ2
impliy
λ1 = 1 + ρ, λ2 = 1− ρ
are the two eigenvalues. Assume ρ > 0, so that λ2 is the smallest eigenvalue. Then it is clear that
when the correlation is over-estimated in practice, λ2 will be under-estimated or be too small.
4.4.5 Eigenvalue adjustment
Since the unstable covariance matrix is caused by under-estimated small eigenvalues, why do we
adjust them directly?
c© Zhou, 2021 Page 132
4.4 Shrinkage estimation
Yao, Zheng and Bai (2015, Section 12.5) provide the statistical theory. Recall the eigenvalue
decomposition (6.27),
Σ = [A1, . . . , An]

λ1 0 . . . 0
0 λ2 . . . 0
...
... . . .
...
0 0 . . . λn
 [A1, . . . , An]
′ (4.58)
where Ai is the eigenvector corresponding to eigenvalue λi, and the eigenvectors are orthogonal to
each other with unit length. The idea is to divide the eigenvalues into a few groups, and replace
their values in each group by an equal estimate, and then use the above formula to compute an
estimated Σ. See Yao, Zheng and Bai (2015) for the details.
Under fairly general conditions, the estimator will perform much better than the sample co-
variance matrix. Eigenvalues are also known the spectrum of the matrix, and so the above is also
called the spectrum-corrected estimator of the covariance matrix.
In particular, as suggested by Lo´pez de Prado (2020a), one can find a cut-off eigenvalue λm,
retain all the first m estimated eigenvalues, but replace all the rest small eigenvalues by their
average
λ¯s =
1
n−m+ 1
n∑
j=m+1
λj .
The choice of m in practice may be through trial and error, which is close to the number of factors.
How well this procedure work is an empirical question as there is no theory yet.
4.4.6 Exponentially weighted moving averages
Motivated by the idea that recent data are more important than the past, we may want to weight
the recent ones more heavily than the past in computing our parameter estimates. To do so, we
assign a weight of wt = 1 to the most recent observation, and
wt−1 = λwt, wt−2 = λwt−1, . . . , (4.59)
successively for previous data, where λ is a prespecified constant. What this does is to replace an
equal average of time series with a weighted average and the weights are
1, λ, λ2, . . . . . . , λt−2, λt−1.
c© Zhou, 2021 Page 133
4.4 Shrinkage estimation
The magnitude of λ indicates how information decays. If λ = 0, we care about only the observation
today. If λ = 1, we use equal weighting for past observations. Typically, 0 < λ < 1. The choice is
λ is driven by applications and by the calibration results.
To understand it, consider estimating the expected return of a stock with 3 past observations.
The usual sample mean (equal-weighting) is,
µˆ =
Rt +Rt−1 +Rt−2
3
.
However, if we believe the recent data are more informative, we may use,
µˆW =
Rt + 0.9Rt−1 + 0.92Rt−2
1 + 0.9 + 0.92
=
Rt + λRt−1 + λ2Rt−2
1 + λ+ λ2
,
which is the weighted mean with λ = 0.9, so we put less importance on earlier observations. Note
that the weighted return is divided by the sum of the weights because only in such a way the
weights on the returns sum to 1.
In practice, however, the use of the weighed mean on the returns is not common because
the returns are noisy and mean-reverting, and so over-weighting the more recent ones can be
counterproductive. On the other hand, covariances are much more persistent, and so they are often
estimated with the above so called exponentially weighted moving averages (EWMA) of the data.
Why do we call the above weighting exponential? That is because if λm = eb, then, taking log
on both sides, we get b = m log λ, or
λm = em log λ.
Since log λ < 0 under the assumption that 0 < λ < 1, the above equation says that the weight λm
decays at the exponential rate b.
In practice, as mentioned by Menchero and Li (2020), one can use daily data from 10 to 150
days to estimate the covariance matrix. The variance is estimated by
σˆ2i =
R2i,t + λR
2
i,t−1 + λ
2Ri,t−2 + · · ·+ λt−1R2i,1
1 + λ+ λ2 + · · ·+ λt−1 , (4.60)
and the covariance by
σˆij =
Ri,tRj,t + λRi,t−1Rj,−1t + λ2Ri,t−2Rj,t−2 + · · ·+ λt−1Ri,1Rj,1
1 + λ+ λ2 + · · ·+ λt−1 , (4.61)
where the daily mean is taken as zero, as in almost all practical computations.
c© Zhou, 2021 Page 134
4.4 Shrinkage estimation
Now, imagine that we have infinite amount of data. Note that 1 + λ+ λ2 + · · · · · · = 1/(1− λ),
we can write σˆ2i as (1− λ) times the sum of terms λmRi,t−m, which can be written further as the
first term plus the rest,
σˆ2i,t+1 = (1− λ)R2t + λσˆ2i,t, (4.62)
where we added the time scripts to indicate the previous variance estimate σˆ2i,t that is based on
data up to t− 1. Similarly, we have the covariance
σˆij,t+1 = (1− λ)Ri,tRj,t + λσˆij,t. (4.63)
Both (4.62) and (4.63) say that we can recursively update the estimates from the past. It is easy to
see from (4.62) about the use of λ. The first term captures the volatility reaction to current market
events, and the second term indicates persistence. No matter what happens today, λσˆ2i,t states that
high volatility estimated yesterday is likely to cause high volatility tomorrow. The greater the λ,
the greater the persistence.
A rule of thumb choice of λ is between 0.75 and 0.98 for most markets (Alexander, 2001, p. 60).
Generally, λ is greater for long-term forecast and smaller for short-term forecast. Mathematically,
EWMA is equivalent to an I-GARCH model without intercept.
4.4.7 GS covariance matrix estimator
The covariance matrix is estimated so far by fixing the sampling frequency. Suppose we are inter-
ested in the monthly covariance matrix. The estimates are then computed by using monthly data,
and the accuracy increases as more monthly data are used. Theoretically, the estimate converges
to the true value as the sample size goes to infinity.
However, in practice, the covariances or volatilities change over time. So the data too long ago
may not be relevant, implying that the sample size may not be that large. Researchers at Goldman
Sachs (see Litterman, 2003, Chapter 16) suggest to use daily data to improve the accuracy.
To estimate the vol this month, the idea is to not only use information of many past months,
but also use more information such as more frequent data within the months. Here we use the
daily data.
Theoretically, as shown by Merton (1980), if the stock prices follow a diffusion process or iid
c© Zhou, 2021 Page 135
4.4 Shrinkage estimation
lognormal, then the use of more frequent data will not help estimating the mean, but it can help
estimating the variance as accurately as possible. To see why the frequency does not matter to the
mean, suppose T is the time length, and we have prices at the beginning and at the end, P0 and
PT . Then the expected (continuously compounded) return per unit of time is estimated by
µˆ =
log(PT /P0)
T
.
Now assume that we have n daily prices available over [0, T ], P0, P1h, . . . , Pnh = PT with h = T/n.
Then the average daily return is
µˆd =
log(P1h/P0) + log(P2h/P1h) + · · ·+ log(PT /PT−1)
n
(4.64)
=
log [(P1h/P0)× (P2h/P1h)× · · · × (PT /PT−1)]
n
=
log(PT /P0)
n
. (4.65)
If T is measured in years, then µˆ is the estimated annual return, and is the same to the annualized
daily return, (n/T )µˆd. In short, the daily observations do not matter except the beginning and end
prices, P0 and PT . Hence, the only way to raise the accuracy of estimating the expected return is
to increase the length of the history, T .
Let ri(m) be the monthly return of asset i, and assume that there p daily returns available,
ri,t(d), t = 1, 2, . . . , p. Then, the monthly return can be written as a sum of the daily returns,
8
ri(m) =
p∑
t=1
ri,t(d). (4.66)
We can also write this for another asset j,
rj(m) =
p∑
s=1
rj,s(d). (4.67)
Then the covariance between i and j will be given by the cross products of the right hand side,
Cov[ri(m), rj(m)] =
p∑
t=1
p∑
s=1
Cov[ri,t, rj,s] (4.68)
Notice that this formula is true for any two assets. If i = j, it provides the formula for the monthly
variance of asset i.https://www.overleaf.com/project/5f6b68d0a663fc0001e98565
There is one subtle point to be made about our usual variance transformation formula for one
frequency to another. Usually wehttps://www.overleaf.com/project/5f6b68d0a663fc0001e98565 ag-
gregate to get the monthly variance or covariance by multiplying the daily one with the number of
8For your easy reference, we here use almost identical notation to those of Litterman, et al (2003, Chapter 16).
c© Zhou, 2021 Page 136
4.4 Shrinkage estimation
business days within the month,
σ2m = p× σ2d.
But this formula is correct only if the data is iid. This can be seen by rewriting (4.68) as
Cov[ri(m), rj(m)] = p× Cov[ri,t, rj,t] (4.69)
+(p− 1)× (Cov[ri,t+1, rj,t] + Cov[ri,t, rj,t+1])
+(p− 2)× (Cov[ri,t+2, rj,t] + Cov[ri,t, rj,t+2])
+ · · ·
+1× (Cov[ri,t+p−1, rj,t] + Cov[ri,t, rj,t+p−1]) .
Note that the first term Cov[ri,t, rj,t] = Cov[ri,1, rj,1] = · · · = Cov[ri,p, rj,p] (assume the daily
variance is constant within the month), and p indicates we collect all the p terms on the same days
for the two assets together. Similarly, the second term represents all the cross products od returns
on dates with 1-day difference, and so on. Therefore, besides the first term, others will matter too
if the data are not iid.
Now, given sample data of T daily returns, we can estimate the first term in the above formula
by
Ĉov[ri,t, rj,t] =
1
T
T∑
s=1
ri,srj,s, (4.70)
and any of the other (p− 1) terms by
Ĉov[ri,t, rj,t+k] =
1
T − k
T∑
s=1
ri,srj,s+k. (4.71)
Then, the righthand side of (4.69) provides the estimate for the monthly covariance or volatility.
Note that (4.70) differs from standard statistical estimation formula,
Ĉov[ri,t, rj,t] =
1
T
T∑
s=1
(ri,s − rˆi)(rj,s − rˆj). (4.72)
The reason is that the daily means are small and can be taken as zeros without consequences for
daily covariance and volatility estimations.
For easy programming, the above estimator can be written in a simpler matrix form. Assume
c© Zhou, 2021 Page 137
4.4 Shrinkage estimation
there are T daily returns for all N assets, which can be written as a T ×N matrix,
R(d) =

r1,1(d) r2,1(d) . . . rN,1(d)
r1,2(d) r2,2(d) . . . rN,2(d)
...
... . . .
...
r1,T (d) r2,T (d) . . . rN,T (d)
 , (4.73)
Then the monthly covariance matrix estimator can be written as
S(m) = p× S0(d) +
q∑
k=1
(p− k)× [Sk(d) + Sk(d)′], (4.74)
where q is the order of serial correlation, and
S0(d) =
1
T
R(d)′R(d), Sk(d) =
1
T
R(d)′Rk(d),
are matrix form of (4.70) and (4.71) with Rk(d) defined as R(d) by treating the first k rows as
zeros. Note that we may average S(m) over past months to obtain the covariance estimator for the
current month.
Finally, based on the EWMA, the GS estimate of the daily covariance estimator is
Ĉov[ri,t, rj,t] =
∑T
s=1wsri,srj,s∑T
s=1ws
=
∑T
s=1w
1/2
s ri,sw
1/2
s rj,s∑T
s=1ws
. (4.75)
Other terms can be written out similarly, and we thus obtain the monthly covariance estimator by
(4.74).
In matrix form, we can weight the returns as
Rˆ(d) =

(1− λ)T−12 r1,1 (1− δ)T−12 r2,1 . . . (1− λ)T−12 rN,1
...
... . . .
...
(1− δ) 12 r1,T−1 (1− δ) 12 r2,T−1 . . . (1− δ) 12 rN,T−1
r1,T r2,T . . . rN,T
 , (4.76)
where δ = 1 − λ is the notation used by the GS as the decay parameter. Then, the weighted
monthly covariance matrix estimator is
Sˆ(m) = p× Sˆ0(d) +
q∑
k=1
(p− k)× [Sˆk(d) + Sˆk(d)′], (4.77)
where
Sˆ0(d) = Rˆ(d)
′Rˆ(d)/
T∑
t=1
wt, Sk(d) = Rˆ(d)
′Rˆk(d)/
T∑
t=1
wt.
c© Zhou, 2021 Page 138
It may be noted that the weights wt are applied linearly for computing the expected return, and
not so for computing the covariances. The reason is that when they are applied in the latter case,
they weigh the more recent covariance more heavily, rather than directly applied to the returns.
5 Factor Models 1: Known Factors
Factor models for stock returns are popular. There are two types. The first is that the factors
are assumed to be known and directly observable from financial markets, which can be time series
factors, like the market index, or cross section factors like firm fundamentals/characteristics. The
second type is to assume that the factors are unknown random variables (also known as latent
variables), whose realizations are not directly observed, but can be estimated from the data. This
section focus on the first type, and the next section deals with the second.
5.1 The CAPM
In this subsection, we focus on testing the CAPM. For completeness, we first prove the CAPM for
mean-variance utility investors and then for investors with arbitrary utilities but the returns are
normal (can be extended to elliptical). Then we move to the tests. We examine first the widely
used tests that are pricing errors or alpha-based, and then tests based on cross section analysis and
stochastic discount factors.
5.1.1 Proof 1: preference assumption
Theoretically, the CAPM is valid under two assumptions:
• Perfect market: All investors have the same full information and so they have the same true
beliefs on the meana and covariances of stock returns; there are no transaction costs or taxes
so that trading and mispricing can be corrected without costs; all can borrow and lend at the
riskfree rate.
• Preference or return restrictions: All investors are rational, with either mean-variance utility
or a concave utility function; in the latter case, the stock returns are assumed to be normally
distributed (can be extend to elliptical).
c© Zhou, 2021 Page 139
5.1 The CAPM
We will prove the CAPM under the mean-variance preference in this subsection, and leave the proof
for the other case in the next subsection. The proofs below follow standard texts such as Ingersoll
(1987) and Huang and Litzenberger (1988). Berk (1997) examines the necessary and sufficient
conditions for the CAPM.
Under the mean-variance utility, the CAPM follows from the Two-fund Separation Theorem
and the market-clearing condition. The latter says demand must equal to supply in the market:
all the stocks bought by investors must be equal to the supply of the existing shares or the total
values bought must be the total values of the shares:
I∑
j=1
WjRη =

W 1m
...
WNm
 , (5.1)
where I is the number of investors and Wjs are their wealth, Rη is the tangent portfolio they have
(see Section 2.7.2), and W im is the market total value of stock i.
Since the vector of stock total wealth is a product of market wealth times market portfolio,
W 1m
...
WNm
 = W totalm

W 1m/W
total
m
...
WNm /W
total
m
 = W totalm wm, (5.2)
where wm is defined by the above equation as the fraction of stock value relative to the market,
which is exactly the market portfolio weights. Hence, Equation (5.1) says that the tangency portfolio
must be the market portfolio in equilibrium when demand equals supply.
Example 5.1 Suppose that there are only N = 2 stocks in the market with market values of $100
and $200. If there are I = 3 investors with wealth $50, $100, and $150 invested in their stock
portfolios, then it must be the case that
50Rη + 100Rη + 150Rη =
100
200
 = 300
1/3
2/3
 = 300wm,
where w = (1/3, 2/3) is the market portfolio and Rη must be the same as wm. ♠
Let Rq be any portfolio return with weights wq that is fully invested in the risky assets, and
c© Zhou, 2021 Page 140
5.1 The CAPM
Rm be that of the market. Since Rm is tangent, wm = Σ
−1µ/γ, and so we have
cov(Rq, Rm) = w
′
qΣwm = w
′
q(µ0 − rf1N )/γ
= (E[Rq]− rf ) /γ. (5.3)
Let Rq now be stock i and the market, respectively, we obtain
cov(Ri, Rm) = (E[Ri]− rf ) /γ (5.4)
cov(Rm, Rm) = (E[Rm]− rf ) /γ. (5.5)
Now taking a ratio of the above equations, and then multiplying on both sides by E[Rm]− rf , we
have
E[Ri]− rf = cov(Ri, Rm)
cov(Rm, Rm)
(E[Rm]− rf ), (5.6)
or
E[Ri] = rf + βi(E[Rm]− rf ), i = 1, 2, . . . , N, (5.7)
where βi = cov(Ri, Rm)/cov(Rm, Rm) is the stock i’s beta. This is exactly the CAPM, stating
that the expected return of any stock is the riskfree rate plus beta times the market risk premium,
E[Rm]− rf .
Security market line (SML) is a plot of the CAPM relation in terms of the expected return of
an individual security as a function of systematic, non-diversifiable risk β. It says that the greater
the beta, the greater the expected return. In contrast, the common wisdom says the great the risk,
the greater the return. This is not true as the CAPM states that only the system risk that gets
compensated by the market. When the CAPM is not true, Clearly those assets above the SML
earn positive alpha and those below earn negative alphas. Buying positive alpha assets or shorting
negative alpha assets help to beat the market.
Capital market line (CML), a concept related to the CAPM, is the tangent line drawn from
the point of the risk-free asset to the tangency portfolio, which is the market portfolio under under
the CAPM conditions. Focusing on portfolio selection, CML says that all investors should choose
portfolio along the tangent line: a portfolio of the riskfree asset and the market portfolio, though
the mix can vary across investors. No matter what mix an investor chooses along the CML, the
Sharpe ratio will be the same for all investors.
Mathematically, it states that
E(Rq)− rf
σq
=
E(Rm)− rf
σm
,
c© Zhou, 2021 Page 141
5.1 The CAPM
for any portfolio Rq on the line. This implies that, for a given level of desired risk σq an investor
wants to take, the expected return is
E(Rq) = rf +
σq
σm
[E(Rm)− rf ].
Note that this holds only on the CML, and is not true for a general portfolio Rq.
Note that the slope of the CML is the Sharpe ratio of the market portfolio. If the CAPM is
true, all efficient portfolio should earn the same market Sharpe ratio. In the real world, the CAPM
is not exactly true, and hence an implication is that an investor should buy assets whose Sharpe
ratio are above CML. However, you will not necessarily sell those assets if their Sharpe ratios are
below because all inefficient portfolios will lie underneath.
5.1.2 Proof 2: return assumption
Let r˜j be the random future stock return for asset j, j = 1, 2, . . . , N , wij be the portfolio weights
of investor i, then the random terminal wealth can be written as
W˜i = W
0
i
[
1 + rf +
∑
wij(r˜j − rf )
]
.
The investor’s problem is to maximize the expected utility of wealth, E
[
ui(W˜i)
]
, and the first-order
condition is
E
[
u′i(W˜i)(r˜j − rf )
]
= 0. (5.8)
Since cov(a˜, b˜) = E(a˜− a¯)(b˜− b¯) = E(a˜b˜)− a¯b¯ for any pair of random variables a˜ and b˜ with mean
a¯ and b¯, we have
cov(u′i(W˜i), r˜j) = −E
[
u′i(W˜i)
]
E(r˜j − rf ),
so we can solve the expected asset excess return as
E(r˜j − rf ) = −cov(u′i(W˜i)r˜j)/E
[
u′i(W˜i)
]
. (5.9)
For any normal random variables x˜ and y˜, Stein’s Lemma states that the covariance of any function
of x˜ with y˜ can be separated out as a product with the covariance of x˜ and y˜,
cov(g(x˜), y˜) = E[g′(x)] cov(x˜, y˜).
Since we assume that the returns are normally distributed, so are the wealth, and hence we can
apply Stein’s Lemma to rewrite (5.9) as
1
θi
E(r˜j − rf ) = cov(W˜i, r˜j), (5.10)
c© Zhou, 2021 Page 142
5.1 The CAPM
where θi ≡ −E
[
u
′′
i (W˜i)
]
/E
[
u′i(W˜i)
]
. Summing this equation over all individuals, we have(
I∑
i=1
1
θi
)
E(r˜j − rf ) = cov(W˜m, r˜j) = W 0mcov(r˜m, r˜j), (5.11)
where W˜m = W
0
m(1 + r˜m) is the future market wealth, with W
0
m the initial wealth and r˜m is the
market return. Multiplying the above equation by market portfolio weights, we obtain(
I∑
i=1
1
θi
)
E(r˜m − rf ) = W 0mcov(r˜m, r˜m). (5.12)
Finally, taking a ratio of the above two equations, and then multiplying E(r˜m − rf ) on both sides,
we immediately have the CAPM.
5.1.3 Market model
Suppose there are N stocks. A single factor model is the simplest model that one can use to explain
the returns on the stocks,
rit = αi + βift + it, t = 1, . . . , T, (5.13)
for i = 1, 2, . . . N . That is, we have one regression for each stock on the factor, and there are in
total N regressions.
In practice, it is often the excess asset returns are used, i.e., the total returns minus the riskfree
rate. The Market Model is a regression of asset excess return on the market,
rit = αi + βirmt + it, t = 1, . . . , T, (5.14)
where the previous factor is taken as the market factor (or excess return on a market index return
in practice), it is the residual which is uncorrelated with the market rmt and has a zero mean, and
T is the number of time series observations.
Market model relates asset returns to that of the market. Based the market model, we can
always compute the asset risk (variance) from
var[rit] = βivar[rmt] + var[it], (5.15)
i.e., asset variance risk is a sum of its market risk and residual risk (idiosyncratic risk).
c© Zhou, 2021 Page 143
5.1 The CAPM
Moreover, the regression model implies
β =
cov(rit, rmt)
var(rmt)
,
i.e., beta is the ratio of the covariance between stock and the market to the market variance. Also,
α = E(rit)− βE(rmt),
i.e., alpha is the expected asset return minus what implied by the CAPM.
Note that we can always run the above market model regression or project rit on rmt. Math-
ematically, there is nothing one can say how big or how small the alpha should be. Hence, the
market model regression itself has nothing to do with the CAPM or any economic theory. However,
if the CAPM is true, as clear from below, it says that the alpha should be zero under certain fairly
general economic assumptions.
5.1.4 Some truths on Alpha
Similarly, regardless of any economic theory, there are some simple truths about the alphas in the
market model regression.
It is an accounting equality that the value-weighted sum of all the stock alphas must be equal
to zero regardless the CAPM is true or not. This is because if we multiply the market model by
the value of the firm, wi, we get
wirit = wiαi + wiβirmt + wiit, (5.16)
which is true for every stock i. Assume there are N stock in the market. Summing the above over
all the stocks,
N∑
i=1
wirit =
N∑
i=1
(wiαi) +
[
N∑
i=1
wiβi
]
rmt +
N∑
i
(wiit).
Since the left-side is simply the market (value-weighted index), we must then have index = 0 + 1×
index + 0, so
N∑
i=1
(wiαi) = 0,
N∑
i=1
wiβi = 1,
N∑
i=1
wiit = 0.
Hence, the value-weighted sum of the alphas is 0, and that of the betas is 1.
c© Zhou, 2021 Page 144
5.1 The CAPM
In practice, people often say that the sum of the alphas is 0, which is not exactly true, but
only approximately, as the exact equation requires weighting the alphas by the firm’s value, not
equal-weighting by 1/N . Clearly the fact that the (value-weighted) sum of all the alphas is zero
does not imply the alphas are zeros individually because some can be positive and some can be
negative, while their sum is zero. If the CAPM is true, it makes a strong assertion that every single
alpha is zero!
Consider now portfolio alphas, in contrast to stock alphas above. Since the portfolio alpha
is the portfolio of the individual alphas, and since the sum of all the portfolio weights of all the
investors is the market, the (value-weighted) sum of the portfolio alphas of all investors must be
equal to zero too. This help to understand why it is difficult to beat the market.
If one investor earns a positive alpha on his or her portfolio, someone else must earn a negative
alpha. While all investors can make money in the stock market theoretically, but the competition
for alpha is a zero sum game. If one has no information or does not want to get a possible negative
alpha, buy-and-hold the market index earns the market return with zero alpha.
5.1.5 Claims of the CAPM
The capital asset pricing model (CAPM) is about pricing stocks. Let Rit be the return on stock i.
Under certain assumption, the CAPM is true, which states that, at anytime t,
E[Rit] = rft + βi (E[Rmt]− rft) , (5.17)
that is, the expected return on a stock is the riskfree rate plus the stock beta times the market risk
premium, E[Rmt]− rft.
Recall that we often work with excess returns. Let rit = Rit− rft be firm i’s return in excess of
the riskfree asset, and rmt = Rmt − rft be the market excess return, then the CAPM relation can
be simply written as
E[rit] = βiµm, (5.18)
where µm = E[Rmt] − rft is the market risk premium, i.e., the expected market excess return. In
practice, Rmt is often taken as the return on a broad market index (such as the S&P500), and then
µm is the expected excess return on the index.
c© Zhou, 2021 Page 145
5.1 The CAPM
It should be noted that the CAPM is about the expected return, not risk. It says that for all
the stocks, the greater the beta (systematic market risk exposure), the greater the expected return.
In fact, the expected excess return is simply a linear function of beta times the expected market
excess return. Contrary to the belief of many investors, the conventional wisdom of high risk and
high return is not true. The CAPM says that only the systematic risk gets compensated, and
idiosyncratic risk does not.
5.1.6 GRS test
Recall that, if the CAPM is true, then, taking expectation on both sides of (5.14), we have,
H0 : αi = 0, i = 1, . . . , N, (5.19)
for all the stocks, where N is the total number of stocks. Hence, a test of the CAPM is to test
whether all the alphas are zero in the market model.
The alpha is also know as pricing error. When it is positive, then the CAPM under-values the
asset. When it is negative, then there is over-valuation. Here we have only one factor, the market
factor. In general, when there are more than one factor, then alpha still measures the pricing error,
but it will be the one related to the multi-factor model instead of the CAPM.
To test the CAPM, we have to estimate the alphas and betas first. The estimation can be done
equation by equation by the OLS to obtain them for each asset. Then, the null hypothesis can be
tested by using the well known Gibbons, Ross and Shanken’s (1989) test
GRS ≡ (T −N − 1)
N
αˆ′Σˆ−1αˆ
1 + θˆ2m
∼ FN,T−N−1, (5.20)
where αˆ is the vector of the estimated alphas, Σˆ is the estimated residual covariance matrix, N×N ,
whose (i, j) element is given by
σˆ(i, j) =
1
T
T∑
t=1
(rit − αˆi − βˆirmt)(rjt − αˆj − βˆjrmt), (5.21)
θˆm is the Sharp ratio of the market rm (the ratio of mean excess return to the standard deviation),
and FN,T−N−1 is the F -distribution function with the degrees of freedom N and T − N − 1. We
reject the null when the GRS statistic is large relative to the random fluctuations measured by F .
As we shall see in the slides, the CAPM is rejected strongly by the data.
c© Zhou, 2021 Page 146
5.1 The CAPM
However, a rejection of the CAPM simply indicates that the market factor alone cannot price the
assets, but it can still be very important in explaining the returns (with non-zero betas). Indeed, in
usual multi-factor extensions of the CAPM, the market factor is always the most important factor
by far, and it, in conjunction with other factors, can price asset fairly well in many applications.
Note that to test whether the alpha of an individual firm,
αˆi = 0
or not, it is straightforward to apply the popular t-ratio test in a univariate linear regression. But
this is different from the CAPM test, which requires all the alphas to be zero simultaneously. It
is a multi-dimensional test. If one company has a zero alpha, it does not imply the CAPM is true
in general, although it may be true just for this company. On the other hand, if the CAPM is
rejected by one company by a t-ratio test with a P-value 5%, this does not imply that the CAPM
is invalid either. Because the rejection is not absolute or not with 100% confidence, or because the
P-value is 5%, there are 5 rejections that will occur among every 100 firms by chance alone even if
the CAPM is true. Hence, statistically, the GRS test, which tests all the pricing being zero jointly,
is ideal to test the multiple restrictions here, and it is statistically the most power test.
What is the statistical intuition of the GRS test? Consider the case of two stocks. We want to
design a single test statistic to test both α1 and α2 being zero. One possible statistic is
J = αˆ21 + αˆ
2
2.
If the both α1 and α2 are zeros, J should be small, like the GRS. If we find J is too large empir-
ically, we can reject that the alphas are zeros. The problem is that it is very difficult to find the
distribution of J . So we have to modify it. The idea is to standardize both αˆ1 and αˆ2, so that
their joint distribution is tractable. In doing so, we get the GRS test. In the one-dimension case,
we standardize the alpha and get the t distribution. So, the F distribution is an extension of the t
into higher dimensional case.
What is the economic intuition of the GRS test? If the CAPM is true, and if we invest in all
the assets including the market, we cannot do better than investing in the market alone. Let θˆq be
the Sharpe ratio of the former. Gibbons, Ross and Shanken’s (1989) show that
αˆ′Σˆ−1αˆ = θˆ2q − θˆ2m.
Hence, the GRS test measures how close the market Sharpe ratio θˆm is close to θˆq. If the difference
is sizable, we reject the CAPM.
c© Zhou, 2021 Page 147
5.1 The CAPM
In matrix form, the market model can be written as,
R = XΘ + E,
where R, T × N , is the returns on N assets in excess of the riskless rate return; X, T × 2, is a
matrix whose columns are ones and the excess returns on the given portfolio rm; Θ, 2 × N , is a
matrix whose rows are the α and β′s respectively.
Technically, in the context of multivariate testing, rmt is often treated as fixed and E is assumed
to follow a multivariate normal distribution,
vec(E) ∼ N(0, I ⊗ Σ), (5.22)
for the GRS to hold. Furthermore, to guarantee the non-singularity of Σ, we need to assume rm
is not a linear function of the N asset excess returns. This effectively implies that the given index
portfolio contains other assets returns which are not include in the left-hand side of the market
model.
5.1.7 CAPM and market efficiency
There are two concepts of market efficiency. The usual meaning is the three forms of market
efficiency: ”weak-form”, ”semi-strong-form”, and ”strong-form” tests. The weak-form efficiency
hypothesis says that historical prices and trading volume provide no information for making abnor-
mal profits, implying in particular that technical analysis and common trading signals are useless
(though the latter is debatable in practice). The semi-strong form states that all public available
information (beyond historical prices) is still useless for making abnormal profits, suggesting that
fundamental analysis based on firm public valuation data, such as earnings and growth, is a waste
of time and money (which fund managers may not agree). The strong-form regards private in-
formation. Market efficiency says that stock prices reflect all information, public and private, so
that there is no over- or under-valuation and no possibility of making abnormal profits. In the real
world, none of the hypotheses is absolutely true, but it does serve as a useful reminder that it is
difficult to beat the market.
If the CAPM is true, the market must be efficient as then all the assets are corrected priced by
the CAPM. In fact, if any known asset pricing model is true, the same conclusion holds. But if we
reject the CAPM or a known model, it says nothing about the market is efficient or not. It simply
c© Zhou, 2021 Page 148
5.1 The CAPM
states that the given model cannot price all the assets correctly. In other words, according to the
model, there are over- and under-valued assets.
Another meaning of market efficiency is whether the market portfolio, as approximated by the
value-weighted stock index, is an efficient portfolio in the mean-variance frontier. Theoretically,
the CAPM is true if and only if the market portfolio is efficient.
5.1.8 Fama-MacBeth 2-pass regressions
Fama and MacBeth (1973) propose a 2-pass regression approach for estimating factor risk premia
and for testing validity of factor models, the CAPM in particular. The procedure has two steps:
1. A time series regression is run on an asset return to obtain the asset’s betas or exposures to
the risk factors; (The first-pass)
2. A cross-section regression is run for all asset returns on their estimated betas to determine
the risk premia of the factors. (The second-pass)
Consider the case of testing the CAPM. In the first pass, we estimate the market model,
rit = αi + βirmt + it, t = 1, . . . , T, (5.23)
to obtain βˆi. Given firm i, we run the above regression over time. βˆi is its estimated market risk
exposure. We can do the above regression N times, and so to get the betas for all firms.
In the second pass, at each time t, we run a regression across the firms on their betas,
rit = γ0 + γ1βˆit + it, i = 1, . . . , N, (5.24)
where βˆit is the estimated beta at time t from the first-pass regression. Or more clearly,
RIBM,t
RApple,t
...
RGoogle,t
 = γ0 + γ1

βˆIBM,t
βˆApple,t
...
βˆGoogle,t
+

1t
2t
...
Nt
 . (5.25)
c© Zhou, 2021 Page 149
5.1 The CAPM
If the CAMP is true, we should have
γ0 = 0, γ1 = E[rmt],
or the slope of the second-pass regression is the risk premium.
In practice, the estimates of γ0 and γ1 will be noisy. Assume they are constant, we will use their
average estimates over time as the estimate. For example, suppose we have 15 years of monthly
data. We can start the estimation the first month after 5 years, month 61, to get the beta (we need
the first 5 years data to estimate the first beta), and then the gamma. Using the 5-year rolling
window, we can similarly obtain the beta and gamma in the following month (month 62), and up
to the last month, a total of 120 betas per firm. The average is then given by
γ¯1 =
1
120
120∑
t=1
γ1t, (5.26)
where γ1t are the estimate in each of the month. Statistically, γ¯1 is a better estimate than using
any one of the γ1t’s.
Fama and MacBeth (1973) suggest to use the following t-statistic to test whether γ1 is signifi-
cantly different from zero,
t− stat = γ¯1
std(γˆ1)
√
T
,
where std(γˆ1) is the sample standard error of γˆ1, also known as Fama and MacBeth (1973) standard
error for the estimated risk premium. Statistically, the t-test tends to over rejects the null because
std(γˆ1) under-estimates the true standard error. The reason is that there is an errors-in-variables
problem in the second-pass, where it is the estimated betas, those with estimation errors from the
first-pass, that are use as regressors, not the true ones. Shanken (1982) provides the corrected
standard error. Shanken and Zhou (2007) provide further a specification test of the factor model.
Kan, Robotti and Shanken (2013) have more general discussions on the two-pass procedure.
5.1.9 Stochastic discount factor
As Cochrane (2001) shows, almost all asset pricing models can be written in a stochastic discount
factor (SDF) form,
1 = E(mR), (5.27)
c© Zhou, 2021 Page 150
5.1 The CAPM
where m is the SDF and R is the return. It says that, if an asset provide me an random payoff R
next period, I am willing to pay $1 for its price today. The factor m is the factor I use to discount
the payoff. Since the payoff is uncertain, I use the expected value.
To see how it works, for the T-bill with interest r or return R = 1 + r, if we pay one dollar
today to buy it, we have
1 =
1 + r
m
=
1 + r
1 + r
,
so our discount rate is m = 1 + r. Now imagine that actually the T-bill is a corporate bond with
the same interest, we will not pay $1 today for it. In this case, the SDF will be greater than 1 + r,
so we get a lower price, say 50 cents today. We re-scale the units, and obtain Equation (5.27) again
for pricing the bond, now we pay every one dollar for 2 units of the bond. It works similar for all
other assets.
If the CAPM is true, it can be shown that the SDF will be of a very simple form, a linear
function of the market,
m = λ0 + λ1rmt,
where λ0 and λ1 are parameters. Cochrane (2001) shows how to test the CAPM in the SDF
framework. Kan and Zhou (1999) compare this methodology with the traditional beta tests like
the GRS. They find that the SDF can be less inefficient. Later Jagannathan and Wang (2002)
show that this inefficiency can be remedied by adding moment conditions on the factors, so that
both the SDF and the traditional approach can be asymptotically equivalent. However, adding
the factor moment conditions makes the implementation of the SDF tests difficult as there will no
longer analytical solutions to the parameter estimates.
5.1.10 GMM test and others
The GRS test is ideal if the data are normally distributed. However, the data are not normally
distributed in the real world. When the normality assumption is violated, the GRS test statistician
no longer has an F distribution, and hence the P-value and the test results may be in doubt.
The simplest approach is to use a bootstrap procedure. The idea is resample B sets of data of
the same sample length T from the original data with replacement, computing the GRS statistics
B times for each bootstraped data set. The upper 5% percentile will be a good estimate of the
c© Zhou, 2021 Page 151
5.2 Spanning tests
5% P-value under iid assumption. Without the iid assumption, the bootstrap procedure will be
more complex. Chou and Zhou (2006) provide the details. Bootstrap is used widely in finance
to estimate standard errors and test trading strategies. Section 7.2) and Section 4.3) have more
discussions.
Hansen (1982) provides the generalized method of moments (GMM), a Nobel prize winning work,
which can be used to test almost any economic model under very general statistical assumptions.
In our context, as long as the residuals are stationary, the test is valid. However, without normality,
almost all tests, GMM included, are valid at most asymptotically, meaning true as the sample size
goes to infinity. Hence, the reliability of asymptotic tests is an issue for such tests. Simulations may
be run to help assess the reliability, and the bootstrap may still be used to improve the accuracy.
There are two ways to implement the GMM test for the CAPM. The first is to construct a χ2
test from the distribution of αˆ under stationarity assumptions, and the second is to impose the null
on the model and use the GMM overidentification test directly. MacKinlay and Richardson (1991)
is the first to apply the GMM to test the CAPM. Harvey and Zhou (1993) provide further results.
A Bayesian approach to the test of CAPM has been taken by Shanken (1987) and Harvey
and Zhou (1990). The first obtains the posterior odds-ratio from a given prior on the correlation
between the market portfolio and the proxy, while the second, based on a full Bayesian specification
of the market model, conducts both a posterior analysis and odds-ratio testing with several priors
on the behavior of the parameters.
5.2 Spanning tests
Huberman and Kandel (1987) introduce the idea of a mean– variance spanning test. The question
is whether the mean-variance optimal portfolio of a set of given assets can improved by adding in a
set of new assets. In other words, the question is equivalent to whether the investment opportunity
set of all of the assets can be spanned by the set of given assets. Theoretically, the “mean-variance”
term may be removed to consider a general spanning, but this case is too complex. Hence, almost
studies, including below, focus only on mean–variance spanning.
We start with the simplest case of two risky assets with returns, RA and RB, where RB is
return on the given benchmark asset. Our question is whether adding RA improves our optimal
c© Zhou, 2021 Page 152
5.2 Spanning tests
portfolio with the original set of {RB}. For example, we may ask if China exposure (as summarized
by China market index return RA) offers any diversification advantage to the US market index.
Assume that there is no borrowing or lending (no riskfree asset available). Consider the regres-
sion (or called projection sometimes) of RA on RB,
RA = α+ βRB + , (5.28)
where is the residual uncorrelated to RB. If
RA = 0 + 1×RB + , (5.29)
it will be easy to show that RA adds no value to the portfolio. This is intuitively obvious. Based on
(5.29), RA has the same expected return as RB but has greater variance risk, and so it is dominated
by RB and adds no investment value. On the other hand, consider the case
RA = 0 + 1.5×RB + . (5.30)
Although buying 1.5 of RB will replicate RA up to a noise, but this is not possible as we assume no
borrowing. So RA allows those investors who are aggressive can hold 1.5 unit of asset B in stead
of one originally.
Therefore, the spanning hypothesis is
α = 0, β = 1 (5.31)
in the one test asset and one benchmark asset case. There is mean-variance spanning if and only
the above parametric restriction holds in the asset regression (5.28). In practice, one has data on
both assets, and can then run the regression to do the test. Since it is a joint test of the alpha
and beta, an F test is needed in stead of an often used t-ratio test which applies only to a single
parameter.
A more complex case is with 2 benchmarks (or 2 factors). Consider now the regression
R1 = α1 + β11f1 + β12f2 + 1, (5.32)
where 1 is the residual uncorrelated to the factors. Then similar argument as earlier shows that
the spanning hypothesis is
α1 = 0, β11 + β12 = 1. (5.33)
c© Zhou, 2021 Page 153
5.3 Fama-French 3- and 5-factor models
The second restriction is not surprising because we want hold the factors to replicate the asset
without borrowing.
Suppose now we have two test assets instead of one, then we add one more regression,
R2 = α2 + β21f1 + β22f2 + 2. (5.34)
It will be easy to see that the spanning hypothesis is then Equation (5.33) plus
α2 = 0, β21 + β22 = 1, (5.35)
the latter of which is pertinent to the second asset. In general, we can have N test assets and K
benchmark assets, and the spanning hypothesis can be summarized as
α = 0N , β
′1N = 1,
where both α and β are N -vectors of the N test assets. How do we test the above hypothesis?
DeRoon and Nijman (2001), and Kan and Zhou (2012) summarize the statistical procedures.
It should be noted that the spanning hypothesis is simplified greatly when there is the riskfree
asset whose return is Rf . With Rf , we consider only excess returns. For example, for the one test
asset and one benchmark asset case, we
rA = α+ βrB + , (5.36)
where rA = RA − Rf and rB = RB − Rf are the excess returns and is the residual uncorrelated
to rB. Then the spanning hypothesis is simply
α = 0. (5.37)
The beta is no longer matter because we can borrow or lend to replicate RA with RB. Testing the
CAPM can be regarded as a special case here. An asset cannot improve the market portfolio if and
only if its alpha is zero. If the CAPM is true, the market portfolio is efficient and no assets can
help to do better, and so all the asset alphas must be zeros.
5.3 Fama-French 3- and 5-factor models
Due to the failure of the CAPM, Fama and French (1993, 1996) advocate the following three-factor
model,
Rit − rft = αi + βi1(fM,t − rft) + βi2fSMB,t + βi3fHML,t + it, (5.38)
c© Zhou, 2021 Page 154
5.4 Additional factor models
where fM is the return on the market factor, fSMB is the SMB spread return, fHML is the HML
spread return, and rft is the 30-day T-bill rate. In their tests of the above model, French (1993,
1996) take the Rit’s as the asset returns on the 25 stock portfolios formed on size and book-to-
market.
We can test a multiple factor model similar to the CAPM case. Consider the above Fama and
French (1993) 3-factor model. The estimation can be done equation by equation by the OLS to
obtain the 3 betas for each asset. If the 3-factor model is true, then we have again
H0 : αi = 0, i = 1, . . . , N. (5.39)
But this is true only for tradable factors (see next subsection for more explanations). The null
hypothesis can be tested by using the K-factor version of the Gibbons, Ross and Shanken’s (1989)
test,
GRS ≡ (T −N −K)
N
αˆ′Σˆ−1αˆ
1 + f¯ ′Ω̂−1f¯
∼ FN,T−N−K , (5.40)
where f¯ and Ω̂ are the sample mean and sample covariance matrix of the K = 3 factors. As before,
we reject the null when the GRS statistic is large.
One can also test the three-factor model using the Fama-MacBeth 2-pass procedure. In the
first-pass, one runs regression (5.38) to obtain the three betas. Then, in the second-pass, the
following cross-section regression is run on the betas across assets,
ri = γ0 + γ1βˆi1 + γ2βˆi2 + γ3βˆi3 + it, i = 1, . . . , N, (5.41)
to obtain risk premia estimates, the slopes, on the three factors. Similar t-stats can be defined and
the significance of the risk premia be examined as before.
Improving their 3-factor model, Fama and French (2015) recently propose a 5-factor model by
adding new factor of profitability and a factor of investment patterns. The GMM and other tests
discussed previously can also be used to test them.
5.4 Additional factor models
Hou, Xue, and Zhang (2015) provide a 4-factor model, which is similar and competitive to the
Fama and French (2015) 5-factor model. Stambaugh and Yuan (2017) provide a 4-factor model:
c© Zhou, 2021 Page 155
5.5 Non-traded factors
the Mkt, Size and two mispricing factors (MGMT and PERF), and Daniel, Hirshleifer, and Sun
(2020) provide a 3-factor models: Mkt and two behaviorial factors (PEAD and FIN). Han, Zhou
and Zhu (2016) propose a trend factor that has an average return of about 1.61% per month, more
than twice of the momentum factor and more than double the Sharpe ratio. Liu, Zhou, and Zhu
(2020a) add volume information and construct a new trend factor particularly suitable in China
due to 80% of trading volume is generated by individual investors.
In addition, those hundreds of anomalies, surveyed by Harvey, Liu and Zhu (2016) and Hou,
Xue and Zhang (2019), are yet candidates of additional factors. The search of security factors is
endless.
Starting from the twelve distinct risk factors in Fama and French (1993, 2015), Hou, Xue, and
Zhang (2015), Stambaugh and Yuan (2017), and Daniel, Hirshleifer, and Sun (2020), Chib, Zhao
and Zhou (2020) construct and compare 4,095 possible combinations, and find that the model
with the risk factors, Mkt, Size, MOM, ROE, MGMT, and PEAD, performs the best in terms of
Bayesian posterior probability, out-of-sample predictability, and Sharpe ratio. A more extensive
model comparison of 8,388,607 factor models, constructed from the twelve winners plus eleven
principal components of anomalies unexplained by the winners, shows the benefit of incorporating
information in genuine anomalies in explaining the cross-section of expected equity returns.
5.5 Non-traded factors
To understand the indeterminacy of factor risk premia in the regression model, consider again the
market model. Let λm be the market risk premium, then the CAPM says,
E[rit] = βiλm, (5.42)
that is, if an asset has double the market risk, its excess return (its return beyond riskfree rate)
is expected to earn double the risk premium. The above equation is true for any traded assets, in
particular, as the market is tradable and a beta of one, we have
E[rmt] = 1× λm = λm, (5.43)
which says that the market risk premium λm = E[rmt] = E[Rmt − rft], that is, the market return
in excess of the riskfree rate.
c© Zhou, 2021 Page 156
5.6 How to construct factors?
Now let f be a systematic factor, such as consumption growth, that affect the asset returns.
For simplicity, assume it is the only factor. Then we have
E[rit] = βiλf , (5.44)
where λf is the risk premium or reward for taking the factor exposure. However, since f is not
traded, we no longer have an equation like (5.43) to tie down λf . Some complex procedures with
additional assumptions may be needed to determine the value of λf .
Indeed, Giglio and Xiu (2021) suggest a three-step approach. First, we can extract a suitable
number of factors with the PCA method to be explained below (see Section 6.2). Then, in the
second-step, the factor risk premia can be estimated by running a cross section of the average
returns on the factors. Finally, in the third-step, a time series regression of f can be run on the
factors to get the loadings, which yield the risk premium on f by multiplying the loadings to the
factor risk premia.
5.6 How to construct factors?
How to obtain the systematic factors beyond the market factor? These factors are often obtained
as spread or long-short portfolios, which are tradable, from firm characteristics. Some studies also
use macroeconomic variables, such as industrial production, as factors (see Chen, Roll and Roll,
1986, for a classic, and Rapach and Zhou, 2019, for the latest work). In practice, the tradable
factors usually work better than macroeconomic factors. One can also extract factors statistically
from asset returns, which will be discuss in the next chapter. We here focus on form factors from
firm characteristics.
5.6.1 Sorting
Sorting stocks into decile portfolios by a firm characteristic is one of the most common ways for
constructing new factors. For example, to obtain the size factor, we can sort all stocks (except those
prices lower than $1, say) 10 portfolios each month by their capitalization (size), and those buy
those stocks (small) in the lowest decile and short those (large) in the largest decile. The resulted
zero-cost spread or long-short portfolio will capture well the performance due to size. Return on
c© Zhou, 2021 Page 157
5.6 How to construct factors?
this portfolio each month will be the return on the size factor,
fsize = R1 −R10, (5.45)
where R1 and R10 are returns on the lowest and largest decile portfolios, respectively.
In applications, R1 and R10 can be either equal- or value-weighted returns. For equal-weighted,
the spread portfolio performs usually better, but it will be more influenced by small and midcap
firms. From a feasibility point of view, value-weighted returns are preferred as more money may be
invested into spread portfolios without investing too heavily to small and midcap stocks or affecting
the prices too much.
While decile portfolios are popular, quintile and sorting into three or even two groups are also
often seen. Generally speaking, the average return (over time) of the spread portfolio is greater
with decile portfolios than other cases, because the decile creates more dispersion of stocks in their
factor exposure.
While univariate sorting is widely used, bivariate sorting is sometimes also employed. For
example, Fama and French’ (1993) well known size and book-to-market factors, posed today on
French’web, are based on a bivariate sorting via size and book-to-market.
Specifically, according to the web, their construct 6 portfolios (all are value-weighted) at the
end of each June, which are the intersections of 2 portfolios formed on size (market equity, ME;
drop the median portfolio out of the 3 size portfolios) and 3 portfolios formed on the ratio of book
equity to market equity (BE/ME). The size breakpoint for year t is the median NYSE market
equity at the end of June of year t, with breakpoints 30th (Small) and 70th (Big). BE/ME for
June of year t is the book equity for the last fiscal year end in t− 1 divided by ME for December
of t− 1, with the same breakpoints. Then the size factor is defined as SMB (Small Minus Big), the
average return on the three small portfolios minus the average return on the three big portfolios,
SMB = 1/3(Small Value + Small Neutral + Small Growth)
= −1/3(Big Value + Big Neutral + Big Growth),
(5.46)
and the book-to-market is defined as HML(High Minus Low), the average return on the two value
portfolios minus the average return on the two growth portfolios,
HML = 1/2(Small Value + Big Value)
= −1/2(Small Growth + Big Growth),
(5.47)
c© Zhou, 2021 Page 158
5.6 How to construct factors?
Occasionally, sorting based on 3 characteristics is done. However, it becomes increasing complex
as the number of characteristics increases. The solution is to use a method that accomplishes similar
tasks and yet is easy to implement. There are two popular methods. The first is a naive scoring
approach, and the second is the cross-section regression (CSR) approach. Both are discussed below.
5.6.2 Scoring
Suppose that we have 8 firm characteristics. We give each stock a score of 1 to 10 for each
characteristic, and 10 indicates the best and 1 the worst. Then, we have 8 scores for each stock.
Adding the scores together, we get one aggregated score for each stock. Hence, we can buy stocks
with the top 10% highest scores and sell those bottom 1%. This is to form the zero-cost spread
portfolio. If only investing is of concern, we just buy the top 10%.
Instead of scoring from 1 to 10, a z-score, which is the number of standard deviations from the
mean, is also often used. One can compute
zi,k =
ci,k − c¯k
std(ck)
,
where ci,k is firm i’s k-characteristic, c¯k is the mean across firms and std(ck) is the standard
deviation. The z-score is also known as a standard score, and can be placed on a normal distribution
curve. The aggregate z-score is defined by
z¯i =
1
K
(zi,1 + zi,2 + · · ·+ zi,K),
where K is the number of characteristics.
Although scoring can in principle be used to construct systematic factors, common to all stocks
like the market factor, it is perhaps more suited for selecting stocks that satisfy a number of desired
criteria/conditions/chareteristics. Scoring is easy to implement. But it has its weaknesses. The
most important of all is that it weights all characteristics equally. This is clearly not true in practice.
Certain factors are more important than others. The CSR below does not suffer this problem.
5.6.3 Cross-section regression
The cross-section regression (CSR) approach typically runs a regression of stock returns on one
or more firm characteristics across stocks. This can be used not only to construct systematic
c© Zhou, 2021 Page 159
5.6 How to construct factors?
factors (for understanding risk exposures), but also to forecast stock returns (for selecting stocks or
sectors). In practice, such regressions are known as fundamental factor models and characteristics-
based models.
Consider, for example, the question how firm size affect future stock returns. We run a CSR of
firm return on size,
Ri,t = a+ bsizei,t−1 + i, i = 1, 2, ..., N, (5.48)
where N is the number of firms. In the above regression, the time is fixed, and the regression is
run across firms. In terms of data, the regression may be written, say, as
RIBM,t
RApple,t
...
RGoogle,t
 = a+ b

sizeIBM,t−1
sizeApple,t−1
...
sizeGoogle,t−1
+

1
2
...
N
 . (5.49)
Again, the time here is fixed, and we ask how the returns across firms are predicted by their sizes.
The performance of the CSR can be assessed as usual by the magnitude of the slope, its
significance and the R2 of the regression. Most important of all, we should assess the economic
performance. Based on Equation (5.48), we can compute the estimated coefficients at each time,
and then forecast the return for the next period,
Rˆi,t+1 = aˆt + bˆtsizei,t, i = 1, 2, ..., N, (5.50)
where aˆt and bˆt are the estimated predictive coefficients at time t. We can then buy stocks with the
top 10% highest predicted returns, and sell those bottom 10% with the lowest. The performance
of the spread portfolio over time is the economic value the CSR brings to the table.
Based on the CSR, we can also construct a systematic size factor as
fsize = f1 − f10, (5.51)
where f1 is the return on the long position of the highest predicted return stocks, and f10 is the one
on the lowest. This size factor is clearly very closely related to the earlier size factor constructed
by sorting stocks on size. Indeed, they are mathematically equivalent (assuming bˆt > 0). However,
the CSR is more flexible as it can control for additional factors for better investment performance.
c© Zhou, 2021 Page 160
5.6 How to construct factors?
For example, if we think that stocks are affected by the market, size and idiosyncratic volatility
(IVol), then we run a CSR,
Ri,t = a+ b1 × βi,t−1 + b2 × sizei,t−1 + b3 × IV oli,t−1 + i, i = 1, 2, ..., N. (5.52)
This makes all information of the three firm characteristics to predict optimally the future returns,
and then one can buy those with the highest expected returns and short those with the lowest.
Sorting or scoring will not be able to achieve what the CSR does.
If there are K > 1 characteristics, we simply run multiple CSR,
Ri,t = a+ b1Ci,1,t−1 + b2Ci,2,t−1 + · · ·+ bKCi,K,t−1 + i, i = 1, 2, ..., N, (5.53)
where Ci,k,t−1 is firm i’s k-th characteristic at time t − 1. The CSR finds the best (linear) pre-
dictability from the K characteristics collectively and weights their importance according to their
individual predictive power. Chapter 11 provides more discussions and the detailed procedures for
implementing CSR.
To assess the importance of one particular characteristic, one can examine its risk premium,
or compare the performance of the performance from the above regression with and without it.
However, whether some characteristics can be removed is a difficult econometric problem.
5.6.4 Machine learning methods
As extensions of the CSR, various machine learning methods (see Chapter 10) can be applied to
forecast the cross section of stock returns. Then one can sort stocks based on the expected returns,
and then the long-short portfolio will be a factor that represent all the characteristic that are used
to forecast the returns.
Applications of machine learning methods for finding factors is a direction of active research.
See, for example, Coqueret and Guida (2020), Jurczenko (2020), Han et al (2021) and Neuhierl et
al (2021), and is also related to factor investing to be discussed in the next section.
c© Zhou, 2021 Page 161
5.6 How to construct factors?
5.6.5 Time series vs cross section
There are often confusions about time series regression and cross section regression, and time series
and cross section factors. Let us make the distinctions clear here.
In a CSR, we want to know how well one predictor predicts the returns cross firms, or how well
one variable explains the perform of the students (e.g., the hours each worked for Prof. Zhou’s class).
In contrast, a time series regression asks how well one predictor predicts the returns over time, or
how well the market factor explains the return over time (the market, size and book-t–market are
all times series factors).
In terms of equation, a time series regression regresses an asset return over time,
RIBM,t = a+ bxt−1 + t, t = 1, 2, . . . , T, (5.54)
where T is sample size, say 120 for 10 years of monthly data. In terms of data, the regression is,
RIBM,1
RIBM,2
...
RIBM,T
 = a+ b

x0
x1
...
xT−1
+

1
2
...
T
 . (5.55)
In contrast to the CSR, here there is only one stock and we examine how a variable xt−1 predicts
RIBM,t over time. If we use xt instead of xt−1 in the regression, then, we examine how well xt,
such as the market factor, explains RIBM,t as both variables occur at the same time.
Now let us examine the difference between an aggregate level factor, such as the market factor,
and firm characteristic factors such as earnings per share. The former is systematic risk factor and
each firm’s exposure is measured by the beta from the time series regression on the factor. The
latter is a firm-level factor and each firm’s exposure is measured directly by its observed value, such
as earnings per share. The two will not necessarily coexist. For example, industrial production
factor is a well known systematic factor, and it is difficult to come up with a measure at the firm
level other than the regression beta. On the other hand, the quality of corporate governance may
be well measured at the firm level by a ranking of 1 through 10. But a systematic governance factor
seems unknown (one may construct a spread portfolio by the ranking. However, it may not earn a
significant risk premium). Nevertheless, for some factors, such as size and book-to-market, we do
have both systematic risk factors at the aggregate level, and individual measures at the firm level.
c© Zhou, 2021 Page 162
5.7 Uses of factor models
5.7 Uses of factor models
It will be useful to discuss some common uses of factor models.
5.7.1 Capital budgeting/Expected return estimation
First, it is useful for capital budgeting. Assume the factor model is true, one obtains the expected
return on a firm given the systematic factors. One can interpret it as the expected return investors
expect to get from taking the systematic risks, regardless of the alphas are truly zero or not.
Combining it with other info one can get the WACC for valuing projects.
Intuitively, if the total risk premia from the systematic risk exposure is 10% (the sum of the
betas times the factor risk premia), then the company should have at least 10% return on a project
with the same risk. Otherwise the shareholders can maximize their value by investing in the stock
market rather than the project.
In portfolio choice, it is critically important to provide accurate estimates on the expected
returns and covariances. While historical averages are important, but it is backward looking. In
practice, macro economic outlooks can often lead to forward-looking estimates on the performance of
the market and various factors. With the factor model, this can generate forward-looking estimates
on the expected returns. It can also help on estimating the stock covariances if the factor covariances
are known.
For example, consider the use of factor models for forecasting stock returns. Suppose you run
a time series regression of a stock or your portfolio on the factors, say equation (5.60). If you have
forecasted returns for the market to be 3%, and the size factor to be 2% next month, you can
compute the forecasted return on your stock or portfolio by replacing the factors by their predicted
values in (5.60). When forecasting return longer term say for a year, you need to add return from
next month, etc, up to a year in the equation to get the annual return, rr,1→12 = rpt+1 + · · ·+rpt+12.
Clearly the same predicted slopes/coefficients apply,
rp,1→12 = 3%× 12 + 1.1× rm,1→12 + 0.7× fsize,1→12, (5.56)
so the only differences are to scale the intercept by 12 and to replace the factor returns by their
next year’s predicted returns.
c© Zhou, 2021 Page 163
5.7 Uses of factor models
5.7.2 Smart beta and factor investing
Smart beta in practice generally means an investment strategy that deviates from holding the
value-weighted market index. The latter is an passive strategy whose beta (relative to the index)
is 1. Smart beta strategies typically hold a portfolio of passive investments combined with some
exposures of active investments, particularly with factor investing. Factor investing is an investment
strategy used by many fund managers to beat a benchmark index. The idea is to tilt your portfolio
towards some factors, where the factors are the so-called fundamental factors that are firm specific.
The reason for smart beta, as discussed by Ghayur, Heaney and Platt (2019), is due to 3
potential drawbacks with the value-weighted index:
1. Concentration: Large firms dominate the index, and some sectors may have excessive repre-
sentation in the index (e.g., during the internet bubble, tech weight in the S&P 500 Index
increased from 13% in 1998 to more than 30% at the start of 2000).
2. Volatility: high concentration tends to generate high volatility.
3. Propensity: Value-weighting tends to overweight overvalued stocks and underweight under-
valued stocks, and so the index may lose more when the mispricing inevitably corrects.
Another drawback is that cap-weighted investing cannot address any firm specific investment ob-
jective such as ESG (environmental, social and corporate governance).
Cap-weighting index investing is still the primary approach in practice, since it is easy to
implement and is consistent with the CAPM (is the best in an ideal efficient market world in which
all investors are smart and all have the same info and all have quadratic utility preferences). It is
a buy-and-hold passive strategy with minimum fees to invest. This is why index funds are keep
growing over time.
The most popular factors are size, value, momentum, quality, and low volatility, though there
are hundreds of potential factors. Each firm characteristic can be a potential factor. Han et al
(2021) examine up to 299 firm specific factors, and Neuhierl et al (2021) consider in addition firm
option characteristics which seem entirely new in the factor investing literature.
To understand more on factor investing, consider a couple of examples. Suppose that the stock
c© Zhou, 2021 Page 164
5.7 Uses of factor models
returns are driven by the market and size factor.
rit = γ0 + γ1βmi + γ2Sizei + other factors + vit, (5.57)
where vit is the residual. There are many ways to tilt your portfolio to size. The simplest one is to
use a standard equal-weighed spread portfolio. Based on the size of each firm, Sizei, we can sort
stocks into decile portfolios. The spread portfolio is simply long the smallest decile and short the
largest decile. Then, effectively, our portfolio is
w = ρwm + (1− ρ)wLS ,
where ρ is the proportion in the market, say 80%, and wLS is the weight of the spread portfolio with
values of 1/m’s or −1/m’s on the long and short deciles and zeros for other stocks, and m is the
number of stocks in each decile. Although wLS has negative components, w is typically non-zero
in practice as ρ is not far away from 100%, so that there are no short sells in the end.
An alternative approach is to simply hold the factor portfolio or the long-short portfolio, which
can be implemented in practice by buying the equity Smart Beta (SB) ETFs or strategic-Beta
Exchange-Traded Products that match the factor of interest (this is often feasible). Then you hold
the market index and this EFT, whose return is
R = ρRm + (1− ρ)RETF ,
where RETF is the return on the ETF.
Another more quantitative approach is to use the factor model,
rit = αi + βmiRmt + βsiRsize,t + it, (5.58)
to obtain the betas of all the stocks. Suppose you have 10 stocks you want to buy, and you want to
load up on size factor, to have a loading of 2 for your portfolio, as you expect the factor will likely
to have good reward next period. By equation (5.60) and the like you can get the size beta for all
your stocks, say β1 = 0.7, β2, . . . , β10. Then you want
w10.7 + w2β2 + · · ·+ w10β10 = 2.
In the above equation, the beta are known, and you need to solve the portfolio weights. You may
impose in addition that the weights sum to 1, and other conditions. Then, applying a quadratic
program, you can solve the weights to meet your needs.
c© Zhou, 2021 Page 165
5.7 Uses of factor models
Ang (2014) and Ghayur, Heaney and Platt (2019) provide more extensive discussions on moti-
vation and practice of factor investing. Jurczenko (2020) and Coqueret and Guida (2021) provide
the state of art applications of machine learning tools to factor investing.
5.7.3 Hedging
This is related to factor investing. Instead of taking factor risks, you eliminate them. Suppose
equation (5.56) describes your portfolio. If you are concerns about the factor risks next month,
then you can short 1.1 units of the index and 0.7 units of the size factor (use futures or ETFs in
practice) per dollar of your portfolio for per dollar of the factors. Then you can remove the factor
risk exposures without having to liquidate your entire portfolio.
5.7.4 Measuring performance
Another use of factor models is to use them for evaluating a fund manager’s performance. Consider
first the one factor case. Suppose we have
rpt = α+ βrmt + it, t = 1, . . . , T, (5.59)
where rpt is the excess return of the actively managed portfolio. With data, suppose we have an
estimated model,
rpt = 5% + 1.1× rmt + it, (5.60)
which says that the mangers earns 5% extra, alpha, after adjusting for market risk. Hence, in terms
of the market factor, the manager seems to have skill. If there 5% were zero, then he would not
have had any skills as you can buy 1.1 unit of the market index to replicate the performance.
Now suppose we consider further a size factor, and the estimated model is
rpt = 3% + 1.1× rmt + 0.7× fsize,t. (5.61)
Then, accounting for the additional factor, the manager earns only 4% alpha. In practice, quite a
few common traded factors may be used to assess the alpha. If the alpha becomes zero, implying
that the investor can buy the factors with the suitable portions to replicate the fund performance.
The unexplained positive alpha may be a measure of skill. However, in practice, the CAPM is the
most widely used model for fund performance evaluation.
c© Zhou, 2021 Page 166
6 Factor Models 2: Unknown Factors
Both the CAPM and Fama-French 3-factor models assume that we know the driving forces of the
stock returns: a) the number of factors; b) the specific form of the factors. This is clearly not true
in the real world. In this section, we provide statistical methods for estimating both number of
factors and the factor themselves.
6.1 Latent factor model
To start, we may agree that there is one factor that determine all the stock returns, but the factor
may not necessarily be the market portfolio or stock index. That is, we consider the following
one-factor model,
rit = αi + βift + it, t = 1, . . . , T, (6.1)
where the factor ft is latent or unobservable. This is very similar to the market model regression
except now the factor is unknown and has to be estimated from data.
Before estimating, it is important to understand the identification problem in a latent factor
model. In the case where the factor is latent, the factor can only be only identified up to a scale.
This is because if ft is the factor, a new factor f
∗
t = cft works the same as (6.1) with β
∗
i = βi/c,
rit = αi +
βi
c
(cft) + it,
where c 6= 0 is a constant. So, in a latent factor model, once we find one factor, we can use any
scale of it. Another related issue is that we can ‘standardize’ or set the factor mean as zero,
E[ft] = 0. (6.2)
This will not affect the model either. Indeed, if E[ft] 6= 0, the new factor f∗t = ft − E[ft] will. We
have still have mathematically the same factor model,
rit = α
∗
i + βif
∗
t + it,
if we define the new alpha as
α∗i = αi + βiE[ft].
The reason for setting the factor mean as zero is to simply the task of finding the factor.
c© Zhou, 2021 Page 167
6.1 Latent factor model
In general, let f1t, . . . , fKt be K latent factors (or systematic risks) of the stock market. Then,
a K-factor model for the returns on N assets is:
rit = αi + bi1f1t + · · ·+ biKfKt + it, t = 1, . . . , T, (6.3)
where bi1, . . . , biK are the factor loadings on the risks, and it is the specific factors or idiosyncratic
risks. The factor model is often written in vector form,
rit = αi + β
′ft + it, (6.4)
with vector notations of the betas and factors,
β =

β1
...
βK
 , ft =

f1t
...
fKt
 .
Following the convention, we set all the factor means as zero, then
αi = Erit,
which can be estimated by the sample mean of the returns. Hence, in a latent factor model, the
major task is to estimate K, the betas and the factors.
Note that the scale invariance property becomes any nonsingular linear transformation invari-
ance property when K > 1. In this case, if ft is a factor, then f
∗
t = Cft is a new one, where C is
any nonsingular K ×K matrix. Let β∗ = C−1′β, then
β∗′f∗t = β
′C−1Cft = β′IKft = β′ft,
so the factor model is unchanged.
Ross (1977) shows that, in the absence of riskless arbitrage opportunities, there exists an ap-
proximate linear relationship between the expected asset returns and their risk exposures to the
latent factors,
Erit ≈ λ0 + bi1λ1 + · · ·+ biKλK , (6.5)
as the number of assets satisfying (6.3) increases to infinity, where λ0 is the intercept of the pricing
relationship and λk is the risk premium on the k-th factor, k = 1, . . . ,K. The relationship (6.5) is
known as the implication of the APT (arbitrage pricing theory). When K = 1, the APT says that
Erit ≈ λ0 + bi1λ1, (6.6)
c© Zhou, 2021 Page 168
6.2 Principal components analysis
that is, the greater the beta, the greater the expected return. This is very similar to the CAPM,
but is fundamentally different, because the factor is not necessarily the market factor, but the one
that is systematic to all stocks and estimated from data.
How to estimate the number of factors and the factors? There are three common approaches.
Principal components analysis (PCA) is the most popular, and asymptotic PCA (aPCA) is compu-
tationally preferred with large number of assets, and the traditional factor analysis. We will focus
on the first two, which are the most useful.
6.2 Principal components analysis
Principal components analysis (PCA) is a general dimension reduction approach. In this section,
we provide first a review on the concepts of eigenvalue and eigenvectors, then provide the details
for computing the principal components (PCs). Finally, we explain the theory behind it.
6.2.1 Eigenvalue and eigenvectors
First, let us review the concepts of eigenvalue and eigenvectors. Consider a 2× 2 matrix
Σ =
2.05 1.95
1.95 2.05
 . (6.7)
Any vector (a1, a2)
′ satisfying
Σ
 a1
a2
 = λ
 a1
a2
 (6.8)
are called eigenvector and λ the associated eigenvalue. Here we have2.05 1.95
1.95 2.05
 ·
 1
1
 = 4×
 1
1

and 2.05 1.95
1.95 2.05
 ·
 1
−1
 = 0.1×
 1
−1
 .
So 4 and 0.1 are two eigenvalues and
A1 =
 1
1
 , A2 =
 1
−1

c© Zhou, 2021 Page 169
6.2 Principal components analysis
are the eigenvectors. Note that the eigenvectors are not unique as their scaled vectors will also be
eigenvectors. However, once they are standardized,
a21 + a
2
2 = 1,
then they will be unique up to a sign. In our example here,
A1 =
1/√2
1/
√
2
 , A2 =
 1/√2
−1/√2

are the standardized ones (scaled to make the sum of squared components equal to one). Clearly,
A∗1 = −
1/√2
1/
√
2
 , A∗2 = −
 1/√2
−1/√2

are also standardized eigenvectors. But they are essential the same as A1 and A2 except the sign.
In finance, the covariance matrix of n assets, Σ, is of great important to determine the risk of
any portfolio of the assets. It has important properties: 1) symmetry; 2) positive definiteness (it is
in particular invertible, ruling out redundant assets, i.e., those are linear combinations of others).
Symmetry means that the transpose of the matrix equals to itself. Positive definiteness means
that, for any nonzero n-vector η 6= 0, we have
η′Ση > 0.
We can scale η to have its elements sum to 1 without affecting the above inequality, then the above
equation says that the risk of any fully invested portfolio of the assets is not zero.
The inequality or positive definiteness must be true intuitively as, if there are redundant assets,
the portfolio of risky asset must be risky. Otherwise, there is at least one portfolio that has no risk,
w1r1 + w2r2 + · · ·+ wnrn = 0.
We can then solve one asset as a linear combination of other assets, implying this asset is redundant,
contradicting to our assumption.
The no redundancy assumption will always assumed in this section, which is often assumed
implicitly in portfolio theory without even mentioning it. In practice, returns on different stocks
are never or unlikely to be redundant because, for any given stock, linear combinations of other
stocks cannot perfectly replicate it.
c© Zhou, 2021 Page 170
6.2 Principal components analysis
Under the no redundancy assumption, the covariance matrix Σ of n assets is positive definite,
and so, mathematically, it will be invertible, and will have exactly n positive eigenvalues (could be
of equal values similar two roots to a quadratic algebraic equation) and n standardized eigenvectors
associated with them (unique up to signs).
6.2.2 PCs: data
For a large data set of n variables, the PCA re-packages them in n components, PCs, ordered in
such a way that the first (newly created) component contains the maximum of variation, and the
second component is orthogonal to the first and contains the second-largest amount of variation,
etc. So the last component contains the smallest amount of variation. The idea is that we can
focus on the first K important components, while dropping the rest less important ones.
Let X is an T × n matrix of the data, where n is the dimensionality and T is the sample size.
Suppose that X is demeaded (the data are subtracted from sample means as researchers often do
in applying PCA). Then the n× n matrix
Σˆ ≡ X ′X/T (6.9)
is the sample covariance matrix. It has n eigenvalues,
λ1 ≥ λ2 ≥ · · · ≥ λn
and n eigenvectors A1, . . . , An, each of which is n-vector. The first PC, in terms of data, is defined
as
P1t = A11X1t +A12X2t + · · ·+A1nXnt = A′1Xt, t = 1, 2, . . . , T, (6.10)
which is a weighted sum of the data (a portfolio if Xt are returns) with the first eigenvector A1 as
weights. Hence, the first PC, P1, is simply a repackage of the original data. Mathematically, it has
the property that
var(P1) = λ1, (6.11)
that is, its variance is the same as the largest eigenvalue (the proof is in the next subsection).
Similarly, the second PC is defined by
P2t = A21X1t +A22X2t + · · ·+A2nXnt = A′2Xt, t = 1, 2, . . . , T, (6.12)
c© Zhou, 2021 Page 171
6.2 Principal components analysis
and so on. The variance of the j-th PC is equal to the j-th eigenvalue,
var(fj) = λj , j = 1, 2, . . . ,K, (6.13)
where λj is the j-th largest eigenvalue of Σ.
The second important property of the PCs is that they are orthogonal to each other. This
means that the original data
X = [X1, X2, . . . , Xn]
are transformed into orthogonal data (PCs)
P = [P1, P2, . . . , Pn].
The orthogonal property means that the PCs are uncorrelated if the PCA is applied to stock returns,
which simplifying the convariance structure and makes it simple the optimal portfolio in terms of
the PCs. In many applications, we may care only about those data that have the most variations or
the first K principal components. Then we reduce the problem of study, say n = 1000 variables, to
a study of K, say K = 5, linear combination of the original variables. This is dimension reduction.
The third property of the PCs is that they are invariant (the same) with any orthogonal trans-
formation of the data. The reason is that if we apply the PCA analysis to a new data set that is
an orthogonal transformation of the old,
X∗ = XC,
where the matrix C is orthogonal that C ′C = In. The eigenvalues will remain the same, but the
eigenvectors will be multiplied by C ′. Based on (6.24), the PCs will be unchanged.
As to the choice of K, the number of factors, one usually examine the sum of the first K
eigenvalues. If this sum is 95% of the sum of all the eigenvalues, K may be adequate enough as the
K factors can explain about 95% of the variations of the returns. In general, a K-factor explains
a fraction of
λ1 + λ2 + · · ·+ λK
λ1 + λ2 + · · ·+ λn
of the total variance.
c© Zhou, 2021 Page 172
6.2 Principal components analysis
6.2.3 PCs: random variables
The PCA can also be stated in terms of random variables (population). It finds n linear com-
binations of the original n random variables. In contrast to the original ones, the new variables
(PC components) are orthogonal to each other, and the first component has the large variance, the
second component has the second large, etc. So, as far as variances are concerned, we can study the
K new variables instead of the original n ones, which is especially advantageous when n is large.
Let x = (x1, x2, . . . , xn) be an n-vector of de-meaned random variables, say n de-meaned stock
returns. Denote its covariance matrix by
Σ = Var(x), (6.14)
which is n× n. We now define the PCs in terms of Σ, the population parameter.
Let A1 = (A11, A12, . . . , A1n)
′ be the first eigenvector of Σ,
ΣA1 = λ1A1, (6.15)
where λ1 is the largest eigenvalue. Then, the first PC is defined as a linear combination of the
original variables,
P1 ≡ A11x1 +A12x2 + · · ·+A1nxn = A′1x, (6.16)
which says that the fist PC is determined by the first eigenvector, whose elements serve as the
weights on the original variables.
One can define the second PCA factor using the second eigenvector,
P2 ≡ A21x1 +A22x2 + · · ·+A2nxn, (6.17)
and so on. In short, given the original n random variables (assets), we can repackage them to
obtain n particular new random variables (linear combinations of the original assets), A′jx’s, i.e.,
the n PC components.
Why do we do that? In practice, it is often the case that the first K (say, K=5) are the
most important. Then, as an approximation, we can replace the n original assets by the K PC
components. Imagine there are thousands of stocks. With PCA, we reduce the dimensionality from
thousands to K.
c© Zhou, 2021 Page 173
6.2 Principal components analysis
Example 6.1 For the covariance matrix
Σ =
2 1
1 3
 , (6.18)
the first standardized eigenvector is
A1 =
0.526
0.851
 , (6.19)
and hence the first PCA component is
P1 = 0.526x1 + 0.851x2, (6.20)
where x1 and x2 are the de-meaned original variables. ♠
Similar to the case with data, for the random variables P1, . . . , Pn, there are three properties:
a) the j-components has the j − th largest variance,
var(Pj) = λj ,
where λj is the j-th largest eigenvalue of Σ; b) they are uncorrelated; c) they are invariant if the
original x are transformed via an orthogonal matrix.
In practice, the population parameter is unknown and Σ is often estimated by data, say the
sample covariance matrix,
Σˆ = X ′X/T, (6.21)
Given n, if the sample size T is large enough, they will converge to the population parameters.
Then, PCA applied to the data will be the same as applied to population. However, in practice,
PCA is often applied to the data as previous subsection.
6.2.4 PCA factors
Principal components analysis (PCA) is a general dimension reduction approach without imposing
a factor structure on the data, and so it is more general than a factor model. But the PCA can be
applied to estimate factors of a factor model.
Assume that there are K factors that drive returns in the factor model, Eq. (6.3), whose vector
form is
rit = αi + β
′ft + it, αi = E[rit] (6.22)
c© Zhou, 2021 Page 174
6.2 Principal components analysis
or
xit = β
′ft + it, (6.23)
where xit = rit − E[rit] is the de-meaned returns. The important question is how to we estimate
the factors. PCA is one of the most popular approaches to estimate ft.
If we stack the first K PC components as a K × 1 vector at any time t,
Ft = [P1t, P2t, . . . , PKt] = Φ
′Xt, K × 1, (6.24)
where Xt, n× 1, is the de-meaned stock returns, and Φ = [A1, . . . , AK ] is an n×K matrix of the
first K eigenvectors, then Ft is the PCA estimate of the realizations of K factors at time t. When
K = 1, it is simply
Ft = A11X1t +A12X2t + · · ·+A1nXnt = A′1Xt,
where A1 = (A11, A12, . . . , A1n)
′ is the first eigenvector. This is similar to the case of the market
factor. It is a random variable of asset returns that fluctuates over time. But in terms of data, say
current month, it is a weighted average of the realized returns. While the market factor uses the
firm values as the weights, the PCA factor uses the first eigenvector as the weights. Note that the
weights on the PCA factor does not sum to 1, which is to scaled to ensure that it has a variance of
λ1.
Now, if we stack all the factor observations as a T ×K matrix, it follows that
Fˆ = XΦ (6.25)
is an estimate of all the factor realizations in the K-factor model. Here we put a hat on F
to emphasize that it is an estimate, rather than the true realization of the factors that are not
observable. In short, the PCA estimate of the K factors are simply the first PCs, that is, we use
the first PCs as factors.
How to interpret a PCA factor economically? This is an issue for which there are no perfect
answers. One way is to examine its correlations with known economic variables. For example, if the
first PCA has 90% correlation with the market factor, and the second has 80% with inflation, we
may interpret the first factor is primarily the market and the second is largely inflation. Another
way is to run a regression of the PCA on known variables. If the second factor has a slope of 80%
on inflation and 15% on GDP, we can attribute its effects as inflation and GDP.
c© Zhou, 2021 Page 175
6.2 Principal components analysis
Consider now how to determine the number of factors, given that the factor model is true. For
any number K, we can use the first K PCs as the factors. Then we run the time series regression,
(6.3), on the factors to get the estimated mean-squared errors, σˆ2i for each stock i. Let
V (K) =
T
n(T −K)
n∑
i=1
σˆ2i +Kg(n), (6.26)
where
g(n) = n−1/4 log(n).
The first term, up to a scale, measures how well the factors fit the linear regression. The smaller
it is, the better the fit. Increasing the number of factors will always improve the fit. However, the
greater number of factors will introduce more parameters and greater estimation errors, so we add
the second term to penalize a larger K. The optimal trade-off between the two is theoretically the
right number of factors. In other words, we choose such a K, K∗, to minimize V (K).
Econometrically, Bai (2003) is the first to provide the statistical properties of the estimated
factors, given that the factor model is true. Bai and Ng (2002) provides the criterion for selecting
the number of factors. The above criterion is taken from Zaffaroni (2019), who proves that the K∗
so chosen converges to the true value as the number of assets n increases to infinity. Empirically,
we compute the factors for K = 1, 2, . . . , 30 (say), and find K∗ that makes V (K) the smallest.
It should be mentioned that the above factor estimates is computational efficient when n < T ,
but this is rarely true in practice. The mathematical equivalent but computational more efficient
estimator is given in Section 6.3. The asymptotic theory on factor selection is the same, and that
on the estimators will be equivalent too apart from a linear transformation.
PCA has wide applications in finance (see, e.g., Alexander, 2001), used not only in the equity
market, but also in other asset classes. For example, Litterman and Scheinkman (1991) show that
three yield factors, the level, slope and curvature, from the PCAs explain bond returns well. Jolliffe
(2002) discusses various theoretical aspects of the PCA and its uses in other areas.
Another wide use of the PCA is to extract a few predictors out of many. When there are
many predictors, running a regression on them is not efficient as the estimation errors can be
large. Instead, running a regression on the first PCA (or the first few) can do a much better job
in forecasting out-of-sample. For examples, Baker and Wurgler (2006) uses the first PCA of six
proxies as their famous investor sentiment index, and Neely et al (2014) use PCAs of technical
indicators to predict stock market returns.
c© Zhou, 2021 Page 176
6.2 Principal components analysis
6.2.5 The theory
For a covariance matrix, the most important property is that it can be decomposed into a product
of three terms, eigenvectors, eigenvalues and eigenvectors. That is, Σ can be written as (see a
Linear Algebra text for the proof)
Σ = AλA′
= [A1, . . . , An]

λ1 0 . . . 0
0 λ2 . . . 0
...
... . . .
...
0 0 . . . λn
 [A1, . . . , An]
′ (6.27)
= λ1A1A
′
1 + λ2A2A
′
2 + · · ·+ λnAnA′n,
where Ai is the eigenvector corresponding to eigenvalue λi, and the eigenvectors are orthogonal to
each other with unit length. There are exactly n eigenvector and n eigenvalue (though the eigen-
values could be equal like the roots of an n-order polynomial). The above is known as Eigenvalue
Decomposition Theorem or The Spectral Theorem.
The decomposition holds for any symmetric matrices. The eigenvalues are greater than zero
for positive definite or nonsingular covariance matrices. The eigenvalue decomposition is a special
case of the singular value decomposition (SVD).
Now we are ready to understand more on the PCA. It is enough to carry out the analysis
in terms of population or random variables. Statistically, the PCA is motivated to find a linear
combination of the variables that has the maximum variance. In other words, we want find a such
that
P1 = a1x1 + a2x2 + · · ·+ anxn = a′x (6.28)
explains the most variations of the underlying random variable x = (x1, . . . , xn)
′ (here we consider
the PCA in terms of population and we assume x has zero mean), or
max
a
Var(a′x) = a′Σa, (6.29)
where a ≡ (a1, a2, . . . , an)′ is standardized such that
a′a = a21 + · · ·+ a2n = 1.
The above equation means that the vector a has a unit length (if a is unrestricted, the maximal
will be infinity by increasing a properly).
c© Zhou, 2021 Page 177
6.2 Principal components analysis
Mathematically, we want to maximize the following function,
f(a) = a′Σa− λ(a′a− 1) =
∑
i,j
aiσijaj − λ
(∑
i
a2i − 1
)
, (6.30)
where λ the Lagrange multiplier. The first-order conditions are
∂f(a)
∂a
= 2Σa− 2λa = 0.
This is the same as
Σa = λa,
which, following our definition, says that λ must be an eigenvalue and a must be the associated
eigenvector. Suppose that a is the i-th eigenvector. Based on the Eigenvalue Decomposition
Theorem and the orthogonality,
a′Σa = λiA′ia = λi.
Therefore, to maximize a′Σa, a must be the first eigenvector, and the maximum is exactly equal
to λ1, the largest eigenvalue.
In other words, P1 is a random variable that is a linear combination of the original random
variables and it has the maximum variance, λ1, when the combination coefficients are the first
standardized eigenvector. Similarly, the second PCA is defined the same way as to maximize the
variance, but it is required to be uncorrelated with P1. It can be shown that the combination
coefficients must be the second standardized eigenvector and its maximum variance is equal to the
second largest eigenvalue. The rest PCAs are obtained similarly. Anderson (1984, Chapter 11), a
classic of multivariate statistics, provides more properties of the PCAs.
In practice, PCA requires only the computation of the eigenvalues and eigenvectors of Σ, which
is straightforward to do with many available packages. Let λ1 ≥ λ2 ≥ . . . ≥ λn be the n eigenvalues.
Put them into a diagonal matrix, and put the associated eigenvectors into a matrix A, then the
n-th PCA is defined as Pi = A
′
ix. Recall we set the mean of x as zeros already so that µx, the
mean of x, does not enter the PCA as in (6.16). In matrix form, all the PCAs can be expressed as:
P =

P1
P2
...
Pn
 =

A′1x
A′2x
...
A′nx
 = A
′x. (6.31)
c© Zhou, 2021 Page 178
6.2 Principal components analysis
Hence, as an approximation, the covariance matrix can be modeled by the first few, say K,
components after ignoring the rest insignificant λi’s, i = K+ 1, . . . , n. Notice that the eigenvectors
are normalized here, A′A = AA′ = I, (6.31) clearly imply x = AP or, if only using the first K
PCAs,
x1 ≈ a11P1 + a12P2 + · · ·+ a1KPK , (6.32)
x2 ≈ a21P1 + a22P2 + · · ·+ a2KPK , (6.33)
...
...
...
xn ≈ an1P1 + an2P2 + · · ·+ anKPK , (6.34)
which says that the study of the original complex and potentially a large number of (n) variables
can be reduced to the study of only K linear combinations of the variables, or the first K PCAs
can be taken approximately as the factors. For example, the term structure of interest rates is
complex, but it can often be reduced to study 3 PCAs. See, e.g., Alexander (2001).
What are the statistical properties of the estimated PCAs? Beyond the above population
motivation, we are in practice more interested in the estimation accuracy of the eigenvalues and
eigenvectors, based on an estimated covariance matrix. In general, the estimation of the largest
eigenvalue tends to be larger than the largest true eigenvalue value, and the smallest tends to be
smaller than the smallest true eigenvalue value. But they are consistent: for fixed n, as the sample
size increases, the estimated values will converge to the true values theoretically. However, if n is
too large relative to T , this will not be true (see Section 6.2.6).
Finally, we mention two formulas on the determinant and trace in terms of eigenvalues, which
will be useful in the future,
det(Σ) = λ1λ2 · · ·λN , (6.35)
tr(Σ) = λ1 + λ2 + · · ·+ λN . (6.36)
Both are consequences of the Eigenvalue Decomposition Theorem. Indeed, given the Theorem, the
determinant will be the product of three other determinants. That of the eigenvector matrix is 1
as it consists of orthogonal vectors, and that of the eigenvalue matrix which is clearly the above.
Since tr(ABC) = tr(ACB), the last equality follows from the Theorem too.
c© Zhou, 2021 Page 179
6.3 Asymptotic PCA
6.2.6 High-dimensional PCA
Theoretically, the PCA works well for fixed n and large sample size T . However, if the dimension-
ality n is large relative to T , the traditional PCA will run into problems. For example, if we use
240 monthly data (20 years) to extract factors out of 50 industries, there is likely a problem as
50/240 = 20.83%, implying that n is not small relative to T . In this case, the PCA is known as
high-dimensional PCA, and the estimation errors can be very high.
To understand the problem, consider the simple case where the data are generated from the iid
standard normal. Thus the true covariance matrix is the identity matrix. Let λˆ1 and λˆn be the
estimated largest and smallest eigenvalues. Then (see, e.g, Yao, Zheng and Bai, 2015), if T goes to
infinity, but N goes infinity too with N/T → η > 0, then
λˆ1 −→ (1 +√η)2, (6.37)
λˆn −→ (1−√η)2.
It says that the estimated eigenvalues will be biased even asymptotically! In other words, even if
the sample size is large, but n is a fixed fraction of T , the largest eigenvalue will be over-estimated
and the smallest one will be under-estimated.
Applying the asymptotic theory with N = 50 and T = 240, we have λˆ1 converges to 2.12,
and λˆn converges to 0.30, both of which are far from 1, the true value. Note that the estimated
trace (sum of all the eigenvalues) will be close to N , the true value. It is just that the estimated
eigenvalues will be spread out, but the average will be close to the true average. The larger the η,
the more the spread. See Johnstone and Paul (2018) for a recent review of the issues and Wu, Qin
and Zhu (2020) and reference therein for the latest solutions.
6.3 Asymptotic PCA
Asymptotic PCA is more in the spirit of most applications in finance, where n increases to infinity
and it can be much greater than T . In this case, it is computational much more efficient to estimate
the PCA factors from the T × T data matrix,
Πˆ = XX ′, (6.38)
where X, T × n, is the demeaned returns, so that Πˆ is T × T .
c© Zhou, 2021 Page 180
6.3 Asymptotic PCA
Recall the K-factor model,
xit = β
′ft + it, i = 1, . . . , n (6.39)
which is a re-write of (6.4). In terms of data or in matrix form, we have
X = Fβ′ + , (6.40)
where β is n×K of the loadings and F is T ×K of the factor observations to be estimated.
For example, if n = 5000, T = 60, then Πˆ, 60 by 60, is a much smaller matrix than Σˆ which is
n× n, or 5000 by 5000. Hence the computation of eigenvalues and eigenvectors is much easier for
Πˆ than for Σˆ. This is why aPCA is computational much more efficient than PCA when n is much
larger than T .
Let η1, η2, . . . , ηK be the eigenvectors of Πˆ, then the factors and loading estimates are
Fˇ =
√
T [η1, η2, . . . , ηK ], βˇ =
1
T
X ′Fˇ . (6.41)
Note that each η1 is T ×1, Fˇ is T ×K, matching the dimensionality of the factor F to be estimated.
Mathematically, the factors extracted from either the PCA or the aPCA are equivalent,
Fˇ = Fˆ V −1/2, (6.42)
where V −1/2 is a K×K diagonal matrix, consisting of the first K largest eigenvalues of X ′X/(nT ).
Hence, use of either of the factor estimates will yield essentially the same factor model as the factors
can only be identified up to a nonsingular linear transformation.
Bai (2003) shows that, under the assumption that the K-factor model is true, if n becomes
large and
√
N/T → 0 (that means N cannot be too large relative to T ), then the estimated factor
will converge to the true factor up to a rotation,
√
N(Ft −HF 0t ) −→ N(0, VF ), (6.43)
where H is some rotation matrix and VF is the asymptotic covariance matrix.
Connor and Korajczyk (1988) propose the aPCA and apply it to as many as 1745 assets to
extract factors. Bai and Ng (2008) and Bai and Wang (2016) provide reviews of various extensions.
c© Zhou, 2021 Page 181
6.4 Covariance matrix estimation
6.4 Covariance matrix estimation
The Invertibility of the covariance matrix is critical for our optimal portfolio formula and for
implementing via quadratic program too. However, the usual estimator, the sample covariance
matrix can be singular when when the sample size is smaller than the number of assets. We will
discuss this problem in detail below and the solutions to it.
6.4.1 Invertibility problem
Recall that the ample covariance matrix is
S =
1
T − 1
T∑
i=1
(Xi − X¯)(Xi − X¯)′,
where T is sample size and X1, X2, . . . , XT are observations over time. It is easy to see a necessary
condition for S to be nonsinular is
T ≥ N, (6.44)
where N is the dimensionality of X or number of assets. The proof is easy. The rank of S must
be less than or equal to T as it is a sum of T terms whose rank is 1. Since S is N×, its rank is
exactly N if it is invertible, so N ≤ T .
For example, with N = 500 assets, and T = 240 (20 years of monthly data), then the above
condition is violated, and hence it is impossible to compute the inverse of the sample covariance
matrix. In practice, we can have thousands of stocks, and hence the sample covariance matrix runs
into problems. It can be applied only to a limited number of asset classes, not individual securities.
Conditional (6.44) is easy to prove and is only necessary, and is not sufficient. The more
stringent is
T ≥ N + 1. (6.45)
Indeed, it is clear that
S = UU ′, U = [X1 − X¯,X2 − X¯, . . . , XT − X¯],
so N = rank(S) ≤ rank(U). Since U1T = 0, rank(U) ≤ T −1, and the necessity follows. If the data
are randomly drawn from a distribution with non-singular covariance matrix, conditional (6.45)
appears to be sufficient for S to be nonsingular most surely. In general, the above condition is only
necessary, but there is no guarantee because the data can come from a low dimensional space.
c© Zhou, 2021 Page 182
6.4 Covariance matrix estimation
6.4.2 Factor-model based estimator
The key for doing various factor analysis is that we can eventually to model the asset returns using
a “good” factor model,
r˜it − rft = µi + βi1f˜1t + · · ·+ βiK f˜Kt + ˜it, (6.46)
where the K factors capture all the systematic risks (more than 20 factors in some practitioners’
models), so that we can assume the residuals are uncorrelated.
Taking covariance on both sides of the factor model, we obtain the return covariance matrix
Σ = β′Σfβ + Σ, (6.47)
where Σf is the covariance matrix of the factors, and Σ is the diagonal covariance matrix of the
residuals.
The above Σ can be inverted easily even if there are a large number of assets and a relatively
small data size. Indeed, the inverse can be analytically computed from the well known Sherman-
Morrison-Woodbury matrix identity,
Σ−1 = Σ−1 − Σ−1 β′[Σ−1f + βΣ−1 β′]−1βΣ−1 , (6.48)
which is well defined as along as the factor covariance matrix is invertible, i.e., Σ−1f exists. This is
clearly not a problem in practice as the number of factors is usually small, say less than 30.
In short, inversion of the covariance matrix is essential for applying the mean-variance portfolio
theory. Without imposing a factor structure, the standard sample covariance matrix, Equation
3.10, is not invertible unless T > N +K. Moreover, even if it is invertible when T > N +K and if
N is large relative to T (say N/T=0.3), it is still a poor estimator of the true covariance matrix.
The above factor model with uncorrelated residuals is a solution to this problem, and it always
works.
Note that the factor model is only way to estimate the covariance matrix. In Chapter 4, we
have discussed two other approaches. The first is to apply a shrinkage approach to reduce the
dimensionality. This may even include recent machine learning methods. The second is to use
high frequency data (such as daily returns). In practice, all of these can be used to resolve the
invertibility problem. Further analysis of performance may determine what is best for the problem
at hand as there is none of the methods can completely dominate all others.
c© Zhou, 2021 Page 183
6.5 Both explicit and latent factors
6.5 Both explicit and latent factors
In the previous chapter, we analyze factor models in which the factors are known or explicit, such
as the market factor model and the Fama-French 3-factor model. In this chapter, we examine latent
factors. Now we consider a more general factor model in which both types of factors are present.
Mathematically, we have
rit = αi + f
′
tβi +G
′
tβgi + it, (6.49)
where rit are excess returns, ft are K latent factors to be estimated, and Gt are L known factors,
while βi and βgi are K × 1 and L× 1 loadings on the factors.
The known or explicit factors, Gt, may include common systematic and macroeconomic factors,
which are measured as surprises in macroeconomic variables that help explain returns. For example,
we may have an explicit macroeconomic factor model,
Rit − rft = αi + βig[GDPt − Et−1(GDPt)] + βif [IFt − Et−1(IF )] + it, t = 1, . . . , T, (6.50)
where Et−1(GDPt) and Et−1(IF ) are past expected GDP and inflation and so their differences
from the realized values are the surprises or unexpected changes which can affect the market, and
βig and βif are individual stock sensitivities to such changes. Of course, we can add common
systematic factors, such as the Fama-French 3 factors, then we will have an explcit factor model
with L = 5 factors.
The combined factor model (6.49) is quite intuitive. We start with explicit factors that are
known to affect stock returns, such as the Fama-French 3 factors, the GDP and the inflation, to
obtain a set Gt. Since the L factors of Gt may not account for all the systematic risks in the market,
we need to add K unknown statistical factors, which are to be estimated from the data, to capture
the missing systematic effects.
The estimation of the mixed factor model usually takes two steps. In the first step, a regression
of the asset returns on the known factors Gt is run to obtain αˆi and βˆgi. Then the unexplained
returns will be
uit = rit − αˆi −Gtβˆgi, (6.51)
which are the difference of the asset returns from their fitted values by using the observed factors.
This is the returns after removing the effects of Gt.
c© Zhou, 2021 Page 184
6.6 All-inclusive factor model
Then, in the second step, a factor estimation approach, such as the PCA, is used to estimate
the latent factors from
uit = f
′
tβi + vit. (6.52)
With the factor estimates from here, we can plug them back into (6.49) to determine the expected
asset returns and their covariance matrix.
The above procedure combines the explicit factors (such as the market and GDP) with statistical
factors (estimated from PCA). Conceptually, with both information sets, the factor model should
work better than otherwise.
6.6 All-inclusive factor model
An all-inclusive factor model is one that combines both the time series factors and firm fundamental
factors, resulting in a cross section regression model to include all possible factor effects.
6.6.1 Time series factor model
A general times series factor model may be written as,
rit = αi + βi1f1t + · · ·+ βiKfKt + it, t = 1, . . . , T, (6.53)
where f1t, . . . , fKt are explicitly known or are estimated from the data. The Fama-French 3 factors,
GDP and inflation factors, and the statistical (PCA) factors can all be included in the above
equation.
The key is that the regression is run over time (which is why time series model) for any given
stock. What we learn is the exposures, betas, of the stock to the various systematic factors. If the
alpha is zero, it means that the systematic factors explain fully the expected return. Of course,
this is not true for all the stocks in the real world.
6.6.2 Fundamental factor model
In contrast to a times series factor model, a fundamental factor model often refers to a cross section
regression on firm specific variables or firm characteristics that are relevant to the changes in stock
c© Zhou, 2021 Page 185
6.6 All-inclusive factor model
prices. Examples of such factors are price-to-earnings ratio, market capitalization, and financial
leverage.
A simple example of the fundamental factor model is a cross section regression,
Ri = c0 + c1Sizei + c2Profiti + i, i = 1, 2, . . . , N, (6.54)
where c0, c1, c2 are regression coefficients, and Sizei and Profiti are firm size and profitability.
The key here is that the regression is run in cross section over firms. The slopes c1 and c2 are
the same cross firms, reflecting equal compensations to firm characteristics. If the purpose is to
explain the returns or risk attribution, both the explanatory variables and dependent variables are
measured at the same time t. However, if the purpose is to forecast future returns using the firm
characteristics, then the characteristics are measured at time t− 1 while returns at t. Chapter 11
provides more discussions and the detailed procedures for implementing such factor models.
6.6.3 All types of factors
Clearly, we should incorporate all relevant information to either explain or forecast the stock returns.
An all-inclusive factor model or a generalised fundamental factor model is one that combines both
the time series factors and firm fundamental factors, providing a cross section regression model
that considers all possible factor effects.
An example is a cross section regression of returns on two sets of variables,
Ri = c0 + c1βmi + c2βgi + c3Sizei + c4Profiti + i, i = 1, 2, . . . , N, (6.55)
where c0, c1, c2, c3, c4 are regression coefficients that are the same across firms. The regression is
run across firms, and hence N , the number of firms plays the role of T in a typical regression model.
There are in general two sets of explanatory variables. The first captures systematic or macrco
factor exposures, such as the market factor and GDP factor, where the exposures are measured by
beta sensitivities to the factors. The second set consists of directly observable firm characteristics
such as size and profitability and so on.
c© Zhou, 2021 Page 186
6.7 Factor analysis
In terms of data, we can write above in matrix form as, if we have 1000 firms,
R1
R2
...
R1000
 =

1 X1,1 X1,2 X1,3 X1,4
1 X2,1 X2,2 X2,3 X2,4
...
...
...
...
...
1 X1000,1 X1000,2 X1000,3 X1000,4


c0
c1
...
c4
+

1
2
...
1000
 , (6.56)
where each Xi,j is firm characteristic j for firm i. The parameters can be estimated by using the
standard OLS regression. Suppose that the returns are measured at time t. As mentioned before,
if the purpose is to explain the returns, the explanatory variables are also measured at t. However,
often forecasting returns is of interest. In this case, the explanatory variables are measured and
available at t− 1, so that we use previous information to forecast the future returns.
Haugen and Baker (1996) appears the first to analyze a large set of explanatory variables in
the above model. Lewellen (2015) provides a more recent and comprehensive analysis. Chapter 11
provides more discussions and the detailed procedures for implementation.
6.7 Factor analysis
The maximum likelihood (ML) method is the well established approach for estimating the parame-
ters in the traditional factor analysis. In other words, the maximum likelihood estimators of B and
V are obtained by maximizing the likelihood function of the observations. However, no analytical
expressions are available for the estimators, an iterative numerical approach has to be taken for the
maximization.
Nevertheless, with the use of the EM algorithm,9 we can obtain the ML estimator iteratively.
The first is E step: we find the expectation of the log complete likelihood function which is the
density of the returns data in which the factors treated as if they were known. Then, in the second
M step, we maximize the expected value obtained in the first step over the parameters. This leads
to the following:
B∗′ = [δSδ′ + ∆]−1(Sδ′)′, K ×N (6.57)
V ∗ = diag
(
S − (Sδ′)[δSδ′ + ∆]−1(Sδ′)′) , N ×N (6.58)
9McLachlan and Krishnan (1997) provide a detailed introduction to the EM algorithm.
c© Zhou, 2021 Page 187
where ‘diag’ takes the diagonal elements of a matrix,
S ≡ 1
T
T∑
t=1
(rt − µˆ)(rt − µˆ)′, µˆ ≡ 1
T
T∑
t=1
rt, (6.59)
δ ≡ B′(BB′ + V )−1, K ×N (6.60)
∆ ≡ I −B′(BB′ + V )−1B, K ×K, (6.61)
in which the inversion of (BB′ + V )−1 is computed from Woodbury’s identity:
(BB′ + V )−1 = V −1 − V −1B(I +B′V −1B)−1B′V −1 (6.62)
(so that no inversion of any N × N matrix is needed). Given an initial estimation of B and V ,
δ and ∆ can be computed, and hence can B∗ and V ∗, which are values of B and V for the next
iteration. Continue this process, the limit value will be the maximum likelihood estimator of B
and V .
Because any rotation of the factors will also make the factor model hold, a common identification
condition is to impose a diagonal restriction on
Jˆ = Bˆ′Vˆ −1Bˆ. (6.63)
Under this restriction, the factors can be estimated as
fˆt = Jˆ
−1Bˆ′Vˆ −1(rt − r¯). (6.64)
There are also alternative estimators, of which Seber (1984) provides more details.
Lehmann and Modest (1988) is the first to apply such an approach in finance. A Bayesian
factor analysis and APT can be found in Geweke and Zhou (1996). However, factor analysis is
generally very difficult to implement, especially when there are a large number of assets. So PCA
and aPCA are the major methods for estimating factors in practice.
7 Performance and Style
In this section, we examine first performance measures, and then investment styles.
c© Zhou, 2021 Page 188
7.1 Performance measures
7.1 Performance measures
There are many performance measures that based on the returns of a portfolios. Although none of
them is perfect, alphas and Sharpe ratios are the most widely used in practice.
7.1.1 Alpha
The most widely used alpha for assessing the performance of a portfolio is the CAPM alpha, also
known as Jensen’s alpha,
αp = R˜p − rf − βp(R˜M − rf ) (7.1)
which is easily computed in practice as the intercept of the portfolio excess return on the market
excess returns:
R˜p − rf = αp + βp(R˜M − rf ) + ˜p, (7.2)
where p is the regression residual.
Multiple factors are also used from time to time. For example, if the Fama and French (1993)
3-factors are used on the right hand side of the regression, the alpha is known as Fama and French
(1993) 3-factor alpha.
7.1.2 Sharpe ratio
Recall from (2.34), the Sharpe ratio is defined as,
Sharpe ratio =
E[R˜p − rf ]
sp
, (7.3)
where sp is the standard deviation of the excess portfolio return of a given portfolio Rp and rf is
the riskfree rate. Since it is a special case of the portfolio, the above definition also works for any
asset.
c© Zhou, 2021 Page 189
7.1 Performance measures
7.1.3 Sortino ratio
Sortino ratio, proposed in the 1980s, is a modification of the Sharpe ratio. The volatility may not
capture what investors concern about as it penalizes both upside and downside movements equally,
s2p =
1
T − 1
T∑
t=1
(xt − x¯)2, (7.4)
where xt = Rp − rf , and T is the sample size.
Presumably, investors love to see the asset return jumps up, but not down. So Sortino ratio is
defined only in terms of the down-side volatility (or the volatility of negative returns in excess of a
target),
Sortino ratio =
E[R˜p −Rb]
s−p
, (7.5)
where Rb is the return on a target asset or index, and
(s−p )
2 =
1
T − 1
T∑
t=1
min(Rpt −Rbt, 0)2,
which effectively uses observations of Rpt − Rbt < 0 only, i.e., those underperformed returns. s−p ,
also known downside deviation, is a well known downside risk measure.
7.1.4 Information ratio
Recall from (2.85), the information ratio is similar to the Sharpe ratio except now a benchmark
index is used in replacing rf ,
IR =
E(Rp −RB)
σ(Rp −RB) , (7.6)
where RB is the return on a benchmark index the fund manager attempts to beat, and Rp is the
raw fund return.
7.1.5 Treynor ratio
The Treynor ratio is similar to Sharpe ratio except replacing the volatility by beta risk,
Treynor ratio =
E[R˜p − rf ]
βp
, (7.7)
where βp is the CAPM beta of the portfolio.
c© Zhou, 2021 Page 190
7.1 Performance measures
7.1.6 Treynor and Black appraisal ratio
The Treynor and Black appraisal ratio measures alpha per unit of volatility risk,
TB appraisal ratio =
αp
sp
. (7.8)
In other words, for two fund managers with the same alpha, the one who has a lower volatility on
his/her portfolio is preferred.
7.1.7 Graham-Harvey volatility-matched return
The Graham-Harvey volatility-matched return, known also as M2, is defined as
GH = R˜p − R˜q, (7.9)
where R˜q is the return of a portfolio of the S&P 500 futures and T-bills whose volatility is set to
equal to that of the given portfolio p (by adjusting the weight on T-bill). If a fund under-performs
the volatility-matched market portfolio, the GH measure is negative. The intuition is that if an
investor had a target level of volatility equal to the fund, then the investor would have been much
better off holding a fixed weight combination of S&P 500 futures and Treasury bills than holding
the fund. Graham and Harvey (1996, 1997) provide the M2 and another related measure.
7.1.8 Maximum drawdown and Calmar ratio
A drawdown is the loss of return of a portfolio between a peak (new highs) and a subsequent valley.
The Maximum drawdown, or more commonly referred to as Max DD, is the maximum peak to
valley loss since the investment’s inception or since a given time (typically 3 years). If the Max DD
holds in the future, that will be the loss cap if an investor invests into the fund when buying at the
peak and selling at the bottom.
Calmar ratio is defined as a fund’s annual return divided by Max DD, or the annualized return
adjusted for the Max DD risk. Both of the performance measures are particularly popular in the
world of commodity trading advisors.
Ideally, if the portfolio weights are known, better performance measures might be proposed.
c© Zhou, 2021 Page 191
7.2 Sharpe ratio: further analysis
See Grinblatt and Titman (1995) for a review of the related issues. For some recent performance
measures, see Christopherson, Ferson and Turner (1999) and Cohen, Coval and Pastor (2005).
7.2 Sharpe ratio: further analysis
Sharpe ratio (SR) is widely used and very important. Hence, it will be useful to examine its
accuracy (standard error), as well as to test whether two trading strategies have the same Sharpe
ratio or not.
7.2.1 Asymptotic standard error
Recall that the Sharpe ratio is defined as,
SR =
µ
σ
, (7.10)
where µ is the expected excess return on a trading strategy, an asset or a portfolio, and σ is the
standard deviation. This is not observable, but estimated with data,
ŜR =
µˆ
σˆ
, (7.11)
where
µˆ =
1
T
T∑
t=1
Rt, σˆ =
1
T
T∑
t=1
(Rt − µˆ)2
with T as the sample size and Rt’s the realized excess returns. (σˆ could be computed by dividing
by T − 1 more accurately in statistical sense, but makes no different in asymptotic theory).
The question is how to close ŜR to the true SR. Lo (2002) shows that, if Rt is iid normal, they
converge asymptotically,
ŜR− SR asy∼ N
(
0,
1 + 12SR
2
T
)
. (7.12)
In other words, the 95% confidence interval for SR is approximately
[−1.96σˆSR + ŜR, ŜR + 1.96σˆSR], (7.13)
where
σˆ2SR =
1 + 12 ŜR
2
T
, (7.14)
c© Zhou, 2021 Page 192
7.2 Sharpe ratio: further analysis
that is, σˆSR is the estimated standard error of ŜR, or the square root of the asymptotic variance
after replacing the unknown SR by ŜR.
The importance of computing the standard error of ŜR is that, if a strategy has a SR of 1, which
is impressive, but if T = 12, the standard error can be 0.355 (see Lo’s paper), then the confidence
interval is [0.31, 1.69], and there is a lot of uncertainty for it is greater than 0.50, the one roughly
the market has. Of course, as T increases (with more track record), the confidence interval will
shrink.
Under IID but not necessarily normal, Mertens (2002) shows that the asymptotic theory still
holds, but we need to adjust
σˆ2SR =
1 + 12 ŜR
2 − κ3ŜR + κ3−34 ŜR
2
T
, (7.15)
where κ3 and κ4 are the skewness and kurtosis of the Rt’s.
Relaxing the IID assumption, Christie (2005) and Opdyke (2007) show that Mertens’s adjust-
ment is still valid under the more general assumption that returns are stationary and ergodic. Pav
(2021) provides the latest comprehensive discussions on SR.
7.2.2 Test the difference between two SRs
In practice, we often are interested in whether one trading strategy truly outperforms another or
one fund manager has superior skills than another. If we use Sharpe ratio (similarly for information)
ratio to measure the performance, the null hypothesis is
H0 :
µa
σa
=
µb
σb
(7.16)
Our question is whether there is enough statistical evidence to reject the null of no difference in
the Sharpe ratios.
Jobson and Korkie (1981) propose the following test statistic,
zab =
σˆbµˆa − σˆaµˆb√
θˆ
asy∼ N(0, 1), (7.17)
where
θˆ = (1/T )
[
2σˆ2aσˆ
2
b − 2σˆaσˆbσˆab + 0.5µˆ2aσˆ2b + 0.5µˆ2b σˆ2a −
(
µˆaµˆbσˆ
2
ab
)
/ (σˆaσˆb)
]
,
c© Zhou, 2021 Page 193
7.3 Portfolio-based style analysis
µˆa. µˆb, σˆ
2
a, σˆ
2
b and σˆab are sample means, variances and covariance, and T is the sample size.
Note that the above test holds asymptotically under the assumption that returns are distributed
independently and identically (IID) over time with a normal distribution. The assumption is often
not true in the real data. Ledoit and Wolf (2008) provide a bootstrap approach to relax this
assumption (with Matlab is posted on Wolf’s web). Section (4.3) provides more discussions on
bootstrap.
7.3 Portfolio-based style analysis
To assess risk and performance, both individual and institutional investors are interested in the
styles of a fund management, such as whether it is domestic or international, growth or value,
sector or index.
A simple style analysis is Morningstar’s style box which classifies funds by size and growth.
• Size:
Every month, Morningstar classifies all US stocks in its database according to their market
capitalizations, or the total market value of all outstanding stock shares. Then, Morningstar
ranks them by their market capitalizations: those of the top 72% as large capitalization
(“large cap”) stocks, the next 18% as mid-cap, and the smallest 10% as small-cap. As of
April 2002, stocks with market caps of more than $8.85 billion are considered large cap;
between $1.56 billion and $8.85 billion are mid cap; less than $1.56 billion are small cap.10
• Growth and Value:
Morningstar’s “value” of a stock is based on five scores. The first, weighted 50%, is ranking by
the forward price-to-earnings ratio (P/E), which is obtained by dividing the stock price by its
projected earnings per share for next year, in its cap group. The other 50% of the value score
comes from rankings from four equally weighted historical measures: price-to-sales (P/S),
price-to-book (P/B), price-to-cash flow (P/C), and dividend yield.
The growth score is obtained similar: 50% comes from the ranking of the long-term projected
10Morningstar discusses the details at its web: http://www.morningstar.com. They have not updated these April
2002 numbers yet as of 9/12/05.
c© Zhou, 2021 Page 194
7.4 Return-based style analysis
earnings growth rate against stocks in the same cap, and the other 50% from rankings of the
historical earnings, sales, cash flow, and book value growth in its market cap band.
A stock’s style score is then obtained by subtracting its value score from its growth score,
resulting in scores that can range from -100 to 100. A stock with a score of -100 would be a high-
yielding, low-growth stock, while one with 100 would have no yield and very high growth. Stocks
in the middle are classified as “core” stocks. The clarification can vary over time with changes in
the market, but on average each style will include about one third of all the stocks in each market
cap.
The market cap of a fund is a weighted average of the market caps of the stocks it owns. If its
market cap is is at least as big as the top 70% of the US capitalization, the fund is classified as
a large-cap fund; if falls in the next-largest 20%, mid-cap; and the rest small cap. Similarly, if a
fund’s the net style score, weighted average of its stocks, equals or exceeds the “growth threshold”
(normally about 25 for large-cap stocks), it is classified as growth; an if the its score equals or falls
below the “value threshold” (normally about 15 for large-cap stocks), classified as value; and the
others “blend.”
Portfolio-based analysis requires the information on the holdings of a funds, which may be
difficult to obtain and subject to changes. Another problem is that a domestic fund may own stocks
whose earnings are largely determined by foreign economies, and hence it is highly correlated with
an international fund. In effect, what matters is not the labels of styles, but rather its returns. In
William Sharpe’s words, “If it acts like a duck, assume it’s a duck.” So, we may clarify funds by
its return correlations with given styles.
7.4 Return-based style analysis
Sharpe (1988, 1992) proposes a way to measure the effective style of a fund portfolio Rpt. Let
F1t, . . . , FKt be the returns on K > 1 style benchmark index portfolios. Run the following regression
Rpt = b1F1t + b2F2t + · · ·+ bKFKt + t, t = 1, . . . , T, (7.18)
where b′is are interpreted as the style exposures. The regression essentially finds a portfolio of the
style benchmarks that can best explain the return on the fund. Typically, we impose the following
c© Zhou, 2021 Page 195
7.5 Hedge fund styles
restrictions on the parameters:
b1 + b2 + · · ·+ bK = 1 (7.19)
bj ≥ 0, j = 1, 2, . . . ,K. (7.20)
The first restriction says that the coefficients must be the weights of a portfolio, and the second
eliminates short-sells.
Coggin and Fabozzi (2003) provide a collection of studies on styles, while Kim, White and Stone
(2005) analyze the statistical properties of the estimates.
7.5 Hedge fund styles
See Brown and Goetzmann (2001).
8 Anomalies and Behavior Finance
In this section, we discuss first various anomalies, and then some of the issues about the limits of
arbitrage and behavior finance.
8.1 Anomalies
Anomalies here mean abnormal stock returns that cannot be explained by asset pricing models.
Most of them are abnormal relative to the CAPM. This section draws heavily from the excellent
survey by Schwert (2003). Dong, et al (2021) show that anomalies collectively predict the market,
and references therein provide a glimpse of recent research on anomalies.
8.1.1 Size and January effect
Banz (1981) and Reinganum (1981) found that small firms on the New York Stock Exchange
(NYSE) earned higher average returns than is predicted by the CAPM. Keim (1983) and Reinganum
(1983) showed that much of the abnormal return to small firms (measured relative to the CAPM)
c© Zhou, 2021 Page 196
8.1 Anomalies
occurs during the first two weeks in January, “January effect” or “turn-of-the-year effect.” Roll
(1983) explained the effect as a scenario in which high volatility of small firms might caused great
short-term capital losses that investors might want to realize for income tax purposes before the
end of the year. This selling pressure might reduce prices of small cap stocks in December, leading
to a rebound in early January as investors repurchase them. Thus, as Schwert (2003) showed, it
seems that the size effect has disappeared since the publication of the papers that discovered it.
8.1.2 The weekend effect
French (1980) observed that the average return to S&P composite portfolio was reliably nega-
tive over weekends from 1953–77. However, like the size effect, it seems the weekend effect has
disappeared, or at least substantially attenuated, since it was first published in 1980.
8.1.3 The value effect
Around the same time as early size effect papers, Basu (1977, 1983) found that firms with high
earnings/price (E/P) ratios earned positive abnormal returns relative to the CAPM. More recently,
Fama and French (1992, 1993) have proposed their famous 3-factor model, arguing that size and
book value, which closely related to E/P value, are two risk factors (as measured by spread portfolios
based on size and book/market) in additional to the market risk. However, the value effect has
also disappeared, or at least attenuated.
8.1.4 The momentum effect
Jegadeesh and Titman (1993) find that buying recent winners (portfolios formed on the last year of
past returns) out-perform recent losers, “momentum” effect. However, in a longer time horizon, 3-5
years, DeBondt and Thaler (1995) found past losers (low stock returns in the past 3-5 years) have
higher average returns than past winners (high stock returns in the past 3-5 years), “contrarian”
effect. What is different here is that “the momentum effect seems to persist, but may reflect
predictable variation in risk premiums that are not yet understood.” (Schwert, 2003, p. 949).
c© Zhou, 2021 Page 197
8.1 Anomalies
8.1.5 Closed-end fund puzzle
A closed-end fund often trades at less than the value of its underlying assets, the “closed-end fund
discount” anomaly.
8.1.6 Mutual fund persistence
Hendricks, Patel and Zeckhauser (1993) found that there is short-run persistence in mutual fund
performance. The “cold-hands” phenomenon is very strong that poor performance seems more
likely to persist than would be true by random chance.
8.1.7 IPOs abnormal returns
The first day returns on IPOs are about 20% or so, and that in China is about 100%! However,
in the long-term, say 3 years, Ritter (1991) found that the performance is in fact lower than
comparable firms.
8.1.8 Technical analysis
Technical analysis uses past prices and perhaps other past data to predict future market movements,
of which momentum, high-frequency and algorithmic trading are special cases. Traditionally, tech-
nical analysis focuses on trading indicators, and price moving averages is one of the most widely
and perhaps most useful indicator. For example,a 5-day moving average is defined by
MV5 =
Pt + Pt−1 + Pt−2 + Pt−3 + Pt−4
5
,
which is the average price of the past 5 days. When today’s price is above MV5, indicating a
positive price trend as the price today is above the 5-day moving average. 20- and 200-day moving
averages are the most popular ones used by traders.
Broadly speaking, data science today as applied to finance/trading is part of technical analysis,
which is just more sophisticated than traditional technical analysis. It uses more advanced math-
ematical/statistical tools to extract information from past data, with the same goal of predicting
c© Zhou, 2021 Page 198
8.1 Anomalies
returns and making profits.
In practice, all major brokerage firms publish technical commentaries on the markets and many
of their advisory services are based on technical analysis. Many top traders and investors use it
partially or exclusively (see, e.g., Schwager, 1989, and Lo and Hasanhodzic, 2009). Commodities
and currency trading are known to rely on it heavily. All rule-based trading, such as Trending-
following or Systematic Trading (see, e.g., and Covel and Ritholtz, 2017, Hurst, Ooi and Pedersen,
2017), are part of technical trading, though they have different names likely for marketing purposes.
However, many academics are the skeptics. If the market is efficient in the weak form, all past
information should be useless in predicting future returns to make any abnormal profits. Although
the weak-form efficiency is unlikely true completely, it does point out that, due to competition
for profits, it is difficult to find any simple rules that can make abnormal profits because as more
people are aware of them or discover them and are using them, the profits are likely to disappear.
But the view that technical analysis has no value is challenged by many studies. For exam-
ple, stocks, Brock, Lakonishok, and LeBaron (1992) provide evidence on the predictive power of
the price moving averages on the Down. Lo, Mamaysky, and Wang (2000) further strengthen the
evidence with an automated pattern recognition analysis. Recently, Neely, Rapach, Tu and Zhou
(2014), Han, Yang and Zhou (2013) and Han, Zhou and Zhu (2016) provide more extensive evi-
dence. The first paper finds that technical analysis can predict the stock market as good as using
fundamentals, and it can offer significant economic gains over the strategy that ignores this pre-
dictability. The second finds that an application of a moving average timing strategy of technical
analysis to portfolios sorted by volatility generates investment timing portfolios that outperform the
buy-and-hold strategy substantially. For high volatility portfolios, the abnormal returns, relative
to the CAPM and the Fama-French three-factor models, are of great economic significance (the
annualized alphas are over 20%!), and are greater than those from the well known momentum strat-
egy. Moreover, the abnormal performance cannot be explained by market timing ability, investor
sentiment, default and liquidity risks. Similar results also hold if the portfolios are sorted based
on other proxies of information uncertainty. The third paper shows that technical analysis can be
used to capture a trend factor that combines short-, intermediate- and long-term price trends. The
trend factor performs far better than existing factors. Han, Liu, Zhou and Zhu (2021) provide a
review of technical analysis.
Theoretically, why can technical analysis work? While many studies show that past prices can
c© Zhou, 2021 Page 199
8.1 Anomalies
have predictive power on future returns (see, e.g., the references in Han, Zhou and Zhu (2016)),
which implies that technical analysis can be useful, few provide the economic reasons and make
the point explicitly in terms of technical indicators. Zhu and Zhou (2009) and Han, Zhou and Zhu
(2016) provide theoretical models that justify directly the value of using the price moving averages
as predictors of the stock market trends. Why is it possible to observe trends in the stock market?
There are two simple reasons:
• Due to differences in timing of receiving information or in the speed of reaction to infor-
mation, a good news on a stock or on the market is not fully incorporated into the market
intravenously. It may take minutes, days or months depending on the nature of the news,
such as earning, corporate structures or business cycles.
• Due to liquidity and uncertainly, informed traders who need to move a lot of positions will
have to move slowly over time. For example, it is often reported that large investors or hedge
fund may buy or sell a security over a month or time time.
Of course, predicting the start and end of a trend is very difficult, if not possible. Most empirical
studies (including all of the above) are about identifying a trend after it happens, and recognizing
the end of a trend after it is over.
Great investor Warren Buffett once said that be “fearful when others are greedy, and greedy
when others are fearful.” This statement is a contrarian view on the stock market. When others are
greedy, prices are flying high, and one should be cautious. When others are fearful, it may present
a good buying opportunity. This may be a useful way to predict the start and end of a trend. But
the idea needs to be quantified and tested.
Why can some well known technical rule still work today? One answer is that there are frictions
for investors to follow them. The first friction is hurdles in following the rule. The rule may be too
risky (remember there are no profitable riskfree trading rules ever!), may incur high transaction
costs and may require great discipline and patience. For example, suppose Buffett’s rule is true
that buying when others are panic selling is profitable. But it requires the buyers to withstand the
loss if the temporary selling continues, and to hold the position likely for a long time. Another
friction is that investors may have other ideas/straegies which they perceive more attractive. For
example, it is known, and perhaps most will agree, that buy-and-hold the market index is a simple
long-term strategy that can make one retire rich, but many investors simply refuse it to instead opt
c© Zhou, 2021 Page 200
8.2 Are the anomalies real?
for active trading or highly speculative investments, to fulfill their dream of getting rich quickly
(runs the risk of losing a lot) or to get the thrill/entaintainment from winning and losing (runs
the risk of paying a too high price for). These are additional reasons why technical analysis or
rule-based trading can work.
8.2 Are the anomalies real?
As Schwert (2003) nicely puts, “Some interesting questions arise when perceived market inefficien-
cies or anomalies seem to disappear after they are documented in the finance literature: Does
their disappearance reflect sample selection bias, so that there was never an anomaly in the first
place? Or does it reflect the actions of practitioners who learn about the anomaly and trade so
that profitable transactions vanish?” These are big issues of future research.
To understand the sample selection bias, authors in research are likely to focus attention on
“surprising” results. It seems likely that there is a bias toward the publication of findings that chal-
lenge existing theories, this could lead to the over-discovery of “anomalies”. To mitigate the sample
selection bias, one has to examine whether the anomaly persists in new, independent samples.
8.3 Limits to arbitrage
Limits to arbitrage emphasizes the difficulties for doing arbitrage in practice. While it seems riskfree
or almost riskfree to arbitrage derivatives mispricing and futures prices relative to the underlying
assets, it is risky arbitrage for “mispriced” securities. For example, if Ford, whose fundamental
value is $20, is mispriced at $15, a buyer faces at least three risks in doing arbitrage.11
The first is the fundamental risk: buying Ford exposure to it as the fundamental changes over
time. To eliminate this, one may sell a substitute security or a replicating portfolio (RP) of Ford.
But perfect substitutes are difficult or impossible to find.
The second is noise trader risk: the risk that the mispricing may be come even more mispriced
by the noise/irrational traders who caused the mispricing in the first place. A profitable arbitrage
(in which one buys Ford and sell a RP against it) relies on the convergence of Ford price to the
11Both this and the next section rely heavily on Barberis and Thaler (2003).
c© Zhou, 2021 Page 201
8.4 Behavior finance
value of the RP. In reality, RP may stay the same while Ford’s price keeps going down, i.e., the
mispricing worsens. In theory, one can hold the position as long as it converges. But in reality,
large arbitragers and funds are managing money for others, and their performances are judged
on a short-term basis.12 If a mispricing of the arb trades worsens to yield bad returns, investors
may decide that the arbitrager is incompetent, and withdraw their funds. If this happens, the
arbitrageur will be forced to liquidate his position prematurely at a loss. Fear of such premature
liquidation makes the arbitrager less aggressive in combating the mispricing to begin with.
The third is implementation risk: the costs and the risk involved in carrying the arbitrage to
convergence. The point is that, with the presence of the three risks, it is not easy for arbitrageurs
like hedge funds to exploit market inefficiencies.
In fact, sometimes it might be optimal for the big money to trade in the same direction as
the noise traders, thereby exacerbating the mispricing, rather than correcting it. For example, De
Long, et al. (1990) models an economy with positive feedback traders, who buy more of an asset
this period if it performed well last period. When these noise traders push an asset’s price above
its fundamental value, arbitrageurs do not sell or short the asset, but rather to buy it, to attract
more feedback traders next period, leading to still higher prices, at which they can exit at a profit.
Griffin, Harris and Topaloglu (2003) find exactly such behavior on the recent rise and fall of the
NASDAQ: “Our evidence supports the view that institutions contributed more than individuals
to the Nasdaq rise and fall.”
8.4 Behavior finance
Behavior finance relies on investors’ behavior biases/psychology to explain anomalies or mispricing.
If there are no limits of arbitrage, there would be no mispricing and standard asset pricing models
(rational models) are sufficient. So, it is in this sense that behavior finance also relies on limits of
arbitrage or imperfection of the market.
As summarized by Barberis and Thaler (2003), the common behavior biases are: overconfidence,
optimism and wishful thinking, representativeness, conservatism, belief perseverance and so on.
(Some of the behavior biases may be good traits in life, but not so in trading or in the non-emotional
valuation of securities.) Prospect theory, proposed by Kahneman and Tversky (1979) and Tversky
12Shleifer and Vishny (1997) explore the effects of this so-called agency problem.
c© Zhou, 2021 Page 202
8.4 Behavior finance
and Kahneman (1992), models investor decision making by the following utility function,
pi(p)v(x) + pi(q)v(y), x ≤ 0 ≤ y (8.1)
for a risky off of x with probability p or y with probability q. In contrast with the usual expected
utility theory, in which the utility u(W ) is a concave function over the entire range of wealth W ,
here v is convex when in x and concave only in y, implying that investors are risk-seeking over
losses. This is motivated by the classical example of the choice by most people in the following
gaming. If presented with choice between
A : payoff = 1000, probability = 50% (nothing else) (8.2)
and
B : payoff = 500, probability = 100%, (8.3)
most will choose B, same as what the classic risk-aversion investors will do. On the other hand, If
presented
C : payoff = −1000, probability = 50% (nothing else) (8.4)
and
D : payoff = −500, probability = 100%, (8.5)
most will choose C, opposite to what the classic risk-aversion investors will do.
Behavior finance offers its explanations to the anomalies. On the closed-end funds, Lee, Shleifer
and Thaler (1991) argue in a simple model that some of the fund investors are too optimistic, while
at other times, are too pessimistic. Changes in their sentiment causes the difference between prices
and net asset values.
Insufficient diversification indicates the phenomenon that investors diversify their portfolio hold-
ings much less than is recommended by standard models of portfolio choice. In particular, First,
investors exhibit strong “home bias” as found by French and Poterba (1991) and others that
investors in the USA, Japan and the UK allocate 94%, 98%, and 82% of their overall equity in-
vestment, respectively, to domestic equities. Ambiguity and familiarity offer an explanation for
insufficient diversification.
Over trading is a common mistake of individual investors. For example, Barber and Odean
(2000) examine the trading activity from 1991 to 1996 and find that the investors would do a lot
better if they traded less. The behavioral explanation for such excessive trading is overconfidence:
c© Zhou, 2021 Page 203
people believe that they have superior information and skills to justify a trade, whereas in fact they
do not.
Disposition effect is a another common mistake of individual investors who are reluctant to sell
assets at a loss. For example, Odean (1998) finds that individual investors in his sample tend to
sell stocks which have gone up in price since purchase, rather than those which have gone down.
There are two behavioral explanations. First, the effect may be due to an irrational belief in mean-
reversion. Second, it is caused by the loss aversion or risk-seeking behavior when at a loss, that is,
the investors would like to gamble that the stock will eventually come back to avoid the painful
loss (which in the real world leads even greater painful loss later, e.g., if they hang up on those
dot.com stocks). Interestingly, Coval and Shumway (2000) find that professional traders also have
this problem. In the Treasury Bond futures pit at the CBOT, traders with profits (losses) by the
middle of the trading day will take less (more) risk in their afternoon trading.
9 Predictability 1: Time Series
In this section, we discuss first about market efficiency, rejection of random walk, and limits to
predictability. Then, we examine various approaches, including recently developed ones, that are
used to predict asset returns.
9.1 Market efficiency
As you recall, there are three forms of market efficiency, all of which suggest a security’s price equals
its “fundamental value” or no abnormal returns can be made relative to one of the information
sets: past history, public and private. An explanation for this is that any mispricing will be quickly
corrected by smarter traders and investors. However, in practice, due to limits to arbitrage and
information asymmetry, anomalies discussed in the next section are arguably difficult to explain.
Two important questions: 1) is there really no predictability? 2) if there is predictability, what
is the degree or how profitable is it?
c© Zhou, 2021 Page 204
9.2 Random walk?
9.2 Random walk?
Early studies of market efficiency focus on a random walk model of stock prices:
xt = µ+ xt−1 + t, t ∼ N(0, σ2) (9.1)
where xt = log(Pt) is the log stock price. It says that tomorrow’s log price is today’s plus a drift
and a normal random noise, or the continuous return is normal with mean µ and variance σ2. This
is exactly the lognormal assumption underlying the Black-Scholes formula. If the random walk
model is true, the market must be efficient. However, if the market is efficient, the random walk
model is not necessarily true.
Equation (9.1) says that time series (xt − xt−1) is iid, and hence the mean and variance are
estimated consistently by the sample analogues,
µˆ =
1
T
T∑
t=1
(xt − xt−1) = 1
T
T∑
t=1
log(Pt/Pt−1), (9.2)
and
σˆ2a =
1
T
T∑
t=1
(xt − xt−1 − µˆ)2. (9.3)
To test (9.1), notice that it implies
xt = 2µ+ xt−2 + (t + t−1), (9.4)
So the sample variance of xt− 2µ−xt−2 should estimate 2σ2, or (dividing the result by 2) we have
an alternative variance estimator
σˆ2b =
1
T
T/2∑
k=1
(x2k − x2k−2 − 2µˆ)2. (9.5)
Intuitively, if (9.1) is true, both estimators should have converge to σ2, and hence their ratio,
Jr =
σˆ2b
σˆ2a
, (9.6)
should converge to 1. Indeed, Lo and MacKinlay (1988) show that
√
TJr
asy∼ N(1, 2), (9.7)
which says that the statistic scaled by T is asymptotically normal-distributed with mean 1 and
variance 2. Since Jr is based on the ratio of two variances, Jr is known as variance-ratio test.
c© Zhou, 2021 Page 205
9.3 Limits to predictability
Hence, if one finds from real data that
√
TJr is different from 1 significantly as judged by the
above asymptotic normal distribution, one can reject the null hypothesis that (9.1) is true. Lo and
MacKinlay reject the random walk hypothesis for the US stock market indices.
If a stock return or the market return is a random walk, then there is no predictability what-
soever. Lo and MacKinlay’s rejection of the random walk hypothesis open the door for studying
predictability.
9.3 Limits to predictability
In the real world, profit competition is clearly a strong force to eliminate any obvious predictability.
One important point to make is that no matter how much resources investors put into studying the
market, or no matter how hard they try, they may not get what they want: to beat the market,
but in fact half of them will fail!
The reason is that the returns of all investors must sum to the market return,
1
I
I∑
i=1
Ri = index return,
where I is the total # of investors, and Ri is the return of investor i. It implies that roughly 50%
will outperform and another 50% will under-perform the index. If all investors are smart, half of
them will still fail to predict the market, no matter what latest AI software they are using. Indeed,
suppose they have learned the best mathematical model from machine learning given all the past
data, then the same (or similar) model will no longer work if all use it. The market will move in
an unpredictable way (at least by that model) so that about half of the investors will lose.
Unlike predicting weather or earthquake, if we all are smart, we all can predict it correctly. But
not about the market. To win, you need to beat the other 50% of players! It is a zero-sum game
in beating the market.
Ross (2005) provides the first theoretical bound on the degree of predictability if asset pricing
models are true. However, his bound is too wide. Zhou (2010), and Huang and Zhou (2017) derive
much tighter and binding bounds.
c© Zhou, 2021 Page 206
9.4 Predictive regressions
9.4 Predictive regressions
9.4.1 Basic model
When running regressions of current economic variables on the past ones, we obtain the so-called
predictive regressions. This is the simplest set-up in most applications.
For simplicity, consider running a regression of a single variable yt on a predictable variable
xt−1,
yt = α+ βxt−1 + t, t = 1, 2, . . . , T. (9.8)
For example, when yt is the return on a portfolio of common stocks and xt is a dividend yield,
book-to-price ratio, or a function of interest rates, Fama and Schwert (1977), Rozeff (1984), Keim
and Stambaugh (1986), Campbell (1987), Fama and French (1988), Kothari and Shanken (1997)
and Pontiff and Schall (1998), among others find that β is significantly from zero, or predictability.
Rapach, Strauss, and Zhou (2010, 2013) provide some of the most recent evidence on predictability,
and Rapach and Zhou (2013, 2021) survey the literature.
It should be noted that the R2 of the predictive regression is usually very small, usually less than
5%. This simply says that it is difficult to predict stock returns or financial time series in general.
Another point is that OLS estimate of β is in general biased, and have sampling distributions that
differ from those in the standard OLS regression. The reason is that the predictor xt−1 is a time
series which is usually correlated over time. Stambaugh (1999) discusses the associated econometric
theory.
To understand better about the predictive regression, it will be useful to contrast it with an
explanatory regression that runs a regression of current variable yt on current variable zt,
yt = α+ βzt + t. (9.9)
For example, the CAPM or market model regression is to use the current excess return on the
market to explain the excess return on the stock. Although this regression typically has high R2,
say 80%, for especially large stocks, but the regression has little use in forecasting the excess stock
returns unless one can forecast the market.
c© Zhou, 2021 Page 207
9.4 Predictive regressions
9.4.2 Out-of-sample performance
How do we assess the degree of predictability? Traditionally, one examine the statistical significance
of the slope coefficient or the regression R2 in the previous regression by using all the all the
data, that is, running the regression from the beginning to the end of the sample period. This is
problematic and is subject to a look-ahead bias since it uses all the data to estimate the slope. As
a result, the forecasts cannot be made and used in real time.
Recently, researchers focus more on out-of-sample R2 measure,
R2OS = 1−
∑T
t=T0
(rt − rˆt)2∑T
t=T0
(rt − r¯t)2
, (9.10)
where rˆt is the forecasted return from the predictive regression estimated through period t−1 only,
r¯t is the historical average forecast estimated from the sample mean through period t− 1, and T0
is the first period the forecast is available (assume that one needs T0− 1 data points for estimating
the predictive regression). Be definition, the out-of-sample R2 uses the historical average forecast
as the simple benchmark. Any model predictability should do better then the sample mean r¯t.
In practice, the regression has to be done at each time t recursively using available data (ex-
panding window) or using data going back of a fixed length, say ten years (a rolling fixed window).
A positive R2OS indicates that the predictive regression forecast beats the simple historical average.
Hence, R2OS > 0 implies predictability. A test of this hypothesis is discussed in the next subsection.
Welch and Goyal (2008) find in their comprehensive analysis that a large list of potential
predictors from the literature are unable to deliver consistently superior out-of-sample forecasts of
the U.S. equity premium relative to a simple forecast based on the historical average. The reason
is that the regression model is not stable: the regression parameters changeover time. So either
recursive or rolling regressions cannot provide good out-of-sample forecasts. However, Rapach,
Strauss, and Zhou (2013) and subsequent studies do find predictive power using either innovative
methods or new predictors.
9.4.3 Statistical significance/tests
Computing both the in-sample R2 and out-of-sample R2OS is valuable, but their values are estimated
with errors. So statistical tests that account for the errors must be used to assert predictability.
c© Zhou, 2021 Page 208
9.4 Predictive regressions
On in-sample analysis, standard t-ratios or the Newey and West (1987) heteroskedasticity-
adjusted standard error estimate can be used to assess the statistical significance of the slopes.
Alternatively, one use the confidence interval of the R2, which is not analytically tractile for general
distributions, but can be computed via bootstrap (see, e.g., Huang et al (2020, p. 783)).
On out-of-sample analysis, the hypothesis of interest is
H0 : R
2
OS ≤ 0 (9.11)
vs R2OS > 0. This is often done by using Clark and West’s (2007) test (see, e.g., Rapach et al (2010,
p. 828)), which is an adjusted version of the Diebold and Mariano (1995) and West (1996) statistic
– what they call theMSPE-adjusted statistic.
9.4.4 Economic significance
A strategy or proposition can be statistically significant, but may not have sizable economic value.
In practice, it is the economic value that is of key interest. Hence, for a given degree of predictability,
an important question is whether it can bring any significant economic values.
How will an asset allocation benefit from predictability? Kandel and Stambaugh (1996) and
Barberis (2000) are examples of this line of research which find that there are economic gains of
using predictability. Of course, the degree of significance will vary from application to application.
Consider a mean-variance investor who allocates cash between the stock market and money
market, and how the investor can benefit from the forecast. Recall from our earlier portfolio
theory, the investor’s optimal portfolio weight at t is
wt =
1
γ
rˆt+1
σˆ2t+1
, (9.12)
where γ is the investor’s coefficient of relative risk aversion, rˆt+1 is a predicted excess return (our
forecast), and σˆ2t+1 is a forecast of the excess return variance. In practice, we may restrict wt
to lie between −0.5 and 1.5, which imposes realistic portfolio constraints and produces better-
behaved portfolio weights given the well-known sensitivity of mean-variance optimal weights to
return forecasts.
The investor’s realized utility or certainty equivalent return is computed from
CER = R¯p − γ
2
σ¯2p, (9.13)
c© Zhou, 2021 Page 209
9.4 Predictive regressions
where R¯p and σ¯
2
p are the mean and variance, respectively, of the portfolio return over the forecast
evaluation period. The CER is the risk-free rate of return that an investor would be willing to
accept in lieu of holding the risky portfolio. The utility gain of using the forecast is then
Gain = CER− CER0, (9.14)
where CER0 is computed similarly to CER except the excess return forecast is replaced by instead
by the no-predictability constant estimated by using the historical average, or the default expected
excess return without using any predictors. The utility gain is then the difference in CER for the
investor who uses the predictive regression forecast to guide asset allocation as compared with the
case she uses the prevailing mean benchmark forecast. We usually annualize the CER gain so that
it can be interpreted as the annual portfolio management fee that the investor would be willing to
pay to have access to the predictive regression forecast in place of the prevailing historical mean
forecast. This is a common measure for the economic value of return predictability.
Campbell and Thompson (2008) show that out-of-sample forecasts can be improved upon to
beat the historical average both statistically and economically, once modifying the usual predictive
regressions by placing some theoretically motivated restrictions on the coefficients. For example,
we can assume the regression slope of inflation be negative. If we get a positive estimate, we just
set it as zero. They show further that, in terms of out-of-sample analysis, an R2OS of 0.5% for
monthly data typically implies economically significant utility gains.
Unfortunately, though, their approach works only for a few of the economic variables. Back
to the beginning of the forecasting period, investors have no way of knowing which few out of the
many can have good out-of-sample performance in the future. Hence, their study does not provide
convincing evidence on out-of-sample predictability of the market.
Rapach, Strauss, and Zhou (2010) seem the first to provide consistently out-of-sample evidence
of predictability by applying the forecasting combination method to all of the predictors. In ad-
dition, they show that predictability is concentrated in recessions. Henkel, Martin, and Nardari
(2011) have a similar finding, and Cujean and Hasler (2017) explains this theoretically.
c© Zhou, 2021 Page 210
9.5 Forecasting with many predictors
9.5 Forecasting with many predictors
The prediction regression works well for a few predictors. Standard time series models, such as
AR, ARMA models (see, e.g., Tsay, 2010, Box, Jenkins, Reinse and Ljung, 2016, and Brockwell
and Davis, 2016), may improve the performance in this low dimensionality case. But none of the
methods appear effective for a large number of predictors, the high dimensionality case. This latter
is of great interest in finance. The reason is that the signal-noise ratio of a single predictor is often
very low in predicting asset returns, and the hypothesis is that, with many low signals predictors,
the predictability can be improved by incorporating all the information. This requires innovative
methods.
We discuss four methods in this section, while leaving shrinkage approaches, such as LASSO,
to the next chapter, which are part of the popular machine learning methods.
9.5.1 Forecast combination
Forecast combination is perhaps the simplest forecasting approach in the presence of many predic-
tors. To see how it works, suppose we have 20 predictors. A standard regression forecast is to run
a regression on all 20 of them,
yt = α+ β1xt−1,1 + · · ·+ β20xt−1,20 + t. (9.15)
In practice, the sample size T is often small, and so the above regression can usually fit well in
-sample, but perform poorly out-of-sample (over-fitting problem).
Forecast combination is an effective solution for many practical problems. Instead of running
the regression on all the predictors, We run it on each one of them at a time but for 20 times,
yt = αj + βjxt−1,j + t, j = 1, . . . , 20. (9.16)
We get the coefficient estimate αˆj and βˆj , and then obtain 20 forecasts based on each of the
individual regression,
yˆt+1,j = αˆj + βˆjxt,j , j = 1, . . . , 20.
The (final) forecast by using the forecast combination method is
yˆt+1,comb =
yˆt+1,1 + yˆt+1,2 + · · ·+ yˆt+1,20
20
, (9.17)
c© Zhou, 2021 Page 211
9.5 Forecasting with many predictors
which is an average of the individual forecasts. Statistically, this provides diversification and
shrinkage, and is robust to distributional assumptions.
As it turns out, the average forecast is an excellent one in practice. Bates and Granger (1969)
and others are early studies using the approach. Rapach, Strauss and Zhou (2010) show that it
can be effectively used to predict the stock market.
One can also consider using different weights on the individual forecasts. However, simple
average (equal-weighting) tends to work well in most applications. Rapach, Strauss and Zhou
(2010) show that it can provide consistent statistically and economically significant out-of-sample
gains.
Mathematically, if individual forecasts are unbiased, so will the average, and it will have smaller
variance. This is similar to a standard portfolio selection problem: a portfolio of independent stocks
will generally have smaller risk. However, if bad forecasts are used in the average, the average
forecast will clearly not be good. Hence, the implicit assumption of using the average forecast is
that all the individual forecasts are reasonably good, and then their average improves. Later we do
introduce a method, C-LASSO, that is designed to improve the average forecast by selecting good
forecasts out of many (which may have bad ones), and then apply the average to the selected ones.
9.5.2 PCA or PCR
When there are many predictors, principal components analysis (PCA) is a also popular approach.
It extracts a few components out of the many predictors and then forms the forecasts based on the
few composite predictors (linear combination of the original ones). PCA reduces the dimensionality
from many to a few. Since the PCA is used in a regression context, it is also known as PCR.
Consider now a regression on n predictors (n is large),
yt = α+ β1xt−1,1 + · · ·+ βnxt−1,n + t. (9.18)
Suppose n = 50 and we want to find only one predictor out of the 50. Which one to choose?
The PCA analysis on all the predictors xt,k’s suggests that the first PCA is the dominating
c© Zhou, 2021 Page 212
9.5 Forecasting with many predictors
variable. Hence, it is a logical choice to choose it as the sole predictor,
ft = ψ1xt,1 + · · ·+ ψ50xt,50 = ψ′xt, (9.19)
where ψ is an 50-vector of the first eigenvector corresponding to the largest eigenvalue of X ′X/T ,
where the X is T × 50 matrix of data of the predictors which are de-meaend (substrated from
sample mean so that it has zero mean). Then, instead of running a regression on 50 predictors as
(9.18), we run a regression on only one predictor,
yt = γ0 + γft−1 + et, (9.20)
which is more stable and often provides better out-of-sample forecasts than (9.18).
Example 9.1 As demonstrated in class, the PCA sentiment index is,
St = 0.90x1 + 0.72x2 + 0.70x3 + 0.71x4 + 0.14x5,
where the loadings/coefficients are the first eigenvector of the sample covariance matrix of the de-
meaned 5 sentiment proxies xi’s. Since they are computed using all the data, it is an in-sample
result. In practice, we need a training sample period, say 120 months, to compute the loadings at
month 120, and then re-compute those in month 121 by adding the data in month 121. So, for the
realistic out-of-sample index, the loadings are estimated each month and so they vary over time. ♠
Of course, the PCA analysis usually suggests K important PCAs. Then we can choose in general
K predictors, each of which is a linear combination of the original predictors. Mathematically,
x = (x1, . . . , xn)
′ predictors equivalent to their n PCAs (linear orthogonal transformations),
f1
...
fn
 =

Ψ′1
...
Ψ′n


x1
...
xn
 = Ψ′x,
where Ψj is the j-th eigenvector of X
′X/T for j = 1, . . . , n. Then (9.18) can be written equivalently
as
yt = α+ γ1ft−1,1 + · · ·+ γnft−1,n + t, (9.21)
Note that the n PCA new predictors are uncorrelated and are likely dominated, say, by first K
predictors (while the rest have negligible variances). If we drop those terms after fK , we obtain
the forecast on only K factors,
yt = α+ γ1ft−1,1 + · · ·+ βKft−1,K + et, (9.22)
c© Zhou, 2021 Page 213
9.5 Forecasting with many predictors
which is the PCR in the general case. In matrix form,
Y = α1T +Xβ + , (9.23)
where Y is T -vector of observations on the dependent variable, 1T is T -vector of ones, and X is
T × n matrix of data of the predictors. The PCA or PCR simply replaces the T × n data matrix
X by a T ×K matrix of data, i.e., K linear combinations of the original data, known as factors,
F = XΦ, T ×K, (9.24)
where Φ is a n×K matrix of eigenvectors corresponding to the largest K eigenvalues of the sample
covariance of the predictors, X ′X/T with data de-meaned. In vector form, let Ft be K × 1 of the
factors at t, then
Ft = Φ
′Xt,
where Xt is n × 1 of the predictors at t. Then, instead of running the large regression (9.23), we
run
Y = α1T + FΛ + , (9.25)
on K < n variables, where the loading coefficients Λ is K × 1. Since K can be much smaller than
n in practice, the PCA/PCR reduces the dimensionality substantially and can also often do much
better in forecasting than using too many predictors in the regression directly.
Stock and Watson (2002) provide a rigorous justification to the above procedure. Assume the
data-generating process is
yt = α+ F
′
tΛ + t (9.26)
Xt = βFFt + et, (9.27)
where Ft is a K-vector of latent factors and βF is n×K. The first equation says that yt is related
to K latent factors, and the second says that the n predictors are also relate to Ft. In other words,
one can interpret Ft as the true but latent predictors, and the second equation simply states how
they are related to the observed ones. The above model is known as a factor forecasting model,
which is popular in macroeconomics where one can forecast the GDP with many predictors that
related to the K driving factors.
Given the second equation, we can solve the latent factors and their loadings from minimizing
the model mean-squared error,
min
βF ,F
1
NT
n∑
i=1
T∑
t=1
(xit − βFFt)2.
c© Zhou, 2021 Page 214
9.5 Forecasting with many predictors
Mathematically, Stock and Watson (2002) show that, up to a rotation, the factors are given as
earlier, and as n becomes large, the PCA estimator of F will converge to the true but unobserved
K predictors.
The Python codes make it easy to implement:
1
2 from sklearn.preprocessing import scale
3 # scale makes the data to have zero mean and unit variance
4
5 from sklearn.decomposition import PCA
6
7 pca = PCA()
8
9 X_new = pca.fit_transform(scale(X))
10
11 loadings = pca.components_.T * np.sqrt(pca.explained_variance_)
Note that While the PCA is invariant to any orthogonal transformation of the data, but it is
sensitive to scaling. In practice, the PCA is usually applied to scaled data with zero mean and
unit variance. Then the output Xnew will be the transformed data, the F above with K = n, the
loadings are the eigenvectors multiplied by the square root matrix of the diagonal eigenvalue matrix.
We may use only the first few columns of Xnew as our predictors by examining the eigenvalues or
to determine the optimal K as in the PCA analysis or by cross-validation (see Section 10.6).
9.5.3 sPCA
Huang, Jiang, Li, Tong and Zhou (2020) propose a scaled PCA (sPCA) approach to improve the
PCA. The idea is that the PCA does not use any information of what to be predicted, and, as a
result, the PC can have noise that are unrelated to the target. To mitigate this problem, we run a
regression on each one of the predictor, as we do for the combination approach,
yt = αj + βjxt−1,j + t, j = 1, . . . , n. (9.28)
Assume the predictors are standardized. The usually PCA uses
(x1, x2, . . . , xn)
c© Zhou, 2021 Page 215
9.5 Forecasting with many predictors
to find the PC components. In contrast, the sPCA uses the scaled predictors,
(βˆ1x1, βˆ2x2, . . . , βˆnxn).
So sPCA weights each predictor by its relevance to the target. The more useful ones will have
greater weights, and the less useful ones will have smaller weights.
In the language of machine learning, PCA is an un-supervised learning that does not use
any information on the forecasting target. In contrast, sPCA is supervised to use the relevant
information to over- or under-weigh the predictors. As a result, it is not surprising that sPCA
typically outperforms PCA in practice.
9.5.4 Partial least squares
The partial least squares (PLS) method, pioneered by Wold (1966, 1975) and extended by Kelly
and Pruitt (2013, 2014), provides another way to extract a few predictors as linear combinations of
many original predictors besides the PCA. As it turns out, it is particularly useful in many finance
applications.
Following Hastie (2018, p. 81), the idea is similar to the PCA that we want to replace the
original forecasting equation by using K << n predictors,
yt = α+ γ1zt−1,1 + · · ·+ βKzt−1,K + et, (9.29)
where zk is a linear combination of (x, . . . , xn) and they are uncorrelated. Consider, for example,
how we obtain the first PLS predictor,
z1 = φ1,1x1 + · · ·+ φ1,nxn, (9.30)
where φ1,1, . . . , φ1,1 are linear combination coefficients to be determined. Unlike the PCA, we now
want to use information of yt. The simplest and intuitive way is to weight each xj by its correlation
with yt. The greater the correlation, the more important it is to the forecasting. Since the predictors
are often standardized to have zero mean and unit variance in PLS, we can let
φ1,j = cov(xj , y), (9.31)
which is easily estimated by the sample covariance between xt−1,j and yt. Then z1 can be computed,
and the one-factor PLS regression is
yt = α+ γ1zt−1,1 + et, (9.32)
c© Zhou, 2021 Page 216
9.5 Forecasting with many predictors
which is easily run. However, it should be noted that, for out-of-sample forecasting, this regression
has to be run recursively and the predictors have to be standardized at each time t. In other words,
one has to make sure no future information is used at time t.
Interestingly, there is a simple link between the PLS and the average forecast in the one-factor
case. Since the predictors are standardized, φ1,i must be the slope of the regression of y on xi, and
so the forecast based on xi is
yˆit = y¯ + φ1,ixt−1,i,
where y¯ is the sample mean up to time t. Hence, the average forecast is
yˆavt = y¯ + z1.
Moreover, the PLS regression (9.32) can be written as
yt = (α− γ1y¯) + γ1(y¯ + zt−1,1) + et, (9.33)
so the PLS can be interpreted as a blend of the average (combination) forecast with the sample
mean (as the intercept term is a function of the sample mean since α = E(y) when the predictor
is standardized).
The link to the average forecast is also derived by Lin, Wu and Zhou (2018), as a special case of
an iterated combination method. Let rMCt+1 be the standard mean combination forecast. Consider
re-combine it with the historical average forecast,
rt+1 = (1− δ)r¯ + δrˆMCt+1 + ut+1, (9.34)
where ut+1 is the noise. Mathematically, our objective is to solve the following optimization prob-
lem,
min
δ
Et(rt+1 − rˆt+1)2 = Et[rt+1 − (1− δ)r¯ − δrˆMCt+1 ]2. (9.35)
In the special case of δ = 0, it implies that rˆMCt+1 has no information whatsoever. When δ = 1, it is
unnecessary to use information about r¯ to improve rMCt+1 . Theoretically, there exists such a δ that
makes the iterated combination better than either r¯ or rMCt+1 . The optimal δ can be solved easily
from the first-order condition of the objective function,
δ∗ =
covt(rt+1 − r¯, rˆMCt+1 − r¯)
vart(rˆMCt+1 − r¯)
. (9.36)
In Lin, Wu and Zhou’s (2018) applications, δ∗ is generally greater than 1. Mathematically, the
iteration combination starting from the average forecast is equivalent to the PLS in the one-factor
case.
c© Zhou, 2021 Page 217
9.5 Forecasting with many predictors
Huang et al (2015) provide an empirical application of the PLS to extract an investor senti-
ment index that is relevant to forecasting the stock market. Assume that the true sentiment is
unobservable, though it is related to the the stock return in the standard prediction regression,
Rt+1 = α+ βSt + εt+1, (9.37)
where εt+1 is residual, unforecastable and unrelated to St. Let xt = (x1,t, ..., xN,t)
′ denote an N ×1
vector of individual investor sentiment proxies at period t, such as close-end fund discount rate,
share turnover, number of initial public offerings (IPOs). Assume that the proxies are related to
the true sentiment by
xi,t = ηi,0 + ηi,1 St + ηi,2Et + ei,t, i = 1, ..., N, (9.38)
where ηi,1 is the factor loading that summarizes the sensitivity of sentiment proxy xi,t to movements
in St, Et is the common approximation error component of all the proxies that is irrelevant to
returns, and ei,t is the idiosyncratic noise associated with measure i only.
The PLS extracts St from the above equations based on data on the proxies. Mathematically,
the T × 1 vector of the true investor sentiment index, SPLS = (SPLS1 , ..., SPLST )′, can be computed
from,
SPLS = XJNX
′JTR(R′JTXJNX ′JTR)−1R′JTR, (9.39)
where X denotes a T×N matrix of the standardized (each column has zero mean and unit variance)
individual investor sentiment measures, X = (x′1, ..., x′T )
′, R = (R2, ..., RT+1)′ is a T × 1 vector of
excess stock returns, JT = IT − 1T ιT ι′T and JN = IN − 1N ιN ι′N . Mathematically, it is the same as
we obtain earlier. Huang et al (2015) show that the PLS index work much better than the popular
PCA investor sentiment index of Baker and Wurgler (2006).
How do we get the second PLS factor and so on? With z1, our forecast is
y(1) = y0 + γ1z1,
where y0 = y¯ and γ1 is the regression slope on z1. Let
x1 = x0 − cov(z1, x
0)
cov(x0, x0)
z1,
i.e., x(1) is the previous predictor after removing the effects of z1 (so that newly extracted predictor
z2 will be uncorrelated to z1). Then we compute, similarly to z1,
z2 = φ2,1x
(1)
1 + · · ·+ φ2,nx(1)n , (9.40)
c© Zhou, 2021 Page 218
9.5 Forecasting with many predictors
with
φ2,j = cov(x
(1)
j , y
(1)). (9.41)
Then we obtain a new forecast
y(2) = y(1) + γ2z2. (9.42)
Keep iterating until K + 1, when yK+1 makes little difference from yK , then we use K factors, and
the final forecast is clearly a function of z1, . . . , zK .
Theoretically, Helland and Alm0y (1994) provide an asymptotic theory for the PLS with n
fixed while T goes to infinity, while most such theories later require both are large. Cook and
Forzani (2019) have some of the latest analysis, while Cook and Forzani (2021) provide a nonlinear
extension of the PLS. Kelly and Pruitt (2013, 2014) extend the PLS and provide an asymptotic
theory for both n and T go to infinity
9.5.5 PLS: m > 1
Previously, we have only one target variable to forecast. Now Y is multivariate, the PLS algorithm is
more complex. There are various modifications, but the popular and primarily ones are the original
NIPALS (Wold, 1975) and later innovation SIMPLS (de Jong, 1993). However, both become the
same in the one-dimensional case.
To motivate, we may want to use the same set of variables to forecast of m stock returns
simultaneously. For example, when m = 2, we have
y1t = α1 + β11xt−1,1 + · · ·+ β1nxt−1,n + 1t, (9.43)
y2t = α2 + β21xt−1,1 + · · ·+ β2nxt−1,n + 2t, (9.44)
that is, we want to use the same xjs to predict two targets y1 and y2. In general, we can write the
problem in matrix form,
Y = Xβ + , (9.45)
where Y is T ×m matrix of observations on m dependent variables, and X is T × n as before, but
β is n×m of the regression coefficients. Note that the alphas are zeros in the above equation, due
to, following the common practice, we assume that both Y and X are de-meaned. Our objective
is still to seek a lower dimension matrix F , T ×K, to run
Y = FΛ + e, (9.46)
c© Zhou, 2021 Page 219
9.5 Forecasting with many predictors
to obtain a stable out-of-sample forecast.
How to reduce high-dimension X to low-dimension F? The PLS algorithm essentially makes
the following decomposition of the data,
X = VW + E1, (9.47)
Y = UQ+ E2, (9.48)
where V and U are T × K (known as scores or factors), W is K × n and Q is K × m (known
as orthogonal loading matrices), and E1 and E2 are errors. The correlation between V and U is
maximized.
Consider how to obtain the first PLS factor/component. The original algorithm is difficult to
understand (though may be efficient in computations), here we follow the eigenvalue approach (see,
e.g., Ng, 2013) to see the ideas. Let w1 and q1 be n- and m-vectors, we want to find them to
maximize
f(w1, q1) = corr(x
′w1, y′q1),
where x is n× 1 of the predictors, and y is m× 1 of the dependent variables. The solution w1 and
q1 are not unique unless we normalize them, say, to 1,
||w1||2 = w′1w1 = 1, ||q1|| = q′1q1 = 1.
Then the w1 and q1 are the standardized first eigenvectors (corresponding the largest eigenvalue)
of two matrices,
X ′Y Y ′Xw1 = λw1, (9.49)
Y ′XX ′Y q1 = γq1. (9.50)
Then it is clear that
V1 = Xw1, U1 = Y q1,
where compose of the first term in decomposition (9.47) and (9.47). If we care about only one PLS
factor, V1 can serve the purpose.
To get the second factor, we update the X and Y with
X := X − V1w′1, (9.51)
Y := Y − U1q′1 (9.52)
c© Zhou, 2021 Page 220
9.5 Forecasting with many predictors
to remove the effects of the first factor, and then repeat the same process to obtain the second
factor. We can continue the same process to get the rest factors until VK stops changing in value.
The Python code that implements the PLS is as simple as that for the PCR:
1
2 from sklearn.cross_decomposition import PLSRegression
3
4 pls = PLSRegression(n_components=m) # say m=2
5
6 pls.fit(X, Y)
7
8 X_new = pls.transform(X)
9
10 Y_pred = pls.predict(X)
The transformed Xnew is what of our interest to use to forecast all the ys. The last output is
the forecast. To make the code working in practice for out-of-sample forecasting, we need to do the
above recursively over time or train the model in a training period, and then use the parameters
for the future test period without re-estimation.
Theoretically, it will be of interest to see how decomposition (9.47) and (9.47) works in the
one target case (m = 1). In this case, Y ′XX ′Y is a number. Let w1 = cX ′Y , where c is the
normalization constant to make ||w1|| = 1, then equation (9.47) becomes
X ′Y (Y ′XX ′Y )c = λcX ′Y, (9.53)
so λ = Y ′XX ′Y . In other words, X ′Y is the loading (up to a scale) which is exactly the sloped of
the combination forecast. Adding back the mean in the regression provides the same PLS factor
as before.
Note also that the PCA factor is simply (9.47) without the Y ,
X ′Xw1 = λw1,
so w1 is the eigenvector of X
′X which is the same as that of X ′X/T as the scaling will not affect the
eigenvector, and the first PCA factor is Xw1. The second PCA factor is simply the one replacing
w1 by w2, the second eigenvector. In contrast, the second PLS factor is more difficult to obtain as
it has to run the entire process all over again for the updated X and Y .
c© Zhou, 2021 Page 221
9.6 Common time-series predictors
9.6 Common time-series predictors
There are many time-series predictors that researchers have used to predict the stock market or
individual stock returns over time. Here we focus on some of the major ones that are used to predict
the market or major indices/sectors. There are even more predictors that are used for cross-section
predictions of individual assets or asset classes (see Chapter 11).
9.6.1 Macro economic variables
The following 15 well known macroeconomic predictors are used by Welch and Goyal (2008) and
many others:
1. Dividend-price ratio (log), D/P : Difference between the log of dividends paid on the S&P
500 index and the log of stock prices (S&P 500 index), where dividends are measured using
a one-year moving sum.
2. Dividend yield (log), D/Y : Difference between the log of dividends and the log of lagged
stock prices.
3. Earnings-price ratio (log), E/P : Difference between the log of earnings on the S&P 500 index
and the log of stock prices, where earnings are measured using a one-year moving sum.
4. Dividend-payout ratio (log), D/E: Difference between the log of dividends and the log of
earnings.
5. Stock variance, SV AR: Sum of squared daily returns on the S&P 500 index.
6. Book-to-market ratio, B/M : Ratio of book value to market value for the Dow Jones Industrial
Average.
7. Net equity expansion, NTIS: Ratio of twelve-month moving sums of net issues by NYSE-
listed stocks to total end-of-year market capitalization of NYSE stocks.
8. Treasury bill rate, TBL: Interest rate on a 3-month Treasury bill (secondary market).
9. Long-term yield, LTY : Long-term government bond yield.
10. Long-term return, LTR: Return on long-term government bonds.
c© Zhou, 2021 Page 222
9.6 Common time-series predictors
11. Term spread, TMS: Difference between the long-term yield and the Treasury bill rate.
12. Default yield spread, DFY : Difference between BAA- and AAA-rated corporate bond yields.
13. Default return spread, DFR: Difference between long-term corporate bond and long-term
government bond returns.
14. Inflation, INFL: Calculated from the CPI (all urban consumers); following Welch and Goyal
(2008), since inflation rate data are released in the following month, we need use suitable lags
for inflation.
15. Investment-to-capital ratio, I/K: Ratio of aggregate (private nonresidential fixed) investment
to aggregate capital for the entire economy.
9.6.2 Technical variables
Technical indicators, such as moving averages of prices, have been widely used by practitioners
who use past price and volume patterns to identify price trends believed to persist into the future.
Neely, Rapach, Tu and Zhou (2014) examine 14 technical indicators in three categories and find
that they are as important as macroeconomic variables. Moreover, they are complimentary to the
predictive power of macroeconomic variables, and so the use of both can improve the predictability
of the market substantially.
9.6.3 Investor sentiment
Baker and Wurgler (2006) propose 6 proxies for investor sentiment and use them to explain re-
turns on small stocks, young stocks, high volatility stocks, unprofitable stocks, non-dividend-paying
stocks, extreme growth stocks, and distressed stocks. However, their sentiment index (the first prin-
cipal component of the proxies) does not predict the market. Huang, Jiang, Tu and Zhou (2015)
construct a sentiment index, using partial least squares (PLS) instead of PCA, and find that the
resulting index is a powerful predictor of the stock market.
Jiang, Lee, Martin and Zhou (2019) construct a manager sentiment index, extending the scope
of investor sentiment, based on the aggregated textual tone of conference calls and financial state-
ments, and find it negatively predicts future aggregate earnings and cross-sectional stock returns,
c© Zhou, 2021 Page 223
9.6 Common time-series predictors
particularly for those firms that are either hard to value or difficult to arbitrage. In addition, Chen,
Tang, Yao, and Zhou (2021) propose an employee sentiment index and find its negative predictive
power on the stock market.
Edmans, Fernandez-Perez, Garel and Indriawan (2021) recently propose a real-time, continuous
measure of national sentiment based on the positivity of songs that individuals choose to listen to.
The music sentiment is language free and thus comparable globally. They find that it is positively
correlated with same week stock market returns and negatively correlated with next-week returns.
This is consistent with the notion that sentiment induced mispricing will be eventurally corrected
by the market. On sentiment in general, Zhou (2018) provides a review of the literature.
9.6.4 Investor attention
Chen, Tang, Yao and Zhou (2020) propose an investor attention based on 12 individual attention
proxies in the literature, and find it has significant power in predicting stock market risk premium,
both in-sample and out-of-sample. Moreover, the index can deliver sizable economic gains for mean-
variance investors in asset allocation. They explain that the predictive power of investor attention
primarily stems from the reversal of temporary price pressure and from the stronger forecasting
ability for high-variance stocks.
9.6.5 Short interest
The finance literature largely agrees that short sellers are informed traders who earn excess returns
in compensation for processing firm-specific information (see, e.g., Boehmer, Jones, and Zhang,
2008). Rapach, Ringgenberg and Zhou (2016) construct a short interest index and find that it
is a strong predictor of the aggregate stock returns, outperforming a host of popular return pre-
dictors from the literature in both in-sample and out-of-sample tests. They further find that the
predictability of the short sellers is due to their informed anticipations of future aggregate cash
flows. The information content of short selling thus appears more economically important than
previously thought
Recently, Chen, Da and Huang (2021) propose a measure of short selling efficiency (SSE) by
using t the slope coefficient of cross-sectionally regressing abnormal short interest on a mispricing
c© Zhou, 2021 Page 224
9.6 Common time-series predictors
score. They find that SSE ,significantly and negatively predicts stock market returns both in-
sample and out-of-sample, suggesting that mispricing gets corrected after short sales are executed
on the right stocks. They also show conceptually and empirically that SSE has favorable predictive
ability over aggregate short interest, as SSE reduces the effect of noises in short interest and better
captures the amount of aggregate short selling capital devoted to overpricing.
9.6.6 Corporate activities
While all the above predictors have little to do with what the firms are doing except the manager
sentiment, Lie, Meng, Qian and Zhou (2017) focus on an aggregate index of corporate activities,
and find it has substantially greater predictive power both in- and out-of sample, and yields much
greater economic gain for a mean-variance investor than the macroeconomic predictors. The pre-
dictive ability of the corporate index stems from its information content about future cash flows.
Cross-sectionally, the corporate index performs particularly well for stocks with great information
asymmetry. The corporate activities cover five major categories of corporate or managerial ac-
tivities: aggregate security issues, share repurchases, corporate investments, merger activity and
payments, and insider trading, with 13 measures
• Percentage of stock payment, COMPCT: the aggregate amount of stock payment divided by
the sum of the aggregate amount of stock payment and cash payment (in percentage points);
• Total stock payment (log), COM: the natural log of the aggregate amount of stock payment
(the dollar amounts, in millions, are deflated to 1986 dollar).
• Net Transactions, NT: the aggregate number of open market purchases minus the aggregate
number of open market sales (in thousands);
• Net Dollar Amount, NDA: the aggregate amount of open market purchases minus the ag-
gregate amount of open market sales (the dollar amounts, in billions, are deflated to 1986
dollars);
• Ratio of Net Purchases, RT: the aggregate number of open market purchases divided by the
sum of the aggregate number of open market purchases and the aggregate number of open
market sales (in percentage points);
c© Zhou, 2021 Page 225
9.6 Common time-series predictors
• Ratio of Net Purchasing Dollar Amount, RDA: the aggregate amount of open market pur-
chases divided by the sum of the aggregate amount of open market purchases and the aggre-
gate amount of open market sales (in percentage points).
• CAPX scaled by ME, CAPXME: aggregate capital expenditures scaled by total market capi-
talization (in percentage points);
• CAPX scaled by AT, CAPXAT: aggregate capital expenditures scaled by average total assets
(in percentage points).
• Change in net operating asset scaled by ME, ALME: The change in net operating asset plus
R&D scaled by total market capitalization (in percentage points);
• Change in net operating asset scaled by AT, ALAT: The change in net operating asset plus
R&D scaled by average total assets (in percentage points).
• Total Equity Issuance (log), E: the natural log of equity issuance (the dollar amounts, in
millions, are deflated to 1986 dollar);
• Ratio of Equity Issuance, S: equity issuance scaled by the sum of equity and debt issuance
(in percentage points).
• Aggregate share repurchases (log), REP: The natural log of aggregate share repurchases (in
millions of 1986 dollar);
9.6.7 Option market
Bollerslev, Tauchen, and Zhou (2009) show that the difference between implied and realized vari-
ance, or the variance risk premium, can predict the market. In the recovery literature. Ross (2015)
pioneers a theory to recover the entire physical distribution of market returns from options writ-
ten on the S&P 500 index. Subsequent studies focus on recovering asset expected returns from
option prices under normal market conditions and over a relatively long period. In particular,
Martin (2017) provides an estimate of future expected marker return. Extending this framework
into events such as the Federal Open Market Committee (FOMC) meetings, Liu, Tang and Zhou
(2021) provide a method to estimate the conditional market risk premium.
The option market is forward looking and is an ideal place for informed trading there due to
leverage, therefore there are likely many option predictors. However, due to perhaps the relatively
c© Zhou, 2021 Page 226
9.6 Common time-series predictors
more difficult processing the data, option predictors are under-studied so far, and may yield more
research in the future.
9.6.8 Others
Dong, Li, Rapach and Zhou (2021) find there is a links between cross-section predictability and
time series predictability. In particular, they use 100 representative anomaly portfolio returns to
forecast the market excess return, and show that, for the 1985:01–2017:12 out-of-sample period, a
C-Mean forecast based on the 100 anomalies generates an out-of-sample R2 = 0.89% (significant
at the 1% level) and an annualized CER gain of 289 basis points for a mean-variance investor with
a relative risk aversion coefficient of three. Economically, they attribute the predictive power to
asymmetric limits of arbitrage and overpricing correction persistence
Chang, Chu, Tu, Zhang and Zhou (2021) propose an environmental, social, and governance
(ESG) index. They find that it has significant power in predicting the stock market risk premium,
both in- and out-of-sample, and delivers sizable economic gains for mean-variance investors in asset
allocation. Although the index is extracted by using the PLS method, its predictability is robust
to using alternative machine learning tools. They find further that the aggregate of environmental
variables captures short-term forecasting power, while that of social or governance captures long-
term. The predictive power of the ESG index stems from both cash flow and discount rate channels.
In the bond market, there are also many studies on predictability. Recently, based on a linear
combination of five forward rates, Cochrane and Piazzesi (2005) find a much higher predictive R2,
between 30% and 35%, for the risk premia on short-term bonds with maturities ranging from two
to five years (unlike stocks whose predictive R2s are very small). Interestingly, Ludvigson and Ng
(2009) demonstrate that the impressive predictive power found by Cochrane and Piazzesi (2005)
can be further improved with additional five macroeconomic factors estimated from a set of 132
macroeconomic variables that measure a wide range of economic activities. Goh, Jiang, Tu and
Zhou (2012) show that the high predictability, however, only generate economic gains comparable
to the stock market. The reason is that the the bond risk premia are much smaller than the stock
market risk premia. They also provide another intriguing result that the technical indicators of
the bond market predict much better than Ludvigson and Ng’s (2009) five macro factors estimated
from a set of 132 macroeconomic variables (in contrast in the stock market, as shown by Neely,
Rapach, Tu and Zhou (2014), their predictability is comparable).
c© Zhou, 2021 Page 227
9.7 Mixed-frequency predictors
The above predictors are largely at the aggregated level and are time series predictors used to
predict the market return or other economic variables over time. On the other hand, there are
many firm characteristics that can be used to forecast returns in the cross section. This will be
discussed later.
9.7 Mixed-frequency predictors
There are times where the predictors are observed at different time frequency. Some available
monthly and some quarterly, for example. The question is how monthly information helps to
provide better quarterly forecast. Conversely, one can also ask how to use quarterly to improve
monthly forecasts. There are mainly three approaches, of which Ghysels and Marcellino’s (2018)
book has detailed discussions.
9.8 Nowcasting
It should be noted that most of the forecast studies are based on low frequency economic data, and
published works are mostly at the monthly frequency, and the next is at quarterly frequency when
accounting data are used. Higher frequency studies are much less. Jiang, Li and Wang (2020) is an
example to forecast daily returns using firm news, and Gao, Han, Li and Zhou (2018) is an example
of intraday forecast.
Nowcasting in economics is about forecasting the present, the very near future and explaining
the very recent past. The term is an abbreviation of “now” and “forecasting,” and has been used
for a long time in weather forecasting on a very short term mesoscale period of up to 2 hours
according to the World Meteorological Organization and up to six hours according to some others
in the field.
It has recently become popular in economics to provide a real time assessment of the economy
such as the GDP which is usually determined after a long time delay and is also subject to revisions.
See, e.g., Bok, Caratelli, Giannone, Sbordone and Tambalotti (2017) and references therein.
Lo´pez de Prado (2020b) argues the importance of nowcasting in explain substantial losses many
quantitative firms suffered as a result of the COVID-19 selloff. This is understandable as the low
c© Zhou, 2021 Page 228
frequency forecasting implicitly assumes the stationarity of the predictive model over a long period
of time (over years in monthly forecasting, for example), which is not true (if all use the same or
similar models, the models will break down too; see Section 9.3), especially during sudden extreme
shocks in the market or in the economy. As a result, forecasting in a very short time window, say
a few hours away, is likely more accurate than a forecast of what is going to happen a month from
today.
Perhaps nowcasting is more useful in using current and recent past data to identify a particular
regime in a timely fashion, say from a normal market state to a crisis one quickly. Then money
managers can react with their backup plans more effectively.
10 Machine Learning Tools
In this chapter, we apply some of the machine learning tools to finance, but focus primarily on
asset return predictions.
10.1 What is Machine Learning?
There are various definitions. We use the one most closely related to finance applications. Machine
learning (ML) is using machine (computers) to learn from data. So, ML is a particular form of
learning that involve both computers (codes/programs) and data. In finance, we often have or
assume a statistical model, such as normal distribution, for the data, and so this part of ML is also
known as statistical learning.
There are mainly three types of ML: unsupervised learning, supervised learning, and reinforce-
ment learning. We explain all three briefly below.
c© Zhou, 2021 Page 229
10.2 Types of Machine Learning
10.2 Types of Machine Learning
10.2.1 Unsupervised learning
Unsupervised learning is to find patterns or hidden structures of data sets. Given 1000 stock
returns, are there any clusters or can their dimensionality be reduced? Or what distribution or
data-generating process or statistical model can fit the data? In short, it is purely data analysis
without user’s objectives.
10.2.2 Supervised learning
Unsupervised learning is to find a model or relation between the data sets. For example, we want to
forecast stock market returns using a set of economic indicators. How the economic indicators are
related to the returns is the question of interest. Finding the parameters of a linear regression of the
returns on the indicators is a common example of supervised learning. Here we have an objective to
minimizing the forecasting errors, and the objective determines/supervises the learning the results
based on the data.
10.2.3 Reinforcement learning
Reinforcement learning (RL) is find the best sequence of actions that will generate the optimal
outcome based on reward/utility function and data. A robot trading system is an example of RL
that monitors the stock market in real time and places buy and orders, to maximize the terminal
return/profit.
10.3 A short literature review
Deisenroth, Faisal and Ong (2020) provide an excellent introduction along with the needed mathe-
matics. Bishop (2006) and Murphy (2012) offer deeper and yet easily accessible introduction. The
well known text of Hastie, Tibshirani, and Friedman (2009) provides a more formal analysis. For re-
cent theory and applications, see, e.g., the books by Anthony and Bartlett (2009), Shalev-Shwartz
c© Zhou, 2021 Page 230
10.4 Why penalized regressions?
and Ben-David (2014), and Shi and Iyengar (2020). On Python implementations, the books of
Ge´ron (2019) and Raschka and Mirjalili (2019) seem the best.
Machine learning (ML) tools are receiving increasing attention by both hedge funds and aca-
demic researchers in recent years. In finance, Rapach, Strauss and Zhou (2013) is perhaps the first
major study (published in a top finance journal) that applies LASSO (the least absolute shrinkage
and selection operator, Tibshirani 1996) to select predictors from a large set (“big-data”) of candi-
dates for forecasting global stock markets monthly. Chinco, Clark-Joseph, and Ye (2019), perhaps
the first one in the recent, use LASSO to analyze cross-firm return predictability at the one-minute
horizon. Kozak, Nagel and Santosh (2020) provide a Bayesian LASSO approach to shrink dimen-
sionality. Feng, Giglio, and Xiu (2020) focus on choosing factors, and Freyberger, Neuhierl, and
Weber (2020) study nonlinear effects. Gu, Kelly and Xiu (2020) apply a comprehensive set of ML
tools, including generalized linear models, dimension reduction, boosted regression trees, random
forests, and neural networks, to forecast individual stocks and their aggregates.
Han, He, Rapach and Zhou (2020) use a combination and a combination LASSO methods to
identify what firm characteristics driving the US stock returns, while Jiang, Tang and Zhou (2018)
study such issues for the Chinese stock market. Filippou, Taylor, Rapach and Zhou (2020) apply
LASSO and neural network to predict foreign exchanges, and Guo, Lin, Wu and Zhou conduct an
ML study corporate bonds. Guida (2019) and Jurczenko (2020) provide collections of papers on ML
theory and its applications in finance, while Dixon, Halperin and Bilokon (2020) focus on primarily
explaining neural networks and their applications. Lo´pez de Prado (2018, 2020a) analyzes some of
the practical issues (read these books like others with caution as some claims may not be true).
Nagel (2021) discusses some of the major asset pricing studies and research issues. Giglio, Kelly
and Xiu (2021) provide a survey of recent advances.
10.4 Why penalized regressions?
Penalized regressions or similar methods are particularly useful in finance. To understand why
penalized regressions are of interest, we need to discuss first the bias-variance decomposition of an
estimator.
c© Zhou, 2021 Page 231
10.4 Why penalized regressions?
10.4.1 Bias-variance tradeoff
For simplicity, consider the predictive regression model,
yt = βxt−1 + t, t = 1, . . . , T. (10.1)
where t is iid normal with zero mean and variance σ
2
.
The mean-squared error (MSE) of any estimator of β, say βˆ, is defined as
MSE ≡ E(βˆ − β)2 = Bias2(βˆ) + Var(βˆ), (10.2)
where the second equality follows from summing the two terms,
Bias2(βˆ) ≡ (Eβˆ − β)2 = (Eβˆ)2 + β2 − 2βEβˆ, (10.3)
Var(βˆ) ≡ E(βˆ − Eβˆ)2 = Eβˆ2 − (Eβˆ)2. (10.4)
The MSE tells us how accurate our estimator is. The bias is simply the squared difference between
the expected value and the true value, and the variance measures how the estimator can fluctuate
around its mean from sample to sample.
Equation (10.2) is known as the bias-variance decomposition and it shows that, to minimize the
MSE, there is in general a tradeoff between bias and variance. In other words, for some estimators
(we have many ways to estimate parameters), the first term is small, but the second is large; and
for other estimators, the reverse may be true. As far as the MSE is concerned, we want the sum
to be minimal.
The popular OLS estimator has zero bias, but has certain variance. To reduce the variance
of βˆ, we may impose an upper bound on it. Then the estimator will be biased, but its MSE can
potentially be smaller. Indeed, it is the case for many penalized regressions that impose restrictions
on the beta coefficients, leading generally smaller MSEs.
10.4.2 Prediction error
The next question is why the MSE is important. This is because we want, in practice, our prediction
to be as close to the future realized value as possible, i.e., to minimize the prediction error. As it
turns out, the prediction error is tied to the MSE.
c© Zhou, 2021 Page 232
10.4 Why penalized regressions?
To see why, given an estimator βˆ, the next period predicted value and true values are
yˆT+1 = βˆxT , (10.5)
yT+1 = βxT + T+1, (10.6)
respectively. At time T , we know our prediction yˆT+1 but not yT+1 as T+1 is random to us, and
so the expected mean-squared error of our prediction is
Prediction Error = E(yT+1 − yˆT+1)2 = E(β − βˆ)2 × x2T + σ2 . (10.7)
Hence, to reduce the prediction error, we need to reduce the MSE of βˆ. It is the MSE that matters
for prediction accuracy, not the bias alone.
10.4.3 Problems with many regressors
The bias-variance tradeoff becomes more important when there are many regressors or many betas,
because the estimation risk in the betas will then be greater, and hence their impact on the
prediction error will be greater too. In this case, imposing constraints on betas often helps in most
practical problems.
To see these points mathematically, consider the standard regression with n regressors (we count
the constant here),
yt = α+ β1xt,1 + β2xt,2 + · · ·+ βn−1xt,n−1 + t, t ∼ N(0, σ2). (10.8)
In vector form,
Y = Xβ + e, (10.9)
where Y is a T -vector of observations on the dependent variable, X is a T×n matrix of observations
on the regressors and β is an n + 1-vector of the regression coefficients. Recall that the common
OLS estimator is
βˆOLS = (X
′X)−1X ′Y. (10.10)
Note that the first column of X has all ones as the regression has an intercept.
The covariance of the OLS estimator is well known,
cov(βˆOLS) = σ
2(X ′X)−1, (10.11)
c© Zhou, 2021 Page 233
10.5 LASSO
where σ2 is the variance of the model residual under the standard iid assumption. If n is large,
(X’X/n) is close to the covariance matrix of the regressors, a constant matrix Σx. Then
cov(βˆOLS) ≈ (Σxσ2)× n, (10.12)
which says that the variances of the estimators explore at a rate n. In other words, every else equal,
the more the regressors, the less accurate the estimates, as their standard errors increase at a rate
of
√
n.
Then expected mean-squared error of prediction is (see, e.g., Hastie, et al, 2009, p. 26 or one
can prove directly),
Expected Prediction Error = E(yT+1 − yˆT+1)2 = σ2 n
T
, (10.13)
which grows proportional to n.
Hence, when there are too many predictors relative to the sample size (T ), the standard linear
regression will not perform well in both estimating the parameters and in forecasting.
10.5 LASSO
Tibshirani (1996) proposes LASSO (the least absolute shrinkage and selection operator) to improve
the OLS. Today, it is one of the most useful ML methods in finance as it helps to select a few
important variables out of potentially hundreds to predict a stock or the stock market or default
of a loan.
10.5.1 The idea
Consider, for example, the following predictive regression of market return yt on 200 predictors,
yt = α+ β1xt−1,1 + β2xt−1,2 + · · ·+ β200xt−1,200 + t, t = 1, . . . , T. (10.14)
The problem is that T is usually not large. If T ≤ 200, the above regression is infeasible as usual
OLS estimator
β = (X ′X)−1X ′Y (10.15)
c© Zhou, 2021 Page 234
10.5 LASSO
will explode because X ′X will be not be invertible, where β denotes all the coefficients (including
intercept) and X is the T × 201 data matrix, the constant values and the x′s, and Y is a T -
vector of the y’s. Suppose T > 200 so that the regression is numerical feasible. Then there is the
well known over-fitting problem that the regression can fit well in-sample due to the use of many
variables/predictors, but it can perform very poorly out-of-sample. For examples, Welch and Goyal
(2008), and Rapach and Zhou (2020) show that the OLS out-of-sample prediction is purely garbage
if all the 14 or 12 predictors there are used, respectively. The reason is that the estimation accuracy
is low (see Section 10.4.3) so the parameters (intercept and slopes) are far from the truth, and they
do not work well for out-of-sample forecasting.
The objective of LASSO is to select the most important predictors out of the 200. In so doing,
LASSO imposes a bound on the sum of the absolute values of regression coefficients,
|β1|+ |β2|+ · · ·+ |β200| ≤ C.
When the constant C is chosen small enough, most of the regression coefficients are forced to be
zeros, and what leftover are the most important ones.
Suppose that LASSO selects 5 variables, say, x2, x9, x105, x119, x188, then we only run an OLS
regression only on them, instead of on all the 200 variables, to form our forecast. Hence, LASSO
is a data-driven approach that searches for sparsity to identify the minimal number of predictors.
From forecasting accuracy point of view, the restrictions reduce variance of the parameter estimates,
resulting generally biased estimates, but they can often reduce the MSE of the parameter estimates,
yielding improved forecasts with greater accuracy. The optimal choice of C will be discussed in the
next subsection.
Since the regression is to minimize the average mean-squared error of the residuals, LASSO
solves the same problem,
min
β0,β′js
1
T
T∑
t=1
yt − β0 − n∑
j=1
βjxt−1,j
2
subject to
n∑
j=1
|βj | ≤ C,
with the additional constraints on the betas, where β0 denotes the previous α for notation con-
venience. It is often referred as regularized or penalized regression that imposes constraints
or information to make a problem more tractable. Mathematically, the constrained problem is
c© Zhou, 2021 Page 235
10.5 LASSO
equivalent to an unconstrained problem with a Lagrange multiplier,
βLASSO = arg min
β
 1
2T
T∑
t=1
yt − β0 − n∑
j=1
βjxt−1,j
2 + λ n∑
j=1
|βj |
 , (10.16)
where λ is the Lagrange multiplier. This is a quadratic programming problem with certain con-
straints. There is no analytical formula for the solution in general, but it can be solved easily by
various algorithms. In practice, software packages are readily available in Matlab, R or Python.
Note that LASSO is a constrained regression. The usual OLS regression has no constraints,
and is a special case of LASSO with C = +∞ or λ = 0 (mathematically, C and λ are inversely
related). The smaller the C (or the larger the λ, the stronger the constrains on the betas, forcing
them closer to zeros. This is the reason why LASSO is also called a shrinkage estimation because
it shrinks the betas to zeros. How to choose C or λ in practice? One often uses the cross-validation
(see next subsection) for the choice to minimize the prediction error. Note further that we divide
the first term by 2T is simply for mathematical convenience and there is no change in the solution
on betas. In the optimization process when we set the derivatives with respect to the betas to
zeros, the 2 will be cancelled out (see an example blow), making the end formula is more elegant,
in terms of λ rather than λ/2. The definition (10.16) will be consistent with our Python codes.
Mathematically, an lq norm on β is defined by
||β||q =
 n∑
j=1
|βj |q
1/q . (10.17)
Then the LASSO problem is often written as, in a short but more abstract form,
βLASSO = arg min
β
[||y − βx||2 + λ∗||β||1] ,
where
||y − βx||2 =
T∑
t=1
yt − β0 − n∑
j=1
βjxt−1,j
2 , λ∗ = 2Tλ,
which is why LASSO is known as imposing constraints on betas with l1 norm. Note that || · ||2 is
the square of the l2 norm, || · ||2. Since the l2 norm is the most widely used, its subscript 2 is often
omitted for simplicity.
To gain some intuition on LASSO estimator, consider a special case of a univariate regression
without intercept,
yt = βxt + t, t = 1, . . . , T. (10.18)
c© Zhou, 2021 Page 236
10.5 LASSO
In this case, we want to solve β to minimize
f(β) =
1
2T
T∑
t=1
(yt − βxt)2 + λ |β|.
The first-order condition is
f ′(β) = − 1
T
T∑
t=1
(yt − βxt)xt + λ sign(β) = 0,
where sign(·) is the sign function so that sign(β) = 1 or −1 if β > 0 or < 0. Assume β > 0, then
we solve from above,
βˆLASSO = βˆ − λ1
T
∑T
t=1 x
2
t
, (10.19)
where βˆ is the standard OLS estimator (without the constraint),
βˆ =
1
T
T∑
t=1
ytxt
/
[
1
T
T∑
t=1
x2t ].
So βLASSO is βˆ if λ = 0 (no constraints), and is shrank to zero as λ increases to βˆ and beyond
(βˆLASSO is defined as zero if the righthand side of (10.19) is negative because that equation is solved
by assuming β > 0, and its estimator βˆ ≥ 0).
In particular, if the data x’s are normalized, i.e.,
1
T
T∑
t=1
x2t = 1,
then it is clear, from (10.19), that
βˆLASSO = βˆ − λ, if βˆ > 0. (10.20)
The LASSO estimator simply reduces the OLS estimator by λ amount. If λ = βˆ, βˆLASSO = 0. If
λ > βˆ, βˆLASSO is set to be zero.
Similarly, if βˆ < 0, say βˆ = −2, we have
βˆLASSO = βˆ + λ, if βˆ < 0, (10.21)
that is, we add λ to the OLS estimator to make it closer to 0. But if λ > |βˆ| = 2, we set βˆLASSO
be zero. If βˆ = 0 instead of being −2, we set obviously βˆLASSO = 0. Overall, βˆLASSO always has
the same sign (positive or negative) as βˆ, and it is just smaller in absolute value.
c© Zhou, 2021 Page 237
10.5 LASSO
The above simple relation between βˆLASSO and βˆ is also true for when n > 1, as long as the
regressions are normalized or the columns are orthonormal. Of course, there is no such a relation
for a general X matrix, and we have to use a numerical algorithm to search for the solution.
Nevertheless, as in the simple case above, the LASSO estimator is always a piecewise linear function
of λ. Moreover, it is convex. Mathematically, this makes it easy to find the numerical solution by
the code below.
10.5.2 The code
Python has the greatest number and high quality packages for machine learning. The LASSO is
easily implemented by using sklearn, or Scikit-learn. The key codes are:
1
2 from sklearn import linear_model
3
4 alpha = 0.5
5 lesso = linear_model.Lasso(alpha)
6 lesso.fit(x,y)
7
8 print(lesso.intercept_) # the intercept
9 print(lesso.coef_) # the slopes
The code uses alpha (we call λ) as the input. Given a value of α = λ, it compute the beta
parameters from the alternative definition,
βLASSO = arg min
β
 1
2T
T∑
t=1
yt − β0 − n∑
j=1
βjxt−1,j
2 + λ n∑
j=1
|βj |
 .
Technically, this is a constrained quadratic programming problem. The last two statements simply
print them. Note again that, imposing an α value equivalent for imposing a C on the coefficients.
In other words, for a given α, there is an implied C. But they are inversely related. When α go from
0 to +∞, C will go from +∞ to 0. Moreover, the choice of alpha can be done via cross-validation.
See the class codes for details.
c© Zhou, 2021 Page 238
10.5 LASSO
10.5.3 The theory
In the standard linear regression,
yt = α+ β
′xt + t, t = 1, . . . , T, (10.22)
where β is an n-vector of slopes/coefficients on n variables, and T is the sample size. It is easy to
show that the expected prediction error of the OLS estimator βˆOLS is (see 10.13)
E||X(βˆOLS − β)||/T = σ2 n
T
= σ2
# of parameters
# of observations
, (10.23)
where X is a T×n matrix of the data on xt’s, and are treated as fixed variable (vs random variable)
here for simplicity. So we must have enough time series sample relative to the # of parameters to
make the prediction error to become small. The above also implies how close the OLS estimator
can be to the true parameter.
In short, for the OLS estimator to work, n/T must be small. Traditionally, we assume n is
fixed and T is large, this will be OK. But in the big data context, n can be close to T , and even be
larger than T sometimes, so the OLS estimator cannot work.
In the context of the LASSO estimator, under certain conditions, we have
E||X(βˆLASSO − β)||/T = O
(
s0 log n
T
)
, (10.24)
where O(·) means that the left-hand side is bounded by what is inside of the bracket, and s0 is the
number of true non-zero parameters. See Bu¨hlmann and van de Geer (2011) for details.
There are two important messages. First, given the sample size, it is impossible to estimate
more than s0 > T non-zero parameters. Second, although the number of variables n can be much
larger than observations T (and we assume many of them have zero slopes), but it cannot be so
exponentially, that is, the log number of variables cannot be too high relative to sample size, i.e.,
log n
T
should be small or converges to 0. Otherwise, there will be no theory that can guarantee the validity
of the LASSO.
c© Zhou, 2021 Page 239
10.6 Cross-validation
10.6 Cross-validation
An important question is how to choose λ in practice. The cross-validation is widely used to select
λ by examining how well the resulted estimator performs over test data sets. So we want to validate
the procedure cross data sets.
The simplest way to understand it may start from the leave-one-out cross-validation (LOOCV).
Suppose we have data
(x1, y1), (x2, y2), . . . . . . , (xn, yn).
We leave the first one out, and use all the remain data,
(x2, y2), . . . . . . , (xn, yn),
for the estimation or training of the model. Then we can forecast y1 to get yˆ1 based on the (n− 1)
data. Let
MSE1 = (y1 − yˆ1)2
be the mean-squared-error of our forecast.
Similarly we can compute MSE2 = (y2−yˆ2)2 by leaving out (x2, y2) while using all the remaining
(n− 1) data. Successively, we can compute the average MSE cross the data sets,
CV−1 =
1
n
n∑
i=1
MSEi.
The LOOCV procedure is to find a tuning parameter λ to minimize the above overall error.
In general, a K-fold cross-validation (K-CV) approach works in three steps:
1) divide the data into K separate sets of roughly equal size,
Data1, Data2, . . . . . . , DataK .
2) Estimate the model for k = 1, 2, . . . ,K by excluding only the k-fold, Datak, and compute the
predictive MSEs. Then compute the total error
CVK =
1
K
K∑
k=1
MSEk.
c© Zhou, 2021 Page 240
10.7 Ridge
3) Search the tuning parameter to minimize CVK .
It may be noted that the beta estimates in LOOCV have the least variance compared with
those using the K-CV as it has the largest sample sizes. However, its predictive power may be
limited because the MSEs are likely very noisy as they are computed based on one data point each
time. In addition, it has to estimate the model n times which can be time consuming if the model
is difficult to estimate. In general, K > 1 is a better choice, and K = 5 or 10 are commonly chosen
(in the LOOCV, K = n).
10.7 Ridge
LASSO may have a problem when the predictors are highly correlated. Intuitive, if two predictors
are highly correlated, it is difficult to select which one of them. A simple may be just retain them
both or retain a linear combination of them.
Hoerl and Kennard (1970) propose a ridge regression to deal with highly correlated regressors
in a general regression, 26 years before LASSO was proposed. We will consider ridge by itself in
this section, and will combine it with LASSO in the E-net section.
10.7.1 The idea
Consider the standard predictive regression
yt = α+ β1xt−1,1 + β2xt−1,2 + · · ·+ βnxt−1,n + t, t = 1, . . . , T. (10.25)
In vector form,
Y = Xβ + e, (10.26)
where Y is a T -vector of observations on the dependent variable, X is a T × (n + 1) matrix of
observations on the regressors and β is an n + 1-vector of the regression coefficients. Recall that
the common OLS estimator is
βˆOLS = (X
′X)−1X ′Y. (10.27)
Note that the first column of X will be all ones if the regression has an intercept.
c© Zhou, 2021 Page 241
10.7 Ridge
The problem is that when the regressors are highly correlated or when the columns of X are close
to be dependent (multicollinearity), the matrix X ′X will be close to be singular (non-invertible).
In this case, the inversion (X ′X)−1 will be very large, so are the OLS estimator.
The ridge estimator is defined by
βˆridge = (X
′X + λ I)−1X ′Y, (10.28)
where I is the identity matrix of order n + 1, and λ ≥ 0 is the shrinkage parameter. So βridge is
obtained by adding the matrix λ I (“ridge”) to X ′X, making the result X ′X +λ I more stable and
easily invertible.
In the special case of λ = 0, the βˆridge reduces to the OLS estimator. In general, it shrinks the
estimates to zero. To see this, consider the case when X is orthonormal or X ′X = I, then
βˆridge =
1
1 + λ
βˆOLS,
which clearly shrinks all the OLS estimators toward 0 as λ gets larger.
Mathematically, if we solve the standard MSE minimization by imposing the lq norm constraints
with q = 2,
βridge = arg min
β
 1
T
T∑
t=1
yt − β0 − n∑
j=1
βjxt−1,j
2 + λ n∑
j=1
β2j
 , (10.29)
the solution is the bridge estimator. Although this differs from the LASSO only from replacing
q = 1 by q = 2, the behavior of the estimator is totally different. Note that we now divide the first
term by T instead of 2T because setting the derivatives on the betas being zero in the optimization
process will cancel the 2s from both terms.
To understand the impact of multicollinearity, let us consider a simple example where T = 3
and n = 1, with
X =

1 1
1 1
1 1 + η
 .
It is clear that when η = 0, there is exact collinearity. When it is small, close to be collinear. Then
X ′X =
1 1 1
1 1 1 + η


1 1
1 1
1 1 + η
 =
 3 3 + η
3 + η 2 + (1 + η)2
 .
c© Zhou, 2021 Page 242
10.7 Ridge
Its inverse is (see formula (1.78)),
(X ′X)−1 =
1
det(X ′X)
2 + (1 + η)2 −3− η
−3− η 3
 ,
and
det(X ′X) = 3[2 + (1 + η)2]− (3 + η)2 = η2.
So, when η is small, the inverse must be large, driven by the determinant. By the relation between
determinant and eigenvalues (see (6.35)),
det(X ′X) = λ1λ2.
So the smallest eigenvalue of X ′X must be small too. So the following practitioners often talk
about are equivalent: a) X is near collinear; b) X ′X near singular; c) some eigenvalues of X ′X are
too small; d) (X ′X)−1 is too large.
Once we add the ridge into X ′X, we have, from the definition of eigenvalues, that
det(X ′X + λ I) = (λ1 + λ)(λ2 + λ).
That is, the eigenvalues of the bridge estimator will be the original ones shifted by λ amount.
Note that when n > T , i.e., the number of regressors or variables is greater than the sample
size, the OLS estimator is undefined as X ′X must be singular in this case. However, the bridge
estimator is still well defined. This is because, if λ > 0 and is slightly large enough, the det of the
bridge estimator will stay away from zero. This implies that X ′X + λ I will be well invertible, and
the hence estimator will be well behaved at least numerically.
10.7.2 The code
The key codes are exactly that for the LASSO, except replacing the word Lasso by Ridge:
1
2 from sklearn import linear_model
3
4 alpha = 0.5
5 ridge = linear_model.Ridge(alpha)
6 ridge.fit(x,y)
7
c© Zhou, 2021 Page 243
10.7 Ridge
8 print(ridge.intercept_) # the intercept
9 print(ridge.coef_) # the slopes
Again, the code uses alpha (we call λ) as the input, it solves the betas from the alternative
definition,
βridge = arg min
β
 1
T
T∑
t=1
yt − β0 − n∑
j=1
βjxt−1,j
2 + λ n∑
j=1
β2j
 .
The solution is the same as the formula (10.28). See the class codes for details.
10.7.3 The theory
Consider the case where the OLS estimator is well defined,
βˆOLS = (X
′X)−1X ′Y, (10.30)
though X ′X maybe close to be singular. It well known that the OLS estimator is unbiased,
E[βˆOLS] = β, (10.31)
i.e., its expected value is the true parameter. In other words, if we compute the OLS estimator for
10,000 data sets, the average should converge to the true value.
Since the bridge estimator shrinks the unbiased estimator to zero, it must biased. What is its
advantage then? It can have much smaller variance. Indeed, the covariance of the OLS estimator
is well known,
cov(βˆOLS) = σ
2(X ′X)−1, (10.32)
where σ2 is the variance of the model residual under the standard iid assumption. It explodes as
X ′X is near singular. It is easier to see this from the trace
tr[cov(βˆOLS)] = σ
2tr(X ′X)−1 = σ2
n+1∑
i=1
1
λi
, (10.33)
where λi’s are the eigenvalues, which can be very small when X
′X is near singular.
In contrast, it can be shown that
tr[cov(βˆbridge)] = σ
2tr(X ′X + λI)−1 = σ2
n+1∑
i=1
λi
λi + λ
, (10.34)
which will not explore even if λi’s are really small or even zeros. Hence, the bridge estimator trades
bias for smaller variance.
c© Zhou, 2021 Page 244
10.8 Enet
10.8 Enet
When n is large, the bridge estimator can shrink the coefficients, but cannot shrink them to exactly
zero. Hence, it cannot be used effectively to reduce dimensionality or to select variables. On the
other hand, LASSO tends to be more aggressive in setting many betas to be zeros. In particular,
it tends to select an arbitrary one among highly correlated variables (by setting other betas zeros),
while Ridge tends to select the group by keeping all the betas but making them small.
Zou and Hastie (2005) propose a combination of LASSO and bridge, known as Elastic Net (E-
net), to take the advantages of both. While it will set some coefficients be zeros like the LASSO,
but it will be less aggressive and it will use of the bridge features to tame the large coefficients only.
Mathematically, the E-net estimator solves the same MSE problem,
βElastic = arg min
β
 1
2T
T∑
t=1
yt − β0 − n∑
j=1
βjxt−1,j
2 + λ n∑
j=1
|βj |+ η
2
n∑
j=1
β2j
 , (10.35)
by imposing both the LASSO and ridge constrains, or imposing both l1 and l2 norm restrictions
on the betas. Because it has two constraints, it has now two parameters, λ and η to choose to
determine the severity of constraints on each. In practice, the Enet tends to do better than either
LASSO or bridge alone (which may be expected as it contains each as a special case when η or λ
is zero), and it is widely used today for forecasting.
The key codes similar to that for the LASSO,
1
2 from sklearn import linear_model
3
4 alpha = 0.5
5 psi = 0.3
6 Enet = linear_model.ElasticNet(alpha ,psi)
7 Enet.fit(x,y)
8
9 print(Enet.intercept_) # the intercept
10 print(Enet.coef_) # the slopes
c© Zhou, 2021 Page 245
10.9 C-LASSO
Note that the code uses alpha and ψ as inputs, and it solves
βElastic = arg min
β
 1
2T
T∑
t=1
yt − β0 − n∑
j=1
βjxt−1,j
2 + αψ n∑
j=1
|βj |+ α
2
(1− ψ)
n∑
j=1
β2j
 , (10.36)
which is exactly (10.35) except with different parameterization,
α = λ+ η,
ψ = η/(λ+ η).
See the class codes for details.
10.9 C-LASSO
Following Han, He, Rapach, Zhou (2020), and more closely Rapach and Zhou (2020), we can define
a time series version of C-Lass as follows, while the details for cross-section and extensions can be
found in Han, et al (2019). Diebold and Shin (2019) is the first to explore this line of ideas, though
their procedure is quite different from ours.
The C-LASSO, or Combination-LASSO, is to use the idea of the combination forecast method
first, and then use LASSO to select the most important forecasts out of all the forecasts based on
all the predictors individually.
Suppose we have 200 predictors, which implies that we have 200 forecasts based on each of the
predictors. Now we consider a regression of the realized returns on the forecasts,
yt = α+ θ1yˆt−1,1 + θ2yˆt−1,2 + · · ·+ θ200yˆt−1,200 + t, t = 1, . . . , T. (10.37)
We use the LASSO to select the most important forecasts in (10.37). This regression on forecasts
will in general be more robust than the regression on predictors, though it is not the most efficient.
It works well when the true regression parameters change over time. In implementation, we impose
the nonnegativity restriction that θj ≥ 0, for which the reason is the return forecasts should
be positively related to the realized returns. In contrast to the usual LASSO that is applied to
predictors, we apply C-LASSO to forecasts. This is only the first selection step of C-LASSO.
After selection, the final forecast is the average of the selected ones. For example, if 10 forecasts
are selected out the 200, the C-LASSO forecast is
yˆC−LASSOt =
yˆt−1,1 + yˆt−1,3 + · · ·+ yˆt−1,180
10
, (10.38)
c© Zhou, 2021 Page 246
10.10 E-LASSO
where we assume that the first, third, . . ., and the 180th (a total of 10) are the selected ones. The
same idea can be applied to extend the bridge and Elastic net to yield C-Ridge and C-Enet.
C-LASSO improves in general the average forecast substantially by selecting/using only the
good forecasts in the average, rather than averaging all the forecasts.
10.10 E-LASSO
E-LASSO or encompassing LASSO is motivated from two ideas (see Rapach and Zhou, 2020 and
Han et al, 2021). First, based on forecast encompassing, there is likely a gain of combining the
C-LASSO with the OLS. Second, it belongs to the ensemble approach of machine learning that
combines algorithms (e.g., Zhou 2012).
In general, we define E-LASSO forecast as a simple linear combination of C-LASSO and the
OLS
yˆE−LASSOt = λyˆ
C−LASSO
t + (1− λ)yˆOLSt , (10.39)
where λ is data-driven, computed as the best one that minimizes the forecasting error of yE−LASSOt
over past M (say M = 36 in a monthly forecast applications).
Dong et al (2021) and Han, et al (2021), among others, find that, indeed, the E-LASSO tends
to do better than both C-LASSO and OLS in most applications.
10.11 Neutral network
LASSO and the previous models are extension of the ordinary linear regressions (OLS), but can
handle many regressors. They work well only if the true data are from a linear model. In practice,
however, many dependent variables depend on others in a nonlinear fashion.
The neutral network (NN) is a major class of models that extend the OLS to allow for nonlinear
relations. It weights data linearly into a layer of new data sets, making a nonlinear transformation,
and then weight them into another layer of data, and make another nonlinear transformation, and
so on, to finally the observed output data. It is motivated by a biological neural network, and so
the notes of the network are called neurons.
c© Zhou, 2021 Page 247
10.11 Neutral network
Mathematically, any smooth function can be approximated by a suitable NN (Hornik, Stinch-
combe, and White 1989; Cybenko 1989). In other words, if we use a set of predictors to predict
the market return, and if the true function is highly nonlinear but smooth, then, given enough
data, we can build a suitable NN so that it approximates the true but unknown function with
an arbitrary accuracy. This is the theoretical reason why NN is widely used in practice, and has
growing applications in finance. Gu, Kelly and Xiu (2020) is an example. Klaas (2019), Ge´ron
(2019), and Gulli, Kapoor and Pal (2019), among many others, provide the standard Python codes
for implementing the NN.
A deep neural network (DNN) is an NN (sometimes called artificial NN or ANN) with multiple
layers between the input and output layers. Deep learning (also known as deep structured learning)
is part of a broader family of machine learning methods based on artificial neural networks.
The neutral network (NN) is perhaps best understood by going through some examples.
10.11.1 No hidden layer: linear regression
Consider the prediction of yt using two predictors, z1 and z2. We have the usual simple predictive
regression,
yt = α+ β1z1,t−1 + β2z2,t−1 + t, t = 1, 2, . . . , T. (10.40)
Recall that, if the parameters were known, we would compute our forecast from
yˆt = α+ β1z1,t−1 + β2z2,t−1,
which says that our forecast is a linear function of the predictors. Mathematically more convenient,
we can express it in term of dot function (as many ML books do),
yˆt = θ1x1t + θ2x2t + θ3x3t = θ · xt, (10.41)
where
θ = (α, β1, β2) = (θ1, θ2, θ3), (10.42)
xt = (1, z1,t−1, z2,t−1) = (x1t, x2t, x3t), (10.43)
the last equation of (10.41 follows from the definition of the dot function.
c© Zhou, 2021 Page 248
10.11 Neutral network
In terms of NN, we map xt, the attributes, into an output using weights θ. Denote the output
by y1 and drop time scripts for brevity, we have
x1
x2
x3
Input
layer
y1
Output
layer
In other words, the OLS can be viewed as an NN with 3 nodes or neutrons in the input layer,
no hidden layers, and one output in the output layer. In a multivariate regression, we will have
multiple outputs, and so multiple y’s in the output layer.
Note that there are 3 parameters. We seek such parameters that make the forecasts be as
close to the actual data (the training sample) as possible. This is often done by minimizing the
mean-squared error of the difference,
min L ≡
T∑
t=1
[yt − (θ1x1t + θ2x2t + θ3x3t)]2 =
T∑
t=1
(yt − θ · xt)2. (10.44)
The solution is the well known OLS regression coefficients and is analytically available.
To summary, in the no hidden layer case, the NN is simply the usual OLS regression. However,
for a general NN, that is different, and the solution is not available by any formula. Instead, we
need to search it numerically by using optimization algorithms, of which gradient descent is the
most common, to be discussed later.
10.11.2 One hidden layer
Consider now a NN with one hidden layer. Suppose we map the 3 input into 4 nodes of a hidden
layer, and then an output. Graphically, we have
c© Zhou, 2021 Page 249
10.11 Neutral network
x1
x2
x3
Input
layer
Hidden
layer
y1
Output
layer
The key is that the data in the hidden layers are nonlinear functions of the linear combination of
data. For example, the top one and the bottom ones are
x11 = f(θ
1,1
1 x1 + θ
1,1
2 x2 + θ
1,1
3 x3),
x14 = f(θ
1,4
1 x1 + θ
1,4
2 x2 + θ
1,4
3 x3),
where f is a nonlinear activation function that maps the linear weights of the previous data into
a nonlinear relation, with θ1,jk s as parameters. For example, for θ
1,4
2 , the superscripts indicate the
first layer and the fourth node.
The forecast is then
yˆ = θ20 + θ
2
1x
1
1 + θ
2
2x
1
2 + θ
2
3x
1
3 + θ
2
4x
1
4
= θ20 +W
′
2x
1
= θ20 +W
′
2f(W
′
1x),
where W2 and W1 are the coefficient matrix in the second and first step, and x is a vector of the
data.
So it is clear that the data is transformed linear each, and at the hidden layer, the results are
further transformed by a nonlinear function before used in the next step. In this way, one can
generate a NN with an arbitrary number of steps.
The rectified linear unit (ReLU) function is one of the most popular one, used by Gu, Kelly,
and Xiu (2020) and others in finance, which is defined as
f(x) =
 0, if x < 0;x, otherwise.
c© Zhou, 2021 Page 250
10.11 Neutral network
Intuitively, the activation function activates a neuronal connection in response to a sufficiently
strong signal, thereby relaying the signal forward through the network.
Note that , in the above one layer NN, there are now 3 × 4 = 12 parameters to arrive at the
hidden layer, and then 4 more parameter at the end. So thee are in total 16 parameters to estimate.
In a NN with m layers, there are Km parameters if each step has K parameters, so the number
of parameters grow very fast. In finance, in time series forecasting, probably 1-2 layers are enough
due to data limitations. In cross section forecasting, one may apply up to 5 layers as in Gu, Kelly,
and Xiu (2020).
10.11.3 Gradient decent: A search algorithm
It will be of interest to see how the numerical estimation of the parameters is done mathematically,
which can provide deeper insight into the Python packages.
Consider the simple OLS case. Based on (10.44), the derivative with respect any parameter is
Lj = L
∂θj
= −2
T∑
t=1
(yt − θ · xt)xjt.
Mathematically, at the optimal value of any parameter, the derivative with respect it should be
zero. However, in practice, we do not know the optimal parameter values, which are what we need
to find.
The idea is that we can start from any initial guess value, θ0j for j = 1, 2, 3. We then compute
a new updated/iterated value
θ1j = θ
0
j − ρLj , j = 1, 2, 3, (10.45)
where ρ > 0 is a small constant. The reason is that, if θ0j is not optimal, then Lj 6= 0. Suppose
Lj > 0, then θ0j is on the righthand side of the optimal value (imagine a U-shape function with
the minimum in the middle), and so we have to move to the left, and that is exactly the above
algorithm does.
If Lj is not zero, one can iterate
θm+1j = θ
m
j − ρmLj , m = 1, 2, · · · , (10.46)
c© Zhou, 2021 Page 251
10.11 Neutral network
until it converges, where ρm is sometimes called is the learning rate. If it is too small, the algorithm
may converge slowly and may need a lot of training examples to do so. If it is too large, the iterated
values, θm+1j ’s, may change too fast and end up oscillating around the optimal value.
The above first-order derivative iterative optimization algorithm is known as gradient ascent,
and is generally attributed to Cauchy, who first suggested it in 1847. To understand its name,
image that you walk down to the bottom (minimum) of a mountain. The direction points to the
bottom is the gradient (first-order derivatives) at that point, and you descend accordingly. However,
depending on the maintain, there is a possibility that you can get stuck in some hole (i.e. local
minimum or saddle point). In practice, many problems do have well behaved global minimum. Even
in the case when the solution is not well behaved, multiple staring points or alternative models
are useful in checking whether the solution is a local minimum or not. If it is, additional search is
necessary.
In a general NN, the output is a compound function of the parameters. For example, in the
1-hidden layer case, we have
yˆ =
4∑
i=1
θ2i f
 3∑
j=1
θ1,ij xj
 .
Suppose now we add one more hidden layer with 5 nodes before reaching the output, and we assume
use the same activation function, then
yˆ =
5∑
i=1
θ3i f
 4∑
k=1
θ2,ik f
 3∑
j
θ1,kj xj
 ,
where f is used on a linear function of itself (compound). Despite the complex look, the output
is easily computed recursively and the first-order derivatives follows from the chain-rule. Then the
gradient decent can be applied to search for the best parameter values that delivers the best fit to
the model.
10.11.4 Remarks
The number of nodes in each hidden layer and the number of hidden layers can be any value,
driven by the data in applications. Various algorithms are developed to estimate these numbers
and the associated parameters. Theoretically, large enough NN should capture almost any complex
decision function. Hence, NN type methods are the currently preferred approach for complex
c© Zhou, 2021 Page 252
10.12 Genetic algorithm
machine learning problems such as computer vision and natural language processing. However,
due to layers after layers, it is one of the least transparent, least interpretable, and most highly
parameterized machine learning tools. In addition, it generally requires a large sample size for
convergence, limiting its applications in time series forecasting where the time series is often not
long enough in finance.
Gu, Kelly and Xiu (2020), among others, find that NN does better than LASSO and other
regression-type methods for predicting stock returns, due to it captures important nonlinearity.
But this line of papers are based a balanced large panel of both cross section and time series
data. However, Filippou, Taylor, Rapach and Zhou (2020) find that no gains from NN for foreign
exchanges due to small sample size in both time and cross section (or due to the absence of
nonlinearity). In a similar setting as Gu, Kelly and Xiu (2020) but with new predictors added over
time, the NN can no longer be applied in such an unbalanced panel model. However, LASSO and
C-LASSO are still effective methods (see Han, He, Rapach and Zhou, 2020). Dixon, Halperin and
Bilokon (2020) discuss more advanced neural networks and their applications in finance.
10.12 Genetic algorithm
Like gradient decent, genetic programming (GP) is a general search algorithm for finding the optimal
solution for an objective function. But its search idea is more heuristic and is based on some
principles in natural genetic processes.
Liu, Zhou and Zhu (2020b) seems the first to apply it to forecast the cross section of stock
returns. Like the NNs, the GP captures nonlinearity and interaction, and so it performs better
than linear regression-based methods such as the LASSO. However, the GP appears to require
smaller sample sizes than the NNs. More importantly, it can be used to maximize an arbitrary
economic objective, such as the Sharpe ratio, directly. In contrast, other approaches are often
designed to deal with only model fitting. The drawback of the GP is that it is computationally
demanding, and so it is incapable of handling problems with many predictors. It is also complex and
difficult to apply as available packages are limited. See Liu, Zhou and Zhu (2020b) and references
therein for further readings.
c© Zhou, 2021 Page 253
10.13 Ensemble Learning
10.13 Ensemble Learning
Ensemble learning is to learn from combining a set of models or algorithms. The combination
forecast (Section 9.5.1) is the simplest example of ensemble learning. The forecast based on each
predictor is a model. Rather than relying on any of the single models, we use the average forecast
across the models as our new forecast.
The 1/N portfolio rule (Section 2.1.1) is also an example of ensemble learning that diversify
over assets. The Bayesian model averaging (Section 3.7.2) is an important example, where various
models are weighted with posterior probabilities. The Bayesian model averaging is used in a wide
range of complex decision making. Why does ensemble learning work? Each model is unlikely to
capture fully the real world. By pooling together of all the models, the final model is likely to
capture all aspects of the problem, and so performing better. Another technical reason is that each
model is evaluated based on its own assumptions about the true data-generating process which
itself may not be true.
There are many specific methods of ensemble learning. Below we discuss three of the most
popular ones.
10.13.1 Bagging
Bagging is a way to use bootstrap to improve performance. It is also known as bootstrap aggregation
or bagging averages.
To understand it, consider the case that we have a predictive model to forecast a future return,
RT+1. Based on our model, let the forecast be
RˆT+1 = f(xT ), (10.47)
where XT denotes all the data up to T (today).
Rather than replying on the single forecast above, we bootstrap the data B times with replace-
ment to obtain B sets of data, X
(1)
T , . . . , X
(B)
T , each of which allows for a re-estimation of the model
c© Zhou, 2021 Page 254
10.13 Ensemble Learning
to yield a new forecast f(X
(b)
T ), for b = 1, 2, . . . , B. Then the bagging forecast is
RˆBaggingT+1 =
1
B
B∑
b=1
f(X
(b)
T ), (10.48)
i.e., a simple average of the bootstrapped forecast.
Bagging attempts to make the data more representative in the model, so that it can typically
help to improve an unstable model. The bootstrapped portfolio investment rule (Section 4.3.3) or
the re-sampled frontier as discussed by Michaud and Michaud (2008) is an example of bagging in
portfolio choice.
10.13.2 Stacking
Bagging is a way to use cross-validation to improve performance. In contrast with equal weights
or posterior probability weights, it is more data-driving, with smaller weights on models that have
high empirical bias.
Suppose now that there are M models, f1(x), . . . , fM (x) that we use to forecast an outcome y.
Our objective is to find the best weights such that
fStack(x) =
M∑
m=1
wmfm(x), (10.49)
be the best forecast by some metric.
Consider the popular quadratic objective that we want to find w = (w1, . . . , wM )
′ to minimize
the mean-squared error (MSE),
w = arg min
w
T∑
i=1
(
yi −
M∑
m=1
wmf
(−i)
m (xi)
)
, (10.50)
where f
(−i)
m (xi) is the re-estimated model of fm without the i-th observation xi, an idea from
cross-validation to obtain more robust models.
The above optimization over w is a simple quadratic programming problem without constraints.
In practice, one can also impose the restriction that the weights are positive and sum to 1, which
is also easily solved similar to the portfolio constraints problems.
c© Zhou, 2021 Page 255
10.13 Ensemble Learning
10.13.3 Boosting
Boosting is one of the most powerful and popular ways to improve a model, and there are many
versions (see, e.g., Hastie, Tibshirani, and Friedman (2009)). In what follows, we focus on a
regression type which seems more relevant to the finance problems of our interest here.
Consider the problem of improving a fit function F (x) on data (x1, y1), (x2, y2), . . . , (xT , yT ).
The errors are
y1 − F (x1), y2 − F (x2), . . . , yT − F (xT ).
Our objective is to find a function h(x) so that
F1(x) = F (x) + h(x)
has smaller MSE.
Mathematically, the MSE is
J =
1
N
T∑
i=1
[yi − F (xi)]2. (10.51)
Although the fitted values F (x1), F (x2), . . . , F (xT ) are just numbers, we can view them as param-
eters when thinking about how they affect the loss. Then, taking derivatives we have the gradient
g(xi) ≡ ∂J
∂F (xi)
=
2
N
[F (xi)− yi].
Now we fit a function h(x) to (x1,−g(x1)), . . . , (xT ,−g(xT )), so that h(x) is close to −g(x) for x
taking all the xi’s. Then it is clear that
F1(x) = F (x) + ρh(x) (10.52)
will be an improvement of F (x) for small enough ρ > 0.
The reason is that, in optimization, the negative gradient is the direction to get closer to the
optimum. For example, consider find the minimum of
f(x) = x2/2− x.
Suppose now we are at value x0 = 2. Then f
′(x0) = 1 and
x1 = x0 − ρf ′(x0) = 2− ρ
c© Zhou, 2021 Page 256
will clearly be closer to 1 for small enough ρ > 0. Of course, in the last equation, one can choose
ρ > 0 to minimize the error. So is the gradient algorithm.
Hence, the gradient algorithm can be summarized in 4 steps in general: 1) compute the negative
gradient for any given Fm based on any metric; 2) Fit the data with the negative gradient to get
hm; 3) Solve the one-dimensional optimization problem for the MSE of the metric; 4) Update the
fit to Fm+1. In practice, the iteration stops if there is no more significant improvement.
11 Predictability 2: Cross Section
In this chapter, we discuss Cross section forecasts in great length, and also provide the detailed
implementation procedures.
11.1 Overview
Cross section forecasts focus on predicting the relative performance of firms, and the cross section
regression (CSR) is run over the number of firms, N , which is usually large in practice, say T =
10000 firms. We can use only the present observations on predictors to forecast the future N returns
in the CSR, although time series data can help improve the forecasting accuracy, say smoothing
estimates over time, but not required. CSR forecasts are useful for fund managers to pick up stocks
to buy or over-weight, and to short or under-weight.
In contrast, time series predictability amounts to predictions of an asset return over time, and
the time series forecasting regression is run over time, the number of available time periods, T ,
for the asset return. Usually T is small, say T = 120 for ten year monthly data. An investment
of getting in or getting out of the stock market is called market timing. Time series forecasting
methods will be useful in this context. But it should be remembered that the predictability is small
and time-varying.
Cross section forecasts are popular in practice. Since N is large, OLS is the popular approach
for estimation. However, when there are many predictors, the OLS still tend to overfit. Various
methods are proposed, see Han et al (2021) and Neuhierl et al (2021) for deal that problem. Co-
queret and Guida (2020) and Jurczenko (2020) provide additional applications of machine learning
c© Zhou, 2021 Page 257
11.2 Cross-section regression
methods. We will discuss some of the estimation procedures below.
11.2 Cross-section regression
To understand better about CSR, consider the size effect. We know that large firms tend to have
lower average returns than small firms, and so the size of a firm will be a return predictor of the
future. No matter the stock market is up or down next month (time series behavior), small firms
tend to outperform large firms on average.
A simple way is to buy small firms and short long ones to obtain a portfolio to have a positive
alpha if size effect persists. To be more precise, we can run the CSR on size, assuming N = 1000
firms,
Ri,t = α+ β Sizei,t−1 + i, i = 1, 2, . . . , 1000, (11.1)
where α and β are the regression coefficients (the same cross firms in CRS), and Sizei is the firm size
of firm i (usually in log and is standardized; see Section 11.3 for implementation details). Suppose
our estimated regression is
Ri,t =
15%
12
− 5%
12
Sizei,t−1 + ˆi, (11.2)
where we assume the data is monthly, so that αi = 15%/12 and its annualized value is 15%. Assume
that the size variable is standardized across firms. Then the above equation tells us, if a firm’s size
is one unit larger than others, its expected return will be 5% (annualized) lower than others. In the
above model, it is evident that the smaller the size, the greater the expected return. If we divide
all the stocks into 10 groups, known as decile portfolios, by the expected returns estimated from
(11.2), it will be equivalent to sorting the stocks by size.
Clearly there are more factors than size alone that affect stock expected returns in practice. If
we add profitability, we can then run the CRS on both of them,
Ri,t = α+ βs Sizei,t−1 + βp Profiti,t−1 + i, i = 1, 2, . . . , 1000. (11.3)
If our estimated regression is
Ri,t =
15%
12
− 5%
12
Sizei,t−1 +
7%
12
Profiti,t−1 + ˆi, (11.4)
then a large firm may be desired if its profitability is high. So we should consider both factors, and
the total contribution is what matters to the expected stock return. In this case, if we divide all
c© Zhou, 2021 Page 258
11.2 Cross-section regression
the stocks into 10 groups, decile portfolios, by the expected returns, estimated from (11.4), it will
not the same as sorting the stocks by size or by profitability. In fact, sorting cannot capture fully
the two effects, but the CRS is one valid approach to do so.
More generally, if we have now four factors,
Ri,t = α+ β1Xi,1,t−1 + β2Xi,2,t−1 + β3Xi,3,t−1 + β4Xi,4,t−1 + i, i = 1, 2, . . . , 1000. (11.5)
and we can write this CSR in matrix form,
R1,t
R2,t
...
R1000,t
 =

1 X1,1 X1,2 X1,3 X1,4
1 X2,1 X2,2 X2,3 X2,4
...
...
...
...
...
1 X1000,1 X1000,2 X1000,3 X1000,4


α
β1
...
β4
+

1,t
2,t
...
1000,t
 , (11.6)
where each Xi,j is firm characteristic j for firm i. Note that the returns are measured at time t
and the explanatory variables are measured at t− 1, since we use past information to forecast the
future return. In implementation, to forecast return next month, as we do not know the return
yet which is to be predicted, we run the CRS using current returns on the past month predictors
to obtain the regression coefficients, and then, based on them and the current predictor values, we
can compute our forecast. Since the above equation is a linear regression, we can use the OLS to
estimate the parameters, and the details are given in the next subsection.
It is worthwhile to contrast the difference between time series regression (TSR) and CSR.
Suppose we want to predict the stock market return Rm,t+1 using four predictors, and have T = 240
monthly data available, then we run the TSR,
Rm,t = α+ β1x1,t−1 + β2x2,t−1 + β3x3,t−1 + β4x4,t−1 + t, t = 1, 2, ..., 240. (11.7)
To predict the return at T + 1, in terms of data, we have
Rm,240
Rm,239
...
Rm,1
 =

1 x1,239 x2,239 x3,239 x1,239
1 x1,238 x2,238 x2,238 x2,238
...
...
...
...
...
1 x1,0 x2,0 x3,0 x4,0


α
β1
...
β4
+

240
239
...
1
 . (11.8)
We estimate the regression coefficients and then plug in (11.7) to obtain the forecast. In comparison
with the previous CSR, the dependent variable is a time series of the market return, not one time
c© Zhou, 2021 Page 259
11.3 OLS estimation
variables. The same is true for the explanatory variables. So, although OLS can be applied in both
cases, it is applied to CS data and TS data, respectively.
Time series predictability is about how predictors predicting returns over time. The predictive
regression is the typical set-up. Most machine learning tools are readily applicable when there are
many predictors. In contrast, cross-section predictability is about the relative predictability among
asset returns. It predicts some assets will have greater returns than some others regardless of the
up and down of the stock market. Most machine learning tools, developed for time series, may be
adapted easily to apply to the cross section by treating the number of cross section as if it were the
number of time series periods (e.g., Han, He, Rapach and Zhou, 2021, and Freyberger, Neuhierl,
and Weber, 2020).
In empirical applications, the degree of time series predictability is low. On the other hand,
the cross-section predictability is much stronger, yielding sizable economic profits (see, e.g., Gu,
Kelly and Xiu, 2020, and Han, He, Rapach and Zhou, 2021). Hence, the cross section regression is
popular in practice.
11.3 OLS estimation
In the real world implementation of the CSR, the first issue is to clean and make the data applicable.
Typically, there are missing data in back-testing, and one often use ad hoc interpolation or cross
section mean to replace them. Often the firm characteristics, such as size, value, momentum, and
quality, are used in standardized form or in terms of z-scores on the right hand side of (11.6), where
the z-score is defined as
z-score =
x− µ
σ
, (11.9)
which standardizes the raw data x, say, size, of each firm in the cross section, where µ is the mean
of the size across firms and σ is the standard deviation. In addition, the data may be trimmed so
that scores above 3 are set at 3, and below −3 set at −3. This prevents the results are driven by a
few extremely large or small firms.
The OLS is the standard procedure applied each period to obtain the coefficient estimates.
Haugen and Baker (1996) appears the first to do so. Lewellen (2015) provides a more recent and
comprehensive analysis. Han et al (2017) show how to obtain an interpretable factor from a group
of proxies.
c© Zhou, 2021 Page 260
11.4 E-LASSO estimation
The procedure takes two steps. First, at time t, we run an OLS in
Ri,t = α+ β1Xi,1,t−1 + β2Xi,2,t−1 + β3Xi,3,t−1 + β4Xi,4,t−1 + i, i = 1, 2, . . . , 1000. (11.10)
to obtain coefficient estimates
βˆ1,t, βˆ2,t, βˆ3,t, βˆ4,t.
This will be sufficient for us to compute the forecasted or expected value at t+ 1 as
E[Ri,t+1] = βˆ1,tXi,1,t + βˆ2,tXi,2,t + βˆ3,tXi,3,t + βˆ4,tXi,4,t. (11.11)
Note that we have ignored the alphas, as they do not matter in the ranking of stocks by expected
returns which simply adds a constant to all. However, in practice, due to instability of model, the
estimates are usually smoothed.
In the second-step, we smooth the estimates over the past year (or other periods as appropriate),
by taking the average coefficient estimates as our final estimates,
β¯j,t =
1
12
12∑
s=1
βˆj,t+1−m. (11.12)
Then the expected returns are computed from
E[Ri,t+1] = β¯1,tXi,1,t + β¯2,tXi,2,t + β¯3,tXi,3,t + β¯4,tXi,4,t. (11.13)
In practice, the average betas works much better than using the betas from one period estimation
alone.
A typical way to use the forecasts is to to divide the stocks into 10 decile groups, Then, buying
the group that has the highest expected stock returns and shorting those that has the lowest is
likely to be profitable trading strategy. If no shorting is allowed, one can simply over-weigh those
high expected return stocks and under-weigh those low expected ones.
When there are too many factors, the OLS estimation will likely to have an over-fitting problem
that makes out-of-sample performance deteriorate. Machine learning tools are well suited to such
a problem and may be applied. See Han et al (2021) and references therein.
11.4 E-LASSO estimation
As mentioned before, E-LASSO and other ML tools can be applied to the CSR by taking N as
T . However, since now we have both CS and times series information, the latter can be used to
c© Zhou, 2021 Page 261
11.5 Weighted cross section regression
improve the accuracy. See the papers cited at the beginning of this chapter.
11.5 Weighted cross section regression
In the cross section regression model, when using the OLS estimation method, we weight effectively
the companies equally. In practice, we may want to weigh larger firms more heavily as they are
more important.
For example, Green, Hand, and Zhang (2017) and Han et al (2021) use log (market-cap) as the
weight across firms. A Bloomberg’s white paper uses square-root,
wi =
√
market-capi∑N
i=1 market-capi
,
where market-capi is the market-cap of firm i.
Mathematically, the OLS estimation is to find the slopes to minimize the mean-squared error,
MSE =
N∑
i=1
(yi − x′iβ)2,
and solution to the slope is the standard formula
βˆ = (X ′X)−1Xy.
A weighted OLS is to minimize the weighted MSE,
MSE∗ =
N∑
i=1
wi(yi − x′iβ)2,
where wi is the weight. The solution to the slope is
βˆ = (X ′WX)−1XWy,
where W is a diagonal matrix formed by the wi’s. In Python, this is easy done with codes: import
statsmodels.api as sm; sm.WLS(y, X, W).
12 Bayesian Estimation
In this section, we introduce the Bayesian method to prepare for later applications. The key idea
of the Bayesian method is to view parameters as random variables, and we learn their properties
c© Zhou, 2021 Page 262
12.1 Bayes Theorem
via their posterior distribution derived based on Bayes Theorem. In contrast, the usual statistics
(so-called the classical method) view the parameters as constants, and we learn them by examining
their sample estimates.
12.1 Bayes Theorem
There are two versions of Bayes Theorem. One is in terms of events, and another is in terms of
densities. Both are widely used, especially the latter, because densities are involved with data
analysis.
12.1.1 Conditional events
For any two events A and B, elementary probability theory says that
P(A,B) = P(A)P(B|A) = P(B)P(A|B) (12.1)
which say that the probability of a joint event is the marginal probability times conditional prob-
ability. Then, it follows that
P(A|B) = P(A)P(B|A)
P(B)
, (12.2)
which is knows as the Bayes Theorem. The key is about its interpretation. If P(A) is our initial
belief on an event before knowing B, and B is the evidence, then P(A|B) is our updated belief in
light of the evidence. In this case, P(A) is called the prior, and P(A|B) is the posterior.
By the law (or formula) of total probability,
P(B) = P(A)P(B|A) + P(Ac)P(B|Ac)
where Ac is the compliment of A, i.e., all the events except A. Then the Bayes Theorem can be
written as
P(A|B) = P(A)P(B|A)
P(A)P(B|A) + P(Ac)P(B|Ac) , (12.3)
which has many applications.
Example 12.1 You are interested in the probability that the market will go up next month. You
forecast the mkt has 60% chance to go up. Now an expert says it will go up to. Given that the
c© Zhou, 2021 Page 263
12.1 Bayes Theorem
expert’s forecasting accuracy is 90%: if mkt up, 90% right of up forecasts; if mkt down, 90% of
down forecasts. What is the prob that the mkt is up conditional on expert’s “up” ?
P(A|B) = 0.6× 0.9
0.6× 0.9 + 0.4× 0.1
= 93%, (12.4)
which is the updated probability in light of the expert’s opinion. ♠
Here is another example.
Example 12.2 A medical test for whether someone has been infected by a virus is 95% true
positive and 90% true negative. Only 1% of the population is actually infected. What is the
probability that a random person who tests positive is really got infected ?
Now A is the event that the person is positive, and B is tested positive.
P(A|B) = 0.01× 0.95
0.01× 0.95 + 0.99× 0.10
= 8.76%, (12.5)
which is totally different from 95%! However, if we conduct the test only if the person has some
symptoms, and if the population with the symptoms is infected at a rate of 20%, then P(A|B)
becomes .20× .95/(.20× .95 + .80× 0.10) = 70.37%, much greater than before! ♠
Yet another example which is famous:
Example 12.3 Suppose that there are two boxes filled with millions of poker chips. The first box
has 70% red and 30% blue, and the second box has 70% blue and 30% red . Assume now that one
of the two boxes is chosen randomly, and a dozen chips are drawn from it, and you have the sample
result: 8 red chips and 4 blue. What is chance that the chips came from the first box?
Let A and B be the first and second boxes, respectively, and S is the sample/data. Prior to the
draw, it is clearly reasonable to assume that
p(A) = 50%, p(B) = 50%.
c© Zhou, 2021 Page 264
12.1 Bayes Theorem
By simple combinatorics, we have
p(S|A) =
(
12
8
)
0.78 × 0.34 = 0.231, p(S|A) =
(
12
8
)
0.74 × 0.38 = 0.008.
Then,
P(A|S) = 0.5× 0.231
0.5× 0.231 + 0.5× 0.008
= 97%. (12.6)
which is totally different from 70–80% that most people would guess (Edwards, 1968)! See Benjamin
(2018, pp. 50–51) and references therein for more details. ♠
12.1.2 Conditional densities
Consider now densities, which are “probabilities” in small intervals. Let p(θ, y) be the joint distri-
bution of the parameters, θ, and data, y.
Standard probability analysis says that the joint density of any two (or two sets) of random
variables is the marginal density times the conditional density,
p(θ, y) = p(y) p(θ | y) = p(θ) p(y | θ). (12.7)
This implies that
p(θ | y) = p(θ) p(y | θ)
p(y)
. (12.8)
If we interpret y as data, then p(y) is a constant conditional on observing y, and so the above can
be written as,
p(θ | y) ∝ p(θ) p(y | θ), (12.9)
which is the Bayes Theorem in terms of density functions. It says that the posterior density of
θ conditional on the data is proportional, ∝, to the product of the prior density of θ with the
likelihood function of the data.
The objective to learn about θ. In Bayesian analysis, all we know about θ is its density function.
There are two densities. One is the prior, summarizing all we know before observing the data. The
other is the posterior, telling us all about θ after observing the data. That is our updated learning
on θ with the data y.
c© Zhou, 2021 Page 265
12.2 Classical vs Bayesian
The key assumption in Bayesian analysis is to view both all data and parameters as random
variables, and key insight is that we can update our learning with data.
Before observing data, we have some prior on what likely values of θ are, which is summarized
or expressed by our prior density p(θ). In what follows as in the literature, the prior will be denoted
as p0(θ) to emphasize it is a prior. Then, after observing the data, we should have updated learning
on θ, which is the posterior density, p(θ | y). This is our learning conditional on data.
The application of the Bayesian method has three-steps:
1. Provide p0(θ) to reflect our prior belief;
2. Compute the likelihood function (joint density of data);
3. Obtain the posterior density
p(θ | y) ∝ p0(θ)× likelihood function. (12.10)
Based on the posterior density, we can learn about θ by computing its posterior mean, vari-
ance, confidence interval, etc.
12.2 Classical vs Bayesian
The classical statistical framework treats parameters as true and unknown constants and use the
data, random sample, to learn about them. In contrast, Bayesian set-up treat parameters as random
variables and learn their distributions by using the random data. The difference between them is
best understood by studying an example.
For simplicity, we assume that the variance of the data σ2 is known. The unknown case is
examined later. The difference in the results is the difference between normal and the t distributions,
and so, if the sample size is reasonably large (say ≥ 50), they yield almost the same results.
12.2.1 σ2 known
c© Zhou, 2021 Page 266
12.2 Classical vs Bayesian
Example 12.4 Given T independent observations,
y = (y1, y2, . . . , yT )
′,
on a random variable y which has a normal distribution with unknown mean,
y ∼ N(µ, 1), (12.11)
and known variance 1 (a simplification). What can we learn about the mean µ ?
The classical approach:
With data (y1, y2, . . . , yT )
′, we estimate µ by its sample mean,
µˆ =
1
T
T∑
t=1
yt.
From (12.11), we have
µˆ ∼ N(µ, 1/T ) (12.12)
i.e., µˆ has a normal distribution.
The result says that:
• the estimator has the parameter as its mean;
• the variance is 1/T – As sample size T gets large, we get on the average more and more
accurate estimate of the true mean;
• Any hypothesis testing about µ can be done based on (12.12).
The Bayesian approach: Recall the three-steps
1. Assume a diffuse prior: (µ can be any real number)
p0(µ) ∝ 1. (12.13)
To understand why that represents a diffuse prior, consider how to express a prior that µ is
in [−1, 1] equally likely. What we want is
p0(µ) ∝
 c, if µ ∈ [−1, 1];0, otherwise
c© Zhou, 2021 Page 267
12.2 Classical vs Bayesian
Since the density has an integral of 1, we have
1 =
1∫
−1
p0(µ)dµ =
1∫
−1
cdµ = 2c
so we get c = 1/2. Similarly, if we want the prior be in [−M,M ] equally likely, c = 1/(2M).
Since a constant c has no impact on the posterior, what matters is the range of p0, so, we can
simply use p0(µ) = 1 over [−M,M ], and zero otherwise, which is called an improper prior
as its integral is not 1 (not strictly a density). Theoretically, this is still valid for posterior
analysis. Letting M goes to infinity, we have the diffuse prior given by (12.13).
2. The likelihood function or density of the data is
p(y |µ) =
(
1√
2pi
)T
exp
[
−1
2
T∑
t=1
(yt − µ)2
]
. (12.14)
To understand it, consider 2 data points. Their joint density is
p(y1, y2) = p(y1)p(y2)
=
1√
2pi
e−
1
2
(y1−µ)2 1√
2pi
e−
1
2
(y2−µ)2
=
(
1√
2pi
)2
e−
1
2
(y1−µ)2− 12 (y2−µ)2 ,
where the first equality follows from independence, the second from normality assumption
and the third uses a property of exponential functions.
3. The posterior density is then
p(µ|y) ∝ p0(µ) p(y |µ)
∝ exp
[
−1
2
T∑
t=1
(yt − µ)2
]
. (12.15)
Now let us simplify the posterior density so that we can learn its implications on µ. Since
T∑
t=1
(yt − µ)2 = Tµ2 − 2Tµµˆ+
T∑
t=1
y2t = T (µ− µˆ)2 +
T∑
t=1
(yt − µˆ)2 (12.16)
and the second term on the right hand side is constant (which can be ignored in the posterior
density because it is just a proportional constant), so we can write the posterior density as
p(µ|y) ∝ exp
[
−T
2
(µ− µˆ)2
]
, (12.17)
which is exactly a normal density function on µ.
The posterior density says that
c© Zhou, 2021 Page 268
12.2 Classical vs Bayesian
• the posterior mean of µ is µˆ;
• the posterior variance is 1/T – As sample size T gets large, the distribution of µ is more and
more around µˆ;
• Any hypothesis testing/assessment about µ can be based on (12.17).
To summarize, the example shows that classical and Bayesian inference is the quantitatively the
same under the diffuse prior. This is not surprising because the classical approach does not use any
prior information. So when both do not use any prior information, they should yield fundamentally
the same conclusion. Then, what is the potential advantages of the Bayesian approach?
Its advantage is to use informative priors (see future examples). Technically, it also has the
advantage in computing the exact distribution of functions of interest for a finite sample size, which
the classical framework may not be able to provide and has to reply on asymptotic distribution or
bootstraps, which may not be reliable or accurate when the sample size is small.
However, Bayesian approach has its potential disadvantages. Although it can use informative
priors, it is also the root cause of arguments about the appropriateness of priors. Using the incorrect
prior can clearly be worse off than using any prior at all. Moreover, for tractability, Bayesian analysis
often makes restrictive assumptions on the data-generating process (such as iid normality), while
the classical analysis can usually have much more general assumptions.
12.2.2 σ2 unknown
Assume still that the data are normally distributed,
y ∼ N(µ, σ2). (12.18)
Previously, σ2 is assumed known for simplicity. Now we assume that σ2 is unknown. Since σ2
can usually be estimated fairly accurately in many applications, the results of the known case are
not much different from the second case. As noted earlier, that is the case if the sample size is
reasonably large.
When σ is unknown, the posterior distribution of µ will no longer be normal, but a t distribution
under a standard diffuse prior on σ, which is a commonly assumed.
c© Zhou, 2021 Page 269
12.2 Classical vs Bayesian
The key is to note that the diffuse prior on σ, known as Jeffrey’s prior, is
p0(σ) ∝ 1
σ
, σ > 0, (12.19)
because it is Jeffrey who shows first that it presents noninformativeness on σ. Then, assuming a
diffuse prior on µ and it is independent of σ, the joint prior density of both µ and σ is
p0(µ, σ) ∝ 1
σ
σ > 0, (12.20)
which is the common diffuse prior in statistics in this context.
The likelihood function of the normally distributed data is
p(y |µ) =
(
1
σ
√
2pi
)T
exp
[
− 1
2σ2
T∑
t=1
(yt − µ)2
]
,
and hence the posterior
p(µ, σ|y) ∝ σ−(T+1)exp
[
− 1
2σ2
T∑
t=1
(yt − µ)2
]
. (12.21)
In comparison with our earlier analysis under diffuse prior with known σ, we now just add those
terms involving σ.
Since we are interested in the posterior mean of µ, we have to integrate σ out from the joint
density to obtain the density for µ alone. To do so, we need a formula from calculus,∫ +∞
0
x−(n+1)e−ax
−2
dx =
1
2
a−n/2Γ(n/2),
where Γ(·) is the Gamma function,
Γ(z + 1) = zΓ(z), Γ(1) = 1, Γ(1/2) =
√
pi.
Then the integration of (12.21) over σ is
p(µ|y) ∝
(
T∑
t=1
(yt − µ)2
)−T/2
, (12.22)
which is not yet a recognizable distribution. Let
s2 =
1
T − 1
T∑
t=1
(yt − µˆ)2, (12.23)
c© Zhou, 2021 Page 270
12.3 Informative priors
the standard deviation of the data (dividing by using T − 1 rather than T is to make it unbiased,
which makes little numerical difference when T ≥ 30). From (12.16), we can write
T∑
t=1
(yt − µ)2 = T (µ− µˆ)2 + (T − 1)s2.
Plugging this into (12.22), dividing by Ts2 and change the order of the sum, we have
p(µ|y) ∝
(
1 +
(µ− µˆ)2
ν(s/
√
T )2
)−(ν+1)/2
, (12.24)
where ν = T − 1.
The above equation says that the posterior distribution of µ is t-distributed with mean µˆ (recall
the definition in (1.60)). Interestingly, in the classical analysis, µˆ is also t-distributed with mean µ
with degrees of freedom ν. Again, under the diffuse prior, both the Classical and Bayeasian reach
the same conclusion, though interpreted differently (parameters are regarded as random variables
in Bayesian framework, but they are constants in the Classical).
12.3 Informative priors
As mentioned earlier, easily incorporating prior information is an important advantage of the
Bayesian analysis. The example below illustrates the main idea, while more complex examples will
be analyzed in later applications.
We now extend the previous example to a more realistic situation by using a general normal
prior density,
p0(µ) =
1√
2piσ20
e
− (µ−µ0)2
2σ20 , (12.25)
where µ0 is our prior mean and σ0 is our prior standard deviation. For instance, when we examine
the expected return on the market, we may set µ0 = 10% and σ0 = 15%, i.e., we use prior
µ ∼ N(10%, 15%).
This says that the future expected return on the asset is likely to be 10%, but it has a standard
error of 15%. Although this prior is not perfect, it should be better than no priors at all in practice.
It does reflect some sort of long-term view on the stock market.
c© Zhou, 2021 Page 271
12.3 Informative priors
12.3.1 σ2 known
Then the posterior density is
p(µ|y) ∝ p0(µ) p(y |µ)
∝ exp
(
−
[
(µ− µ0)2
2σ20
+
1
2σ2
T∑
t=1
(yt − µ)2
])
∝ exp
(
−
[
(µ− µ0)2
2σ20
+
T
2σ2
T∑
t=1
(µ− µˆ)2
])
(12.26)
where the last equation follows the diffuse prior case, (12.15). Now
(µ− µ0)2
a
+
(µ− µˆ)2
b
∝ µ
2 − 2µµ0
a
+
µ2 − 2µµˆ
b
(12.27)
∝ b+ a
ab
[
µ2 − 2µ
(
µ0
a
+
µˆ
b
)
ab
b+ a
]
. (12.28)
Then, taking a = σ20 and b = σ
2/T , then we obtain the posterior density
p(µ|y) ∝ exp
[
−
(
σ20 + σ
2/T
2σ20σ
2/T
)(
µ− σ
2
0µˆ+ µ0σ
2/T
σ20 + σ
2/T
)2]
, (12.29)
where σ2 is treated as known, and can be replaced by a sample variance estimate.
Equation (12.29) says that µ has a normal density. The mean, (σ20µˆ + µ0σ
2/T )/(σ20 + σ
2/T ),
can be written as,
Eµ = wµ0 + (1− w)µˆ, w = σ
2/T
σ20 + σ
2/T
, (12.30)
which is a weighted average of the prior µ0 and sample mean µˆ. The greater the sample variance
(the less informative the data), the more it weights on the prior. However, when there are more
and more data (as T becomes large), the data speak itself and the prior has no impact (unless
σ0 = 0 which disregards the data).
In the case of estimating the market expected return, even if the sample mean is negative (due
to using bear market data), the prior mean of µ0 = 10% will help to pull the expected return
to the positive territory. In contrast, the classical analysis estimates the expected return using
sample mean, which can be inadequate with very limited data size. This is the advantage of using
Bayesian.
The posterior variance of µ is (σ20σ
2/T )/(σ20 + σ
2/T ), or
Var(µ) =
(
1
σ20
+
T
σ2
)−1
, (12.31)
c© Zhou, 2021 Page 272
12.4 Predictive distribution
which says that the posterior precision is an average of the prior precision and data precision. If
my guess error is high and the data is informative, the posterior precision should be good, and vice
versa. As sample size gets large, the posterior variance will approach zero, and then we learn the
exact mean.
12.3.2 σ2 unknown
In this case, there is an issue on what prior to impose on σ2. While there are many choice, we,
following Zellner (1971, pp. 70–72), use an initial sample to set prior, and the posterior will be of
the sample form as before, making it easier to analyze.
Specifically, we use the earlier the posterior, (12.21), as our prior,
p(µ, σ|y1) ∝ σ−(n1+1)exp
[
− 1
2σ2
n1∑
t=1
(y1t − µ)2
]
∝ σ−(n1+1)exp
{
− 1
2σ2
[ν1s
2
1 + n1(µ− µ1)2]
}
, (12.32)
where n1 is the initial sample size, ν1 = n1 − 1, µ1 the sample mean and s21 the variance.
Given sample size n2, the likelihood function is
l(µ, σ|y2) ∝ σ−n2exp
[
− 1
2σ2
n2∑
t=1
(y2t − µ)2
]
. (12.33)
Then the posterior density is
p(µ, σ|y1,y2) ∝ σ−(n1+n2+1)exp
{
− 1
2σ2
[
n1∑
t=1
(y1t − µ)2 +
n2∑
t=1
(y2t − µ)2
]}
∝ σ−(n+1)exp
{
− 1
2σ2
[
νs2 + n(µ− µˆ)2]} , (12.34)
where n = n1 + n2, ν = n − 1, µˆ and s2 are the sample mean and variance based on all the data.
Mathematically, this density is exactly the same form of (12.21), and so the earlier analysis can be
used to obtain the marginal posterior densities.
12.4 Predictive distribution
In applications, it is the future value or return of a random variable, not the past values, that is of
great interest. Consider again Example 12.4. In the classical approach, the future value, y˜T+1 (the
c© Zhou, 2021 Page 273
12.4 Predictive distribution
tilde emphasizes the fact that it is a random variable and yet not observed), is clearly predicted as
the sample mean with standard error 1.
In the Bayesian framework, we need to find out the predictive density of y˜T+1 conditional on
the data,
p(y˜T+1 |y) =
∫
p(y˜T+1, µ |y) dµ
=
∫
p(y˜T+1 |µ,y)p(µ |y) dµ (12.35)
which say that the predictive density is the product of the density of y˜T+1 conditional on the true
parameter and data times the posterior density after integrating out the parameter.
By the assumption on the data-generating process (12.11), we have
p(y˜T+1 |µ,y) ∝ exp
[
−1
2
(yT+1 − µ)2
]
and by the earlier posterior result (12.17), we can compute
p(y˜T+1 |y) ∝
∫ +∞
−∞
exp
[
−1
2
(yT+1 − µ)2 − T
2
(µ− µˆ)2
]
dµ
∝ exp
[
− T
2(T + 1)
(yT+1 − µˆ)2
]
, (12.36)
which says that the predictive density is normal. If one uses the posterior mean as the point
prediction, the Bayesian also provides the same mean prediction as the classical one,
Et(y˜T+1) = µˆ,
conditional on information available at t, but the standard error is (T + 1)/T , slightly greater 1,
the standard error from the classical approach. This is due to incorporating the estimation error
on µ.
In general, the Bayesian predictive point estimate may not be equal to the classical one. For
example, an informative prior in the previous example will produce a different predictive point
estimate. In addition, the predictive density is usually not normally distributed. For instance, if
we assume the data variance σ2 is unknown, the predictive density of the previous example will be
t -distributed. All these issues can be found in Zellner (1971) who provides an excellent guide to
the Bayesian approach.
c© Zhou, 2021 Page 274
12.5 Bayesian regression
12.5 Bayesian regression
Previously, the statistical model is about the mean only, or
yt = µ+ t, t ∼ N(0, σ2) (12.37)
where the previous notation of xt is replaced by yt. In this subsection, we consider an extension by
including a regressor,
yt = α+ βxt + t, t ∼ N(0, σ2) (12.38)
which is important as we often analyze a stock return relative to the market.
Now assume all the parameters are unknown, then the diffuse prior is
p0(α, β, σ) ∝ 1
σ
, σ > 0, (12.39)
which is an extension of (12.20). The posterior is
p(α, β, , σ|D) ∝ σ−(T+1)exp
[
− 1
2σ2
T∑
t=1
(yt − α− βxt)2
]
, (12.40)
where D = (y,x) denotes all the data.
Now we want to make sense of (12.40). By purely subtracting and adding, we have∑
(yt − α− βxt)2 =
∑(
yt − αˆ− βˆxt − [(α− αˆ) + (β − βˆ)xt]
)2
. (12.41)
Now let αˆ and βˆ be the OLS estimator,
αˆ = y¯ − βˆx¯, βˆ =
∑
(xt − x¯)(yt − y¯)∑
(xt − x¯)2 ,
where y¯ and x¯ are the sample means. Expanding (12.41) and using orthogonal conditions of the
OLS, we have ∑
(yt − α− βxt)2 = νs2 + T (α− αˆ)2 + (β − βˆ)2
∑
x2t
+2(α− αˆ)(β − βˆ)
∑
xt,
where
s2 =
1
T − 2
∑
(yt − αˆ− βˆxt)2, (12.42)
the sample residual variance. In contrast to (12.23), here s2 is made unbiased by dividing (T − 2)
because there are 2 degree of freedoms lost with constant and variable x to explain y (in general,
we should divide by T −K − 1 if there are K variables plus the constant).
c© Zhou, 2021 Page 275
12.6 Bayesian CAPM test
Based on the above decomposition, we know from (12.40) that, conditional on σ, (α, β) are
jointly normally distributed with mean (αˆ, βˆ)′ and covariance matrix
Cov
α
β
 = σ2
 T ∑xt∑
xt
1∑
x2t
−1 = σ2
 ∑x2tT∑(xt−x¯)2 −x¯∑(xt−x¯)2
−x¯∑
(xt−x¯)2
1∑
(xt−x¯)2 .
 . (12.43)
The results extend the earlier one variable case in which σ2 is known.
Since σ2 is unknown in practice, we need integrate it out to get the marginal distributions,
p(α|D) ∝
[
ν +
∑
(xt − x¯)2
s2
∑
x2t /T
(α− αˆ)2
]−(ν+1)/2
, (12.44)
p(β|D) ∝
[
ν +
∑
(xt − x¯)2
s2
(β − βˆ)2
]−(ν+1)/2
, (12.45)
where ν = T − 2 (Again, we have now (T − 2) vs (T − 1)earlier due to one less degree of freedom).
For informative priors and multivariate regressions, see Zellner (1971) for further discussions.
12.6 Bayesian CAPM test
Recall that we have a multivariate regression model for the asset excess returns in testing the
CAPM
rit = αi + βirmt + it, i = 1, . . . , N, (12.46)
where rit is the return on asset i in excess of the return on a Treasury bill, rmt is the excess return
on the market portfolio, and it is the disturbance.
A key assumption about the disturbances is that they are assumed to be correlated contempo-
raneously but not across time:
Eitjs =
 σij , if i = j0, otherwise (12.47)
This is understandable. If the CAPM underprices one technology stock, it is likely to do so for
another. So the residuals of the two stocks are likely correlated at a given time t. Overtime, all
stock returns are difficult to forecast and iid can be a good assumption.
The contemporaneous implies that we cannot study the univariate regression of each stock
in isolation. The information of other stocks is useful. However, it should be noted that the
c© Zhou, 2021 Page 276
12.6 Bayesian CAPM test
parameters, alphas and betas, are still obtained from each company’s univariate regressions. It is
just that their standard errors will be affected by other companies.
In the Bayesian framework, a confidence region on the alphas can be computed:
−h
√
var[αi] + αˆi < αi < αˆi + h
√
var[αi], i = 1, . . . N, (12.48)
where αˆi is the OLS estimator, and h is a number chosen such that the area is, say, 95%. Then we
can examine whether all alphas is inside or outside of the region, providing intuition on how alphas
are different from zero, or on the degree of validity of the CAPM.
In the Bayesian framework, it is also convenient (under the normality assumption) to compute
the exact distribution of
λ = α′Σ−1α, (12.49)
where Σ is the covariance matrix of the residuals. It is clear that the greater the alphas (in absolute
value), the greater the λ. In fact, λ measures the extra money one can earn if the CAPM is not
true.
The conditional distribution formula of the multivariate normal distribution can be used for
obtaining the marginal distribution of the alphas, which makes the above two computations possible.
Harvey and Zhou (1990) provide all the details.
In general, we can assume that stocks and factors are jointly normal, then their conditional
moments are related to the parameters of the multivariate regression of X1 (the stocks) on X2 (the
factors),
X1t = α+BX2t + Et, (12.50)
where Et is a vector of model disturbances with zero means and a non-singular covariance matrix
Σ, with relationship
α = µ1 −Bµ2, B = V12V −122 , (12.51)
and
Σ = V11 −BV22B′. (12.52)
Then the procedure for testing the CAPM can be applied, yielding a Bayesian framework for testing
multi-factor models.
c© Zhou, 2021 Page 277
13 Black-Litterman Model
Since its publication, the Black and Litterman (1992) asset allocation model has gained wide
application by many financial institutions. In this section, we discuss first its motivation , then
details of the model in 1- and N-dimensions. Finally, we discuss some of its problems and offer a
few alternatives.
13.1 Motivations
While the mean-variance optimal portfolio is an elegant framework, there are many problems with
its use in practice (see, e.g., Michaud (1998) or our early discussions). In particular, Black and
Litterman (1992) find that it recommends large short positions in many assets when no constraints
are imposed, and there are corner solutions with zero weights in many assets when no-short sell
constraints are imposed. To solve this problem, Black and Litterman propose to combine parameter
estimates with what is suggested by asset pricing theory – the CAPM, or the equilibrium values.
Their solution also provides a way to allow incorporating priors (cutting-edge research/info, WS
buzz words) into the portfolio optimization process.
13.2 Single risky asset case
For easy understanding, we discuss first the Black and Litterman in the single risky asset case, and
leave the more complex case in the next subsection.
Consider asset allocation between the risky and the riskless asset. The first question we ask is:
what value the expected return is likely to be?
In equilibrium, all investors as a whole hold all of the stocks which is proportional to the market
portfolio or value-weighted index. Let we be the portfolio weights of the market portfolio, pi be the
expected excess return (or risk premium), and γ be the average risk tolerance of the world. The
key assumption of Black and Litterman (1992) is that, in equilibrium, if all investors hold the same
view, then their demand for the risky assets should exactly be equal to the outstanding supply,
c© Zhou, 2021 Page 278
13.2 Single risky asset case
which is given by the optimal portfolio weight formula (2.32),
we =
1
γ
pi
σ2
. (13.1)
In other words, the equilibrium risk premium satisfy
pi = γσ2we, (13.2)
where pi is a constant as it is the equilibrium value.
Assume as usual the excess return is normally distributed
Rt = µ+ t, t ∼ N(0, σ2). (13.3)
Recall that a Bayesian views all parameters as random variables. Hence, µ is naturally assumed to
be normally distributed with mean pi,
µ = pi + et ,
e
t ∼ N(0, κσ2), (13.4)
where κ is a scalar indicating the degree of how µ is close to its equilibrium value.
On the other hand, an investor may have a view that
µ = µ0 +
v
t ,
e
t ∼ N(0, ω2). (13.5)
For example, if µ0 > pi, the investor believes the risk premium will be higher than the equilibrium
value.
Now, regarding the equilibrium relationship as the likelihood function and the view as the
prior,13 the Bayesian theorem provides (in exactly the same way as in Section 12.2 in combining
prior info) the posterior density is:
p(µ|y) ∝ p0(µ) p(y |µ) (13.6)
∝ exp
(
−
[
(µ− µ0)2
2ω2
+
(µ− pi)2
2κσ2
])
(13.7)
∝ exp
[
−
(
ω2 + κσ2
2ω2κσ2
)(
µ− ω
2pi + µ0κσ
2
ω2 + κσ2
)2]
, (13.8)
or the posterior mean is
µ¯ =
ω2pi + µ0κσ
2
ω2 + κσ2
=
(κσ2)−1pi + (ω2)−1µ0
(κσ2)−1 + (ω2)−1
, (13.9)
13The same result is obtained here if one changes the role of the two.
c© Zhou, 2021 Page 279
13.3 Multiple risky asset case
and the variance is
θ¯2 =
ω2κσ2
ω2 + κσ2
=
1
(κσ2)−1 + (ω2)−1
. (13.10)
Again, these are weighted averages of the prior and equilibrium values.
The posterior density says that
µ = µ¯+ ct ,
c
t ∼ N(0, θ¯2). (13.11)
Combining this with (13.3), we have
Rt = µ+ t = µ¯+ (t +
c
t), (13.12)
so the Bayesian updated mean is µ¯ and variance is
σ¯2 = σ2 + θ¯2,
where t and
c
t are assumed independent. Hence, by using again the earlier optimal portfolio
formula (2.32), we get the Bayesian optimal portfolio weight after updating,
w∗ =
1
γ
µ¯
σ¯2
, (13.13)
which is to apply the standard formula using Bayesian parameter estimates.
There are a few interesting facts. First, if the investor believes 100% about the equilibrium
risk premium, i.e., κ = 0, then it is clear that µ¯ = pi and σ¯2 = σ2, implying w∗ = we, that is, the
investor holds the equilibrium market portfolio. Second, if the view is absolute such that ω = 0,
then µ¯ = µ0, σ¯
2 = σ2, and the investor invests more or less into the risky asset depending on
whether µ0 is greater or smaller than pi. Third, if ω > 0 and κ > 0, the investor will invest less
than the market even if µ0 = pi. This is because the risk of the asset has gone up, σ¯
2 > σ2, if the
investor is unsure of its expected return. As the investor is risk-averse, the amount invested in the
risky asset must go down, so w∗ < we.
13.3 Multiple risky asset case
In the multivariate case, the excess return of n > 1 risky assets are
Rt = µ+ t, t ∼ N(0,Σ). (13.14)
c© Zhou, 2021 Page 280
13.3 Multiple risky asset case
In contrast with (13.3), Rt now is an n-vector and Σ is an n×n matrix. Analogously, the n-vector
equilibrium risk premium satisfy
Π = γΣwe. (13.15)
Thus, the distribution of µ is
µ = Π + et ,
v
t ∼ N(0, κΣ), (13.16)
where κ is the same scalar indicating the degree of how µ is close to its equilibrium value.
However, the view on µ is more complex than the single risky asset case. First, there can be K,
0 < K ≤ n, views. Second, the view is not necessarily on an element of µ, but on a portfolio of them.
For example, the first view can be stated as that a portfolio, with eights P1 = (p11, p12, . . . , p1n)
′,
has a prior mean µ01, i.e.,
P ′1µ = p11µ1 + p12µ2 + · · ·+ p1nµn = µ01 + v1t, e1t ∼ N(0,Ω11). (13.17)
With k views, we can simply write all the K equations in a simple matrix form
Pµ = µ0 +
v
t ,
e
t ∼ N(0,Ω), (13.18)
where
P =

P1
P2
...
PK
 , µ0 =

µ01
µ02
...
µ0K
 , t =

1t
2t
...
Kt
 , (13.19)
that is, P is a K×n matrix summarizing the views, µ0 is a K-vector summarizing the prior means
and 1t is the residual vector. The covariance matrix of the residuals, Ω is often assumed diagonal
unless the errors of the views are correlated.
By the same logic as the N = 1 case, the posterior density is:
p(µ|y) ∝ p0(µ) p(y |µ) (13.20)
∝ exp
(
−
[
1
2
(Pµ− µ0)′Ω−1(Pµ− µ0) + 1
2
(µ−Π)′(κΣ)−1(µ−Π)
])
. (13.21)
By matrix algebra, it can be verified that
The quadratic terms = (Pµ− µ0)′Ω−1(Pµ− µ0) + 1
2
(µ−Π)′(κΣ)−1(µ−Π) (13.22)
= µ′[P ′Ω−1P + (κΣ)−1]µ− 2[µ′0Ω−1P + Π′(κΣ)−1]µ+ C (13.23)
= (µ− µ¯)′Θ¯(µ− µ¯) + C, (13.24)
c© Zhou, 2021 Page 281
13.4 Alternative approaches
where C is a generic constant,
µ¯ = [(κΣ)−1 + P ′Ω−1P ]−1[(κΣ)−1Π + P ′Ω−1µ0], (13.25)
Θ¯ = [(κΣ)−1 + P ′Ω−1P ]−1. (13.26)
This says that the posterior density is normal with mean µ¯ and covariance matrix Θ¯. Then,
the associated Bayesian portfolio choice weights are the same as before with covariance matrix
Σ¯ = Σ + Θ¯.
13.4 Alternative approaches
One of the problems with the Black and Litterman model is that there is no model for the data-
generating process. Ideally, prior information on the expected returns, including the equilibrium
priors, should be combined with the likelihood function of the data-generating process. Pa´stor and
Stambaugh (2000) and Tu and Zhou (2004, 2010) are examples of research in this direction. Zhou
(2009) provides a general framework.
c© Zhou, 2021 Page 282
14 References
Alexander, C., 2001, Market Models: A Guide to Financial Data Analysis, Wiley.
Amemiya, T., 1985, Advanced econometrics, Harvard University Press, MA.
Anderson, T.W., 1984, An Introduction to Multivariate Statistical Analysis, 2ed, Wiley.
Ang, A., 2014, Asset Management: A Systematic Approach to Factor Investing, Oxford University Press.
Anthony, M. P. Bartlett, 2009, Neural Network Learning: Theoretical Foundations, Cambridge University Press.
Ao, M., Y., Li, and X. Zheng, 2019, Approaching mean-variance efficiency for large portfolios, Review of Financial
Studies 32, 2890–2919.
Arditti, F., 1971, Another look at mutual fund performance, Journal of Financial and Quantitative Analysis 6,
909–912.
Azzalini, A., 1985, A class of distributions which includes the normal Ones, Scandinavian Journal of Statistics
12,171–17.
Azzalini, A.,and A. Dalla Valle, 1996, The multivariate skew-normal distribution, Biometrika 83, 715–726.
Bai, J., 2003, Inferential theory for factor models of large dimensions, Econometrica 71, 135–172.
Bai, J., and S. Ng, 2002, Determining the number of factors in approximate factor models, Econometrica 70, 191–221.
Bai, J., and S. Ng, 2008, Large dimensional factor models, Foundations and Trends in Econometrics 3, 89–163.
Bai, J., and P., Wang, 2016, Econometric analysis of large factor models, Annual Review of Economics 8, 53–80.
Baker, M., and J.Wurgler, 2006, Investor sentiment and the cross-section of stock returns, Journal of Finance 61,
1645–1680.
Barber, B., and T. Odean, 2000, Trading is hazardous to your wealth: the common stock performance of individual
investors, Journal of Finance 55, 773–806.
Barberis, N., 2000, Investing for the long run when returns are predictable. Journal of Finance 55, 225–264.
Barberis, N. and R. Thaler, 2003, A survey of behavioral finance, Chapter 18, Handbook of the Economics of
Finance, eds. George Constantinides, Milton Harris, and Rene Stulz, North-Holland, 937–972.
Bartlett, M. S., 1947, Multivariate analysis. Journal of the Royal Statistical Society (Suppl.) 9, 176–190.
Bates, J. M., and C. W. J. Granger, 1969, The Combination of forecasts, Operational Research Quarterly 20,
451–68.
Bawa, V. S., S. J. Brown, and R. W. Klein, 1979, Estimation risk and optimal portfolio choice. North-Holland,
Amsterdam.
Benjamin, D., 2018, Errors in probabilistic reasoning and judgment biases, NBER working paper.
Berk, J., 1997, Necessary conditions for the CAPM, Journal of Economic Theory 73, 245–257.
Bishop, C., 2006, Pattern recognition and machine learning. Springer.
Black, F., 1972, Capital market equilibrium with restricted borrowing. Journal of Business 45, 444–454.
Black, F., Litterman, R., 1992, Global portfolio optimization, Financial Analysts Journal 48, 28–43.
Boehmer, E., C. Jones, and X. Zhang, 2008. Which shorts are informed? Journal of Finance 63, 491–527.
c© Zhou, 2021 Page 283
Bok, B., D., Caratelli, D., Giannone, A. Sbordone and A. Tambalotti, 2017, Macroeconomic nowcasting and fore-
casting with big data, working paper.
Bollerslev, T., 1986, Generalized autoregressive conditional heteroskedasticity, Journal of Econometrics 31, 307–327.
Bollerslev, T., Chou, R.Y., Kroner, K.F., 1992, ARCH modeling in Finance: a selective review of the theory and
empirical evidence, Journal of Econometrics 52, 5–59.
Bollerslev, T., 2001, Financial econometrics: Past developments and future challenges, Journal of Econometrics 100,
41–51.
Bollerslev, T., G. Tauchen, and H. Zhou, 2009, Expected stock returns and variance risk premia, Review of Financial
Studies 22, 4463–4492.
Box, G.E.P., G. Jenkins, G. Reinse, and G. Ljung, 2016, Time series analysis forecasting and control, 5ed, Wiley.
Brides, P., 2009, Examining portfolio optimisation as a regression Problem, MSC. Financial Engineering: Birbeck,
University of London.
Britten-Jones, M., 1999, The sampling error in estimates of mean-variance efficient portfolio weights, Journal of
Finance 54, 655–671.
Brock, W., Lakonishok, J., LeBaron, B., 1992. Simple technical trading rules and the stochastic properties of stock
returns. Journal of Finance 47, 1731–1764.
Brockwell, P., and R. Davis, 2016, Introduction to Time Series and Forecasting, 3ed, Springer.
Brown, S. J., 1976, Optimal portfolio choice under uncertainty, ph.d. dissertation, University of Chicago.
Brown, S. J., 1978, The portfolio choice problem: comparison of certainty equivalence and optimal Bayes portfolios,
Communications in Statistics-Simulation and Computation 7, 321–334.
Bu¨hlmann, P., and van de Geer (2011),S., 2011, Statistics for High-Dimensional Data, Springer.
Campbell, J.Y. and S.B. Thompson, 2008, Predicting the equity premium out of sample: Can anything beat the
historical average?, Review of Financial Studies 21, 1509–1531.
Campbell, John Y. and Luis M. Viceira, 2003, Strategic asset allocation: portfolio choice for long-term investors,
Oxford University Press.
Chang, R. Chu, L., Tu, J., Zhang, B., Zhou, G., 2021, ESG and the Market Return, working paper.
Chen, J., Tang, G., Yao, J., and Zhou, G., 2020, Investor attention and stock returns, Journal of Financial and
Quantitative Analysis (forthcoming).
Chen, J., Tang, G., Yao, J., and Zhou, G., 2021, Employee sentiment and stock returns. working paper.
Chen, J., and M. Yuan, 2016, Efficient portfolio selection in a large market, Journal of Financial Econometrics 14,
496–524.
Chen, N.-F., R. Roll, and S. A. Ross, 1986, Economic forces and the stock market, Journal of Business 59, 383–403.
Chen, Y., Z. Da and D. Huang, 2021, Short selling efficiency, Journal of Financial Economics (forthcoming).
Chen, J., G. Tang, J. Yang, and G. Zhou, 2021. Employee Sentiment and Stock Returns.
Chib, S., L. Zhao and G. Zhou, 2021, Winners from winners: A tale of risk factors, working paper.
Chincarini, Ludwig B., and Daehwan Kim, 2006, Quantitative Equity Portfolio Management, New York: McGraw-
Hill.
c© Zhou, 2021 Page 284
Chinco, A., A. D. Clark-Joseph, and M. Ye, 2019, Sparse signals in the cross-section of returns. Journal of Finance
74, 449–492.
Chou, P., G. Zhou, 2006, Using bootstrap to test portfolio efficiency, Annals of Economics and Finance 7, 217–249.
Christie, S., 2005, Is the Sharpe Ratio Useful in Asset Allocation? MAFC Research Papers No.31, Applied Finance
Centre, Macquarie University.
Christopherson, Jon A., Wayne Ferson and Andrew L. Turner, 1999, Performance evaluation using conditional
alphas and betas, Journal of Portfolio Management 26, 59–72.
Clark, T. E., and K. D.West. 2007, Approximately normal tests for equal predictive accuracy in nested models,
Journal of Econometrics 138, 291–311.
Clarke, R., H. Silva, and S. Thorley, 2002, Portfolio constraints and the fundamental law of active management,
Financial Analyst ournal 58, 48–66.
Cochrane, J. H. 2001. Asset pricing, Princeton University Press.
Cochrane, J.H., and M. Piazzesi. 2005. Bond risk premia, American Economic Review 95, 138–60.
Coggin, T. Daniel, and Frank J. Fabozzi, 2003, The Handbook of equity style management, Wiley, 2003.
Cohen, Randolph, Joshua Coval and Lubos Pastor, 2005, Judging fund managers by the company they keep, Journal
of Finance 60, 1057–1096.
Connor, G. and R. Korajczyk, 1988, Risk and return in an equilibrium APT: An application of a new methodology,
Journal of Financial Economics 21, 255–289.
Connor, G. and R. A. Korajczyk, 1995, The arbitrage pricing theory and multifactor models of asset returns, in
Handbooks in Operations Research and Management Science: Finance, Volume 9, edited by R. A. Jarrow, et
al, North-Holland.
Cook, R. D., Forzani, L., 2019. Partial least squares prediction in high-dimensional regression, Annals of Statistics
47, 884–908.
Cook, R. D., Forzani, L., 2021, PLS Regression Algorithms in the Presence of Nonlinearity, working paper.
Coqueret, G., and T. Guida, 2021, Machine Learning for Factor Investing, CRC Press.
Coval, J., and T. Shumway, 2005, Do behavioral biases affect prices?, Journal of Finance 60, 1–34.
Covel, M., and B. Ritholtz, 2017, Trend Following: How to Make a Fortune in Bull, Bear and Black Swan Markets,
Wiley, 5th edition.
Cujean, Julien and Hasler, Michael, 2017, Why Does Return Predictability Concentrate in Bad Times? Journal of
Finance 72, 2717—2758.
Cybenko, G. 1989, Approximation by superpositions of a sigmoidal function, Mathematics of Control, Signals and
Systems 2, 303–314.
Daniel, K., D. Hirshleifer, and L. Sun, 2020, Short-and long-horizon behavioral factors, Review of Financial Studies
33, 1673–1736.
de Jong, S., 1993, Simpls: An alternative approach to partial least squares regression, Chemometrics and Intelligent
Laboratory Systems 18, 251–263.
c© Zhou, 2021 Page 285
DeRoon, F. A. and T. E. Nijman, 2001, Testing for mean-variance spanning: a survey, Journal of Empirical Finance
8, 111–155.
Deisenroth , M., A. Faisal, and C. Ong, 2020, Mathematics for Machine Learning, Cambridge University Press.
DeMiguel, V., L. Garlappi, and R. Uppal, 2009, Optimal versus naive diversification: How inefficient is the 1/N
portfolio strategy? Review of Financial Studies 22, 1915–1953.
Den Haan, W.J., and A. Levin, 1997, A practitioner’s guide to robust covariance matrix estimation, in Handbook
of Statistics 15, G.S. Maddala and C.R. Rao, eds., Elsevier (Amsterdam), pp.299–342.
Diebold, F. X., and R. S. Mariano. 1995, Comparing predictive accuracy, Journal of Business and Economic
Statistics 13, 253–263.
Diebold, F. X. and M. Shin, 2019, Machine learning for regularized survey forecast combination: Partially-egalitarian
LASSO and its derivatives, International Journal of Forecasting 35, 1679–1691.
Ding, Z., and R. Martin, 2017, The fundamental law of active management: Redux, Journal of Empirical Finance
43, 91–114.
Dixon, M., I. Halperin, and P. Bilokon, 2020, Machine Learning in Finance: From Theory to Practice, Springer.
Doane, P., and L. Seward, 2011, Measuring skewness: a forgotten statistic, Journal of Statistics Education 19, 1–18.
Dong, X., Li, Y., Rapach, D. and Zhou, G., 2021, Anomalies and the Expected Market Return, Journal of Finance
(forthcoming).
Edmans, A., A. Fernandez-Perez, A. Garel and I. Indriawan, 2021, Music sentiment and stock returns around the
world, Journal of Financial Economics, forthcoming.
Efron, B., 1979, Bootstrap methods: Another look at the Jacknife, Annals of Statistics 7, 1–26.
Engle, Robert F., 1982, Autoregressive conditional heteroskedasticity with estimates of the variance of United
Kingdom inflation, Econometrica 50, 987–1007.
Fabozzi, Frank J., Petter N. Kolm, Dessislava Pachamanova, and Sergio M. Focardi, 2007, Robust Portfolio Opti-
mization and Management, New York: Wiley.
Fabozzi Frank J., Dashan Huang and Guofu Zhou, 2010, Robust Portfolios: Contributions from Operations Research
and Finance, Annals of Operations Research 176, 191–220.
Fama, E. F., MacBeth, J. D., 1973, Risk, return, and equilibrium: Empirical tests, Journal of Political Economy81,
607–636.
Fama, E.F., French, K.R., 1993, Common risk factors in the returns on stocks and bonds, Journal of Financial
Economics 33, 3–56.
Fama, E.F., French, K.R., 2015, A five-factor asset pricing model, Journal of Financial Economics 116, 1–22.
Fan, J., Liao, Y., and M. Mincheva, 2013). Large covariance estimation by thresholding principal orthogonal
complements, Journal of the Royal Statistical Society (Series B, Statistical Methodology) 75, 603–680.
Feng, G., S. Giglio, and D. Xiu, 2020, Taming the factor zoo: A test of new factors, Journal of Finance 75, 1327–1370.
Ferri, R., 2010, All About Asset Allocation, 2e, McGraw-Hill.
Filippou, I., M. Taylor, and G. Zhou, 2020, Exchange Rate Prediction with Machine Learning and a Smart Carry
Portfolio, Working paper.
c© Zhou, 2021 Page 286
Frazzini, A., Israel, R., Moskowitz, T., 2015, Trading costs of asset pricing anomalies, Working paper.
French, K., and J. Poterba, 1991, Investor diversification and international equity markets, American Economic
Review 81, 222–226.
Freyberger, J., A. Neuhierl, and M. Weber, 2020, Dissecting characteristics nonparametrically, Review of Financial
Studies 33, 2326–2377.
Gao, L., Y. Han, Z., Li, and G. Zhou, 2018, Market intraday momentum, Journal of Financial Economics 129,
394–414.
Ge´ron, A., 2019, Hands-on Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques
to Build Intelligent Systems, 2nd edition, O’Reilly Media.
Geweke, J., Zhou, G., 1996, Measuring the pricing error of the arbitrage pricing theory. Review of Financial Studies
9, 557–587.
Ghayur, K., R. Heaney, and S. Platt, 2019, Equity Smart Beta and Factor Investing for Practitioners, Wiley.
Ghysels, E., and M. Marcellino, 2018, Applied Economic Forecasting using Time Series Methods, Oxford U. Press.
Gibbons, M., S. Ross and J. Shanken, 1989, A test of the efficiency of a given portfolio, Econometrica 57, 1121–1152.
Giglio, S. and D. Xiu, 2021, Asset pricing with omitted factors, Journal of Political Economy 129, 1947–1990.
Giglio, S., Kelly, B., D., Xiu, 2021, Factor models, machine learning, and asset pricing, working paper.
Giraud, C., 2015, Introduction to high-dimensional statistics. CRC Press.
Glassermann, P., 2004, Monte Carlo Methods in Financial Engineering, Springer-Verlag.
Goh, Jeremy , Fuwei Jiang, Jun Tu and Guofu Zhou, 2012, Forecasting bond risk premia using technical indicators,
Washington University in St Louis, Working paper.
Guo, X., H. Lin, C. Wu, and G. Zhou, 2020, Extracting information from corporate yield curve: A machine learning
approach, Working paper.
Graham, John R., and Campbell Harvey, 1996, Market timing ability and volatility implied in investment newslet-
ters’ asset allocation recommendations, Journal of Financial Economics 42, 397–421.
Graham, John R., and Campbell Harvey, 1997, Grading the performance of market timing newsletters, Financial
Analysts Journal, 54–66.
Griffin, John M., Jeffrey H. Harris and Selim Topaloglu, 2003, Investor behavior over the rise and fall of Nasdaq,
working paper, Yale University.
Grinblatt, Mark S, and Sheridan Titman, 1995, Performance evaluation, in Handbook in Operations Research and
Management Science, Vol. 9: Finance, Jarrow, R., Maksimovic, V., and Ziemba, W. (Eds.), Elsevier Science,
581–609.
Grinold, Richard C., 1989, The fundamental law of active management, The Journal of Portfolio Management 15,
30–37.
Grinold, Richard C and Ronald N. Kahn, 1999, Active portfolio management: quantitative theory and applications,
McGraw-Hill.
Gu, S., B. Kelly, and D. Xiu, 2020, Empirical asset pricing via machine learning, Review of Financial Studies 33,
2223–2273.
c© Zhou, 2021 Page 287
Guida, T., 2019, Big Data and Machine Learning in Quantitative Investment, Wiley.
Gulli, A., Kapoor, A¿, Pal, S., 2019, Deep Learning with TensorFlow 2 and Keras, 2nd ed, Packt.
Hall, P., 1992, The Bootstrap and Edgeworth Expansion. Springer-Verlag, New York.
Hilpisch, Y., 2015, Derivatives analytics with Python: data analysis, models, simulation, calibration and hedging,
Wiley.
Horowitz, J., 1995, Bootstrap methods in econometrics: Theory and numerical performance. In: Advances in
Economics and Econometrics: Theory and Applications III, edited by D. M. Kreps and K. F. Walls, 188–222,
Cambridge University Press.
Han, Y., K. Yang, and G. Zhou, 2013, A new anomaly: the cross-sectional profitability of technical analysis, Journal
of Financial and Quantitative Analysis 48, 1433–1461.
Han, Y., G. Zhou, and Y. Zhu, 2016, A trend factor: any economic gains from using information over investment
horizons? Journal of Financial Economics 122, 352–375.
Han, Y., A. He, D. E. Rapach, and G. Zhou, 2021, What firm characteristics drive US stock returns? Manuscript.
Han, Y., Y. Liu, G. Zhou, Y., Zhu, 2021, Technical Analysis in the Stock Market: A Review. Manuscript.
Hansen, L. P., 1982, Large sample properties of generalized method of moments estimators, Econometrica 50,
1029–1054.
Harvey, C. R., and G. Zhou, 1990, Bayesian inference in asset pricing tests, Journal of Financial Economics 26,
221–254.
Harvey, C., and G. Zhou, 1993, International asset pricing with alternative distributional specifications, Journal of
Empirical Finance 1, 1993, 107–131.
Harvey, C., Liu, Y., and H. Zhu, 2016, ... and the cross-section of expected returns, Review of Financial Studies
29, 5–68.
Hastie, T., R. Tibshirani, and J. Friedman, 2009, The Elements of Statistical Learning. 2ed edition, Springer.
Haugen, R., N. Baker, 1996, Commonality in the determinants of expected stock returns, Journal of Financial
Economics 41, 401–439.
Helland and Alm0y, 1994, Comparison of prediction methods when only a few components are relevant, Journal of
the American Statistical Association 89, 583–591.
Henkel, S., J. S. Martin, and F. Nardari, 2011, Time-varying short-horizon predictability, Journal of Financial
Economics 99, 560–580.
Hoerl, A. E. and R. W. Kennard, 1970, Ridge Regression: Applications to Nonorthogonal Problems, Technometrics
12, 69–82.
Hornik, K., M. Stinchcombe, and H. White, 1989, Multilayer feedforward networks are universal approximators,
Neural Networks 2, 359–366.
Hou, Kewei, Chen Xue, and Lu Zhang, 2015, Digesting anomalies: An investment approach, Review of Financial
Studies 28, 650–705.
Huang, D., Jiang, F., J. Tu and G. Zhou, 2015, Investor sentiment aligned: a powerful predictor of stock returns,
Review of Financial Studies 28, 791–837.
c© Zhou, 2021 Page 288
Huang, and G. Zhou, 2017, Upper bounds on return predictability, Journal of Financial and Quantitative Analysis
52, 401–425.
Huang, D., J. Li, and L. Wang, 2020, Time-series momentum: is it there? Journal of Financial Economics 135,
774–794.
Huang, D., Jiang, F., Li, K,, Tong, G., and G. Zhou, 2020, Scaled PCA: A new approach to dimension reduction,
Management Science (forthcoming).
Huang, Chi-fu, and Robert H. Litzenberger, 1988, Foundations for Financial Economics, North-Holland.
Huberman, G. and S. Kandel, 1987, Mean-variance spanning, Journal of Finance 42, 873–888.
Hurst, B., Y. Ooi and L. Pedersen, A century of evidence on trend-following investing, 2017, Journal of Portfolio
Management 44, 15–29.
Ingersoll, J., 1987, Theory of Financial Decision Making, Rowman and L.
Jacquier, E., Kane, A., and Marcus, A. J., 2003, Geometric or arithmetic mean: a reconsideration, Financial
Analysts Journal 59, 46–53.
Jacquier, E., Kane, A., and Marcus, A. J., 2005, Optimal estimation of the risk premium for the long run and asset
allocation: a case of compounded estimation risk, Journal of Financial Econometrics 3, 37–55
Jiang, F., J. Lee, X. Martin and G. Zhou, 2019, Manager sentiment and stock returns, Journal of Financial Economics
132, 126–149.
Jiang, L., K. Wu, G. Zhou, Y. Zhu, 2020, Stock return asymmetry: beyond skewness, Journal of Financial and
Quantitative Analysis 55, 357–386.
Jagannathan, R., and Z. Wang, 2002, Empirical evaluation of asset pricing models: A comparison of the SDF and
beta models, Journal of Finance 57, 2337–2368.
Jiang, F., G. Tang, and G. Zhou, 2018, Firm characteristics and Chinese stocks, Journal of Management Science
and Engineering 3, 259–283.
Jiang, H., Z. Li and H. Wang, 2020, Pervasive underreaction: Evidence from high-frequency data, working paper.
Joanes, D., A., Gill, 1998, Comparing measures of sample skewness and kurtosis, Journal of the Royal Statistical
Society 47, Series D, 183-–189,
Jobson, J. D. and B. M. Korkie, 1981, Performance hypothesis testing with the Sharpe and Treynor measures.
Journal of Finance 36, 889–908.
Jobson, J. D. and B. M. Korkie, 1983, Statistical inference in two-parameter portfolio theory with multiple regression
software, Journal of Financial and Quantitative Analysis 18, 189–197.
Johnstone, I., and D. Paul, 2018, PCA in High Dimensions: An Orientation, Proceedings of the IEEE 106, 1277–
1292.
Jolliffe, I. T., 2002, Principal Components Analysis, 2nd edition, Springer.
Jorion, P., 1986, Bayes-Stein estimation for portfolio analysis, Journal of Financial and Quantitative Analysis 21,
279–292.
Jorion, P., 2003, Portfolio optimization with tracking-error constraints, Financial Analysts Journal 59, 70–82.
Jurczenko, M., B. Maillet, 2006, Multi-moment Asset Allocation and Pricing Models, Wiley.
c© Zhou, 2021 Page 289
Jurczenko, E., 2020, Machine Learning for Asset Management, Wiley.
Kahneman, D., and A. Tversky, 1974, Judgment under uncertainty: heuristics and biases, Science 185, 1124–1131.
Kan, R., and G. Zhou, 1999, A critique of the stochastic discount factor methodology, Journal of Finance 54,
1021–1048.
Kan, R., C. Robotti, and J. Shanken, 2013, Pricing model performance and the two-pass cross-sectional regression
methodology, Journal of Finance 68,2617–2649.
Kan, R., X. Wang, and G. Zhou, 2021, Optimal portfolio choice with estimation risk: no risk-free asset case,
Management Science (forthcoming).
Kan, R., and G. Zhou, 2007, Optimal portfolio choice with parameter uncertainty, Journal of Financial and Quan-
titative Analysis 42, 621–656.
Kan, R., and G. Zhou, 2009, What will the likely range of my wealth be? Financial Analysts Journal 65 (4), 2009,
68–77.
Kan, R., and G. Zhou, 2012, Tests of mean-variance spanning, Annals of Economics and Finance 13, 2012, 145–193.
Kandel, S., Stambaugh, R.F., 1996, On the predictability of stock returns: An asset-allocation perspective. Journal
of Finance 51, 385–424.
Kelly, J. L., 1956, A new interpretation of information rate, Bell System Technical Journal 35, 917–926.
Kelly, B., Pruitt, S., 2013, Market expectations in the cross-section of present values, Journal of Finance 68, 1721–
1756.
Kelly, B., Pruitt, S., 2015, The three-pass regression filter: A new approach to forecasting using many predictors,
Journal of Econometrics 186, 294–316.
Kendall, M, A., Hill, 1953, The analysis of economic time-series-part I: prices, Journal of the Royal Statistical
Society Series A 116, 11–34.
Kim, T., H. White, and D. Stone, 2005, Asymptotic and Bayesian Confidence Intervals for Sharpe-Style Weights,
Journal of Financial Econometrics 3, 315–343.
Klain, R., and V. Bawa, 1976, The effect of estimation risk on optimal choice, Journal of Financial Economics 3,
215–231.
Klasss, J., 2019, Machine Learning for Finance, Packt.
Kozak, S., S. Nagel, and S. Santosh, 2020, Shrinking the cross section, Journal of Financial Economics 135, 271–292.
Ledoit, O. and Wolf, M., 2003, Improved estimation of the covariance matrix of stock returns with an application
to portfolio selection, Journal of Empirical Finance 10, 603–621.
Ledoit, O. and Wolf, M., 2017, Nonlinear shrinkage of the covariance matrix for portfolio selection: Markowitz
meets goldilocks, Review of Financial Studies 30, 4349–4388.
Ledoit, O. and Wolf, M., 2020, Analytical nonlinear shrinkage of large-dimensional covariance matrices, Annals of
Statistics 48, 3043–3065.
Lehmann, E.L., and G. Casella, 1998, Theory of Point Estimation (Springer-Verlag, New York).
Lee, C., A. Shleifer and R. Thaler, 1991, Investor sentiment and the closed-end fund puzzle, Journal of Finance 46,
75–110.
c© Zhou, 2021 Page 290
Lehmann, B. N., and D. M. Modest, 1988, The empirical foundations of the arbitrage pricing theory, Journal of
Financial Economics 21, 213–254.
Leibowitz, Martin L., 1996, Return targets and shortfall risks: studies in strategic asset allocation, Irwin Professional
Pub.
Lewellen, J., 2015, The cross-section of expected stock returns, Critical Finance Review 4, 1–44.
Lie, E., Meng, B., Qian, Y., and G. Zhou, 2017, Corporate activities and the market risk premium, working paper.
Lin, H., C. Wu, and G. Zhou, 2018, Forecasting corporate bond returns: an iterated combination approach, Man-
agement Science 64, 4218–4238.
Litterman, B., J. Scheinkman, 1991, Common factors affecting bond returns, Journal of Fixed Income 1, 54–61.
Liu, H., X. Tang, and G. Zhou, 2021, Recovering the FOMC risk premium, working paper.
Liu, Y., G. Zhou, and Y. Zhu, 2020a, Trend factor in China, working paper.
Liu, Y., G. Zhou, and Y. Zhu, 2020b, Maximizing the Sharpe Ratio: A Genetic Programming Approach, working
paper.
Lo, Andrew W., and Craig MacKinlay, 1988, Stock market prices do not follow random walks: evidence from a
simple specification test, Review of Financial Studies 1, 41–66.
Lo, A. W., Hasanhodzic, J., 2009. The Heretics of Finance: Conversations with Leading Practitioners of Technical
Analysis. Bloomberg Press, .
Lo, A. W., Mamaysky, H., Wang, J., 2000. Foundations of technical analysis: Computational algorithms, statistical
inference, and empirical implementation. Journal of Finance 55, 1705–1770.
Lo´pez de Prado, M., 2018, Advances in Financial Machine Learning, Wiley.
Lo´pez de Prado, M., 2020a, Machine Learning for Asset Managers, Cambridge University Press.
Lo´pez de Prado, M., 2020b, Three quant lessons from COVID-19, Presentation Slides.
Ludvigson, S.C., and S. Ng., 2007. The Empirical risk-return relation: A factor analysis approach. Journal of
Financial Economics 83, 171–222.
MacKinlay, A. C., and M. P. Richardson, 1991. Using generalized method of moments to test mean-variance
efficiency. Journal of Finance 46, 511–527.
MacLean, L., E. Thorp and W. Ziemba, 2011, The Kelly capital growth investment criterion:theory and practice,
WSPC.
Maillard, S., Thierry, R., Teiletche, J., 2010, The properties of equally weighted risk contribution portfolios, Journal
of Portfolio Management 36, 60–70.
Mandelbrot, B., 1963, New methods in statistical economics, Journal of Political Economy 71, 421–440.
Maruyama, Y., 2004, Stein’s idea and minimax admissible estimation of a multivariate normal mean, Journal of
Multivariate Analysis 88, 320–334.
Markowitz, Harry M., 1952, Mean-variance analysis in portfolio choice and capital markets, Journal of Finance 7,
77–91.
Martin, I., 2017. What is the expected return on the market? Quarterly Journal of Economics 132, 367–433.
c© Zhou, 2021 Page 291
Menchero, J., and P. Li, 2020, Correlation shrinkage: implications for risk forecasting, Journal Of Investment
Management 18, 92–108.
McLachlan, G., and T. Krishnan, 1997, The EM algorithm and Extensions, Wiley.
Mehlawat, M., P. Gupta, A. Khan, 2021, Portfolio optimization using higher moments in an uncertain random
environment, Information Sciences 567, 348–374.
Memmel, C., 2003, Performance Hypothesis Testing with the Sharpe Ratio, Finance Letters 1, 21–23.
Merton, R., 1969, Lifetime portfolio selection under uncertainty: The continuous-time case, Review of Economics
and Statistics 51, 247–257.
Merton, R., 1971, Optimum consumption and portfolio rules in a continuous-time model, Journal of Economic
Theory 3, 373–413.
Merton, R., 1973, An intertemporal capital asset pricing model, Econometrica 41, 867–887.
Merton, R., 1980, On estimating the expected return on the market: An exploratory investigation, Journal of
Financial Economics 8, 323–361.
Mertens, E., 2002, Comments on variance of the IID estimator in Lo (2002), Technical report, Working Paper
University of Basel, Wirtschaftswissenschaftliches Zentrum, Department of Finance.
Michaud, Richard , 1998, Efficient asset management: a practical guide to stock portfolio optimization and asset
allocation, Harvard Business School Press.
Michaud, R., and R., Michaud, 2008, Efficient Asset Management: A Practical Guide to Stock Portfolio Optimiza-
tion and Asset Allocation, 2e. Oxford University Press.
Muirhead, Robb J., 1982, Aspects of Multivariate Statistical Theory (Wiley, New York).
Murphy, K., 2012, Machine Learning: A Probabilistic Perspective. MIT Press.
Nagel, S., 2021. Machine Learning in Asset Pricing. Princeton: Princeton University Press.
Neely, C.J., D.E. Rapach, J. Tu, and G. Zhou, 2014. Forecasting the equity premium: The role of technical
indicators, Management Science 60, 1772–1791.
Ng, K.S., , 2013, A simple explanation of partial least squares, working paper.
Neuhierl, A., X. Tang., R. Varneskov and G. Zhou, 2021, Expected stock returns from option characteristics, working
paper.
Newey, W. K., and K. D. West. 1987, A simple, positive semi-definite, heteroskedasticity and autocorrelation
consistent covariance matrix, Econometrica 55, 703–708.
Novy-Marx, R., and M. Velikov, 2016, A taxonomy of anomalies and their trading costs, Review of Financial Studies
29, 2016, 104–147.
Odean, T., 1998, Are investors reluctant to realize their losses?, Journal of Finance 53, 1775–1798.
Opdyke, J., 2007, Comparing Sharpe Ratios: So Where Are the p-Values? Journal of Asset Management 8, 308—36.
Pav, S., 2021, A Short Sharpe Course, working paper (SSRN).
Pa´stor, L˘., Stambaugh, R.F., 2000, Comparing asset pricing models: an investment perspective. Journal of Financial
Economics 56, 335–381.
c© Zhou, 2021 Page 292
Pedersen, A. Babu, and A. Levine, 2020, Enhanced Portfolio Optimization, working paper.
Platanakis, E., C. Sutcliffe and X. Ye, 2021, Horses for courses: Mean-variance for asset allocation and 1/N for
stock selection, European Journal of Operational Research 288, 302–317
Pourahmadi, M., 2013, High-dimensional Covariance Estimation, Wiley.
Qian, Edward, Ronald Hua, and Eric Sorensen, 2007, Quantitative Equity Portfolio Management: Modern Tech-
niques and Applications, New York: Chapman & Hall.
Rapach, D., Ringgenberg, M., and G. Zhou, 2016, Short interest and aggregate stock returns, Journal of Financial
Economics 122, 352–375.
Rapach, D., J. Strauss, and G. Zhou, 2010, Out-of-sample equity premium prediction: Combination forecasts and
links to the real economy, Review of Financial Studies 23, 821–862.
Rapach, D., J. Strauss, and G. Zhou, 2013, International stock return predictability: What is the role of the United
States? Journal of Finance 68, 1633–1662.
Rapach, D., and G. Zhou, 2013, Forecasting stock returns, (in Handbook of Forecasting II, edited by G. Elliott and
A. Timmermann; North-Holland, 328–383.
Rapach, D., and G. Zhou, 2019, Sparse macro factors, working paper.
Rapach, D., and G. Zhou, 2020, Time-series and cross-sectional stock return forecasting: new machine learning
methods, in Machine Learning in Asset Management, edited by Emmanuel Jurczenko, Wiley, 1–33.
Rapach, D., and G. Zhou, 2021, Asset Pricing: Time-Series Predictability, working paper.
Raschka, S., and V. Mirjalili, 2019, Python Machine Learning: Machine Learning and Deep Learning with Python,
scikit-learn, and TensorFlow 2, 3rd edition, Packt Publishing Ltd.
Rice, J., 2007, Mathematical Statistics and Data Analysis, 3e, Thomson Higher Education.
Ritter, Jay R., 1991, The long-run performance of initial public offerings, Journal of Finance 46, 3–27.
Roll, R., 1992, A mean-variance analysis of tracking error, Journal of Portfolio Management 18, 13–22.
Romero, P., and T. Balch, 2014, What Hedge Funds Really Do: An Introduction to Portfolio Management, Business
Expert Press.
Ross, S. A., 1976, The arbitrage theory of capital asset pricing, Journal of Economic Theory 13, 341–360.
Ross, S. A., 2005, Neoclassical Finance. Princeton University Press.
Ross, S., 2015. The recovery theorem, Journal of Finance 70, 615–648.
Samuelson, P., 1969, Lifetime portfolio selection by dynamic stochastic programming, Review of Economics and
Statistics 51, 239–246.
Samuelson, P., 1970, The fundamental approximation theorem of portfolio analysis in terms of means, variances
and higher moments, Review of Economic Studies 37, 537–542.
Schwager, J. D., 1989. Market Wizards. John Wiley & Sons, Hoboken, New Jersey.
Schwert, Bill, 2003, Anomalies and market efficiency, Chapter 15, Handbook of the Economics of Finance, eds.
George Constantinides, Milton Harris, and Rene Stulz, North-Holland, 937–972.
Se´bastien, M., T. Roncalli, and J. Teiletche, 2010, On the properties of equally-weighted risk contributions portfolios,
Journal of Portfolio Management 36, 60–70.
c© Zhou, 2021 Page 293
Seber, G.A.F., 1984, Multivariate Observations, Wiley.
Shalev-Shwartz, S., and S. Ben-David, 2014, Understanding Machine Learning: From Theory to Algorithms. Cam-
bridge University Press.
Shanken, J. 1987. A Bayesian approach to testing portfolio efficiency, Journal of Financial Economics 19, 195-215.
Shanken, Jay, 1992, On the estimation of beta-pricing models, Review of Financial Studies 5, 1–33.
Shanken, Jay, and Guofu Zhou, 2007, Estimating and testing beta pricing models: Alternative methods and their
performance in simulations, Journal of Financial Economics 84, 40–86.
Shao, J. and D. Tu, 1995, The Jacknife and Bootstrap. Springer Verlag, New York.
Sharpe, W. F., 1988, Determining a fund’s effective asset mix, Investment Management Review 2, 59–69.
Sharpe, W. F., 1992, Management style and performance measurement, Journal of Portfolio Management 18, 7–19.
Shi, B., and S. S. Iyengar, 2020, Mathematical Theories of Machine Learning, Springer.
Shleifer, A., and R. Vishny, 1997, The limits of arbitrage, Journal of Finance 52, 35–55.
Stambaugh, Robert F., 1999, Predictive regressions, Journal of Financial Economics 54, 375–421.
Stambaugh, R. F., and Y. Yuan, 2017, Mispricing factors, Review of Financial Studies 30, 1270–1315.
Stein, Charles, 1956, Inadmissibility of the usual estimator for the mean of a multivariate normal distribution,
Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1, 197–206
(University of California Press, Berkeley).
Stock, J., and M. W. Watson, 2002, Forecasting using principal components from a large number of predictors,
Journal of the American Statistical Association 97, 1167–1179.
Tibshirani, R., 1996, Regression shrinkage and selection via the LASSO, Journal of the Royal Statistical Society,
Series B (Methodological) 58:1, 267–288.
Tsay, R., 2010, Analysis of financial time series, 3ed. Wiley.
Tu, J., and G. Zhou, 2004, Data-Generating process uncertainty: What difference does it make in portfolio decisions?
Journal of Financial Economics 72, 385–421.
Tu, J., and G. Zhou, 2010, Incorporating economic objectives into Bayesian priors: Portfolio choice under parameter
uncertainty, Journal of Financial and Quantitative Analysis 45, 959–986.
Tu, J., and G. Zhou, 2011, Markowitz meets Talmud: A combination of sophisticated and naive diversification
strategies, Journal of Financial Economics 99, 204–215.
Tversky, A., and D. Kahneman, 1992, Advances in prospect theory: cumulative representation of uncertainty,
Journal of Risk and Uncertainty 5, 297–323.
Vinz, V., W. Chin, J. Henseler, and H. Wang, 2010, Handbook of Partial Least Squares, Springer.
Welch, I., Goyal, A. 2008, A comprehensive look at the empirical performance of equity premium prediction, Review
of Financial Studies 21, 1455–1508.
Wold, H., 1966, Estimation of principal components and related models by iterative least squares, in P. R. Krish-
naiaah (eds.), Multivariate Analysis, 391-420. New York: Academic Press.
c© Zhou, 2021 Page 294
Wold, H., 1975, Path models with latent variables: The nipals approach. In H. M. Blalock, A. Aganbegian, F.
M. Borodkin, R. Boudon, and V. Capecchi (Eds.), Quantitative Sociology: International Perspectives on
Mathematical and Statistical Model Building, pp. 307–357. Academic Press.
Wu, Y., Y. Qin, and Mu Zhu, 2020, High-dimensional covariance matrix estimation using a low-rank and diagonal
decomposition, The Canadian Journal of Statistics 48, 308–337.
Yao, J., S. Zheng and Z. Bai, 2015, Large Sample Covariance Matrices and High-Dimensional Data Analysis,
Cambridge.
Ye, J., 2008, How variation in signal quality affects performance, Financial Analyst Journal 64, 48–61.
Yiu, K.F.C, 2004, Optimal portfolios under a value-at-risk constraint, Journal of Economic Dynamics & Control
28, 1317–1334.
Zaffaroni, P., 2019, Factor models for asset pricing, working paper.
Zellner, Arnold, and V. Karuppan Chetty, 1965, Prediction and decision problems in regression models from the
Bayesian point of view, Journal of American Statistical Association 60, 608–616.
Zellner, Arnold, 1971, An introduction to Bayesian inference in econometrics (Wiley, New York).
Zhou, G., 1993, Asset pricing tests under alternative distributions, Journal of Finance 48, 1927–1942.
Zhou, G., 2008a, On the fundamental law of active portfolio management: What happens if our estimates are
wrong? Journal of Portfolio Management 34 (3), 26–33.
Zhou, G., 2008b, On the fundamental law of active portfolio management: How to make conditional investments
unconditionally optimal? Journal of Portfolio Management 35 (1), 2008, 12–21.
Zhou, G., 2009, Beyond Black-Litterman: Letting the data speak, Journal of Portfolio Management 36, 36–45.
Zhou, G., 2010, How much stock return predictability can we expect from an asset pricing model? Economics
Letters 108, 184–186.
Zhou, G., 2018, Measuring investor sentiment, Annual Review of Financial Economics 10, 239–259.
Zhou, Z., 2012, Ensemble Methods: Foundations and Algorithms. New York: CRC Press.
Zhu, Y., and G. Zhou, 2009, Technical analysis: An asset allocation perspective on the use of moving averages,
Journal of Financial Economics 92, 519-544.
Zou, H. and T. Hastie, 2005, Regularization and Variable Selection via the Elastic Net, Journal of the Royal
Statistical Society, Series B (Statistical Methodology) 67, 301–320.
c© Zhou, 2021 Page 295

欢迎咨询51作业君