Data Analysis for Investments Professor Guofu Zhou1 Olin Business School Washington University in St. Louis St. Louis, MO 63130 E-mail: zh[email protected] The Lecture Notes are in-depth optional readings for the students Course Use Only; All Rights Reserved (please do not distribute) Current version: December, 2021 1The lecture notes for Fin 532B, Data Analysis for Investments. c©2005 and 2021 by Guofu Zhou. CONTENTS Contents 1 Properties and Models of Stock Returns 1 1.1 Multiple-period returns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Expected returns vs realized returns . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Mean, std, and confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Mode and median . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.5 Skewness and kurtosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.6 Univariate distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.6.1 Uniform distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.6.2 Normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.6.3 Lognormal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.6.4 χ2-distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.6.5 t-distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.6.6 A skewed normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.6.7 F -distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.7 Multivariate distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 1.7.1 Mean and variance of linear transformations . . . . . . . . . . . . . . . . . . . 21 1.7.2 Bivariate normal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 1.7.3 Multivariate normal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 1.7.4 Multivariate t . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 1.7.5 Wishart distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 1.8 Simple Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 1.8.1 Univariate linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 1.8.2 Multiple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 c© Zhou, 2021 Page 1 CONTENTS 1.8.3 Autocorrelations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 1.8.4 Time series models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2 Portfolio Choice 1: Mean-variance Theory 35 2.1 Ad hoc rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.1.1 Equal-weighting: 1/N . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.1.2 Value-weighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.1.3 Volatility-weighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.1.4 Risk parity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.1.5 Global minimum-variance portfolio . . . . . . . . . . . . . . . . . . . . . . . . 41 2.2 MV Optimal portfolio: Riskfree asset case . . . . . . . . . . . . . . . . . . . . . . . . 45 2.2.1 One risky asset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 2.2.2 N = 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 2.2.3 Multiple risky assets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 2.2.4 Two-fund separation theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 2.2.5 Parameter estimation by sample moments . . . . . . . . . . . . . . . . . . . . 56 2.2.6 Practical implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 2.2.7 MV frontier and utility maximization . . . . . . . . . . . . . . . . . . . . . . 60 2.2.8 Alternative formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 2.2.9 Links to regression and machine learning . . . . . . . . . . . . . . . . . . . . 62 2.3 Tracking error minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 2.4 Information ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 2.5 How to outperform with alpha asset? . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 2.6 Fundamental Law of active portfolio management . . . . . . . . . . . . . . . . . . . 70 2.6.1 IR = IC √ N . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 c© Zhou, 2021 Page 2 CONTENTS 2.6.2 A casino example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 2.6.3 A proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 2.7 MV Optimal portfolio: No rf case . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 2.7.1 Variance minimization given µp . . . . . . . . . . . . . . . . . . . . . . . . . . 75 2.7.2 Two-fund separation: No rf case . . . . . . . . . . . . . . . . . . . . . . . . . 79 2.7.3 Utility maximization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 2.7.4 Optimality of ad hoc rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 2.7.5 Links to linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 3 Portfolio Choice 2: Constraints and Extensions 84 3.1 Practical constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 3.2 Quadratic programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 3.3 Asset allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 3.3.1 Stocks and bonds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 3.3.2 Multi-asset classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 3.4 Large set of individual stocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 3.5 Estimation risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 3.5.1 The plug-in rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 3.5.2 Errors in using a model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 3.5.3 Estimation errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 3.5.4 Analytical assessment∗ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 3.5.5 Correlation shrinkage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 3.5.6 Combination of 1/N with plug-in . . . . . . . . . . . . . . . . . . . . . . . . . 97 3.5.7 Backtesting: A comparison of rules . . . . . . . . . . . . . . . . . . . . . . . . 98 3.5.8 A Bayesian solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 c© Zhou, 2021 Page 3 CONTENTS 3.6 Transaction costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 3.7 Model uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 3.7.1 Perturbation of the normal model . . . . . . . . . . . . . . . . . . . . . . . . 103 3.7.2 Model averaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 3.8 Alternative objective functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 3.8.1 Kelly’s criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 3.8.2 Higher moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 3.8.3 Other utilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 4 Simulation, Bootstrap and Shrinkage 110 4.1 Sampling from distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 4.1.1 Univariate case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 4.1.2 Bivariate case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 4.1.3 Cholesky decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 4.1.4 Singular value decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 4.2 Monte Carlo integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 4.2.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 4.2.2 VaR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 4.2.3 Option pricing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 4.3 Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 4.3.1 Estimating standard error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 4.3.2 Estimating confidence interval . . . . . . . . . . . . . . . . . . . . . . . . . . 124 4.3.3 Bootstrapping portfolio weights . . . . . . . . . . . . . . . . . . . . . . . . . . 126 4.4 Shrinkage estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 4.4.1 Sample averages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 c© Zhou, 2021 Page 4 CONTENTS 4.4.2 Mean shrinkage: Stein estimators . . . . . . . . . . . . . . . . . . . . . . . . . 129 4.4.3 Covariance shrinkage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 4.4.4 Use of correlation shrinkage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 4.4.5 Eigenvalue adjustment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 4.4.6 Exponentially weighted moving averages . . . . . . . . . . . . . . . . . . . . . 133 4.4.7 GS covariance matrix estimator . . . . . . . . . . . . . . . . . . . . . . . . . . 135 5 Factor Models 1: Known Factors 139 5.1 The CAPM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 5.1.1 Proof 1: preference assumption . . . . . . . . . . . . . . . . . . . . . . . . . . 139 5.1.2 Proof 2: return assumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 5.1.3 Market model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 5.1.4 Some truths on Alpha . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 5.1.5 Claims of the CAPM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 5.1.6 GRS test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 5.1.7 CAPM and market efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 5.1.8 Fama-MacBeth 2-pass regressions . . . . . . . . . . . . . . . . . . . . . . . . . 149 5.1.9 Stochastic discount factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 5.1.10 GMM test and others . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 5.2 Spanning tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 5.3 Fama-French 3- and 5-factor models . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 5.4 Additional factor models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 5.5 Non-traded factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 5.6 How to construct factors? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 5.6.1 Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 c© Zhou, 2021 Page 5 CONTENTS 5.6.2 Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 5.6.3 Cross-section regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 5.6.4 Machine learning methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 5.6.5 Time series vs cross section . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 5.7 Uses of factor models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 5.7.1 Capital budgeting/Expected return estimation . . . . . . . . . . . . . . . . . 163 5.7.2 Smart beta and factor investing . . . . . . . . . . . . . . . . . . . . . . . . . . 164 5.7.3 Hedging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 5.7.4 Measuring performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 6 Factor Models 2: Unknown Factors 167 6.1 Latent factor model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 6.2 Principal components analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 6.2.1 Eigenvalue and eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 6.2.2 PCs: data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 6.2.3 PCs: random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 6.2.4 PCA factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 6.2.5 The theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 6.2.6 High-dimensional PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 6.3 Asymptotic PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 6.4 Covariance matrix estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 6.4.1 Invertibility problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 6.4.2 Factor-model based estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 6.5 Both explicit and latent factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 6.6 All-inclusive factor model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 c© Zhou, 2021 Page 6 CONTENTS 6.6.1 Time series factor model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 6.6.2 Fundamental factor model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 6.6.3 All types of factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 6.7 Factor analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 7 Performance and Style 188 7.1 Performance measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 7.1.1 Alpha . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 7.1.2 Sharpe ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 7.1.3 Sortino ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 7.1.4 Information ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 7.1.5 Treynor ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 7.1.6 Treynor and Black appraisal ratio . . . . . . . . . . . . . . . . . . . . . . . . 191 7.1.7 Graham-Harvey volatility-matched return . . . . . . . . . . . . . . . . . . . . 191 7.1.8 Maximum drawdown and Calmar ratio . . . . . . . . . . . . . . . . . . . . . . 191 7.2 Sharpe ratio: further analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 7.2.1 Asymptotic standard error . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 7.2.2 Test the difference between two SRs . . . . . . . . . . . . . . . . . . . . . . . 193 7.3 Portfolio-based style analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 7.4 Return-based style analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 7.5 Hedge fund styles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 8 Anomalies and Behavior Finance 196 8.1 Anomalies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 8.1.1 Size and January effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 8.1.2 The weekend effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 c© Zhou, 2021 Page 7 CONTENTS 8.1.3 The value effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 8.1.4 The momentum effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 8.1.5 Closed-end fund puzzle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 8.1.6 Mutual fund persistence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 8.1.7 IPOs abnormal returns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 8.1.8 Technical analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 8.2 Are the anomalies real? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 8.3 Limits to arbitrage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 8.4 Behavior finance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 9 Predictability 1: Time Series 204 9.1 Market efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 9.2 Random walk? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 9.3 Limits to predictability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 9.4 Predictive regressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 9.4.1 Basic model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 9.4.2 Out-of-sample performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 9.4.3 Statistical significance/tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 9.4.4 Economic significance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 9.5 Forecasting with many predictors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 9.5.1 Forecast combination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 9.5.2 PCA or PCR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 9.5.3 sPCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 9.5.4 Partial least squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 9.5.5 PLS: m > 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 c© Zhou, 2021 Page 8 CONTENTS 9.6 Common time-series predictors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 9.6.1 Macro economic variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 9.6.2 Technical variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 9.6.3 Investor sentiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 9.6.4 Investor attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 9.6.5 Short interest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 9.6.6 Corporate activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 9.6.7 Option market . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 9.6.8 Others . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 9.7 Mixed-frequency predictors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 9.8 Nowcasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 10 Machine Learning Tools 229 10.1 What is Machine Learning? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229 10.2 Types of Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230 10.2.1 Unsupervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230 10.2.2 Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230 10.2.3 Reinforcement learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230 10.3 A short literature review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230 10.4 Why penalized regressions? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 10.4.1 Bias-variance tradeoff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 10.4.2 Prediction error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 10.4.3 Problems with many regressors . . . . . . . . . . . . . . . . . . . . . . . . . . 233 10.5 LASSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234 10.5.1 The idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234 c© Zhou, 2021 Page 9 CONTENTS 10.5.2 The code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238 10.5.3 The theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 10.6 Cross-validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240 10.7 Ridge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241 10.7.1 The idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241 10.7.2 The code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243 10.7.3 The theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244 10.8 Enet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 10.9 C-LASSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246 10.10E-LASSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 10.11Neutral network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 10.11.1 No hidden layer: linear regression . . . . . . . . . . . . . . . . . . . . . . . . . 248 10.11.2 One hidden layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249 10.11.3 Gradient decent: A search algorithm . . . . . . . . . . . . . . . . . . . . . . . 251 10.11.4 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252 10.12Genetic algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253 10.13Ensemble Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254 10.13.1 Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254 10.13.2 Stacking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 10.13.3 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256 11 Predictability 2: Cross Section 257 11.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 11.2 Cross-section regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258 11.3 OLS estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260 c© Zhou, 2021 Page 10 CONTENTS 11.4 E-LASSO estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261 11.5 Weighted cross section regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262 12 Bayesian Estimation 262 12.1 Bayes Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 12.1.1 Conditional events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 12.1.2 Conditional densities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 12.2 Classical vs Bayesian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266 12.2.1 σ2 known . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266 12.2.2 σ2 unknown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269 12.3 Informative priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271 12.3.1 σ2 known . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272 12.3.2 σ2 unknown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273 12.4 Predictive distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273 12.5 Bayesian regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275 12.6 Bayesian CAPM test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276 13 Black-Litterman Model 278 13.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278 13.2 Single risky asset case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278 13.3 Multiple risky asset case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280 13.4 Alternative approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282 14 References 283 c© Zhou, 2021 Page 11 1 Properties and Models of Stock Returns In this section, we examine and review the statistical properties of primarily equity returns, and the associated models. Here we are mainly concerned with univariate time series of an individual stock return, while leaving the more complex multivariate case to later sessions. 1.1 Multiple-period returns Let Pt be a stock price at time t, say today, and Pt−1 the price last period (could be yesterday or last month). There are three commonly referred returns: • Gross return R∗t = Pt +Dt Pt−1 , (1.1) the percentage gain of investing Pt−1 dollars at t − 1. For example, if you buy at stock at $100 last year (time t− 1, and it pays $2 dividends at the end of year (time t), and the price today (time t) is $103, then your gross return is 1.05%. • Simple return or simply return Rt = Pt+Dt−Pt−1 Pt−1 = Pt+DtPt−1 − 1 = R∗t − 1 (1.2) i.e., the net percentage gain on investing your money. It is 5% in the previous example. One often decomposes the return into two terms, Rt = Pt−Pt−1 Pt−1 + Dt Pt−1 = capital gain (loss) + dividend yield (1.3) Then there are 3% in capital gain and 2% in dividends in the earlier example. • Continuously compounded return or log return: rt = log ( Pt +Dt Pt−1 ) , (1.4) c© Zhou, 2021 Page 1 1.1 Multiple-period returns which says the gain grows at the continuously compounded rate rt. To see this, assume Dt = 0, then the above equation implies Pt = Pt−1ert , (1.5) i.e., the price appreciates at rate rt if no dividends. • Simple v.s. Continuous: There are a few notable differences between the simple and con- tinuous returns. First, the simple return is in the range of [−1,+∞), but the latter is in (−∞,+∞). So, theoretically speaking, simple return has no symmetry, but continuous re- turn does, and hence we cannot assume that the simple return is normally distributed, but can do so for the continuous return as we do in option pricing. However, most empirical studies still use the normality assumption for simple returns as an approximation. Second, the simple return is always greater or equal (rarely) to the continuous return. In our earlier example, the simple return is 5%, and the continuous return is only 4.88% (= log(105/100)). Computing the accumulative wealth could be misleading using simple returns. For example, if a non-dividend paying stock goes up from 100 to 200, and then 100. The average simple return is 25% (= (100%− 50%)/2), but no value is created because the stock drops back to 100. The average continuous returns will measure the value correctly, 1 2 [log(200/100) + log(100/200)] = 1 2 × 0 = 0. Although not popular, there are two other measures of gains in investments. • The net gain gt = Pt +Dt − Pt−1, (1.6) which is simply the gain in value. For stocks, this series is unstable and the return is the preferred series to model. However, the returns for trading futures cannot be defined as we do for stocks here because the cost of buying futures is arguably zero. So, gt is the usually object of study for futures contracts. To make it stable, it is often divided by the notional value of the contract (assuming implicitly a leverage ratio). • The return with margin, R∗∗t = m(Pt+Dt) Pt−1 − 1 = mRt − 1, (1.7) c© Zhou, 2021 Page 2 1.1 Multiple-period returns which is the return when $1 is used to buy mPt−1 shares (ignoring the interest charge on margins). When there is no use of margin, m = 1. In the US, one can in general use $1 to buy $2 worth of stocks (m = 2), a margin of 50% (of the purchased assets). Suppose now you invest $1, and earn 10% in year 1 and 20% in year 2. Then your wealth in year 2 is W2 = (1 + .10)(1 + .20) = 1.32, (1.8) and your two-year return is 32%. The implicit assumption is that dividends, if there are any, are reinvested into the same asset. What is your average annual return? There are two common averages, arithmetic and geometric ones. • Arithmetic Average: For an investment of T period with R1, . . . , RT as the returns from today to 1, . . ., time T − 1 to T , the arithmetic average return is defined as Ra = R1 +R2 + · · ·+RT T . (1.9) In our previous example, T = 2, and Ra = 15%. • Geometric Average: Note that the end of period wealth is WT = (1 +R1)(1 +R2) · · · (1 +RT ). (1.10) The geometric average, Rg, is defined as such a return which compounds to the end of period wealth, (1 +Rg) T = (1 +R1)(1 +R2) · · · (1 +RT ). (1.11) In our earlier example, it is clear that Rg = 14.9%. • Arithmetic v.s. Geometric: Mathematically, Ra is is always greater Rg unless all the period by period returns are equal (in this rare case, they are equal). Theoretically, the more volatilities of the returns, the greater their differences. In practice, some investors mistakenly using the arithmetic average, which is a proxy of expected return, to compound the wealth. But this can be very inaccurate. For example, over the period c© Zhou, 2021 Page 3 1.2 Expected returns vs realized returns 1926–2002, the average annual returns in the US stock market are Ra = 17.74% and Rg = 11.64%, respectively. If we use them to compound the investment of $10,000 for thirty years, then we have (1 + 17.74%)30 = 1, 341, 900, (1 + 11.64%)30 = 272, 020, which are totally different. In pensions, the expected returns are often used to discount the future liabilities, which will likely to under-estimate the true liabilities substantially. In portfolio management and many investment contexts, we model and analyze returns using simple returns and arithmetic averages. The primary reason is for statistical consistency. If we assume individual returns are normally distributed, so will their portfolios. However, if we assume individual returns are log-normally distributed (as we do for pricing options in the Black-Scholes model), their their portfolios will no longer log-normally distributed. In practice, just remembering that compounding should use continuous returns or geometric averages will be sufficient. Jacquier, Kane and Marcus (2003, 2005) point out that the statistical estimates of the average return can be substantially upward or downward biased toward the estimates of the long-term expected returns (for example, for investment horizons of 40 years, the difference in forecasts can easily exceed a factor of 2!). Then, the question is whether one can derive an unbiased estimator. Jacquier, Kane and Marcus (2005) does obtain such an unbiased estimator by assuming the variance is known. But the assumption is clearly unrealistic. Kan and Zhou (2009) solve the problem completely by providing a new unbiased estimator without that assumption. Moreover, they provide an unbiased estimator for a range of wealth levels which seem add more relevant information. 1.2 Expected returns vs realized returns • At the start of the period today (time t), future variables are unknown, and we can only calculate their expected value. So the expected return is E[Rt+1] = E[Pt+1] + E[Dt+1] Pt − 1, (1.12) • At the end of the period (time t+ 1), however, the realized return can be computed based on the observed price and dividend, Rt+1 = Pt+1 +Dt+1 Pt − 1, (1.13) c© Zhou, 2021 Page 4 1.3 Mean, std, and confidence intervals The point is that the two can be quite different. For example, at the beginning of the year, I may expect to get 10% return, but the realized return at the end of the year can actually be −20%! Another point is that a present value (PV) model of the stock price can be derived from (1.12). To see this, we can rewrite the equation as Pt = E[Pt+1] + E[Dt+1] 1 + r , (1.14) where r = E[Rt+1], the expected return or discount rate. The above equation says that the stock price today is the expected payoff next period discounted back to today. Assume r is constant for simplicity. Applying it to Pt+1, we get Pt+1 = E[Pt+2] + E[Dt+2] 1 + r , (1.15) Combining the two, we have Pt = E[Dt+1] 1 + r + E[Dt+2] (1 + r)2 + E[Pt+2] (1 + r)2 . (1.16) By using the same logic, we eventually get Pt = E[Dt+1] 1 + r + E[Dt+2] (1 + r)2 + E[Dt+3] (1 + r)3 + · · · , (1.17) which says that the stock price today is the sum of its discounted expected future cash flows. Thus, changes in expectations about future dividends or about the discount rate will cause changes in the current stock price. 1.3 Mean, std, and confidence intervals As stock return Rt is random over time, we sometimes emphasize this fact by using notation R˜t. Usually we assume that the stock return is independently and identically distributed (iid) over time. Denote the density function by f(x). The properties of the distribution is often examined by looking at the first two moments, the mean and variance. Mathematically, they are defined by µ = E(Rt) = ∫ +∞ −∞ x f(x) dx (1.18) and σ2 = E(Rt − µ)2 = ∫ +∞ −∞ (x− µ)2 f(x) dx, (1.19) c© Zhou, 2021 Page 5 1.3 Mean, std, and confidence intervals where, for simplicity, we assumed the range of the integrals are (−∞,∞). The mean is the same as the expected value, and the standard deviation, also known as volatility in finance, is simply σ = Vol = √ variance. The mean summarizes the center of the mass of the distribution, while the standard deviation tells how far most of the mass is away from the center. The mean and variance are the most important quantities of any distribution. Given data/observations of returns R1, . . . , RT , how do we estimate the mean and variance? We often use the sample mean and variance (estimating integrals by their sums, called sample analogues), µˆ = 1 T T∑ t=1 Rt, (1.20) σˆ2 = 1 T − 1 T∑ t=1 (Rt − µˆ)2, (1.21) where T is the sample size. The above sample averages are intuitive approximations of the integrals. Statistically, both of them are unbiased estimators, E(µˆ) = µ, E(σˆ2) = σ2, i.e., their expected values are equal to the true parameters. It says that if you estimate the parameters over many data sets, you will be right on average. However, given a set of data, you will have estimation errors. The confidence intervals below quantify such errors. Note that the following is also a popular estimator of the variance, sˆ2 = 1 T T∑ t=1 (Rt − µˆ)2. (1.22) Mathematically, this is the maximum likelihood estimator that maximizes the density function of the data. However, numerically, the difference between the two is very small when T is greater than, say, 100. In Python, Numpy.std(Data) will compute the standard deviation of the data by using the denominator T , the default. To use denominator T − 1, we simply specify the parameter type of ddof (stands for Delta Degrees of Freedom) as 1, i.e., we use Numpy.std(Data,ddof=1). c© Zhou, 2021 Page 6 1.3 Mean, std, and confidence intervals How accurate are the estimates? Although the estimators are unbiased, but they have variances. If the data are iid normal, and if the variance of the data, σ2, is known, then standard text books tell us the popular 95% confidence interval is[ µˆ− 1.96 σ√ T , µˆ+ 1.96 σ√ T ] , (1.23) which means that the interval has 95% probability of containing the true µ. To see why, it is easy to show that µˆ ∼ N(µ, 1 T σ2). (1.24) As it is well known that, for a standard normal z˜, its 95% probability interval is determined from, 0.95 = Prob(−1.96 < z˜ < 1.96), (1.25) In our case, based on (1.24), if we standardize xˆ by letting z˜ = µˆ− µ σ/ √ T , z˜ must follow the standard normal, and hence (1.25) implies µˆ− 1.96 σ√ T < µ < µˆ+ 1.96 σ√ T , which is the proof. Note that, to get the 90% or 99% confidence intervals, we simply replace 1.96 by 1.65 or 2.58 (the 90% and 99% of the standard normal cutoffs). If you want to have a greater confidence to cover µ, the interval must be wider. However, σ is unknown in the real world, but it can be estimated by σˆ. Then the 95% confidence interval is approximately [ µˆ− 1.96 σˆ√ T , µˆ+ 1.96 σˆ√ T ] . (1.26) Since σˆ introduces error in estimating σ, the true confidence interval should be wider than this one. Nevertheless, under normality and if the sample size is greater than 50, the above is very accurate. The reason is that the exact confidence interval will be determined now by t˜ ≡ µˆ− µ σˆ/ √ T . Statistically, t˜ so defined have an exact t-distribution, and hence the true confidence interval is[ µˆ+ t0.025 σˆ√ T , µˆ+ t.975 σˆ√ T ] , (1.27) c© Zhou, 2021 Page 7 1.3 Mean, std, and confidence intervals where t.025 and t.975 are the lower and upper sides of the 95% interval for the t-distribution, 0.95 = Prob(t.025 < t˜ < t.975). (1.28) where the degree of freedom of the t-distribution is ν = T −1. When the sample size is T = 50, t.95 is about 1.96 (t.025 is about −1.96), and the normal confidence interval is a good approximation. As T increases, t.95 becomes closer to 1.96 and reaches it in the limit because t-distribution approaches the normal as the degree of freedom increases to infinity. The following Python codes make the above easy to implement: 1 2 alpha = 0.05 # significance level = 5% 3 T = len(Data) # sample size 4 df = T - 1 # degree of freedom=sample size - 1 5 t95 = scipy.stats.t.ppf(1 - alpha/2, df) # t-critical value for 95% 6 s = np.std(Data , ddof =1) # sample standard deviation of Data 7 x = np.mean(Data) # sample mean 8 lower = xbar - t95 * (s / np.sqrt(T)) 9 upper = xbar + t95 * (s / np.sqrt(T)) Example 1.1 Suppose that with T = 5 data points, you obtain sample mean and standard devi- ation (see the optional Python code), µˆ = 0.10, σˆ = 0.1768. Then, the true confidence interval, (1.27), is [−0.1195, 0.3195], and the normal approximation, (1.26), is [−0.0550, 0.2549]. In this case, as the sample size is small, the difference between the two is large. Now suppose that T = 120 (e.g., 10 years of monthly data), then the confidence intervals are [0.0680, 0.1320] and [0.0683, 0.1316] which are much tighter and closer, as they should be as T becomes larger. ♠ c© Zhou, 2021 Page 8 1.4 Mode and median It should be emphasized that the above confidence interval is only true when the data is iid normal. When normality is violated as it often does, the interval is only an approximation. Theo- retically, as the sample size becomes large, it will be more accurate. When the sample size is small, the bootstrap procedure (see can be used to improve the accuracy. What is the 95% confidence interval for σ? Now we need the statistical result that (T − 1)σˆ2 σ2 ∼ χ2T−1, (1.29) that is, the ratio of the variance estimator to the true variance after multiplying T − 1 has a chi- squared distribution with T − 1 degrees of freedom. Then, solving from above, the 95% confidence interval for σ2 is [ (T − 1)σˆ2 χ20.975 , (T − 1)σˆ2 χ20.025 ] , (1.30) where χ20.025 and χ 2 0.975 are the lower and upper sides of the the 95% confidence interval for the chi-squared distribution. Then the confidence interval for σ is obtained by taking the square-roots on both sides of (1.30). 1.4 Mode and median Besides moments, mode and median are also of use at times. The mode of a distribution is defined as the value of x, around which the density function has a peak or the greatest value. The mode is not necessarily unique. The density function of a continuous distribution can have multiple local maxima, and so it is commonly referred as multimodal (as opposed to unimodal). However, most distributions used in finance has a unique mode. For example, the normal distribution has only one mode and it is the mean. However, the mode is generally different from the mean, especially for asymmetric distributions. The median is such an x value, say x0, such that the probability of x to be greater or less than it is exactly 50%, ∫ x0 −∞ f(x) dx = 0.5 = ∫ −∞ x0 f(x) dx. (1.31) For discrete distributions or for a set of data, the median is the central number from the smallest to the largest. If the number of data points is even, there will be 2 numbers in the middle, then the median is the average of those two. c© Zhou, 2021 Page 9 1.5 Skewness and kurtosis In Python, Numpy has a function for median, but not for mode. So the easiest way is to use another page that does both: 1 2 import statistics as stat # import the package 3 stat.mode(Data) # the output is the mode of the data 4 stat.median(Data) # the output is the median of the data 1.5 Skewness and kurtosis In the real world, the mean and variance do not summary the complete properties of the data, and we need more measures, the third and and fourth centered moments, µ3 = E(Rt − µ)3 = ∫ +∞ −∞ (x− µ)3 f(x) dx (1.32) and µ4 = E(Rt − µ)4 = ∫ +∞ −∞ (x− µ)4 f(x) dx. (1.33) Then, skewness and kurtosis are defined as the standardized third and fourth centered moments, Skewness = µ3 σ3 (1.34) and Kurtosis = µ4 σ4 . (1.35) Since they are divided by the powers of the standard deviation, they will be invariant to scaling of the returns. Economically, if you double your holding of the asset, you will have the same skewness or kurtosis. If the skewness is positive, it means that there is relatively more mass on the right side of the mean. For symmetric distributions around its expected value like the standard normal, the skewness is zero. The kurtosis measures how fat the tail of the distribution is. For the standard normal distribution, the kurtosis is 3. With data available, they can clearly be estimated by their sample counterparts, γˆ3 = 1 T T∑ t=1 (Rt − µˆ)3/σˆ3, (1.36) γˆ4 = 1 T T∑ t=1 (Rt − µˆ)4/σˆ4, (1.37) c© Zhou, 2021 Page 10 1.5 Skewness and kurtosis where 1/T , similar to the variance case, may be replaced by other scalars to make them unbiased. But they make little numerical differences when the sample size is large, say greater than 100. In Python, we can use scipy.stats.skew and scipy.stats.kurtosis to compute the skewness and kurtosis from data. As the case for standard estimation, the above simple moment estimators are biased. To compute the unbiased estimates, we simply specify a parameter bias=False. The default is the above formulas with biased estimates. The Python codes computes them when set to be unbiased (but the kurtosis code subtracts 3 in this case, so it is close to zero for normal distributed data). Mathematically (see Doane and Seward, 2011, Joanes and Gill, 1998, and references therein), the following adjusted Fisher–Pearson standardized moment coefficient g3 = √ T (T − 1) T − 2 γˆ3 is the unbiased skewness estimator, and g4 = (T + 1)T (T − 1) (T − 2) (T − 3) ∑T t=1(Rt − µˆ)4 sˆ2 − 3 (n− 1) 2 (n− 2)(n− 3) + 3, is unbiased kurtosis estimator with sˆ2 as the unbiased variance estimator. Statistically, any random variable is completely determined by its moment generating function g(t) = E(etx) = ∫ +∞ −∞ etx f(x) dx. (1.38) In other words, knowing f(x) we know g(t), and knowing g(t), we can recover f(x). Since ex = 1 + x+ x2 2! + x3 3! + · · ·+ x n n! + · · · · · · , it follows that g(t) = 1 + tE(x) + t2 2! E(x2) + x3 3! E(x3) + · · ·+ x n n! E(xn) + · · · · · · , where E(x), E(x2), E(x3), E(x4), ... are the moments of x, which are related to the earlier centered moments (x subtracted from its mean). The point is that the first 4 moments summarize almost all features of a distribution. In practice, moments of higher order than 4 are almost never been used. Since normality is a common assumption, the skewness and kurtosis are useful tests in telling us how the data deviate from normality assumption (see Section 1.6.2). They are also relevant to portfolio choice too (see Section 3.8.2). c© Zhou, 2021 Page 11 1.6 Univariate distributions 1.6 Univariate distributions In this subsection, we provide a short review of common univariate distributions that are often used in finance. 1.6.1 Uniform distribution While this distribution is not as widely used as the normal distribution, it is the simplest continuous distribution that is useful for understanding others. This distribution can be used to simulate other distributions or can be used in a Bayesian setup for describing diffuse priors. A random variable u has a standard uniform distribution, denoted as U(0, 1), if it has equal probability to be any numbers in [0, 1]. Since it is equally likely over [0, 1], the density function must be a constant over the range. Then its density function must be f(x) = 1, if x ∈ [0, 1];0, otherwise, (1.39) which follows from the fact that the integral should be 1, and so the constant is 1. In Bayesian analysis (to be discussed), if our prior belief is that the expected return on a stock is equally likely to be 0% to 100%, then we can model this belief by using U(0, 1). There are two important properties of the standard uniform distribution. First, If u is a random number from U(0, 1), then x = G−1(u) will be a random number from G(x), where G(x) is the cumulative distribution function of any continuous distribution. Hence, the standard uniform distribution helps to obtain random numbers from any other continuous distributions. Second, if u follows U(0, 1), so is 1 − u. This property can be used in Monte Carlo simulations to reduce variance. The cumulative distribution function of a U(0, 1) random variable has also a simple form. It is clear that F (x) = Prob(u ≤ x) = 0, if x < 0; x, if x ∈ [0, 1]; 1, if x > 1; (1.40) For example, the probability for u ≤ 0.5 is clearly 50%. c© Zhou, 2021 Page 12 1.6 Univariate distributions In general, we can consider a uniform distribution over any finite interval [a, b]. The density function is f(x) = 1b−a , if x ∈ [a, b];0, otherwise, (1.41) and the cumulative distribution function is F (x) = Prob(u ≤ x) = 0, if x < 0; x−a b−a , if x ∈ [a, b]; 1, if x > b. (1.42) Moreover, the n-th moment can be solved analytically, E(un) = bn+1 − an+1 (n+ 1)(b− a) . (1.43) In particular, the mean is E(u) = (b+ a)/2 and the variance is (b− a)2/12. 1.6.2 Normal distribution The normal distribution is the most used not only in finance, but also in statistics, whose density function is f(x) = 1 σ √ 2pi e− (x−µ)2 2σ2 , (1.44) where µ is the mean and σ2 is the variance. When a random variable x˜ (stock return) follows a normal distribution, we often write it as x˜ ∼ N(µ, σ2). (1.45) In simulations, a compute code, say Python often provides a random number from the standard normal, z˜ ∼ N(0, 1), (1.46) then a random number x˜ can be computed from x˜ = µ+ σz˜ has the desired mean µ and variance µ. The cumulative distribution function (cdf) of the standard normal distribution, usually denoted by Φ(x) is the statistical literature, is Φ(x) ≡ Prob(z < x) = 1√ 2pi ∫ x −∞ e−t 2/2 dt (1.47) c© Zhou, 2021 Page 13 1.6 Univariate distributions which is the probability for the standard normal random variable below a fixed constant x. With Python, one can easily compute the density and cdf. For example, running the commands below at Spyder prompt, 1 2 import scipy.stats 3 4 scipy.stats.norm (0,1).pdf(0) 5 6 scipy.stats.norm (0,1).cdf (1.96) you will get the value of the density at 0, 0.3989, the probability less than 0, 50%, and that less than 1.96, 97.5%. There are some simple facts on the normal distribution. If a set of data are randomly drawn from the normal distribution, then 68% of the data fall within one standard deviation from the mean, 95% percent within two standard deviations, and 99.7% within three standard deviations. For example, mathematically, Prob(−2 < z˜ < 2) = 95%, where z˜ follows the standard normal (note: the above equality is approximately true and easier to remember. The exact equality requires replacing the 2 by 1.96). These facts are easily verified with Python. The normal distribution has its higher central moments analytically available, E(x− µ)m = 0, if m is odd;σm(m− 1)!!, if m is even, where (m− 1)!! = (m− 1)(m− 3) · · · 3 · 1, the double factorial. In particular, E(x− µ)4 = 3σ4 and E(x− µ)6 = 15σ6. It follows that the normal distribution has skewness zero and kurtosis 3: Skewness = 0, (1.48) Kurtosis = 3, (1.49) which are obtained from the 3rd and fourth central moments by dividing σ3 and σ4, respectively. In practice, the mean and variance are unknown, but can be easily estimated by using the c© Zhou, 2021 Page 14 1.6 Univariate distributions sample mean and sample variance (see (1.20) and (1.21)). How good the estimates are? The confidence intervals discussed there answer this question. Is the normal distribution a good assumption for a given set of data? The common tests are to examine whether the sample skewness and kurtosis are too far from those of the normal distribution. Asymptotically, if the data is normally distributed and iid, the sample skewness and kurtosis should converge to 0 and 3, with the following distributions, γˆ3 ∼ N(0, 6 T ) (1.50) γˆ4 ∼ N(3, 24 T ). (1.51) In other words, as sample size T increases, they should be close to 0 and 3, respectively. How close is close? This will be judged by the confidence intervals from the above asymptotic distributions. Hence, if the estimated skewness and kurtosis are far way from the above asymptotic distributions, we can reject the null hypothesis that the data are normal. Example 1.2 As demonstrated in class, we can compute and obtain γˆ3 = −0.4551, γˆ4 = 6.3448, for the CRSP stock index based on monthly returns on from January 1934 to December 2011 (T = 78 ∗ 12 = 936). The standard errors are√ 6 T = 0.0801, √ 24 T = 0.1601 respectively. Then the 95% confidence intervals are [−0.1569, 0.1569] and [2.6861, 3.3139]. Since the estimates are outside of these intervals, we reject the hypothesis that the index is normally distributed. ♠ 1.6.3 Lognormal distribution The stock returns are often measured in terms of the simple returns. Theoretically, there is a potential problem for the use of a normal distribution because the simple returns are asymmetric and bounded below by -100%. In many applications, -100% is in the far left tail and may be safely c© Zhou, 2021 Page 15 1.6 Univariate distributions ignored, say for daily returns. Nevertheless, when the size of the return is large, say for annual returns, that can be an important issue. In this case, we often use the continuously compound return, rt = log ( Pt +Dt Pt−1 ) . When we assume rt is normal, we say the price Pt is lognormally distributed because its logarithm is normally distributed. In the famous Black-Scholes formula for option prices, the stock price is assumed to be lognormal. Mathematically, if y is lognormal or the log of y is normal, log(y) ∼ N(µ, σ2), (1.52) then its density function is given by g(y) = 1 y 1 σ √ 2pi e− (log y−µ)2 2σ2 , and its mean and variance are E(y) = eµ+ σ2 2 , σ2y = ( eσ 2 − 1 ) e2µ+σ 2 . So lognormal and normal distributions are quite different. Also normal distribution is symmetric, and lognormal is obviously not. 1.6.4 χ2-distribution The chi-squared distribution is widely used as chi-square tests for goodness of fit of an observed model. Most asymptotic tests, such as whether a set of parameters are equal to some prescribed ones or the likelihood ratio test, have a χ2-distribution. Statistically, it is defined as the distribution of the sum of squared standard normal deviates: χ2 = X21 +X 2 2 + · · ·+X2n, (1.53) where Xi’s are independent standard normal random variables, and n is known as the degrees of freedom of the χ2-distribution. c© Zhou, 2021 Page 16 1.6 Univariate distributions Its summary statistics are Mean = n, (1.54) Variance = n(n+ 2), (1.55) Skewness = n(n+ 2)(n+ 4), (1.56) Kurtosis = n(n+ 2)(n+ 4)(n+ 6). (1.57) Its density function is f(x) = 1 Γ ( 1 2n )xn/2e−x/2, (1.58) Γ(·) is the Gamma function, Γ(z) ≡ ∫ ∞ 0 xz−1e−x dx with properties that Γ ( 1 2 ) = √ pi, Γ(z + 1) = zΓ(z), and, for any positive integer n, Γ(n) = (n − 1)!. For examples, Γ(1) = 1, Γ(2) = 1! = 1, Γ(3) = 2! = 2, and Γ(4) = 3! = 6 (note that Γ(0) is undefined or +∞). An interesting fact about χ2 is that the sum of normal variables deviated from their mean still follows a χ2, that is, z = (X1 − X¯)2 + (X2 − X¯)2 + · · ·+ (Xn − X¯)2 ∼ χ2n−1, where X¯ is the mean of the Xi’s. Note that the degree of freedom is reduced by 1. In general, this result can be extended to a linear regression model. The sum of the fitted residuals (errors) will be χ2-distributed up to a scale, which is the variance of the residual and can be consistently estimated by the sample variance. The degree of freedom will go down by the number of regressors. 1.6.5 t-distribution The t-distribution is the most used in finance for testing a hypothesis. It can also be used as a model for stock returns. Indeed, if the return data have fatter tails than the normal, the normality hypothesis will be rejected. Then the t-distribution is a good alternative candidate distribution. Statistically, the t-distribution is the ratio of a standard normal to a square-root of χ2, that is, X√ Z/ν ∼ t(ν), (1.59) c© Zhou, 2021 Page 17 1.6 Univariate distributions where X ∼ N(0, 1), Z ∼ χ2ν , and ν is also known as the degree of freedom for t here. Note that√ ν is used to scale X/ √ Z in the definition. The reason is that √ Z has a value around √ ν (as χ2ν has a mean ν), then the scaling makes X divided by a value around 1, not changing its variance by much unless ν is small (see the moments below). Historically, the t-distribution is motivated for analyzing the sampling accuracy. Let X1, . . . , Xn be iid samples from a general normal distribution N(µ, σ2). We have sample mean X¯ = 1 n n∑ i=1 Xi and and sample variance s2 = 1 n− 1 n∑ i=1 (Xi − X¯)2. If σ is known, then we have X¯ − µ σ/ √ n ∼ N(0, 1), i.e., we can obtain the confidence interval on the true and unknown mean µ by scaling the standard normal distribution. However, σ is unknown in practice, but can be estimated by s. Replacing σ by s, then the term has t-distribution, X¯ − µ s/ √ n ∼ t(n− 1), and so we can use the t-distribution to determine the confidence interval. In particular, we can test the hypothesis µ = 0. This can be extended into many models, such as the linear regression, to test whether a slope is zero or not. The density function is f(x) = Γ[(ν + 1)/2] Γ(1/2)Γ(ν/2) 1 σ √ ν ( 1 + (ft − µ)2 ν )−(ν+1)/2 , (1.60) where ν is the degree of freedom. Its summary statistics are Mean = µ, (1.61) Variance = ν ν − 2 , (ν > 2), (1.62) Skewness = 0, (1.63) Kurtosis = 3 + 6 ν − 4 , (ν > 4). (1.64) It is seen that t is symmetric and cannot capture any skewness in the data. However, for whatever level of kurtosis, t distribution can model it as long as ν is close to 4 enough. On the other hand, when ν goes to infinity, the kurtosis becomes zero and the distribution converges to the normal. c© Zhou, 2021 Page 18 1.6 Univariate distributions 1.6.6 A skewed normal distribution In practice, the normal distribution is often rejected, and the t-distribution is a better alternative of the data. However, it is not as used often because it is more complex, and also because it usually will not change the results that much. Note that both normal and t distributions are symmetric. In certain applications or for certainty stocks, the skewness can be very important, but are completely ignored by both normal and t. In this case, a distribution with non-zero skewness is needed. However, a skewed distribution is more complex to construct. For example, x = µ+ σ z − E(z)√ var(z) (1.65) is a skewed normal distribution (see Azzalini, 1985, and Azzalini and Valle, 1996), where the density of z is given by g(z) = 2φ(z)Φ(λz), (1.66) where φ and Φ are the standard normal density and distribution function, respectively. Ideally, the choice of a statistical model for stock returns should satisfy three criterion: 1. consistency: fits the past data 2. testability: one should be able to test hypotheses of whether the model fits the data 3. parsimony: few parameters and tractable But such a model is difficult to find. The study of skewed and skewed-t distributions, especially in the multivariate case and in asset allocation applications, is a subject of ongoing research. 1.6.7 F -distribution The F -distribution is defined as a ratio of two χ2 random variables with adjusting of degrees of freedom, z ≡ X1/d1 X2/d2 ∼ F (d1, d2) (1.67) c© Zhou, 2021 Page 19 1.7 Multivariate distributions where X1 and X2 are independently χ 2-distributed with degrees of freedom d1 and d2, respectively. In a univariate regression with multiple regressors, t-distribution is often used for testing whether a slope is zero. However, the t-test is no longer applicable in multivariate regressions, which are common in finance as usually many assets are run in regressions. So the F -distribution can be regarded as an extension of the squared t into multivariate hypotheses testing, or [t(d)]2 is the same as F (1, d). The density function is f(x; d1, d2) = 1 B ( d1 2 , d2 2 ) (d1 d2 ) d1 2 x d1 2 −1 ( 1 + d1 d2 x )− d1+d2 2 , (1.68) where B(x, y) is the beta function, B(x, y) = ∫ 1 0 tx−1(1− t)y−1 dt, which can be computed from the Gamma function via B(x, y) = Γ(x)Γ(y)/Γ(x+ y). Its first two moments are Mean = d2/(d2 − 2), (d2 > 2), (1.69) Variance = 2 ( d2 d2 − 2 )2 d1 + d2 − 2 d1(d2 − 4) , (d2 > 4). (1.70) It is interesting that the mean depends only on the degree of freedom in the denominator, but the variance does related to both. 1.7 Multivariate distributions In investments, we are often interested in a set of assets rather than a single one. Hence, multivariate distributions are critically useful for modeling the returns of many assets. Multivariate normal is the most commonly used multivariate distribution in finance. To gain some more insights, we will review first a general property of linear transformation of a vector of random variables, then discuss bivariate normal, and finally the multivariate normal and multivariate t distributions. c© Zhou, 2021 Page 20 1.7 Multivariate distributions 1.7.1 Mean and variance of linear transformations In various application, it is required to compute the mean and covariance matrix of linear trans- formations of a random vector. Becaue of this, we list the formulas below. Let X be an n-vector of random variables, X = X1 ... Xn . It is often of interest to know the distributional properties of its linear transformation, Y = AX +B, (1.71) where A and B are constants, an N ×m matrix and an m-vector, respectively. It is well known (check your stats texts) that the mean and covariance matrix of Y are E[Y ] = AE[X] +B (1.72) cvar[Y ] = Acvar[X]A′. (1.73) The proof of the first equation is trivial as constants can be factored out in taking expectations. The second equation follows from var[Y ] = E ( [AX −Aµ][AX −Aµ]′) = A (E[X − µ][X − µ]′)A′ = Acvar[X]A′, where, with µ = E[X] or the mean, the first and last equalities are valid by definitions. In finance, we are often interested in a portfolio of stock return. Taking X as a vector of the returns, and A = (w1, . . . , wn) as the portfolio weights, then Equations (1.72) and (1.73) provide the mean and covariance matrix of the portfolio, E[Y ] = AE[X] = w1µ1 + w2µ2 + · · ·+ wnµn, (1.74) cvar[Y ] = A cvar[X]A′ = w′Σw, (1.75) where w is an n×1 vector of the weights, and Σ is the covariance matrix of X. These two formulas are very useful for portfolio decisions. Equations (1.72) and (1.73) are true regard less of the distribution of X. In other words, they are true whether the elements of X have normal or t or χ2 distributions. They are also very useful c© Zhou, 2021 Page 21 1.7 Multivariate distributions for simulations. Computers often generate random numbers that are standardized, then the two equations help to transform them into random variables of arbitrary mean and variance. Chapter 4 provides the details. 1.7.2 Bivariate normal The simplest bivariate normal distribution is the distribution of two independent standard normal variables. In this case, each one has the standard normal density (Equation (1.44 with µ = 0 and σ = 1). Due to independence, their joint density function is a product of the individual densities, f(x1, x2) = 1√ 2pi e− x21 2 × 1√ 2pi e− x22 2 = 1 2pi e− 1 2 (x21+x 2 2) (1.76) which completely determines the distribution. In general, for two variables, X = X1 X2 , with arbitrary mean and covariance matrix, µ = µ1 µ2 , Σ = σ21 ρσ1σ2 ρσ1σ2 σ 2 2 , where ρ is the correlation, the bivariate normal density function is f(x1, x2) = 1 2pi |Σ|−1/2exp [ −1 2 (x− µ)′Σ−1(x− µ) ] , (1.77) where |Σ| is the determinant of matrix Σ and x = (x1, x2)′. To make sense of the density, recall that the determinant and inversion of any 2× 2 matrix A = a b c d , can be written out explicitly, det(A) = ad− bc, A−1 = 1 det(A) d −b −c a . (1.78) Then |Σ|−1/2 = (σ21σ22 − ρ2σ21σ22)−1/2 = 1 σ1σ2 √ 1− ρ2 c© Zhou, 2021 Page 22 1.7 Multivariate distributions and (x− µ)′Σ−1(x− µ) = 1 σ21σ 2 2(1− ρ2) x1 − µ1 x2 − µ2 ′ σ22 −ρσ1σ2 −ρσ1σ2 σ21 x1 − µ1 x2 − µ2 = 1 σ21σ 2 2(1− ρ2) ( x1 − µ1)2 σ21 − 2ρ(x1 − µ1)(x2 − µ2) σ1σ2 + (x2 − µ2)2 σ22 ) , where the last term is from multiplying the previous equation out and combining the terms. Hence, if needed, the bivariate normal density is straightforward to compute. An important property is the conditional distribution. Denote X and Y now as two stock returns, following bivariate normal. Conditional on stock X go up, should stock Y goes up? First, Y conditional on X is still a normal distribution. The mean is (derivations are not given here), E[Y |X = x)] = µY + ρx− µX σX σY . (1.79) This formulas makes intuitive sense. If X goes up relative to its mean (higher than expected), Y will be so too if the two are positively correlated. The variance of the conditional expectation is Var[Y |X = x)] = σ2Y (1− ρ2). (1.80) This formulas makes intuitive sense too. Without knowing X, the variance of Y is σ2Y . Information on X will help to reduce the variance and the reduction depends on their correlation. 1.7.3 Multivariate normal The density function of an n-dimensional multivariate normal, X ∼ N(µ,Σ), is f(x) = (2pi)−n/2|Σ|−1/2exp [ −1 2 (x− µ)′Σ−1(x− µ) ] , (1.81) where µ is an n-vector of means, and Σ is the covariance matrix, n × n, and x = (x1, . . . , xn)′. Although this is more complex than the bivariate case, it is still easy to compute, if needed, in practice by using a computer. c© Zhou, 2021 Page 23 1.7 Multivariate distributions One of the most important properties of the normal is that all conditional and marginal distri- butions are normally distributed too. Let X ∼ N(µ,Σ), and partition X,µ and Σ as X = X1 X2 , µ = µ1 µ2 , Σ = Σ11 Σ12 Σ21 Σ22 , (1.82) where X1 and µ1 are k-vectors and Σ11 is a k by k matrix. Then, the conditional distribution of X1 given X2 is still normal with mean and covariance matrix E[X1 |X2] = µ1 + Σ12Σ−122 (X2 − µ2), (1.83) Var[X1 |X2] = Σ11 − Σ12Σ−122 Σ21. (1.84) In other words, given a joint distribution of normal random variables, we can get their conditional means and variances easily from above formulas, which determine the entire distribution in the normal case. 1.7.4 Multivariate t The multivariate t-distribution is an extension of the univariate t to an n-dimensional vector. Assume Y is an n-dimension normal, Y ∼ N(µ,Σ), and u is an independent χ2ν random variable, we call the distribution of X, X ≡ Y√ u/ν , (1.85) a multivariate t with degrees of freedom ν. The density function is f(x) = Γ[(ν + n)/2] νn/2pin/2Γ(ν/2)|Σ|1/2 ( 1 + 1 2 (x− µ)′Σ−1(x− µ) )−(ν+n)/2 . (1.86) Note that the covariance of the multivariate t is ν/(ν − 2)Σ, not |Σ|. Although the multivariate t is not used by many, Tu and Zhou (2004) find that it is a much better model than the multivariate normal. Despite that it is symmetric, a skewness test is unable to reject it for the data. 1.7.5 Wishart distribution Wishart distribution is an extension of χ2-distribution in n > 1 dimension. Mathematically, it is defined as the products of independent normally-distributed vectors. Let Z1, Z2, . . . , ZT be T c© Zhou, 2021 Page 24 1.8 Simple Models independent random n-vectors, each of which follows a multivariate normal distribution with zero mean and the same variance, Zt ∼ N(0,Σ), i.e., the Zt’s are T independent random draws from the same multivariate normal distribution. We call the distribution of the m×m matrix A below Wishart, A = Z1Z ′ 1 + Z2Z ′ 2 + · · ·+ ZTZ ′T = Z ′Z ∼ W (T,Σ), (1.87) where Z = (Z1, . . . , ZT ) and T > m. When m = 1, A = z21 + z 2 2 + · · ·+ z2m is clearly a χ2-distribution scaled by Σ with T degrees of freedom. Suppose now that X1, X2, . . . , XT are independent N(µ,Σ) random variables. The sample covariance matrix is S = 1 T − 1 T∑ i=1 (Xi − X¯)(Xi − X¯)′, where X¯ is the sample mean. It is well known that S is an unbiased estimator of Σ, E(S) = Σ. Moreover, the covariance of any two elements of S is Cov(sij , skl) = 1 T − 1σijσjk, which is useful for computing the standard errors of the elements of S. 1.8 Simple Models Consider now a statistical model or data-generating process for a series of observed stock returns. As mentioned earlier, one of the key assumptions we often make is the iid (independently and identically distributed) assumption. This is the assumption underlying the simple linear regression models for many applications. In this subsection, we review some of the most important properties of the linear regression. Then we discuss ways that relax the iid assumption. c© Zhou, 2021 Page 25 1.8 Simple Models 1.8.1 Univariate linear regression To understand well linear regressions, the best starting point is the univariate linear regression, y˜ = α+ βx˜+ ˜, (1.88) where we want to model a linear relation between random variable y˜ and random variable x˜, such as a stock return and the market return, with α and β as the parameters and ˜ as the random error. The linear regression is usually written in term of observations, yi = α+ βxi + i, i = 1, . . . , n, (1.89) where yi is the called the dependent variable, regressand or left-hand variable; xi is the dependent variable, explanatory variable, regressor or right-hand variable; α intercept, regression coefficient; β slope, regression coefficient; i residual, disturbance, error (usually assumed mean 0, with homoscedasticity: uncorrelated and have identical variance; even normally distributed; but generally assumed iid); n number of observation, sample size (usually we use T instead n in finance). How do we get the parameter estimate? We want find estimated values αˆ and βˆ of the true but unknown parameters α and β, that provide the ”best” fit in some sense for the data points. The most common objective is to minimize the mean-squared error Q(α, β) = n∑ i=1 (yi − α− βxi)2, (1.90) which is why the resulted solution is called the ordinary least-squares (OLS) estimator. Taking first-order derivatives of Q(α, β) with respect α and β and setting them to be zeros, the solutions are αˆ = y¯ − βˆx¯, (1.91) βˆ = 1 T ∑n i=1(xi − x¯)(yi − y¯) 1 T ∑n i=1(xi − x¯)2 , (1.92) c© Zhou, 2021 Page 26 1.8 Simple Models where y¯ and x¯ are the sample means of the data. It is the above formulas that are used in the standard OLS packages too compute the estimators. The above formulas are quite easy to understand intuitively. Multiplying the random variable expression of linear regression (1.88) by the difference of x˜ from its mean, x˜− µx, and then taking expectation on both sides, we obtain E(x˜− µx)y˜ = βE(x˜− µx)x˜, that is β = cov(x˜, y˜) var(x˜) , which says that beta is the covariance between x˜ and y˜ divided by the variance of x˜ (recall, perhaps from your Investment class, in the CAPM regression, beta is the covariance between stock and the market divided by the market variance). The previous βˆ is simply the sample approximation of β. Similarly, taking expectation in (1.88), we have α = E(y˜)− βx˜, so αˆ is the sample analogue of α. It may be noted that sometimes the regression is run without the intercept, i.e., y˜ = βx˜+ ˜, (1.93) is assumed to be the true model. In this case, the OLS estimator of β is βˆ = 1 T ∑n i=1 xiyi 1 T ∑n i=1 x 2 i , which is clearly different from the case with the intercept. Previously, βˆ is computed by de-meaning the data. Now, it is the raw data. Moreover, without the intercept, the covariance and variance interpretation for β is generally no longer available. However, we will only focus on the main case with intercept in the regression. In practice or in data science books, vector and matrix notations are common. Let y and x be vectors of the data/observations, y = y1 ... yn , x = x1 ... xn , c© Zhou, 2021 Page 27 1.8 Simple Models then the regression can be written as y = α1n + βx+ , (1.94) where 1n an n-vector of 1s, and is an n-vector of the residual. Let X = 1 x1 ... ... 1 xn = [1n x] be the n× 2 data matrix of the constant and the regressor, then the regression is often written in the vector and matrix form y = X α β + . (1.95) Moreover, the OLS estimator can also be written in matrix form,αˆ βˆ = (X ′X)−1X ′y, (1.96) which is the well known analytical formula for the OLS estimator, and can be generalized easily into the case when we have multiple regressors (see Section 1.8.2). How good is the fit? This is usually judged by the R2, which is a measure of the proportion of the variation in y that is explained by the variation in x: R2 = 1− n∑ i=1 (yi − yˆi)2∑ (yi − y¯)2 = 1− V arianceresidual V ariancetotal . (1.97) where y¯ is the sample mean and yˆ = αˆ + βˆx are the fitted values. It is clear that R2 is between 0 and 1, 0 ≤ R2 ≤ 1. When it is 1, x explains y perfectly well. In this case, they must be perfectly correlated. When the R2 is zero, then x has nothing to do with y. Of course, in practice, R2 should be within 0 and 1, and will not be that extreme. Typically, in the CAPM regression on a large stock return, a value of R2 80% or 90% is not uncommon. However, in predictions when you regression current values on the past, the R2 is very low, in the range of 0 to 5%. Mathematically, when you add an additional regressor into the regression, the R2 will increase by design as more variables will always help to explain more. However, out-of-sample (when you c© Zhou, 2021 Page 28 1.8 Simple Models apply the model to future data or new data), it is typically the case that too many regressors do worse, called over-fitting in statistics. Therefore, the adjusted R2 is proposed to penalize the number of regressors, R2adj = 1− (1−R2) n− 1 n−K − 1 , (1.98) where K is the number of regressors. So, everything else is equal, the greater the K, the lower the R2adj . Of course, the greater the R 2 adj , the better the fit of the linear regression. Note that now 1 is the upper bound for R2adj that is unachievable in practice and can only get close to, while it can also take negative values. How accurate are the OLS estimators compared with the true values? This will depend on assumptions we make on the linear regression model, yi = α+ βxi + i, i = 1, . . . , n. (1.99) There are 3 key assumptions: 1) i has 0 mean and E(i|xi) = 0; 2) (yi, xi)’s are iid; 3) yi and xi have finite 4th moments (large outliers are unlikely). Many books impose normality assumption on i. In this case, E(i|xi) = 0 is equivalent to that xi and i are uncorrelated. This is much stronger than Assumption 1) and is unnecessary asymp- totically. However, the zero mean assumption is always necessary to guarantee convergence of the estimators. The E(i|xi) = 0 ensures identification of the slope. Otherwise, there will be missing regressors and the OLS slope will fail to converge. An example is the following regression Salary = a+ b× Education + c×Ability + . If Ability, which is correlated with Education, is missing, then the residual, c × Ability + , will be correlated with Education. In this case, the OSL regression will likely to get a larger b than otherwise, attributing more effects to Education. Assumption 2) is technical, and maybe weakened to allow dependence of the data over time as long as certain stationary assumptions hold. Assumption 3) is important to ensure certain degrees of accuracy. c© Zhou, 2021 Page 29 1.8 Simple Models Under those assumptions, the OLS estimators will converge to the true parameters as the sample size becomes large. Then we have asymptotic confidence intervals for the estimators, and asymptotic t-ratio tests. It is will be useful to analyze the often assumed ideal case in which the residuals are assumed to be iid normal, i ∼ N(0, σ2). From (1.96), it is straightforward to show that the OLS estomator is jointly normal distributed too,αˆ βˆ ∼ N α β , σ2(X ′X)−1 . (1.100) In particular, it implies that αˆ ∼ N(α, 1 n (1 + θ2x)σ 2), (1.101) where θx = x¯/std(x), i.e., θx is equal to the sample mean divided by sample standard deviation of regressor x. The above formula says that αˆ is an unbiased estimator of α, and it is normally distributed with variance 1 n(1 + θ 2 x)σ 2. As sample size n grows, αˆ becomes more accurate. In practice, σ2 is unknown, but can be estimated by using the realized residuals. Let σˆ2 be the estimator. Then, due to errors in estimating σ2, the standardized alpha, αˆ/(1/ √ n(1 + θ2x)σˆ), will follow a t-distribution instead of a normal even if the residual is assumed normal here. This is the traditional and popular t-ratio, which is often used to test whether or not α is zero, What is the impact of using too many regressors in the OLS regression? The more the re- gressiors, the lower the R2 or the better the in-sample fit. But the adjusted R2 may not go down. Importantly, the estimation error tends to grow with the number of regressors, so do the forecasting errors. In general, too good in-sample fit of the model leads to worse out-of-sample forecasting (see Section 10.4.3). 1.8.2 Multiple linear regression When there are multiple regressors, say K of them, the linear regression model becomes yi = β0 + β1xi1 + · · ·+ βKxiK + i, i = 1, . . . , n. (1.102) c© Zhou, 2021 Page 30 1.8 Simple Models Let y = y1 ... yn , X = 1 x11 · · · x1K ... ... ... ... 1 xn1 · · · xnK = [1n x], then the regression can be written as y = Xβ + , (1.103) where β = β0 β1 ... βK is a (K + 1)-vector of the parameters. It is easy to verify that the OLS estimators till has the same form as before, β = (X ′X)−1X ′y. (1.104) While almost all results on univariate regression carry through in the multiple regressor case, there is one important error arising from using K regressors. To see this, let us define the L2 norm, a commonly used notation in both statistics and data science. For any vector a = (a1, . . . , am) ′, its squared norm or squared L2 distance is ||a||2 = a21 + a22 + · · ·+ a2m (1.105) which is the sum of the squares of the components of the vector. The norm itself is the squared root, ||a|| = √ a21 + a 2 2 + · · ·+ a2m. With the new notation, we can write the mean-squared error (see 1.90) as ||y−Xβ||2, and the OLS estimator βˆ as the solution to min β ||y −Xβ||2. Under the assumption that the errors are iid, i ∼ IID(0, σ2), the key result is about the expected errors of estimating the true betas (see, e.g., Giraud, 2015, p. 8), E [ ||βˆ − β||2 ] = (K + 1)σ2, (1.106) c© Zhou, 2021 Page 31 1.8 Simple Models when the data X are standardized to have orthonormal columns. The above equation says that the expected errors are proportional to σ2 with a scalar K + 1. When there is one regressor, the error is 2σ2, but the error grows into 100σ2!, when there are 99 regressors. This says that regressions will not work well if there are too many regressors, posing a challenge that requires machine learning methods for dimension reduction (see Chapter 10 and references therein). 1.8.3 Autocorrelations The fundamental assumption made so far is the iid assumption. There has been a vast amount of research that relax this assumption by fitting the data using various time series models, such as ARMA, ARCH and GARCH. A common way of examining whether the data is time-dependent is to compute the sample autocorrelations of the data, ρˆτ = ∑T−τ t=1 (Rt − µˆ)(Rt+τ − µˆ) (Rt − µˆ)2 . (1.107) If the data is independently distributed over time, R˜t should be independent from R˜t+τ , and so the computed ρˆτ should be close to zero. For large sample size T , ρˆτ is approximately normally distributed, ρˆτ ∼ N(0, 1 T ), (1.108) if the data is iid. So, the standard error of ρˆτ is roughly 1/ √ T . So, if ρˆτ is away from zero by 2 standard deviations, we may reject the independence assumption. 1.8.4 Time series models The simplest model for stock returns is Rt = µ+ t, t ∼ N(0, σ2). (1.109) It says that the returns are iid normal with constant mean µ and variance σ2. This is the same model mentioned earlier in (1.44). In the real world, even if we believe that the long-term expected return is constant, the expected return may change over time conditional on information variables. For example, conditional on the c© Zhou, 2021 Page 32 1.8 Simple Models oil price being high or low, our expected return on the stock market may be different. A simple model to reflect this may be Rt = µ+ αzt−1 + βRt−1 + t, t ∼ N(0, σ2), (1.110) where zt−1 is the information variable. Here the above model allows for also the dependence of the returns on its past. In general, zt−1 can be a vector of available information and extra lags of returns can also be included in the above regression, to get a model like Rt = µ+ α ′zt−1 + β1Rt−1 + · · ·+ βpRt−p + t + γ1t−1 + · · ·+ γqq−1, t ∼ N(0, σ2), (1.111) which is the standard ARMA(p, q) times series model plus regressors. In equation (1.110), the expected return conditional on {zt−1, Rt−1}, E[Rt | zt−1, Rt−1] = αzt−1 + βRt−1, (1.112) changes over time and varies with the information variables. However, the conditional volatility is constant, Var[Rt | zt−1, Rt−1] = Var[t] = σ2. (1.113) This is unrealistic in applications. To model the time-varying volatility, Engle (1982), Nobel prize-winning work, proposes to use Rt = µ+ t, t | It ∼ N(0, σ2t ) (1.114) σ2t = a0 + a1 2 t−1 + · · ·+ ap2t−p, (1.115) a0 > 0, a1, . . . , ap ≥ 0, (1.116) where It stands for all available information at time t. Notice that Rt in (1.114) is, for simplicity, assumed to have a constant mean µ. However, the variance of Rt conditional on past information is σ2t , a function of time. In other words, the conditional volatility σt is now time-varying, which introduces heteroscedasticity of the variance across time. How does this volatility, say daily, change over time? Equation (1.115) assumes that it depends on the shocks to the returns of the previous day and up to past p days. A surprising large drop of the stock market yesterday is likely to increase the volatility (vol) today. Since the dependence is of regression type, the model is known as an autoregressive conditional heteroscedasticity model, or ARCH(p). c© Zhou, 2021 Page 33 1.8 Simple Models Bollerslev (1986) generalizes ARCH(p) into GARCH(p, q) by adding q past volatilities into the vol regression, Rt = µ+ t, t | It ∼ N(0, σ2t ) (1.117) σ2t = a0 + a1 2 t−1 + · · ·+ ap2t−p + b1σ2t−1 + · · ·+ bqσ2t−q, (1.118) a0 > 0, a1, . . . , ap, b1, . . . , bq ≥ 0. (1.119) The simplest GARCH model is GARCH(1,1), Rt = µ+ t, t | It ∼ N(0, σ2t ) (1.120) σ2t = ω + a 2 t−1 + bσ 2 t−1, (1.121) ω > 0, a, b ≥ 0, a+ b < 1, (1.122) which has only one lag on each of the regressors. It has only three parameters, and hence easy to estimate in practice. Softwares in many programming languages, such as Excel or Matlab or R or Python, are available for the estimation. GARCH(1,1) is the generic or ‘vanilla’ GARCH model used by many financial institutions. Technically, GARCH(1,1) is useful for two reasons. First, complex GARCH models require the estimation of more parameters which turns out unnecessary as the maximum likelihood function becomes flat with more parameters. Second, GARCH(1,1) does capture most of the salient feature of the data which ARCH(p) fails to do. To see this, we apply (1.121) recursively and get σ2t = ω + a 2 t−1 + b(ω + a 2 t−2 + bσ 2 t−2) (1.123) = ω + a2t−1 + b(ω + a 2 t−2 + b(ω + a 2 t−3 + b(· · · ))) (1.124) = ω 1− b + a( 2 t−1 + b 2 t−2 + b 23t−3 + · · · ). (1.125) This says that GARCH(1,1) model is in effect an ARCH model of an infinity order with coefficients declining exponentially in weighting the past shocks. The condition of a + b < 1 in equation (1.114) is the stability condition of the model. If it is true, GARCH(1,1) is a stationary process for which the uncondition or long-term vol exists, and is equal to σ2 = Var(Rt) = ω 1− a− b . (1.126) For stock returns, the stability condition is satisfied though the estimates of a + b are close to 1. However, for currencies/exchanges rates and commodities prices, the estimates are too close to 1. c© Zhou, 2021 Page 34 If a+ b = 1, we can reparameterize GARCH(1,1) into Rt = µ+ t, t | It ∼ N(0, σ2t ) (1.127) σ2t = ω + (1− λ)2t−1 + λσ2t−1, (1.128) ω > 0, 0 ≤ λ ≤ 1. (1.129) This is known as an integrated GARCH or I-GARCH. In contrast to ARCH or GARCH, the simplest model for time-varying volatilities (usually daily) is σ2 = (1− λ)R2t−1 + λσ2t−1. (1.130) Note that the estimated variance based on the single observation Rt−1 is simply R2t−1 if we ignore the virtually zero mean of the daily return. So the righthand side is a weighted average of the vol estimated based on the more recent data and yesterday’s vol. RiskMetrics fixes the value of λ as 0.94 so that no estimation is needed in their applications. If we apply (1.126) recursively similar to the GARCH case, it is easy to see that σ2 = (1− λ)(R2t−1 + λR2t−2 + λ2R2t−3 + · · · ) = (1− λ) ∞∑ i=1 λi−1R2t−i, (1.131) i.e., the vol is an infinite exponentially weighted moving average (widely known as EWMA) of the squared returns. For further readings on time series model, see Alexander (2001) and your Econometrics texts. 2 Portfolio Choice 1: Mean-variance Theory In this chapter, we discuss strategies for selecting a portfolio among N risky securities with perhaps the addition of the riskfree asset. We provide first a few ad hoc rules. Then, we derive and discuss the optimal portfolio rules under the popular mean-variance framework, in which investors care only about the means and covariances of the assets that determine the portfolio risk and return. In other words, we examine mainly the case in which the portfolio risk and return are determined by the asset means and covariances only. c© Zhou, 2021 Page 35 2.1 Ad hoc rules 2.1 Ad hoc rules In this subsection, we discuss 5 ad hoc portfolio selection rules: equal-weighting, value-weighting, volatility-weighting, risk parity, and global minimum-variance portfolio. Although these rules are are not optimal under standard assumptions such as the mean-variance utility, they are widely used in practice in different contexts. All of the 5 rules are strategies for investing into N risky assets only. When there is a riskfree asset, the rules should be modified based on the needs of the applications. 2.1.1 Equal-weighting: 1/N Suppose there are N assets with returns R1, . . . , RN , the equal-weighting rule is to divide your money equally cross the assets, with weights wnaive = ( 1 N , 1 N , . . . , 1 N )′ , (2.1) where the portfolio weights across assets are the same, and are equal to 1/N . This is known as an naive rule because that is a simple and intuitive way of investing. The resulted portfolio return is Rp,e = 1 N R1 + 1 N R2 + · · ·+ 1 N RN , (2.2) which simply adds the returns with weight 1/N or the average return cross assets (different from the usual asset average return which is computed across time). Example 2.1 There are 3 stocks with prices $20 and $40 and $50, respectively. Then, based on the 1/N rule, we invest 1/3 of our money into each, regardless of how expensive or how good each company is. If we have $3000, then we invest $1000 into each stock. ♠ In practice, investors use the 1/N rule often in placing bets on ideas or on sectors of the stock market or on asset classes. The 1/N is simple, and is useful when the estimation errors on the asset expected returns are large. But better strategies are available even when we worry about estimation risk, a topic discussed later in Section 3.5. c© Zhou, 2021 Page 36 2.1 Ad hoc rules 2.1.2 Value-weighting Let V1, . . . , VN be the values of the N assets. The value-weighted portfolio is Rp,V = V1 V1 + · · ·+ VN R1 + V2 V1 + · · ·+ VN R2 + · · ·+ VN V1 + · · ·+ VN RN , (2.3) where the portfolio weight on asset i is wi = Vi V1 + · · ·+ VN , i = 1, 2, . . . , N, (2.4) whose sum is equal to one. Example 2.2 If there are 2 stocks whose market values are $20,000 and $80,000, respectively, then our portfolio weights on the 2 stocks are w1 = 20, 000 100, 000 = 0.25; w2 = 80, 000 100, 000 = 0.75. Note that only the information on the value of the firms are used, regardless of the current prices or economic outlooks of the companies. ♠ Value-weighting is very popular in practice. Almost all stock indices are value-weighted (Dow is an exception which is price-weighted, holding equal shares of the stocks in the index), in particular the S&P500 is. In a value-weighted portfolio, one holds all the assets, and holdings of which is proportional to its value relative to the total market value. Almost all index funds invest their money via value-weighting, and so no research or stock analysis is needed, which is why their costs are low (virtually zero except for book keeping and occasionally trading due to dividends reinvestment or redemption or addition/removal of stocks in the indices). 2.1.3 Volatility-weighting Volatility is one of the most important factors for portfolio selection. Let σ1, . . . , σN be the volatil- ities (standard deviations) of N assets. The volatility-weighted portfolio is proportional to the inverse of variances, Rp,σ2 = 1 σ21 R1 + 1 σ22 R2 + · · ·+ 1 σ2N RN . (2.5) c© Zhou, 2021 Page 37 2.1 Ad hoc rules It is clear that the greater the volatility of an asset, the less the weight we put on that asset. Since the above weights do not sum to one, the normalized weights are wi = 1 σ2i 1 σ21 + · · ·+ 1 σ2N , i = 1, 2, . . . , N, (2.6) then the sum is 1 by construction. Hence, the volatility-weighted portfolio is fully determined by Rp,σ2 = w1R1 + w2R2 + · · ·+ wNRN , (2.7) which is the portfolio per dollar invested based on volatility information alone. Note that the weight of the first asset is inversely related to σ21, but not to σ1! Hence, the common known volatility-weighting defined above should really be called inverse-variance-weighting. The true volatility-weighting or inverse-volatility-weighting portfolio is Rp,σ = w1R1 + w2R2 + · · ·+ wNRN , j = 1, 2, . . . , N, (2.8) with wi = 1 σi 1 σ1 + · · ·+ 1σN , i = 1, 2, . . . , N. (2.9) In contrast to the earlier case, it is volatility (the square root of the variance) that determine the portfolio. The question is why people use variance to inversely weight on assets? The reason is that such weights are optimal in minimizing the portfolio risk if the asset returns are independent from one another (see Section 2.1.5). In general, the optimal portfolio is related to the inverse of the covariance matrix (see Sections 2.2 and 2.7), not directly and linearly related to the volatility per se. Example 2.3 If two stocks have 20% and 40% volatility, respectively, then w1 = 1 .22 1 .22 + 1 .42 = 80% and w2 = 1− w1 = 20%. Note that the weight is a nonlinear function of the volatilities. Here the second stock has twice the volatility of the first, but its weight is not half of the former, but only 1/4 of it. However, the true (inverse) volatility-weighting is w1 = 1 .2 1 .2 + 1 .4 = 66.67%, c© Zhou, 2021 Page 38 2.1 Ad hoc rules lower than before. The reason is that, in terms of volatility, the second stock is twice big. But in terms of variance, it is 4 times as large (.42/.22 = 4), so you invest more in the first. ♠ Some active funds use the volatility weighting to effectively reduce the volatility of a portfolio. If the stock returns are independent, the strategy will generate the portfolio with the minimum volatility. However, if the stocks are correlated as they are in the real world, the volatility weighting will not get the minimum volatility portfolio theoretically, because the correlation information can be used to reduce risk (see Section 2.1.5). In the real world, the estimation of the correlations is noisy, it will be unclear which of the two strategies will do better. That is why some mangers still use volatility weighting for certain investments. It should be noted that the 1/N rule is a special case of volatility weighting. When all the volatilities are taking as equal, it is clear that wi = 1/N for all i. So, volatility weighting is more general than the 1/N , and it tends to do slightly better than 1/N since it incorporates volatility information info decision making and volatility can usually be fairly accurately estimated. 2.1.4 Risk parity Risk parity is a portfolio rule that puts equal weight on the risk contribution of each asset, so it also known as equal risk or equally-weighted risk portfolio. Note that the volatility of each asset can be different from one another. It is just that each asset, with its weight, contributes equally to the total volatility of the portfolio. For simplicity, consider first the two asset case with volatilities σ1 and σ2 and correlation ρ. The portfolio is Rp = w1R1 + w2R2, (2.10) where R1 and R2 are returns of the assets. Recall from any standard investment text that the portfolio volatility is σp = √ w21σ 2 1 + 2ρw1w2σ1σ2 + w 2 2σ 2 2. (2.11) The contribution to σp of the first asset is its weight times per unit contribution, that is, C1 = w1 × ∂σp ∂w1 = w21σ 2 1 + ρw1w2σ1σ2 σp . (2.12) c© Zhou, 2021 Page 39 2.1 Ad hoc rules Similarly or by symmetry, the risk contribution of the second asset is C2 = w22σ 2 2 + ρw1w2σ1σ2 σp . (2.13) For them to have equal contributions, we want C1 = C2, i.e., w21σ 2 1 = w 2 2σ 2 2 = (1− w1)2σ22. Hence the solution is w1 = σ−11 σ−11 + σ −1 2 , w2 = σ−12 σ−11 + σ −1 2 . (2.14) Note that in the two asset case, the correlation plays no role. The risk parity is the same as the (inverse) true volatility-weighting. But this will not be true in general when there are N > 2 assets. Example 2.4 Suppose that there are two stocks whose volatilities are 20% and 40%, respectively, then the weights will be w1 = 1/.2 1/.2 + 1/.4 = 67%, w2 = 23%. One can verify that .67 × .2 = .13 = .23 × .4 (rounding errors beyond 3 digits as we rounded w1 and w2), i.e., both assets contribute equally to a risk level of 23% to the portfolio. ♠ A typical allocation advice from investment advisors is to invest about 60% in stocks and 40% in bonds. This implies that 90% of the portfolio risk is from the stock portion of the portfolio given the much greater stock volatility. To see this, assume they have 20% and 12% volatilities with no correlation. Then, σp = √ 0.6 ∗ 0.22 + 0.4 ∗ 0.122 = 17.25%, C1 = 15.49%, C2 = 1.76%, and so the stock risk share is C1/C2 = 90%. In contrast, the risk parity portfolio has weights w1 = 37.5%, w2 = 62.5%. With these weights, the stock portfolio has half of the risk of the entire portfolio. In applications, risk parity managers attempt to equalize risk across asset classes such as stocks, bonds, commodities, real estate and currency. During the recent financial crisis, stocks lost about 50% while the bonds were up, so the risk parity portfolio clearly did better. In the long-run, however, it will under-perform the traditional portfolio as the mean return of bonds is lower than c© Zhou, 2021 Page 40 2.1 Ad hoc rules the stocks. Some portfolio managers argue that one can use leverage to increase the return on the entire portfolio to be comparable or to beat the traditional asset allocation. But whether this is true or not is unclear. Theoretically, it seems unlikely to succeed over all regimes of the markets. When N > 2, in the special but unrealistic case in which the correlation among all the assets is the same, the weight on the i-th asset is analytically obtainable, wi = σ−1i∑N j=1 σ −1 j . (2.15) This is the same as the (inverse) true volatility-weighting. However, when the correlations are different cross the assets, as is the case in the real world, the correlations will matter, and there are no simple formulas available to find the portfolio weights. Denote by Σ the covariance matrix of the assets. It is well known that the volatility of the portfolio with weights w is σ(w) = √ w′Σw. (2.16) The risk contribution of asset i is σi(w) = wi × ∂σ(w) ∂wi = wi(Σw)i√ w′Σw , where (Σw)i denotes the derivative of (Σw) taking with respect to wi. Because of equal contribution, we have σi(w) = σ(w)/N , implying that wi = w′Σw N(Σw)i . (2.17) Note that wi also appears on the righthand side, and so the above is not an analytically solution. To solve all the wi’s that make the above equation holds, we need to solve the following minimization problem, min w N∑ i=1 [ wi − w ′Σw N(Σw)i ]2 , (2.18) subject to the constraint that all the weights sum to 1. The solution has to be found numerically via Python, Matlab or R. Se´bastien, Roncalli, and Teiletche (2010) provide further property of the equal risk portfolio. 2.1.5 Global minimum-variance portfolio In practice, asset expected returns/means are difficult to estimate, and so many investors/fund managers simply ignore the means (they typically may not be too much different for similar stocks) c© Zhou, 2021 Page 41 2.1 Ad hoc rules and focus on minimizing the risk, to obtain a minimum risk possible portfolio, called the global minimum-variance portfolio (GMV). Consider first the case of two risky assets. In terms of the earlier notation, we want to minimize σp, which is equivalent to minimizing its square (variance), σ2p = w 2 1σ 2 1 + 2ρw1w2σ1σ2 + w 2 2σ 2 2. (2.19) Plugging-in w2 = 1− w1 and then taking derivative with respect to w1, we have dσ2p dw1 = 2w1σ 2 1 + 2ρ(1− w1)σ1σ2 − 2ρw1σ1σ2 − 2(1− w1)σ22 = 0. Solving w1, we obtain w1 = σ22 − ρσ1σ2 σ21 − 2ρσ1σ2 + σ22 . (2.20) This is the weight on the fist asset that will minimize the portfolio risk (the weight on the second asset is w2 = 1 − w1). The formula makes intuitive sense. If the second asset is riskier or σ22 is larger, we should weight more on the first asset. Example 2.5 Suppose the vol of the first stock is 20%, the second is 40%, and the correlation is 50%. Then w1 = .42 − 0.5 ∗ 0.2 ∗ 0.4 0.22 − 2 ∗ 0.5 ∗ 0.2 ∗ 0.4 + 0.42 = 0.8571, that is, you invest 85.71% of your money into the first asset, and the rest of your money into the second. The your portfolio will have the minimum risk possible. What is the minimum risk? This will be computed below. ♠ It will be useful to consider a few special case. If ρ = 1, i.e., the two stocks are perfectly positively correlated, we can eliminate the risk entirely by buying one and shorting another. Indeed, assume σ2 > σ1 without loss of generality. We have w1 = σ2 σ2 − σ1 , w2 = − σ1 σ2 − σ1 , Then we long the first and short the second, and the portfolio risk is zero. Now if ρ = −1, i.e., the two stocks are perfectly negatively correlated, then we can buy both, w1 = σ2 σ1 + σ2 , w2 = σ1 σ1 + σ2 , to minimize the risk to zero. c© Zhou, 2021 Page 42 2.1 Ad hoc rules In practice, |ρ| = 1 is impossible, so we will rule this case out in what follows. Then the minimum risk portfolio must have positive risk or the their covariance matrix Σ must be positive definite, and in particular invertible. If ρ = 0, the above becomes the volatility weighting strategy. If the variances are equal, the portfolio becomes the equal-weighted one (w1 = 1/2). When there are N > 2 risky assets, one can still derive, in a similar fashion just with matrix algebra, the weights of the minimum-variance portfolio, known also as global minimum-variance portfolio (GMV), wg = Σ−11N 1′NΣ−11N , (2.21) where 1N is an N ×1 vector of ones, and Σ is the covariance matrix of the N assets and is assumed invertible. Although matrix inversion is involved, the portfolio weights on the N assets, wg, are straightforward to compute using Python, Matlab or R. Note that the minimized variance risk of the GMV portfolio is V ar(Rp) = 1/(1 ′ NΣ −11N ) > 0, (2.22) which is derived from simply plug-in wg into the portfolio risk formula. Note that the risk cannot be eliminated completely, but only be minimized. The reason is that the invertability of Σ assumes that no asset is redundant, that is, any asset return cannot be a linear combination of others. This in particular rules out any perfect correlation between any pair of assets. For stocks, this assumption is clearly true in practice. As a result, there is always non-zero risk for any stock portfolio. If it were, the portfolio must be zero identically, and then we can solve one stock return in terms of the rest, a contradiction with the invertability assumption. It is of interest to see how the above formula works when N = 2. Given the covariance matrix, we can find analytically its inverse, based on (1.78) and the discussions there, Σ−1 = 1 det(Σ) σ22 −ρσ1σ2 −ρσ1σ2 σ21 , Σ = σ21 ρσ1σ2 ρσ1σ2 σ 2 2 , where the determinant det(Σ) = σ21σ 2 2 − ρ2σ21σ22 > 0 under the assumption that |ρ| < 1 (so that Σ is invertible). Then Σ−11N = 1 det(Σ) σ22 − ρσ1σ2 −ρσ1σ2 + σ21 , 1′NΣ−11N = 1det(Σ)(σ21 − 2ρσ1σ2 + σ22). c© Zhou, 2021 Page 43 2.1 Ad hoc rules The first element of their ratio is exactly the weight on the first asset as given by (2.20), and the second element is the weight on the second asset. Moreover, we obtain the minimized variance risk V ar(Rp) = σ 2 1σ 2 2(1− ρ2)/(σ21 − 2ρσ1σ2 + σ22). (2.23) This formula can be easily computed by hand in stead of using Python. Example 2.6 (continue on Example 2.5) The variance risk is V ar(Rp) = 0.2 2 ∗ 0.32(1− 0.52)/(.22 − 2 ∗ 0.5 ∗ 0.2 ∗ 0.3 + 0.32) = 0.03857143, and so the vol is √ V ar(Rp) = 19.64%. It does not reduce too much from 20%. ♠ There are two important remarks on the GMV portfolio. First, it is the portfolio that has the lowest risk among all possible portfolios, regardless of what values the expected returns on all the stocks take. However, given information on the stock expected returns, one can design a portfolio for a desired level of expected return on the portfolio, say 12% per year, with the minimum risk permissible. This portfolio will have no smaller risk than the GMV by definition of the latter, but it has the minimum risk among those portfolios that earns 12% expected return per year. The next two subsections address this issue for the cases without a riskless asset and with it, respectively. Second, in practice, note that it is always not easy to estimate the expected returns. The GMV avoids this problem by not using expected returns at all, but it still requires estimating the covariance matrix. When N is large, it is difficult to get a good estimate of Σ. This issue will be discussed further in Sections 6.4, 4.4.3, and 4.4.7. It may also be noted that the inverse volatility weighting is a special case of the GMV when the assets are assumed uncorrelated. In this case, Σ is a diagonal matrix, Σ = σ21 0 . . . 0 0 σ22 . . . 0 ... ... . . . ... 0 0 . . . σ2N (2.24) c© Zhou, 2021 Page 44 2.2 MV Optimal portfolio: Riskfree asset case and its inverse is obvious, Σ−1 = 1 σ21 0 . . . 0 0 1 σ22 . . . 0 ... ... . . . ... 0 0 . . . 1 σ2N . (2.25) Then, multiplying out the terms, the GMV is indeed the same as the volatility-weighting (or more accurately, inverse variance-weighting), in the zero correlation case. 2.2 MV Optimal portfolio: Riskfree asset case The mean-variance framework is not only used by many practitioners, but also is a framework useful for understanding a variety issues involved in portfolio choice and asset pricing. Now we assume that there is a riskfree asset available in additional to N risky assets. This is the case most investment books focus on. For pedagogical reasons, we consider first a single risky asset case, then two risky assets case, and finally the multiple assets case. We also provide an alternative and equivalent formation that maximize return for a given level of risk. 2.2.1 One risky asset Consider the problem of an investor who allocates money between investing in the stock index and money market. Let rt and rf be the returns on the market and riskfree investment, respectively. Then the return on the portfolio is Rpt = wrt + (1− w)rf , (2.26) where w is the amount invested in the risky asset and (1−w) is that invested in the riskfree asset. If the investor’s initial wealth is W0, then the next period wealth should be W = W0(1 +Rpt). Rewrite Rpt as Rpt = w(rt − rf ) + rf = wRt + rf , (2.27) c© Zhou, 2021 Page 45 2.2 MV Optimal portfolio: Riskfree asset case where Rt ≡ rt − rf (2.28) is known as the excess return or return in excess of the riskfree rate. In most asset pricing models, we assume that there is the riskfree asset, approximated by the Treasury bill returns in practice when the investment horizon is short, say a month. As a result, most empirical research uses excess returns on asset rather than the original or raw returns. The popular assumption in portfolio analysis is that the market excess return is iid normally distributed: Rt = µ+ t, (2.29) where t has a normal distribution mean zero and variance σ 2, and µ is the expected excess return on the market. It is then easy to verify that the mean and variance of the portfolio are E[Rpt] = wµ+ rf , Var[Rpt] = w 2σ2. (2.30) In the mean-variance framework, the investor is assumed to care about only the mean and variance of the portfolio, who prefers higher mean and lower variance. Note that a preference must specified to determine the optimal portfolio. Assume the standard mean-variance utility, U(w) = E[Rpt]− γ 2 Var[Rpt] = rf + wµ− γ 2 w2σ2, (2.31) where γ is the coefficient of relative risk aversion, i.e., the trade-off parameter between risk and return. Then the investor chooses w to maximize U(w). Taking the derivative and setting it be zero, we get the first-order condition (FOC): dU(w) dw = µ− γwσ2 = 0, and hence the optimal choice is w = 1 γ µ σ2 , (2.32) which is proportional to the mean-variance ratio of the asset. The formula is intuitively clear. The greater the expected return or the lower the risk, the more the money the investor invests in the risky asset. On the other hand, everything else being equal, the more risk-averse the investor (larger γ) is, the less the money is invested in the risky asset. c© Zhou, 2021 Page 46 2.2 MV Optimal portfolio: Riskfree asset case Example 2.7 Assume that the riskfree asset earns 3% (per year) and the risky asset has an expected return of 12% and a volatility, σ, of 20%. Then the portfolio return is Rpt = wrt + (1− w)3% = w(rt − 3%) + 3%, and µ = E(rt − 3%) = 9%. If γ = 2.8, then w = 1 2.8 0.09 .202 = 0.8036, which says that we put 80.36% into the risky asset and the reminder into the riskfree asset. ♠ The mean and variance of the portfolio are, based on (2.30), E[Rpt] = 1 γ µ2 σ2 + rf , Var[Rpt] = 1 γ2 µ2 σ2 , (2.33) in terms of µ and σ2, parameters of the asset returns. These formulas are useful in assessing portfolio risk and return in practice. How do investors assess the performance of a portfolio? The Sharpe ratio, originated by William Sharpe in 1966 and revised in 1994, is the most widely used yardstick in practice, Sharpe Ratio = E[Rp − rf ]√ var[Rp − rf ] , (2.34) where Rp is the return on an asset or a portfolio. That is, the Sharpe ratio is the ratio of the excess return to its standard deviation, or the risk premium one earns on the portfolio per unit of risk. In our one risky asset case here, it is clear that Sharpe Ratio = µ σ . (2.35) It is interesting that, no matter how one chooses his/her portfolio, one gets the same Sharpe ratio. However, this is only true in the case of one risky asset. When there are N > 1 risk assets, different portfolios will have different Sharpe ratios. To get the portfolio with the greatest Sharpe ratio, one has to choose the weights optimally. There are two remarks. First, the Sharpe ratio is often reported in practice in annualized form, Sharpe Ratioa = √ L E[Rp − rf ]√ var[Rp − rf ] , (2.36) c© Zhou, 2021 Page 47 2.2 MV Optimal portfolio: Riskfree asset case where L is the number of periods per year. For example, if the return is daily or monthly, we should annualize it with L = 252 (trading days) and L = 12, respectively, which is obtained by annualizing the return with L and the standard deviation with √ L. Second, the above formula is an ex ante measure that is based on expectations. In practice, the realized or ex-post Sharpe ratio is reported, which is computed based on the same equation as above but with the realized returns on the portfolio and riskfree rate. Mathematically, maximizing the Sharpe ratio is equivalent to maximizing the mean-variance utility when N > 1, as shown below. However, although the optimal portfolios can be different across individual investors with different risk tolerances, the Sharpe ratio is the same for all investors as long as they hold optimal portfolios! This is different from the result in the case where the riskless asset is not available. The last point can be understood intuitively. When there is no riskfree asset, investors should select portfolios in the mean-variance frontier (see Section 2.7) and different portfolios in the frontier have different Sharpe ratios. However, when investors can invest in the riskfree asset, they all hold a portfolio of the riskfree asset and the same tangent portfolio to the line starting from the level of the riskfree rate. The Sharpe ratio of this portfolio is the same regardless how one allocates between the two. This is similar to the one risky asset case. The portfolio solution (2.32) is unconstrained, where no restrictions on w are imposed. In practice, short-selling is often imposed which requires w ≥ 0. In addition, borrowing at the riskfree rate is usually not feasible. So, a common restriction is 0 ≤ w ≤ 1. In this case, if the solution (2.32) falls into this range, it will be the optimal one. If not, either 0 or 1 will be the solution. Of course, for a hedge fund or a large investor, some limited shorts and borrowing may be possible. Then the constraint may be written as a ≤ w ≤ b for some constants a and b, and we just search the solution in this range. 2.2.2 N = 2 Consider now the the case with N = 2 assets. Let rt and Rt be the returns and excess return, respectively, rt = r1 r2 , Rt ≡ r1 − rft r2 − rft = rt − rft12, 12 ≡ 1 1 , (2.37) c© Zhou, 2021 Page 48 2.2 MV Optimal portfolio: Riskfree asset case where rft is the riskfree rate return. Let µ be the expected excess return and Σ covariance matrix of the excess return, µ = µ1 µ2 , Σ = σ21 ρσ1σ2 ρσ1σ2 σ 2 2 . Then the portfolio return is Rpt = w1r1 + w2r2 + (1− w1 − w2)rft = w′rt + (1− w′12)rf = w′Rt + rf , where w = (w1, w2) ′ are the portfolio weights on the risky assets. Note that the sum of w1 and w2 are no longer required to be equal to 1 because the remainder goes to the riskfree asset. If their sum is less than 1, the money is investment in the riskfree asset. If it is greater than 1, the difference from 1 is the amount of borrowing from the riskfree asset. The variance risk is Var(Rpt) = w 2 1σ 2 1 + 2ρw1w2σ1σ2 + w 2 2σ 2 2 = [ w1 w2 ] σ21 ρσ1σ2 ρσ1σ2 σ 2 2 w1 w2 = w′Σw. The investor is assumed to choose w so as to maximize the same mean-variance objective function U(w) = E[Rpt]− γ 2 Var[Rpt] = rft + w ′µ− γ 2 w′Σw, (2.38) The solution (see end of this subsection) to the optimization is w∗ = 1 γ Σ−1µ = 1 γ σ21 ρσ1σ2 ρσ1σ2 σ 2 2 −1 µ1 µ2 . (2.39) Example 2.8 Assume that there are N = 2 risky assets. The excess returns (the returns minus the riskfree rate) have the expected return and covariance matrix: µ = µ1 µ2 = 0.10 0.20 , Σ = 0.32 0.5× 0.3× 0.4 0.5× 0.3× 0.4 0.42 . Assume rf = 3% and γ = 3. Then our portfolio is Rpt = w1r1 + w2r2 + (1− w1 − w2)× 3% = w′rt + (1− w′1N )rf . c© Zhou, 2021 Page 49 2.2 MV Optimal portfolio: Riskfree asset case Our optimal choice of w is w∗ = w∗1 w∗2 = 1 3 0.32 0.5× 0.3× 0.4 0.5× 0.3× 0.4 0.42 −1 0.10 0.20 = 1 3 14.82 −5.56 −5.56 8.33 0.10 0.20 = 0.123 0.370 , where the inverse of the matrix and the product of the matrix with the vector can be easily computed using Python, Matlab or R. ♠ Note that if the correlation is zero, the inversion of Σ is trivial,σ21 0 0 σ22 −1 = 1σ21 0 0 1 σ22 and so w∗1 w∗2 = 1γ µ1σ21 1 γ µ2 σ22 , or w∗i = 1γ µiσ2i , i = 1, 2. This says that, when the two assets are uncorrelated, we can apply our portfolio formula to each of them separately, as if we have one asset at a time. It can be verified that the squared Sharpe ratio of the optimal portfolio is (Sharpe Ratio)2 = µ′Σ−1µ = µ1 µ2 ′ σ21 ρσ1σ2 ρσ1σ2 σ 2 2 −1 µ1 µ2 . (2.40) When the two assets are uncorrelated, it has a much simpler form, (Sharpe Ratio)2 = µ′Σ−1µ = ( µ1 σ1 )2 + ( µ2 σ2 )2 , (2.41) i.e., the portfolio squared Sharpe ratio is simply the sum of the individual ones, in the special uncorrelated case. Proof of (2.39): The first-order condition is ∂U(w) ∂w1 = µ1 − γ(w1σ21 + ρw2σ1σ2), ∂U(w) ∂w2 = µ1 − γ(ρσ1w1 + w2σ22). In matrix form, it is µ− γΣw = 0. Solving w, by multiplying Σ−1 on both sides and dividing by γ, yields the formula. Q.E.D c© Zhou, 2021 Page 50 2.2 MV Optimal portfolio: Riskfree asset case 2.2.3 Multiple risky assets Consider now the general case. Let rt be the returns on N risky assets. We define Rt ≡ rt − rft1N (2.42) as the excess return similarly, where 1N is an N -vector of ones. The common assumption on the probability distribution of Rt is that the excess return Rt is iid multivariate normal with mean µ and covariance matrix Σ. Given the portfolio weights w, an N × 1 vector, on the risky assets, the return on the portfolio at time t is Rpt = w ′rt + (1− w′1N )rf = w′Rt + rf . (2.43) The investor is assumed to choose w so as to maximize the same mean-variance objective function U(w) = E[Rpt]− γ 2 Var[Rpt] = rf + w ′µ− γ 2 w′Σw. (2.44) The solution to the problem is similarly obtained as: w∗ = 1 γ Σ−1µ, (2.45) which is the optimal portfolio weights. Proof: Define df/dw as an N -vector formed by df/dwi for any function f = f(w1, . . . , wN ), which is a vector formed taking derivative one variable at a time. Then it can be verified that dw′µ dw = µ, dw′Σw dw = Σw. (2.46) Hence, the first-order condition is dU(w) dw = µ− γΣw = 0. Multiplying Σ−1 on both sides and simplifying the expression, we get (2.45). Q.E.D With the optimal portfolio weights, the maximized expected utility is U(w∗) = rf + 1 2γ µ′Σ−1µ = rf + θ2 2γ , (2.47) where θ2 = µ′Σ−1µ. That is the maximum utility that the investor can obtain when the portfolio weights w∗ are computed based on the true parameters. In practice, however, the parameters have c© Zhou, 2021 Page 51 2.2 MV Optimal portfolio: Riskfree asset case to be estimated and the estimation errors will impact significantly on the performance, an issue to be examined later. Example 2.9 Assume that there are N = 3 risky assets. The excess returns (the returns minus the riskfree rate) have the expected return and covariance matrix: µ = µ1 µ2 µ3 = 0.10 0.20 0.30 , Σ = 0.32 0.5× 0.3× 0.4 0.1 0.5× 0.3× 0.4 0.42 0.1 0.1 0.1 0.52 . Assume rf = 3% and γ = 3 as before. Then our portfolio is Rpt = w1r1 + w2r2 + w3r3 + (1− w1 − w2 − w3)× 3% = w′rt + (1− w′1N )rf . Our optimal choice of w is w∗ = w∗1 w∗2 w∗3 = 13 0.32 0.5× 0.3× 0.4 0.1 0.5× 0.3× 0.4 0.42 0.1 0.1 0.1 0.52 −1 0.10 0.20 0.30 = 1 3 21.43 −3.57 −7.14 −3.57 8.93 −2.14 −7.14 −2.14 7.71 0.10 0.20 0.30 = −0.238 0.262 0.390 , where the inverse of the matrix and the product of the matrix with the vector can be easily computed using Python, Matlab or R. The asnwer makes intuitive sense. Now the third asset is very attractive with return 30%, so we want to buy more. But the risk is too high, so we short the first asset to offset a substantial amount of the risk. ♠ Although the optimal portfolio formula has problems in practical applications (see Section 2.2.6), it is very important and will be used throughout the lectures to provide insights on optimal investments. Below are two analytical examples: Example 2.10 Consider the popular 1/N portfolio rule that invests $1 fully by putting 1/N into each asset (see Section 2.1.1). This effectively assumes that each asset has the same expected return, say µ0, and volatility, say σ0. Then, Σ is a diagonal matrix of σ 2 0, and so, by (2.45), the c© Zhou, 2021 Page 52 2.2 MV Optimal portfolio: Riskfree asset case optimal portfolio weights are w∗ = 1 γ µ0 σ20 1 ... 1 . Although this is a scale or leveraged position of the 1/N portfolio, but it is unlikely exactly equal to it (when 1γ µ0 σ20 = 1N ). Hence, the widely used 1/N portfolio is not theoretically optimal when there is riskless asset even if the risky assets have the same expected return and risk, and are uncorrelated. Nevertheless, the 1/N rule is useful in practice as it does not require the estimation of the expected asset returns and covariance matrix, which are noisy and, with the noisy estimates, the optimal portfolio rule usually performs poorly. Some solutions will be discussed later (see Section 3.5). ♠ Example 2.11 To have a better understanding of the optimal portfolio weights, consider the special case when the assets are uncorrelated. In this case, Σ will be a diagonal matrix, and so the portfolio weights are w∗ = 1 γ µ1/σ 2 1 ... µN/σ 2 N . Then we can write the weight on the i-th asset as w∗i = 1 γ µi σi 1 σi . Note that µi/σi is the Sharpe ratio of asset i. The formulas says that, when the assets are uncor- related, we load the Sharpe ratio by a factor of 1σi . For two assets with the same Sharpe ratio, we invest more into the one with lower risk. If the risk is 2 times less, we double the investment. ♠ The expected return and variance of the optimal portfolio are, µp = E[Rpt] = w ∗′µ+ rf = 1 γ µ′Σ−1µ+ rf , (2.48) Var[Rpt] = w ∗′Σw∗ = 1 γ2 µ′Σ−1µ. (2.49) Hence, the squared Sharpe ratio is (Sharpe Ratio)2 = (E[Rpt]− rf )2 Var[Rpt] = µ′Σ−1µ. (2.50) To summarize, the squared Sharpe ratio of the optimal portfolio is Sharpe Ratio = √ µ′Σ−1µ. (2.51) c© Zhou, 2021 Page 53 2.2 MV Optimal portfolio: Riskfree asset case When N = 1, it reduces to (2.35). It is interesting that the Sharpe ratio is independent of risk aversion. Risk-averse investors invest less on the tangency portfolio (see next subsection), and aggressive ones invest more. But both portfolios are efficient and they achieve the same Sharpe ratio. However, an investor in practice, who holds often an inefficient portfolio or a portfolio under constraints (such as no short-sell), will only obtain a lower Sharpe ratio than the theoretical maximum one √ µ′Σ−1µ. Note that, theoretically, the Sharpe ratio is the same for all investors no matter what their risk aversion is, as long as they choose the optimal portfolios which differs from a scale of the risk aversion. However, if one investor chooses the portfolio by another rule, not the optimal portfolio, he will have a lower Sharpe ratio. So, that all investors have the same Sharpe ratio is true only if they all behave rationally and choose their optimal portfolios. However, in practice, investors will not have the same Sharpe ratios. This is because their asset universe may not be the same. Moreover, even if the asset universe is the same, they may not agree upon on the same true parameters, µ or Σ, and then their portfolios can be quite different from each other and from the optimal portfolio. Consider now the case when the asset returns are uncorrelated (Example 2.11). In this case, the Sharpe ratio formula simplifies to (Sharpe Ratio)2 = N∑ i=1 ( µi σi )2 , (2.52) that is, the square of the portfolio Sharpe ratio is the sum of the squares of the individual Sharpe ratios. In other words, when the asset returns are uncorrelated, each asset contributes to the portfolio performance in terms of its own Sharpe ratio. The greater the individual Sharpe ratio, the greater the contribution. 2.2.4 Two-fund separation theorem Since here we assume that the riskless asset is available, the sum of the components of w∗ (weights on risky assets) will not be equal to one in general. When it is less than 1, implying that we put the rest of the money into the riskfree asset. When it is greater than 1, we borrow money at the riskfree rate to invest into the risky assets. c© Zhou, 2021 Page 54 2.2 MV Optimal portfolio: Riskfree asset case To understand it better, let wη = Σ−1µ 1′NΣ−1µ , (2.53) it is clear that the weights sum to 1, w′η1N = 1. Then Rη = wη ′rt, (2.54) is a fully invested portfolio or a fund of the risky assets. We will show below that it is an efficient portfolio, and is tangent to the mean-variance frontier with a line starting from (0, rf ), known as tangency portfolio, and show later that it is the market portfolio under some further conditions (see Section 5.1.1. The optimal portfolio weights can be written as w∗ = 1 γ Σ−1µ = c γ wη, (2.55) where c = 1N ′Σ−1µ is the scalar, and a constant given the parameters. Then the optimal portfolio return is Rpt = c γ Rη + ( 1− c γ ) rf . (2.56) This is the two-fund separation theorem, known also as mutual fund separation. If says that, if investors have mean-variance utility and agree on all the expected returns and covariance matrix of the assets, they all will choose among a portfolio of two funds, Rη and rf , out of all the possible combinations of individual stocks. The allocation between the two funds depends on their risk aversion. If they are aggressive (small γ), they invest more into the tangency portfolio (market portfolio). If the are conservative (large γ), they invest less. In the extremely case that γ = +∞, they put all money into the riskfree asset. Now we want to show that Rη is the tangent portfolio. First, it must be an efficient portfolio. When γ = c, the investor will invest all the money into Rη. If Rη were not efficient, there will be one portfolio of risk assets that does better and the investor will be better off with this portfolio, a contradiction. Second, Equation (2.56) is a line connecting (0, rf ) with Rη. If this line is not tangent at Rη, there must a portfolio on the frontier that lies above this line. Then that portfolio performs better, contradicting the fact that all the optimal solutions are on the line. c© Zhou, 2021 Page 55 2.2 MV Optimal portfolio: Riskfree asset case 2.2.5 Parameter estimation by sample moments To implement the mean-variance optimal portfolio, we have to provide µ and Σ, the population parameters of the return data-generating process. But they are unknown and have to be estimated in practice. Consider now how to estimate them from data. Suppose there are T periods of observed excess returns data ΦT = {R1, R2, · · · , RT }, and we would like to form a portfolio for period T+1. Under the standard assumption that the excess return Rt is i.i.d., the common sample estimates are µˆ = 1 T T∑ t=1 Rt, (2.57) Σˆ = 1 T − 1 T∑ t=1 (Rt − µˆ)(Rt − µˆ)′, (2.58) which are known as sample moments, as they are the results of replacing the theoretical integrals by sample averages. Statistically, the estimators are unbiased, E[µˆ] = µ, E[Σˆ] = Σ, which means that the average estimates over in infinite number of data sets will be equal to the true parameters. However, given any sample size T , the estimates will only be around the true parameters with random errors and their standard deviations will provide an indication on how large the errors are (see the confidence intervals, Section 1.3). While we will examine estimation errors in Section 3.5, it is important to point out here that it is the inverse of the covariance matrix that is used for computing the optimal portfolio weights, and the inverse of Σˆ is a biased estimator of Σ−1 (though Σˆ is unbiased to Σ), E[Σˆ−1] = T − 1 T −N − 2Σ −1, (2.59) which is well known in statistics (see, e.g., Anderson, 1984, p. 270). It says that the inverse of Σˆ often over-estimates Σ−1. As a result, one will over-invest into the risky assets if one uses Σˆ−1 to estimate Σ−1. If T = 120, N = 10, one will over invests 12% (as 121/108 = 1.12). Hence, in practice, a better estimation of the inverse of the covariance matrix is Σ˜−1 = T −N − 2 T − 1 Σˆ −1 c© Zhou, 2021 Page 56 2.2 MV Optimal portfolio: Riskfree asset case or using Σ˜ = 1 T −N − 2 T∑ t=1 (Rt − µˆ)(Rt − µˆ)′. (2.60) as an estimator for Σ for the purpose of obtaining Σ−1 or computing the optimal portfolio. Technically, why does the inverse destroy unbiasness? This is because of Jensen inequality: E[g(x˜)] ≥ g[E(x˜)], for any convex function g(·). Consider for example the case of N = 1. σˆ2 is an unbiased estimator of σ2. Let g(x) = 1x , x > 0, it is clear that g ′ < 1 and g′′ > 0, so g is a convex function. Then, from Jensen inequality, E [ 1 σˆ2 ] ≥ 1 E(σˆ2) = 1 σ2 . Since g is not a constant, the inequality holds strictly. That is the intuition why the inverse is no longer biased. 2.2.6 Practical implementation Suppose that we have T = 360, 30 years of monthly data. How do we know how well the theoretical investment rule w∗ = 1 γ Σ−1µ perform in the past? One way is to estimate the parameters with all the data to obtain an estimate of w∗, called it wˆ; and then apply it to all the past data, to obtain the (estimated) optimal portfolio return Rpt = wˆ ′rt + (1− wˆ′1N )rf , t = 1, 2, . . . , 360. Then we can examine the Sharpe ratio of this portfolio, etc. The above in-sample procedure is simple, and is good for pedagogical purposes. It is in-sample because it assumes that one knows all the data in the analysis. The argument is that, although the true parameters are unknown, using all the data will give us the best estimate and then the performance would be closer to the one which can be achieved by those who were using the true parameters. But in the real world, no one knows the true parameters. Moreover, the true parameters may change over time too, and so a simple one-shot estimate may not work well. c© Zhou, 2021 Page 57 2.2 MV Optimal portfolio: Riskfree asset case Feasibility is the major objection to in-sample analysis because it cannot be carried out in reality. Indeed, in the first month, you only have the data for that month, and do not have all other 360 months of data, which are not yet available, for estimating the parameters. Therefore, we cannot invest in the first month. To really see the past performance of a realistic investment, we need to divide the data into two periods, say from the Month 1 to Month 120, which is used as the training data, to estimate the parameters, call it wˆ(120). Then we can start investing in Month 120 and move forward. In Month 121, we could continue to use wˆ(120) as the weights for our optimal portfolio, but then we do not make use of the new data. Since more data generally lead to more accurate estimates, so, as people almost always do in the real world, we will update our estimate of the moments and hence the portfolio weights with the additional data in Month 121 to obtain a new estimate wˆ(121). Similarly, in Month 122, we will update the weights to wˆ(122) for our optimal portfolio in Month 122. In general, we can compute portfolio weights over time to obtain wˆ(120), wˆ(121), . . . , wˆ(359), wˆ(360). However, we may not compute wˆ(360) as we may not consider investment in Month 360 because the return is unavailable without data in Month 361. With the above weights, we can then compute the returns, Rpt = wˆ (t−1)′rt + (1− wˆ(t−1)′1N )rf , t = 121, 122, . . . , 360. That is, we invest in Month 120 and get return in Month 121, wˆ(120)′r121 + (1 − wˆ(120)′)rf , then invest in Month 121, and get return in Month 122, and so on. The last return is in Month 360, result of investing in Month 359 with weights wˆ(359). All of these returns are the ones that can be used for performance analysis in terms of Sharpe ratio, etc. The above procedure is known as a recursive one that estimates the parameters recursively using all available data. An alternative is to estimate the parameters with a fixed window of past data. Say we estimate the parameters by using the past 120 months of data only. For example, in Month 121, we use data from Month 2 to Month 121, and, in Month 122, we use data from Month 3 to Month 122. This is often known as a rolling procedure. In contrast to the in-sample analysis, both recursive and rolling are out-of-sample assessment, which do not use future information and is feasible in real time. c© Zhou, 2021 Page 58 2.2 MV Optimal portfolio: Riskfree asset case In practice, more advanced estimation methods can be used. Additional data such as daily or fundamental information may be utilized too. Moreover, a more general mean-variance problem in practice is to impose a range constraint on the weights: (see Example 2.11) ai ≤ wi ≤ bi, i = 1, 2, . . . , N, (2.61) i.w., for each asset i, the position has to be between ai and bi. For example, if a1 = 0 and b1 = 0.10, we cannot short sell the first asset and cannot invest more than 10% of our money into it. In this case, no analytical formulas will be available for the optimal portfolio weights, but they can usually easily be solved numerically. Note that one cannot truncate the solution from the unconditional one, the analytical formula, when N > 1. The reason is that once one sets a weight at a bound, all other weights must be re-selected to optimize the objection, and many times taking the largest or lowest level of the bound may not be optimal either. However, numerically, one can solve the above constrained problem, even with more complex constraints, easily via using available quadratic programming packages in Python. These issues are further discussed later. In summary, there are important limitations of the optimal portfolio formula: 1) practical constraints have to be imposed in the real world. The constrained portfolio is obviously different (see Section 3) and have to be solved numerically. 2) the formula implies roughly 50% short positions in a large portfolio (N is large), and hence it is difficult to implement. 3) the mean and covariance matrix have to be estimated in practice, which is difficult, and the optimal portfolio is sensitive to even a small change of the inputs; – an expected return of 10% vs 8% can cause the portfolio weights to change much more than 2%; – the invertibility of the sample covariance matrix requires T ≥ N + 2, which is violated if N = 1000 stocks and T = 120 with 10 years of monthly data (the solution is to be discussed later). 4) it should be noted that any portfolio rule, except value-weighting, requires costly portfolio rebalancing. c© Zhou, 2021 Page 59 2.2 MV Optimal portfolio: Riskfree asset case 2.2.7 MV frontier and utility maximization When there is no riskfree asset, we can define an optimal portfolio as one that minimizes risk for a given level of return or maximizes the expected utility (review your Investment Theory class or see Section 2.7 for details). The mean-variance (MV) efficient portfolio frontier is a concave curve, the plot of the expected return of the optimal portfolio vs the risk. When there is the riskfree asset, as is the case we assume now, investors will in general choose a portfolio risky assets and also invest or borrow in the riskfree asset. The new mean-variance efficient portfolio frontier is a line connecting the riskfree rate to a portfolio tangent to the frontier (known as tangency portfolio), and this line consists of all the possible portfolios an investor may choose. The utility maximization identifies exact which point on the line an investor will choose given her/his risk aversion. The optimal portfolio formula, equation (2.45), w∗ = 1 γ Σ−1µ, a scale of the tangent portfolio because its weights do not summer to 1, as shown in Equation (2.53). The difference from 1 is invested in the riskfree asset. As mentioned before, the Sharpe ratio of all efficient portfolios is the same, though the portfolios may be different in exposures in risky assets. However, when there is no riskless asset, all the portfolios on the tradition mean-variance frontier, are efficient, but they have different Sharpe ratios. Now, in the presence of the riskless asset, the only efficient portfolio from the frontier is the tangency portfolio. Investors will choose a combination of this with the riskfree asset, no other risky assets, to obtain the best possible Sharpe ratio. 2.2.8 Alternative formulation Instead of maximizing the expected utility, one can maximize the expected return for a given level of risk (or minimize the risk for a given level of return). Mathematically, both approaches are equivalent. c© Zhou, 2021 Page 60 2.2 MV Optimal portfolio: Riskfree asset case To see the equivalence, let σ2 be a given level of risk. Then the maximization problem is max w s.t. w′Σw=σ2 E[Rpt] = µ ′w + rf , (2.62) whose solution is wa = σ√ µ′Σµ Σ−1µ. (2.63) Proof: The Lagrangian of the objective function is: L = µ′w + rf − λ 2 (w′Σw − σ2), (2.64) where λ is the multiplier that transformed the constrained optimization problem to an uncon- strained one. Taking first derivatives with respect to all the wi’s and λ (recall (2.46)), and setting them to zeros, we get the first-order conditions (FOC); µ− λΣw = 0, (2.65) w′Σw − σ2 = 0. (2.66) Multiplying (2.65) by µ′Σ−1 and w′ respectively, we get µ′Σ−1µ− λµ′w = 0, (2.67) w′µ− λw′Σw = 0. (2.68) These two equation implies, using the fact that µ′w = w′µ, µ′Σ−1µ = λµ′w = λw′µ = λ2w′Σw. Hence, we can solve λ as: λ = √ µ′Σ−1µ σ . (2.69) Plugging this back to (2.65), we get w as (2.63). Q.E.D In comparison (2.63) with the standard formula (2.45), we have γ = √ µ′Σµ σ . (2.70) This means that, if one wants to have a fixed level of portfolio risk σ, the effective risk aversion is given above. On the other hand, for a given risk-aversion of γ, the risk of the portfolio is σ = √ µ′Σµ γ . (2.71) c© Zhou, 2021 Page 61 2.2 MV Optimal portfolio: Riskfree asset case When γ > 0 takes all possible risk aversion values, σ will take all the possible risk levels. So mathematically, the two optimization problems are equivalent. However, it should be noted that the equivalence assumes the true parameters µ and Σ are known. But they are unknown in practice, so the two approaches are no longer equivalent when there is parameter uncertainty. The reason is that, given a certain level of risk aversion, one does not know exactly what risk level to take as µ and Σ are unknown. Conversely, if one sets a risk level, this will not necessarily be consistent with his/her risk aversion given that we do not know µ and Σ. 2.2.9 Links to regression and machine learning Jobson and Korkie (1983) establish an interesting link between the optimal portfolio and a linear regression. Britten-Jones (1999), based on the regression framework, provides ways to test hypothe- ses on portfolio weights. The regression framework can also be used to estimate portfolio weights when there are a large number of assets (see, e.g., Ross and Zhou (2021)). Consider the regression of a constant on the asset excess returns, 1T = Xβ + , (2.72) where 1T is a T-vector of 1’s, X is a T × N matrix of the N asset excess returns data, and β is N × 1 of the regression coefficients. The least-squares estimator has the usual formula, βˆ = (X ′X)−1X ′1T . (2.73) In what follow, we will show that βˆ = Σˆ−1µˆ 1 + µˆ′Σˆ−1µˆ , (2.74) where the mean and covariance matrix of the excess returns are estimated by using a slightly different formulas from 2.57 and 2.58, µˆ = X ′1T /T, (2.75) Σˆ = (X − 1T µˆ′)′(X − 1T µˆ′)/T, (2.76) where µˆ is the same (just in vector notation), Σˆ is obtained with dividing by T instead of by T − 1. c© Zhou, 2021 Page 62 2.2 MV Optimal portfolio: Riskfree asset case Recall that the very important optimal portfolio weights formula, (2.45), and its estimate is clearly wˆ∗ = 1 γ Σˆ−1µˆ. Comparing this and the beta expression, we see that βˆ is the same as the estimated optimal portfolio weights if γ = 1 + µˆ′Σˆ−1µˆ. Hence, we can recover wˆ∗ from βˆ up to a scalar. Proof of (2.74): Using a standard matrix inversion formula, we have (X ′X)−1 = (Σˆ + µˆµˆ′)−1 = Σˆ−1 − Σˆ −1µˆµˆ′Σˆ−1 1 + µˆ′Σˆ−1µˆ . Then, using the OLS formula and simplifying, we get the desired result. Q.E.D. Ao, Li and Zheng (2019) recently find another link to a linear regression, based on which a wide range of machine leaning tools can be applied to portfolio choice. Recall that the best squared Sharpe ratio investors can possibly obtain is θ = µ′Σµ, and they obtain it by choosing the optimal portfolio w∗ = 1 γ Σ−1µ, given a level of risk aversion γ. Alternatively, given a desired level of risk σ, the investors can choose the portfolio wa = σ√ µ′Σµ Σ−1µ. (2.77) The problem is that it is very difficult to get an accurate estimate of Σ−1µ in practice when N is large, so the analytical portfolio formulas have limited value in a high dimensional case. Ao, Li and Zheng (2019) show that, given a level of desired risk σ, the optimal portfolio weights are a solution of the linear regression problem below: min w (rc − w′R)2, rc ≡ σ1 + θ√ θ , where rc is a fixed constant in the optimization problem. The important message from this new formation is that we do not have to use the analytical formulas which are not reliable in high dimension. Instead, we solve the weights from the above problem by various dimensional reduction and model fitting techniques, opening a door for wide ML (machine learning) applications c© Zhou, 2021 Page 63 2.3 Tracking error minimization Note that it is a two-stage procedure. First, we estimate rc, which is a single constant and is easier to estimate more accurately than estimating w from the analytical formulas. Then, in the second-step, we solve the weights by using various econometric or machine learning methods. 2.3 Tracking error minimization In practice, big institutional money is invested across asset classes, such as bonds, stocks. currencies and commodities. Within the stock portfolio, there are usually two styles of investments. The first is passive management to gain returns identical to an index, such as the S&P500. This is a basically by-and-hold strategy. There is a growing trend for doing so as outperforming the market is not an easy task, and those who promise they can often fail badly. The second is active management where managers trade frequently to beat the market or to outperform some certain benchmarks. The simplest way to generate the index return is to hold the index, i.e., to buy all the stock in the index proportional to their weights defining the index. Alternatively, one can replicate the index by using fewer stocks based on mean-variance portfolio theory, capitalization and stratifies methods.1 However, passive investments are still not free. Some one has to manage it. As stocks coming in or out of the index, trading has to take place. The same is true for dividend reinvesting and for money in and out of the index funds. So, one has to pay a fee to invest in a passive index fund in practice. Vanguard is a leader in index funds which created one of the first index funds in 1975. As of today, it manages over $5 trillion in assets. Their index funds charge one of the lowest fees, but is still 0.05%, or 5 basis points as of 2013. This can still be enormous if the base or asset under management (AUM) is large. Active portfolio managers today are often required to beat an index with the minimum greater risk net of all transaction costs. The active return is defined as Active Return = Total Return on Managed Port− Total Return on Index. (2.78) The idea is that if you, an active portfolio manager, can beat the index (with perhaps some level of given risk) by achieving positive active returns, I can pay you a fee. For example, if your track records show that you can beat the index by 1-3% per year with risk no greater than the index by 1See, e.g., Fabozzi (1999, Ch 14) for the latter two methods. c© Zhou, 2021 Page 64 2.3 Tracking error minimization 2%, I can pay you, say 50 or 70 basis points. You will gain by making more money than managing a pure index fund (with the same AUM), and I will gain also than by investing in a pure index fund. Even if the market is perfectly efficient, one can theoretically beat the market index by taking higher risk. So, in practice, managers who try to beat an index are allowed to take higher risk, but there is a limit. Tracking Error limits are used as “risk budgets” to control the risk that the managers can take. Then the question is whether any gain is comparable with the risk taking. Note that the error limits are in terms of volatility, not return. The reason is that the active return is very difficult to estimate in the real world, because the estimated expected returns can be very different from the realized ones. In contrast, the volatilities are more stable. Let w¯ be the portfolio weights of a benchmark index, R be an n-vector of the asset returns and w the weights of an actively managed tracking portfolio. The Tracking Error (which is volatility) is defined as Tracking Error = TE ≡ Var[w′R− w¯′R] = (w − w¯)′V (w − w¯), (2.79) where V is the covariance matrix of the underlying assets R. If your managed active portfolio has close volatility to the index, say with 2% difference, and if it has some substantial higher return, you may be a good active manager. The TE optimization problem of practical interest is min w TE = (w − w¯)′V (w − w¯) (2.80) s.t. w′1N = 1 w′R− w¯′R = g, which minimizes TE while achieving a given target, g, of expected performance relative to the benchmark. Recall that the constraint w′1N = 1 is the standard one which implies the money is fully invested so the weights sums to 1. In practice, some pensions or institutional investors may want to take 4% greater risk than then market, in order to earn a greater expected return. This is a valid objective even if the market is fully efficient or the market index is unbeatable (with the same risk). To understand this, suppose the market has 12% annual return and 20% volatility. If investors want to take only 20% market risk, the easiest thing they can do is to earn 12% market return by buying the index. If they are willing to take 24% market risk, they should earn, say, 15%, greater expected return (than 12%). However, it is often inefficient or infeasible for them to do this by themselves, and so they have to hire active managers to obtain and manage such a portfolio for them. c© Zhou, 2021 Page 65 2.3 Tracking error minimization Roll (1992) provides an analytical solution to the above TE problem, wr = w¯ ± √ g d V −1(µ− µ0), (2.81) where d is one of the four efficient set constants: a = µ′V −1µ, b = µ′V −11N , c = 1′NV −11N , d = a− b2/c, (2.82) and µ and µ0 are the expected return on R and on the global minimum variance portfolio, respec- tively. However, in the real world, the earlier optimization problem usually has constraints, such as position limits and shorting selling restrictions, and hence there are no analytical formulas for the solutions. However, since to minimize the TE is the same as to minimize a quadratic function of w, TE = 1 2 w′(2V )w − (2w¯′V )w + w¯V w¯, where w¯ and V are constants, quadratic programming can be used to solve the problem under general constraints (see the next chapter). The TE optimization allows managers to beat the index while controlling the tracking error. However, there is a hidden problem with the TE criterion. The problem is that the variance of the tracking portfolio is Var[w′rR] = Var[w¯ ′R] + 2(w − w¯)′V w¯ + (w − w¯)′V (w − w¯), (2.83) i.e., the variance of the tracking portfolio can be quite large relative to the index of if the 2nd term is sizable. In other words, if the TE is 4% risk (the square-root value), the actually active portfolio variance can exceed the market by more than 4%2 if the 2nd term is not zero. In other words, the common TE optimization is not perfect and one needs to be cautious on its understatement of the true risk. However, this may not be an issue as usually the active portfolio has little correlation with the market. If there is a concern on the under-statement of the TE optimization, one can solve the active portfolio by fixing the total risk of the tracking portfolio is at a given level σ2p: Var[w′rR] = w ′ rV wr = σ 2 p, (2.84) and then maximize the expected return. Jorion (2003) provide an analytical solution to this op- timization problem under the standard fully investment (weights sum to 1) and the total risk c© Zhou, 2021 Page 66 2.4 Information ratio constraints. Again, however, if no short-selling or other practical constraints are imposed, the op- timization problem has to, and can be easily solved numerically by using quadratic programming tools. 2.4 Information ratio How do we assess the performance of a portfolio manager whose goal is to beat the S&P500? The information ratio, also known as appraisal ratio, is the widely used measure in practice. Some hedge funds even use it as a metric for calculating a performance fee. The information ratio measures the performance of a portfolio relative to a benchmark index, IR = E(Rp −RB) σ(Rp −RB) , (2.85) i.e., the ratio of the expected active return of a fund to its standard deviation relative to RB, where RB is the return on a benchmark index the fund manager attempts to beat, and Rp is the fund return. It is clear that the greater the IR, the smarter the fund manager. Recall that σ(Rp − RB) is the tracking error. Given a tracking error allowance, the portfolio should outperform the index (as it usually takes a calculated greater risk than the index), but the question is by how much. The ratio states precisely the expected return per unit of the tracking error. In practice, according to Grinold and Ronald (1999, p. 114), top-quartile investment managers typically achieve annualized information ratios of about 0.5. This means that, if the fund manager uses a risk aversion of γ = 1, she/he can beat the index by about 1.25% per year (Grinold and Ronald, 1999, p. 114). It is worth noting that the Sharpe ratio of the fund is Sharpe Ratio = E(Rp − rf ) σ(Rp − rf ) , (2.86) where rf is the risk free rate. The IR is simply obtained by replacing the rf with RB. Comparing the Sharpe ratio of a manager who is to outperform the conservative utility index with one who is to outperform a high tech index does not make sense (as the latter will likely to have a higher Sharpe ratio by simply holding the index). How much they beat their benchmarks should be the criteria, so we should use IR. c© Zhou, 2021 Page 67 2.5 How to outperform with alpha asset? 2.5 How to outperform with alpha asset? In practice, one often asks the question that: how can I improve my portfolio if I find an asset that has a positive alpha? Let R be the excess return on an asset that has a positive alpha (note that a negative alpha will be fine too because then shorting it will have positive alpha), that is, in the benchmark model regression, R = α+ βRB + , (2.87) where α > 0, which is the asset’s alpha relative to the benchmark mark portfolio you hold. β is asset beta, and is the residual with zero mean. Our objective is to find a portfolio of RB and R such that its performance is better than the benchmark, with a greater Sharpe ratio, for example. We often consider the part of the asset that is uncorrelated with the benchmark, r = R− βRB, (2.88) known as the residual return to fund managers, the return without benchmark risk. Note that E[r] = α, var[r] = σ2, where σ2 is the variance of . The residual return or residual asset is tradable as one can buy a unit of the underlying asset with shorting β portion of the benchmark. Mathematically, finding an optimal portfolio among RB and R is the same as finding an optimal portfolio among RB and r. It is just that the portfolio formula of the latter is simpler. Consider now a portfolio among RB and r, Rp = w1RB + w2r. (2.89) Note that the two assets are uncorrelated, and so, based on our optimal portfolio formula, we havew1 w2 = 1 γ var[RB] 0 0 var[r] −1 E[RB] E[r] = 1γ E[RB ]var[RB ] 1 γ α var[r] , c© Zhou, 2021 Page 68 2.5 How to outperform with alpha asset? that is, our weight on the benchmark portfolio will remain the same!, but with an additional investment on the residual asset whose weight is of the usual form (mean-variance ratio) as if we are investing in it alone! Recall our formula for the squared Sharp ratio, (2.51), we have now (Sharpe Ratio)2 = (E[RB]) 2 var[RB] + α2 var[r] , (2.90) so the squared Sharpe of the residual asset adds directly into the squared Sharp ratio of our portfolio. The greater it is, the more it helps on the performance of our portfolio. Our analysis above shows that for two assets with the same alpha, they do not contribute equally to the portfolio. The one with the smaller residual variance contributes more, because it is the Sharpe ratio of the residual asset that matters, not solely the alpha value itself. Example 2.12 As Example 2.7, we assume that the riskfree asset earns 3% (per year), and your benchmark (say the market) has an expected return of 12% (excess return 9%) and a volatility of 20%. Then, assuming γ = 2.8, the optimal portfolio (among the riskfree and benchmark assets) is w1 = 1 2.8 0.09 .202 = 0.8036. Now you have an alpha portfolio (e.g., return on an investment based on a number of ideas), and assume the alpha is 5% and the residual volatility is 15%. Then you will continue to hold the benchmark with weight w1, but at the same time, invest w2 = 1 2.8 0.05 .152 = 0.7937. So you need to borrow money (0.5973 = w1 + w2 − 1) to investment. The squared Sharp ratio of your portfolio will be (Sharpe Ratio)2 = 0.092 0.202 + 0.052 0.152 = 0.3136. (2.91) So the Sharp ratio is 0.56, improving about 25% from 0.45, the Sharp ratio without the alpha portfolio. In practice, the borrowing may not be feasible. In this case, the optimal portfolio should be solved under the constraints (see next Section on how to solve such problems), and the resulting Sharp ratio will be lower than 0.56. ♠ c© Zhou, 2021 Page 69 2.6 Fundamental Law of active portfolio management 2.6 Fundamental Law of active portfolio management 2.6.1 IR = IC √ N If one has no forecasting sills at all, one cannot possibly beat the market by taking the same level of risk. Suppose that one does have some skills. The question is then how to translate this skill into the active return efficiently. To this question, Grinold (1989) proposes the fundamental law of active portfolio management (FLAM). Note that the value-added of a portfolio is the information ratio (IR), which measures the performance of the active portfolio per unit of the active risk. In its simplest and most intuitive form, the FLAM states that the value-added or performance of an active manager, IR, is proportional to the information coefficient (IC) and the square-root of the market breadth (BR), IR = IC √ N, (2.92) where IC, the information coefficient (skill), is measured in terms of the correlation between the return forecasts with the actual future returns, which are assumed constant across assets and over time, and N is the number of assets (market breadth here). In words, following Romero and Balch (2014), the FLAM says that performance = skill× √ breath, (2.93) where, the annual performance depends on skill and breath, where skill is a measure of how well a manager forecast future return, and breadth represents the number of investment decisions (trades) the manager makes each year. The law suggests that as a manager’s skill increases, or makes more use of the skill, more money will be made. That is not surprising, but what may be surprising is that, to double performance, one has to double the skill, or, at the same level of skill, quadruple the trading. According to Romero and Balch (2014), Warren Buffett has 92% of his fund’s money invested in only 12 stocks in September, 2010. So he has high skills and applies it to a limited number of stocks. Why does not he apply it to more? It is likely that his high level of skill is not portable. On the other hand, a hedge fund that uses machine learning about future stock returns is less efficient. But the skill can be applied to almost any stock. As a result, based on the FALM, both can enjoy great performance on their funds. c© Zhou, 2021 Page 70 2.6 Fundamental Law of active portfolio management In short, the FLAM says that IR is linearly related to skill. If a manager doubles his/her forecasting accuracy (IC), then IR doubles. If the accuracy can be doubled at a research cost, and if the cost is lower than the value-added, it should be doubled. Applying the same level of skill to a portfolio of 500 stocks will generate 10 times more value than applying it to 5 stocks. As a result, a small degree of predictability can potentially help an active manager to make significantly gains in beating the benchmark if this predictability can be repeated used many times during the year or applied to many assets. 2.6.2 A casino example To understand better about the FLAM, consider a casino game of tossing a unfair coin (playing slots machine with odds favoring the casino is similar to the abstract). Suppose the payoff to the casino is: payoff = −1, if head, 49% prob;+1, if tail, 51% prob. (2.94) Clearly the expected value of the game is µ = 0.49× (−1) + 0.51× 1 = 0.02, (2.95) with variance σ2 = 0.49× (−1− 0.02)2 + 0.51× (1− 0.02)2 = 0.99982, (2.96) and so the risk (standard deviation) is σ = 0.9998. Do you think the casino should play the game if it can play only once? The answer is no despite the game has a positive value. The reason is that the return and risk trade-off is poor, with a Sharpe ratio (assuming zero riskfree rate) SR = µ σ = 0.02, because the stock market has a Sharpe ratio about 0.5. Intuitively, you are risking $1 with high probability, and makes only a small expected profit of 0.02. Do you think the casino should play the game if it can play, N , a large number of times? The answer is then absolutely! Because now the Sharpe ratio is SR = µN σ √ N = 0.02 √ N c© Zhou, 2021 Page 71 2.6 Fundamental Law of active portfolio management can be very large (remember that the return and variance both grows linearly in N , but the risk only at a rate of √ N). In terms of the FLAM, 0.02 is the skill, the win per game, and N is the breath. For the fixed skill level, the greater the N , the more profitable the strategy of playing N games. 2.6.3 A proof Now let us see why the law is true. Consider a managed portfolio with return Rp in excess of the risk free rate. Let RB be the excess return on the benchmark portfolio, then we have Rp = RB +RA, (2.97) where RA is the return on the active portfolio. The proof essentially generalizes the analysis of Section 2.5. Suppose that the benchmark consists of N risky assets. We can always decompose the excess return on the i-th asset into Ri = αi + βiRB + i, i = 1, . . . , N, (2.98) where αi is the asset’s alpha, βi is its beta, and i is the residual with zero mean conditional on available information. This is the market model we discussed before. Mathematically, the return composition is simply a projection of Ri onto 1 and RB. Then ri ≡ Ri − βiRB, (2.99) known as residual return to fund managers, the return without benchmark risk.2 It is clear that investing in Ri is the equivalent to investing in ri in the sense that their weights different only by a weight on the market. Let rit be the residual return at time t, and αˆit be our forecasted return based on prior infor- mation It−1, that is, E[rit | It−1] = αˆit. (2.100) 2This is often assumed by practitioners. However, theoretically, the project only guarantees zero correlation with RB , and so can be dependent without normality assumption on returns. c© Zhou, 2021 Page 72 2.6 Fundamental Law of active portfolio management Let µp be the expected return on a portfolio of rit’s with portfolio weights wi’s, then, using (2.100), µp = N∑ i=1 wiαˆit, (2.101) and the variance is σ2p = N∑ i=1 w2i σ 2 i , (2.102) where σ2i is the variance of rit conditional on the information and the residual returns here are assumed uncorrelated here (if necessary, more factors can be added to make the residual return uncorrelated; see factor models later). Consider now the value-added or the risk adjusted returns, U = µp − γ 2 σ2p. The optimal portfolio choice, clear from the first-order condition, is wi = 1 γ αˆit σ2i , (2.103) which is the standard formula for uncorrelated assets (Example 2.11). Suppose now that we keep σ2p, which is TE, as a constant overtime. Plugging (2.103) into (2.102), the implied squared risk aversion is γ2 = ( N∑ i=1 αˆ2it σ2i ) /σ2p. Plugging this back to (2.103), and then we get µp, µp = σp × √√√√ N∑ i=1 αˆ2it σ2i . (2.104) So the final task is to simplify the last term. Assume that rit and αˆit are normally distributed. Since they have a correlation of IC, we can assume that αˆit = IC × σi × zit, (2.105) where zit is standard normal with perfect correlation with the standardized residual return rit/σi, and αˆit is assumed to have a zero expected mean (which is reasonable as as stocks have zero alphas long-term). Then it can be verified that indeed IC = corr(αˆit, rit). c© Zhou, 2021 Page 73 2.7 MV Optimal portfolio: No rf case Therefore, by (2.104), we have µp σp = IC × √√√√ N∑ i=1 z2it = IC × √ χ2N , (2.106) where the last term follows from the definition of a chi-squared distribution. From statistics, the expected of its square-root can be computed as a ratio of Gamma functions, and then an application of Stirling’s approximation yields E[ √ χ2N ] = √ N − 1 [ 1 + 1 4N +O ( 1 N2 )] , (2.107) where O(1/N2) indicates errors of order 1/N2. Note that it is obvious Eχ2N = N and E [ √ χ2N ] is not, but (2.107) says that, for chi-squared variables, the latter is well approximate by √ N − 1 or by √ N (the −1 can be ignored when N is large). Then (2.106) implies the FLAM after taking expectation over the conditional information or over time. Our above proof follows closely Ye (2008). While the FLAM has received enormous attention for its key insights into portfolio strategy design and performance evaluation (see, e.g., Chincarini and Kim, 2006, and Qian and Hua and Sorensen, 2006), subsequently studies show that the FLAM states only the idealized gain. Once realistic constraints are imposed, the gain is much smaller than predicted (see, e.g., Clarke, Silva and Thorley, 2002). Zhou (2008a, b) analyzes how estimation and optimal use of conditional information affect the gain. Ding and Martin (2017) provide the latest analysis. 2.7 MV Optimal portfolio: No rf case In the real world, most if not all equity funds require 100% investment in the risky asset, so it is of interest to consider the mean-variance optimal portfolio without the riskfree asset. This is also often discussed prior to the riskfree asset case in most investment texts. Since this case is technically more complex, and it is not essential for most of the early results, we postpone the discussions until now. Denote now rpt as the (raw) return on N risk assets, and µ0 = E[rpt] the expected return. We use notation µ0 to make sure no confusion on µ that denotes the expected excess return (return subtracting from the riskfree rate). We still use the same notation, Σ to c© Zhou, 2021 Page 74 2.7 MV Optimal portfolio: No rf case denote the covariance matrix, which is the theoretically identical regardless we use raw or excess returns because the riskfree rate is a constant and it does not affect the covariance. However, in the real world, the riskfree rate changes, though constant per period (say per month), but it varies over periods, and hence the estimated covariances in the two cases can be different numerically. This will have no impact on the theory. We will obtain the optimal portfolio in two ways. The first is the familiar variance minimization and the second is from mean-variance utility maximization. 2.7.1 Variance minimization given µp Standard investment texts solve the mean-variance optimal portfolio by minimizing the risk for a given level of return. Mathematically, this is to solve portfolio weights w from min w 1 2w ′Σw (2.108) s.t. w′1N = 1 w′µ0 = µp, where 1N is a vector of 1s, and µp is the given level of return. The risk here is captured by the variance of the portfolio, w′Σw. Minimizing the variance is mathematically equivalent to minimizing its square-root, the volatility. To understand the matrix notion, consider the case where we have two risky assets only, N = 2. Then we have the asset expected return and covariance matrix in matrix form as µ0 = µ01 µ02 , Σ = σ21 ρσ1σ2 ρσ1σ2 σ 2 2 . The portfolio variance risk is σ2p = w 2 1σ 2 1 + 2ρw1w2σ1σ2 + w 2 2σ 2 2 = [ w1 w2 ] σ21 ρσ1σ2 ρσ1σ2 σ 2 2 w1 w2 = w′Σw, which is the objective function scaled by 1/2. Assume as usual that we are fully invested, then w1 + w2 = 1 = [ w1 w2 ]1 1 = w′12, c© Zhou, 2021 Page 75 2.7 MV Optimal portfolio: No rf case which is the first restriction, also known as budget constraint. The second restriction is on the investment objective. Suppose that we want to obtain an expected return of 15% for our portfolio, then w1µ01 + w2µ02 = 15% must be satisfied for our portfolio weights. The solution is well known. Based a standard optimization procedure (derivation given below), the optimal weights are w = c1Σ −11N + c2Σ−1µ0, (2.109) where 1N is an N × 1 vector of ones, and c1 and c2 are constants, c1 = c− bµp ∆ , c2 = aµp − b ∆ (2.110) with a = 1′NΣ −11N , b = 1′NΣ −1µ0, c = µ′0Σ −1µ0, (2.111) and ∆ = ac− b2 > 0, all of which are constants independent of µp. Numerically, given the asset raw expected returns and their covariance matrix, µ0 and Σ, as well as the desired level of expected portfolio return, we can compute a, b, and c, which are the 3 key coefficients determining the optimal portfolio. Indeed, with their values, we can compute easily ∆, c1 and c2. Then the optimal portfolio weights are computed from Equation (2.109). The minimized portfolio variance is σ2p = w ′Σw = w′Σ ( c1Σ −11N + c2Σ−1µ0 ) = c1w ′1N + c2w′µ0 = c1 + c2µp = aµ2p − 2bµp + c ∆ , (2.112) which is the familiar mean-variance frontier or parabola: as the expected return increase, so must the risk, but it increases in a parabolic pattern. Note that the investors will only choose the upper part of the mean-variance frontier or efficient frontier. For any portfolio w on the efficient frontier, −w is exactly the minor image on the lower part of the frontier with the same risk, and negative of the return, which is why the lower part will never be chosen. Technically, the existence of the mean-variance frontier has two conditions: a) Σ is nonsingular or there are no redundant assets; c© Zhou, 2021 Page 76 2.7 MV Optimal portfolio: No rf case b) at least two assets have different expected returns. If Σ is singular, the inversion breaks down. This happens only one of the assets is a linear com- bination of other assets. In particular, this rules out perfect correlation of any two assets. Now, if all the assets have the same expected returns, any portfolio of them will have the same return. Hence, it will be impossible to obtain a portfolio with return of any other value. Under the two conditions, you can get an optimal portfolio with any target expected return, with the risk given by (2.112). Fox example, you can design a portfolio with monthly return of 100%, but then the risk must be too high too. However, this may not be achievable in the real world because the optimal portfolio must then require large short positions (negative weights) which can run into difficulties in implementation. More on practical portfolio constraints will be discussed in the next chapter. Classic graduate texts, such as, Ingersoll (1987) and Huang and Litzenberger (1988), have in depth discussions of the mean-variance frontier as well as the proofs. For completeness, we provide the derivation here. Proof of (2.109): Let L = 1 2 w′Σw − η(w′1N − 1)− λ(w′µ0 − µp) be the Lagrangian (objective function with constraints), where η and λ are additional parameters to choose to reflect the constraints. Define df/dw as an N -vector formed by df/dwi for any function f = f(w1, . . . , wN ). Then it can be verified that dw′µ0 dw = µ0, dw′Σw dw = Σw. (2.113) Hence, the first-order condition is ∂L ∂w = Σw − η1N − λµ0 = 0 (2.114) ∂L ∂η = w′1N − 1 = 0 (2.115) ∂L ∂λ = w′µ0 − µp = 0 (2.116) Equation (2.114) provides w = ηΣ−11N + λΣ−1µ0 (2.117) Multiplying both sides of this equation by µ′ and using (2.116), we have w = ηµ′0Σ −11N + λµ′0Σ −1µ0 (2.118) c© Zhou, 2021 Page 77 2.7 MV Optimal portfolio: No rf case Multiplying both sides of (2.117) by 1′N and using (2.115), we have 1 = η1′NΣ −11N + λ1′NΣ −1µ0 (2.119) Now both equations (2.118) and (2.119) are linear equations for η and λ. Since two linear equations with two variables can be solved analytically (see 1.78), then the solution of η and λ is obtained, and so we get w∗ from (2.117), the same one given earlier. Q.E.D. When the given expected portfolio return µp is taken as µp = b/a, the resulting portfolio has the minimum risk, which is evident from the first-order condition: dσ2p dµp = 2aµp − 2b ∆ . The portfolio is known as global minimum-variance portfolio (GMV), whose weights are wg = Σ−11N 1′NΣ−11N , (2.120) which is the same as (2.21) discussed earlier. Here we see an alternative derivation. To implement portfolio selection, as is the case in the riskless asset case, µ and Σ have to be estimated. Say they are to be estimated from historical data. Suppose there are T periods of observed raw returns data ΦT = {r1, r2, · · · , rT } and we would like to form a portfolio for period T + 1. Under the common assumption that Rt is i.i.d., the standard estimates are µˆ0 = 1 T T∑ t=1 rt, (2.121) Σˆ = 1 T − 1 T∑ t=1 (Rt − µˆ0)(rt − µˆ0)′. (2.122) Mathematically, these are the same estimators as before, (2.57) and (2.58), and they share the same properties. The only difference is that, previously the returns are measured in terms of excess returns, and now they are simply the raw return. It should be mentioned that practical portfolio choice may involved many constraints, such as no short-sales and position limits, other than weights summing to 1. If the investment policy is to hold a large portion of assets in the market index, and a small portion in an active portfolio like the optimal portfolio here. Combining the two will not violate the constraints. For example, c© Zhou, 2021 Page 78 2.7 MV Optimal portfolio: No rf case although w∗ typically requires the shorting of about 50% of the assets, the fund can often simply under-weight some assets in the index if the active portion say is only 20% is of the assets, without violating the constraints. However, if the optimal portfolio is a standalone portfolio and if it is not allowed to go short, then the above analytical formula is no longer applicable, and numerical solution is the only approach. This will be addressed in the next chapter. 2.7.2 Two-fund separation: No rf case Analytically, the optimal portfolio formula (2.109) can be written as a portfolio of two other frontier portfolio, w∗ = (c1a)wg + (c2b)wq, (2.123) where (c1a) + (c2b) = 1, wg is the GMV portfolio and portfolio wq is defined by wq = Σ−1µ0 1′NΣ−1µ0 . (2.124) Graphically, wq is the tangent portfolio to the line starting from the origin point (see, e.g., Ingersoll (1987) for a proof). Equitation (2.123) is the Two-fund Separation Theorem in the case of no riskfree asset. It says that any optimal portfolio is a portfolio of two funds, wg and wq. In an ideal mean-variance economy, offering the two funds will be sufficient for all investors demands in the absence of the riskfree rate. In fact, any two distinct frontier portfolios can serve as the two fund. The reason is that both of them will satisfy (2.123). An inversion implies that wg and wq can be their portfolios, and hence every efficient portfolio will be their portfolio too. Define µp/σp as the Sharpe ratio of a portfolio in the absence of the riskfree, then wq is the only frontier portfolio that maximizes the Sharpe ratio. This is true because wq is the tangency portfolio from origin and the mean-variance frontier is underneath it. Interestingly, in our case of no riskfree asset case, all frontier portfolios are optimal, and investors will choose different ones depending on their risk tolerance (see next subsection), andhttps://www.overleaf.com/project/5f6b68d0a663fc0001e98565 then they will achieve different levels of Sharpe ratios. In contrast, as you learn in the case when the riskfree asset is available, although the optimal portfolios can be different, they all have the same Sharpe ratio (though defined differently with use of the riskfree rate). c© Zhou, 2021 Page 79 2.7 MV Optimal portfolio: No rf case There is one more interesting property of the portfolio wq. Its expected portfolio return is 1, E[w′qR] = 1. This property will be used later for a link to a linear regression. 2.7.3 Utility maximization Now we assume that the investor chooses his/her portfolio weights w so as to maximize his/her quadratic utility function, max s.t. w′1N=1 U(w) = E[rpt]− γ 2 Var[rpt] = w ′µ0 − γ 2 w′Σw, (2.125) where γ is the risk aversion parameter of the investor. The greater its value, the more the risk aversion as it penalize the risk more. Note that here we still assume that, as almost all studies do, that the investors is fully invested in the risky assets, so that the weights sum to 1, w1 + w2 + · · ·+ wN = w′1N = 1, 1N is an N × 1 vector of ones. However, we no longer have a constraint on the expected return of the optimal portfolio. In fact, the investor does not what level of expected return she or he should choose. Intuitively, if the investor can tolerate a high degree of return, then she/he will choose a portfolio with greater risk and so greater expected return. Hence, the level of risk and expected return will be completely determined by the risk aversion or the utility function. This is why we no longer impose any restrictions on expected portfolio returns vs the variance risk minimization. The optimal weights are w∗ = wg + 1 γ wz, (2.126) where wg = Σ−11N 1′NΣ−11N , wz = Σ −1(µ0 − 1Nµg), with µg = µ ′ 0Σ −11N/1′NΣ −11N the expected return on the global minimum variance (GMV) port- folio wg. Equation (2.126) says that holding the optimal portfolio is the same as investing into two funds, wg and wz (as these two themselves are portfolios or funds). Since investors here invest 100% c© Zhou, 2021 Page 80 2.7 MV Optimal portfolio: No rf case into the risky assets, they always hold 100% of wg. Depending on their degrees of risk aversion, their exposures to wz vary. Note that wz is a zero investment portfolio satisfying 1 ′ Nwz = 0. It is clear from (2.126) that any optimal portfolio is a linear combination of wg and wz. Mathematically, one can show that maximizing the quadratic utility function is equivalent to the usual portfolio risk minimization for a given level of return. Indeed, when the risk aversion is infinity, investor will choose the GMV. As the risk aversion goes down, the optimal portfolio from (2.132) will trace out the upper mean-variance frontier. In practice, utility maximization is critical, as it tells us which portfolio to buy for an investor or a fund manager. In contrast, the mean-variance frontier itself does not provide such information. It tells only the choose one from the frontier, and does not tell which one. Proof of (2.126): it is similar as before. Now the Lagrangian is L = w′µ0 − γ 2 w′Σw − η(w′1N − 1). The first-order condition is ∂L ∂w = µ0 − γΣw − η1N = 0 (2.127) ∂L ∂η = w′1N − 1 = 0 (2.128) Equation (2.127) provides w = 1 γ Σ−1(µ0 − η1N ) = 1 γ Σ−1µ0 − η 1 γ Σ−11N (2.129) Multiplying both sides by 1′N and using (2.128), we have 1 = 1 γ 1′NΣ −1µ0 − η 1 γ 1′NΣ −11N (2.130) and hence η = −γ − 1 ′ NΣ −1µ0 1′NΣ−11N = −γ(1′NΣ−11N )−1 + µg (2.131) Plugging this into (2.129), we obtain the result. Q.E.D. 2.7.4 Optimality of ad hoc rules Let us consider two special cases of utility maximization. This helps to see conditions under which some of the earlier ad hoc rules are optimal. c© Zhou, 2021 Page 81 2.7 MV Optimal portfolio: No rf case Consider first the popular 1/N rule that puts equal money cross risky assets. If we assume equal expected returns across the assets (utility maximization allows this possibility) and Σ is diagonal with equal volatilities, then wg = (1/N)1N and wz = 0N , and so w∗ = wg + 1 γ wz = 1 N 1N , (2.132) which is the usual equal-weighed portfolio. Indeed, when the assets are independent from each other, and have the same mean and variance, full and equal diversification is possible, and so the usual equal-weighed portfolio is optimal. Consider next the inverse volatility-weighting. If we assume Σ is diagonal but allowing for different elements, and if we still assume that the expected asset returns are equal, then wz = 0N , and wg reduces to the volatility-weighted weights, wgi = 1 σ2i 1 σ21 + · · ·+ 1 σ2N , i = 1, ..., N. This is also intuitive. When the assets are independent from each other and have the same means, the weight on each asset depends on only its own volatility scaled by the total volatility of all assets. Consider finally the GMV portfolio. It is clear that w∗ = wg if and only if wz = 0N or if and only if all the means are equal. Hence, when the expected returns on a set of stocks/assets are roughly the same (after perhaps grouping stocks by their expected returns), the GMV may be a good rule to apply, without having to provide the expected return estimates which are noisy. Note that the estimation will not affect the ranking of the stocks by estimating expected returns because the errors are likely highly correlated. For example, if all stocks have an error of 5% in their expected return, this will not affect their true ranking. That may be the reason why the GMV rule is popular in practice. 2.7.5 Links to linear regression Based the on earlier result of Jobson and Korkie (1983), it is easy to see that there is also an interesting relation between the optimal mean-variance portfolio wq and a linear regressions. Assume we have iid asset returns and the sample size is T . Consider the regression of a constant on the asset returns, 1T = Xβ + , (2.133) c© Zhou, 2021 Page 82 2.7 MV Optimal portfolio: No rf case where 1T is a T-vector of 1’s, X is a T × N matrix of the N asset returns data, and β is N × 1 of the regression coefficients. Note that, since there is no riskfree rate, the returns are raw returns here. Let wˆq be the estimate of wq from data, with µ and Σ estimated by µˆ and Σˆ from (2.121) and (2.122), then wˆq = βˆ 1′N βˆ , (2.134) where βˆ is the OLS regression estimate. In other words, the regression slopes are proportional to the optimal portfolio weights. The term 1′N βˆ is a sum of all the betas, to make the ratio vector, βˆ/1′N βˆ, sums to 1, so that it is the portfolio weights. Proof: The least-squares estimator has the usual formula, βˆ = (X ′X)−1X ′1T . In matrix form, we can write (2.121) and (2.122) as µˆ0 = X ′1T /T, (2.135) Σˆ = (X − 1T µˆ′0)′(X − 1T µˆ′0)/T. (2.136) Then, using a standard matrix inversion formula, we have (X ′X)−1 = (Σˆ + µˆ0µˆ′0) −1 = Σˆ−1 − Σˆ −1µˆ0µˆ′0Σˆ−1 1 + µˆ′0Σˆ−1µˆ0 . Hence, we obtain βˆ = Σˆ−1µˆ0 1 + µˆ′0Σˆ−1µˆ0 . This is clearly proportional to wˆq. To make it sum to 1, we divide it by 1 ′ N βˆ, yielding exactly wˆq. Q.E.D. Britten-Jones (1999), based on the regression framework, provides ways to test hypotheses on portfolio weights. Brides (2009) extends the relation further. Consider the regression η1T = Xβ + , (2.137) with the portfolio constraint that 1′Nβ = 1. When η = 1, this is exactly the case studied before, and the slope must be the portfolio wˆq, as here we have imposed the constraint and so the estimated betas are the OLS betas scaled to have their sum equal to 1. c© Zhou, 2021 Page 83 Now when η is any constant, the slope is clearly a function of η, βˆ = βˆ(η), whose explicit expression is complex under the constraint 1′Nβ = 1. The interesting result is that βˆ(η) must be the estimated optimal portfolio weights whose expected return is η. Mathematically, it can be verified that ′/T = (η − µˆe)2 + β′Σˆβ, where µˆe = β ′µˆ is the estimated expected return of the portfolio with wights β. In minimizing the mean-squared error ′/T , the solution is to make the first term be zero and the second term as small as possible. This says exactly that the OLS betas provide the minimal risk given the expected return η. As η varies in (0,∞), βˆ(η) traces out all the possible upper frontier portfolios. 3 Portfolio Choice 2: Constraints and Extensions In this section, we discuss first portfolio choice decisions under practical constraints. Next we examine alternative objective other than the mean-variance utility. Then, we consider modeling errors: estimation errors and model misspecification/uncertainty. 3.1 Practical constraints There are a lot of restrictions in the real world in investing in stocks. The first is no short-sell restriction. Many funds face such a restriction as they cannot short securities. However, hedge funds can typically do so freely, and some funds are allow to do 130/30 (some even up up to 150/50), where 130/30 means the fund can goes both long and short at the same time with up to 130% exposure to its position and the 30% to its short positive. Even so, it is costly to borrow stocks to short sell and sometimes it is almost impossible to short certain stocks. Suppose we have 5 stocks. If no short-sells are allows, it will imply that the following constraints on the portfolio weights, w ≥ 0, or wi ≥ 0, for i = 1, 2, . . . , 5. (3.1) Position limits is another common restriction. A fund manager cannot put too much money c© Zhou, 2021 Page 84 3.1 Practical constraints into a single stock by perhaps internal rules. There are at least three reasons. The first is to force diversification. The second is to limit exposures to certain industries or ideas. The third is to reduce the trading costs as it is difficult to get in or out of a stock if one owns too many shares of the company. In our 5 stock example, if we impose no than 50% on one particular company, then w ≤ 0.50, or wi ≤ 0.50, for i = 1, 2, . . . , 5. (3.2) In practice, it is difficult to borrow money for an equity fund manager. To ensure no borrowing, we need the sum of weights on the risky assets is no greater than 1, w1 + w2 + · · ·+ w5 ≤ 1, (3.3) for our 5 stock example. Another issue is transaction costs. Suppose manager rebalances the portfolio monthly, and she/he has to trade the stocks monthly to ensure the desired weights. If it is too expensive to trade stock 1 (which may be a very illiquid stock), then the following constraint may be imposed, −0.10 ≤ w01 − w1 ≤ 0.10, (3.4) where w01 is the previous month weight. This effectively imposes both a lower and upper bound on w1. One desired objective is to form a portfolio with certain attributes or characteristics. For example, it may be desirable to make the beta of the portfolio to be 1.5. In this case, we impose 1.5 = βp = w1β1 + w1β1 + · · ·+ w5β5. (3.5) One may also similarly impose restrictions on earnings-to-price (E/P) or size or sector exposures. There are hedge funds specializing in long-short strategies and some other funds may can do it also with some degree. One particular idea is pair trading. For every share investing in stock 1, we want to short stock 2. In this case, the constraint is w1 + w2 = 0. Suppose now we have a portfolio of 2n stocks, and want to go long half them and go short the other half, then w1 + w2 + · · ·+ wn = 1, c© Zhou, 2021 Page 85 3.2 Quadratic programming wn+1 + wn+2 + · · ·+ w2n = −1. Note that the net invest in the portfolio is zero, called zero-cost portfolio theoretically (in practice, trading costs and short selling costs are not negligible). It should be noted that the analytical formulas provided in the previous section is no longer valid once we impose restrictions. It is impossible to derive formulas for the constrained case. We have to solve the optimal portfolios numerically. This is the topic of the next subsection. 3.2 Quadratic programming Quadratic programming is the process of solving an optimization problem, minimizing a quadratic function of multiple variables subject to linear constraints on the variables. Mathematically, we solve min x Π = 1 2 x′Qx+ q′x (3.6) s.t. Gx ≤ h Ax = b, where x = (x1, x2, . . . , xn) ′ are the variables. The problem is well understood in mathematics and computer science, and algorithms are well developed to solve it numerically via Python, Matlab or R. Note that the above constraints contain upper and lower bounds on the weights as special cases as demonstrated below. It is critically important to understand the link between our mean-variance utility maximization and quadratic programming. Recall that our objective function is U(w) = rf + w ′µ− γ 2 w′Σw, (3.7) which is the riskfree case case. Without the riskfree asset, there is no rf term. Mathematically, maximizing U(w) is the same as minimizing −U(w), −U(w) = −rf + γ 2 w′Σw − µ′w, (3.8) as w′µ = µ′w. Since rf is a constant, which does not affect the optimal solution, we can ignore it. Comparing (3.8) with (3.6), we have Q = γΣ, q = −µ′, x = w. c© Zhou, 2021 Page 86 3.2 Quadratic programming Hence, utility maximization is a quadratic programming problem. Moreover, the practical constraints can be easily incorporate into standard constraints of the quadratic programming. For example, consider two assets with no short-sells and with limit of 80% on each. Then we want to have 0 ≤ w1 ≤ 0.8, 0 ≤ w2 ≤ 0.8. Let G1 = 1 0 0 1 , h1 = 0.8 0.8 . It is clear G1x = 1 0 0 1 x1 x2 = x1 x2 ≤ h1 reflects the upper bound. Let G2 = −1 0 0 −1 , h2 = 0.0 0.0 . It is clear G2x = −1 0 0 −1 x1 x2 = −x1 −x2 ≤ h2 reflects the lower bound. Hence, if we stack G1 and G2 together, h1 and h2 together, G = G1 G2 , h = h1 h2 , then Gx ≤ h reflects both the upper and lower bounds. The equality constraints are even easier. For example, if we want to impose w1 + w2 = 1, we simply let A = [1 1], b = 1, then Ax = b reflects the constraints. Another is that we want to fix the beta of the portfolio at 1.5, w1 × 0.8 + w2 × 2.2 = 1.5, c© Zhou, 2021 Page 87 3.3 Asset allocation where 0.8 and 2.2 are the individual betas. Then A = [0.8 2.2], b = 1.5, If there are many assets and many equality constrains are imposed, then we just stack each A’s and b’s together, as did for G. In short, mean-variance utility maximization under various practical constraints do not have analytical solutions, but it has a perfect link to quadratic programming, and hence can be solved easily in practice with computer algorithms via Python, Matlab or R. However, there are some important issues that cannot be solved by quadratic programming, as discussed below. 3.3 Asset allocation Asset allocation in practice is usually the advice wealth advisors or consultants give to their clients on how to allocate their investments over a small set of asset classes. It is often about long-term strategic asset allocation where a fixed proportion is suggested for each asset class, and the portfolio is rebalanced quarterly or annually. The second allocation strategy, Dynamic asset allocation, will change occasionally the weights on the asset classes over time based on the future expectations, thus requiring accurate market predictions. The third is Tactical asset allocation, where investors are more active in adjusting weights in assets, sectors, or individual stocks that show the most potential for perceived gains While an original asset mix is formulated much like strategic and dynamic portfolio. Market timing can be viewed as the extreme form of the latter that jumps in or out of the market with active forecasts. 3.3.1 Stocks and bonds The simplest asset allocation problem is to split the money between stocks (say S&P500 index) and bonds (a bond index), and the money is usually invested into funds suggested by advisors or ETFs (exchange traded funds). Based on Ferri (2010, p. 76), the two assets above have 9.7% and 7.7% annual return over 1973–2009. However, the risk of the stock market is much higher, 18.8% vs only 5.5% in the bond market. c© Zhou, 2021 Page 88 3.3 Asset allocation Assume that the two are uncorrelated (from Ferri, 2010, p. 58, they have 49% correlation during 1990–1994, but only 16% during 1995–1999, and then goes negative to −46% during 2000-2004, and then −20% from 2005–2009). Then the minimum risk portfolio has 100% weight on the bonds as typically short sells are not allowed, which can be verified from the GMV formula (2.20). So the minimum risk is 5.5%. If an investor cannot tolerate this level of risk, maybe short-terms T-bills are the only way to go, which barely beats inflation in the long-term (see your Investment text or Ferri, 2010, p. 27). If the investor can take more than 5.5% risk, then allocation to the stock market makes sense. An naive 50% allocation will earn an expected return of Rp = 0.5× 9.7 + 0.5× 7.7 = 8.7%, and a risk of σp = √ 0.1882 + 5.52 = 13.85%. If the investor can take more risk, then more money can be allocated to stocks to earn greater return. However, the additional return may incur too much risk to an average investor. As a result, the typical recommendation is to invest 60-70% in the stock market, and put the rest into bonds. Note that the money here is supposed left for long-term investment, and so cash might already have been set out for short-term liquidity. This is the reason why there is no riskfree investment (investment into T-bills) here. 3.3.2 Multi-asset classes Suppose that the investor has enough wealth so that the money can be further invested into different asset classes beyond stocks and bonds. And within each class, subclasses can be considered. For example, within equities, the funds may be divided into growth stocks and international markets for aggressive wealth growth and diversification. Within bonds, may consider other fixed-income investments such as U.S. corporate bond, and international bonds. The other asset classes may be: commodities (precious metals, nonferrous metals, agriculture, energy,etc), real estate, collectibles (such as art, coins, or stamps) insurance products (annuity, life settlements, catastrophe bonds, personal life insurance products, etc.) derivatives (such as c© Zhou, 2021 Page 89 3.4 Large set of individual stocks options, collateralized debt, and futures), foreign currency, venture capital, private equity, distressed securities and hedge funds. Mathematically, the optimal allocation can be solved the same by imposing realistic constraints with quadratic programming. With 10 or 20 assets, the covariance matrix can be relatively easy to estimate, and so there is usually no problems for the implementation. In practice, some naive or round values are often provided to investors. 3.4 Large set of individual stocks Consider now the case we need to invest our money into a large number of assets, say thousands of stocks. In this case, it is difficult to get a good covariance matrix estimator. The sample covariance matrix is usually useless as it is often not invertible, because the condition T ≥ N + 2 is violated as N , which is thousands, is greater than T , the sample size (# of time periods here). Note that even if T can be large, data a long time ago may have little relevance today. We discuss below two solutions to solve the problem. The first method is to take a two-stage approach. In the first stage, we divide stocks into a few or a few dozen of categories. The division sets up an asset allocation problem that is to allocate funds into a few dozen groups of stocks. Within each group, we may simply invest into the group indices (though one may further choose stocks to outperform the group indices) such as industry indices. In the second stage, we decide how much to invest into each one of the groups. To do so, we can use either a naive portfolio rule, such as 1/N or any ad hoc rule discussed in the previous section, or an optimized rule. This will not not require the inversion of a large covariance matrix. In practice, indeed, many institutional investors operate in this fashion (see, e.g., Platanakis, Sutcliffe and Ye, 2021). The second method is to impose a factor structure or to use a shrinkage, so that the covariance matrix becomes invertible. A simple factor approach is discussed in Section 6.4, where the residual is assumed to have an diagonal matrix. Relaxing this assumption, Fan, Liao and Mincheva (2013) provide a more general POET estimator (Principal Orthogonal ComplEment Thresholding). The POET is well known and its R-code is available on web. Ledoit and Wolf (2013, 2017, 2020) provide c© Zhou, 2021 Page 90 3.5 Estimation risk some of the most useful alternative shrinkage estimators. 3.5 Estimation risk Portfolio optimization generally refers to selecting the best portfolio (asset allocation) out of the set of all possible portfolios according to some objective function. The mean-variance portfolio theory is a particular case. 3.5.1 The plug-in rule In the case of the presence of the riskless asset and without portfolio constraints, we have a simple analytical formula for the optimal portfolio. However, the formula has unknown parameters. To apply it, we have to estimate the parameters first, and then, replacing the parameters by the estimates, we obtain the so-called plug-in optimal portfolio rule, or plug-in rule for short. Specially, in our mean-variance portfolio context, since the true parameters µ and Σ are un- known, the true theoretical optimal portfolio cannot be obtained. To implement this mean-variance portfolio theory of Markowitz (1952), the optimal portfolio weights are usually estimated by fol- lowing a two-step procedure. Suppose there are T periods of observed excess returns data (since we assume riskless asset is available and the all our portfolio choice formulas only on excess returns) ΦT = {R1, R2, · · · , RT } and we would like to form a portfolio for period T + 1. In the first-step, the mean and covariance matrix of the asset returns are estimated based on the observed data. Under the assumption that Rt is i.i.d., the standard estimates are µˆ = 1 T T∑ t=1 Rt, (3.9) Σˆ = 1 T − 1 T∑ t=1 (Rt − µˆ)(Rt − µˆ)′, (3.10) which are the same estimators given in Section 2.2.5. The above estimates are extension of the univariate sample mean and sample variance, (1.20) and (1.21), into the multiple asset or high dimensional case. c© Zhou, 2021 Page 91 3.5 Estimation risk In the second-step, the sample estimates are then treated as if they were the true parameters, and are simply plugged into the theoretical formula (2.45) to compute the optimal portfolio weights, wˆ = 1 γ Σˆ−1µˆ. (3.11) This is known as the plug-in rule, obtained by plugging in the sample estimates. Statistically, the above moment estimators are the most efficient estimates that converge to the true parameters as the sample size T increases to infinity. However, in practice, the sample size is small and limited. Hence, there are substantial estima- tion errors in estimating both the expected return and the covariance matrix. This issue is focus of the rest of subsections and will also be examined further in Chapter 4. 3.5.2 Errors in using a model In general, portfolio optimization assumes a model for the data generating process. It is important to remember that “All models are wrong, but some are useful” (George Box, Statistician). Hence, there are three types of errors in a model: 1. Errors in fitting the data Our models are built to fit the past data, and the models are never perfect as there are assumed random errors even within the models. 2. Errors in parameter estimates Given the assumed random errors, there are additional errors resulted from estimating the assumed true but unknown parameters. 3. Errors in capturing the changing world The models are built from and for the past data, but the future may move into an unforeseen regimes/crisis or there are shifts of behaviors. 3.5.3 Estimation errors Here we focus on estimation errors. In particular, even if the data generating process is true, but there are errors in estimating the expected returns and covariance matrix, due to limited data. c© Zhou, 2021 Page 92 3.5 Estimation risk Because of the errors, the plug-in rule often performs poorly. Example 3.1 Similar to Example 2.8, assume that there are N = 2 risky assets with monthly expected return and monthly covariance matrix: µ = 1 12 0.10 0.20 , Σ = 1 12 0.32 0.5× 0.3× 0.4 0.5× 0.3× 0.4 0.42 , and γ = 3. Then the optimal weights at the true parameters are w = 1 γ Σ−1µ = 0.123 0.370 . If T = 120 (10 years of data), then the standard errors of the estimated expected returns are 0.30/12√ 120 = 0.002282, 0.30/12√ 120 = 0.003043. Since any values with 2 standard deviations are quite likely, let us say we have made errors of 1 standard deviation. Then we use the true values plus the errors, i.e., µˆ = µ+ 0.002282 0.003043 , to compute the weights, wˆ = 1 γ Σ−1µˆ = 0.191 0.421 , which are quite different from w. That is the problem caused by just errors in µ. ♠ In practice, we have errors in estimating Σ, that makes the problem worse than the example. In addition, as N becomes large, the problem becomes more severe. The issue is known to practitioners a longtime ago. For example, Michaud and Michaud (2008), emphasize that that it is difficult to estimate the inputs (mean and covariance matrix) of the portfolio optimization, and that even small changes in the inputs can lead to very large changes in the optimized portfolio weights. Brown (1976, 1978), Klein and Bawa (1976), and Bawa, Brown and Klain (1979) are earlier academic studies on the problem. Kan and Zhou (2007), DeMiguel, Garlappi, and Uppal (2009), Tu and Zhou (2011), and Pedersen, Babu, and Levine (2020) are examples of recent studies. We will, in the subsections that follow, provide first a theoretical analysis of the problem, and then discuss some of the solutions. c© Zhou, 2021 Page 93 3.5 Estimation risk 3.5.4 Analytical assessment∗ To understand the impact of estimation errors, consider the loss of expected utility. Recall from the previous chapter that, if an investor knows the true µ and Σ , the optimal portfolio is w∗ = 1 γ Σ−1µ, (3.12) that yields an maximum utility of U(w∗) = 1 2γ µ′Σ−1µ. (3.13) and a maximum Sharpe ratio of Sharpe Ratio = √ µ′Σ−1µ. The results hold only if the investor is so smart that he or she knows the true parameters. But no one knows the true parameters in the real world. To see the consequence of not knowing the true parameters, consider a case in which an investor does not know µ, but knows Σ. Assume knowing Σ is to simplify the formulas below. Also in practice, one can potentially use high-frequency data to estimate Σ with greater accuracy than replying the sample covariance matrix. Assume that the investor uses the sample mean wˆ to estimate µ, then he/she can only invest based on the plug-in rule, wˆplug = 1 γ Σ−1µˆ, (3.14) and the expected utility is only E[U(wˆ)] = 1 2γ µ′Σ−1µ− 1 2γ N T , (3.15) which is lower than previously. The Sharpe ratio is lower too, Sharpe Ratio = µ′Σ−1µ√ µ′Σ−1µ+N/T . (3.16) Both (3.15) and (3.16) follow from Kan and Zhou (2007) by assuming that the data are iid normal. Equation (3.15) says that the investor will get less utility than someone who know the true parameters, which will depend critically on N , the number of stocks, and T , the number of sample size (periods of data). If the data are monthly, say T = 120 (10 years of data), then, if N = 50, the expected utility will be negative! (as the µ′Σ−1µ is the monthly squared Sharpe ratio and its value c© Zhou, 2021 Page 94 3.5 Estimation risk usually is far less than 0.5). The investor would be better off by putting money into the riskfree asset as the likelihood of choosing a bad portfolio from inaccurate estimation is too high, given the estimates from the limited data. However, as sample size T goes infinity, the utilities will be the same as then the estimated parameters will converge to the true parameters. In the real world, T is finite, and so there is always an issue whether the data provide sufficient information for a given application. 3.5.5 Correlation shrinkage Pedersen, Babu, and Levine (2020) emphasize that the poor performance problem is primarily due to errors in estimating the small eigenvalue portfolio, although the latter seems known to practitioners for years. For example, Chen and Yuan (2016) propose to use a factor model to eliminate small eigenvalue portfolios. To see the impact of the small eigenvalue portfolios, consider the assets after linear transforma- tion by the principal components analysis (PCA) (see Section 6.2 on PCA), RPCA = A′(R− µ), (3.17) where A is the standardized eigenvectors of the covariance matrix Σ such that Σ = AλA′ with λ as the eigenvectors with decreasing order (see Equation (6.27)), and so RPCA are N portfolios of the original assets with the eigenvectors as weights. Then investing in all the N original assets is the same as investing in their N independent portfolios RPCA. Let µP1 , . . . , µ P N be the expected returns on the latter, and σP1 , . . . , σ P N be their volatility risks. Since the variance of the j-th component of PCA is λj (σ P j = √ λj) and the PCA portfolios are uncorrelated, we have, from Example 2.11, that the weight on the j-th portfolio is wPj = 1 γ µPj√ λj 1√ λj . (3.18) This says that the portfolio weight is a produce of three terms: the inverse risk aversion, the Sharpe ratio of the asset, and with 1√ λj . The first two terms are typically bounded. But the last term can be too large if the eigenvalues are small. Note that the eigenvalues are ordered here, λ1 ≥ λ2 ≥ . . . ≥ λN . Indeed, the small eigenvalues are often under-estimated or estimate to be too small in the real world, especially if N is large (high dimensionality). As a result, the optimal portfolio will load up too heavily on the principal component associated with small eigenvalues. c© Zhou, 2021 Page 95 3.5 Estimation risk Hence, any errors on the mean will have a huge impact on the performance of the estimated optimal portfolio too. Because of the issue above, Pedersen, Babu, and Levine (2020) provide a simple solution to the small eigenvalue problem. They propose to shrink the correlation of the asset returns and call their solution as Enhanced Portfolio Optimization. Contemporaneously, Menchero and Li (2020) also provide a similar shrinkage solution, but they focus on risk forecasting. Let Σ be the covariance matrix as usual. We can express it mathematically as a product of its root-squared variances and the correlation matrix. When N = 2, it is easy to verify that Σ = σ21 ρ12σ1σ2 ρ21σ1σ2 σ 2 2 = σ1 0 0 σ2 1 ρ12 ρ21 1 σ1 0 0 σ2 . In general, Σ = σ1 0 . . . 0 0 σ2 . . . 0 ... ... . . . ... 0 0 . . . σn 1 ρ12 . . . ρ1N ρ21 1 . . . ρ2N ... ... . . . ... ρN1 ρN2 . . . 1 σ1 0 . . . 0 0 σ2 . . . 0 ... ... . . . ... 0 0 . . . σn = DσΩDσ, (3.19) where σ2i is the variance of asset i and ρij is the correlation between asset i and asset j, Dσ is a diagonal matrix of the σ2i ’s and Ω is the correlation matrix. Denote by Ωˆ the estimated correlation matrix. The shrinkage estimator is defined as Ωˆη = (1− η)Ωˆ + ηIN , η ∈ [0, 1], (3.20) where IN is the identity matrix. Then the covariance matrix is estimated by Ση = DσΩηDσ. (3.21) If η = 0 there will be no shrinkage and we simply use the original Ω. If η = 0, we replace it by the identity matrix, effectively ignoring all the correlations so that the eigenvalues will be much more accurately estimated as the asset variances. The shrinkage toward zero correlations is intuitive because the true correlations among assets tend to be small and the estimates tend to be high in practice. How to choose η? One can choose a value of η that works well in the past and keep updating it overtime. Pedersen, Babu, and Levine (2020) find that a simple choice of η = 50% works well in a number of data sets. We will compare this rule with others later in Section 3.5.7 in our data applications. c© Zhou, 2021 Page 96 3.5 Estimation risk 3.5.6 Combination of 1/N with plug-in Due to estimation errors, many investors use the ad hoc rules (See Section 2.1) in practice. Con- sistent with this, DeMiguel, Garlappi, and Uppal (2009) show that the simple 1/N investment rule can actually outperform most estimated optimized rules (which are optimal if there were no esti- mation errors), including the previous plun-in rule. Tu and Zhou (2011) provide improved portfolio rules by combining the 1/N with estimated optimized rules in the riskfree asset case. Kan, Wang and Zhou (2020) provide new improved portfolio rules in the no riskless asset case. The estimation errors are more severe in a large portfolio (N is large), for which Ao, Li, Zheng (2019) introduce some effective methods. In what follows, we focus on the rule proposed by Tu and Zhou (2011), which is simple and effective. First, instead of using wˆ, we use a scaled one: w¯ = 1 γ Σ˜−1µˆ, (3.22) where Σ˜ = T−1T−N−2 Σˆ. The scaled w¯ is unbiased, and performs better than wˆ theoretically. Indeed, it outperforms wˆ in almost all empirical applications. Let we = 1N/N be the 1/N rule that invest 1/N of every dollar into each asset. Tu and Zhou (2011) consider a combination of we with w¯, wˆC = (1− δ)we + δw¯. (3.23) Intuitively, this is a portfolio diversification. Instead investing using either w′eR or δw¯′R, we invest a portfolio of them and this should do better in general. Indeed, theoretically, there exists δ > 0 such that wˆC will dominate both we and w¯, i.e., performing better unless the true parameters take special values such that Σ−1µ/γ = 1/N . In the latter case, δ = 0, and wˆC becomes we. However, how to get δ > 0? In practice, δ > 0 can be estimated, but the performance of using the estimated one will be not as good as using δ w eakens due to errors in the estimation. Nevertheless, it tends to perform better than w¯ in most applications. The estimate of δ is δˆ = pˆi1/(pˆi1 + pˆi2) (3.24) c© Zhou, 2021 Page 97 3.5 Estimation risk with pˆi1 and pˆi2 given by pˆi1 = w ′ eΣˆwe − 2 γ w′eµˆ+ 1 γ2 θ˜2, (3.25) pˆi2 = 1 γ2 (c1 − 1)θ˜2 + c1 γ2 N T , (3.26) where θ˜2 is an estimator of θ2 = µ′Σ−1µ and c1 = (T − 2)(T −N − 2)/((T −N − 1)(T −N − 4)), with T > N + 4. A natural estimator of θ2 is its sample counterpart, θˆ2 = µˆ′Σˆ−1µˆ. (3.27) But θˆ2 can be a heavily biased estimator of θ2 when T is small. Hence, we use the estimator below, proposed by Kan and Zhou (2007), θ˜2 = (T −N − 2)θˆ2 −N T + 2(θˆ2) N 2 (1 + θˆ2)− T−2 2 TBθˆ2/(1+θˆ2)(N/2, (T −N)/2) , (3.28) where Bx(a, b) = ∫ x 0 ya−1(1− y)b−1 dy (3.29) is the incomplete beta function. The first part is the unbiased estimator of θ2 and the second part is the adjustment to improve the unbiased estimator when the unbiased estimator is too small. A simple combination of we with w¯ is the naive diversification, θ = 1/2, wˆnaive = 1 2 we + 1 2 w¯. (3.30) In practice, this rule works well. It is a special case of model averaging (see Section 3.7.2), and it is much simpler than the optimal combination wC and no estimation of θ is needed. However, wC has the theoretical advantage that it will converge to the true rule w∗ as the sample size goes to infinity. In contrast, wˆnaive will never converge. 3.5.7 Backtesting: A comparison of rules Backtesting often refers to testing a model or strategy based on historical data. It provides insights on how well the performance could be if the model were used. However, it should be remembered that a model that worked well in the past may not work well in the future. c© Zhou, 2021 Page 98 3.5 Estimation risk As an example of backtesting, we provide here a detailed implementation and a comparison of 5 portfolio investment rules: 1/N , the plug-in, the GMV, the correlation shrinkage of Pedersen, Babu, and Levine (2020), and the combination rule of Tu and Zhou (2011). Suppose that we have data from period 1 to T . We have to use some samples for initial estimation. Assume that we use data from 1 to T0 as the initial sample to obtain the parameters, say T0 = 120 of 10 years data (monthly observations). Then we can invest starting from time T0. Note that the 1/N rule, wˆ(1) = 1 N 1N , (3.31) can start at any time as it does not require estimation, but we set it to start at T0 for comparison with other rules. Based on the data, the plug-in rule is easily computed at T0, wˆ(2) = 1 γ Σˆ−1µˆ, (3.32) based on formula (3.11) where µˆ and Σˆ are estimated by using data up to T0 [or replacing T by T0 in (3.9) and (3.10)], and the the GMV rules is computed from wˆ(3) = Σˆ−11N 1′N Σˆ−11N , (3.33) based on formula (2.21). The correlation shrinkage of Pedersen, Babu, and Levine (2020) is computed like the plug-in rule, wˆ(4) = 1 γ Σˆ−1η µˆ, Σˆη = DˆσΩˆηDˆσ (3.34) except that now the covariance matrix is estimated from shrinkage with η = 1/2. The combination rule of Tu and Zhou (2011) is wˆ(5) = (1− δ)we + δw¯, (3.35) where δ is T Σˆ/(T −N − 2), and δ is somewhat complex but can still be done with a few steps. At time T0 + 1, we have one more data point, from from 1 to T0 + 1. Then we re-estimate the parameters and use them to re-compute the portfolio weights to determine the investment at T0 + 1. Then we do the same for at T0 + 2, and so on, till time T0 + (T − T0) = T . Note that we have then returns on various investment rules available from T0 + 1, T0 + 2, ..., to T . Then we can use them to assess the performance of the rules, such as comparing their Sharpe ratios. c© Zhou, 2021 Page 99 3.5 Estimation risk Recall from 2.2.6, the above estimation is known as recursive estimation with time-varying windows or with data recursively available. At T0, we use T0 data points. At T0 + 1, we use one more data point, and so on. The length of data or sample size is increasing over time. See the Python codes for implementation. Alternative, one can also do the estimation with a fixed window size. For example, we use T0 = 120 data points at T0, and still use 120 data points at T0 + 1 (the data from 2, 3, ..., T0, T0 + 1), and continue to do so till time T . A reason for doing this may be that the data that are too old may not necessarily capture well what is happening recently or in the near future. This procedure is known as rolling estimation with a widow size of 120. To summarize, it is important in practice to consider about alternative investment rules, because all of them are approximations and no one is better than the other always. For example, the 1/N rule is good for cases in which there are many assets whose moments are difficult to estimate and whose expected returns are likely equal. But it clearly performs poorly when applies to the case of one riskfree asset and one risky asset. Hence, for a given application, it is important know have a complete list of investment strategies (the above and plus more, such as value-weighting and additional combination rules), and to find out ones are better than others. Then the better ones may be applied directly or used after combining further. Note that portfolio weights of the 4 estimated portfolio rules can be too large in some cases. To make them more realistic, we also compare the rules under the constraints that the weights on each assets are less than a bound, |w(i)j | ≤ b, j = 1, . . . , N, i = 1, 2, 3, 4, where b is the limits of long or short on each asset. For the GMV and the plug-in rules, we can solve the constrained problems via quadratic program. For the constrained Tu-Zhou rule, we can use the previous δˆ as an approximation to obtain it as the combination of the 1/N and the solved constrained plug-in rule. The combination weights will clearly satisfy the above bound because both of the underlying rules do. c© Zhou, 2021 Page 100 3.5 Estimation risk 3.5.8 A Bayesian solution The expected stock returns or means are known to be difficult to estimate. The sample mean is the most common estimator, but some shrinkage estimators will be discussed later. For Σ, there are simple alternative estimators. The maximum likelihood estimator is Sˆ = 1 T T∑ t=1 (Rt − µˆ)(Rt − µˆ)′. (3.36) Its relation to our other estimators are • Unbiased estimator of Σ: Σˆ = T Sˆ/(T − 1), (3.37) which is unbiased that E[Σˆ] = Σ. However, Σˆ and Sˆ are numerically almost identical when T = 120. • Unbiased estimator of Σ−1: Σ˜ = T Sˆ/(T −N − 2) (3.38) which satisfies E[Σ˜−1] = Σ−1. • Bayesian rule under a diffuse prior: ΣˆBayes = (T + 1)Sˆ/(T −N − 2), (3.39) which is the implied Σ estimator of the Bayesian optimal portfolio weights, the solution to wˆBayes = argmaxw ∫ RT+1 U(w)p(RT+1|ΦT ) dRT+1 (3.40) = argmaxw ∫ RT+1 ∫ µ ∫ Σ U(w)p(RT+1, µ,Σ|ΦT ) dµdΣdRT+1, (3.41) where U(w) is the utility of holding a portfolio w at time T + 1, p(RT+1|ΦT ) is the predictive density, and p(RT+1, µ,Σ|ΦT ) = p(RT+1|µ,Σ,ΦT )p(µ,Σ|ΦT ), (3.42) where p(µ,Σ|ΦT ) is the posterior density of µ and Σ, and the prior is diffuse, p0(µ,Σ) ∝ |Σ|− N+1 2 . (3.43) Notice that the Bayesian approach maximizes the average expected utility over the distribu- tion of the parameters. The solution under the diffuse prior is almost the same as using the inverse unbiased estimator Σ˜. c© Zhou, 2021 Page 101 3.6 Transaction costs The performance of the various estimators are, in terms of expected utility, E[J(Σˆ)] < E[J(Σ¯)] < E[J(Σ˜)] < E[J(ΣˆBayes)]. However, they are still far away from being optimal. Tu and Zhou (2010) use an informative prior and the results are substantially better in general. Some shrinkage estimators of both µ and Σ will be discussed later (see Section 4.4). 3.6 Transaction costs Transaction costs are important in practice. Novy-Marx and Velikov (2016) is a recent study on the cost of trading anomalies, that is, the non-market factors such as size and momentum, and other long-short portfolios. They divide the anomaly into 3 turnover groups, low, mid and high. They find that the cost of trading low-turnover strategies (such as size, gross profitability and value) is generally quite low, often less than 10 bp per month, because the strategies require only annual rebalancing. The cost for mid-turnover strategies (14–35% turnover on each side such as momentum and idiosyncratic volatility) is from 20 to 57 bp, and for high-turn-over (≥ 90% per month), over 1%. However, Frazzini, et al (2015) argue that the transaction costs are lower if trading the position over 3 days. To individual investors, the price impact is negligible and the cost is mainly commissions (there are two other tiny fees: the SEC and FINRA Trading Activity Fee (TAF), which are regulatory fees charged on the sale (only) of any security, $22.10 per million for SEC and 0.0000119 per share for TAF, all rounded up to the nearest penny). Initiated by Robinhood brokerage (Robinhood.com), zero commission was offered, that forced many brokerages today to have zero commissions on online orders (assisted orders by brokers will be charged $25 or so, and options and other complex instruments still require fees). Individual investors can also trade US stocks algorithmically free via API directly from Alpaca and Interactive Brokers, among others. 3.7 Model uncertainty There are many forms of model uncertainty. We consider two cases here. In the first case, we provide more robust of the parameters that improves the earlier sample moment estimates (see, c© Zhou, 2021 Page 102 3.7 Model uncertainty e.g., Meucci, 2005, for more such analysis). In the second case where the true model is unknown, we discuss the popular model averaging as an effective approach for using many candidate models. 3.7.1 Perturbation of the normal model Assume that the true data distribution falls into the class of distributions: {G |G = (1− )FN + W}, (3.44) where FN is the normal distribution, W an arbitrary one and is a constant between 0 and 1. The equation says that the true distribution, which may not be normal, is around a neighborhood of the common assumed normal one. The question we ask is that how this will affect our parameter estimates. Perret-Gentila and Victoria-Feserb (2004) prove two interesting results. First, the asymptotic bias of the optimal portfolio weights only depends the asymptotic bias of µˆ and Σˆ. Second, the bias can potentially be infinite even though the data may deviate from normality by a small amount. Then, how do we estimate the parameters µ and Σ in such a way that the bias be small? The solution is to use weighted averages, rather than the simple averages or the standard sample moments (see (3.9) and (3.10)), to estimate µ and Σ: µˆ = 1 T T∑ t=1 wmt Rt, (3.45) Vˆ = 1 T T∑ t=1 wv1t (Rt − µˆ)(Rt − µˆ)′ wv2t , (3.46) where the weights wmt , w v1 t and w v2 t depend on two control parameters (see Perret-Gentila and Victoria-Feserb for details). As it turns out, the above estimates are more robust than the standard sample moments. 3.7.2 Model averaging When there are multiple estimates or models, one of the popular decision method is the so-called maxmin rule, which maximizes the min (worse scenario) of the objective function so that the total loss is minimized. c© Zhou, 2021 Page 103 3.7 Model uncertainty In the mean-variance framework, suppose the investor is provided by J experts with the esti- mates of mean and covariance matrix of asset returns: µj , Σj , j = 1, 2, . . . J. (3.47) Which of them should the investor use? The maxmin rule suggests the investor to choose the portfolio weights to maximize the worse case utility: max [ min j Qj(w) ] = maxminQj(w), (3.48) where Qj(w) is Qj(w) = w ′µj − γ 2 w′Σjw, (3.49) i.e., the objective function evaluated at the estimated parameters µj and Σj . A naive Bayesian procedure may assign probability λj for the expert j’s estimate, then the optimal portfolio weights are w = 1 γ J∑ j=1 Σj −1 J∑ j=1 µj , (3.50) where the probabilities satisfying 0 ≤ λj ≤ 1 and ∑J j=1 λj = 1. 3 Formally, the Bayesian model averaging approach proceeds from a set of priors of J models, p0(Mj) = prior probability for Model j, j = 1, 2, . . . J. (3.51) After observing the data R, the posterior probability is given by p(Mj |R) = p0(Mj)p(R |Mj)∑J j=1 p0(Mj)p(R |Mj) , (3.52) where p(R |Mj) is the marginal likelihood computed by p(R |Mj) = ∫ p0(θj |Mj) p(R | θj ,Mj) dθj , (3.53) where p0(θj |Mj) and p(R | θj ,Mj) are the usual prior and posterior density of the parameter θj conditional on model j being true. Then, the predictive return is that from each model weighted by the posterior probabilities, p(rT+1 |R) = J∑ j=1 p(rT+1 |Mj , R) p(Mj |R), (3.54) 3Lutgens and Schotman (2004) provide more discussion on combining various estimates. c© Zhou, 2021 Page 104 3.8 Alternative objective functions which, when combined with the objective function, provides the Bayesian optimal portfolio choice. In the mean-variance case, the portfolio choice depends only on the predictive mean and variance: E∗M = J∑ j=1 E∗j p(Mj |R), (3.55) V ∗M = J∑ j=1 V ∗j p(Mj |R) + J∑ j=1 (E∗j − E∗M )(E∗j − E∗M )′ p(Mj |R), (3.56) where E∗j and V ∗ j are the predictive moments from model j. 4 However, an equal averaging or equal weight on the estimates or models is simple and popular in practice, and it usually works well. 3.8 Alternative objective functions Mean-variance objective function is the focus here, and is the most widely used in practice. We consider some alternatives in this subsection. 3.8.1 Kelly’s criterion Kelly criterion (known also as Kelly strategy, Kelly bet, ...) provides an optimal gambling method, or a formula, whose suggested a fixed proportional bet will lead almost surely to the greatest possible wealth compared to any other strategy in the long run. Mathematically, its objective function is to maximize the expected geometric growth rate. How to place your bets in an advantage game? Kelly criterion provides an answer to it, with a predetermined fraction of wealth. Algorithmic trading often codes it into the program, some hedge funds use it for their trading strategies, and Warren Buffett and Bill Gross are reported using it too. This is in fact not surprising because it is close to the mean-variance utility with risk aversion γ = 1, and is exactly to maximize the expected log expected utility of wealth. However, as γ is around 3 for a typical investor, Kelly criterion (with γ = 1) seems too aggressive to many investors. As a result, half Kelly is often used in practice, which is half of the usual Kelly 4Web site, http://www.research.att.com/volinsky/bma.html, provides a big list of papers on Bayesian model averaging. c© Zhou, 2021 Page 105 3.8 Alternative objective functions bet and is equivalent to setting γ = 2. In comparison with the history of the mean-variance portfolio theory, Kelly (1956) propose his criterion in 1956, which is 4 years later after Markowitz (1952) who proposes his portfolio theory. There are various extension of Kelly criterion, of which MacLean, Thorp, and Ziemba (2011) provide a survey of the literature. Consider about a gamble. Suppose that • p is the wining probability of a bet; • M is the money you win when bet $1 (if win, get 1 +M back); • L is the Loss (if lose, get 1− L back). To maximize the terminal wealth (assume that you can play the game over and over), Kelly criterion says that one should invest K%, Kelly percentage, of one’s wealth: K% = pM − (1− p)L M × L = Expected Return M × L . (3.57) and simplifies to K% = p− 1− p M , if L = 1. (3.58) Example 3.2 Suppose your trading strategy has 50% chance to triple your money and 50% to lose all. How much money should you invest in it each time? Now we have p = 50%, W = 2, L = 1. Kelly’s rule says you should invest K% = .5− 1− .5 2 = 25%. Note that the expected mean and variance of the gamble are µ = .5 ∗ 2 + 0.5 ∗ (−1) = .5, σ2 = .5∗(2− .5)2 + .5∗(−1− .5)2 = 2.25. Assume the risk-aversion is γ = 1, and the riskfree rate is zero, the optimal investment from mean-variance utility theory is w = 1 γ µ σ2 = .5 2.25 = 22.22%, which is quite close to the Kelly’s solution. ♠ c© Zhou, 2021 Page 106 3.8 Alternative objective functions Note that for a typical risk-aversion of 3, one invests only 22.22%/3=7.4%. This means that a Kelly investor is generally very aggressive and endures much greater risk. Example 3.3 Suppose you are offered to play a coin tossing game many times. The coin has 60% chance for heads and 40% for tail. Heads up you win $1, and tails you lose $1. What is the best strategy for you to place your bets if you have $100 to start with, to maximize your long-tern gains? It is clear that p = 60%, W = 1, L = 1, and so K% = .6− 1− .6 1 = 20%. Also we have µ = .6 ∗ 1 + 0.4 ∗ (−1) = .20, σ2 = .6 ∗ (1− .2)2 + .4 ∗ (−1− .2)2 = 0.96, then w = 1 γ µ σ2 = .2 0.96 = 0.2083, again quite close to the Kelly’s solution. ♠ Proof of (3.57) : Consider the discrete case only, and consider one period first. Let W0 be the wealth today, and W1 be the wealth after the gamble, and R is the return on the playing the game. Then W1 = W0(1 +R), logW1 = logW0 + log(1 +R). We choose k, the % of investment, to maximize expected log wealth: maxkE[logW1] = logW0 + p log(1 + kM) + (1− p) log(1− kL). The first-order condition is pM 1 + kM − (1− p)L 1− kL = 0, i.e., pM(1− kL) = (1− p)L(1 + kM). c© Zhou, 2021 Page 107 3.8 Alternative objective functions Solving k yields (3.57). Now consider T periods. Assume iid (independent and identically dis- tributed) payoffs, then the terminal wealth is logWT = logW0 + [p log(1 + k1M) + (1− p) log(1− k1L)] + [p log(1 + k2M) + (1− p) log(1− k2L)] + · · · · · · · · · + [p log(1 + kTM) + (1− p) log(1− kTL)]. Therefore, maximizing the expected log terminal wealth is the same as maximizing the expected wealth in each period! Hence the solution is the same. Q.E.D. 3.8.2 Higher moments The mean-variance portfolio theory applies theoretically if the stock returns are normally dis- tributed or if investors care only about the mean and variance of returns. In the real world, the returns are not normally distributed, which are noted at least as early as Kendall and Hill (1953) and Mandelbrot (1963). Clearly there are no reasons why other moments should not matter in the utility function of investors. The mean-variance assumption is for simplicity and tractability. Beyond the first two moments, the following 4 moment utility is often used, U = µ− γ 2 σ2 + γ3 γ3Skew 6 − γ4γ 4(Kurt− 3) 720 , (3.59) where the γ’s are parameters. It is generally true that, everything else equal, investors prefer positive skewness and does not like kurtosis. However, it is in generally very difficult to solve the portfolio optimization problem with the above utility. Samuelson (1970) and Arditti (1971) are the early studies. Jurczenko and Maillet (2006) have an excellent collection of articles. Jiang et al (2020) examine the empirical evidence on asymmetry including skewness. Mehlawat, Gupta and Khan (2021) provide references on some of the latest advances. c© Zhou, 2021 Page 108 3.8 Alternative objective functions 3.8.3 Other utilities The mean-variance utility function defined over the expected return and variance, equation (2.44), is equivalent to the following quadratic utility function defined over wealth, U(Wt+1) = aWt+1 − bW 2t+1, (3.60) where Wt+1 = (1 +Rp,t+1)Wt and Wt is the initial wealth at time t. While the quadratic utility is popular, there are two others that are simple and popular too, • Exponential utility: U(Wt+1) = −exp(−θWt+1), (3.61) • Power utility: U(Wt+1) = W 1−γt+1 1− γ , (3.62) a special case of which is the log utility: U(Wt+1) = log(Wt+1). Additional utility functions may be found in an asset pricing book like Huang and Litzenberger (1988) and Cochrane (2001). It is clear that the portfolio decisions will in general be different under different utility functions. However, as the quadratic utility is a second-order approximation of smooth utilities, the differences may not be dramatic. A fundamental limitation of the portfolio choice problem studied thus far is its short-term nature or decision over one-period. Clearly, in practice, investors care about long-term well being. For example, an investor might want to maximize the terminal wealth of T periods from today, Wt+T = (1 +Rp,t+1)(1 +Rp,t+2) · · · (1 +Rp,t+T )Wt. (3.63) Samuelson (1969) and Merton (1969, 1971) show that the portfolio choice will be myopic (same as the 1-period decision) for power utility with iid returns or for log utility with arbitrary returns. However, recent studies, as summarized by Campbell and Viceira (2003), show that investors’ long-term portfolio choice should vary with changing economic conditions and changing labor in- come (which was not modeled previously). In particular, while cash is safer to short-term investors, not so for long-term ones as the long-term investors have to re-invest the cash at uncertainty in- terest rates. People whose labor income is fairly uncorrelated with the stock market should invest c© Zhou, 2021 Page 109 more in the equities when they are young. See Campbell and Viceira (2003) for more discussions on the intertemporal decisions of a long-term investor. Other than maximizing utility functions, one can also use alternative criteria for the optimality of portfolio choice. Some of the criteria are reviewed by Meucci (Ch 5, 2005). 4 Simulation, Bootstrap and Shrinkage In this section, we study how to draw random samples from multivariate distributions used to model multiple stock returns. We also discuss bootstrap that resamples from data to obtain standard error or test sizes which are typically more accurate than asymptotic theory. Then we discuss shrinkage estimation of means and covariances of asset returns. 4.1 Sampling from distributions In investment analysis and derivatives evaluation, simulation is very important. To make it easy to understand, we will show how to draw random samples from a univariate, bivariate and the multivariate distributions. 4.1.1 Univariate case To start, consider the simplest common question of how to simulate a normally-distributed monthly return on a stock with 12% annual return and 20% annual volatility. Note that any computer programming language is almost surely has a function to generate the standard normal variable. In Python, the following code 1 2 x = np.random.randn(m,n) # Generate random variables from N(0,1) generates an m × n matrix of independent samples from the standard normal. To get only one sample, we simply change the (m,n) to (1, 1). c© Zhou, 2021 Page 110 4.1 Sampling from distributions Statistically, assume follows the standardized normal distribution with mean zero and variable one, ∼ N(0, 1), (4.1) which is the one any computer can simulate. Then R = 12% 12 + σ × ∼ N(1%, σ2), σ = 20%√ 12 , (4.2) has the desired distribution. In other words, we have to make a linear transformation of the standard normal, adding the mean and scaling the standard deviation, to have the return with desired mean and variance. In terms of Python code, we have 1 2 e1 = np.random.randn (1,1) # Generate a random from N(0,1) 3 R = 0.12/12 + (0.2/np.sqrt (12))*e1 Then the R will be what we need. In applications we may need to simulate many such returns. We can either simulate e1 as a vector to begin with, or use a loop and simulate e1 one at a time. Suppose that you have generate 10 returns by modifying the above code. Next time you run the program again and you will get another set of 10 returns. Often you want to get the same 10 returns or someone else wants to re-check your results with the same returns, how do you do that? You add a seed function into the code: 1 2 np.random.seed (1234) # to set random numbers the same each time running 3 e1 = np.random.randn (10 ,1) # Generate 10 random variables from N(0,1) 4 R = 0.12/12* np.ones (10,1) + (.2/np.sqrt (12))*e1 # now R is 10 by 1 What the seed function does it to allow you to get the random numbers from the same starting point, determined the input 1234 via the built in seed function. The reason we can do that is all the random numbers provided by computers are almost purely random, but not exact. You can imagine that all these numbers are on a gigantic circle and the computer just picks them up sequentially one by one (but they are almost random outcomes). The seed function simply chooses a starting point on the circle, which is arbitrary but fixed. c© Zhou, 2021 Page 111 4.1 Sampling from distributions 4.1.2 Bivariate case In practice, generating two stock returns with a given covariance structure is very common and important. If one simulate the stock returns separately, then the returns will be independent! This is not the case we want as stock returns are often correlated in the real world. Consider first the simplified problem of generating two standardized random variables (zero means and unit standard deviations) with a given correlation ρ. We will generate two random numbers first (we know they are independent), and then get a new set of two random numbers which are correlated at level ρ. Assume 1 and 2 follow the standardized normal distribution with mean zero and variable one, i ∼ N(0, 1), i = 1, 2. (4.3) Then the following linear transformation of them, x1 = 1 (4.4) x2 = ρ 1 + √ 1− ρ2 2 (4.5) will be a bivariate normal with mean zero, standard deviation 1, and correlation ρ, x = x1 x2 ∼ N 0 0 , 1 ρ ρ 1 (4.6) This is easy to verify. For example, E(x1x2) = E[1(ρ1 + √ 1− ρ22)] = ρ, (4.7) or, more elegantly by matrix algebra, we compute the covariance matrix of the x’s, Var(x) = E(xx′) = 1 0 ρ √ 1− ρ2 E(′) 1 ρ 0 √ 1− ρ2 (4.8) = 1 0 ρ √ 1− ρ2 1 ρ 0 √ 1− ρ2 = 1 ρ ρ 1 , (4.9) which says x has the covariance matrix expression as in (4.6). c© Zhou, 2021 Page 112 4.1 Sampling from distributions To get a bivariate normal random variable with arbitrary means and variances, we simply shift and scale x, y1 = µ1 + σ11, (4.10) y2 = µ2 + σ2[ρ 1 + √ 1− ρ2 2], (4.11) then y will be bivariate normal with mean µ1 and µ2 and covariance matrix Var(y) = σ21 ρσ1σ2 ρσ1σ2 σ 2 2 . (4.12) In Python, instead of doing the above from scratch, we can use a ready code, to simulate from the multivariate normal distribution directly, 1 2 y = np.random.multivariate_normal(mean , cov , (m,)) will generate an m × 2 matrix, with independent rows and each row follows N(mean, cov). The program does the earlier transformation for us. However, it is useful to understand the transfor- mation, as it is generally applicable for altering the covariance structure of any vectors with an arbitrary distribution. 4.1.3 Cholesky decomposition In general, we can make a similar, but more complex transformation to simulate a random sample from an arbitrary n−dimension normal distribution. Of course, for multivariate normal, we can use the above Python code for any n without carrying out the Cholesky decomposition by ourselves. But, understanding the Cholesky decomposition is generally useful. Let µ ≡ E[y] = µ1 µ2 ... µn , V ≡ Var(y) = var(y1) cov(y1, y2) . . . cov(y1, yn) cov(y2, y1) var(y2) . . . cov(y2, yn) ... ... . . . ... cov(yn, y1) cov(yn, y2) . . . var(yn) . (4.13) Our objective is to get a random sample from a multivariate normal distribution with the above mean and covariance matrix. c© Zhou, 2021 Page 113 4.1 Sampling from distributions As before, we first generate n standard normal variables, = (1, . . . , n) ′. Then we transform them to get a new set of the desired n-vector. The key is obtain the Cholesky decomposition of the covariance matrix. Mathematically, there exists a lower-triangular matrix L, such that LL′ = V, (4.14) which is known as the Cholesky decomposition or Cholesky factorization of covariance matrix V . For example, when n = 2, and if V = 1 ρ ρ 1 , then L′L = V, L = 1 ρ 0 √ 1− ρ2 because 1 0 ρ √ 1− ρ2 × 1 0 ρ √ 1− ρ2 ′ = 1 0 ρ √ 1− ρ2 1 ρ 0 √ 1− ρ2 = 1 ρ ρ 1 , which is the bivariate case we studied earlier. When n > 2, it is impossible to find a formula for L. But in practice, there are many softwares computing L, which is a lower-triangular matrix with positive diagonal elements because, to be the covariance matrix of nons-singular random variables, V is a positive definite matrix. So we can always make the following transformation, y = µ+ L. (4.15) Mathematically, it can be verified that Var(y) = E[(y − µ)(y − µ)′] = LE[′]L′ = LL′ = V, i.e., the covariance matrix of y is indeed V . (Recall in the univariate case, it is σ × σ = σ2, so the Cholesky decomposition works like taking a square root of the variance). The y defined in (4.15) is the sample we need. This procedure works not only for normal distributions, but also for a general distribution to get the desired covariance matrix. However, only c© Zhou, 2021 Page 114 4.2 Monte Carlo integration the normal and elliptical distributions in general have the property that their linear transformations remain in the same class of distributions. An counterexample is the lognormal. A linear combination of two lognormal variables will no longer be lognormal. 4.1.4 Singular value decomposition Singular value decomposition (SVD), widely used, is a general decomposition applying to any m×n matrix M , M = UDV, (4.16) where D is an r × r square matrix for some r > 0, and U and V are m × r and r × n orthogonal matrices such that UU ′ = V V ′ = Ir with Ir as identity matrix. Note that SVD also applies to complex matrices, but we consider only real matrices (matrices of real numbers) here. In particular, when M is a covariance matrix, SVD becomes the eigenvalue or spectral decom- position. Based on (6.27), we can define the square root of a covariance formally, Σ1/2 = [A1, . . . , An] √ λ1 0 . . . 0 0 √ λ2 . . . 0 ... ... . . . ... 0 0 . . . √ λn [A1, . . . , An] ′ = A √ λA′. (4.17) Then it follows that Σ1/2Σ1/2 = A √ λA′A √ λA′ = A √ λ √ λA′ = Σ. Hence, the Cholesky decomposition is not the only way to generate a given covariance matrix, because Σ1/2 can do the same. However, computationally, the Cholesky decomposition is the most efficient as it requires much less time to compute than computing Σ1/2. 4.2 Monte Carlo integration Simulation has a number of applications in finance. Here we illustrate with two examples. One is on estimating risk, and another is on option valuation. c© Zhou, 2021 Page 115 4.2 Monte Carlo integration 4.2.1 Theory The Monte Carlo integration is a simulation approach to compute an expected value, µ = E[f(x˜)] = ∫ f(x)g(x) dx, (4.18) where x˜ is a random variable with density function g(x). The integration is the expected value of a general function of the random variable x˜, f(x˜). For example, in option pricing, g(x) may be the lognormal density of the stock price at expira- tion, and f(x˜) is the present value of the payoff of an European option. The option price is then the expected value, which requires the valuation of the integral under the risk-neutral distribution of the terminal stock price. The famous Black-Scholes formula is an outcome of this integral. The advantage of using Monte Carlo simulation is that it can compute the option value with non-standard payoff functions or options that depend on high dimensional random variables. In these cases while analytical valuations are often impossible, the Monte Carlo approach is as easy as in the simple case. However, this is true only for European options. With American options, due to early exercise, the Monte Carlo method has to be adapted and it can become rather complex. The Monte Carlo simply uses the average value of the function at simulated samples to approx- imate the true expected value, µˆ = f(x1) + f(x2) + · · ·+ f(xn) n , (4.19) where n is the # of simulated samples, and x1, x2, . . . , xn are independent random samples of x˜ from its distribution with density g(x). By law of large numbers, we know that µˆ must converge to µ as n goes to large. In practice, n = 10, 000 is good for many applications. What is the error? Let Sn denote the numerator of the righthand side of (4.19), the central limit theorem says that the standardized Sn converges to a standard normal distribution, Sn − nµ σ √ n =⇒ N(0, 1), σ2 ≡ var[f(x˜)], (4.20) i.e., σ is the standard deviation of the random function f(x˜). The above equation implies that µˆ = µ+ σ√ n z + o( 1√ n ), (4.21) c© Zhou, 2021 Page 116 4.2 Monte Carlo integration where z is a standard normal random variable and o(1/ √ n) denotes a term of higher order of 1/ √ n, i.e., a term that converges to 0 after dividing by 1/ √ n. In other words, the error of the Monte Carlo integration is random, but its magnitude in terms of standard deviation is roughly σ/ √ n. So, roughly speaking, the error MC Error = µˆ− µ ≈ Problem Difficulty√ Simulation Size . This makes intuitive sense. The greater the variance of the random function, the more difficult to find its expected value. Given the difficulty level σ, the error converges to zero at a rate of 1/ √ n. Suppose n = 10, 000, then the error is typically 1% of σ. Since σ is unknown, it has to be estimated too. With the same simulated samples, it can be estimated by σˆ = √√√√( 1 n n∑ i=1 f(xi)2 ) − µˆ2. (4.22) Then the Monte Carlo error is estimated by σˆ/ √ n, and we can construct an approximate 95% confidence interval, [µˆ− 1.96 σˆ√ n , µˆ+ 1.96 σˆ√ n ] for the true but unknown µ = E[f(x˜)]. 4.2.2 VaR Consider first how to compute the VaR (value-at-risk) of a portfolio. VaR provides a single number, a measure of the total risks of a portfolio of various financial assets. It answers at what loss level that we are X% confident it will not be exceeded in N business days. In practice, people often compute VaR at X = 99 and N = 10. Mathematically, the portfolio is a function of random variables, and we need to compute a cutoff point in the distribution of the value of the portfolio, such that which there is 99% probability that the value of the portfolio is greater than. Suppose we have a portfolio of three stocks which follow normal distributions. Let µ and Σ be the expected returns and covariance matrix, and w be the portfolio weights. Assume µ and Σ are c© Zhou, 2021 Page 117 4.2 Monte Carlo integration annualized, then we need compute first the 10-day expected returns and covariance matrix, which are µ10day = (10/252)× µ, Σ10day = (10/252)× Σ. The the returns in 10 days have the following normal distribution, Rp = w1r1 + w2r2 + w3r3 ∼ N(µ10day,Σ10day). We can then generate hundreds and thousands of random returns from this distribution. The worst 1% return cut off is the VaR. The in-class exercise shows all the codes. The Python codes have all the details. Indeed, with 99% confidence, we should not lose more than that amount. Note that the pro- cedure applies to any distributions as long as we can generate samples from them. If the portfolio has options or derivatives, we generate first the underlying risk variables/shocks, and then compute the returns. 4.2.3 Option pricing Simulation, or Monte Carlo simulation as often called, can be easily applied to value all European options. It can also be used to value American options, but the procedure is very complex. Consider, for example, the valuation of a standard call option on a non-dividend-paying stock with parameters S = 50, X = 50, T = 0.25, r = 10%, σ = 30%, i.e, the current price is 50, riskfree rate is 10% (continuous compounding), volatility is 30% (of the continuous stock return), strike price is 50 and the expiration is 3 months. The call price is easy to compute from the Black-Scholes formula: C = S N(d1)−X e−rT N(d2), (4.23) where d1 = ln(S/X) + (r + σ2/2)T√ σ2T , d2 = d1 − √ σ2T , (4.24) and N(d) is the normal distribution function. Indeed, it is straightforward to code the formula into Python: c© Zhou, 2021 Page 118 4.2 Monte Carlo integration 1 import numpy as np 2 import scipy.stats as si 3 4 # define a function to compute the standard call with no dividend 5 6 def BS_call(S, X, T, r, sigma): 7 8 # S: spot price; X: strike; T: time to maturity; r: riskfree rate; sigma: vol 9 d1 = (np.log(S / X) + (r + 0.5 * sigma ** 2) * T) / (sigma * np.sqrt(T)) 10 d2 = d1 - np.sqrt ( (sigma ** 2)*T ) 11 N1 = si.norm.cdf(d1, 0.0, 1.0) 12 N2 = si.norm.cdf(d2, 0.0, 1.0) 13 call = S * N1 - X * np.exp(-r * T) * N2 14 15 return call Then a value of 3.6104 is obtained. Alternatively, one can easily compute the price by Monte Carlo. Recall from option pricing that, to get the Black-Scholes formula, the stock price is assumed to follow a geometric Brownian motion, i.e., the stock price is independently lognormally distributed, or the returns are independently normally distributed at anytime, and in particular, ln(ST /S) ∼ N [µT, σ2T ], (4.25) where µ = r − σ2/2 is the risk-neutral expected return. That is, ST = Se µT+σ √ T z˜, (4.26) where z˜ follows the standard normal distribution. Hence, we can drawM = 10, 000, say, z1, z2, . . . , zM , random numbers from the normal, and then can compute M random terminal prices SmT = Se µT+σ √ T z˜m , m = 1, 2, . . . ,M. (4.27) At each of the terminal price, the call is worth clearly the present value of the payoff function cm = e −rT (SmT −X)+. (4.28) [Recall (S −X)+ is defined as S −X if S > X and 0 otherwise, the payoff of the option.] Then, the average value over all the simulated prices is c = c1 + c2 + · · ·+ c10,000 10, 000 , (4.29) c© Zhou, 2021 Page 119 4.2 Monte Carlo integration which is the Monte Carlo approximation of the true call price. The greater the M , the greater the accuracy. Note that, numerically, you can ignore the discounting term, e−rT , in (4.28), and then discount at the end in (4.29), then you will save some computational time. But this may be confusing to an inexperienced programmer, and one may simply just use above formulas. What is the advantage of Monte Carlo? It is generally applicable while formulas may not be available for many non-standard options. As a theoretical example, consider an option with a payoff of f(ST ) = (S pi T −X)+ (4.30) at maturity. The payoff is similar to the standard option, but different in that now the stock price has a power of the irrational number pi. Then in this case, there are no formulas as the usual technique for obtaining the Black-Scholes fails. However, Monte Carlo method is as easy as above, by simply replacing the payoff function with the new one in computing the present value cm. In the previous examples, the option is a function of the terminal price alone and hence drawing the stock price at time T is sufficient. However, many options have complex payoff functions that require drawing the price path over time. An example is the lookback call option that allows the holder to exercise with the proceeds equal to the difference between the highest price during the option’s life and the strike price. Let C be the call price of the lookback. The option is not the standard option and cannot be evaluated by the Black-Scholes formula. There does exist a complex analytical formula, but simulation is much easier to compute. If one defines the payoff as the max price minus its mean, then no formulas are available. But the simulation approach is still easy with a little complexity. Now we need to draw a path of the stock prices. Making the same geometric Brownian motion or log-normal distribution on the stock price, the stock price is log-normally distributed conditional on its past: ln(St+1) ∼ N [ln(St) + µ 1 12 , σ2 1 12 ] So, ln(S1) ∼ N [ln(S0) + µ 1 12 , σ2 1 12 ], ln(S2) ∼ N [ln(S1) + µ 1 12 , σ2 1 12 ], c© Zhou, 2021 Page 120 4.2 Monte Carlo integration etc., and ln(S12) ∼ N [ln(S11) + µ 1 12 , σ2 1 12 ], where S0 = S is the current stock price. That is, we now draw a stock price path: next month, S1 (get y1 from N [ln(S0) + µ 1 12 , σ 2 1 12 ] and solve lnS1 = y1 to get S1 = e y1), the second month, S2, and so on. We have now the simulated stock prices from the 1st month to the end of the year: S1, . . . , S12. Let S ∗ 1 be the maximum of the prices, then payoff of the call option is: c1 = S ∗ 1 −X, if S∗1 > X; and zero otherwise. Next, we can repeat the above process and get another path of the stock price, and get S∗2 and c2. Continuing in this way 10 times, we get c1, c2, . . . , c10. Recall that the call price should be equal to its expected payoff discounted back to today: C = e−rT × (Expected payoff). By the “Law of Large Numbers”, the expected payoff is the probability limit of the average of the payoffs of the simulations, and so c = e−rT [ c1 + c2 + · · ·+ c10 10 ] should be an approximation of the true price taken over the 12 monthly intervals. Here we have simulated the path of the stock price at the month interval. In practice, to achieve high accuracy, we may have to simulate the path at much smaller intervals, say daily or hourly. In addition, we may not just do 10 simulations as done here. Usually, a number of 10,000 simulations gives quite accurate results. Formally, we can obtain the price path at interval 1n (n = 12 in the above) by simulating n prices one after another from: ln(St+1) ∼ N [ln(St) + µ 1 n , σ2 1 n ], t = 0, 1, 2, . . . , (n− 1). Based on these prices, we can evaluate the payoff on the option, ci (in the i-th simulation). Suppose we do in total m simulations (m = 10 in the above), then we get m payoffs: c1, c2, . . . , cm, and the c© Zhou, 2021 Page 121 4.3 Bootstrap call price is given by: Cm = e −rT c1 + c2 + · · ·+ cm m (4.31) When n and m are large enough, the call price computed from above will converge to the true theoretical price under the standard diffusion, or the geometric Brownian motion, or the log-normal distribution assumption. Theoretically, the rate of convergence is of order (1/ √ m+ 1/n). The above examples are options on a single stock. The same approach applies to options on multiple stocks or portfolios. In this case, the only difference is to draw now random samples from multivariate distribution based on the Cholesky decomposition. The Monte Carlo simulation approach relies on the risk-neutral valuation principle. It applies to virtually any European option. With some complex modifications, it can be applied to American options. Overall, its simplicity and generality makes it appealing to a great number of practitioners. Glassermann (2004), for example, provides more theories, such as various related method with accelerated convergence, and Hilpisch (2015) has a extensive example of Python implementations. 4.3 Bootstrap The bootstrap method, introduced by Efron (1979), is a computation intensive method for esti- mating the distribution of an estimator or a test statistic by resampling the data at hand. It treats the data as if they were the population. Under mild regularity conditions, the bootstrap method generally yields an approximation to the sampling distribution of an estimator or test statistic that is at least as accurate as the approximation obtained from traditional first-order asymptotic theory (see, e.g., Horowitz (1997)). 4.3.1 Estimating standard error The idea of bootstrap is very simple. Consider the problem of estimating the standard error of an estimator. Suppose we have iid excess return data, x1, . . . , xT , and compute the Sharpe ratio κˆ = xˆ sˆ c© Zhou, 2021 Page 122 4.3 Bootstrap where s is the sample standard deviation, s2 = 1 T − 1 T∑ t=1 (xt − x¯)2. But how accurate is it? Ever if the data is normally distributed, the standard error of κˆ is not easy to derive. However, it can be easily estimated via a bootstrap: 1. Draw T data, (x∗1, . . . , x∗T ) randomly from the original data set with replacement; 2. Compute the Sharp ratio for the data drawn, κˆ∗, and save the result as yj = κˆ∗; 3. Repeat the above B times, say, B = 10, 000, to obtain all the yj ’s (j = 1, 2, . . . , B); 4. Compute standard deviation of the yj ’s. The standard deviation is the bootstrap approximation of the standard error of κˆ. Why does it work? Statistically, the variance of κˆ is defined as the integral of the squared difference with the true Sharpe ratio, var(κˆ) = ∫ (κˆ− µ/σ)2 dF (x1, . . . , xT ) ≈ ∫ (κˆ− µ/σ)2 dF ∗(x1, . . . , xT ), (4.32) where F (x1, . . . , xT ) is the true and generally unknown distribution of the data, and F ∗(x1, . . . , xT ) is the empirical distribution that assigns equal probability to each data point, defined by F ∗(x = xt) = 1 T , t = 1, . . . , T. (4.33) In other word, we obtain Equation (4.32) by using the so-called bootstrap plug-in principle, replac- ing the unknown by the empirical distribution, and then we can evaluate any statistics based on the latter. To compute (4.32), we use Monte Carlo simulations to simulate, say B = 10, 000, sample from F ∗(x1, . . . , xT ), this is exactly what the re-sampling with replacement does! Then the variance is approximated by simulation, var(κˆ) ≈ 1 T − 1 B∑ j=1 (κˆ∗j − µ/σ)2. c© Zhou, 2021 Page 123 4.3 Bootstrap Replacing the unknown µ/σ by its bootstrap average, we obtain the result of Step 4. The above procedure is also often applied to bias correction. The idea is that a statistics, such as the standard deviation, can be computed based on the data or based on the bootstraped data. Under certain regularity conditions, the bootstraped estimator will be better, and their difference is call the bias correction. However, it should be pointed out that if the iid assumption is violated or even under the iid but if the skewness or kurtosis is high, there is no guarantee that the bootstraped statistic will always be better. Without the iid assumption, a block bootstrap (see, e.g., Shao and Tu, 1995, Chapter 9) should be used to capture serial correlations. Indeed, while iid assumption is not a bad assumption for many asset returns, but it is unlikely to be always true for the returns on a trading strategy or a managed fund. 4.3.2 Estimating confidence interval Now let us keep the iid assumption, but relax normality. Then the usual confidence interval (see Section 1.3) is questionable. In this case, the bootstrap can be used to find the confidence interval which can be more accurate in small sample. The procedure has three easy steps: 1. Draw T data, (x∗1, . . . , x∗T ) randomly from the original data set with replacement; 2. Compute the sample mean for the data drawn, xˆ∗, and save the result yj = xˆ∗; 3. Repeat the above B times, say, B = 10, 000, to obtain all the yj ’s (j = 1, 2, . . . , B); 4. Compute the 2.5% and highest 97.25% percentiles, δ0.025 and δ0.975, of the yj ’s. The result, [δ0.025, δ0.975], is our estimate for the 95% confidence interval (e.g., if B = 1000, δ0.025 is the 25th value and δ0.975 is the 975th value after the yj ’s are sorted from the lowest to the highest; if B = 10000, the 125th and 9750th). The above is known as the bootstrap percentile approach. Interesting, it does not use sample mean xˆ at all. This approach may be used to approximate any confidence interval on any statistic. However, mathematically, as argued by Rice (2007, p. 285), it will be much more accurate using the centered bootstrap: c© Zhou, 2021 Page 124 4.3 Bootstrap 1. Draw T data, (x∗1, . . . , x∗T ) randomly from the original data set with replacement; 2. Compute the mean for the data drawn, xˆ∗, and save the result yj = xˆ∗ − xˆ; 3. Repeat the above B times, say, B = 10, 000, to obtain all the yj ’s (j = 1, 2, . . . , B); 4. Compute the 2.5% and highest 97.25% percentiles, η0.025 and η0.975, of the yj ’s. The result, [xˆ− η0.975, xˆ− η0.025], (4.34) is our estimate for the 95% confidence interval. Although this looks quite different from the previous one, as pointed by Rice, the two methods are equivalent if the bootstrap distribution is symmetric. To understand the expression (4.34), we know that the centered bootstrap is mathematically to use the distribution of xˆ∗ − xˆ to approximate xˆ− µ because xˆ is the true mean of the bootstraped data, and µ is the true mean of the data. Hence, 0.95 = Prob(−η.025 < xˆ∗ − xˆ < η.975) ≈ Prob(−η.025 < xˆ− µ < η.975) = Prob(−η.025 − xˆ < −µ < η.975 − xˆ) = Prob(xˆ− η0.975 < µ < xˆ− ηˆ0.025) which is exactly (4.34), the 95% probability interval covering µ. The analysis above shows that, if the distributions xˆ∗ − xˆ and xˆ − µ are not close, then the bootstrap can be inaccurate. However, even if they are not, the distributions of the standardized versions, (xˆ∗ − xˆ)/σ∗ and (xˆ∗ − xˆ)/σ could, where σ and σ∗ are the standard deviations. Suppose σˆ∗ is an estimate of the standard deviation from a bootstrap as we did before. Then we can do another studentized bootstrap: 1. Draw T data, (x∗1, . . . , x∗T ) randomly from the original data set with replacement; 2. Compute the mean for the data drawn, xˆ∗, and save yj = (x¯∗ − xˆ)/σˆ∗; 3. Repeat the above B times, say, B = 10, 000, to obtain all the yj ’s (j = 1, 2, . . . , B); 4. Compute the 2.5% and highest 97.25% percentiles, τ0.025 and τ0.975, of the yj ’s. c© Zhou, 2021 Page 125 4.3 Bootstrap The 95% confidence interval is [xˆ− σˆ∗τ0.975, xˆ− σˆ∗τ0.025], (4.35) Note that the studentized bootstrap is so-named because (xˆ∗− xˆ)/σ∗ approximate a t distribution. Also it is computational more demanding as it requires a bootstrap to compute σˆ∗ first. How- ever, today the computational time is rarely of concern and the greater accuracy makes the the studentized bootstrap more useful in practice. For more discussions on bootstrap, see Efron (1979), Shao and Tu (1995), Horowitz (1997) and Rice (2007). For applying the bootstrap to test the CAPM, see Chou and Zhou (2006). 4.3.3 Bootstrapping portfolio weights Similar to applying bootstrap to estimating the standard deviation, we can also apply it to esti- mating any function of the parameters, the optimal portfolio weights in particular. Consider the case in which we have no constraints and have the riskfree asset available, then the optimal portfolio formula is (see (2.45)), w∗ = 1 γ Σ−1µ. (4.36) When data are available, we can compute the sample mean and sample covariance matrix (assume invertible here) to get the plug-in rule (see (3.11)), wˆ = 1 γ Σˆ−1µˆ. (4.37) As we discussed before, this rule often does not do well in practice due to estimation errors. Since bootstrap can improve small sample performance, it is natural to bootstrap the data to get a bootstrapped rule. Let χ = (x1, · · · , xT ) be the original data set. We can re-sample with replacement to get another set, χ(1) = (x (1) 1 , · · · , x(1)T ). With this data set, we get another plug-in rule wˆ(1). Continuing this n times, then we obtain wˆboot = wˆ(1) + wˆ(2) + · · ·+ wˆ(n) n , (4.38) which is known as the bootstrapped portfolio investment rule. c© Zhou, 2021 Page 126 4.4 Shrinkage estimation While it is unclear whether they are the first, Michaud and Michaud (2008) apply bootstrap for obtaining resampled efficient frontier, which is simply the average of those from the re-sampled data. They filed a U. S. patent for it. Note that the plug-in rule is usually the worse performer in many applications, so beating it does not say much. Theoretically, there is no reason that the bootstrapped rule can outperform other rules that we examined earlier. Samples of typical applications will be given in the Python example. 4.4 Shrinkage estimation The mean and covariance matrix, µ and V , of asset returns Rt are fundamental in many financial decisions such as asset allocation or computation of the VaR. In this subsection, we discuss the prop- erties of the sample average estimators, which are also known as moment estimators or maximum likelihood estimators. Then we discuss shrinkage estimates that provide improved performance. 4.4.1 Sample averages Under the assumption that Rt is i.i.d. (or stationary in general), the mean and covariance matrix, µ and V , can simply be estimated by using their sample analogues, µˆ = 1 T T∑ t=1 Rt, (4.39) Vˆ = 1 T T∑ t=1 (Rt − µˆ)(Rt − µˆ)′, (4.40) where Rt is an n-vector of asset returns, and hence µˆ is also an n-vector, and Vˆ is an n×n matrix. It should be noted that the sample covariance matrix will not be invertible if T ≤ (N − 2). In the high dimension case when N is large, even though T > N , the estimator Vˆ can be very inaccurate with many too small eigenvalues (see Section 6.2.6). The above estimators are known as moment estimators because we simply replace the theoretical c© Zhou, 2021 Page 127 4.4 Shrinkage estimation expectation (moments), i.e., the righthand side of µ = E[Rt], (4.41) V = E[(Rt − µ)(Rt − µ)′] (4.42) by their sample counterparts as an estimation for the left hand side. They are also known as maximum likelihood (ML) estimators, which are the most efficient one among all unbiased estimators (achieving the so-called Cramer-Rao bound).5 By Law of Large Numbers, µˆ and Vˆ must converge to µ and V as the sample size T increase to infinity. The question is how to assess the accuracy. Asymptotically, for independently and identically distributed returns, µˆ ∼ N ( µ, 1 T V ) . (4.43) So the squared root of the diagonal elements of Vˆ /T provides the standard errors for µ, which indicates how far the estimates might deviate from the true µ. However, it is more difficult to assess the standard errors for Vˆ . One of the difficulties is that Vˆ is an n × n (almost surely positive definite) symmetric matrix, of which there are n(n + 1)/2 distinct elements. Under normality assumption6, it is known that Σˆ ∼ Wn(T − 1, V/T ), (4.44) where Wn(T −1, V/T ) denotes a Wishart distribution with T −1 degrees of freedom and covariance matrix Σ/T . In general, a Wishart distribution, denoted by Wn(ν,Σ) is defined as the distribution of ν sums of matrices, W = X1X ′ 1 +X2X ′ 2 + · · ·+XνXν , (4.45) where X1, . . . , Xν are ν independent normal variables, Xi ∼ N(0,Σ). It is a generalization of the usual chi-squared distribution into n-dimensional space. To write the standard errors for the covariances, we need to introduce two popular matrix operators. The first one is vec, which vectorizes any matrix into a vector by stacking the columns 5Exercise: when n = 1, verify that the ML estimator is indeed as given above. 6We refer the general case to Muirhead (1982, Chapter 1). c© Zhou, 2021 Page 128 4.4 Shrinkage estimation one on top of the other, vec(A) ≡ A1 A2 ... An , if A = a1,1 a1,2 . . . a1,n a2,1 a2,2 . . . a2,n ... ... . . . ... am,1 am,2 . . . am,n ≡ (A1, A2, . . . , An). (4.46) The second operator, ⊗, is known as Kronecker product which turns two matrices into a larger matrix, A⊗B ≡ a1,1B a1,2B . . . a1,nB a2,1B a2,2B . . . a2,nB ... ... . . . ... am,1B am,2B . . . am,nB . (4.47) Then, the standard errors for the covariances are approximately Var[vec(Vˆ )] = 1 T (In2 +Knn)(Vˆ ⊗ Vˆ ), (4.48) where In2 is the identity matrix of order n 2 and Knn is the commutation matrix such as vec(A) = Knnvec(A ′) for any order n matrix A. 4.4.2 Mean shrinkage: Stein estimators In estimating µ using µˆ, a standard measure of loss of efficiency is the squared errors, Loss(µˆ, µ) = (µˆ− µ)′(µˆ− µ) = N∑ j=1 (µˆj − µj)2, (4.49) where N is the number of assets. Geometrically, it is the distance between two points, µˆ and µ. The closer µˆ to µ, the smaller the loss. For a long time, µˆ is considered as the estimator with the best estimator, until Stein (1955) published his path-breaking paper to prove the contrary. In other words, the sample mean does not have the smallest expected loss. In general, we can consider an estimator of the form, µˆS = (1− α)µˆ+ αbˆ. (4.50) This is known as James-Stein shrinkage estimator which shrinks the estimator toward a target estimator bˆ. When α = 0, no bias, but the MSE can be high. When α 6= 0, but small, then there c© Zhou, 2021 Page 129 4.4 Shrinkage estimation is bias, but the variance of αbˆ can be small. Hence, it is a matter of trade-off between the bias and variance. Under multivariate normality assumption, the optimal choice of α is α = 1 T Nλ¯− 2λ1 (µˆ− bˆ)′(µˆ− bˆ) , N > 2, (4.51) where λ¯ is the average of the eigenvalue of V and λ1 is the largest. So, when the sample size T is small, α weights heavily toward the target. However, it will be smaller and smaller as the sample size gets large, so that the estimator is essentially µˆ for large T . It will be of interest to see a special case in which the asset returns are independent and have unit variances, and assume bˆ = 0. Then V is the identity matrix, and both λ¯ and λ1 are 1. Then we have µˆSj = ( 1− 1 T N − 2∑N j=1 µˆ 2 j ) µˆj , that is, the Stein’s estimator of the j-th asset mean is the usual sample mean shrinked toward 0 by the first term. As sample size T becomes large, it gets closer the sample mean. There are two popular choices for bˆ in practice. The first is to use the average sample mean across assets, known as the average of the average, bˆ = ( 1 N N∑ i=1 µ˜i ) × 1N , (4.52) where µ˜i is some prior mean estimate for asset i and 1N is an N -vector of ones, so that bˆ is an N -vector too scaled by the average of the asset means or the grand mean average across assets and time.7 Another choice is due to Jorion (1986) who suggests bˆ = 1′N Vˆ −1µˆ 1′N Vˆ −11N × 1N . (4.53) For additional estimators and the theory, see Lehmann and Casella (1998), Maruyama (2004) and Kan and Zhou (2007). 7Theoretically, µ˜i should be independent of µˆi. But in practice one may take them as the same, then the James- Stein estimator shrinks toward the grand mean. c© Zhou, 2021 Page 130 4.4 Shrinkage estimation 4.4.3 Covariance shrinkage The covariance matrix is important not only for the optimal portfolio construction, but also useful for risk forecasting of a portfolio. The reason for the latter case is that, if w is one’s portfolio weight. Regardless how w is chosen, the risk of the portfolio, say next month, is σ2p = w ′Σw. Since Σ is unknown, and must be estimated today to do the forecast. However, Menchero and Li (2020) show that shrinkage may not be needed for risk forecasting, but it is always important for portfolio selection. Traditionally, however, the shrinkage is carried out directly to the covariance matrix. Following Meucci (2005, p. 208) who in turn follow Ledoit and Wolf (2003), the shrinkage estimator of the covariance is to shrink the sample covariance matrix to a target, Vˆ S = (1− α)Vˆ + αCˆ, (4.54) where the target Cˆ = ∑N i=1 λˆi N × IN , (4.55) with λˆi as the i-th largest eigenvalue of Vˆ and IN the identify matrix of order N ; and the weight is α = 1 T 1 T ∑T t=1 tr [ (RtR ′ t − Vˆ )2 ] tr [ (Vˆ − Cˆ)2 ] , (4.56) where “tr” is the trace operator that takes the trace (sum of diagonal elements) of a matrix. In practice, due to concerns of stability of parameters, the effective sample size cannot be very large, and hence the shrinkage approach adds value for better estimates. Very often we need to estimate the covariance in high dimension, which is a topic of constant research. The PCA and the factor analysis, discussed in Sections 6.2.6 and 6.4, are relevant. Pourahmadi (2013) provides an easily accessible analysis. Recently, Ledoit and Wolf (2017) provide yet another shrinkage estimator. However, empirically, the evidence from Pedersen, Babu, and Levine (2020) appears to suggest that shrinking the correlation matrix is better than shrinking the covariance matrix in portfolio optimization. c© Zhou, 2021 Page 131 4.4 Shrinkage estimation 4.4.4 Use of correlation shrinkage Recall that Pedersen, Babu, and Levine (2020) argue for the use of correlation shrinkage. As the covariance matrix can be decomposed as a product of the vol matrix, the correlation matrix and the vol matrix, we can shrink the correlation matrix toward the identity matrix (so that the correlations are shrank toward zero). The details are given in Section 3.5.5. While it is known that the small eigenvalues will cause problems to portfolios, it is unclear why making the correlations smaller will help. Intuitively, the smaller the correlations, the closer the covariance matrix to an diagonal matrix. Since the diagonal matrix has the original asset variances as the eigenvalues, it is then unlikely that the estimated smallest eigenvalue can be too small. Below we can it can in fact be true. Assume that there are two assets. Consider the simple case of the correlation matrix. When both assets have variances of 1. Then their covariance matrix is the same as the correlation matrix, Σ = 1 ρ ρ 1 (4.57) where ρ is the correlation. In this case, it is easy to det(Σ) = 1− ρ2 = λ1λ2, and 2 = λ1 + λ2 impliy λ1 = 1 + ρ, λ2 = 1− ρ are the two eigenvalues. Assume ρ > 0, so that λ2 is the smallest eigenvalue. Then it is clear that when the correlation is over-estimated in practice, λ2 will be under-estimated or be too small. 4.4.5 Eigenvalue adjustment Since the unstable covariance matrix is caused by under-estimated small eigenvalues, why do we adjust them directly? c© Zhou, 2021 Page 132 4.4 Shrinkage estimation Yao, Zheng and Bai (2015, Section 12.5) provide the statistical theory. Recall the eigenvalue decomposition (6.27), Σ = [A1, . . . , An] λ1 0 . . . 0 0 λ2 . . . 0 ... ... . . . ... 0 0 . . . λn [A1, . . . , An] ′ (4.58) where Ai is the eigenvector corresponding to eigenvalue λi, and the eigenvectors are orthogonal to each other with unit length. The idea is to divide the eigenvalues into a few groups, and replace their values in each group by an equal estimate, and then use the above formula to compute an estimated Σ. See Yao, Zheng and Bai (2015) for the details. Under fairly general conditions, the estimator will perform much better than the sample co- variance matrix. Eigenvalues are also known the spectrum of the matrix, and so the above is also called the spectrum-corrected estimator of the covariance matrix. In particular, as suggested by Lo´pez de Prado (2020a), one can find a cut-off eigenvalue λm, retain all the first m estimated eigenvalues, but replace all the rest small eigenvalues by their average λ¯s = 1 n−m+ 1 n∑ j=m+1 λj . The choice of m in practice may be through trial and error, which is close to the number of factors. How well this procedure work is an empirical question as there is no theory yet. 4.4.6 Exponentially weighted moving averages Motivated by the idea that recent data are more important than the past, we may want to weight the recent ones more heavily than the past in computing our parameter estimates. To do so, we assign a weight of wt = 1 to the most recent observation, and wt−1 = λwt, wt−2 = λwt−1, . . . , (4.59) successively for previous data, where λ is a prespecified constant. What this does is to replace an equal average of time series with a weighted average and the weights are 1, λ, λ2, . . . . . . , λt−2, λt−1. c© Zhou, 2021 Page 133 4.4 Shrinkage estimation The magnitude of λ indicates how information decays. If λ = 0, we care about only the observation today. If λ = 1, we use equal weighting for past observations. Typically, 0 < λ < 1. The choice is λ is driven by applications and by the calibration results. To understand it, consider estimating the expected return of a stock with 3 past observations. The usual sample mean (equal-weighting) is, µˆ = Rt +Rt−1 +Rt−2 3 . However, if we believe the recent data are more informative, we may use, µˆW = Rt + 0.9Rt−1 + 0.92Rt−2 1 + 0.9 + 0.92 = Rt + λRt−1 + λ2Rt−2 1 + λ+ λ2 , which is the weighted mean with λ = 0.9, so we put less importance on earlier observations. Note that the weighted return is divided by the sum of the weights because only in such a way the weights on the returns sum to 1. In practice, however, the use of the weighed mean on the returns is not common because the returns are noisy and mean-reverting, and so over-weighting the more recent ones can be counterproductive. On the other hand, covariances are much more persistent, and so they are often estimated with the above so called exponentially weighted moving averages (EWMA) of the data. Why do we call the above weighting exponential? That is because if λm = eb, then, taking log on both sides, we get b = m log λ, or λm = em log λ. Since log λ < 0 under the assumption that 0 < λ < 1, the above equation says that the weight λm decays at the exponential rate b. In practice, as mentioned by Menchero and Li (2020), one can use daily data from 10 to 150 days to estimate the covariance matrix. The variance is estimated by σˆ2i = R2i,t + λR 2 i,t−1 + λ 2Ri,t−2 + · · ·+ λt−1R2i,1 1 + λ+ λ2 + · · ·+ λt−1 , (4.60) and the covariance by σˆij = Ri,tRj,t + λRi,t−1Rj,−1t + λ2Ri,t−2Rj,t−2 + · · ·+ λt−1Ri,1Rj,1 1 + λ+ λ2 + · · ·+ λt−1 , (4.61) where the daily mean is taken as zero, as in almost all practical computations. c© Zhou, 2021 Page 134 4.4 Shrinkage estimation Now, imagine that we have infinite amount of data. Note that 1 + λ+ λ2 + · · · · · · = 1/(1− λ), we can write σˆ2i as (1− λ) times the sum of terms λmRi,t−m, which can be written further as the first term plus the rest, σˆ2i,t+1 = (1− λ)R2t + λσˆ2i,t, (4.62) where we added the time scripts to indicate the previous variance estimate σˆ2i,t that is based on data up to t− 1. Similarly, we have the covariance σˆij,t+1 = (1− λ)Ri,tRj,t + λσˆij,t. (4.63) Both (4.62) and (4.63) say that we can recursively update the estimates from the past. It is easy to see from (4.62) about the use of λ. The first term captures the volatility reaction to current market events, and the second term indicates persistence. No matter what happens today, λσˆ2i,t states that high volatility estimated yesterday is likely to cause high volatility tomorrow. The greater the λ, the greater the persistence. A rule of thumb choice of λ is between 0.75 and 0.98 for most markets (Alexander, 2001, p. 60). Generally, λ is greater for long-term forecast and smaller for short-term forecast. Mathematically, EWMA is equivalent to an I-GARCH model without intercept. 4.4.7 GS covariance matrix estimator The covariance matrix is estimated so far by fixing the sampling frequency. Suppose we are inter- ested in the monthly covariance matrix. The estimates are then computed by using monthly data, and the accuracy increases as more monthly data are used. Theoretically, the estimate converges to the true value as the sample size goes to infinity. However, in practice, the covariances or volatilities change over time. So the data too long ago may not be relevant, implying that the sample size may not be that large. Researchers at Goldman Sachs (see Litterman, 2003, Chapter 16) suggest to use daily data to improve the accuracy. To estimate the vol this month, the idea is to not only use information of many past months, but also use more information such as more frequent data within the months. Here we use the daily data. Theoretically, as shown by Merton (1980), if the stock prices follow a diffusion process or iid c© Zhou, 2021 Page 135 4.4 Shrinkage estimation lognormal, then the use of more frequent data will not help estimating the mean, but it can help estimating the variance as accurately as possible. To see why the frequency does not matter to the mean, suppose T is the time length, and we have prices at the beginning and at the end, P0 and PT . Then the expected (continuously compounded) return per unit of time is estimated by µˆ = log(PT /P0) T . Now assume that we have n daily prices available over [0, T ], P0, P1h, . . . , Pnh = PT with h = T/n. Then the average daily return is µˆd = log(P1h/P0) + log(P2h/P1h) + · · ·+ log(PT /PT−1) n (4.64) = log [(P1h/P0)× (P2h/P1h)× · · · × (PT /PT−1)] n = log(PT /P0) n . (4.65) If T is measured in years, then µˆ is the estimated annual return, and is the same to the annualized daily return, (n/T )µˆd. In short, the daily observations do not matter except the beginning and end prices, P0 and PT . Hence, the only way to raise the accuracy of estimating the expected return is to increase the length of the history, T . Let ri(m) be the monthly return of asset i, and assume that there p daily returns available, ri,t(d), t = 1, 2, . . . , p. Then, the monthly return can be written as a sum of the daily returns, 8 ri(m) = p∑ t=1 ri,t(d). (4.66) We can also write this for another asset j, rj(m) = p∑ s=1 rj,s(d). (4.67) Then the covariance between i and j will be given by the cross products of the right hand side, Cov[ri(m), rj(m)] = p∑ t=1 p∑ s=1 Cov[ri,t, rj,s] (4.68) Notice that this formula is true for any two assets. If i = j, it provides the formula for the monthly variance of asset i.https://www.overleaf.com/project/5f6b68d0a663fc0001e98565 There is one subtle point to be made about our usual variance transformation formula for one frequency to another. Usually wehttps://www.overleaf.com/project/5f6b68d0a663fc0001e98565 ag- gregate to get the monthly variance or covariance by multiplying the daily one with the number of 8For your easy reference, we here use almost identical notation to those of Litterman, et al (2003, Chapter 16). c© Zhou, 2021 Page 136 4.4 Shrinkage estimation business days within the month, σ2m = p× σ2d. But this formula is correct only if the data is iid. This can be seen by rewriting (4.68) as Cov[ri(m), rj(m)] = p× Cov[ri,t, rj,t] (4.69) +(p− 1)× (Cov[ri,t+1, rj,t] + Cov[ri,t, rj,t+1]) +(p− 2)× (Cov[ri,t+2, rj,t] + Cov[ri,t, rj,t+2]) + · · · +1× (Cov[ri,t+p−1, rj,t] + Cov[ri,t, rj,t+p−1]) . Note that the first term Cov[ri,t, rj,t] = Cov[ri,1, rj,1] = · · · = Cov[ri,p, rj,p] (assume the daily variance is constant within the month), and p indicates we collect all the p terms on the same days for the two assets together. Similarly, the second term represents all the cross products od returns on dates with 1-day difference, and so on. Therefore, besides the first term, others will matter too if the data are not iid. Now, given sample data of T daily returns, we can estimate the first term in the above formula by Ĉov[ri,t, rj,t] = 1 T T∑ s=1 ri,srj,s, (4.70) and any of the other (p− 1) terms by Ĉov[ri,t, rj,t+k] = 1 T − k T∑ s=1 ri,srj,s+k. (4.71) Then, the righthand side of (4.69) provides the estimate for the monthly covariance or volatility. Note that (4.70) differs from standard statistical estimation formula, Ĉov[ri,t, rj,t] = 1 T T∑ s=1 (ri,s − rˆi)(rj,s − rˆj). (4.72) The reason is that the daily means are small and can be taken as zeros without consequences for daily covariance and volatility estimations. For easy programming, the above estimator can be written in a simpler matrix form. Assume c© Zhou, 2021 Page 137 4.4 Shrinkage estimation there are T daily returns for all N assets, which can be written as a T ×N matrix, R(d) = r1,1(d) r2,1(d) . . . rN,1(d) r1,2(d) r2,2(d) . . . rN,2(d) ... ... . . . ... r1,T (d) r2,T (d) . . . rN,T (d) , (4.73) Then the monthly covariance matrix estimator can be written as S(m) = p× S0(d) + q∑ k=1 (p− k)× [Sk(d) + Sk(d)′], (4.74) where q is the order of serial correlation, and S0(d) = 1 T R(d)′R(d), Sk(d) = 1 T R(d)′Rk(d), are matrix form of (4.70) and (4.71) with Rk(d) defined as R(d) by treating the first k rows as zeros. Note that we may average S(m) over past months to obtain the covariance estimator for the current month. Finally, based on the EWMA, the GS estimate of the daily covariance estimator is Ĉov[ri,t, rj,t] = ∑T s=1wsri,srj,s∑T s=1ws = ∑T s=1w 1/2 s ri,sw 1/2 s rj,s∑T s=1ws . (4.75) Other terms can be written out similarly, and we thus obtain the monthly covariance estimator by (4.74). In matrix form, we can weight the returns as Rˆ(d) = (1− λ)T−12 r1,1 (1− δ)T−12 r2,1 . . . (1− λ)T−12 rN,1 ... ... . . . ... (1− δ) 12 r1,T−1 (1− δ) 12 r2,T−1 . . . (1− δ) 12 rN,T−1 r1,T r2,T . . . rN,T , (4.76) where δ = 1 − λ is the notation used by the GS as the decay parameter. Then, the weighted monthly covariance matrix estimator is Sˆ(m) = p× Sˆ0(d) + q∑ k=1 (p− k)× [Sˆk(d) + Sˆk(d)′], (4.77) where Sˆ0(d) = Rˆ(d) ′Rˆ(d)/ T∑ t=1 wt, Sk(d) = Rˆ(d) ′Rˆk(d)/ T∑ t=1 wt. c© Zhou, 2021 Page 138 It may be noted that the weights wt are applied linearly for computing the expected return, and not so for computing the covariances. The reason is that when they are applied in the latter case, they weigh the more recent covariance more heavily, rather than directly applied to the returns. 5 Factor Models 1: Known Factors Factor models for stock returns are popular. There are two types. The first is that the factors are assumed to be known and directly observable from financial markets, which can be time series factors, like the market index, or cross section factors like firm fundamentals/characteristics. The second type is to assume that the factors are unknown random variables (also known as latent variables), whose realizations are not directly observed, but can be estimated from the data. This section focus on the first type, and the next section deals with the second. 5.1 The CAPM In this subsection, we focus on testing the CAPM. For completeness, we first prove the CAPM for mean-variance utility investors and then for investors with arbitrary utilities but the returns are normal (can be extended to elliptical). Then we move to the tests. We examine first the widely used tests that are pricing errors or alpha-based, and then tests based on cross section analysis and stochastic discount factors. 5.1.1 Proof 1: preference assumption Theoretically, the CAPM is valid under two assumptions: • Perfect market: All investors have the same full information and so they have the same true beliefs on the meana and covariances of stock returns; there are no transaction costs or taxes so that trading and mispricing can be corrected without costs; all can borrow and lend at the riskfree rate. • Preference or return restrictions: All investors are rational, with either mean-variance utility or a concave utility function; in the latter case, the stock returns are assumed to be normally distributed (can be extend to elliptical). c© Zhou, 2021 Page 139 5.1 The CAPM We will prove the CAPM under the mean-variance preference in this subsection, and leave the proof for the other case in the next subsection. The proofs below follow standard texts such as Ingersoll (1987) and Huang and Litzenberger (1988). Berk (1997) examines the necessary and sufficient conditions for the CAPM. Under the mean-variance utility, the CAPM follows from the Two-fund Separation Theorem and the market-clearing condition. The latter says demand must equal to supply in the market: all the stocks bought by investors must be equal to the supply of the existing shares or the total values bought must be the total values of the shares: I∑ j=1 WjRη = W 1m ... WNm , (5.1) where I is the number of investors and Wjs are their wealth, Rη is the tangent portfolio they have (see Section 2.7.2), and W im is the market total value of stock i. Since the vector of stock total wealth is a product of market wealth times market portfolio, W 1m ... WNm = W totalm W 1m/W total m ... WNm /W total m = W totalm wm, (5.2) where wm is defined by the above equation as the fraction of stock value relative to the market, which is exactly the market portfolio weights. Hence, Equation (5.1) says that the tangency portfolio must be the market portfolio in equilibrium when demand equals supply. Example 5.1 Suppose that there are only N = 2 stocks in the market with market values of $100 and $200. If there are I = 3 investors with wealth $50, $100, and $150 invested in their stock portfolios, then it must be the case that 50Rη + 100Rη + 150Rη = 100 200 = 300 1/3 2/3 = 300wm, where w = (1/3, 2/3) is the market portfolio and Rη must be the same as wm. ♠ Let Rq be any portfolio return with weights wq that is fully invested in the risky assets, and c© Zhou, 2021 Page 140 5.1 The CAPM Rm be that of the market. Since Rm is tangent, wm = Σ −1µ/γ, and so we have cov(Rq, Rm) = w ′ qΣwm = w ′ q(µ0 − rf1N )/γ = (E[Rq]− rf ) /γ. (5.3) Let Rq now be stock i and the market, respectively, we obtain cov(Ri, Rm) = (E[Ri]− rf ) /γ (5.4) cov(Rm, Rm) = (E[Rm]− rf ) /γ. (5.5) Now taking a ratio of the above equations, and then multiplying on both sides by E[Rm]− rf , we have E[Ri]− rf = cov(Ri, Rm) cov(Rm, Rm) (E[Rm]− rf ), (5.6) or E[Ri] = rf + βi(E[Rm]− rf ), i = 1, 2, . . . , N, (5.7) where βi = cov(Ri, Rm)/cov(Rm, Rm) is the stock i’s beta. This is exactly the CAPM, stating that the expected return of any stock is the riskfree rate plus beta times the market risk premium, E[Rm]− rf . Security market line (SML) is a plot of the CAPM relation in terms of the expected return of an individual security as a function of systematic, non-diversifiable risk β. It says that the greater the beta, the greater the expected return. In contrast, the common wisdom says the great the risk, the greater the return. This is not true as the CAPM states that only the system risk that gets compensated by the market. When the CAPM is not true, Clearly those assets above the SML earn positive alpha and those below earn negative alphas. Buying positive alpha assets or shorting negative alpha assets help to beat the market. Capital market line (CML), a concept related to the CAPM, is the tangent line drawn from the point of the risk-free asset to the tangency portfolio, which is the market portfolio under under the CAPM conditions. Focusing on portfolio selection, CML says that all investors should choose portfolio along the tangent line: a portfolio of the riskfree asset and the market portfolio, though the mix can vary across investors. No matter what mix an investor chooses along the CML, the Sharpe ratio will be the same for all investors. Mathematically, it states that E(Rq)− rf σq = E(Rm)− rf σm , c© Zhou, 2021 Page 141 5.1 The CAPM for any portfolio Rq on the line. This implies that, for a given level of desired risk σq an investor wants to take, the expected return is E(Rq) = rf + σq σm [E(Rm)− rf ]. Note that this holds only on the CML, and is not true for a general portfolio Rq. Note that the slope of the CML is the Sharpe ratio of the market portfolio. If the CAPM is true, all efficient portfolio should earn the same market Sharpe ratio. In the real world, the CAPM is not exactly true, and hence an implication is that an investor should buy assets whose Sharpe ratio are above CML. However, you will not necessarily sell those assets if their Sharpe ratios are below because all inefficient portfolios will lie underneath. 5.1.2 Proof 2: return assumption Let r˜j be the random future stock return for asset j, j = 1, 2, . . . , N , wij be the portfolio weights of investor i, then the random terminal wealth can be written as W˜i = W 0 i [ 1 + rf + ∑ wij(r˜j − rf ) ] . The investor’s problem is to maximize the expected utility of wealth, E [ ui(W˜i) ] , and the first-order condition is E [ u′i(W˜i)(r˜j − rf ) ] = 0. (5.8) Since cov(a˜, b˜) = E(a˜− a¯)(b˜− b¯) = E(a˜b˜)− a¯b¯ for any pair of random variables a˜ and b˜ with mean a¯ and b¯, we have cov(u′i(W˜i), r˜j) = −E [ u′i(W˜i) ] E(r˜j − rf ), so we can solve the expected asset excess return as E(r˜j − rf ) = −cov(u′i(W˜i)r˜j)/E [ u′i(W˜i) ] . (5.9) For any normal random variables x˜ and y˜, Stein’s Lemma states that the covariance of any function of x˜ with y˜ can be separated out as a product with the covariance of x˜ and y˜, cov(g(x˜), y˜) = E[g′(x)] cov(x˜, y˜). Since we assume that the returns are normally distributed, so are the wealth, and hence we can apply Stein’s Lemma to rewrite (5.9) as 1 θi E(r˜j − rf ) = cov(W˜i, r˜j), (5.10) c© Zhou, 2021 Page 142 5.1 The CAPM where θi ≡ −E [ u ′′ i (W˜i) ] /E [ u′i(W˜i) ] . Summing this equation over all individuals, we have( I∑ i=1 1 θi ) E(r˜j − rf ) = cov(W˜m, r˜j) = W 0mcov(r˜m, r˜j), (5.11) where W˜m = W 0 m(1 + r˜m) is the future market wealth, with W 0 m the initial wealth and r˜m is the market return. Multiplying the above equation by market portfolio weights, we obtain( I∑ i=1 1 θi ) E(r˜m − rf ) = W 0mcov(r˜m, r˜m). (5.12) Finally, taking a ratio of the above two equations, and then multiplying E(r˜m − rf ) on both sides, we immediately have the CAPM. 5.1.3 Market model Suppose there are N stocks. A single factor model is the simplest model that one can use to explain the returns on the stocks, rit = αi + βift + it, t = 1, . . . , T, (5.13) for i = 1, 2, . . . N . That is, we have one regression for each stock on the factor, and there are in total N regressions. In practice, it is often the excess asset returns are used, i.e., the total returns minus the riskfree rate. The Market Model is a regression of asset excess return on the market, rit = αi + βirmt + it, t = 1, . . . , T, (5.14) where the previous factor is taken as the market factor (or excess return on a market index return in practice), it is the residual which is uncorrelated with the market rmt and has a zero mean, and T is the number of time series observations. Market model relates asset returns to that of the market. Based the market model, we can always compute the asset risk (variance) from var[rit] = βivar[rmt] + var[it], (5.15) i.e., asset variance risk is a sum of its market risk and residual risk (idiosyncratic risk). c© Zhou, 2021 Page 143 5.1 The CAPM Moreover, the regression model implies β = cov(rit, rmt) var(rmt) , i.e., beta is the ratio of the covariance between stock and the market to the market variance. Also, α = E(rit)− βE(rmt), i.e., alpha is the expected asset return minus what implied by the CAPM. Note that we can always run the above market model regression or project rit on rmt. Math- ematically, there is nothing one can say how big or how small the alpha should be. Hence, the market model regression itself has nothing to do with the CAPM or any economic theory. However, if the CAPM is true, as clear from below, it says that the alpha should be zero under certain fairly general economic assumptions. 5.1.4 Some truths on Alpha Similarly, regardless of any economic theory, there are some simple truths about the alphas in the market model regression. It is an accounting equality that the value-weighted sum of all the stock alphas must be equal to zero regardless the CAPM is true or not. This is because if we multiply the market model by the value of the firm, wi, we get wirit = wiαi + wiβirmt + wiit, (5.16) which is true for every stock i. Assume there are N stock in the market. Summing the above over all the stocks, N∑ i=1 wirit = N∑ i=1 (wiαi) + [ N∑ i=1 wiβi ] rmt + N∑ i (wiit). Since the left-side is simply the market (value-weighted index), we must then have index = 0 + 1× index + 0, so N∑ i=1 (wiαi) = 0, N∑ i=1 wiβi = 1, N∑ i=1 wiit = 0. Hence, the value-weighted sum of the alphas is 0, and that of the betas is 1. c© Zhou, 2021 Page 144 5.1 The CAPM In practice, people often say that the sum of the alphas is 0, which is not exactly true, but only approximately, as the exact equation requires weighting the alphas by the firm’s value, not equal-weighting by 1/N . Clearly the fact that the (value-weighted) sum of all the alphas is zero does not imply the alphas are zeros individually because some can be positive and some can be negative, while their sum is zero. If the CAPM is true, it makes a strong assertion that every single alpha is zero! Consider now portfolio alphas, in contrast to stock alphas above. Since the portfolio alpha is the portfolio of the individual alphas, and since the sum of all the portfolio weights of all the investors is the market, the (value-weighted) sum of the portfolio alphas of all investors must be equal to zero too. This help to understand why it is difficult to beat the market. If one investor earns a positive alpha on his or her portfolio, someone else must earn a negative alpha. While all investors can make money in the stock market theoretically, but the competition for alpha is a zero sum game. If one has no information or does not want to get a possible negative alpha, buy-and-hold the market index earns the market return with zero alpha. 5.1.5 Claims of the CAPM The capital asset pricing model (CAPM) is about pricing stocks. Let Rit be the return on stock i. Under certain assumption, the CAPM is true, which states that, at anytime t, E[Rit] = rft + βi (E[Rmt]− rft) , (5.17) that is, the expected return on a stock is the riskfree rate plus the stock beta times the market risk premium, E[Rmt]− rft. Recall that we often work with excess returns. Let rit = Rit− rft be firm i’s return in excess of the riskfree asset, and rmt = Rmt − rft be the market excess return, then the CAPM relation can be simply written as E[rit] = βiµm, (5.18) where µm = E[Rmt] − rft is the market risk premium, i.e., the expected market excess return. In practice, Rmt is often taken as the return on a broad market index (such as the S&P500), and then µm is the expected excess return on the index. c© Zhou, 2021 Page 145 5.1 The CAPM It should be noted that the CAPM is about the expected return, not risk. It says that for all the stocks, the greater the beta (systematic market risk exposure), the greater the expected return. In fact, the expected excess return is simply a linear function of beta times the expected market excess return. Contrary to the belief of many investors, the conventional wisdom of high risk and high return is not true. The CAPM says that only the systematic risk gets compensated, and idiosyncratic risk does not. 5.1.6 GRS test Recall that, if the CAPM is true, then, taking expectation on both sides of (5.14), we have, H0 : αi = 0, i = 1, . . . , N, (5.19) for all the stocks, where N is the total number of stocks. Hence, a test of the CAPM is to test whether all the alphas are zero in the market model. The alpha is also know as pricing error. When it is positive, then the CAPM under-values the asset. When it is negative, then there is over-valuation. Here we have only one factor, the market factor. In general, when there are more than one factor, then alpha still measures the pricing error, but it will be the one related to the multi-factor model instead of the CAPM. To test the CAPM, we have to estimate the alphas and betas first. The estimation can be done equation by equation by the OLS to obtain them for each asset. Then, the null hypothesis can be tested by using the well known Gibbons, Ross and Shanken’s (1989) test GRS ≡ (T −N − 1) N αˆ′Σˆ−1αˆ 1 + θˆ2m ∼ FN,T−N−1, (5.20) where αˆ is the vector of the estimated alphas, Σˆ is the estimated residual covariance matrix, N×N , whose (i, j) element is given by σˆ(i, j) = 1 T T∑ t=1 (rit − αˆi − βˆirmt)(rjt − αˆj − βˆjrmt), (5.21) θˆm is the Sharp ratio of the market rm (the ratio of mean excess return to the standard deviation), and FN,T−N−1 is the F -distribution function with the degrees of freedom N and T − N − 1. We reject the null when the GRS statistic is large relative to the random fluctuations measured by F . As we shall see in the slides, the CAPM is rejected strongly by the data. c© Zhou, 2021 Page 146 5.1 The CAPM However, a rejection of the CAPM simply indicates that the market factor alone cannot price the assets, but it can still be very important in explaining the returns (with non-zero betas). Indeed, in usual multi-factor extensions of the CAPM, the market factor is always the most important factor by far, and it, in conjunction with other factors, can price asset fairly well in many applications. Note that to test whether the alpha of an individual firm, αˆi = 0 or not, it is straightforward to apply the popular t-ratio test in a univariate linear regression. But this is different from the CAPM test, which requires all the alphas to be zero simultaneously. It is a multi-dimensional test. If one company has a zero alpha, it does not imply the CAPM is true in general, although it may be true just for this company. On the other hand, if the CAPM is rejected by one company by a t-ratio test with a P-value 5%, this does not imply that the CAPM is invalid either. Because the rejection is not absolute or not with 100% confidence, or because the P-value is 5%, there are 5 rejections that will occur among every 100 firms by chance alone even if the CAPM is true. Hence, statistically, the GRS test, which tests all the pricing being zero jointly, is ideal to test the multiple restrictions here, and it is statistically the most power test. What is the statistical intuition of the GRS test? Consider the case of two stocks. We want to design a single test statistic to test both α1 and α2 being zero. One possible statistic is J = αˆ21 + αˆ 2 2. If the both α1 and α2 are zeros, J should be small, like the GRS. If we find J is too large empir- ically, we can reject that the alphas are zeros. The problem is that it is very difficult to find the distribution of J . So we have to modify it. The idea is to standardize both αˆ1 and αˆ2, so that their joint distribution is tractable. In doing so, we get the GRS test. In the one-dimension case, we standardize the alpha and get the t distribution. So, the F distribution is an extension of the t into higher dimensional case. What is the economic intuition of the GRS test? If the CAPM is true, and if we invest in all the assets including the market, we cannot do better than investing in the market alone. Let θˆq be the Sharpe ratio of the former. Gibbons, Ross and Shanken’s (1989) show that αˆ′Σˆ−1αˆ = θˆ2q − θˆ2m. Hence, the GRS test measures how close the market Sharpe ratio θˆm is close to θˆq. If the difference is sizable, we reject the CAPM. c© Zhou, 2021 Page 147 5.1 The CAPM In matrix form, the market model can be written as, R = XΘ + E, where R, T × N , is the returns on N assets in excess of the riskless rate return; X, T × 2, is a matrix whose columns are ones and the excess returns on the given portfolio rm; Θ, 2 × N , is a matrix whose rows are the α and β′s respectively. Technically, in the context of multivariate testing, rmt is often treated as fixed and E is assumed to follow a multivariate normal distribution, vec(E) ∼ N(0, I ⊗ Σ), (5.22) for the GRS to hold. Furthermore, to guarantee the non-singularity of Σ, we need to assume rm is not a linear function of the N asset excess returns. This effectively implies that the given index portfolio contains other assets returns which are not include in the left-hand side of the market model. 5.1.7 CAPM and market efficiency There are two concepts of market efficiency. The usual meaning is the three forms of market efficiency: ”weak-form”, ”semi-strong-form”, and ”strong-form” tests. The weak-form efficiency hypothesis says that historical prices and trading volume provide no information for making abnor- mal profits, implying in particular that technical analysis and common trading signals are useless (though the latter is debatable in practice). The semi-strong form states that all public available information (beyond historical prices) is still useless for making abnormal profits, suggesting that fundamental analysis based on firm public valuation data, such as earnings and growth, is a waste of time and money (which fund managers may not agree). The strong-form regards private in- formation. Market efficiency says that stock prices reflect all information, public and private, so that there is no over- or under-valuation and no possibility of making abnormal profits. In the real world, none of the hypotheses is absolutely true, but it does serve as a useful reminder that it is difficult to beat the market. If the CAPM is true, the market must be efficient as then all the assets are corrected priced by the CAPM. In fact, if any known asset pricing model is true, the same conclusion holds. But if we reject the CAPM or a known model, it says nothing about the market is efficient or not. It simply c© Zhou, 2021 Page 148 5.1 The CAPM states that the given model cannot price all the assets correctly. In other words, according to the model, there are over- and under-valued assets. Another meaning of market efficiency is whether the market portfolio, as approximated by the value-weighted stock index, is an efficient portfolio in the mean-variance frontier. Theoretically, the CAPM is true if and only if the market portfolio is efficient. 5.1.8 Fama-MacBeth 2-pass regressions Fama and MacBeth (1973) propose a 2-pass regression approach for estimating factor risk premia and for testing validity of factor models, the CAPM in particular. The procedure has two steps: 1. A time series regression is run on an asset return to obtain the asset’s betas or exposures to the risk factors; (The first-pass) 2. A cross-section regression is run for all asset returns on their estimated betas to determine the risk premia of the factors. (The second-pass) Consider the case of testing the CAPM. In the first pass, we estimate the market model, rit = αi + βirmt + it, t = 1, . . . , T, (5.23) to obtain βˆi. Given firm i, we run the above regression over time. βˆi is its estimated market risk exposure. We can do the above regression N times, and so to get the betas for all firms. In the second pass, at each time t, we run a regression across the firms on their betas, rit = γ0 + γ1βˆit + it, i = 1, . . . , N, (5.24) where βˆit is the estimated beta at time t from the first-pass regression. Or more clearly, RIBM,t RApple,t ... RGoogle,t = γ0 + γ1 βˆIBM,t βˆApple,t ... βˆGoogle,t + 1t 2t ... Nt . (5.25) c© Zhou, 2021 Page 149 5.1 The CAPM If the CAMP is true, we should have γ0 = 0, γ1 = E[rmt], or the slope of the second-pass regression is the risk premium. In practice, the estimates of γ0 and γ1 will be noisy. Assume they are constant, we will use their average estimates over time as the estimate. For example, suppose we have 15 years of monthly data. We can start the estimation the first month after 5 years, month 61, to get the beta (we need the first 5 years data to estimate the first beta), and then the gamma. Using the 5-year rolling window, we can similarly obtain the beta and gamma in the following month (month 62), and up to the last month, a total of 120 betas per firm. The average is then given by γ¯1 = 1 120 120∑ t=1 γ1t, (5.26) where γ1t are the estimate in each of the month. Statistically, γ¯1 is a better estimate than using any one of the γ1t’s. Fama and MacBeth (1973) suggest to use the following t-statistic to test whether γ1 is signifi- cantly different from zero, t− stat = γ¯1 std(γˆ1) √ T , where std(γˆ1) is the sample standard error of γˆ1, also known as Fama and MacBeth (1973) standard error for the estimated risk premium. Statistically, the t-test tends to over rejects the null because std(γˆ1) under-estimates the true standard error. The reason is that there is an errors-in-variables problem in the second-pass, where it is the estimated betas, those with estimation errors from the first-pass, that are use as regressors, not the true ones. Shanken (1982) provides the corrected standard error. Shanken and Zhou (2007) provide further a specification test of the factor model. Kan, Robotti and Shanken (2013) have more general discussions on the two-pass procedure. 5.1.9 Stochastic discount factor As Cochrane (2001) shows, almost all asset pricing models can be written in a stochastic discount factor (SDF) form, 1 = E(mR), (5.27) c© Zhou, 2021 Page 150 5.1 The CAPM where m is the SDF and R is the return. It says that, if an asset provide me an random payoff R next period, I am willing to pay $1 for its price today. The factor m is the factor I use to discount the payoff. Since the payoff is uncertain, I use the expected value. To see how it works, for the T-bill with interest r or return R = 1 + r, if we pay one dollar today to buy it, we have 1 = 1 + r m = 1 + r 1 + r , so our discount rate is m = 1 + r. Now imagine that actually the T-bill is a corporate bond with the same interest, we will not pay $1 today for it. In this case, the SDF will be greater than 1 + r, so we get a lower price, say 50 cents today. We re-scale the units, and obtain Equation (5.27) again for pricing the bond, now we pay every one dollar for 2 units of the bond. It works similar for all other assets. If the CAPM is true, it can be shown that the SDF will be of a very simple form, a linear function of the market, m = λ0 + λ1rmt, where λ0 and λ1 are parameters. Cochrane (2001) shows how to test the CAPM in the SDF framework. Kan and Zhou (1999) compare this methodology with the traditional beta tests like the GRS. They find that the SDF can be less inefficient. Later Jagannathan and Wang (2002) show that this inefficiency can be remedied by adding moment conditions on the factors, so that both the SDF and the traditional approach can be asymptotically equivalent. However, adding the factor moment conditions makes the implementation of the SDF tests difficult as there will no longer analytical solutions to the parameter estimates. 5.1.10 GMM test and others The GRS test is ideal if the data are normally distributed. However, the data are not normally distributed in the real world. When the normality assumption is violated, the GRS test statistician no longer has an F distribution, and hence the P-value and the test results may be in doubt. The simplest approach is to use a bootstrap procedure. The idea is resample B sets of data of the same sample length T from the original data with replacement, computing the GRS statistics B times for each bootstraped data set. The upper 5% percentile will be a good estimate of the c© Zhou, 2021 Page 151 5.2 Spanning tests 5% P-value under iid assumption. Without the iid assumption, the bootstrap procedure will be more complex. Chou and Zhou (2006) provide the details. Bootstrap is used widely in finance to estimate standard errors and test trading strategies. Section 7.2) and Section 4.3) have more discussions. Hansen (1982) provides the generalized method of moments (GMM), a Nobel prize winning work, which can be used to test almost any economic model under very general statistical assumptions. In our context, as long as the residuals are stationary, the test is valid. However, without normality, almost all tests, GMM included, are valid at most asymptotically, meaning true as the sample size goes to infinity. Hence, the reliability of asymptotic tests is an issue for such tests. Simulations may be run to help assess the reliability, and the bootstrap may still be used to improve the accuracy. There are two ways to implement the GMM test for the CAPM. The first is to construct a χ2 test from the distribution of αˆ under stationarity assumptions, and the second is to impose the null on the model and use the GMM overidentification test directly. MacKinlay and Richardson (1991) is the first to apply the GMM to test the CAPM. Harvey and Zhou (1993) provide further results. A Bayesian approach to the test of CAPM has been taken by Shanken (1987) and Harvey and Zhou (1990). The first obtains the posterior odds-ratio from a given prior on the correlation between the market portfolio and the proxy, while the second, based on a full Bayesian specification of the market model, conducts both a posterior analysis and odds-ratio testing with several priors on the behavior of the parameters. 5.2 Spanning tests Huberman and Kandel (1987) introduce the idea of a mean– variance spanning test. The question is whether the mean-variance optimal portfolio of a set of given assets can improved by adding in a set of new assets. In other words, the question is equivalent to whether the investment opportunity set of all of the assets can be spanned by the set of given assets. Theoretically, the “mean-variance” term may be removed to consider a general spanning, but this case is too complex. Hence, almost studies, including below, focus only on mean–variance spanning. We start with the simplest case of two risky assets with returns, RA and RB, where RB is return on the given benchmark asset. Our question is whether adding RA improves our optimal c© Zhou, 2021 Page 152 5.2 Spanning tests portfolio with the original set of {RB}. For example, we may ask if China exposure (as summarized by China market index return RA) offers any diversification advantage to the US market index. Assume that there is no borrowing or lending (no riskfree asset available). Consider the regres- sion (or called projection sometimes) of RA on RB, RA = α+ βRB + , (5.28) where is the residual uncorrelated to RB. If RA = 0 + 1×RB + , (5.29) it will be easy to show that RA adds no value to the portfolio. This is intuitively obvious. Based on (5.29), RA has the same expected return as RB but has greater variance risk, and so it is dominated by RB and adds no investment value. On the other hand, consider the case RA = 0 + 1.5×RB + . (5.30) Although buying 1.5 of RB will replicate RA up to a noise, but this is not possible as we assume no borrowing. So RA allows those investors who are aggressive can hold 1.5 unit of asset B in stead of one originally. Therefore, the spanning hypothesis is α = 0, β = 1 (5.31) in the one test asset and one benchmark asset case. There is mean-variance spanning if and only the above parametric restriction holds in the asset regression (5.28). In practice, one has data on both assets, and can then run the regression to do the test. Since it is a joint test of the alpha and beta, an F test is needed in stead of an often used t-ratio test which applies only to a single parameter. A more complex case is with 2 benchmarks (or 2 factors). Consider now the regression R1 = α1 + β11f1 + β12f2 + 1, (5.32) where 1 is the residual uncorrelated to the factors. Then similar argument as earlier shows that the spanning hypothesis is α1 = 0, β11 + β12 = 1. (5.33) c© Zhou, 2021 Page 153 5.3 Fama-French 3- and 5-factor models The second restriction is not surprising because we want hold the factors to replicate the asset without borrowing. Suppose now we have two test assets instead of one, then we add one more regression, R2 = α2 + β21f1 + β22f2 + 2. (5.34) It will be easy to see that the spanning hypothesis is then Equation (5.33) plus α2 = 0, β21 + β22 = 1, (5.35) the latter of which is pertinent to the second asset. In general, we can have N test assets and K benchmark assets, and the spanning hypothesis can be summarized as α = 0N , β ′1N = 1, where both α and β are N -vectors of the N test assets. How do we test the above hypothesis? DeRoon and Nijman (2001), and Kan and Zhou (2012) summarize the statistical procedures. It should be noted that the spanning hypothesis is simplified greatly when there is the riskfree asset whose return is Rf . With Rf , we consider only excess returns. For example, for the one test asset and one benchmark asset case, we rA = α+ βrB + , (5.36) where rA = RA − Rf and rB = RB − Rf are the excess returns and is the residual uncorrelated to rB. Then the spanning hypothesis is simply α = 0. (5.37) The beta is no longer matter because we can borrow or lend to replicate RA with RB. Testing the CAPM can be regarded as a special case here. An asset cannot improve the market portfolio if and only if its alpha is zero. If the CAPM is true, the market portfolio is efficient and no assets can help to do better, and so all the asset alphas must be zeros. 5.3 Fama-French 3- and 5-factor models Due to the failure of the CAPM, Fama and French (1993, 1996) advocate the following three-factor model, Rit − rft = αi + βi1(fM,t − rft) + βi2fSMB,t + βi3fHML,t + it, (5.38) c© Zhou, 2021 Page 154 5.4 Additional factor models where fM is the return on the market factor, fSMB is the SMB spread return, fHML is the HML spread return, and rft is the 30-day T-bill rate. In their tests of the above model, French (1993, 1996) take the Rit’s as the asset returns on the 25 stock portfolios formed on size and book-to- market. We can test a multiple factor model similar to the CAPM case. Consider the above Fama and French (1993) 3-factor model. The estimation can be done equation by equation by the OLS to obtain the 3 betas for each asset. If the 3-factor model is true, then we have again H0 : αi = 0, i = 1, . . . , N. (5.39) But this is true only for tradable factors (see next subsection for more explanations). The null hypothesis can be tested by using the K-factor version of the Gibbons, Ross and Shanken’s (1989) test, GRS ≡ (T −N −K) N αˆ′Σˆ−1αˆ 1 + f¯ ′Ω̂−1f¯ ∼ FN,T−N−K , (5.40) where f¯ and Ω̂ are the sample mean and sample covariance matrix of the K = 3 factors. As before, we reject the null when the GRS statistic is large. One can also test the three-factor model using the Fama-MacBeth 2-pass procedure. In the first-pass, one runs regression (5.38) to obtain the three betas. Then, in the second-pass, the following cross-section regression is run on the betas across assets, ri = γ0 + γ1βˆi1 + γ2βˆi2 + γ3βˆi3 + it, i = 1, . . . , N, (5.41) to obtain risk premia estimates, the slopes, on the three factors. Similar t-stats can be defined and the significance of the risk premia be examined as before. Improving their 3-factor model, Fama and French (2015) recently propose a 5-factor model by adding new factor of profitability and a factor of investment patterns. The GMM and other tests discussed previously can also be used to test them. 5.4 Additional factor models Hou, Xue, and Zhang (2015) provide a 4-factor model, which is similar and competitive to the Fama and French (2015) 5-factor model. Stambaugh and Yuan (2017) provide a 4-factor model: c© Zhou, 2021 Page 155 5.5 Non-traded factors the Mkt, Size and two mispricing factors (MGMT and PERF), and Daniel, Hirshleifer, and Sun (2020) provide a 3-factor models: Mkt and two behaviorial factors (PEAD and FIN). Han, Zhou and Zhu (2016) propose a trend factor that has an average return of about 1.61% per month, more than twice of the momentum factor and more than double the Sharpe ratio. Liu, Zhou, and Zhu (2020a) add volume information and construct a new trend factor particularly suitable in China due to 80% of trading volume is generated by individual investors. In addition, those hundreds of anomalies, surveyed by Harvey, Liu and Zhu (2016) and Hou, Xue and Zhang (2019), are yet candidates of additional factors. The search of security factors is endless. Starting from the twelve distinct risk factors in Fama and French (1993, 2015), Hou, Xue, and Zhang (2015), Stambaugh and Yuan (2017), and Daniel, Hirshleifer, and Sun (2020), Chib, Zhao and Zhou (2020) construct and compare 4,095 possible combinations, and find that the model with the risk factors, Mkt, Size, MOM, ROE, MGMT, and PEAD, performs the best in terms of Bayesian posterior probability, out-of-sample predictability, and Sharpe ratio. A more extensive model comparison of 8,388,607 factor models, constructed from the twelve winners plus eleven principal components of anomalies unexplained by the winners, shows the benefit of incorporating information in genuine anomalies in explaining the cross-section of expected equity returns. 5.5 Non-traded factors To understand the indeterminacy of factor risk premia in the regression model, consider again the market model. Let λm be the market risk premium, then the CAPM says, E[rit] = βiλm, (5.42) that is, if an asset has double the market risk, its excess return (its return beyond riskfree rate) is expected to earn double the risk premium. The above equation is true for any traded assets, in particular, as the market is tradable and a beta of one, we have E[rmt] = 1× λm = λm, (5.43) which says that the market risk premium λm = E[rmt] = E[Rmt − rft], that is, the market return in excess of the riskfree rate. c© Zhou, 2021 Page 156 5.6 How to construct factors? Now let f be a systematic factor, such as consumption growth, that affect the asset returns. For simplicity, assume it is the only factor. Then we have E[rit] = βiλf , (5.44) where λf is the risk premium or reward for taking the factor exposure. However, since f is not traded, we no longer have an equation like (5.43) to tie down λf . Some complex procedures with additional assumptions may be needed to determine the value of λf . Indeed, Giglio and Xiu (2021) suggest a three-step approach. First, we can extract a suitable number of factors with the PCA method to be explained below (see Section 6.2). Then, in the second-step, the factor risk premia can be estimated by running a cross section of the average returns on the factors. Finally, in the third-step, a time series regression of f can be run on the factors to get the loadings, which yield the risk premium on f by multiplying the loadings to the factor risk premia. 5.6 How to construct factors? How to obtain the systematic factors beyond the market factor? These factors are often obtained as spread or long-short portfolios, which are tradable, from firm characteristics. Some studies also use macroeconomic variables, such as industrial production, as factors (see Chen, Roll and Roll, 1986, for a classic, and Rapach and Zhou, 2019, for the latest work). In practice, the tradable factors usually work better than macroeconomic factors. One can also extract factors statistically from asset returns, which will be discuss in the next chapter. We here focus on form factors from firm characteristics. 5.6.1 Sorting Sorting stocks into decile portfolios by a firm characteristic is one of the most common ways for constructing new factors. For example, to obtain the size factor, we can sort all stocks (except those prices lower than $1, say) 10 portfolios each month by their capitalization (size), and those buy those stocks (small) in the lowest decile and short those (large) in the largest decile. The resulted zero-cost spread or long-short portfolio will capture well the performance due to size. Return on c© Zhou, 2021 Page 157 5.6 How to construct factors? this portfolio each month will be the return on the size factor, fsize = R1 −R10, (5.45) where R1 and R10 are returns on the lowest and largest decile portfolios, respectively. In applications, R1 and R10 can be either equal- or value-weighted returns. For equal-weighted, the spread portfolio performs usually better, but it will be more influenced by small and midcap firms. From a feasibility point of view, value-weighted returns are preferred as more money may be invested into spread portfolios without investing too heavily to small and midcap stocks or affecting the prices too much. While decile portfolios are popular, quintile and sorting into three or even two groups are also often seen. Generally speaking, the average return (over time) of the spread portfolio is greater with decile portfolios than other cases, because the decile creates more dispersion of stocks in their factor exposure. While univariate sorting is widely used, bivariate sorting is sometimes also employed. For example, Fama and French’ (1993) well known size and book-to-market factors, posed today on French’web, are based on a bivariate sorting via size and book-to-market. Specifically, according to the web, their construct 6 portfolios (all are value-weighted) at the end of each June, which are the intersections of 2 portfolios formed on size (market equity, ME; drop the median portfolio out of the 3 size portfolios) and 3 portfolios formed on the ratio of book equity to market equity (BE/ME). The size breakpoint for year t is the median NYSE market equity at the end of June of year t, with breakpoints 30th (Small) and 70th (Big). BE/ME for June of year t is the book equity for the last fiscal year end in t− 1 divided by ME for December of t− 1, with the same breakpoints. Then the size factor is defined as SMB (Small Minus Big), the average return on the three small portfolios minus the average return on the three big portfolios, SMB = 1/3(Small Value + Small Neutral + Small Growth) = −1/3(Big Value + Big Neutral + Big Growth), (5.46) and the book-to-market is defined as HML(High Minus Low), the average return on the two value portfolios minus the average return on the two growth portfolios, HML = 1/2(Small Value + Big Value) = −1/2(Small Growth + Big Growth), (5.47) c© Zhou, 2021 Page 158 5.6 How to construct factors? Occasionally, sorting based on 3 characteristics is done. However, it becomes increasing complex as the number of characteristics increases. The solution is to use a method that accomplishes similar tasks and yet is easy to implement. There are two popular methods. The first is a naive scoring approach, and the second is the cross-section regression (CSR) approach. Both are discussed below. 5.6.2 Scoring Suppose that we have 8 firm characteristics. We give each stock a score of 1 to 10 for each characteristic, and 10 indicates the best and 1 the worst. Then, we have 8 scores for each stock. Adding the scores together, we get one aggregated score for each stock. Hence, we can buy stocks with the top 10% highest scores and sell those bottom 1%. This is to form the zero-cost spread portfolio. If only investing is of concern, we just buy the top 10%. Instead of scoring from 1 to 10, a z-score, which is the number of standard deviations from the mean, is also often used. One can compute zi,k = ci,k − c¯k std(ck) , where ci,k is firm i’s k-characteristic, c¯k is the mean across firms and std(ck) is the standard deviation. The z-score is also known as a standard score, and can be placed on a normal distribution curve. The aggregate z-score is defined by z¯i = 1 K (zi,1 + zi,2 + · · ·+ zi,K), where K is the number of characteristics. Although scoring can in principle be used to construct systematic factors, common to all stocks like the market factor, it is perhaps more suited for selecting stocks that satisfy a number of desired criteria/conditions/chareteristics. Scoring is easy to implement. But it has its weaknesses. The most important of all is that it weights all characteristics equally. This is clearly not true in practice. Certain factors are more important than others. The CSR below does not suffer this problem. 5.6.3 Cross-section regression The cross-section regression (CSR) approach typically runs a regression of stock returns on one or more firm characteristics across stocks. This can be used not only to construct systematic c© Zhou, 2021 Page 159 5.6 How to construct factors? factors (for understanding risk exposures), but also to forecast stock returns (for selecting stocks or sectors). In practice, such regressions are known as fundamental factor models and characteristics- based models. Consider, for example, the question how firm size affect future stock returns. We run a CSR of firm return on size, Ri,t = a+ bsizei,t−1 + i, i = 1, 2, ..., N, (5.48) where N is the number of firms. In the above regression, the time is fixed, and the regression is run across firms. In terms of data, the regression may be written, say, as RIBM,t RApple,t ... RGoogle,t = a+ b sizeIBM,t−1 sizeApple,t−1 ... sizeGoogle,t−1 + 1 2 ... N . (5.49) Again, the time here is fixed, and we ask how the returns across firms are predicted by their sizes. The performance of the CSR can be assessed as usual by the magnitude of the slope, its significance and the R2 of the regression. Most important of all, we should assess the economic performance. Based on Equation (5.48), we can compute the estimated coefficients at each time, and then forecast the return for the next period, Rˆi,t+1 = aˆt + bˆtsizei,t, i = 1, 2, ..., N, (5.50) where aˆt and bˆt are the estimated predictive coefficients at time t. We can then buy stocks with the top 10% highest predicted returns, and sell those bottom 10% with the lowest. The performance of the spread portfolio over time is the economic value the CSR brings to the table. Based on the CSR, we can also construct a systematic size factor as fsize = f1 − f10, (5.51) where f1 is the return on the long position of the highest predicted return stocks, and f10 is the one on the lowest. This size factor is clearly very closely related to the earlier size factor constructed by sorting stocks on size. Indeed, they are mathematically equivalent (assuming bˆt > 0). However, the CSR is more flexible as it can control for additional factors for better investment performance. c© Zhou, 2021 Page 160 5.6 How to construct factors? For example, if we think that stocks are affected by the market, size and idiosyncratic volatility (IVol), then we run a CSR, Ri,t = a+ b1 × βi,t−1 + b2 × sizei,t−1 + b3 × IV oli,t−1 + i, i = 1, 2, ..., N. (5.52) This makes all information of the three firm characteristics to predict optimally the future returns, and then one can buy those with the highest expected returns and short those with the lowest. Sorting or scoring will not be able to achieve what the CSR does. If there are K > 1 characteristics, we simply run multiple CSR, Ri,t = a+ b1Ci,1,t−1 + b2Ci,2,t−1 + · · ·+ bKCi,K,t−1 + i, i = 1, 2, ..., N, (5.53) where Ci,k,t−1 is firm i’s k-th characteristic at time t − 1. The CSR finds the best (linear) pre- dictability from the K characteristics collectively and weights their importance according to their individual predictive power. Chapter 11 provides more discussions and the detailed procedures for implementing CSR. To assess the importance of one particular characteristic, one can examine its risk premium, or compare the performance of the performance from the above regression with and without it. However, whether some characteristics can be removed is a difficult econometric problem. 5.6.4 Machine learning methods As extensions of the CSR, various machine learning methods (see Chapter 10) can be applied to forecast the cross section of stock returns. Then one can sort stocks based on the expected returns, and then the long-short portfolio will be a factor that represent all the characteristic that are used to forecast the returns. Applications of machine learning methods for finding factors is a direction of active research. See, for example, Coqueret and Guida (2020), Jurczenko (2020), Han et al (2021) and Neuhierl et al (2021), and is also related to factor investing to be discussed in the next section. c© Zhou, 2021 Page 161 5.6 How to construct factors? 5.6.5 Time series vs cross section There are often confusions about time series regression and cross section regression, and time series and cross section factors. Let us make the distinctions clear here. In a CSR, we want to know how well one predictor predicts the returns cross firms, or how well one variable explains the perform of the students (e.g., the hours each worked for Prof. Zhou’s class). In contrast, a time series regression asks how well one predictor predicts the returns over time, or how well the market factor explains the return over time (the market, size and book-t–market are all times series factors). In terms of equation, a time series regression regresses an asset return over time, RIBM,t = a+ bxt−1 + t, t = 1, 2, . . . , T, (5.54) where T is sample size, say 120 for 10 years of monthly data. In terms of data, the regression is, RIBM,1 RIBM,2 ... RIBM,T = a+ b x0 x1 ... xT−1 + 1 2 ... T . (5.55) In contrast to the CSR, here there is only one stock and we examine how a variable xt−1 predicts RIBM,t over time. If we use xt instead of xt−1 in the regression, then, we examine how well xt, such as the market factor, explains RIBM,t as both variables occur at the same time. Now let us examine the difference between an aggregate level factor, such as the market factor, and firm characteristic factors such as earnings per share. The former is systematic risk factor and each firm’s exposure is measured by the beta from the time series regression on the factor. The latter is a firm-level factor and each firm’s exposure is measured directly by its observed value, such as earnings per share. The two will not necessarily coexist. For example, industrial production factor is a well known systematic factor, and it is difficult to come up with a measure at the firm level other than the regression beta. On the other hand, the quality of corporate governance may be well measured at the firm level by a ranking of 1 through 10. But a systematic governance factor seems unknown (one may construct a spread portfolio by the ranking. However, it may not earn a significant risk premium). Nevertheless, for some factors, such as size and book-to-market, we do have both systematic risk factors at the aggregate level, and individual measures at the firm level. c© Zhou, 2021 Page 162 5.7 Uses of factor models 5.7 Uses of factor models It will be useful to discuss some common uses of factor models. 5.7.1 Capital budgeting/Expected return estimation First, it is useful for capital budgeting. Assume the factor model is true, one obtains the expected return on a firm given the systematic factors. One can interpret it as the expected return investors expect to get from taking the systematic risks, regardless of the alphas are truly zero or not. Combining it with other info one can get the WACC for valuing projects. Intuitively, if the total risk premia from the systematic risk exposure is 10% (the sum of the betas times the factor risk premia), then the company should have at least 10% return on a project with the same risk. Otherwise the shareholders can maximize their value by investing in the stock market rather than the project. In portfolio choice, it is critically important to provide accurate estimates on the expected returns and covariances. While historical averages are important, but it is backward looking. In practice, macro economic outlooks can often lead to forward-looking estimates on the performance of the market and various factors. With the factor model, this can generate forward-looking estimates on the expected returns. It can also help on estimating the stock covariances if the factor covariances are known. For example, consider the use of factor models for forecasting stock returns. Suppose you run a time series regression of a stock or your portfolio on the factors, say equation (5.60). If you have forecasted returns for the market to be 3%, and the size factor to be 2% next month, you can compute the forecasted return on your stock or portfolio by replacing the factors by their predicted values in (5.60). When forecasting return longer term say for a year, you need to add return from next month, etc, up to a year in the equation to get the annual return, rr,1→12 = rpt+1 + · · ·+rpt+12. Clearly the same predicted slopes/coefficients apply, rp,1→12 = 3%× 12 + 1.1× rm,1→12 + 0.7× fsize,1→12, (5.56) so the only differences are to scale the intercept by 12 and to replace the factor returns by their next year’s predicted returns. c© Zhou, 2021 Page 163 5.7 Uses of factor models 5.7.2 Smart beta and factor investing Smart beta in practice generally means an investment strategy that deviates from holding the value-weighted market index. The latter is an passive strategy whose beta (relative to the index) is 1. Smart beta strategies typically hold a portfolio of passive investments combined with some exposures of active investments, particularly with factor investing. Factor investing is an investment strategy used by many fund managers to beat a benchmark index. The idea is to tilt your portfolio towards some factors, where the factors are the so-called fundamental factors that are firm specific. The reason for smart beta, as discussed by Ghayur, Heaney and Platt (2019), is due to 3 potential drawbacks with the value-weighted index: 1. Concentration: Large firms dominate the index, and some sectors may have excessive repre- sentation in the index (e.g., during the internet bubble, tech weight in the S&P 500 Index increased from 13% in 1998 to more than 30% at the start of 2000). 2. Volatility: high concentration tends to generate high volatility. 3. Propensity: Value-weighting tends to overweight overvalued stocks and underweight under- valued stocks, and so the index may lose more when the mispricing inevitably corrects. Another drawback is that cap-weighted investing cannot address any firm specific investment ob- jective such as ESG (environmental, social and corporate governance). Cap-weighting index investing is still the primary approach in practice, since it is easy to implement and is consistent with the CAPM (is the best in an ideal efficient market world in which all investors are smart and all have the same info and all have quadratic utility preferences). It is a buy-and-hold passive strategy with minimum fees to invest. This is why index funds are keep growing over time. The most popular factors are size, value, momentum, quality, and low volatility, though there are hundreds of potential factors. Each firm characteristic can be a potential factor. Han et al (2021) examine up to 299 firm specific factors, and Neuhierl et al (2021) consider in addition firm option characteristics which seem entirely new in the factor investing literature. To understand more on factor investing, consider a couple of examples. Suppose that the stock c© Zhou, 2021 Page 164 5.7 Uses of factor models returns are driven by the market and size factor. rit = γ0 + γ1βmi + γ2Sizei + other factors + vit, (5.57) where vit is the residual. There are many ways to tilt your portfolio to size. The simplest one is to use a standard equal-weighed spread portfolio. Based on the size of each firm, Sizei, we can sort stocks into decile portfolios. The spread portfolio is simply long the smallest decile and short the largest decile. Then, effectively, our portfolio is w = ρwm + (1− ρ)wLS , where ρ is the proportion in the market, say 80%, and wLS is the weight of the spread portfolio with values of 1/m’s or −1/m’s on the long and short deciles and zeros for other stocks, and m is the number of stocks in each decile. Although wLS has negative components, w is typically non-zero in practice as ρ is not far away from 100%, so that there are no short sells in the end. An alternative approach is to simply hold the factor portfolio or the long-short portfolio, which can be implemented in practice by buying the equity Smart Beta (SB) ETFs or strategic-Beta Exchange-Traded Products that match the factor of interest (this is often feasible). Then you hold the market index and this EFT, whose return is R = ρRm + (1− ρ)RETF , where RETF is the return on the ETF. Another more quantitative approach is to use the factor model, rit = αi + βmiRmt + βsiRsize,t + it, (5.58) to obtain the betas of all the stocks. Suppose you have 10 stocks you want to buy, and you want to load up on size factor, to have a loading of 2 for your portfolio, as you expect the factor will likely to have good reward next period. By equation (5.60) and the like you can get the size beta for all your stocks, say β1 = 0.7, β2, . . . , β10. Then you want w10.7 + w2β2 + · · ·+ w10β10 = 2. In the above equation, the beta are known, and you need to solve the portfolio weights. You may impose in addition that the weights sum to 1, and other conditions. Then, applying a quadratic program, you can solve the weights to meet your needs. c© Zhou, 2021 Page 165 5.7 Uses of factor models Ang (2014) and Ghayur, Heaney and Platt (2019) provide more extensive discussions on moti- vation and practice of factor investing. Jurczenko (2020) and Coqueret and Guida (2021) provide the state of art applications of machine learning tools to factor investing. 5.7.3 Hedging This is related to factor investing. Instead of taking factor risks, you eliminate them. Suppose equation (5.56) describes your portfolio. If you are concerns about the factor risks next month, then you can short 1.1 units of the index and 0.7 units of the size factor (use futures or ETFs in practice) per dollar of your portfolio for per dollar of the factors. Then you can remove the factor risk exposures without having to liquidate your entire portfolio. 5.7.4 Measuring performance Another use of factor models is to use them for evaluating a fund manager’s performance. Consider first the one factor case. Suppose we have rpt = α+ βrmt + it, t = 1, . . . , T, (5.59) where rpt is the excess return of the actively managed portfolio. With data, suppose we have an estimated model, rpt = 5% + 1.1× rmt + it, (5.60) which says that the mangers earns 5% extra, alpha, after adjusting for market risk. Hence, in terms of the market factor, the manager seems to have skill. If there 5% were zero, then he would not have had any skills as you can buy 1.1 unit of the market index to replicate the performance. Now suppose we consider further a size factor, and the estimated model is rpt = 3% + 1.1× rmt + 0.7× fsize,t. (5.61) Then, accounting for the additional factor, the manager earns only 4% alpha. In practice, quite a few common traded factors may be used to assess the alpha. If the alpha becomes zero, implying that the investor can buy the factors with the suitable portions to replicate the fund performance. The unexplained positive alpha may be a measure of skill. However, in practice, the CAPM is the most widely used model for fund performance evaluation. c© Zhou, 2021 Page 166 6 Factor Models 2: Unknown Factors Both the CAPM and Fama-French 3-factor models assume that we know the driving forces of the stock returns: a) the number of factors; b) the specific form of the factors. This is clearly not true in the real world. In this section, we provide statistical methods for estimating both number of factors and the factor themselves. 6.1 Latent factor model To start, we may agree that there is one factor that determine all the stock returns, but the factor may not necessarily be the market portfolio or stock index. That is, we consider the following one-factor model, rit = αi + βift + it, t = 1, . . . , T, (6.1) where the factor ft is latent or unobservable. This is very similar to the market model regression except now the factor is unknown and has to be estimated from data. Before estimating, it is important to understand the identification problem in a latent factor model. In the case where the factor is latent, the factor can only be only identified up to a scale. This is because if ft is the factor, a new factor f ∗ t = cft works the same as (6.1) with β ∗ i = βi/c, rit = αi + βi c (cft) + it, where c 6= 0 is a constant. So, in a latent factor model, once we find one factor, we can use any scale of it. Another related issue is that we can ‘standardize’ or set the factor mean as zero, E[ft] = 0. (6.2) This will not affect the model either. Indeed, if E[ft] 6= 0, the new factor f∗t = ft − E[ft] will. We have still have mathematically the same factor model, rit = α ∗ i + βif ∗ t + it, if we define the new alpha as α∗i = αi + βiE[ft]. The reason for setting the factor mean as zero is to simply the task of finding the factor. c© Zhou, 2021 Page 167 6.1 Latent factor model In general, let f1t, . . . , fKt be K latent factors (or systematic risks) of the stock market. Then, a K-factor model for the returns on N assets is: rit = αi + bi1f1t + · · ·+ biKfKt + it, t = 1, . . . , T, (6.3) where bi1, . . . , biK are the factor loadings on the risks, and it is the specific factors or idiosyncratic risks. The factor model is often written in vector form, rit = αi + β ′ft + it, (6.4) with vector notations of the betas and factors, β = β1 ... βK , ft = f1t ... fKt . Following the convention, we set all the factor means as zero, then αi = Erit, which can be estimated by the sample mean of the returns. Hence, in a latent factor model, the major task is to estimate K, the betas and the factors. Note that the scale invariance property becomes any nonsingular linear transformation invari- ance property when K > 1. In this case, if ft is a factor, then f ∗ t = Cft is a new one, where C is any nonsingular K ×K matrix. Let β∗ = C−1′β, then β∗′f∗t = β ′C−1Cft = β′IKft = β′ft, so the factor model is unchanged. Ross (1977) shows that, in the absence of riskless arbitrage opportunities, there exists an ap- proximate linear relationship between the expected asset returns and their risk exposures to the latent factors, Erit ≈ λ0 + bi1λ1 + · · ·+ biKλK , (6.5) as the number of assets satisfying (6.3) increases to infinity, where λ0 is the intercept of the pricing relationship and λk is the risk premium on the k-th factor, k = 1, . . . ,K. The relationship (6.5) is known as the implication of the APT (arbitrage pricing theory). When K = 1, the APT says that Erit ≈ λ0 + bi1λ1, (6.6) c© Zhou, 2021 Page 168 6.2 Principal components analysis that is, the greater the beta, the greater the expected return. This is very similar to the CAPM, but is fundamentally different, because the factor is not necessarily the market factor, but the one that is systematic to all stocks and estimated from data. How to estimate the number of factors and the factors? There are three common approaches. Principal components analysis (PCA) is the most popular, and asymptotic PCA (aPCA) is compu- tationally preferred with large number of assets, and the traditional factor analysis. We will focus on the first two, which are the most useful. 6.2 Principal components analysis Principal components analysis (PCA) is a general dimension reduction approach. In this section, we provide first a review on the concepts of eigenvalue and eigenvectors, then provide the details for computing the principal components (PCs). Finally, we explain the theory behind it. 6.2.1 Eigenvalue and eigenvectors First, let us review the concepts of eigenvalue and eigenvectors. Consider a 2× 2 matrix Σ = 2.05 1.95 1.95 2.05 . (6.7) Any vector (a1, a2) ′ satisfying Σ a1 a2 = λ a1 a2 (6.8) are called eigenvector and λ the associated eigenvalue. Here we have2.05 1.95 1.95 2.05 · 1 1 = 4× 1 1 and 2.05 1.95 1.95 2.05 · 1 −1 = 0.1× 1 −1 . So 4 and 0.1 are two eigenvalues and A1 = 1 1 , A2 = 1 −1 c© Zhou, 2021 Page 169 6.2 Principal components analysis are the eigenvectors. Note that the eigenvectors are not unique as their scaled vectors will also be eigenvectors. However, once they are standardized, a21 + a 2 2 = 1, then they will be unique up to a sign. In our example here, A1 = 1/√2 1/ √ 2 , A2 = 1/√2 −1/√2 are the standardized ones (scaled to make the sum of squared components equal to one). Clearly, A∗1 = − 1/√2 1/ √ 2 , A∗2 = − 1/√2 −1/√2 are also standardized eigenvectors. But they are essential the same as A1 and A2 except the sign. In finance, the covariance matrix of n assets, Σ, is of great important to determine the risk of any portfolio of the assets. It has important properties: 1) symmetry; 2) positive definiteness (it is in particular invertible, ruling out redundant assets, i.e., those are linear combinations of others). Symmetry means that the transpose of the matrix equals to itself. Positive definiteness means that, for any nonzero n-vector η 6= 0, we have η′Ση > 0. We can scale η to have its elements sum to 1 without affecting the above inequality, then the above equation says that the risk of any fully invested portfolio of the assets is not zero. The inequality or positive definiteness must be true intuitively as, if there are redundant assets, the portfolio of risky asset must be risky. Otherwise, there is at least one portfolio that has no risk, w1r1 + w2r2 + · · ·+ wnrn = 0. We can then solve one asset as a linear combination of other assets, implying this asset is redundant, contradicting to our assumption. The no redundancy assumption will always assumed in this section, which is often assumed implicitly in portfolio theory without even mentioning it. In practice, returns on different stocks are never or unlikely to be redundant because, for any given stock, linear combinations of other stocks cannot perfectly replicate it. c© Zhou, 2021 Page 170 6.2 Principal components analysis Under the no redundancy assumption, the covariance matrix Σ of n assets is positive definite, and so, mathematically, it will be invertible, and will have exactly n positive eigenvalues (could be of equal values similar two roots to a quadratic algebraic equation) and n standardized eigenvectors associated with them (unique up to signs). 6.2.2 PCs: data For a large data set of n variables, the PCA re-packages them in n components, PCs, ordered in such a way that the first (newly created) component contains the maximum of variation, and the second component is orthogonal to the first and contains the second-largest amount of variation, etc. So the last component contains the smallest amount of variation. The idea is that we can focus on the first K important components, while dropping the rest less important ones. Let X is an T × n matrix of the data, where n is the dimensionality and T is the sample size. Suppose that X is demeaded (the data are subtracted from sample means as researchers often do in applying PCA). Then the n× n matrix Σˆ ≡ X ′X/T (6.9) is the sample covariance matrix. It has n eigenvalues, λ1 ≥ λ2 ≥ · · · ≥ λn and n eigenvectors A1, . . . , An, each of which is n-vector. The first PC, in terms of data, is defined as P1t = A11X1t +A12X2t + · · ·+A1nXnt = A′1Xt, t = 1, 2, . . . , T, (6.10) which is a weighted sum of the data (a portfolio if Xt are returns) with the first eigenvector A1 as weights. Hence, the first PC, P1, is simply a repackage of the original data. Mathematically, it has the property that var(P1) = λ1, (6.11) that is, its variance is the same as the largest eigenvalue (the proof is in the next subsection). Similarly, the second PC is defined by P2t = A21X1t +A22X2t + · · ·+A2nXnt = A′2Xt, t = 1, 2, . . . , T, (6.12) c© Zhou, 2021 Page 171 6.2 Principal components analysis and so on. The variance of the j-th PC is equal to the j-th eigenvalue, var(fj) = λj , j = 1, 2, . . . ,K, (6.13) where λj is the j-th largest eigenvalue of Σ. The second important property of the PCs is that they are orthogonal to each other. This means that the original data X = [X1, X2, . . . , Xn] are transformed into orthogonal data (PCs) P = [P1, P2, . . . , Pn]. The orthogonal property means that the PCs are uncorrelated if the PCA is applied to stock returns, which simplifying the convariance structure and makes it simple the optimal portfolio in terms of the PCs. In many applications, we may care only about those data that have the most variations or the first K principal components. Then we reduce the problem of study, say n = 1000 variables, to a study of K, say K = 5, linear combination of the original variables. This is dimension reduction. The third property of the PCs is that they are invariant (the same) with any orthogonal trans- formation of the data. The reason is that if we apply the PCA analysis to a new data set that is an orthogonal transformation of the old, X∗ = XC, where the matrix C is orthogonal that C ′C = In. The eigenvalues will remain the same, but the eigenvectors will be multiplied by C ′. Based on (6.24), the PCs will be unchanged. As to the choice of K, the number of factors, one usually examine the sum of the first K eigenvalues. If this sum is 95% of the sum of all the eigenvalues, K may be adequate enough as the K factors can explain about 95% of the variations of the returns. In general, a K-factor explains a fraction of λ1 + λ2 + · · ·+ λK λ1 + λ2 + · · ·+ λn of the total variance. c© Zhou, 2021 Page 172 6.2 Principal components analysis 6.2.3 PCs: random variables The PCA can also be stated in terms of random variables (population). It finds n linear com- binations of the original n random variables. In contrast to the original ones, the new variables (PC components) are orthogonal to each other, and the first component has the large variance, the second component has the second large, etc. So, as far as variances are concerned, we can study the K new variables instead of the original n ones, which is especially advantageous when n is large. Let x = (x1, x2, . . . , xn) be an n-vector of de-meaned random variables, say n de-meaned stock returns. Denote its covariance matrix by Σ = Var(x), (6.14) which is n× n. We now define the PCs in terms of Σ, the population parameter. Let A1 = (A11, A12, . . . , A1n) ′ be the first eigenvector of Σ, ΣA1 = λ1A1, (6.15) where λ1 is the largest eigenvalue. Then, the first PC is defined as a linear combination of the original variables, P1 ≡ A11x1 +A12x2 + · · ·+A1nxn = A′1x, (6.16) which says that the fist PC is determined by the first eigenvector, whose elements serve as the weights on the original variables. One can define the second PCA factor using the second eigenvector, P2 ≡ A21x1 +A22x2 + · · ·+A2nxn, (6.17) and so on. In short, given the original n random variables (assets), we can repackage them to obtain n particular new random variables (linear combinations of the original assets), A′jx’s, i.e., the n PC components. Why do we do that? In practice, it is often the case that the first K (say, K=5) are the most important. Then, as an approximation, we can replace the n original assets by the K PC components. Imagine there are thousands of stocks. With PCA, we reduce the dimensionality from thousands to K. c© Zhou, 2021 Page 173 6.2 Principal components analysis Example 6.1 For the covariance matrix Σ = 2 1 1 3 , (6.18) the first standardized eigenvector is A1 = 0.526 0.851 , (6.19) and hence the first PCA component is P1 = 0.526x1 + 0.851x2, (6.20) where x1 and x2 are the de-meaned original variables. ♠ Similar to the case with data, for the random variables P1, . . . , Pn, there are three properties: a) the j-components has the j − th largest variance, var(Pj) = λj , where λj is the j-th largest eigenvalue of Σ; b) they are uncorrelated; c) they are invariant if the original x are transformed via an orthogonal matrix. In practice, the population parameter is unknown and Σ is often estimated by data, say the sample covariance matrix, Σˆ = X ′X/T, (6.21) Given n, if the sample size T is large enough, they will converge to the population parameters. Then, PCA applied to the data will be the same as applied to population. However, in practice, PCA is often applied to the data as previous subsection. 6.2.4 PCA factors Principal components analysis (PCA) is a general dimension reduction approach without imposing a factor structure on the data, and so it is more general than a factor model. But the PCA can be applied to estimate factors of a factor model. Assume that there are K factors that drive returns in the factor model, Eq. (6.3), whose vector form is rit = αi + β ′ft + it, αi = E[rit] (6.22) c© Zhou, 2021 Page 174 6.2 Principal components analysis or xit = β ′ft + it, (6.23) where xit = rit − E[rit] is the de-meaned returns. The important question is how to we estimate the factors. PCA is one of the most popular approaches to estimate ft. If we stack the first K PC components as a K × 1 vector at any time t, Ft = [P1t, P2t, . . . , PKt] = Φ ′Xt, K × 1, (6.24) where Xt, n× 1, is the de-meaned stock returns, and Φ = [A1, . . . , AK ] is an n×K matrix of the first K eigenvectors, then Ft is the PCA estimate of the realizations of K factors at time t. When K = 1, it is simply Ft = A11X1t +A12X2t + · · ·+A1nXnt = A′1Xt, where A1 = (A11, A12, . . . , A1n) ′ is the first eigenvector. This is similar to the case of the market factor. It is a random variable of asset returns that fluctuates over time. But in terms of data, say current month, it is a weighted average of the realized returns. While the market factor uses the firm values as the weights, the PCA factor uses the first eigenvector as the weights. Note that the weights on the PCA factor does not sum to 1, which is to scaled to ensure that it has a variance of λ1. Now, if we stack all the factor observations as a T ×K matrix, it follows that Fˆ = XΦ (6.25) is an estimate of all the factor realizations in the K-factor model. Here we put a hat on F to emphasize that it is an estimate, rather than the true realization of the factors that are not observable. In short, the PCA estimate of the K factors are simply the first PCs, that is, we use the first PCs as factors. How to interpret a PCA factor economically? This is an issue for which there are no perfect answers. One way is to examine its correlations with known economic variables. For example, if the first PCA has 90% correlation with the market factor, and the second has 80% with inflation, we may interpret the first factor is primarily the market and the second is largely inflation. Another way is to run a regression of the PCA on known variables. If the second factor has a slope of 80% on inflation and 15% on GDP, we can attribute its effects as inflation and GDP. c© Zhou, 2021 Page 175 6.2 Principal components analysis Consider now how to determine the number of factors, given that the factor model is true. For any number K, we can use the first K PCs as the factors. Then we run the time series regression, (6.3), on the factors to get the estimated mean-squared errors, σˆ2i for each stock i. Let V (K) = T n(T −K) n∑ i=1 σˆ2i +Kg(n), (6.26) where g(n) = n−1/4 log(n). The first term, up to a scale, measures how well the factors fit the linear regression. The smaller it is, the better the fit. Increasing the number of factors will always improve the fit. However, the greater number of factors will introduce more parameters and greater estimation errors, so we add the second term to penalize a larger K. The optimal trade-off between the two is theoretically the right number of factors. In other words, we choose such a K, K∗, to minimize V (K). Econometrically, Bai (2003) is the first to provide the statistical properties of the estimated factors, given that the factor model is true. Bai and Ng (2002) provides the criterion for selecting the number of factors. The above criterion is taken from Zaffaroni (2019), who proves that the K∗ so chosen converges to the true value as the number of assets n increases to infinity. Empirically, we compute the factors for K = 1, 2, . . . , 30 (say), and find K∗ that makes V (K) the smallest. It should be mentioned that the above factor estimates is computational efficient when n < T , but this is rarely true in practice. The mathematical equivalent but computational more efficient estimator is given in Section 6.3. The asymptotic theory on factor selection is the same, and that on the estimators will be equivalent too apart from a linear transformation. PCA has wide applications in finance (see, e.g., Alexander, 2001), used not only in the equity market, but also in other asset classes. For example, Litterman and Scheinkman (1991) show that three yield factors, the level, slope and curvature, from the PCAs explain bond returns well. Jolliffe (2002) discusses various theoretical aspects of the PCA and its uses in other areas. Another wide use of the PCA is to extract a few predictors out of many. When there are many predictors, running a regression on them is not efficient as the estimation errors can be large. Instead, running a regression on the first PCA (or the first few) can do a much better job in forecasting out-of-sample. For examples, Baker and Wurgler (2006) uses the first PCA of six proxies as their famous investor sentiment index, and Neely et al (2014) use PCAs of technical indicators to predict stock market returns. c© Zhou, 2021 Page 176 6.2 Principal components analysis 6.2.5 The theory For a covariance matrix, the most important property is that it can be decomposed into a product of three terms, eigenvectors, eigenvalues and eigenvectors. That is, Σ can be written as (see a Linear Algebra text for the proof) Σ = AλA′ = [A1, . . . , An] λ1 0 . . . 0 0 λ2 . . . 0 ... ... . . . ... 0 0 . . . λn [A1, . . . , An] ′ (6.27) = λ1A1A ′ 1 + λ2A2A ′ 2 + · · ·+ λnAnA′n, where Ai is the eigenvector corresponding to eigenvalue λi, and the eigenvectors are orthogonal to each other with unit length. There are exactly n eigenvector and n eigenvalue (though the eigen- values could be equal like the roots of an n-order polynomial). The above is known as Eigenvalue Decomposition Theorem or The Spectral Theorem. The decomposition holds for any symmetric matrices. The eigenvalues are greater than zero for positive definite or nonsingular covariance matrices. The eigenvalue decomposition is a special case of the singular value decomposition (SVD). Now we are ready to understand more on the PCA. It is enough to carry out the analysis in terms of population or random variables. Statistically, the PCA is motivated to find a linear combination of the variables that has the maximum variance. In other words, we want find a such that P1 = a1x1 + a2x2 + · · ·+ anxn = a′x (6.28) explains the most variations of the underlying random variable x = (x1, . . . , xn) ′ (here we consider the PCA in terms of population and we assume x has zero mean), or max a Var(a′x) = a′Σa, (6.29) where a ≡ (a1, a2, . . . , an)′ is standardized such that a′a = a21 + · · ·+ a2n = 1. The above equation means that the vector a has a unit length (if a is unrestricted, the maximal will be infinity by increasing a properly). c© Zhou, 2021 Page 177 6.2 Principal components analysis Mathematically, we want to maximize the following function, f(a) = a′Σa− λ(a′a− 1) = ∑ i,j aiσijaj − λ (∑ i a2i − 1 ) , (6.30) where λ the Lagrange multiplier. The first-order conditions are ∂f(a) ∂a = 2Σa− 2λa = 0. This is the same as Σa = λa, which, following our definition, says that λ must be an eigenvalue and a must be the associated eigenvector. Suppose that a is the i-th eigenvector. Based on the Eigenvalue Decomposition Theorem and the orthogonality, a′Σa = λiA′ia = λi. Therefore, to maximize a′Σa, a must be the first eigenvector, and the maximum is exactly equal to λ1, the largest eigenvalue. In other words, P1 is a random variable that is a linear combination of the original random variables and it has the maximum variance, λ1, when the combination coefficients are the first standardized eigenvector. Similarly, the second PCA is defined the same way as to maximize the variance, but it is required to be uncorrelated with P1. It can be shown that the combination coefficients must be the second standardized eigenvector and its maximum variance is equal to the second largest eigenvalue. The rest PCAs are obtained similarly. Anderson (1984, Chapter 11), a classic of multivariate statistics, provides more properties of the PCAs. In practice, PCA requires only the computation of the eigenvalues and eigenvectors of Σ, which is straightforward to do with many available packages. Let λ1 ≥ λ2 ≥ . . . ≥ λn be the n eigenvalues. Put them into a diagonal matrix, and put the associated eigenvectors into a matrix A, then the n-th PCA is defined as Pi = A ′ ix. Recall we set the mean of x as zeros already so that µx, the mean of x, does not enter the PCA as in (6.16). In matrix form, all the PCAs can be expressed as: P = P1 P2 ... Pn = A′1x A′2x ... A′nx = A ′x. (6.31) c© Zhou, 2021 Page 178 6.2 Principal components analysis Hence, as an approximation, the covariance matrix can be modeled by the first few, say K, components after ignoring the rest insignificant λi’s, i = K+ 1, . . . , n. Notice that the eigenvectors are normalized here, A′A = AA′ = I, (6.31) clearly imply x = AP or, if only using the first K PCAs, x1 ≈ a11P1 + a12P2 + · · ·+ a1KPK , (6.32) x2 ≈ a21P1 + a22P2 + · · ·+ a2KPK , (6.33) ... ... ... xn ≈ an1P1 + an2P2 + · · ·+ anKPK , (6.34) which says that the study of the original complex and potentially a large number of (n) variables can be reduced to the study of only K linear combinations of the variables, or the first K PCAs can be taken approximately as the factors. For example, the term structure of interest rates is complex, but it can often be reduced to study 3 PCAs. See, e.g., Alexander (2001). What are the statistical properties of the estimated PCAs? Beyond the above population motivation, we are in practice more interested in the estimation accuracy of the eigenvalues and eigenvectors, based on an estimated covariance matrix. In general, the estimation of the largest eigenvalue tends to be larger than the largest true eigenvalue value, and the smallest tends to be smaller than the smallest true eigenvalue value. But they are consistent: for fixed n, as the sample size increases, the estimated values will converge to the true values theoretically. However, if n is too large relative to T , this will not be true (see Section 6.2.6). Finally, we mention two formulas on the determinant and trace in terms of eigenvalues, which will be useful in the future, det(Σ) = λ1λ2 · · ·λN , (6.35) tr(Σ) = λ1 + λ2 + · · ·+ λN . (6.36) Both are consequences of the Eigenvalue Decomposition Theorem. Indeed, given the Theorem, the determinant will be the product of three other determinants. That of the eigenvector matrix is 1 as it consists of orthogonal vectors, and that of the eigenvalue matrix which is clearly the above. Since tr(ABC) = tr(ACB), the last equality follows from the Theorem too. c© Zhou, 2021 Page 179 6.3 Asymptotic PCA 6.2.6 High-dimensional PCA Theoretically, the PCA works well for fixed n and large sample size T . However, if the dimension- ality n is large relative to T , the traditional PCA will run into problems. For example, if we use 240 monthly data (20 years) to extract factors out of 50 industries, there is likely a problem as 50/240 = 20.83%, implying that n is not small relative to T . In this case, the PCA is known as high-dimensional PCA, and the estimation errors can be very high. To understand the problem, consider the simple case where the data are generated from the iid standard normal. Thus the true covariance matrix is the identity matrix. Let λˆ1 and λˆn be the estimated largest and smallest eigenvalues. Then (see, e.g, Yao, Zheng and Bai, 2015), if T goes to infinity, but N goes infinity too with N/T → η > 0, then λˆ1 −→ (1 +√η)2, (6.37) λˆn −→ (1−√η)2. It says that the estimated eigenvalues will be biased even asymptotically! In other words, even if the sample size is large, but n is a fixed fraction of T , the largest eigenvalue will be over-estimated and the smallest one will be under-estimated. Applying the asymptotic theory with N = 50 and T = 240, we have λˆ1 converges to 2.12, and λˆn converges to 0.30, both of which are far from 1, the true value. Note that the estimated trace (sum of all the eigenvalues) will be close to N , the true value. It is just that the estimated eigenvalues will be spread out, but the average will be close to the true average. The larger the η, the more the spread. See Johnstone and Paul (2018) for a recent review of the issues and Wu, Qin and Zhu (2020) and reference therein for the latest solutions. 6.3 Asymptotic PCA Asymptotic PCA is more in the spirit of most applications in finance, where n increases to infinity and it can be much greater than T . In this case, it is computational much more efficient to estimate the PCA factors from the T × T data matrix, Πˆ = XX ′, (6.38) where X, T × n, is the demeaned returns, so that Πˆ is T × T . c© Zhou, 2021 Page 180 6.3 Asymptotic PCA Recall the K-factor model, xit = β ′ft + it, i = 1, . . . , n (6.39) which is a re-write of (6.4). In terms of data or in matrix form, we have X = Fβ′ + , (6.40) where β is n×K of the loadings and F is T ×K of the factor observations to be estimated. For example, if n = 5000, T = 60, then Πˆ, 60 by 60, is a much smaller matrix than Σˆ which is n× n, or 5000 by 5000. Hence the computation of eigenvalues and eigenvectors is much easier for Πˆ than for Σˆ. This is why aPCA is computational much more efficient than PCA when n is much larger than T . Let η1, η2, . . . , ηK be the eigenvectors of Πˆ, then the factors and loading estimates are Fˇ = √ T [η1, η2, . . . , ηK ], βˇ = 1 T X ′Fˇ . (6.41) Note that each η1 is T ×1, Fˇ is T ×K, matching the dimensionality of the factor F to be estimated. Mathematically, the factors extracted from either the PCA or the aPCA are equivalent, Fˇ = Fˆ V −1/2, (6.42) where V −1/2 is a K×K diagonal matrix, consisting of the first K largest eigenvalues of X ′X/(nT ). Hence, use of either of the factor estimates will yield essentially the same factor model as the factors can only be identified up to a nonsingular linear transformation. Bai (2003) shows that, under the assumption that the K-factor model is true, if n becomes large and √ N/T → 0 (that means N cannot be too large relative to T ), then the estimated factor will converge to the true factor up to a rotation, √ N(Ft −HF 0t ) −→ N(0, VF ), (6.43) where H is some rotation matrix and VF is the asymptotic covariance matrix. Connor and Korajczyk (1988) propose the aPCA and apply it to as many as 1745 assets to extract factors. Bai and Ng (2008) and Bai and Wang (2016) provide reviews of various extensions. c© Zhou, 2021 Page 181 6.4 Covariance matrix estimation 6.4 Covariance matrix estimation The Invertibility of the covariance matrix is critical for our optimal portfolio formula and for implementing via quadratic program too. However, the usual estimator, the sample covariance matrix can be singular when when the sample size is smaller than the number of assets. We will discuss this problem in detail below and the solutions to it. 6.4.1 Invertibility problem Recall that the ample covariance matrix is S = 1 T − 1 T∑ i=1 (Xi − X¯)(Xi − X¯)′, where T is sample size and X1, X2, . . . , XT are observations over time. It is easy to see a necessary condition for S to be nonsinular is T ≥ N, (6.44) where N is the dimensionality of X or number of assets. The proof is easy. The rank of S must be less than or equal to T as it is a sum of T terms whose rank is 1. Since S is N×, its rank is exactly N if it is invertible, so N ≤ T . For example, with N = 500 assets, and T = 240 (20 years of monthly data), then the above condition is violated, and hence it is impossible to compute the inverse of the sample covariance matrix. In practice, we can have thousands of stocks, and hence the sample covariance matrix runs into problems. It can be applied only to a limited number of asset classes, not individual securities. Conditional (6.44) is easy to prove and is only necessary, and is not sufficient. The more stringent is T ≥ N + 1. (6.45) Indeed, it is clear that S = UU ′, U = [X1 − X¯,X2 − X¯, . . . , XT − X¯], so N = rank(S) ≤ rank(U). Since U1T = 0, rank(U) ≤ T −1, and the necessity follows. If the data are randomly drawn from a distribution with non-singular covariance matrix, conditional (6.45) appears to be sufficient for S to be nonsingular most surely. In general, the above condition is only necessary, but there is no guarantee because the data can come from a low dimensional space. c© Zhou, 2021 Page 182 6.4 Covariance matrix estimation 6.4.2 Factor-model based estimator The key for doing various factor analysis is that we can eventually to model the asset returns using a “good” factor model, r˜it − rft = µi + βi1f˜1t + · · ·+ βiK f˜Kt + ˜it, (6.46) where the K factors capture all the systematic risks (more than 20 factors in some practitioners’ models), so that we can assume the residuals are uncorrelated. Taking covariance on both sides of the factor model, we obtain the return covariance matrix Σ = β′Σfβ + Σ, (6.47) where Σf is the covariance matrix of the factors, and Σ is the diagonal covariance matrix of the residuals. The above Σ can be inverted easily even if there are a large number of assets and a relatively small data size. Indeed, the inverse can be analytically computed from the well known Sherman- Morrison-Woodbury matrix identity, Σ−1 = Σ−1 − Σ−1 β′[Σ−1f + βΣ−1 β′]−1βΣ−1 , (6.48) which is well defined as along as the factor covariance matrix is invertible, i.e., Σ−1f exists. This is clearly not a problem in practice as the number of factors is usually small, say less than 30. In short, inversion of the covariance matrix is essential for applying the mean-variance portfolio theory. Without imposing a factor structure, the standard sample covariance matrix, Equation 3.10, is not invertible unless T > N +K. Moreover, even if it is invertible when T > N +K and if N is large relative to T (say N/T=0.3), it is still a poor estimator of the true covariance matrix. The above factor model with uncorrelated residuals is a solution to this problem, and it always works. Note that the factor model is only way to estimate the covariance matrix. In Chapter 4, we have discussed two other approaches. The first is to apply a shrinkage approach to reduce the dimensionality. This may even include recent machine learning methods. The second is to use high frequency data (such as daily returns). In practice, all of these can be used to resolve the invertibility problem. Further analysis of performance may determine what is best for the problem at hand as there is none of the methods can completely dominate all others. c© Zhou, 2021 Page 183 6.5 Both explicit and latent factors 6.5 Both explicit and latent factors In the previous chapter, we analyze factor models in which the factors are known or explicit, such as the market factor model and the Fama-French 3-factor model. In this chapter, we examine latent factors. Now we consider a more general factor model in which both types of factors are present. Mathematically, we have rit = αi + f ′ tβi +G ′ tβgi + it, (6.49) where rit are excess returns, ft are K latent factors to be estimated, and Gt are L known factors, while βi and βgi are K × 1 and L× 1 loadings on the factors. The known or explicit factors, Gt, may include common systematic and macroeconomic factors, which are measured as surprises in macroeconomic variables that help explain returns. For example, we may have an explicit macroeconomic factor model, Rit − rft = αi + βig[GDPt − Et−1(GDPt)] + βif [IFt − Et−1(IF )] + it, t = 1, . . . , T, (6.50) where Et−1(GDPt) and Et−1(IF ) are past expected GDP and inflation and so their differences from the realized values are the surprises or unexpected changes which can affect the market, and βig and βif are individual stock sensitivities to such changes. Of course, we can add common systematic factors, such as the Fama-French 3 factors, then we will have an explcit factor model with L = 5 factors. The combined factor model (6.49) is quite intuitive. We start with explicit factors that are known to affect stock returns, such as the Fama-French 3 factors, the GDP and the inflation, to obtain a set Gt. Since the L factors of Gt may not account for all the systematic risks in the market, we need to add K unknown statistical factors, which are to be estimated from the data, to capture the missing systematic effects. The estimation of the mixed factor model usually takes two steps. In the first step, a regression of the asset returns on the known factors Gt is run to obtain αˆi and βˆgi. Then the unexplained returns will be uit = rit − αˆi −Gtβˆgi, (6.51) which are the difference of the asset returns from their fitted values by using the observed factors. This is the returns after removing the effects of Gt. c© Zhou, 2021 Page 184 6.6 All-inclusive factor model Then, in the second step, a factor estimation approach, such as the PCA, is used to estimate the latent factors from uit = f ′ tβi + vit. (6.52) With the factor estimates from here, we can plug them back into (6.49) to determine the expected asset returns and their covariance matrix. The above procedure combines the explicit factors (such as the market and GDP) with statistical factors (estimated from PCA). Conceptually, with both information sets, the factor model should work better than otherwise. 6.6 All-inclusive factor model An all-inclusive factor model is one that combines both the time series factors and firm fundamental factors, resulting in a cross section regression model to include all possible factor effects. 6.6.1 Time series factor model A general times series factor model may be written as, rit = αi + βi1f1t + · · ·+ βiKfKt + it, t = 1, . . . , T, (6.53) where f1t, . . . , fKt are explicitly known or are estimated from the data. The Fama-French 3 factors, GDP and inflation factors, and the statistical (PCA) factors can all be included in the above equation. The key is that the regression is run over time (which is why time series model) for any given stock. What we learn is the exposures, betas, of the stock to the various systematic factors. If the alpha is zero, it means that the systematic factors explain fully the expected return. Of course, this is not true for all the stocks in the real world. 6.6.2 Fundamental factor model In contrast to a times series factor model, a fundamental factor model often refers to a cross section regression on firm specific variables or firm characteristics that are relevant to the changes in stock c© Zhou, 2021 Page 185 6.6 All-inclusive factor model prices. Examples of such factors are price-to-earnings ratio, market capitalization, and financial leverage. A simple example of the fundamental factor model is a cross section regression, Ri = c0 + c1Sizei + c2Profiti + i, i = 1, 2, . . . , N, (6.54) where c0, c1, c2 are regression coefficients, and Sizei and Profiti are firm size and profitability. The key here is that the regression is run in cross section over firms. The slopes c1 and c2 are the same cross firms, reflecting equal compensations to firm characteristics. If the purpose is to explain the returns or risk attribution, both the explanatory variables and dependent variables are measured at the same time t. However, if the purpose is to forecast future returns using the firm characteristics, then the characteristics are measured at time t− 1 while returns at t. Chapter 11 provides more discussions and the detailed procedures for implementing such factor models. 6.6.3 All types of factors Clearly, we should incorporate all relevant information to either explain or forecast the stock returns. An all-inclusive factor model or a generalised fundamental factor model is one that combines both the time series factors and firm fundamental factors, providing a cross section regression model that considers all possible factor effects. An example is a cross section regression of returns on two sets of variables, Ri = c0 + c1βmi + c2βgi + c3Sizei + c4Profiti + i, i = 1, 2, . . . , N, (6.55) where c0, c1, c2, c3, c4 are regression coefficients that are the same across firms. The regression is run across firms, and hence N , the number of firms plays the role of T in a typical regression model. There are in general two sets of explanatory variables. The first captures systematic or macrco factor exposures, such as the market factor and GDP factor, where the exposures are measured by beta sensitivities to the factors. The second set consists of directly observable firm characteristics such as size and profitability and so on. c© Zhou, 2021 Page 186 6.7 Factor analysis In terms of data, we can write above in matrix form as, if we have 1000 firms, R1 R2 ... R1000 = 1 X1,1 X1,2 X1,3 X1,4 1 X2,1 X2,2 X2,3 X2,4 ... ... ... ... ... 1 X1000,1 X1000,2 X1000,3 X1000,4 c0 c1 ... c4 + 1 2 ... 1000 , (6.56) where each Xi,j is firm characteristic j for firm i. The parameters can be estimated by using the standard OLS regression. Suppose that the returns are measured at time t. As mentioned before, if the purpose is to explain the returns, the explanatory variables are also measured at t. However, often forecasting returns is of interest. In this case, the explanatory variables are measured and available at t− 1, so that we use previous information to forecast the future returns. Haugen and Baker (1996) appears the first to analyze a large set of explanatory variables in the above model. Lewellen (2015) provides a more recent and comprehensive analysis. Chapter 11 provides more discussions and the detailed procedures for implementation. 6.7 Factor analysis The maximum likelihood (ML) method is the well established approach for estimating the parame- ters in the traditional factor analysis. In other words, the maximum likelihood estimators of B and V are obtained by maximizing the likelihood function of the observations. However, no analytical expressions are available for the estimators, an iterative numerical approach has to be taken for the maximization. Nevertheless, with the use of the EM algorithm,9 we can obtain the ML estimator iteratively. The first is E step: we find the expectation of the log complete likelihood function which is the density of the returns data in which the factors treated as if they were known. Then, in the second M step, we maximize the expected value obtained in the first step over the parameters. This leads to the following: B∗′ = [δSδ′ + ∆]−1(Sδ′)′, K ×N (6.57) V ∗ = diag ( S − (Sδ′)[δSδ′ + ∆]−1(Sδ′)′) , N ×N (6.58) 9McLachlan and Krishnan (1997) provide a detailed introduction to the EM algorithm. c© Zhou, 2021 Page 187 where ‘diag’ takes the diagonal elements of a matrix, S ≡ 1 T T∑ t=1 (rt − µˆ)(rt − µˆ)′, µˆ ≡ 1 T T∑ t=1 rt, (6.59) δ ≡ B′(BB′ + V )−1, K ×N (6.60) ∆ ≡ I −B′(BB′ + V )−1B, K ×K, (6.61) in which the inversion of (BB′ + V )−1 is computed from Woodbury’s identity: (BB′ + V )−1 = V −1 − V −1B(I +B′V −1B)−1B′V −1 (6.62) (so that no inversion of any N × N matrix is needed). Given an initial estimation of B and V , δ and ∆ can be computed, and hence can B∗ and V ∗, which are values of B and V for the next iteration. Continue this process, the limit value will be the maximum likelihood estimator of B and V . Because any rotation of the factors will also make the factor model hold, a common identification condition is to impose a diagonal restriction on Jˆ = Bˆ′Vˆ −1Bˆ. (6.63) Under this restriction, the factors can be estimated as fˆt = Jˆ −1Bˆ′Vˆ −1(rt − r¯). (6.64) There are also alternative estimators, of which Seber (1984) provides more details. Lehmann and Modest (1988) is the first to apply such an approach in finance. A Bayesian factor analysis and APT can be found in Geweke and Zhou (1996). However, factor analysis is generally very difficult to implement, especially when there are a large number of assets. So PCA and aPCA are the major methods for estimating factors in practice. 7 Performance and Style In this section, we examine first performance measures, and then investment styles. c© Zhou, 2021 Page 188 7.1 Performance measures 7.1 Performance measures There are many performance measures that based on the returns of a portfolios. Although none of them is perfect, alphas and Sharpe ratios are the most widely used in practice. 7.1.1 Alpha The most widely used alpha for assessing the performance of a portfolio is the CAPM alpha, also known as Jensen’s alpha, αp = R˜p − rf − βp(R˜M − rf ) (7.1) which is easily computed in practice as the intercept of the portfolio excess return on the market excess returns: R˜p − rf = αp + βp(R˜M − rf ) + ˜p, (7.2) where p is the regression residual. Multiple factors are also used from time to time. For example, if the Fama and French (1993) 3-factors are used on the right hand side of the regression, the alpha is known as Fama and French (1993) 3-factor alpha. 7.1.2 Sharpe ratio Recall from (2.34), the Sharpe ratio is defined as, Sharpe ratio = E[R˜p − rf ] sp , (7.3) where sp is the standard deviation of the excess portfolio return of a given portfolio Rp and rf is the riskfree rate. Since it is a special case of the portfolio, the above definition also works for any asset. c© Zhou, 2021 Page 189 7.1 Performance measures 7.1.3 Sortino ratio Sortino ratio, proposed in the 1980s, is a modification of the Sharpe ratio. The volatility may not capture what investors concern about as it penalizes both upside and downside movements equally, s2p = 1 T − 1 T∑ t=1 (xt − x¯)2, (7.4) where xt = Rp − rf , and T is the sample size. Presumably, investors love to see the asset return jumps up, but not down. So Sortino ratio is defined only in terms of the down-side volatility (or the volatility of negative returns in excess of a target), Sortino ratio = E[R˜p −Rb] s−p , (7.5) where Rb is the return on a target asset or index, and (s−p ) 2 = 1 T − 1 T∑ t=1 min(Rpt −Rbt, 0)2, which effectively uses observations of Rpt − Rbt < 0 only, i.e., those underperformed returns. s−p , also known downside deviation, is a well known downside risk measure. 7.1.4 Information ratio Recall from (2.85), the information ratio is similar to the Sharpe ratio except now a benchmark index is used in replacing rf , IR = E(Rp −RB) σ(Rp −RB) , (7.6) where RB is the return on a benchmark index the fund manager attempts to beat, and Rp is the raw fund return. 7.1.5 Treynor ratio The Treynor ratio is similar to Sharpe ratio except replacing the volatility by beta risk, Treynor ratio = E[R˜p − rf ] βp , (7.7) where βp is the CAPM beta of the portfolio. c© Zhou, 2021 Page 190 7.1 Performance measures 7.1.6 Treynor and Black appraisal ratio The Treynor and Black appraisal ratio measures alpha per unit of volatility risk, TB appraisal ratio = αp sp . (7.8) In other words, for two fund managers with the same alpha, the one who has a lower volatility on his/her portfolio is preferred. 7.1.7 Graham-Harvey volatility-matched return The Graham-Harvey volatility-matched return, known also as M2, is defined as GH = R˜p − R˜q, (7.9) where R˜q is the return of a portfolio of the S&P 500 futures and T-bills whose volatility is set to equal to that of the given portfolio p (by adjusting the weight on T-bill). If a fund under-performs the volatility-matched market portfolio, the GH measure is negative. The intuition is that if an investor had a target level of volatility equal to the fund, then the investor would have been much better off holding a fixed weight combination of S&P 500 futures and Treasury bills than holding the fund. Graham and Harvey (1996, 1997) provide the M2 and another related measure. 7.1.8 Maximum drawdown and Calmar ratio A drawdown is the loss of return of a portfolio between a peak (new highs) and a subsequent valley. The Maximum drawdown, or more commonly referred to as Max DD, is the maximum peak to valley loss since the investment’s inception or since a given time (typically 3 years). If the Max DD holds in the future, that will be the loss cap if an investor invests into the fund when buying at the peak and selling at the bottom. Calmar ratio is defined as a fund’s annual return divided by Max DD, or the annualized return adjusted for the Max DD risk. Both of the performance measures are particularly popular in the world of commodity trading advisors. Ideally, if the portfolio weights are known, better performance measures might be proposed. c© Zhou, 2021 Page 191 7.2 Sharpe ratio: further analysis See Grinblatt and Titman (1995) for a review of the related issues. For some recent performance measures, see Christopherson, Ferson and Turner (1999) and Cohen, Coval and Pastor (2005). 7.2 Sharpe ratio: further analysis Sharpe ratio (SR) is widely used and very important. Hence, it will be useful to examine its accuracy (standard error), as well as to test whether two trading strategies have the same Sharpe ratio or not. 7.2.1 Asymptotic standard error Recall that the Sharpe ratio is defined as, SR = µ σ , (7.10) where µ is the expected excess return on a trading strategy, an asset or a portfolio, and σ is the standard deviation. This is not observable, but estimated with data, ŜR = µˆ σˆ , (7.11) where µˆ = 1 T T∑ t=1 Rt, σˆ = 1 T T∑ t=1 (Rt − µˆ)2 with T as the sample size and Rt’s the realized excess returns. (σˆ could be computed by dividing by T − 1 more accurately in statistical sense, but makes no different in asymptotic theory). The question is how to close ŜR to the true SR. Lo (2002) shows that, if Rt is iid normal, they converge asymptotically, ŜR− SR asy∼ N ( 0, 1 + 12SR 2 T ) . (7.12) In other words, the 95% confidence interval for SR is approximately [−1.96σˆSR + ŜR, ŜR + 1.96σˆSR], (7.13) where σˆ2SR = 1 + 12 ŜR 2 T , (7.14) c© Zhou, 2021 Page 192 7.2 Sharpe ratio: further analysis that is, σˆSR is the estimated standard error of ŜR, or the square root of the asymptotic variance after replacing the unknown SR by ŜR. The importance of computing the standard error of ŜR is that, if a strategy has a SR of 1, which is impressive, but if T = 12, the standard error can be 0.355 (see Lo’s paper), then the confidence interval is [0.31, 1.69], and there is a lot of uncertainty for it is greater than 0.50, the one roughly the market has. Of course, as T increases (with more track record), the confidence interval will shrink. Under IID but not necessarily normal, Mertens (2002) shows that the asymptotic theory still holds, but we need to adjust σˆ2SR = 1 + 12 ŜR 2 − κ3ŜR + κ3−34 ŜR 2 T , (7.15) where κ3 and κ4 are the skewness and kurtosis of the Rt’s. Relaxing the IID assumption, Christie (2005) and Opdyke (2007) show that Mertens’s adjust- ment is still valid under the more general assumption that returns are stationary and ergodic. Pav (2021) provides the latest comprehensive discussions on SR. 7.2.2 Test the difference between two SRs In practice, we often are interested in whether one trading strategy truly outperforms another or one fund manager has superior skills than another. If we use Sharpe ratio (similarly for information) ratio to measure the performance, the null hypothesis is H0 : µa σa = µb σb (7.16) Our question is whether there is enough statistical evidence to reject the null of no difference in the Sharpe ratios. Jobson and Korkie (1981) propose the following test statistic, zab = σˆbµˆa − σˆaµˆb√ θˆ asy∼ N(0, 1), (7.17) where θˆ = (1/T ) [ 2σˆ2aσˆ 2 b − 2σˆaσˆbσˆab + 0.5µˆ2aσˆ2b + 0.5µˆ2b σˆ2a − ( µˆaµˆbσˆ 2 ab ) / (σˆaσˆb) ] , c© Zhou, 2021 Page 193 7.3 Portfolio-based style analysis µˆa. µˆb, σˆ 2 a, σˆ 2 b and σˆab are sample means, variances and covariance, and T is the sample size. Note that the above test holds asymptotically under the assumption that returns are distributed independently and identically (IID) over time with a normal distribution. The assumption is often not true in the real data. Ledoit and Wolf (2008) provide a bootstrap approach to relax this assumption (with Matlab is posted on Wolf’s web). Section (4.3) provides more discussions on bootstrap. 7.3 Portfolio-based style analysis To assess risk and performance, both individual and institutional investors are interested in the styles of a fund management, such as whether it is domestic or international, growth or value, sector or index. A simple style analysis is Morningstar’s style box which classifies funds by size and growth. • Size: Every month, Morningstar classifies all US stocks in its database according to their market capitalizations, or the total market value of all outstanding stock shares. Then, Morningstar ranks them by their market capitalizations: those of the top 72% as large capitalization (“large cap”) stocks, the next 18% as mid-cap, and the smallest 10% as small-cap. As of April 2002, stocks with market caps of more than $8.85 billion are considered large cap; between $1.56 billion and $8.85 billion are mid cap; less than $1.56 billion are small cap.10 • Growth and Value: Morningstar’s “value” of a stock is based on five scores. The first, weighted 50%, is ranking by the forward price-to-earnings ratio (P/E), which is obtained by dividing the stock price by its projected earnings per share for next year, in its cap group. The other 50% of the value score comes from rankings from four equally weighted historical measures: price-to-sales (P/S), price-to-book (P/B), price-to-cash flow (P/C), and dividend yield. The growth score is obtained similar: 50% comes from the ranking of the long-term projected 10Morningstar discusses the details at its web: http://www.morningstar.com. They have not updated these April 2002 numbers yet as of 9/12/05. c© Zhou, 2021 Page 194 7.4 Return-based style analysis earnings growth rate against stocks in the same cap, and the other 50% from rankings of the historical earnings, sales, cash flow, and book value growth in its market cap band. A stock’s style score is then obtained by subtracting its value score from its growth score, resulting in scores that can range from -100 to 100. A stock with a score of -100 would be a high- yielding, low-growth stock, while one with 100 would have no yield and very high growth. Stocks in the middle are classified as “core” stocks. The clarification can vary over time with changes in the market, but on average each style will include about one third of all the stocks in each market cap. The market cap of a fund is a weighted average of the market caps of the stocks it owns. If its market cap is is at least as big as the top 70% of the US capitalization, the fund is classified as a large-cap fund; if falls in the next-largest 20%, mid-cap; and the rest small cap. Similarly, if a fund’s the net style score, weighted average of its stocks, equals or exceeds the “growth threshold” (normally about 25 for large-cap stocks), it is classified as growth; an if the its score equals or falls below the “value threshold” (normally about 15 for large-cap stocks), classified as value; and the others “blend.” Portfolio-based analysis requires the information on the holdings of a funds, which may be difficult to obtain and subject to changes. Another problem is that a domestic fund may own stocks whose earnings are largely determined by foreign economies, and hence it is highly correlated with an international fund. In effect, what matters is not the labels of styles, but rather its returns. In William Sharpe’s words, “If it acts like a duck, assume it’s a duck.” So, we may clarify funds by its return correlations with given styles. 7.4 Return-based style analysis Sharpe (1988, 1992) proposes a way to measure the effective style of a fund portfolio Rpt. Let F1t, . . . , FKt be the returns on K > 1 style benchmark index portfolios. Run the following regression Rpt = b1F1t + b2F2t + · · ·+ bKFKt + t, t = 1, . . . , T, (7.18) where b′is are interpreted as the style exposures. The regression essentially finds a portfolio of the style benchmarks that can best explain the return on the fund. Typically, we impose the following c© Zhou, 2021 Page 195 7.5 Hedge fund styles restrictions on the parameters: b1 + b2 + · · ·+ bK = 1 (7.19) bj ≥ 0, j = 1, 2, . . . ,K. (7.20) The first restriction says that the coefficients must be the weights of a portfolio, and the second eliminates short-sells. Coggin and Fabozzi (2003) provide a collection of studies on styles, while Kim, White and Stone (2005) analyze the statistical properties of the estimates. 7.5 Hedge fund styles See Brown and Goetzmann (2001). 8 Anomalies and Behavior Finance In this section, we discuss first various anomalies, and then some of the issues about the limits of arbitrage and behavior finance. 8.1 Anomalies Anomalies here mean abnormal stock returns that cannot be explained by asset pricing models. Most of them are abnormal relative to the CAPM. This section draws heavily from the excellent survey by Schwert (2003). Dong, et al (2021) show that anomalies collectively predict the market, and references therein provide a glimpse of recent research on anomalies. 8.1.1 Size and January effect Banz (1981) and Reinganum (1981) found that small firms on the New York Stock Exchange (NYSE) earned higher average returns than is predicted by the CAPM. Keim (1983) and Reinganum (1983) showed that much of the abnormal return to small firms (measured relative to the CAPM) c© Zhou, 2021 Page 196 8.1 Anomalies occurs during the first two weeks in January, “January effect” or “turn-of-the-year effect.” Roll (1983) explained the effect as a scenario in which high volatility of small firms might caused great short-term capital losses that investors might want to realize for income tax purposes before the end of the year. This selling pressure might reduce prices of small cap stocks in December, leading to a rebound in early January as investors repurchase them. Thus, as Schwert (2003) showed, it seems that the size effect has disappeared since the publication of the papers that discovered it. 8.1.2 The weekend effect French (1980) observed that the average return to S&P composite portfolio was reliably nega- tive over weekends from 1953–77. However, like the size effect, it seems the weekend effect has disappeared, or at least substantially attenuated, since it was first published in 1980. 8.1.3 The value effect Around the same time as early size effect papers, Basu (1977, 1983) found that firms with high earnings/price (E/P) ratios earned positive abnormal returns relative to the CAPM. More recently, Fama and French (1992, 1993) have proposed their famous 3-factor model, arguing that size and book value, which closely related to E/P value, are two risk factors (as measured by spread portfolios based on size and book/market) in additional to the market risk. However, the value effect has also disappeared, or at least attenuated. 8.1.4 The momentum effect Jegadeesh and Titman (1993) find that buying recent winners (portfolios formed on the last year of past returns) out-perform recent losers, “momentum” effect. However, in a longer time horizon, 3-5 years, DeBondt and Thaler (1995) found past losers (low stock returns in the past 3-5 years) have higher average returns than past winners (high stock returns in the past 3-5 years), “contrarian” effect. What is different here is that “the momentum effect seems to persist, but may reflect predictable variation in risk premiums that are not yet understood.” (Schwert, 2003, p. 949). c© Zhou, 2021 Page 197 8.1 Anomalies 8.1.5 Closed-end fund puzzle A closed-end fund often trades at less than the value of its underlying assets, the “closed-end fund discount” anomaly. 8.1.6 Mutual fund persistence Hendricks, Patel and Zeckhauser (1993) found that there is short-run persistence in mutual fund performance. The “cold-hands” phenomenon is very strong that poor performance seems more likely to persist than would be true by random chance. 8.1.7 IPOs abnormal returns The first day returns on IPOs are about 20% or so, and that in China is about 100%! However, in the long-term, say 3 years, Ritter (1991) found that the performance is in fact lower than comparable firms. 8.1.8 Technical analysis Technical analysis uses past prices and perhaps other past data to predict future market movements, of which momentum, high-frequency and algorithmic trading are special cases. Traditionally, tech- nical analysis focuses on trading indicators, and price moving averages is one of the most widely and perhaps most useful indicator. For example,a 5-day moving average is defined by MV5 = Pt + Pt−1 + Pt−2 + Pt−3 + Pt−4 5 , which is the average price of the past 5 days. When today’s price is above MV5, indicating a positive price trend as the price today is above the 5-day moving average. 20- and 200-day moving averages are the most popular ones used by traders. Broadly speaking, data science today as applied to finance/trading is part of technical analysis, which is just more sophisticated than traditional technical analysis. It uses more advanced math- ematical/statistical tools to extract information from past data, with the same goal of predicting c© Zhou, 2021 Page 198 8.1 Anomalies returns and making profits. In practice, all major brokerage firms publish technical commentaries on the markets and many of their advisory services are based on technical analysis. Many top traders and investors use it partially or exclusively (see, e.g., Schwager, 1989, and Lo and Hasanhodzic, 2009). Commodities and currency trading are known to rely on it heavily. All rule-based trading, such as Trending- following or Systematic Trading (see, e.g., and Covel and Ritholtz, 2017, Hurst, Ooi and Pedersen, 2017), are part of technical trading, though they have different names likely for marketing purposes. However, many academics are the skeptics. If the market is efficient in the weak form, all past information should be useless in predicting future returns to make any abnormal profits. Although the weak-form efficiency is unlikely true completely, it does point out that, due to competition for profits, it is difficult to find any simple rules that can make abnormal profits because as more people are aware of them or discover them and are using them, the profits are likely to disappear. But the view that technical analysis has no value is challenged by many studies. For exam- ple, stocks, Brock, Lakonishok, and LeBaron (1992) provide evidence on the predictive power of the price moving averages on the Down. Lo, Mamaysky, and Wang (2000) further strengthen the evidence with an automated pattern recognition analysis. Recently, Neely, Rapach, Tu and Zhou (2014), Han, Yang and Zhou (2013) and Han, Zhou and Zhu (2016) provide more extensive evi- dence. The first paper finds that technical analysis can predict the stock market as good as using fundamentals, and it can offer significant economic gains over the strategy that ignores this pre- dictability. The second finds that an application of a moving average timing strategy of technical analysis to portfolios sorted by volatility generates investment timing portfolios that outperform the buy-and-hold strategy substantially. For high volatility portfolios, the abnormal returns, relative to the CAPM and the Fama-French three-factor models, are of great economic significance (the annualized alphas are over 20%!), and are greater than those from the well known momentum strat- egy. Moreover, the abnormal performance cannot be explained by market timing ability, investor sentiment, default and liquidity risks. Similar results also hold if the portfolios are sorted based on other proxies of information uncertainty. The third paper shows that technical analysis can be used to capture a trend factor that combines short-, intermediate- and long-term price trends. The trend factor performs far better than existing factors. Han, Liu, Zhou and Zhu (2021) provide a review of technical analysis. Theoretically, why can technical analysis work? While many studies show that past prices can c© Zhou, 2021 Page 199 8.1 Anomalies have predictive power on future returns (see, e.g., the references in Han, Zhou and Zhu (2016)), which implies that technical analysis can be useful, few provide the economic reasons and make the point explicitly in terms of technical indicators. Zhu and Zhou (2009) and Han, Zhou and Zhu (2016) provide theoretical models that justify directly the value of using the price moving averages as predictors of the stock market trends. Why is it possible to observe trends in the stock market? There are two simple reasons: • Due to differences in timing of receiving information or in the speed of reaction to infor- mation, a good news on a stock or on the market is not fully incorporated into the market intravenously. It may take minutes, days or months depending on the nature of the news, such as earning, corporate structures or business cycles. • Due to liquidity and uncertainly, informed traders who need to move a lot of positions will have to move slowly over time. For example, it is often reported that large investors or hedge fund may buy or sell a security over a month or time time. Of course, predicting the start and end of a trend is very difficult, if not possible. Most empirical studies (including all of the above) are about identifying a trend after it happens, and recognizing the end of a trend after it is over. Great investor Warren Buffett once said that be “fearful when others are greedy, and greedy when others are fearful.” This statement is a contrarian view on the stock market. When others are greedy, prices are flying high, and one should be cautious. When others are fearful, it may present a good buying opportunity. This may be a useful way to predict the start and end of a trend. But the idea needs to be quantified and tested. Why can some well known technical rule still work today? One answer is that there are frictions for investors to follow them. The first friction is hurdles in following the rule. The rule may be too risky (remember there are no profitable riskfree trading rules ever!), may incur high transaction costs and may require great discipline and patience. For example, suppose Buffett’s rule is true that buying when others are panic selling is profitable. But it requires the buyers to withstand the loss if the temporary selling continues, and to hold the position likely for a long time. Another friction is that investors may have other ideas/straegies which they perceive more attractive. For example, it is known, and perhaps most will agree, that buy-and-hold the market index is a simple long-term strategy that can make one retire rich, but many investors simply refuse it to instead opt c© Zhou, 2021 Page 200 8.2 Are the anomalies real? for active trading or highly speculative investments, to fulfill their dream of getting rich quickly (runs the risk of losing a lot) or to get the thrill/entaintainment from winning and losing (runs the risk of paying a too high price for). These are additional reasons why technical analysis or rule-based trading can work. 8.2 Are the anomalies real? As Schwert (2003) nicely puts, “Some interesting questions arise when perceived market inefficien- cies or anomalies seem to disappear after they are documented in the finance literature: Does their disappearance reflect sample selection bias, so that there was never an anomaly in the first place? Or does it reflect the actions of practitioners who learn about the anomaly and trade so that profitable transactions vanish?” These are big issues of future research. To understand the sample selection bias, authors in research are likely to focus attention on “surprising” results. It seems likely that there is a bias toward the publication of findings that chal- lenge existing theories, this could lead to the over-discovery of “anomalies”. To mitigate the sample selection bias, one has to examine whether the anomaly persists in new, independent samples. 8.3 Limits to arbitrage Limits to arbitrage emphasizes the difficulties for doing arbitrage in practice. While it seems riskfree or almost riskfree to arbitrage derivatives mispricing and futures prices relative to the underlying assets, it is risky arbitrage for “mispriced” securities. For example, if Ford, whose fundamental value is $20, is mispriced at $15, a buyer faces at least three risks in doing arbitrage.11 The first is the fundamental risk: buying Ford exposure to it as the fundamental changes over time. To eliminate this, one may sell a substitute security or a replicating portfolio (RP) of Ford. But perfect substitutes are difficult or impossible to find. The second is noise trader risk: the risk that the mispricing may be come even more mispriced by the noise/irrational traders who caused the mispricing in the first place. A profitable arbitrage (in which one buys Ford and sell a RP against it) relies on the convergence of Ford price to the 11Both this and the next section rely heavily on Barberis and Thaler (2003). c© Zhou, 2021 Page 201 8.4 Behavior finance value of the RP. In reality, RP may stay the same while Ford’s price keeps going down, i.e., the mispricing worsens. In theory, one can hold the position as long as it converges. But in reality, large arbitragers and funds are managing money for others, and their performances are judged on a short-term basis.12 If a mispricing of the arb trades worsens to yield bad returns, investors may decide that the arbitrager is incompetent, and withdraw their funds. If this happens, the arbitrageur will be forced to liquidate his position prematurely at a loss. Fear of such premature liquidation makes the arbitrager less aggressive in combating the mispricing to begin with. The third is implementation risk: the costs and the risk involved in carrying the arbitrage to convergence. The point is that, with the presence of the three risks, it is not easy for arbitrageurs like hedge funds to exploit market inefficiencies. In fact, sometimes it might be optimal for the big money to trade in the same direction as the noise traders, thereby exacerbating the mispricing, rather than correcting it. For example, De Long, et al. (1990) models an economy with positive feedback traders, who buy more of an asset this period if it performed well last period. When these noise traders push an asset’s price above its fundamental value, arbitrageurs do not sell or short the asset, but rather to buy it, to attract more feedback traders next period, leading to still higher prices, at which they can exit at a profit. Griffin, Harris and Topaloglu (2003) find exactly such behavior on the recent rise and fall of the NASDAQ: “Our evidence supports the view that institutions contributed more than individuals to the Nasdaq rise and fall.” 8.4 Behavior finance Behavior finance relies on investors’ behavior biases/psychology to explain anomalies or mispricing. If there are no limits of arbitrage, there would be no mispricing and standard asset pricing models (rational models) are sufficient. So, it is in this sense that behavior finance also relies on limits of arbitrage or imperfection of the market. As summarized by Barberis and Thaler (2003), the common behavior biases are: overconfidence, optimism and wishful thinking, representativeness, conservatism, belief perseverance and so on. (Some of the behavior biases may be good traits in life, but not so in trading or in the non-emotional valuation of securities.) Prospect theory, proposed by Kahneman and Tversky (1979) and Tversky 12Shleifer and Vishny (1997) explore the effects of this so-called agency problem. c© Zhou, 2021 Page 202 8.4 Behavior finance and Kahneman (1992), models investor decision making by the following utility function, pi(p)v(x) + pi(q)v(y), x ≤ 0 ≤ y (8.1) for a risky off of x with probability p or y with probability q. In contrast with the usual expected utility theory, in which the utility u(W ) is a concave function over the entire range of wealth W , here v is convex when in x and concave only in y, implying that investors are risk-seeking over losses. This is motivated by the classical example of the choice by most people in the following gaming. If presented with choice between A : payoff = 1000, probability = 50% (nothing else) (8.2) and B : payoff = 500, probability = 100%, (8.3) most will choose B, same as what the classic risk-aversion investors will do. On the other hand, If presented C : payoff = −1000, probability = 50% (nothing else) (8.4) and D : payoff = −500, probability = 100%, (8.5) most will choose C, opposite to what the classic risk-aversion investors will do. Behavior finance offers its explanations to the anomalies. On the closed-end funds, Lee, Shleifer and Thaler (1991) argue in a simple model that some of the fund investors are too optimistic, while at other times, are too pessimistic. Changes in their sentiment causes the difference between prices and net asset values. Insufficient diversification indicates the phenomenon that investors diversify their portfolio hold- ings much less than is recommended by standard models of portfolio choice. In particular, First, investors exhibit strong “home bias” as found by French and Poterba (1991) and others that investors in the USA, Japan and the UK allocate 94%, 98%, and 82% of their overall equity in- vestment, respectively, to domestic equities. Ambiguity and familiarity offer an explanation for insufficient diversification. Over trading is a common mistake of individual investors. For example, Barber and Odean (2000) examine the trading activity from 1991 to 1996 and find that the investors would do a lot better if they traded less. The behavioral explanation for such excessive trading is overconfidence: c© Zhou, 2021 Page 203 people believe that they have superior information and skills to justify a trade, whereas in fact they do not. Disposition effect is a another common mistake of individual investors who are reluctant to sell assets at a loss. For example, Odean (1998) finds that individual investors in his sample tend to sell stocks which have gone up in price since purchase, rather than those which have gone down. There are two behavioral explanations. First, the effect may be due to an irrational belief in mean- reversion. Second, it is caused by the loss aversion or risk-seeking behavior when at a loss, that is, the investors would like to gamble that the stock will eventually come back to avoid the painful loss (which in the real world leads even greater painful loss later, e.g., if they hang up on those dot.com stocks). Interestingly, Coval and Shumway (2000) find that professional traders also have this problem. In the Treasury Bond futures pit at the CBOT, traders with profits (losses) by the middle of the trading day will take less (more) risk in their afternoon trading. 9 Predictability 1: Time Series In this section, we discuss first about market efficiency, rejection of random walk, and limits to predictability. Then, we examine various approaches, including recently developed ones, that are used to predict asset returns. 9.1 Market efficiency As you recall, there are three forms of market efficiency, all of which suggest a security’s price equals its “fundamental value” or no abnormal returns can be made relative to one of the information sets: past history, public and private. An explanation for this is that any mispricing will be quickly corrected by smarter traders and investors. However, in practice, due to limits to arbitrage and information asymmetry, anomalies discussed in the next section are arguably difficult to explain. Two important questions: 1) is there really no predictability? 2) if there is predictability, what is the degree or how profitable is it? c© Zhou, 2021 Page 204 9.2 Random walk? 9.2 Random walk? Early studies of market efficiency focus on a random walk model of stock prices: xt = µ+ xt−1 + t, t ∼ N(0, σ2) (9.1) where xt = log(Pt) is the log stock price. It says that tomorrow’s log price is today’s plus a drift and a normal random noise, or the continuous return is normal with mean µ and variance σ2. This is exactly the lognormal assumption underlying the Black-Scholes formula. If the random walk model is true, the market must be efficient. However, if the market is efficient, the random walk model is not necessarily true. Equation (9.1) says that time series (xt − xt−1) is iid, and hence the mean and variance are estimated consistently by the sample analogues, µˆ = 1 T T∑ t=1 (xt − xt−1) = 1 T T∑ t=1 log(Pt/Pt−1), (9.2) and σˆ2a = 1 T T∑ t=1 (xt − xt−1 − µˆ)2. (9.3) To test (9.1), notice that it implies xt = 2µ+ xt−2 + (t + t−1), (9.4) So the sample variance of xt− 2µ−xt−2 should estimate 2σ2, or (dividing the result by 2) we have an alternative variance estimator σˆ2b = 1 T T/2∑ k=1 (x2k − x2k−2 − 2µˆ)2. (9.5) Intuitively, if (9.1) is true, both estimators should have converge to σ2, and hence their ratio, Jr = σˆ2b σˆ2a , (9.6) should converge to 1. Indeed, Lo and MacKinlay (1988) show that √ TJr asy∼ N(1, 2), (9.7) which says that the statistic scaled by T is asymptotically normal-distributed with mean 1 and variance 2. Since Jr is based on the ratio of two variances, Jr is known as variance-ratio test. c© Zhou, 2021 Page 205 9.3 Limits to predictability Hence, if one finds from real data that √ TJr is different from 1 significantly as judged by the above asymptotic normal distribution, one can reject the null hypothesis that (9.1) is true. Lo and MacKinlay reject the random walk hypothesis for the US stock market indices. If a stock return or the market return is a random walk, then there is no predictability what- soever. Lo and MacKinlay’s rejection of the random walk hypothesis open the door for studying predictability. 9.3 Limits to predictability In the real world, profit competition is clearly a strong force to eliminate any obvious predictability. One important point to make is that no matter how much resources investors put into studying the market, or no matter how hard they try, they may not get what they want: to beat the market, but in fact half of them will fail! The reason is that the returns of all investors must sum to the market return, 1 I I∑ i=1 Ri = index return, where I is the total # of investors, and Ri is the return of investor i. It implies that roughly 50% will outperform and another 50% will under-perform the index. If all investors are smart, half of them will still fail to predict the market, no matter what latest AI software they are using. Indeed, suppose they have learned the best mathematical model from machine learning given all the past data, then the same (or similar) model will no longer work if all use it. The market will move in an unpredictable way (at least by that model) so that about half of the investors will lose. Unlike predicting weather or earthquake, if we all are smart, we all can predict it correctly. But not about the market. To win, you need to beat the other 50% of players! It is a zero-sum game in beating the market. Ross (2005) provides the first theoretical bound on the degree of predictability if asset pricing models are true. However, his bound is too wide. Zhou (2010), and Huang and Zhou (2017) derive much tighter and binding bounds. c© Zhou, 2021 Page 206 9.4 Predictive regressions 9.4 Predictive regressions 9.4.1 Basic model When running regressions of current economic variables on the past ones, we obtain the so-called predictive regressions. This is the simplest set-up in most applications. For simplicity, consider running a regression of a single variable yt on a predictable variable xt−1, yt = α+ βxt−1 + t, t = 1, 2, . . . , T. (9.8) For example, when yt is the return on a portfolio of common stocks and xt is a dividend yield, book-to-price ratio, or a function of interest rates, Fama and Schwert (1977), Rozeff (1984), Keim and Stambaugh (1986), Campbell (1987), Fama and French (1988), Kothari and Shanken (1997) and Pontiff and Schall (1998), among others find that β is significantly from zero, or predictability. Rapach, Strauss, and Zhou (2010, 2013) provide some of the most recent evidence on predictability, and Rapach and Zhou (2013, 2021) survey the literature. It should be noted that the R2 of the predictive regression is usually very small, usually less than 5%. This simply says that it is difficult to predict stock returns or financial time series in general. Another point is that OLS estimate of β is in general biased, and have sampling distributions that differ from those in the standard OLS regression. The reason is that the predictor xt−1 is a time series which is usually correlated over time. Stambaugh (1999) discusses the associated econometric theory. To understand better about the predictive regression, it will be useful to contrast it with an explanatory regression that runs a regression of current variable yt on current variable zt, yt = α+ βzt + t. (9.9) For example, the CAPM or market model regression is to use the current excess return on the market to explain the excess return on the stock. Although this regression typically has high R2, say 80%, for especially large stocks, but the regression has little use in forecasting the excess stock returns unless one can forecast the market. c© Zhou, 2021 Page 207 9.4 Predictive regressions 9.4.2 Out-of-sample performance How do we assess the degree of predictability? Traditionally, one examine the statistical significance of the slope coefficient or the regression R2 in the previous regression by using all the all the data, that is, running the regression from the beginning to the end of the sample period. This is problematic and is subject to a look-ahead bias since it uses all the data to estimate the slope. As a result, the forecasts cannot be made and used in real time. Recently, researchers focus more on out-of-sample R2 measure, R2OS = 1− ∑T t=T0 (rt − rˆt)2∑T t=T0 (rt − r¯t)2 , (9.10) where rˆt is the forecasted return from the predictive regression estimated through period t−1 only, r¯t is the historical average forecast estimated from the sample mean through period t− 1, and T0 is the first period the forecast is available (assume that one needs T0− 1 data points for estimating the predictive regression). Be definition, the out-of-sample R2 uses the historical average forecast as the simple benchmark. Any model predictability should do better then the sample mean r¯t. In practice, the regression has to be done at each time t recursively using available data (ex- panding window) or using data going back of a fixed length, say ten years (a rolling fixed window). A positive R2OS indicates that the predictive regression forecast beats the simple historical average. Hence, R2OS > 0 implies predictability. A test of this hypothesis is discussed in the next subsection. Welch and Goyal (2008) find in their comprehensive analysis that a large list of potential predictors from the literature are unable to deliver consistently superior out-of-sample forecasts of the U.S. equity premium relative to a simple forecast based on the historical average. The reason is that the regression model is not stable: the regression parameters changeover time. So either recursive or rolling regressions cannot provide good out-of-sample forecasts. However, Rapach, Strauss, and Zhou (2013) and subsequent studies do find predictive power using either innovative methods or new predictors. 9.4.3 Statistical significance/tests Computing both the in-sample R2 and out-of-sample R2OS is valuable, but their values are estimated with errors. So statistical tests that account for the errors must be used to assert predictability. c© Zhou, 2021 Page 208 9.4 Predictive regressions On in-sample analysis, standard t-ratios or the Newey and West (1987) heteroskedasticity- adjusted standard error estimate can be used to assess the statistical significance of the slopes. Alternatively, one use the confidence interval of the R2, which is not analytically tractile for general distributions, but can be computed via bootstrap (see, e.g., Huang et al (2020, p. 783)). On out-of-sample analysis, the hypothesis of interest is H0 : R 2 OS ≤ 0 (9.11) vs R2OS > 0. This is often done by using Clark and West’s (2007) test (see, e.g., Rapach et al (2010, p. 828)), which is an adjusted version of the Diebold and Mariano (1995) and West (1996) statistic – what they call theMSPE-adjusted statistic. 9.4.4 Economic significance A strategy or proposition can be statistically significant, but may not have sizable economic value. In practice, it is the economic value that is of key interest. Hence, for a given degree of predictability, an important question is whether it can bring any significant economic values. How will an asset allocation benefit from predictability? Kandel and Stambaugh (1996) and Barberis (2000) are examples of this line of research which find that there are economic gains of using predictability. Of course, the degree of significance will vary from application to application. Consider a mean-variance investor who allocates cash between the stock market and money market, and how the investor can benefit from the forecast. Recall from our earlier portfolio theory, the investor’s optimal portfolio weight at t is wt = 1 γ rˆt+1 σˆ2t+1 , (9.12) where γ is the investor’s coefficient of relative risk aversion, rˆt+1 is a predicted excess return (our forecast), and σˆ2t+1 is a forecast of the excess return variance. In practice, we may restrict wt to lie between −0.5 and 1.5, which imposes realistic portfolio constraints and produces better- behaved portfolio weights given the well-known sensitivity of mean-variance optimal weights to return forecasts. The investor’s realized utility or certainty equivalent return is computed from CER = R¯p − γ 2 σ¯2p, (9.13) c© Zhou, 2021 Page 209 9.4 Predictive regressions where R¯p and σ¯ 2 p are the mean and variance, respectively, of the portfolio return over the forecast evaluation period. The CER is the risk-free rate of return that an investor would be willing to accept in lieu of holding the risky portfolio. The utility gain of using the forecast is then Gain = CER− CER0, (9.14) where CER0 is computed similarly to CER except the excess return forecast is replaced by instead by the no-predictability constant estimated by using the historical average, or the default expected excess return without using any predictors. The utility gain is then the difference in CER for the investor who uses the predictive regression forecast to guide asset allocation as compared with the case she uses the prevailing mean benchmark forecast. We usually annualize the CER gain so that it can be interpreted as the annual portfolio management fee that the investor would be willing to pay to have access to the predictive regression forecast in place of the prevailing historical mean forecast. This is a common measure for the economic value of return predictability. Campbell and Thompson (2008) show that out-of-sample forecasts can be improved upon to beat the historical average both statistically and economically, once modifying the usual predictive regressions by placing some theoretically motivated restrictions on the coefficients. For example, we can assume the regression slope of inflation be negative. If we get a positive estimate, we just set it as zero. They show further that, in terms of out-of-sample analysis, an R2OS of 0.5% for monthly data typically implies economically significant utility gains. Unfortunately, though, their approach works only for a few of the economic variables. Back to the beginning of the forecasting period, investors have no way of knowing which few out of the many can have good out-of-sample performance in the future. Hence, their study does not provide convincing evidence on out-of-sample predictability of the market. Rapach, Strauss, and Zhou (2010) seem the first to provide consistently out-of-sample evidence of predictability by applying the forecasting combination method to all of the predictors. In ad- dition, they show that predictability is concentrated in recessions. Henkel, Martin, and Nardari (2011) have a similar finding, and Cujean and Hasler (2017) explains this theoretically. c© Zhou, 2021 Page 210 9.5 Forecasting with many predictors 9.5 Forecasting with many predictors The prediction regression works well for a few predictors. Standard time series models, such as AR, ARMA models (see, e.g., Tsay, 2010, Box, Jenkins, Reinse and Ljung, 2016, and Brockwell and Davis, 2016), may improve the performance in this low dimensionality case. But none of the methods appear effective for a large number of predictors, the high dimensionality case. This latter is of great interest in finance. The reason is that the signal-noise ratio of a single predictor is often very low in predicting asset returns, and the hypothesis is that, with many low signals predictors, the predictability can be improved by incorporating all the information. This requires innovative methods. We discuss four methods in this section, while leaving shrinkage approaches, such as LASSO, to the next chapter, which are part of the popular machine learning methods. 9.5.1 Forecast combination Forecast combination is perhaps the simplest forecasting approach in the presence of many predic- tors. To see how it works, suppose we have 20 predictors. A standard regression forecast is to run a regression on all 20 of them, yt = α+ β1xt−1,1 + · · ·+ β20xt−1,20 + t. (9.15) In practice, the sample size T is often small, and so the above regression can usually fit well in -sample, but perform poorly out-of-sample (over-fitting problem). Forecast combination is an effective solution for many practical problems. Instead of running the regression on all the predictors, We run it on each one of them at a time but for 20 times, yt = αj + βjxt−1,j + t, j = 1, . . . , 20. (9.16) We get the coefficient estimate αˆj and βˆj , and then obtain 20 forecasts based on each of the individual regression, yˆt+1,j = αˆj + βˆjxt,j , j = 1, . . . , 20. The (final) forecast by using the forecast combination method is yˆt+1,comb = yˆt+1,1 + yˆt+1,2 + · · ·+ yˆt+1,20 20 , (9.17) c© Zhou, 2021 Page 211 9.5 Forecasting with many predictors which is an average of the individual forecasts. Statistically, this provides diversification and shrinkage, and is robust to distributional assumptions. As it turns out, the average forecast is an excellent one in practice. Bates and Granger (1969) and others are early studies using the approach. Rapach, Strauss and Zhou (2010) show that it can be effectively used to predict the stock market. One can also consider using different weights on the individual forecasts. However, simple average (equal-weighting) tends to work well in most applications. Rapach, Strauss and Zhou (2010) show that it can provide consistent statistically and economically significant out-of-sample gains. Mathematically, if individual forecasts are unbiased, so will the average, and it will have smaller variance. This is similar to a standard portfolio selection problem: a portfolio of independent stocks will generally have smaller risk. However, if bad forecasts are used in the average, the average forecast will clearly not be good. Hence, the implicit assumption of using the average forecast is that all the individual forecasts are reasonably good, and then their average improves. Later we do introduce a method, C-LASSO, that is designed to improve the average forecast by selecting good forecasts out of many (which may have bad ones), and then apply the average to the selected ones. 9.5.2 PCA or PCR When there are many predictors, principal components analysis (PCA) is a also popular approach. It extracts a few components out of the many predictors and then forms the forecasts based on the few composite predictors (linear combination of the original ones). PCA reduces the dimensionality from many to a few. Since the PCA is used in a regression context, it is also known as PCR. Consider now a regression on n predictors (n is large), yt = α+ β1xt−1,1 + · · ·+ βnxt−1,n + t. (9.18) Suppose n = 50 and we want to find only one predictor out of the 50. Which one to choose? The PCA analysis on all the predictors xt,k’s suggests that the first PCA is the dominating c© Zhou, 2021 Page 212 9.5 Forecasting with many predictors variable. Hence, it is a logical choice to choose it as the sole predictor, ft = ψ1xt,1 + · · ·+ ψ50xt,50 = ψ′xt, (9.19) where ψ is an 50-vector of the first eigenvector corresponding to the largest eigenvalue of X ′X/T , where the X is T × 50 matrix of data of the predictors which are de-meaend (substrated from sample mean so that it has zero mean). Then, instead of running a regression on 50 predictors as (9.18), we run a regression on only one predictor, yt = γ0 + γft−1 + et, (9.20) which is more stable and often provides better out-of-sample forecasts than (9.18). Example 9.1 As demonstrated in class, the PCA sentiment index is, St = 0.90x1 + 0.72x2 + 0.70x3 + 0.71x4 + 0.14x5, where the loadings/coefficients are the first eigenvector of the sample covariance matrix of the de- meaned 5 sentiment proxies xi’s. Since they are computed using all the data, it is an in-sample result. In practice, we need a training sample period, say 120 months, to compute the loadings at month 120, and then re-compute those in month 121 by adding the data in month 121. So, for the realistic out-of-sample index, the loadings are estimated each month and so they vary over time. ♠ Of course, the PCA analysis usually suggests K important PCAs. Then we can choose in general K predictors, each of which is a linear combination of the original predictors. Mathematically, x = (x1, . . . , xn) ′ predictors equivalent to their n PCAs (linear orthogonal transformations), f1 ... fn = Ψ′1 ... Ψ′n x1 ... xn = Ψ′x, where Ψj is the j-th eigenvector of X ′X/T for j = 1, . . . , n. Then (9.18) can be written equivalently as yt = α+ γ1ft−1,1 + · · ·+ γnft−1,n + t, (9.21) Note that the n PCA new predictors are uncorrelated and are likely dominated, say, by first K predictors (while the rest have negligible variances). If we drop those terms after fK , we obtain the forecast on only K factors, yt = α+ γ1ft−1,1 + · · ·+ βKft−1,K + et, (9.22) c© Zhou, 2021 Page 213 9.5 Forecasting with many predictors which is the PCR in the general case. In matrix form, Y = α1T +Xβ + , (9.23) where Y is T -vector of observations on the dependent variable, 1T is T -vector of ones, and X is T × n matrix of data of the predictors. The PCA or PCR simply replaces the T × n data matrix X by a T ×K matrix of data, i.e., K linear combinations of the original data, known as factors, F = XΦ, T ×K, (9.24) where Φ is a n×K matrix of eigenvectors corresponding to the largest K eigenvalues of the sample covariance of the predictors, X ′X/T with data de-meaned. In vector form, let Ft be K × 1 of the factors at t, then Ft = Φ ′Xt, where Xt is n × 1 of the predictors at t. Then, instead of running the large regression (9.23), we run Y = α1T + FΛ + , (9.25) on K < n variables, where the loading coefficients Λ is K × 1. Since K can be much smaller than n in practice, the PCA/PCR reduces the dimensionality substantially and can also often do much better in forecasting than using too many predictors in the regression directly. Stock and Watson (2002) provide a rigorous justification to the above procedure. Assume the data-generating process is yt = α+ F ′ tΛ + t (9.26) Xt = βFFt + et, (9.27) where Ft is a K-vector of latent factors and βF is n×K. The first equation says that yt is related to K latent factors, and the second says that the n predictors are also relate to Ft. In other words, one can interpret Ft as the true but latent predictors, and the second equation simply states how they are related to the observed ones. The above model is known as a factor forecasting model, which is popular in macroeconomics where one can forecast the GDP with many predictors that related to the K driving factors. Given the second equation, we can solve the latent factors and their loadings from minimizing the model mean-squared error, min βF ,F 1 NT n∑ i=1 T∑ t=1 (xit − βFFt)2. c© Zhou, 2021 Page 214 9.5 Forecasting with many predictors Mathematically, Stock and Watson (2002) show that, up to a rotation, the factors are given as earlier, and as n becomes large, the PCA estimator of F will converge to the true but unobserved K predictors. The Python codes make it easy to implement: 1 2 from sklearn.preprocessing import scale 3 # scale makes the data to have zero mean and unit variance 4 5 from sklearn.decomposition import PCA 6 7 pca = PCA() 8 9 X_new = pca.fit_transform(scale(X)) 10 11 loadings = pca.components_.T * np.sqrt(pca.explained_variance_) Note that While the PCA is invariant to any orthogonal transformation of the data, but it is sensitive to scaling. In practice, the PCA is usually applied to scaled data with zero mean and unit variance. Then the output Xnew will be the transformed data, the F above with K = n, the loadings are the eigenvectors multiplied by the square root matrix of the diagonal eigenvalue matrix. We may use only the first few columns of Xnew as our predictors by examining the eigenvalues or to determine the optimal K as in the PCA analysis or by cross-validation (see Section 10.6). 9.5.3 sPCA Huang, Jiang, Li, Tong and Zhou (2020) propose a scaled PCA (sPCA) approach to improve the PCA. The idea is that the PCA does not use any information of what to be predicted, and, as a result, the PC can have noise that are unrelated to the target. To mitigate this problem, we run a regression on each one of the predictor, as we do for the combination approach, yt = αj + βjxt−1,j + t, j = 1, . . . , n. (9.28) Assume the predictors are standardized. The usually PCA uses (x1, x2, . . . , xn) c© Zhou, 2021 Page 215 9.5 Forecasting with many predictors to find the PC components. In contrast, the sPCA uses the scaled predictors, (βˆ1x1, βˆ2x2, . . . , βˆnxn). So sPCA weights each predictor by its relevance to the target. The more useful ones will have greater weights, and the less useful ones will have smaller weights. In the language of machine learning, PCA is an un-supervised learning that does not use any information on the forecasting target. In contrast, sPCA is supervised to use the relevant information to over- or under-weigh the predictors. As a result, it is not surprising that sPCA typically outperforms PCA in practice. 9.5.4 Partial least squares The partial least squares (PLS) method, pioneered by Wold (1966, 1975) and extended by Kelly and Pruitt (2013, 2014), provides another way to extract a few predictors as linear combinations of many original predictors besides the PCA. As it turns out, it is particularly useful in many finance applications. Following Hastie (2018, p. 81), the idea is similar to the PCA that we want to replace the original forecasting equation by using K << n predictors, yt = α+ γ1zt−1,1 + · · ·+ βKzt−1,K + et, (9.29) where zk is a linear combination of (x, . . . , xn) and they are uncorrelated. Consider, for example, how we obtain the first PLS predictor, z1 = φ1,1x1 + · · ·+ φ1,nxn, (9.30) where φ1,1, . . . , φ1,1 are linear combination coefficients to be determined. Unlike the PCA, we now want to use information of yt. The simplest and intuitive way is to weight each xj by its correlation with yt. The greater the correlation, the more important it is to the forecasting. Since the predictors are often standardized to have zero mean and unit variance in PLS, we can let φ1,j = cov(xj , y), (9.31) which is easily estimated by the sample covariance between xt−1,j and yt. Then z1 can be computed, and the one-factor PLS regression is yt = α+ γ1zt−1,1 + et, (9.32) c© Zhou, 2021 Page 216 9.5 Forecasting with many predictors which is easily run. However, it should be noted that, for out-of-sample forecasting, this regression has to be run recursively and the predictors have to be standardized at each time t. In other words, one has to make sure no future information is used at time t. Interestingly, there is a simple link between the PLS and the average forecast in the one-factor case. Since the predictors are standardized, φ1,i must be the slope of the regression of y on xi, and so the forecast based on xi is yˆit = y¯ + φ1,ixt−1,i, where y¯ is the sample mean up to time t. Hence, the average forecast is yˆavt = y¯ + z1. Moreover, the PLS regression (9.32) can be written as yt = (α− γ1y¯) + γ1(y¯ + zt−1,1) + et, (9.33) so the PLS can be interpreted as a blend of the average (combination) forecast with the sample mean (as the intercept term is a function of the sample mean since α = E(y) when the predictor is standardized). The link to the average forecast is also derived by Lin, Wu and Zhou (2018), as a special case of an iterated combination method. Let rMCt+1 be the standard mean combination forecast. Consider re-combine it with the historical average forecast, rt+1 = (1− δ)r¯ + δrˆMCt+1 + ut+1, (9.34) where ut+1 is the noise. Mathematically, our objective is to solve the following optimization prob- lem, min δ Et(rt+1 − rˆt+1)2 = Et[rt+1 − (1− δ)r¯ − δrˆMCt+1 ]2. (9.35) In the special case of δ = 0, it implies that rˆMCt+1 has no information whatsoever. When δ = 1, it is unnecessary to use information about r¯ to improve rMCt+1 . Theoretically, there exists such a δ that makes the iterated combination better than either r¯ or rMCt+1 . The optimal δ can be solved easily from the first-order condition of the objective function, δ∗ = covt(rt+1 − r¯, rˆMCt+1 − r¯) vart(rˆMCt+1 − r¯) . (9.36) In Lin, Wu and Zhou’s (2018) applications, δ∗ is generally greater than 1. Mathematically, the iteration combination starting from the average forecast is equivalent to the PLS in the one-factor case. c© Zhou, 2021 Page 217 9.5 Forecasting with many predictors Huang et al (2015) provide an empirical application of the PLS to extract an investor senti- ment index that is relevant to forecasting the stock market. Assume that the true sentiment is unobservable, though it is related to the the stock return in the standard prediction regression, Rt+1 = α+ βSt + εt+1, (9.37) where εt+1 is residual, unforecastable and unrelated to St. Let xt = (x1,t, ..., xN,t) ′ denote an N ×1 vector of individual investor sentiment proxies at period t, such as close-end fund discount rate, share turnover, number of initial public offerings (IPOs). Assume that the proxies are related to the true sentiment by xi,t = ηi,0 + ηi,1 St + ηi,2Et + ei,t, i = 1, ..., N, (9.38) where ηi,1 is the factor loading that summarizes the sensitivity of sentiment proxy xi,t to movements in St, Et is the common approximation error component of all the proxies that is irrelevant to returns, and ei,t is the idiosyncratic noise associated with measure i only. The PLS extracts St from the above equations based on data on the proxies. Mathematically, the T × 1 vector of the true investor sentiment index, SPLS = (SPLS1 , ..., SPLST )′, can be computed from, SPLS = XJNX ′JTR(R′JTXJNX ′JTR)−1R′JTR, (9.39) where X denotes a T×N matrix of the standardized (each column has zero mean and unit variance) individual investor sentiment measures, X = (x′1, ..., x′T ) ′, R = (R2, ..., RT+1)′ is a T × 1 vector of excess stock returns, JT = IT − 1T ιT ι′T and JN = IN − 1N ιN ι′N . Mathematically, it is the same as we obtain earlier. Huang et al (2015) show that the PLS index work much better than the popular PCA investor sentiment index of Baker and Wurgler (2006). How do we get the second PLS factor and so on? With z1, our forecast is y(1) = y0 + γ1z1, where y0 = y¯ and γ1 is the regression slope on z1. Let x1 = x0 − cov(z1, x 0) cov(x0, x0) z1, i.e., x(1) is the previous predictor after removing the effects of z1 (so that newly extracted predictor z2 will be uncorrelated to z1). Then we compute, similarly to z1, z2 = φ2,1x (1) 1 + · · ·+ φ2,nx(1)n , (9.40) c© Zhou, 2021 Page 218 9.5 Forecasting with many predictors with φ2,j = cov(x (1) j , y (1)). (9.41) Then we obtain a new forecast y(2) = y(1) + γ2z2. (9.42) Keep iterating until K + 1, when yK+1 makes little difference from yK , then we use K factors, and the final forecast is clearly a function of z1, . . . , zK . Theoretically, Helland and Alm0y (1994) provide an asymptotic theory for the PLS with n fixed while T goes to infinity, while most such theories later require both are large. Cook and Forzani (2019) have some of the latest analysis, while Cook and Forzani (2021) provide a nonlinear extension of the PLS. Kelly and Pruitt (2013, 2014) extend the PLS and provide an asymptotic theory for both n and T go to infinity 9.5.5 PLS: m > 1 Previously, we have only one target variable to forecast. Now Y is multivariate, the PLS algorithm is more complex. There are various modifications, but the popular and primarily ones are the original NIPALS (Wold, 1975) and later innovation SIMPLS (de Jong, 1993). However, both become the same in the one-dimensional case. To motivate, we may want to use the same set of variables to forecast of m stock returns simultaneously. For example, when m = 2, we have y1t = α1 + β11xt−1,1 + · · ·+ β1nxt−1,n + 1t, (9.43) y2t = α2 + β21xt−1,1 + · · ·+ β2nxt−1,n + 2t, (9.44) that is, we want to use the same xjs to predict two targets y1 and y2. In general, we can write the problem in matrix form, Y = Xβ + , (9.45) where Y is T ×m matrix of observations on m dependent variables, and X is T × n as before, but β is n×m of the regression coefficients. Note that the alphas are zeros in the above equation, due to, following the common practice, we assume that both Y and X are de-meaned. Our objective is still to seek a lower dimension matrix F , T ×K, to run Y = FΛ + e, (9.46) c© Zhou, 2021 Page 219 9.5 Forecasting with many predictors to obtain a stable out-of-sample forecast. How to reduce high-dimension X to low-dimension F? The PLS algorithm essentially makes the following decomposition of the data, X = VW + E1, (9.47) Y = UQ+ E2, (9.48) where V and U are T × K (known as scores or factors), W is K × n and Q is K × m (known as orthogonal loading matrices), and E1 and E2 are errors. The correlation between V and U is maximized. Consider how to obtain the first PLS factor/component. The original algorithm is difficult to understand (though may be efficient in computations), here we follow the eigenvalue approach (see, e.g., Ng, 2013) to see the ideas. Let w1 and q1 be n- and m-vectors, we want to find them to maximize f(w1, q1) = corr(x ′w1, y′q1), where x is n× 1 of the predictors, and y is m× 1 of the dependent variables. The solution w1 and q1 are not unique unless we normalize them, say, to 1, ||w1||2 = w′1w1 = 1, ||q1|| = q′1q1 = 1. Then the w1 and q1 are the standardized first eigenvectors (corresponding the largest eigenvalue) of two matrices, X ′Y Y ′Xw1 = λw1, (9.49) Y ′XX ′Y q1 = γq1. (9.50) Then it is clear that V1 = Xw1, U1 = Y q1, where compose of the first term in decomposition (9.47) and (9.47). If we care about only one PLS factor, V1 can serve the purpose. To get the second factor, we update the X and Y with X := X − V1w′1, (9.51) Y := Y − U1q′1 (9.52) c© Zhou, 2021 Page 220 9.5 Forecasting with many predictors to remove the effects of the first factor, and then repeat the same process to obtain the second factor. We can continue the same process to get the rest factors until VK stops changing in value. The Python code that implements the PLS is as simple as that for the PCR: 1 2 from sklearn.cross_decomposition import PLSRegression 3 4 pls = PLSRegression(n_components=m) # say m=2 5 6 pls.fit(X, Y) 7 8 X_new = pls.transform(X) 9 10 Y_pred = pls.predict(X) The transformed Xnew is what of our interest to use to forecast all the ys. The last output is the forecast. To make the code working in practice for out-of-sample forecasting, we need to do the above recursively over time or train the model in a training period, and then use the parameters for the future test period without re-estimation. Theoretically, it will be of interest to see how decomposition (9.47) and (9.47) works in the one target case (m = 1). In this case, Y ′XX ′Y is a number. Let w1 = cX ′Y , where c is the normalization constant to make ||w1|| = 1, then equation (9.47) becomes X ′Y (Y ′XX ′Y )c = λcX ′Y, (9.53) so λ = Y ′XX ′Y . In other words, X ′Y is the loading (up to a scale) which is exactly the sloped of the combination forecast. Adding back the mean in the regression provides the same PLS factor as before. Note also that the PCA factor is simply (9.47) without the Y , X ′Xw1 = λw1, so w1 is the eigenvector of X ′X which is the same as that of X ′X/T as the scaling will not affect the eigenvector, and the first PCA factor is Xw1. The second PCA factor is simply the one replacing w1 by w2, the second eigenvector. In contrast, the second PLS factor is more difficult to obtain as it has to run the entire process all over again for the updated X and Y . c© Zhou, 2021 Page 221 9.6 Common time-series predictors 9.6 Common time-series predictors There are many time-series predictors that researchers have used to predict the stock market or individual stock returns over time. Here we focus on some of the major ones that are used to predict the market or major indices/sectors. There are even more predictors that are used for cross-section predictions of individual assets or asset classes (see Chapter 11). 9.6.1 Macro economic variables The following 15 well known macroeconomic predictors are used by Welch and Goyal (2008) and many others: 1. Dividend-price ratio (log), D/P : Difference between the log of dividends paid on the S&P 500 index and the log of stock prices (S&P 500 index), where dividends are measured using a one-year moving sum. 2. Dividend yield (log), D/Y : Difference between the log of dividends and the log of lagged stock prices. 3. Earnings-price ratio (log), E/P : Difference between the log of earnings on the S&P 500 index and the log of stock prices, where earnings are measured using a one-year moving sum. 4. Dividend-payout ratio (log), D/E: Difference between the log of dividends and the log of earnings. 5. Stock variance, SV AR: Sum of squared daily returns on the S&P 500 index. 6. Book-to-market ratio, B/M : Ratio of book value to market value for the Dow Jones Industrial Average. 7. Net equity expansion, NTIS: Ratio of twelve-month moving sums of net issues by NYSE- listed stocks to total end-of-year market capitalization of NYSE stocks. 8. Treasury bill rate, TBL: Interest rate on a 3-month Treasury bill (secondary market). 9. Long-term yield, LTY : Long-term government bond yield. 10. Long-term return, LTR: Return on long-term government bonds. c© Zhou, 2021 Page 222 9.6 Common time-series predictors 11. Term spread, TMS: Difference between the long-term yield and the Treasury bill rate. 12. Default yield spread, DFY : Difference between BAA- and AAA-rated corporate bond yields. 13. Default return spread, DFR: Difference between long-term corporate bond and long-term government bond returns. 14. Inflation, INFL: Calculated from the CPI (all urban consumers); following Welch and Goyal (2008), since inflation rate data are released in the following month, we need use suitable lags for inflation. 15. Investment-to-capital ratio, I/K: Ratio of aggregate (private nonresidential fixed) investment to aggregate capital for the entire economy. 9.6.2 Technical variables Technical indicators, such as moving averages of prices, have been widely used by practitioners who use past price and volume patterns to identify price trends believed to persist into the future. Neely, Rapach, Tu and Zhou (2014) examine 14 technical indicators in three categories and find that they are as important as macroeconomic variables. Moreover, they are complimentary to the predictive power of macroeconomic variables, and so the use of both can improve the predictability of the market substantially. 9.6.3 Investor sentiment Baker and Wurgler (2006) propose 6 proxies for investor sentiment and use them to explain re- turns on small stocks, young stocks, high volatility stocks, unprofitable stocks, non-dividend-paying stocks, extreme growth stocks, and distressed stocks. However, their sentiment index (the first prin- cipal component of the proxies) does not predict the market. Huang, Jiang, Tu and Zhou (2015) construct a sentiment index, using partial least squares (PLS) instead of PCA, and find that the resulting index is a powerful predictor of the stock market. Jiang, Lee, Martin and Zhou (2019) construct a manager sentiment index, extending the scope of investor sentiment, based on the aggregated textual tone of conference calls and financial state- ments, and find it negatively predicts future aggregate earnings and cross-sectional stock returns, c© Zhou, 2021 Page 223 9.6 Common time-series predictors particularly for those firms that are either hard to value or difficult to arbitrage. In addition, Chen, Tang, Yao, and Zhou (2021) propose an employee sentiment index and find its negative predictive power on the stock market. Edmans, Fernandez-Perez, Garel and Indriawan (2021) recently propose a real-time, continuous measure of national sentiment based on the positivity of songs that individuals choose to listen to. The music sentiment is language free and thus comparable globally. They find that it is positively correlated with same week stock market returns and negatively correlated with next-week returns. This is consistent with the notion that sentiment induced mispricing will be eventurally corrected by the market. On sentiment in general, Zhou (2018) provides a review of the literature. 9.6.4 Investor attention Chen, Tang, Yao and Zhou (2020) propose an investor attention based on 12 individual attention proxies in the literature, and find it has significant power in predicting stock market risk premium, both in-sample and out-of-sample. Moreover, the index can deliver sizable economic gains for mean- variance investors in asset allocation. They explain that the predictive power of investor attention primarily stems from the reversal of temporary price pressure and from the stronger forecasting ability for high-variance stocks. 9.6.5 Short interest The finance literature largely agrees that short sellers are informed traders who earn excess returns in compensation for processing firm-specific information (see, e.g., Boehmer, Jones, and Zhang, 2008). Rapach, Ringgenberg and Zhou (2016) construct a short interest index and find that it is a strong predictor of the aggregate stock returns, outperforming a host of popular return pre- dictors from the literature in both in-sample and out-of-sample tests. They further find that the predictability of the short sellers is due to their informed anticipations of future aggregate cash flows. The information content of short selling thus appears more economically important than previously thought Recently, Chen, Da and Huang (2021) propose a measure of short selling efficiency (SSE) by using t the slope coefficient of cross-sectionally regressing abnormal short interest on a mispricing c© Zhou, 2021 Page 224 9.6 Common time-series predictors score. They find that SSE ,significantly and negatively predicts stock market returns both in- sample and out-of-sample, suggesting that mispricing gets corrected after short sales are executed on the right stocks. They also show conceptually and empirically that SSE has favorable predictive ability over aggregate short interest, as SSE reduces the effect of noises in short interest and better captures the amount of aggregate short selling capital devoted to overpricing. 9.6.6 Corporate activities While all the above predictors have little to do with what the firms are doing except the manager sentiment, Lie, Meng, Qian and Zhou (2017) focus on an aggregate index of corporate activities, and find it has substantially greater predictive power both in- and out-of sample, and yields much greater economic gain for a mean-variance investor than the macroeconomic predictors. The pre- dictive ability of the corporate index stems from its information content about future cash flows. Cross-sectionally, the corporate index performs particularly well for stocks with great information asymmetry. The corporate activities cover five major categories of corporate or managerial ac- tivities: aggregate security issues, share repurchases, corporate investments, merger activity and payments, and insider trading, with 13 measures • Percentage of stock payment, COMPCT: the aggregate amount of stock payment divided by the sum of the aggregate amount of stock payment and cash payment (in percentage points); • Total stock payment (log), COM: the natural log of the aggregate amount of stock payment (the dollar amounts, in millions, are deflated to 1986 dollar). • Net Transactions, NT: the aggregate number of open market purchases minus the aggregate number of open market sales (in thousands); • Net Dollar Amount, NDA: the aggregate amount of open market purchases minus the ag- gregate amount of open market sales (the dollar amounts, in billions, are deflated to 1986 dollars); • Ratio of Net Purchases, RT: the aggregate number of open market purchases divided by the sum of the aggregate number of open market purchases and the aggregate number of open market sales (in percentage points); c© Zhou, 2021 Page 225 9.6 Common time-series predictors • Ratio of Net Purchasing Dollar Amount, RDA: the aggregate amount of open market pur- chases divided by the sum of the aggregate amount of open market purchases and the aggre- gate amount of open market sales (in percentage points). • CAPX scaled by ME, CAPXME: aggregate capital expenditures scaled by total market capi- talization (in percentage points); • CAPX scaled by AT, CAPXAT: aggregate capital expenditures scaled by average total assets (in percentage points). • Change in net operating asset scaled by ME, ALME: The change in net operating asset plus R&D scaled by total market capitalization (in percentage points); • Change in net operating asset scaled by AT, ALAT: The change in net operating asset plus R&D scaled by average total assets (in percentage points). • Total Equity Issuance (log), E: the natural log of equity issuance (the dollar amounts, in millions, are deflated to 1986 dollar); • Ratio of Equity Issuance, S: equity issuance scaled by the sum of equity and debt issuance (in percentage points). • Aggregate share repurchases (log), REP: The natural log of aggregate share repurchases (in millions of 1986 dollar); 9.6.7 Option market Bollerslev, Tauchen, and Zhou (2009) show that the difference between implied and realized vari- ance, or the variance risk premium, can predict the market. In the recovery literature. Ross (2015) pioneers a theory to recover the entire physical distribution of market returns from options writ- ten on the S&P 500 index. Subsequent studies focus on recovering asset expected returns from option prices under normal market conditions and over a relatively long period. In particular, Martin (2017) provides an estimate of future expected marker return. Extending this framework into events such as the Federal Open Market Committee (FOMC) meetings, Liu, Tang and Zhou (2021) provide a method to estimate the conditional market risk premium. The option market is forward looking and is an ideal place for informed trading there due to leverage, therefore there are likely many option predictors. However, due to perhaps the relatively c© Zhou, 2021 Page 226 9.6 Common time-series predictors more difficult processing the data, option predictors are under-studied so far, and may yield more research in the future. 9.6.8 Others Dong, Li, Rapach and Zhou (2021) find there is a links between cross-section predictability and time series predictability. In particular, they use 100 representative anomaly portfolio returns to forecast the market excess return, and show that, for the 1985:01–2017:12 out-of-sample period, a C-Mean forecast based on the 100 anomalies generates an out-of-sample R2 = 0.89% (significant at the 1% level) and an annualized CER gain of 289 basis points for a mean-variance investor with a relative risk aversion coefficient of three. Economically, they attribute the predictive power to asymmetric limits of arbitrage and overpricing correction persistence Chang, Chu, Tu, Zhang and Zhou (2021) propose an environmental, social, and governance (ESG) index. They find that it has significant power in predicting the stock market risk premium, both in- and out-of-sample, and delivers sizable economic gains for mean-variance investors in asset allocation. Although the index is extracted by using the PLS method, its predictability is robust to using alternative machine learning tools. They find further that the aggregate of environmental variables captures short-term forecasting power, while that of social or governance captures long- term. The predictive power of the ESG index stems from both cash flow and discount rate channels. In the bond market, there are also many studies on predictability. Recently, based on a linear combination of five forward rates, Cochrane and Piazzesi (2005) find a much higher predictive R2, between 30% and 35%, for the risk premia on short-term bonds with maturities ranging from two to five years (unlike stocks whose predictive R2s are very small). Interestingly, Ludvigson and Ng (2009) demonstrate that the impressive predictive power found by Cochrane and Piazzesi (2005) can be further improved with additional five macroeconomic factors estimated from a set of 132 macroeconomic variables that measure a wide range of economic activities. Goh, Jiang, Tu and Zhou (2012) show that the high predictability, however, only generate economic gains comparable to the stock market. The reason is that the the bond risk premia are much smaller than the stock market risk premia. They also provide another intriguing result that the technical indicators of the bond market predict much better than Ludvigson and Ng’s (2009) five macro factors estimated from a set of 132 macroeconomic variables (in contrast in the stock market, as shown by Neely, Rapach, Tu and Zhou (2014), their predictability is comparable). c© Zhou, 2021 Page 227 9.7 Mixed-frequency predictors The above predictors are largely at the aggregated level and are time series predictors used to predict the market return or other economic variables over time. On the other hand, there are many firm characteristics that can be used to forecast returns in the cross section. This will be discussed later. 9.7 Mixed-frequency predictors There are times where the predictors are observed at different time frequency. Some available monthly and some quarterly, for example. The question is how monthly information helps to provide better quarterly forecast. Conversely, one can also ask how to use quarterly to improve monthly forecasts. There are mainly three approaches, of which Ghysels and Marcellino’s (2018) book has detailed discussions. 9.8 Nowcasting It should be noted that most of the forecast studies are based on low frequency economic data, and published works are mostly at the monthly frequency, and the next is at quarterly frequency when accounting data are used. Higher frequency studies are much less. Jiang, Li and Wang (2020) is an example to forecast daily returns using firm news, and Gao, Han, Li and Zhou (2018) is an example of intraday forecast. Nowcasting in economics is about forecasting the present, the very near future and explaining the very recent past. The term is an abbreviation of “now” and “forecasting,” and has been used for a long time in weather forecasting on a very short term mesoscale period of up to 2 hours according to the World Meteorological Organization and up to six hours according to some others in the field. It has recently become popular in economics to provide a real time assessment of the economy such as the GDP which is usually determined after a long time delay and is also subject to revisions. See, e.g., Bok, Caratelli, Giannone, Sbordone and Tambalotti (2017) and references therein. Lo´pez de Prado (2020b) argues the importance of nowcasting in explain substantial losses many quantitative firms suffered as a result of the COVID-19 selloff. This is understandable as the low c© Zhou, 2021 Page 228 frequency forecasting implicitly assumes the stationarity of the predictive model over a long period of time (over years in monthly forecasting, for example), which is not true (if all use the same or similar models, the models will break down too; see Section 9.3), especially during sudden extreme shocks in the market or in the economy. As a result, forecasting in a very short time window, say a few hours away, is likely more accurate than a forecast of what is going to happen a month from today. Perhaps nowcasting is more useful in using current and recent past data to identify a particular regime in a timely fashion, say from a normal market state to a crisis one quickly. Then money managers can react with their backup plans more effectively. 10 Machine Learning Tools In this chapter, we apply some of the machine learning tools to finance, but focus primarily on asset return predictions. 10.1 What is Machine Learning? There are various definitions. We use the one most closely related to finance applications. Machine learning (ML) is using machine (computers) to learn from data. So, ML is a particular form of learning that involve both computers (codes/programs) and data. In finance, we often have or assume a statistical model, such as normal distribution, for the data, and so this part of ML is also known as statistical learning. There are mainly three types of ML: unsupervised learning, supervised learning, and reinforce- ment learning. We explain all three briefly below. c© Zhou, 2021 Page 229 10.2 Types of Machine Learning 10.2 Types of Machine Learning 10.2.1 Unsupervised learning Unsupervised learning is to find patterns or hidden structures of data sets. Given 1000 stock returns, are there any clusters or can their dimensionality be reduced? Or what distribution or data-generating process or statistical model can fit the data? In short, it is purely data analysis without user’s objectives. 10.2.2 Supervised learning Unsupervised learning is to find a model or relation between the data sets. For example, we want to forecast stock market returns using a set of economic indicators. How the economic indicators are related to the returns is the question of interest. Finding the parameters of a linear regression of the returns on the indicators is a common example of supervised learning. Here we have an objective to minimizing the forecasting errors, and the objective determines/supervises the learning the results based on the data. 10.2.3 Reinforcement learning Reinforcement learning (RL) is find the best sequence of actions that will generate the optimal outcome based on reward/utility function and data. A robot trading system is an example of RL that monitors the stock market in real time and places buy and orders, to maximize the terminal return/profit. 10.3 A short literature review Deisenroth, Faisal and Ong (2020) provide an excellent introduction along with the needed mathe- matics. Bishop (2006) and Murphy (2012) offer deeper and yet easily accessible introduction. The well known text of Hastie, Tibshirani, and Friedman (2009) provides a more formal analysis. For re- cent theory and applications, see, e.g., the books by Anthony and Bartlett (2009), Shalev-Shwartz c© Zhou, 2021 Page 230 10.4 Why penalized regressions? and Ben-David (2014), and Shi and Iyengar (2020). On Python implementations, the books of Ge´ron (2019) and Raschka and Mirjalili (2019) seem the best. Machine learning (ML) tools are receiving increasing attention by both hedge funds and aca- demic researchers in recent years. In finance, Rapach, Strauss and Zhou (2013) is perhaps the first major study (published in a top finance journal) that applies LASSO (the least absolute shrinkage and selection operator, Tibshirani 1996) to select predictors from a large set (“big-data”) of candi- dates for forecasting global stock markets monthly. Chinco, Clark-Joseph, and Ye (2019), perhaps the first one in the recent, use LASSO to analyze cross-firm return predictability at the one-minute horizon. Kozak, Nagel and Santosh (2020) provide a Bayesian LASSO approach to shrink dimen- sionality. Feng, Giglio, and Xiu (2020) focus on choosing factors, and Freyberger, Neuhierl, and Weber (2020) study nonlinear effects. Gu, Kelly and Xiu (2020) apply a comprehensive set of ML tools, including generalized linear models, dimension reduction, boosted regression trees, random forests, and neural networks, to forecast individual stocks and their aggregates. Han, He, Rapach and Zhou (2020) use a combination and a combination LASSO methods to identify what firm characteristics driving the US stock returns, while Jiang, Tang and Zhou (2018) study such issues for the Chinese stock market. Filippou, Taylor, Rapach and Zhou (2020) apply LASSO and neural network to predict foreign exchanges, and Guo, Lin, Wu and Zhou conduct an ML study corporate bonds. Guida (2019) and Jurczenko (2020) provide collections of papers on ML theory and its applications in finance, while Dixon, Halperin and Bilokon (2020) focus on primarily explaining neural networks and their applications. Lo´pez de Prado (2018, 2020a) analyzes some of the practical issues (read these books like others with caution as some claims may not be true). Nagel (2021) discusses some of the major asset pricing studies and research issues. Giglio, Kelly and Xiu (2021) provide a survey of recent advances. 10.4 Why penalized regressions? Penalized regressions or similar methods are particularly useful in finance. To understand why penalized regressions are of interest, we need to discuss first the bias-variance decomposition of an estimator. c© Zhou, 2021 Page 231 10.4 Why penalized regressions? 10.4.1 Bias-variance tradeoff For simplicity, consider the predictive regression model, yt = βxt−1 + t, t = 1, . . . , T. (10.1) where t is iid normal with zero mean and variance σ 2 . The mean-squared error (MSE) of any estimator of β, say βˆ, is defined as MSE ≡ E(βˆ − β)2 = Bias2(βˆ) + Var(βˆ), (10.2) where the second equality follows from summing the two terms, Bias2(βˆ) ≡ (Eβˆ − β)2 = (Eβˆ)2 + β2 − 2βEβˆ, (10.3) Var(βˆ) ≡ E(βˆ − Eβˆ)2 = Eβˆ2 − (Eβˆ)2. (10.4) The MSE tells us how accurate our estimator is. The bias is simply the squared difference between the expected value and the true value, and the variance measures how the estimator can fluctuate around its mean from sample to sample. Equation (10.2) is known as the bias-variance decomposition and it shows that, to minimize the MSE, there is in general a tradeoff between bias and variance. In other words, for some estimators (we have many ways to estimate parameters), the first term is small, but the second is large; and for other estimators, the reverse may be true. As far as the MSE is concerned, we want the sum to be minimal. The popular OLS estimator has zero bias, but has certain variance. To reduce the variance of βˆ, we may impose an upper bound on it. Then the estimator will be biased, but its MSE can potentially be smaller. Indeed, it is the case for many penalized regressions that impose restrictions on the beta coefficients, leading generally smaller MSEs. 10.4.2 Prediction error The next question is why the MSE is important. This is because we want, in practice, our prediction to be as close to the future realized value as possible, i.e., to minimize the prediction error. As it turns out, the prediction error is tied to the MSE. c© Zhou, 2021 Page 232 10.4 Why penalized regressions? To see why, given an estimator βˆ, the next period predicted value and true values are yˆT+1 = βˆxT , (10.5) yT+1 = βxT + T+1, (10.6) respectively. At time T , we know our prediction yˆT+1 but not yT+1 as T+1 is random to us, and so the expected mean-squared error of our prediction is Prediction Error = E(yT+1 − yˆT+1)2 = E(β − βˆ)2 × x2T + σ2 . (10.7) Hence, to reduce the prediction error, we need to reduce the MSE of βˆ. It is the MSE that matters for prediction accuracy, not the bias alone. 10.4.3 Problems with many regressors The bias-variance tradeoff becomes more important when there are many regressors or many betas, because the estimation risk in the betas will then be greater, and hence their impact on the prediction error will be greater too. In this case, imposing constraints on betas often helps in most practical problems. To see these points mathematically, consider the standard regression with n regressors (we count the constant here), yt = α+ β1xt,1 + β2xt,2 + · · ·+ βn−1xt,n−1 + t, t ∼ N(0, σ2). (10.8) In vector form, Y = Xβ + e, (10.9) where Y is a T -vector of observations on the dependent variable, X is a T×n matrix of observations on the regressors and β is an n + 1-vector of the regression coefficients. Recall that the common OLS estimator is βˆOLS = (X ′X)−1X ′Y. (10.10) Note that the first column of X has all ones as the regression has an intercept. The covariance of the OLS estimator is well known, cov(βˆOLS) = σ 2(X ′X)−1, (10.11) c© Zhou, 2021 Page 233 10.5 LASSO where σ2 is the variance of the model residual under the standard iid assumption. If n is large, (X’X/n) is close to the covariance matrix of the regressors, a constant matrix Σx. Then cov(βˆOLS) ≈ (Σxσ2)× n, (10.12) which says that the variances of the estimators explore at a rate n. In other words, every else equal, the more the regressors, the less accurate the estimates, as their standard errors increase at a rate of √ n. Then expected mean-squared error of prediction is (see, e.g., Hastie, et al, 2009, p. 26 or one can prove directly), Expected Prediction Error = E(yT+1 − yˆT+1)2 = σ2 n T , (10.13) which grows proportional to n. Hence, when there are too many predictors relative to the sample size (T ), the standard linear regression will not perform well in both estimating the parameters and in forecasting. 10.5 LASSO Tibshirani (1996) proposes LASSO (the least absolute shrinkage and selection operator) to improve the OLS. Today, it is one of the most useful ML methods in finance as it helps to select a few important variables out of potentially hundreds to predict a stock or the stock market or default of a loan. 10.5.1 The idea Consider, for example, the following predictive regression of market return yt on 200 predictors, yt = α+ β1xt−1,1 + β2xt−1,2 + · · ·+ β200xt−1,200 + t, t = 1, . . . , T. (10.14) The problem is that T is usually not large. If T ≤ 200, the above regression is infeasible as usual OLS estimator β = (X ′X)−1X ′Y (10.15) c© Zhou, 2021 Page 234 10.5 LASSO will explode because X ′X will be not be invertible, where β denotes all the coefficients (including intercept) and X is the T × 201 data matrix, the constant values and the x′s, and Y is a T - vector of the y’s. Suppose T > 200 so that the regression is numerical feasible. Then there is the well known over-fitting problem that the regression can fit well in-sample due to the use of many variables/predictors, but it can perform very poorly out-of-sample. For examples, Welch and Goyal (2008), and Rapach and Zhou (2020) show that the OLS out-of-sample prediction is purely garbage if all the 14 or 12 predictors there are used, respectively. The reason is that the estimation accuracy is low (see Section 10.4.3) so the parameters (intercept and slopes) are far from the truth, and they do not work well for out-of-sample forecasting. The objective of LASSO is to select the most important predictors out of the 200. In so doing, LASSO imposes a bound on the sum of the absolute values of regression coefficients, |β1|+ |β2|+ · · ·+ |β200| ≤ C. When the constant C is chosen small enough, most of the regression coefficients are forced to be zeros, and what leftover are the most important ones. Suppose that LASSO selects 5 variables, say, x2, x9, x105, x119, x188, then we only run an OLS regression only on them, instead of on all the 200 variables, to form our forecast. Hence, LASSO is a data-driven approach that searches for sparsity to identify the minimal number of predictors. From forecasting accuracy point of view, the restrictions reduce variance of the parameter estimates, resulting generally biased estimates, but they can often reduce the MSE of the parameter estimates, yielding improved forecasts with greater accuracy. The optimal choice of C will be discussed in the next subsection. Since the regression is to minimize the average mean-squared error of the residuals, LASSO solves the same problem, min β0,β′js 1 T T∑ t=1 yt − β0 − n∑ j=1 βjxt−1,j 2 subject to n∑ j=1 |βj | ≤ C, with the additional constraints on the betas, where β0 denotes the previous α for notation con- venience. It is often referred as regularized or penalized regression that imposes constraints or information to make a problem more tractable. Mathematically, the constrained problem is c© Zhou, 2021 Page 235 10.5 LASSO equivalent to an unconstrained problem with a Lagrange multiplier, βLASSO = arg min β 1 2T T∑ t=1 yt − β0 − n∑ j=1 βjxt−1,j 2 + λ n∑ j=1 |βj | , (10.16) where λ is the Lagrange multiplier. This is a quadratic programming problem with certain con- straints. There is no analytical formula for the solution in general, but it can be solved easily by various algorithms. In practice, software packages are readily available in Matlab, R or Python. Note that LASSO is a constrained regression. The usual OLS regression has no constraints, and is a special case of LASSO with C = +∞ or λ = 0 (mathematically, C and λ are inversely related). The smaller the C (or the larger the λ, the stronger the constrains on the betas, forcing them closer to zeros. This is the reason why LASSO is also called a shrinkage estimation because it shrinks the betas to zeros. How to choose C or λ in practice? One often uses the cross-validation (see next subsection) for the choice to minimize the prediction error. Note further that we divide the first term by 2T is simply for mathematical convenience and there is no change in the solution on betas. In the optimization process when we set the derivatives with respect to the betas to zeros, the 2 will be cancelled out (see an example blow), making the end formula is more elegant, in terms of λ rather than λ/2. The definition (10.16) will be consistent with our Python codes. Mathematically, an lq norm on β is defined by ||β||q = n∑ j=1 |βj |q 1/q . (10.17) Then the LASSO problem is often written as, in a short but more abstract form, βLASSO = arg min β [||y − βx||2 + λ∗||β||1] , where ||y − βx||2 = T∑ t=1 yt − β0 − n∑ j=1 βjxt−1,j 2 , λ∗ = 2Tλ, which is why LASSO is known as imposing constraints on betas with l1 norm. Note that || · ||2 is the square of the l2 norm, || · ||2. Since the l2 norm is the most widely used, its subscript 2 is often omitted for simplicity. To gain some intuition on LASSO estimator, consider a special case of a univariate regression without intercept, yt = βxt + t, t = 1, . . . , T. (10.18) c© Zhou, 2021 Page 236 10.5 LASSO In this case, we want to solve β to minimize f(β) = 1 2T T∑ t=1 (yt − βxt)2 + λ |β|. The first-order condition is f ′(β) = − 1 T T∑ t=1 (yt − βxt)xt + λ sign(β) = 0, where sign(·) is the sign function so that sign(β) = 1 or −1 if β > 0 or < 0. Assume β > 0, then we solve from above, βˆLASSO = βˆ − λ1 T ∑T t=1 x 2 t , (10.19) where βˆ is the standard OLS estimator (without the constraint), βˆ = 1 T T∑ t=1 ytxt / [ 1 T T∑ t=1 x2t ]. So βLASSO is βˆ if λ = 0 (no constraints), and is shrank to zero as λ increases to βˆ and beyond (βˆLASSO is defined as zero if the righthand side of (10.19) is negative because that equation is solved by assuming β > 0, and its estimator βˆ ≥ 0). In particular, if the data x’s are normalized, i.e., 1 T T∑ t=1 x2t = 1, then it is clear, from (10.19), that βˆLASSO = βˆ − λ, if βˆ > 0. (10.20) The LASSO estimator simply reduces the OLS estimator by λ amount. If λ = βˆ, βˆLASSO = 0. If λ > βˆ, βˆLASSO is set to be zero. Similarly, if βˆ < 0, say βˆ = −2, we have βˆLASSO = βˆ + λ, if βˆ < 0, (10.21) that is, we add λ to the OLS estimator to make it closer to 0. But if λ > |βˆ| = 2, we set βˆLASSO be zero. If βˆ = 0 instead of being −2, we set obviously βˆLASSO = 0. Overall, βˆLASSO always has the same sign (positive or negative) as βˆ, and it is just smaller in absolute value. c© Zhou, 2021 Page 237 10.5 LASSO The above simple relation between βˆLASSO and βˆ is also true for when n > 1, as long as the regressions are normalized or the columns are orthonormal. Of course, there is no such a relation for a general X matrix, and we have to use a numerical algorithm to search for the solution. Nevertheless, as in the simple case above, the LASSO estimator is always a piecewise linear function of λ. Moreover, it is convex. Mathematically, this makes it easy to find the numerical solution by the code below. 10.5.2 The code Python has the greatest number and high quality packages for machine learning. The LASSO is easily implemented by using sklearn, or Scikit-learn. The key codes are: 1 2 from sklearn import linear_model 3 4 alpha = 0.5 5 lesso = linear_model.Lasso(alpha) 6 lesso.fit(x,y) 7 8 print(lesso.intercept_) # the intercept 9 print(lesso.coef_) # the slopes The code uses alpha (we call λ) as the input. Given a value of α = λ, it compute the beta parameters from the alternative definition, βLASSO = arg min β 1 2T T∑ t=1 yt − β0 − n∑ j=1 βjxt−1,j 2 + λ n∑ j=1 |βj | . Technically, this is a constrained quadratic programming problem. The last two statements simply print them. Note again that, imposing an α value equivalent for imposing a C on the coefficients. In other words, for a given α, there is an implied C. But they are inversely related. When α go from 0 to +∞, C will go from +∞ to 0. Moreover, the choice of alpha can be done via cross-validation. See the class codes for details. c© Zhou, 2021 Page 238 10.5 LASSO 10.5.3 The theory In the standard linear regression, yt = α+ β ′xt + t, t = 1, . . . , T, (10.22) where β is an n-vector of slopes/coefficients on n variables, and T is the sample size. It is easy to show that the expected prediction error of the OLS estimator βˆOLS is (see 10.13) E||X(βˆOLS − β)||/T = σ2 n T = σ2 # of parameters # of observations , (10.23) where X is a T×n matrix of the data on xt’s, and are treated as fixed variable (vs random variable) here for simplicity. So we must have enough time series sample relative to the # of parameters to make the prediction error to become small. The above also implies how close the OLS estimator can be to the true parameter. In short, for the OLS estimator to work, n/T must be small. Traditionally, we assume n is fixed and T is large, this will be OK. But in the big data context, n can be close to T , and even be larger than T sometimes, so the OLS estimator cannot work. In the context of the LASSO estimator, under certain conditions, we have E||X(βˆLASSO − β)||/T = O ( s0 log n T ) , (10.24) where O(·) means that the left-hand side is bounded by what is inside of the bracket, and s0 is the number of true non-zero parameters. See Bu¨hlmann and van de Geer (2011) for details. There are two important messages. First, given the sample size, it is impossible to estimate more than s0 > T non-zero parameters. Second, although the number of variables n can be much larger than observations T (and we assume many of them have zero slopes), but it cannot be so exponentially, that is, the log number of variables cannot be too high relative to sample size, i.e., log n T should be small or converges to 0. Otherwise, there will be no theory that can guarantee the validity of the LASSO. c© Zhou, 2021 Page 239 10.6 Cross-validation 10.6 Cross-validation An important question is how to choose λ in practice. The cross-validation is widely used to select λ by examining how well the resulted estimator performs over test data sets. So we want to validate the procedure cross data sets. The simplest way to understand it may start from the leave-one-out cross-validation (LOOCV). Suppose we have data (x1, y1), (x2, y2), . . . . . . , (xn, yn). We leave the first one out, and use all the remain data, (x2, y2), . . . . . . , (xn, yn), for the estimation or training of the model. Then we can forecast y1 to get yˆ1 based on the (n− 1) data. Let MSE1 = (y1 − yˆ1)2 be the mean-squared-error of our forecast. Similarly we can compute MSE2 = (y2−yˆ2)2 by leaving out (x2, y2) while using all the remaining (n− 1) data. Successively, we can compute the average MSE cross the data sets, CV−1 = 1 n n∑ i=1 MSEi. The LOOCV procedure is to find a tuning parameter λ to minimize the above overall error. In general, a K-fold cross-validation (K-CV) approach works in three steps: 1) divide the data into K separate sets of roughly equal size, Data1, Data2, . . . . . . , DataK . 2) Estimate the model for k = 1, 2, . . . ,K by excluding only the k-fold, Datak, and compute the predictive MSEs. Then compute the total error CVK = 1 K K∑ k=1 MSEk. c© Zhou, 2021 Page 240 10.7 Ridge 3) Search the tuning parameter to minimize CVK . It may be noted that the beta estimates in LOOCV have the least variance compared with those using the K-CV as it has the largest sample sizes. However, its predictive power may be limited because the MSEs are likely very noisy as they are computed based on one data point each time. In addition, it has to estimate the model n times which can be time consuming if the model is difficult to estimate. In general, K > 1 is a better choice, and K = 5 or 10 are commonly chosen (in the LOOCV, K = n). 10.7 Ridge LASSO may have a problem when the predictors are highly correlated. Intuitive, if two predictors are highly correlated, it is difficult to select which one of them. A simple may be just retain them both or retain a linear combination of them. Hoerl and Kennard (1970) propose a ridge regression to deal with highly correlated regressors in a general regression, 26 years before LASSO was proposed. We will consider ridge by itself in this section, and will combine it with LASSO in the E-net section. 10.7.1 The idea Consider the standard predictive regression yt = α+ β1xt−1,1 + β2xt−1,2 + · · ·+ βnxt−1,n + t, t = 1, . . . , T. (10.25) In vector form, Y = Xβ + e, (10.26) where Y is a T -vector of observations on the dependent variable, X is a T × (n + 1) matrix of observations on the regressors and β is an n + 1-vector of the regression coefficients. Recall that the common OLS estimator is βˆOLS = (X ′X)−1X ′Y. (10.27) Note that the first column of X will be all ones if the regression has an intercept. c© Zhou, 2021 Page 241 10.7 Ridge The problem is that when the regressors are highly correlated or when the columns of X are close to be dependent (multicollinearity), the matrix X ′X will be close to be singular (non-invertible). In this case, the inversion (X ′X)−1 will be very large, so are the OLS estimator. The ridge estimator is defined by βˆridge = (X ′X + λ I)−1X ′Y, (10.28) where I is the identity matrix of order n + 1, and λ ≥ 0 is the shrinkage parameter. So βridge is obtained by adding the matrix λ I (“ridge”) to X ′X, making the result X ′X +λ I more stable and easily invertible. In the special case of λ = 0, the βˆridge reduces to the OLS estimator. In general, it shrinks the estimates to zero. To see this, consider the case when X is orthonormal or X ′X = I, then βˆridge = 1 1 + λ βˆOLS, which clearly shrinks all the OLS estimators toward 0 as λ gets larger. Mathematically, if we solve the standard MSE minimization by imposing the lq norm constraints with q = 2, βridge = arg min β 1 T T∑ t=1 yt − β0 − n∑ j=1 βjxt−1,j 2 + λ n∑ j=1 β2j , (10.29) the solution is the bridge estimator. Although this differs from the LASSO only from replacing q = 1 by q = 2, the behavior of the estimator is totally different. Note that we now divide the first term by T instead of 2T because setting the derivatives on the betas being zero in the optimization process will cancel the 2s from both terms. To understand the impact of multicollinearity, let us consider a simple example where T = 3 and n = 1, with X = 1 1 1 1 1 1 + η . It is clear that when η = 0, there is exact collinearity. When it is small, close to be collinear. Then X ′X = 1 1 1 1 1 1 + η 1 1 1 1 1 1 + η = 3 3 + η 3 + η 2 + (1 + η)2 . c© Zhou, 2021 Page 242 10.7 Ridge Its inverse is (see formula (1.78)), (X ′X)−1 = 1 det(X ′X) 2 + (1 + η)2 −3− η −3− η 3 , and det(X ′X) = 3[2 + (1 + η)2]− (3 + η)2 = η2. So, when η is small, the inverse must be large, driven by the determinant. By the relation between determinant and eigenvalues (see (6.35)), det(X ′X) = λ1λ2. So the smallest eigenvalue of X ′X must be small too. So the following practitioners often talk about are equivalent: a) X is near collinear; b) X ′X near singular; c) some eigenvalues of X ′X are too small; d) (X ′X)−1 is too large. Once we add the ridge into X ′X, we have, from the definition of eigenvalues, that det(X ′X + λ I) = (λ1 + λ)(λ2 + λ). That is, the eigenvalues of the bridge estimator will be the original ones shifted by λ amount. Note that when n > T , i.e., the number of regressors or variables is greater than the sample size, the OLS estimator is undefined as X ′X must be singular in this case. However, the bridge estimator is still well defined. This is because, if λ > 0 and is slightly large enough, the det of the bridge estimator will stay away from zero. This implies that X ′X + λ I will be well invertible, and the hence estimator will be well behaved at least numerically. 10.7.2 The code The key codes are exactly that for the LASSO, except replacing the word Lasso by Ridge: 1 2 from sklearn import linear_model 3 4 alpha = 0.5 5 ridge = linear_model.Ridge(alpha) 6 ridge.fit(x,y) 7 c© Zhou, 2021 Page 243 10.7 Ridge 8 print(ridge.intercept_) # the intercept 9 print(ridge.coef_) # the slopes Again, the code uses alpha (we call λ) as the input, it solves the betas from the alternative definition, βridge = arg min β 1 T T∑ t=1 yt − β0 − n∑ j=1 βjxt−1,j 2 + λ n∑ j=1 β2j . The solution is the same as the formula (10.28). See the class codes for details. 10.7.3 The theory Consider the case where the OLS estimator is well defined, βˆOLS = (X ′X)−1X ′Y, (10.30) though X ′X maybe close to be singular. It well known that the OLS estimator is unbiased, E[βˆOLS] = β, (10.31) i.e., its expected value is the true parameter. In other words, if we compute the OLS estimator for 10,000 data sets, the average should converge to the true value. Since the bridge estimator shrinks the unbiased estimator to zero, it must biased. What is its advantage then? It can have much smaller variance. Indeed, the covariance of the OLS estimator is well known, cov(βˆOLS) = σ 2(X ′X)−1, (10.32) where σ2 is the variance of the model residual under the standard iid assumption. It explodes as X ′X is near singular. It is easier to see this from the trace tr[cov(βˆOLS)] = σ 2tr(X ′X)−1 = σ2 n+1∑ i=1 1 λi , (10.33) where λi’s are the eigenvalues, which can be very small when X ′X is near singular. In contrast, it can be shown that tr[cov(βˆbridge)] = σ 2tr(X ′X + λI)−1 = σ2 n+1∑ i=1 λi λi + λ , (10.34) which will not explore even if λi’s are really small or even zeros. Hence, the bridge estimator trades bias for smaller variance. c© Zhou, 2021 Page 244 10.8 Enet 10.8 Enet When n is large, the bridge estimator can shrink the coefficients, but cannot shrink them to exactly zero. Hence, it cannot be used effectively to reduce dimensionality or to select variables. On the other hand, LASSO tends to be more aggressive in setting many betas to be zeros. In particular, it tends to select an arbitrary one among highly correlated variables (by setting other betas zeros), while Ridge tends to select the group by keeping all the betas but making them small. Zou and Hastie (2005) propose a combination of LASSO and bridge, known as Elastic Net (E- net), to take the advantages of both. While it will set some coefficients be zeros like the LASSO, but it will be less aggressive and it will use of the bridge features to tame the large coefficients only. Mathematically, the E-net estimator solves the same MSE problem, βElastic = arg min β 1 2T T∑ t=1 yt − β0 − n∑ j=1 βjxt−1,j 2 + λ n∑ j=1 |βj |+ η 2 n∑ j=1 β2j , (10.35) by imposing both the LASSO and ridge constrains, or imposing both l1 and l2 norm restrictions on the betas. Because it has two constraints, it has now two parameters, λ and η to choose to determine the severity of constraints on each. In practice, the Enet tends to do better than either LASSO or bridge alone (which may be expected as it contains each as a special case when η or λ is zero), and it is widely used today for forecasting. The key codes similar to that for the LASSO, 1 2 from sklearn import linear_model 3 4 alpha = 0.5 5 psi = 0.3 6 Enet = linear_model.ElasticNet(alpha ,psi) 7 Enet.fit(x,y) 8 9 print(Enet.intercept_) # the intercept 10 print(Enet.coef_) # the slopes c© Zhou, 2021 Page 245 10.9 C-LASSO Note that the code uses alpha and ψ as inputs, and it solves βElastic = arg min β 1 2T T∑ t=1 yt − β0 − n∑ j=1 βjxt−1,j 2 + αψ n∑ j=1 |βj |+ α 2 (1− ψ) n∑ j=1 β2j , (10.36) which is exactly (10.35) except with different parameterization, α = λ+ η, ψ = η/(λ+ η). See the class codes for details. 10.9 C-LASSO Following Han, He, Rapach, Zhou (2020), and more closely Rapach and Zhou (2020), we can define a time series version of C-Lass as follows, while the details for cross-section and extensions can be found in Han, et al (2019). Diebold and Shin (2019) is the first to explore this line of ideas, though their procedure is quite different from ours. The C-LASSO, or Combination-LASSO, is to use the idea of the combination forecast method first, and then use LASSO to select the most important forecasts out of all the forecasts based on all the predictors individually. Suppose we have 200 predictors, which implies that we have 200 forecasts based on each of the predictors. Now we consider a regression of the realized returns on the forecasts, yt = α+ θ1yˆt−1,1 + θ2yˆt−1,2 + · · ·+ θ200yˆt−1,200 + t, t = 1, . . . , T. (10.37) We use the LASSO to select the most important forecasts in (10.37). This regression on forecasts will in general be more robust than the regression on predictors, though it is not the most efficient. It works well when the true regression parameters change over time. In implementation, we impose the nonnegativity restriction that θj ≥ 0, for which the reason is the return forecasts should be positively related to the realized returns. In contrast to the usual LASSO that is applied to predictors, we apply C-LASSO to forecasts. This is only the first selection step of C-LASSO. After selection, the final forecast is the average of the selected ones. For example, if 10 forecasts are selected out the 200, the C-LASSO forecast is yˆC−LASSOt = yˆt−1,1 + yˆt−1,3 + · · ·+ yˆt−1,180 10 , (10.38) c© Zhou, 2021 Page 246 10.10 E-LASSO where we assume that the first, third, . . ., and the 180th (a total of 10) are the selected ones. The same idea can be applied to extend the bridge and Elastic net to yield C-Ridge and C-Enet. C-LASSO improves in general the average forecast substantially by selecting/using only the good forecasts in the average, rather than averaging all the forecasts. 10.10 E-LASSO E-LASSO or encompassing LASSO is motivated from two ideas (see Rapach and Zhou, 2020 and Han et al, 2021). First, based on forecast encompassing, there is likely a gain of combining the C-LASSO with the OLS. Second, it belongs to the ensemble approach of machine learning that combines algorithms (e.g., Zhou 2012). In general, we define E-LASSO forecast as a simple linear combination of C-LASSO and the OLS yˆE−LASSOt = λyˆ C−LASSO t + (1− λ)yˆOLSt , (10.39) where λ is data-driven, computed as the best one that minimizes the forecasting error of yE−LASSOt over past M (say M = 36 in a monthly forecast applications). Dong et al (2021) and Han, et al (2021), among others, find that, indeed, the E-LASSO tends to do better than both C-LASSO and OLS in most applications. 10.11 Neutral network LASSO and the previous models are extension of the ordinary linear regressions (OLS), but can handle many regressors. They work well only if the true data are from a linear model. In practice, however, many dependent variables depend on others in a nonlinear fashion. The neutral network (NN) is a major class of models that extend the OLS to allow for nonlinear relations. It weights data linearly into a layer of new data sets, making a nonlinear transformation, and then weight them into another layer of data, and make another nonlinear transformation, and so on, to finally the observed output data. It is motivated by a biological neural network, and so the notes of the network are called neurons. c© Zhou, 2021 Page 247 10.11 Neutral network Mathematically, any smooth function can be approximated by a suitable NN (Hornik, Stinch- combe, and White 1989; Cybenko 1989). In other words, if we use a set of predictors to predict the market return, and if the true function is highly nonlinear but smooth, then, given enough data, we can build a suitable NN so that it approximates the true but unknown function with an arbitrary accuracy. This is the theoretical reason why NN is widely used in practice, and has growing applications in finance. Gu, Kelly and Xiu (2020) is an example. Klaas (2019), Ge´ron (2019), and Gulli, Kapoor and Pal (2019), among many others, provide the standard Python codes for implementing the NN. A deep neural network (DNN) is an NN (sometimes called artificial NN or ANN) with multiple layers between the input and output layers. Deep learning (also known as deep structured learning) is part of a broader family of machine learning methods based on artificial neural networks. The neutral network (NN) is perhaps best understood by going through some examples. 10.11.1 No hidden layer: linear regression Consider the prediction of yt using two predictors, z1 and z2. We have the usual simple predictive regression, yt = α+ β1z1,t−1 + β2z2,t−1 + t, t = 1, 2, . . . , T. (10.40) Recall that, if the parameters were known, we would compute our forecast from yˆt = α+ β1z1,t−1 + β2z2,t−1, which says that our forecast is a linear function of the predictors. Mathematically more convenient, we can express it in term of dot function (as many ML books do), yˆt = θ1x1t + θ2x2t + θ3x3t = θ · xt, (10.41) where θ = (α, β1, β2) = (θ1, θ2, θ3), (10.42) xt = (1, z1,t−1, z2,t−1) = (x1t, x2t, x3t), (10.43) the last equation of (10.41 follows from the definition of the dot function. c© Zhou, 2021 Page 248 10.11 Neutral network In terms of NN, we map xt, the attributes, into an output using weights θ. Denote the output by y1 and drop time scripts for brevity, we have x1 x2 x3 Input layer y1 Output layer In other words, the OLS can be viewed as an NN with 3 nodes or neutrons in the input layer, no hidden layers, and one output in the output layer. In a multivariate regression, we will have multiple outputs, and so multiple y’s in the output layer. Note that there are 3 parameters. We seek such parameters that make the forecasts be as close to the actual data (the training sample) as possible. This is often done by minimizing the mean-squared error of the difference, min L ≡ T∑ t=1 [yt − (θ1x1t + θ2x2t + θ3x3t)]2 = T∑ t=1 (yt − θ · xt)2. (10.44) The solution is the well known OLS regression coefficients and is analytically available. To summary, in the no hidden layer case, the NN is simply the usual OLS regression. However, for a general NN, that is different, and the solution is not available by any formula. Instead, we need to search it numerically by using optimization algorithms, of which gradient descent is the most common, to be discussed later. 10.11.2 One hidden layer Consider now a NN with one hidden layer. Suppose we map the 3 input into 4 nodes of a hidden layer, and then an output. Graphically, we have c© Zhou, 2021 Page 249 10.11 Neutral network x1 x2 x3 Input layer Hidden layer y1 Output layer The key is that the data in the hidden layers are nonlinear functions of the linear combination of data. For example, the top one and the bottom ones are x11 = f(θ 1,1 1 x1 + θ 1,1 2 x2 + θ 1,1 3 x3), x14 = f(θ 1,4 1 x1 + θ 1,4 2 x2 + θ 1,4 3 x3), where f is a nonlinear activation function that maps the linear weights of the previous data into a nonlinear relation, with θ1,jk s as parameters. For example, for θ 1,4 2 , the superscripts indicate the first layer and the fourth node. The forecast is then yˆ = θ20 + θ 2 1x 1 1 + θ 2 2x 1 2 + θ 2 3x 1 3 + θ 2 4x 1 4 = θ20 +W ′ 2x 1 = θ20 +W ′ 2f(W ′ 1x), where W2 and W1 are the coefficient matrix in the second and first step, and x is a vector of the data. So it is clear that the data is transformed linear each, and at the hidden layer, the results are further transformed by a nonlinear function before used in the next step. In this way, one can generate a NN with an arbitrary number of steps. The rectified linear unit (ReLU) function is one of the most popular one, used by Gu, Kelly, and Xiu (2020) and others in finance, which is defined as f(x) = 0, if x < 0;x, otherwise. c© Zhou, 2021 Page 250 10.11 Neutral network Intuitively, the activation function activates a neuronal connection in response to a sufficiently strong signal, thereby relaying the signal forward through the network. Note that , in the above one layer NN, there are now 3 × 4 = 12 parameters to arrive at the hidden layer, and then 4 more parameter at the end. So thee are in total 16 parameters to estimate. In a NN with m layers, there are Km parameters if each step has K parameters, so the number of parameters grow very fast. In finance, in time series forecasting, probably 1-2 layers are enough due to data limitations. In cross section forecasting, one may apply up to 5 layers as in Gu, Kelly, and Xiu (2020). 10.11.3 Gradient decent: A search algorithm It will be of interest to see how the numerical estimation of the parameters is done mathematically, which can provide deeper insight into the Python packages. Consider the simple OLS case. Based on (10.44), the derivative with respect any parameter is Lj = L ∂θj = −2 T∑ t=1 (yt − θ · xt)xjt. Mathematically, at the optimal value of any parameter, the derivative with respect it should be zero. However, in practice, we do not know the optimal parameter values, which are what we need to find. The idea is that we can start from any initial guess value, θ0j for j = 1, 2, 3. We then compute a new updated/iterated value θ1j = θ 0 j − ρLj , j = 1, 2, 3, (10.45) where ρ > 0 is a small constant. The reason is that, if θ0j is not optimal, then Lj 6= 0. Suppose Lj > 0, then θ0j is on the righthand side of the optimal value (imagine a U-shape function with the minimum in the middle), and so we have to move to the left, and that is exactly the above algorithm does. If Lj is not zero, one can iterate θm+1j = θ m j − ρmLj , m = 1, 2, · · · , (10.46) c© Zhou, 2021 Page 251 10.11 Neutral network until it converges, where ρm is sometimes called is the learning rate. If it is too small, the algorithm may converge slowly and may need a lot of training examples to do so. If it is too large, the iterated values, θm+1j ’s, may change too fast and end up oscillating around the optimal value. The above first-order derivative iterative optimization algorithm is known as gradient ascent, and is generally attributed to Cauchy, who first suggested it in 1847. To understand its name, image that you walk down to the bottom (minimum) of a mountain. The direction points to the bottom is the gradient (first-order derivatives) at that point, and you descend accordingly. However, depending on the maintain, there is a possibility that you can get stuck in some hole (i.e. local minimum or saddle point). In practice, many problems do have well behaved global minimum. Even in the case when the solution is not well behaved, multiple staring points or alternative models are useful in checking whether the solution is a local minimum or not. If it is, additional search is necessary. In a general NN, the output is a compound function of the parameters. For example, in the 1-hidden layer case, we have yˆ = 4∑ i=1 θ2i f 3∑ j=1 θ1,ij xj . Suppose now we add one more hidden layer with 5 nodes before reaching the output, and we assume use the same activation function, then yˆ = 5∑ i=1 θ3i f 4∑ k=1 θ2,ik f 3∑ j θ1,kj xj , where f is used on a linear function of itself (compound). Despite the complex look, the output is easily computed recursively and the first-order derivatives follows from the chain-rule. Then the gradient decent can be applied to search for the best parameter values that delivers the best fit to the model. 10.11.4 Remarks The number of nodes in each hidden layer and the number of hidden layers can be any value, driven by the data in applications. Various algorithms are developed to estimate these numbers and the associated parameters. Theoretically, large enough NN should capture almost any complex decision function. Hence, NN type methods are the currently preferred approach for complex c© Zhou, 2021 Page 252 10.12 Genetic algorithm machine learning problems such as computer vision and natural language processing. However, due to layers after layers, it is one of the least transparent, least interpretable, and most highly parameterized machine learning tools. In addition, it generally requires a large sample size for convergence, limiting its applications in time series forecasting where the time series is often not long enough in finance. Gu, Kelly and Xiu (2020), among others, find that NN does better than LASSO and other regression-type methods for predicting stock returns, due to it captures important nonlinearity. But this line of papers are based a balanced large panel of both cross section and time series data. However, Filippou, Taylor, Rapach and Zhou (2020) find that no gains from NN for foreign exchanges due to small sample size in both time and cross section (or due to the absence of nonlinearity). In a similar setting as Gu, Kelly and Xiu (2020) but with new predictors added over time, the NN can no longer be applied in such an unbalanced panel model. However, LASSO and C-LASSO are still effective methods (see Han, He, Rapach and Zhou, 2020). Dixon, Halperin and Bilokon (2020) discuss more advanced neural networks and their applications in finance. 10.12 Genetic algorithm Like gradient decent, genetic programming (GP) is a general search algorithm for finding the optimal solution for an objective function. But its search idea is more heuristic and is based on some principles in natural genetic processes. Liu, Zhou and Zhu (2020b) seems the first to apply it to forecast the cross section of stock returns. Like the NNs, the GP captures nonlinearity and interaction, and so it performs better than linear regression-based methods such as the LASSO. However, the GP appears to require smaller sample sizes than the NNs. More importantly, it can be used to maximize an arbitrary economic objective, such as the Sharpe ratio, directly. In contrast, other approaches are often designed to deal with only model fitting. The drawback of the GP is that it is computationally demanding, and so it is incapable of handling problems with many predictors. It is also complex and difficult to apply as available packages are limited. See Liu, Zhou and Zhu (2020b) and references therein for further readings. c© Zhou, 2021 Page 253 10.13 Ensemble Learning 10.13 Ensemble Learning Ensemble learning is to learn from combining a set of models or algorithms. The combination forecast (Section 9.5.1) is the simplest example of ensemble learning. The forecast based on each predictor is a model. Rather than relying on any of the single models, we use the average forecast across the models as our new forecast. The 1/N portfolio rule (Section 2.1.1) is also an example of ensemble learning that diversify over assets. The Bayesian model averaging (Section 3.7.2) is an important example, where various models are weighted with posterior probabilities. The Bayesian model averaging is used in a wide range of complex decision making. Why does ensemble learning work? Each model is unlikely to capture fully the real world. By pooling together of all the models, the final model is likely to capture all aspects of the problem, and so performing better. Another technical reason is that each model is evaluated based on its own assumptions about the true data-generating process which itself may not be true. There are many specific methods of ensemble learning. Below we discuss three of the most popular ones. 10.13.1 Bagging Bagging is a way to use bootstrap to improve performance. It is also known as bootstrap aggregation or bagging averages. To understand it, consider the case that we have a predictive model to forecast a future return, RT+1. Based on our model, let the forecast be RˆT+1 = f(xT ), (10.47) where XT denotes all the data up to T (today). Rather than replying on the single forecast above, we bootstrap the data B times with replace- ment to obtain B sets of data, X (1) T , . . . , X (B) T , each of which allows for a re-estimation of the model c© Zhou, 2021 Page 254 10.13 Ensemble Learning to yield a new forecast f(X (b) T ), for b = 1, 2, . . . , B. Then the bagging forecast is RˆBaggingT+1 = 1 B B∑ b=1 f(X (b) T ), (10.48) i.e., a simple average of the bootstrapped forecast. Bagging attempts to make the data more representative in the model, so that it can typically help to improve an unstable model. The bootstrapped portfolio investment rule (Section 4.3.3) or the re-sampled frontier as discussed by Michaud and Michaud (2008) is an example of bagging in portfolio choice. 10.13.2 Stacking Bagging is a way to use cross-validation to improve performance. In contrast with equal weights or posterior probability weights, it is more data-driving, with smaller weights on models that have high empirical bias. Suppose now that there are M models, f1(x), . . . , fM (x) that we use to forecast an outcome y. Our objective is to find the best weights such that fStack(x) = M∑ m=1 wmfm(x), (10.49) be the best forecast by some metric. Consider the popular quadratic objective that we want to find w = (w1, . . . , wM ) ′ to minimize the mean-squared error (MSE), w = arg min w T∑ i=1 ( yi − M∑ m=1 wmf (−i) m (xi) ) , (10.50) where f (−i) m (xi) is the re-estimated model of fm without the i-th observation xi, an idea from cross-validation to obtain more robust models. The above optimization over w is a simple quadratic programming problem without constraints. In practice, one can also impose the restriction that the weights are positive and sum to 1, which is also easily solved similar to the portfolio constraints problems. c© Zhou, 2021 Page 255 10.13 Ensemble Learning 10.13.3 Boosting Boosting is one of the most powerful and popular ways to improve a model, and there are many versions (see, e.g., Hastie, Tibshirani, and Friedman (2009)). In what follows, we focus on a regression type which seems more relevant to the finance problems of our interest here. Consider the problem of improving a fit function F (x) on data (x1, y1), (x2, y2), . . . , (xT , yT ). The errors are y1 − F (x1), y2 − F (x2), . . . , yT − F (xT ). Our objective is to find a function h(x) so that F1(x) = F (x) + h(x) has smaller MSE. Mathematically, the MSE is J = 1 N T∑ i=1 [yi − F (xi)]2. (10.51) Although the fitted values F (x1), F (x2), . . . , F (xT ) are just numbers, we can view them as param- eters when thinking about how they affect the loss. Then, taking derivatives we have the gradient g(xi) ≡ ∂J ∂F (xi) = 2 N [F (xi)− yi]. Now we fit a function h(x) to (x1,−g(x1)), . . . , (xT ,−g(xT )), so that h(x) is close to −g(x) for x taking all the xi’s. Then it is clear that F1(x) = F (x) + ρh(x) (10.52) will be an improvement of F (x) for small enough ρ > 0. The reason is that, in optimization, the negative gradient is the direction to get closer to the optimum. For example, consider find the minimum of f(x) = x2/2− x. Suppose now we are at value x0 = 2. Then f ′(x0) = 1 and x1 = x0 − ρf ′(x0) = 2− ρ c© Zhou, 2021 Page 256 will clearly be closer to 1 for small enough ρ > 0. Of course, in the last equation, one can choose ρ > 0 to minimize the error. So is the gradient algorithm. Hence, the gradient algorithm can be summarized in 4 steps in general: 1) compute the negative gradient for any given Fm based on any metric; 2) Fit the data with the negative gradient to get hm; 3) Solve the one-dimensional optimization problem for the MSE of the metric; 4) Update the fit to Fm+1. In practice, the iteration stops if there is no more significant improvement. 11 Predictability 2: Cross Section In this chapter, we discuss Cross section forecasts in great length, and also provide the detailed implementation procedures. 11.1 Overview Cross section forecasts focus on predicting the relative performance of firms, and the cross section regression (CSR) is run over the number of firms, N , which is usually large in practice, say T = 10000 firms. We can use only the present observations on predictors to forecast the future N returns in the CSR, although time series data can help improve the forecasting accuracy, say smoothing estimates over time, but not required. CSR forecasts are useful for fund managers to pick up stocks to buy or over-weight, and to short or under-weight. In contrast, time series predictability amounts to predictions of an asset return over time, and the time series forecasting regression is run over time, the number of available time periods, T , for the asset return. Usually T is small, say T = 120 for ten year monthly data. An investment of getting in or getting out of the stock market is called market timing. Time series forecasting methods will be useful in this context. But it should be remembered that the predictability is small and time-varying. Cross section forecasts are popular in practice. Since N is large, OLS is the popular approach for estimation. However, when there are many predictors, the OLS still tend to overfit. Various methods are proposed, see Han et al (2021) and Neuhierl et al (2021) for deal that problem. Co- queret and Guida (2020) and Jurczenko (2020) provide additional applications of machine learning c© Zhou, 2021 Page 257 11.2 Cross-section regression methods. We will discuss some of the estimation procedures below. 11.2 Cross-section regression To understand better about CSR, consider the size effect. We know that large firms tend to have lower average returns than small firms, and so the size of a firm will be a return predictor of the future. No matter the stock market is up or down next month (time series behavior), small firms tend to outperform large firms on average. A simple way is to buy small firms and short long ones to obtain a portfolio to have a positive alpha if size effect persists. To be more precise, we can run the CSR on size, assuming N = 1000 firms, Ri,t = α+ β Sizei,t−1 + i, i = 1, 2, . . . , 1000, (11.1) where α and β are the regression coefficients (the same cross firms in CRS), and Sizei is the firm size of firm i (usually in log and is standardized; see Section 11.3 for implementation details). Suppose our estimated regression is Ri,t = 15% 12 − 5% 12 Sizei,t−1 + ˆi, (11.2) where we assume the data is monthly, so that αi = 15%/12 and its annualized value is 15%. Assume that the size variable is standardized across firms. Then the above equation tells us, if a firm’s size is one unit larger than others, its expected return will be 5% (annualized) lower than others. In the above model, it is evident that the smaller the size, the greater the expected return. If we divide all the stocks into 10 groups, known as decile portfolios, by the expected returns estimated from (11.2), it will be equivalent to sorting the stocks by size. Clearly there are more factors than size alone that affect stock expected returns in practice. If we add profitability, we can then run the CRS on both of them, Ri,t = α+ βs Sizei,t−1 + βp Profiti,t−1 + i, i = 1, 2, . . . , 1000. (11.3) If our estimated regression is Ri,t = 15% 12 − 5% 12 Sizei,t−1 + 7% 12 Profiti,t−1 + ˆi, (11.4) then a large firm may be desired if its profitability is high. So we should consider both factors, and the total contribution is what matters to the expected stock return. In this case, if we divide all c© Zhou, 2021 Page 258 11.2 Cross-section regression the stocks into 10 groups, decile portfolios, by the expected returns, estimated from (11.4), it will not the same as sorting the stocks by size or by profitability. In fact, sorting cannot capture fully the two effects, but the CRS is one valid approach to do so. More generally, if we have now four factors, Ri,t = α+ β1Xi,1,t−1 + β2Xi,2,t−1 + β3Xi,3,t−1 + β4Xi,4,t−1 + i, i = 1, 2, . . . , 1000. (11.5) and we can write this CSR in matrix form, R1,t R2,t ... R1000,t = 1 X1,1 X1,2 X1,3 X1,4 1 X2,1 X2,2 X2,3 X2,4 ... ... ... ... ... 1 X1000,1 X1000,2 X1000,3 X1000,4 α β1 ... β4 + 1,t 2,t ... 1000,t , (11.6) where each Xi,j is firm characteristic j for firm i. Note that the returns are measured at time t and the explanatory variables are measured at t− 1, since we use past information to forecast the future return. In implementation, to forecast return next month, as we do not know the return yet which is to be predicted, we run the CRS using current returns on the past month predictors to obtain the regression coefficients, and then, based on them and the current predictor values, we can compute our forecast. Since the above equation is a linear regression, we can use the OLS to estimate the parameters, and the details are given in the next subsection. It is worthwhile to contrast the difference between time series regression (TSR) and CSR. Suppose we want to predict the stock market return Rm,t+1 using four predictors, and have T = 240 monthly data available, then we run the TSR, Rm,t = α+ β1x1,t−1 + β2x2,t−1 + β3x3,t−1 + β4x4,t−1 + t, t = 1, 2, ..., 240. (11.7) To predict the return at T + 1, in terms of data, we have Rm,240 Rm,239 ... Rm,1 = 1 x1,239 x2,239 x3,239 x1,239 1 x1,238 x2,238 x2,238 x2,238 ... ... ... ... ... 1 x1,0 x2,0 x3,0 x4,0 α β1 ... β4 + 240 239 ... 1 . (11.8) We estimate the regression coefficients and then plug in (11.7) to obtain the forecast. In comparison with the previous CSR, the dependent variable is a time series of the market return, not one time c© Zhou, 2021 Page 259 11.3 OLS estimation variables. The same is true for the explanatory variables. So, although OLS can be applied in both cases, it is applied to CS data and TS data, respectively. Time series predictability is about how predictors predicting returns over time. The predictive regression is the typical set-up. Most machine learning tools are readily applicable when there are many predictors. In contrast, cross-section predictability is about the relative predictability among asset returns. It predicts some assets will have greater returns than some others regardless of the up and down of the stock market. Most machine learning tools, developed for time series, may be adapted easily to apply to the cross section by treating the number of cross section as if it were the number of time series periods (e.g., Han, He, Rapach and Zhou, 2021, and Freyberger, Neuhierl, and Weber, 2020). In empirical applications, the degree of time series predictability is low. On the other hand, the cross-section predictability is much stronger, yielding sizable economic profits (see, e.g., Gu, Kelly and Xiu, 2020, and Han, He, Rapach and Zhou, 2021). Hence, the cross section regression is popular in practice. 11.3 OLS estimation In the real world implementation of the CSR, the first issue is to clean and make the data applicable. Typically, there are missing data in back-testing, and one often use ad hoc interpolation or cross section mean to replace them. Often the firm characteristics, such as size, value, momentum, and quality, are used in standardized form or in terms of z-scores on the right hand side of (11.6), where the z-score is defined as z-score = x− µ σ , (11.9) which standardizes the raw data x, say, size, of each firm in the cross section, where µ is the mean of the size across firms and σ is the standard deviation. In addition, the data may be trimmed so that scores above 3 are set at 3, and below −3 set at −3. This prevents the results are driven by a few extremely large or small firms. The OLS is the standard procedure applied each period to obtain the coefficient estimates. Haugen and Baker (1996) appears the first to do so. Lewellen (2015) provides a more recent and comprehensive analysis. Han et al (2017) show how to obtain an interpretable factor from a group of proxies. c© Zhou, 2021 Page 260 11.4 E-LASSO estimation The procedure takes two steps. First, at time t, we run an OLS in Ri,t = α+ β1Xi,1,t−1 + β2Xi,2,t−1 + β3Xi,3,t−1 + β4Xi,4,t−1 + i, i = 1, 2, . . . , 1000. (11.10) to obtain coefficient estimates βˆ1,t, βˆ2,t, βˆ3,t, βˆ4,t. This will be sufficient for us to compute the forecasted or expected value at t+ 1 as E[Ri,t+1] = βˆ1,tXi,1,t + βˆ2,tXi,2,t + βˆ3,tXi,3,t + βˆ4,tXi,4,t. (11.11) Note that we have ignored the alphas, as they do not matter in the ranking of stocks by expected returns which simply adds a constant to all. However, in practice, due to instability of model, the estimates are usually smoothed. In the second-step, we smooth the estimates over the past year (or other periods as appropriate), by taking the average coefficient estimates as our final estimates, β¯j,t = 1 12 12∑ s=1 βˆj,t+1−m. (11.12) Then the expected returns are computed from E[Ri,t+1] = β¯1,tXi,1,t + β¯2,tXi,2,t + β¯3,tXi,3,t + β¯4,tXi,4,t. (11.13) In practice, the average betas works much better than using the betas from one period estimation alone. A typical way to use the forecasts is to to divide the stocks into 10 decile groups, Then, buying the group that has the highest expected stock returns and shorting those that has the lowest is likely to be profitable trading strategy. If no shorting is allowed, one can simply over-weigh those high expected return stocks and under-weigh those low expected ones. When there are too many factors, the OLS estimation will likely to have an over-fitting problem that makes out-of-sample performance deteriorate. Machine learning tools are well suited to such a problem and may be applied. See Han et al (2021) and references therein. 11.4 E-LASSO estimation As mentioned before, E-LASSO and other ML tools can be applied to the CSR by taking N as T . However, since now we have both CS and times series information, the latter can be used to c© Zhou, 2021 Page 261 11.5 Weighted cross section regression improve the accuracy. See the papers cited at the beginning of this chapter. 11.5 Weighted cross section regression In the cross section regression model, when using the OLS estimation method, we weight effectively the companies equally. In practice, we may want to weigh larger firms more heavily as they are more important. For example, Green, Hand, and Zhang (2017) and Han et al (2021) use log (market-cap) as the weight across firms. A Bloomberg’s white paper uses square-root, wi = √ market-capi∑N i=1 market-capi , where market-capi is the market-cap of firm i. Mathematically, the OLS estimation is to find the slopes to minimize the mean-squared error, MSE = N∑ i=1 (yi − x′iβ)2, and solution to the slope is the standard formula βˆ = (X ′X)−1Xy. A weighted OLS is to minimize the weighted MSE, MSE∗ = N∑ i=1 wi(yi − x′iβ)2, where wi is the weight. The solution to the slope is βˆ = (X ′WX)−1XWy, where W is a diagonal matrix formed by the wi’s. In Python, this is easy done with codes: import statsmodels.api as sm; sm.WLS(y, X, W). 12 Bayesian Estimation In this section, we introduce the Bayesian method to prepare for later applications. The key idea of the Bayesian method is to view parameters as random variables, and we learn their properties c© Zhou, 2021 Page 262 12.1 Bayes Theorem via their posterior distribution derived based on Bayes Theorem. In contrast, the usual statistics (so-called the classical method) view the parameters as constants, and we learn them by examining their sample estimates. 12.1 Bayes Theorem There are two versions of Bayes Theorem. One is in terms of events, and another is in terms of densities. Both are widely used, especially the latter, because densities are involved with data analysis. 12.1.1 Conditional events For any two events A and B, elementary probability theory says that P(A,B) = P(A)P(B|A) = P(B)P(A|B) (12.1) which say that the probability of a joint event is the marginal probability times conditional prob- ability. Then, it follows that P(A|B) = P(A)P(B|A) P(B) , (12.2) which is knows as the Bayes Theorem. The key is about its interpretation. If P(A) is our initial belief on an event before knowing B, and B is the evidence, then P(A|B) is our updated belief in light of the evidence. In this case, P(A) is called the prior, and P(A|B) is the posterior. By the law (or formula) of total probability, P(B) = P(A)P(B|A) + P(Ac)P(B|Ac) where Ac is the compliment of A, i.e., all the events except A. Then the Bayes Theorem can be written as P(A|B) = P(A)P(B|A) P(A)P(B|A) + P(Ac)P(B|Ac) , (12.3) which has many applications. Example 12.1 You are interested in the probability that the market will go up next month. You forecast the mkt has 60% chance to go up. Now an expert says it will go up to. Given that the c© Zhou, 2021 Page 263 12.1 Bayes Theorem expert’s forecasting accuracy is 90%: if mkt up, 90% right of up forecasts; if mkt down, 90% of down forecasts. What is the prob that the mkt is up conditional on expert’s “up” ? P(A|B) = 0.6× 0.9 0.6× 0.9 + 0.4× 0.1 = 93%, (12.4) which is the updated probability in light of the expert’s opinion. ♠ Here is another example. Example 12.2 A medical test for whether someone has been infected by a virus is 95% true positive and 90% true negative. Only 1% of the population is actually infected. What is the probability that a random person who tests positive is really got infected ? Now A is the event that the person is positive, and B is tested positive. P(A|B) = 0.01× 0.95 0.01× 0.95 + 0.99× 0.10 = 8.76%, (12.5) which is totally different from 95%! However, if we conduct the test only if the person has some symptoms, and if the population with the symptoms is infected at a rate of 20%, then P(A|B) becomes .20× .95/(.20× .95 + .80× 0.10) = 70.37%, much greater than before! ♠ Yet another example which is famous: Example 12.3 Suppose that there are two boxes filled with millions of poker chips. The first box has 70% red and 30% blue, and the second box has 70% blue and 30% red . Assume now that one of the two boxes is chosen randomly, and a dozen chips are drawn from it, and you have the sample result: 8 red chips and 4 blue. What is chance that the chips came from the first box? Let A and B be the first and second boxes, respectively, and S is the sample/data. Prior to the draw, it is clearly reasonable to assume that p(A) = 50%, p(B) = 50%. c© Zhou, 2021 Page 264 12.1 Bayes Theorem By simple combinatorics, we have p(S|A) = ( 12 8 ) 0.78 × 0.34 = 0.231, p(S|A) = ( 12 8 ) 0.74 × 0.38 = 0.008. Then, P(A|S) = 0.5× 0.231 0.5× 0.231 + 0.5× 0.008 = 97%. (12.6) which is totally different from 70–80% that most people would guess (Edwards, 1968)! See Benjamin (2018, pp. 50–51) and references therein for more details. ♠ 12.1.2 Conditional densities Consider now densities, which are “probabilities” in small intervals. Let p(θ, y) be the joint distri- bution of the parameters, θ, and data, y. Standard probability analysis says that the joint density of any two (or two sets) of random variables is the marginal density times the conditional density, p(θ, y) = p(y) p(θ | y) = p(θ) p(y | θ). (12.7) This implies that p(θ | y) = p(θ) p(y | θ) p(y) . (12.8) If we interpret y as data, then p(y) is a constant conditional on observing y, and so the above can be written as, p(θ | y) ∝ p(θ) p(y | θ), (12.9) which is the Bayes Theorem in terms of density functions. It says that the posterior density of θ conditional on the data is proportional, ∝, to the product of the prior density of θ with the likelihood function of the data. The objective to learn about θ. In Bayesian analysis, all we know about θ is its density function. There are two densities. One is the prior, summarizing all we know before observing the data. The other is the posterior, telling us all about θ after observing the data. That is our updated learning on θ with the data y. c© Zhou, 2021 Page 265 12.2 Classical vs Bayesian The key assumption in Bayesian analysis is to view both all data and parameters as random variables, and key insight is that we can update our learning with data. Before observing data, we have some prior on what likely values of θ are, which is summarized or expressed by our prior density p(θ). In what follows as in the literature, the prior will be denoted as p0(θ) to emphasize it is a prior. Then, after observing the data, we should have updated learning on θ, which is the posterior density, p(θ | y). This is our learning conditional on data. The application of the Bayesian method has three-steps: 1. Provide p0(θ) to reflect our prior belief; 2. Compute the likelihood function (joint density of data); 3. Obtain the posterior density p(θ | y) ∝ p0(θ)× likelihood function. (12.10) Based on the posterior density, we can learn about θ by computing its posterior mean, vari- ance, confidence interval, etc. 12.2 Classical vs Bayesian The classical statistical framework treats parameters as true and unknown constants and use the data, random sample, to learn about them. In contrast, Bayesian set-up treat parameters as random variables and learn their distributions by using the random data. The difference between them is best understood by studying an example. For simplicity, we assume that the variance of the data σ2 is known. The unknown case is examined later. The difference in the results is the difference between normal and the t distributions, and so, if the sample size is reasonably large (say ≥ 50), they yield almost the same results. 12.2.1 σ2 known c© Zhou, 2021 Page 266 12.2 Classical vs Bayesian Example 12.4 Given T independent observations, y = (y1, y2, . . . , yT ) ′, on a random variable y which has a normal distribution with unknown mean, y ∼ N(µ, 1), (12.11) and known variance 1 (a simplification). What can we learn about the mean µ ? The classical approach: With data (y1, y2, . . . , yT ) ′, we estimate µ by its sample mean, µˆ = 1 T T∑ t=1 yt. From (12.11), we have µˆ ∼ N(µ, 1/T ) (12.12) i.e., µˆ has a normal distribution. The result says that: • the estimator has the parameter as its mean; • the variance is 1/T – As sample size T gets large, we get on the average more and more accurate estimate of the true mean; • Any hypothesis testing about µ can be done based on (12.12). The Bayesian approach: Recall the three-steps 1. Assume a diffuse prior: (µ can be any real number) p0(µ) ∝ 1. (12.13) To understand why that represents a diffuse prior, consider how to express a prior that µ is in [−1, 1] equally likely. What we want is p0(µ) ∝ c, if µ ∈ [−1, 1];0, otherwise c© Zhou, 2021 Page 267 12.2 Classical vs Bayesian Since the density has an integral of 1, we have 1 = 1∫ −1 p0(µ)dµ = 1∫ −1 cdµ = 2c so we get c = 1/2. Similarly, if we want the prior be in [−M,M ] equally likely, c = 1/(2M). Since a constant c has no impact on the posterior, what matters is the range of p0, so, we can simply use p0(µ) = 1 over [−M,M ], and zero otherwise, which is called an improper prior as its integral is not 1 (not strictly a density). Theoretically, this is still valid for posterior analysis. Letting M goes to infinity, we have the diffuse prior given by (12.13). 2. The likelihood function or density of the data is p(y |µ) = ( 1√ 2pi )T exp [ −1 2 T∑ t=1 (yt − µ)2 ] . (12.14) To understand it, consider 2 data points. Their joint density is p(y1, y2) = p(y1)p(y2) = 1√ 2pi e− 1 2 (y1−µ)2 1√ 2pi e− 1 2 (y2−µ)2 = ( 1√ 2pi )2 e− 1 2 (y1−µ)2− 12 (y2−µ)2 , where the first equality follows from independence, the second from normality assumption and the third uses a property of exponential functions. 3. The posterior density is then p(µ|y) ∝ p0(µ) p(y |µ) ∝ exp [ −1 2 T∑ t=1 (yt − µ)2 ] . (12.15) Now let us simplify the posterior density so that we can learn its implications on µ. Since T∑ t=1 (yt − µ)2 = Tµ2 − 2Tµµˆ+ T∑ t=1 y2t = T (µ− µˆ)2 + T∑ t=1 (yt − µˆ)2 (12.16) and the second term on the right hand side is constant (which can be ignored in the posterior density because it is just a proportional constant), so we can write the posterior density as p(µ|y) ∝ exp [ −T 2 (µ− µˆ)2 ] , (12.17) which is exactly a normal density function on µ. The posterior density says that c© Zhou, 2021 Page 268 12.2 Classical vs Bayesian • the posterior mean of µ is µˆ; • the posterior variance is 1/T – As sample size T gets large, the distribution of µ is more and more around µˆ; • Any hypothesis testing/assessment about µ can be based on (12.17). To summarize, the example shows that classical and Bayesian inference is the quantitatively the same under the diffuse prior. This is not surprising because the classical approach does not use any prior information. So when both do not use any prior information, they should yield fundamentally the same conclusion. Then, what is the potential advantages of the Bayesian approach? Its advantage is to use informative priors (see future examples). Technically, it also has the advantage in computing the exact distribution of functions of interest for a finite sample size, which the classical framework may not be able to provide and has to reply on asymptotic distribution or bootstraps, which may not be reliable or accurate when the sample size is small. However, Bayesian approach has its potential disadvantages. Although it can use informative priors, it is also the root cause of arguments about the appropriateness of priors. Using the incorrect prior can clearly be worse off than using any prior at all. Moreover, for tractability, Bayesian analysis often makes restrictive assumptions on the data-generating process (such as iid normality), while the classical analysis can usually have much more general assumptions. 12.2.2 σ2 unknown Assume still that the data are normally distributed, y ∼ N(µ, σ2). (12.18) Previously, σ2 is assumed known for simplicity. Now we assume that σ2 is unknown. Since σ2 can usually be estimated fairly accurately in many applications, the results of the known case are not much different from the second case. As noted earlier, that is the case if the sample size is reasonably large. When σ is unknown, the posterior distribution of µ will no longer be normal, but a t distribution under a standard diffuse prior on σ, which is a commonly assumed. c© Zhou, 2021 Page 269 12.2 Classical vs Bayesian The key is to note that the diffuse prior on σ, known as Jeffrey’s prior, is p0(σ) ∝ 1 σ , σ > 0, (12.19) because it is Jeffrey who shows first that it presents noninformativeness on σ. Then, assuming a diffuse prior on µ and it is independent of σ, the joint prior density of both µ and σ is p0(µ, σ) ∝ 1 σ σ > 0, (12.20) which is the common diffuse prior in statistics in this context. The likelihood function of the normally distributed data is p(y |µ) = ( 1 σ √ 2pi )T exp [ − 1 2σ2 T∑ t=1 (yt − µ)2 ] , and hence the posterior p(µ, σ|y) ∝ σ−(T+1)exp [ − 1 2σ2 T∑ t=1 (yt − µ)2 ] . (12.21) In comparison with our earlier analysis under diffuse prior with known σ, we now just add those terms involving σ. Since we are interested in the posterior mean of µ, we have to integrate σ out from the joint density to obtain the density for µ alone. To do so, we need a formula from calculus,∫ +∞ 0 x−(n+1)e−ax −2 dx = 1 2 a−n/2Γ(n/2), where Γ(·) is the Gamma function, Γ(z + 1) = zΓ(z), Γ(1) = 1, Γ(1/2) = √ pi. Then the integration of (12.21) over σ is p(µ|y) ∝ ( T∑ t=1 (yt − µ)2 )−T/2 , (12.22) which is not yet a recognizable distribution. Let s2 = 1 T − 1 T∑ t=1 (yt − µˆ)2, (12.23) c© Zhou, 2021 Page 270 12.3 Informative priors the standard deviation of the data (dividing by using T − 1 rather than T is to make it unbiased, which makes little numerical difference when T ≥ 30). From (12.16), we can write T∑ t=1 (yt − µ)2 = T (µ− µˆ)2 + (T − 1)s2. Plugging this into (12.22), dividing by Ts2 and change the order of the sum, we have p(µ|y) ∝ ( 1 + (µ− µˆ)2 ν(s/ √ T )2 )−(ν+1)/2 , (12.24) where ν = T − 1. The above equation says that the posterior distribution of µ is t-distributed with mean µˆ (recall the definition in (1.60)). Interestingly, in the classical analysis, µˆ is also t-distributed with mean µ with degrees of freedom ν. Again, under the diffuse prior, both the Classical and Bayeasian reach the same conclusion, though interpreted differently (parameters are regarded as random variables in Bayesian framework, but they are constants in the Classical). 12.3 Informative priors As mentioned earlier, easily incorporating prior information is an important advantage of the Bayesian analysis. The example below illustrates the main idea, while more complex examples will be analyzed in later applications. We now extend the previous example to a more realistic situation by using a general normal prior density, p0(µ) = 1√ 2piσ20 e − (µ−µ0)2 2σ20 , (12.25) where µ0 is our prior mean and σ0 is our prior standard deviation. For instance, when we examine the expected return on the market, we may set µ0 = 10% and σ0 = 15%, i.e., we use prior µ ∼ N(10%, 15%). This says that the future expected return on the asset is likely to be 10%, but it has a standard error of 15%. Although this prior is not perfect, it should be better than no priors at all in practice. It does reflect some sort of long-term view on the stock market. c© Zhou, 2021 Page 271 12.3 Informative priors 12.3.1 σ2 known Then the posterior density is p(µ|y) ∝ p0(µ) p(y |µ) ∝ exp ( − [ (µ− µ0)2 2σ20 + 1 2σ2 T∑ t=1 (yt − µ)2 ]) ∝ exp ( − [ (µ− µ0)2 2σ20 + T 2σ2 T∑ t=1 (µ− µˆ)2 ]) (12.26) where the last equation follows the diffuse prior case, (12.15). Now (µ− µ0)2 a + (µ− µˆ)2 b ∝ µ 2 − 2µµ0 a + µ2 − 2µµˆ b (12.27) ∝ b+ a ab [ µ2 − 2µ ( µ0 a + µˆ b ) ab b+ a ] . (12.28) Then, taking a = σ20 and b = σ 2/T , then we obtain the posterior density p(µ|y) ∝ exp [ − ( σ20 + σ 2/T 2σ20σ 2/T )( µ− σ 2 0µˆ+ µ0σ 2/T σ20 + σ 2/T )2] , (12.29) where σ2 is treated as known, and can be replaced by a sample variance estimate. Equation (12.29) says that µ has a normal density. The mean, (σ20µˆ + µ0σ 2/T )/(σ20 + σ 2/T ), can be written as, Eµ = wµ0 + (1− w)µˆ, w = σ 2/T σ20 + σ 2/T , (12.30) which is a weighted average of the prior µ0 and sample mean µˆ. The greater the sample variance (the less informative the data), the more it weights on the prior. However, when there are more and more data (as T becomes large), the data speak itself and the prior has no impact (unless σ0 = 0 which disregards the data). In the case of estimating the market expected return, even if the sample mean is negative (due to using bear market data), the prior mean of µ0 = 10% will help to pull the expected return to the positive territory. In contrast, the classical analysis estimates the expected return using sample mean, which can be inadequate with very limited data size. This is the advantage of using Bayesian. The posterior variance of µ is (σ20σ 2/T )/(σ20 + σ 2/T ), or Var(µ) = ( 1 σ20 + T σ2 )−1 , (12.31) c© Zhou, 2021 Page 272 12.4 Predictive distribution which says that the posterior precision is an average of the prior precision and data precision. If my guess error is high and the data is informative, the posterior precision should be good, and vice versa. As sample size gets large, the posterior variance will approach zero, and then we learn the exact mean. 12.3.2 σ2 unknown In this case, there is an issue on what prior to impose on σ2. While there are many choice, we, following Zellner (1971, pp. 70–72), use an initial sample to set prior, and the posterior will be of the sample form as before, making it easier to analyze. Specifically, we use the earlier the posterior, (12.21), as our prior, p(µ, σ|y1) ∝ σ−(n1+1)exp [ − 1 2σ2 n1∑ t=1 (y1t − µ)2 ] ∝ σ−(n1+1)exp { − 1 2σ2 [ν1s 2 1 + n1(µ− µ1)2] } , (12.32) where n1 is the initial sample size, ν1 = n1 − 1, µ1 the sample mean and s21 the variance. Given sample size n2, the likelihood function is l(µ, σ|y2) ∝ σ−n2exp [ − 1 2σ2 n2∑ t=1 (y2t − µ)2 ] . (12.33) Then the posterior density is p(µ, σ|y1,y2) ∝ σ−(n1+n2+1)exp { − 1 2σ2 [ n1∑ t=1 (y1t − µ)2 + n2∑ t=1 (y2t − µ)2 ]} ∝ σ−(n+1)exp { − 1 2σ2 [ νs2 + n(µ− µˆ)2]} , (12.34) where n = n1 + n2, ν = n − 1, µˆ and s2 are the sample mean and variance based on all the data. Mathematically, this density is exactly the same form of (12.21), and so the earlier analysis can be used to obtain the marginal posterior densities. 12.4 Predictive distribution In applications, it is the future value or return of a random variable, not the past values, that is of great interest. Consider again Example 12.4. In the classical approach, the future value, y˜T+1 (the c© Zhou, 2021 Page 273 12.4 Predictive distribution tilde emphasizes the fact that it is a random variable and yet not observed), is clearly predicted as the sample mean with standard error 1. In the Bayesian framework, we need to find out the predictive density of y˜T+1 conditional on the data, p(y˜T+1 |y) = ∫ p(y˜T+1, µ |y) dµ = ∫ p(y˜T+1 |µ,y)p(µ |y) dµ (12.35) which say that the predictive density is the product of the density of y˜T+1 conditional on the true parameter and data times the posterior density after integrating out the parameter. By the assumption on the data-generating process (12.11), we have p(y˜T+1 |µ,y) ∝ exp [ −1 2 (yT+1 − µ)2 ] and by the earlier posterior result (12.17), we can compute p(y˜T+1 |y) ∝ ∫ +∞ −∞ exp [ −1 2 (yT+1 − µ)2 − T 2 (µ− µˆ)2 ] dµ ∝ exp [ − T 2(T + 1) (yT+1 − µˆ)2 ] , (12.36) which says that the predictive density is normal. If one uses the posterior mean as the point prediction, the Bayesian also provides the same mean prediction as the classical one, Et(y˜T+1) = µˆ, conditional on information available at t, but the standard error is (T + 1)/T , slightly greater 1, the standard error from the classical approach. This is due to incorporating the estimation error on µ. In general, the Bayesian predictive point estimate may not be equal to the classical one. For example, an informative prior in the previous example will produce a different predictive point estimate. In addition, the predictive density is usually not normally distributed. For instance, if we assume the data variance σ2 is unknown, the predictive density of the previous example will be t -distributed. All these issues can be found in Zellner (1971) who provides an excellent guide to the Bayesian approach. c© Zhou, 2021 Page 274 12.5 Bayesian regression 12.5 Bayesian regression Previously, the statistical model is about the mean only, or yt = µ+ t, t ∼ N(0, σ2) (12.37) where the previous notation of xt is replaced by yt. In this subsection, we consider an extension by including a regressor, yt = α+ βxt + t, t ∼ N(0, σ2) (12.38) which is important as we often analyze a stock return relative to the market. Now assume all the parameters are unknown, then the diffuse prior is p0(α, β, σ) ∝ 1 σ , σ > 0, (12.39) which is an extension of (12.20). The posterior is p(α, β, , σ|D) ∝ σ−(T+1)exp [ − 1 2σ2 T∑ t=1 (yt − α− βxt)2 ] , (12.40) where D = (y,x) denotes all the data. Now we want to make sense of (12.40). By purely subtracting and adding, we have∑ (yt − α− βxt)2 = ∑( yt − αˆ− βˆxt − [(α− αˆ) + (β − βˆ)xt] )2 . (12.41) Now let αˆ and βˆ be the OLS estimator, αˆ = y¯ − βˆx¯, βˆ = ∑ (xt − x¯)(yt − y¯)∑ (xt − x¯)2 , where y¯ and x¯ are the sample means. Expanding (12.41) and using orthogonal conditions of the OLS, we have ∑ (yt − α− βxt)2 = νs2 + T (α− αˆ)2 + (β − βˆ)2 ∑ x2t +2(α− αˆ)(β − βˆ) ∑ xt, where s2 = 1 T − 2 ∑ (yt − αˆ− βˆxt)2, (12.42) the sample residual variance. In contrast to (12.23), here s2 is made unbiased by dividing (T − 2) because there are 2 degree of freedoms lost with constant and variable x to explain y (in general, we should divide by T −K − 1 if there are K variables plus the constant). c© Zhou, 2021 Page 275 12.6 Bayesian CAPM test Based on the above decomposition, we know from (12.40) that, conditional on σ, (α, β) are jointly normally distributed with mean (αˆ, βˆ)′ and covariance matrix Cov α β = σ2 T ∑xt∑ xt 1∑ x2t −1 = σ2 ∑x2tT∑(xt−x¯)2 −x¯∑(xt−x¯)2 −x¯∑ (xt−x¯)2 1∑ (xt−x¯)2 . . (12.43) The results extend the earlier one variable case in which σ2 is known. Since σ2 is unknown in practice, we need integrate it out to get the marginal distributions, p(α|D) ∝ [ ν + ∑ (xt − x¯)2 s2 ∑ x2t /T (α− αˆ)2 ]−(ν+1)/2 , (12.44) p(β|D) ∝ [ ν + ∑ (xt − x¯)2 s2 (β − βˆ)2 ]−(ν+1)/2 , (12.45) where ν = T − 2 (Again, we have now (T − 2) vs (T − 1)earlier due to one less degree of freedom). For informative priors and multivariate regressions, see Zellner (1971) for further discussions. 12.6 Bayesian CAPM test Recall that we have a multivariate regression model for the asset excess returns in testing the CAPM rit = αi + βirmt + it, i = 1, . . . , N, (12.46) where rit is the return on asset i in excess of the return on a Treasury bill, rmt is the excess return on the market portfolio, and it is the disturbance. A key assumption about the disturbances is that they are assumed to be correlated contempo- raneously but not across time: Eitjs = σij , if i = j0, otherwise (12.47) This is understandable. If the CAPM underprices one technology stock, it is likely to do so for another. So the residuals of the two stocks are likely correlated at a given time t. Overtime, all stock returns are difficult to forecast and iid can be a good assumption. The contemporaneous implies that we cannot study the univariate regression of each stock in isolation. The information of other stocks is useful. However, it should be noted that the c© Zhou, 2021 Page 276 12.6 Bayesian CAPM test parameters, alphas and betas, are still obtained from each company’s univariate regressions. It is just that their standard errors will be affected by other companies. In the Bayesian framework, a confidence region on the alphas can be computed: −h √ var[αi] + αˆi < αi < αˆi + h √ var[αi], i = 1, . . . N, (12.48) where αˆi is the OLS estimator, and h is a number chosen such that the area is, say, 95%. Then we can examine whether all alphas is inside or outside of the region, providing intuition on how alphas are different from zero, or on the degree of validity of the CAPM. In the Bayesian framework, it is also convenient (under the normality assumption) to compute the exact distribution of λ = α′Σ−1α, (12.49) where Σ is the covariance matrix of the residuals. It is clear that the greater the alphas (in absolute value), the greater the λ. In fact, λ measures the extra money one can earn if the CAPM is not true. The conditional distribution formula of the multivariate normal distribution can be used for obtaining the marginal distribution of the alphas, which makes the above two computations possible. Harvey and Zhou (1990) provide all the details. In general, we can assume that stocks and factors are jointly normal, then their conditional moments are related to the parameters of the multivariate regression of X1 (the stocks) on X2 (the factors), X1t = α+BX2t + Et, (12.50) where Et is a vector of model disturbances with zero means and a non-singular covariance matrix Σ, with relationship α = µ1 −Bµ2, B = V12V −122 , (12.51) and Σ = V11 −BV22B′. (12.52) Then the procedure for testing the CAPM can be applied, yielding a Bayesian framework for testing multi-factor models. c© Zhou, 2021 Page 277 13 Black-Litterman Model Since its publication, the Black and Litterman (1992) asset allocation model has gained wide application by many financial institutions. In this section, we discuss first its motivation , then details of the model in 1- and N-dimensions. Finally, we discuss some of its problems and offer a few alternatives. 13.1 Motivations While the mean-variance optimal portfolio is an elegant framework, there are many problems with its use in practice (see, e.g., Michaud (1998) or our early discussions). In particular, Black and Litterman (1992) find that it recommends large short positions in many assets when no constraints are imposed, and there are corner solutions with zero weights in many assets when no-short sell constraints are imposed. To solve this problem, Black and Litterman propose to combine parameter estimates with what is suggested by asset pricing theory – the CAPM, or the equilibrium values. Their solution also provides a way to allow incorporating priors (cutting-edge research/info, WS buzz words) into the portfolio optimization process. 13.2 Single risky asset case For easy understanding, we discuss first the Black and Litterman in the single risky asset case, and leave the more complex case in the next subsection. Consider asset allocation between the risky and the riskless asset. The first question we ask is: what value the expected return is likely to be? In equilibrium, all investors as a whole hold all of the stocks which is proportional to the market portfolio or value-weighted index. Let we be the portfolio weights of the market portfolio, pi be the expected excess return (or risk premium), and γ be the average risk tolerance of the world. The key assumption of Black and Litterman (1992) is that, in equilibrium, if all investors hold the same view, then their demand for the risky assets should exactly be equal to the outstanding supply, c© Zhou, 2021 Page 278 13.2 Single risky asset case which is given by the optimal portfolio weight formula (2.32), we = 1 γ pi σ2 . (13.1) In other words, the equilibrium risk premium satisfy pi = γσ2we, (13.2) where pi is a constant as it is the equilibrium value. Assume as usual the excess return is normally distributed Rt = µ+ t, t ∼ N(0, σ2). (13.3) Recall that a Bayesian views all parameters as random variables. Hence, µ is naturally assumed to be normally distributed with mean pi, µ = pi + et , e t ∼ N(0, κσ2), (13.4) where κ is a scalar indicating the degree of how µ is close to its equilibrium value. On the other hand, an investor may have a view that µ = µ0 + v t , e t ∼ N(0, ω2). (13.5) For example, if µ0 > pi, the investor believes the risk premium will be higher than the equilibrium value. Now, regarding the equilibrium relationship as the likelihood function and the view as the prior,13 the Bayesian theorem provides (in exactly the same way as in Section 12.2 in combining prior info) the posterior density is: p(µ|y) ∝ p0(µ) p(y |µ) (13.6) ∝ exp ( − [ (µ− µ0)2 2ω2 + (µ− pi)2 2κσ2 ]) (13.7) ∝ exp [ − ( ω2 + κσ2 2ω2κσ2 )( µ− ω 2pi + µ0κσ 2 ω2 + κσ2 )2] , (13.8) or the posterior mean is µ¯ = ω2pi + µ0κσ 2 ω2 + κσ2 = (κσ2)−1pi + (ω2)−1µ0 (κσ2)−1 + (ω2)−1 , (13.9) 13The same result is obtained here if one changes the role of the two. c© Zhou, 2021 Page 279 13.3 Multiple risky asset case and the variance is θ¯2 = ω2κσ2 ω2 + κσ2 = 1 (κσ2)−1 + (ω2)−1 . (13.10) Again, these are weighted averages of the prior and equilibrium values. The posterior density says that µ = µ¯+ ct , c t ∼ N(0, θ¯2). (13.11) Combining this with (13.3), we have Rt = µ+ t = µ¯+ (t + c t), (13.12) so the Bayesian updated mean is µ¯ and variance is σ¯2 = σ2 + θ¯2, where t and c t are assumed independent. Hence, by using again the earlier optimal portfolio formula (2.32), we get the Bayesian optimal portfolio weight after updating, w∗ = 1 γ µ¯ σ¯2 , (13.13) which is to apply the standard formula using Bayesian parameter estimates. There are a few interesting facts. First, if the investor believes 100% about the equilibrium risk premium, i.e., κ = 0, then it is clear that µ¯ = pi and σ¯2 = σ2, implying w∗ = we, that is, the investor holds the equilibrium market portfolio. Second, if the view is absolute such that ω = 0, then µ¯ = µ0, σ¯ 2 = σ2, and the investor invests more or less into the risky asset depending on whether µ0 is greater or smaller than pi. Third, if ω > 0 and κ > 0, the investor will invest less than the market even if µ0 = pi. This is because the risk of the asset has gone up, σ¯ 2 > σ2, if the investor is unsure of its expected return. As the investor is risk-averse, the amount invested in the risky asset must go down, so w∗ < we. 13.3 Multiple risky asset case In the multivariate case, the excess return of n > 1 risky assets are Rt = µ+ t, t ∼ N(0,Σ). (13.14) c© Zhou, 2021 Page 280 13.3 Multiple risky asset case In contrast with (13.3), Rt now is an n-vector and Σ is an n×n matrix. Analogously, the n-vector equilibrium risk premium satisfy Π = γΣwe. (13.15) Thus, the distribution of µ is µ = Π + et , v t ∼ N(0, κΣ), (13.16) where κ is the same scalar indicating the degree of how µ is close to its equilibrium value. However, the view on µ is more complex than the single risky asset case. First, there can be K, 0 < K ≤ n, views. Second, the view is not necessarily on an element of µ, but on a portfolio of them. For example, the first view can be stated as that a portfolio, with eights P1 = (p11, p12, . . . , p1n) ′, has a prior mean µ01, i.e., P ′1µ = p11µ1 + p12µ2 + · · ·+ p1nµn = µ01 + v1t, e1t ∼ N(0,Ω11). (13.17) With k views, we can simply write all the K equations in a simple matrix form Pµ = µ0 + v t , e t ∼ N(0,Ω), (13.18) where P = P1 P2 ... PK , µ0 = µ01 µ02 ... µ0K , t = 1t 2t ... Kt , (13.19) that is, P is a K×n matrix summarizing the views, µ0 is a K-vector summarizing the prior means and 1t is the residual vector. The covariance matrix of the residuals, Ω is often assumed diagonal unless the errors of the views are correlated. By the same logic as the N = 1 case, the posterior density is: p(µ|y) ∝ p0(µ) p(y |µ) (13.20) ∝ exp ( − [ 1 2 (Pµ− µ0)′Ω−1(Pµ− µ0) + 1 2 (µ−Π)′(κΣ)−1(µ−Π) ]) . (13.21) By matrix algebra, it can be verified that The quadratic terms = (Pµ− µ0)′Ω−1(Pµ− µ0) + 1 2 (µ−Π)′(κΣ)−1(µ−Π) (13.22) = µ′[P ′Ω−1P + (κΣ)−1]µ− 2[µ′0Ω−1P + Π′(κΣ)−1]µ+ C (13.23) = (µ− µ¯)′Θ¯(µ− µ¯) + C, (13.24) c© Zhou, 2021 Page 281 13.4 Alternative approaches where C is a generic constant, µ¯ = [(κΣ)−1 + P ′Ω−1P ]−1[(κΣ)−1Π + P ′Ω−1µ0], (13.25) Θ¯ = [(κΣ)−1 + P ′Ω−1P ]−1. (13.26) This says that the posterior density is normal with mean µ¯ and covariance matrix Θ¯. Then, the associated Bayesian portfolio choice weights are the same as before with covariance matrix Σ¯ = Σ + Θ¯. 13.4 Alternative approaches One of the problems with the Black and Litterman model is that there is no model for the data- generating process. Ideally, prior information on the expected returns, including the equilibrium priors, should be combined with the likelihood function of the data-generating process. Pa´stor and Stambaugh (2000) and Tu and Zhou (2004, 2010) are examples of research in this direction. Zhou (2009) provides a general framework. c© Zhou, 2021 Page 282 14 References Alexander, C., 2001, Market Models: A Guide to Financial Data Analysis, Wiley. Amemiya, T., 1985, Advanced econometrics, Harvard University Press, MA. Anderson, T.W., 1984, An Introduction to Multivariate Statistical Analysis, 2ed, Wiley. Ang, A., 2014, Asset Management: A Systematic Approach to Factor Investing, Oxford University Press. Anthony, M. P. Bartlett, 2009, Neural Network Learning: Theoretical Foundations, Cambridge University Press. Ao, M., Y., Li, and X. Zheng, 2019, Approaching mean-variance efficiency for large portfolios, Review of Financial Studies 32, 2890–2919. Arditti, F., 1971, Another look at mutual fund performance, Journal of Financial and Quantitative Analysis 6, 909–912. Azzalini, A., 1985, A class of distributions which includes the normal Ones, Scandinavian Journal of Statistics 12,171–17. Azzalini, A.,and A. Dalla Valle, 1996, The multivariate skew-normal distribution, Biometrika 83, 715–726. Bai, J., 2003, Inferential theory for factor models of large dimensions, Econometrica 71, 135–172. Bai, J., and S. Ng, 2002, Determining the number of factors in approximate factor models, Econometrica 70, 191–221. Bai, J., and S. Ng, 2008, Large dimensional factor models, Foundations and Trends in Econometrics 3, 89–163. Bai, J., and P., Wang, 2016, Econometric analysis of large factor models, Annual Review of Economics 8, 53–80. Baker, M., and J.Wurgler, 2006, Investor sentiment and the cross-section of stock returns, Journal of Finance 61, 1645–1680. Barber, B., and T. Odean, 2000, Trading is hazardous to your wealth: the common stock performance of individual investors, Journal of Finance 55, 773–806. Barberis, N., 2000, Investing for the long run when returns are predictable. Journal of Finance 55, 225–264. Barberis, N. and R. Thaler, 2003, A survey of behavioral finance, Chapter 18, Handbook of the Economics of Finance, eds. George Constantinides, Milton Harris, and Rene Stulz, North-Holland, 937–972. Bartlett, M. S., 1947, Multivariate analysis. Journal of the Royal Statistical Society (Suppl.) 9, 176–190. Bates, J. M., and C. W. J. Granger, 1969, The Combination of forecasts, Operational Research Quarterly 20, 451–68. Bawa, V. S., S. J. Brown, and R. W. Klein, 1979, Estimation risk and optimal portfolio choice. North-Holland, Amsterdam. Benjamin, D., 2018, Errors in probabilistic reasoning and judgment biases, NBER working paper. Berk, J., 1997, Necessary conditions for the CAPM, Journal of Economic Theory 73, 245–257. Bishop, C., 2006, Pattern recognition and machine learning. Springer. Black, F., 1972, Capital market equilibrium with restricted borrowing. Journal of Business 45, 444–454. Black, F., Litterman, R., 1992, Global portfolio optimization, Financial Analysts Journal 48, 28–43. Boehmer, E., C. Jones, and X. Zhang, 2008. Which shorts are informed? Journal of Finance 63, 491–527. c© Zhou, 2021 Page 283 Bok, B., D., Caratelli, D., Giannone, A. Sbordone and A. Tambalotti, 2017, Macroeconomic nowcasting and fore- casting with big data, working paper. Bollerslev, T., 1986, Generalized autoregressive conditional heteroskedasticity, Journal of Econometrics 31, 307–327. Bollerslev, T., Chou, R.Y., Kroner, K.F., 1992, ARCH modeling in Finance: a selective review of the theory and empirical evidence, Journal of Econometrics 52, 5–59. Bollerslev, T., 2001, Financial econometrics: Past developments and future challenges, Journal of Econometrics 100, 41–51. Bollerslev, T., G. Tauchen, and H. Zhou, 2009, Expected stock returns and variance risk premia, Review of Financial Studies 22, 4463–4492. Box, G.E.P., G. Jenkins, G. Reinse, and G. Ljung, 2016, Time series analysis forecasting and control, 5ed, Wiley. Brides, P., 2009, Examining portfolio optimisation as a regression Problem, MSC. Financial Engineering: Birbeck, University of London. Britten-Jones, M., 1999, The sampling error in estimates of mean-variance efficient portfolio weights, Journal of Finance 54, 655–671. Brock, W., Lakonishok, J., LeBaron, B., 1992. Simple technical trading rules and the stochastic properties of stock returns. Journal of Finance 47, 1731–1764. Brockwell, P., and R. Davis, 2016, Introduction to Time Series and Forecasting, 3ed, Springer. Brown, S. J., 1976, Optimal portfolio choice under uncertainty, ph.d. dissertation, University of Chicago. Brown, S. J., 1978, The portfolio choice problem: comparison of certainty equivalence and optimal Bayes portfolios, Communications in Statistics-Simulation and Computation 7, 321–334. Bu¨hlmann, P., and van de Geer (2011),S., 2011, Statistics for High-Dimensional Data, Springer. Campbell, J.Y. and S.B. Thompson, 2008, Predicting the equity premium out of sample: Can anything beat the historical average?, Review of Financial Studies 21, 1509–1531. Campbell, John Y. and Luis M. Viceira, 2003, Strategic asset allocation: portfolio choice for long-term investors, Oxford University Press. Chang, R. Chu, L., Tu, J., Zhang, B., Zhou, G., 2021, ESG and the Market Return, working paper. Chen, J., Tang, G., Yao, J., and Zhou, G., 2020, Investor attention and stock returns, Journal of Financial and Quantitative Analysis (forthcoming). Chen, J., Tang, G., Yao, J., and Zhou, G., 2021, Employee sentiment and stock returns. working paper. Chen, J., and M. Yuan, 2016, Efficient portfolio selection in a large market, Journal of Financial Econometrics 14, 496–524. Chen, N.-F., R. Roll, and S. A. Ross, 1986, Economic forces and the stock market, Journal of Business 59, 383–403. Chen, Y., Z. Da and D. Huang, 2021, Short selling efficiency, Journal of Financial Economics (forthcoming). Chen, J., G. Tang, J. Yang, and G. Zhou, 2021. Employee Sentiment and Stock Returns. Chib, S., L. Zhao and G. Zhou, 2021, Winners from winners: A tale of risk factors, working paper. Chincarini, Ludwig B., and Daehwan Kim, 2006, Quantitative Equity Portfolio Management, New York: McGraw- Hill. c© Zhou, 2021 Page 284 Chinco, A., A. D. Clark-Joseph, and M. Ye, 2019, Sparse signals in the cross-section of returns. Journal of Finance 74, 449–492. Chou, P., G. Zhou, 2006, Using bootstrap to test portfolio efficiency, Annals of Economics and Finance 7, 217–249. Christie, S., 2005, Is the Sharpe Ratio Useful in Asset Allocation? MAFC Research Papers No.31, Applied Finance Centre, Macquarie University. Christopherson, Jon A., Wayne Ferson and Andrew L. Turner, 1999, Performance evaluation using conditional alphas and betas, Journal of Portfolio Management 26, 59–72. Clark, T. E., and K. D.West. 2007, Approximately normal tests for equal predictive accuracy in nested models, Journal of Econometrics 138, 291–311. Clarke, R., H. Silva, and S. Thorley, 2002, Portfolio constraints and the fundamental law of active management, Financial Analyst ournal 58, 48–66. Cochrane, J. H. 2001. Asset pricing, Princeton University Press. Cochrane, J.H., and M. Piazzesi. 2005. Bond risk premia, American Economic Review 95, 138–60. Coggin, T. Daniel, and Frank J. Fabozzi, 2003, The Handbook of equity style management, Wiley, 2003. Cohen, Randolph, Joshua Coval and Lubos Pastor, 2005, Judging fund managers by the company they keep, Journal of Finance 60, 1057–1096. Connor, G. and R. Korajczyk, 1988, Risk and return in an equilibrium APT: An application of a new methodology, Journal of Financial Economics 21, 255–289. Connor, G. and R. A. Korajczyk, 1995, The arbitrage pricing theory and multifactor models of asset returns, in Handbooks in Operations Research and Management Science: Finance, Volume 9, edited by R. A. Jarrow, et al, North-Holland. Cook, R. D., Forzani, L., 2019. Partial least squares prediction in high-dimensional regression, Annals of Statistics 47, 884–908. Cook, R. D., Forzani, L., 2021, PLS Regression Algorithms in the Presence of Nonlinearity, working paper. Coqueret, G., and T. Guida, 2021, Machine Learning for Factor Investing, CRC Press. Coval, J., and T. Shumway, 2005, Do behavioral biases affect prices?, Journal of Finance 60, 1–34. Covel, M., and B. Ritholtz, 2017, Trend Following: How to Make a Fortune in Bull, Bear and Black Swan Markets, Wiley, 5th edition. Cujean, Julien and Hasler, Michael, 2017, Why Does Return Predictability Concentrate in Bad Times? Journal of Finance 72, 2717—2758. Cybenko, G. 1989, Approximation by superpositions of a sigmoidal function, Mathematics of Control, Signals and Systems 2, 303–314. Daniel, K., D. Hirshleifer, and L. Sun, 2020, Short-and long-horizon behavioral factors, Review of Financial Studies 33, 1673–1736. de Jong, S., 1993, Simpls: An alternative approach to partial least squares regression, Chemometrics and Intelligent Laboratory Systems 18, 251–263. c© Zhou, 2021 Page 285 DeRoon, F. A. and T. E. Nijman, 2001, Testing for mean-variance spanning: a survey, Journal of Empirical Finance 8, 111–155. Deisenroth , M., A. Faisal, and C. Ong, 2020, Mathematics for Machine Learning, Cambridge University Press. DeMiguel, V., L. Garlappi, and R. Uppal, 2009, Optimal versus naive diversification: How inefficient is the 1/N portfolio strategy? Review of Financial Studies 22, 1915–1953. Den Haan, W.J., and A. Levin, 1997, A practitioner’s guide to robust covariance matrix estimation, in Handbook of Statistics 15, G.S. Maddala and C.R. Rao, eds., Elsevier (Amsterdam), pp.299–342. Diebold, F. X., and R. S. Mariano. 1995, Comparing predictive accuracy, Journal of Business and Economic Statistics 13, 253–263. Diebold, F. X. and M. Shin, 2019, Machine learning for regularized survey forecast combination: Partially-egalitarian LASSO and its derivatives, International Journal of Forecasting 35, 1679–1691. Ding, Z., and R. Martin, 2017, The fundamental law of active management: Redux, Journal of Empirical Finance 43, 91–114. Dixon, M., I. Halperin, and P. Bilokon, 2020, Machine Learning in Finance: From Theory to Practice, Springer. Doane, P., and L. Seward, 2011, Measuring skewness: a forgotten statistic, Journal of Statistics Education 19, 1–18. Dong, X., Li, Y., Rapach, D. and Zhou, G., 2021, Anomalies and the Expected Market Return, Journal of Finance (forthcoming). Edmans, A., A. Fernandez-Perez, A. Garel and I. Indriawan, 2021, Music sentiment and stock returns around the world, Journal of Financial Economics, forthcoming. Efron, B., 1979, Bootstrap methods: Another look at the Jacknife, Annals of Statistics 7, 1–26. Engle, Robert F., 1982, Autoregressive conditional heteroskedasticity with estimates of the variance of United Kingdom inflation, Econometrica 50, 987–1007. Fabozzi, Frank J., Petter N. Kolm, Dessislava Pachamanova, and Sergio M. Focardi, 2007, Robust Portfolio Opti- mization and Management, New York: Wiley. Fabozzi Frank J., Dashan Huang and Guofu Zhou, 2010, Robust Portfolios: Contributions from Operations Research and Finance, Annals of Operations Research 176, 191–220. Fama, E. F., MacBeth, J. D., 1973, Risk, return, and equilibrium: Empirical tests, Journal of Political Economy81, 607–636. Fama, E.F., French, K.R., 1993, Common risk factors in the returns on stocks and bonds, Journal of Financial Economics 33, 3–56. Fama, E.F., French, K.R., 2015, A five-factor asset pricing model, Journal of Financial Economics 116, 1–22. Fan, J., Liao, Y., and M. Mincheva, 2013). Large covariance estimation by thresholding principal orthogonal complements, Journal of the Royal Statistical Society (Series B, Statistical Methodology) 75, 603–680. Feng, G., S. Giglio, and D. Xiu, 2020, Taming the factor zoo: A test of new factors, Journal of Finance 75, 1327–1370. Ferri, R., 2010, All About Asset Allocation, 2e, McGraw-Hill. Filippou, I., M. Taylor, and G. Zhou, 2020, Exchange Rate Prediction with Machine Learning and a Smart Carry Portfolio, Working paper. c© Zhou, 2021 Page 286 Frazzini, A., Israel, R., Moskowitz, T., 2015, Trading costs of asset pricing anomalies, Working paper. French, K., and J. Poterba, 1991, Investor diversification and international equity markets, American Economic Review 81, 222–226. Freyberger, J., A. Neuhierl, and M. Weber, 2020, Dissecting characteristics nonparametrically, Review of Financial Studies 33, 2326–2377. Gao, L., Y. Han, Z., Li, and G. Zhou, 2018, Market intraday momentum, Journal of Financial Economics 129, 394–414. Ge´ron, A., 2019, Hands-on Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems, 2nd edition, O’Reilly Media. Geweke, J., Zhou, G., 1996, Measuring the pricing error of the arbitrage pricing theory. Review of Financial Studies 9, 557–587. Ghayur, K., R. Heaney, and S. Platt, 2019, Equity Smart Beta and Factor Investing for Practitioners, Wiley. Ghysels, E., and M. Marcellino, 2018, Applied Economic Forecasting using Time Series Methods, Oxford U. Press. Gibbons, M., S. Ross and J. Shanken, 1989, A test of the efficiency of a given portfolio, Econometrica 57, 1121–1152. Giglio, S. and D. Xiu, 2021, Asset pricing with omitted factors, Journal of Political Economy 129, 1947–1990. Giglio, S., Kelly, B., D., Xiu, 2021, Factor models, machine learning, and asset pricing, working paper. Giraud, C., 2015, Introduction to high-dimensional statistics. CRC Press. Glassermann, P., 2004, Monte Carlo Methods in Financial Engineering, Springer-Verlag. Goh, Jeremy , Fuwei Jiang, Jun Tu and Guofu Zhou, 2012, Forecasting bond risk premia using technical indicators, Washington University in St Louis, Working paper. Guo, X., H. Lin, C. Wu, and G. Zhou, 2020, Extracting information from corporate yield curve: A machine learning approach, Working paper. Graham, John R., and Campbell Harvey, 1996, Market timing ability and volatility implied in investment newslet- ters’ asset allocation recommendations, Journal of Financial Economics 42, 397–421. Graham, John R., and Campbell Harvey, 1997, Grading the performance of market timing newsletters, Financial Analysts Journal, 54–66. Griffin, John M., Jeffrey H. Harris and Selim Topaloglu, 2003, Investor behavior over the rise and fall of Nasdaq, working paper, Yale University. Grinblatt, Mark S, and Sheridan Titman, 1995, Performance evaluation, in Handbook in Operations Research and Management Science, Vol. 9: Finance, Jarrow, R., Maksimovic, V., and Ziemba, W. (Eds.), Elsevier Science, 581–609. Grinold, Richard C., 1989, The fundamental law of active management, The Journal of Portfolio Management 15, 30–37. Grinold, Richard C and Ronald N. Kahn, 1999, Active portfolio management: quantitative theory and applications, McGraw-Hill. Gu, S., B. Kelly, and D. Xiu, 2020, Empirical asset pricing via machine learning, Review of Financial Studies 33, 2223–2273. c© Zhou, 2021 Page 287 Guida, T., 2019, Big Data and Machine Learning in Quantitative Investment, Wiley. Gulli, A., Kapoor, A¿, Pal, S., 2019, Deep Learning with TensorFlow 2 and Keras, 2nd ed, Packt. Hall, P., 1992, The Bootstrap and Edgeworth Expansion. Springer-Verlag, New York. Hilpisch, Y., 2015, Derivatives analytics with Python: data analysis, models, simulation, calibration and hedging, Wiley. Horowitz, J., 1995, Bootstrap methods in econometrics: Theory and numerical performance. In: Advances in Economics and Econometrics: Theory and Applications III, edited by D. M. Kreps and K. F. Walls, 188–222, Cambridge University Press. Han, Y., K. Yang, and G. Zhou, 2013, A new anomaly: the cross-sectional profitability of technical analysis, Journal of Financial and Quantitative Analysis 48, 1433–1461. Han, Y., G. Zhou, and Y. Zhu, 2016, A trend factor: any economic gains from using information over investment horizons? Journal of Financial Economics 122, 352–375. Han, Y., A. He, D. E. Rapach, and G. Zhou, 2021, What firm characteristics drive US stock returns? Manuscript. Han, Y., Y. Liu, G. Zhou, Y., Zhu, 2021, Technical Analysis in the Stock Market: A Review. Manuscript. Hansen, L. P., 1982, Large sample properties of generalized method of moments estimators, Econometrica 50, 1029–1054. Harvey, C. R., and G. Zhou, 1990, Bayesian inference in asset pricing tests, Journal of Financial Economics 26, 221–254. Harvey, C., and G. Zhou, 1993, International asset pricing with alternative distributional specifications, Journal of Empirical Finance 1, 1993, 107–131. Harvey, C., Liu, Y., and H. Zhu, 2016, ... and the cross-section of expected returns, Review of Financial Studies 29, 5–68. Hastie, T., R. Tibshirani, and J. Friedman, 2009, The Elements of Statistical Learning. 2ed edition, Springer. Haugen, R., N. Baker, 1996, Commonality in the determinants of expected stock returns, Journal of Financial Economics 41, 401–439. Helland and Alm0y, 1994, Comparison of prediction methods when only a few components are relevant, Journal of the American Statistical Association 89, 583–591. Henkel, S., J. S. Martin, and F. Nardari, 2011, Time-varying short-horizon predictability, Journal of Financial Economics 99, 560–580. Hoerl, A. E. and R. W. Kennard, 1970, Ridge Regression: Applications to Nonorthogonal Problems, Technometrics 12, 69–82. Hornik, K., M. Stinchcombe, and H. White, 1989, Multilayer feedforward networks are universal approximators, Neural Networks 2, 359–366. Hou, Kewei, Chen Xue, and Lu Zhang, 2015, Digesting anomalies: An investment approach, Review of Financial Studies 28, 650–705. Huang, D., Jiang, F., J. Tu and G. Zhou, 2015, Investor sentiment aligned: a powerful predictor of stock returns, Review of Financial Studies 28, 791–837. c© Zhou, 2021 Page 288 Huang, and G. Zhou, 2017, Upper bounds on return predictability, Journal of Financial and Quantitative Analysis 52, 401–425. Huang, D., J. Li, and L. Wang, 2020, Time-series momentum: is it there? Journal of Financial Economics 135, 774–794. Huang, D., Jiang, F., Li, K,, Tong, G., and G. Zhou, 2020, Scaled PCA: A new approach to dimension reduction, Management Science (forthcoming). Huang, Chi-fu, and Robert H. Litzenberger, 1988, Foundations for Financial Economics, North-Holland. Huberman, G. and S. Kandel, 1987, Mean-variance spanning, Journal of Finance 42, 873–888. Hurst, B., Y. Ooi and L. Pedersen, A century of evidence on trend-following investing, 2017, Journal of Portfolio Management 44, 15–29. Ingersoll, J., 1987, Theory of Financial Decision Making, Rowman and L. Jacquier, E., Kane, A., and Marcus, A. J., 2003, Geometric or arithmetic mean: a reconsideration, Financial Analysts Journal 59, 46–53. Jacquier, E., Kane, A., and Marcus, A. J., 2005, Optimal estimation of the risk premium for the long run and asset allocation: a case of compounded estimation risk, Journal of Financial Econometrics 3, 37–55 Jiang, F., J. Lee, X. Martin and G. Zhou, 2019, Manager sentiment and stock returns, Journal of Financial Economics 132, 126–149. Jiang, L., K. Wu, G. Zhou, Y. Zhu, 2020, Stock return asymmetry: beyond skewness, Journal of Financial and Quantitative Analysis 55, 357–386. Jagannathan, R., and Z. Wang, 2002, Empirical evaluation of asset pricing models: A comparison of the SDF and beta models, Journal of Finance 57, 2337–2368. Jiang, F., G. Tang, and G. Zhou, 2018, Firm characteristics and Chinese stocks, Journal of Management Science and Engineering 3, 259–283. Jiang, H., Z. Li and H. Wang, 2020, Pervasive underreaction: Evidence from high-frequency data, working paper. Joanes, D., A., Gill, 1998, Comparing measures of sample skewness and kurtosis, Journal of the Royal Statistical Society 47, Series D, 183-–189, Jobson, J. D. and B. M. Korkie, 1981, Performance hypothesis testing with the Sharpe and Treynor measures. Journal of Finance 36, 889–908. Jobson, J. D. and B. M. Korkie, 1983, Statistical inference in two-parameter portfolio theory with multiple regression software, Journal of Financial and Quantitative Analysis 18, 189–197. Johnstone, I., and D. Paul, 2018, PCA in High Dimensions: An Orientation, Proceedings of the IEEE 106, 1277– 1292. Jolliffe, I. T., 2002, Principal Components Analysis, 2nd edition, Springer. Jorion, P., 1986, Bayes-Stein estimation for portfolio analysis, Journal of Financial and Quantitative Analysis 21, 279–292. Jorion, P., 2003, Portfolio optimization with tracking-error constraints, Financial Analysts Journal 59, 70–82. Jurczenko, M., B. Maillet, 2006, Multi-moment Asset Allocation and Pricing Models, Wiley. c© Zhou, 2021 Page 289 Jurczenko, E., 2020, Machine Learning for Asset Management, Wiley. Kahneman, D., and A. Tversky, 1974, Judgment under uncertainty: heuristics and biases, Science 185, 1124–1131. Kan, R., and G. Zhou, 1999, A critique of the stochastic discount factor methodology, Journal of Finance 54, 1021–1048. Kan, R., C. Robotti, and J. Shanken, 2013, Pricing model performance and the two-pass cross-sectional regression methodology, Journal of Finance 68,2617–2649. Kan, R., X. Wang, and G. Zhou, 2021, Optimal portfolio choice with estimation risk: no risk-free asset case, Management Science (forthcoming). Kan, R., and G. Zhou, 2007, Optimal portfolio choice with parameter uncertainty, Journal of Financial and Quan- titative Analysis 42, 621–656. Kan, R., and G. Zhou, 2009, What will the likely range of my wealth be? Financial Analysts Journal 65 (4), 2009, 68–77. Kan, R., and G. Zhou, 2012, Tests of mean-variance spanning, Annals of Economics and Finance 13, 2012, 145–193. Kandel, S., Stambaugh, R.F., 1996, On the predictability of stock returns: An asset-allocation perspective. Journal of Finance 51, 385–424. Kelly, J. L., 1956, A new interpretation of information rate, Bell System Technical Journal 35, 917–926. Kelly, B., Pruitt, S., 2013, Market expectations in the cross-section of present values, Journal of Finance 68, 1721– 1756. Kelly, B., Pruitt, S., 2015, The three-pass regression filter: A new approach to forecasting using many predictors, Journal of Econometrics 186, 294–316. Kendall, M, A., Hill, 1953, The analysis of economic time-series-part I: prices, Journal of the Royal Statistical Society Series A 116, 11–34. Kim, T., H. White, and D. Stone, 2005, Asymptotic and Bayesian Confidence Intervals for Sharpe-Style Weights, Journal of Financial Econometrics 3, 315–343. Klain, R., and V. Bawa, 1976, The effect of estimation risk on optimal choice, Journal of Financial Economics 3, 215–231. Klasss, J., 2019, Machine Learning for Finance, Packt. Kozak, S., S. Nagel, and S. Santosh, 2020, Shrinking the cross section, Journal of Financial Economics 135, 271–292. Ledoit, O. and Wolf, M., 2003, Improved estimation of the covariance matrix of stock returns with an application to portfolio selection, Journal of Empirical Finance 10, 603–621. Ledoit, O. and Wolf, M., 2017, Nonlinear shrinkage of the covariance matrix for portfolio selection: Markowitz meets goldilocks, Review of Financial Studies 30, 4349–4388. Ledoit, O. and Wolf, M., 2020, Analytical nonlinear shrinkage of large-dimensional covariance matrices, Annals of Statistics 48, 3043–3065. Lehmann, E.L., and G. Casella, 1998, Theory of Point Estimation (Springer-Verlag, New York). Lee, C., A. Shleifer and R. Thaler, 1991, Investor sentiment and the closed-end fund puzzle, Journal of Finance 46, 75–110. c© Zhou, 2021 Page 290 Lehmann, B. N., and D. M. Modest, 1988, The empirical foundations of the arbitrage pricing theory, Journal of Financial Economics 21, 213–254. Leibowitz, Martin L., 1996, Return targets and shortfall risks: studies in strategic asset allocation, Irwin Professional Pub. Lewellen, J., 2015, The cross-section of expected stock returns, Critical Finance Review 4, 1–44. Lie, E., Meng, B., Qian, Y., and G. Zhou, 2017, Corporate activities and the market risk premium, working paper. Lin, H., C. Wu, and G. Zhou, 2018, Forecasting corporate bond returns: an iterated combination approach, Man- agement Science 64, 4218–4238. Litterman, B., J. Scheinkman, 1991, Common factors affecting bond returns, Journal of Fixed Income 1, 54–61. Liu, H., X. Tang, and G. Zhou, 2021, Recovering the FOMC risk premium, working paper. Liu, Y., G. Zhou, and Y. Zhu, 2020a, Trend factor in China, working paper. Liu, Y., G. Zhou, and Y. Zhu, 2020b, Maximizing the Sharpe Ratio: A Genetic Programming Approach, working paper. Lo, Andrew W., and Craig MacKinlay, 1988, Stock market prices do not follow random walks: evidence from a simple specification test, Review of Financial Studies 1, 41–66. Lo, A. W., Hasanhodzic, J., 2009. The Heretics of Finance: Conversations with Leading Practitioners of Technical Analysis. Bloomberg Press, . Lo, A. W., Mamaysky, H., Wang, J., 2000. Foundations of technical analysis: Computational algorithms, statistical inference, and empirical implementation. Journal of Finance 55, 1705–1770. Lo´pez de Prado, M., 2018, Advances in Financial Machine Learning, Wiley. Lo´pez de Prado, M., 2020a, Machine Learning for Asset Managers, Cambridge University Press. Lo´pez de Prado, M., 2020b, Three quant lessons from COVID-19, Presentation Slides. Ludvigson, S.C., and S. Ng., 2007. The Empirical risk-return relation: A factor analysis approach. Journal of Financial Economics 83, 171–222. MacKinlay, A. C., and M. P. Richardson, 1991. Using generalized method of moments to test mean-variance efficiency. Journal of Finance 46, 511–527. MacLean, L., E. Thorp and W. Ziemba, 2011, The Kelly capital growth investment criterion:theory and practice, WSPC. Maillard, S., Thierry, R., Teiletche, J., 2010, The properties of equally weighted risk contribution portfolios, Journal of Portfolio Management 36, 60–70. Mandelbrot, B., 1963, New methods in statistical economics, Journal of Political Economy 71, 421–440. Maruyama, Y., 2004, Stein’s idea and minimax admissible estimation of a multivariate normal mean, Journal of Multivariate Analysis 88, 320–334. Markowitz, Harry M., 1952, Mean-variance analysis in portfolio choice and capital markets, Journal of Finance 7, 77–91. Martin, I., 2017. What is the expected return on the market? Quarterly Journal of Economics 132, 367–433. c© Zhou, 2021 Page 291 Menchero, J., and P. Li, 2020, Correlation shrinkage: implications for risk forecasting, Journal Of Investment Management 18, 92–108. McLachlan, G., and T. Krishnan, 1997, The EM algorithm and Extensions, Wiley. Mehlawat, M., P. Gupta, A. Khan, 2021, Portfolio optimization using higher moments in an uncertain random environment, Information Sciences 567, 348–374. Memmel, C., 2003, Performance Hypothesis Testing with the Sharpe Ratio, Finance Letters 1, 21–23. Merton, R., 1969, Lifetime portfolio selection under uncertainty: The continuous-time case, Review of Economics and Statistics 51, 247–257. Merton, R., 1971, Optimum consumption and portfolio rules in a continuous-time model, Journal of Economic Theory 3, 373–413. Merton, R., 1973, An intertemporal capital asset pricing model, Econometrica 41, 867–887. Merton, R., 1980, On estimating the expected return on the market: An exploratory investigation, Journal of Financial Economics 8, 323–361. Mertens, E., 2002, Comments on variance of the IID estimator in Lo (2002), Technical report, Working Paper University of Basel, Wirtschaftswissenschaftliches Zentrum, Department of Finance. Michaud, Richard , 1998, Efficient asset management: a practical guide to stock portfolio optimization and asset allocation, Harvard Business School Press. Michaud, R., and R., Michaud, 2008, Efficient Asset Management: A Practical Guide to Stock Portfolio Optimiza- tion and Asset Allocation, 2e. Oxford University Press. Muirhead, Robb J., 1982, Aspects of Multivariate Statistical Theory (Wiley, New York). Murphy, K., 2012, Machine Learning: A Probabilistic Perspective. MIT Press. Nagel, S., 2021. Machine Learning in Asset Pricing. Princeton: Princeton University Press. Neely, C.J., D.E. Rapach, J. Tu, and G. Zhou, 2014. Forecasting the equity premium: The role of technical indicators, Management Science 60, 1772–1791. Ng, K.S., , 2013, A simple explanation of partial least squares, working paper. Neuhierl, A., X. Tang., R. Varneskov and G. Zhou, 2021, Expected stock returns from option characteristics, working paper. Newey, W. K., and K. D. West. 1987, A simple, positive semi-definite, heteroskedasticity and autocorrelation consistent covariance matrix, Econometrica 55, 703–708. Novy-Marx, R., and M. Velikov, 2016, A taxonomy of anomalies and their trading costs, Review of Financial Studies 29, 2016, 104–147. Odean, T., 1998, Are investors reluctant to realize their losses?, Journal of Finance 53, 1775–1798. Opdyke, J., 2007, Comparing Sharpe Ratios: So Where Are the p-Values? Journal of Asset Management 8, 308—36. Pav, S., 2021, A Short Sharpe Course, working paper (SSRN). Pa´stor, L˘., Stambaugh, R.F., 2000, Comparing asset pricing models: an investment perspective. Journal of Financial Economics 56, 335–381. c© Zhou, 2021 Page 292 Pedersen, A. Babu, and A. Levine, 2020, Enhanced Portfolio Optimization, working paper. Platanakis, E., C. Sutcliffe and X. Ye, 2021, Horses for courses: Mean-variance for asset allocation and 1/N for stock selection, European Journal of Operational Research 288, 302–317 Pourahmadi, M., 2013, High-dimensional Covariance Estimation, Wiley. Qian, Edward, Ronald Hua, and Eric Sorensen, 2007, Quantitative Equity Portfolio Management: Modern Tech- niques and Applications, New York: Chapman & Hall. Rapach, D., Ringgenberg, M., and G. Zhou, 2016, Short interest and aggregate stock returns, Journal of Financial Economics 122, 352–375. Rapach, D., J. Strauss, and G. Zhou, 2010, Out-of-sample equity premium prediction: Combination forecasts and links to the real economy, Review of Financial Studies 23, 821–862. Rapach, D., J. Strauss, and G. Zhou, 2013, International stock return predictability: What is the role of the United States? Journal of Finance 68, 1633–1662. Rapach, D., and G. Zhou, 2013, Forecasting stock returns, (in Handbook of Forecasting II, edited by G. Elliott and A. Timmermann; North-Holland, 328–383. Rapach, D., and G. Zhou, 2019, Sparse macro factors, working paper. Rapach, D., and G. Zhou, 2020, Time-series and cross-sectional stock return forecasting: new machine learning methods, in Machine Learning in Asset Management, edited by Emmanuel Jurczenko, Wiley, 1–33. Rapach, D., and G. Zhou, 2021, Asset Pricing: Time-Series Predictability, working paper. Raschka, S., and V. Mirjalili, 2019, Python Machine Learning: Machine Learning and Deep Learning with Python, scikit-learn, and TensorFlow 2, 3rd edition, Packt Publishing Ltd. Rice, J., 2007, Mathematical Statistics and Data Analysis, 3e, Thomson Higher Education. Ritter, Jay R., 1991, The long-run performance of initial public offerings, Journal of Finance 46, 3–27. Roll, R., 1992, A mean-variance analysis of tracking error, Journal of Portfolio Management 18, 13–22. Romero, P., and T. Balch, 2014, What Hedge Funds Really Do: An Introduction to Portfolio Management, Business Expert Press. Ross, S. A., 1976, The arbitrage theory of capital asset pricing, Journal of Economic Theory 13, 341–360. Ross, S. A., 2005, Neoclassical Finance. Princeton University Press. Ross, S., 2015. The recovery theorem, Journal of Finance 70, 615–648. Samuelson, P., 1969, Lifetime portfolio selection by dynamic stochastic programming, Review of Economics and Statistics 51, 239–246. Samuelson, P., 1970, The fundamental approximation theorem of portfolio analysis in terms of means, variances and higher moments, Review of Economic Studies 37, 537–542. Schwager, J. D., 1989. Market Wizards. John Wiley & Sons, Hoboken, New Jersey. Schwert, Bill, 2003, Anomalies and market efficiency, Chapter 15, Handbook of the Economics of Finance, eds. George Constantinides, Milton Harris, and Rene Stulz, North-Holland, 937–972. Se´bastien, M., T. Roncalli, and J. Teiletche, 2010, On the properties of equally-weighted risk contributions portfolios, Journal of Portfolio Management 36, 60–70. c© Zhou, 2021 Page 293 Seber, G.A.F., 1984, Multivariate Observations, Wiley. Shalev-Shwartz, S., and S. Ben-David, 2014, Understanding Machine Learning: From Theory to Algorithms. Cam- bridge University Press. Shanken, J. 1987. A Bayesian approach to testing portfolio efficiency, Journal of Financial Economics 19, 195-215. Shanken, Jay, 1992, On the estimation of beta-pricing models, Review of Financial Studies 5, 1–33. Shanken, Jay, and Guofu Zhou, 2007, Estimating and testing beta pricing models: Alternative methods and their performance in simulations, Journal of Financial Economics 84, 40–86. Shao, J. and D. Tu, 1995, The Jacknife and Bootstrap. Springer Verlag, New York. Sharpe, W. F., 1988, Determining a fund’s effective asset mix, Investment Management Review 2, 59–69. Sharpe, W. F., 1992, Management style and performance measurement, Journal of Portfolio Management 18, 7–19. Shi, B., and S. S. Iyengar, 2020, Mathematical Theories of Machine Learning, Springer. Shleifer, A., and R. Vishny, 1997, The limits of arbitrage, Journal of Finance 52, 35–55. Stambaugh, Robert F., 1999, Predictive regressions, Journal of Financial Economics 54, 375–421. Stambaugh, R. F., and Y. Yuan, 2017, Mispricing factors, Review of Financial Studies 30, 1270–1315. Stein, Charles, 1956, Inadmissibility of the usual estimator for the mean of a multivariate normal distribution, Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1, 197–206 (University of California Press, Berkeley). Stock, J., and M. W. Watson, 2002, Forecasting using principal components from a large number of predictors, Journal of the American Statistical Association 97, 1167–1179. Tibshirani, R., 1996, Regression shrinkage and selection via the LASSO, Journal of the Royal Statistical Society, Series B (Methodological) 58:1, 267–288. Tsay, R., 2010, Analysis of financial time series, 3ed. Wiley. Tu, J., and G. Zhou, 2004, Data-Generating process uncertainty: What difference does it make in portfolio decisions? Journal of Financial Economics 72, 385–421. Tu, J., and G. Zhou, 2010, Incorporating economic objectives into Bayesian priors: Portfolio choice under parameter uncertainty, Journal of Financial and Quantitative Analysis 45, 959–986. Tu, J., and G. Zhou, 2011, Markowitz meets Talmud: A combination of sophisticated and naive diversification strategies, Journal of Financial Economics 99, 204–215. Tversky, A., and D. Kahneman, 1992, Advances in prospect theory: cumulative representation of uncertainty, Journal of Risk and Uncertainty 5, 297–323. Vinz, V., W. Chin, J. Henseler, and H. Wang, 2010, Handbook of Partial Least Squares, Springer. Welch, I., Goyal, A. 2008, A comprehensive look at the empirical performance of equity premium prediction, Review of Financial Studies 21, 1455–1508. Wold, H., 1966, Estimation of principal components and related models by iterative least squares, in P. R. Krish- naiaah (eds.), Multivariate Analysis, 391-420. New York: Academic Press. c© Zhou, 2021 Page 294 Wold, H., 1975, Path models with latent variables: The nipals approach. In H. M. Blalock, A. Aganbegian, F. M. Borodkin, R. Boudon, and V. Capecchi (Eds.), Quantitative Sociology: International Perspectives on Mathematical and Statistical Model Building, pp. 307–357. Academic Press. Wu, Y., Y. Qin, and Mu Zhu, 2020, High-dimensional covariance matrix estimation using a low-rank and diagonal decomposition, The Canadian Journal of Statistics 48, 308–337. Yao, J., S. Zheng and Z. Bai, 2015, Large Sample Covariance Matrices and High-Dimensional Data Analysis, Cambridge. Ye, J., 2008, How variation in signal quality affects performance, Financial Analyst Journal 64, 48–61. Yiu, K.F.C, 2004, Optimal portfolios under a value-at-risk constraint, Journal of Economic Dynamics & Control 28, 1317–1334. Zaffaroni, P., 2019, Factor models for asset pricing, working paper. Zellner, Arnold, and V. Karuppan Chetty, 1965, Prediction and decision problems in regression models from the Bayesian point of view, Journal of American Statistical Association 60, 608–616. Zellner, Arnold, 1971, An introduction to Bayesian inference in econometrics (Wiley, New York). Zhou, G., 1993, Asset pricing tests under alternative distributions, Journal of Finance 48, 1927–1942. Zhou, G., 2008a, On the fundamental law of active portfolio management: What happens if our estimates are wrong? Journal of Portfolio Management 34 (3), 26–33. Zhou, G., 2008b, On the fundamental law of active portfolio management: How to make conditional investments unconditionally optimal? Journal of Portfolio Management 35 (1), 2008, 12–21. Zhou, G., 2009, Beyond Black-Litterman: Letting the data speak, Journal of Portfolio Management 36, 36–45. Zhou, G., 2010, How much stock return predictability can we expect from an asset pricing model? Economics Letters 108, 184–186. Zhou, G., 2018, Measuring investor sentiment, Annual Review of Financial Economics 10, 239–259. Zhou, Z., 2012, Ensemble Methods: Foundations and Algorithms. New York: CRC Press. Zhu, Y., and G. Zhou, 2009, Technical analysis: An asset allocation perspective on the use of moving averages, Journal of Financial Economics 92, 519-544. Zou, H. and T. Hastie, 2005, Regularization and Variable Selection via the Elastic Net, Journal of the Royal Statistical Society, Series B (Statistical Methodology) 67, 301–320. c© Zhou, 2021 Page 295
欢迎咨询51作业君