Try an interactive version of this dialog: Sign up at solve.it.com, click Upload, and pass this URL.

Note: 1,242

Context

I'm Dan — trained in physical oceanography, experienced in state-space modeling, finite difference methods, and data science, but new to investment theory. I'm working through Chris Bemis's Portfolio Statistics and Optimization (August 2023) to build understanding.

Book structure: Chapters are split into separate dialogs (ch_1.ipynb, ch_2.ipynb, etc.). Also available:

  • ./spoiler.md — short digest of each chapter's point and how they form the book's arc
  • ./bib.ipynb — bibliography

Arc: The book builds from statistical foundations (distributions, covariance) through regression and optimization theory to Mean-Variance Optimization and coherent risk measures. Chris emphasizes using CAPM and Fama-French as interpretive tools rather than prescriptive models, given non-stationarity and fat tails in real markets.

Ch 1 covered: EMH, systemic vs idiosyncratic risk, CAPM (β as market sensitivity, α as excess return), Fama-French (adding size and value factors), and MVO setup. Key theme: models are useful for interpretation and comparison, but their stationarity assumptions are violated in practice.

Ch 2 covered: CDFs, PDFs, moments (mean, variance, skew, kurtosis), percentiles. Normal, log-normal, and Student t distributions — with emphasis on how real equity returns exhibit fat tails and asymmetry that the normal badly misses (837,128× underestimate of 6σ events!). Student t with ν=5 is a pragmatic alternative. Multivariate extensions (joint distributions, covariance matrices, independence). Weak Law of Large Numbers, Central Limit Theorem, and unbiased estimators.

Key Ch 2 insight: Two toy models illustrate diversification. Model 1 (iid returns) suggests you can diversify away all risk. Model 2 (returns = β×market + idiosyncratic) shows you can only diversify away idiosyncratic risk — systematic risk remains. This motivates the passive investing view ("just own the market") and sets up the question: what justifies active management?

Ch 3 covered: Covariance as an inner product (giving Cauchy-Schwarz bounds on correlation). CAPM revisited: β = ρ(σ_r/σ_m), derived via inner product machinery. Key insight: correlation measures linear dependence only — a perfect quadratic relationship can show ρ ≈ 0 if positive and negative branches cancel.

Copulas: Sklar's theorem lets you separate any joint distribution into (1) a copula (pure dependence structure in [0,1]ᴺ space) and (2) arbitrary marginals. The probability integral transform (F(X) ~ U[0,1]) is the bridge. Gaussian copula infamously underestimated tail dependence in 2008 CDO crisis — Student-t copulas better capture "extremes cluster" behavior.

Eigendecomposition of Σ: Eigenportfolios (eᵢ'X) are uncorrelated and variance-ranked (Var = λᵢ). This is just PCA on returns. Empirically: one dominant eigenvalue (the market), a few sector factors, then a bulk of near-zero eigenvalues that are mostly estimation noise (random matrix theory). Ill-conditioning (large λ_max/λ_min) causes optimization to blow up on those noisy directions.

Key Ch 3 insight: The largest eigenportfolio is a proxy for "the market" — lending empirical support to simple CAPM. Correlations are unstable (regime-dependent) and spike in crises, which copula models must account for or fail catastrophically.

My approach: I'm comfortable with the calculus and linear algebra; I'm here for the finance intuition and the book's arc toward coherent risk measures. I skipped the Ch 2 exercises (PDF integration drills) since later chapters use distributions as tools rather than deriving them.

Prompt: 24

Please read formatting_instructions.md in this directory and apply it throughout this dialog.

Prompt: 31

Sorry, I mean search every message in this dialog and make the relevant transformations described in those rules.

Note: 6

Chapter 4

Note: 472

Ordinary Least Squares

Our work in Ordinary Least Squares (OLS) begins with the Capital Asset Pricing Model (CAPM). Fitting the parameters of the model to historical data is reduced to a calculus problem and an associated assumption about so-called idiosyncratic error terms. After establishing the methodology for calculating the CAPM parameters, we look at the implications of the model, giving testable hypotheses which we examine with market data. We find, as in Fama-French [10], for example, that these hypotheses are not borne out in the market, however.

The approach used to fit a CAPM model is justified through the establishment of the method of maximum likelihood, which, more generally justifies our approach in OLS. Our treatment of the subject is principally mathematical, relying in large part on the properties of projection operators.

We discuss and derive from three main assumptions (here identified as the Gauss-Markov assumptions and a sometimes-accompanying distributional assumption) the distributional properties of estimators of regression parameters. These distributions allow us to determine confidence intervals for our estimators as well as establish so-called null hypothesis tests [18, 6]. Our primary test statistics will be the t-test and F-test, which we build from first principles. The latter test statistic lends itself to model selection criteria, and we build upon this in our presentation of forward selection, backward elimination, and stepwise regression techniques.

We conclude with an introductory treatment on a generalization of ordinary least squares – aptly named Generalized Least Squares (GLS).

Note: 424

4.1 CAPM and Least Squares

The Capital Asset Pricing Model (CAPM) [22, 31] relates the time series of a given stock’s returns, \(\{r_t\}\) , to the returns of the contemporaneous market, \(\{m_t\}\) , as

\[r_t - r_f = \beta(m_t - r_f) + \epsilon_t, \tag{4.1}\]

where \(r_f\) is the risk-free rate, the rate at which one may borrow with a probability of default on the bond being zero. Treasury bills are the obvious and most likely candidate here, the dynamics of the US debt notwithstanding.

We will not at present make any assumptions about the distribution of \(\{\epsilon_t\}\) . However, we will assume that the \(\epsilon_t\) 's are iid, and share the same distribution as some \(\epsilon\) . Similarly, we will assume that the market returns, \(m_t\) are iid and distributed as some \(m\). Finally, we assume as before that

\[\begin{aligned} \text{Cov}(m, \epsilon) &= 0 \\ \text{Var}(\epsilon) &= \sigma_\epsilon^2 \\ \text{Var}(m) &= \sigma_m^2. \end{aligned}\]

Note: 558

We consider the case where we have observations \(\{r_t\}_{t=1}^T\) and \(\{m_t\}_{t=1}^T\) , drawn from \(r\) and \(m\), respectively, and seek an estimate, \(\hat{\beta}\) of \(\beta\). We have already seen in (3.11) that

\[\beta = \frac{\rho\sigma_r}{\sigma_m} = \rho \frac{\sigma_r}{\sigma_m},\]

and we made mention in (3.14) that a method of moments estimator for \(\beta\) is

\[\hat{\beta} = \hat{\rho} \frac{\hat{\sigma}_r}{\hat{\sigma}_m}.\]

Here we show that this estimator coincides with what is called the least squares estimate for reasons that will become clear momentarily.

Let \(f\) be a function of the as-yet unknown \(\beta\) given by

\[f(\beta) = \sum_{t=1}^N (r_t - r_f - \beta(m_t - r_f))^2. \quad (4.2)\]

Mathematically, finding the value which minimizes \(f(\cdot)\) is an attractive and perhaps natural proposition. We know from calculus that \(f(\cdot)\) is parabolic and hence has a single minimizer, which may be found where the derivative \(f'(\cdot)\) is zero. We will now show that this minimizer is the method of moments estimator of \(\beta\) , why this link to least squares minimization exists, and that this minimizer is an unbiased estimator of \(\beta\) .

Note: 1,765

4.1.1 Minimizing \(f(\cdot)\)

The minimizer of \(f(\cdot)\) above is exactly the method of moments estimator of \(\beta\) . The result follows from calculus and the assumption that the sample covariance between \(\epsilon\) and \(m\) ,

\[\hat{\sigma}_{\epsilon m} = \frac{1}{T-1} \sum_{t=1}^T (m_t - \bar{m})(\epsilon_t - \bar{\epsilon})\]

satisfies \(\hat{\sigma}_{\epsilon m} = 0\) , with \(\bar{m}\) and \(\bar{\epsilon}\) the estimators for the mean we have seen previously. For notation, we denote

\[\begin{aligned} x_t &= m_t - r_f \\ y_t &= r_t - r_f. \end{aligned}\]

Notice that

\[\bar{x} = \frac{1}{T} \sum_{t=1}^{T} x_t = \bar{m} - r_f\] \[\bar{y} = \frac{1}{T} \sum_{t=1}^{T} y_t = \bar{r} - r_f,\]

and since variance and covariance are both shift independent,

\[\hat{\sigma}_x^2 = \hat{\sigma}_m^2\] \[\hat{\sigma}_y^2 = \hat{\sigma}_r^2\] \[\hat{\sigma}_{xy}^2 = \hat{\sigma}_{mr}^2\] \[\hat{\sigma}_{xe}^2 = \hat{\sigma}_{me}^2,\]

where the estimators of variance are as in (2.43).

In these new variables, \(f(\cdot)\) becomes

\[f(\beta) = \sum_{t=1}^{T} (y_t - \beta x_t)^2.\]

The minimum of \(f(\cdot)\) occurs when the derivative is set to zero since the function is quadratic in \(\beta\) , hence we solve

\[f'(\beta) = 2 \sum_{t=1}^{T} (y_t - \beta x_t) \cdot x_t = 0\] \[= \sum_{t=1}^{T} \epsilon_t \cdot x_t = 0. \quad (4.3)\]

Notice that for any sequences of random variables \(\{u_t\}\) and \(\{v_t\}\) ,

\[\sum_{t=1}^{T} u_t v_t = (T-1) \hat{\sigma}_{uv} + T \bar{u} \bar{v},\]

and

\[\sum_{t=1}^{T} u_t^2 = (T-1) \hat{\sigma}_u^2 + T \bar{u}^2.\]

These relationships are left to the reader.

Continuing, we have

\[\sum_{t=1}^{T} (y_t - \beta x_t) \cdot x_t = \sum_{t=1}^{T} y_t x_t - \beta \sum_{t=1}^{T} x_t^2\] \[= (T-1) \hat{\sigma}_{xy} + T \bar{x} \bar{y} - \beta \left( (T-1) \hat{\sigma}_x^2 + T \bar{x}^2 \right)\] \[= (T-1) \left( \hat{\sigma}_{xy} - \beta \hat{\sigma}_x^2 \right) + T \left( \bar{x} \bar{y} - \beta \bar{x}^2 \right).\]

Our assumption that \(\hat{\sigma}_{m\epsilon}=0\) , or equivalently, that \(\hat{\sigma}_{x\epsilon}=0\) gives

\[\sum_{t=1}^T x_t \epsilon_t - T \bar{x} \bar{\epsilon} = 0.\]

But since we know from (4.3) that \(\sum_{t=1}^T x_t \epsilon_t = 0\) , this gives that \(\bar{x}\bar{\epsilon}=0\) as well. Returning to our formulation above, we see then that

\[\begin{aligned}(T-1)(\hat{\sigma}_{xy} - \beta \hat{\sigma}_x^2) + T(\bar{x}\bar{y} - \beta \bar{x}^2) &= (T-1)(\hat{\sigma}_{xy} - \beta \hat{\sigma}_x^2) + T\bar{x}(\bar{y} - \beta \bar{x}) \\&= (T-1)(\hat{\sigma}_{xy} - \beta \hat{\sigma}_x^2) + T\bar{x}\bar{\epsilon} \\&= (T-1)(\hat{\sigma}_{xy} - \beta \hat{\sigma}_x^2).\end{aligned}\]

Solving \(f'(\beta)=0\) therefore amounts to solving

\[\hat{\sigma}_{xy} - \beta \hat{\sigma}_x^2 = 0,\]

which yields

\[\beta = \frac{\hat{\sigma}_{xy}}{\hat{\sigma}_x^2}\]

as expected.

Prompt: 30

Calculus is clear, note shifted variables: \(x=m-r_f, y=r-r_f\)

Assistant: 73

The key assumption doing the heavy lifting here is \(\hat{\sigma}_{x\epsilon} = 0\) (sample covariance between market and errors is zero), which forces the least-squares solution to match the method-of-moments estimator.

Prompt: 31

Well, yeah - if $\epsilon $ is correlated to the market then what's the point of the model?

Assistant: 123

Exactly right. The whole point of CAPM is to decompose returns into:

  • Systematic: \(\beta(m - r_f)\) — the part explained by market exposure
  • Idiosyncratic: \(\epsilon\) — everything else (company-specific news, etc.)

If \(\epsilon\) were correlated with \(m\), that correlation would by definition belong in the systematic term.

Note: 241

It is critical to point out that we used an assumption that the error terms \(\{\epsilon_t\}\) had zero correlation with the market returns \(\{m_t\}\) . Without this assumption, the estimate of \(\beta\) from this least squares approach keeps the terms eliminated in the final parts of our derivation; namely, we arrive at a solution

\[\beta = \frac{(T-1)\hat{\sigma}_{xy} + T\bar{x}\bar{y}}{(T-1)\hat{\sigma}_x^2 + T\bar{x}^2}.\]

Implicitly (and maybe naturally), we minimized the function \(f(\cdot)\) with the constraint of idiosyncrasy of the error terms.

Note: 57

If we consider an alternative model,

\[r_t - r_f = \alpha + \beta(m_t - r_f) + \epsilon_t, \quad (4.4)\]

Prompt: 52

Whoa - decomposing the idiosyncratic term into ... oh, \(\alpha\) is constant, so we're pulling out the time-mean of the previously-defined epsilon?

Assistant: 186

Exactly. In the original model, \(\epsilon\) could have a non-zero mean. Now we're writing:

\[\epsilon_{\text{old}} = \alpha + \epsilon_{\text{new}}\]

where \(\alpha = \mathbb{E}[\epsilon_{\text{old}}]\) and \(\epsilon_{\text{new}}\) is zero-mean.

This \(\alpha\) is the famous "alpha" — excess return beyond what the market exposure (\(\beta\)) predicts. If \(\alpha > 0\), the asset is outperforming its CAPM benchmark. That's what active managers are chasing.

Prompt: 42

Actually Chris hasn't said yet that \(\bar{\epsilon}=0\), but I don't see the point of naming \(\alpha\) if it isn't.

Assistant: 132

Right — without imposing \(\mathbb{E}[\epsilon] = 0\), \(\alpha\) isn't uniquely identified. You could shift any constant between them. The convention is that \(\alpha\) absorbs all the constant term, which implicitly forces \(\mathbb{E}[\epsilon] = 0\).

Chris will likely make this explicit when he sets up the estimation problem — it becomes a constraint that pins down the intercept.

Note: 880

and look again at the least squares function, now in two variables,

\[f(\alpha, \beta) = \sum_{t=1}^N (r_t - r_f - (\alpha + \beta(m_t - r_f)))^2, \quad (4.5)\]

we may, by solving for where the gradient, \(\nabla f\) , is zero,

\[\nabla f(\alpha, \beta) = \begin{pmatrix} \frac{\partial f}{\partial \alpha}(\alpha, \beta) \\ \frac{\partial f}{\partial \beta}(\alpha, \beta) \end{pmatrix} = 0.\]

we obtain a system of equations in \(\alpha\) and \(\beta\) . For the former, we see –using the same notation as before– that

\[\frac{\partial f}{\partial \alpha}(\alpha, \beta) = -2 \sum_{t=1}^N (r_t - r_f - (\alpha + \beta(m_t - r_f))).\]

Setting this equal to zero gives

\[\alpha = \bar{r}_t - r_f - \beta(\bar{m}_t - r_f).\]

We leave it to the reader to show that

\[\beta = \frac{\hat{\sigma}_{rm}}{\hat{\sigma}_m^2}.\]

We denote these estimates by \(\hat{\alpha}\) and \(\hat{\beta}\) , respectively.

The intercept term in (4.4) has other implications. In particular for the least squares estimators \((\hat{\alpha}, \hat{\beta})\) , the estimators of \(\{\epsilon_t\}\) , \(\{\hat{\epsilon}_t\}\) , satisfy (without explicit assumption)

\[\bar{\epsilon} = 0\] \[\hat{\sigma}_{m\epsilon} = 0.\]

Showing the first equation is straightforward:

\[\bar{\epsilon} = \frac{1}{T} \sum_{t=1}^{T} (r_t - r_f - \hat{\alpha} - \hat{\beta}(m_t - r_f))\] \[= \frac{1}{T} \sum_{t=1}^{T} (r_t - r_f - \hat{\beta}(m_t - r_f)) - \hat{\alpha}\] \[= \bar{r}_t - r_f - \hat{\beta}(\bar{m}_t - r_f) - \hat{\alpha}\]

which is zero by the construction of \(\hat{\alpha}\) . The second equation follows from this and the partial derivative with respect to \(\beta\) of (4.5) being zero.

Prompt: 7

And there it is.

Assistant: 60

Yep — the zero-mean constraint we were expecting falls out automatically from the first-order condition on \(\alpha\). And the zero-correlation with market falls out from the first-order condition on \(\beta\).

Note: 1,221

The Capital Asset Pricing Model [22, 31] is a normative model: the model in (4.1) omits an intercept term and defines the relationship of the idiosyncratic terms \(\{\epsilon_t\}\) to the market through assumptions. This is not a model derived from observations and relationships in the market, but based on assumptions of how a market ought to perform. We prefer a somewhat nontraditional definition, relating a given security's adjusted returns to the market as in (4.4). The allowance for nonzero \(\alpha\) has serious implications, though, which we address momentarily. The other major defining features of the model are obtained rather than assumed in this formulation, however. This intersection of parsimony and flexibility is hard to pass up.

We can formulate the relationship between random variables implied by (4.4), as

\[r - r_f = \alpha + \beta(m - r_f) + \epsilon.\]

Maintaining assumptions of idiosyncrasy of \(\epsilon\) , which, follow from the least squares estimators, but are not properties of the model as such, and taking expectations on both sides, we get

\[\mathbb{E}(r - r_f) = \alpha + \beta(\mathbb{E}(m) - r_f),\]

or

\[\mathbb{E}(r) = r_f + \alpha + \beta(\mathbb{E}(m) - r_f).\]

This last line is intimately involved with the Efficient Market Hypothesis [28, 2]. In the case that we have a fully explanatory model, a nonzero \(\alpha\) should be precluded in an efficient market. Any alternative allows for securities which give opportunities for risk free excess returns. Consider the case where \(\alpha > 0\) . A portfolio, \(\Pi\) , with returns \(r_\Pi\) , long one unit of the security with returns \(r\) and short \(\beta\) units of the market will have an expected return

\[\mathbb{E}(r_\Pi) = r_f + \alpha.\]

That is, the expected return of the portfolio will exceed the risk free rate while having no linear relationship to the market. Here we have used the fact that the \(\beta\) of a portfolio is the weighted sum of the \(\beta\) s of its positions. We forgo requiring a proof of this until we have established more optimization techniques as the current applications find a direct and amenable analogue there.

We say that such a portfolio provides an opportunity for statistical arbitrage: positive returns are expected in excess of those explained by selected factors. In our current model, the only factor (usually called a risk factor) is the market proper. A particular type of statistical arbitrage often claimed by academics and practitioners alike involves cross-sectional comparisons of fundamental or technical data; e.g., comparing ratios like earnings before interest and taxes over enterprise value at a given time for a set of stocks under consideration (we will have a more formal treatment of these terms in a following chapter). We will call these types of statistical arbitrage opportunities anomalies.

Empirically, the omission of \(\alpha\) in (4.1) is a mistake. Equity anomalies abound. As such, we use (4.4) for the identification of anomalies rather than assume away their existence. Notice, however, that statistical arbitrage and anomalies are defined in relation to some model. The specification of that model, then, is critical in the assessment of perceived opportunities and statistical descriptions of any security or portfolio of securities. A consequence of the above is that one measure of a powerful descriptive model in the equity space is its ability to statistically identify an \(\alpha\) of zero for many of the widely accepted and observed anomalies in the literature (as determined by models such as CAPM or its extensions).

Prompt: 85

"A consequence of the above is that one measure of a powerful descriptive model in the equity space is its ability to statistically identify an α of zero for many of the widely accepted and observed anomalies in the literature (as determined by models such as CAPM or its extensions)."

Say what?

Assistant: 157

It's a bit convoluted, but the idea is:

CAPM finds α ≠ 0 for certain stocks (anomalies). A better model (more factors) might explain those same returns with α ≈ 0 — the "anomaly" was just a missing factor.

So the test of a good model is: does it make previously-anomalous alphas disappear? That's exactly what Fama-French does — the size and value factors absorb much of what looked like alpha under plain CAPM.

Prompt: 49

Oh - a better model explains more variance. But back in Ch 1 or 2 didn't Chris complain that the trend was toward models with more and more factors?

Assistant: 213

Yes — there's a tension here. More factors → less unexplained variance → alphas shrink. But that opens two problems:

  1. Overfitting: With enough factors you can "explain" anything, including noise
  2. Factor zoo: Hundreds of published factors, many likely spurious (data mining)

The challenge is finding factors that are economically meaningful (like size and value arguably are) versus statistically convenient. Fama-French stopped at 3–5 factors for a reason — beyond that, you're probably fitting noise rather than capturing real risk premia.

This connects back to the Variable Selection section coming up — how do you decide which factors earn their place?

Prompt: 36

I hate to be pedantic (actually, I love being pedantic) but aren't your two problems the same problem?

Assistant: 111

You're right — they're really two facets of the same thing. Overfitting is when you add too many factors and fit noise. Factor zoo is when the field collectively does the same thing via publication bias (only "significant" factors get published, but run enough tests and you'll find spurious ones).

Same disease, individual vs. collective strains.

Note: 144

Practitioners will often refer to the \(\alpha\) of a given security, portfolio, strategy, or idea. These references necessarily presuppose some underlying model description – oftentimes the assumed model is CAPM. Similarly, \(\beta\) is also used as a descriptive characteristic for securities and portfolios alike. In practice, references to \(\beta\) as such are made to the sensitivity to the market; i.e., CAPM- \(\beta\) .

Prompt: 84

They should presuppose some underlying model description, but isn't alpha often relative to a vague wave toward "everyone else's models"? At least it seems that way to me hearing terms that filter through to non-professionals, maybe the pros understand it more rigorously.

Assistant: 169

You're spot on. In rigorous usage, α is always relative to a specific model — CAPM-α, Fama-French-α, etc. But in casual usage (marketing materials, financial media), "alpha" often just means "outperformance" relative to some vaguely-defined benchmark or peer group.

That vagueness is convenient for fund managers — harder to disprove "we generate alpha" if you never specify the model. A skeptic would ask: alpha relative to what factors? And over what period? With what statistical significance?

Note: 784

4.1.2 Interpretations and Implications

The Capital Asset Pricing Model (without intercept but with idiosyncratic assumptions) implies several relationships between returns and volatility, idiosyncratic volatility, correlation to the market, and \(\beta\). The model

\[r - r_f = \beta(m - r_f) + \epsilon\]

gives under the CAPM assumptions that

\[\begin{aligned}\mathbb{E}(r) &= r_f + \beta(\mathbb{E}(m) - r_f) \\ &= r_f + \rho \frac{\sigma_r}{\sigma_m} (\mathbb{E}(m) - r_f) \\ &= r_f + \rho \sigma_r \frac{\mathbb{E}(m) - r_f}{\sigma_m},\end{aligned}\]

where \(\rho = \text{Corr}(r, m)\) . The term \(\frac{\mathbb{E}(m) - r_f}{\sigma_m}\) will make another appearance when we look at the so-called Sharpe Ratio. For now, however, we note that if the expected return of the market exceeds the risk free rate1, \(\mathbb{E}(r)\) , is an increasing function of \(\beta\) , \(\sigma_r\) , and correlation to the market. Where in studying the latter two cases empirically, care must be taken to control for correlation and \(\sigma_r\) , respectively.

1A very minor assumption in theory, almost by fiat, but one that may be violated using estimators over unfortunate time windows

We may also solve for \(\sigma_r\) , obtaining a similar result,

\[\sigma_r = \sigma_m \frac{1}{\rho} \frac{\mathbb{E}(r) - r_f}{\mathbb{E}(m) - r_f}.\]

Here, as the expected return increases, leaving all else fixed, so does the volatility. Again, when testing this relationship empirically, a control for correlation must be in place.

The volatility of \(r\) may be decomposed into systemic and idiosyncratic components as

\[\begin{aligned}\text{Var}(r - r_f) &= \beta(m - r_f) + \epsilon \\ \sigma_r^2 &= \beta^2 \sigma_m^2 + \sigma_\epsilon^2.\end{aligned}\]

Prompt: 31

On the second-to-last line, should there be a \(\Var\) on LHS as well?

Assistant: 112

Yes, that line is garbled. It should either be:

\[\text{Var}(r - r_f) = \text{Var}(\beta(m - r_f) + \epsilon)\]

...or just skip straight to the next line. Looks like a copy-paste artifact where the model equation snuck onto the RHS instead of the variance expansion.

Note: 54

As a result, controlling for \(\beta\) , idiosyncratic risk, \(\sigma_\epsilon\) , is positively and monotonically related to higher returns.

Prompt: 76

It is? I see \(\sigma^2_r=\beta^2\sigma^2_m+\sigma^2_{\epsilon}\), which I read as "idiosyncratic risk is positively and monotonically related to higher risk."

Assistant: 172

You're right to push back. That equation directly links σ_ε to σ_r (total variance), not to expected returns.

Under pure CAPM, idiosyncratic risk is not compensated — only systematic risk (β) earns a premium. The whole point is that you can diversify away idiosyncratic risk, so the market shouldn't pay you for bearing it.

Chris may be setting up an empirical claim that contradicts CAPM theory (which would be another "anomaly"). But as stated, the logic doesn't follow from that variance decomposition alone.

Note: 94

We have already seen that systemic risk, \(\sigma_m\) , cannot be reduced while idiosyncratic risk may be diversified away. This seems to be an apparent drawback to the model: the identification of idiosyncratic risk and return is diluted away by diversification, leaving only the market.

Prompt: 73

I didn't think the model admitted any idiosycratic return. Maybe that's what he means? "If you did inadvertantly stumble across some idiosyncratic return, but were following this model, you'd diversify it away"

Assistant: 181

That's a good reading. Under CAPM, E[ε] = 0, so there's no idiosyncratic return to capture — only idiosyncratic risk to eliminate.

But if you believe α exists (anomalies, skill, whatever), then diversification is a double-edged sword: it reduces idiosyncratic risk but also dilutes the alpha signal. Concentrated portfolios are riskier but necessary to express a view.

That's the active manager's dilemma — you need concentration to beat the market, but concentration exposes you to uncompensated risk.

Note: 124

We will see an elegant proof that ameliorates this concern on the surface—namely that the model gives preferential treatment to the market over any other security or portfolio. This proof, however, will explicitly state that all investors should be holders only of a risk free asset and the market, in varying proportion determined by their risk preferences. We postpone this proof until we have established Merton’s mean variance optimization problem.b

Prompt: 58

"ameliorates this concern" by reassuring us that there's no idiosyncratic return out there to find in the first place so we need not worry that we're diversifying it away?

Assistant: 111

Exactly. The proof will show: under CAPM assumptions, the optimal portfolio is the market (plus risk-free). So don't worry about diversifying away alpha — there isn't any to find. Just own the market and adjust your leverage to taste.

It's the theoretical foundation for passive investing. Of course, whether you believe the assumptions is another matter.

Prompt: 42

My 12-year-old has a small portfolio which she manages this way. I'll be very curious to learn whether that's a bad idea.

Assistant: 105

For a 12-year-old? Probably excellent. The empirical evidence is that most active managers underperform passive index funds after fees — and she's learning the right habits early.

The case against passive only gets interesting if you believe you have genuine informational or analytical edge. At 12, the lesson that markets are hard to beat is a gift.

Note: 37

In every case above risk – as proxied by \(\sigma_r\) – is directly proportional to expected returns.

Prompt: 34

Did I miss it? Where was that demonstrated? Every case above ... did I miss it multiple times?

Assistant: 165

You didn't miss it. The only place it appears is in the first rearrangement:

\[\mathbb{E}(r) = r_f + \rho \sigma_r \frac{\mathbb{E}(m) - r_f}{\sigma_m}\]

That's one case showing E(r) increasing with σ_r (holding ρ fixed). The variance decomposition σ²_r = β²σ²_m + σ²_ε doesn't say anything about expected returns.

"Every case above" seems to be Chris being a bit loose.

Prompt: 169

Well let's focus on that one because it's fundamental. I get the risk free rate plus \(\beta\) times the market rate. If \(\rho\) is higher, my market-multiplier is higher because I'm tracking the market rate better. For fixed \(\rho\), if \(\sigma_r\) is higher, my market-multiplier is higher because ... why? It's like the 2x leveraged market case in chapter 1. When the market returns are positive, mine are even more positive. When negative, mine are even more negative. Ok that's right.

Assistant: 135

Exactly. With ρ fixed, higher σ_r means you're taking amplified bets on the market direction. Since E(m) > r_f (the equity risk premium), amplification pays off in expectation — you're getting more exposure to a positive-expected-value game.

The catch is that amplification works both ways, so your risk scales too. The market doesn't give you extra return for free — it's compensation for bearing that amplified volatility.

Note: 184

The implication of the CAPM model, then, and its interpretation within the framework of the Efficient Market Hypothesis, is that apparent excess returns are obtained by taking on risk.

This is not borne out in practice, however. In fact, the equity market prices risk perversely: returns to higher volatility names underperform low volatility names. This feature seems to generalize when one encounters the value, momentum, accrual, and fraud anomalies, for instance. In each of these cases, portfolios of a priori preferable names outperform less desirable names, and they do so with more attractive risk profiles.

Prompt: 66

So much for theory. But this belies the CAPM assertion that risk pays. It doesn't belie the implication (so far) that passive wins. (At least, not immediately; it might with some further work.)

Assistant: 145

Right. The low-volatility anomaly says CAPM mispredicts which stocks earn higher returns — but it doesn't immediately tell you how to exploit it profitably after costs, or whether the anomaly persists once known.

However, there's a thread worth pulling: if low-vol stocks systematically outperform, couldn't you just hold a low-vol index passively? That's "smart beta" — passive in execution, but tilted toward known anomalies. It blurs the active/passive line.

Prompt: 10

Hold blue chips not the market.

Assistant: 69

Essentially, yes. Blue chips tend to be lower volatility, so a blue-chip tilt is a crude low-vol strategy. Your daughter's passive approach might accidentally be doing this if she's in a large-cap index rather than total market.

Prompt: 54

Is the S&P500 large-cap or total market? Large cap, I think -- that's what defines it, right? Should I move her to the Russel 3000?

Assistant: 138

S&P 500 is large-cap — roughly the 500 largest US companies by market cap. Russell 3000 adds mid and small-cap, covering ~98% of US equities.

If you buy the low-vol anomaly, moving to Russell 3000 might actually hurt — you'd be adding the higher-volatility small-caps that empirically underperform on a risk-adjusted basis. Staying S&P 500 keeps her tilted toward the safer names.

Note: 1,881

In the next example we look at one of the implications of CAPM based on estimators observed in the equity markets over the last twenty years.

Example 4.1.1. Using the same cross sectional dataset we have used previously, spanning the period from 10/31/1997-5/31/2014 (200 months), we look at the average performance of volatility quartiles, recalculated monthly and held for one month. Our universe under consideration looks at the 1,000 largest US stocks by market cap each month – a liquid and tradeable set of securities.

Volatility Q1 Q2 Q3 Q4
Average Performance (Ann.) 10.00% 10.41% 10.13% 9.44%
Volatility (Ann.) 12.07% 16.20% 19.58% 32.28%
Performance per Vol Point (Ann.) 0.83 0.64 0.52 0.29

From the table we see little to no evidence of compensation for realized volatility in the cross section. If anything a quick visual inspection of the annualized average performance indicates that the market slightly prefers less volatile stocks to more volatile. The decrease is not monotonic in annual performance, however.

Looking more closely, though, if we consider the annual performance per volatility point – something akin to \(\frac{\mu}{\sigma}\) – we see a clear monotonic decrease as we go from lower to higher volatility names.

We have seen previously in our discussion of CAPM that the model implies a relationship between expected return and a product of correlation to the market and own-volatility. We did not control for correlation in this original analysis. This control can be approximated by performing so-called double sorts.

For each quartile of volatility, we may further subdivide into four quartiles by correlation. If equity returns follow the CAPM model, we would expect that within each volatility quartile, as we increase correlation (going from Quartile 1 to Quartile 4) we should see an increase in average annualized returns. And further, within each correlation quartile, an increase in volatility should correspond to an increase in expected return.

Volatility\Correlation Q1 Q2 Q3 Q4
Q1 10.17% 10.30% 10.62% 8.94%
Q2 10.13% 11.32% 9.45% 10.71%
Q3 10.47% 9.31% 8.84% 11.91%
Q4 7.40% 9.06% 8.37% 12.89%

Visually, there does not appear to be evidence in the empirical table to support the hypothesis that, even when controlling for correlation, the market compensates investors for bearing risk in the equity markets – where, again, risk here is equated to realized own-volatility. We may conduct more rigorous statistical tests, but we postpone those slightly.

From the table above, there is evidence that over this period, ex post equity returns are negatively associated with ex ante volatility within each quartile of correlation, so that the control we established does not support the implications of the CAPM model.

Perhaps, though, we should focus again on some sort of risk-adjusted returns. We again take the ratio of the sample average return to the sample volatility and present the results in the table below.

Volatility\Correlation Q1 Q2 Q3 Q4
Q1 0.86 0.89 0.85 0.59
Q2 0.70 0.72 0.57 0.53
Q3 0.58 0.50 0.42 0.49
Q4 0.25 0.27 0.24 0.36

The monotonicity is more dramatic here. But in the wrong direction of the hypotheses of CAPM. In fact, we see that the most attractive performance per unit of risk is to be found in the lowest quartile of correlation within the lowest quartile of volatility.

One may conduct the same analysis as in the preceding example sorting by estimated \(\beta\) from (4.4). Based on the results here, one might expect that the empirical relationship between \(\beta\) and ex post returns is again negative. Interestingly, the double sort methodology in \(\beta\) and correlation does not produce the same results as the volatility/correlation double sorts. In particular, in the period under study, equity market returns show a preference for low \(\beta\) with high correlation.

At first glance, this seems to be counter-intuitive. However, we may interpret \(\beta\) as the sensitivity of an asset’s return to movements in the market. Correlation, on the other hand, simply measures the linear relationship between the asset and market’s returns. A high correlation indicates the asset’s returns can in fact be explained by the market.

An alternative approach is to re-sort to (3.11) where we may rearrange terms to find

\[\rho=\beta\frac{\sigma_m}{\sigma_r},\]

so that a higher correlation indicates a lower own-volatility. These findings are left to the reader to verify.

Note: 2,422

4.2 Maximum Likelihood

In the preceding section we obtained estimates for \(\alpha\) and \(\beta\) in CAPM models by finding critical points (minima) of quadratic functions. While we were

able to relate these values to method of moments estimators, we did not supply a justification for the usage of least squares. Here we establish one such justification.

Let \(\{X_i\}_{i=1}^N\) be random variables with joint density \(f_\theta(\cdot)\) , where \(\theta\) is some vector of parameters which determine the density. For example, if \(f_\theta(\cdot)=\phi_{\mu,\Sigma}(\cdot)\) , \(\theta=(\mu,\Sigma)\) . Or, in the case of a Student \(t\) distribution, \(\theta\) may include the degrees of freedom. We have become familiar with the need for estimators, and the present case will be no exception. In particular, we identify an estimator \(\hat{\theta}\) .

Given observations \(x_1,\dots,x_N\) , the likelihood function based on these observations is

\[L(\theta)=f_\theta(x_1,\dots,x_N).\] (4.6)

The maximum likelihood estimator of \(\theta\) , \(\hat{\theta}\) , is the maximizer of

\[\max_\theta f_\theta(x_1,\dots,x_N).\] (4.7)

The general case is likely not applicable. A common implementation of maximum likelihood estimation assumes that \(\{X_i\}_{i=1}^N\) are iid, reducing the joint density \(f_\theta(\cdot)\) to a product of identical marginals,

\[f_\theta(x_1,\dots,x_N)=\prod_{i=1}^N\tilde{f}_\theta(x_i).\] (4.8)

Maximizing a product directly is generally difficult. Logarithms transform products into sums, preserve order, and are continuous on their domain. As such, the maximizer of

\[\max_\theta \log(f_\theta(x_1,\dots,x_N))\] (4.9)

is exactly the maximizer of (4.7). In the case of iid \(\{X_i\}_{i=1}^N\) , (4.8) becomes

\[\log(f_\theta(x_1,\dots,x_N))=\sum_{i=1}^N\log(\tilde{f}_\theta(x_i)).\] (4.10)

We call the logarithm of the likelihood function \(\log(L(\theta))\) the log-likelihood function.

Example 4.2.1. Let \(\{X_i\}_{i=1}^N\) be univariate normal random variables, and assume that the \(X_i\) 's are iid. We have that \(\theta=(\mu,\sigma)\) . The likelihood function given observations \(\{x_i\}_{i=1}^N\) is

\[L(\mu,\sigma)=\prod_{i=1}^N\frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{(x_i-\mu)^2}{2\sigma^2}\right)\]

or

\[L(\mu,\sigma)\propto\prod_{i=1}^N\frac{1}{\sigma}\exp\left(-\frac{(x_i-\mu)^2}{2\sigma^2}\right)\]

where \(\propto\) denotes proportional to; viz., maximizing \(L(\cdot)\) is equivalent to maximizing \(c \cdot L(\cdot)\) for any constant \(c\) , so we may eliminate the normalization factor \(\frac{1}{\sqrt{2\pi}}\) . The log-likelihood function becomes

\[l_0(\mu, \sigma) = - \sum_{i=1}^{N} \left[ \log(\sigma) + \frac{(x_i - \mu)^2}{2\sigma^2} \right].\]

Now maximizing \(l_0(\cdot)\) is equivalent to minimizing

\[l(\mu, \sigma) = \sum_{i=1}^{N} \left[ \log(\sigma) + \frac{(x_i - \mu)^2}{2\sigma^2} \right].\]

As before, the minimum will occur exactly when \(\nabla l = 0\) .

Looking at the \(\frac{\partial l}{\partial \mu}\) in the gradient, we solve

\[-\frac{1}{\sigma^2} \sum_{i=1}^{N} [x_i - \mu] = 0,\]

giving the estimator we have seen previously:

\[\hat{\mu} = \frac{1}{N} \sum_{i=1}^{N} x_i.\]

The maximum likelihood estimator of \(\sigma^2\) , however, is given by

\[\hat{\sigma}_N^2 = \frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2,\]

a biased estimator of the variance.

Maximum likelihood also finds application in the model

\[y_t = \beta' x_t + \epsilon_t\]

where \(\beta \in \mathbb{R}^p\) , \(x_t \in \mathbb{R}^p\) is nonrandom, and the \(\{\epsilon_t\}\) are iid distributed as

\[\epsilon \sim N(0, \sigma^2).\]

In this case, the likelihood function in \((\beta, \sigma^2)\) may be written as,

\[\begin{aligned} L(\beta, \sigma^2) &\propto \prod_{i=1}^{N} \frac{1}{\sigma} \exp \left( \frac{\epsilon_i^2}{2\sigma^2} \right) \\ &= \prod_{i=1}^{N} \frac{1}{\sigma} \exp \left( \frac{(y_i - \beta' x_i)^2}{2\sigma^2} \right). \end{aligned}\]

The log-likelihood function is

\[l_0(\beta, \sigma^2) = -\sum_{i=1}^N \log(\sigma) + \frac{(y_t - \beta' x_t)^2}{2\sigma^2}\]

which is maximized exactly when the function

\[l(\beta, \sigma^2) = N \log(\sigma) + \frac{1}{2\sigma^2} \sum_{i=1}^N (y_t - \beta' x_t)^2\]

is minimized. The minimum occurs when the gradient is identically zero,

\[\frac{\partial l}{\partial \beta_i} = 0\]

for \(i = 1, \dots, N\) . This coincides exactly with the minimizer of

\[f(\beta) = \sum_{i=1}^N (y_t - \beta' x_t)^2\]

so that minimizing the quadratic function \(f(\cdot)\) coincides with the maximum likelihood estimator when residuals are independent and normally distributed. We will explicitly state when this assumption is necessary in our proofs that follow.

Prompt: 21

Bottom line: so much for theory (at least for CAPM)

Assistant: 135

Right. CAPM as a predictive model of returns fails empirically — higher risk doesn't earn higher reward, and often the opposite.

But CAPM (and OLS) as descriptive machinery is still useful: estimate β to measure market exposure, estimate α to identify anomalies, decompose variance into systematic vs. idiosyncratic. The math is fine; it's the efficient-market interpretation that doesn't survive contact with data.

Note: 394

4.3 OLS

Consider the model

\[y_t = \beta' x_t + \epsilon_t \quad (4.11)\]

with \(\beta \in \mathbb{R}^p\) , \(x_t \in \mathbb{R}^p\) nonrandom, \(\{\epsilon_t\}\) univariate random variables. In particular, we assume that each \(y_t\) is a univariate random variable, and that the randomness is driven by \(\epsilon_t\) . We may rewrite the above as

\[Y = X\beta + \epsilon \quad (4.12)\]

with

\[Y = \begin{pmatrix} y_1 \\ \vdots \\ y_T \end{pmatrix}, \quad X = \begin{pmatrix} x_{11} & \dots & x_{1p} \\ \vdots & & \vdots \\ x_{T1} & \dots & x_{Tp} \end{pmatrix}, \quad \epsilon = \begin{pmatrix} \epsilon_1 \\ \vdots \\ \epsilon_T \end{pmatrix}.\]

Here, \(\epsilon\) is now a random vector.

Note: 465

We call the estimator, \(\hat{\beta}\) , obtained from minimizing the quadratic function

\[\min_{\beta} ||Y - X\beta||^2 \quad (4.13)\]

the ordinary least squares (OLS) estimate of \(\beta\) . We have seen in the previous section that in the case that \(\epsilon \sim N(0, \sigma^2 I)\) , the OLS estimate and maximum likelihood estimate coincide.

The following assumptions are collectively, often referred to as the Gauss-Markov assumptions. We have

  1. \(X \in \mathbb{R}^{N \times p}\) is a nonrandom matrix with \(N > p\) and full rank.
  2. \(\epsilon\) is a random vector with \(\mathbb{E}(\epsilon) = 0\) .
  3. \(\text{Var}(\epsilon_t) = \sigma^2\) for all \(t\) , and \(\text{Cov}(\epsilon_i, \epsilon_j) = 0\) for \(i \neq j\) ; i.e., \(\text{Cov}(\epsilon) = \sigma^2 I\) .

In addition to the above, we will resort to a distributional assumption for \(\epsilon\) when we develop distributional properties for \(\hat{\beta}\). In particular, we will assume that \(\epsilon \sim N(0, \sigma^2 I)\).

Note: 2,889

Theorem 4.3.1. Under the first two Gauss-Markov assumptions, the OLS estimate of \(\beta\) given by

\[\hat{\beta} = (X'X)^{-1}X'Y \quad (4.14)\]

is unbiased.

Proof. Since \(X\) is full rank, we know that \((X'X)^{-1}\) is invertible. Looking at the equation

\[Y = X\beta + \epsilon\]

then, we have

\[\begin{aligned} X'Y &= X'X\beta + X'\epsilon \\ (X'X)^{-1}X'Y &= \beta + (X'X)^{-1}X'\epsilon. \end{aligned}\]

Defining \(\hat{\beta}\) as the term on the left hand side,

\[\hat{\beta} = (X'X)^{-1}X'Y,\]

we see that

\[\begin{aligned} \mathbb{E}(\hat{\beta}) &= \mathbb{E}(\beta) + \mathbb{E}((X'X)^{-1}X'\epsilon) \\ &= \beta + (X'X)^{-1}X'\mathbb{E}(\epsilon) \\ &= \beta \end{aligned}\]

so that \(\hat{\beta}\) is an unbiased estimator of \(\beta\) .

Notice in our previous theorem that the estimator \(\hat{\beta}\) is a random variable since \(Y\) is a random vector driven by \(\epsilon\) . As such, we may obtain a point estimate based on a specific set of observations, but we may also want to, as in our work with the estimator for the mean, understand the distribution of \(\hat{\beta}\) . Similarly, the \(\epsilon\) values are unobservable. As such, we will be left with estimators of \(\epsilon\) . Under the full Gauss-Markov assumptions with the distributional assumption on \(\epsilon\) , we may satisfactorily address both of these curiosities. Before doing so, however,

we identify an important structural property of the matrix \(X(X'X)^{-1}X'\) which will be used extensively and which, further, lends to more interpretability of the OLS estimators.

We define the matrix \(H\) as

\[H = X(X'X)^{-1}X'. \quad (4.15)\]

In the literature, \(H\) is often called the hat matrix. We have immediately that

\[\begin{aligned} HY &= X(X'X)^{-1}X'Y \\ &= X\hat{\beta}. \end{aligned}\]

That is, the effect of multiplying \(Y\) by \(H\) is to obtain our estimate, \(X\hat{\beta}\) of \(X\beta\) . We call

\[\hat{Y} = X\hat{\beta} \quad (4.16)\]

and show in what follows that \(\hat{Y}\) is the projection of \(Y\) onto the space spanned by the columns of \(X\) . We begin by showing that \(H\) is a projection matrix.

We say a matrix \(P\) is a projection matrix if it satisfies the following two conditions: \(P^2 = P\) and \(P' = P\) . \(P\) is necessarily a square matrix. Projection matrices, \(P\) , have several useful properties.

Projection Matrix Property 1: The only eigenvalues of \(P\) a projection matrix are 0 and 1.

Proof. Let \(\lambda\) be an eigenvalue of \(P\) with associated eigenvalue, \(v\) . By the first condition of being a projection matrix, we have that

\[\begin{aligned} Pv &= \lambda v \\ P^2v &= P(\lambda v) = \lambda Pv = \lambda^2v \end{aligned}\]

so that

\[\lambda v = \lambda^2v\]

or \((\lambda^2 - \lambda)v = 0\) . Since \(v\) is nonzero, we have that \(\lambda(\lambda - 1) = 0\) , proving the result.

The intuition behind the preceding result is that projection matrices (under some appropriate basis) fix some subspace of \(\mathbb{R}^N\) and send to zero its orthogonal complement. This motivates the third property we will note. First, we look at \(I - P\) .

Projection Matrix Property 3: \(I - P\) is also a projection.

Proof. We verify the properties of a projection matrix. We have

\[\begin{aligned} (I - P)^2 &= I - 2P + P^2 \\ &= I - P \end{aligned}\]

and

\[(I - P)' = I' - P' = I - P.\]

Returning to interpreting the eigenvalue property, we have:

Projection Matrix Property 3: Any vector \(x\) may be decomposed as \(x = u + v\) where \((u, v) = 0\) and \(u = Px\) . In this case, \(||x||^2 = ||u||^2 + ||v||^2\) .

Proof. The statement of the property dictates the proof. Let \(u = Px\) . We verify the claims in turn. We have yet to define \(v\) , but note that if

\[x = u + v\]

then necessarily

\[v = x - u = x - Px = (I - P)x.\]

The inner product of \(u\) and \(v\) is now

\[\begin{aligned}(u, v) &= (Px, (I - P)x) \\ &= P(x, x)(I - P)' \\ &= ||x||^2 P(I - P) \\ &= 0\end{aligned}\]

Since \(P(I - P) = P - P^2 = 0\) . The norm of \(x\) , then becomes

\[\begin{aligned}||x||^2 &= (x, x) \\ &= (u + v, u + v) \\ &= (u, u) + 2(u, v) + (v, v) \\ &= ||u||^2 + ||v||^2.\end{aligned}\]

We call \(u = Px\) in the above proof the projection of \(x\) by \(P\) .

Returning to the matrix obtained from our OLS estimate, the matrix \(H\) is a projection matrix since

\[\begin{aligned}H^2 &= X(X'X)^{-1}X'X(X'X)^{-1}X' \\ &\text{by associativity} \\ &= X(X'X)^{-1}X' \\ &= H.\end{aligned}\]

Now for matrices \(A\) and \(B\) , \((A')^{-1} = (A^{-1})'\) , and \((AB)' = B'A'\) so that

\[\begin{aligned}H' &= (X(X'X)^{-1}X')' \\ &= X((X'X)^{-1})'X' \\ &= X((X'X)')^{-1}X' \\ &= X((X'X))^{-1}X' \\ &= H.\end{aligned}\]

As a result, \(\hat{Y} = HY\) is the projection of \(Y\) by \(H\) . By construction, then, \(\hat{Y}\) is the projection of \(Y\) onto the space spanned by the columns of \(X\) .

Defining the estimator of the residuals, \(\epsilon\) by

\[\hat{\epsilon} = Y - \hat{Y}, \quad (4.17)\]

we have that \(\hat{\epsilon} = (I - H)Y\) , or that the estimated residuals are a projection of \(Y\) onto the orthogonal complement of the space spanned by the columns of \(X\) . Necessarily, then, \(\hat{\epsilon}\) and \(\hat{Y}\) are orthogonal; i.e., \((\hat{\epsilon}, \hat{Y}) = 0\) . We also have the deconstruction \(Y = \hat{Y} + \hat{\epsilon}\) with norm \(||Y||^2 = ||\hat{Y}||^2 + ||\hat{\epsilon}||^2\) .

We may also show that

\[\hat{\epsilon} = (I - H)\epsilon \quad (4.18)\]

but leave this to the reader.

Prompt: 39

I confess I skimmed this quickly, but it 'feels' like SVD only more black-and-white, more binary.

Assistant: 151

Good intuition. SVD gives you a spectrum of singular values — directions weighted by importance. Projection matrices are the degenerate case: eigenvalues are only 0 or 1. You're either in the subspace (kept perfectly) or orthogonal to it (annihilated).

\(H\) keeps whatever lies in the column space of \(X\); \((I-H)\) keeps whatever's orthogonal to it. No partial shrinkage, no soft weighting — just a clean split.

Note: 387

Notice that under the Gauss-Markov assumptions, the covariance of the estimator of \(\epsilon\) is not a diagonal matrix. In particular, we have

\[\begin{aligned} \text{Cov}(\hat{\epsilon}) &= \text{Cov}((I - H)Y) \\ &= (I - H)\text{Cov}(Y)(I - H)' \\ &= (I - H)\text{Cov}(\epsilon)(I - H) \\ &= \sigma^2(I - H)^2 \\ &= \sigma^2(I - H). \end{aligned}\]

And

\[\text{Var}(\hat{\epsilon}_i) = \sigma^2(1 - h_{ii}),\]

so that the variables, the so-called jackknife residuals,

\[\frac{\hat{\epsilon}_i}{\sigma\sqrt{1 - h_{ii}}} \quad (4.19)\]

have constant unit variance if the model is specified correctly. Notice, however, that \(\sigma\) is not known and must be estimated. We return to this issue in the exercises, but will note it below as well in determining some statistical properties for \(\hat{\beta}\).

Prompt: 82

The \(\epsilon\)'s are "supposed" to be uncorrelated. But if they are indeed uncorrelated, that doesn't mean their projections onto some smaller space spanned by the model will be uncorrelated, and so we get off-diagonals.

Assistant: 84

Exactly. The true ε's are uncorrelated by assumption, but the projection \((I-H)\) couples them. You're projecting \(T\) independent errors onto a \((T-p)\)-dimensional subspace — they have to "squeeze together" and become correlated to fit.

Note: 657

Under these same assumptions, we also have that the covariance of \(\hat{\beta}\) is

\[\text{Cov}(\hat{\beta}) = \sigma^2(X'X)^{-1}. \quad (4.20)\]

This follows directly from \(\hat{\beta} = \beta + (X'X)^{-1}X'\epsilon\) , giving

\[\begin{aligned} \text{Cov}(\hat{\beta}) &= \text{Cov}(\beta + (X'X)^{-1}X'\epsilon) \\ &= (X'X)^{-1}X'\text{Cov}(\epsilon)((X'X)^{-1}X')' \\ &= \sigma^2(X'X)^{-1}X'((X'X)^{-1}X')' \\ &= \sigma^2(X'X)^{-1}X'X(X'X)^{-1} \\ &= \sigma^2(X'X)^{-1}. \end{aligned}\]

If we assume further that \(\epsilon \sim N(0, \sigma^2I)\) , then from what has proceeded, \(\hat{\beta} \sim N(\beta, \sigma^2(X'X)^{-1})\) . In this case,

\[\frac{\hat{\beta}_i - \beta_i}{\sigma\sqrt{c_{ii}}} \sim N(0, 1), \quad (4.21)\]

where \(C = (X'X)^{-1}\) .

Notice that in both of the preceding examples we have formulated our final result in terms of the unknown parameter \(\sigma\) . We will necessarily require an estimator \(s\) of \(\sigma\) . Having this estimator in hand, we will be in a position to look at the distribution of

\[\frac{\hat{\beta}_i - \beta_i}{s\sqrt{c_{ii}}}, \quad (4.22)\]

which, perhaps surprisingly, will be the Student \(t\) distribution.

Prompt: 22

Wait - the errors in the estimate of \(\beta\) are Student-t?

Assistant: 159

Not quite — the errors themselves are still assumed normal. But when you estimate σ with s (from the residuals), you introduce extra uncertainty.

The ratio of a standard normal to an independent √(χ²/df) gives you a t-distribution. Since s² is estimated from the residuals (which have n−p degrees of freedom), the standardized statistic follows t(n−p) rather than N(0,1).

Same reason we use t-tests instead of z-tests when variance is unknown.