6 Statistical models

6.1 Simple linear regression

Simple linear regression is the most common “first model” in applied statistics. It answers a basic question:

How does the expected value of an outcome $Y$ change as a function of a single predictor $X$?

Even if your final analysis uses more sophisticated methods (mixed models, GLMs, survival models), simple linear regression remains a key building block. It teaches you how estimation works, why assumptions matter, and how uncertainty is quantified through standard errors, confidence intervals, and hypothesis tests.

In practice, regression has two equally important goals: 1) Explanation (estimating association or effect size), and
2) Prediction (forecasting outcomes at new values of $X$).

6.1.1 Linear regeression assumptions

A regression model is only as reliable as its assumptions. In real-world analysis, the assumptions are rarely perfect, but they guide diagnostics and help you understand when inference may break down.

There are four principal assumptions:

Linearity of the relationship between dependent and independent variables.
This means the conditional mean $E(Y|X)$ is well-approximated by a straight line. If the true relationship is curved, the linear model may still be useful as a local approximation, but interpretation can become misleading.
Statistical independence of the errors with $y$ variable.
Independence is often violated in longitudinal data, clustered data (e.g., patients within sites), or time series. When independence fails, standard errors are typically wrong—often too small, leading to overly optimistic p-values and confidence intervals that are too narrow.
Homoscedasticity (constant variance) of the errors for all $x$.
If variability increases with $X$ (a “fanning out” pattern), the model may still estimate the mean trend reasonably, but standard errors and tests can be distorted unless you use robust methods or transform variables.
Normality of the error distribution.
Normality matters mainly for small-sample inference. Large samples rely less on strict normality due to asymptotic approximations. Also note: normality is assumed for the errors, not necessarily for $X$ or $Y$ marginally.

if independent assumption violated, the estimated standard errors tend to underestimate the true standard error. P value associated thus is lower.

only the prediction errors need to be normally distributed. but with extremely asymmetric or long-tailed, it may be hard to fit them (x and y) into a linear model whose errors will be normally distributed.

6.1.2 Population regression function

The population regression function is the ideal target we would like to know: the true conditional mean of $Y$ given $X$. Regression is fundamentally about modeling and estimating this conditional expectation.

Regression is to estimate and/or predict the population mean (expectation) of dependent variable (yi) by a known or a set value of explanatory variables (xi). Population regression line (PRL) is the trajectory of the conditional expectation value given Xi.

\[ E(Y|X_i)=f(X_i)=\beta_1+\beta_2X_i \]

This is an unknown but fixed value (can be estimated).

A key interpretation: - $\beta_1$ is the expected value of $Y$ when $X=0$ (sometimes meaningful, sometimes not). - $\beta_2$ is the expected change in $Y$ for a one-unit increase in $X$.

6.1.3 Population regression model

In the population, actual observations deviate from the regression function due to randomness and unmeasured factors. We represent this deviation as an error term $u_i$.

\[ Y_i=\beta_1+\beta_2X_i+u_i \]

the errors $u_i=y_i-\hat{y}_i$ have equal variance

In applied interpretation, $u_i$ captures everything not explained by $X$: measurement noise, omitted variables, and inherent randomness.

6.1.4 Sample regression model

In practice we observe data and estimate coefficients. The sample regression model replaces unknown population parameters with estimates (hats), and uses residuals $e_i$ as estimated errors.

(using hat to indicate sample)

\[ Y_i=\hat{\beta}_1+\hat{\beta}_2X_i+e_i \]

since

\[ u_i \sim N(0,\sigma^2) \] or \[ e_i \sim N(0,\hat{\sigma} ^2) \]

and

i.i.d., independent identically distribution, the probability distributions are all the same and variables are independent of each other.

\[ \begin{align} u_i \sim i.i.d \ N(0,\sigma^2) \end{align} \]

then

\[ \begin{align} Y_i- \beta_1+\beta_2X_i (\hat{Y_i}) &\sim i.i.d \ N(0,\sigma^2)\\ \end{align} \]

This i.i.d. assumption is what allows us to derive standard errors and perform t-tests and F-tests in the classical linear regression framework.

6.1.5 Least squares: minimize $Q=\sum (Y_i-\hat{Y}_i)^2$

The ordinary least squares (OLS) estimator chooses coefficients that minimize the total squared residual error. Squaring penalizes large deviations and gives a unique and mathematically convenient solution.

thence, to minimize Q $\sum{(Y_i-\hat{Y}_i)^2}$ to solve b0 and b1.

\[ \begin{align} Min(Q) &=\sum{(Y_i-\hat{Y}_i)^2}\\ &=\sum{\left ( Y_i-(\hat{\beta}_1+\hat{\beta}_2X_i) \right )^2}\\ &=f(\hat{\beta}_1,\hat{\beta}_2) \end{align} \]

6.1.6 Solve $\hat{\beta}_1,\hat{\beta}_2$ and variance

Once the least squares problem is solved, you get closed-form estimators for the slope and intercept. The sampling variability of these estimators depends on: - the error variance $\sigma^2$, - the spread of $X$ values (more spread → more information → smaller variance).

\[ \begin{align} \begin{split} \hat{\beta}_2 &=\frac{\sum{x_iy_i}}{\sum{x_i^2}}\\ \hat{\beta_1} &=\bar{Y}_i-\hat{\beta}_2\bar{X}_i \end{split} \\ var(\hat{\beta}_2) =\sigma_{\hat{\beta}_2}^2&=\frac{1}{\sum{x_i^2}}\cdot\sigma^2&&\text{} \\ var(\hat{\beta}_1) =\sigma_{\hat{\beta}_1}^2 &=\frac{\sum{X_i^2}}{n\sum{x_i^2}}\cdot\sigma^2 \end{align} \]

A practical takeaway: if your $X$ values are tightly clustered, $\sum x_i^2$ is small and the slope becomes hard to estimate precisely.

6.1.7 Calculate the variance $\hat{\sigma}^2$ of error $e_i$

The residual variance is estimated from the residual sum of squares. Conceptually, it measures how much unexplained variability remains after fitting the regression line.

(for sample)

\[ \begin{align} \hat{Y}_i &=\hat{\beta}_1+\hat{\beta}_2X_i \\ e_i &=Y_i-\hat{Y}_i \\ \hat{\sigma}^2 &=\frac{\sum{e_i^2}}{n-1}=\frac{\sum{(Y_i-\hat{Y}_i)^2}}{n-1} \end{align} \]

(Practical note: in classical regression, the unbiased estimator typically uses $n-2$ in the denominator for simple linear regression because two parameters were estimated. Your expression shows the core idea—estimating variance from squared residuals.)

6.1.8 Sum of squares decomposition

A central identity in regression is that total variability can be decomposed into explained and unexplained components. This is the basis of $R^2$ and the ANOVA-style F test.

\[ \begin{align} (Y_i-\bar{Y_i}) &= (\hat{Y_i}-\bar{Y_i}) +(Y_i-\hat{Y_i}) \\ \sum{y_i^2} &= \sum{\hat{y_i}^2} +\sum{e_i^2} \\ TSS&=ESS+RSS \end{align} \]

Interpretation: - TSS: total sum of squares (overall variability around the mean) - ESS: explained sum of squares (variability explained by the regression) - RSS: residual sum of squares (unexplained variability)

6.1.9 Coefficient of determination $R^2$ and goodness of fit

$R^2$ is the proportion of variability explained by the model. It is descriptive: a higher $R^2$ means the fitted line tracks the data more closely, but it does not guarantee causality or correctness of assumptions.

\[ \begin{align} r^2 &=\frac{ESS}{TSS}=\frac{\sum{(\hat{Y_i}-\bar{Y})^2}}{\sum{(Y_i-\bar{Y})^2}}\\ &=1-\frac{RSS}{TSS}=1-\frac{\sum{(Y_i-\hat{Y_i})^2}}{\sum{(Y_i-\bar{Y})^2}} \end{align} \]

In practice, always interpret $R^2$ alongside residual diagnostics. A model can have a decent $R^2$ but still violate key assumptions (e.g., heteroscedasticity).

6.1.10 Test of regression coefficients

Hypothesis testing in regression typically focuses on whether coefficients differ from zero (or another clinically meaningful value). Under classical assumptions, coefficients are normally distributed around their true values.

since
\[ \begin{align} \hat{\beta_2} &\sim N(\beta_2,\sigma^2_{\hat{\beta_2}}) \\ \hat{\beta_1} &\sim N(\beta_1,\sigma^2_{\hat{\beta_1}}) \end{align} \]

and \[ \begin{align} S_{\hat{\beta}_2} &=\sqrt{\frac{1}{\sum{x_i^2}}}\cdot\hat{\sigma} \\ S_{\hat{\beta}_1} &=\sqrt{\frac{\sum{X_i^2}}{n\sum{x_i^2}}}\cdot\hat{\sigma} \end{align} \]

therefore
\[ \begin{align} t_{\hat{\beta_2}}^{\ast}&=\frac{\hat{\beta_2}-\beta_2}{S_{\hat{\beta_2}}} =\frac{\hat{\beta_2}}{S_{\hat{\beta_2}}} =\frac{\hat{\beta_2}}{\sqrt{\frac{1}{\sum{x_i^2}}}\cdot\hat{\sigma}} \sim t(n-2) \\ t_{\hat{\beta_1}}^{\ast}&=\frac{\hat{\beta_1}-\beta_1}{S_{\hat{\beta_1}}} =\frac{\hat{\beta_1}}{S_{\hat{\beta_1}}} =\frac{\hat{\beta_1}}{\sqrt{\frac{\sum{X_i^2}}{n\sum{x_i^2}}}\cdot\hat{\sigma}} \sim t(n-2) \end{align} \]

In applied reporting, the slope test is often the primary focus, because it corresponds to whether $X$ is associated with $Y$ in a linear trend.

6.1.11 Statistical test of model

Beyond individual coefficients, we may want to test whether the model as a whole explains a statistically significant amount of variability compared with a null model.

since \[ \begin{align} Y_i&\sim i.i.d \ N(\beta_1+\beta_2X_i,\sigma^2)\\ \end{align} \]

and
\[ \begin{align} ESS&=\sum{(\hat{Y_i}-\bar{Y})^2} \sim \chi^2(df_{ESS}) \\ RSS&=\sum{(Y_i-\hat{Y_i})^2} \sim \chi^2(df_{RSS}) \end{align} \]

therefore \[ \begin{align} F^{\ast}&=\frac{ESS/df_{ESS}}{RSS/df_{RSS}}=\frac{MSS_{ESS}}{MSS_{RSS}}\\ &=\frac{\sum{(\hat{Y_i}-\bar{Y})^2}/df_{ESS}}{\sum{(Y_i-\hat{Y_i})^2}/df_{RSS}} \\ &=\frac{\hat{\beta_2}^2\sum{x_i^2}}{\sum{e_i^2}/{(n-2)}}\\ &=\frac{\hat{\beta_2}^2\sum{x_i^2}}{\hat{\sigma}^2} \end{align} \]

The F test evaluates whether the explained variation (ESS) is large relative to residual variation (RSS), after accounting for degrees of freedom.

6.1.12 Mean prediction

Prediction in regression has two common targets:

Mean prediction: the expected outcome for individuals with a given $X_0$.
Individual prediction: the outcome for a new single individual with $X_0$.

Mean prediction is more precise because it estimates an average, not a single future value.

since

\[ \begin{align} \mu_{\hat{Y}_0}&=E(\hat{Y}_0)\\ &=E(\hat{\beta}_1+\hat{\beta}_2X_0)\\ &=\beta_1+\beta_2X_0\\ &=E(Y|X_0) \end{align} \]

and \[ \begin{align} var(\hat{Y}_0)&=\sigma^2_{\hat{Y}_0}\\ &=E(\hat{\beta}_1+\hat{\beta}_2X_0)\\ &=\sigma^2 \left( \frac{1}{n}+ \frac{(X_0-\bar{X})^2}{\sum{x_i^2}} \right) \end{align} \]

therefore \[ \begin{align} \hat{Y}_0& \sim N(\mu_{\hat{Y}_0},\sigma^2_{\hat{Y}_0})\\ \hat{Y}_0& \sim N \left(E(Y|X_0), \sigma^2 \left( \frac{1}{n}+ \frac{(X_0-\bar{X})^2}{\sum{x_i^2}} \right) \right) \end{align} \]

then construct t statistic to estimate CI \[ \begin{align} t_{\hat{Y}_0}& =\frac{\hat{Y}_0-E(Y|X_0)}{S_{\hat{Y}_0}} \sim t(n-2) \end{align} \]

\[ \begin{align} \hat{Y}_0-t_{1-\alpha/2}(n-2) \cdot S_{\hat{Y}_0} \leq E(Y|X_0) \leq \hat{Y}_0+t_{1-\alpha/2}(n-2) \cdot S_{\hat{Y}_0} \end{align} \]

Interpretation: the CI here is for the mean response at $X_0$. It is narrow when: - $n$ is large, and - $X_0$ is close to $\bar{X}$ (more information near the center of the data).

6.1.13 Individual prediction

Individual prediction intervals are wider because they must account for: - uncertainty in estimating the mean trend, and - the irreducible random error for a new observation.

since

\[ \begin{align} (Y_0-\hat{Y}_0)& \sim N \left(\mu_{(Y_0-\hat{Y}_0)},\sigma^2_{(Y_0-\hat{Y}_0)} \right)\\ (Y_0-\hat{Y}_0)& \sim N \left(0, \sigma^2 \left(1+ \frac{1}{n}+ \frac{(X_0-\bar{X})^2}{\sum{x_i^2}} \right) \right) \end{align} \]

and Construct t statistic \[ \begin{align} t_{\hat{Y}_0}& =\frac{\hat{Y}_0-E(Y|X_0)}{S_{\hat{Y}_0}} \sim t(n-2) \end{align} \]

and \[ \begin{align} S_{\hat{Y}_0}& = \sqrt{\hat{\sigma}^2 \left( \frac{1}{n}+ \frac{(X_0-\bar{X})^2}{\sum{x_i^2}} \right)} \end{align} \]

therefore

\[ \begin{align} \hat{Y}_0-t_{1-\alpha/2}(n-2) \cdot S_{\hat{Y}_0} \leq E(Y|X_0) \leq \hat{Y}_0+t_{1-\alpha/2}(n-2) \cdot S_{\hat{Y}_0} \end{align} \]

it is harder to predict your weight based on your age than to predict the mean weight of people who are your age. so, the interval of individual prediction is wider than those of mean prediction.

A practical mental model: - Mean CI answers: “What is the expected mean outcome at $X_0$?” - Prediction interval answers: “Where might a new individual outcome fall at $X_0$?”

6.2 Multiple linear regression

Multiple linear regression generalizes the simple model by allowing multiple predictors. The key shift is that each coefficient is interpreted as an effect holding other variables constant.

6.2.1 Matrix format

\[ \begin{align} Y_i&=\beta_1+\beta_2X_{2i}+\beta_3X_{3i}+\cdots+\beta_kX_{ki}+u_i && \ \end{align} \]

\[ \begin{equation} \begin{bmatrix} Y_1 \\ Y_2 \\ \cdots \\ Y_n \\ \end{bmatrix} = \begin{bmatrix} 1 & X_{21} & X_{31} & \cdots & X_{k1} \\ 1 & X_{22} & X_{32} & \cdots & X_{k2} \\ \cdots & \cdots & \cdots & \cdots & \cdots \\ 1 & X_{2n} & X_{3n} & \cdots & X_{kn} \end{bmatrix} \begin{bmatrix} \beta_1 \\ \beta_2 \\ \vdots \\ \beta_k \\ \end{bmatrix}+ \begin{bmatrix} u_1 \\ u_2 \\ \vdots \\ u_n \\ \end{bmatrix} \end{equation} \]

\[ \begin{alignat}{4} \mathbf{y} &= &\mathbf{X}&\mathbf{\beta}&+&\mathbf{u} \\ (n \times 1) & &{(n \times k)} &{(k \times 1)}&+&{(n \times 1)} \end{alignat} \]

The matrix form is not just notation—it simplifies derivations and makes the estimator compact and general.

6.2.2 Variance covariance matrix of random errors

The classical assumption is that errors are independent, have equal variance, and have zero covariance. This leads to a diagonal variance-covariance structure proportional to the identity matrix.

because \[ \mathbf{u} \sim N(\mathbf{0},\sigma^2\mathbf{I})\text{ population}\\ \mathbf{e} \sim N(\mathbf{0},\sigma^2\mathbf{I})\text{ sample}\ \]

therefore \[ \begin{align} var-cov(\mathbf{u})&=E(\mathbf{uu'})\\ &= \begin{bmatrix} \sigma_1^2 & \sigma_{12}^2 &\cdots &\sigma_{1n}^2\\ \sigma_{21}^2 & \sigma_2^2 &\cdots &\sigma_{2n}^2\\ \vdots & \vdots &\vdots &\vdots \\ \sigma_{n1}^2 & \sigma_{n2}^2 &\cdots &\sigma_n^2\\ \end{bmatrix} && \leftarrow (E{(u_i)}=0)\\ &= \begin{bmatrix} \sigma^2 & \sigma_{12}^2 &\cdots &\sigma_{1n}^2\\ \sigma_{21}^2 & \sigma^2 &\cdots &\sigma_{2n}^2\\ \vdots & \vdots &\vdots &\vdots \\ \sigma_{n1}^2 & \sigma_{n2}^2 &\cdots &\sigma^2\\ \end{bmatrix} && \leftarrow (var{(u_i)}=\sigma^2)\\ &= \begin{bmatrix} \sigma^2 & 0 &\cdots &0\\ 0 & \sigma^2 &\cdots &0\\ \vdots & \vdots &\vdots &\vdots \\ 0 & 0 &\cdots &\sigma^2\\ \end{bmatrix} && \leftarrow (cov{(u_i,u_j)}=0,i \neq j)\\ &=\sigma^2 \begin{bmatrix} 1 & 0 &\cdots &0\\ 0 & 1 &\cdots &0\\ \vdots & \vdots &\vdots &\vdots \\ 0 & 0 &\cdots &1\\ \end{bmatrix}\\ &=\sigma^2\mathbf{I} \end{align} \]

In applied work, this assumption is often challenged by clustering and repeated measures. When violated, analysts may move to robust standard errors, GLS, or mixed effects models.

6.2.3 Minimize $Q=\sum (y-\hat{y})^2$

In matrix form, OLS still minimizes the squared residuals, but the algebra becomes compact and scalable.

\[ \begin{align} Q&=\sum{e_i^2}\\ &=\mathbf{e'e}\\ &=\mathbf{(y-X\hat{\beta})'(y-X\hat{\beta})}\\ &=\mathbf{y'y-2\hat{\beta}'X'y+\hat{\beta}'X'X\hat{\beta}} \end{align} \]

6.2.4 Solve $\hat{\beta}$ by derivation

Setting the derivative with respect to $\hat{\beta}$ equal to zero yields the normal equations. Solving them gives the closed-form OLS estimator.

(population=sample)

\[ \begin{align} \frac{\partial Q}{\partial \mathbf{\hat{\beta}}}&=0\\ \frac{\partial(\mathbf{y'y-2\hat{\beta}'X'y+\hat{\beta}'X'X\hat{\beta}})}{\partial \mathbf{\hat{\beta}}}&=0\\ -2\mathbf{X'y}+2\mathbf{X'X\hat{\beta}}&=0\\ -\mathbf{X'y}+\mathbf{X'X\hat{\beta}}&=0\\ \mathbf{X'X\hat{\beta}} &=\mathbf{X'y} \end{align} \]

\[ \begin{align} \mathbf{\hat{\beta}} &=\mathbf{(X'X)^{-1}X'y} \end{align} \]

Interpretation: $(X'X)^{-1}$ reflects the information in the design matrix. When predictors are highly correlated (multicollinearity), $X'X$ becomes nearly singular and coefficient estimates become unstable.

6.2.5 Solve $var\text{-}cov(\mathbf{\hat{\beta}})$

The variance-covariance matrix of $\hat{\beta}$ is the core object for inference in multiple regression. It contains: - variances of each coefficient (diagonal), - covariances between coefficients (off-diagonal).

\[ \begin{align} var-cov(\mathbf{\hat{\beta}}) &=\mathbf{E\left( \left(\hat{\beta}-E(\hat{\beta}) \right) \left( \hat{\beta}-E(\hat{\beta}) \right )' \right)}\\ &=\mathbf{E\left( \left(\hat{\beta}-{\beta} \right) \left( \hat{\beta}-\beta \right )' \right)} \\ &=\mathbf{E\left( \left((X'X)^{-1}X'u \right) \left( (X'X)^{-1}X'u \right )' \right)} \\ &=\mathbf{E\left( (X'X)^{-1}X'uu'X(X'X)^{-1} \right)} \\ &= \mathbf{(X'X)^{-1}X'E(uu')X(X'X)^{-1}} \\ &= \mathbf{(X'X)^{-1}X'}\sigma^2\mathbf{IX(X'X)^{-1}} \\ &= \sigma^2\mathbf{(X'X)^{-1}X'X(X'X)^{-1}} \\ &= \sigma^2\mathbf{(X'X)^{-1}} \\ \end{align} \]

6.2.6 Solve $S^2(\mathbf{\hat{\beta}})$ (sample)

In practice, $\sigma^2$ is unknown. We estimate it from residuals and then plug it into the variance-covariance formula.

where \[ \begin{align} \hat{\sigma}^2&=\frac{\sum{e_i^2}}{n-k}=\frac{\mathbf{e'e}}{n-k} \\ E(\hat{\sigma}^2)&=\sigma^2 \end{align} \]

therefore \[ \begin{align} S^2_{ij}(\mathbf{\hat{\beta}}) &= \hat{\sigma}^2\mathbf{(X'X)^{-1}} \\ &= \frac{\mathbf{e'e}}{n-k}\mathbf{(X'X)^{-1}} \\ \end{align} \]

which is variance-covariance of coefficients

6.2.7 Sum of squares decomposition (matrix format)

The same TSS/ESS/RSS decomposition generalizes to multiple regression, forming the basis of $R^2$ and the overall F test.

\[ \begin{align} TSS&=\mathbf{y'y}-n\bar{Y}^2 \\ RSS&=\mathbf{ee'}=\mathbf{yy'-\hat{\beta}'X'y} \\ ESS&=\mathbf{\hat{\beta}'X'y}-n\bar{Y}^2 \end{align} \]

6.2.8 Determination coefficient $R^2$ and goodness of fit

\[ \begin{align} R^2&=\frac{ESS}{TSS}\\ &=\frac{\mathbf{\hat{\beta}'X'y}-n\bar{Y}^2}{\mathbf{y'y}-n\bar{Y}^2} \end{align} \]

Interpretation remains the same: $R^2$ is descriptive goodness-of-fit. In multiple regression it almost always increases as you add predictors, which is why adjusted $R^2$ or out-of-sample validation is often preferred for model selection.

6.2.9 Test of regression coefficients

Multiple regression inference typically includes: - individual coefficient tests (is $\beta_j=0$?), - joint tests (are multiple coefficients simultaneously zero?).

because \[ \begin{align} \mathbf{u}&\sim N(\mathbf{0},\sigma^2\mathbf{I}) \\ \mathbf{\hat{\beta}} &\sim N\left(\mathbf{\beta},\sigma^2\mathbf{X'X}^{-1} \right) \\ \end{align} \]

therefore

(for all coefficients test, vector, see above $S_{\hat{\beta}}^2$ ) \[ \begin{align} \mathbf{t_{\hat{\beta}}}&=\mathbf{\frac{\hat{\beta}-\beta}{S_{\hat{\beta}}}} \sim \mathbf{t(n-k)} \end{align} \]

(for individual coefficient test) \[ \begin{align} \mathbf{t_{\hat{\beta}}^{\ast}}&=\frac{\mathbf{\hat{\beta}}}{\mathbf{\sqrt{S^2_{ij}(\hat{\beta}_{kk})}}} \end{align} \]

where \[ S^2_{ij}(\hat{\beta}_{kk})=[s^2_{\hat{\beta}_1},s^2_{\hat{\beta}_2},\cdots,s^2_{\hat{\beta}_k}]' \]

they are on diagonal line of the matrix of $S^2(\mathbf{\hat{\beta}})$

6.2.10 Test of model

The overall model test compares: - an unrestricted model with predictors, versus - a restricted model (often intercept-only).

unrestricted model \[ \begin{align} u_i &\sim i.i.d \ N(0,\sigma^2)\\ Y_i&\sim i.i.d \ N(\beta_1+\beta_2X_i+\cdots+\beta_kX_i,\sigma^2)\\ RSS_U&=\sum{(Y_i-\hat{Y_i})^2} \sim \chi^2(n-k) \\ \end{align} \]

restricted model \[ \begin{align} u_i &\sim i.i.d \ N(0,\sigma^2)\\ Y_i&\sim i.i.d \ N(\beta_1,\sigma^2)\\ RSS_R&=\sum{(Y_i-\hat{Y_i})^2} \sim \chi^2(n-1) \\ \end{align} \]

F test \[ \begin{align} F^{\ast}&=\frac{(RSS_R-RSS_U)/(k-1)}{RSS_U/(n-k)} \\ &=\frac{ESS_U/df_{ESS_U}}{RSS_U/df_{RSS_U}} \\ &\sim F(df_{ESS_U},df_{RSS_U}) \end{align} \]

\[ \begin{align} F^{\ast}&=\frac{ESS_U/df_{ESS_U}}{RSS_U/df_{RSS_U}} =\frac{\left(\mathbf{\hat{\beta}'X'y}-n\bar{Y}^2 \right)/{(k-1)}}{\left(\mathbf{yy'-\hat{\beta}'X'y}\right)/{(n-k)}} \end{align} \]

6.2.11 Mean prediction (multiple regression)

The mean prediction generalizes naturally: you plug in a covariate vector $X_0$. The uncertainty depends on the leverage of $X_0$ through $X_0(X'X)^{-1}X_0'$.

since \[ \begin{align} E(\hat{Y}_0)&=E\mathbf{(X_0\hat{\beta})}=\mathbf{X_0\beta}=E\mathbf{(Y_0)}\\ var(\hat{Y}_0)&=E\mathbf{(X_0\hat{\beta}-X_0\beta)}^2\\ &=E\mathbf{\left( X_0(\hat{\beta}-\beta)(\hat{\beta}-\beta)'X_0' \right)}\\ &=E\mathbf{X_0\left( (\hat{\beta}-\beta)(\hat{\beta}-\beta)' \right)X_0'}\\ &=\sigma^2\mathbf{X_0\left( X'X \right)^{-1}X_0'}\\ \end{align} \]

and \[ \begin{align} \hat{Y}_0& \sim N(\mu_{\hat{Y}_0},\sigma^2_{\hat{Y}_0})\\ \hat{Y}_0& \sim N\left(E(Y_0|X_0), \sigma^2\mathbf{X_0(X'X)^{-1}X_0'}\right) \end{align} \]

construct t statistic \[ \begin{align} t_{\hat{Y}_0}& =\frac{\hat{Y}_0-E(Y|X_0)}{S_{\hat{Y}_0}} &\sim t(n-k) \end{align} \]

therefore \[ \begin{align} \hat{Y}_0-t_{1-\alpha/2}(n-2) \cdot S_{\hat{Y}_0} \leq E(Y|X_0) \leq \hat{Y}_0+t_{1-\alpha/2}(n-2) \cdot S_{\hat{Y}_0} \end{align} \]

where \[ \begin{align} \mathbf{S_{\hat{Y}_0}} &=\sqrt{\hat{\sigma}^2X_0(X'X)^{-1}X_0'} \\ \hat{\sigma}^2&=\frac{\mathbf{ee'}}{(n-k)} \end{align} \]

6.2.12 Individual prediction (multiple regression)

Individual prediction adds the irreducible error term for a new observation, making the interval wider than the mean-response interval.

since \[ \begin{align} e_0&=Y_0-\hat{Y}_0 \end{align} \]

and \[ \begin{align} E(e_0)&=E(Y_0-\hat{Y}_0)\\ &=E(\mathbf{X_0\beta}+u_0-\mathbf{X_0\hat{\beta}})\\ &=E\left(u_0-\mathbf{X_0 (\hat{\beta}- \beta)} \right)\\ &=E\left(u_0-\mathbf{X_0 (X'X)^{-1}X'u} \right)\\ &=0 \end{align} \]

\[ \begin{align} var(e_0)&=E(Y_0-\hat{Y}_0)^2\\ &=E(e_0^2)\\ &=E\left(u_0-\mathbf{X_0 (X'X)^{-1}X'u} \right)^2\\ &=\sigma^2\left( 1+ \mathbf{X_0(X'X)^{-1}X_0'}\right) \end{align} \]

and
\[ \begin{align} e_0& \sim N(\mu_{e_0},\sigma^2_{e_0})\\ e_0& \sim N\left(0, \sigma^2\left(1+\mathbf{X_0(X'X)^{-1}X_0'}\right)\right) \end{align} \]

construct a t statistic \[ \begin{align} t_{e_0}& =\frac{\hat{Y}_0-Y_0}{S_{e_0}} \sim t(n-k) \end{align} \]

therefore \[ \begin{align} \hat{Y}_0-t_{1-\alpha/2}(n-2) \cdot S_{Y_0-\hat{Y}_0} \leq (Y_0|X_0) \leq \hat{Y}_0+t_{1-\alpha/2}(n-2) \cdot S_{Y_0-\hat{Y}_0} \end{align} \]

where \[ \begin{align} S_{Y_0-\hat{Y}_0}=S_{e_0} &=\sqrt{\hat{\sigma}^2 \left( 1+X_0(X'X)^{-1}X_0' \right) } \\ \hat{\sigma}^2&=\frac{\mathbf{ee'}}{(n-k)} \end{align} \]

A final practical takeaway: - If your goal is decision-making about the average response at a covariate profile, use mean-response inference. - If your goal is forecasting an individual outcome, expect much wider uncertainty bands, even with a well-fitted model.

6.3 Multiple linear regression practice

This section is a hands-on workflow for multiple linear regression using the classic Boston housing dataset. The goal is not only to fit a model, but to practice the full sequence that a statistician typically follows in real projects:

understand the dataset and variable types,
check missingness and data quality,
explore distributions and correlations,
consider transformations for modeling stability,
build training/test splits for honest evaluation,
fit models (baseline, stepwise, polynomial, interaction, robust),
check assumptions and diagnostics,
compare models using information criteria and cross-validation,
interpret coefficients and relative importance,
perform prediction and validate performance.

Even though this example is not a clinical dataset, the workflow is transferable to many applied settings.

6.3.1 Load required packages

We start by loading packages that support regression modeling, data exploration, and diagnostics:

MASS: contains the Boston dataset and robust regression (rlm)
psych: descriptive summaries and EDA panels
car: regression utilities like VIF

library(car)
library(MASS)
library(psych)

6.3.2 Loading and describing data

We load the dataset and create a working copy. In a real analysis, it is good practice to keep a pristine “original” object and perform transformations on copies.

describe() provides a compact summary: N, mean, SD, min/max, and distribution hints for each variable. This is often more informative than simply printing the first few rows.

data(Boston)
data_ori <- Boston
describe(data_ori)

## data_ori 
## 
##  14  Variables      506  Observations
## --------------------------------------------------------------------------------
## crim 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      506        0      504        1    3.614    5.794  0.02791  0.03819 
##      .25      .50      .75      .90      .95 
##  0.08204  0.25651  3.67708 10.75300 15.78915 
## 
## lowest : 0.00632 0.00906 0.01096 0.01301 0.01311
## highest: 45.7461 51.1358 67.9208 73.5341 88.9762
## --------------------------------------------------------------------------------
## zn 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      506        0       26    0.603    11.36    18.77      0.0      0.0 
##      .25      .50      .75      .90      .95 
##      0.0      0.0     12.5     42.5     80.0 
## 
## lowest : 0    12.5 17.5 18   20  , highest: 82.5 85   90   95   100 
## --------------------------------------------------------------------------------
## indus 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      506        0       76    0.982    11.14    7.705     2.18     2.91 
##      .25      .50      .75      .90      .95 
##     5.19     9.69    18.10    19.58    21.89 
## 
## lowest : 0.46  0.74  1.21  1.22  1.25 , highest: 18.1  19.58 21.89 25.65 27.74
## --------------------------------------------------------------------------------
## chas 
##        n  missing distinct     Info      Sum     Mean      Gmd 
##      506        0        2    0.193       35  0.06917    0.129 
## 
## --------------------------------------------------------------------------------
## nox 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      506        0       81        1   0.5547   0.1295   0.4092   0.4270 
##      .25      .50      .75      .90      .95 
##   0.4490   0.5380   0.6240   0.7130   0.7400 
## 
## lowest : 0.385 0.389 0.392 0.394 0.398, highest: 0.713 0.718 0.74  0.77  0.871
## --------------------------------------------------------------------------------
## rm 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      506        0      446        1    6.285   0.7515    5.314    5.594 
##      .25      .50      .75      .90      .95 
##    5.886    6.208    6.623    7.152    7.588 
## 
## lowest : 3.561 3.863 4.138 4.368 4.519, highest: 8.375 8.398 8.704 8.725 8.78 
## --------------------------------------------------------------------------------
## age 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      506        0      356    0.999    68.57    31.52    17.72    26.95 
##      .25      .50      .75      .90      .95 
##    45.02    77.50    94.07    98.80   100.00 
## 
## lowest : 2.9  6    6.2  6.5  6.6 , highest: 98.8 98.9 99.1 99.3 100 
## --------------------------------------------------------------------------------
## dis 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      506        0      412        1    3.795    2.298    1.462    1.628 
##      .25      .50      .75      .90      .95 
##    2.100    3.207    5.188    6.817    7.828 
## 
## lowest : 1.1296  1.137   1.1691  1.1742  1.1781 
## highest: 9.2203  9.2229  10.5857 10.7103 12.1265
## --------------------------------------------------------------------------------
## rad 
##        n  missing distinct     Info     Mean      Gmd 
##      506        0        9    0.959    9.549    8.518 
##                                                                 
## Value          1     2     3     4     5     6     7     8    24
## Frequency     20    24    38   110   115    26    17    24   132
## Proportion 0.040 0.047 0.075 0.217 0.227 0.051 0.034 0.047 0.261
## 
## For the frequency table, variable is rounded to the nearest 0
## --------------------------------------------------------------------------------
## tax 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      506        0       66    0.981    408.2    181.7      222      233 
##      .25      .50      .75      .90      .95 
##      279      330      666      666      666 
## 
## lowest : 187 188 193 198 216, highest: 432 437 469 666 711
## --------------------------------------------------------------------------------
## ptratio 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      506        0       46    0.978    18.46    2.383    14.70    14.75 
##      .25      .50      .75      .90      .95 
##    17.40    19.05    20.20    20.90    21.00 
## 
## lowest : 12.6 13   13.6 14.4 14.7, highest: 20.9 21   21.1 21.2 22  
## --------------------------------------------------------------------------------
## black 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      506        0      357    0.986    356.7     65.5    84.59   290.27 
##      .25      .50      .75      .90      .95 
##   375.38   391.44   396.23   396.90   396.90 
## 
## lowest : 0.32   2.52   2.6    3.5    3.65  , highest: 396.28 396.3  396.33 396.42 396.9 
## --------------------------------------------------------------------------------
## lstat 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      506        0      455        1    12.65    7.881    3.708    4.680 
##      .25      .50      .75      .90      .95 
##    6.950   11.360   16.955   23.035   26.808 
## 
## lowest : 1.73  1.92  1.98  2.47  2.87 , highest: 34.37 34.41 34.77 36.98 37.97
## --------------------------------------------------------------------------------
## medv 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      506        0      229        1    22.53    9.778    10.20    12.75 
##      .25      .50      .75      .90      .95 
##    17.02    21.20    25.00    34.80    43.40 
## 
## lowest : 5    5.6  6.3  7    7.2 , highest: 46.7 48.3 48.5 48.8 50  
## --------------------------------------------------------------------------------

summary() is a base R quick scan: it gives min/median/mean/max for numeric variables and counts for factors. This is a standard first step to detect strange ranges or unrealistic values.

summary(data_ori)

crim	zn	indus	chas	nox	rm	age	dis	rad	tax	ptratio	black	lstat	medv
Min. : 0.00632	Min. : 0.00	Min. : 0.46	Min. :0.00000	Min. :0.3850	Min. :3.561	Min. : 2.90	Min. : 1.130	Min. : 1.000	Min. :187.0	Min. :12.60	Min. : 0.32	Min. : 1.73	Min. : 5.00
1st Qu.: 0.08205	1st Qu.: 0.00	1st Qu.: 5.19	1st Qu.:0.00000	1st Qu.:0.4490	1st Qu.:5.886	1st Qu.: 45.02	1st Qu.: 2.100	1st Qu.: 4.000	1st Qu.:279.0	1st Qu.:17.40	1st Qu.:375.38	1st Qu.: 6.95	1st Qu.:17.02
Median : 0.25651	Median : 0.00	Median : 9.69	Median :0.00000	Median :0.5380	Median :6.208	Median : 77.50	Median : 3.207	Median : 5.000	Median :330.0	Median :19.05	Median :391.44	Median :11.36	Median :21.20
Mean : 3.61352	Mean : 11.36	Mean :11.14	Mean :0.06917	Mean :0.5547	Mean :6.285	Mean : 68.57	Mean : 3.795	Mean : 9.549	Mean :408.2	Mean :18.46	Mean :356.67	Mean :12.65	Mean :22.53
3rd Qu.: 3.67708	3rd Qu.: 12.50	3rd Qu.:18.10	3rd Qu.:0.00000	3rd Qu.:0.6240	3rd Qu.:6.623	3rd Qu.: 94.08	3rd Qu.: 5.188	3rd Qu.:24.000	3rd Qu.:666.0	3rd Qu.:20.20	3rd Qu.:396.23	3rd Qu.:16.95	3rd Qu.:25.00
Max. :88.97620	Max. :100.00	Max. :27.74	Max. :1.00000	Max. :0.8710	Max. :8.780	Max. :100.00	Max. :12.127	Max. :24.000	Max. :711.0	Max. :22.00	Max. :396.90	Max. :37.97	Max. :50.00

6.3.3 Create table 1

A “Table 1” is a standard descriptive table for reporting baseline characteristics (clinical trials, observational studies, epidemiology, etc.). Here we create a Table 1 across all variables without grouping, mainly to practice the tool and check distributions.

library(boot) 
library(table1)

## Warning: package 'table1' was built under R version 4.4.3

table1(~ . , data=data_ori)

	Overall (N=506)
crim
Mean (SD)	3.61 (8.60)
Median [Min, Max]	0.257 [0.00632, 89.0]
zn
Mean (SD)	11.4 (23.3)
Median [Min, Max]	0 [0, 100]
indus
Mean (SD)	11.1 (6.86)
Median [Min, Max]	9.69 [0.460, 27.7]
chas
Mean (SD)	0.0692 (0.254)
Median [Min, Max]	0 [0, 1.00]
nox
Mean (SD)	0.555 (0.116)
Median [Min, Max]	0.538 [0.385, 0.871]
rm
Mean (SD)	6.28 (0.703)
Median [Min, Max]	6.21 [3.56, 8.78]
age
Mean (SD)	68.6 (28.1)
Median [Min, Max]	77.5 [2.90, 100]
dis
Mean (SD)	3.80 (2.11)
Median [Min, Max]	3.21 [1.13, 12.1]
rad
Mean (SD)	9.55 (8.71)
Median [Min, Max]	5.00 [1.00, 24.0]
tax
Mean (SD)	408 (169)
Median [Min, Max]	330 [187, 711]
ptratio
Mean (SD)	18.5 (2.16)
Median [Min, Max]	19.1 [12.6, 22.0]
black
Mean (SD)	357 (91.3)
Median [Min, Max]	391 [0.320, 397]
lstat
Mean (SD)	12.7 (7.14)
Median [Min, Max]	11.4 [1.73, 38.0]
medv
Mean (SD)	22.5 (9.20)
Median [Min, Max]	21.2 [5.00, 50.0]

Practical note: in real reporting, Table 1 is usually stratified by a group (e.g., treatment arm), but the unstratified version is still useful as a data audit.

6.3.4 Missingness checking

Before modeling, verify whether any variables have missing values and whether patterns exist. The Boston dataset is typically complete, but this step is included because real datasets almost never are.

md.pattern() (from mice) shows missingness patterns and counts by variable.

library(mice)

## Warning: package 'mice' was built under R version 4.4.3

md.pattern(data_ori)

##  /\     /\
## {  `---'  }
## {  O   O  }
## ==>  V <==  No need for mice. This data set is completely observed.
##  \  \|/  /
##   `-----'

	crim	zn	indus	chas	nox	rm	age	dis	rad	tax	ptratio	black	lstat	medv
506	1	1	1	1	1	1	1	1	1	1	1	1	1	1	0
	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0

If missingness exists, you should decide whether it is: - MCAR (completely at random), - MAR (at random conditional on observed data), - MNAR (not at random).

That decision influences the imputation strategy and the validity of downstream inference.

6.3.5 Exploratory data analysis

Exploratory data analysis (EDA) is where you learn the “shape” of the data: correlations, nonlinear relationships, skewness, and potential outliers.

6.3.5.1 Correlation matrix and bivariate panels

pairs.panels() is a powerful all-in-one diagnostic: - scatterplots for each pair, - correlations, - histograms (or density plots) on the diagonal.

This often reveals multicollinearity and potential transformations needed.

library(psych)
pairs.panels(data_ori)

6.3.5.2 Histograms

Histograms quickly show skewness, long tails, spikes, and potential coding problems (e.g., a continuous variable with only a few discrete values).

library(DataExplorer)
plot_histogram(data_ori)

6.3.6 Transformations

Many predictors in real-world socioeconomic and biomedical data are skewed. Transformations can: - improve linearity, - stabilize variance, - reduce influence of extreme values, - make residuals closer to normal.

Here you apply a set of transformations guided by a common heuristic: - log transforms for strictly positive skewed variables, - square root for moderately skewed, - reflections when needed to “flip” direction (as shown with age and black).

The goal is not “making everything normal,” but making linear modeling assumptions more reasonable.

library(tidyverse)
data_trans = data_ori %>%  mutate(age= sqrt(max(age)+1-age),
                         black= log10(max(black)+1-black),
                         crim= log10(crim),
                         dis= sqrt(dis)   )
plot_histogram(data_trans)

# pairs.panels(data2)

! How to transform data for normality.

A practical note: transformations should be applied thoughtfully and documented clearly, especially if you need interpretability. For example, a log-transformed predictor means coefficients represent changes per multiplicative change in the original scale.

6.3.7 Check linearity between $y$ and $x$

Before fitting a multivariable model, it helps to check whether key predictors have roughly linear relationships with the outcome. Scatterplots can immediately show: - curvature (suggesting polynomial terms), - clusters (suggesting interactions or stratification), - heteroscedasticity (spread changes with $X$).

attach(data_trans)
plot(medv, rm)

plot(medv,lstat)

plot(medv,age)

plot(medv, black)

plot(medv,crim)

In practice, the Boston housing dataset is known for strong relationships between medv and both lstat and rm, and these often show nonlinearity—motivating quadratic terms and interactions later.

6.3.8 Data imputation and normalization

This workflow demonstrates KNN imputation using caret::preProcess(method = "knnImpute"). Even if the dataset has no missing values, this section is included because missingness is common in applied work.

6.3.8.1 For original “data”

We store medv separately before preprocessing, then restore it afterward. This preserves the target variable while applying preprocessing to predictors.

library(caret)
# Create the knn imputation model on the training data
y=data_ori$medv
preProcess_missingdata_model <- preProcess(data_ori , method='knnImpute')
preProcess_missingdata_model

## Created from 506 samples and 14 variables
## 
## Pre-processing:
##   - centered (14)
##   - ignored (0)
##   - 5 nearest neighbor imputation (14)
##   - scaled (14)

Then we apply the model and verify missingness is resolved. anyNA() is a quick binary check; in larger projects you may also compute missing rates by column.

# Use the imputation model to predict the values of missing data points
library(RANN)  # required for knnInpute

## Warning: package 'RANN' was built under R version 4.4.3

data_ori  <- predict(preProcess_missingdata_model, newdata = data_ori )
anyNA(data_ori )

## [1] FALSE

data_ori$medv <- y

6.3.8.2 For transformed “data2”

We repeat the same workflow for the transformed dataset. This creates a fair comparison between “original scale” and “transformed scale” modeling pipelines.

library(caret)
y2=data_trans$medv
# Create the knn imputation model on the training data
preProcess_missingdata_model2 <- preProcess(data_trans , method='knnImpute')
preProcess_missingdata_model2

## Created from 506 samples and 14 variables
## 
## Pre-processing:
##   - centered (14)
##   - ignored (0)
##   - 5 nearest neighbor imputation (14)
##   - scaled (14)

# Use the imputation model to predict the values of missing data points
library(RANN)  # required for knnInpute
data_trans  <- predict(preProcess_missingdata_model2, newdata = data_trans )
anyNA(data_trans )

## [1] FALSE

data_trans$medv <- y2

6.3.9 Generate dummy variables

Categorical predictors must be encoded properly before modeling. In many workflows, converting variables to factor is enough because lm() automatically handles factors using dummy coding (with a reference level).

also can do using as.factor function for predictors x

In real projects, pay attention to: - reference group selection, - whether categories are sparse, - and whether encoding must be consistent across train/test.

6.3.10 Splitting data into training and test data

A training/test split provides external-style evaluation: the model is fit on training data and assessed on held-out test data.

This helps detect overfitting, especially when you: - try many model variants, - add polynomial terms, - include interactions, - or apply variable selection procedures.

# Create the training and test datasets
set.seed(123)
# for original data 
# Step 1: Get row numbers for the training data
trainRowNumbers <- createDataPartition(data_ori$medv, p=0.8, list=FALSE)

# Step 2: Create the training  dataset
data  <- data_ori[trainRowNumbers,]

# Step 3: Create the test dataset
testdata  <- data_ori[-trainRowNumbers,]

# for transformed data
# Step 1: Get row numbers for the training data
trainRowNumbers2 <- createDataPartition(data_trans$medv, p=0.8, list=FALSE)

# Step 2: Create the training  dataset
data2 <- data_trans[trainRowNumbers2,]

# Step 3: Create the test dataset
testdata2 <- data_trans[-trainRowNumbers2,]

Practical note: createDataPartition() creates balanced partitions with respect to the outcome distribution, which can be helpful when the outcome is skewed.

6.3.11 Step regression

Stepwise regression is a common teaching tool and sometimes used in quick exploratory modeling. It iteratively adds/removes variables to optimize a criterion (typically AIC by default).

However, in serious applied work, stepwise selection can be unstable and can inflate type I error if you treat the final p-values as if selection never happened. Treat it as a screening tool, and validate with cross-validation.

model_o = lm(  medv ~.  , data=data2)
step(model_o,direction = "both")

## Start:  AIC=1281.15
## medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad + 
##     tax + ptratio + black + lstat
## 
##           Df Sum of Sq     RSS    AIC
## - black    1      0.17  8847.0 1279.2
## - age      1      7.07  8853.9 1279.5
## - crim     1     14.36  8861.2 1279.8
## - indus    1     24.08  8870.9 1280.3
## <none>                  8846.8 1281.2
## - rad      1    103.22  8950.0 1283.9
## - tax      1    156.33  9003.1 1286.3
## - zn       1    198.34  9045.2 1288.2
## - chas     1    251.31  9098.1 1290.5
## - nox      1    692.00  9538.8 1309.8
## - ptratio  1    840.04  9686.9 1316.1
## - rm       1    965.90  9812.7 1321.3
## - dis      1   1349.41 10196.2 1336.9
## - lstat    1   2766.14 11613.0 1389.9
## 
## Step:  AIC=1279.16
## medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad + 
##     tax + ptratio + lstat
## 
##           Df Sum of Sq     RSS    AIC
## - age      1      7.15  8854.1 1277.5
## - crim     1     14.32  8861.3 1277.8
## - indus    1     24.52  8871.5 1278.3
## <none>                  8847.0 1279.2
## + black    1      0.17  8846.8 1281.2
## - rad      1    103.72  8950.7 1281.9
## - tax      1    157.40  9004.4 1284.3
## - zn       1    198.20  9045.2 1286.2
## - chas     1    251.25  9098.2 1288.6
## - nox      1    695.37  9542.4 1308.0
## - ptratio  1    850.76  9697.7 1314.5
## - rm       1    966.99  9814.0 1319.4
## - dis      1   1375.04 10222.0 1336.0
## - lstat    1   2770.28 11617.3 1388.0
## 
## Step:  AIC=1277.49
## medv ~ crim + zn + indus + chas + nox + rm + dis + rad + tax + 
##     ptratio + lstat
## 
##           Df Sum of Sq     RSS    AIC
## - crim     1     18.36  8872.5 1276.3
## - indus    1     25.56  8879.7 1276.7
## <none>                  8854.1 1277.5
## + age      1      7.15  8847.0 1279.2
## + black    1      0.26  8853.9 1279.5
## - rad      1     97.20  8951.3 1279.9
## - tax      1    152.93  9007.1 1282.5
## - zn       1    196.76  9050.9 1284.4
## - chas     1    255.17  9109.3 1287.0
## - nox      1    694.68  9548.8 1306.2
## - ptratio  1    843.66  9697.8 1312.5
## - rm       1   1023.40  9877.5 1320.0
## - dis      1   1633.60 10487.7 1344.4
## - lstat    1   2978.57 11832.7 1393.5
## 
## Step:  AIC=1276.33
## medv ~ zn + indus + chas + nox + rm + dis + rad + tax + ptratio + 
##     lstat
## 
##           Df Sum of Sq     RSS    AIC
## - indus    1     21.92  8894.4 1275.3
## <none>                  8872.5 1276.3
## + crim     1     18.36  8854.1 1277.5
## + age      1     11.19  8861.3 1277.8
## + black    1      0.13  8872.4 1278.3
## - tax      1    158.39  9030.9 1281.5
## - zn       1    180.19  9052.7 1282.5
## - rad      1    212.19  9084.7 1284.0
## - chas     1    249.50  9122.0 1285.6
## - nox      1    689.33  9561.8 1304.8
## - ptratio  1    873.78  9746.3 1312.6
## - rm       1   1025.43  9897.9 1318.8
## - dis      1   1701.37 10573.9 1345.7
## - lstat    1   2996.77 11869.3 1392.8
## 
## Step:  AIC=1275.34
## medv ~ zn + chas + nox + rm + dis + rad + tax + ptratio + lstat
## 
##           Df Sum of Sq     RSS    AIC
## <none>                  8894.4 1275.3
## + indus    1     21.92  8872.5 1276.3
## + crim     1     14.72  8879.7 1276.7
## + age      1     11.89  8882.5 1276.8
## + black    1      0.00  8894.4 1277.3
## - zn       1    206.52  9100.9 1282.7
## - chas     1    237.50  9131.9 1284.1
## - tax      1    281.42  9175.8 1286.0
## - rad      1    293.27  9187.7 1286.5
## - nox      1    800.54  9695.0 1308.4
## - ptratio  1    929.05  9823.5 1313.8
## - rm       1   1083.34  9977.8 1320.1
## - dis      1   1706.94 10601.4 1344.8
## - lstat    1   3024.07 11918.5 1392.5

## 
## Call:
## lm(formula = medv ~ zn + chas + nox + rm + dis + rad + tax + 
##     ptratio + lstat, data = data2)
## 
## Coefficients:
## (Intercept)           zn         chas          nox           rm          dis  
##      22.509        1.016        0.759       -2.764        2.176       -3.970  
##         rad          tax      ptratio        lstat  
##       2.119       -2.317       -1.991       -4.224

# summary(step(model_o,direction = "both"))

6.3.12 Create a model after selecting variables

After variable selection (or based on domain knowledge), we fit a parsimonious model and inspect its statistical summary.

The summary() output gives: - coefficient estimates, - standard errors and t-tests, - residual standard error, - $R^2$ and adjusted $R^2$, - overall F-statistic.

model_trasf <- lm(formula = medv ~ zn + chas + nox + rm + dis + rad + tax + 
    ptratio + lstat, data = data2)
summary(model_trasf)

## 
## Call:
## lm(formula = medv ~ zn + chas + nox + rm + dis + rad + tax + 
##     ptratio + lstat, data = data2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.2923  -2.4690  -0.5086   1.6269  24.5813 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  22.5090     0.2352  95.718  < 2e-16 ***
## zn            1.0161     0.3347   3.036 0.002554 ** 
## chas          0.7590     0.2331   3.256 0.001227 ** 
## nox          -2.7640     0.4624  -5.978 5.05e-09 ***
## rm            2.1760     0.3129   6.954 1.47e-11 ***
## dis          -3.9697     0.4548  -8.729  < 2e-16 ***
## rad           2.1194     0.5858   3.618 0.000335 ***
## tax          -2.3171     0.6538  -3.544 0.000441 ***
## ptratio      -1.9909     0.3092  -6.440 3.47e-10 ***
## lstat        -4.2244     0.3636 -11.618  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.733 on 397 degrees of freedom
## Multiple R-squared:  0.7187, Adjusted R-squared:  0.7123 
## F-statistic: 112.7 on 9 and 397 DF,  p-value: < 2.2e-16

At this stage, interpretability matters: each coefficient reflects the effect of that predictor holding others constant, assuming linearity and correct specification.

6.3.13 Multicollinearity checking

Multicollinearity inflates standard errors and makes coefficient estimates unstable. Variance inflation factor (VIF) is a standard diagnostic: - VIF ≈ 1: no collinearity - VIF moderately large: correlation among predictors - Very large VIF: serious instability

vif(model_trasf)

##       zn     chas      nox       rm      dis      rad      tax  ptratio 
## 2.024269 1.044796 4.074241 1.721080 3.759267 6.008751 7.469414 1.743359 
##    lstat 
## 2.372125

In practice, multicollinearity is common in socioeconomic variables and in biomedical lab panels. If VIF is high, consider: - removing redundant variables, - combining variables (indexes), - penalized regression methods (ridge/lasso).

6.3.14 Plot model to check assumptions

The default plot(lm_object) produces key diagnostic plots: - residuals vs fitted (linearity, heteroscedasticity), - normal Q-Q (normality of residuals), - scale-location (variance stability), - residuals vs leverage (influential points).

plot(model_trasf)

6.3.14.1 Histogram of residuals

A quick distribution check; it complements the Q-Q plot.

resid<- model_trasf$residuals
hist(resid)

6.3.14.2 F test of model

ANOVA for the fitted model provides model-level significance and decomposition of sums of squares.

anova(model_trasf)

	Df	Sum Sq	Mean Sq	F value	Pr(>F)
zn	1	4019.0308	4019.03076	179.38831	0e+00
chas	1	1011.5766	1011.57663	45.15144	0e+00
nox	1	2785.8086	2785.80857	124.34378	0e+00
rm	1	8487.0727	8487.07268	378.81810	0e+00
dis	1	951.7774	951.77743	42.48232	0e+00
rad	1	558.8550	558.85503	24.94434	9e-07
tax	1	767.8718	767.87176	34.27374	0e+00
ptratio	1	1119.3945	1119.39453	49.96386	0e+00
lstat	1	3024.0665	3024.06652	134.97836	0e+00
Residuals	397	8894.4215	22.40408	NA	NA

6.3.14.3 Coefficients table

This is the standard regression table with estimates, SEs, t-statistics, and p-values.

coef(summary(model_trasf))

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	22.5089571	0.2351584	95.718273	0.0000000
zn	1.0160975	0.3346703	3.036115	0.0025545
chas	0.7589909	0.2331142	3.255876	0.0012275
nox	-2.7639762	0.4623875	-5.977619	0.0000000
rm	2.1760130	0.3129267	6.953747	0.0000000
dis	-3.9697228	0.4547944	-8.728609	0.0000000
rad	2.1193795	0.5857829	3.618029	0.0003352
tax	-2.3171470	0.6537962	-3.544143	0.0004408
ptratio	-1.9908948	0.3091666	-6.439552	0.0000000
lstat	-4.2244121	0.3636086	-11.618019	0.0000000

6.3.14.4 Confidence intervals

Confidence intervals help quantify uncertainty around effect sizes and are generally more informative than p-values alone.

confint(model_trasf)

	2.5 %	97.5 %
(Intercept)	22.0466457	22.971269
zn	0.3581500	1.674045
chas	0.3006983	1.217283
nox	-3.6730103	-1.854942
rm	1.5608125	2.791214
dis	-4.8638292	-3.075616
rad	0.9677553	3.271004
tax	-3.6024824	-1.031812
ptratio	-2.5987032	-1.383086
lstat	-4.9392512	-3.509573

6.3.15 Add polynomial (quadratic) terms

When scatterplots suggest curvature, a quadratic term can capture nonlinearity without fully abandoning linear regression.

Here we add $rm^2$ and $lstat^2$. This often improves fit when relationships are curved.

model_trasf_poly <- lm(formula = medv ~ zn + chas + nox +  I(rm^2) + dis + rad + tax + 
    ptratio +   I(lstat^2), data = data2)
summary(model_trasf_poly)

## 
## Call:
## lm(formula = medv ~ zn + chas + nox + I(rm^2) + dis + rad + tax + 
##     ptratio + I(lstat^2), data = data2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.0736  -3.4029  -0.6212   2.8340  29.4942 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  22.2859     0.3526  63.212  < 2e-16 ***
## zn            1.5524     0.4096   3.790 0.000174 ***
## chas          0.9570     0.2834   3.377 0.000805 ***
## nox          -4.4962     0.5455  -8.242 2.50e-15 ***
## I(rm^2)       1.5313     0.1589   9.637  < 2e-16 ***
## dis          -3.3186     0.5644  -5.880 8.69e-09 ***
## rad           3.0529     0.7024   4.346 1.76e-05 ***
## tax          -3.6647     0.7861  -4.662 4.28e-06 ***
## ptratio      -2.6800     0.3706  -7.232 2.47e-12 ***
## I(lstat^2)   -1.3170     0.1918  -6.865 2.56e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.763 on 397 degrees of freedom
## Multiple R-squared:  0.583,  Adjusted R-squared:  0.5735 
## F-statistic: 61.67 on 9 and 397 DF,  p-value: < 2.2e-16

Practical note: polynomial terms can improve fit but may complicate interpretation. Always validate that the improvement generalizes (e.g., CV).

6.3.16 Add interaction terms

Interactions allow the effect of one predictor to depend on another. Conceptually, they represent effect modification.

Here we model the interaction $rm \times lstat$. This is a meaningful interaction in Boston housing: the benefit of more rooms may differ across neighborhood socioeconomic status proxies.

rm and lstat
R2 >0.7 indicates a good fit of the model

model_trasf_term <- lm(formula = medv ~ zn + chas + nox +  (rm*  lstat) + dis + rad + tax + 
    ptratio  , data = data2)
summary(model_trasf_term)

## 
## Call:
## lm(formula = medv ~ zn + chas + nox + (rm * lstat) + dis + rad + 
##     tax + ptratio, data = data2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.4595  -2.3458  -0.2389   1.7950  26.6992 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  21.4297     0.2379  90.060  < 2e-16 ***
## zn            0.5221     0.3046   1.714 0.087296 .  
## chas          0.6538     0.2095   3.120 0.001941 ** 
## nox          -2.0295     0.4218  -4.812 2.13e-06 ***
## rm            1.6459     0.2860   5.754 1.75e-08 ***
## lstat        -5.8715     0.3669 -16.002  < 2e-16 ***
## dis          -3.1810     0.4161  -7.645 1.60e-13 ***
## rad           1.9251     0.5262   3.658 0.000288 ***
## tax          -1.9187     0.5883  -3.261 0.001205 ** 
## ptratio      -1.5554     0.2811  -5.534 5.70e-08 ***
## rm:lstat     -1.8202     0.1852  -9.830  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.249 on 396 degrees of freedom
## Multiple R-squared:  0.7739, Adjusted R-squared:  0.7682 
## F-statistic: 135.5 on 10 and 396 DF,  p-value: < 2.2e-16

Diagnostic plots remain essential because adding interactions can create leverage points and change residual structure.

plot(model_trasf_term)

6.3.17 Robust regression

Outliers and influential points can dominate OLS. Robust regression (rlm) downweights extreme residuals and can produce more stable estimates.

This is especially relevant when: - the dataset contains measurement errors, - there are heavy tails, - influential observations distort inference.

robust_model_term <- rlm(medv ~ zn + chas + nox +  (rm*  lstat) + dis + rad + tax + 
    ptratio  , data = data2)
summary(robust_model_term)

## 
## Call: rlm(formula = medv ~ zn + chas + nox + (rm * lstat) + dis + rad + 
##     tax + ptratio, data = data2)
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -20.90999  -1.74873  -0.09845   1.92931  33.87244 
## 
## Coefficients:
##             Value    Std. Error t value 
## (Intercept)  20.9624   0.1656   126.5763
## zn            0.1998   0.2120     0.9424
## chas          0.5198   0.1458     3.5643
## nox          -1.3458   0.2935    -4.5846
## rm            2.8773   0.1991    14.4524
## lstat        -4.5706   0.2554   -17.8979
## dis          -1.9208   0.2896    -6.6326
## rad           0.9005   0.3663     2.4586
## tax          -1.7199   0.4095    -4.2004
## ptratio      -1.2451   0.1956    -6.3652
## rm:lstat     -1.9899   0.1289   -15.4398
## 
## Residual standard error: 2.698 on 396 degrees of freedom

Practical note: robust regression changes the objective function and standard inference is different. It’s often used for sensitivity analysis rather than as the sole primary model.

6.3.18 Create a model before transforming data

To understand the impact of transformation, we fit the analogous model on the original (non-transformed) training dataset. This helps assess whether transformation improves: - fit, - residual behavior, - predictive performance.

model_trasf_orig <- lm(formula = medv ~ zn + chas + nox + rm + dis + rad + tax + 
    ptratio + lstat, data = data)
summary(model_trasf_orig)

## 
## Call:
## lm(formula = medv ~ zn + chas + nox + rm + dis + rad + tax + 
##     ptratio + lstat, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.219  -2.729  -0.463   1.920  25.992 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  22.5094     0.2430  92.619  < 2e-16 ***
## zn            0.8232     0.3754   2.193 0.028904 *  
## chas          0.6582     0.2412   2.728 0.006652 ** 
## nox          -1.9351     0.4727  -4.093 5.15e-05 ***
## rm            2.3985     0.3128   7.668 1.36e-13 ***
## dis          -2.9289     0.4565  -6.416 3.99e-10 ***
## rad           2.2109     0.6348   3.483 0.000551 ***
## tax          -2.1880     0.6896  -3.173 0.001627 ** 
## ptratio      -2.0274     0.3254  -6.230 1.19e-09 ***
## lstat        -4.3534     0.3844 -11.325  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.895 on 397 degrees of freedom
## Multiple R-squared:  0.7212, Adjusted R-squared:  0.7149 
## F-statistic: 114.1 on 9 and 397 DF,  p-value: < 2.2e-16

6.3.18.1 Non-nested model comparisons (AIC)

AIC compares models by balancing fit and complexity. Lower AIC indicates better tradeoff (within the candidate set).

AIC(model_trasf,model_trasf_orig)

	df	AIC
model_trasf	11	2432.353
model_trasf_orig	11	2459.780

6.3.19 K-fold cross validation

Cross-validation provides a more direct estimate of predictive performance. It is especially important after model selection or when comparing multiple model families.

Here we use 10-fold CV via DAAG::cv.glm().

# install.packages("DAAG")
library(DAAG)

## Warning: package 'DAAG' was built under R version 4.4.3

set.seed(123)
model_trasf_term_cv <- glm( medv ~ zn + chas + nox +  (rm*  lstat) + dis + rad + tax + 
    ptratio  , data = data2)
cv.err  <- cv.glm(data2, model_trasf_term_cv, K = 10)$delta
cv.err

## [1] 19.24588 19.15400

Interpretation: - The delta output typically includes raw and adjusted CV error estimates. - Compare CV errors across competing models; smaller values indicate better predictive performance.

6.3.20 Nonnest models comparisons

Once you have multiple candidate models (original scale, transformed, polynomial, interaction), compare them side-by-side.

AIC is one way; CV error is another. In applied practice, you often consider both: - AIC for model parsimony, - CV for predictive robustness.

AIC(model_trasf_term,model_trasf,model_trasf_orig,model_trasf_poly)

	df	AIC
model_trasf_term	12	2345.492
model_trasf	11	2432.353
model_trasf_orig	11	2459.780
model_trasf_poly	11	2592.584

# interaction, transformation, original, polynomial by order (`data` has been normalized but not `data2`)

6.3.21 Posterior predictive / diagnostic checks

Even after selecting a “best” model, the most important step is to verify assumptions and identify influential observations.

performance::check_model() provides a comprehensive set of diagnostics in one call: - linearity, - homoscedasticity, - influential points, - collinearity, - normality of residuals.

# install.packages("performance")
library(performance)

## Warning: package 'performance' was built under R version 4.4.3

check_model(model_trasf_term)

In practice, if diagnostics are poor, consider: - transformations, - adding nonlinear terms, - robust SEs, - or moving to a more appropriate model class.

6.3.22 Forest plot for coefficients

Coefficient plots help communicate results clearly. They emphasize effect size and uncertainty, not only p-values.

library(sjPlot) 
plot_model(model_trasf_term, show.values = TRUE, value.offset = 0.4)

6.3.23 Relative Importance

When predictors are correlated, raw coefficients are not always a good measure of “importance.” Relative importance methods attempt to quantify each predictor’s contribution to explained variance.

Here, relaimpo is used with bootstrap resampling to assess stability.

library(relaimpo)

## Warning: package 'relaimpo' was built under R version 4.4.3

## Warning: package 'survey' was built under R version 4.4.3

## Warning: package 'mitools' was built under R version 4.4.3

# calc.relimp(fit,type=c("lmg","last","first","pratt"),
#    rela=TRUE)
# Bootstrap Measures of Relative Importance (1000 samples)
boot <- boot.relimp(model_trasf_term, b = 10, type =c("lmg" ), rank = TRUE,
                    # type =c("lmg","last","first","pratt")
  diff = TRUE, rela = TRUE)
booteval.relimp(boot) # print result

## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints

## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints

## Response variable: medv 
## Total response variance: 77.88147 
## Analysis based on 407 observations 
## 
## 10 Regressors: 
## zn chas nox rm lstat dis rad tax ptratio rm:lstat 
## Proportion of variance explained by model: 77.39%
## Metrics are normalized to sum to 100% (rela=TRUE). 
## 
## Relative importance metrics: 
## 
##                 lmg
## zn       0.03220843
## chas     0.01925573
## nox      0.05753222
## rm       0.24841214
## lstat    0.32582422
## dis      0.04647715
## rad      0.03216196
## tax      0.05940953
## ptratio  0.09003437
## rm:lstat 0.08868423
## 
## Average coefficients for different model sizes: 
## 
##                  1X        2Xs        3Xs        4Xs        5Xs        6Xs
## zn        3.1505046  2.1309586  1.5419108  1.1885764  0.9622432  0.8066737
## chas      1.4317561  1.3828980  1.2922217  1.1827563  1.0710746  0.9666952
## nox      -3.6683140 -2.8590532 -2.3987148 -2.1635246 -2.0628637 -2.0336275
## rm        5.8715576  5.1812470  4.6686609  4.2214027  3.7902531  3.3584729
## lstat    -6.3961998 -6.2413211 -6.1543204 -6.0962610 -6.0505503 -6.0111059
## dis       2.3495853  0.6106749 -0.5986185 -1.4554886 -2.0655023 -2.4944977
## rad      -3.2090101 -1.5469225 -0.4505596  0.2850626  0.7884232  1.1408255
## tax      -4.1727995 -3.6061521 -3.1776467 -2.8413999 -2.5683300 -2.3434564
## ptratio  -4.2048171 -3.4233285 -2.9476391 -2.6199136 -2.3633992 -2.1469790
## rm:lstat -0.7386853 -1.0072327 -1.1975350 -1.3430505 -1.4610987 -1.5602846
##                 7Xs        8Xs        9Xs       10Xs
## zn        0.6942682  0.6122971  0.5558327  0.5221292
## chas      0.8734919  0.7916003  0.7191770  0.6537613
## nox      -2.0351844 -2.0434312 -2.0444974 -2.0294702
## rm        2.9235837  2.4886854  2.0599352  1.6458985
## lstat    -5.9756419 -5.9424093 -5.9088868 -5.8714971
## dis      -2.7882468 -2.9825750 -3.1064663 -3.1809633
## rad       1.3969994  1.5963587  1.7666711  1.9250667
## tax      -2.1625306 -2.0281734 -1.9455332 -1.9186629
## ptratio  -1.9610124 -1.8027959 -1.6694147 -1.5553973
## rm:lstat -1.6447082 -1.7160362 -1.7745576 -1.8202278
## 
##  
##  Confidence interval information ( 10 bootstrap replicates, bty= perc ): 
## Relative Contributions with confidence intervals: 
##  
##                                    Lower  Upper
##              percentage 0.95       0.95   0.95  
## zn.lmg       0.0322     _______HIJ 0.0204 0.0357
## chas.lmg     0.0193     ______GHIJ 0.0016 0.0501
## nox.lmg      0.0575     ____EFG___ 0.0430 0.0708
## rm.lmg       0.2484     AB________ 0.1916 0.3284
## lstat.lmg    0.3258     AB________ 0.2617 0.3836
## dis.lmg      0.0465     _____FGHIJ 0.0286 0.0548
## rad.lmg      0.0322     ______GHIJ 0.0215 0.0465
## tax.lmg      0.0594     ___DEF____ 0.0479 0.0796
## ptratio.lmg  0.0900     __CD______ 0.0711 0.1202
## rm:lstat.lmg 0.0887     __CDE_____ 0.0606 0.1037
## 
## Letters indicate the ranks covered by bootstrap CIs. 
## (Rank bootstrap confidence intervals always obtained by percentile method) 
## CAUTION: Bootstrap confidence intervals can be somewhat liberal. 
## 
##  
##  Differences between Relative Contributions: 
##  
##                                      Lower   Upper
##                      difference 0.95 0.95    0.95   
## zn-chas.lmg           0.0130         -0.0192  0.0276
## zn-nox.lmg           -0.0253     *   -0.0504 -0.0137
## zn-rm.lmg            -0.2162     *   -0.3028 -0.1559
## zn-lstat.lmg         -0.2936     *   -0.3478 -0.2274
## zn-dis.lmg           -0.0143         -0.0235  0.0019
## zn-rad.lmg            0.0000         -0.0221  0.0084
## zn-tax.lmg           -0.0272     *   -0.0532 -0.0166
## zn-ptratio.lmg       -0.0578     *   -0.0911 -0.0377
## zn-rm:lstat.lmg      -0.0565     *   -0.0746 -0.0305
## chas-nox.lmg         -0.0383     *   -0.0584 -0.0033
## chas-rm.lmg          -0.2292     *   -0.3144 -0.1753
## chas-lstat.lmg       -0.3066     *   -0.3672 -0.2117
## chas-dis.lmg         -0.0272         -0.0385  0.0137
## chas-rad.lmg         -0.0129         -0.0449  0.0201
## chas-tax.lmg         -0.0402     *   -0.0781 -0.0008
## chas-ptratio.lmg     -0.0708     *   -0.1187 -0.0260
## chas-rm:lstat.lmg    -0.0694     *   -0.1021 -0.0338
## nox-rm.lmg           -0.1909     *   -0.2781 -0.1378
## nox-lstat.lmg        -0.2683     *   -0.3298 -0.2047
## nox-dis.lmg           0.0111         -0.0010  0.0275
## nox-rad.lmg           0.0254     *    0.0048  0.0288
## nox-tax.lmg          -0.0019         -0.0283  0.0061
## nox-ptratio.lmg      -0.0325     *   -0.0689 -0.0130
## nox-rm:lstat.lmg     -0.0312     *   -0.0525 -0.0008
## rm-lstat.lmg         -0.0774         -0.1920  0.0488
## rm-dis.lmg            0.2019     *    0.1368  0.2922
## rm-rad.lmg            0.2163     *    0.1599  0.3069
## rm-tax.lmg            0.1890     *    0.1310  0.2806
## rm-ptratio.lmg        0.1584     *    0.1065  0.2525
## rm-rm:lstat.lmg       0.1597     *    0.1047  0.2605
## lstat-dis.lmg         0.2793     *    0.2253  0.3288
## lstat-rad.lmg         0.2937     *    0.2231  0.3519
## lstat-tax.lmg         0.2664     *    0.1899  0.3229
## lstat-ptratio.lmg     0.2358     *    0.1493  0.2985
## lstat-rm:lstat.lmg    0.2371     *    0.1658  0.3018
## dis-rad.lmg           0.0143         -0.0088  0.0231
## dis-tax.lmg          -0.0129     *   -0.0419 -0.0043
## dis-ptratio.lmg      -0.0436     *   -0.0825 -0.0303
## dis-rm:lstat.lmg     -0.0422     *   -0.0660 -0.0071
## rad-tax.lmg          -0.0272     *   -0.0331 -0.0205
## rad-ptratio.lmg      -0.0579     *   -0.0737 -0.0367
## rad-rm:lstat.lmg     -0.0565     *   -0.0714 -0.0240
## tax-ptratio.lmg      -0.0306     *   -0.0462 -0.0123
## tax-rm:lstat.lmg     -0.0293         -0.0453  0.0050
## ptratio-rm:lstat.lmg  0.0014         -0.0253  0.0346
## 
## * indicates that CI for difference does not include 0. 
## CAUTION: Bootstrap confidence intervals can be somewhat liberal.

plot(booteval.relimp(boot,sort=TRUE)) # plot result

## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints
## Warning in norm.inter(t, alpha): extreme order statistics used as endpoints

Practical note: bootstrap sample size b should be much larger (e.g., 500–2000) for stable inference; here it is small for demonstration.

6.3.24 Model prediction

Prediction is where the model becomes operational. We generate: - prediction intervals (interval="predict") for individual outcomes, - confidence intervals (interval="confidence") for mean responses.

First, create a predictor dataset.

library(dplyr)
data_pred <-   dplyr::select(data2 , zn , chas , nox , rm , dis , rad , tax , 
    ptratio , lstat)
data_pred[1:10,]

	zn	chas	nox	rm	dis	rad	tax	ptratio	lstat
1	0.2845483	-0.2723291	-0.1440749	0.4132629	0.2785465	-0.9818712	-0.6659492	-1.4575580	-1.0744990
2	-0.4872402	-0.2723291	-0.7395304	0.1940824	0.6786919	-0.8670245	-0.9863534	-0.3027945	-0.4919525
4	-0.4872402	-0.2723291	-0.8344581	1.0152978	1.1314532	-0.7521778	-1.1050216	0.1129203	-1.3601708
5	-0.4872402	-0.2723291	-0.8344581	1.2273620	1.1314532	-0.7521778	-1.1050216	0.1129203	-1.0254866
6	-0.4872402	-0.2723291	-0.8344581	0.2068916	1.1314532	-0.7521778	-1.1050216	0.1129203	-1.0422909
7	0.0487240	-0.2723291	-0.2648919	-0.3880270	0.9295961	-0.5224844	-0.5769480	-1.5037485	-0.0312367
8	0.0487240	-0.2723291	-0.2648919	-0.1603069	1.0872565	-0.5224844	-0.5769480	-1.5037485	0.9097999
9	0.0487240	-0.2723291	-0.2648919	-0.9302853	1.1392843	-0.5224844	-0.5769480	-1.5037485	2.4193794
10	0.0487240	-0.2723291	-0.2648919	-0.3994130	1.3357787	-0.5224844	-0.5769480	-1.5037485	0.6227277
11	0.0487240	-0.2723291	-0.2648919	0.1314594	1.2422167	-0.5224844	-0.5769480	-1.5037485	1.0918456

6.3.24.1 Prediction interval (individual prediction)

Prediction intervals are wider because they include residual variability.

predy <- predict(model_trasf_term, data_pred[1:10,], interval="predict")
predy

	fit	lwr	upr
1	30.258620	21.8534712	38.66377
2	24.415330	16.0263022	32.80436
4	31.759215	23.3073401	40.21109
5	29.920476	21.4789966	38.36196
6	26.441050	18.0254500	34.85665
7	20.820462	12.3963513	29.24457
8	15.455998	6.9807773	23.93122
9	8.991037	0.4048231	17.57725
10	16.144719	7.6745761	24.61486
11	13.847673	5.3305253	22.36482

6.3.24.2 Confidence interval (mean prediction)

Confidence intervals for the mean response are narrower than prediction intervals.

confy <- predict(model_trasf_term, data_pred[1:10,], interval="confidence")
confy

	fit	lwr	upr
1	30.258620	29.329961	31.18728
2	24.415330	23.646134	25.18452
4	31.759215	30.474666	33.04376
5	29.920476	28.706202	31.13475
6	26.441050	25.422125	27.45998
7	20.820462	19.733484	21.90744
8	15.455998	14.025877	16.88612
9	8.991037	7.006358	10.97572
10	16.144719	14.745003	17.54443
11	13.847673	12.187044	15.50830

6.3.24.3 CI width

A quick way to compute interval width is upper minus lower bound.

confy %*% c(0, -1, 1)

1	1.857318
2	1.538391
4	2.569097
5	2.428550
6	2.037850
7	2.173956
8	2.860241
9	3.969358
10	2.799431
11	3.321259

predy %*% c(0, -1, 1)

1	16.81030
2	16.77805
4	16.90375
5	16.88296
6	16.83120
7	16.84822
8	16.95044
9	17.17243
10	16.94029
11	17.03430

6.3.25 Compare predictions vs actual values

A scatterplot of observed vs predicted values provides a quick sense of calibration. If the model is well-calibrated, points tend to align along the diagonal; the fitted line provides a rough check.

plot(data2$medv,predict(model_trasf_term) )
fit <- lm(predict(model_trasf_term)~data2$medv)
abline(fit)

6.3.26 Manual computation: $\hat{y}$ and confidence interval

This section demonstrates how the regression prediction and CI formulas relate to matrix algebra. It is useful for understanding what predict() does internally.

We fit a smaller model and compute $\hat{y}$ using $X\hat{\beta}$.

data_ci <-  dplyr::select(data2, zn ,chas ,tax , medv)
model_ci <- lm(formula = medv ~ zn + chas +tax , data = data_ci)
summary(model_ci)

## 
## Call:
## lm(formula = medv ~ zn + chas + tax, data = data_ci)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.182  -4.715  -1.305   2.345  34.377 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  22.3073     0.3699  60.307  < 2e-16 ***
## zn            2.1080     0.3919   5.379 1.27e-07 ***
## chas          1.4174     0.3599   3.939 9.66e-05 ***
## tax          -3.4463     0.3985  -8.649  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.46 on 403 degrees of freedom
## Multiple R-squared:  0.2907, Adjusted R-squared:  0.2855 
## F-statistic: 55.07 on 3 and 403 DF,  p-value: < 2.2e-16

compute y hat and compare with y predict and actual y

XCI <- data.frame(intercept=1, data_ci[,1:3])
comp_y <- as.matrix(XCI)%*%as.numeric(model_ci$coefficients)

head(cbind(comp_y,predict(model_ci, XCI), data_ci[,4]))

1	24.81617	24.81617	24.0
2	24.29345	24.29345	21.6
4	24.70242	24.70242	33.4
5	24.70242	24.70242	36.2
6	24.70242	24.70242	28.7
7	24.01233	24.01233	22.9

compute ci

library(matlib)

## Warning: package 'matlib' was built under R version 4.4.3

## 
## Attaching package: 'matlib'

## The following object is masked from 'package:psych':
## 
##     tr

var.yhat <- sigma(model_ci)**2* as.matrix(XCI[1 ,]) %*% inv(t(as.matrix(XCI)) %*% as.matrix (XCI))%*%t(as.matrix(XCI[1 ,]))
# var.yhat
cbind(
(predict(model_ci, XCI[1 ,])-1.96 * sqrt(var.yhat)),
(predict(model_ci, XCI[1 ,]) ),
(predict(model_ci, XCI[1 ,])+1.96 * sqrt(var.yhat))
)

1		1
23.91998	24.81617	25.71237

predict(model_ci, XCI[1 ,], interval="confidence")

fit	lwr	upr
24.81617	23.9173	25.71505

\[ E(\hat{Y}_0)=\sigma^2\mathbf{X_0\left( X'X\right)^{-1}X_0'} \]

This matrix form highlights the concept of leverage: when $X_0$ is far from the center of the predictor space, uncertainty increases.

6.3.27 External data validation

This is a simple demonstration of external-style validation: evaluate performance on the held-out test set created earlier.

We compute common predictive metrics: - $R^2$ (proportion of variance explained), - RMSE (penalizes larger errors), - MAE (more robust to outliers).

library(caret)
testdata2_pred <-   dplyr::select(testdata2, zn , chas , nox , rm , dis , rad , tax , 
    ptratio , lstat)

R_sq <- R2(testdata2$medv,predict(model_trasf_term,testdata2_pred))
RMSE <- RMSE(testdata2$medv,predict(model_trasf_term,testdata2_pred))
MAE <- MAE(testdata2$medv,predict(model_trasf_term,testdata2_pred))
print(c(R_sq, RMSE, MAE))

## [1] 0.8465718 4.1960036 3.4352611

In applied practice, the held-out evaluation is often the most trusted summary of model usefulness, especially when many model variants were tried.

6.4 Variable selection

Variable selection is a large topic. In practice, you should align the selection method with your goal: - inference (interpretability, prespecified covariates, stability), - prediction (regularization, cross-validation, performance focus).

see here

6.4.1 Chapter takeaways

This multiple regression practice section demonstrates a complete applied modeling workflow: - data understanding (describe/summary/Table 1), - quality checks (missingness), - EDA (histograms, correlations, scatterplots), - transformations (stabilize modeling assumptions), - model building (baseline, stepwise, polynomial, interaction, robust), - diagnostics (assumption checks), - comparison (AIC, CV), - interpretation (coefficients, CI, relative importance), - prediction and validation (intervals, test-set metrics).

The same structure generalizes naturally to clinical and biomedical regression problems—only the outcome type and modeling family may change.

6.5 Linear mixed model theory

Linear mixed models (LMMs) extend ordinary linear regression to handle correlated data. In practice, correlation arises when observations are clustered or repeated, for example:

patients nested within hospitals,
students nested within schools,
repeated lab measurements within a subject,
longitudinal follow-up over time.

If we ignore this correlation and use ordinary regression, standard errors are often too small, confidence intervals become too narrow, and p-values may look more “significant” than they should. LMMs address this by modeling two sources of variation:

fixed effects: population-average effects you want to estimate (e.g., treatment, age, sex)
random effects: cluster- or subject-specific deviations that induce correlation (e.g., hospital-specific baseline stress)

6.5.1 Matrix format

The mixed model is commonly written as:

\[ \mathbf{y} = \boldsymbol{X\beta} + \boldsymbol{Zu} + \boldsymbol{\varepsilon} \]

Interpretation of each component:

$\mathbf{y}$: $N \times 1$ outcome vector
$\mathbf{X}$: $N \times r$ design matrix for fixed effects
$\boldsymbol{\beta}$: $r \times 1$ coefficient vector for fixed effects
$\mathbf{Z}$: $N \times m$ design matrix for random effects
$\boldsymbol{u}$: $m \times 1$ random effect vector
$\boldsymbol{\varepsilon}$: $N \times 1$ residual vector

So, $\mathbf{X\beta}$ describes the systematic (population-average) part of the mean, while $\mathbf{Zu}$ describes cluster-specific departures from that mean.

The dimension bookkeeping is worth emphasizing because it prevents many implementation errors:

\[ \overbrace{\mathbf{y}}^{N \times 1}= \overbrace{\underbrace{\mathbf{X}}_{N \times r}\underbrace{\boldsymbol{\beta}}_{r \times 1}}^{N \times 1} + \overbrace{\underbrace{\mathbf{Z}}_{N \times m}\underbrace{\boldsymbol{u}}_{m \times 1}}^{N \times 1} + \overbrace{\boldsymbol{\varepsilon}}^{N \times 1}. \]

6.5.2 Why random effects create correlation

A key conceptual point: observations within the same cluster share the same random effect(s).
That shared term makes them correlated.

For example, with a random intercept by hospital, \[ Y_{ij} = \beta_0 + u_{0j} + \beta_1 X_{ij} + \varepsilon_{ij}, \] two nurses $i$ and $i'$ in the same hospital $j$ both include $u_{0j}$, so they are correlated even if their residual errors $\varepsilon$ are independent.

6.5.3 Example: reducing the work stress of nurses

Suppose you measure work stress from $N=1000$ nurses across 25 hospitals. Let:

outcome $y$: stress score
fixed effects $X$: age, gender, experience, ward type, intervention
random effect: hospital-level random intercept only

The model is: \[ \mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{Z}\boldsymbol{u} + \boldsymbol{\varepsilon}, \] with dimensions: \[ \overbrace{\mathbf{y}}^{1000 \times 1} = \overbrace{\underbrace{\mathbf{X}}_{1000 \times r}\underbrace{\boldsymbol{\beta}}_{r \times 1}}^{1000 \times 1} + \overbrace{\underbrace{\mathbf{Z}}_{1000 \times 25}\underbrace{\boldsymbol{u}}_{25 \times 1}}^{1000 \times 1} + \overbrace{\boldsymbol{\varepsilon}}^{1000 \times 1}. \]

A practical interpretation of the random intercept: - each hospital has its own baseline stress level, - fixed effects estimate population-average effects after accounting for hospital differences.

6.5.4 The dependent variable $\mathbf{y}$

Your vector representation emphasizes that outcomes are stacked:

\[ \mathbf{y} = \left[ \begin{array}{c} y_1\\ y_2\\ \vdots\\ y_{1000} \end{array} \right]. \]

In practice, the ordering can be any consistent ordering (by hospital then nurse, or by nurse id then time), as long as $\mathbf{X}$ and $\mathbf{Z}$ align row-by-row.

6.5.5 Fixed effects design matrix $\mathbf{X}$

$\mathbf{X}$ contains covariates used to estimate the population-average mean function.
An intercept column is usually included.

Your example matrix is exactly what $\mathbf{X}$ is meant to represent: a structured layout of predictors aligned with each observation.

Two practical reminders:

Categorical variables (gender, ward type, intervention) are represented through indicator columns (dummy coding) once you fit the model in software.
Centering/scaling continuous variables (age, experience) is often helpful, especially when random slopes or interactions are present.

6.5.6 Fixed effect coefficients $\boldsymbol{\hat{\beta}}$

$\boldsymbol{\beta}$ answers questions like:

“What is the average difference in stress between intervention and control, after adjusting for covariates and hospital clustering?”
“How does stress change with age, on average, after controlling for hospital differences?”

In LMMs, fixed effect interpretation is typically conditional on the random effects structure being correct, but the coefficients represent population-average contrasts in the linear predictor.

6.5.6.1 How $\hat{\beta}$ is estimated (ML and REML)

A clean way to explain estimation is to separate: - the mean model: $\mathbf{X}\beta$
- the covariance model: $\mathbf{V} = \mathbf{ZGZ'} + \mathbf{R}$

Once $\mathbf{V}$ is known, the generalized least squares estimator for $\beta$ is:

\[ \hat{\beta}(\gamma)=\left(\mathbf{X}'\mathbf{V}(\gamma)^{-1}\mathbf{X}\right)^{-1}\mathbf{X}'\mathbf{V}(\gamma)^{-1}\mathbf{y}, \] where $\gamma$ represents covariance parameters (e.g., $\sigma^2$, $\sigma_u^2$, correlations).

This highlights the logic: - ordinary least squares weights by $(X'X)^{-1}$, - mixed models weight by $(X'V^{-1}X)^{-1}$ because observations are correlated.

ML vs REML (intuition): - ML estimates $\beta$ and variance parameters together by maximizing the likelihood of $\mathbf{y}$. - REML maximizes a likelihood of linear combinations of $\mathbf{y}$ that remove fixed effects, helping reduce small-sample bias in variance components.

Your “three-step” profile approach is the standard computational idea: 1) write $\hat{\beta}$ as a function of variance parameters,
2) optimize the likelihood over variance parameters,
3) plug back to obtain $\hat{\beta}$.

6.5.7 Variance–covariance matrix of $\hat{\beta}$

Once variance parameters are estimated, the approximate sampling distribution is:

\[ \hat{\beta} \sim \mathcal{N}\left(\beta,\; \mathrm{Var}(\hat{\beta})\right), \quad \mathrm{Var}(\hat{\beta})=\left(\mathbf{X}'\mathbf{V}^{-1}\mathbf{X}\right)^{-1}. \]

This matrix is the engine behind: - standard errors, - Wald tests, - confidence intervals for fixed effects.

A practical note: in many software outputs, the reported degrees of freedom for fixed effects tests are not simply $N-r$, because correlation and random effects reduce effective information. Approximations like Satterthwaite/Kenward–Roger are widely used.

6.5.8 Random effects design matrix $\mathbf{Z}$

Your $\mathbf{Z}$ matrix is a clean illustration of a random intercept by hospital:

each row has a “1” in the column corresponding to that nurse’s hospital,
zeros elsewhere.

This structure ensures every observation in hospital $j$ shares the same $u_j$.

If you add random slopes (say, random slope of time within hospital), $\mathbf{Z}$ gets additional columns corresponding to those slope terms, and $\mathbf{G}$ becomes a block structure rather than purely diagonal.

6.5.9 Random effects $\boldsymbol{u}$ and BLUP intuition

We usually assume: \[ \boldsymbol{u}\sim \mathcal{N}(\mathbf{0},\mathbf{G}),\qquad \boldsymbol{\varepsilon}\sim \mathcal{N}(\mathbf{0},\mathbf{R}), \] and independence between $\boldsymbol{u}$ and $\boldsymbol{\varepsilon}$.

The conditional mean of $u$ given the data leads to the familiar “BLUP” form:

\[ \hat{\mathbf{u}}=\mathbf{G}\mathbf{Z}'\mathbf{V}^{-1}(\mathbf{y}-\mathbf{X}\hat{\beta}). \]

Interpretation: - if a hospital’s observed mean stress is above the overall mean, $\hat{u}_j$ tends to be positive, - the estimate is shrunk toward zero depending on $\sigma^2$ and $\sigma_u^2$, - shrinkage is stronger when within-hospital noise is large or hospital sample size is small.

This “partial pooling” is a major advantage of mixed models.

6.5.10 Covariance structures: $\mathbf{G}$, $\mathbf{R}$, and $\mathbf{V}$

The total covariance of $\mathbf{y}$ is:

\[ \mathbf{V}=\mathrm{Var}(\mathbf{y})=\mathbf{Z}\mathbf{G}\mathbf{Z}'+\mathbf{R}. \]

Random intercept only: - $\mathbf{G}$ is often $\sigma_u^2\mathbf{I}$ (simple diagonal). - $\mathbf{R}$ is often $\sigma^2\mathbf{I}$ (i.i.d. residuals).

But LMMs become most powerful when you allow more realistic $\mathbf{R}$ structures for repeated measures:

compound symmetry (equal correlations),
AR(1) (correlation decays with lag),
unstructured (max flexibility, parameter-heavy),
spatial correlation (depends on distance/time).

A useful applied rule is: start simple, then add complexity only when diagnostics or design justify it.

6.5.11 Estimating variance parameters

In practice, $\sigma^2$ and variance components like $\sigma_u^2$ are estimated by optimizing ML or REML criteria. Your derivation illustrates the idea that in special balanced cases, closed-form relationships connect variance parameters to mean squares; generally software uses numerical optimization.

A simple intuition: - $\sigma_u^2$ measures between-cluster variability, - $\sigma^2$ measures within-cluster variability.

Their ratio is tightly related to the intraclass correlation (ICC) in random-intercept models: \[ \mathrm{ICC}=\frac{\sigma_u^2}{\sigma_u^2+\sigma^2}. \] (Useful for interpreting how strongly clustered the data are.)

6.5.12 Model statement and interpretation

Your final distribution statement is an important “model contract”:

\[ (\mathbf{y}\mid \beta,\; \boldsymbol{u}=u)\sim \mathcal{N}(\mathbf{X}\beta+\mathbf{Z}u,\mathbf{R}). \]

Meaning: - conditional on random effects, observations follow a linear model with residual covariance $\mathbf{R}$; - marginally (integrating out $u$), $\mathbf{y}\sim \mathcal{N}(\mathbf{X}\beta,\mathbf{V})$.

This distinction matters because: - fixed effects often describe marginal mean trends, - random effects describe cluster-specific deviations and induce correlation.

6.5.13 Testing and model comparison

There are two common testing layers:

Fixed effects tests: test linear hypotheses $L'\beta=c$.
Wald-type tests are common, but degrees-of-freedom approximations (Satterthwaite/KR) often improve accuracy in finite samples.
Random effects / variance components tests: compare nested models via likelihood ratio tests.
Boundary issues (variance ≥ 0) lead to mixture $\chi^2$ null distributions in simple cases, which is why “half p-values” sometimes appear in random-effect tests.

For non-nested models, AIC/BIC provide pragmatic selection criteria: \[ \mathrm{AIC}=-2\log L+2p,\qquad \mathrm{BIC}=-2\log L+p\log(n). \]

Applied note: - AIC tends to favor predictive performance (less penalty), - BIC tends to favor simpler models (more penalty as $n$ grows).

6.5.14 Diagnostics (what to actually check)

Diagnostics in mixed models should be done with the data structure in mind:

Residual vs fitted: check nonlinearity and heteroscedasticity
Normal Q-Q: for residuals and sometimes for random effects
Influence: clusters can be influential (a single hospital)
Random effects: do they look approximately normal and centered?
Correlation structure: does the chosen $\mathbf{R}$ fit the repeated-measures pattern?

Standardized or studentized residual ideas carry over, but “leverage” and “influence” are more complex because of correlation and random effects.

6.5.15 Practical takeaway for readers

If you remember only one framework, remember this:

Choose a mean model: $\mathbf{X}\beta$ (what predictors affect the outcome?)
Choose a correlation model: $\mathbf{V}=\mathbf{ZGZ'}+\mathbf{R}$ (why are observations correlated?)
Fit with ML/REML, then evaluate:
- fixed effect interpretability,
- variance components (how much clustering/repeated correlation exists),
- diagnostics,
- and model comparison (AIC/BIC/LRT/CV).

That workflow is what makes LMMs a practical tool for longitudinal and multilevel data.

6.6 Linear mixed model practice

This section walks through a practical progression from ordinary regression to mixed-effects modeling, using two datasets:

lmm.data: a continuous outcome example (extro) with predictors (open, agree, social) and grouping variables (school, class).
nurse: a multilevel stress dataset with hospital/ward clustering.

The overall learning objectives are:

Understand why we move from OLS/GLM to LMM (correlated observations within clusters).
Learn the difference between treating a grouping variable as a fixed effect vs a random effect.
Practice specifying nested random effects, random slopes, and comparing models.
Extract model components (fixed effects, random effects, fitted values, residuals).
Apply common mixed-model diagnostics and post-hoc summaries.
Use ICC to quantify clustering strength.

Throughout, we keep the R code intact and focus on interpretation, reasoning, and “what to look for”.

6.6.1 Loading data and library

We begin with lme4 for fitting mixed models using lmer(), and arm for convenient display output. The dataset is read from a URL and inspected.

Conceptually, when you see school and class in the same dataset, you should immediately ask:

Are observations within the same school/class likely correlated?
Is class nested within school, or can the same class label appear in multiple schools?

Those questions guide whether you use (1|school), (1|class), or nested terms like (1|school/class).

# --- Linear Mixed Model (LMM) Data Simulation ---
set.seed(42)

# 1. Define Data Structure
n_schools <- 6
n_classes_per_school <- 4
n_students_per_class <- 50
total_n <- n_schools * n_classes_per_school * n_students_per_class # Total: 1200 observations

# 2. Generate Hierarchical IDs
# School level (Level 3)
school_ids <- rep(paste0("School_", 1:n_schools), 
                  each = n_classes_per_school * n_students_per_class)

# Class level (Level 2) - Nested within schools
class_ids  <- rep(rep(letters[1:n_classes_per_school], 
                      each = n_students_per_class), times = n_schools)

# Individual student level (Level 1)
student_ids <- 1:total_n

# 3. Generate Predictors (Fixed Effects)
# Simulating psychological assessment scores using Normal Distribution
open   <- rnorm(total_n, mean = 50, sd = 10) # Openness
agree  <- rnorm(total_n, mean = 50, sd = 10) # Agreeableness
social <- rnorm(total_n, mean = 50, sd = 10) # Social Engagement

# 4. Generate Random Effects
# Intercept deviations for schools and classes (Random Intercepts)
school_effects <- rep(rnorm(n_schools, mean = 0, sd = 5), 
                      each = n_classes_per_school * n_students_per_class)

class_effects  <- rep(rnorm(n_schools * n_classes_per_school, mean = 0, sd = 3), 
                      each = n_students_per_class)

# 5. Generate Dependent Variable (Outcome: Extroversion)
# Equation: extro = intercept + coefficients * predictors + random_effects + error
error <- rnorm(total_n, mean = 0, sd = 2) # Residual error

extro <- 30 + (0.5 * open) + (0.3 * agree) + (0.2 * social) + 
         school_effects + class_effects + error

# 6. Combine into Data Frame
lmm_data_sim <- data.frame(
  id = student_ids,
  extro = extro,
  open = open,
  agree = agree,
  social = social,
  class = class_ids,
  school = school_ids
)

# Preview the first few rows
head(lmm_data_sim)

id	extro	open	agree	social	class	school
1	92.96908	63.70958	42.53484	63.27505	a	School_1
2	81.63828	44.35302	50.36606	43.99164	a	School_1
3	87.33240	53.63128	53.23310	50.56507	a	School_1
4	84.91356	56.32863	53.79676	44.68924	a	School_1
5	89.25167	54.04268	58.76557	49.19101	a	School_1
6	90.21962	48.93875	59.33388	51.60756	a	School_1

library(lme4) # load library
library(arm) # convenience functions for regression in R

## Warning: package 'arm' was built under R version 4.4.3

## 
## arm (Version 1.14-4, built: 2024-4-1)

## Working directory is C:/Users/hed2/Downloads/others/mybook2/mybook2

## 
## Attaching package: 'arm'

## The following object is masked from 'package:performance':
## 
##     display

## The following object is masked from 'package:boot':
## 
##     logit

## The following object is masked from 'package:car':
## 
##     logit

## The following object is masked from 'package:scales':
## 
##     rescale

## The following objects are masked from 'package:psych':
## 
##     logit, rescale, sim

lmm.data <- lmm_data_sim
  # read.table("http://jaredknowles.com/s/lmm.data.txt", 
  #                      header=TRUE, sep=",", na.strings="NA", dec=".", strip.white=TRUE) 
#summary(lmm.data)
head(lmm.data)

id	extro	open	agree	social	class	school
1	92.96908	63.70958	42.53484	63.27505	a	School_1
2	81.63828	44.35302	50.36606	43.99164	a	School_1
3	87.33240	53.63128	53.23310	50.56507	a	School_1
4	84.91356	56.32863	53.79676	44.68924	a	School_1
5	89.25167	54.04268	58.76557	49.19101	a	School_1
6	90.21962	48.93875	59.33388	51.60756	a	School_1

6.6.2 General linear regression (OLS)

This is the baseline model that assumes:

all observations are independent,
a single residual variance applies to all observations,
no clustering correlation exists.

Even if this model fits numerically, it can give misleading standard errors when data are clustered.

OLSexamp <- lm(extro ~ open + agree + social, data = lmm.data)
display(OLSexamp)

## lm(formula = extro ~ open + agree + social, data = lmm.data)
##             coef.est coef.se
## (Intercept) 30.80     1.07  
## open         0.51     0.01  
## agree        0.30     0.01  
## social       0.19     0.01  
## ---
## n = 1200, k = 4
## residual sd = 4.29, R-Squared = 0.68

What to look at here - Which predictors appear associated with extro? - Do the effect sizes seem plausible? - This model provides a baseline for later comparison.

6.6.3 Generalized linear regression (GLM)

Here you fit glm() with the same formula. Since no family= argument is specified, it defaults to Gaussian, meaning it is essentially equivalent to lm() for continuous outcomes (up to minor implementation differences).

This step is mainly a bridge to show “we can also use the GLM framework,” but the independence assumption is still the same.

MLexamp <- glm(extro ~ open + agree + social, data=lmm.data)
display(MLexamp)

## glm(formula = extro ~ open + agree + social, data = lmm.data)
##             coef.est coef.se
## (Intercept) 30.80     1.07  
## open         0.51     0.01  
## agree        0.30     0.01  
## social       0.19     0.01  
## ---
##   n = 1200, k = 4
##   residual deviance = 22048.4, null deviance = 67937.3 (difference = 45888.9)
##   overdispersion parameter = 18.4
##   residual sd is sqrt(overdispersion) = 4.29

6.6.4 Varying intercept by adding a stratum variable as fixed effect (GLM)

Now we include class as a fixed effect. This means:

each class gets its own intercept shift,
we estimate a separate parameter for each class (minus a reference level).

This approach can work well when the number of classes is small and you only care about those specific classes.
But when there are many classes, fixed effects can become parameter-heavy and reduce generalizability.

MLexamp.2 <- glm(extro ~ open + agree + social + class, data=lmm.data )
display(MLexamp.2)

## glm(formula = extro ~ open + agree + social + class, data = lmm.data)
##             coef.est coef.se
## (Intercept) 31.03     1.06  
## open         0.51     0.01  
## agree        0.29     0.01  
## social       0.19     0.01  
## classb       0.51     0.34  
## classc       0.38     0.34  
## classd      -2.13     0.34  
## ---
##   n = 1200, k = 7
##   residual deviance = 20687.5, null deviance = 67937.3 (difference = 47249.8)
##   overdispersion parameter = 17.3
##   residual sd is sqrt(overdispersion) = 4.16

6.6.5 Comparisons of models (fixed effect approach)

Comparing AIC between the model without and with class fixed effects gives a first signal: does adding class improve fit enough to justify the extra parameters?

The anova(..., test="F") here compares nested Gaussian models.

AIC(MLexamp)

## [1] 6908.556

AIC(MLexamp.2)

## [1] 6838.103

anova(MLexamp, MLexamp.2, test="F")

Resid. Df	Resid. Dev	Df	Deviance	F	Pr(>F)
1196	22048.42	NA	NA	NA	NA
1193	20687.51	3	1360.911	26.16018	0

Interpretation guide - If adding class meaningfully improves fit, it suggests there is between-class variability (i.e., intercepts differ by class). - But fixed effects do not explicitly model correlation; they just soak up mean differences.

6.6.6 Adding class and school as fixed effects (GLM)

This expands the fixed-effect stratification by adding the interaction school:class, which is effectively a separate intercept for each school-by-class combination.

This can become even more parameter-heavy, but it mimics what a nested random intercept structure is trying to capture—except without shrinkage and without a variance-component interpretation.

MLexamp.4 <- glm(extro ~ open + agree + social + school:class, data=lmm.data )
display(MLexamp.4)

## glm(formula = extro ~ open + agree + social + school:class, data = lmm.data)
##                       coef.est coef.se
## (Intercept)            34.93     0.58 
## open                    0.50     0.01 
## agree                   0.29     0.01 
## social                  0.20     0.01 
## schoolSchool_1:classa   0.40     0.40 
## schoolSchool_2:classa  -3.03     0.40 
## schoolSchool_3:classa  -3.59     0.40 
## schoolSchool_4:classa -12.58     0.40 
## schoolSchool_5:classa  -7.94     0.40 
## schoolSchool_6:classa   5.06     0.40 
## schoolSchool_1:classb  -1.89     0.40 
## schoolSchool_2:classb  -1.05     0.40 
## schoolSchool_3:classb  -6.62     0.40 
## schoolSchool_4:classb  -4.28     0.40 
## schoolSchool_5:classb  -2.07     0.40 
## schoolSchool_6:classb  -2.63     0.40 
## schoolSchool_1:classc  -4.95     0.40 
## schoolSchool_2:classc  -6.09     0.40 
## schoolSchool_3:classc  -0.06     0.40 
## schoolSchool_4:classc  -6.35     0.40 
## schoolSchool_5:classc  -3.27     0.40 
## schoolSchool_6:classc   1.40     0.40 
## schoolSchool_1:classd  -8.65     0.40 
## schoolSchool_2:classd  -6.48     0.40 
## schoolSchool_3:classd  -7.07     0.40 
## schoolSchool_4:classd  -8.39     0.40 
## schoolSchool_5:classd  -3.75     0.40 
## ---
##   n = 1200, k = 27
##   residual deviance = 4777.8, null deviance = 67937.3 (difference = 63159.6)
##   overdispersion parameter = 4.1
##   residual sd is sqrt(overdispersion) = 2.02

AIC(MLexamp)

## [1] 6908.556

AIC(MLexamp.4)

## [1] 5119.435

anova(MLexamp, MLexamp.4, test="F")

Resid. Df	Resid. Dev	Df	Deviance	F	Pr(>F)
1196	22048.422	NA	NA	NA	NA
1173	4777.768	23	17270.65	184.3546	0

Practical warning - Fixed-effect stratification can overfit when many groups exist, especially with small group sizes. - Mixed models typically handle this more gracefully via partial pooling.

6.6.7 Considering different slopes by stratum (GLM)

Here you fit separate regression models within each (school, class) stratum. This is a “fully stratified” approach:

each group gets its own intercept and its own slopes for open, agree, social.

This is useful for exploration: it shows whether slopes vary meaningfully across groups.
But it creates many separate models and makes it hard to summarize results.

require(plyr)

modellist <- dlply(lmm.data, .(school, class), function(x) 
                              glm(extro~ open + agree + social, data=x))
strat1 <- display(modellist[[1]])

## glm(formula = extro ~ open + agree + social, data = x)
##             coef.est coef.se
## (Intercept) 35.42     2.73  
## open         0.48     0.02  
## agree        0.29     0.03  
## social       0.21     0.04  
## ---
##   n = 50, k = 4
##   residual deviance = 174.7, null deviance = 2113.8 (difference = 1939.2)
##   overdispersion parameter = 3.8
##   residual sd is sqrt(overdispersion) = 1.95

strat2 <- display(modellist[[2]])

## glm(formula = extro ~ open + agree + social, data = x)
##             coef.est coef.se
## (Intercept) 33.40     2.32  
## open         0.46     0.03  
## agree        0.32     0.03  
## social       0.20     0.03  
## ---
##   n = 50, k = 4
##   residual deviance = 198.8, null deviance = 2211.3 (difference = 2012.6)
##   overdispersion parameter = 4.3
##   residual sd is sqrt(overdispersion) = 2.08

strat24 <- display(modellist[[24]])

## glm(formula = extro ~ open + agree + social, data = x)
##             coef.est coef.se
## (Intercept) 34.32     3.17  
## open         0.54     0.03  
## agree        0.30     0.04  
## social       0.15     0.03  
## ---
##   n = 50, k = 4
##   residual deviance = 220.2, null deviance = 1843.6 (difference = 1623.5)
##   overdispersion parameter = 4.8
##   residual sd is sqrt(overdispersion) = 2.19

AIC(MLexamp)

## [1] 6908.556

# AIC(strat1)
# AIC(strat2)
# AIC(strat24)

how to combine these models?

From a modeling standpoint, this question motivates mixed models:

If intercepts vary across groups → random intercept.
If slopes vary across groups → random slope(s).
Mixed models combine evidence across groups through shrinkage, instead of fitting each group completely separately.

So “combining these models” typically means moving to a hierarchical model where group-to-group differences are random effects rather than separate fixed models.

6.6.8 Varying intercept with LMM

Now we switch to a true mixed-effects model.
Adding (1|class) says: “each class has its own intercept deviation.”

This is a direct way to model within-class correlation: - observations within the same class share the same random intercept.

MLexamp.6 <- lmer(extro ~ open + agree + social + (1|class), data=lmm.data)
display(MLexamp.6)

## lmer(formula = extro ~ open + agree + social + (1 | class), data = lmm.data)
##             coef.est coef.se
## (Intercept) 30.72     1.20  
## open         0.51     0.01  
## agree        0.29     0.01  
## social       0.19     0.01  
## 
## Error terms:
##  Groups   Name        Std.Dev.
##  class    (Intercept) 1.21    
##  Residual             4.16    
## ---
## number of obs: 1200, groups: class, 4
## AIC = 6870.4, DIC = 6817.7
## deviance = 6838.0

How to interpret - Fixed effects (open, agree, social) are population-average. - The random intercept variance tells you how much extroversion differs across classes after adjusting for covariates.

6.6.8.1 Compare fixed-effect `class` vs random-effect `class`

You compare AIC of: - MLexamp.2 (class fixed effects) - MLexamp.6 (class random intercept)

Lower AIC suggests better tradeoff of fit vs complexity.

AIC(MLexamp.2)

## [1] 6838.103

AIC(MLexamp.6)

## [1] 6870.354

6.6.9 Add class and school as random effects

This specifies two crossed random intercepts:

(1|school) allows schools to differ in baseline extro
(1|class) allows classes to differ as well

This is appropriate if class is not strictly nested in school in coding, or if there are shared class labels across schools.

MLexamp.7 <- lmer(extro ~ open + agree + social + (1|school) + (1|class), data=lmm.data)
display(MLexamp.7)

## lmer(formula = extro ~ open + agree + social + (1 | school) + 
##     (1 | class), data = lmm.data)
##             coef.est coef.se
## (Intercept) 31.24     1.54  
## open         0.50     0.01  
## agree        0.29     0.01  
## social       0.19     0.01  
## 
## Error terms:
##  Groups   Name        Std.Dev.
##  school   (Intercept) 2.82    
##  class    (Intercept) 1.21    
##  Residual             3.27    
## ---
## number of obs: 1200, groups: school, 6; class, 4
## AIC = 6320.3, DIC = 6266
## deviance = 6286.2

6.6.10 Nested terms: `school/class`

This explicitly encodes nesting:

(1|school/class) expands to (1|school) + (1|school:class)

Meaning: - school random intercept, plus - class-within-school random intercept.

This is usually the right structure when each class belongs to only one school.

MLexamp.8 <- lmer(extro ~ open + agree + social + (1|school/class), data=lmm.data)
display(MLexamp.8)

## lmer(formula = extro ~ open + agree + social + (1 | school/class), 
##     data = lmm.data)
##             coef.est coef.se
## (Intercept) 31.02     1.26  
## open         0.50     0.01  
## agree        0.29     0.01  
## social       0.20     0.01  
## 
## Error terms:
##  Groups       Name        Std.Dev.
##  class:school (Intercept) 3.20    
##  school       (Intercept) 2.33    
##  Residual                 2.02    
## ---
## number of obs: 1200, groups: class:school, 24; school, 6
## AIC = 5247, DIC = 5186.3
## deviance = 5209.6

6.6.11 Varying slope with LMM

Random slopes allow the effect of a predictor (here open) to vary across clusters.

(1+open|school/class) means: - random intercept and random slope of open at the school:class level, - plus the corresponding variance and covariance between intercept and slope.

This answers a deeper scientific question: - does “openness” relate to extroversion differently across different classes/schools?

MLexamp.9 <- lmer(extro ~ open + agree + social + (1+open|school/class), data=lmm.data)

## boundary (singular) fit: see help('isSingular')

display(MLexamp.9)

## lmer(formula = extro ~ open + agree + social + (1 + open | school/class), 
##     data = lmm.data)
##             coef.est coef.se
## (Intercept) 30.96     1.02  
## open         0.50     0.01  
## agree        0.29     0.01  
## social       0.20     0.01  
## 
## Error terms:
##  Groups       Name        Std.Dev. Corr  
##  class:school (Intercept) 3.31           
##               open        0.01     -1.00 
##  school       (Intercept) 1.42           
##               open        0.02     1.00  
##  Residual                 2.01           
## ---
## number of obs: 1200, groups: class:school, 24; school, 6
## AIC = 5246, DIC = 5178.7
## deviance = 5201.3

6.6.11.1 Compare AIC across candidate structures

This is a practical model selection exercise:

MLexamp.4 uses fixed school:class intercepts
MLexamp.7 uses random school and class intercepts (separate)
MLexamp.8 uses nested random intercepts
MLexamp.9 adds random slope of open

AIC(MLexamp.4) #using the interaction of class and school as fixed effect

## [1] 5119.435

AIC(MLexamp.7) #using class and school as random intercepts

## [1] 6320.338

AIC(MLexamp.8) #using nested term of class and school as random intercepts

## [1] 5246.96

AIC(MLexamp.9) #adding open as random slope based on MLexamp.8

## [1] 5245.997

Interpretation guide - If random effects variances are near zero and AIC does not improve, random effects may not be needed. - If AIC improves substantially when adding nesting or random slopes, clustering and/or slope heterogeneity is likely important.

6.6.12 Summarize model 8

summary() provides:

fixed effects estimates and standard errors,
random effects variance components (school and class-within-school),
residual variance,
fit criteria.

summary(MLexamp.8)

## Linear mixed model fit by REML ['lmerMod']
## Formula: extro ~ open + agree + social + (1 | school/class)
##    Data: lmm.data
## 
## REML criterion at convergence: 5233
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -3.12970 -0.68636 -0.00039  0.67466  2.95243 
## 
## Random effects:
##  Groups       Name        Variance Std.Dev.
##  class:school (Intercept) 10.247   3.201   
##  school       (Intercept)  5.413   2.327   
##  Residual                  4.073   2.018   
## Number of obs: 1200, groups:  class:school, 24; school, 6
## 
## Fixed effects:
##              Estimate Std. Error t value
## (Intercept) 31.016666   1.259145   24.63
## open         0.495711   0.005878   84.33
## agree        0.293323   0.005936   49.41
## social       0.195343   0.005725   34.12
## 
## Correlation of Fixed Effects:
##        (Intr) open   agree 
## open   -0.233              
## agree  -0.228 -0.022       
## social -0.229  0.026 -0.015

Your note is exactly the applied takeaway:

If the variance components for random effects are extremely small relative to the total variance, then there is little clustering signal; LMM may not be necessary.
But be careful: “small” depends on outcome scale and study design—often it is better to quantify via ICC (later section).

6.6.13 Testing random effects (LRT logic)

Testing random effects is subtle because variance parameters lie on the boundary (they cannot be negative). This breaks standard chi-square asymptotics.

Your code demonstrates a conservative mixture-chi-square approach by hand.

# W2 will have to build the test statistic by hand because we cannot rely on asymptotics
L_Model= logLik(MLexamp.8)[1]
L_Model1=logLik(MLexamp.9)[1]
LRT_statistic=-2*(L_Model1-L_Model)
# Model has 3 parameters associated with the Random Effects
# Model1 has 1 parameter associated with the Random Effects
# DF=2
# We will look at a mixture of chi-squared distributions with 1 and 2 degrees of freedom
0.5*pchisq(LRT_statistic,df=2,lower.tail = F)+0.5*pchisq(LRT_statistic,df=1,lower.tail = F)

## [1] 1

Practical guidance: - Use LRT for nested random effects cautiously. - In real applied work, AIC/BIC + stability + diagnostics are often used together rather than relying only on a p-value.

6.6.14 Visualizing grouping structure

A grouped line plot helps you “see” whether the relationship between open and extro differs by school:

parallel-ish lines suggest varying intercepts,
different slopes suggest random slopes may be needed.

library(ggplot2)
ggplot(lmm.data,aes(x= open,y=extro  ,group=school,color= school))+geom_line()

6.6.15 Residual checks by grouping variables

These plots help detect whether certain schools or classes have systematically different residual behavior (e.g., variance differences, outliers, patterning).

plot(MLexamp.8, school~resid(.))

plot(MLexamp.8, class~resid(.))

6.6.16 Residuals vs explanatory variables

A common check for: - nonlinearity, - heteroscedasticity (variance changes with predictor level), - outliers.

plot(MLexamp.8, resid(.)~open)

# plot(MLexamp.8,resid(.,type="p")~fitted(.) | BMIGRP_fm002)

6.6.17 Normality checks

6.6.17.1 QQ plot of residuals

Residual normality is not always critical for point estimation, but it matters more for small-sample inference and extreme prediction intervals.

qqnorm(resid(MLexamp.8)  )

6.6.17.2 Random effect coefficients normality

Random effects are assumed normal in standard LMMs.
A QQ plot of estimated random intercepts (or slopes) checks whether this seems reasonable.

test= ((ranef(MLexamp.8)[1])[["class:school"]])
qqnorm ((test$`(Intercept)`) )

6.6.18 Extracting elements (parameters)

These are the standard extraction tools:

fixef(): fixed effects $\hat{\beta}$
ranef(): random effects $\hat{u}$
coef(): conditional coefficients (fixed + random), i.e., group-specific coefficients

fixef(MLexamp.8)

## (Intercept)        open       agree      social 
##  31.0166664   0.4957109   0.2933230   0.1953426

ranef(MLexamp.8)

## $`class:school`
##            (Intercept)
## a:School_1   4.1886142
## a:School_2   1.0466568
## a:School_3   0.6059617
## a:School_4  -5.9229747
## a:School_5  -3.7634378
## a:School_6   5.6331936
## b:School_1   1.9100724
## b:School_2   3.0071916
## b:School_3  -2.4043278
## b:School_4   2.3149458
## b:School_5   2.0630662
## b:School_6  -2.0026469
## c:School_1  -1.1286244
## c:School_2  -1.9962792
## c:School_3   4.1081884
## c:School_4   0.2610122
## c:School_5   0.8674500
## c:School_6   1.9964598
## d:School_1  -4.7916596
## d:School_2  -2.3777254
## d:School_3  -2.8508902
## d:School_4  -1.7647766
## d:School_5   0.3896333
## d:School_6   0.6108966
## 
## $school
##          (Intercept)
## School_1  0.09424432
## School_2 -0.16912820
## School_3 -0.28582870
## School_4 -2.70039511
## School_5 -0.23417485
## School_6  3.29528254
## 
## with conditional variances for "class:school" "school"

coef(MLexamp.8)

## $`class:school`
##            (Intercept)      open    agree    social
## a:School_1    35.20528 0.4957109 0.293323 0.1953426
## a:School_2    32.06332 0.4957109 0.293323 0.1953426
## a:School_3    31.62263 0.4957109 0.293323 0.1953426
## a:School_4    25.09369 0.4957109 0.293323 0.1953426
## a:School_5    27.25323 0.4957109 0.293323 0.1953426
## a:School_6    36.64986 0.4957109 0.293323 0.1953426
## b:School_1    32.92674 0.4957109 0.293323 0.1953426
## b:School_2    34.02386 0.4957109 0.293323 0.1953426
## b:School_3    28.61234 0.4957109 0.293323 0.1953426
## b:School_4    33.33161 0.4957109 0.293323 0.1953426
## b:School_5    33.07973 0.4957109 0.293323 0.1953426
## b:School_6    29.01402 0.4957109 0.293323 0.1953426
## c:School_1    29.88804 0.4957109 0.293323 0.1953426
## c:School_2    29.02039 0.4957109 0.293323 0.1953426
## c:School_3    35.12485 0.4957109 0.293323 0.1953426
## c:School_4    31.27768 0.4957109 0.293323 0.1953426
## c:School_5    31.88412 0.4957109 0.293323 0.1953426
## c:School_6    33.01313 0.4957109 0.293323 0.1953426
## d:School_1    26.22501 0.4957109 0.293323 0.1953426
## d:School_2    28.63894 0.4957109 0.293323 0.1953426
## d:School_3    28.16578 0.4957109 0.293323 0.1953426
## d:School_4    29.25189 0.4957109 0.293323 0.1953426
## d:School_5    31.40630 0.4957109 0.293323 0.1953426
## d:School_6    31.62756 0.4957109 0.293323 0.1953426
## 
## $school
##          (Intercept)      open    agree    social
## School_1    31.11091 0.4957109 0.293323 0.1953426
## School_2    30.84754 0.4957109 0.293323 0.1953426
## School_3    30.73084 0.4957109 0.293323 0.1953426
## School_4    28.31627 0.4957109 0.293323 0.1953426
## School_5    30.78249 0.4957109 0.293323 0.1953426
## School_6    34.31195 0.4957109 0.293323 0.1953426
## 
## attr(,"class")
## [1] "coef.mer"

6.6.19 Fitted values and residuals

Fitted values (fitted) represent the model’s conditional mean, which includes group-level random effects.
Residuals represent the within-group deviations not explained by the model.

yhat <- fitted(MLexamp.8)
head(yhat)

##        1        2        3        4        5        6 
## 91.71782 80.65276 87.37714 87.73178 88.93546 87.04414

summary(yhat)

Min.	1st Qu.	Median	Mean	3rd Qu.	Max.
59.11443	75.30169	80.02914	80.06942	84.89566	102.5105

residuals <- resid(MLexamp.8)
head(residuals)

##           1           2           3           4           5           6 
##  1.25126333  0.98551428 -0.04473698 -2.81822020  0.31620284  3.17547575

summary(residuals)

Min.	1st Qu.	Median	Mean	3rd Qu.	Max.
-6.316344	-1.385218	-0.0007856	0	1.3616	5.958581

hist(residuals,50)

6.6.20 Fitted lines by group

This plot visualizes model-implied trajectories by school.
Because fitted values in a random-intercept model contain partial pooling, you often see less variation than raw group means—this is expected and is one of the main benefits of mixed models.

lmm.data$Fitted=predict(MLexamp.8)
# subjects=unique(lmm.data$school) #Taking the identifiers and saving it
# sam="V" # Taking a random sample of 10 Subjects
 
ggplot(lmm.data    ,aes(x=open ,y=Fitted,group=school,color=school ) )+geom_line(se=F)

## Warning in geom_line(se = F): Ignoring unknown parameters: `se`

6.6.21 Model diagnostics (second dataset: Nurses)

This section shows how the same mixed model ideas apply in a more realistic hierarchical structure: hospital/ward.

6.6.21.1 Loading `nurse` data

You read an SPSS dataset from GitHub and fit a nested random intercept model.

Important conceptual point: if the outcome is truly ordinal, a Gaussian LMM is an approximation.
For ordinal outcomes, you would typically consider an ordinal mixed model (e.g., cumulative logit mixed model).
But as you note, this is for demonstration of diagnostics.

# --- Data Simulation ---
library(dplyr)

set.seed(2026)

# --- 1. Define Hierarchical Structure ---
n_hospitals <- 5           # Number of hospitals (Level 3)
n_wards_per_hosp <- 10     # Wards per hospital (Total 50 wards)
n_nurses_per_ward <- 15    # Nurses per ward (Total 750 nurses)
total_n <- n_hospitals * n_wards_per_hosp * n_nurses_per_ward

# --- 2. Generate Hierarchical IDs ---
hospital <- rep(1:n_hospitals, each = n_wards_per_hosp * n_nurses_per_ward)
ward     <- rep(1:(n_hospitals * n_wards_per_hosp), each = n_nurses_per_ward)
nurse_id <- 1:total_n

# --- 3. Generate Predictors (Fixed Effects) ---
# experien: Nurse work experience in years (ranging 1-25)
experien <- runif(total_n, 1, 25)

# wardtype: Type of ward (0 = General, 1 = Special/ICU) - Level 2 variable
# First generate types for each ward, then broadcast to individual nurses
unique_ward_types <- rbinom(n_hospitals * n_wards_per_hosp, 1, 0.4)
wardtype <- rep(unique_ward_types, each = n_nurses_per_ward)

# expcon: Experience control/interaction term (binary experimental condition)
expcon <- rbinom(total_n, 1, 0.5)

# --- 4. Define Random Effects (Intercept Deviations) ---
# Hospital-level random intercepts (u_h) - Variance at Level 3
u_h <- rep(rnorm(n_hospitals, 0, 0.8), each = n_wards_per_hosp * n_nurses_per_ward)

# Ward-level random intercepts (u_w) - Variance at Level 2
u_w <- rep(rnorm(n_hospitals * n_wards_per_hosp, 0, 1.2), each = n_nurses_per_ward)

# --- 5. Generate Dependent Variable: stress ---
# Model Equation: 
# stress = intercept + 0.5*wardtype - 0.1*experien + 0.3*expcon + Random Effects + Residual Error
intercept <- 5.0
beta_wardtype <- 0.5
beta_experien <- -0.1
beta_expcon <- 0.3
error <- rnorm(total_n, 0, 1.0) # Residual error (Level 1 variance)

stress <- intercept + (beta_wardtype * wardtype) + (beta_experien * experien) + 
          (beta_expcon * expcon) + u_h + u_w + error

# --- 6. Assemble into Data Frame ---
nurses_sim_data <- data.frame(
  nurse_id = nurse_id,
  hospital = as.factor(hospital),
  ward = as.factor(ward),
  wardtype = as.factor(wardtype),
  experien = experien,
  expcon = expcon,
  stress = stress
)

# Display the first few rows of the simulated dataset
head(nurses_sim_data)

nurse_id	hospital	ward	experien	expcon	stress
1	1	1	17.768163	1	5.001246
2	1	1	14.356732	0	7.377821
3	1	1	4.363359	1	6.298894
4	1	1	7.857359	1	5.904764
5	1	1	14.328856	0	4.593614
6	1	1	1.603148	1	3.892058

library(haven)
nurse <- nurses_sim_data
  # read_sav(file="https://github.com/MultiLevelAnalysis/Datasets-third-edition-Multilevel-book/blob/master/chapter%202/Nurses/SPSS/Nurses.sav?raw=true")

MLexamp.18 <- lmer(stress ~ experien+ wardtype+ expcon  + (1|hospital/ward), data=nurse)
summary(MLexamp.18)

## Linear mixed model fit by REML ['lmerMod']
## Formula: stress ~ experien + wardtype + expcon + (1 | hospital/ward)
##    Data: nurse
## 
## REML criterion at convergence: 2293.4
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -3.08705 -0.64245  0.01861  0.61302  2.58650 
## 
## Random effects:
##  Groups        Name        Variance Std.Dev.
##  ward:hospital (Intercept) 2.1225   1.4569  
##  hospital      (Intercept) 0.4072   0.6382  
##  Residual                  0.9680   0.9839  
## Number of obs: 750, groups:  ward:hospital, 50; hospital, 5
## 
## Fixed effects:
##              Estimate Std. Error t value
## (Intercept)  5.035277   0.396294  12.706
## experien    -0.111048   0.005457 -20.350
## wardtype1    0.426834   0.442263   0.965
## expcon       0.481870   0.074216   6.493
## 
## Correlation of Fixed Effects:
##           (Intr) expern wrdty1
## experien  -0.179              
## wardtype1 -0.402  0.000       
## expcon    -0.101  0.016  0.004

outcome is ordinal variable

6.6.21.2 Residual normality: residual vs fitted, QQ plot

You compute residuals and inspect: - scale-location style behavior via $\sqrt{|residual|}$ vs fitted, - normality via QQ plot.

lmerresid <- resid(MLexamp.18)
par(mfrow = c(1,2))
plot(sqrt(abs(lmerresid)) ~ predict(MLexamp.18))
qqnorm(lmerresid)
qqline(lmerresid)

Interpretation guide: - A strong trend in residual spread vs fitted suggests heteroscedasticity. - Strong deviation from the QQ line suggests heavy tails or skewness.

6.6.21.3 Checking variance homogeneity

Residuals plotted against predictors can reveal: - differing variance across categories (wardtype, expcon), - nonlinear patterns (experien).

Your plotting uses color palettes for readability.

par(mfrow = c(2,2))
plot(lmerresid ~ fitted(MLexamp.18))
with(nurse, plot(lmerresid ~ experien , col = heat.colors(20)))
with(nurse, plot(lmerresid ~ wardtype , col = rainbow(3)))
with(nurse, plot(lmerresid ~ expcon , col = rainbow(5)))

# with(lmm.data, plot(lmerresid ~ class, col = rainbow(5)))

6.6.21.4 Influence points (Cook’s distance)

Influence in mixed models can be assessed using tools like influence.ME.
Cook’s distance highlights observations that disproportionately change fixed effects.

In clustered data, influential points can be: - a single extreme observation, - or an entire cluster with unusual behavior.

library(influence.ME)

## 
## Attaching package: 'influence.ME'

## The following object is masked from 'package:arm':
## 
##     se.fixef

## The following object is masked from 'package:stats':
## 
##     influence

lmer3.infl <- influence(MLexamp.18, obs = TRUE)
par(mfrow = c(1,1))
plot(cooks.distance(lmer3.infl), type = "p", pch = 20)

6.6.21.5 Random effect normality

A QQ plot of the estimated random intercepts checks whether the random effects distribution is approximately normal.

par(mfrow = c(1,3))
qqnorm(ranef(MLexamp.18)$`ward:hospital`$`(Intercept)`)

(If the QQ plot shows strong asymmetry or heavy tails, consider whether the random effects distribution assumption is reasonable, or whether model specification is missing covariates.)

6.6.21.6 Post hoc analysis

confint() can provide confidence intervals for variance parameters (and fixed effects depending on settings).
You request parameters 1:4, which typically corresponds to variance components and fixed effects depending on the model and method.

confint(MLexamp.18, parm = 1:4, oldNames = FALSE)

## Computing profile confidence intervals ...

	2.5 %	97.5 %
sd_(Intercept)\|ward:hospital	1.1784347	1.808224
sd_(Intercept)\|hospital	0.0000000	1.467998
sigma	0.9331689	1.036270
(Intercept)	4.2180132	5.848215

6.6.21.6.1 Estimated marginal means (lsmeans / emmeans)

emmeans produces adjusted means (marginal means) for categorical predictors, averaged over other covariates.

The interaction plot helps interpret whether the effect of expcon differs by wardtype (effect modification).

mylsmeans <- emmeans::emmeans(MLexamp.18, c("wardtype", "expcon" ))
sum_mylsmeans <- summary(mylsmeans)
with(sum_mylsmeans, interaction.plot(wardtype , expcon ,emmean, col = 2:4))

6.6.21.6.2 Pairwise comparisons

Pairwise differences provide estimated contrasts; plotting SE and estimates is a quick way to see magnitude and uncertainty.

sum_difflsmeans <- summary(pairs(mylsmeans))
head(sum_difflsmeans)

contrast	estimate	SE	df	t.ratio	p.value
wardtype0 expcon0 - wardtype1 expcon0	-0.4268335	0.4461313	45.32772	-0.9567443	0.7743988
wardtype0 expcon0 - wardtype0 expcon1	-0.4818697	0.0742242	700.55632	-6.4920809	0.0000000
wardtype0 expcon0 - wardtype1 expcon1	-0.9087032	0.4525239	48.02449	-2.0080779	0.1993094
wardtype1 expcon0 - wardtype0 expcon1	-0.0550361	0.4520031	47.79373	-0.1217605	0.9993454
wardtype1 expcon0 - wardtype1 expcon1	-0.4818697	0.0742242	700.55632	-6.4920809	0.0000000
wardtype0 expcon1 - wardtype1 expcon1	-0.4268335	0.4461313	45.32772	-0.9567443	0.7743988

barplot(sum_difflsmeans$SE)

barplot(sum_difflsmeans$estimate)

6.6.21.6.3 One-stop diagnostic plot

performance::check_model() is a convenient diagnostic suite for mixed models (residuals, QQ, collinearity, etc.).

performance::check_model(MLexamp.18)

6.6.22 Intra class correlation (ICC)

ICC quantifies the proportion of outcome variance attributable to clustering.
It is often the fastest way to decide whether mixed modeling is warranted.

Your code fits the null model and prints the summary, from which you can compute ICC.

 lmm.null <- lmer(extro ~ 1 + (1|school), data = lmm.data)
 summary(lmm.null)

## Linear mixed model fit by REML ['lmerMod']
## Formula: extro ~ 1 + (1 | school)
##    Data: lmm.data
## 
## REML criterion at convergence: 8100.8
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -2.75973 -0.65491 -0.00704  0.64908  2.99247 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  school   (Intercept)  8.855   2.976   
##  Residual             49.276   7.020   
## Number of obs: 1200, groups:  school, 6
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)   80.069      1.232   65.01

Your interpretation statement is exactly the right reading: a large ICC means that group membership (school) accounts for a large portion of variance in extro, strongly supporting the need for random effects.

6.6.22.1 Notice / practical reminders

Back-transforming expected values: if you transform the outcome, marginal means should be interpreted on the transformed scale unless explicitly back-transformed.
Box–Cox transformations can help stabilize variance and improve normality when residual diagnostics show strong deviations.
Mixed model choice should balance:
- scientific interpretability,
- diagnostics,
- parsimony (AIC/BIC),
- and validation where possible.

This section now gives a complete “from OLS to LMM” practical pipeline: build, compare, diagnose, interpret, and quantify clustering (ICC).

6.7 Linear mixed model covariance decomposition with random intercept — lme4

This section walks through the same workflow you used above, but with a stronger emphasis on (1) what each matrix represents, (2) where the covariance decomposition shows up, and (3) how the “manual” computations match lme4.

We focus on the random-intercept LMM:

\[ \mathbf{y} = \mathbf{X}\beta + \mathbf{Z}u + \varepsilon, \qquad u \sim N(0,\mathbf{G}), \qquad \varepsilon \sim N(0,\mathbf{R}) \]

with the implied marginal model

\[ \mathbf{y} \sim N(\mathbf{X}\beta,\ \mathbf{V}), \qquad \mathbf{V}=\mathbf{Z}\mathbf{G}\mathbf{Z}'+\mathbf{R}. \]

In lme4 (without nlme residual correlation structures), residuals are independent:

\[ \mathbf{R}=\sigma^2\mathbf{I}. \]

For a random intercept only model with grouping variable Subject, the random-effect covariance is

\[ \mathbf{G}=\sigma_u^2\mathbf{I}. \]

This is the classic compound symmetry pattern: within a subject, all repeated observations share covariance $\sigma_u^2$, and the diagonal includes $\sigma_u^2+\sigma^2$.

6.7.1 Load data

We use the Orthodont dataset and store it as Data.

data(Orthodont, package="nlme")
Data <- Orthodont

6.7.2 Plot means and variances by higher level variable (grouping)

Before fitting models, we quickly examine whether subjects differ in their outcome level and variability.

If subject means vary a lot, random intercepts are plausible.
If within-subject variances vary wildly, a more complex residual structure might be needed (though lme4 cannot directly specify correlated residuals).

barplot(with(Data, tapply(distance, list(subject = Subject), mean)))

barplot(with(Data, tapply(distance, list(subject = Subject), var)))

6.7.3 Using glm-style fixed effects (baseline reference)

6.7.3.1 Using glm (subject as fixed effect)

This model allows a separate intercept for each subject as a fixed parameter:

\[ \text{distance}_{ij}=\beta_0+\beta_1\text{age}_{ij}+\beta_2\text{Sex}_{ij}+\alpha_j+\varepsilon_{ij}. \]

It often fits well, but it treats subject effects as fixed and does not provide a variance component interpretation $\sigma_u^2$.

fit.lm <- lm(distance ~ age+ Sex + Subject, data = Data) 
anova(fit.lm)

	Df	Sum Sq	Mean Sq	F value	Pr(>F)
age	1	235.3560	235.356019	114.838287	0
Sex	1	140.4649	140.464857	68.537629	0
Subject	25	377.9148	15.116591	7.375904	0
Residuals	80	163.9565	2.049456	NA	NA

6.7.3.2 Using glm with random slopes (aov Error term)

This approach is the classical ANOVA / variance-component view. For balanced repeated measures, it connects to mean squares.

fit.aov <- aov(distance ~ age+ Sex + Error(Subject), data = Data) 
summary(fit.aov)

## 
## Error: Subject
##           Df Sum Sq Mean Sq F value  Pr(>F)   
## Sex        1  140.5  140.46   9.292 0.00538 **
## Residuals 25  377.9   15.12                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Error: Within
##           Df Sum Sq Mean Sq F value Pr(>F)    
## age        1  235.4  235.36   114.8 <2e-16 ***
## Residuals 80  164.0    2.05                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

6.7.3.2.1 How can we compute $\sigma_a^2$ from this output?

Your note is the key identity:

$\sigma^2 = \mathrm{MSE}$
$E(\mathrm{MSA}) = \sigma_a^2 + n\sigma^2$

\[ \sigma_a^2=\frac{\mathrm{MSA}-\mathrm{MSE}}{n}. \]

Practical caution: Orthodont is not perfectly balanced in the same way as textbook ANOVA examples, so this computation is best viewed as an approximation. Likelihood-based LMM estimation (REML/ML) is typically preferred for unbalanced data.

6.7.4 Using LMM (random intercept) in lme4

6.7.4.1 Fit the random intercept model

We fit

\[ \text{distance}_{ij}=\beta_0+\beta_1\text{age}_{ij}+\beta_2\text{Sex}_{ij}+u_{0j}+\varepsilon_{ij}, \]

with $u_{0j}\sim N(0,\sigma_u^2)$, $\varepsilon_{ij}\sim N(0,\sigma^2)$.

library(lme4) 
fit.lmer <- lmer(distance ~ age + Sex  +(1 | Subject), data = Data) 
summary(fit.lmer)

## Linear mixed model fit by REML ['lmerMod']
## Formula: distance ~ age + Sex + (1 | Subject)
##    Data: Data
## 
## REML criterion at convergence: 437.5
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.7489 -0.5503 -0.0252  0.4534  3.6575 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  Subject  (Intercept) 3.267    1.807   
##  Residual             2.049    1.432   
## Number of obs: 108, groups:  Subject, 27
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept) 17.70671    0.83392  21.233
## age          0.66019    0.06161  10.716
## SexFemale   -2.32102    0.76142  -3.048
## 
## Correlation of Fixed Effects:
##           (Intr) age   
## age       -0.813       
## SexFemale -0.372  0.000

default is compound symmetry (induced by the random intercept).
More precisely:

the random intercept implies constant covariance $\sigma_u^2$ for any two observations from the same subject,
residual errors are independent (identity $\mathbf{R}=\sigma^2\mathbf{I}$).

6.7.5 Get $\mathbf{X}$, $\mathbf{y}$, $\mathbf{Z}$

These are the core objects behind the matrix formulas.

$\mathbf{X}$: fixed-effects design matrix (Intercept, age, Sex…)
$\mathbf{y}$: response vector
$\mathbf{Z}$: random-effects design matrix (subject membership for random intercept)

X=(getME(fit.lmer, "X"))
head(X)

(Intercept)	age	SexFemale
1	8	0
1	10	0
1	12	0
1	14	0
1	8	0
1	10	0

y=getME(fit.lmer, "y")
head(y)

## [1] 26.0 25.0 29.0 31.0 21.5 22.5

Z <- getME(fit.lmer, "Z")
head(Z)

## 6 x 27 sparse Matrix of class "dgCMatrix"

##   [[ suppressing 27 column names 'M16', 'M05', 'M02' ... ]]

##                                                        
## 1 . . . . . . . . . . . . . . 1 . . . . . . . . . . . .
## 2 . . . . . . . . . . . . . . 1 . . . . . . . . . . . .
## 3 . . . . . . . . . . . . . . 1 . . . . . . . . . . . .
## 4 . . . . . . . . . . . . . . 1 . . . . . . . . . . . .
## 5 . . 1 . . . . . . . . . . . . . . . . . . . . . . . .
## 6 . . 1 . . . . . . . . . . . . . . . . . . . . . . . .

6.7.6 Get fixed and random effect coefficients

$\hat{\beta}$ are population-average (fixed) coefficients.
$\hat{u}$ are BLUPs (random intercepts per subject).
coef() combines them into subject-specific coefficients.

bhat <- getME(fit.lmer, "fixef") #getME(fit.lmer, "beta")  # fixef(fit.lmer)
bhat

## (Intercept)         age   SexFemale 
##  17.7067130   0.6601852  -2.3210227

uhat <- ranef(fit.lmer) 
uhat

## $Subject
##     (Intercept)
## M16 -1.70183357
## M05 -1.70183357
## M02 -1.37767479
## M11 -1.16156894
## M07 -1.05351602
## M08 -0.94546309
## M03 -0.62130432
## M12 -0.62130432
## M13 -0.62130432
## M14 -0.08103969
## M09  0.13506616
## M15  0.78338371
## M06  1.21559540
## M04  1.43170125
## M01  2.40417758
## M10  3.91691853
## F10 -3.58539251
## F09 -1.31628108
## F06 -1.31628108
## F01 -1.10017524
## F05 -0.01964599
## F07  0.30451279
## F02  0.30451279
## F08  0.62867156
## F03  0.95283034
## F04  1.92530666
## F11  3.22194176
## 
## with conditional variances for "Subject"

coef(fit.lmer)

## $Subject
##     (Intercept)       age SexFemale
## M16    16.00488 0.6601852 -2.321023
## M05    16.00488 0.6601852 -2.321023
## M02    16.32904 0.6601852 -2.321023
## M11    16.54514 0.6601852 -2.321023
## M07    16.65320 0.6601852 -2.321023
## M08    16.76125 0.6601852 -2.321023
## M03    17.08541 0.6601852 -2.321023
## M12    17.08541 0.6601852 -2.321023
## M13    17.08541 0.6601852 -2.321023
## M14    17.62567 0.6601852 -2.321023
## M09    17.84178 0.6601852 -2.321023
## M15    18.49010 0.6601852 -2.321023
## M06    18.92231 0.6601852 -2.321023
## M04    19.13841 0.6601852 -2.321023
## M01    20.11089 0.6601852 -2.321023
## M10    21.62363 0.6601852 -2.321023
## F10    14.12132 0.6601852 -2.321023
## F09    16.39043 0.6601852 -2.321023
## F06    16.39043 0.6601852 -2.321023
## F01    16.60654 0.6601852 -2.321023
## F05    17.68707 0.6601852 -2.321023
## F07    18.01123 0.6601852 -2.321023
## F02    18.01123 0.6601852 -2.321023
## F08    18.33538 0.6601852 -2.321023
## F03    18.65954 0.6601852 -2.321023
## F04    19.63202 0.6601852 -2.321023
## F11    20.92865 0.6601852 -2.321023
## 
## attr(,"class")
## [1] "coef.mer"

Interpretation:

$\hat{\beta}_0$ is the overall intercept.
$\hat{u}_j$ is subject $j$’s deviation from the overall intercept.
Subject $j$’s conditional intercept is $\hat{\beta}_0+\hat{u}_j$.

6.7.7 Random effect covariance structure and the role of $\theta$

In lme4, random-effects parameters are represented via a Cholesky-factor parameterization. Conceptually:

$\theta$ parameterizes the relative Cholesky factors,
from $\theta$ we can reconstruct the random-effect covariance $\Sigma$ (often denoted $\mathbf{G}$ for a random-effect term).

Your 2×2 example (intercept + slope) illustrates:

\[ \Sigma = \left( \begin{array}{cc} \theta_1 & 0 \\ \theta_2 & \theta_3 \end{array} \right) \left( \begin{array}{cc} \theta_1 & \theta_2 \\ 0 & \theta_3 \end{array} \right) = \left( \begin{array}{cc} \theta_1^2 & \theta_1 \theta_2 \\ \theta_1 \theta_2 & \theta_2^2 + \theta_3^2 \end{array} \right) = \left( \begin{array}{cc} \sigma_1^2 & \sigma_{12} \\ \sigma_{12} & \sigma_2^2 \end{array} \right). \]

For a random intercept only model, there is only one standard deviation parameter, so $\theta$ reduces to $\sigma_u$ (up to the scaling conventions used internally).

6.7.7.1 Build a “random intercept covariance object” for Subject

VarCorr(fit.lmer) reports the random intercept SD for Subject. You build a diagonal matrix of that SD across subjects:

vc <- VarCorr(fit.lmer) 
Lambda_new <-vc[["Subject"]][1]*diag(length(levels(Data$Subject)))
head(Lambda_new)

3.266784	0.000000	0.000000	0.000000	0.000000	0.000000
0.000000	3.266784	0.000000	0.000000	0.000000	0.000000
0.000000	0.000000	3.266784	0.000000	0.000000	0.000000
0.000000	0.000000	0.000000	3.266784	0.000000	0.000000
0.000000	0.000000	0.000000	0.000000	3.266784	0.000000
0.000000	0.000000	0.000000	0.000000	0.000000	3.266784

dim(Lambda_new)

## [1] 27 27

Notes for interpretation:

If you want $\mathbf{G}$ on the variance scale, it is $\sigma_u^2\mathbf{I}$.
Many internal pieces in lme4 are stored in factorized form; the important point is that the random intercept induces a block-constant covariance when mapped through $\mathbf{Z}$.

6.7.8 Residual variance and $\mathbf{R}=\sigma^2\mathbf{I}$

Extract residual SD from the fitted model and square it:

\[ \mathbf{R}=\sigma^2\mathbf{I}_N. \]

sigma <- getME(fit.lmer, "sigma")**2 
sigma

## [1] 2.049456

head(sigma*diag(nrow(Data)))

2.049456	0.000000	0.000000	0.000000	0.000000	0.000000
0.000000	2.049456	0.000000	0.000000	0.000000	0.000000
0.000000	0.000000	2.049456	0.000000	0.000000	0.000000
0.000000	0.000000	0.000000	2.049456	0.000000	0.000000
0.000000	0.000000	0.000000	0.000000	2.049456	0.000000
0.000000	0.000000	0.000000	0.000000	0.000000	2.049456

6.7.9 Marginal covariance matrix $\mathbf{V}$ of $\mathbf{y}$

The key decomposition is:

\[ \mathbf{V}=\mathbf{Z}\mathbf{G}\mathbf{Z}'+\sigma^2\mathbf{I}. \]

In your implementation:

$\mathbf{Z}\Lambda_{\text{new}}\mathbf{Z}'$ creates within-subject covariance,
$\sigma^2\mathbf{I}$ adds residual variance.

VM <- Z%*%Lambda_new%*%t(Z)+sigma*diag(nrow(Data)) 
head(VM)

## 6 x 108 sparse Matrix of class "dgCMatrix"

##   [[ suppressing 108 column names '1', '2', '3' ... ]]

##                                                                                
## 1 5.316240 3.266784 3.266784 3.266784 .        .        .        .        . . .
## 2 3.266784 5.316240 3.266784 3.266784 .        .        .        .        . . .
## 3 3.266784 3.266784 5.316240 3.266784 .        .        .        .        . . .
## 4 3.266784 3.266784 3.266784 5.316240 .        .        .        .        . . .
## 5 .        .        .        .        5.316240 3.266784 3.266784 3.266784 . . .
## 6 .        .        .        .        3.266784 5.316240 3.266784 3.266784 . . .
##                                                                                
## 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
##                                                                                
## 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
##                                        
## 1 . . . . . . . . . . . . . . . . . . .
## 2 . . . . . . . . . . . . . . . . . . .
## 3 . . . . . . . . . . . . . . . . . . .
## 4 . . . . . . . . . . . . . . . . . . .
## 5 . . . . . . . . . . . . . . . . . . .
## 6 . . . . . . . . . . . . . . . . . . .

dim(VM)

## [1] 108 108

This $\mathbf{V}$ is the object that controls:

GLS estimation of $\beta$,
shrinkage and BLUP estimation of $u$,
standard errors of $\hat{\beta}$.

6.7.10 Fixed effect coefficient covariance matrix

From theory:

\[ \mathrm{Var}(\hat{\beta})= (\mathbf{X}'\mathbf{V}^{-1}\mathbf{X})^{-1}. \]

vcov(fit.lmer) returns the estimated $\mathrm{Var}(\hat{\beta})$.

vcov <- vcov(fit.lmer) #fixed cov
vcov

## 3 x 3 Matrix of class "dpoMatrix"
##             (Intercept)           age     SexFemale
## (Intercept)  0.69542669 -4.174818e-02 -2.361967e-01
## age         -0.04174818  3.795289e-03 -3.051416e-17
## SexFemale   -0.23619673 -3.051416e-17  5.797556e-01

# computer correlation coefficients 
vcov@x[2]/prod(sqrt( diag(vcov(fit.lmer))[-3] ))

## [1] -0.8126236

Interpretation: - diagonal elements are variances of coefficients, - square roots are standard errors, - off-diagonal elements encode correlation among coefficient estimates.

6.7.11 Random effect covariance matrix (standard deviation scale)

VarCorr() prints SDs (and correlations if applicable).
Matrix::bdiag(vc) produces a block-diagonal covariance structure across random-effect terms.

vc <- VarCorr(fit.lmer) 
vc ### default print method: standard dev and corr

##  Groups   Name        Std.Dev.
##  Subject  (Intercept) 1.8074  
##  Residual             1.4316

as.matrix(Matrix::bdiag(vc)) #random effect covariance matrix

	(Intercept)
(Intercept)	3.266784

# https://stackoverflow.com/questions/47307340/extracting-the-i-estimated-variance-covariance-matrix-of-random-effects-and-or
# https://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_mixed_sect022.htm

And your empirical SD check based on BLUPs:

uintercept <- (uhat[["Subject"]])
sd(uintercept[,1])

## [1] 1.647809

Important: the SD of the BLUPs is typically smaller than $\hat{\sigma}_u$ due to shrinkage.

6.7.12 Compute fixed effect coefficients manually (GLS)

Given $\mathbf{V}$, the GLS estimator is:

\[ \hat{\beta}= (\mathbf{X}'\mathbf{V}^{-1}\mathbf{X})^{-1}\mathbf{X}'\mathbf{V}^{-1}\mathbf{y}. \]

You compute it directly and compare with bhat.

library(matlib)
inv(t(as.matrix(X))%*%inv(as.matrix(VM))%*%as.matrix(X))%*%t(as.matrix(X))%*%inv(as.matrix(VM))%*%y

17.7067076

0.6601871

-2.3210230

bhat

## (Intercept)         age   SexFemale 
##  17.7067130   0.6601852  -2.3210227

6.7.13 Compute covariance of fixed effect coefficients manually

Theory:

\[ \mathrm{Var}(\hat{\beta})=(\mathbf{X}'\mathbf{V}^{-1}\mathbf{X})^{-1}. \]

Then standard errors are the square roots of the diagonal.

inv(t(as.matrix(X))%*%inv(as.matrix(VM))%*%as.matrix(X))

0.6954267	-0.0417482	-0.2361967
-0.0417482	0.0037953	0.0000000
-0.2361967	0.0000000	0.5797556

# standard error
sqrt(inv(t(as.matrix(X))%*%inv(as.matrix(VM))%*%as.matrix(X)))

## Warning in sqrt(inv(t(as.matrix(X)) %*% inv(as.matrix(VM)) %*% as.matrix(X))):
## NaNs produced

0.8339225	NaN	NaN
NaN	0.0616059	0.0000000
NaN	0.0000000	0.7614169

# the following equals lmm summary
vcov(fit.lmer)

## 3 x 3 Matrix of class "dpoMatrix"
##             (Intercept)           age     SexFemale
## (Intercept)  0.69542669 -4.174818e-02 -2.361967e-01
## age         -0.04174818  3.795289e-03 -3.051416e-17
## SexFemale   -0.23619673 -3.051416e-17  5.797556e-01

sqrt(vcov(fit.lmer))

## Warning in g({: NaNs produced

## 3 x 3 Matrix of class "dsyMatrix"
##             (Intercept)        age SexFemale
## (Intercept)   0.8339225        NaN       NaN
## age                 NaN 0.06160592       NaN
## SexFemale           NaN        NaN 0.7614168

6.7.14 Compute random effect coefficients manually (BLUPs)

The BLUP formula is:

\[ \hat{u}=\mathbf{G}\mathbf{Z}'\mathbf{V}^{-1}(\mathbf{y}-\mathbf{X}\hat{\beta}). \]

Your code implements the same structure and compares to ranef().

comput_uhat <- (as.matrix(Lambda_new))%*%t(Z)%*%inv(as.matrix(VM))%*%(y-as.matrix(X)%*%(bhat))
cbind((comput_uhat@x),(uhat[["Subject"]]))

	(comput_uhat@x)	(Intercept)
M16	-1.7018335	-1.7018336
M05	-1.7018335	-1.7018336
M02	-1.3776748	-1.3776748
M11	-1.1615689	-1.1615689
M07	-1.0535160	-1.0535160
M08	-0.9454631	-0.9454631
M03	-0.6213043	-0.6213043
M12	-0.6213043	-0.6213043
M13	-0.6213043	-0.6213043
M14	-0.0810397	-0.0810397
M09	0.1350662	0.1350662
M15	0.7833837	0.7833837
M06	1.2155954	1.2155954
M04	1.4317012	1.4317013
M01	2.4041775	2.4041776
M10	3.9169184	3.9169185
F10	-3.5853924	-3.5853925
F09	-1.3162811	-1.3162811
F06	-1.3162811	-1.3162811
F01	-1.1001752	-1.1001752
F05	-0.0196460	-0.0196460
F07	0.3045128	0.3045128
F02	0.3045128	0.3045128
F08	0.6286716	0.6286716
F03	0.9528303	0.9528303
F04	1.9253066	1.9253067
F11	3.2219417	3.2219418

6.7.14.1 Covariance object used for random effects

You print the diagonal matrix you constructed:

head(Lambda_new)

3.266784	0.000000	0.000000	0.000000	0.000000	0.000000
0.000000	3.266784	0.000000	0.000000	0.000000	0.000000
0.000000	0.000000	3.266784	0.000000	0.000000	0.000000
0.000000	0.000000	0.000000	3.266784	0.000000	0.000000
0.000000	0.000000	0.000000	0.000000	3.266784	0.000000
0.000000	0.000000	0.000000	0.000000	0.000000	3.266784

In a teaching narrative, this is where you can emphasize:

in random intercept models, $\mathbf{G}$ is extremely simple,
all complexity in $\mathbf{V}$ comes from how $\mathbf{Z}$ maps subject membership into the observation-level covariance.

6.7.15 Compute predicted values

Conditional fitted values are:

\[ \hat{y}=\mathbf{X}\hat{\beta}+\mathbf{Z}\hat{u}. \]

You compute and compare with fitted().

yhat <- X%*%(bhat)+Z%*%(uhat[["Subject"]][["(Intercept)"]])
head(yhat)

## 6 x 1 Matrix of class "dgeMatrix"
##       [,1]
## 1 25.39237
## 2 26.71274
## 3 28.03311
## 4 29.35348
## 5 21.61052
## 6 22.93089

head(fitted(fit.lmer))

##        1        2        3        4        5        6 
## 25.39237 26.71274 28.03311 29.35348 21.61052 22.93089

Interpretation:

$\mathbf{X}\hat{\beta}$ is the population-average prediction,
$\mathbf{Z}\hat{u}$ adds subject-specific intercept shifts,
fitted(fit.lmer) corresponds to the conditional fitted values under the LMM.

6.7.15.1 Key takeaway

This random-intercept example is the cleanest case where the entire mixed model machinery reduces to:

specify $\mathbf{V}=\mathbf{Z}\mathbf{G}\mathbf{Z}'+\sigma^2\mathbf{I}$,
compute GLS $\hat{\beta}$,
compute BLUP $\hat{u}$,
obtain fitted values $\hat{y}=\mathbf{X}\hat{\beta}+\mathbf{Z}\hat{u}$.

6.8 Linear Mixed Model Covariance Decomposition with Random Slopes (lme4)

In this section, we extend the covariance decomposition of linear mixed models (LMMs) from random intercept models to random slope models. Allowing random slopes means that, in addition to subject-specific baselines (intercepts), the effect of a covariate (here, age) is also allowed to vary across subjects. This is particularly important in longitudinal and growth-curve settings, where individuals may follow systematically different trajectories over time.

We continue to use the Orthodont dataset, a classic longitudinal dataset that records the distance of orthodontic measurements over age, stratified by subject and sex.

6.8.1 Load data

We begin by loading the Orthodont dataset from the nlme package. Each subject has repeated measurements over age, making this dataset well suited for mixed-effects modeling.

data(Orthodont, package="nlme")
Data <- Orthodont

6.8.2 Using a linear mixed model with random slopes

We now fit a linear mixed model with: - fixed effects for age and Sex, - a random intercept for each Subject, - and a random slope for age within each Subject.

Mathematically, this model can be written as

\[ y_{ij} = (\beta_0 + u_{0j}) + (\beta_1 + u_{1j}) \, \text{age}_{ij} + \beta_2 \, \text{Sex}_{ij} + \varepsilon_{ij}, \]

where: - $(u_{0j}, u_{1j})$ are subject-specific random effects, - $\varepsilon_{ij} \sim N(0, \sigma^2)$ are residual errors.

library(lme4) 
fit.lmer.slope <- lmer(distance ~ age + Sex  +(1+ age | Subject), data = Data) 
summary(fit.lmer.slope)

## Linear mixed model fit by REML ['lmerMod']
## Formula: distance ~ age + Sex + (1 + age | Subject)
##    Data: Data
## 
## REML criterion at convergence: 435.2
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0815 -0.4568  0.0155  0.4471  3.8944 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  Subject  (Intercept) 7.82249  2.7969        
##           age         0.05126  0.2264   -0.77
##  Residual             1.71621  1.3100        
## Number of obs: 108, groups:  Subject, 27
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept) 17.63518    0.88623  19.899
## age          0.66019    0.07125   9.265
## SexFemale   -2.14544    0.75746  -2.832
## 
## Correlation of Fixed Effects:
##           (Intr) age   
## age       -0.838       
## SexFemale -0.348  0.000

Although the residual covariance is independent (diagonal), the random-effects covariance structure is no longer scalar. Instead, each subject has a 2×2 covariance matrix corresponding to the random intercept and random slope.

6.8.3 Extracting the design matrices X, y, and Z

To understand the internal structure of the fitted model, we explicitly extract: - X: the fixed-effects design matrix, - y: the response vector, - Z: the random-effects design matrix.

The Z matrix is particularly important here because it now contains two columns per subject: one for the random intercept and one for the random slope.

X=(getME(fit.lmer.slope, "X"))
head(X)

(Intercept)	age	SexFemale
1	8	0
1	10	0
1	12	0
1	14	0
1	8	0
1	10	0

y=getME(fit.lmer.slope, "y")
head(y)

## [1] 26.0 25.0 29.0 31.0 21.5 22.5

Z <- getME(fit.lmer.slope, "Z")
head(Z)

## 6 x 54 sparse Matrix of class "dgCMatrix"

##   [[ suppressing 54 column names 'M16', 'M16', 'M05' ... ]]

##                                                                                
## 1 . . . . .  . . . . . . . . . . . . . . . . . . . . . . . 1  8 . . . . . . . .
## 2 . . . . .  . . . . . . . . . . . . . . . . . . . . . . . 1 10 . . . . . . . .
## 3 . . . . .  . . . . . . . . . . . . . . . . . . . . . . . 1 12 . . . . . . . .
## 4 . . . . .  . . . . . . . . . . . . . . . . . . . . . . . 1 14 . . . . . . . .
## 5 . . . . 1  8 . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . .
## 6 . . . . 1 10 . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . .
##                                  
## 1 . . . . . . . . . . . . . . . .
## 2 . . . . . . . . . . . . . . . .
## 3 . . . . . . . . . . . . . . . .
## 4 . . . . . . . . . . . . . . . .
## 5 . . . . . . . . . . . . . . . .
## 6 . . . . . . . . . . . . . . . .

dim(Z)

## [1] 108  54

Each block of columns in Z corresponds to one subject, and within each block, the columns represent the intercept and slope contributions.

6.8.4 Fixed and random effect coefficients

We next extract: - $\hat{\beta}$, the estimated fixed-effect coefficients, - $\hat{u}$, the estimated random effects (BLUPs).

The random effects are returned as a list indexed by grouping factor.

bhat <- getME(fit.lmer.slope, "fixef") # fixed effects
bhat

## (Intercept)         age   SexFemale 
##  17.6351805   0.6601852  -2.1454431

uhat <- ranef(fit.lmer.slope) 
uhat

## $Subject
##     (Intercept)         age
## M16 -0.96698456 -0.06544230
## M05 -2.15621081  0.04447130
## M02 -1.58143839  0.02194874
## M11  0.38737823 -0.13961777
## M07 -1.40307472  0.03606404
## M08  0.37415109 -0.11799495
## M03 -0.83491587  0.02435289
## M12 -1.82593774  0.11594755
## M13 -5.59182087  0.46400729
## M14  0.51944692 -0.04982259
## M09 -1.07941522  0.11835170
## M15 -1.11909663  0.18322017
## M06  1.03469722  0.02495756
## M04  3.20171821 -0.15492789
## M01  0.96194797  0.14388309
## M10  3.04960613  0.09373458
## F10 -2.31273350 -0.13319695
## F09  0.32324282 -0.16262237
## F06 -0.07316594 -0.12598450
## F01  0.11181130 -0.12268061
## F05  1.43310624 -0.14279903
## F07  0.62044804 -0.03708906
## F02 -0.37057384  0.05450561
## F08  2.38444671 -0.16952522
## F03 -0.01384650  0.08273621
## F04  2.30508388 -0.03978828
## F11  2.62212981  0.05331079
## 
## with conditional variances for "Subject"

Each subject now has two random-effect coefficients: one intercept deviation and one slope deviation.

6.8.5 Random-effect covariance structure (matrix expansion)

In a random slope model, the random-effects covariance matrix for a single subject is:

\[ \mathbf{G}_{\text{subject}} = \begin{pmatrix} \sigma^2_{\text{int}} & \sigma_{\text{int,slope}} \\ \sigma_{\text{int,slope}} & \sigma^2_{\text{slope}} \end{pmatrix}. \]

The full covariance matrix $\mathbf{G}$ is block-diagonal, with one such matrix per subject. We explicitly construct this block-diagonal matrix using bdiag().

vc <- VarCorr(fit.lmer.slope)
Lambda_new <- bdiag(replicate(length(levels(Data$Subject)), vc[["Subject"]], simplify = FALSE))
head(Lambda_new)

## 6 x 54 sparse Matrix of class "dgCMatrix"
##                                                                               
## [1,]  7.822487 -0.48495997  .         .           .         .          . . . .
## [2,] -0.484960  0.05126495  .         .           .         .          . . . .
## [3,]  .         .           7.822487 -0.48495997  .         .          . . . .
## [4,]  .         .          -0.484960  0.05126495  .         .          . . . .
## [5,]  .         .           .         .           7.822487 -0.48495997 . . . .
## [6,]  .         .           .         .          -0.484960  0.05126495 . . . .
##                                                                               
## [1,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## [2,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## [3,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## [4,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## [5,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## [6,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
##                   
## [1,] . . . . . . .
## [2,] . . . . . . .
## [3,] . . . . . . .
## [4,] . . . . . . .
## [5,] . . . . . . .
## [6,] . . . . . . .

dim(Lambda_new)

## [1] 54 54

Here: - VarCorr() provides the estimated standard deviations and correlations, - bdiag() assembles the subject-specific covariance matrices into a global random-effects covariance matrix.

Note
In lme4, the internal parameters theta correspond to the Cholesky factors of these covariance matrices. Given theta, the full covariance matrix is reconstructed implicitly.

6.8.6 Residual variance and identity matrix

The residual errors are assumed independent with variance $\sigma^2$. We extract $\sigma^2$ and explicitly form $\sigma^2 I$.

sigma <- getME(fit.lmer.slope, "sigma")**2 
sigma

## [1] 1.71621

head(sigma*diag(nrow(Data)))

1.71621	0.00000	0.00000	0.00000	0.00000	0.00000
0.00000	1.71621	0.00000	0.00000	0.00000	0.00000
0.00000	0.00000	1.71621	0.00000	0.00000	0.00000
0.00000	0.00000	0.00000	1.71621	0.00000	0.00000
0.00000	0.00000	0.00000	0.00000	1.71621	0.00000
0.00000	0.00000	0.00000	0.00000	0.00000	1.71621

6.8.7 Covariance matrix of the response vector y

The marginal covariance of $\mathbf{y}$ is:

\[ \mathbf{V} = \mathbf{Z G Z}^\top + \sigma^2 \mathbf{I}. \]

We compute this matrix explicitly.

VM <- Z%*%Lambda_new%*%t(Z)+sigma*diag(nrow(Data)) 
head(VM)

## 6 x 108 sparse Matrix of class "dgCMatrix"

##   [[ suppressing 108 column names '1', '2', '3' ... ]]

##                                                                                
## 1 5.060294 3.194403 3.044722 2.895042 .        .        .        .        . . .
## 2 3.194403 4.965992 3.305161 3.360540 .        .        .        .        . . .
## 3 3.044722 3.305161 5.281810 3.826039 .        .        .        .        . . .
## 4 2.895042 3.360540 3.826039 6.007747 .        .        .        .        . . .
## 5 .        .        .        .        5.060294 3.194403 3.044722 2.895042 . . .
## 6 .        .        .        .        3.194403 4.965992 3.305161 3.360540 . . .
##                                                                                
## 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
##                                                                                
## 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
##                                        
## 1 . . . . . . . . . . . . . . . . . . .
## 2 . . . . . . . . . . . . . . . . . . .
## 3 . . . . . . . . . . . . . . . . . . .
## 4 . . . . . . . . . . . . . . . . . . .
## 5 . . . . . . . . . . . . . . . . . . .
## 6 . . . . . . . . . . . . . . . . . . .

dim(VM)

## [1] 108 108

This matrix fully characterizes the dependence structure induced by the random intercepts and random slopes.

6.8.8 Covariance matrix of fixed-effect coefficients

The covariance matrix of $\hat{\beta}$ is given by:

\[ \operatorname{Var}(\hat{\beta}) = (\mathbf{X}^\top \mathbf{V}^{-1} \mathbf{X})^{-1}. \]

We confirm that this matches the output from vcov().

vcov <- vcov(fit.lmer.slope) # fixed-effect covariance
vcov

## 3 x 3 Matrix of class "dpoMatrix"
##             (Intercept)           age     SexFemale
## (Intercept)  0.78540227 -5.292131e-02 -2.337501e-01
## age         -0.05292131  5.076869e-03 -5.668086e-16
## SexFemale   -0.23375009 -5.668086e-16  5.737502e-01

# compute correlation coefficient
vcov@x[2]/prod(sqrt( diag(vcov(fit.lmer.slope))[-3] ))

## [1] -0.8380822

6.8.9 Random-effect covariance matrix

The block-diagonal random-effects covariance matrix can also be extracted directly.

vc <- VarCorr(fit.lmer.slope) 
vc ## standard deviations and correlations

##  Groups   Name        Std.Dev. Corr  
##  Subject  (Intercept) 2.79687        
##           age         0.22642  -0.766
##  Residual             1.31004

as.matrix(Matrix::bdiag(vc)) # random-effect covariance matrix

	(Intercept)	age
(Intercept)	7.822487	-0.4849600
age	-0.484960	0.0512649

6.8.10 Computing fixed-effect coefficients manually

Using the generalized least squares (GLS) formula,

\[ \hat{\beta} = (\mathbf{X}^\top \mathbf{V}^{-1} \mathbf{X})^{-1}\mathbf{X}^\top \mathbf{V}^{-1}\mathbf{y}, \]

we compute $\hat{\beta}$ explicitly and verify it matches the model output.

library(matlib)
inv(t(as.matrix(X))%*%inv(as.matrix(VM))%*%as.matrix(X))%*%t(as.matrix(X))%*%inv(as.matrix(VM))%*%y

17.635190

0.660189

-2.145444

bhat

## (Intercept)         age   SexFemale 
##  17.6351805   0.6601852  -2.1454431

6.8.11 Covariance of fixed-effect coefficients

We also compute the covariance matrix and standard errors manually.

inv(t(as.matrix(X))%*%inv(as.matrix(VM))%*%as.matrix(X))

0.7854023	-0.0529213	-0.2337501
-0.0529213	0.0050769	0.0000000
-0.2337501	0.0000000	0.5737502

# standard errors
sqrt(inv(t(as.matrix(X))%*%inv(as.matrix(VM))%*%as.matrix(X)))

## Warning in sqrt(inv(t(as.matrix(X)) %*% inv(as.matrix(VM)) %*% as.matrix(X))):
## NaNs produced

0.8862292	NaN	NaN
NaN	0.0712522	0.000000
NaN	0.0000000	0.757463

# matches lmer output
vcov(fit.lmer.slope)

## 3 x 3 Matrix of class "dpoMatrix"
##             (Intercept)           age     SexFemale
## (Intercept)  0.78540227 -5.292131e-02 -2.337501e-01
## age         -0.05292131  5.076869e-03 -5.668086e-16
## SexFemale   -0.23375009 -5.668086e-16  5.737502e-01

sqrt(vcov(fit.lmer.slope))

## Warning in g({: NaNs produced

## 3 x 3 Matrix of class "dsyMatrix"
##             (Intercept)        age SexFemale
## (Intercept)   0.8862292        NaN       NaN
## age                 NaN 0.07125215       NaN
## SexFemale           NaN        NaN  0.757463

6.8.12 Computing random-effect coefficients

The BLUPs of the random effects are given by:

\[ \hat{\mathbf{u}} = \mathbf{GZ}^\top \mathbf{V}^{-1} (\mathbf{y} - \mathbf{X}\hat{\beta}). \]

We compute these manually and compare them with ranef().

comput_uhat <- (as.matrix(Lambda_new))%*%t(Z)%*%inv(as.matrix(VM))%*%(y-as.matrix(X)%*%(bhat))
cbind((comput_uhat@x),(uhat[["Subject"]]))

## Warning in data.frame(..., check.names = FALSE): row names were found from a
## short variable and have been discarded

(comput_uhat@x)	(Intercept)	age
-0.9669845	-0.9669846	-0.0654423
-0.0654423	-2.1562108	0.0444713
-2.1562108	-1.5814384	0.0219487
0.0444713	0.3873782	-0.1396178
-1.5814383	-1.4030747	0.0360640
0.0219487	0.3741511	-0.1179949
0.3873782	-0.8349159	0.0243529
-0.1396178	-1.8259377	0.1159476
-1.4030747	-5.5918209	0.4640073
0.0360640	0.5194469	-0.0498226
0.3741511	-1.0794152	0.1183517
-0.1179950	-1.1190966	0.1832202
-0.8349158	1.0346972	0.0249576
0.0243529	3.2017182	-0.1549279
-1.8259377	0.9619480	0.1438831
0.1159476	3.0496061	0.0937346
-5.5918208	-2.3127335	-0.1331970
0.4640073	0.3232428	-0.1626224
0.5194469	-0.0731659	-0.1259845
-0.0498226	0.1118113	-0.1226806
-1.0794152	1.4331062	-0.1427990
0.1183517	0.6204480	-0.0370891
-1.1190966	-0.3705738	0.0545056
0.1832202	2.3844467	-0.1695252
1.0346972	-0.0138465	0.0827362
0.0249576	2.3050839	-0.0397883
3.2017181	2.6221298	0.0533108
-0.1549279	-0.9669846	-0.0654423
0.9619480	-2.1562108	0.0444713
0.1438831	-1.5814384	0.0219487
3.0496061	0.3873782	-0.1396178
0.0937346	-1.4030747	0.0360640
-2.3127334	0.3741511	-0.1179949
-0.1331970	-0.8349159	0.0243529
0.3232428	-1.8259377	0.1159476
-0.1626224	-5.5918209	0.4640073
-0.0731659	0.5194469	-0.0498226
-0.1259845	-1.0794152	0.1183517
0.1118113	-1.1190966	0.1832202
-0.1226806	1.0346972	0.0249576
1.4331062	3.2017182	-0.1549279
-0.1427990	0.9619480	0.1438831
0.6204480	3.0496061	0.0937346
-0.0370891	-2.3127335	-0.1331970
-0.3705738	0.3232428	-0.1626224
0.0545056	-0.0731659	-0.1259845
2.3844467	0.1118113	-0.1226806
-0.1695252	1.4331062	-0.1427990
-0.0138465	0.6204480	-0.0370891
0.0827362	-0.3705738	0.0545056
2.3050838	2.3844467	-0.1695252
-0.0397883	-0.0138465	0.0827362
2.6221297	2.3050839	-0.0397883
0.0533108	2.6221298	0.0533108

6.8.13 Predicted values

Predicted values are obtained by combining fixed and random components:

\[ \hat{\mathbf{y}} = \mathbf{X}\hat{\beta} + \mathbf{Z}\hat{\mathbf{u}}. \]

yhat <- X%*%(bhat)+Z%*%comput_uhat
head(yhat)

## 6 x 1 Matrix of class "dgeMatrix"
##       [,1]
## 1 25.02967
## 2 26.63781
## 3 28.24595
## 4 29.85408
## 5 21.51081
## 6 22.87508

head(fitted(fit.lmer.slope))

##        1        2        3        4        5        6 
## 25.02967 26.63781 28.24595 29.85408 21.51081 22.87508

6.8.14 Model with two grouping factors

We now illustrate a model with two random-effect grouping factors, adding a random intercept for age as well.

library(lme4) 
fit.lmer.slope2 <- lmer(distance ~ age + Sex  +(1+ age | Subject)+ (1 | age), data = Data)

## boundary (singular) fit: see help('isSingular')

summary(fit.lmer.slope2)

## Linear mixed model fit by REML ['lmerMod']
## Formula: distance ~ age + Sex + (1 + age | Subject) + (1 | age)
##    Data: Data
## 
## REML criterion at convergence: 435.2
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0821 -0.4568  0.0154  0.4471  3.8942 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  Subject  (Intercept) 7.81589  2.7957        
##           age         0.05122  0.2263   -0.77
##  age      (Intercept) 0.00000  0.0000        
##  Residual             1.71643  1.3101        
## Number of obs: 108, groups:  Subject, 27; age, 4
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept) 17.63511    0.88612  19.902
## age          0.66019    0.07124   9.267
## SexFemale   -2.14527    0.75745  -2.832
## 
## Correlation of Fixed Effects:
##           (Intr) age   
## age       -0.838       
## SexFemale -0.348  0.000
## optimizer (nloptwrap) convergence code: 0 (OK)
## boundary (singular) fit: see help('isSingular')

6.8.15 Extracting Z for the two-factor model

Here, the Z matrix expands accordingly. The dimension reflects the combined random effects.

Z2 <- getME(fit.lmer.slope2, "Z")
head(Z2,10)

## 10 x 58 sparse Matrix of class "dgCMatrix"

##   [[ suppressing 58 column names 'M16', 'M16', 'M05' ... ]]

##                                                                                
## 1  . . . . .  . . . . . . . .  . . . . . . . . . . . . . . . 1  8 . . . . . . .
## 2  . . . . .  . . . . . . . .  . . . . . . . . . . . . . . . 1 10 . . . . . . .
## 3  . . . . .  . . . . . . . .  . . . . . . . . . . . . . . . 1 12 . . . . . . .
## 4  . . . . .  . . . . . . . .  . . . . . . . . . . . . . . . 1 14 . . . . . . .
## 5  . . . . 1  8 . . . . . . .  . . . . . . . . . . . . . . . .  . . . . . . . .
## 6  . . . . 1 10 . . . . . . .  . . . . . . . . . . . . . . . .  . . . . . . . .
## 7  . . . . 1 12 . . . . . . .  . . . . . . . . . . . . . . . .  . . . . . . . .
## 8  . . . . 1 14 . . . . . . .  . . . . . . . . . . . . . . . .  . . . . . . . .
## 9  . . . . .  . . . . . . . 1  8 . . . . . . . . . . . . . . .  . . . . . . . .
## 10 . . . . .  . . . . . . . 1 10 . . . . . . . . . . . . . . .  . . . . . . . .
##                                             
## 1  . . . . . . . . . . . . . . . . . 1 . . .
## 2  . . . . . . . . . . . . . . . . . . 1 . .
## 3  . . . . . . . . . . . . . . . . . . . 1 .
## 4  . . . . . . . . . . . . . . . . . . . . 1
## 5  . . . . . . . . . . . . . . . . . 1 . . .
## 6  . . . . . . . . . . . . . . . . . . 1 . .
## 7  . . . . . . . . . . . . . . . . . . . 1 .
## 8  . . . . . . . . . . . . . . . . . . . . 1
## 9  . . . . . . . . . . . . . . . . . 1 . . .
## 10 . . . . . . . . . . . . . . . . . . 1 . .

dim(Z2)

## [1] 108  58

6.8.16 Nested versus crossed random effects

Finally, we demonstrate a nested random-effects structure, where subjects are nested within sex.

library(lme4) 
fit.lmer.slope3 <- lmer(distance ~ age + Sex  +(1 | Sex/Subject), data = Data)

## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model is nearly unidentifiable: large eigenvalue ratio
##  - Rescale variables?

summary(fit.lmer.slope3)

## Linear mixed model fit by REML ['lmerMod']
## Formula: distance ~ age + Sex + (1 | Sex/Subject)
##    Data: Data
## 
## REML criterion at convergence: 437.5
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.7489 -0.5503 -0.0252  0.4534  3.6575 
## 
## Random effects:
##  Groups      Name        Variance Std.Dev.
##  Subject:Sex (Intercept) 3.2668   1.8074  
##  Sex         (Intercept) 0.9392   0.9691  
##  Residual                2.0495   1.4316  
## Number of obs: 108, groups:  Subject:Sex, 27; Sex, 2
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept) 17.70671    1.27852   13.85
## age          0.66019    0.06161   10.72
## SexFemale   -2.32102    1.56785   -1.48
## 
## Correlation of Fixed Effects:
##           (Intr) age   
## age       -0.530       
## SexFemale -0.586  0.000
## optimizer (nloptwrap) convergence code: 0 (OK)
## Model is nearly unidentifiable: large eigenvalue ratio
##  - Rescale variables?

The corresponding Z matrix reflects this nesting.

Z3 <- getME(fit.lmer.slope3, "Z")
head(Z3,10)

## 10 x 29 sparse Matrix of class "dgCMatrix"

##   [[ suppressing 29 column names 'M16:Male', 'M05:Male', 'M02:Male' ... ]]

##                                                             
## 1  . . . . . . . . . . . . . . 1 . . . . . . . . . . . . 1 .
## 2  . . . . . . . . . . . . . . 1 . . . . . . . . . . . . 1 .
## 3  . . . . . . . . . . . . . . 1 . . . . . . . . . . . . 1 .
## 4  . . . . . . . . . . . . . . 1 . . . . . . . . . . . . 1 .
## 5  . . 1 . . . . . . . . . . . . . . . . . . . . . . . . 1 .
## 6  . . 1 . . . . . . . . . . . . . . . . . . . . . . . . 1 .
## 7  . . 1 . . . . . . . . . . . . . . . . . . . . . . . . 1 .
## 8  . . 1 . . . . . . . . . . . . . . . . . . . . . . . . 1 .
## 9  . . . . . . 1 . . . . . . . . . . . . . . . . . . . . 1 .
## 10 . . . . . . 1 . . . . . . . . . . . . . . . . . . . . 1 .

dim(Z3)

## [1] 108  29

6.8.17 Remark: Nested vs. crossed random effects

Nested random effects occur when one grouping factor appears only within levels of another:

(1 | group1/group2)

Crossed random effects occur when observations are independently grouped by multiple factors:

(1 | group1) + (1 | group2)

Understanding how these structures affect the Z matrix and covariance decomposition is essential for correctly specifying and interpreting linear mixed models in practice.

6.9 Linear Mixed Model Covariance Decomposition (nlme)

This chapter is a practical, “under-the-hood” guide to linear mixed models (LMMs) using nlme, with a strong focus on covariance decomposition and how it connects to estimation, inference, and prediction. We will fit models in R, extract the key design matrices, and then reproduce (parts of) the computations by hand to make the machinery transparent.

Throughout, we work with the Orthodont dataset, a classic longitudinal dataset with repeated measurements per subject. The key statistical idea is simple:

repeated observations within the same subject are correlated,
that correlation can be explained by random effects, correlated residual errors, or both,
and once we specify the covariance structure correctly, estimation becomes a generalized least squares (GLS) problem.

6.9.1 Load data

data(Orthodont, package="nlme")
Data <- Orthodont

The Orthodont dataset contains repeated measurements of a distance outcome over time (age). Each subject is measured multiple times, so the data are clustered by Subject. This repeated-measures structure is exactly where mixed models become useful.

A quick reminder of notation we will use in this chapter:

$i$ indexes subjects, $j$ indexes repeated observations within subject $i$.
$y_{ij}$ is the response (distance).
$\mathbf{X}$ is the fixed-effect design matrix.
$\mathbf{Z}$ is the random-effect design matrix.
$\boldsymbol\beta$ are fixed-effect parameters.
$\mathbf{u}$ are random effects.
$\boldsymbol\varepsilon$ are residual errors.

6.9.2 Exploratory visualization

Before fitting a model, we want to visually confirm two things:

the overall mean trend of distance over age, and
whether subjects appear to have systematic deviations from that trend.

The first plot stratifies by sex and overlays a linear trend line.

library(ggplot2)
ggplot(Data, aes(y = distance, x = age,  color = Sex)) +
    geom_smooth(method = "lm",se=F) + geom_point() + theme_classic()

## `geom_smooth()` using formula = 'y ~ x'

Interpretation: - If the slopes differ between sexes, that suggests an interaction (not modeled here). - If points form vertical “bands” by subject (hidden here), that suggests subject-to-subject heterogeneity.

The second plot includes Subject in the aesthetics. Even if the visual is busy, it helps us see whether each subject follows a roughly parallel trajectory shifted up or down. That pattern is the classic signal for a random intercept.

library(ggplot2)
ggplot(Data, aes(y = distance, x = age, fill = Subject, color = Sex)) +
    geom_smooth(method = "lm",se=F) + geom_point() + theme_classic()

## `geom_smooth()` using formula = 'y ~ x'

If subjects show consistent vertical separation across age, a random intercept is a natural first model. If subjects also differ in slope, we would consider random slopes (covered in a separate chapter / section).

6.9.3 Using lme (nlme)

We first fit a random intercept model using nlme::lme. The model is:

\[ y_{ij} = \beta_0 + \beta_1 \, \text{age}_{ij} + \beta_2 \, \text{Sex}_i + b_i + \varepsilon_{ij}, \]

with assumptions

\[ b_i \sim N(0,\sigma_b^2), \quad \varepsilon_{ij} \sim N(0,\sigma^2), \quad b_i \perp \varepsilon_{ij}. \]

This implies a within-subject covariance of compound symmetry form (equal correlation between any two observations from the same subject), because the shared random intercept induces correlation.

We fit the model in nlme and also fit the “equivalent” random-intercept model in lme4 (we will use lme4 later to extract $\mathbf{X}$, $\mathbf{Z}$, and $\mathbf{y}$).

library(nlme)

## 
## Attaching package: 'nlme'

## The following object is masked from 'package:lme4':
## 
##     lmList

## The following object is masked from 'package:matlib':
## 
##     Dim

## The following object is masked from 'package:dplyr':
## 
##     collapse

fit.lmer2 <- lme(distance ~ age + Sex  , random = ~1   | Subject, data = Data) 
summary(fit.lmer2)

## Linear mixed-effects model fit by REML
##   Data: Data 
##        AIC      BIC    logLik
##   447.5125 460.7823 -218.7563
## 
## Random effects:
##  Formula: ~1 | Subject
##         (Intercept) Residual
## StdDev:    1.807425 1.431592
## 
## Fixed effects:  distance ~ age + Sex 
##                 Value Std.Error DF   t-value p-value
## (Intercept) 17.706713 0.8339225 80 21.233044  0.0000
## age          0.660185 0.0616059 80 10.716263  0.0000
## SexFemale   -2.321023 0.7614168 25 -3.048294  0.0054
##  Correlation: 
##           (Intr) age   
## age       -0.813       
## SexFemale -0.372  0.000
## 
## Standardized Within-Group Residuals:
##         Min          Q1         Med          Q3         Max 
## -3.74889609 -0.55034466 -0.02516628  0.45341781  3.65746539 
## 
## Number of Observations: 108
## Number of Groups: 27

library(lme4)
# lme4
fit.lmer <- lmer(distance ~ age + Sex  +(1 | Subject), data = Data) 
# compound symmetry

Practical note: - In this simplest random-intercept setting, lme and lmer should agree closely on fixed-effect estimates and variance components (allowing for minor numerical differences). - The big advantage of nlme is the ability to specify residual correlation structures (AR(1), exponential, Gaussian, unstructured, etc.), which we explore later.

6.9.4 Inference for fixed effects

After fitting the model, we usually want to test whether covariates such as age and Sex are associated with the outcome. Mixed models complicate testing because correlation changes both the standard errors and the effective degrees of freedom.

6.9.4.1 t test (Wald-type test)

Here we compute a t-statistic “by hand” from the fixed-effect estimates and their estimated covariance matrix:

\[ t_k = \frac{\hat\beta_k}{\sqrt{\widehat{\mathrm{Var}}(\hat\beta_k)}}. \]

We then compute two-sided p-values using the degrees of freedom stored by nlme.

tstat <- fixef(fit.lmer2)/sqrt(diag(vcov(fit.lmer2)))
pval <- 2*pt(-abs(tstat), df = fit.lmer2$fixDF$X)
Tresult <- data.frame(t = tstat, p = round(pval, 4))
print(Tresult)

##                     t      p
## (Intercept) 21.233044 0.0000
## age         10.716263 0.0000
## SexFemale   -3.048294 0.0054

Why this matters: - vcov(fit.lmer2) is not the OLS covariance; it is the GLS-based covariance that accounts for $\mathbf{V}$, the marginal covariance of $\mathbf{y}$. - The degrees of freedom in nlme reflect an approximate finite-sample adjustment for mixed models, not a simple $n-p$.

6.9.4.2 F test (ANOVA table)

The anova() method provides an ANOVA-style summary of fixed effects.

anova(fit.lmer2 )

	numDF	denDF	F-value	p-value
(Intercept)	1	80	4123.155691	0.0000000
age	1	80	114.838287	0.0000000
Sex	1	25	9.292099	0.0053751

Interpretation: - This is typically a Wald-type test presented as an F statistic. - It answers whether each term contributes after accounting for the covariance structure.

6.9.4.3 Restricted model test (linear contrast)

Sometimes we want to test a specific linear hypothesis about fixed effects:

\[ H_0: \mathbf{L}\boldsymbol\beta = 0. \]

The L vector specifies a linear combination of coefficients. In this example, L = c(0,1,0) targets the second fixed-effect coefficient (often the age effect, depending on parameter ordering).

anova(fit.lmer2, L = c(0, 1, 0))

numDF	denDF	F-value	p-value
1	80	114.8383	0

This approach generalizes cleanly to: - testing multiple coefficients jointly, - testing contrasts among factor levels, - testing spline basis groups, etc.

6.9.4.4 Likelihood ratio test (nested models)

We compare:

full model: distance ~ age + Sex
reduced model: distance ~ Sex

Both have the same random intercept structure, so the models are nested in their fixed effects.

fit.lmer2.2 <- lme(distance ~   Sex  , random = ~1   | Subject, data = Data) 
anova(fit.lmer2, fit.lmer2.2)

## Warning in anova.lme(fit.lmer2, fit.lmer2.2): fitted objects with different
## fixed effects. REML comparisons are not meaningful.

	call	Model	df	AIC	BIC	logLik	Test	L.Ratio	p-value
fit.lmer2	lme.formula(fixed = distance ~ age + Sex, data = Data, random = ~1 \| Subject)	1	5	447.5125	460.7823	-218.7563		NA	NA
fit.lmer2.2	lme.formula(fixed = distance ~ Sex, data = Data, random = ~1 \| Subject)	2	4	513.8718	524.5255	-252.9359	1 vs 2	68.35926	0

A likelihood ratio test uses:

\[ \mathrm{LR} = 2(\ell_1 - \ell_0), \]

and compares LR to a $\chi^2$ distribution with degrees of freedom equal to the difference in number of parameters (here, the difference in fixed effects).

And now by hand

logLik  <- logLik(fit.lmer2)
logLik0 <- logLik(fit.lmer2.2)

LR <- 2 * (logLik - logLik0)
pval <- pchisq(LR, df = 2, lower.tail = FALSE)
LRresult <- data.frame(LR = LR, p = round(pval, 4), row.names = "age")
print(LRresult)

##           LR p
## age 68.35926 0

Practical note: - For testing variance components (e.g., whether a random effect variance is zero), LR tests can be nonstandard due to boundary issues. - Here we are testing a fixed effect, so the usual large-sample chi-square reference is commonly used.

6.9.5 Model diagnosing

Even if inference looks “significant,” model assumptions can be wrong. Diagnostic plots help assess:

residual patterns (nonlinearity, heteroscedasticity),
normality of residuals,
influential clusters,
random-effects distribution.

library(nlme)
plot.lme(fit.lmer2)

In longitudinal data, a frequent warning sign is residual correlation that remains after adding random effects. If residuals within each subject still show time dependence, we may need an explicit residual correlation structure (AR(1), exponential, etc.), which we will fit later in this chapter.

6.9.6 Get X, y, Z matrices (using lme4)

To connect the model to matrix formulas, we extract the core objects:

$\mathbf{X}$: fixed-effect design matrix
$\mathbf{Z}$: random-effect design matrix
$\mathbf{y}$: response vector

We use lme4 because getME() provides these matrices directly in a standard form.

X=(getME(fit.lmer, "X"))
y=getME(fit.lmer, "y")
Z <- getME(fit.lmer, "Z")

Z2 <- model.matrix(~Subject-1,data=Data)

Interpretation: - For a random intercept model, $\mathbf{Z}$ is essentially a subject indicator matrix (one column per subject). - Each row has a single 1 indicating the subject membership.

6.9.6.1 Check Z matrix

We confirm that Z matches a manually created subject dummy matrix.

dummyz<- as.matrix (Z)==Z2
table(dummyz)

TRUE
2916

dim(Z)

## [1] 108  27

This is a useful sanity check: it ensures that when we later compute $\mathbf{Z}\mathbf{D}\mathbf{Z}^\top$, we are using the same $\mathbf{Z}$ that the model implies.

6.9.6.2 Check y and X matrices

We also check that columns in $\mathbf{X}$ match the original variables.

dummyx<- Data[,c(2)]==X[,2]
table(dummyx)

TRUE
108

6.9.7 Fixed effect coefficient

The fixed effects $\hat{\boldsymbol\beta}$ represent the population-average association between distance and the covariates, after accounting for within-subject correlation.

bhat <- fit.lmer2$coef$fixed
bhat

## (Intercept)         age   SexFemale 
##  17.7067130   0.6601852  -2.3210227

# fixef()

6.9.7.1 Fixed effect confidence intervals

nlme::intervals() provides approximate confidence intervals for fixed effects.

intervals(fit.lmer2)

## Approximate 95% confidence intervals
## 
##  Fixed effects:
##                  lower       est.      upper
## (Intercept) 16.0471544 17.7067130 19.3662716
## age          0.5375855  0.6601852  0.7827849
## SexFemale   -3.8891901 -2.3210227 -0.7528554
## 
##  Random Effects:
##   Level: Subject 
##                    lower     est.    upper
## sd((Intercept)) 1.310357 1.807425 2.493049
## 
##  Within-group standard error:
##    lower     est.    upper 
## 1.226095 1.431592 1.671531

These intervals are typically Wald-type:

\[ \hat\beta_k \pm t_{0.975,\;df}\sqrt{\widehat{\mathrm{Var}}(\hat\beta_k)}. \]

6.9.7.2 Plot fixed effect coefficients and their CI

Visualizing coefficients often communicates results faster than tables, especially in teaching materials.

tab <- cbind(Est = intervals(fit.lmer2)[["fixed"]][,2], LL = intervals(fit.lmer2)[["fixed"]][,1], UL=
intervals(fit.lmer2)[["fixed"]][,3])
#Extracting the fixed effect estimates and manually calculating the 95% confidence limits
#qnorm extracts the standard normal critical value for 1-alpha/2= 1-0.05/2=0.975
Odds= (tab)
round(Odds,4)

	Est	LL	UL
(Intercept)	17.7067	16.0472	19.3663
age	0.6602	0.5376	0.7828
SexFemale	-2.3210	-3.8892	-0.7529

Odds=as.data.frame(Odds)
Odds$Label=rownames(Odds)
 
ggplot(Odds[-1,],aes(x=Est,y=Label))+geom_point()+
geom_errorbarh(aes(xmax = UL, xmin = LL))+
  
theme_bw()+geom_vline(xintercept=0,col="darkgray",size=1.2,linetype=2)+
theme(axis.title.y = element_blank(),axis.text = element_text(size=10),
axis.title.x = element_text(size=12))+labs(x="Beta coefficients")

## Warning: `geom_errorbarh()` was deprecated in ggplot2 4.0.0.
## ℹ Please use the `orientation` argument of
##   `geom_errorbar()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see
## where this warning was generated.

Interpretation tips: - Intervals crossing 0 suggest the effect may be weak relative to uncertainty. - The vertical line at 0 is a convenient “no effect” reference.

And now by hand

We replicate the fixed-effect CI using the stored degrees of freedom and the covariance matrix.

print(tquants <- qt(0.975, df = fit.lmer2$fixDF$X))

## (Intercept)         age   SexFemale 
##    1.990063    1.990063    2.059539

Low <- fixef(fit.lmer2) - tquants * sqrt(diag(vcov(fit.lmer2)))
Upp <- fixef(fit.lmer2) + tquants * sqrt(diag(vcov(fit.lmer2)))
EstInt <- data.frame(Lower = Low, Estimate = fixef(fit.lmer2), Upper = Upp)
print(EstInt)

##                  Lower   Estimate      Upper
## (Intercept) 16.0471544 17.7067130 19.3662716
## age          0.5375855  0.6601852  0.7827849
## SexFemale   -3.8891901 -2.3210227 -0.7528554

This is an important conceptual checkpoint: it shows that once we know $\widehat{\mathrm{Var}}(\hat{\boldsymbol\beta})$, confidence intervals reduce to standard linear model logic—the “mixed model” complexity is contained in estimating $\mathbf{V}$ and thus the correct covariance of $\hat{\boldsymbol\beta}$.

6.9.8 Random effect coefficient

The random effects are subject-specific intercept deviations (BLUPs / empirical Bayes estimates). These quantify how each subject’s average distance differs from the population mean after adjusting for covariates.

fit.lmer2$coef$random

## $Subject
##     (Intercept)
## M16 -1.70183357
## M05 -1.70183357
## M02 -1.37767479
## M11 -1.16156894
## M07 -1.05351602
## M08 -0.94546309
## M03 -0.62130432
## M12 -0.62130432
## M13 -0.62130432
## M14 -0.08103969
## M09  0.13506616
## M15  0.78338371
## M06  1.21559540
## M04  1.43170125
## M01  2.40417758
## M10  3.91691853
## F10 -3.58539251
## F09 -1.31628108
## F06 -1.31628108
## F01 -1.10017523
## F05 -0.01964599
## F07  0.30451279
## F02  0.30451279
## F08  0.62867156
## F03  0.95283034
## F04  1.92530666
## F11  3.22194176

Interpretation: - Large positive random intercept: subject tends to have larger distance than average. - Large negative random intercept: subject tends to have smaller distance than average. - Shrinkage: subjects with fewer observations are pulled more strongly toward 0.

6.9.9 Fixed effect coefficients covariance

The covariance matrix of fixed effect estimates is central for inference:

\[ \widehat{\mathrm{Cov}}(\hat{\boldsymbol\beta}) = (\mathbf{X}^\top \hat{\mathbf{V}}^{-1}\mathbf{X})^{-1}. \]

We inspect both the matrix and the standard errors.

fit.lmer2$varFix

	(Intercept)	age	SexFemale
(Intercept)	0.6954267	-0.0417482	-0.2361967
age	-0.0417482	0.0037953	0.0000000
SexFemale	-0.2361967	0.0000000	0.5797556

# vcov(fit.lmer2)
sqrt(diag(vcov(fit.lmer2)))

## (Intercept)         age   SexFemale 
##  0.83392247  0.06160592  0.76141685

6.9.9.1 Fixed effect coefficients correlation

The next line manually computes a correlation between two fixed-effect estimates by dividing an off-diagonal covariance term by the product of standard deviations. (In practice, you would usually compute the full correlation matrix, but this single calculation emphasizes the concept.)

-0.04174818/prod(sqrt( diag(fit.lmer2$varFix)[-3] ))

## [1] -0.8126236

Why care about correlation among estimates? - Strong correlation can make interpretation unstable (e.g., intercept and slope correlation in centered vs uncentered age). - It affects joint inference and multicollinearity diagnostics.

6.9.10 Random effect covariance and correlation

For a random intercept model, the random-effect covariance $\mathbf{D}$ is a $1\times1$ matrix (a single variance). In more complex models (random slopes), $\mathbf{D}$ becomes a full covariance matrix.

Here we extract the random-effect covariance from the internal structures.

pdMatrix(fit.lmer2$modelStruct$reStruct[[1]])*fit.lmer2$sigma**2

	(Intercept)
(Intercept)	3.266784

getVarCov(fit.lmer2)  #using default

## Random effects variance covariance matrix
##             (Intercept)
## (Intercept)      3.2668
##   Standard Deviations: 1.8074

getVarCov(fit.lmer2, individual = "F01", type = "marginal")

## Subject F01 
## Marginal variance covariance matrix
##        1      2      3      4
## 1 5.3162 3.2668 3.2668 3.2668
## 2 3.2668 5.3162 3.2668 3.2668
## 3 3.2668 3.2668 5.3162 3.2668
## 4 3.2668 3.2668 3.2668 5.3162
##   Standard Deviations: 2.3057 2.3057 2.3057 2.3057

getVarCov(fit.lmer2, type = "conditional")

## Subject M01 
## Conditional variance covariance matrix
##        1      2      3      4
## 1 2.0495 0.0000 0.0000 0.0000
## 2 0.0000 2.0495 0.0000 0.0000
## 3 0.0000 0.0000 2.0495 0.0000
## 4 0.0000 0.0000 0.0000 2.0495
##   Standard Deviations: 1.4316 1.4316 1.4316 1.4316

Interpretation: - type="marginal" returns the covariance of $\mathbf{y}$ within a subject after integrating over random effects. - type="conditional" returns the covariance of $\mathbf{y}$ given the random effects (often closer to $\boldsymbol\Sigma$).

We also look at variance components and compute the proportion of each component.

vc <- as.numeric(as.matrix(VarCorr(fit.lmer2))[,1])
vc/sum(vc)

## [1] 0.6144914 0.3855086

VarCorr(fit.lmer2)

## Subject = pdLogChol(1) 
##             Variance StdDev  
## (Intercept) 3.266784 1.807425
## Residual    2.049456 1.431592

This proportion is a useful descriptive measure: - If random intercept variance dominates, subject-to-subject differences explain most variability. - If residual variance dominates, within-subject noise is large relative to between-subject differences.

6.9.11 Get y covariance directly

A core identity in LMMs is the marginal covariance decomposition:

\[ \mathbf{V} = \mathbf{Z}\mathbf{D}\mathbf{Z}^\top + \boldsymbol\Sigma. \]

In the simplest random intercept model with independent residuals, $\boldsymbol\Sigma = \sigma^2 \mathbf{I}$, and $\mathbf{Z}\mathbf{D}\mathbf{Z}^\top$ creates equal covariance between any two observations in the same subject.

We extract $\mathbf{V}$ directly.

= Z REcov Zt + sigma**2

require(mgcv)

## Loading required package: mgcv

## This is mgcv 1.9-1. For overview type 'help("mgcv-package")'.

library(nlme)
 
cov <- extract.lme.cov(fit.lmer2,Data)
head(cov)

5.316240	3.266784	3.266784	3.266784	0.000000	0.000000	0.000000	0.000000
3.266784	5.316240	3.266784	3.266784	0.000000	0.000000	0.000000	0.000000
3.266784	3.266784	5.316240	3.266784	0.000000	0.000000	0.000000	0.000000
3.266784	3.266784	3.266784	5.316240	0.000000	0.000000	0.000000	0.000000
0.000000	0.000000	0.000000	0.000000	5.316240	3.266784	3.266784	3.266784
0.000000	0.000000	0.000000	0.000000	3.266784	5.316240	3.266784	3.266784

dim(cov)

## [1] 108 108

Practical reason to do this: - Once we have $\mathbf{V}$, we can compute GLS estimates and prediction variances directly using matrix formulas. - This also makes it easy to compare alternative covariance models: only $\mathbf{V}$ changes.

6.9.12 Residual variance

For the baseline random intercept model, the residual variance is $\sigma^2$.

fit.lmer2$sigma**2

## [1] 2.049456

6.9.13 Compute fixed effect coefficients (by GLS)

The GLS estimator is:

\[ \hat{\boldsymbol\beta} = (\mathbf{X}^\top \mathbf{V}^{-1}\mathbf{X})^{-1}\mathbf{X}^\top \mathbf{V}^{-1}\mathbf{y}. \]

We compute it directly using the extracted covariance matrix.

library(matlib)
inv(t(as.matrix(X))%*%inv(as.matrix(cov))%*%as.matrix(X))%*%t(as.matrix(X))%*%inv(as.matrix(cov))%*%y

17.7067076

0.6601871

-2.3210230

This should match fit.lmer2$coef$fixed (up to numerical rounding). This is the central “bridge” between mixed models and GLS: mixed models reduce to GLS once $\mathbf{V}$ is known.

6.9.14 Compute covariance of fixed effect coefficients

As noted earlier:

\[ \widehat{\mathrm{Cov}}(\hat{\boldsymbol\beta}) = (\mathbf{X}^\top \mathbf{V}^{-1}\mathbf{X})^{-1}. \]

inv(t(as.matrix(X))%*%inv(as.matrix(cov))%*%as.matrix(X))

0.6954267	-0.0417482	-0.2361967
-0.0417482	0.0037953	0.0000000
-0.2361967	0.0000000	0.5797556

6.9.15 Compute random effect coefficients (BLUP)

The BLUP / empirical Bayes estimate satisfies:

\[ \hat{\mathbf{u}} = \mathbf{D}\mathbf{Z}^\top \mathbf{V}^{-1}(\mathbf{y} - \mathbf{X}\hat{\boldsymbol\beta}). \]

For a random intercept model, $\mathbf{D}$ is just the scalar $\sigma_b^2$. In code, Lambda_new constructs a diagonal matrix with that variance repeated for each subject (one random intercept per subject).

Lambda_new <-as.numeric(VarCorr(fit.lmer2)[1])*diag(length(levels(Data$Subject)))
# head(Lambda_new)

Now we compute $\hat{\mathbf{u}}$ and compare with the model’s own random-effect estimates.

uhat <- fit.lmer2$coef$random 
comput_uhat <- (as.matrix(Lambda_new))%*%t(Z)%*%inv(as.matrix(cov))%*%(y-as.matrix(X)%*%(bhat))
cbind((comput_uhat@x),(uhat[["Subject"]]))

		(Intercept)
M16	-1.7018337	-1.7018336
M05	-1.7018337	-1.7018336
M02	-1.3776749	-1.3776748
M11	-1.1615690	-1.1615689
M07	-1.0535161	-1.0535160
M08	-0.9454632	-0.9454631
M03	-0.6213044	-0.6213043
M12	-0.6213044	-0.6213043
M13	-0.6213044	-0.6213043
M14	-0.0810397	-0.0810397
M09	0.1350662	0.1350662
M15	0.7833838	0.7833837
M06	1.2155955	1.2155954
M04	1.4317013	1.4317013
M01	2.4041777	2.4041776
M10	3.9169188	3.9169185
F10	-3.5853927	-3.5853925
F09	-1.3162812	-1.3162811
F06	-1.3162812	-1.3162811
F01	-1.1001753	-1.1001752
F05	-0.0196460	-0.0196460
F07	0.3045128	0.3045128
F02	0.3045128	0.3045128
F08	0.6286716	0.6286716
F03	0.9528304	0.9528303
F04	1.9253068	1.9253067
F11	3.2219420	3.2219418

Why this is important: - It demonstrates that random effects are not “free parameters” estimated independently. - They are predictions, borrowing strength from the full dataset via $\mathbf{V}^{-1}$. - Shrinkage happens naturally through this formula.

6.9.16 Compute predicted values

In mixed models, the word “prediction” can mean different things:

conditional (subject-specific) prediction: includes random effects
marginal (population-average) prediction: excludes random effects

Both are valid, but they answer different scientific questions.

In most applied analyses: - “What is the expected trajectory for this subject?” → conditional - “What is the average trajectory in the population?” → marginal

We now formalize both views.

6.9.16.1 The conditional distribution

Conditioning on random effects:

\[ \mathbf{y} \mid \mathbf{b} \sim \mathbf{N}\left(\mathbf{X}\boldsymbol{\beta}+\mathbf{Z}\mathbf{b}, \boldsymbol{\Sigma}\right). \]

This implies that the conditional mean depends on both fixed and random components.

6.9.16.1.1 Conditional mean prediction

For a new observation with design rows $\mathbf{X}_0$ and $\mathbf{Z}_0$:

\[ E(\hat{Y}_0) = \mathbf{X}_0\hat{\boldsymbol\beta} + \mathbf{Z}_0\hat{\mathbf{u}}. \]

The uncertainty in the conditional mean prediction comes from uncertainty in estimating $\boldsymbol\beta$ and $\mathbf{u}$ (and possibly from uncertainty in variance components, which we typically ignore in simple plug-in formulas).

Your derivation expands:

\[ \mathrm{Var}(X+Y)=\mathrm{Var}(X)+\mathrm{Var}(Y)+2\mathrm{Cov}(X,Y), \]

so if we treat the fixed and random predictors as independent (a simplifying approximation), the covariance term is ignored and the prediction variance becomes “fixed-part variance + random-part variance.”

That leads to the practical decomposition:

\[ \mathrm{Var}(\hat{Y}_0) \approx \mathbf{X}_0 \,\widehat{\mathrm{Cov}}(\hat{\boldsymbol\beta})\, \mathbf{X}_0^\top + \mathbf{Z}_0 \,\widehat{\mathrm{Cov}}(\hat{\mathbf{u}})\, \mathbf{Z}_0^\top. \]

6.9.16.1.2 Conditional individual prediction

If we want a prediction interval for an actual future $Y_0$ (not just its mean), we must add residual variance:

\[ \mathrm{Var}(Y_0 - \hat{Y}_0)\approx \mathrm{Var}(\hat{Y}_0) + \sigma^2. \]

This is exactly the distinction in your code between SE_rand (mean) and SE_rand2 (individual).

6.9.16.2 Lower level (conditional prediction) in R

Here we compute subject-specific fitted values explicitly:

\[ \hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol\beta} + \mathbf{Z}\hat{\mathbf{u}}. \]

# how to calculate predicted values
yhat <- X%*%(fit.lmer2$coef$fixed)+Z%*% as.numeric ( uhat[["Subject"]])

head(cbind (yhat,predict(fit.lmer2),y))  #create individual trajectory curve

## 6 x 3 Matrix of class "dgeMatrix"
##                        y
## 1 25.39237 25.39237 26.0
## 2 26.71274 26.71274 25.0
## 3 28.03311 28.03311 29.0
## 4 29.35348 29.35348 31.0
## 5 21.61052 21.61052 21.5
## 6 22.93089 22.93089 22.5

The comparison shows that: - your manual computation equals predict(fit.lmer2) at the default prediction level (subject-specific).

Next we compute an approximate standard error of conditional predictions by adding two components: - fixed-effect contribution: $\mathbf{X}\,\widehat{\mathrm{Cov}}(\hat{\boldsymbol\beta})\,\mathbf{X}^\top$ - random-effect contribution: $\mathbf{Z}\,\widehat{\mathrm{Cov}}(\hat{\mathbf{u}})\,\mathbf{Z}^\top$ (approximated here using the random intercept variance)

#compute standard error for marginal predictions
predvar_rand <- diag(X %*% fit.lmer2$varFix %*% t(X)) + diag(Z %*%  diag(getVarCov(fit.lmer2)[1]  ,27) %*% t(Z))

SE_rand <- sqrt (predvar_rand)  #mean prediction 
SE_rand2 <- sqrt(predvar_rand+fit.lmer2$sigma^2)  #individual prediction 
head(SE_rand,20)

##        1        2        3        4        5        6        7        8 
## 1.880728 1.872639 1.872639 1.880728 1.880728 1.872639 1.872639 1.880728 
##        9       10       11       12       13       14       15       16 
## 1.880728 1.872639 1.872639 1.880728 1.880728 1.872639 1.872639 1.880728 
##       17       18       19       20 
## 1.880728 1.872639 1.872639 1.880728

head(SE_rand2,20)

##        1        2        3        4        5        6        7        8 
## 2.363598 2.357166 2.357166 2.363598 2.363598 2.357166 2.357166 2.363598 
##        9       10       11       12       13       14       15       16 
## 2.363598 2.357166 2.357166 2.363598 2.363598 2.357166 2.357166 2.363598 
##       17       18       19       20 
## 2.363598 2.357166 2.357166 2.363598

A conceptual caution: - In full generality, $\widehat{\mathrm{Cov}}(\hat{\mathbf{u}})$ is not simply $\sigma_b^2 I$; it depends on $\mathbf{D}$ and $\mathbf{V}^{-1}$. - Your expression is a reasonable pedagogical approximation that highlights the idea of adding fixed and random contributions.

6.9.17 The marginal distribution

Integrating out random effects yields:

\[ \mathbf{y} \sim \mathrm{N}\left(\mathbf{X}\boldsymbol{\beta}, \mathbf{V}\right), \]

where

\[ \mathrm{Cov}(\mathbf{y})=\mathbf{V}=\mathbf{Z}\mathbf{D}\mathbf{Z}^\top+\boldsymbol{\Sigma}. \]

Here $\mathbf{Z}\mathbf{D}\mathbf{Z}^\top$ is the covariance induced by random effects, and $\boldsymbol{\Sigma}$ is the residual covariance (often $\sigma^2 I$ unless we specify a correlation structure).

6.9.17.1 Marginal mean prediction

The marginal mean depends only on fixed effects:

\[ E(\hat{Y}_0)=\mathbf{X}_0\hat{\boldsymbol\beta}. \]

The variance of the marginal mean prediction uses only $\widehat{\mathrm{Cov}}(\hat{\boldsymbol\beta})$:

\[ \mathrm{Var}(\hat{Y}_0)=\mathbf{X}_0 \,\widehat{\mathrm{Cov}}(\hat{\boldsymbol\beta})\, \mathbf{X}_0^\top. \]

6.9.17.2 Marginal individual prediction

For predicting a new observation at the population level, we add residual variance:

\[ \mathrm{Var}(Y_0)\approx \mathrm{Var}(\hat{Y}_0)+\sigma^2. \]

6.9.17.3 Higher level (marginal) in R

Your code computes the marginal prediction SE using only the fixed-effect variance component:

#compute standard error for marginal predictions
predvar <- diag(X %*% fit.lmer2$varFix %*% t(X))

SE <- sqrt (predvar)  #mean prediction 
SE2 <- sqrt(predvar+fit.lmer2$sigma^2)  #individual prediction
head(SE,10)

##         1         2         3         4         5         6         7         8 
## 0.5199561 0.4898898 0.4898898 0.5199561 0.5199561 0.4898898 0.4898898 0.5199561 
##         9        10 
## 0.5199561 0.4898898

head(SE2,10)

##        1        2        3        4        5        6        7        8 
## 1.523092 1.513092 1.513092 1.523092 1.523092 1.513092 1.513092 1.523092 
##        9       10 
## 1.523092 1.513092

Then it forms approximate 95% bounds:

mean prediction band: $\hat{\mu} \pm 1.96\,SE$
individual prediction band: $\hat{\mu} \pm 1.96\,SE2$

up=predict(fit.lmer2, newdata=Data, level=0) +1.96 *SE  #mean prediction 
up2=predict(fit.lmer2, newdata=Data, level=0) +1.96 *SE2  #individual prediction 
head(up)

##        1        2        3        4        5        6 
## 24.00731 25.26875 26.58912 27.96842 24.00731 25.26875

head(up2)

##        1        2        3        4        5        6 
## 25.97346 27.27423 28.59460 29.93457 25.97346 27.27423

Note the crucial use of level=0: - level=0 gives marginal predictions (fixed effects only) - level=1 (or default) typically includes random effects for subject-specific predictions

Finally, ggeffects::ggpredict() provides a convenient interface for marginal effects plots (your code uses %>%, but we keep it unchanged as requested).

library(tidyverse)
library(ggeffects)

## Warning: package 'ggeffects' was built under R version 4.4.3

ggpredict(fit.lmer2,terms=c("age" ))

x	predicted	std.error	conf.low	conf.high	group
8	22.98819	0.5199561	21.96910	24.00729	1
10	24.30856	0.4898898	23.34840	25.26873	1
12	25.62894	0.4898898	24.66877	26.58910	1
14	26.94931	0.5199561	25.93021	27.96840	1

# ggpredict(fit.lmer2,terms=c("age" ))  %>% plot(rawdata = T, dot.alpha = 0.2)
# ggpredict(fit.lmer2,"age",   type = "re" )  %>% plot(rawdata = T, dot.alpha = 0.2)

library(sjPlot)
plot_model(fit.lmer2, type = "pred", terms = "age", show.data = TRUE, dot.alpha = 0.2)

Your “Compute SE by hand” numeric values are a useful checkpoint: they suggest that the analytic SE derived from $\mathbf{X}\widehat{\mathrm{Cov}}(\hat{\boldsymbol\beta})\mathbf{X}^\top$ matches a simpler calculation for specific design points.

6.9.18 Extending the covariance structure: correlated residual errors

So far, our baseline model assumed independent residuals:

\[ \boldsymbol{\Sigma}=\sigma^2\mathbf{I}. \]

But longitudinal data often have residual correlations over time even after accounting for random effects. nlme allows us to specify parametric correlation structures for residuals within subjects.

A key conceptual point:

random effects model between-subject heterogeneity (e.g., different baseline levels),
residual correlation structures model within-subject temporal dependence beyond random effects.

6.9.19 Using lme with Gaussian correlation

A Gaussian correlation structure implies residual correlation decays with distance in time (here, age). With a nugget effect, we allow a discontinuity at distance 0 (measurement error).

library(nlme)

fit.lmer.gaus <- lme(distance ~ age + Sex  , random = ~1 | Subject, correlation=corGaus(form= ~ age|Subject, nugget=TRUE), data = Data) 
summary(fit.lmer.gaus)

## Linear mixed-effects model fit by REML
##   Data: Data 
##        AIC      BIC    logLik
##   451.4311 470.0089 -218.7156
## 
## Random effects:
##  Formula: ~1 | Subject
##         (Intercept) Residual
## StdDev:    1.795504 1.446192
## 
## Correlation Structure: Gaussian spatial correlation
##  Formula: ~age | Subject 
##  Parameter estimate(s):
##     range    nugget 
## 1.1324270 0.1039536 
## Fixed effects:  distance ~ age + Sex 
##                 Value Std.Error DF   t-value p-value
## (Intercept) 17.717033 0.8448813 80 20.969848  0.0000
## age          0.659620 0.0628344 80 10.497756  0.0000
## SexFemale   -2.325355 0.7612563 25 -3.054629  0.0053
##  Correlation: 
##           (Intr) age   
## age       -0.818       
## SexFemale -0.367  0.000
## 
## Standardized Within-Group Residuals:
##         Min          Q1         Med          Q3         Max 
## -3.70618679 -0.54458678 -0.01435195  0.45606804  3.62915585 
## 
## Number of Observations: 108
## Number of Groups: 27

Variogram plots visualize how empirical semi-variance changes with time separation. In longitudinal settings, they act like a diagnostic for residual correlation: if semi-variance increases with lag, that indicates correlation decays as time separation grows.

print(plot(Variogram(fit.lmer.gaus, form =~as.numeric(age)|Subject,
data = Data)))

We can again extract $\mathbf{V}$ (now incorporating the Gaussian residual correlation) and verify that GLS formulas still apply. The only difference is that $\mathbf{V}$ is no longer compound symmetric.

require(mgcv)
library(nlme)
 
cov.gaus <- extract.lme.cov(fit.lmer.gaus,Data)
head(cov.gaus)

5.315307	3.306657	3.223843	3.223836	0.000000	0.000000	0.000000	0.000000
3.306657	5.315307	3.306657	3.223843	0.000000	0.000000	0.000000	0.000000
3.223843	3.306657	5.315307	3.306657	0.000000	0.000000	0.000000	0.000000
3.223836	3.223843	3.306657	5.315307	0.000000	0.000000	0.000000	0.000000
0.000000	0.000000	0.000000	0.000000	5.315307	3.306657	3.223843	3.223836
0.000000	0.000000	0.000000	0.000000	3.306657	5.315307	3.306657	3.223843

dim(cov.gaus)

## [1] 108 108

Compute fixed effects using the Gaussian covariance:

\[ \hat{\boldsymbol\beta}_{GLS} = (\mathbf{X}^\top \mathbf{V}^{-1}\mathbf{X})^{-1}\mathbf{X}^\top \mathbf{V}^{-1}\mathbf{y}. \]

library(matlib)
inv(t(as.matrix(X))%*%inv(as.matrix(cov.gaus))%*%as.matrix(X))%*%t(as.matrix(X))%*%inv(as.matrix(cov.gaus))%*%y

17.7170272

0.6596201

-2.3253552

fit.lmer.gaus$coef$fixed

## (Intercept)         age   SexFemale 
##   17.717033    0.659620   -2.325355

This equivalence is a powerful principle: > Mixed models are GLS with a structured covariance matrix.

6.9.20 Using lme with autoregressive (AR1)

An AR(1) residual structure assumes correlation decays geometrically with time lag:

\[ \mathrm{Cor}(\varepsilon_{ij},\varepsilon_{ik})=\rho^{|t_{ij}-t_{ik}|}, \]

under equally spaced times; with irregular spacing, nlme uses an appropriate generalization based on the numeric time covariate.

fit.lmer.AR1 <- lme(distance ~ age + Sex  , random = ~1 | Subject, correlation=corAR1(form= ~ as.numeric(age)|Subject ), data = Data) 
summary(fit.lmer.AR1)

## Linear mixed-effects model fit by REML
##   Data: Data 
##        AIC      BIC    logLik
##   449.5125 465.4363 -218.7563
## 
## Random effects:
##  Formula: ~1 | Subject
##         (Intercept) Residual
## StdDev:    1.807425 1.431592
## 
## Correlation Structure: ARMA(1,0)
##  Formula: ~as.numeric(age) | Subject 
##  Parameter estimate(s):
## Phi1 
##    0 
## Fixed effects:  distance ~ age + Sex 
##                 Value Std.Error DF   t-value p-value
## (Intercept) 17.706713 0.8339225 80 21.233044  0.0000
## age          0.660185 0.0616059 80 10.716263  0.0000
## SexFemale   -2.321023 0.7614168 25 -3.048294  0.0054
##  Correlation: 
##           (Intr) age   
## age       -0.813       
## SexFemale -0.372  0.000
## 
## Standardized Within-Group Residuals:
##         Min          Q1         Med          Q3         Max 
## -3.74889609 -0.55034466 -0.02516628  0.45341781  3.65746539 
## 
## Number of Observations: 108
## Number of Groups: 27

Variogram diagnostic:

print(plot(Variogram(fit.lmer.AR1, form =~as.numeric(age)|Subject,
data = Data)))

Extract covariance:

require(mgcv)
library(nlme)
 
cov.AR1 <- extract.lme.cov(fit.lmer.AR1,Data)
head(cov.AR1)

5.316240	3.266784	3.266784	3.266784	0.000000	0.000000	0.000000	0.000000
3.266784	5.316240	3.266784	3.266784	0.000000	0.000000	0.000000	0.000000
3.266784	3.266784	5.316240	3.266784	0.000000	0.000000	0.000000	0.000000
3.266784	3.266784	3.266784	5.316240	0.000000	0.000000	0.000000	0.000000
0.000000	0.000000	0.000000	0.000000	5.316240	3.266784	3.266784	3.266784
0.000000	0.000000	0.000000	0.000000	3.266784	5.316240	3.266784	3.266784

dim(cov.AR1)

## [1] 108 108

6.9.21 Using lme with exponential correlation

Exponential correlation decays continuously with lag:

\[ \mathrm{Cor}(\varepsilon_{ij},\varepsilon_{ik})=\exp(-\phi |t_{ij}-t_{ik}|). \]

This is often used in spatial and longitudinal settings.

fit.lmer.Exp <- lme(distance ~ age + Sex  , random = ~1 | Subject, correlation=corExp(form= ~ as.numeric(age)|Subject ), data = Data) 
summary(fit.lmer.Exp)

## Linear mixed-effects model fit by REML
##   Data: Data 
##        AIC      BIC    logLik
##   449.3968 465.3206 -218.6984
## 
## Random effects:
##  Formula: ~1 | Subject
##         (Intercept) Residual
## StdDev:    1.788899 1.454494
## 
## Correlation Structure: Exponential spatial correlation
##  Formula: ~as.numeric(age) | Subject 
##  Parameter estimate(s):
##     range 
## 0.7045117 
## Fixed effects:  distance ~ age + Sex 
##                 Value Std.Error DF   t-value p-value
## (Intercept) 17.721416 0.8500193 80 20.848250  0.0000
## age          0.659405 0.0634074 80 10.399499  0.0000
## SexFemale   -2.327485 0.7611852 25 -3.057711  0.0053
##  Correlation: 
##           (Intr) age   
## age       -0.821       
## SexFemale -0.365  0.000
## 
## Standardized Within-Group Residuals:
##          Min           Q1          Med           Q3          Max 
## -3.683026666 -0.540915318 -0.008097445  0.461167542  3.612579065 
## 
## Number of Observations: 108
## Number of Groups: 27

Variogram:

print(plot(Variogram(fit.lmer.Exp, form =~as.numeric(age)|Subject,
data = Data)))

Extract covariance:

require(mgcv)
library(nlme)
 
cov.Exp <- extract.lme.cov(fit.lmer.Exp,Data)
head(cov.Exp)

5.315710	3.323904	3.207397	3.200582	0.000000	0.000000	0.000000	0.000000
3.323904	5.315710	3.323904	3.207397	0.000000	0.000000	0.000000	0.000000
3.207397	3.323904	5.315710	3.323904	0.000000	0.000000	0.000000	0.000000
3.200582	3.207397	3.323904	5.315710	0.000000	0.000000	0.000000	0.000000
0.000000	0.000000	0.000000	0.000000	5.315710	3.323904	3.207397	3.200582
0.000000	0.000000	0.000000	0.000000	3.323904	5.315710	3.323904	3.207397

dim(cov.Exp)

## [1] 108 108

6.9.22 Using lme with unstructured residual correlation (corSymm)

An unstructured residual correlation places minimal constraints on within-subject correlations, allowing each pairwise correlation to be estimated (subject to positive definiteness). Because this can be unstable, nlme often pairs it with heteroscedasticity modeling such as varIdent.

Conceptually, this approach says: - “I do not want to impose a parametric decay pattern.” - “Let the data estimate the correlation pattern.”

fit.lmer.corSymm <- lme(distance ~ age + Sex  , random = ~1 |  (Subject), correlation=corSymm(form= ~ 1| (Subject) ),
                     weights = varIdent(form=~1|age),
                     data = Data)
summary(fit.lmer.corSymm)

## Linear mixed-effects model fit by REML
##   Data: Data 
##        AIC    BIC    logLik
##   456.6945 493.85 -214.3473
## 
## Random effects:
##  Formula: ~1 | Subject
##         (Intercept) Residual
## StdDev:    1.783141 1.481552
## 
## Correlation Structure: General
##  Formula: ~1 | (Subject) 
##  Parameter estimate(s):
##  Correlation: 
##   1      2      3     
## 2 -0.260              
## 3  0.238 -0.149       
## 4 -0.251 -0.007  0.426
## Variance function:
##  Structure: Different standard deviations per stratum
##  Formula: ~1 | age 
##  Parameter estimates:
##        8       10       12       14 
## 1.000000 0.686841 1.199078 1.000414 
## Fixed effects:  distance ~ age + Sex 
##                 Value Std.Error DF   t-value p-value
## (Intercept) 17.417591 0.8657147 80 20.119319  0.0000
## age          0.674651 0.0702289 80  9.606462  0.0000
## SexFemale   -2.045167 0.7361373 25 -2.778241  0.0102
##  Correlation: 
##           (Intr) age   
## age       -0.840       
## SexFemale -0.346  0.000
## 
## Standardized Within-Group Residuals:
##         Min          Q1         Med          Q3         Max 
## -3.33886326 -0.56222386  0.01374189  0.52235946  3.99060056 
## 
## Number of Observations: 108
## Number of Groups: 27

Variogram:

print(plot(Variogram(fit.lmer.corSymm, form =~as.numeric(age)|Subject,
data = Data)))

Extract covariance:

require(mgcv)
library(nlme)
 
cov.corSymm <- extract.lme.cov(fit.lmer.corSymm,Data)
head(cov.corSymm)

5.374589	2.786932	3.807032	2.628359	0.000000	0.000000	0.000000	0.000000
2.786932	4.215083	2.909655	3.168382	0.000000	0.000000	0.000000	0.000000
3.807032	2.909655	6.335531	4.301450	0.000000	0.000000	0.000000	0.000000
2.628359	3.168382	4.301450	5.376406	0.000000	0.000000	0.000000	0.000000
0.000000	0.000000	0.000000	0.000000	5.374589	2.786932	3.807032	2.628359
0.000000	0.000000	0.000000	0.000000	2.786932	4.215083	2.909655	3.168382

dim(cov.corSymm)

## [1] 108 108

6.9.23 Using lme with compound symmetry residual correlation (corCompSymm)

Compound symmetry as a residual correlation structure assumes constant correlation within a subject:

\[ \mathrm{Cor}(\varepsilon_{ij},\varepsilon_{ik})=\rho \quad (j\neq k). \]

This resembles what a random intercept induces, but note the difference:

Random intercept creates compound symmetry through $\mathbf{Z}\mathbf{D}\mathbf{Z}^\top$
corCompSymm creates compound symmetry through $\boldsymbol\Sigma$

In practice, both can produce similar marginal covariance patterns, but they represent different modeling stories.

fit.lmer.Symm <- lme(distance ~ age + Sex  , random = ~1 |  (Subject), correlation=corCompSymm(form= ~ as.numeric(age)| (Subject) ), data = Data) 
summary(fit.lmer.Symm)

## Linear mixed-effects model fit by REML
##   Data: Data 
##        AIC      BIC    logLik
##   449.5125 465.4363 -218.7563
## 
## Random effects:
##  Formula: ~1 | Subject
##         (Intercept) Residual
## StdDev:    1.807425 1.431592
## 
## Correlation Structure: Compound symmetry
##  Formula: ~as.numeric(age) | (Subject) 
##  Parameter estimate(s):
## Rho 
##   0 
## Fixed effects:  distance ~ age + Sex 
##                 Value Std.Error DF   t-value p-value
## (Intercept) 17.706713 0.8339225 80 21.233044  0.0000
## age          0.660185 0.0616059 80 10.716263  0.0000
## SexFemale   -2.321023 0.7614168 25 -3.048294  0.0054
##  Correlation: 
##           (Intr) age   
## age       -0.813       
## SexFemale -0.372  0.000
## 
## Standardized Within-Group Residuals:
##         Min          Q1         Med          Q3         Max 
## -3.74889609 -0.55034466 -0.02516628  0.45341781  3.65746539 
## 
## Number of Observations: 108
## Number of Groups: 27

Variogram:

print(plot(Variogram(fit.lmer.Symm, form =~as.numeric(age)|Subject,
data = Data)))

Extract covariance:

require(mgcv)
library(nlme)
 
cov.Symm <- extract.lme.cov(fit.lmer.Symm,Data)
head(cov.Symm)

5.316240	3.266784	3.266784	3.266784	0.000000	0.000000	0.000000	0.000000
3.266784	5.316240	3.266784	3.266784	0.000000	0.000000	0.000000	0.000000
3.266784	3.266784	5.316240	3.266784	0.000000	0.000000	0.000000	0.000000
3.266784	3.266784	3.266784	5.316240	0.000000	0.000000	0.000000	0.000000
0.000000	0.000000	0.000000	0.000000	5.316240	3.266784	3.266784	3.266784
0.000000	0.000000	0.000000	0.000000	3.266784	5.316240	3.266784	3.266784

dim(cov.Symm)

## [1] 108 108

6.9.24 GLS models: correlation in residuals without random effects

The gls() function fits models where all correlation is placed in the residuals. This can be useful when:

you do not need subject-level random effects, or
you want a pure covariance-modeling approach.

6.9.25 GLS with unstructured correlation

gls.corsymm <- gls(distance ~ Sex +age, Orthodont,
                   correlation = corSymm(form = ~ 1 | Subject),
                   weights = varIdent(form=~1|age))
summary(gls.corsymm)

## Generalized least squares fit by REML
##   Model: distance ~ Sex + age 
##   Data: Orthodont 
##        AIC     BIC    logLik
##   454.6945 489.196 -214.3473
## 
## Correlation Structure: General
##  Formula: ~1 | Subject 
##  Parameter estimate(s):
##  Correlation: 
##   1     2     3    
## 2 0.586            
## 3 0.652 0.563      
## 4 0.489 0.666 0.737
## Variance function:
##  Structure: Different standard deviations per stratum
##  Formula: ~1 | age 
##  Parameter estimates:
##         8        10        12        14 
## 1.0000000 0.8855866 1.0857309 1.0001682 
## 
## Coefficients:
##                 Value Std.Error   t-value p-value
## (Intercept) 17.417595 0.8657121 20.119383  0.0000
## SexFemale   -2.045167 0.7361425 -2.778222  0.0065
## age          0.674651 0.0702284  9.606521  0.0000
## 
##  Correlation: 
##           (Intr) SexFml
## SexFemale -0.346       
## age       -0.840  0.000
## 
## Standardized residuals:
##          Min           Q1          Med           Q3          Max 
## -2.508191537 -0.584408334  0.003646016  0.531073432  2.179748120 
## 
## Residual standard error: 2.318325 
## Degrees of freedom: 108 total; 105 residual

6.9.26 GLS with compound symmetry

gls.comsys <- gls(distance ~ Sex +age, Orthodont,
                   correlation = corCompSymm(form = ~ 1 | Subject))
summary(gls.comsys)

## Generalized least squares fit by REML
##   Model: distance ~ Sex + age 
##   Data: Orthodont 
##        AIC      BIC    logLik
##   447.5125 460.7823 -218.7563
## 
## Correlation Structure: Compound symmetry
##  Formula: ~1 | Subject 
##  Parameter estimate(s):
##       Rho 
## 0.6144914 
## 
## Coefficients:
##                 Value Std.Error   t-value p-value
## (Intercept) 17.706713 0.8339225 21.233044  0.0000
## SexFemale   -2.321023 0.7614169 -3.048294  0.0029
## age          0.660185 0.0616059 10.716263  0.0000
## 
##  Correlation: 
##           (Intr) SexFml
## SexFemale -0.372       
## age       -0.813  0.000
## 
## Standardized residuals:
##         Min          Q1         Med          Q3         Max 
## -2.59712955 -0.64544226 -0.02540005  0.51680604  2.32947531 
## 
## Residual standard error: 2.305697 
## Degrees of freedom: 108 total; 105 residual

Interpretation guidance: - gls() is often a good baseline when you want to emphasize covariance structure. - lme() is preferred when subject-specific effects are scientifically meaningful (e.g., patient-specific baselines).

6.9.27 Practical limitation of lme4

lme4 is extremely powerful for random effects, but it does not natively support rich residual correlation structures. In lme4, residuals are assumed independent conditional on random effects (i.e., $\boldsymbol\Sigma=\sigma^2\mathbf{I}$). You can fit complex random-effects covariance structures, but not AR(1)/Gaussian/exponential residual correlations in the same direct way as nlme.

This is why, in longitudinal data analysis, a common workflow is:

use lme4 for flexible random-effects modeling and efficient fitting,
use nlme when you need explicit residual correlation structures.

6.10 Manual simulating data for linear mix model

see here

6.11 How to calculate the prediction interval for LMM

see here

6.12 Least-squares means with interaction effect

see here

6.13 Spline regression

see here

6 Statistical models

6.1 Simple linear regression

6.1.1 Linear regeression assumptions

6.1.2 Population regression function

6.1.3 Population regression model

6.1.4 Sample regression model

6.1.5 Least squares: minimize \(Q=\sum (Y_i-\hat{Y}_i)^2\)

6.1.6 Solve \(\hat{\beta}_1,\hat{\beta}_2\) and variance

6.1.7 Calculate the variance \(\hat{\sigma}^2\) of error \(e_i\)

6.1.8 Sum of squares decomposition

6.1.9 Coefficient of determination \(R^2\) and goodness of fit

6.1.10 Test of regression coefficients

6.1.11 Statistical test of model

6.1.12 Mean prediction

6.1.13 Individual prediction

6.2 Multiple linear regression

6.2.1 Matrix format

6.2.2 Variance covariance matrix of random errors

6.2.3 Minimize \(Q=\sum (y-\hat{y})^2\)

6.2.4 Solve \(\hat{\beta}\) by derivation

6.2.5 Solve \(var\text{-}cov(\mathbf{\hat{\beta}})\)

6.2.6 Solve \(S^2(\mathbf{\hat{\beta}})\) (sample)

6.2.7 Sum of squares decomposition (matrix format)

6.2.8 Determination coefficient \(R^2\) and goodness of fit

6.2.9 Test of regression coefficients

6.2.10 Test of model

6.2.11 Mean prediction (multiple regression)

6.2.12 Individual prediction (multiple regression)

6.3 Multiple linear regression practice

6.3.1 Load required packages

6.3.2 Loading and describing data

6.3.3 Create table 1

6.3.4 Missingness checking

6.3.5 Exploratory data analysis

6.3.5.1 Correlation matrix and bivariate panels

6.3.5.2 Histograms

6.3.6 Transformations

6.3.7 Check linearity between \(y\) and \(x\)

6.3.8 Data imputation and normalization

6.3.8.1 For original “data”

6.3.8.2 For transformed “data2”

6.3.9 Generate dummy variables

6.3.10 Splitting data into training and test data

6.3.11 Step regression

6.3.12 Create a model after selecting variables

6.3.13 Multicollinearity checking

6.3.14 Plot model to check assumptions

6.3.14.1 Histogram of residuals

6.3.14.2 F test of model

6.3.14.3 Coefficients table

6.3.14.4 Confidence intervals

6.3.15 Add polynomial (quadratic) terms

6.3.16 Add interaction terms

6.3.17 Robust regression

6.3.18 Create a model before transforming data

6.3.18.1 Non-nested model comparisons (AIC)

6.3.19 K-fold cross validation

6.3.20 Nonnest models comparisons

6.3.21 Posterior predictive / diagnostic checks

6.3.22 Forest plot for coefficients

6.3.23 Relative Importance

6.3.24 Model prediction

6.3.24.1 Prediction interval (individual prediction)

6.3.24.2 Confidence interval (mean prediction)

6.3.24.3 CI width

6.3.25 Compare predictions vs actual values

6.3.26 Manual computation: \(\hat{y}\) and confidence interval

6.3.27 External data validation

6.4 Variable selection

6.4.1 Chapter takeaways

6.5 Linear mixed model theory

6.5.1 Matrix format

6.5.2 Why random effects create correlation

6.5.3 Example: reducing the work stress of nurses

6.5.4 The dependent variable \(\mathbf{y}\)

6.5.5 Fixed effects design matrix \(\mathbf{X}\)

6.5.6 Fixed effect coefficients \(\boldsymbol{\hat{\beta}}\)

6.5.6.1 How \(\hat{\beta}\) is estimated (ML and REML)

6.5.7 Variance–covariance matrix of \(\hat{\beta}\)

6.5.8 Random effects design matrix \(\mathbf{Z}\)

6.6.8.1 Compare fixed-effect `class` vs random-effect `class`

6.6.10 Nested terms: `school/class`

6.6.21.1 Loading `nurse` data