8 Algorithms

8.1 Maximum likelihood estimation

8.1.1 Likelihood estimation

(without random effects)

With the normal distribution of errors, likelihood can be expressed explicitly as the product of the densities of each of the $n$ independent normal observations.
$ℓ = - \log L$

the negative log-likelihood $\begin{aligned} ℓ ((y_{1}, \dots, y_{n}), μ, σ^{2}) & = - \log [\prod_{i = 1}^{n} \frac{1}{\sqrt{2 π σ^{2}}} \exp (- \frac{{(y_{i} - μ)}^{2}}{2 σ^{2}})] \\ = - \sum_{i = 1}^{n} [\log (\frac{1}{\sqrt{2 π σ^{2}}}) + (- \frac{{(y_{i} - μ)}^{2}}{2 σ^{2}})] \end{aligned}$ then,
for matrix format

$\begin{aligned} ℓ (y, β, γ) & = \frac{1}{2} {n \log (2 π) + \log | σ^{2} I | + (y - X β)^{'} {(σ^{2} I)}^{- 1} (y - X β)} \\ = \frac{1}{2} {n \log (2 π) + \log (\prod_{i = 1}^{n} σ^{2}) + (y - X β)^{'} (y - X β) / σ^{2}} \\ = \frac{1}{2} {n \log (2 π) + n \log (σ^{2}) + (y - μ)^{'} (y - μ) / σ^{2}} \\ = \frac{1}{2} {n \log (2 π) + n \log (σ^{2}) + \sum_{i = 1}^{n} {(y_{i} - μ)}^{2} / σ^{2}} \end{aligned}$ where $γ$ derived from $σ^{2} I$ .

minimize $ℓ$ by taking the derivative $a r g m i n (ℓ (y, β, γ))$

taking the derivatives of the negative log-likelihood function.

$\begin{aligned} \frac{\partial ℓ (μ, σ^{2})}{\partial μ} & = \frac{1}{2} [\sum_{i = 1}^{n} (- 2) (y_{i} - μ) / σ^{2}] \\ = (n μ - \sum_{i = 1}^{n} y_{i}) / σ^{2} = 0 \end{aligned}$

$\frac{\partial ℓ (μ, σ^{2})}{\partial σ^{2}} = \frac{1}{2} [\frac{n}{σ^{2}} - \frac{\sum_{i = 1}^{n} {(y_{i} - μ)}^{2}}{{(σ^{2})}^{2}}] = 0$

setting the derivatives equal to zero and solving for the parameters

$\begin{aligned} \hat{μ} & = \bar{y} \\ {\hat{σ}}^{2} & = \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - μ)}^{2} \\ = \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2} \end{aligned}$

but for REML $\begin{aligned} {\hat{σ}}^{2} & = \frac{1}{n - 1} \sum_{i = 1}^{n} {(y_{i} - μ)}^{2} \\ = \frac{1}{n - 1} \sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2} \end{aligned}$

8.1.2 `R demonstration`

toy data

set.seed(123)
n <- 20000
x <- rnorm(n, 2, sqrt(2))
s <- rnorm(n, 0, 0.8)
y <- 1.5+x*3+s
mydata <- data.frame(y,x)

using linear regression

lmfit <- lm(y~., data=mydata)
# logLik(lmfit)
coefficients(lmfit)

## (Intercept)           x 
##    1.487701    3.003695

(summary(lmfit)$sigma**2)

## [1] 0.6397411

using -log max likelihood estimate formula

notice, using vector and matrix notation

 ## Using the mathematical expression:
 minusloglik <- function(param){
   beta <- param[-1] #Regression Coefficients
   sigma <- param[1] #Variance
   y <- as.vector(mydata$y) #DV
   x <- cbind(1, mydata$x) #IV
   mu <- x%*%beta #multiply matrices
   0.5*(n*log(2*pi) + n*log(sigma) + sum((y-mu)^2)/sigma)
 }

MLoptimize <- optim(  c (1,  1, 1 ), minusloglik)
## The results:
MLoptimize$par

## [1] 0.6397201 1.4876186 3.0038675

using max likelihood estimate directly (normal distribution)

# max 
library(maxLik)
ols.lf <- function(param) {
  beta <- param[-1] #Regression Coefficients
  sigma <- param[1] #Variance
  y <- as.vector(mydata$y) #DV
  x <- cbind(1, mydata$x) #IV
  mu <- x%*%beta #multiply matrices
  sum(dnorm(y, mu, sqrt(sigma), log = TRUE)) #normal distribution(vector of observations, mean, sd)
}  

mle_ols <- maxLik(logLik = ols.lf, start = c(sigma = 1, beta1 = 1, beta2 = 1 ))
summary(mle_ols)

## --------------------------------------------
## Maximum Likelihood estimation
## Newton-Raphson maximisation, 11 iterations
## Return code 8: successive function values within relative tolerance limit (reltol)
## Log-Likelihood: -23910.85 
## 3  free parameters
## Estimates:
##       Estimate Std. error t value Pr(> t)    
## sigma 0.639677   0.006396   100.0  <2e-16 ***
## beta1 1.487701   0.009768   152.3  <2e-16 ***
## beta2 3.003695   0.003999   751.2  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## --------------------------------------------

another example

ols.lf <- function(param) {
  beta <- param[-1] #Regression Coefficients
  sigma <- param[1] #Variance
  y <- as.vector(mtcars$mpg) #DV
  x <- cbind(1, mtcars$cyl, mtcars$disp) #IV
  mu <- x%*%beta #multiply matrices
  sum(dnorm(y, mu, sqrt(sigma), log = TRUE)) #normal distribution(vector of observations, mean, sd)
}  

mle_ols <- maxLik(logLik = ols.lf, start = c(sigma = 1, beta1 = 1, beta2 = 1, beta3=1))
summary(mle_ols)

## --------------------------------------------
## Maximum Likelihood estimation
## Newton-Raphson maximisation, 28 iterations
## Return code 2: successive function values within tolerance limit (tol)
## Log-Likelihood: -79.57282 
## 4  free parameters
## Estimates:
##        Estimate Std. error t value   Pr(> t)    
## sigma  8.460632   2.037359   4.153 0.0000329 ***
## beta1 34.661013   2.395871  14.467   < 2e-16 ***
## beta2 -1.587281   0.673675  -2.356    0.0185 *  
## beta3 -0.020584   0.009757  -2.110    0.0349 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## --------------------------------------------

#Checking against linear regression
lmfit2 <- (lm(mpg~cyl+disp, data=mtcars))
lmfit2

## 
## Call:
## lm(formula = mpg ~ cyl + disp, data = mtcars)
## 
## Coefficients:
## (Intercept)          cyl         disp  
##    34.66099     -1.58728     -0.02058

(summary(lmfit2)$sigma**2)

## [1] 9.335872

8.1.3 Estimate confidence intervals using the likelihood

Most classical confidence intervals for parameters are estimated using the likelihood approach, the Wald interval (or the asymptotic normality property).

${\hat{θ}}_{i} \pm z_{1 - α / 2} S E_{{\hat{θ}}_{i}}$

where the standard error is from the second derivative of the log-likelihood function. This is the Hessian matrix/(observed) Information matrix if there is more than one single model parameter.

take second derivative of the log-likelihood function

$I (θ) = ℓ^{''} (θ)$

e.g. for one independent variable (with parameters: $β$ and $σ^{2}$ in the model).

Here, giving the four entries of the $2 \times 2$ Hessian matrix $\frac{\partial^{2} ℓ (μ, σ^{2})}{\partial μ^{2}} = \frac{n}{σ^{2}}$ $\frac{\partial^{2} ℓ (μ, σ^{2})}{\partial (σ^{2})} = \sqrt{\frac{2 {({\hat{σ}}^{2})}^{2}}{n}}$

$\frac{\partial^{2} ℓ (μ, σ^{2})}{\partial μ \partial (σ^{2})} = 0 \frac{\partial^{2} ℓ (μ, σ^{2})}{\partial (σ^{2}) \partial μ} = 0$

therefore $S E_{{\hat{θ}}_{i}} = \sqrt{{(I (\hat{θ})^{- 1})}_{i}}$

the inverse of the Fisher information is just each diagonal element, so

For $β$ $S E_{\hat{μ}} = \sqrt{{(I {(\hat{μ}, {\hat{σ}}^{2})}^{- 1})}_{11}} = \sqrt{\frac{{\hat{σ}}^{2}}{n}}$ Thus, the Wald confidence interval for the mean would be ${\hat{u}}^{2} \pm z_{1 - α / 2} \sqrt{\frac{{\hat{σ}}^{2}}{n}}$
For variance $S E_{{\hat{σ}}^{2}} = \sqrt{{(I {(\hat{μ}, {\hat{σ}}^{2})}^{- 1})}_{22}} = \sqrt{\frac{2 {({\hat{σ}}^{2})}^{2}}{n}}$

Thus, the Wald confidence interval for the variance would be ${\hat{σ}}^{2} \pm z_{1 - α / 2} {\hat{σ}}^{2} \sqrt{\frac{2}{n}}$

However, the following approach confidence interval will generally have much better small sample properties than the Wald interval.

${θ ∣ \frac{L (θ)}{L (\hat{θ})} > \exp (- 3.84 / 2)}$

8.1.4 The profile likelihood

It can be profiled by maximizing the likelihood function with respect to all the other parameters.

In the following equation, $σ^{2}$ expressed by $μ$

$\begin{aligned} L_{p} (μ) & = L (μ, \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - μ)}^{2}) \\ = \prod_{i = 1}^{n} \frac{1}{\sqrt{2 π \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - μ)}^{2}}} \exp (- \frac{{(y_{i} - μ)}^{2}}{2 \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - μ)}^{2}}) \end{aligned}$

Similarly, the profile likelihood for the variance $σ^{2}$ can be expressed $\begin{aligned} L_{p} (σ^{2}) & = L (\hat{μ} (σ^{2}), σ^{2}) \\ = L (\bar{y}, σ^{2}) \end{aligned}$

This becomes particularly simple, as the $u$ -estimate does not depend on the $σ$ .

8.1.5 Maximum likelihood estimate practice

question: eatimate mean and variance

sample<-c(1.38, 3.96, -0.16, 8.12, 6.30, 2.61, -1.35, 0.03, 3.94, 1.11)
n<-length(sample)
muhat<-mean(sample)
sigsqhat<-sum((sample-muhat)^2)/n
muhat

## [1] 2.594

sigsqhat

## [1] 8.133884

loglike<-function(theta){
a<--n/2*log(2*pi)-n/2*log(theta[2])-sum((sample-theta[1])^2)/(2*theta[2])
return(-a)
}
optim(c(2,2),loglike,method="BFGS")$par

## [1] 2.593942 8.130340

8.2 Gradient descent

linear regression

library("ggplot2")
# fit a linear model
res <- lm( hwy ~ cty ,data=mpg)
summary(res)

## 
## Call:
## lm(formula = hwy ~ cty, data = mpg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.3408 -1.2790  0.0214  1.0338  4.0461 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.89204    0.46895   1.902   0.0584 .  
## cty          1.33746    0.02697  49.585   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.752 on 232 degrees of freedom
## Multiple R-squared:  0.9138, Adjusted R-squared:  0.9134 
## F-statistic:  2459 on 1 and 232 DF,  p-value: < 2.2e-16

compute regression coefficients

# squared error cost function
cost <- function(X, y, theta) {
  sum( (X %*% theta - y)^2 ) / (2*length(y))
}

# learning rate and iteration limit
alpha <- 0.005
num_iters <- 20000

# keep history
cost_history <- double(num_iters)
theta_history <- list(num_iters)

# initialize coefficients
theta <- matrix(c(0,0), nrow=2)

x=mpg$cty
y=mpg$hwy

# add a column of 1's for the intercept coefficient
X <- cbind(1, matrix(x))

# gradient descent
for (i in 1:num_iters) {
  error <- (X %*% theta - y)
  delta <- t(X) %*% error / length(y)   #derivation
  
  theta <- theta - alpha * delta
  
  cost_history[i] <- cost(X, y, theta)
  theta_history[[i]] <- theta
}

print(theta)

##           [,1]
## [1,] 0.8899161
## [2,] 1.3375742

tail(cost_history)

## [1] 1.522137 1.522137 1.522137 1.522137 1.522137 1.522137

plot the cost function

plot(cost_history, type='line', col='red', lwd=2, main='Cost function', ylab='cost', xlab='Iterations')

- compare two ways (linear regresion vs. gradient descent)

x=mpg$cty
y=mpg$cty*theta[2]  +  theta[1]
plot(x,y, main='Linear regression by gradient descent')
# line(x,y ,col=3)
abline(lm(mpg$hwy ~ mpg$cty),col="blue",lwd = 4)
abline(res, col='red')