Consider regression models with increasing model complexity. Assuming the models are nested (with increasing addition of variables), we can compare the models using anova
function. This function compares consecutive models with results showing improvement in squared errors and corresponding p-values. In the below example, lm1
shows significant improvement over lm0
. While there is improvement from lm1
to lm2
, it is not significant.
lm0 <- lm(dist ~ 1, cars)
lm1 <- lm(dist ~ speed, cars)
lm2 <- lm(dist ~ speed + I(speed^2), cars)
anova(lm0, lm1, lm2)
## Analysis of Variance Table
##
## Model 1: dist ~ 1
## Model 2: dist ~ speed
## Model 3: dist ~ speed + I(speed^2)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 49 32539
## 2 48 11354 1 21185.5 91.986 1.211e-12 ***
## 3 47 10825 1 528.8 2.296 0.1364
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Consider several groups with constant variance. The goal is to check if group means are same.
ngroups <- 5 # Number of groups / treatments
nsample <- 10 # Number of observations in each group
n <- ngroups*nsample # Total number of data points
pop.means <- c(50, 40, 45, 55, 60) # Population means for each of the groups
normal.error <- rnorm(n, 0, 3) # Residuals, Residual sd=3 (note: assumption of homoscedasticity)
x <- as.factor(rep(1:5, rep(nsample, ngroups))) # Indicator for group
y <- rep(pop.means, each=nsample) + normal.error
ANOVA can be done using
anova(lm(y~x))
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x 4 2623.03 655.76 84.652 < 2.2e-16 ***
## Residuals 45 348.59 7.75
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
m <- "model {
# Likelihood
for (i in 1:(ngroups*nsample)) {
y[i] ~ dnorm(means[x[i]], tau)
}
# Priors
for (i in 1:ngroups) {
means[i] ~ dnorm(0, 0.001)
}
sigma ~ dunif(0,100)
tau <- 1/(sigma*sigma)
# output variables
for (i in 1:(ngroups-1)) {
effect[i] <- means[i+1] - means[i]
}
}"
mean.diff <- list('x'=as.numeric(x),'y'=y,'ngroups'=ngroups,'nsample'=nsample) %>%
jags.model(file = textConnection(m), data = ., n.adapt = 1000) %>%
jags.samples(c('effect'), 100) %>% `$`(effect) %>%
matrix(ncol=4, byrow=T) %>% colMeans %>% round(1)
## mean.diff
## Group 1 vs 2 -9.5
## Group 2 vs 3 3.1
## Group 3 vs 4 12.1
## Group 4 vs 5 4.1
The statistic F is defined by: \[F = \frac{between-group-variability}{within-group-variability}\] For permutation test, the F
value for original dataset is computed. The dataset is then shuffled and and several F
values are computed. The proportion of F
values above the reference is the p-value.
between.group.var <- function(x,y) {sum(nsample*(sapply(1:ngroups, function(i) {mean(y[x==i])}) - mean(y))^2)/(ngroups-1)}
within.group.var <- function(x,y) {sum((y-rep(sapply(1:ngroups, function(i) {mean(y[x==i])}),each=nsample))^2)/(n-ngroups)}
F.Statistic <- function(x,y) {between.group.var(x,y) / within.group.var(x,y)}
F.orig <- F.Statistic(x,y)
Fs <- replicate(10000, F.Statistic(x, sample(y)))
cat("p-value is ", mean(Fs > F.orig))
## p-value is 0
Random effects: the means for the different groups are correlated and they come from a probability distribution. In contrast to fixed-effects ANOVA, where they are taken to be independent AND fixed / not variable, e.g., the treatments are the entire population, i.e., they exhaust the space of all treatments
Note that the stochastic part of the model consists of TWO stochastic processes:
npop <- 10 # Number of groups
nsample <- 12 # Number of observations in each group
n <- npop*nsample # Total number of data points
pop.grand.mean <- 50 # Grand mean
pop.sd <- 5 # sd of population effects about mean
pop.means <- rnorm(n=npop, mean=pop.grand.mean, sd=pop.sd)
normal.error <- rnorm(n, 0, 3);sigma=3 # Draw residuals with residual sd=3
x <- rep(1:npop, rep(nsample, npop))
X <- as.matrix(model.matrix(~as.factor(x)-1))
y <- as.numeric(X %*% as.matrix(pop.means) + normal.error) # recall that as.numeric is essential
boxplot(y~x, col="lightblue", xlab="Groups", ylab="Continuous Response", main="", las=1)
abline(h=pop.grand.mean, col="red", lwd=2)
library("lme4")
pop <- as.factor(x) # Define x as a factor and call it pop
lme.fit <- lmer(y~1+1|pop) %>% summary()
## lme4 population grand mean = 51.65359
## lme4 pop sd and sigma =
## Groups Name Std.Dev.
## pop (Intercept) 5.9832
## Residual 3.2258