## Odd results on bias-variance tradeoff assessment

0

I am running a bias-variance tradeoff assessment on ten regression models of increasing complexity (linear to x^10), but my results do not satisfy the MSE = Bias^2 + Variance + True Error relationship.

I noticed two oddities with results from my large actual dataset (true pattern is roughly cubic, with a flat left tail): as model complexity increases, (1) variance increases sharply then plateaus (2) bias is almost invariant. This does not look correct to me. As it is unwieldy to upload the dataset I am working on, I use simulated data for reprex. I managed to recreate the bias issue, though the variance curve looks normal here. However, I am still unable to get the relationship:

# Generate 101 datasets, 100 obs each
# One test set, the rest are training sets.
library(tidyverse)
data_sets <- lapply(1:101, function(...) {
tibble(
X = runif(100, min = -pi / 2, max = pi / 2), # predictor
y.true = X^3, # true outcome
e = rnorm(100, mean = 0, sd = 3), # true error
Y = y.true + e # observed outcome
)
}) %>% {
list(test_set = .[],
training_sets = .[-1])
}

# convert test_set to dataframe for predict() compatibility
testset <- as.data.frame(data_sets$test_set) ## Begin train/predict -------------- # Set necessary stuff timesiter = 100 # iterations polynom = 10 # model complexity predict.df <- data.frame(matrix(NA, nrow = nrow(testset)*timesiter, ncol = polynom + 1)) # + 1 iteration col colnames(predict.df) <- c(paste0("x",1:polynom), "iteration") predict.df$iteration <- rep(1:timesiter, each = nrow(testset))

set.seed(696969)
for(p in 1:timesiter){

# Take a new training set for training
train.sample <- data_sets$training_sets[[p]] # Train models for (i in 1:polynom){ lm.model <- lm(Y ~ poly(X, i), data = train.sample) # Get model predictions on testset & store it predict.df[predict.df$iteration == p, i] <- predict(lm.model, newdata = testset)

}

}

## ----------------------------------
# Using manual calculations for understanding
# Also, I realise the for-loops are unwieldy

## VARIANCE for each model
var.df <- data.frame(matrix(NA,
nrow = 1,
ncol = polynom))
for (i in 1:polynom){
var.df[, i] <- mean((predict.df[, i] - mean(predict.df[, i]))^2)
}
var.df

## BIAS^2
biasquare.df <- data.frame(matrix(NA,
nrow = 1,
ncol = polynom))
for (i in 1:polynom){
biasquare.df[, i] <- mean( (mean(predict.df[, i]) - testset$y.true)^2 ) } biasquare.df ## MSE (Mean Squared Error) mse.df <- data.frame(matrix(NA, nrow = 1, ncol = polynom)) for (i in 1:polynom){ mse.df[, i] <- mean((testset$Y - predict.df[, i])^2)
}
mse.df

## TRUE ERROR
true_error <- mean(((testset$$y.true - testset$$Y)^2))

# Check if MSE = Bias^2 + Variance + Error (they are not)
mse.df - (var.df + biasquare.df + true_error)


The plot for this looks like this: Appreciate if anyone can point out my mistakes. Thanks!

The bias-variance tradeoff equality holds in the expectation, so if you have a large enough dataset and the approximate equality, you are all good. It also assumes that the noise is independent of the samples. Not sure about other issues. – kate-melnykova – 2020-09-17T06:02:38.880

Yeah the problem I have is the difference is much greater than just rounding error (and increasing with complexity, probably because bias doesn't seem to be decreasing). – fluent – 2020-09-17T07:00:09.383

Did you try decreasing the noise levels? sd=3 seems to be too large. – kate-melnykova – 2020-09-17T17:23:34.670