Odd results on bias-variance tradeoff assessment

0

I am running a bias-variance tradeoff assessment on ten regression models of increasing complexity (linear to x^10), but my results do not satisfy the MSE = Bias^2 + Variance + True Error relationship.

I noticed two oddities with results from my large actual dataset (true pattern is roughly cubic, with a flat left tail): as model complexity increases, (1) variance increases sharply then plateaus (2) bias is almost invariant. This does not look correct to me.

enter image description here

As it is unwieldy to upload the dataset I am working on, I use simulated data for reprex. I managed to recreate the bias issue, though the variance curve looks normal here. However, I am still unable to get the relationship:

# Generate 101 datasets, 100 obs each
# One test set, the rest are training sets. 
library(tidyverse)
data_sets <- lapply(1:101, function(...) {
  tibble(
    X = runif(100, min = -pi / 2, max = pi / 2), # predictor
    y.true = X^3, # true outcome
    e = rnorm(100, mean = 0, sd = 3), # true error
    Y = y.true + e # observed outcome
  )
}) %>% {
  list(test_set = .[[1]],
       training_sets = .[-1])
}

# convert test_set to dataframe for predict() compatibility
testset <- as.data.frame(data_sets$test_set)

## Begin train/predict -------------- 

# Set necessary stuff
timesiter = 100 # iterations
polynom = 10 # model complexity

predict.df <- data.frame(matrix(NA, 
                                nrow = nrow(testset)*timesiter, 
                                ncol = polynom + 1)) # + 1 iteration col
colnames(predict.df) <- c(paste0("x",1:polynom), "iteration")
predict.df$iteration <- rep(1:timesiter, each = nrow(testset))

set.seed(696969)
for(p in 1:timesiter){
  
  # Take a new training set for training 
  train.sample <- data_sets$training_sets[[p]]
  
  # Train models
  for (i in 1:polynom){ 
    lm.model <- lm(Y ~ poly(X, i), data = train.sample)
    # Get model predictions on testset & store it
    predict.df[predict.df$iteration == p, i] <- predict(lm.model, newdata = testset)
    
  }
  
} 

## ---------------------------------- 
# Using manual calculations for understanding
# Also, I realise the for-loops are unwieldy

## VARIANCE for each model
var.df <- data.frame(matrix(NA, 
                            nrow = 1,
                            ncol = polynom))
for (i in 1:polynom){
  var.df[, i] <- mean((predict.df[, i] - mean(predict.df[, i]))^2)
}
var.df

## BIAS^2
biasquare.df <- data.frame(matrix(NA, 
                                  nrow = 1,
                                  ncol = polynom))
for (i in 1:polynom){
  biasquare.df[, i] <- mean( (mean(predict.df[, i]) - testset$y.true)^2 ) 
}
biasquare.df

## MSE (Mean Squared Error)
mse.df <- data.frame(matrix(NA, 
                            nrow = 1,
                            ncol = polynom))
for (i in 1:polynom){
  mse.df[, i] <- mean((testset$Y - predict.df[, i])^2)  
}
mse.df

## TRUE ERROR
true_error <- mean(((testset$y.true - testset$Y)^2))

# Check if MSE = Bias^2 + Variance + Error (they are not)
mse.df - (var.df + biasquare.df + true_error)

The plot for this looks like this:

enter image description here

Appreciate if anyone can point out my mistakes. Thanks!

fluent

Posted 2020-09-16T18:13:14.430

Reputation: 21

The bias-variance tradeoff equality holds in the expectation, so if you have a large enough dataset and the approximate equality, you are all good. It also assumes that the noise is independent of the samples. Not sure about other issues. – kate-melnykova – 2020-09-17T06:02:38.880

Yeah the problem I have is the difference is much greater than just rounding error (and increasing with complexity, probably because bias doesn't seem to be decreasing). – fluent – 2020-09-17T07:00:09.383

Did you try decreasing the noise levels? sd=3 seems to be too large. – kate-melnykova – 2020-09-17T17:23:34.670

No answers