## Trying determining degree polynomial for polynomical regression

1

I'm trying to predict the birth weight baby using polynomial regression model. First what I need know what degree polynomial should fit better to my data. In order to do that I split my dataset in training set (70%) and Cross Validation set(30%) and then plots each error by degree polynomial. I did run my script 4 times selecting randomly the data but I get so different curves each time as you can see I don't know why this happens or What am I doing wrong.

You can see my code below:

main script

%========== Begin - constants declaration ==========%
x_training_percent = 0.7;
cv_set_percent = 0.3;
%========== End - constants declaration ==========%

[m, n] = size(X);% m: Number of examples, n: Number of features.

%========== Begin - Getting traingin and CV sets ==========%
training_set_size = round(m * x_training_percent);
cv_set_size = round(m * cv_set_percent);
test_set_size = round(m * test_set_percent);

random_order_idx = randperm(m);

indexes = random_order_idx(1:training_set_size);
random_order_idx(1:training_set_size) = []; % Remove indexes were used
x_training_o = X(indexes, :); % X original training Set
y_training = y(indexes, :);   % y original training Set

indexes = random_order_idx(1:cv_set_size);
random_order_idx(1:cv_set_size) = []; % Remove indexes were used
x_cv_o = X(indexes, :); % X original Cross Validation Set
y_cv = y(indexes, :);   % y original Cross Validation Set

%========== End - Getting traingin and CV sets ==========%

max_p = 10; % Max degree polynomial

cv_error = zeros(max_p, 1);
training_error = zeros(max_p, 1);

for p = 1:max_p

% Processing training set
x_training = x_training_o;
x_training = polyFeatures(x_training, p); % Adding polynomial terms from 1 to p
[x_training, mu, sigma] = featureNormalize(x_training);
x_training = [ones(training_set_size, 1) x_training];

% Processing cross validation set
x_cv = x_cv_o;
x_cv = polyFeatures(x_cv, p); % Adding polynomial terms from 1 to p
x_cv = bsxfun(@minus, x_cv, mu);
x_cv = bsxfun(@rdivide, x_cv, sigma);
x_cv = [ones(cv_set_size, 1) x_cv];

%========== Begin - Training ==========%
lambda = 0
theta = trainLinearReg(x_training, y_training, lambda);
%========== End - Training ==========%

%========== Begin - Computing prediction errors with polinomial degree p ==========%
predictions = x_training * theta; % Predictions with training set
training_error(p, :) = (1 / (2 * training_set_size)) * sum((predictions - y_training) .^ 2);

cv_predictions = x_cv * theta; % Predictions with cross validation set
cv_error(p, :) = (1 / (2 * cv_set_size)) * sum((cv_predictions - y_cv) .^ 2);
%========== End - Computing prediction errors ==========%

end

plot(1:p, training_error, 1:p, cv_error);

title(sprintf('Learning curve for linear regression with lambda = %f', lambda));
legend('Train', 'Cross Validation')
xlabel('degree of Polynomial')
ylabel('Error')


polyFeatures

function [X_poly] = polyFeatures(X, p)
%POLYFEATURES Maps X (1D vector) into the p-th power
%   [X_poly] = POLYFEATURES(X, p) takes a data matrix X (size m x 1) and
%   maps each example into its polynomial features where
%   X_poly(i, :) = [X(i) X(i).^2 X(i).^3 ...  X(i).^p];
%

X_poly = X; % For p = 1

for i = 2:p
X_poly = [X_poly (X .^ i)];
end

end


featureNormalize

function [X_norm, mu, sigma] = featureNormalize(X)
%FEATURENORMALIZE Normalizes the features in X
%   FEATURENORMALIZE(X) returns a normalized version of X where
%   the mean value of each feature is 0 and the standard deviation
%   is 1. This is often a good preprocessing step to do when
%   working with learning algorithms.

mu = mean(X);
X_norm = bsxfun(@minus, X, mu);

sigma = std(X_norm);
X_norm = bsxfun(@rdivide, X_norm, sigma);

end


trainingLinearReg

function [theta] = trainLinearReg(X, y, lambda)
%TRAINLINEARREG Trains linear regression given a dataset (X, y) and a
%regularization parameter lambda
%   [theta] = TRAINLINEARREG (X, y, lambda) trains linear regression using
%   the dataset (X, y) and regularization parameter lambda. Returns the
%   trained parameters theta.
%

% Initialize Theta
initial_theta = zeros(size(X, 2), 1);

% Create "short hand" for the cost function to be minimized
costFunction = @(t) linearRegCostFunction(X, y, t, lambda);

% Now, costFunction is a function that takes in only one argument
options = optimset('MaxIter', 400, 'GradObj', 'on');

% Minimize using fmincg
theta = fmincg(costFunction, initial_theta, options);

end


linearRegCostFuncion

function [J, grad] = linearRegCostFunction(X, y, theta, lambda)
%LINEARREGCOSTFUNCTION Compute cost and gradient for regularized linear
%regression with multiple variables
%   [J, grad] = LINEARREGCOSTFUNCTION(X, y, theta, lambda) computes the
%   cost of using theta as the parameter for linear regression to fit the
%   data points in X and y. Returns the cost in J and the gradient in grad

% Initialize some useful values
m = length(y); % number of training examples

% You need to return the following variables correctly
J = 0;

hx = X * theta; % Prediction

J = (1 / (2 * m)) * sum((hx - y) .^ 2) + (lambda / (2 * m)) * sum([ 0; theta(2:end, :) ] .^ 2);

grad = (1 / m) * (hx - y)' * X + (lambda / m) * [ 0; theta(2:end, :) ]';

end


Can anyone help me?

More people would probably look at the problem if you could reduce it to fewer lines of code. What seems strange are the really high training/testing errors you're getting - like 200k MSE for a baby's weight is really far off. Maybe there's something wrong with the gradient calculation - try to remove the calculation of the gradient and use a minimization function that doesn't require the gradient. And how much data have you got (samples + features)? – stmax – 2016-05-03T08:48:59.623

I don't really know octave, but the docs suggest fminunc - set GradObj to "off" and make your linearRegCostFunction not return grad. See if that helps...

– stmax – 2016-05-03T13:01:33.330

@stmax thank you for your reply. I have posted all my code because maybe anyone wants to reproduce the execution. What minimization function do you suggest? My dataset have 189 samples and 8 features originally (Without considering polynomial terms). Now I tried increase the number of iterations to 100000000 instead of 400... My script it's still running after 24 hs! but for degree 9 fmincg did 4115163 Iteration with a cost = 1.674962e+005. I don't understand what is going on. – Overflow012 – 2016-05-03T13:06:47.370

You are running your model on different data set so it's really not that surprising. – 3x89g2 – 2016-09-03T07:42:02.277