## Coursera ML - Does the choice of optimization algorithm affect the accuracy of multiclass logistic regression?

7

1

I recently completed exercise 3 of Andrew Ng's Machine Learning on Coursera using Python.

When initially completing parts 1.4 to 1.4.1 of the exercise, I ran into difficulties ensuring that my trained model has the accuracy that matches the expected 94.9%. Even after debugging and ensuring that my cost and gradient functions were bug free, and that my predictor code was working correctly, I was still getting only 90.3% accuracy. I was using the conjugate gradient (CG) algorithm in scipy.optimize.minimize.

Out of curiosity, I decided to try another algorithm, and used Broyden–Fletcher–Goldfarb–Shannon (BFGS). To my surprise, the accuracy improved drastically to 96.5% and thus exceeded the expectation. The comparison of these two different results between CG and BFGS can be viewed in my notebook under the header Difference in accuracy due to different optimization algorithms.

Is the reason for this difference in accuracy due to the different choice of optimization algorithm? If yes, then could someone explain why?

Also, I would greatly appreciate any review of my code just to make sure that there isn't a bug in any of my functions that is causing this.

Thank you.

EDIT: Here below I added the code involved in the question, on the request in the comments that I do so in this page rather than refer readers to the links to my Jupyter notebooks.

Model cost functions:

def sigmoid(z):
return 1 / (1 + np.exp(-z))

def compute_cost_regularized(theta, X, y, lda):
reg =lda/(2*len(y)) * np.sum(theta[1:]**2)
return 1/len(y) * np.sum(-y @ np.log(sigmoid(X@theta))
- (1-y) @ np.log(1-sigmoid(X@theta))) + reg

XT = X.T
beta = sigmoid(X@theta) - y
regterm = lda/len(y) * theta
# theta_0 does not get regularized, so a 0 is substituted in its place
regterm = 0
gradient = (1/len(y) * XT@beta).T + regterm


Function that implements one-vs-all classification training:

from scipy.optimize import minimize

def train_one_vs_all(X, y, opt_method):
theta_all = np.zeros((y.max()-y.min()+1, X.shape))
for k in range(y.min(),y.max()+1):
grdtruth = np.where(y==k, 1,0)
results = minimize(compute_cost_regularized, theta_all[k-1,:],
args = (X,grdtruth,0.1),
method = opt_method,
# optimized parameters are accessible through the x attribute
theta_optimized = results.x
# Assign thetheta_optimized vector to the appropriate row in the
# theta_all matrix
theta_all[k-1,:] = theta_optimized
return theta_all


Called the function to train the model with different optimization methods:

theta_all_optimized_cg = train_one_vs_all(X_bias, y, 'CG')  # Optimization performed using Conjugate Gradient
theta_all_optimized_bfgs = train_one_vs_all(X_bias, y, 'BFGS') # optimization performed using Broyden–Fletcher–Goldfarb–Shanno


We see that prediction results differ based on the algorithm used:

def predict_one_vs_all(X, theta):
return np.mean(np.argmax(sigmoid(X@theta.T), axis=1)+1 == y)*100

In: predict_one_vs_all(X_bias, theta_all_optimized_cg)
Out: 90.319999999999993

In: predict_one_vs_all(X_bias, theta_all_optimized_bfgs)
Out: 96.480000000000004


For anyone wanting to get any data to try the code, they can find it in my Github as linked in this post.

1Logistic regression should have a single stable minimum (like linear regression), so it is likely that something is causing this that you haven't noticed – Neil Slater – 2017-07-04T15:57:02.243

So there must be guaranteed convergence to the minimum cost? Would you be able to do a code review for me please? – AKKA – 2017-07-04T23:28:25.230

1If there's a lot of code you need reviewing, maybe post it on codereview.stackexchange.com - if it is only a small amount required to replicate the problem, you could add it to your question here (edit it in as a code block, please include enough to fully replicate the problem). – Neil Slater – 2017-07-05T06:51:59.693

while it is true that ensuring a global minimum should give you same result regardless of the optimization algorithm, there can be subtleties in the implementation of the algorithm (i.e. the methods to handle numerical stability etc) that may lead to slightly different solutions. These small difference in solutions may lead to larger performance difference when evaluated on small test set. May be that is causing such a large performance difference in your case. And yes, in general, optimization algorithms can largely influence the learning outcome. Btw, I got the desired result in MATLAB. – Sal – 2017-07-06T06:30:50.243

1@NeilSlater: ok, I have just added the code directly into the question as an edit. Does it look ok? – AKKA – 2017-07-06T15:07:22.133

@Sal: I see. May I ask if you had taken a look at my code to see if it is correct? – AKKA – 2017-07-06T15:07:55.883

@AKKA: I cannot see any obvious problem with the code after 10 mins scanning through it. Best I can think is that maybe the default value of gtol is too high for your data, which could occur if it is not normalised. I might take a deeper look later on, if no-one else has answered in the meantime. – Neil Slater – 2017-07-06T15:44:54.417

@NeilSlater: I had tried reducing the value of gtol already, based on a suggestion from someone in a replicate post of my question on stackoverflow: https://stackoverflow.com/questions/44915145/coursera-ml-does-the-choice-of-optimization-algorithm-affect-the-accuracy-of-m?noredirect=1#comment76806558_44915145 . Reducing gtol does not solve the problem :( no one has answered since you. Could you please help me take a deeper look? Thank you

– AKKA – 2017-07-08T07:53:33.783

3

Limits of numerical accuracy and stability are causing the optimisation routines to struggle.

You can see this most easily by changing the regularisation term to 0.0 - there is no reason why this should not work in principle, and you are not using any feature engineering that particularly needs it. With regularisation set to 0.0, then you will see limits of precision reached and attempts to take log of 0 when calculating the cost function. The two different optimisation routines are affected differently, due to taking different sample points on route to the minimum.

I think that with regularisation term set high, you remove the numerical instability, but at the expense of not seeing what is really going on with the calculations - in effect the regularisation terms become dominant for the difficult training examples.

You can offset some of the accuracy problems by modifying the cost function:

def compute_cost_regularized(theta, X, y, lda):
reg =lda/(2*len(y)) * np.sum(theta[1:]**2)
return reg - 1/len(y) * np.sum(
y @ np.log( np.maximum(sigmoid(X@theta), 1e-10) )
+ (1-y) @ np.log( np.maximum(1-sigmoid(X@theta), 1e-10) ) )


Also to get some feedback during the training, you can add

                       options = {
'disp': True
}


To the call to minimize.

With this change, you can try with regularisation term set to zero. When I do this, I get:

predict_one_vs_all(X_bias, theta_all_optimized_cg)
Out:
94.760000000000005
In :

predict_one_vs_all(X_bias, theta_all_optimized_bfgs)
/usr/local/lib/python3.6/site-packages/ipykernel/__main__.py:2: RuntimeWarning: overflow encountered in exp
from ipykernel import kernelapp as app
Out:
98.839999999999989


The CG value of 94.76 seems to match the expected result nicely - so I wonder if this was done without regularisation. The BFGS value is still "better" although I am not sure how much I trust it given the warning messages during training and evaluation. To tell if this apparently better training result really translates into better digit detection, you would need to measure results on a hold-out test set.

Really appreciate the analysis you've provided in your answer. I still have a question about the modification you made to the cost function, like with np.maximum(sigmoid(X@theta), 1e-10), how did you know to use 1e-10 as the threshold value? Also, I noticed that you shifted the negative sign out side of the individual terms of the sum, and brought it out so that its now reg - the regularization term minus the sum term. Does this matter too? – AKKA – 2017-07-08T13:03:22.590

As you suggested, I also tried setting the regularization term to 0.0, and not only do I get the divide by zero error, but the running time also becomes much longer! About the dividing by zero error, I don't quite understand why though. How did it come about? Has this got something to do with the implementation details of the algorithms? Pardon me as I'm not familiar with numerical methods... – AKKA – 2017-07-08T13:06:31.777

@AKKA: I just chose 1e-10 arbitrarily, and the shuffling of terms around was a side-effect of me double checking and understanding the code. I don't think either make a big difference. Technically it isn't a divide by zero, but a np.log( array_containing_a_zero ) which has occurred due to a large negative or positive sum in one ore more examples during the optimisation search. – Neil Slater – 2017-07-08T13:21:46.830

Because the code exponentiates then takes logs, the numbers you see can seem within reasonable bounds, but the interim calculations can be extreme. Some frameworks can resolve the expressions so that exponentiation and logs don't actually occur - but the maths for that is beyond me. – Neil Slater – 2017-07-08T13:26:43.160

I see. Do you think then that the better results you have obtained could have been over-fitting? I guess that is why you said ultimately a test set is required to validate this... – AKKA – 2017-07-08T13:29:21.377

@AKKA: No I don' t think this is over-fitting. Most likely some records are behaving unexpectedly, and maybe not being reported at all because of inf values in the table. Look at np.sum( (np.abs(theta_all_optimized_bfgs) > 1000).astype('float' ) ) and compare to np.sum( (np.abs(theta_all_optimized_cg) > 100).astype('float' ) ) . . . it is those super-large weights that BFGS has assigned that are causing inaccurate results. – Neil Slater – 2017-07-08T13:53:15.460

2

CG does not converge to the minimum as well as BFGS

If I may add an answer here to my own question too, credits given to a good friend who volunteered to look at my code. He's not on Data Science stackexchange, and didn't feel the need to create an account just to post the answer up, so he's passed this chance to post over to me.

I would also reference @Neil Slater, as there is chance his analysis on the numerical stability issue could account for this.

So the main premise behind my solution is:

We know that the cost function is convex, meaning it has no locals, and only a global minimum. Since the prediction using parameters trained with BFGS is better than those trained using CG, this implies that BFGS converged closer to the minimum than CG did. Whether or not BFGS converged to the global minimum, we can't say for sure, but we can definitely say that it is closer than CG is.

So if we take the parameters that were trained using CG, and pass them through the optimization routine using BFGS, we should see that these parameters get further optimized, as BFGS brings everything closer to the minimum. This should improve the prediction accuracy and bring it closer to the one obtained using plain BFGS training.

Here below is code that verifies this, variable names follow the same as in the question:

# Copy the old array over, else only a reference is copied, and the
# original vector gets modified
theta_all_optimized_bfgs_from_cg = np.copy(theta_all_optimized_cg)

for k in range(y.min(),y.max()+1):
grdtruth = np.where(y==k, 1,0)
results = minimize(compute_cost_regularized,theta_all_optimized_bfgs_from_cg[k-1,:],
args = (X_bias,grdtruth,0.1),
method = "BFGS",
# optimized parameters are accessible through the x attribute
theta_optimized = results.x
# Assign thetheta_optimized vector to the appropriate row in the
# theta_all matrix
theta_all_optimized_bfgs_from_cg[k-1,:] = theta_optimized


During execution of the loop, only one of the iterations produced a message that showed a non-zero number of optimization routine iterations, meaning that further optimization was performed:

Optimization terminated successfully.
Current function value: 0.078457
Iterations: 453
Function evaluations: 455


And the results were improved:

In:  predict_one_vs_all(X_bias, theta_all_optimized_bfgs_from_cg)
Out:  96.439999999999998


By further training the parameters, which were initially obtained from CG, through an additional BFGS run, we have further optimized them to give a prediction accuracy of 96.44% which is very close to the 96.48% that was obtained by directly using just only BFGS!

I updated my notebook with this explanation.

Of course this raises more questions, such as why CG did not work as well as BFGS did on this cost function, but I guess those are questions meant for another post.

I think you should still test this on a hold-out test set, to rule out BFGS being broken instead. However, I was wondering since answering, whether adding regularisation is making the loss surface less simple . . . meaning that BFGS results are strictly better in that situation, but become unstable without regularisation on this data set. – Neil Slater – 2017-07-11T12:47:08.780

@NeilSlater: True, I agree that the best validation and standard practice is to run it on a testing dataset. Doing on a test set was not part of the Coursera assignment though, so no such test sets were provided to us. I will have to take a chunk out of the original MNIST. What you said seems plausible, since without regularization, conjugate gradient improves. However, if the loss surface was truly simpler, then why would CG still perform poorer than BFGS, rather than the same? – AKKA – 2017-07-11T14:11:01.330