7

1

I recently completed exercise 3 of Andrew Ng's Machine Learning on Coursera using Python.

When initially completing parts 1.4 to 1.4.1 of the exercise, I ran into difficulties ensuring that my trained model has the accuracy that matches the expected 94.9%. Even after debugging and ensuring that my cost and gradient functions were bug free, and that my predictor code was working correctly, I was still getting only 90.3% accuracy. I was using the conjugate gradient (CG) algorithm in `scipy.optimize.minimize`

.

Out of curiosity, I decided to try another algorithm, and used Broyden–Fletcher–Goldfarb–Shannon (BFGS). To my surprise, the accuracy improved drastically to 96.5% and thus exceeded the expectation. The comparison of these two different results between CG and BFGS can be viewed in my notebook under the header **Difference in accuracy due to different optimization algorithms**.

Is the reason for this difference in accuracy due to the different choice of optimization algorithm? If yes, then could someone explain why?

Also, I would greatly appreciate any review of my code just to make sure that there isn't a bug in any of my functions that is causing this.

Thank you.

**EDIT:** Here below I added the code involved in the question, on the request in the comments that I do so in this page rather than refer readers to the links to my Jupyter notebooks.

Model cost functions:

```
def sigmoid(z):
return 1 / (1 + np.exp(-z))
def compute_cost_regularized(theta, X, y, lda):
reg =lda/(2*len(y)) * np.sum(theta[1:]**2)
return 1/len(y) * np.sum(-y @ np.log(sigmoid(X@theta))
- (1-y) @ np.log(1-sigmoid(X@theta))) + reg
def compute_gradient_regularized(theta, X, y, lda):
gradient = np.zeros(len(theta))
XT = X.T
beta = sigmoid(X@theta) - y
regterm = lda/len(y) * theta
# theta_0 does not get regularized, so a 0 is substituted in its place
regterm[0] = 0
gradient = (1/len(y) * XT@beta).T + regterm
return gradient
```

Function that implements one-vs-all classification training:

```
from scipy.optimize import minimize
def train_one_vs_all(X, y, opt_method):
theta_all = np.zeros((y.max()-y.min()+1, X.shape[1]))
for k in range(y.min(),y.max()+1):
grdtruth = np.where(y==k, 1,0)
results = minimize(compute_cost_regularized, theta_all[k-1,:],
args = (X,grdtruth,0.1),
method = opt_method,
jac = compute_gradient_regularized)
# optimized parameters are accessible through the x attribute
theta_optimized = results.x
# Assign thetheta_optimized vector to the appropriate row in the
# theta_all matrix
theta_all[k-1,:] = theta_optimized
return theta_all
```

Called the function to train the model with different optimization methods:

```
theta_all_optimized_cg = train_one_vs_all(X_bias, y, 'CG') # Optimization performed using Conjugate Gradient
theta_all_optimized_bfgs = train_one_vs_all(X_bias, y, 'BFGS') # optimization performed using Broyden–Fletcher–Goldfarb–Shanno
```

We see that prediction results differ based on the algorithm used:

```
def predict_one_vs_all(X, theta):
return np.mean(np.argmax(sigmoid(X@theta.T), axis=1)+1 == y)*100
In[16]: predict_one_vs_all(X_bias, theta_all_optimized_cg)
Out[16]: 90.319999999999993
In[17]: predict_one_vs_all(X_bias, theta_all_optimized_bfgs)
Out[17]: 96.480000000000004
```

For anyone wanting to get any data to try the code, they can find it in my Github as linked in this post.

1Logistic regression should have a single stable minimum (like linear regression), so it is likely that something is causing this that you haven't noticed – Neil Slater – 2017-07-04T15:57:02.243

So there must be guaranteed convergence to the minimum cost? Would you be able to do a code review for me please? – AKKA – 2017-07-04T23:28:25.230

1If there's a lot of code you need reviewing, maybe post it on codereview.stackexchange.com - if it is only a small amount required to replicate the problem, you could add it to your question here (edit it in as a code block, please include enough to fully replicate the problem). – Neil Slater – 2017-07-05T06:51:59.693

while it is true that ensuring a global minimum should give you same result regardless of the optimization algorithm, there can be subtleties in the implementation of the algorithm (i.e. the methods to handle numerical stability etc) that may lead to slightly different solutions. These small difference in solutions may lead to larger performance difference when evaluated on small test set. May be that is causing such a large performance difference in your case. And yes, in general, optimization algorithms can largely influence the learning outcome. Btw, I got the desired result in MATLAB. – Sal – 2017-07-06T06:30:50.243

1@NeilSlater: ok, I have just added the code directly into the question as an edit. Does it look ok? – AKKA – 2017-07-06T15:07:22.133

@Sal: I see. May I ask if you had taken a look at my code to see if it is correct? – AKKA – 2017-07-06T15:07:55.883

@AKKA: I cannot see any obvious problem with the code after 10 mins scanning through it. Best I can think is that maybe the default value of

`gtol`

is too high for your data, which could occur if it is not normalised. I might take a deeper look later on, if no-one else has answered in the meantime. – Neil Slater – 2017-07-06T15:44:54.417@NeilSlater: I had tried reducing the value of

– AKKA – 2017-07-08T07:53:33.783`gtol`

already, based on a suggestion from someone in a replicate post of my question on stackoverflow: https://stackoverflow.com/questions/44915145/coursera-ml-does-the-choice-of-optimization-algorithm-affect-the-accuracy-of-m?noredirect=1#comment76806558_44915145 . Reducing`gtol`

does not solve the problem :( no one has answered since you. Could you please help me take a deeper look? Thank you