I recently completed exercise 3 of Andrew Ng's Machine Learning on Coursera using Python.
When initially completing parts 1.4 to 1.4.1 of the exercise, I ran into difficulties ensuring that my trained model has the accuracy that matches the expected 94.9%. Even after debugging and ensuring that my cost and gradient functions were bug free, and that my predictor code was working correctly, I was still getting only 90.3% accuracy. I was using the conjugate gradient (CG) algorithm in
Out of curiosity, I decided to try another algorithm, and used Broyden–Fletcher–Goldfarb–Shannon (BFGS). To my surprise, the accuracy improved drastically to 96.5% and thus exceeded the expectation. The comparison of these two different results between CG and BFGS can be viewed in my notebook under the header Difference in accuracy due to different optimization algorithms.
Is the reason for this difference in accuracy due to the different choice of optimization algorithm? If yes, then could someone explain why?
Also, I would greatly appreciate any review of my code just to make sure that there isn't a bug in any of my functions that is causing this.
EDIT: Here below I added the code involved in the question, on the request in the comments that I do so in this page rather than refer readers to the links to my Jupyter notebooks.
Model cost functions:
def sigmoid(z): return 1 / (1 + np.exp(-z)) def compute_cost_regularized(theta, X, y, lda): reg =lda/(2*len(y)) * np.sum(theta[1:]**2) return 1/len(y) * np.sum(-y @ np.log(sigmoid(X@theta)) - (1-y) @ np.log(1-sigmoid(X@theta))) + reg def compute_gradient_regularized(theta, X, y, lda): gradient = np.zeros(len(theta)) XT = X.T beta = sigmoid(X@theta) - y regterm = lda/len(y) * theta # theta_0 does not get regularized, so a 0 is substituted in its place regterm = 0 gradient = (1/len(y) * XT@beta).T + regterm return gradient
Function that implements one-vs-all classification training:
from scipy.optimize import minimize def train_one_vs_all(X, y, opt_method): theta_all = np.zeros((y.max()-y.min()+1, X.shape)) for k in range(y.min(),y.max()+1): grdtruth = np.where(y==k, 1,0) results = minimize(compute_cost_regularized, theta_all[k-1,:], args = (X,grdtruth,0.1), method = opt_method, jac = compute_gradient_regularized) # optimized parameters are accessible through the x attribute theta_optimized = results.x # Assign thetheta_optimized vector to the appropriate row in the # theta_all matrix theta_all[k-1,:] = theta_optimized return theta_all
Called the function to train the model with different optimization methods:
theta_all_optimized_cg = train_one_vs_all(X_bias, y, 'CG') # Optimization performed using Conjugate Gradient theta_all_optimized_bfgs = train_one_vs_all(X_bias, y, 'BFGS') # optimization performed using Broyden–Fletcher–Goldfarb–Shanno
We see that prediction results differ based on the algorithm used:
def predict_one_vs_all(X, theta): return np.mean(np.argmax(sigmoid(X@theta.T), axis=1)+1 == y)*100 In: predict_one_vs_all(X_bias, theta_all_optimized_cg) Out: 90.319999999999993 In: predict_one_vs_all(X_bias, theta_all_optimized_bfgs) Out: 96.480000000000004
For anyone wanting to get any data to try the code, they can find it in my Github as linked in this post.