How to increase accuracy of classifiers?

20

5

I am using OpenCV letter_recog.cpp example to experiment on random trees and other classifiers. This example has implementations of six classifiers - random trees, boosting, MLP, kNN, naive Bayes and SVM. UCI letter recognition dataset with 20000 instances and 16 features is used, which I split in half for training and testing. I have experience with SVM so I quickly set its recognition error to 3.3%. After some experimentation what I got was:

UCI letter recognition:

  • RTrees - 5.3%
  • Boost - 13%
  • MLP - 7.9%
  • kNN(k=3) - 6.5%
  • Bayes - 11.5%
  • SVM - 3.3%

Parameters used:

  • RTrees - max_num_of_trees_in_the_forrest=200, max_depth=20, min_sample_count=1

  • Boost - boost_type=REAL, weak_count=200, weight_trim_rate=0.95, max_depth=7

  • MLP - method=BACKPROP, param=0.001, max_iter=300 (default values - too slow to experiment)

  • kNN(k=3) - k=3

  • Bayes - none

  • SVM - RBF kernel, C=10, gamma=0.01

After that I used same parameters and tested on Digits and MNIST datasets by extracting gradient features first (vector size 200 elements):

Digits:

  • RTrees - 5.1%
  • Boost - 23.4%
  • MLP - 4.3%
  • kNN(k=3) - 7.3%
  • Bayes - 17.7%
  • SVM - 4.2%

MNIST:

  • RTrees - 1.4%
  • Boost - out of memory
  • MLP - 1.0%
  • kNN(k=3) - 1.2%
  • Bayes - 34.33%
  • SVM - 0.6%

I am new to all classifiers except SVM and kNN, for these two I can say the results seem fine. What about others? I expected more from random trees, on MNIST kNN gives better accuracy, any ideas how to get it higher? Boost and Bayes give very low accuracy. In the end I'd like to use these classifiers to make a multiple classifier system. Any advice?

Mika

Posted 2014-07-16T09:49:15.933

Reputation: 303

Answers

11

Dimensionality Reduction

Another important procedure is to compare the error rates on training and test dataset to see if you are overfitting (due to the "curse of dimensionality"). E.g., if your error rate on the test dataset is much larger than the error on the training data set, this would be one indicator.
In this case, you could try dimensionality reduction techniques, such as PCA or LDA.

If you are interested, I have written about PCA, LDA and some other techniques here and in my GitHub repo here.

Cross validation

Also you may want to take a look at cross-validation techniques in order to evaluate the performance of your classifiers in a more objective manner

user2556

Posted 2014-07-16T09:49:15.933

Reputation:

Yes, indeed error rates on training data set are around 0. Changing parameters to reduce overfitting didn't result in higher accuracy on test dataset in my case. I will look into techniques you mention as soon as possible and comment, thank you. – Mika – 2014-07-17T16:16:51.433

What are the relative proportions of training and test dataset btw? Something line 70:30, 60:40, or 50:50? – None – 2014-07-17T16:38:07.737

First dataset - UCI letter recognition is set to 50:50 (10000:10000), Digits is about 51:49 (1893:1796) and MNIST is about 86:14 (60000:10000). – Mika – 2014-07-18T01:35:18.207

I experimented with PCA, still didn't get good results with random forrest, but boost and Bayes now give results similar to other classifiers. I found a discussion about random forrest here: http://stats.stackexchange.com/questions/66543/random-forest-is-overfitting It is possible I am actually not overfitting but couldn't find the out-of-bag (OOB) prediction error mentioned there. Running experiment now with a large number of trees to see if accuracy will improve.

– Mika – 2014-07-21T16:04:50.647

Okay, sounds you are making a little bit of progress :) A trivial question, but have you standardized your features (z-score) so that they are centered around the mean with standard deviation=1? – None – 2014-07-21T16:19:55.410

Actually no, I usually would scale features to range 0-1 but now I see I didn't even do that correctly before PCA. So that would not be the right thing to do anyway? After PCA mean = 0, std = 0.5754. – Mika – 2014-07-21T16:41:29.853

It depends on your data whether you want to do a Min-max normalization to unit range (e.g., 0-1) or Z-score normalization/standardization to unit variance (variance=1, mean=0). Sorry, but I forgot that you are doing text classification. I think normalization after you stemmed the words and used a vectorizer function would not be necessary – None – 2014-07-23T16:22:00.253

It took me a while to try everything out, I had an error earlier with PCA, now I see I just get much lower accuracy when using it. I reduce dimensions to 100, and that should be fine, but SVM gives me 4% error on MNIST (0.6% without PCA) and over 20% error on DIGITS (4% without PCA). Same for other classifiers. Earlier I somehow made the error of doing PCA on the whole dataset (train and test sets) which gave me too optimistic results. – Mika – 2014-07-31T05:29:44.877

Which programming language are you using btw? If you are a Python guy, I'd have some examples here where I used PCA, maybe it helps: http://sebastianraschka.com/Articles/2014_about_feature_scaling.html http://sebastianraschka.com/Articles/2014_scikit_dataprocessing.html http://sebastianraschka.com/Articles/2014_pca_step_by_step.html Usually I prefer LDA since I am mostly working with supervised datasets (class labels), a separate article (like the step by step PCA) is in the works ;)

– None – 2014-07-31T13:47:17.910

I am using C++ for classification and Matlab to prepare datasets. I will check out your links and try LDA too. – Mika – 2014-07-31T16:04:48.890

I tried using LDA but can't get it working with my data. Matlab function classify should perform LDA but it works only up to 20 dimensions, at least on my data. Also I found that maximum dimensions given by LDA should be number_of_classes-1, which is too little. – Mika – 2014-08-03T09:20:08.827

I just uploaded the LDA article, although I used Python for the step-wise implementation, the Intro might still be interesting and helpful: http://sebastianraschka.com/Articles/2014_python_lda.html

– None – 2014-08-03T21:40:36.477

Finally I found out what was going on... My function in Matlab that writes features to a file would add spaces sometimes and only on some datasets and then my reader function in C would apparently read wrong values... PCA actually helped, boost classifier is still bad but will try to play with parameters some more to make it work. Still didn't try LDA but will do that too. – Mika – 2014-08-05T08:49:26.577

Nice! I am glad to here that it was "just" a technical problem :). For supervised training samples, LDA is often (but not always) a better choice than PCA. There is a research article where the authors discuss this point: http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=908974

– None – 2014-08-05T15:04:04.100

5

I expected more from random trees:

  • With random forests, typically for N features, sqrt(N) features are used for each decision tree construction. Since in your case N=20, you could try setting max_depth (the number of sub-features to construct each decision tree) to 5.

  • Instead of decision trees, linear models have been proposed and evaluated as base estimators in random forests, in particular multinomial logistic regression and naive Bayes. This might improve your accuracy.

On MNIST kNN gives better accuracy, any ideas how to get it higher?

  • Try with a higher value of K (say 5 or 7). A higher value of K would give you more supportive evidence about the class label of a point.
  • You could run PCA or Fisher's Linear Discriminant Analysis before running k-nearest neighbour. By this you could potentially get rid of correlated features while computing distances between the points, and hence your k neighbours would be more robust.
  • Try different K values for different points based on the variance in the distances between the K neighbours.

Debasis

Posted 2014-07-16T09:49:15.933

Reputation: 1 476

I belive you are referring to OpenCV nactive_vars parameter (not max_depth), which I set to default sqrt(N) value, that is nactive_vars=sqrt(16) for first dataset and sqrt(200) for other two. max_depth determines whether trees grow to full depth (25 is its maximum value) and balances between underfitting and overfitting, more about it here: http://stats.stackexchange.com/questions/66209/opencv-parameters-of-random-trees Not sure about min_sample_count but I tried various values and setting it to 1 worked best.

– Mika – 2014-07-17T07:05:44.933

OpenCV documentation gives brief explanation of parameters: http://docs.opencv.org/modules/ml/doc/random_trees.html#cvrtparams-cvrtparams For now I would like to make random trees work reasonably well and keep things simple because I want to focus on working with a multiple classifier system.

– Mika – 2014-07-17T07:06:32.730

About kNN - these are all really good suggestions, but what I meant to say is that kNN performed better than random trees classifier and I think there is lots of room for improvement with random trees. – Mika – 2014-07-17T07:15:51.477

yes, i'm not sure why random forest is not performing as well (or better) than the simplistic k-NN approach... it just might be the case that a kernel based approach where you directly try to estimate P(y|D) (output given data) such as in k-NN without estimating P(theta|D) (latent model given data) such as in the parametric models. – Debasis – 2014-07-17T09:28:49.887