SVM using scikit learn runs endlessly and never completes execution

42

14

I am trying to run SVR using scikit learn ( python ) on a training dataset having 595605 rows and 5 columns(features) and test dataset having 397070 rows. The data has been pre-processed and regularized.

I am able to successfully run the test examples but on executing using my dataset and letting it run for over an hour, I could still not see any output or termination of program. I have tried executing using a different IDE and even from terminal but that doesn't seem to be the issue. I have also tried changing the 'C' parameter value from 1 to 1e3.

I am facing similar issues with all svm implementations using scikit.

Am I not waiting enough for it to complete ? How much time should this execution take ?

From my experience it shouldn't require over a few minutes.

Here is my system configuration: Ubuntu 14.04, 8GB RAM, lots of free memory, 4th gen i7 processor

tejaskhot

Posted 2014-08-18T10:46:57.360

Reputation: 855

Could you provide the code? Also, does it training or testing takes so much time? How about smaller training/testing datasets?ffriend 2014-08-18T12:09:04.657

I am just reading data from a csv file into a pandas dataframe and passing it to the scikit learn function. That's all! Providing code wouldn't really help heretejaskhot 2014-08-18T12:49:25.843

5sklearn's SVM implementation implies at least 3 steps: 1) creating SVR object, 2) fitting a model, 3) predicting value. First step describes kernel in use, which helps to understand inner processes much better. Second and third steps are pretty different, and we need to know at least which of them takes that long. If it is training, then it may be ok, because learning is slow sometimes. If it is testing, then there's probably a bug, because testing in SVM is really fast. In addition, it may be CSV reading that takes that long and not SVM at all. So all these details may be important.ffriend 2014-08-18T13:22:28.580

Answers

42

Kernelized SVMs require the computation of a distance function between each point in the dataset, which is the dominating cost of $\mathcal{O}(n_\text{features} \times n_\text{observations}^2)$. The storage of the distances is a burden on memory, so they're recomputed on the fly. Thankfully, only the points nearest the decision boundary are needed most of the time. Frequently computed distances are stored in a cache. If the cache is getting thrashed then the running time blows up to $\mathcal{O}(n_\text{features} \times n_\text{observations}^3)$.

You can increase this cache by invoking SVR as

model = SVR(cache_size=7000)

In general, this is not going to work. But all is not lost. You can subsample the data and use the rest as a validation set, or you can pick a different model. Above the 200,000 observation range, it's wise to choose linear learners.

Kernel SVM can be approximated, by approximating the kernel matrix and feeding it to a linear SVM. This allows you to trade off between accuracy and performance in linear time.

A popular means of achieving this is to use 100 or so cluster centers found by kmeans/kmeans++ as the basis of your kernel function. The new derived features are then fed into a linear model. This works very well in practice. Tools like sophia-ml and vowpal wabbit are how Google, Yahoo and Microsoft do this. Input/output becomes the dominating cost for simple linear learners.

In the abundance of data, nonparametric models perform roughly the same for most problems. The exceptions being structured inputs, like text, images, time series, audio.

Further reading

Jacob Mick

Posted 2014-08-18T10:46:57.360

Reputation: 521

9

SVM solves an optimization problem of quadratic order.

I do not have anything to add that has not been said here. I just want to post a link the sklearn page about SVC which clarifies what is going on:

The implementation is based on libsvm. The fit time complexity is more than quadratic with the number of samples which makes it hard to scale to dataset with more than a couple of 10000 samples.

If you do not want to use kernels, and a linear SVM suffices, there is LinearSVR which is much faster because it uses an optimization approach ala linear regressions. You'll have to normalize your data though, in case you're not doing so already, because it applies regularization to the intercept coefficient, which is not probably what you want. It means if your data average is far from zero, it will not be able to solve it satisfactorily.

What you can also use is stochastic gradient descent to solve the optimization problem. Sklearn features SGDRegressor. You have to use loss='epsilon_insensitive' to have similar results to linear SVM. See the documentation. I would only use gradient descent as a last resort though because it implies much tweaking of the hyperparameters in order to avoid getting stuck in local minima. Use LinearSVR if you can.

Ricardo Cruz

Posted 2014-08-18T10:46:57.360

Reputation: 1 642

I had a dataset with many lines. SVC started taking way too long for me about about 150K rows of data. I used your suggestion with LinearSVR and a million rows takes only a couple minutes. PS also found LogisticRegression classifier produces similar results as LinearSVR ( in my case ) and is even faster.jeffery_the_wind 2017-05-01T07:50:24.180

6

With such a huge dataset I think you'd be better off using a neural network, deep learning, random forest (they are surprisingly good), etc.

As mentioned in earlier replies, the time taken is proportional to the third power of the number of training samples. Even the prediction time is polynomial in terms of number of test vectors.

If you really must use SVM then I'd recommend using GPU speed up or reducing the training dataset size. Try with a sample (10,000 rows maybe) of the data first to see whether it's not an issue with the data format or distribution.

As mentioned in other replies, linear kernels are faster.

Leela Prabhu

Posted 2014-08-18T10:46:57.360

Reputation: 98

4

Did you include scaling in your pre-processing step? I had this issue when running my SVM. My dataset is ~780,000 samples (row) with 20 features (col). My training set is ~235k samples. It turns out that I just forgot to scale my data! If this is the case, try adding this bit to your code:

scale data to [-1,1] ; increase SVM speed:

from sklearn.preprocessing import MinMaxScaler
scaling = MinMaxScaler(feature_range=(-1,1)).fit(X_train)
X_train = scaling.transform(X_train)
X_test = scaling.transform(X_test)

Shelby Matlock

Posted 2014-08-18T10:46:57.360

Reputation: 41

2

This makes sense. IIUC, the speed of execution of support vector operations is bound by number of samples, not dimensionality. In other words, it is capped by CPU time and not RAM. I'm not sure exactly how much time this should take, but I'm running some benchmarks to find out.

Jaidev Deshpande

Posted 2014-08-18T10:46:57.360

Reputation: 139

1

Leave it to run overnight or better for 24 hours. What is your CPU utilization? If none of the cores is running at 100% then you have a problem. Probably with memory. Have you checked whether your dataset fits into 8GB at all? Have you tried the SGDClassifier? It is one of the fastest there. Worth giving it a try first hoping it completes in an hour or so.

Diego

Posted 2014-08-18T10:46:57.360

Reputation: 505

SGDClassifier does not support kernels. If the OP wants linear SVM, then I would recommend first trying LinearSVR. It is much faster than SVR because it solves the problem using a linear regression library, and global minimum is guaranteed (unlike gradient descente).Ricardo Cruz 2016-07-08T10:03:30.390

Appreciate your comment. Could you elaborate on why kernel support is an issue?Diego 2016-07-09T10:25:00.390

From the documentation, The loss function to be used. Defaults to ‘hinge’, which gives a linear SVM. Same thing for SGDRegressor. SGDRegressor is equivalent to using SVR(kernel='linear'). If that is what OP wants, that's great. I was under the impression he wanted to use SVM with a kernel. If that is not the case, I would recommend he first tries LinearSVR.

Ricardo Cruz 2016-07-09T13:37:09.213

1

I recently encountered similar problem because forgot to scale features in my dataset which was earlier used to train ensemble model kind. Failure to scale the data may be the likely culprit as pointed by Shelby Matlock. You may try different scalers available in sklearn, such as RobustScaler:

from sklearn.preprocessing import RobustScaler scaler = RobustScaler() X = scaler.fit_transfrom(X)

X is now transformed/scaled and ready to be fed to your desired model.

Dutse I

Posted 2014-08-18T10:46:57.360

Reputation: 11

0

Try normalising the data to [-1,1]. I faced a similar problem and upon normalisation everything worked fine. You can normalise data easily using :

from sklearn import preprocessing X_train = preprocessing.scale(X_train) X_test = preprocessing.scale(X_test)

Sujay_K

Posted 2014-08-18T10:46:57.360

Reputation: 11

@Archie This is an answer to a question, not a question.timleathart 2017-11-12T12:02:18.940

-1

Try using following code:

# X is your numpy data array.

from sklearn import preprocessing

X = preprocessing.scale(X)

Rishabh Gupta

Posted 2014-08-18T10:46:57.360

Reputation: 1

welcome to Data Science SE! Could you explain how your suggestion will help OP? What you are suggesting is a scaling of a array. It is not clear how that may or may not affect the SVR algorithm in scikit learn.Stereo 2017-01-04T14:20:16.720