Solving a system of equations with sparse data



I am attempting to solve a set of equations which has 40 independent variables (x1, ..., x40) and one dependent variable (y). The total number of equations (number of rows) is ~300, and I want to solve for the set of 40 coefficients that minimizes the total sum-of-square error between y and the predicted value.

My problem is that the matrix is very sparse and I do not know the best way to solve the system of equations with sparse data. An example of the dataset is shown below:

   y    x1  x2 x3 x4 x5 x6 ... x40
87169   14  0  1  0  0  2  ... 0 
46449   0   0  4  0  1  4  ... 12
846449  0   0  0  0  0  3  ... 0

I am currently using a Genetic Algorithm to solve this and the results are coming out with roughly a factor of two difference between observed and expected.

Can anyone suggest different methods or techniques which are capable of solving a set of equations with sparse data.


Posted 2014-08-05T20:45:01.383

Reputation: 915

2Typo in the title: spare => sparse. – Aleksandr Blekh – 2014-08-06T02:52:16.923



If I understand you correctly, this is the case of multiple linear regression with sparse data (sparse regression). Assuming that, I hope you will find the following resources useful.

1) NCSU lecture slides on sparse regression with overview of algorithms, notes, formulas, graphics and references to literature:

2) R ecosystem offers many packages, useful for sparse regression analysis, including:

3) A blog post with an example of sparse regression solution, based on SparseM:

4) A blog post on using sparse matrices in R, which includes a primer on using glmnet:

5) More examples and some discussion on the topic can be found on StackOverflow:

UPDATE (based on your comment):

If you're trying to solve an LP problem with constraints, you may find this theoretical paper useful:

Also, check R package limSolve: And, in general, check packages in CRAN Task View "Optimization and Mathematical Programming":

Finally, check the book "Using R for Numerical Analysis in Science and Engineering" (by Victor A. Bloomfield). It has a section on solving systems of equations, represented by sparse matrices (section 5.7, pages 99-104), which includes examples, based on some of the above-mentioned packages:

Aleksandr Blekh

Posted 2014-08-05T20:45:01.383

Reputation: 6 438

3Thank you for the great answer! I am hesitant to classify the problem as sparse regression since I am not really trying to model and predict but rather solve for a set of coefficients. The reason I am using Genetic Algorithms is because I can also employ constraints on the equation. If no other answers come through I will gladly accept this though. – mike1886 – 2014-08-06T12:27:51.970

1@mike1886: My pleasure! I have updated my answer, based on your comment. Hope it helps. – Aleksandr Blekh – 2014-08-06T21:34:14.897


Aleksandr's answer is completely correct.

However, the way the question is posed implies that this is a straightforward ordinary least squares regression question: minimizing the sum of squared residuals between a dependent variable and a linear combination of predictors.

Now, while there may be many zeros in your design matrix, your system as such is not overly large: 300 observations on 40 predictors is no more than medium-sized. You can run such a regression using R without any special efforts for sparse data. Just use the lm() command (for "linear model"). Use ?lm to see the help page. And note that lm will by default silently add a constant column of ones to your design matrix (the intercept) - include a -1 on the right hand side of your formula to suppress this. Overall, assuming all your data (and nothing else) is in a data.frame called foo, you can do this:

model <- lm(y~.-1,data=foo)

And then you can look at parameter estimates etc. like this:


If your system is much larger, say on the order of 10,000 observations and hundreds of predictors, looking at specialized sparse solvers as per Aleksandr's answer may start to make sense.

Finally, in your comment to Aleksandr's answer, you mention constraints on your equation. If that is actually your key issue, there are ways to calculate constrained least squares in R. I personally like pcls() in the mgcv package. Perhaps you want to edit your question to include the type of constraints (box constraints, nonnegativity constraints, integrality constraints, linear constraints, ...) you face?

Stephan Kolassa

Posted 2014-08-05T20:45:01.383

Reputation: 901

1Stephan, I appreciate your kind words! Upvoted your nice answer. You might be interested in the update I made to my answer, based on comment by the question's author. – Aleksandr Blekh – 2014-08-06T21:44:31.100