Shaving the last 50 ms off NMinimize



The epilogue:

A paper where the below answer is used was published. The answer below is cited among the references:)

The background:

I have to fit an objective function for ~10 000 datasets in near real time (with near real time being on the order of about 10 seconds). I compile the huge objective function, so that it takes about 30 microseconds to execute. I'm using the Nelder-Mead method, which on average converges in about 300 steps.

This means that one minimization should take about 9 miliseconds (300*30 microseconds). And the whole fitting could be done in about 10 seconds on eight cores (300*10000*0.00003/8).

The problem:

No matter how quickly the objective function can be calculated, NMinimize[] always takes at least 50 ms. This means the fitting takes about 4 minutes on my eight cores.

The question: Can I somehow get rid of this 50ms delay that NMinimize[] has? Otherwise I will have to rewrite my program in C (or Delphi which I like much better). But I'm still not looking forward to that.

Here is a toy example that illustrates the problem: Regardles of the value of LOOPCOUNT, the "deadtime" is always similar.

Hi2p[a_?NumericQ, b_?NumericQ] := (
    Do[Cos[0.3], {LOOPCOUNT}]; 
    (a - Cos[a^2 - 3 b])^2 + (b - Sin[a^2 + b^3])^2
n = 10000; 
{t, tt} = Do[Hi2p[1, 1], {n}] // AbsoluteTiming; 
Print["TimePerFunctionEvaluation: " <> ToString[t/n]];

{{time, sol}, {points}} = Reap[AbsoluteTiming[
    NMinimize[Hi2p[a, b], {a, b}, 
     Method -> {"NelderMead", "PostProcess" -> False, "RandomSeed" -> 1796}, 
     MaxIterations -> 500, PrecisionGoal -> 10, 
     AccuracyGoal -> 100, EvaluationMonitor :> Sow[{a, b}]]
{time, sol, Length[points]}
deadtime = time - t/n*Length[points];
Print["Deadtime: " <> ToString[deadtime]];

(*"TimePerFunctionEvaluation: 0.00004190240"       
==> {0.0650037, {3.08149*10^-33, {a -> 0.6037, b -> 0.429039}}, 278}
"Deadtime: 0.0536606"*)


Thank you, I almost solved my real problem. I'm struggling with a technical detail, which stems from the fact that I don't completely understand how the in-lineable apply works. Here is the problem: (also I'm not sure if this should be a separate question or not...)

My objective function is set up, so that it takes a number of data points apart from the parameters to fit. The number of data points is the same in all data sets, but can change between experiments. This way I can compile the objective function just once for all 10 000 data sets I have to fit.

But the way NelderMeadMinimize is set up now it compiles the algorithm for each different dataset, which has a non-negligible overhead, especially when compiling to C.

Here is a toy example:

ClearAll[f, toy, bench]
(*toy objective function which takes about 30 microsconds to run*)
toy = Compile[{a, b, c, x, y, z} , Do[Cos[RandomReal[]], {800}];
   (a - x)^2 + 50 (b - y)^2 + (c - z)^2
   , "RuntimeOptions" -> {"EvaluateSymbolically" -> False, 
     "CompareWithTolerance" -> False}, "CompilationTarget" -> "C"];
SetAttributes[bench, HoldFirst];
bench[expr_, n_] := 
 Module[{tt, t}, {t, tt} = Do[expr, {n}] // AbsoluteTiming; 
  t/n]; bench[toy @@ RandomReal[{-1, 1}, 6], 10000]
toyf = Evaluate@toy[Sequence @@ {1, 2, 3}, Sequence @@ {x, y, z}];

(* ==> 0.00003000005 *)

(*compile once to C*)
NelderMeadMinimize[toyf, {x, y, z}, 
  CompilationTarget -> "C"] // AbsoluteTiming
bench[NelderMeadMinimize[toyf, {x, y, z}, 
  CompilationTarget -> "C"], 100]

(* ==> {0.6600009, {1.82172*10^-21, {x -> 1., y -> 2., z -> 3.}}} *)

(* ==> 0.028300039 *)

(*compile once to VMW*)
NelderMeadMinimize[toyf, {x, y, z}, 
  CompilationTarget -> "WVM"] // AbsoluteTiming
bench[NelderMeadMinimize[toyf, {x, y, z}, 
  CompilationTarget -> "WVM"], 100]

(* ==> {0.0400000, {1.82172*10^-21, {x -> 1., y -> 2., z -> 3.}}} *)

(* ==> 0.028200039 *)

(*but these seem to get compiled every time*)
  Evaluate[toy[Sequence @@ {1, 2, 3}, Sequence @@ {x, y, z}]], {x, y, 
   z}, CompilationTarget -> "WVM"], 100]
  Evaluate[toy[Sequence @@ {1, 2, 3}, Sequence @@ {x, y, z}]], {x, y, 
   z}, CompilationTarget -> "C"], 5]

(* ==> 0.032300045 *)

(* ==> 0.51200072 *)

From this benchmarks one can see that:

  • Compilation to C is much slower then to WMV (already known:)
  • It does not matter much whether the algorithm is compiled to C or WMV
  • In the last benchmark one can see, that the algorithm is compiled every time (as each minimization takes half a second)

Is there a way to compile the minimization algorithm just once for the whole data set?

So that code similar to this would get compiled only once instead of 5 times

    Evaluate[toy[Sequence @@ RandomReal[{-1, 1}, 3], Sequence @@ {x, y, z}]], 
    {x, y, z}, CompilationTarget -> "C"], 


I managed to hack a version which works great for my purposes here.

I can compile once like this

(*Hi2 first takes the varaibles then the data points of the spectrum \
as constants*)
vars = {int, p1, lm1, lm2, bl1, bl2};
symbolicSpectrum = Array[Unique[a] &, Length[spectrum]];
cm = NelderMeadMinimize`Dump`CompiledNelderMead[Hi2, Evaluate[vars], 
    symbolicSpectrum, CompilationTarget -> "C"]; // AbsoluteTiming

Then I call the compiled function 10000 times (once for each different spectrum).

(*then I jsut call cm 10000 times, once for each spectrum*)
cm[initialPoints, spectrum, 10^-9, 300]

This takes about 12 ms per dataset on my computer, with 10 ms spent on the evaluation of the objective function. I think this is simply amazing!

Oh, and by setting

$CCompiler = CCompilerDriver`IntelCompiler`IntelCompiler;

I got a free factor of 2 (so it takes about 6 ms per data set).


Posted 2012-04-24T13:48:28.817

Reputation: 13 114

Congratulations on the published article! – shrx – 2013-12-18T18:52:37.203

Congratulations on your publication! As it happens, a friend of mine has just submitted an article to Optics Express for which he has recieved favorable reviews. And, thank you very much for the citation! – Oleksandr R. – 2013-12-19T00:32:44.217

@OleksandrR No, no, thank you! Well, you won't get any "academic" points for the citation, but at least credit is given where it is due. Our experiences with Optics Express are positive: the review was useful and the article was published ~3 weeks. – Ajasja – 2013-12-24T22:28:14.007

Note that AbsoluteTiming includes everything going on on your computer until the calculation finished. What results do you get with Timing instead of AbsoluteTiming? – celtschk – 2012-04-24T14:23:02.560

@OleksandrR. No, no, the 1000 digits are just there so that I get approximately the same number of iterations as in my real example. Of course I don't need that much precision. – Ajasja – 2012-04-24T14:35:52.360

I don't understand your code. What's the purpose of Do[Cos[0.3], {LOOPCOUNT}]? Is it just filler to represent something in your real problem? – rcollyer – 2012-04-24T14:39:33.023

@rcollyer It's used instead of Pause[]. So that one evaluation of Hi2p takes about 40 microseconds. – Ajasja – 2012-04-24T14:42:25.973

I see: it is to represent the length of time it takes to evaluate the full version of Hi2p. – rcollyer – 2012-04-24T14:43:20.913

@rcollyer Yes, exactly. – Ajasja – 2012-04-24T14:44:18.247

5Any chance the data sets have some similarity? If so, you might do better by solving for one, then passing that result as starting values to FindMinimum for the remaining sets. – Daniel Lichtblau – 2012-04-24T17:33:52.950

@DanielLichtblau I had the same idea, but did not get around to implementing it yet:) – Ajasja – 2012-04-24T20:24:41.500

@OleksandrR. You're welcome. Don't worry, now that I understand how apply works I should be able to fix it my self. – Ajasja – 2012-05-04T11:16:11.377

@OleksandrR. No problem. I hacked a version which works for my purposes here. I adapted my objective function, so that the parameters I want to optimize come first and the points from the data set come last. That way I can just compile once for all the datasets.

– Ajasja – 2012-05-07T10:56:05.727

Finally updated as promised. Hope that helps, although I've enjoyed working on this problem so the bounty's really not necessary--IMO you're better off to keep your rep points as the thresholds for doing things on the site (e.g. editing posts) will be quite a lot higher after it gets out of beta. – Oleksandr R. – 2012-05-08T06:13:23.510

@OleksandrR. I'm sure I'll get more reputation points, especially if keeps being so interesting to follow. Besides, it's only 100 points, so it's mostly symbolic (IMO the answer deserves more). – Ajasja – 2012-05-08T15:42:58.223

Thanks for the bounty! I hope the last update was useful. Are we getting close to your target of 10,000 datasets in 10 seconds? How many points are in your spectra, by the way? I suspect more than the 100 points I used in the example... – Oleksandr R. – 2012-05-08T17:27:23.283

@OleksandrR. You're very welcome. Right now I'm on the limit at ~15 s (6 ms per dataset * 10000 on six cores, but the parallel scaling is not completely linear). But I think I may well surpass it, once I start using the solutions of fits as initial conditions for similar datasets. I have from 20 - 40 points in a dataset (depends on how the experiment was done), but the model is quite complicated (a convolution of a Gaussian with a difference of two exponential functions). My hacked solution was working fine, but the update will enable me to code it even more elegantly! – Ajasja – 2012-05-08T17:52:09.390



As promised in the comments on my first answer, here is an implementation of an all-compiled-code Nelder-Mead minimizer, which hopefully represents a more useful response to the question. The algorithm used here corresponds to that given by Lagarias et al. in SIAM J. Optim. 9 (1), 112 (1998) (abridged .pdf). It is compatible with Mathematica versions 6, 7, 8, and 9, but not 5.2 or any previous version. This is not only due to the use of the new-in-6 functions OptionsPattern, FilterRules, and OptionValue, but also to apparent limitiations of the compiler--in particular, the robustness of the type inference mechanism was not entirely satisfactory prior to version 6.

The code is many times faster than NMinimize in all versions, although I would recommend using Mathematica 8 or 9 if possible. The performance of the compiled code is much better here than in versions 6 and 7, and many more functions are supported for compilation. Compilation to native code via C can also result in substantially improved performance. In fact, LibraryLink and/or the Mathematica runtime seem to have gained additional performance improvements in version 9, so this seems to be the optimal version to use as of this posting, being about 25% faster even than version 8.

A very important consideration is that, if the minimand is not compilable, performance will suffer due to calls out of compiled code to perform top-level evaluations. Indeed, these calls are so expensive that the resulting compiled function may easily be slower than the equivalent top-level code. It's also worth noting that FindMinimum possesses a very efficient implementation, so if only local optimization is needed, that function is likely to remain the best choice. For global optimization, an advantageous strategy might consist of using this package to quickly explore large parts of the parameter space (perhaps trying many different initial simplices) followed by the use of the optimized values as a starting point for FindMinimum, which will provide tight convergence to the final result.

Unlike for NMinimize, constrained optimization is not supported, because the Nelder-Mead algorithm is fundamentally an unconstrained method. For constrained problems, NMinimize performs a sort of regularization of the minimand such that the minimum of the resulting function fulfils the Karush-Kuhn-Tucker conditions, thus allowing unconstrained methods to continue be used. I may include this in a future update, but currently it is not implemented. Another difference relative to NMinimize is the convergence criterion: the one used here can more easily distinguish slow convergence from the minimizer having stalled without finding a minimum, which is useful for poorly behaved minimands. Instead of PrecisionGoal/AccuracyGoal, one specifies a tolerance, "ConvergenceTolerance", for the minimum allowed change in the average function value (sampled at the vertices of the simplex) within a given number of iterations. The default settings typically result in tighter convergence than NMinimize achieves, while still terminating the optimization if it genuinely does not converge.

This latest update contains several fixes and improvements:

  • Option handling for NelderMeadMinimize has been fixed--options given for this function were incorrectly being overridden by those of NelderMeadMinimize`Dump`CompiledNelderMead in previous versions, which would have been confusing to the user.
  • It is now possible to refer to NelderMeadMinimize`Private`dimension in options. This value represents the dimension of the problem and allows one to specify this parameter in an abstract way. An application of this will be demonstrated below.
  • The interpretation of values given for the "InitialPoints" option has been improved.
  • Diagnostics in NelderMeadMinimize`Dump`CompiledNelderMead can now be enabled for any return type. When disabled (as by default), the operation counts will no longer be maintained and no reference to them will appear in the compiled code. When enabled, these values will be given along with their descriptions in NelderMeadMinimize`Dump`OperationCounts on return.
  • The code has undergone some general clean-up and should be easier to read as a result.
  • An additional test function, the rotated (nonseparable) hyperellipsoid, has been provided. The rotation should not present much of a hindrance for the Nelder-Mead algorithm, but this is not necessarily the case for other approaches, particularly when scaled to hundreds of dimensions, whereupon e.g. differential evolution begins to encounter difficulties with it. This function is therefore useful for comparative purposes.

The package can no longer be presented in a code block in this post because it is too long, so please download it from its GitHub repository here.

Because the question involves performing a large number of similar minimizations, and in order to avoid expensive calls out of compiled code, I decided to inline the minimand into the Nelder-Mead algorithm itself. For each function minimized with a given set of options, compiled code is generated on the first call and memoized in order to amortize the compilation overhead over subsequent calls. The minimization can be run again with a different starting simplex (or some other set of initial points, specified via the "InitialPoints" option), or with different settings for "RandomSeed", "ConvergenceTolerance", or MaxIterations without re-compilation. Changing any other parameters or options will result in a new minimizer being generated.

To further reduce overheads, very little error checking is done. In fact, only the forms of some of the arguments are verified. As a result, if incorrect parameters or options are specified, the resulting errors will be returned to the top level.

For testing, I've included a few simple problems: the n-dimensional shifted hyperellipsoid function and its rotated counterpart (Schwefel's problem 1.2) and the n-dimensional generalized Rosenbrock's function. Contrary to common belief, the latter is not a unimodal function for all n: as shown by Yun-Wei Shang and Yu-Huang Qiu in Evol. Comp. 14 (1), 119-126 (2006) (link), there are actually two minima for n $\ge$ 4, and the Nelder-Mead algorithm (which is not strictly a global optimization algorithm) might converge to either of them. While these problems are not very difficult, I think they serve well enough for expository purposes. So, let's test the code. First, a simple usage example:

NelderMeadMinimize[x^2 + y^2, {x, y}]

(* or, equivalently, *)
NelderMeadMinimize[Function[{x, y}, x^2 + y^2], {x, y}]

(* or even: *)
With[{cf = Compile[{{x, _Real, 0}, {y, _Real, 0}}, x^2 + y^2]},
 NelderMeadMinimize[cf, {x, y}]

(* -> {4.53016*10^-20, {x -> 1.90885*10^-10, y -> 9.41508*10^-11}} *)

(Note that when the minimand is passed as a pure or compiled function, the names of the variables are not actually important; they can be anything, as we demonstrate below. Note also that NelderMeadMinimize has HoldAll--although this can safely be removed if you prefer consistency with NMinimize to the convenience of not having to Block your variables.)

Now, a performance comparison:

(* Generate some variables *)
vars = Block[{x}, Unique[ConstantArray[x, 10], Temporary]];

(* First let's try NMinimize: *)
  NelderMeadMinimize`Dump`Hyperellipsoid @@ vars, vars, 
  Method -> {"NelderMead", "PostProcess" -> False}, MaxIterations -> 10000
 ] // Timing

(* -> {0.515625, {8.34607*10^-9, {
                  x$405 -> 0.999988, x$406 -> 1.000010, x$407 -> 1.,
              x$408 -> 1.000030, x$409 -> 0.999995, x$410 -> 1.00001,
                  x$411 -> 0.999999, x$412 -> 1.000020, x$413 -> 1.00001,
              x$414 -> 1.00001}}} *)

(* Now NelderMeadMinimize: *)
 NelderMeadMinimize`Dump`Hyperellipsoid, Evaluate[vars],
 CompilationTarget -> "C"
] // Timing

(* -> {0.391375, {1.73652*10^-16, {
                  x$405 -> 1., x$406 -> 1., x$407 -> 1., x$408 -> 1.,
                  x$409 -> 1., x$410 -> 1., x$411 -> 1., x$412 -> 1.,
                  x$413 -> 1., x$414 -> 1.}}} *)

We've achieved much better convergence, somewhat faster than NMinimize. But this includes the time taken for compilation to C! Trying again now that the minimizer has already been generated reveals that almost all of the above timing is in fact due to the compilation step:

  NelderMeadMinimize`Dump`Hyperellipsoid, Evaluate[vars],
  CompilationTarget -> "C"
 ], {100}
] // Timing

(* -> {1.296875, Null} *)

On the second and subsequent minimizations, we beat NMinimize by a factor of around 40, despite a tighter convergence tolerance. Let's now even the odds, as it's well known that the Nelder-Mead algorithm is quite slow to converge to very tight tolerances:

  NelderMeadMinimize`Dump`Hyperellipsoid, Evaluate[vars],
  "ConvergenceTolerance" -> 10^-9,
  CompilationTarget -> "C"
 ], {100}
] // Timing

(* -> {0.953125, Null} *)

That's a better than 50-fold improvement over NMinimize in a more or less fair test, with each minimization of this 10-dimensional function taking under 10 ms. It may be of interest in some cases to record the number of function evaluations and the types of steps taken by the Nelder-Mead algorithm. If so, we may set the option "Diagnostics" -> True: after re-running the optimization we then find the relevant information recorded in the value of NelderMeadMinimize`Dump`OperationCounts:

(* {"Function evaluations" -> 2441,
    "Reflections" -> 1289, "Expansions" -> 80,
    "Contractions" -> 347, "Shrinkages" -> 0} *)

If absolutely minimum overhead is required, the compiled minimizer can be called directly, with the requirements that the minimand is given as a Function or CompiledFunction and the starting simplex is fully specified (taking the form of an array of real numbers having dimensions {d + 1, d}, where d is the dimension of the problem). Also, to specify MaxIterations -> Infinity, the third parameter should be a negative integer. This works as follows:

  NelderMeadMinimize`Dump`Hyperellipsoid, vars,
  CompilationTarget -> "C"
 ][RandomReal[{0, 1}, {Length[vars] + 1, Length[vars]}], 10^-9, -1],
] // Timing

(* -> {0.734375, Null} *)

This was a bit more work, but we have now achieved a 70-fold improvement over NMinimize. However, it should be noted that timings are generally much more sensitive to the initial simplex (and thus the number of iterations performed before convergence) than to the method in which the code is called. Working with the compiled minimizer directly is therefore perhaps better thought of as a means to incorporate it as a building block into other code (as shown below, where many minimizations are performed in parallel) than a means of achieving higher performance in its own right.

Now, we try to minimize the 50-dimensional Rosenbrock's function, even though the performance of the Nelder-Mead algorithm is usually worse (both slower and less reliable) than other methods (e.g. Storn-Price differential evolution) for high-dimensional minimization:

vars = Block[{x}, Unique[ConstantArray[x, 50], Temporary]];

 NelderMeadMinimize`Dump`Rosenbrock, Evaluate[vars],
 "RandomSeed" -> 10, CompilationTarget -> "C"
] // Timing

(* -> {24.109375, {2.44425*10^-15, {
                   x$567 -> 1., x$568 -> 1., x$569 -> 1., x$570 -> 1., x$571 -> 1.,
               x$572 -> 1., x$573 -> 1., x$574 -> 1., x$575 -> 1., x$576 -> 1.,
                   x$577 -> 1., x$578 -> 1., x$579 -> 1., x$580 -> 1., x$581 -> 1.,
               x$582 -> 1., x$583 -> 1., x$584 -> 1., x$585 -> 1., x$586 -> 1.,
                   x$587 -> 1., x$588 -> 1., x$589 -> 1., x$590 -> 1., x$591 -> 1.,
               x$592 -> 1., x$593 -> 1., x$594 -> 1., x$595 -> 1., x$596 -> 1.,
                   x$597 -> 1., x$598 -> 1., x$599 -> 1., x$600 -> 1., x$601 -> 1.,
               x$602 -> 1., x$603 -> 1., x$604 -> 1., x$605 -> 1., x$606 -> 1.,
                   x$607 -> 1., x$608 -> 1., x$609 -> 1., x$610 -> 1., x$611 -> 1.,
               x$612 -> 1., x$613 -> 1., x$614 -> 1., x$615 -> 1., x$616 -> 1.}}} *)

We found the minimum in a reasonable time, although it required a non-default random seed to do so. As a performance comparison, a differential evolution minimizer I wrote in Python can minimize this function in about 21 seconds, which is certainly better considering that Python is interpreted while the result shown here is after compilation to C. However, this is still a huge improvement over NMinimize, which cannot detect convergence properly in this case, taking over 7 times as long trying (and ultimately failing) to find the minimum. In fact, we can do better if we employ the modified ("adaptive") scale parameters proposed by Fuchang Gao and Lixing Han in Comput. Optim. Appl. 51 (1), 259-277 (2012) (.pdf available from Gao's website):

With[{dim := NelderMeadMinimize`Private`dimension},
   NelderMeadMinimize`Dump`Rosenbrock, Evaluate[vars],
   "ReflectRatio" -> 1, "ExpandRatio" :> 1 + 2/dim,
   "ContractRatio" :> 3/4 - 1/(2 dim), "ShrinkRatio" :> 1 - 1/dim,
   CompilationTarget -> "C"
  ] // Timing

(* -> {16.781250, {1.5829*10^-15, { identical result omitted }}} *)

As described in the paper, these parameter values improve the efficacy of the expansion and contraction steps for high-dimensional problems and help to prevent the simplex from degenerating into a hyperplane, which otherwise would lead to failure of the Nelder-Mead algorithm. Convergence is thus achieved more reliably, to tighter tolerances, and without having to adjust the random seed. What is not so clear from this example is that, in favorable cases, the performance improvements can also be very dramatic: for the 35-dimensional hyperellipsoid function, for instance, the modified parameters yield a tenfold reduction in execution timing versus the classical values. I would thus strongly recommend at least trying the modified settings for larger problems, hence the incorporation of the symbol NelderMeadMinimize`Private`dimension to represent the problem dimension when doing so.

Edit: response to Ajasja's edits

Re-compilation of the minimizer on every call for a CompiledFunction objective function was a result of my inadequate testing of this case, so thanks go to Ajasja for noticing and reporting this issue, as well as the bug in handling specified initial points. Until it was pointed out to me by Leonid, I had somehow managed to overlook the fact that CompiledFunctions contain (in their second argument) patterns specifying the types of the arguments they accept. Pattern matching a CompiledFunction against itself will therefore not produce a match unless Verbatim is used to wrap the CompiledFunction being treated as the pattern, and memoization of function values similarly will not work where a CompiledFunction appears in an argument unless Verbatim is used in defining the applicable DownValue. This issue has now been fixed and compiled objective functions will be properly memoized by the updated version (both posted code and downloadable files) above.

The second point regards the question of how to incorporate parameters in the objective function other than the values actually being minimized. In fact this was possible without any additional modifications right from the first posted version of the code, although I didn't make this explicit or specify how it can be done. I hope to rectify this omission now, alongside describing how to obtain the best performance from this approach.

Let's take an example related to the scenario described in the question: namely, fitting a model to data. Here we will perform least-squares fitting of a cubic polynomial, i.e. the minimand is,

Norm[data - (a + b x + c x^2 + d x^3)]

with data (ordinate values only) and x (abscissae) given, and a, b, c, and d being the values under optimization. (In principle, using the monomials as basis functions is less than ideal because of its numerical instability, which combines poorly with the Nelder-Mead algorithm's tendency to get trapped in local minima. Practically speaking, it works well enough as an example.) This can be given to NelderMeadMinimize essentially directly:

fitter = Block[{data = #},
   Block[{x = Range@Length[data]}, Norm[data - (a + b x + c x^2 + d x^3)]],
   {a, b, c, d}
 ] &

The point to note here is that data appears lexically in the minimand, but not as a variable as far as NelderMeadMinimize is concerned. The first time this is called with actual data, compiled code will be generated that is a closure over the non-localized symbol data; where data is referenced, code is generated for a call into the main evaluator to retrieve its value. (The Block inside the minimand isn't relevant to this; it simply generates abscissae suitable for the given data and will be compiled completely since x is localized.) As it's the symbol data that appears inside the minimand and not actual data, compilation occurs only once rather than for every dataset fitted.

We try it:

datasets = Accumulate /@ RandomReal[{-1, 1}, {5, 100}];
fits = fitter /@ datasets;
fittedmodels = a + b x + c x^2 + d x^3 /. fits[[All, 2]];

 { Plot[fittedmodels, {x, 1, Last@Dimensions[datasets]}],
   ListLinePlot[datasets] }, PlotRange -> All


Plot of data and fitted models

which seems like it was reasonably successful. So, this is a fine proof of principle, but quite slow ($\approx$ 100 ms/fit) due to the expensive call out of compiled code on every objective function evaluation. Obviously, we can do much better.

Since the aim is to completely eliminate calls into the main evaluator while fitting an arbitrary number of datasets, the new option "ReturnValues" of NelderMeadMinimize`Dump`CompiledNelderMead will come in useful. This enables compiled minimizers to be generated that produce any of several different return values, facilitating their use as building blocks in other compiled code. The possible option values are:

  • "OptimizedParameters": return a list of the values of the variables that minimize the objective function.
  • "AugmentedOptimizedParameters": as for "OptimizedParameters", but with the corresponding (minimal) value of the objective function prepended.
  • "Simplex": like "OptimizedParameters", but now returning a list of all d + 1 points of the final simplex obtained by the Nelder-Mead algorithm.
  • "AugmentedSimplex": as for "Simplex", but with each point having the corresponding value of the objective function prepended to it.

It seems to me that "ReturnValues" -> "OptimizedParameters" is the most suitable for the present application, so let's proceed as such. We now turn to the question of the parameter value accesses.

As Leonid has noted here, if compiled closures are inlined (using CompilationOptions -> {"InlineCompiledFunctions" -> True}) into other compiled code containing the values they close over, calls to the main evaluator can be eliminated entirely:

   minimizer = NelderMeadMinimize`Dump`CompiledNelderMead[
     Function[{a, b, c, d},
      Block[{x = Range@Length[data]}, Norm[data - (a + b x + c x^2 + d x^3)]]
     ], {a, b, c, d}, "ReturnValues" -> "OptimizedParameters"
   epsilon = $MachineEpsilon
  serialFitter = Compile[{{datasets, _Real, 2}},
    Table[minimizer[RandomReal[{0, 1}, {4 + 1, 4}], epsilon, -1], {data, datasets}],
    CompilationOptions -> {"InlineCompiledFunctions" -> True},
    RuntimeOptions -> {"Speed", "EvaluateSymbolically" -> False}, 
    CompilationTarget -> "C"
  parallelFitter = Compile[{{data, _Real, 1}},
    minimizer[RandomReal[{0, 1}, {4 + 1, 4}], epsilon, -1],
    CompilationOptions -> {"InlineCompiledFunctions" -> True},
    RuntimeOptions -> {"Speed", "EvaluateSymbolically" -> False}, 
    CompilationTarget -> "C",
    Parallelization -> True, RuntimeAttributes -> {Listable}

Here the same minimand as used above is enclosed in a Function to make it suitable for NelderMeadMinimize`Dump`CompiledNelderMead, which is then called with this objective function and the option "ReturnValues" -> "OptimizedParameters" to generate a compiled minimizer that can be used from within other compiled code. Two callers are defined: serialFitter simply loops over each dataset given in its argument, while parallelFitter is Listable and automatically parallelizes over multiple datasets. Let's check them:

Block[{x = Range[50]},
 Round@serialFitter[{3 + x - 4 x^2 + 2 x^3, -8 - 2 x + 7 x^2 - x^3}] == 
  Round@parallelFitter[{3 + x - 4 x^2 + 2 x^3, -8 - 2 x + 7 x^2 - x^3}] ==
   {{3, 1, -4, 2}, {-8, -2, 7, -1}}

As expected, we get True, so these can both correctly fit cubic polynomials. What about performance?

datasets = Accumulate /@ RandomReal[{-1, 1}, {1000, 100}];

serialFitter[datasets]; // Timing
(* 7.813 seconds *)

parallelFitter[datasets]; // AbsoluteTiming
(* 2.141 seconds *)

So, we have 7.8ms and 2.1ms per dataset, respectively. While these datasets, the model being fitted, and the convergence tolerances are admittedly all different to those in Ajasja's problem, that's still not too bad at all in my opinion. Furthermore, if you have a computer with support for simultaneous multithreading (SMT, e.g. Intel HT), the performance of parallelFitter can be further improved by evaluating SetSystemOptions["ParallelOptions" -> "ParallelThreadNumber" -> n] (where the value n depends on the number of logical processors available). Evidently, Mathematica's compiled code is not quite optimal, even after being translated to C and compiled to native code, since I found that this setting provided about 20% better performance on an Intel i7-2600 CPU.

Oleksandr R.

Posted 2012-04-24T13:48:28.817

Reputation: 22 073

You could do a good ol' switcharoo... post this answer in the highly upvoted, accepted answer and the commentary in that one over here. – rm -rf – 2012-04-29T04:44:41.760

1@R.M true, but I'd rather that any upvotes received on this answer have some meaning, i.e. signifying that someone actually found it useful, rather than claiming the votes received for something completely unrelated. I decided against combining the answers when I realised how long that answer would get... – Oleksandr R. – 2012-04-29T05:11:21.023

+1 I waited with my upvote for exactly that reason :) This post will be so useful for me in the future! Unfortunately I don't have a nice minimization problem at hand to test it right now, but maybe soon. Thanks! – sebhofer – 2012-04-29T09:52:25.240

+1 Thank you. Some brilliant techniques for in-lining. Can't wait to try this out (at work). Definitely deserves a gold badge (although it should be this posts that outscores the other one:) – Ajasja – 2012-04-29T11:55:54.633

With the updated version, did you re-run the timing? – rcollyer – 2012-04-30T19:19:20.663

@rcollyer for my own testing, yes, but I didn't bother to update the timings in the above since there's essentially no difference (to within experimental error) between the Nelder-Mead and Lagarias et al. algorithms for these test functions. In principle each iteration of the Nelder-Mead algorithm may be slightly faster while the Lagarias et al. algorithm may converge in slightly fewer iterations. – Oleksandr R. – 2012-04-30T20:53:16.207

3Oleksandr, just a heads up: I've been using your code for a problem I'm working on for the last two days and it's been very helpful so far. Without a lot of tweaking it gives reasonable and fast results, whereas NMinimize actually just crashes :) (I'm not surprised by that however, it happened to me before.) Thanks again! – sebhofer – 2012-05-08T12:28:58.157

@sebhofer thanks, it's good to know that this answer is actually helping people (much more satisfying than any amount of rep points!). If you haven't already, you might like to download the updated code posted yesterday--it contains a couple of important bug-fixes and some minor performance improvements. – Oleksandr R. – 2012-05-08T17:18:32.623

The code in this miminizer contains many methods I haven't seen anywhere else, plus it works well, this is one of my favorite post on this site. – faysou – 2015-04-09T22:23:09.803

@faysou thanks for saying so! I hope it's been useful to you. – Oleksandr R. – 2015-04-09T22:30:23.650

Yes it has, I'm using it. – faysou – 2015-04-09T22:32:08.193

You should add a README to the GitHub repo and link back here. – Szabolcs – 2015-09-09T19:06:40.040

@OleksandrR. Why don't the GitHub .nb and .m open in properly formatted form on Macematica 10 for windows ? – Al Guy – 2016-06-12T20:31:14.513

@AlGuy don't know why not for you, since you don't describe the problem you observe, and it works for me. Certainly the notebook does. But bearing in mind that a .m file is just plain text without any formatting, I'm not sure what you expected in that case. – Oleksandr R. – 2016-06-12T20:57:00.747

@OleksandrR. You right, I am not sure why it didn't. Chrome downloaded in a weird way. – Al Guy – 2016-06-18T23:55:14.687


NMinimize is implemented entirely in Mathematica code, so its behaviour is open to examination by investigation of symbols defined in the Optimization`NMinimizeDump` context (after calling NMinimize at least once to pre-load the necessary package). In particular, the ability to give diagnostic output is present, and can be switched on using the internal variable Optimization`NMinimizeDump`$DiagnosticLevel. This takes values from 0 (no output) through 6 (very verbose diagnostics). If you're interested only in what happens before and after the main minimization (i.e., what can be considered as overhead), a useful value is 3; this omits most messages originating from the main loop of the core optimizer. Also available is Optimization`NMinimizeDump`$Shortness, which corresponds to the second argument of Short for those parts of the diagnostic output that are wrapped in this function (lists of options, &c.).

Examining the output given by a trial optimization:

Optimization`NMinimizeDump`$DiagnosticLevel = 3;
Optimization`NMinimizeDump`$Shortness = 10;
 Hi2p[a, b], {a, b},
 Method -> {"NelderMead", "PostProcess" -> False},
 MaxIterations -> 5

suggests that much of the overhead is due to pre- and post-processing steps needed to enforce constraints when these are present (for example, construction of a suitable penalty function for use with unconstrained optimization methods such as the Nelder-Mead algorithm). However, there are no constraints here, so little can be done to reduce the effort required for this. Another possibility, given that the objective function is already compiled, is to pass the option Compiled -> False, which may avoid some unnecessary compiler overhead. This is undocumented and appears in red, but the diagnostic output confirms that it is accepted. Unfortunately, in this case, its effect on performance is small.

NMinimize has clearly been designed for versatility rather than to minimize the per-call overhead, so I doubt that much optimization can be done short of rewriting parts of it, which is probably not worthwhile. As such, my suggestion is to write a simpler minimizer, perhaps using only the Nelder-Mead algorithm (which is not very difficult to implement, and can probably be done entirely in compiled code or in your non-Mathematica language of choice) if run-time efficiency is of paramount importance.

Oleksandr R.

Posted 2012-04-24T13:48:28.817

Reputation: 22 073

@OleksandrR. Is there an equivalent for FindMinimum. I tried Trace[FindMinimum[x Cos[x], {x, 2}], TraceInternal -> True] and I see something like OptimizationInternalModifiedCholesky$4575`. – my account_ram – 2013-10-29T15:49:49.487


@myaccount_ram no, sorry (or at least not that I know of). FindMinimum is kernel-level code (i.e. written mostly in C) and not traceable in the same way. The best you can do, as I understand it, is to look for calls to relevant top-level functions using Trace, and hook anything you're interested in to add your own diagnostics. I did this here, in case it would be helpful for you to have something to refer to.

– Oleksandr R. – 2013-10-29T17:57:15.743

14You scare me with your spelunking... – rm -rf – 2012-04-24T18:08:04.437

Is it then possible to browse the mma source of NMinimize? That would be very educational :) – Ajasja – 2012-04-24T20:32:22.210


Yes: start with AppendTo[$ContextPath, "Optimization`NMinimizeDump`"], then follow the function calls from Information[NMin]. Leonid's code formatter will be very helpful. By the way, I'll add an all-compiled-code Nelder-Mead minimizer to this answer later; I couldn't do it earlier and can't do it now though as I'm busy trying to run an experiment.

– Oleksandr R. – 2012-04-24T20:47:03.143

1@OleksandrR. Wow, I certainly wouldn't object to an all-compiled-code Nelder-Mead minimizer. – Ajasja – 2012-04-25T07:50:00.833