91

71

**The epilogue:**

A paper where the below answer is used was published. The answer below is cited among the references:)

**The background:**

I have to fit an objective function for ~10 000 datasets in near real time (with near real time being on the order of about 10 seconds). I compile the huge objective function, so that it takes about 30 microseconds to execute. I'm using the Nelder-Mead method, which on average converges in about 300 steps.

This means that one minimization should take about 9 miliseconds (300*30 microseconds). And the whole fitting could be done in about 10 seconds on eight cores (300*10000*0.00003/8).

**The problem:**

No matter how quickly the objective function can be calculated, `NMinimize[]`

always takes at least 50 ms. This means the fitting takes about 4 minutes on my eight cores.

**The question:
Can I somehow get rid of this 50ms delay that NMinimize[] has?** Otherwise I will have to rewrite my program in C (or Delphi which I like much better). But I'm still not looking forward to that.

Here is a toy example that illustrates the problem:
Regardles of the value of `LOOPCOUNT`

, the "*deadtime*" is always similar.

```
ClearAll[Hi2p];
LOOPCOUNT = 100;
Hi2p[a_?NumericQ, b_?NumericQ] := (
Do[Cos[0.3], {LOOPCOUNT}];
(a - Cos[a^2 - 3 b])^2 + (b - Sin[a^2 + b^3])^2
);
n = 10000;
{t, tt} = Do[Hi2p[1, 1], {n}] // AbsoluteTiming;
Print["TimePerFunctionEvaluation: " <> ToString[t/n]];
{{time, sol}, {points}} = Reap[AbsoluteTiming[
NMinimize[Hi2p[a, b], {a, b},
Method -> {"NelderMead", "PostProcess" -> False, "RandomSeed" -> 1796},
MaxIterations -> 500, PrecisionGoal -> 10,
AccuracyGoal -> 100, EvaluationMonitor :> Sow[{a, b}]]
]];
{time, sol, Length[points]}
deadtime = time - t/n*Length[points];
Print["Deadtime: " <> ToString[deadtime]];
(*"TimePerFunctionEvaluation: 0.00004190240"
==> {0.0650037, {3.08149*10^-33, {a -> 0.6037, b -> 0.429039}}, 278}
"Deadtime: 0.0536606"*)
```

**EDIT:**

Thank you, I almost solved my real problem. I'm struggling with a technical detail, which stems from the fact that I don't completely understand how the in-lineable `apply`

works. Here is the problem: (also I'm not sure if this should be a separate question or not...)

My objective function is set up, so that it takes a number of data points apart from the parameters to fit. The number of data points is the same in all data sets, but can change between experiments. This way I can compile the objective function just once for all 10 000 data sets I have to fit.

But the way `NelderMeadMinimize`

is set up now it compiles the algorithm for each different dataset, which has a non-negligible overhead, especially when compiling to C.

Here is a toy example:

```
ClearAll[f, toy, bench]
(*toy objective function which takes about 30 microsconds to run*)
toy = Compile[{a, b, c, x, y, z} , Do[Cos[RandomReal[]], {800}];
(a - x)^2 + 50 (b - y)^2 + (c - z)^2
, "RuntimeOptions" -> {"EvaluateSymbolically" -> False,
"CompareWithTolerance" -> False}, "CompilationTarget" -> "C"];
SetAttributes[bench, HoldFirst];
bench[expr_, n_] :=
Module[{tt, t}, {t, tt} = Do[expr, {n}] // AbsoluteTiming;
t/n]; bench[toy @@ RandomReal[{-1, 1}, 6], 10000]
toyf = Evaluate@toy[Sequence @@ {1, 2, 3}, Sequence @@ {x, y, z}];
(* ==> 0.00003000005 *)
(*compile once to C*)
NelderMeadMinimize[toyf, {x, y, z},
CompilationTarget -> "C"] // AbsoluteTiming
bench[NelderMeadMinimize[toyf, {x, y, z},
CompilationTarget -> "C"], 100]
(* ==> {0.6600009, {1.82172*10^-21, {x -> 1., y -> 2., z -> 3.}}} *)
(* ==> 0.028300039 *)
(*compile once to VMW*)
NelderMeadMinimize[toyf, {x, y, z},
CompilationTarget -> "WVM"] // AbsoluteTiming
bench[NelderMeadMinimize[toyf, {x, y, z},
CompilationTarget -> "WVM"], 100]
(* ==> {0.0400000, {1.82172*10^-21, {x -> 1., y -> 2., z -> 3.}}} *)
(* ==> 0.028200039 *)
(*but these seem to get compiled every time*)
bench[NelderMeadMinimize[
Evaluate[toy[Sequence @@ {1, 2, 3}, Sequence @@ {x, y, z}]], {x, y,
z}, CompilationTarget -> "WVM"], 100]
bench[NelderMeadMinimize[
Evaluate[toy[Sequence @@ {1, 2, 3}, Sequence @@ {x, y, z}]], {x, y,
z}, CompilationTarget -> "C"], 5]
(* ==> 0.032300045 *)
(* ==> 0.51200072 *)
```

From this benchmarks one can see that:

- Compilation to C is much slower then to WMV (already known:)
- It does not matter much whether the algorithm is compiled to C or WMV
- In the last benchmark one can see, that the algorithm is
**compiled every time**(as each minimization takes half a second)

**Is there a way to compile the minimization algorithm just once for the whole data set?**

So that code similar to this would get compiled only once instead of 5 times

```
bench[
NelderMeadMinimize[
Evaluate[toy[Sequence @@ RandomReal[{-1, 1}, 3], Sequence @@ {x, y, z}]],
{x, y, z}, CompilationTarget -> "C"],
5]
```

**EDIT 2:**

I managed to hack a version which works great for my purposes here.

I can compile once like this

```
(*Hi2 first takes the varaibles then the data points of the spectrum \
as constants*)
vars = {int, p1, lm1, lm2, bl1, bl2};
symbolicSpectrum = Array[Unique[a] &, Length[spectrum]];
cm = NelderMeadMinimize`Dump`CompiledNelderMead[Hi2, Evaluate[vars],
symbolicSpectrum, CompilationTarget -> "C"]; // AbsoluteTiming
```

Then I call the compiled function 10000 times (once for each different spectrum).

```
(*then I jsut call cm 10000 times, once for each spectrum*)
cm[initialPoints, spectrum, 10^-9, 300]
```

This takes about 12 ms per dataset on my computer, with 10 ms spent on the evaluation of the objective function. I think this is simply *amazing!*

Oh, and by setting

```
Needs["CCompilerDriver`"]
$CCompiler = CCompilerDriver`IntelCompiler`IntelCompiler;
```

I got a free factor of 2 (so it takes about **6 ms** per data set).

Congratulations on the published article! – shrx – 2013-12-18T18:52:37.203

Congratulations on your publication! As it happens, a friend of mine has just submitted an article to Optics Express for which he has recieved favorable reviews. And, thank you very much for the citation! – Oleksandr R. – 2013-12-19T00:32:44.217

@OleksandrR No, no, thank you! Well, you won't get any "academic" points for the citation, but at least credit is given where it is due. Our experiences with Optics Express are positive: the review was useful and the article was published ~3 weeks. – Ajasja – 2013-12-24T22:28:14.007

Note that

`AbsoluteTiming`

includeseverythinggoing on on your computer until the calculation finished. What results do you get with`Timing`

instead of`AbsoluteTiming`

? – celtschk – 2012-04-24T14:23:02.560@OleksandrR. No, no, the 1000 digits are just there so that I get approximately the same number of iterations as in my real example. Of course I don't need that much precision. – Ajasja – 2012-04-24T14:35:52.360

I don't understand your code. What's the purpose of

`Do[Cos[0.3], {LOOPCOUNT}]`

? Is it just filler to represent something in your real problem? – rcollyer – 2012-04-24T14:39:33.023@rcollyer It's used instead of

`Pause[]`

. So that one evaluation of`Hi2p`

takes about 40 microseconds. – Ajasja – 2012-04-24T14:42:25.973I see: it is to represent the length of time it takes to evaluate the full version of

`Hi2p`

. – rcollyer – 2012-04-24T14:43:20.913@rcollyer Yes, exactly. – Ajasja – 2012-04-24T14:44:18.247

5Any chance the data sets have some similarity? If so, you might do better by solving for one, then passing that result as starting values to FindMinimum for the remaining sets. – Daniel Lichtblau – 2012-04-24T17:33:52.950

@DanielLichtblau I had the same idea, but did not get around to implementing it yet:) – Ajasja – 2012-04-24T20:24:41.500

@OleksandrR. You're welcome. Don't worry, now that I understand how apply works I should be able to fix it my self. – Ajasja – 2012-05-04T11:16:11.377

@OleksandrR. No problem. I hacked a version which works for my purposes here. I adapted my objective function, so that the parameters I want to optimize come first and the points from the data set come last. That way I can just compile once for all the datasets.

– Ajasja – 2012-05-07T10:56:05.727Finally updated as promised. Hope that helps, although I've enjoyed working on this problem so the bounty's really not necessary--IMO you're better off to keep your rep points as the thresholds for doing things on the site (e.g. editing posts) will be quite a lot higher after it gets out of beta. – Oleksandr R. – 2012-05-08T06:13:23.510

@OleksandrR. I'm sure I'll get more reputation points, especially if mma.se keeps being so interesting to follow. Besides, it's only 100 points, so it's mostly symbolic (IMO the answer deserves more). – Ajasja – 2012-05-08T15:42:58.223

Thanks for the bounty! I hope the last update was useful. Are we getting close to your target of 10,000 datasets in 10 seconds? How many points are in your spectra, by the way? I suspect more than the 100 points I used in the example... – Oleksandr R. – 2012-05-08T17:27:23.283

@OleksandrR. You're very welcome. Right now I'm on the limit at ~15 s (6 ms per dataset * 10000 on six cores, but the parallel scaling is not completely linear). But I think I may well surpass it, once I start using the solutions of fits as initial conditions for similar datasets. I have from 20 - 40 points in a dataset (depends on how the experiment was done), but the model is quite complicated (a convolution of a Gaussian with a difference of two exponential functions). My hacked solution was working fine, but the update will enable me to code it even more elegantly! – Ajasja – 2012-05-08T17:52:09.390