## Has this implementation of FDM touched the speed limit of Mathematica?

31

20

Still, I'll use the implementation of the 1D FDTD method (you can simply understand it as a kind of explicit finite difference scheme for the Maxwell's equation) as the example. Just for completeness, here is the 1D Maxwell's equation:

$$\mu \frac{\partial H_y}{\partial t}=\frac{\partial E_z}{\partial x}$$ $$\epsilon \frac{\partial E_z}{\partial t}=\frac{\partial H_y}{\partial x}$$

and the corresponding finite difference equation:

$$H_y^{q+\frac{1}{2}}\left[m+\frac{1}{2}\right]\text{==}H_y^{q-\frac{1}{2}}\left[m+\frac{1}{2}\right]+\frac{\Delta _t}{\mu \Delta _x}\left(E_z^q[m+1]-E_z^q[m]\right)$$ $$E_z^{q+1}[m]==E_z^q[m]+\frac{\Delta _t}{\epsilon \Delta _x}\left(H_y^{q+\frac{1}{2}}\left[m+\frac{1}{2}\right]-H_y^{q+\frac{1}{2}}\left[m-\frac{1}{2}\right]\right)$$

The toy code I've repeatedly used in several posts implementing the difference scheme is:

ie = 200;
ez = ConstantArray[0., {ie + 1}];
hy = ConstantArray[0., {ie}];

fdtd1d = Compile[{{steps}},
Module[{ie = ie, ez = ez, hy = hy},
Do[ez[[2 ;; ie]] += (hy[[2 ;; ie]] - hy[[1 ;; ie - 1]]);
ez[] = Sin[n/10];
hy[[1 ;; ie]] += (ez[[2 ;; ie + 1]] - ez[[1 ;; ie]]), {n,
steps}]; ez]];

result = fdtd1d; // AbsoluteTiming


Notice that constants like $\mu$, $\Delta _t$ are omitted for simplicity.

Personally I think it's a typical example for the implementation of finite difference method (FDM), so here's the question: has this piece of code (at least almost) touched the speed limit of Mathematica? In fact several months ago, I found that if the code is rewrited with Julia, it'll be 4 times faster.

Indeed, I know this might be the case that one should use the best-suited tool for a specific job, but since I've already gained some stupid pride on using Mathematica and am unwilling to spend time to learn a new programming language (Wolfram is almost my first programming language, I used to learn some VB, but already gave it back to my teacher when started to use Mathematica), I still want to make sure if the Mathematica version of the code can be faster.

If it's the limitation, I'd like to know why there's such a big difference.

Any help would be appreciated.

2Using CompilePrint[fdtd1d] we find that the first three instructions in the compiled function are calls to MainEvaluate. If you add CompilationOptions -> {"InlineExternalDefinitions" -> True} and repeat, the first two instructions are now CopyTensor, although there is virtually no difference in the timings between the two functions. This may mean nothing, or it may be something worth investigating. – dr.blochwave – 2014-10-07T15:54:48.723

1Probably on reflection means little - the performance hit is likely later on. – dr.blochwave – 2014-10-07T18:38:05.543

1Also there might be a further speed gain by compiling to C. – Daniel Lichtblau – 2014-10-07T22:11:29.980

@DanielLichtblau Sadly the magic of C is ineffective here, at least with TDM-GCC 4.8.1 and Vista 32bit. – xzczd – 2014-10-08T03:15:20.613

1@blochwave Yeah, in this case "InlineExternalDefinitions" -> True or changing the beginning part to With[{ie2 = ie, ez2 = ez, hy2 = hy}, Compile[{{steps}}, Module[{ie = ie2, ez = ez2, hy = hy2}, … or With[{ie = 200}, Compile[{{steps}}, Module[{ez = Table[0., {ie + 1}], hy = Table[0., {ie}]}, … doesn't help, the bottleneck is inside Do. – xzczd – 2014-10-08T03:24:50.037

I also don't find a gain from compiling to C. However, since Span doesn't actually seem to feature in this list of compilable functions, could that be a potential bottleneck?

– dr.blochwave – 2014-10-08T07:05:48.380

As in, could rewriting it without the spans work...just wondering that's all – dr.blochwave – 2014-10-08T07:07:37.313

@blochwave That list isn't all… Span is compiled as we see with CompilePrint, and vectorization is still one of the most important way to speed up code even inside Compile, rewriting Span with loops only makes the code slower. Try fdtd1d = Compile[{{steps}}, Module[{ie = ie, ez = ez, hy = hy}, Do[Do[ez[[i + 1]] += (hy[[i + 1]] - hy[[i]]), {i, ie - 1}]; ez[] = Sin[n/10]; Do[hy[[i]] += (ez[[i + 1]] - ez[[i]]), {i, ie}], {n, steps}]; ez]]; – xzczd – 2014-10-08T07:16:59.460

Fair enough, didn't think it would be a solution, just curious! – dr.blochwave – 2014-10-08T07:20:52.473

Parallelization? GPUs? – Alexey Bobrick – 2014-10-08T14:18:32.817

@AlexeyBobrick AFAIK, compilation is hard to combine with these techniques, which are much slower than compilation when used separately. – xzczd – 2014-10-09T02:14:18.207

Not quite. CUDA (GPU) is really a compiled C code with CUDA instructions. If you add parallelization flag to your Compile[] command, you will get more speed-up for free. You probably can further add openmp instructions in your pre-compiled code, but I am not sure. – Alexey Bobrick – 2014-10-09T11:33:36.403

1

@AlexeyBobrick Just scaned this document and tried several flags in "CompileOptions" (-O3, -Ofast, etc.) but didn't get a speed-up… Could you give an example? (BTW -Ofast does help in speeding up this code :D)

– xzczd – 2014-10-09T12:53:20.963

I meant simply adding Compile[<Your code>, Parallelization -> True]. Should generally be a quick-help type of a command. – Alexey Bobrick – 2014-10-09T19:41:42.920

@AlexeyBobrick I'm sorry but this won't work, as the document said, Parallelization -> True only works together with RuntimeAttributes -> {Listable}, it just makes a Listable compiled function paralleled, while it's not the case of this question. – xzczd – 2014-10-10T03:36:29.290

Will it compile if you use ParallelDo instead of Do? And keeping Parallelisation->True. I was pretty sure it was parallelising not the argument calls, but the inner loops. Will check. – Alexey Bobrick – 2014-10-10T14:23:05.400

@AlexeyBobrick No, it won't. ParallelDo isn't in the list, the result of CompilePrint also shows it's indeed not compiled.

– xzczd – 2014-10-11T06:40:18.080

Nice question! I somehow missed it! (+1) – Silvia – 2015-11-10T17:33:03.570

33

Okay, this is a bit of an embarassment.

Here is a very small modification of the original code. I simply made explicit option settings, made a denominator to Sin explicitly real, that kind of thing. My tests show the same timing as the original, give or take an iota.

ie = 200;
ez = ConstantArray[0., {ie + 1}];
hy = ConstantArray[0., {ie}];

fdtd1d = Compile[{{steps}},
Module[{ie = ie, ez = ez, hy = hy},
Do[ez[[2 ;; ie]] += (hy[[2 ;; ie]] - hy[[1 ;; ie - 1]]);
ez[] = Sin[n/10.];
hy[[1 ;; ie]] += (ez[[2 ;; ie + 1]] - ez[[1 ;; ie]]), {n,
steps}]; ez],
CompilationOptions -> {"InlineExternalDefinitions" -> True},
"CompilationTarget" -> "C", "RuntimeOptions" -> "Speed"];


I'll beef up the example a factor of 10.

result = fdtd1d; // AbsoluteTiming

(* Out= {0.555320, Null} *)


Now we remove the spans and replace with inner loops. No vectorization, that is.

fdtd1d2 = Compile[{{steps}}, Module[{ie = ie, ez = ez, hy = hy},
Do[
Do[ez[[j]] += (hy[[j]] - hy[[j - 1]]), {j, 2, ie}];
Do[ez[] = Sin[n/10.];
hy[[j - 1]] += (ez[[j]] - ez[[j - 1]]), {j, 2, ie + 1}], {n,
steps}]; ez],
CompilationOptions -> {"InlineExternalDefinitions" -> True},
"CompilationTarget" -> "C", "RuntimeOptions" -> "Speed"];

result2 = fdtd1d2; // AbsoluteTiming
result2 == result

(* Out= {0.179435, Null}

Out= True *)


So that's a factor of 3. I guess I need to report this as a suggestion to revisit vector operations in Compile.

--- edit ---

It is of course slightly faster to take the assignment of ez[] =... out of the second inner loop (sound of hand smacking forehead). Also it turns out to be slightly faster to reduce the index arithmetic in the second loop. I also took the assignment of the constant, ie, out of the Module variables and made it really constant using With; this does not seem to affect timing one way or the other.

fdtd1d3 = With[{ie = ie}, Compile[{{steps, _Integer}},
Module[
{ez = ez, hy = hy},
Do[
ez[] = Sin[n/10.];
Do[ez[[j]] += (hy[[j]] - hy[[j - 1]]), {j, 2, ie}];
Do[
hy[[j]] += (ez[[j + 1]] - ez[[j]]), {j, 1, ie}],
{n, steps}]; ez],
CompilationOptions -> {"InlineExternalDefinitions" -> True},
"CompilationTarget" -> "C", "RuntimeOptions" -> "Speed"]];

result3 = fdtd1d3; // AbsoluteTiming
result3 === result2

(* Out= {0.135636, Null}

Out= True *)


So that's a modest improvement.

--- end edit ---

--- edit #2 ---

Per suggestion by @s0rce, we'll use CompileGetElement instead of Part.

fdtd1d4 = Compile[{{steps, _Integer}},
Module[
{ie = ie, ez = ez, hy = hy},
Do[
ez[] = Sin[n/10.];
Do[ez[[j]] += (CompileGetElement[hy, j] -
CompileGetElement[hy, j - 1]), {j, 2, ie}];
Do[
hy[[j]] += (CompileGetElement[ez, j + 1] -
CompileGetElement[ez, j]), {j, 1, ie}],
{n, steps}]; ez],
CompilationOptions -> {"InlineExternalDefinitions" -> True},
"CompilationTarget" -> "C", "RuntimeOptions" -> "Speed"];

result4 = fdtd1d4; // AbsoluteTiming
result4 === result2

(* Out= {0.076532, Null}

Out= True *)


Now that's progress.

--- end edit #2 ---

1Just a side note, it's not necessary to make numbers inside Sin explicitly real and use "InlineExternalDefinitions" -> True, and in some computer (for example mine) one may need to use something like And @@ PossibleZeroQ[result2 - result // Chop] to test the result. Then, this answer is really… surprising, it seems to be the first example in this site showing devectorizing can speed up code inside Compile. – xzczd – 2014-10-09T07:37:30.057

PossibleZeroQ however will result in a call to the main evaluator. Best to do a numeric test with some granularity such as result1==result2 || (Abs[result1-result2]/(Abs[result1]+Abs[result2])<epsilon for suitable epsilon. (The first check should preclude the second one giving a division by zero issue.) – Daniel Lichtblau – 2014-10-09T14:41:11.707

11Does CompileGetElement provide any speed up over Part? – s0rce – 2014-10-10T15:46:30.237

1@sOrce Yes, a huge one. Thanks. I'll reedit. – Daniel Lichtblau – 2014-10-10T15:51:05.210

5There's an additional 1.5X speedup by replacing e.g. ez[[j]] += (CompileGetElement[hy, j] - CompileGetElement[hy, j - 1]), {j, 2, ie}] with ez[[j]] = CompileGetElement[ez, j] + (CompileGetElement[hy, j] - CompileGetElement[hy, j - 1]), {j, 2, ie}] etc. – RunnyKine – 2014-10-10T17:56:36.127

Compared to OP's original method that's an overall 7.25X speed-up on my machine. Impressive! – RunnyKine – 2014-10-10T18:01:23.863

@RunnyKine The += variant is faster on my machine (running on Ubuntu Linux, with I know not what C compiler). I don't see why it is faster since CompileGetElementseems to beatPart rather handily. But there it is. Platform weirdness at work. – Daniel Lichtblau – 2014-10-10T19:50:09.450

But it should be faster since we're replacing Part on the RHS with CompileGetElement and as you said CompileGetElement beats Part, on that we agree. I'm on Windows with Visual Studio Compiler. – RunnyKine – 2014-10-10T20:45:58.033

@RunnyKine With TDM-GCC 4.8.1, Vista 32bit, ez[[j]] = Compile\GetElement[ez, j]……even makes the code a bit slower. TheCompileGetElementtrick is really amazing, now the code only takes about 13s forsteps = 10^7in my computer, as fast as the _Julia_ version (with the [@inboundsmacro added](http://stackoverflow.com/questions/23290917/can-i-take-advantage-of-parallelization-to-make-this-piece-of-code-faster?lq=1#comment35986126_23453661)). And if I set"CompileOptions"->"-Ofast" when loading the compiler, the timing will reduce to about 12.3s! – xzczd – 2014-10-11T03:55:33.510

@xzczd It looks like Visual Studio is doing something interesting under the hood then. – RunnyKine – 2014-10-11T03:57:30.783

@RunnyKine More likely to be a defect of TDM-GCC, I think. Today I learned the basic syntax of C++ for several hours and rewrited the piece of code with it in order to find out if 12s touches the limit of my hardware. The result is, with optimization flag -O3 or -Ofast, the C++ version only costs about 8s in my computer: that's a 1.5X speedup just as you got. So the compiler failed to dig out all the potential of the code. – xzczd – 2014-10-14T14:56:42.423

@xzczd Good to know. In any case, this is an impressive speed up. – RunnyKine – 2014-10-14T15:39:56.200

Hmm.. Don't know why the timing is different on my laptop: http://i.stack.imgur.com/ekm34.png .

– Silvia – 2015-11-11T05:50:33.453

4Hi, @DanielLichtblau. As you said in the answer " I need to report this as a suggestion to revisit vector operations in Compile." It's been 2 years, in the latest M11, the timing is the same... Is it technically so hard to make vector operation faster, considering MMA has such a powerful symbolic power( things like Symbolic C already there)? After all, they are absolutely equivalent, why will equivalent thing lead to different timing... – matheorem – 2016-09-07T12:51:48.803