31

20

Still, I'll use the implementation of the 1D FDTD method (you can simply understand it as a kind of explicit finite difference scheme for the Maxwell's equation) as the example. Just for completeness, here is the 1D Maxwell's equation:

$$\mu \frac{\partial H_y}{\partial t}=\frac{\partial E_z}{\partial x}$$ $$\epsilon \frac{\partial E_z}{\partial t}=\frac{\partial H_y}{\partial x}$$

and the corresponding finite difference equation:

$$H_y^{q+\frac{1}{2}}\left[m+\frac{1}{2}\right]\text{==}H_y^{q-\frac{1}{2}}\left[m+\frac{1}{2}\right]+\frac{\Delta _t}{\mu \Delta _x}\left(E_z^q[m+1]-E_z^q[m]\right)$$ $$E_z^{q+1}[m]==E_z^q[m]+\frac{\Delta _t}{\epsilon \Delta _x}\left(H_y^{q+\frac{1}{2}}\left[m+\frac{1}{2}\right]-H_y^{q+\frac{1}{2}}\left[m-\frac{1}{2}\right]\right)$$

The toy code ~~I've repeatedly used in several posts~~ implementing the difference scheme is:

```
ie = 200;
ez = ConstantArray[0., {ie + 1}];
hy = ConstantArray[0., {ie}];
fdtd1d = Compile[{{steps}},
Module[{ie = ie, ez = ez, hy = hy},
Do[ez[[2 ;; ie]] += (hy[[2 ;; ie]] - hy[[1 ;; ie - 1]]);
ez[[1]] = Sin[n/10];
hy[[1 ;; ie]] += (ez[[2 ;; ie + 1]] - ez[[1 ;; ie]]), {n,
steps}]; ez]];
result = fdtd1d[10000]; // AbsoluteTiming
```

*Notice that constants like $\mu$, $\Delta _t$ are omitted for simplicity.*

Personally I think it's a typical example for the implementation of finite difference method (FDM), so here's the question: has this piece of code (at least almost) touched the speed limit of *Mathematica*? In fact several months ago, I found that if the code is rewrited with *Julia*, it'll be 4 times faster.

Indeed, I know this might be the case that one should use the best-suited tool for a specific job, but since I've already gained some stupid pride on using *Mathematica* and am unwilling to spend time to learn a new programming language (*Wolfram* is almost my first programming language, I used to learn some *VB*, but already gave it back to my teacher when started to use *Mathematica*), I still want to make sure if the *Mathematica* version of the code can be faster.

If it's the limitation, I'd like to know why there's such a big difference.

Any help would be appreciated.

2Using

`CompilePrint[fdtd1d]`

we find that the first three instructions in the compiled function are calls to`MainEvaluate`

. If you add`CompilationOptions -> {"InlineExternalDefinitions" -> True}`

and repeat, the first two instructions are now`CopyTensor`

, although there is virtually no difference in the timings between the two functions. This may mean nothing, or it may be something worth investigating. – dr.blochwave – 2014-10-07T15:54:48.7231Probably on reflection means little - the performance hit is likely later on. – dr.blochwave – 2014-10-07T18:38:05.543

1Also there might be a further speed gain by compiling to C. – Daniel Lichtblau – 2014-10-07T22:11:29.980

@DanielLichtblau Sadly the magic of C is ineffective here, at least with

TDM-GCC 4.8.1andVista 32bit. – xzczd – 2014-10-08T03:15:20.6131@blochwave Yeah, in this case

`"InlineExternalDefinitions" -> True`

or changing the beginning part to`With[{ie2 = ie, ez2 = ez, hy2 = hy}, Compile[{{steps}}, Module[{ie = ie2, ez = ez2, hy = hy2}, …`

or`With[{ie = 200}, Compile[{{steps}}, Module[{ez = Table[0., {ie + 1}], hy = Table[0., {ie}]}, …`

doesn't help, the bottleneck is inside`Do`

. – xzczd – 2014-10-08T03:24:50.037I also don't find a gain from compiling to C. However, since

– dr.blochwave – 2014-10-08T07:05:48.380`Span`

doesn't actually seem to feature in this list of compilable functions, could that be a potential bottleneck?As in, could rewriting it without the spans work...just wondering that's all – dr.blochwave – 2014-10-08T07:07:37.313

@blochwave That list isn't all…

`Span`

is compiled as we see with`CompilePrint`

, and vectorization is still one of the most important way to speed up code even inside`Compile`

, rewriting`Span`

with loops only makes the code slower. Try`fdtd1d = Compile[{{steps}}, Module[{ie = ie, ez = ez, hy = hy}, Do[Do[ez[[i + 1]] += (hy[[i + 1]] - hy[[i]]), {i, ie - 1}]; ez[[1]] = Sin[n/10]; Do[hy[[i]] += (ez[[i + 1]] - ez[[i]]), {i, ie}], {n, steps}]; ez]];`

– xzczd – 2014-10-08T07:16:59.460Fair enough, didn't think it would be a solution, just curious! – dr.blochwave – 2014-10-08T07:20:52.473

Parallelization? GPUs? – Alexey Bobrick – 2014-10-08T14:18:32.817

@AlexeyBobrick AFAIK, compilation is hard to combine with these techniques, which are much slower than compilation when used separately. – xzczd – 2014-10-09T02:14:18.207

Not quite. CUDA (GPU) is really a compiled C code with CUDA instructions. If you add parallelization flag to your Compile[] command, you will get more speed-up for free. You probably can further add openmp instructions in your pre-compiled code, but I am not sure. – Alexey Bobrick – 2014-10-09T11:33:36.403

1

@AlexeyBobrick Just scaned this document and tried several flags in

– xzczd – 2014-10-09T12:53:20.963`"CompileOptions"`

(`-O3`

,`-Ofast`

, etc.) but didn't get a speed-up… Could you give an example? (BTW`-Ofast`

does help in speeding up this code :D)I meant simply adding Compile[<Your code>, Parallelization -> True]. Should generally be a quick-help type of a command. – Alexey Bobrick – 2014-10-09T19:41:42.920

@AlexeyBobrick I'm sorry but this won't work, as the document said,

`Parallelization -> True`

only works together with`RuntimeAttributes -> {Listable}`

, it just makes a`Listable`

compiled function paralleled, while it's not the case of this question. – xzczd – 2014-10-10T03:36:29.290Will it compile if you use ParallelDo instead of Do? And keeping Parallelisation->True. I was pretty sure it was parallelising not the argument calls, but the inner loops. Will check. – Alexey Bobrick – 2014-10-10T14:23:05.400

@AlexeyBobrick No, it won't.

– xzczd – 2014-10-11T06:40:18.080`ParallelDo`

isn't in the list, the result of`CompilePrint`

also shows it's indeed not compiled.Nice question! I somehow missed it! (+1) – Silvia – 2015-11-10T17:33:03.570