## Mod slow in CompiledFunction

10

2

I've been playing with a little compiled Ising model implementation. It's brutally slow. Decomposing things I've found that the primary culprit appears to be Mod. Taking this trivial function:

modProper =
Compile[
{
{initCoords, _Integer, 2},
{tSteps, _Integer}
},
Module[
{
coordMat = initCoords,
modMat,
gridNRows,
gridNCols
},
gridNRows = Length@coordMat;
gridNCols = Length@coordMat[[1]];
Do[
modMat =
Table[
Mod[i - 1, gridNRows, 1],
{i, gridNRows},
{j, gridNCols}
];,
{n, tSteps}
];
coordMat
],
RuntimeOptions ->
{
"CatchMachineOverflow" -> False,
"CatchMachineIntegerOverflow" -> False,
"CompareWithTolerance" -> False,
"EvaluateSymbolically" -> False
},
CompilationTarget -> "C",
Parallelization -> True
];


Taking a 100x100 grid it takes a full .0002 seconds per iteration:

isingGrid = RandomChoice[{-1, 1}, {100, 100}];
AbsoluteTiming[modProper[isingGrid, 5000]][[1]]/5000

0.000206102


This seems fast until you realize that you're only doing 10000 Mod calculations. It's even a little bit slower than the uncompiled, vectorized version:

With[{range = Range[10000] - 10},
AbsoluteTiming[Mod[range, 5000, 1]][[1]]
]

0.000163


Is there a reason it should be this slow?

FWIW, compiled version is faster on Intel Atom N570 with 32-bit Mathematica 11.2 and default gcc version 5.4.1: 0.00179s compiled vs 0.00315s uncompiled. – Ruslan – 2018-05-31T12:08:29.237

@Ruslan Quite likely, the reason is that the Atom processoreis lacking AVX and AVX2 instructions, so that it cannot profit as well from vectorization. – Henrik Schumacher – 2018-05-31T12:48:53.983

10

Towards your original code (this does not answer the question why Mod is that slow; it just tries to get rid of it), here is my version of it, two generations later. We already identified Mod to be the bottleneck, so let's get rid of it, by precomputing the indices and by submitting them to the compiled function as additional arguments:

isingProper5 = Compile[{{initCoords, _Integer, 2}, {tSteps, _Integer}, {storeTraj, _Integer},
{up, _Integer, 1}, {down, _Integer, 1},
{left, _Integer, 1}, {right, _Integer, 1}
},
Module[{totalCoords, realTraj, storedSteps, storedStepPointer = 1,
coordMat = initCoords,
peMat, gridNRows, gridNCols, giL, giM, gjL, gjM, c},
(*number of steps to be used in the trajectory*)

realTraj = Min@{Max@{storeTraj, 1}, tSteps};
(*steps to be cached*)
storedSteps =
Table[Ceiling[tSteps*i/realTraj], {i, Min@{Max@{realTraj, 1}, tSteps}}];
(*preallocated trajectory block*)

totalCoords = Table[initCoords, {i, realTraj}];
gridNRows = Length@initCoords;
gridNCols = Dimensions[initCoords][[2]];
Do[
peMat =(*calculate PE on the grid from nearest neighbors*)

Table[
giL = CompileGetElement[up, i];
giM = CompileGetElement[down, i];
Table[
gjL = CompileGetElement[left, j];
gjM = CompileGetElement[right, j];
Times[
- CompileGetElement[coordMat, i, j],
Plus[
CompileGetElement[coordMat, giL, j],
CompileGetElement[coordMat, giM, j],
CompileGetElement[coordMat, i, gjL],
CompileGetElement[coordMat, i, gjM]
]
], {j, gridNCols}]
, {i, gridNRows}
];
If[n == storedSteps[[storedStepPointer]],
totalCoords[[storedStepPointer]] = coordMat;
storedStepPointer++];
, {n, tSteps}
];
totalCoords
],
RuntimeOptions -> "Speed",
CompilationTarget -> "C"
];


Here is a timing example:

m = n = 100;
isingGrid = DeveloperToPackedArray[RandomChoice[{-1, 1}, {m, n}]];
up = RotateLeft[Range[n]];
down = RotateRight[Range[n]];
left = RotateLeft[Range[m]];
right = RotateRight[Range[m]];

t5 = AbsoluteTiming[
isingProper5[isingGrid, 1000, 1, up, down, left, right];
][[1]]/1000


0.000020027

Compared to the original function, this is about 36 times as fast:

t3 = AbsoluteTiming[
isingProper3[isingGrid, 1000, 1]
][[1]]/1000

t3/t5


0.00105735

36.2294

So far, this code is also not parallelized. You have to cast that into a Listable function in order to profit from Parallelization -> True.

I am also quite surprised about the slowness of Mod. Usually, we get advised to replace as much memory bound tasks by CPU-bound tasks on modern hardware. In this case, the opposite was more performant.

Disclaimer

I haven't checked the result for correctnes, yet.

Thinking about it even more, I wonder wether the Do-loop in the compiled function gets optimized away by the compiler. If I replace the Do by Table and make it the return variable, execution takes way longer and I doubt that this is solely due to memory allocation...

Moreover, the following isolates a single iteration from the Do-loop.

cg = Compile[{{a, _Integer, 2}, {up, _Integer, 1}, {down, _Integer,
1}, {left, _Integer, 1}, {right, _Integer, 1}},
Table[
Times[
(- CompileGetElement[a, i, j]),
Plus[
CompileGetElement[a, CompileGetElement[up, i], j],
CompileGetElement[a, CompileGetElement[down, i], j],
CompileGetElement[a, i, CompileGetElement[left, j]],
CompileGetElement[a, i, CompileGetElement[right, j]
]
]
], {i, Dimensions[a][[1]]}, {j, Dimensions[a][[2]]}]
,
RuntimeOptions -> "Speed", CompilationTarget -> "C"];


Its runtime is approximately twice as the average runtime of an iteration in isingProper5.

tcg = cg[isingGrid, up, down, left, right]; // RepeatedTiming // First


0.000038

I am not sure what the reason is. It cannot be calling overhead alone, because the factor 2 remains also for m = n = 1000; and m = n = 2000;.

Henrik, is it using Do in numerical simulation like this one a much better choice than 1)compiling a function that use nestList or 2) passing a compiled function to NestList ? i thought i had to avoid this kind of constructs (do /for ) with WM – Alucard – 2018-05-31T16:31:37.547

2@Alucard I'd say, within compiled code, Do is absolutely okay. Nest, Fold will probably also be unrolled to gotos in C, exactly as Table, While, and Do. So, in the end, the C code will look essentially the same. Usually, I use Do in top level code for some outer loop with many compiled or vectorized inner loops. What you use for the outer loop makes usually only a marginal difference in the timings. But I'd also say that using Do and While within Compile allows a bit more of fine tuning as the functional constructs. – Henrik Schumacher – 2018-05-31T16:54:32.270

1This is lovely. I cooked it into a pure Ising model simulator and it beats my more general implementation by a factor of four (because I need to evaluate this PE twice). I'm sure I can impress some people with how fast this runs (and I'll be sure to let them know that it's really your work :)). – b3m2a1 – 2018-06-01T04:19:04.093

Always a pleasure! However, please know that this is far from optimal. It is still not parallelized, and using the lists up, down, left, and right for lookup makes it unnecessarily(?) memory bound. Moreover, I am pretty sure that a more cache aware algorithm (e.g. by traversing the grid diagonally) will make this look rather slowish. The computational stencil is the same as for the 5-point Laplacian in finite differencing. So you may find ways to speed this up further in the literature. – Henrik Schumacher – 2018-06-01T06:51:34.937

5

This is only an observation, as I'm not so sure we can call Mod slow. The question is, how do we choose a base-line? Consider this large matrix

data = RandomInteger[1000, {2000, 2000}];


One of the simplest operations is probably adding an integer to each element. So let's time this

AbsoluteTiming[Do[data + 1, {20}]][[1]]/20
(* 0.00219195 *)


This seems to indicate that 2ms is an appropriate lower bound. How long does Mod take?

AbsoluteTiming[Do[Mod[data, 5, 1], {20}]][[1]]/20
(* 0.0117109 *)


So it takes about 5x the time of an integer addition. Is this slow-down justified? In my opinion, this is expected. An equivalent definition of Mod[m,n,1] in C looks like this

m - floor((m-1.0)/n)*n


We have several things to do. We need to cast to float/double and make an expensive division. I guess floor is pretty cheap. To simply the multiplication with n, we could cast back to an integer, but nevertheless, it seems obvious that there is much more work to do than in a simple addition.

I have tried several compiled versions of low-level mod implementations, including the above C code in a custom LibraryLink function. I was not able to match the plain speed of Mod. There is too much overhead. What I haven't tried is using a better parallelization than that of Compile. You could theoretically write a LibraryLink function that takes a matrix and calculates the mod in parallel. However, if this is not spoiled by the overhead of transferring data needs to be proven.

This is a good point. I suppose I've simply come to expect too much from these low-level internal functions :) for my explicit case I could do better by simply calculating where my wrap around should go to, but Mod needs to be smarter than that. – b3m2a1 – 2018-06-01T04:27:00.393

3

The analysis below may be totally off base, but at least this post will give you some tools to do the analysis yourself.

Let's take a look at what actually happens. For this, we'll wrap gcc with our own script, which will save the source code and gcc arguments for further examination. Here's what we do from Mathematica:

$CCompiler = {"Name" -> "GCC", "Compiler" -> CCompilerDriverGCCCompilerGCCCompiler, "CompilerInstallation" -> "/tmp", "CompilerName" -> Automatic};  If you're not on Linux, you may want to change this appropriately. The idea is to take the first element of the list returned by CCompilers[] and change the "CompilerInstallation" option to point to the wrapper instead of the default path (which was "/usr/bin" in my case). And here's the wrapper script, which we put to /tmp/gcc (don't forget to chmod +x it): #!/bin/sh echo "$*" > /tmp/mma-gcc.log
i=0
for opt in "$@"; do [ -f "$opt" ] && cp "$opt" /tmp/src-$i.c
i=$((i+1)) done exec gcc "$@"


After we run your Mathematica code of Compile[..., we get /tmp/src-9.c or similarly-named file, which is our source code. To further experiment with compilation options you can start from the arguments given to gcc, which are saved in /tmp/mma-gcc.log.

Now let's look inside the C source file we've recovered. The Do loop in your Mathematica code corresponds to the following C code (indented by me, // comments also mine):

lab8: // do loop
I0_7 = I0_1;
I0_8 = I0_3;
I0_9 = I0_5;
dims[0] = I0_7;
dims[1] = I0_8;
err = funStructCompile->MTensor_allocate(T0_2, 2, 2, dims);
if( err)
{
goto error_label;
}
P3 = MTensor_getIntegerDataMacro(*T0_2);
I0_10 = I0_5;
goto lab25;
lab14: // loop over i
I0_12 = I0_5;
goto lab24;
lab16: // loop over j
I0_13 = I0_10 + I0_11;
I0_16 = I0_1;
I0_17 = I0_13 + I0_11;
{
mint S0 = FP1((void*) (&I0_18), (void*) (&I0_17),
(void*) (&I0_16), 1, UnitIncrements, 0);/*  Quotient  */
err = S0 == 0 ? 0 : LIBRARY_NUMERICAL_ERROR;
if( err)
{
goto error_label;
}
}
I0_17 = I0_16 * I0_18;
I0_18 = -I0_17;
I0_17 = I0_13 + I0_18;
P3[I0_9++] = I0_17;
lab24:
if( ++I0_12 <= I0_8)
{
goto lab16; // loop over j
}
lab25:
if( ++I0_10 <= I0_7)
{
goto lab14; // loop over i
}
lab26:
if( ++I0_6 <= I0_4)
{
goto lab8; // do loop
}


The first thing which catches the eye is that the code is not vectorized in any obvious way. Moreover, it's not even auto-vectorizer-friendly, since goto construct is too flexible to reliably prove that it actually represents a loop (from the compiler POV), and to detect what type of loop it is. Another obstacle to auto-vectorization and many other optimizations is the call to Quotient function via pointer FP1, declared as

static BinaryMathFunctionPointer FP1;


So, it's not really surprising that this does not work faster than Mathematica normal code. And maybe Mathematica also does some vectorization of its normal code, which happens to be faster on your CPU.

You can safely use CCodeStringGenerate to see what C code is created from a compiled function. – halirutan – 2018-05-31T14:04:15.960

@halirutan ah, great, I didn't know how to google for it. – Ruslan – 2018-05-31T14:04:57.210

Actually, I wondered several times wether it is really a got idea that Mathematica replaces loops by gotos. – Henrik Schumacher – 2018-05-31T14:05:17.787

@halirutan is there any way to get all the compiler options used to compile this function? – Ruslan – 2018-05-31T14:10:45.087

Yes. Look at the details of CreateLibrary. You can basically use the c-code from CCodeStringGenerate to create what Compile to "C" creates. Use the "ShellCommandFunction" option to see what is happening.

– halirutan – 2018-05-31T14:16:06.050

3

I looked at your original code and tried it with the newer FunctionCompile function.

modProper1 =
Compile[{{initCoords, _Integer, 2}, {tSteps, _Integer}},
Module[{coordMat = initCoords, modMat, gridNRows, gridNCols},
gridNRows = Length@coordMat;
gridNCols = Length@coordMat[[1]];
Do[modMat =
Table[Mod[i - 1, gridNRows, 1], {i, gridNRows}, {j,
gridNCols}];, {n, tSteps}];
coordMat],
RuntimeOptions -> {"CatchMachineOverflow" -> False,
"CatchMachineIntegerOverflow" -> False,
"CompareWithTolerance" -> False, "EvaluateSymbolically" -> False},
CompilationTarget -> "C", Parallelization -> True]


And its timing:

isingGrid = RandomChoice[{-1, 1}, {200, 200}];

{timing1, result1} = AbsoluteTiming[modProper1[isingGrid, 5000]];


On my machine timing gives values like 3.2443 (sometimes better, sometimes worse).

Here is my best attempt with FunctionCompile:

modProper2 = FunctionCompile[
Function[{
Typed[initCoords,
TypeSpecifier["PackedArray"]["MachineInteger", 2]],
Typed[tSteps, "MachineInteger"]
},
NativeUncheckedBlock@
Module[{coordMat = initCoords, modMat, gridNRows, gridNCols},
gridNRows = Length@coordMat;
gridNCols = Length@coordMat[[1]];
Do[modMat =
Table[Mod[i - 1, gridNRows, 1], {i, gridNRows}, {j,
gridNCols}], {n, tSteps}];
coordMat
]], "CompilerOptions" -> {"LLVMOptimization" ->
"ClangOptimization"[3], "AbortHandling" -> False}]


When I run this:

{timing2, result2} = AbsoluteTiming[modProper2[isingGrid, 5000]];


The result for timing2 gives values like 0.0750443 (so about 40x faster)

The results are the same:

In[.]:= result2 == result1

Out[.]= True


It uses two features that I believe are not documented (yet):

1. NativeUncheckedBlock -- this turns off error checking
2. CompilerOptions` -- turning off the abort handlers also speeds things up (a lot)