18

18

# Question

Why is it, that compiled, `Listable`

, parallel functions which work perfectly fine on the main kernel, do not run in parallel on sub-kernels?

# Details

## First Example

Let me give a first example. I compile a function $f:\mathbb{R}\to\mathbb{R}$ which is a simple sum of sine-functions with the options `CompilationTarget -> "C"`

, `RuntimeAttributes -> {Listable}`

and `Parallelization -> True`

. Due to the `Listable`

attribute I can now call the function with tensor parameters and due to the parallelization, the values in the tensor are processed in parallel. If you are on a slow machine, adjust `n`

to be not that high:

```
f = Compile[{{t, _Real, 0}},
#, CompilationTarget -> "C", RuntimeAttributes -> {Listable},
Parallelization -> True] &@
Sum[Sin[2.0 Pi k t]/k, {k, 1000}];
data = With[{n = 1000000}, Table[t, {t, 0, 1, 1/(n - 1)}]];
f[data];
```

Looking at the system monitor during the calculation shows, that all processors are running at 100%

If you like you can compare the speed of this execution with for instance `f /@ data`

.

## Under the hood

*This may be only correct for Linux and OS X!*

What happens when you use `Compile`

with the "C" option is, that a shared library is created from your *Mathematica*-code which contains a function that can be called. The libraries are stored in a folder which is specific to the process id of a specific kernel. Let's make a short function, to print the important stuff of such a compiled function. I extract from a `CompiledFunction`

the information, where the shared library is placed and what type the function is. Additionally, I add `$KernelID`

and `$ProcessID`

:

```
printCFuncLibrary[HoldPattern[CompiledFunction[__, lib_]]] :=
StringJoin["{KernelID: ", ToString[$KernelID], ", ProcessID: ",
ToString[$ProcessID], "} -> ", ToString[lib, InputForm]]
```

Using this on `f`

and I get

```
printCFuncLibrary[f]
(*
{KernelID: 0, ProcessID: 3809} ->
LibraryFunction["/home/patrick/.Mathematica/ApplicationData/\
CCompilerDriver/BuildFolder/lenerd-3809/compiledFunction0.so",
"compiledFunction0", {{Real, 0, "Constant"}}, Real]
*)
```

Please note that that the build-folder has the process id of my main-kernel in it: "lenerd-3809". If you try now to execute this on different sub-kernels, you see that shared library function which is used stays the same. This is kind of expected:

```
ParallelTry[printCFuncLibrary, {f, f, f, f}, 4]
```

What *is* kind of unexpected is, that when I call `f`

even on only 1 sub-kernel, I lose the vector-parallelization completely and only one processor is working on the task

```
ParallelTry[f, {data}, 1];
```

I would have expected, that when calling the compiled function (compiled on the main-kernel) in sub-kernels, that there are some clashes whatsoever.

## Compiling the function **on** the sub-kernels

Since I could not explain the above behavior, but I could surely imagine, that having only one shared library function which is maybe only loaded one time, is not the best situation when several processes want to access it.

But why not compile the function on all sub-kernels. With this every sub-kernel gets its own copy of the shared library and loads its own version of the function:

```
ParallelEvaluate[
fsub = Compile[{{t, _Real, 0}},
#, CompilationTarget -> "C", RuntimeAttributes -> {Listable},
Parallelization -> True] &@
Sum[Sin[2.0 Pi k t]/k, {k, 1000}];
];
ParallelEvaluate[printCFuncLibrary[fsub]]
```

I skip my output here, but what you should see is, that every sub-kernel gets its own copy of the shared library, place in a folder which is named like the process id of the sub-kernel. Additionally, on our main-kernel there doesn't exist a function `fsub`

and therefore calling it with a numeric value stays unevaluated. On the other hand, `ParallelEvaluate[fsub[.1]]`

calculates the correct results.

If you now try to supply the vector `data`

to the compiled function on the sub-kernel you see, that this is not processed parallel

```
ParallelTry[fsub, {data}];
```

I tried several other things to get some insight in the behavior, but nothing really helped me to understand, what's going on.

## You might ask...

... when your compiled function is parallelized so nicely, isn't it pretty useless, to take a second layer of parallelization? The answer is, yes, but for my real problem this is not the case. Assume you have a minimization problem and you parallelize your target-function only. Still, since the minimization method runs serially and only the calls to the target-function are parallel, there is still much cpu-time left. In such a cases, it would be reasonable to run two or more minimizations at the same time.

4Have you looked at the appropriate

`SystemOption`

on the parallel kernels?`ParallelEvaluate@SystemOptions["ParallelOptions"]`

– Szabolcs – 2012-06-19T07:00:28.8931How the hell could I assume, that

`$ProcessorCount`

is used on the sub-kernels too?? Unbelievable! Can you write up an answer? – halirutan – 2012-06-19T07:17:35.027I'll try to answer when I get to try this on a multicore machine. Usually I can test parallel stuff on a single core one too, but here it's important to use multicore ... if you have solved it based on my comment, feel free to post your own answer! – Szabolcs – 2012-06-19T14:31:00.993

Ok, when you don't care then I write up what's happening. Thanks. – halirutan – 2012-06-20T01:50:38.223