First off, your function is very simple without any hard number-crunching, so it will always be hard to get a large speedup for the compiled version. Secondly, your `Parallelization`

option for `Compile`

is useless because it doesn't do any parallelization this way.

Let me give slightly changed versions of your examples and explain how you can achieve a large impact on the speed. I'm using random integers in my example because results are easier to read this way.

```
g = RandomInteger[10, {5, 5}];
g1[v_, w_] :=
Module[{a = 2}, {g[[v]], g[[w]], g[[v]] a + (1 - a) g[[w]],
g[[w]] a + (1 - a) g[[v]]}];
```

Nothing changed so far instead of `a=2`

and `g`

being an integer matrix.

```
g2 = With[{buff = g, a = 2},
Compile[{{v, _Integer, 0}, {w, _Integer, 0}},
buff[[v]] # + (1 - #) buff[[w]] & /@ {1, 0, a, (1 - a)},
CompilationTarget -> "C",
Parallelization -> True,
RuntimeAttributes -> {Listable},
RuntimeOptions -> "Speed"
]
]
```

Look carefully at the given options for `Compile`

because this is how you set up a compiled function that runs in parallel when supplied with list arguments. I'll come to this later. First, the pure speed comparison for your `Do`

loop that calls `g1`

and `g2`

over and over again, but always *one after another* and not in parallel.

```
Names["GeneralUtilities`*"];
Do[g1[1, 2], {10000}] // AccurateTiming
Do[g2[1, 2], {10000}] // AccurateTiming
(* 0.140652 *)
(* 0.0199426 *)
```

This is almost 7x faster which might be caused by the `"Speed"`

option I gave.

Now it's getting a bit harder because we want to use the parallelization. It works as follows: For a function set-up as `Listable`

as I did, you can give list arguments. One example would be to write `Sqrt[{1,2,3}`

instead of `{Sqrt[1], Sqrt[2], Sqrt[3]}`

. This is exactly how your `g2`

can be used now and the awesome part is, that all those calls are indeed computed in parallel.

Therefore, instead of `g2[1,4]`

and `g2[2,5]`

you should call

```
g[{1, 2}, {4, 5}]
```

You can check that the results will be the same

```
{g2[1, 4], g2[2, 5]} == g2[{1, 2}, {4, 5}]
(* True *)
```

Now let's say you want to compute `g2`

for all possible tuples `{1,1}, {1,2}, ..., {2,1}, ...`

. *That* would indeed make a vast difference in speed. Let us create all possible tuples and prepare them and then compute `g1`

and `g2`

for all possibilities and repeat this say 100 times. That leads to

```
tupl = Transpose[Tuples[Range[5], {2}]]
Do[g1[a, b], {100}, {a, First[tupl]}, {b, Last[tupl]}] // AccurateTiming
Do[g2 @@ tupl, {100}] // AccurateTiming
(* 0.984676 *)
(* 0.00324821 *)
```

This is a speed improvement of 300x. So the conclusion is that you need to inspect your problem carefully and lookout for parts that can run independently. Additionally, you need to understand the concept of *listable* functions. If you like you can check the following answers of me where I used this:

### Appendix: On injecting large expressions

Micheal mentioned the following issue which I'd like to address:

I found that with `g = RandomReal[1, {1000, 1000}]`

, injecting g into the compiled-to-C command resulted in the kernel crashing after ten minutes and a loose C compiler running in the background. (I figured if speed was an issue, there must be lots of data.)

The basic problem is that you *inject* the big matrix into your final code. So every value of this big 1000x1000 matrix is copied and initialized in the c-code. For a small example this looks like this:

```
<< CCodeGenerator`
fun = With[{m = RandomReal[1, {5, 5}]},
CCodeStringGenerate[
Compile[{},
m, CompilationTarget -> "C"], "fun"]
]
```

Scroll down and you'll find the place where all the values are set. Now can you imagine having a c-source with 10^6 lines of code only for the initialization (a 1000x1000 matrix)? I guess if I were a compiler I would be pissed too.

The solution is pretty simple: Don't inject the matrix. You can just pass it as argument to the compiled function. Matrices are not copied when you call your final function, instead only a reference to the original *Mathematica* tensor is passed. Therefore, you will even gain speed, because the final compiled function is not bloated with lots of matrix initialization code:

```
g3 = With[{a = 2},
Compile[{{v, _Integer, 0}, {w, _Integer, 0}, {buff, _Real, 2}},
buff[[v]] # + (1 - #) buff[[w]] & /@ {1, 0, a, (1 - a)},
CompilationTarget -> "C",
Parallelization -> True,
RuntimeAttributes -> {Listable},
RuntimeOptions -> "Speed"]
];
```

Let's make a speed comparison for a 100x100 matrix. This still works on my machine but it probably crashes on yours when compiling `g2`

, so please save your work!

```
n = 100;
g = RandomReal[1, {n, n}];
tupl = Tuples[Range[n], {2}];
g2 = With[{buff = g, a = 2},
Compile[{{v, _Integer, 0}, {w, _Integer, 0}},
buff[[v]] # + (1 - #) buff[[w]] & /@ {1, 0, a, (1 - a)},
CompilationTarget -> "C",
Parallelization -> True,
RuntimeAttributes -> {Listable},
RuntimeOptions -> "Speed"
]
];
g2 @@@ tupl // AccurateTiming
(g3[##, g] & @@ Transpose[tupl]) // AccurateTiming
(* 0.0523184 *)
(* 0.0152345 *)
```

So in this case, the compiled function that does not store the matrix explicitly is about 3.5x faster than `g2`

. With growing matrix sizes, this speed difference becomes larger.

3Something is wrong here. You use the symbol

`g`

for both, the global matrix and a function`g[v_,w_]`

? This doesn't work. Haven't you seen all the error messages that appear when you evaluation your example? Can you please provide acomplete, working example for uncompiled code? – halirutan – 2015-08-17T21:26:03.707Yes! I confused the names, now it works fine but i need more speed... Can i edit my wrong question or add a new question and cancel this one? – Vanille – 2015-08-17T21:51:01.800

Yes, edit your question please. You see an edit button directly under the question. You could re-use some of the text-parts and only correct the code.

– halirutan – 2015-08-17T21:52:15.117Your functions are not the same either. Though that may not affect performance. The last two entries of your vector are going to be different depending on which version you call. You should be mapping onto

`{1,0,a,(1-a)}`

in the compiled function. – N.J.Evans – 2015-08-17T22:17:22.120Yes, you're right! Now i edit my question! Thanks a lot! – Vanille – 2015-08-17T22:32:17.647

`With[{g = g, a = Random[]}, Compile[{{v, _Integer}, {w, _Integer}}, With[{ak = a Subtract[g[[w]], g[[v]]]}, {g[[v]], g[[w]], -ak + g[[w]], ak + g[[v]]}]]]`

is ~six times faster than base on the loungebook, not even compiled to C... – ciao – 2015-08-17T22:45:21.620