Use Compile to improve performance

20

10

I've recently discovered Compile as a way to optimize my code. I wrote a program to learn something new about Compile. I have a function that creates a vector in the following form:

{g[[v]],g[[w]],g[[v]]a + (1-a)g[[w]],g[[w]]a + (1-a)g[[v]]}

where g is a $n \times m$ global matrix and v and w are two numbers from user and a is a uniform random number between 0 and 1. Below is the non-compiled function:

g1[v_, w_] := Module[{a = Random[]},
            {g[[v]],g[[w]],g[[v]]a + (1-a)g[[w]],g[[w]]a + (1-a)g[[v]]}];

And its timing:

Do[g1[1, 2], {10000}] // AbsoluteTiming
{0.605403, Null}

And this is the compiled version:

g2 = With[ {buff = g, a = Random[]},
     Compile[ {{v, _Integer }, {w, _Integer}},
     buff[[v]] # + (1 - #) buff[[w]] & /@ {1, 0, a, 1-a},
     CompilationTarget -> "C", Parallelization -> True]]

And its timing:

Do[g2[1, 2], {10000}] // AbsoluteTiming
{0.140093, Null}

The compiled function is about 4x faster than the non-compiled one; is there any way to do this even faster?

Vanille

Posted 2015-08-17T21:18:18.660

Reputation: 303

3Something is wrong here. You use the symbol g for both, the global matrix and a function g[v_,w_]? This doesn't work. Haven't you seen all the error messages that appear when you evaluation your example? Can you please provide a complete, working example for uncompiled code? – halirutan – 2015-08-17T21:26:03.707

Yes! I confused the names, now it works fine but i need more speed... Can i edit my wrong question or add a new question and cancel this one? – Vanille – 2015-08-17T21:51:01.800

Yes, edit your question please. You see an edit button directly under the question. You could re-use some of the text-parts and only correct the code.

– halirutan – 2015-08-17T21:52:15.117

Your functions are not the same either. Though that may not affect performance. The last two entries of your vector are going to be different depending on which version you call. You should be mapping onto {1,0,a,(1-a)} in the compiled function. – N.J.Evans – 2015-08-17T22:17:22.120

Yes, you're right! Now i edit my question! Thanks a lot! – Vanille – 2015-08-17T22:32:17.647

With[{g = g, a = Random[]}, Compile[{{v, _Integer}, {w, _Integer}}, With[{ak = a Subtract[g[[w]], g[[v]]]}, {g[[v]], g[[w]], -ak + g[[w]], ak + g[[v]]}]]] is ~six times faster than base on the loungebook, not even compiled to C... – ciao – 2015-08-17T22:45:21.620

Answers

27

First off, your function is very simple without any hard number-crunching, so it will always be hard to get a large speedup for the compiled version. Secondly, your Parallelization option for Compile is useless because it doesn't do any parallelization this way.

Let me give slightly changed versions of your examples and explain how you can achieve a large impact on the speed. I'm using random integers in my example because results are easier to read this way.

g = RandomInteger[10, {5, 5}];
g1[v_, w_] := 
  Module[{a = 2}, {g[[v]], g[[w]], g[[v]] a + (1 - a) g[[w]], 
    g[[w]] a + (1 - a) g[[v]]}];

Nothing changed so far instead of a=2 and g being an integer matrix.

g2 = With[{buff = g, a = 2}, 
  Compile[{{v, _Integer, 0}, {w, _Integer, 0}}, 
   buff[[v]] # + (1 - #) buff[[w]] & /@ {1, 0, a, (1 - a)},
   CompilationTarget -> "C",
   Parallelization -> True,
   RuntimeAttributes -> {Listable},
   RuntimeOptions -> "Speed"
   ]
  ]

Look carefully at the given options for Compile because this is how you set up a compiled function that runs in parallel when supplied with list arguments. I'll come to this later. First, the pure speed comparison for your Do loop that calls g1 and g2 over and over again, but always one after another and not in parallel.

Names["GeneralUtilities`*"];
Do[g1[1, 2], {10000}] // AccurateTiming
Do[g2[1, 2], {10000}] // AccurateTiming
(* 0.140652 *)
(* 0.0199426 *)

This is almost 7x faster which might be caused by the "Speed" option I gave.

Now it's getting a bit harder because we want to use the parallelization. It works as follows: For a function set-up as Listable as I did, you can give list arguments. One example would be to write Sqrt[{1,2,3} instead of {Sqrt[1], Sqrt[2], Sqrt[3]}. This is exactly how your g2 can be used now and the awesome part is, that all those calls are indeed computed in parallel.

Therefore, instead of g2[1,4] and g2[2,5] you should call

g[{1, 2}, {4, 5}]

You can check that the results will be the same

{g2[1, 4], g2[2, 5]} == g2[{1, 2}, {4, 5}]
(* True *)

Now let's say you want to compute g2 for all possible tuples {1,1}, {1,2}, ..., {2,1}, .... That would indeed make a vast difference in speed. Let us create all possible tuples and prepare them and then compute g1 and g2 for all possibilities and repeat this say 100 times. That leads to

tupl = Transpose[Tuples[Range[5], {2}]]
Do[g1[a, b], {100}, {a, First[tupl]}, {b, Last[tupl]}] // AccurateTiming
Do[g2 @@ tupl, {100}] // AccurateTiming
(* 0.984676 *)
(* 0.00324821 *)

This is a speed improvement of 300x. So the conclusion is that you need to inspect your problem carefully and lookout for parts that can run independently. Additionally, you need to understand the concept of listable functions. If you like you can check the following answers of me where I used this:

Appendix: On injecting large expressions

Micheal mentioned the following issue which I'd like to address:

I found that with g = RandomReal[1, {1000, 1000}], injecting g into the compiled-to-C command resulted in the kernel crashing after ten minutes and a loose C compiler running in the background. (I figured if speed was an issue, there must be lots of data.)

The basic problem is that you inject the big matrix into your final code. So every value of this big 1000x1000 matrix is copied and initialized in the c-code. For a small example this looks like this:

<< CCodeGenerator`
fun = With[{m = RandomReal[1, {5, 5}]},
  CCodeStringGenerate[
   Compile[{},
    m, CompilationTarget -> "C"], "fun"]
  ]

Scroll down and you'll find the place where all the values are set. Now can you imagine having a c-source with 10^6 lines of code only for the initialization (a 1000x1000 matrix)? I guess if I were a compiler I would be pissed too.

The solution is pretty simple: Don't inject the matrix. You can just pass it as argument to the compiled function. Matrices are not copied when you call your final function, instead only a reference to the original Mathematica tensor is passed. Therefore, you will even gain speed, because the final compiled function is not bloated with lots of matrix initialization code:

g3 = With[{a = 2}, 
   Compile[{{v, _Integer, 0}, {w, _Integer, 0}, {buff, _Real, 2}}, 
    buff[[v]] # + (1 - #) buff[[w]] & /@ {1, 0, a, (1 - a)}, 
    CompilationTarget -> "C",
    Parallelization -> True,
    RuntimeAttributes -> {Listable},
    RuntimeOptions -> "Speed"]
];

Let's make a speed comparison for a 100x100 matrix. This still works on my machine but it probably crashes on yours when compiling g2, so please save your work!

n = 100;
g = RandomReal[1, {n, n}];
tupl = Tuples[Range[n], {2}];

g2 = With[{buff = g, a = 2}, 
   Compile[{{v, _Integer, 0}, {w, _Integer, 0}}, 
    buff[[v]] # + (1 - #) buff[[w]] & /@ {1, 0, a, (1 - a)},
    CompilationTarget -> "C",
    Parallelization -> True,
    RuntimeAttributes -> {Listable},
    RuntimeOptions -> "Speed"
    ]
];

g2 @@@ tupl // AccurateTiming
(g3[##, g] & @@ Transpose[tupl]) // AccurateTiming
(* 0.0523184 *)
(* 0.0152345 *)

So in this case, the compiled function that does not store the matrix explicitly is about 3.5x faster than g2. With growing matrix sizes, this speed difference becomes larger.

halirutan

Posted 2015-08-17T21:18:18.660

Reputation: 109 574

2I found that with g = RandomReal[1, {1000, 1000}], injecting g into the compiled-to-C command resulted in the kernel crashing after ten minutes and a loose C compiler running in the background. (I figured if speed was an issue, there must be lots of data.) – Michael E2 – 2015-08-17T22:59:35.123

4@All I have explained the issue that Michael found very detailed in the end of my answer. – halirutan – 2015-08-18T00:09:02.377

1Thanks a lot halirutan! Very detailed answer! – Vanille – 2015-08-18T08:48:57.600

Very good addition about injection - I use the same thing all the time, and its good to see this systematically exposed (I'd vote but I already did). I have discussed a similar phenomenon happening in closures, (the last section on Module vs With), where also effectively one passes the large data structure at runtime - although, for a closure, that meant that one uses Module to store the data, rather than With.

– Leonid Shifrin – 2015-09-19T09:40:53.197

Hi, @halirutan. How do you know matrix is passed by reference? In "LibraryLink/tutorial/InteractionWithMathematica" the doc mentioned this problem about librarylink. For librarylink, "Constant" will guarantee pass by reference, and as I tested, this makes a huge different compared to "Automatic". But there is no such option in Compile, I am not sure if it is passed by reference by default. – matheorem – 2016-09-07T13:17:51.123