## Can I use Compile to speed up InverseCDF?

10

I'm repeatedly issuing the command

P = InverseCDF[NormalDistribution[0, 1], T]


where T is either a 36- or 64-vector of real numbers (points in the unit hypercubes $[0,1]^{36}$ or $[0,1]^{64}$, quasirandomly distributed according to the "golden-ratio" generalization formula of Martin Roberts given in his answer to How can one generate an open-ended sequence of low-discrepancy points in 3D?). (The points--after conversion by the indicated command--are employed for the generation of "two-rebit" and "two-qubit" $4 \times 4$ density matrices, randomly generated with respect to the Bures [minimal monotone] measure, in accordance with formulas (24) and (28), setting $x =\frac{1}{2}$, of "Random Bures mixed states and the distribution of their purity" https://arxiv.org/abs/0909.5094 .)

An analysis of mine shows that this is by far the most time-consuming step in my program.

So, can I use Compile so that the command "knows" that the vector is composed of reals? I presume/hope that, if so, this would lead to a considerable speed-up.

1Unfortunately, this is not possible, at least not without further effort. The function InverseErfc is not compilable. That means that the compiled function will contain calls to MainEvaluate which will slow down the execution severely. You can speed it up a little by using the Listable-attribute of InverseCDF to process whole matrices at once: T = RandomReal[{0., 1.}, {10000, 64}]; { First@AbsoluteTiming[ P1 = Table[InverseCDF[NormalDistribution[0, 1], t], {t, T}];], First@AbsoluteTiming[P2 = InverseCDF[NormalDistribution[0, 1], T];] }. – Henrik Schumacher – 2018-09-02T12:48:33.340

7

Out of curiosity, I tried to write my own version of inverse CDF for the normal distribution. I employ a qualitative approximation of the inverse CDF as initial guess and apply Newton iterations with line search until convergence.

This is the code:

f[x_] := CDF[NormalDistribution[0, 1], x];
finv[y_] := InverseCDF[NormalDistribution[0, 1], y];
p = 1/200;
q = 2/5;
g[x_] = N@Piecewise[{
{finv[$MachineEpsilon], 0 <= x <=$MachineEpsilon},
{Simplify[Normal@Series[finv[x], {x, 0, 1}], 0 < x < 1], $MachineEpsilon < x < p}, {Simplify[PadeApproximant[finv[x], {x, 1/2, {7, 8}}]], q < x < 1 - q}, {Simplify[Normal@Series[finv[x], {x, 1, 1}], 0 < x < 1], 1 - p < x < 1}, {finv[1 -$MachineEpsilon], 1 - $MachineEpsilon <= x <= 1} }, Simplify[PadeApproximant[finv[x], {x, 1/2, {17, 18}}]] ]; (*g[y_]:=Log[Abs[(1-Sqrt[1-y]+Sqrt[y])/(1+Sqrt[1-y]-Sqrt[y])]];*) cfinv = Block[{T, S, Sτ}, With[{ S0 = N[g[T]], ϕ00 = N[(T - f[S])^2], direction = N[Simplify[(T - f[S])/f'[S]]], residual = N[(T - f[Sτ])^2], σ = 0.0001, γ = 0.5, TOL = 1. 10^-15 }, Compile[{{T, _Real}}, Block[{S, Sτ, ϕ0, τ, u, ϕτ}, S = S0; ϕ0 = ϕ00; While[Sqrt[ϕ0] > TOL, u = direction; τ = 1.; Sτ = S + τ u; ϕτ = residual; While[ϕτ > (1. - σ τ) ϕ0, τ *= γ; Sτ = S + τ u; ϕτ = residual; ]; ϕ0 = ϕτ; S = Sτ; ]; S ], CompilationTarget -> "C", RuntimeAttributes -> {Listable}, Parallelization -> True, RuntimeOptions -> "Speed" ] ] ];  And here an obligatory test for speed and accuracy (tested on a Haswell Quad Core CPU): T = RandomReal[{0., 1}, {1000000}]; a = finv[T]; // RepeatedTiming // First b = cfinv[T]; // RepeatedTiming // First Max[Abs[a - b]]  0.416 0.0533 3.77653*10^-12 # Discussion So this one is almost three eight times faster than the built-in method. I also expect a speedup should one replace the function g by a better but also reasonably quick approximation for the initial guess. First I tried an InterpolatingFunction for the (suitably) transformed "true" CDF, but that turned out to be way too slow. Of course Newton's method has its problems on the extreme tails of the distribution (close to 0 and close 1) where the CDF has derivative close to 0. Maybe a secant method would have been more appropriate? # Edit Using expansions of the inverse CDF at$0$,$1/2$and$1$, I was able to come up with a way better initial guess function g. A very quick test indicates that my program seems to run at least five times faster on my (very basic) MacBookAir using the new cfinv command than with the (default/built-in) finv[y_] := InverseCDF[NormalDistribution[0, 1], y] command. – Paul B. Slater – 2018-09-02T17:49:27.870 That great to hear. Interesting that this gets relatively faster on a Duo-Core machine. Maybe InverseCDF is better parallelizable: Newton's methods has the potential to make computations a bit assynchronous, so that the organization overhead rises. That makes it potentially less scalable. – Henrik Schumacher – 2018-09-02T18:14:58.067 @PaulB.Slater In the meantime, I have been able to come up with a further considerable speedup. Have a try. – Henrik Schumacher – 2018-09-02T19:08:15.300 1Quick test indicates now about twice as quick as the previous improvement--so about ten times faster than the InverseCDF[NormalDistribution[0, 1], T] command itself. (Incidentally, I first was directed to perform a software installation to be able to implement these Schumacher codes.) – Paul B. Slater – 2018-09-03T05:15:21.677 I tried my original/full program (given at the end of my preceding question (https://mathematica.stackexchange.com/questions/181045/can-a-certain-pair-of-expressions-be-compressed-into-one) for 100,000 iterations with and without the Schumacher coding. The total time fell from 257.5 secs. to 27.5. The time expended on the inverse operation under study fell from 241.7 to 16.7 (ratio=14.4). A little puzzling, the time spent on the (presumably same) subsequent matrix operations step fell from 9.14 to 3.93. The remaining step (the subject of the preceding question) stayed at 4.6. So, maybe the – Paul B. Slater – 2018-09-03T12:38:28.237 observation there "Notice that a main part of the speed up comes from processing many matrices at once" in regard to that last/third (time-invariant) step although, of course, interesting, is not of major relevance, time-wise. So, maybe not really worth pursuing, since it would involve somewhat more technical coding than I am presently employing. (I meant this to be a continuation of my preceding comment--but the Schumacher comment "snuck in" in the short time interval.) – Paul B. Slater – 2018-09-03T12:44:19.183 Yes, you are probably right. If you got this speed-up with your present style of coding, there is no reason at the moment to change it. – Henrik Schumacher – 2018-09-03T12:47:20.643 Towards the speed-up of matrix operations: The compiled method does never return Infinity or -Infinity but rather the CDF values of $MachineEpsilon and 1-\$MachineEpsilon. This way, the outputs are always packed arrays which is good for matrix operations. – Henrik Schumacher – 2018-09-03T12:48:18.330

1Oh, interesting (preceding/intervening) comment of Schumacher. So, since Infinity or -Infinity is now never returned, I guess I can discard the Check operation used in the matrix operations step. Maybe this will make the two programs comparable in this regard. – Paul B. Slater – 2018-09-03T12:49:29.657

Disregard last sentence of last comment--too late to edit out. – Paul B. Slater – 2018-09-03T13:00:03.180

@PaulB.Slater So, does this method work for you? – Henrik Schumacher – 2018-09-11T21:09:36.367

Yes, I'm very pleased with the Henrik Schumacher code and am running a number of "production" jobs using it. Whether it eventually turns out to be insightful will depend upon the utility of the Roberts quasirandom sequence code I'm implementing--discussed in the expanded form of my question. (I prematurely hit the return button, so my time to complete the edit is limited--not the first time I did it [something of a bug].) – Paul B. Slater – 2018-09-12T03:26:33.000

11

This is a little long for a comment, beside which it points out an unexpected difference probably resulting from subsystems evolving independently at different times. In addition to Henrik's comment that the code is uncompilable, there are two things to add. The most significant is how the quantity is computed, and the other is the usual difference between packed and unpacked arrays.

If InverseCDF[NormalDistribution[0, 1], t] is evaluated symbolically first, it simplifies to a scaled InverseErfc:

p = InverseCDF[NormalDistribution[0, 1], t]
pv = First[p]
(*
ConditionalExpression[-Sqrt[2] InverseErfc[2 t], 0 <= t <= 1]
-Sqrt[2] InverseErfc[2 t]
*)


Surprisingly, the simplified version is over 100 times slower. The ConditionalExpression is not Listable, and one might think that the seemingly "vectorized" pv would be an improvement. However pv is only slightly faster, probably all due to the condition check being removed.

Here are some data for timing tests. The array tt is packed and the array uu is the unpacked version of tt.

SeedRandom[0];    (* so everyone uses the same data *)
tt = RandomReal[1, 100];
uu = DeveloperFromPackedArray[tt];


Clearly from the timings below, InverseCDF[NormalDistribution[0, 1], tt] calls some internal vectorized code for the computation, where as both p and pv, which use InverseErfc, are equally slow on packed and unpacked arrays. Only the InverseCDF is significantly faster on packed arrays. It is still 3.5 times faster than InverseErfc on unpacked arrays. The only case in which using the Listable attribute matters is the first one, which uses InverseCDF on packed arrays.

(* PACKED ARRAYS *)
InverseCDF[NormalDistribution[0, 1], tt]; // RepeatedTiming
pv /. t -> tt; // RepeatedTiming       (* InverseErfc *)
Table[p, {t, tt}]; // RepeatedTiming   (* Henrik's 1st example *)
(*
{0.000054, Null}
{0.0071, Null}
{0.0075, Null}
*)

(* UNPACKED ARRAYS *)
InverseCDF[NormalDistribution[0, 1], uu]; // RepeatedTiming
Table[InverseCDF[NormalDistribution[0, 1], t], {t, uu}]; // RepeatedTiming
pv /. t -> uu; // RepeatedTiming       (* InverseErfc *)
Table[p, {t, uu}]; // RepeatedTiming   (* Henrik's 1st example *)
(*
{0.0023, Null}
{0.0021, Null}
{0.0074, Null}
{0.0076, Null}
*)


Thus the fastest method available is InverseCDF[NormalDistribution[0, 1], t] on numeric t; if you have many such computations, packed the input values into an array t. If it is difficult to generate t as a packed array, you can use t = DeveloperToPackedArray[t, Real].

9

Let me demonstrate two approaches in this answer.

P.J. Acklam (WayBack archive of his page) devised a not-too-long method to approximate the quantile of the Gaussian distribution, with absolute errors on the order of $$10^{-9}$$. I have tweaked the Mathematica implementation of Acklam's approximation by William Shaw so that one can obtain a compilable function:

AcklamQuantile = Block[{u},
Compile[{{u, _Real}}, #, RuntimeAttributes -> {Listable}] & @@
Hold[With[{a = {-39.69683028665376, 220.9460984245205, -275.9285104469687,
138.3577518672690, -30.66479806614716, 2.506628277459239},
b = {-54.47609879822406, 161.5858368580409, -155.6989798598866,
66.80131188771972, -13.28068155288572, 1},
c = {-0.007784894002430293, -0.3223964580411365, -2.400758277161838,
-2.549732539343734, 4.374664141464968, 2.938163982698783},
d = {0.007784695709041462, 0.3224671290700398, 2.445134137142996,
3.754408661907416, 1.}},
Which[0.02435 <= u <= 0.97575,
With[{v = u - 1/2},
v Fold[(#1 v^2 + #2) &, 0, a]/
Fold[(#1 v^2 + #2) &, 0, b]] // Evaluate,
u > 0.97575,
With[{q = Sqrt[-2 Log[1 - u]]},
-Fold[(#1 q + #2) &, 0, c]/
Fold[(#1 q + #2) &, 0, d]] // Evaluate,
True,
With[{q = Sqrt[-2 Log[u]]},
Fold[(#1 q + #2) &, 0, c]/
Fold[(#1 q + #2) &, 0, d]] // Evaluate]]]];


You can use CompiledFunctionToolsCompilePrint[] to check that the function was compiled properly.

Here is a plot of the absolute error over $$(0,1)$$:

nq[p_] = InverseCDF[NormalDistribution[], p];
Plot[AcklamQuantile[p] - nq[p], {p, 0, 1}, PlotRange -> All]


Of course, one can polish this further with a few steps of Newton-Raphson or Halley, if seen fit.

As it turns out, however, some spelunking shows that Mathematica does provide a compilable implementation of the normal distribution quantile. The undocumented function SpecialFunctionsProbit[] is the routine of interest, but there is a minor botch in its implementation that I will show how to fix.

Spelunking the code of SpecialFunctionsProbit[] shows that the function relies on a compiled function, SystemStatisticalFunctionsDumpCompiledProbit[] for numerical implementation. Using CompilePrint[] to inspect its code, however, one sees a number of unsightly MainEvaluate calls in the code. Thus, I offer a fix to this routine that yields a fully compiled function:

SpecialFunctionsProbit; (* force autoload *)
probit = With[{d = N[17/40],
cf1 = SystemStatisticalFunctionsDumpCompiledProbitCentralMinimax,
cf2 = SystemStatisticalFunctionsDumpCompiledProbitAsymptotic},
Compile[{{u, _Real}}, If[Abs[u - 0.5] <= d, cf1[u], cf2[u]],
CompilationOptions -> {"InlineCompiledFunctions" -> True,
"InlineExternalDefinitions" -> True},
RuntimeAttributes -> {Listable}]];


where the two compiled sub-functions SystemStatisticalFunctionsDumpCompiledProbitCentralMinimax and SystemStatisticalFunctionsDumpCompiledProbitAsymptotic implement different approximations; for hopefully obvious reasons, I will not be reproducing their code here.

(Thanks to Henrik Schumacher for suggesting a better reformulation.)

CompilePrint[probit] will show that this version has compiled properly. Here is a plot of the absolute error:

Plot[probit[p] - nq[p], {p, 0, 1}, PlotRange -> All]


2A Necromancer badge on your first day back seem appropriate. – Michael E2 – 2018-09-24T18:29:24.480

In terms of my ongoing “production” runs, should I just continue with my use of the Henrik Schumacher code or could either AcklamQuantile[p] or probit[p] accelerate computations considerably faster than the Schumacher code (which I found to be roughly ten times faster than the uncompiled InverseCDF[NormalDistribution[0, 1], T] command)? For probit[p] (which seems to be the more precise of the two), I take it that I would require the code for the two compiled sub-functions. I’m also applying the command in question to vectors T of either length 36 or 64. – Paul B. Slater – 2018-09-25T02:29:42.470

1"could either AcklamQuantile[p] or probit[p] accelerate computations considerably faster than the Schumacher code" - I don't have your code, so I wholly encourage you to do your own experiments. I gave the Acklam routine because some applications don't need a quantile accurate to machine precision, and that it can still be used to start up a Newton-Raphson iteration. In the case of probit, the two sub-functions are already built-in but hidden, so all you need to access them is to autoload the built-in probit function by evaluating SpecialFunctionsProbit;. – J. M.'s ennui – 2018-09-25T02:48:43.620

OK, I will certainly try comparing the performances of the three codes (AcklamQuantile[p], profit[p] and the Schumacher one given above), when I (shortly) have the opportunity. But since the Schumacher code is given above, can not any immediate assertions be advanced in terms of their relative performances? (I now see that the Schumacher code also uses the Compile command.) – Paul B. Slater – 2018-09-25T03:57:14.983

The function in Henrik's answer has the additional settings CompilationTarget -> "C", Parallelization -> True, so you will need to add these to either AcklamQuantile or probit for an apple-apple comparison. And, in my experience, only experimentation can distinguish the performance of different compiled functions (all other things being equal). – J. M.'s ennui – 2018-09-25T04:04:24.800

Well, without the additional settings CompilationTarget -> "C", Parallelization -> True, the Schumacher code seems slightly superior timewise (about 5%), in my program, to both the AcklamQuantile and probit codes. How precisely to incorporate the indicated additional settings--if that might conceivably lead to further speed-ups--into the AcklamQuantile and probit codes exceeds my expertise. So, at present, I don't see any reason to alter my "production codes"--employed in sec. XI and Figs. 5 and 6 of my new preprint https://arxiv.org/abs/1809.09040 .

– Paul B. Slater – 2018-09-25T17:33:55.917

@Paul, for AcklamQuantile, replace the part Compile[{{u, _Real}}, #, RuntimeAttributes -> {Listable}] with Compile[{{u, _Real}}, #, RuntimeAttributes -> {Listable}, CompilationTarget -> "C", Parallelization -> True]; for probit, insert , CompilationTarget -> "C", Parallelization -> True after the RuntimeAttributes -> {Listable} part. I didn't do it myself because I don't have a compiler that works with Mathematica. – J. M.'s ennui – 2018-09-25T17:38:54.067

The indicated insertion into the probit code now leads to a slight (4%) superiority to the Schumacher code, while the AcklamQuantile insertion gives apparently about 10% superiority. So, for the present moment at least, I don't think I'll alter my "production codes". – Paul B. Slater – 2018-09-25T18:02:06.813

Still playing around with the use of probit. However, now when I ," insert , CompilationTarget -> "C", Parallelization -> True after the RuntimeAttributes -> {Listable} part", as suggested, and I apply probit to a list , e. g. probit[{1/3, 1/4}], I get an error message ending in "calls ordinary code that can be evaluated on only one thread at a time". Without Parallelization -> True, no error message. – Paul B. Slater – 2018-09-26T17:36:00.480

@Paul, Huh, interesting. I don't have a multicore machine to confirm; maybe ask a new question on why the parallelization doesn't work? – J. M.'s ennui – 2018-09-26T17:38:48.550

1

@J.M. I ran both your probit and my cfinv implementations on my machine (see here). Your code turned out to be much faster on my machine, but Paul did not observe any major speed-up.

– Henrik Schumacher – 2018-09-26T18:53:26.023

@Henrik, yes, I'm following the other thread too; very bizarre. The Acklam code could be modified to add a Newton-Raphson polish (just like in your routine), but then I am no longer sure if that would incur a significant slowdown. – J. M.'s ennui – 2018-09-26T18:56:34.183

@J.M. Looking onto the pseudocode of AcklamQuantile`, I cannot find anything to speed it up further. And I guess a bit more than single precision that it provides should be sufficient for statistical applications. – Henrik Schumacher – 2018-09-26T19:01:35.747

@Henrik "a bit more than single precision that it provides" - yes, that is why I thought of posting it, in case the OP didn't need more than 8 digits of accuracy (since he mentions using this for Monte Carlo). – J. M.'s ennui – 2018-09-26T19:05:40.343