Compile command does not succeed with list input with Parallelization->True


J. M. suggested that I pose the following as a question, a matter arising in relation to the function probit:

SpecialFunctions`Probit; (* force autoload *)
probit = Compile[{{u, _Real}},
                 With[{d = 17/40},
                      If[Abs[u - 1/2] <= d,
                 CompilationOptions -> {"InlineCompiledFunctions" -> True,
                                        "InlineExternalDefinitions" -> True}, 
                 RuntimeAttributes -> {Listable}];

that he gave in his answer to my question Can I use Compile to speed up InverseCDF? . In a comment, he suggested that in order to compare performance times with the code of Henrik Schumacher (also given in an earlier answer of Schumacher to the same question), one should "insert CompilationTarget -> "C", Parallelization -> True after the RuntimeAttributes -> {Listable} part" into the probit code, that is

probit = Compile[{{u, _Real}},
                 With[{d = 17/40},
                      If[Abs[u - 1/2] <= d,
                 CompilationOptions -> {"InlineCompiledFunctions" -> True,
                                        "InlineExternalDefinitions" -> True}, 
                 RuntimeAttributes -> {Listable}, CompilationTarget -> "C",
                 Parallelization -> True];

But when I use a list (as I need to)--say {1/3,1/4}--as the input to the so-augmented probit code I get the error message

CompiledFunction::pext: Instruction 12 in
calls ordinary code that can be evaluated on only one thread at a time.

Without Parallelization -> True, I don't get the error message, but the program then runs more slowly than I would hope. (I'm rather befuddled, since this specific problem did not seem to arise in some related analyses of mine yesterday.)

Paul B. Slater

Posted 2018-09-26T18:18:46.987

Reputation: 2 043



Good ol' inlining probem. Enforcing correct inlining with With should work. The problem was that the uninlined CompiledProbitCentralMinimax and CompiledProbitAsymptotic enforced calls to the infamous MainEvaluate.

probit = With[{
    d = N[17/40],
    cf1 = System`StatisticalFunctionsDump`CompiledProbitCentralMinimax,
    cf2 = System`StatisticalFunctionsDump`CompiledProbitAsymptotic
   Compile[{{u, _Real}},
    If[Abs[u - 0.5] <= d,
    RuntimeAttributes -> {Listable},
    CompilationTarget -> "C",
    Parallelization -> True,
    CompilationOptions -> {
      "InlineCompiledFunctions" -> True,
      "InlineExternalDefinitions" -> True
    RuntimeOptions -> "Speed"

Here is also J.M.s function AcklamQuantile:

AcklamQuantile = Block[{u}, Compile[{{u, _Real}}, #,
      CompilationTarget -> "C",
      RuntimeAttributes -> {Listable},
      Parallelization -> True,
      RuntimeOptions -> "Speed"
      ] & @@ 
    Hold[With[{a = {-39.69683028665376, 
         220.9460984245205, -275.9285104469687, 
         138.3577518672690, -30.66479806614716, 2.506628277459239}, 
       b = {-54.47609879822406, 161.5858368580409, -155.6989798598866,
          66.80131188771972, -13.28068155288572, 1}, 
       c = {-0.007784894002430293, -0.3223964580411365, \
-2.400758277161838, -2.549732539343734, 4.374664141464968, 
       d = {0.007784695709041462, 0.3224671290700398, 
         2.445134137142996, 3.754408661907416, 1.}}, 
      Which[0.02435 <= u <= 0.97575, 
       With[{v = u - 1/2}, 
         v Fold[(#1 v^2 + #2) &, 0, a]/Fold[(#1 v^2 + #2) &, 0, b]] //
         Evaluate, u > 0.97575, 
       With[{q = Sqrt[-2 Log[1 - u]]}, -Fold[(#1 q + #2) &, 0, c]/
          Fold[(#1 q + #2) &, 0, d]] // Evaluate, True, 
       With[{q = Sqrt[-2 Log[u]]}, 
         Fold[(#1 q + #2) &, 0, c]/Fold[(#1 q + #2) &, 0, d]] // 

Speed test against my function cfinv from the linked post (running on a Haswell Quad Core):

T = RandomReal[{0., 1.}, {1000000}];
a = cfinv[T]; // RepeatedTiming // First
b = probit[T]; // RepeatedTiming // First
c = AcklamQuantile[T]; // RepeatedTiming // First
Max[Abs[a - b]]
Max[Abs[a - c]]






Well, J.M.'s code is indeed quite a lot faster, in particular at the tails of the distribution (where Newton's method converges slowly).

Henrik Schumacher

Posted 2018-09-26T18:18:46.987

Reputation: 85 430

Interesting, that needs to be injected too... I guess I should edit my answer in the other question. – J. M.'s ennui – 2018-09-26T18:31:38.333

Misunderstanding--I (stupidly) thought the comment was of Henrik Schumacher. – Paul B. Slater – 2018-09-26T18:35:24.487

Well, the "correct inlining" code above does now work on lists in my context--but it only seems to give a rather minor (1-2%) improvement over the original cfinv function of Schumacher I've been using. The AcklamQuantile function, I believe (I have to check further) gives a more substantial improvement, but its precision is lower, so maybe I won't convert to that option either. – Paul B. Slater – 2018-09-26T18:47:58.450

The AcklamQuantile function (with the CompilationTarget -> "C", Parallelization -> True insertion [no "inlining" apparently needed]) gives about a 5% improvement in run time. (The improvement times I give in this comment and the previous one apply to my overall program--and not specifically to the InverseCDF-related commands. So, their improvement times for these commands themselves should be somewhat superior.) – Paul B. Slater – 2018-09-26T19:04:38.750

Ah, I see. Makes sense: Now other parts of your code are becoming bottlenecks rather than the inverse CDF... – Henrik Schumacher – 2018-09-26T19:06:44.900

An opportunity (restart) came up, for me to continue afresh my production runs. I considered using either of J. M.'s codes instead of cfinv, but I was a little concerned by your comment in the companion posting that "The compiled method does never return Infinity or -Infinity but rather the CDF values of MachineEpsilon and 1-MachineEpsilon". Prior to using cfinv I was getting infrequent errors trying to orthogonalize matrices. I didn't know if the J. M. programs were similarly protected against such errors, since I have not been encountering them with cfinv. – Paul B. Slater – 2018-09-26T23:06:09.890

@Paul, in general, compiled routines are not able to return Infinity results. So, if there is a possibility of hitting either $0$ or $1$ in your runs, some additional code might be necessary (e.g. returning $MaxMachineNumber instead). – J. M.'s ennui – 2018-09-27T01:17:31.247

Well, I replaced Henrik's cfinv command--that I'd been using in my production runs--with his version of the AcklamQuantile code, and lo and behold, I once again got the "CompiledFunction::cfne: Numerical error encountered; proceeding with uncompiled evaluation" and "Orthogonalize::mindet: Input matrix contains an indeterminate entry" error messages that cfinv (with its MachineEpsilon commands) had "cured". I'll try the same exercise with Henrik's probit code. So, can I insert $MaxMachineNumber or MachineEpsilon into the AcklamQuantile code to avoid this--or maybe its precision is too lo? – Paul B. Slater – 2018-09-28T16:47:21.523

I've now found that Henrik's probit code, as presently written, also gives me the same error messages (which to reiterate, his cfinv code does not). – Paul B. Slater – 2018-09-28T17:19:53.140

Well, @Paul, then it might be better to stick with cinv... – Henrik Schumacher – 2018-09-28T17:26:57.000

That's what I'm doing at the moment (that is, sticking with cfinv)--while I try to further tweak my code in attempted response to the penultimate comment(s) of Martin Roberts in . That is, I'm trying to see how best to implement his quasirandom sequences (to avoid possible correlations), given that my algorithm requires generating two matrices.

– Paul B. Slater – 2018-09-28T17:55:09.530