Fastest way to measure Hamming distance of integers

22

5

I am looking for a fast and robust way to calculate the Hamming distance of integers. The Hamming distance of two integers is the number of matching bits in their binary representations. I expect that clever methods can easily outpace HammingDistance as it works on vectors instead of integers and on any vector not just binary.

My naive bitwise method is faster than HammingDistance but I'm pretty sure that it can be further optimized. While compilation would help, it won't work on big integers ($\ge 10^{19}$). Nevertheless, I am interested in compiled solutions!

max = 10^10;
n = Length@IntegerDigits[max, 2];
data = RandomInteger[{0, max}, {100000, 2}];
m1 = Map[HammingDistance[IntegerDigits[First@#, 2, n], 
      IntegerDigits[Last@#, 2, n]] &, data]; // AbsoluteTiming
m2 = Map[Total@IntegerDigits[BitXor @@ #, 2] &, data]; // AbsoluteTiming
m1 === m2
{0.967202, Null}   
{0.624001, Null}
True

It would be nice to work entirely on the binary representations, and I thought that using DigitCount on BitXor would help, but it gave a cruel 3x slowdown compared to the HammingDistance version.

Edit

As an answer to Kirma's comment: I have to calculate the pairwise distance matrix for a set of integers (highly related is Szabolcs's post: Fastest way to calculate matrix of pairwise distances), in the (simplest and most didactive) form:

Outer[hamming[#1, #2], Range[2^20], Range[2^20]]

Now in this case my main problem is of course memory not speed, but it would be nice to see solutions that scale well with this problem. I understand that it is another question, but I want to encourage everyone to post their solutions even if they require vectors or matrices of integers as input.

István Zachar

Posted 2013-04-16T18:16:37.050

Reputation: 44 135

If you are still interested in this, please see this question and my answer. I realised that you can deal with large integers simply by using IntegerDigits to partition them into machine-size chunks.

– Oleksandr R. – 2015-07-04T00:45:20.740

No idea why, but replacing Total with Tr in your definition of m2 gives you a ~20% improvement in speed (on my machine). – gpap – 2013-04-16T19:09:45.600

@gpap Tr is often a bit faster than Total on packed arrays, at least in version 7. I'm assuming that's the case here. – Mr.Wizard – 2013-04-16T19:49:50.680

3If you can easily transpose or concatenate your data to consist of few really big integers (even hundreds of millions of binary digits) instead of many smaller ones, both Tr@IntegerDigits is faster and DigitCount is entirely feasible. Speedup may be even 400-fold. – kirma – 2013-04-16T21:51:53.893

@kirma Please consider posting your approach in light of my edit. – István Zachar – 2013-04-17T09:07:50.183

@IstvánZachar Sadly enough, I don't believe this is generalizable to your situation. – kirma – 2013-04-17T11:37:10.440

Answers

28

Here is another compiled implementation:

hammingDistanceCompiled = Compile[{{nums, _Integer, 1}},
   Block[{x = BitXor[nums[[1]], nums[[2]]], n = 0},
    While[x > 0, x = BitAnd[x, x - 1]; n++]; n
   ],
   RuntimeAttributes -> Listable, Parallelization -> True,
   CompilationTarget -> "C", RuntimeOptions -> "Speed"
  ];

This appears to outperform the naive approach (Total@IntegerDigits[BitXor @@ nums, 2], as presented in Leonid's answer) by about 2.5 times. If we are serious about compiled approaches, though, we can surely do much better, by taking advantage of the SSE4.2 POPCNT instruction.


Edit: thanks to halirutan, who told me that the pointers returned by the LibraryLink functions are safe to use directly, this updated version is nearly twice as fast (on my computer) as the original attempt due to the removal of unnecessary function calls from the inner loop.

Since nobody else apparently wanted to write an answer using that suggestion, I decided to give it a try myself:

#include "WolframLibrary.h"

DLLEXPORT
mint WolframLibrary_getVersion() {
    return WolframLibraryVersion;
}

DLLEXPORT
int WolframLibrary_initialize(WolframLibraryData libData) {
    return 0;
}

DLLEXPORT
void WolframLibrary_uninitialize() {
    return;
}

inline
mint hammingDistance(mint a, mint b) {
    return (mint)__builtin_popcountll((unsigned long long)a ^ (unsigned long long)b);
}

/* To load:
LibraryFunctionLoad["hammingDistance",
 "hammingDistance_I_I", {Integer, Integer}, Integer
] */

DLLEXPORT
int hammingDistance_I_I(WolframLibraryData libData,
                       mint argc, MArgument *args,
                       MArgument res) {
    mint a, b;

    if (argc != 2) return LIBRARY_DIMENSION_ERROR;

    a = MArgument_getInteger(args[0]);
    b = MArgument_getInteger(args[1]);

    MArgument_setInteger(res, hammingDistance(a, b));
    return LIBRARY_NO_ERROR;
}

/* To load:
LibraryFunctionLoad["hammingDistance",
 "hammingDistance_T_T", {{Integer, 2, "Constant"}}, {{Integer, 1, Automatic}}
] */

DLLEXPORT
int hammingDistance_T_T(WolframLibraryData libData,
                        mint argc, MArgument *args,
                        MArgument res) {
    MTensor in, out;
    const mint *dims;
    mint i, *indata, *outdata;
    int err = LIBRARY_NO_ERROR;

    in = MArgument_getMTensor(args[0]);
    if (libData->MTensor_getRank(in) != 2) return LIBRARY_DIMENSION_ERROR;
    if (libData->MTensor_getType(in) != MType_Integer) return LIBRARY_TYPE_ERROR;
    dims = libData->MTensor_getDimensions(in);
    if (dims[1] != 2) return LIBRARY_DIMENSION_ERROR;
    indata = libData->MTensor_getIntegerData(in);

    err = libData->MTensor_new(MType_Integer, 1, dims, &out);
    if (err != LIBRARY_NO_ERROR) return err;
    outdata = libData->MTensor_getIntegerData(out);

    #pragma omp parallel for schedule(static)
    for (i = 0; i < dims[0]; i++) {
        outdata[i] = hammingDistance(indata[2*i], indata[2*i + 1]);
    }

    MArgument_setMTensor(res, out);
    return LIBRARY_NO_ERROR;
}

We compile it, using gcc (N.B. __builtin_popcount is a gcc extension):

gcc -Wall -fopenmp -O3 -march=native -shared -o hammingDistance.dll hammingDistance.c

Load it into Mathematica:

hammingDistance = LibraryFunctionLoad[
  "hammingDistance.dll",
  "hammingDistance_I_I", {Integer, Integer}, Integer
 ];
hammingDistanceListable = LibraryFunctionLoad[
  "hammingDistance.dll",
  "hammingDistance_T_T", {{Integer, 2, "Constant"}}, {Integer, 1, Automatic}
 ];

Make sure everything is working:

data = RandomInteger[{0, 2^63 - 1}, {10000, 2}];
hammingDistance @@@ data === 
 hammingDistanceListable[data] ===
  hammingDistanceCompiled[data] ===
   Tr /@ IntegerDigits[BitXor @@@ data, 2]
(* -> True *)

Now for a performance comparison:

dataLarge = RandomInteger[{0, 2^63 - 1}, {10000000, 2}];
hammingDistanceCompiled[dataLarge]; // AbsoluteTiming (* 1.203125 seconds *)
hammingDistanceListable[dataLarge]; // AbsoluteTiming (* 0.063594 seconds *)

That's about 1000 times faster than the code given in the question, so not bad. I'm using an Intel Core 2 CPU, which doesn't actually support the POPCNT instruction, and has only four cores. On more recent CPUs, it will surely be faster still.

Oleksandr R.

Posted 2013-04-16T18:16:37.050

Reputation: 22 073

@s0rce I was just thinking, maybe since you force the compiler to produce a literal SSE4.2 POPCNT rather than using a more generic intrinsic as I do, you are more sensitive to alignment (array should begin on a 16 byte boundary), or you pay the penalty for mixing SSE and AVX instructions in the same code? I can also imagine that GCC may fix these issues silently whereas MS C++ may not. – Oleksandr R. – 2016-06-02T13:00:22.230

Nice, Mathematica-version of the Wegner implementation. – István Zachar – 2013-04-16T20:01:06.110

What about using k = 0; FixedPoint[(k++; BitAnd[#, # - 1]) &, BitXor @@ nums, SameTest -> (! Positive[#2] &)]; k within the compiled function? – J. M.'s ennui – 2013-04-16T23:36:50.260

@J.M. do you find that this gives an improvement? I admit haven't tried it but I'd be really surprised if the bytecode was any better than that produced for a While loop. – Oleksandr R. – 2013-04-17T02:06:49.767

Limited tests seem to show that they're about the same, so I suppose your version is preferred for being a bit more straightforward. BTW, any reason why you're taking a list of two integers instead of having it be a function with two arguments? – J. M.'s ennui – 2013-04-17T02:14:11.703

1@J.M. this approach allows the function to be Listable, which is a prerequisite for auto-parallelization. – Oleksandr R. – 2013-04-17T02:17:28.003

I'm on Windows and I switched to _mm_popcnt_u64 using the MS compiler and I had to remove the inline function definition (gave errors) but it is still 50% slower (0.15sec) then yours on a faster processor Intel Core i5-3570K... :( – s0rce – 2013-04-17T03:29:31.660

@s0rce did you remember to compile (and link!) with OpenMP enabled? Otherwise, you won't get any parallelization. Actually, I'm on Windows as well; I just prefer gcc. – Oleksandr R. – 2013-04-17T04:22:16.440

Thanks for the detailed instructions! However, after successful linking, compiling and inclusion of the dll dir in $LibraryPath I got an error when loading the library: LibraryFunction::libload: The function hammingDistance_I_I was not loaded from the file [...]\hammingDistance.dll. – István Zachar – 2013-04-17T09:41:35.873

@IstvánZachar this message implies that the .dll was not successfully produced, inaccessible, not found, in the wrong format, or for the wrong architecture. Did you make sure to produce a 64-bit library and verify that Mathematica is not seeing e.g. an "access denied" error? – Oleksandr R. – 2013-04-17T10:32:00.217

It seems that the dll is fine (compiled with 64bit MinGW), file (under Cygwin) reports hammingDistance.dll: PE32+ executable (DLL) (console) x86-64, for MS Windows, though Mathematica through $LibraryError returns Library load error 193: %1 is not a valid Win32 application. – István Zachar – 2013-04-17T12:07:07.687

@IstvánZachar I've tried on two different computers and can't reproduce that error. Are you able to compile and load other LibraryLink libraries correctly? – Oleksandr R. – 2013-04-17T14:53:36.537

9

On my machine this is twice as fast as your m2 code:

m3 = Tr /@ IntegerDigits[BitXor @@@ data, 2];

m2 === m3

True

Mr.Wizard

Posted 2013-04-16T18:16:37.050

Reputation: 259 163

4

For completeness, here is a way to extend the compiled or LibraryLink approaches to arbitrarily large integers. Since it comes so long after the original answer, I post it separately.

As explained in this answer, we can bridge the gap between arbitrary and machine precision at least somewhat efficiently by using IntegerDigits to express a large integer as a string of base-$2^{62}$ digits, which we are then free to process using efficient compiled code. The only additional trick we need here is to ensure that there are the same number of such digits for both of the integers, so that they have a 1:1 correspondence and yield the correct answer when compared.

Let's take as our list of big integer pairs:

data = {{2^63, 2^24, 2^56 + 1, 2^84}, {2^64, 2^92, 2^33 - 1, 2^73 - 1}} // Transpose;

And this is what we want to get from it:

Tr /@ IntegerDigits[BitXor @@@ data, 2]
(* -> {2, 2, 33, 74} *)

It's difficult to know a priori how many digits the largest integer will have, without which knowledge the digits produced by IntegerDigits are bound to be inconsistent between the inputs. We have two options for producing a full-rank array, i.e.

full = IntegerDigits[data, 2^62, Ceiling@Log[2^62, Max @@ data]]

or

full = PadLeft@IntegerDigits[data, 2^62]

of which my testing shows that the second is the better-performing formulation in terms of CPU time, with both being similar enough in their memory consumption as probably makes no difference. Why this should be is not clear, since in the first case, IntegerDigits has the opportunity to produce a packed array, which however it fails to do. In both cases we are clearly wasting some amount of memory to contain these padding digits, and if the integers differ drastically in size, then that could be the primary limitation on performance. If so, then the first possibility is able to save 20% on memory consumption, at the cost of 20% more CPU time. The choice is yours.

Finally, we need to take advantage of the listability and efficient (OpenMP) parallelization of the LibraryLink function over the longest axis of the digit array, and Map over the other, shorter axis. Which is the longest axis depends on whether there are many moderately sized integers, or a few very large ones. Then, we add up the results along the appropriate axis using Total.

For number of integers greater than number of digits per integer:

Total[hammingDistanceListable /@ Transpose[full, {2, 3, 1}], {1}]

Or for number of digits greater than number of integers:

Total[hammingDistanceListable /@ Transpose[full, {1, 3, 2}], {2}]

Obviously, this decision and the appropriate type of transpose can be decided at run time. Both give the right answer, of course:

(* -> {2, 2, 33, 74} *)

Putting it together:

hammingDistanceListableBigIntegers[nums : {{_Integer, _Integer} ..}] :=
  With[{digits = PadLeft@IntegerDigits[nums, 2^62]},
   With[{dims = Dimensions[digits]},
    If[First[dims] > Last[dims],
     Total[hammingDistanceListable /@ Transpose[digits, {2, 3, 1}], {1}],
     Total[hammingDistanceListable /@ Transpose[digits, {1, 3, 2}], {2}]
    ]
   ]
  ];

Now let's compare it with Tr /@ IntegerDigits[BitXor @@@ data, 2]:

dataLarge = RandomInteger[2^1024, {100000, 2}];

hammingDistanceListableBigIntegers[dataLarge]; // AbsoluteTiming (* -> 0.390625 seconds *)
MaxMemoryUsed[hammingDistanceListableBigIntegers[dataLarge];] (* -> 79.3467 MB *)

Tr /@ IntegerDigits[BitXor @@@ dataLarge, 2]; // AbsoluteTiming (* -> 1.015625 seconds *)
MaxMemoryUsed[Tr /@ IntegerDigits[BitXor @@@ dataLarge, 2];] (* -> 862.148 MB *)

So, it is three times faster and an amazing 10 times more memory-efficient than the alternative in this particular case. Amazing, because in order to accomplish this we have had to produce a rather large, un-packed array containing a lot of padding.

Unfortunately, it is not as efficient in every case. Surprisingly, it actually takes three times longer if the integers are only 512 bits in length rather than 1024! As a result, it will be beaten by the top-level code. Thus far, I haven't been able to fully understand the curious performance characteristics of this approach.

Oleksandr R.

Posted 2013-04-16T18:16:37.050

Reputation: 22 073

@Mr.Wizard are you still in need of a compiler, or did you install MSVC++? – Oleksandr R. – 2015-11-04T14:40:54.107

I never did get a compiler installed but that doesn't mean I'm less appreciative of your work in this area. – Mr.Wizard – 2015-11-06T02:10:46.690