How to define a distance measure between two IP addresses?



I have IP addresses as feature and I would like to know how much two IP addresses are similar to each other to use the difference in an Euclidean distance measure (in order to quantify the similarities of my data points). What tactic can I use for this?

Marc Lamberti

Posted 2015-11-09T09:40:19.857

Reputation: 297

1To you what would make two IP addresses similar? – paparazzo – 2015-11-09T13:04:49.203

You could use a GeoIP lookup and literally compare the distance on the earth between them.... – Spacedman – 2015-11-09T21:25:54.063

@Spacedman That only works for WWW public IP. OP has not even provided that information. – paparazzo – 2015-11-09T21:43:12.100

Yes, I was hoping some clarification would be forthcoming, instead of everyone piling in with "answers"... @marcL? – Spacedman – 2015-11-09T21:53:55.437

How they are similar in terms of network not based on earth distance. Sorry for the delay – Marc Lamberti – 2015-11-10T14:05:47.140

What to you makes an IP similar in terms of a network? Is this a public (WWW) network? – paparazzo – 2015-11-10T19:27:23.860

You can't tell the network similarity without the netmask value.... Again, people are piling in with answers when the question is unclear. – Spacedman – 2015-11-11T09:57:16.387



If I understood them correctly, both Jeremy and Edmund's (first) solutions are the same, namely, plain euclidean distance in a 4-dimensional space of IP addresses.BTW, I think a very fast alternative to euclidean distance would be to calculate a hamming distance bit-wise.

Edmund's first update would be better than his second. The reason is simple to state: his 2nd update tries to define a distance measure by considering a non-linear function of the coordinates of a 4D vector. That however will most likely destroy the key properties that it needs to satisfy in order to be a metric, namely

  1. Injectivity: $d(IP_1,IP_2)=0 \iff IP_1=IP_2$,
  2. Symmetry: $d(IP_1,IP_2)=d(IP_2,IP_1)$, and
  3. Triangular inequality: $d(IP_1,IP_2)\leq d(IP_1,IP_3)+d(IP_3,IP_2)\,\forall IP_3$.

The latter is key for later interpreting small distances as close points in IP space. One would need a linear (in the coordinates) distance function. However, simple euclidean distance is not enough as you saw.

Physics (well, differential geometry actually) could lead to a nice solution to this problem: define a metric tensor $g$. In plain english, give weights to each pair of coordinates, take each pair difference, square it and multiply it by its weight, and then add those products. Take the square root of that sum and define it as your distance.

For the sake of simplicity, one could start trying with a diagonal metric tensor.

Example: Say you take $g=\begin{pmatrix}1000 &0 &0 &0 \\0 &100&0&0\\0&0&10&0\\0&0&0&1\end{pmatrix}$ $IP_1=(x_1,x_2,x_3,x_4)$ and $IP_2=(y_1,y_2,y_3,y_4)$. Then the square of the distance is given by $$d(IP_1,IP_2)^2=1000*(x_1-y_1)^2+100*(x_2-y_2)^2+\\ \,+10*(x_3-y_3)^2+1*(x_4-y_4)^2$$ For $IP_1=,\,IP_2=$ the distance is clearly 1. However, for $$ and $$ the distance is $\sqrt{1000}\approx 32$

Eventually you could play around with different weights and set a kind of normalization where you could fix the value of the maximal distance $d(,FF.FF.FF.FF)$.

Furthermore, this set up allows for more complex descriptions of your data where the relevant distance would contain "cross-products" of coordinates like say $g_{13}*(x_1-y_1)*(x_3-y_3)$.

EDIT: While this would be a better "weighting" method than using those other I addressed, I realize now it is actualy meaningless: As Anony-Mousse and Phillip mention, IP are indeed 32 dimensional. This means in particular that giving the same weight to all bits in say the 2nd group is in general not sound: One bit could be part of the netmask while the other not. See Anony-Mousse answer for additional objections.


Posted 2015-11-09T09:40:19.857

Reputation: 451

Very nice. Thanks for critique. I obviously need to brush up on my Linear Algebra. ;-) – Edmund – 2015-11-09T21:12:48.953

As I mentioned, eventually one could introduce a more complex "IP space-time" by considering also non-vanishing off-diagonal elements in $g$. This I guess would depend on the problem he wants to solve, i.e., on the data. However, my first guess is that it easily is overkill. At least one should first try with a diagonal form as in the example above. In such case, there wouldn't be any need to dust-off your LA books, ;-) just following the example, which is, btw, like a problem of vectors in 2 dimensions. – MASL – 2015-11-09T21:24:54.820

It's not a fourudimensional space. It's a 32 dimensional space. The dot-representation is just easier to remember. – Has QUIT--Anony-Mousse – 2015-11-09T23:36:09.960

@Anony-Mousse You are right, there is this other way of looking at it, namely bit-wise. What I propose is independent of how one describes the "IP-space, i.e., whether 4 or 32-dimensional. That said, without other info on the OP's problem, it's seems reasonable to distinguish the usual 4 numbers of an IP and say that and differ more than and – MASL – 2015-11-10T00:33:06.663


That's a very interesting question. Similarity here should be computed component-wise, but the thing is from a "business logic" perspective, the similarity of the last 3 numbers doesn't matter if the other 3 sets of numbers are not the same. Keeping that in mind, I would probably do something like the following (there is probably a more elegant way of doing it, and I don't have much time to think about it so forgive me if it doesn't answer your question and for the poor formatting).

Assuming IPv4 of the form aaa.bbb.ccc.ddd, I would so something like:

If aaa_1 == aaa_2:
  If bbb_1 == bbb_2:
    If ccc_1 == ccc_2:
        If ddd_1 == ddd_2:
            Dist = 1;
            Dist = (3 + distance(ddd_1,ddd_2))/4;
        End if;
        Dist = (2 + distance(ccc_1,ccc_2))/4;
    End if;
    Dist = (1 + distance(bbb_1,bbb_2))/4;
  End if;
  Dist = distance(aaa_1,aaa_2);
  Return 1/Dist;

Jérémie Clos

Posted 2015-11-09T09:40:19.857

Reputation: 330

So, if both addresses are the same you code assigns a distance of 1, not zero. Is that a typo or it does this 1 have a special meaning? (You could make it return a zero at that point so as to avoid a NAN at the return at the bottom.) – MASL – 2015-11-09T22:48:45.580

An IP is one 32-bit number. – Has QUIT--Anony-Mousse – 2015-11-09T23:50:10.277

@MASL This is a similarity metric, I should have precised it. 1 means complete similarity here. – Jérémie Clos – 2015-11-10T01:16:05.667

Also I am probably abusing the word "metric" here, forgive me for that. – Jérémie Clos – 2015-11-10T01:22:33.773


IP (v4) addresses are a 32-bit integer, which trivially gives you a metric. However, it may not be a particularly useful metric - and are almost certainly significantly more different than and

Philip Kendall

Posted 2015-11-09T09:40:19.857

Reputation: 149

Exact so do you have any idea ? Maybe one of the answers above ? – Marc Lamberti – 2015-11-09T12:22:35.927

1What is a good metric depends on the problem you're trying to solve. You could for instance use one of the geoIP services to get a geolocation for every IP address - but we can't tell you whether that's good for you because we don't know what your problem is. – Philip Kendall – 2015-11-09T12:36:52.777


20 years ago, I would have suggested to use the length of the shared prefix as similarity measure.

So you take two IPs. In their 32 bit representation, not the "pretty printed" x.y.z.w form; the real "int" reoresentation your network stack uses. Then XOR them, count the leading zeros, and you get

distance = 32 - leadingZeros(ip1 XOR ip2)

However, we have exhausted the IPv4 namespace long ago. The last 10 years, the few remaining netblock have been more or less "randomly" (at least from a similarity perspective) been distributed. IP ranges have been relocated and so on.

A lomg time ago, people would have told you routing happens on trees, based on their prefix. So If you wanted to read an IP it would go to then then But that was just the theory. If you manually configured your router, that is what you would do. 10.2 isthe second building, 10.2.3. is the third floor router. History. Within networks, IPs are assigned by DHCP, often first-come-first-served. Within intranet, you have mostly switches not routers. And on the global level, the BGP is responsible for taking care of the big mess of todays routing tables.

In other words: use some database like GeoIP to map the IPs to (approximate) coordinates. Best you can do. IP based similarity is mostly useful on a /24 prefix, but a binary yes/no similarity won't make you happy I guess.

Has QUIT--Anony-Mousse

Posted 2015-11-09T09:40:19.857

Reputation: 7 331

I hadn't seen your answer before I replied to your comment on mine. I understand now what you meant. Simple di/similarity measures on the IP's won't help one to distinguish any sort of geographical location. Maybe not even any other kind of useful measure of proximity -organizations have IP ranges assigned at ad-hoc, such that make a distance function superfluous/meaningless. – MASL – 2015-11-10T00:41:17.817


Like someone else mentioned, treating IPs as int automatically gives higher bits higher weights. I've used variance of IPs which is log scaled.

math.log(np.std([IP(ip).int() for ip in ips]))


Posted 2015-11-09T09:40:19.857

Reputation: 21


2nd Update

The below can be improved as it does not consider the hierarchical structure of an IP address. To account for this the elements of the IP vectors can be non-linearly scaled before computing the distance vector and its norm. This gives more weight to the elements higher in the hierarchy.

Mathematica code

Once we have the 4D vectors from the 1st update each element is scaled based on its position $[x^{2}_{1},x^{\frac{3}{2}}_{2},x^{1}_{3},x^{\frac{1}{2}}_{4}]$.

Subtract @@ (MapIndexed[
       Function[{value, index}, 
        value^((5 - First@index)/2)], #] & /@ {ip1, ip2}) // Norm // N
(* 2209.17 *)

There is information lost in collapsing from 4D down to 1D but this can't be helped if you are looking for a 1D distance metric.

1st Update

An IP address is made up of 4 numbers. Takes these as vectors in 4D and calculate the distance between them ( Distance in Euclidean space ).

Mathematica code

(* Make some IP addresses *)
{ip1, ip2} = 
 StringRiffle[#, "."] & /@ 
  Map[ToString, RandomInteger[{1, 255}, {2, 4}], {2}]
(* {"", ""} *)

(* Extract 4D vector *)
{ip1, ip2} = Map[FromDigits, StringSplit[#, "."] & /@ {ip1, ip2}, {2}]
(* {{50, 229, 29, 146}, {27, 167, 216, 58}} *)

(* Calculate distance *)
Norm[ip1 - ip2] // N
(* 216.993 *)

Consider an IP address as a 4D vector. Subtract and calculate the norm.


Posted 2015-11-09T09:40:19.857

Reputation: 625

Can you explain a bit more? – Marc Lamberti – 2015-11-09T12:23:08.813

@marcL , I was on my phone. Check the update. – Edmund – 2015-11-09T13:11:01.313

1So you are assuming that Sim("", "") == Sim("", "")? – Jérémie Clos – 2015-11-09T13:53:01.597

Correct. They are equally similar to one another with this measure. But I see the issue here with the hierarchical nature of the IP addresses. Let me think about that. – Edmund – 2015-11-09T14:01:18.397

@JérémieClos So I'm thinking that each element of the vector could be non-linearly scaled before computing the norm of the distance vector. This would give more weight to elements higher in the hierarchy. – Edmund – 2015-11-09T14:08:16.943

That's what I was thinking too, but I didn't really take the time to formulate it well so I went with a simpler solution. Decomposing the similarity into a logarithmically weighted sum of pointwise similarities might do the job better (where each component is divided by log(n) where n is the position of the component).

Then sim(aaa.bbb.ccc.ddd, eee.fff.ggg.hhh) = sim(aaa,eee)/$log_2$(2) + sim(bbb,fff)/$log_2$(3) + sim(ccc,ggg)/$log_2$(4) + sim(ddd,hhh)/$log_2$(5) – Jérémie Clos – 2015-11-09T15:21:20.860

My bad, I didn't see that you updated your answer with something similar. – Jérémie Clos – 2015-11-09T15:34:32.847


First you need to distinguish private and public addresses (check wikipedia IP_address#Private_addresses).

Private IPs: the best you can do is compute if 2 addresses COULD be on the same network or not, then you need clues from other features to KNOW if it is the case or no.

Public IPs: For geographic distance, you may want to check web services/API that try to map IP and geographical locations (one google search turned this one for instance enter link description here).

Another point which could be interesting is the "organisational distance", from the IP address you can try to identify the owner of the address (the ISP), check ARIN for instance

From there you can try to figure out if two addresses belong to the same organisation or not. You will have to find some way to tell if the organisation is an ISP with private customers or a company with their own network. Please be also aware that some organisations have started to resell blocks from their IPV4 addresses to other companies for money with the effect that addresses that used to be in the same organisation/location can now be thousands of miles away in different companies.

I think it would be wise to consider these informations as probabilities only.


Posted 2015-11-09T09:40:19.857

Reputation: 173