## How to define a distance measure between two IP addresses?

6

5

I have IP addresses as feature and I would like to know how much two IP addresses are similar to each other to use the difference in an Euclidean distance measure (in order to quantify the similarities of my data points). What tactic can I use for this?

1To you what would make two IP addresses similar? – paparazzo – 2015-11-09T13:04:49.203

You could use a GeoIP lookup and literally compare the distance on the earth between them.... – Spacedman – 2015-11-09T21:25:54.063

@Spacedman That only works for WWW public IP. OP has not even provided that information. – paparazzo – 2015-11-09T21:43:12.100

Yes, I was hoping some clarification would be forthcoming, instead of everyone piling in with "answers"... @marcL? – Spacedman – 2015-11-09T21:53:55.437

How they are similar in terms of network not based on earth distance. Sorry for the delay – Marc Lamberti – 2015-11-10T14:05:47.140

What to you makes an IP similar in terms of a network? Is this a public (WWW) network? – paparazzo – 2015-11-10T19:27:23.860

You can't tell the network similarity without the netmask value.... Again, people are piling in with answers when the question is unclear. – Spacedman – 2015-11-11T09:57:16.387

5

If I understood them correctly, both Jeremy and Edmund's (first) solutions are the same, namely, plain euclidean distance in a 4-dimensional space of IP addresses.BTW, I think a very fast alternative to euclidean distance would be to calculate a hamming distance bit-wise.

Edmund's first update would be better than his second. The reason is simple to state: his 2nd update tries to define a distance measure by considering a non-linear function of the coordinates of a 4D vector. That however will most likely destroy the key properties that it needs to satisfy in order to be a metric, namely

1. Injectivity: $d(IP_1,IP_2)=0 \iff IP_1=IP_2$,
2. Symmetry: $d(IP_1,IP_2)=d(IP_2,IP_1)$, and
3. Triangular inequality: $d(IP_1,IP_2)\leq d(IP_1,IP_3)+d(IP_3,IP_2)\,\forall IP_3$.

The latter is key for later interpreting small distances as close points in IP space. One would need a linear (in the coordinates) distance function. However, simple euclidean distance is not enough as you saw.

Physics (well, differential geometry actually) could lead to a nice solution to this problem: define a metric tensor $g$. In plain english, give weights to each pair of coordinates, take each pair difference, square it and multiply it by its weight, and then add those products. Take the square root of that sum and define it as your distance.

For the sake of simplicity, one could start trying with a diagonal metric tensor.

Example: Say you take $g=\begin{pmatrix}1000 &0 &0 &0 \\0 &100&0&0\\0&0&10&0\\0&0&0&1\end{pmatrix}$ $IP_1=(x_1,x_2,x_3,x_4)$ and $IP_2=(y_1,y_2,y_3,y_4)$. Then the square of the distance is given by $$d(IP_1,IP_2)^2=1000*(x_1-y_1)^2+100*(x_2-y_2)^2+\\ \,+10*(x_3-y_3)^2+1*(x_4-y_4)^2$$ For $IP_1=192.168.1.1,\,IP_2=192.168.1.2$ the distance is clearly 1. However, for $192.168.1.1$ and $191.168.1.1$ the distance is $\sqrt{1000}\approx 32$

Eventually you could play around with different weights and set a kind of normalization where you could fix the value of the maximal distance $d(0.0.0.0,FF.FF.FF.FF)$.

Furthermore, this set up allows for more complex descriptions of your data where the relevant distance would contain "cross-products" of coordinates like say $g_{13}*(x_1-y_1)*(x_3-y_3)$.

EDIT: While this would be a better "weighting" method than using those other I addressed, I realize now it is actualy meaningless: As Anony-Mousse and Phillip mention, IP are indeed 32 dimensional. This means in particular that giving the same weight to all bits in say the 2nd group is in general not sound: One bit could be part of the netmask while the other not. See Anony-Mousse answer for additional objections.

Very nice. Thanks for critique. I obviously need to brush up on my Linear Algebra. ;-) – Edmund – 2015-11-09T21:12:48.953

As I mentioned, eventually one could introduce a more complex "IP space-time" by considering also non-vanishing off-diagonal elements in $g$. This I guess would depend on the problem he wants to solve, i.e., on the data. However, my first guess is that it easily is overkill. At least one should first try with a diagonal form as in the example above. In such case, there wouldn't be any need to dust-off your LA books, ;-) just following the example, which is, btw, like a problem of vectors in 2 dimensions. – MASL – 2015-11-09T21:24:54.820

It's not a fourudimensional space. It's a 32 dimensional space. The dot-representation is just easier to remember. – Has QUIT--Anony-Mousse – 2015-11-09T23:36:09.960

@Anony-Mousse You are right, there is this other way of looking at it, namely bit-wise. What I propose is independent of how one describes the "IP-space, i.e., whether 4 or 32-dimensional. That said, without other info on the OP's problem, it's seems reasonable to distinguish the usual 4 numbers of an IP and say that 191.0.0.0 and 192.0.0.0 differ more than 0.0.0.0 and 0.0.0.1. – MASL – 2015-11-10T00:33:06.663

4

That's a very interesting question. Similarity here should be computed component-wise, but the thing is from a "business logic" perspective, the similarity of the last 3 numbers doesn't matter if the other 3 sets of numbers are not the same. Keeping that in mind, I would probably do something like the following (there is probably a more elegant way of doing it, and I don't have much time to think about it so forgive me if it doesn't answer your question and for the poor formatting).

Assuming IPv4 of the form aaa.bbb.ccc.ddd, I would so something like:

If aaa_1 == aaa_2:
If bbb_1 == bbb_2:
If ccc_1 == ccc_2:
If ddd_1 == ddd_2:
Dist = 1;
Else:
Dist = (3 + distance(ddd_1,ddd_2))/4;
End if;
Else:
Dist = (2 + distance(ccc_1,ccc_2))/4;
End if;
Else:
Dist = (1 + distance(bbb_1,bbb_2))/4;
End if;
Else:
Dist = distance(aaa_1,aaa_2);
Return 1/Dist;


So, if both addresses are the same you code assigns a distance of 1, not zero. Is that a typo or it does this 1 have a special meaning? (You could make it return a zero at that point so as to avoid a NAN at the return at the bottom.) – MASL – 2015-11-09T22:48:45.580

An IP is one 32-bit number. – Has QUIT--Anony-Mousse – 2015-11-09T23:50:10.277

@MASL This is a similarity metric, I should have precised it. 1 means complete similarity here. – Jérémie Clos – 2015-11-10T01:16:05.667

Also I am probably abusing the word "metric" here, forgive me for that. – Jérémie Clos – 2015-11-10T01:22:33.773

2

IP (v4) addresses are a 32-bit integer, which trivially gives you a metric. However, it may not be a particularly useful metric - 10.255.255.255 and 11.0.0.0 are almost certainly significantly more different than 192.168.1.1 and 192.168.1.2.

Exact so do you have any idea ? Maybe one of the answers above ? – Marc Lamberti – 2015-11-09T12:22:35.927

1What is a good metric depends on the problem you're trying to solve. You could for instance use one of the geoIP services to get a geolocation for every IP address - but we can't tell you whether that's good for you because we don't know what your problem is. – Philip Kendall – 2015-11-09T12:36:52.777

2

20 years ago, I would have suggested to use the length of the shared prefix as similarity measure.

So you take two IPs. In their 32 bit representation, not the "pretty printed" x.y.z.w form; the real "int" reoresentation your network stack uses. Then XOR them, count the leading zeros, and you get

distance = 32 - leadingZeros(ip1 XOR ip2)


However, we have exhausted the IPv4 namespace long ago. The last 10 years, the few remaining netblock have been more or less "randomly" (at least from a similarity perspective) been distributed. IP ranges have been relocated and so on.

A lomg time ago, people would have told you routing happens on trees, based on their prefix. So If you wanted to read an IP 10.2.3.4 it would go to 10.0.0.0 then 10.2.0.0 then 10.2.3.0. But that was just the theory. If you manually configured your router, that is what you would do. 10.2 isthe second building, 10.2.3. is the third floor router. History. Within networks, IPs are assigned by DHCP, often first-come-first-served. Within intranet, you have mostly switches not routers. And on the global level, the BGP is responsible for taking care of the big mess of todays routing tables.

In other words: use some database like GeoIP to map the IPs to (approximate) coordinates. Best you can do. IP based similarity is mostly useful on a /24 prefix, but a binary yes/no similarity won't make you happy I guess.

I hadn't seen your answer before I replied to your comment on mine. I understand now what you meant. Simple di/similarity measures on the IP's won't help one to distinguish any sort of geographical location. Maybe not even any other kind of useful measure of proximity -organizations have IP ranges assigned at ad-hoc, such that make a distance function superfluous/meaningless. – MASL – 2015-11-10T00:41:17.817

2

Like someone else mentioned, treating IPs as int automatically gives higher bits higher weights. I've used variance of IPs which is log scaled.

math.log(np.std([IP(ip).int() for ip in ips]))


1

2nd Update

The below can be improved as it does not consider the hierarchical structure of an IP address. To account for this the elements of the IP vectors can be non-linearly scaled before computing the distance vector and its norm. This gives more weight to the elements higher in the hierarchy.

Mathematica code

Once we have the 4D vectors from the 1st update each element is scaled based on its position $[x^{2}_{1},x^{\frac{3}{2}}_{2},x^{1}_{3},x^{\frac{1}{2}}_{4}]$.

Subtract @@ (MapIndexed[
Function[{value, index},
value^((5 - First@index)/2)], #] & /@ {ip1, ip2}) // Norm // N
(* 2209.17 *)


There is information lost in collapsing from 4D down to 1D but this can't be helped if you are looking for a 1D distance metric.

1st Update

An IP address is made up of 4 numbers. Takes these as vectors in 4D and calculate the distance between them ( Distance in Euclidean space ).

Mathematica code

(* Make some IP addresses *)
{ip1, ip2} =
StringRiffle[#, "."] & /@
Map[ToString, RandomInteger[{1, 255}, {2, 4}], {2}]
(* {"50.229.29.146", "27.167.216.58"} *)

(* Extract 4D vector *)
{ip1, ip2} = Map[FromDigits, StringSplit[#, "."] & /@ {ip1, ip2}, {2}]
(* {{50, 229, 29, 146}, {27, 167, 216, 58}} *)

(* Calculate distance *)
Norm[ip1 - ip2] // N
(* 216.993 *)


Consider an IP address as a 4D vector. Subtract and calculate the norm.

Can you explain a bit more? – Marc Lamberti – 2015-11-09T12:23:08.813

@marcL , I was on my phone. Check the update. – Edmund – 2015-11-09T13:11:01.313

1So you are assuming that Sim("192.168.1.1", "192.168.1.2") == Sim("191.168.1.1", "192.168.1.1")? – Jérémie Clos – 2015-11-09T13:53:01.597

Correct. They are equally similar to one another with this measure. But I see the issue here with the hierarchical nature of the IP addresses. Let me think about that. – Edmund – 2015-11-09T14:01:18.397

@JérémieClos So I'm thinking that each element of the vector could be non-linearly scaled before computing the norm of the distance vector. This would give more weight to elements higher in the hierarchy. – Edmund – 2015-11-09T14:08:16.943

That's what I was thinking too, but I didn't really take the time to formulate it well so I went with a simpler solution. Decomposing the similarity into a logarithmically weighted sum of pointwise similarities might do the job better (where each component is divided by log(n) where n is the position of the component).

Then sim(aaa.bbb.ccc.ddd, eee.fff.ggg.hhh) = sim(aaa,eee)/$log_2$(2) + sim(bbb,fff)/$log_2$(3) + sim(ccc,ggg)/$log_2$(4) + sim(ddd,hhh)/$log_2$(5) – Jérémie Clos – 2015-11-09T15:21:20.860

My bad, I didn't see that you updated your answer with something similar. – Jérémie Clos – 2015-11-09T15:34:32.847

1