(When) is hash table lookup O(1)?

64

16

It is often said that hash table lookup operates in constant time: you compute the hash value, which gives you an index for an array lookup. Yet this ignores collisions; in the worst case, every item happens to land in the same bucket and the lookup time becomes linear ($\Theta(n)$).

Are there conditions on the data that can make hash table lookup truly $O(1)$? Is that only on average, or can a hash table have $O(1)$ worst case lookup?

Note: I'm coming from a programmer's perspective here; when I store data in a hash table, it's almost always strings or some composite data structures, and the data changes during the lifetime of the hash table. So while I appreciate answers about perfect hashes, they're cute but anecdotal and not practical from my point of view.

P.S. Follow-up: For what kind of data are hash table operations O(1)?

Gilles

Posted 2012-03-12T19:01:07.577

Reputation: 29 838

1@Raphael I would be very interested in an answer that explains (along broad lines) when I can count on $O(1)$ amortized and when I can't. As for how the hash values are distributed, that's part of my question really: how can I know? I know hash functions are supposed to distribute values well; but if they always did the worst case would never be reached, which doesn't make sense. – Gilles – 2012-03-15T20:52:25.743

1Also be careful of premature optimization; for smallish (several thousand elements) data I have often seen $O(\log n)$ balanced binary trees outperform hashtables due to lower overhead (string comparisons are vastly cheaper than string hashes). – isturdy – 2013-05-06T12:59:39.873

3Can you live with $\cal{O}(1)$ amortised access time? In general, hash table performance will heavily depend on how much overhead for sparse hashtables you are prepared to tolerate and on how the actual hash values are distributed. – Raphael – 2012-03-12T19:31:07.927

5Oh, btw: you can avoid linear worst-case behaviour by using (balanced) search trees instead of lists. – Raphael – 2012-03-12T19:31:54.027

Let us continue this discussion in chat.

– Raphael – 2015-02-24T11:28:57.193

Answers

39

There are two settings under which you can get $O(1)$ worst-case times.

  1. If your setup is static, then FKS hashing will get you worst-case $O(1)$ guarantees. But as you indicated, your setting isn't static.

  2. If you use Cuckoo hashing, then queries and deletes are $O(1)$ worst-case, but insertion is only $O(1)$ expected. Cuckoo hashing works quite well if you have an upper bound on the total number of inserts, and set the table size to be roughly 25% larger.

There's more information here.

Suresh

Posted 2012-03-12T19:01:07.577

Reputation: 4 051

@Louis: That's not true. you can use relatively simple hash functions in practice. I had my students implement cuckoo hashing for an assignment using regular mod-prime hash functions. – Suresh – 2012-04-21T21:21:56.530

1@Suresh: Really? I thought you needed $\log n$-independent functions, which I always associated with needing expanders. I stand corrected. Will delete my comment in a bit. – Louis – 2012-04-21T21:39:54.437

I meant "in practice" :) – Suresh – 2012-04-21T21:41:05.063

1To make a more useful comment on this answer, as @Suresh points out, cuckoo hashing will work well without the fancy (and big) hash functions used to analyze it theoretically. – Louis – 2012-04-21T22:00:54.530

3Could you expand on FKS and Cuckoo? Both terms are new to me. – Gilles – 2012-03-12T21:30:15.973

1

What about dynamic perfect hashing? It has $O(1)$ worst-case lookups and $O(1)$ amortized insertion and deletion. (http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.30.8165)

– Joe – 2012-03-13T00:23:22.510

Ah yes, dynamic perfect hashing works as well. But it's randomized. For both FKS and cuckoo hashing, these notes are good: http://courses.csail.mit.edu/6.851/spring07/scribe/lec11.pdf

– Suresh – 2012-03-13T00:48:07.847

2FKS are the initials of (Fredman, Komlós, Szemerédi) and Cuckoo is the name of a brid species. It is use for this type of hashing, because cuckoo chicks push sibilings eggs out of the nest. This resembles somewhat how this hasing method functions. – uli – 2012-03-13T08:38:10.527

19

This answer summarises parts of TAoCP Vol 3, Ch 6.4.

Assume we have a set of values $V$, $n$ of which we want to store in an array $A$ of size $m$. We employ a hash function $h : V \to [0..M)$; typically, $M \ll |V|$. We call $\alpha = \frac{n}{m}$ the load factor of $A$. Here, we will assume the natural $m=M$; in practical scenarios, we have $m \ll M$, though, and have to map down to $m$ ourselves.

The first observation is that even if $h$ has uniform characteristics¹ the probability of two values having the same hash value is high; this is essentially an instance of the infamous birthday paradox. Therefore, we will usually have to deal with conflicts and can abandon hope of $\mathcal{O}(1)$ worst case access time.

What about the average case, though? Let us assume that every key from $[0..M)$ occurs with the same probability. The average number of checked entries $C_n^S$ (successful search) resp. $C_n^U$ (unsuccessful search) depends on the conflict resolution method used.

Chaining

Every array entry contains (a pointer to the head of) a linked lists. This is a good idea because the expected list length is small ($\frac{n}{m}$) even though the probability for having collisions is high. In the end, we get \[ C_n^S \approx 1 + \frac{\alpha}{2} \quad \text{ and } \quad C_n^U \approx 1 + \frac{\alpha^2}{2} .\] This can be improved slightly by storing the lists (partly or completely) inside the table.

Linear Probing

When inserting (resp. searching a value) $v$, check positions \[h(v), h(v)-1,\dots,0,m-1,\dots,h(v)+1\] in this order until an empty position (resp. $v$) is found. The advantage is that we work locally and without secondary data structures; however, the number of average accesses diverges for $\alpha \to 1$: \[ C_n^S \approx \frac{1}{2}\left(1 +\frac{1}{1-\alpha}\right) \quad \text{ and } \quad C_n^U \approx \frac{1}{2}\left(1 +\left(\frac{1}{1-\alpha}\right)^2\right).\] For $\alpha < 0.75$, however, performance is comparable to chaining².

Double Hashing

Similar to linear probing but search step size is controlled by a second hash function that is coprime to $M$. No formal derivation is given, but empirical observations suggest \[ C_n^S \approx \frac{1}{\alpha}\ln\left(\frac{1}{1-\alpha}\right)\quad \text{ and } \quad C_n^U \approx \frac{1}{1-\alpha} .\] This method has been adapted by Brent; his variant amortises increased insertion costs with cheaper searches.

Note that removing elements from and extending tables has varying degrees of difficulty for the respective methods.

Bottom-line, you have to choose an implementation that adapts well to your typical use cases. Expected access time in $\mathcal{O}(1)$ is possible if not always guaranteed. Depending on the used method, keeping $\alpha$ low is essential; you have to trade off (expected) access time versus space overhead. A good choice for $h$ is also central, obviously.


1] As arbitrarily dumb uninformed programmers may provide $h$, any assumption regarding its quality is a stretch in practice.
2] Note how this coincides with recommendations for usage of Java's Hashtable.

Raphael

Posted 2012-03-12T19:01:07.577

Reputation: 54 413

9

A perfect hash function can be defined as an injective function from a set $S$ to a subset of the integers $\{0, 1, 2, ..., n\}$. If a perfect hash function exists for your data and storage needs, you can easily get $O(1)$ behavior. For instance, you can get $O(1)$ performance from a hash table for the following task: given an array $l$ of integers and a set $S$ of integers, determine whether $l$ contains $x$ for each $x \in S$. A pre-procesing step would involve making a hash table in $O(|l|)$, followed by checking each element of $S$ against it in $O(|S|)$. Altogether, this is $O(|l| + |S|)$. A naive implementation using linear search might be $O(|l||S|)$; using binary search, you can do $O(\log(|l|)|S|)$ (note that this solution is $O(|l|)$ space, since the hash table must map distinct integers in $l$ to distinct bins).

EDIT: To clarify on how the hash table is generated in $O(|l|)$:

The list $l$ contains integers from a finite set $U \subset \mathbb{N}$, possibly with repeats, and $S \subseteq U$. We want to determine whether $x \in S$ is in $l$. To do so, we pre-compute a hash table for elements of $l$: a lookup table. The hash table will encode a function $h: U \rightarrow \{true, false\}$. To define $h$, initially assume $h(x) = false$ for all $x \in U$. Then, linearly scan through elements $y$ of $l$, setting $h(y) = true$. This takes $O(|l|)$ time and $O(|U|)$ space.

Notice that my original analysis assumed that $l$ contained at least $O(|U|)$ distinct elements. If it contains fewer distinct elements (say, $O(|1|)$), the space requirement may be higher (although it is no more than $O(|U|)$).

EDIT2: The hash table can be stored as a simple array. The hash function can be the identity function on $U$. Notice that the identity function is trivially a perfect hash function. $h$ is the hash table and encodes a separate function. I am being sloppy/confused in some of the above, but will try to improve it soon.

Patrick87

Posted 2012-03-12T19:01:07.577

Reputation: 9 946

Could you expand the part where you make the hash table in $O(|l|)$? I can see how to do that if you don't worry about collisions, but that means the later lookups may take more than $O(|S|)$, up to $O(|l|\cdot|S|)$. – Gilles – 2012-03-12T19:24:05.363

I don't understand the definition of $h$. You're defining a function, but not explaining how it's represented; could you write a few lines of pseudocode? There's also a notation problem; $h:U\to{\mathrm{false},\mathrm{true}}$ and $h$ bijective don't go well together. – Gilles – 2012-03-12T21:32:05.877

@Gilles It's basically just being used as a lookup table for list membership. When you have a perfect hash function with a known & cheap inverse, instead of storing the thing itself, you only need to store 1 bit (whether the thing with the unique hash has been added). If collisions are possible, I think doing this is referred to as a Bloom filter, but in any event can provide a definite "no" to the question of membership, which is still useful in many scenarios. – Patrick87 – 2012-03-12T22:36:59.620

8

A perfect hash function will result in $\cal{O}(1)$ worst case lookup.

Moreover, if the maximum number of collisions possible is $\cal{O}(1)$, then hash table lookup can be said to be $\cal{O}(1)$ in the worst case. If the expected number of collisions is $\cal{O}(1)$, then the hash table lookup can be said to be $\cal{O}(1)$ in the average case.

Nicholas Meyer

Posted 2012-03-12T19:01:07.577

Reputation: 81

@Suresh: If you are allowed to pick a new hash function and increase the size of the table whenever there is a collision, you can always find a (deterministic) hash function that -- for the data already in the table plus the one new item you're trying to insert -- has no collisions (is "perfect"). That is why dynamic perfect hashing periodically picks a random new hash function.

– David Cary – 2016-08-03T23:12:51.430

If the hash is perfect then the number of collisions is exactly $0$ - this is the point. It is $\mathcal O(1)$ in the worst case. – Evil – 2016-10-23T03:14:42.520

A perfect hash function would be perfect, but how do I get one? How much will it cost me? And how do I know what the maximum or expected number of collisions is? – Gilles – 2012-03-12T19:18:00.207

2@Gilles a perfect hash function is any function that will produce a unique hash for all possible inputs. If your possible inputs are finite (and unique), this is easy to do. – Rafe Kettler – 2012-03-12T19:39:08.740

1@RafeKettler My inputs are typically strings or compound data structures, and I usually add and remove entries as my data evolves. How do I make a perfect hash for this? – Gilles – 2012-03-12T19:44:31.577

4Yes, but that's the point. A deterministic perfect hash function doesn't exist if the domain is larger than the range. – Suresh – 2012-03-12T19:45:50.250