Hash tables
Updated: 03 January 2025
Notes based on Algorithms, Part 1 on Coursera
Hashing
Hashing is done by saving items in a key-indexed table such that the index is some function of a key. The hash function then computes an array index from a specifc key
Some challenges we can run into are:
- Computing the hash function
- Equality testing
- Collision resolution - how do we handle keys with the same hash
This is an instance of a space-time tradeoff where we use more space to speed up our lookups
Computing a hash function
- Ideally, we want to scramble keys uniquely to produce a table index
- Efficient to compute
- Each table index is uniquely likely - problematic in practice
Practically, we need a different approach for each type of key
Different languages may implement this differently, Java uses Horner’s method for calculating this and you can utilize your own version for defining custom hashing fuctions
Separate Chaining
Since it’s likely that we will have collisions, we need a way to avoid collisions
Separate chanining uses an array of M < N linked lists with each lista at a position corresponding to s much more constrained hash value that may be repeated
- Hash: map key to integer i between 0 and M - 1
- Insert: put at front of ith chain (if not already there)
- Search: Search only the ith chain
This basically buckets the lists we have so that we can reduce how many items we need to look through
We can use some other data structure for the specific chain but we could use some other kind of list that’s more efficient to navigate
Analysis
It can be noted that our space vs time efficiency is related to the size of each chain as well as the complexity of the hashing
This mechanism is suited for cases where we have a relatively simple hashing function and we don’t need ordered iteration
Open Addressing
A method for this uses an algorithm called Linear Probing Use an array instead of using a linked list, an empty element indicates where we can add a value
- Hash: Make key to integer i between 0 and M-1
- Insert: Put at table i if free, if not try i+1 .. i+n
- Search: Check i, if no match then continue to i+1 .. i+n
Note that this limits the number of available spaces to the size of the array
It’s essential that the array can be resized when the number of keys grows beyond the number spaces in our array, in practice it’s preferable to keep the array about half full
A simple implementation of this can be done by storing an array of keys and an array of values
Clustering
Storage can become a concern in cases of hash tables like this since it is very space inefficient. A problem that compounds this is the occurence of clustering which leads to much longer operations when working with a Linear Probing based hash table
In order to identify the performance of an alogorithm like this we can refer to Knuth’s Parking Problem
- If Half full - with M/2 cars, mean displacement is ~3/2
- If full, with M cars, mean displacement is ~sqrt(pi * M / 8)
Knuth’s Theorem says that under a uniform hashing assumption, the avaerage number of probes in a linear probing hash table of size M that contains N = aM keys is:
- Search hit: ~(1/2)(1 + 1/(1-a))
- Search miss/insert: ~(1/2)(1 + 1/((1-a)^2)
We have the following scenarios:
- M too large: more space than needed
- M too small: search time increases
- Typically uses: a = N/M ~1/2
Alternative Hashing Methods
- Two probe hashing for separate-chaining
- Double hashing for linear probing
- Cuckoo hashing for linear probing
Comparison with Balanced Search Trees
Hash tables are simpler to code and are fast for simple keys
Balanced search trees have a stronger performance guarantee
Additional Context
Java 1.1 made use of a hashing function that only looked at every 8th position in a string in order to decrease then time spent on hashing, this led to really bad performance with typical data that may have collisions on these kinds of positions, for example URLs
Another thing that can happen is that if you’re using a publicly know hash function it becomes possible to attack your service such that your hash table becomes very slow and can be a vulnerability for denial of service attacks
We often also make use of some one-way hash functions that make it hard to find the original input, such as MD4/5, SHA, etc. These methods are typically too expensive for use in symbol tables