Hash tables

Updated: 03 January 2025

Notes based on Algorithms, Part 1 on Coursera

Hashing

Hashing is done by saving items in a key-indexed table such that the index is some function of a key. The hash function then computes an array index from a specifc key

Some challenges we can run into are:

  • Computing the hash function
  • Equality testing
  • Collision resolution - how do we handle keys with the same hash

This is an instance of a space-time tradeoff where we use more space to speed up our lookups

Computing a hash function

  • Ideally, we want to scramble keys uniquely to produce a table index
  • Efficient to compute
  • Each table index is uniquely likely - problematic in practice

Practically, we need a different approach for each type of key

Different languages may implement this differently, Java uses Horner’s method for calculating this and you can utilize your own version for defining custom hashing fuctions

Separate Chaining

Since it’s likely that we will have collisions, we need a way to avoid collisions

Separate chanining uses an array of M < N linked lists with each lista at a position corresponding to s much more constrained hash value that may be repeated

  • Hash: map key to integer i between 0 and M - 1
  • Insert: put at front of ith chain (if not already there)
  • Search: Search only the ith chain

This basically buckets the lists we have so that we can reduce how many items we need to look through

We can use some other data structure for the specific chain but we could use some other kind of list that’s more efficient to navigate

Analysis

It can be noted that our space vs time efficiency is related to the size of each chain as well as the complexity of the hashing

This mechanism is suited for cases where we have a relatively simple hashing function and we don’t need ordered iteration

Open Addressing

A method for this uses an algorithm called Linear Probing Use an array instead of using a linked list, an empty element indicates where we can add a value

  • Hash: Make key to integer i between 0 and M-1
  • Insert: Put at table i if free, if not try i+1 .. i+n
  • Search: Check i, if no match then continue to i+1 .. i+n

Note that this limits the number of available spaces to the size of the array

It’s essential that the array can be resized when the number of keys grows beyond the number spaces in our array, in practice it’s preferable to keep the array about half full

A simple implementation of this can be done by storing an array of keys and an array of values

Clustering

Storage can become a concern in cases of hash tables like this since it is very space inefficient. A problem that compounds this is the occurence of clustering which leads to much longer operations when working with a Linear Probing based hash table

In order to identify the performance of an alogorithm like this we can refer to Knuth’s Parking Problem

  • If Half full - with M/2 cars, mean displacement is ~3/2
  • If full, with M cars, mean displacement is ~sqrt(pi * M / 8)

Knuth’s Theorem says that under a uniform hashing assumption, the avaerage number of probes in a linear probing hash table of size M that contains N = aM keys is:

  • Search hit: ~(1/2)(1 + 1/(1-a))
  • Search miss/insert: ~(1/2)(1 + 1/((1-a)^2)

We have the following scenarios:

  • M too large: more space than needed
  • M too small: search time increases
  • Typically uses: a = N/M ~1/2

Alternative Hashing Methods

  • Two probe hashing for separate-chaining
  • Double hashing for linear probing
  • Cuckoo hashing for linear probing

Comparison with Balanced Search Trees

Hash tables are simpler to code and are fast for simple keys

Balanced search trees have a stronger performance guarantee

Additional Context

Java 1.1 made use of a hashing function that only looked at every 8th position in a string in order to decrease then time spent on hashing, this led to really bad performance with typical data that may have collisions on these kinds of positions, for example URLs

Another thing that can happen is that if you’re using a publicly know hash function it becomes possible to attack your service such that your hash table becomes very slow and can be a vulnerability for denial of service attacks

We often also make use of some one-way hash functions that make it hard to find the original input, such as MD4/5, SHA, etc. These methods are typically too expensive for use in symbol tables