The idea is to have some kind of key-value association
Insert a value with a specified key
Given a key, search for the corresponding value
Basic Table API
The basic idea is to use an Associative Array such that we have the following definition:
Keys and Values
For Values:
Values are non-null
Put overwrites an old value
For Keys:
comparable since it enables us to do some more efficient algorithms
Immutable types for keys
Assume keys have some method for testing equality
The requirements for equality are as follows:
Reflexive - x == x
Symmetric: x == y iif y == x
Transitive: if x == y and x == z then x == y
Java also has: Non-null: x == null is false
For the sake of simplicity we’ll keep our implementation of keys to types that can be compared directly using the javascript === operator
Elementary Implementation
Unordered List/Sequential Search
We could keep a list with a key-value pair in the array
Putting requires us to run through and see if the value is already there and overwrite that value or add a new one at the end
Getting a value requires us to scan through the entire array till we find the value
Ordered List/Binary Search
We can use a binary search implementation in which we sort based on the key
This will then have the same characteristics of a binary search for retreiving values
With binary search however, we need to keep the items in order which makes insertions and deletions expensive as we need to shift all items to the right of our key
Good for read-heavy usecases
Implementation
Worst Case
Average Case
Ordered iteration
Key interface
sequential search
N search, N insert
N/2 search, N insert
No
Equality
binary search
log N search, N insert
log N search, N/2 insert
yes
Comparison
The best above implementation is the binary search but it’s still fairly slow for insertion - we want to find an implementation that’s more efficient than this
Ordered Operations
When considering a symbol table it may sometimes be relevant to do things like
Get the min value
Get the max value
Get a value given a key
Get X top values
Get some list of keys ina a range
Floor/Ciel of a key
The API for an ordered symbol table is as follows
Binary Search Trees
Enables us to provide an efficient implementation of symbol tables
Binary search trees are binary trees that are in symmetric order - links in the tree can be null
A binary tree is either:
Empty
Two disjoint trees
In a binary search tree each node has a key and every key is larger than all trees in its left subtree and larger than all keys in its right subtree
To implement this we will extend the implementation of a linked list which is doubly linked such
BST Operations
Get
When searching a binary tree, we start at the root and compare the value:
If less - go left
If greater - go right
If equal - return the value
If we end up on a null link that means the value is not in the tree so we return null
The implementation of this can be seen below:
Insert
Follow a similar process as we did for searching, but when we get to a null node then we insert the new node there
An important distinction in the insertion implementation is that we recursively traverse the child branches and modify the parent
Tree Shape
In the case of a BST the number of compares in the above operations is 1+node depth
Depending on how insertion works, we can end up in a tree which is heavily skewed which can lead to varying performance. The performance of this may be very bad in the event that the inserts are done in some kind of ordered way
The implementation of a BST is very similar to the idea behind how quicksort works
Ordered BST Operations
Max and Min
To find the max value just keep moving right and to find the min value we just keep moving to the left
Floor and Ceil
To get the floor we need to consider the following:
K is equal to the key at the root = the floor is K
K is less than the key at the root = the floor is in the left subtree
K is greater than the key at the root = the floor is in the right subtree
The ceiling implementation is the same but we traverse the other way instead. We can see the two implementations below:
Counts
Implementing counts can be done by adding a count field to each node which is the number of nodes in that subtree
The way we maintain this is by setting the count of a node as 1+size(left)+size(right)
Rank
Implementing Rank makes use of these subcounts to determine which sides of a tree to traverse and get the next resulting values in a recursive manner
In Order Traversal
The method for traversing the tree in order is as follows:
Traverse the left subtree
Enqueue current key
Traverse the right subtree
Implementing the above recursively leads to us having an ordered queue of keys
Deletion
To delete we can implement a few different solutions. The lazy approach is to remove the value at a key but then we end up accumulating keys in our table
A more general method for deleting nodes is called Hibbard Deletion
If the node has no children, we just set it’s value as null and propogate the change in counts up
If the node has 1 child, we can replace the parent link with the child
If the node has 2 children, we have to do a few more things:
Find the next smallest node in the right subtree
Delete the minimum in the right subtree
Put the smalest value in the spot of the node we deleted
The problem with the Hibbard deletion process leads to the tree becoming imbalanced over time since we are replacing the right subtree constantly
After a lot of deletes the shape of the tree changes which has a considerable negative impact on the performance of the data structure. This leads to operations on the tree being O(sqrt(N))
2-3 Trees
Generalize BSTs to provide some flexibility in defining performance by using trees that can either have:
2-node: one key, two children
3-node: two keys, three children
Perfect balance: every path from root to end has the same length
Symmetric order: in-order traversal returns keys in ascending order
Searching
Compare search key against keys in node
Find relevant link
Smaller than both keys = left
Larget than both keys = right
Between keys = between link
Follow the relevant link recursively
Insertion
Search for key
Either
If at 2-node: Replace 2-node with a 3-node
If at 3-node:
Create a temp 4-node
Move middle key of 4-node into parent
Split remaining into 2 2-nodes
Do this recursively up as needed
Splitting nodes is a local transformation and thereby has a constant number of operations
Invariants
Maintain symmetric order and perfect balance
Each transformation should maintain symmetric order and perfect balance
Performance
Since we have perfect balance, every path from the root to the end has the same lenght
Tree height:
Worst case: lg N (all 2-nodes)
Best case: log3 N (all 3 nodes)
Implementation
Implementation of this is possible but it’s quite complicated and there is an easier way to implement this
Left-leaning Red-Black BSTs
Represent a 2-3 tree as a BST
Use internal left-leaning links as 3-nodes
This is a BST such that:
No node has 2 red likns connected to it
Every path from root to link has the same number of black links
Red links lean right
Search
Search is exactly the same as for a BST but is much faster due to the tree remaining balanced
Representation
To represent a red-black BST we use the BST definition we had previously but
Left Rotation
Sometimes when working with a tree we may end up with a right-leaning red link. We may want to update this to rotateLeft instead which can be done as follows:
The important part of the left rotation is that it maintains symmetric order and perfect black balance
Right Rotation
Sometimes to insert a value it is necessary to temporarily rotat a link right, this can also be done as follows:
Flip Colors
When a node has two child links that are both red this represents a 4-node. In order to split a 4-node it is simply necessary to
Convert the parent link to red
Convert the 2 child links to black
Insertion
When inserting we want to maintain the 2-3 accordance.
Case 1 - insert into a 2-node at the bottom
Do a normal BST insert, color new link red
If new red link is right, rotate it left
Case 2 - Insert into a tree with exactly 2 nodes
Do a normal BST insert, color new link red
Rotate to balance 4 node if needed
Flip colors to pass red link up one level
Rotate to make lean left if needed
Repeat case 1 or 2 up the tree if needed
Performance
Lerft leaning red-black trees will approach lg 2 N hehght, in the worst case it will be 2 lg N
B Trees
Given some contstraints such as a file system, we have the following:
Locate a page - slow
Read a page - fast
We want to optimize in order to minimize the number of propbes that need to be done in order to locate a page
B-trees generalize 2-3 trees by allowing up to M-1 key-link pairs per node where M is as large as possible for the given usecase
At least 2 key-link paris at root
At least M/2 key-link paris in other nodex
External nodes contain client keys
Internal nodes contain copies of keys to guide search
In practive, the number of probes in a B-tree is at most 4, an optimization that’s possible is to keep the root page in memory at all times
Performance Overview
Implementation
Guarantee
Average
Ordered Iteration
search
insert
delete
search
insert
delete
Sequential Search (Linked List)
N
N
N
N/2
N
N/2
no
Binary Search (Ordered Array)
lg N
N
N
lg N
N/2
N/2
yes
BST
N
N
N
1.39 lg N
1.39 lg N
sqrt(N)
yes
2-3 Tree
c lg N
c lg N
c lg N
c lg N
c lg N
c lg N
yes
Left Leaning Red-Black Tree
2 lg N
2 lg N
2 lg N
lg N
lg N
lg N
yes
Sets
Basically just a set without anything to do with value
add(key)
contains(key)
remove(key)
size
iterator
Indexing
File indexing: Build a Symbol Table with key as some search term, and value as a Set<File>
Concordance indexing: Build a Symbol Table with key and value as Set<Integer> where we reference the position in the array where a given key occurs
Sparse Vectors/Matrices
In many usecases we have matrices with very few non-zero values and very large sizes, we can take advantage of these by using Symbol tables to only store the keys and values, otherwise we can assume a 0
In the case of a vector, we can use a Symbol Table such that SparseVector == ST<Integer, Float>
In the case of a matrix we can use the definition of a Sparse Vector such that SparseMatrix == ST<Integer, SparseVector> == ST<Integer, ST<Integer, Float>