Hashing

Hashing is the core of many data structures and algorithms due to the fast (expected) lookup times that it provides for data storage and retrieval. In this lab, you will implement basic hash tables using chaining and open addressing, and experiment with different choices of hash functions.
Github Classroom
You can clone the starter files for this lab at this GitHub Classroom link. Be sure to open the assignment in VSCode and setup up your virtual environment following these instructions.
Goals
By the end of this lab, you should know:
- The differences between chaining and open addressing
- How certain choices of hash functions can degrade performance in certain datasets
Important Notes
- You may assume that for implementation and analysis that no elements are ever removed from the hash map (i.e. no
remove
method). - You may assume that in the datasets you are provided to test with, that all keys are unique.
Hash functions
A hash function is a function that maps some integer to an integer , where is the hash of . This hash is supposed to act as a semi-unique identifier for , such that if , then , but the inverse does not hold (two elements that are not equal can have the same hash).
Python uses a built in hash(x)
function to return the hash of for their dict
and set
implementations.
We will see an example of why this default hash
function can lead to poor performance on certain datasets,
and we will define our own hash functions for integers in hash_function.py
.
Part of this lab is exploring why certain hash functions are better than others.
The way that Python deals with hashing is different from a lot of the theoretical literature.
In Python, the hash
does not directly compute an index for object in a hash table, but instead
computes a hash value (integer) that is generally modded by the size of the hash table to get an index.
One naive option for hashing is to hash an integer by truncating. We can take the last k
bits of the integer, where k
is some constant. We’ll see why this is a bad idea later in the lab.
Task: Implement the truncation_hash_function
in hash_function.py
so that it returns a truncation-based hash_function.
Universal Hashing
For a better hashing method that is more resistant to collisions, we can use a universal hash function. A universal hash function is a family of hash functions , where is the set of all possible keys, and is the size of the hash table. The universal property requires that for any two distinct keys , the probability that is at most .
A simple universal hash function can be defined as , where are random integers in the range , and is a prime number.
Task: Implement the universal_hash_function in hash_function.py
so that it returns a hash function of the above form, with randomly chosen and values.
for a sufficiently large prime number.
Hash Maps
A hash map is a data structure that stores key-value pairs, where each key is mapped to a value using its hash.
Chaining
Recall from lecture that chaining is a way of handling hash function collisions by storing values whose hashes collide in a linked list at the index of the hash.
That is, if , then we store the value at index of our storage array. If as well, then we store the value in a linked list at index (in practice, we can just append to the linked list at index ).
Instead of explicitly using a linked list, we can also use a Python list
to store the values whose hashes collide, appending to the end of the list when we need to store a new value.
Task: Implement the put
and get
methods of the ChainingHashMap
class in hash_map.py
.
Self Checkpoint: Run the tests in test_hash_map.py
to ensure that your ChainingHashMap
implementation is correct.
TA Checkpoint: Show a TA your chaining implementation and walk them through your implementation!
Open Addressing
In the open addressing scheme, we store the values whose hashes collide in some other index in the array (not using an external linked list like in chaining).
Linear Probing
With linear probing, we store the hash key and value at the next available index in the array.
For example, if the hash of a key is 3, and the value at index 3 is already occupied, we store the value at index 4, and so on (wrapping back to the beginning of our storage array if necessary).
Task: Implement the put
and get
methods of the LinearProbingHashMap
class in hash_map.py
.
Self Checkpoint: Run the tests in test_hash_map.py
to ensure that your LinearProbingHashMap
implementation is correct.
TA Checkpoint: Show a TA your linear probing implementation and walk them through your implementation!
Benchmarking
Now that you have implemented both hash map classes, let’s see how they perform on some real data! We’ve prepared 2 datasets for you to test your code on:
random_dataset
: Generates a dataset whose keys are formed uniformly at random.pseudo_random_dataset
: Generates a dataset whose keys are formed in a pseudo-periodic way, representative of periodic real-world data.
Running benchmark.py
, will test both of your hash map implementations using each hash function and both the random_dataset
and pseudo_random_dataset
TA_Checkpoint: Run benchmark.py
to benchmark the performance of your hash maps on different datasets.
Feel free to experiment with different combinations by commenting them out in the main
function.
Have the answers to the following questions ready to show a TA, with a brief explanation:
- Which combinations of hash functions, hash table implementations, and datasets yielded the worst performances?
- Why might this be the case?
Python’s hash
function performance
Let’s also explore the flaws in Python’s built in hash
function. Because Python’s hash
is periodic, we can generate a dataset of keys which will all hash
to the same value and lead to poor performance.
In python_benchmark
, we’ve provided you with the following two functions:
adversarial_python_keys
: Generates a dataset that exploits Python’s built-inhash
method to generate maximum collisions.nonadversarial_python_keys
: Generates a dataset with random keys of size comparable toadversarial_python_keys
in order to compare performance.
Task: Compare the performance of Python’s set
using both of these datasets. You should:
-
Import adversarial_python_keys and nonadversarial_python_keys from the datasets module
-
Choose a dataset size N, e.g. 500000
-
Set
bad
to be the list of keys output by adversarial_python_keys -
Set
good
to be the list of keys output by nonadversarial_python_keys -
Import the module time
-
Time the following computation: good_set = set(good)
-
Time the following computation: bad_set = set(bad)
-
Print out the runtimes!
Final_Checkpoint: Print out the runtimes for populating and retrieving keys from a Python set
using these two datasets, and show your results to a TA!