Hashing

Hashing is the core of many data structures and algorithms due to the fast (expected) lookup times that it provides for data storage and retrieval. In this lab, you will implement basic hash tables using chaining and open addressing, and experiment with different choices of hash functions.

Github Classroom

You can clone the starter files for this lab at this GitHub Classroom link. Be sure to open the assignment in VSCode and setup up your virtual environment following these instructions.

Goals

By the end of this lab, you should know:

The differences between chaining and open addressing
How certain choices of hash functions can degrade performance in certain datasets

Important Notes

You may assume that for implementation and analysis that no elements are ever removed from the hash map (i.e. no remove method).
You may assume that in the datasets you are provided to test with, that all keys are unique.

Hash functions

A hash function is a function that maps some integer $x$ to an integer $h(x)$ , where $h(x)$ is the hash of $x$ . This hash is supposed to act as a semi-unique identifier for $x$ , such that if $x = y$ , then $h(x) = h(y)$ , but the inverse does not hold (two elements that are not equal can have the same hash).

Python uses a built in hash(x) function to return the hash of $x$ for their dict and set implementations. We will see an example of why this default hash function can lead to poor performance on certain datasets, and we will define our own hash functions for integers in hash_function.py.

Part of this lab is exploring why certain hash functions are better than others.

The way that Python deals with hashing is different from a lot of the theoretical literature. In Python, the hash does not directly compute an index for object in a hash table, but instead computes a hash value (integer) that is generally modded by the size of the hash table to get an index.

One naive option for hashing is to hash an integer by truncating. We can take the last k bits of the integer, where k is some constant. We’ll see why this is a bad idea later in the lab.

Task: Implement the truncation_hash_function in hash_function.py so that it returns a truncation-based hash_function.

Universal Hashing

For a better hashing method that is more resistant to collisions, we can use a universal hash function. A universal hash function $h$ is a family of hash functions $h: K \to [0, m)$ , where $K$ is the set of all possible keys, and $m$ is the size of the hash table. The universal property requires that for any two distinct keys $x, y \in K$ , the probability that $h(x) = h(y)$ is at most $\frac{1}{m}$ .

A simple universal hash function can be defined as $h_m(x) = (ax + b \mod p) \mod m$ , where $a, b$ are random integers in the range $[0, p)$ , and $p$ is a prime number.

Task: Implement the universal_hash_function in hash_function.py so that it returns a hash function of the above form, with randomly chosen $a$ and $b$ values. $p = 2^{32} - 5$ for a sufficiently large prime number.

Hash Maps

A hash map is a data structure that stores key-value pairs, where each key is mapped to a value using its hash.

Chaining

Recall from lecture that chaining is a way of handling hash function collisions by storing values whose hashes collide in a linked list at the index of the hash.

That is, if $hash(x) = h$ , then we store the value $x$ at index $h$ of our storage array. If $hash(y) = h$ as well, then we store the value $y$ in a linked list at index $h$ (in practice, we can just append to the linked list at index $h$ ).

Instead of explicitly using a linked list, we can also use a Python list to store the values whose hashes collide, appending to the end of the list when we need to store a new value.

Task: Implement the put and get methods of the ChainingHashMap class in hash_map.py.

Self Checkpoint: Run the tests in test_hash_map.py to ensure that your ChainingHashMap implementation is correct.

TA Checkpoint: Show a TA your chaining implementation and walk them through your implementation!

Open Addressing

In the open addressing scheme, we store the values whose hashes collide in some other index in the array (not using an external linked list like in chaining).

Linear Probing

With linear probing, we store the hash key and value at the next available index in the array.

For example, if the hash of a key is 3, and the value at index 3 is already occupied, we store the value at index 4, and so on (wrapping back to the beginning of our storage array if necessary).

Task: Implement the put and get methods of the LinearProbingHashMap class in hash_map.py.

Self Checkpoint: Run the tests in test_hash_map.py to ensure that your LinearProbingHashMap implementation is correct.

TA Checkpoint: Show a TA your linear probing implementation and walk them through your implementation!

Benchmarking

Now that you have implemented both hash map classes, let’s see how they perform on some real data! We’ve prepared 2 datasets for you to test your code on:

random_dataset: Generates a dataset whose keys are formed uniformly at random.
pseudo_random_dataset: Generates a dataset whose keys are formed in a pseudo-periodic way, representative of periodic real-world data.

Running benchmark.py, will test both of your hash map implementations using each hash function and both the random_dataset and pseudo_random_dataset

TA_Checkpoint: Run benchmark.py to benchmark the performance of your hash maps on different datasets. Feel free to experiment with different combinations by commenting them out in the main function.

Have the answers to the following questions ready to show a TA, with a brief explanation:

Which combinations of hash functions, hash table implementations, and datasets yielded the worst performances?
Why might this be the case?

Python’s `hash` function performance

Let’s also explore the flaws in Python’s built in hash function. Because Python’s hash is periodic, we can generate a dataset of keys which will all hash to the same value and lead to poor performance.

In python_benchmark, we’ve provided you with the following two functions:

adversarial_python_keys: Generates a dataset that exploits Python’s built-in hash method to generate maximum collisions.
nonadversarial_python_keys: Generates a dataset with random keys of size comparable to adversarial_python_keys in order to compare performance.

Task: Compare the performance of Python’s set using both of these datasets. You should:

Import adversarial_python_keys and nonadversarial_python_keys from the datasets module
Choose a dataset size N, e.g. 500000
Set bad to be the list of keys output by adversarial_python_keys
Set good to be the list of keys output by nonadversarial_python_keys
Import the module time
Time the following computation: good_set = set(good)
Time the following computation: bad_set = set(bad)
Print out the runtimes!

Final_Checkpoint: Print out the runtimes for populating and retrieving keys from a Python set using these two datasets, and show your results to a TA!

If you have any feedback, you can fill out this anonymous Google Form!