Hashing

Choosing Hashes (https://xkcd.com/221/)

Hashing is the core of many data structures and algorithms due to the fast (expected) lookup times that it provides for data storage and retrieval. In this lab, you will implement basic hash tables using chaining and open addressing, and experiment with different choices of hash functions.

Github Classroom

You can clone the starter files for this lab at this GitHub Classroom link. Be sure to open the assignment in VSCode and setup up your virtual environment following these instructions.

Goals

By the end of this lab, you should know:

Important Notes

Hash functions

A hash function is a function that maps some integer xx to an integer h(x)h(x), where h(x)h(x) is the hash of xx. This hash is supposed to act as a semi-unique identifier for xx, such that if x=yx = y, then h(x)=h(y)h(x) = h(y), but the inverse does not hold (two elements that are not equal can have the same hash).

Python uses a built in hash(x) function to return the hash of xx for their dict and set implementations. We will see an example of why this default hash function can lead to poor performance on certain datasets, and we will define our own hash functions for integers in hash_function.py.

Part of this lab is exploring why certain hash functions are better than others.

The way that Python deals with hashing is different from a lot of the theoretical literature. In Python, the hash does not directly compute an index for object in a hash table, but instead computes a hash value (integer) that is generally modded by the size of the hash table to get an index.

One naive option for hashing is to hash an integer by truncating. We can take the last k bits of the integer, where k is some constant. We’ll see why this is a bad idea later in the lab.

Task: Implement the truncation_hash_function in hash_function.py so that it returns a truncation-based hash_function.

Universal Hashing

For a better hashing method that is more resistant to collisions, we can use a universal hash function. A universal hash function hh is a family of hash functions h:K[0,m)h: K \to [0, m), where KK is the set of all possible keys, and mm is the size of the hash table. The universal property requires that for any two distinct keys x,yKx, y \in K, the probability that h(x)=h(y)h(x) = h(y) is at most 1m\frac{1}{m}.

A simple universal hash function can be defined as hm(x)=(ax+bmodp)modmh_m(x) = (ax + b \mod p) \mod m, where a,ba, b are random integers in the range [0,p)[0, p), and pp is a prime number.

Task: Implement the universal_hash_function in hash_function.py so that it returns a hash function of the above form, with randomly chosen aa and bb values. p=2325p = 2^{32} - 5 for a sufficiently large prime number.

Hash Maps

A hash map is a data structure that stores key-value pairs, where each key is mapped to a value using its hash.

Chaining

Recall from lecture that chaining is a way of handling hash function collisions by storing values whose hashes collide in a linked list at the index of the hash.

That is, if hash(x)=hhash(x) = h, then we store the value xx at index hh of our storage array. If hash(y)=hhash(y) = h as well, then we store the value yy in a linked list at index hh (in practice, we can just append to the linked list at index hh).

Instead of explicitly using a linked list, we can also use a Python list to store the values whose hashes collide, appending to the end of the list when we need to store a new value.

Task: Implement the put and get methods of the ChainingHashMap class in hash_map.py.

Self Checkpoint: Run the tests in test_hash_map.py to ensure that your ChainingHashMap implementation is correct.

TA Checkpoint: Show a TA your chaining implementation and walk them through your implementation!

Open Addressing

In the open addressing scheme, we store the values whose hashes collide in some other index in the array (not using an external linked list like in chaining).

Linear Probing

With linear probing, we store the hash key and value at the next available index in the array.

For example, if the hash of a key is 3, and the value at index 3 is already occupied, we store the value at index 4, and so on (wrapping back to the beginning of our storage array if necessary).

Task: Implement the put and get methods of the LinearProbingHashMap class in hash_map.py.

Self Checkpoint: Run the tests in test_hash_map.py to ensure that your LinearProbingHashMap implementation is correct.

TA Checkpoint: Show a TA your linear probing implementation and walk them through your implementation!

Benchmarking

Now that you have implemented both hash map classes, let’s see how they perform on some real data! We’ve prepared 2 datasets for you to test your code on:

Running benchmark.py, will test both of your hash map implementations using each hash function and both the random_dataset and pseudo_random_dataset

TA_Checkpoint: Run benchmark.py to benchmark the performance of your hash maps on different datasets. Feel free to experiment with different combinations by commenting them out in the main function.

Have the answers to the following questions ready to show a TA, with a brief explanation:

Python’s hash function performance

Let’s also explore the flaws in Python’s built in hash function. Because Python’s hash is periodic, we can generate a dataset of keys which will all hash to the same value and lead to poor performance.

In python_benchmark, we’ve provided you with the following two functions:

Task: Compare the performance of Python’s set using both of these datasets. You should:

  1. Import adversarial_python_keys and nonadversarial_python_keys from the datasets module

  2. Choose a dataset size N, e.g. 500000

  3. Set bad to be the list of keys output by adversarial_python_keys

  4. Set good to be the list of keys output by nonadversarial_python_keys

  5. Import the module time

  6. Time the following computation: good_set = set(good)

  7. Time the following computation: bad_set = set(bad)

  8. Print out the runtimes!

Final_Checkpoint: Print out the runtimes for populating and retrieving keys from a Python set using these two datasets, and show your results to a TA!

If you have any feedback, you can fill out this anonymous Google Form!