Tech Interview CheatSheet - Maps & Hashing
In Part 1 of this series I looked at common search and sorting algorithms used on Lists. This post will look at Hash functions and how they are applied to Sets and Maps to offer constant time lookup performance.
Hash Functions
A Hash function takes a value and produces a number that is often used as an index in an Array.
For example, when storing large random numbers, a simple Hash function could take the last few digits, those most likely to change often, and divide it by some constant using the remainder as an index (Modulo or % operator).
Collisions
Collisions happen when the Hash function returns the same value for 2 different inputs. There are 2 ways to address Collisions:
- Change the Hash function so that it will produce more possible slots. This will maintain $O(1)$ performance but comes at the cost of more space. The end result could be a large yet sparsely populated Array.
- Change the structure by storing a List of values at each index, generally called Buckets. This approach still requires iteration to find the correct value, but over a much smaller List. The average case will typically lead to $O(1)$ lookup time, but worst case could be $O(n)$ if all the values end up in the same Bucket.
Load Factor
$$ \text{load_factor} = \text{no_of_entries} \div \text{no_of_buckets}$$
Gives a sense of how full a Hash Table is. A Load Factor of 0.01 means only 10 values are stored in a Hash Table with a 1000 buckets, a huge waste of memory.
Load Factor can be used to determine when to rehash, as it approaches 0 the Hash Table is sparsely populated. On the flip side, as it approaches 1 it is densely populated and more buckets should be added, since a value larger than 1 will certainly cause a Collision.
Question: If a Hash function divides by 100 and uses the remainder as a key, how many buckets does that provide?
Answer: Remainders will range from 0 - 99, thus the function will yield a 100 Buckets.
Sets
Sets are similar to Lists in that it stores a collection of values, but with 2 key distinctions:
- The order of elements are not guaranteed.
- No duplicate elements are allowed.
A typical implementation of Sets, called HashSets, uses Hash functions to store the elements. This type of Set offers constant time performance ( $O(1)$ ) for basic operations like add, remove and contains.
Hash Maps
One of the main applications of Hash functions are Hash Maps. Both the key and value are stored in the position provided by the function. Since keys in a Map are part of a Set, they are unique, allowing the Hash function to be designed in a way that gives each key its own bucket.
String keys
A String can be converted into a number by using something like ASCII codes. Java favours large hash codes over the risk of collisions by using Hash function similar to this: ($s[x]$ would contain the ASCII code for the character in position x of the String):
$$s[0]\times 31^\text{(n-1)} + s[1]\times 31^\text{(n-2)} +\ ...\ +s[n-1]$$
A simple Hash table implementation in python:
In part 3 I will look at trees.