In Part 1 of this series I looked at common search and sorting algorithms used on Lists. This post will look at *Hash functions* and how they are applied to *Sets* and *Maps* to offer constant time lookup performance.

## Hash Functions

A *Hash* function takes a value and produces a number that is often used as an index in an *Array*.

For example, when storing large random numbers, a simple *Hash* function could take the last few digits, those most likely to change often, and divide it by some constant using the remainder as an index (Modulo or % operator).

### Collisions

*Collisions* happen when the *Hash function* returns the same value for 2 different inputs. There are 2 ways to address *Collisions*:

- Change the
*Hash*function so that it will produce more possible slots. This will maintain $O(1)$ performance but comes at the cost of more space. The end result could be a large yet sparsely populated*Array*. - Change the structure by storing a
*List*of values at each index, generally called*Buckets*. This approach still requires iteration to find the correct value, but over a much smaller*List*. The average case will typically lead to $O(1)$ lookup time, but worst case could be $O(n)$ if all the values end up in the same*Bucket*.

### Load Factor

$$ \text{load_factor} = \text{no_of_entries} \div \text{no_of_buckets}$$

Gives a sense of how full a *Hash Table* is. A *Load Factor* of 0.01 means only 10 values are stored in a *Hash Table* with a 1000 buckets, a huge waste of memory.

*Load Factor* can be used to determine when to rehash, as it approaches 0 the *Hash Table* is sparsely populated. On the flip side, as it approaches 1 it is densely populated and more buckets should be added, since a value larger than 1 will certainly cause a *Collision*.

*Question:* If a *Hash function* divides by 100 and uses the remainder as a key, how many buckets does that provide?

*Answer*: Remainders will range from 0 - 99, thus the function will yield a 100 *Buckets*.

## Sets

*Sets* are similar to *Lists* in that it stores a collection of values, but with 2 key distinctions:

- The order of elements are not guaranteed.
- No duplicate elements are allowed.

A typical implementation of *Sets*, called *HashSets*, uses Hash functions to store the elements. This type of *Set* offers constant time performance ( $O(1)$ ) for basic operations like *add*, *remove* and *contains*.

## Hash Maps

One of the main applications of *Hash functions* are *Hash Maps*. Both the *key* and *value* are stored in the position provided by the function. Since keys in a *Map* are part of a *Set*, they are unique, allowing the *Hash function* to be designed in a way that gives each key its own bucket.

### String keys

A String can be converted into a number by using something like ASCII codes. Java favours large hash codes over the risk of *collisions* by using *Hash function* similar to this: ($s[x]$ would contain the ASCII code for the character in position x of the String):

$$s[0]\times 31^\text{(n-1)} + s[1]\times 31^\text{(n-2)} +\ ...\ +s[n-1]$$

A simple *Hash table* implementation in python:

In part 3 I will look at trees.