How Bloom Filters Make Checking Millions of Usernames Lightning Fast

Understanding the Use and Working of Bloom Filters

Sep 27, 2024

What is a Bloom Filter?

We often encounter scenarios where we need to run look-ups in a large dataset but are only interested in whether a particular item exists in the set. Bloom Filter is a probabilistic data structure that allows us to do so space-efficiently.

Bloom Filter returns False: Item doesn’t exist in the set
Bloom Filter returns True: Item may or may not exist in the set. (We will look at the reason for that in a moment)

This means that when a Bloom Filter returns false, an item is guaranteed to NOT exist in the given set.

Fig. Bloom Filter will return true for an item that exists in the set

Fig. Bloom Filter will return False for an item that does not exist in the set

Fig. Bloom Filter may also return True for an item that does not exist in the set (false positive)

This makes Bloom Filter an ideal choice in scenarios such as determining whether a username exists in the data store. If it returns false for a given username, it is guaranteed that the username doesn’t exist in the data store, eliminating the need to scan the entire data store for such lookups.

How Bloom Filter Works?

A Bloom Filter comprises two main components: a set of hash functions and a bit array, which stores bits all initially set to 0.

Fig. Bloom Filter comprises of a set of hash functions and a bit array

Addition of Data

To add data to a Bloom Filter,

Hash the given value using all the hash functions, and then modulo each generated hashed value by the size of the bit array. This will return indices of the bit array.
For each index returned in the previous step, update the bit array value from 0 to 1 for each index. If the value is already 1, don’t update it.

Fig. Key X is hashed by three hash functions and corresponding indicies are updated from 0 to 1

Fig. Key Y is hashed by three hash functions and corresponding indicies are updated from 0 to 1. Some indicies are updated by multiple keys

Fig. Key Z is hashed by three hash functions and corresponding indicies are updated from 0 to 1. Some indicies are updated by multiple keys

Checking If Data Exists

To check if a given value exists in a Bloom Filter,

Hash the given value using all the hash functions, and then modulo each generated hashed value by the size of the bit array. This will return indices of the bit array.
Get the values in the bit array at each of the above indices. If all values are 1, Bloom Filter returns true; if any of the values is 0, it will return false.

Fig. For Key X all indicies are 1, therefore X exists

Fig. For Key A, at least one index is 0, therefore Key A does not exist

What About Collisions?

Since the hashed values are mapped to a limited-sized bit array, there is a high chance that two or more values return the same list of indices to be updated when added to a Bloom Filter, resulting in collisions. This is, in fact, the reason why a Bloom Filter can only guarantee if an item doesn’t exist as one of the indices in that case will be 0.

However, when all indices for a given value map to 1, it is impossible to tell if the indices were updated by the given value or some other value, as collisions can result in two or more values updating the same indices. This is why an item may or may not exist in the given set when it returns true.

Fig. Multiple Keys can be hashed to same index by different hash functions due to collisions

Why No Deletion?

As we noticed in the previous section, it is impossible to tell if the indices were updated by the given value or some other value; it’s impossible to run deletions in a Bloom Filter. A deletion would mean setting all the indices for a given value to 0, which can also result in the deletion of some other value, as two or more values can map to the same set of indices. This will eventually lead to scenarios where a Bloom Filter returns false for an item that exists in the data store (false negative).

Fig. Not possible to delete Key X because multiple keys can be hashed to the same index

Production Fun Fact!

Google’s Big Table uses Bloom Filters to reduce the number of disk accesses.
Databases (such as Cassandra, etc.) using LSM Trees use Bloom Filters to run quick look-ups on SSTs.

References

CMSC 420: Bloom Filters. (n.d.-a). https://www.math.umd.edu/~immortal/CMSC420/notes/bloomfilters.pdf
Wikipedia contributors. (2024, August 12). Bloom filter. Wikipedia. https://en.wikipedia.org/wiki/Bloom_filter
CMSC 420: Bloom Filters. (n.d.-a). https://www.math.umd.edu/~immortal/CMSC420/notes/bloomfilters.pdf

The Scalable Thread

Discussion about this post