Probabilistic Data Structures
Probabilistic data structures are what their names suggest — these are data structures that give probabilistic answers to queries. What is lost in precision however, is more than made up for in extremely efficient use of memory and/or computational resources.
Hashing
A hash function is simply a mathematical function that maps an input to an output. In practice however, the output of hash functions have certain properties:
- Deterministic — Hash functions produce the same output for the same input.
- Fixed Length — For any input, most hash functions produce a fixed length output.
- Pseudorandom — Most hash functions produce a hash that is pseudorandom; hash outputs appear and behave random, but are ultimately deterministic.
- One-Way — For most hash functions, it is difficult to determine the input just from the output. This is useful for cryptographic applications; this is not necessarily useful for probabilistic data structures.
- Collision Resistant — Despite having fixed length outputs, hash functions are designed to minimize cases where two inputs output the same hash. Collisions are not impossible however, because of the limited output space.
Hash: 44
Some of these properties form the basis for most probabilistic data structures.
Bloom Filters
Bloom filters are one of the most popular probabilistic data structures. These are used to check for membership in sets or multisets — or in simpler words, they are used to check if an element is present in a set or not. Bloom filters cannot predict the membership of an element with absolute certainty, but it can report with certainty if it does not exist in a set. And it does it for big datasets using very little space.
Approaches
Rather than outright disclosing how bloom filters work, it is much more effective to discuss possible solutions first. It will help build a natural intuition as to why bloom filters are a better approach — at least when accuracy is not a big priority.
Consider these elements:
We need to know if hello
is in the multiset. One possible solution is to store the entire multiset, and then performing a linear search — going through each element one-by-one and checking if it exists.
Element state unknown.
Or if you’ve studied computer science, the multiset can be stored in lexicographic order and then a binary search can be performed.
Element state unknown.
That makes querying faster, but the space required to store the elements remains the same. One way to reduce the space used could be by storing only the unique values and purging all the duplicates. Rephrased formally, the multiset can be made into a set:
Element state unknown.
While this is a decent solution, it can be made better. The space required is roughly dependent on the number of elements, and the size of each element. What if, instead of storing all the individual elements, only the hashes of the elements are stored instead? In this case, we will need to search for 44
— the hash for hello
.
Hash:
44
Element state unknown.Using hashes is arguably better than storing individual long variable-length elements. But it can be optimized even further. Since most hashing functions always produce a fixed length output for any input, it can be exploited to decrease the amount of space needed to store the hashes even more.
The above hash function produces an output between 0
and 255
. Instead of allocating arrays to store all the individual hashes, a bit-array of size of 256-bits can be created to represent the existing hashes. All the bits are initially set to zero. A bit is set to one only if its corresponding hash is in the set.
For example if the output hash is 44
, the 44th bit is flipped to one.
The n-th bit is in reference to a zero based indexing system. Because the range of the hash is from 0 to 255, instead of from 1 to 256, the ‘first’ bit is considered as the 0-th bit and the ordinality of all the bits are considered as one lesser than their ‘actual’ ordinality.
Hash: 44
Bit ordinality
It should be obvious that using a bit-array is analogous to storing the hash itself. But while in the previous approach multiple bits were needed to store the individual hashes, the new bit-array approach requires only one bit per hash.
This is what a bloom filter is. It is fundamentally a bit-array, where a bit corresponds to an individual hash output — and that hash output corresponds to an element (or more) in the set. But in ‘real’ bloom filters, multiple hash functions are used. More about why that is, is discussed later.
Queries
It is apparent that checking for membership of an element simply involves verifying if the hash exists — checking if the corresponding bit in the bit-array is set to one. So to query an element: It is first hashed, and then the bit corresponding to its hash is checked to see if it is set (to one).
Hash: 44
Hash Collisions
While hash functions are designed to be collision resistant, they are not collision proof. Sometimes two elements may have the same hash. In such cases, an element that was added may have the same hash as a different element being queried, leading to a false positive.
For example, hello
having the hash 44
is added in the bloom filter. Another word, foo
might have the same hash 44
. If the bloom filter is queried for foo
, it would wrongly report that foo
exists in the set since the 44th bit in the bit-array is already set to one by a different word — hello
.
Hash: 44
To reduce the probability of collisions, multiple hash functions can be used for each element. In that case, querying an element will return true only if all bits corresponding to the hashes are one.
The hash depth refers to the number of hash functions used. The output of the hash functions should be randomly distributed and should not be correlated to each other. Formally, the hash functions should be selected from a universal family, and should ideally be pairwise independent. These assumptions are used when calculating the probability of collisions when using multiple hash functions.
Hash: 44, 242, 77
Hash depth: 3
Universal hashing
The probability of two elements having the same hash outputs is 1/m
where m
is the size of the bloom filter. However, if two hash functions are used for every element, the probability of two elements having the same hash outputs is roughly 1/m * 1/m
. In general, using more hashing functions exponentially decreases the probability that all the hash outputs of two elements will collide.
However using too many hashing functions can also increase the false positive rate. It might feel counter-intuitive, especially right after showing it decreases false positives, but consider this: The bloom filter is finite. The more the number of hash functions, the higher the probability that it fills up quickly and most bits are set to one. When querying from a bloom filter where most bits are set to one, the higher the probability that any query results in a positive — even if the element is not in the set.
In the extreme case when all the bits in a bloom filter are set to one, the bloom filter would always report positive for membership — regardless of whether the element actually exists in the set.
Hash: 44, 38, 180
Other than using a large number of hashing functions, a bloom filter can also be quickly saturated if the number elements to be hashed (added) is huge. The only solution to decreasing the number of false positives then, is by increasing the size of the bloom filter itself. Increasing the size of the bloom filter bit-array equates to a larger output space for the hash functions, reducing the probability for collisions.
In effect, the false positive rate is also dependent on the size of the bloom filter and the number of (unique) elements to be added.
Space Efficiency
The false positive rate depends on the number of hash functions used per element, the number of elements hashed/added, and the size of the bloom filter. Conversely, the size of the bloom filter and the ideal number of hash functions depends on the tolerance for false positives and the number elements to be added.
More precisely, the optimal size for a bloom filter is -2.08·ln(p)·n
bits, where p
is the tolerance for the false positive rate and n
is the number of unique elements one expects to observe. The optimal number of hash functions is ln(2)·(m/n)
, where m
is the size of the bloom filter and n
is the number of expected elements. Thomas Hurst visualizes how the factors affect each other in his bloom filter calculator.
Deletion
Deletions can be achieved by unsetting the bits corresponding to the hash of the element to be deleted, to zero. But if one of the bits was previously set by some different element, then checking membership for that element would lead to a false negative.
Hash: 44, 242, 77
The general consensus is to forbid deletions in bloom filters to remove the possibility of getting false negatives.
Count-Min Sketch
A count-min sketch is a probabilistic data structure that calculates the multiplicity (the number of occurrences) of elements in a multiset. It is what a bloom filter would have been if it was a frequency table instead.
Counting Bloom Filter
Count-min sketches are very similar to bloom filters, so it’s helpful to start off with a bloom filter.
Hash: 44, 242, 77
Now, instead of using one bit per hash, what if a few more bits (per hash) were used?
A way to utilize the extra bits is by using them as ‘counters’ to store information about frequency — the bit-counter can be incremented by one whenever a hash (an element) is added. This expands the ability of the array from being able to only store information about the existence of an element to store information about its frequency as well.
Hash: 44, 242, 77
Bloom Filter:
Counting Bloom Filter:
The array on top is a bloom filter, which only only be used for binary queries — that is, whether it exists or not. The array below the bloom filter is a counting bloom filter which can give slightly richer information in the form of frequency estimates.
Hash collisions are again possible in counting bloom filters, which increase the counters of other elements. The frequency of an element is calculated by taking the minimum of the values in the hash counters to minimize the overestimation. However, as there is no way to under-count, the minimum of the values is the necessarily the upper bound of the estimate — the true frequency is guaranteed to be equal or lower than the estimate.
As an example, try adding both hello
and foo
and notice how the 44th counter is affected. The counter increases to two. If the 44th counter is the only counter used for calculating the estimate, it would over-count the frequency of both elements. However, using more hashes and counters can decrease the probability of over-counting — but only when the minimum of the counters is used for estimation.
Hash: 44, 242, 77
Counters: 0, 0, 0
Frequency estimate: 0
The only possible way counters can under-count is when they overflow and reset back to zero. Here, four bits are used per hash so the counter will overflow when the count exceeds fifteen. This is not a big issue however, since the number of bits per counter can easily be increased a little — even thirty-two bits for each counter is enough for storing frequency estimates up to more than four billion.
Hash Collisions
To reduce over-counting, hash collisions should be reduced. Two approaches used in bloom filters can be applied: By increasing the size of the array, and using multiple hash functions. The solutions might look plausible on the surface because of the counting bloom filter’s similarities to a bloom filter, but that is not the case — or at least, it is not that simple.
The first solution is easy to verify and is intuitive to follow. Increasing the size of the array increases the range of the hash functions, reducing collisions. Just like bloom filters, discussed above. The second approach however, does reduce errors. In fact it leads to slightly higher errors.
It’s important to mention that in both bloom filters and counting bloom filters, increasing the number of hash functions per element increases the probability of a collision. But in the case of bloom filters, any hash collision did not change the value of a bit in the bloom filter itself — it would be set to one regardless. In fact hash collisions decrease overall error by preventing the bloom filter from getting saturated too quickly.
Compare that to a counting bloom filter, where every collision increases the value of the counter by one. Having more hashing function increases the probability of collisions, and therefore, over-counting. The probability of over-counting an element only increases as the number of hash functions per element increases.
Notice how collisions do not affect the bits that are already set in the bloom filter, but affects the bits of the counting bloom filter:
Hash depth: 32
The solution to reducing errors then would be to increase the size of the array and to minimize the number hash functions. And while that is acceptable, there are better ways to go about it.
Disjoint Hash Ranges
There is an upside in using multiple hash functions — while querying the frequency of an element, it gets exponentially less probable that the all hashes of one element will collide. The downside to using more hash functions is that it increases overall collisions, which leads to over-counting and less precise estimates.
A simple way to resolve the downside is by assigning the output of each hash function to separate output spaces. This prevents from collisions between hash functions. That is, while collisions from different elements having the same hashes are possible, collisions between different hash functions are now impossible.
The resulting data structure is a count-min sketch. It can also be thought of as a two dimensional array with w
columns and d
rows — where d
is the total number of hash functions and w
is the range for the hash outputs. Each hash function outputs to their respective row, and the hash functions are again assumed to be pairwise independent.
Here, the count-min sketch is thirty two counters wide and eight hashes deep.
Hash: 12, 50, 77, 122, 159, 176, 195, 238
Counters: 0, 0, 0, 0, 0, 0, 0, 0
Frequency estimate: 0
The count-min sketch is capable of answering more than simple frequency queries. It can also respond to range and inner-product queries.
Space Efficiency
Similar to bloom filters, the space required by a count-min sketch depends on the tolerance for error and the number of expected elements. The optimal width of the array w
is 2n/ε
and depth d
is log(δ)/log(1/2)
— where n
is the number of total elements, ε
defines the bounds for collisions, and δ
is the probability the estimates exceeds those bounds.
HyperLogLog
HyperLogLog is another probabilistic data structure, used to approximate the cardinality (or total number of distinct elements) of a multiset. It is different from a bloom filter and a count-min sketch, and relies on the pseudorandomness of hash functions to estimate cardinalities.
Probability
Consider this thought experiment: Take a few coins where each coin has an equal probability of landing on heads or tails, and toss them all at once. The probability that the first coin is heads is half, or 1/2. The probability that both the first and second coin are heads is 1/4. The probability that all the first three first coins are heads is 1/8. In general, the probability that the first n coins are heads is one in 2^n.
Phrased alternatively, roughly 2^n total tosses are required to observe a maximum of n consecutive leading heads at the beginning of the coin array.
Leading heads: 0
Max. leading heads: 0
Total tosses: 0
Conversely, if shown n consecutive heads at the beginning of the sequence, it can be estimated that the coins were tossed roughly 2^n times.
Leading heads: 0
Max. leading heads: 0
Estimated tosses: 0
Total tosses: 0
The maximum number of consecutive leading heads provides a rough estimate for the number of times the coins were tossed.
This is the fundamental principle behind HyperLogLog. Coins are replaced with bits — a hash function is chosen such that output is pseudorandom. Then each bit of the hash acts as an unbiased coin, having an equal probability of being 0 or 1. Analogous to the total number of tosses, the total number of elements can be estimated using the longest run of consecutive leading zeroes of their hashes.
Leading zeros: 0
Max. leading zeros: 0
Estimated cardinality: 0
Actual cardinality: 0
Since hashes are deterministic, duplicate elements will yield the same hashes and would not affect the cardinality calculation. So HyperLogLog estimates the number of distinct elements in a multiset.
Limitations
There is a glaring flaw in this approach however. The gaps between the approximations double every time an extra consecutive zero is observed — it can only provide estimates that are powers of two, and nothing in between. Second, a hash output may have a lot of consecutive zeros at the beginning simply owing to chance, skewing the estimate.
To mitigate these problems, multiple counters that store the longest run of zeroes for different subsets of the multiset can be used — instead of storing one counter that stores the longest run of zeroes for the entire multiset. The values can then be averaged to find a more precise cardinality estimate.
Buckets
A multiset is segregated into different subsets using ‘buckets’. In HyperLogLog, a bucket is just a counter for storing the length of the longest run of zeroes of hashes for some particular subset of elements. The first b
bits of the hash of an element are reserved for selecting the bucket, and the remaining bits are used for finding the leading zeros. The number of leading zeroes are then used to compute the cardinality estimate for that subset. Cardinality for the entire multiset can be calculated by finding the average of the individual estimates using the harmonic mean.
Here, the bucket is chosen from the first four bits of the hash.
Bucket: 0 The addition of new buckets ‘spreads’ the cardinality over all the buckets. However, the bucket quantity information is lost during averaging, and the estimates gets scaled down by the number of buckets. To counteract this, the harmonic mean is scaled by the number of buckets. That is, if there are The harmonic mean is used for averaging because it reduces the influence of large outliers and has shown to be more accurate.
Leading zeros: 0
Max. leading zeros: 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
Estimated cardinalities: 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
Average estimate: 0
Actual cardinality: 0
Averaging estimates
m
buckets, the estimate is m * mean
.
However, even with the corrective measures, there is a predictable bias towards larger estimates. Scaling the average by a correction factor counteracts this bias. This brings down the error to 1.04/√m, where m is the number of buckets.
Thirty two buckets are used here, and so the standard error is approximately 0.18.
Bucket: 0 The correction factor ranges between 0.672 and 0.723, depending on the number of buckets. It is approximately equal to
Leading zeros: 0
Average estimate: 0
Scaled estimate: 0
Actual cardinality: 0
Correction factor
0.723/(1+1.079/m)
where m
is the number of buckets.
This is HyperLogLog. A very simple and elegant data structure that is also ridiculously efficient. Adding elements involves hashing them and storing their longest run of leading zeroes in buckets. Estimating the cardinality involves finding the harmonic mean of the estimates, scaled by the number of buckets and a correction factor.
Space Efficiency
The only thing needed (to store) for calculating the cardinality are the buckets. So the total amount of space required is the number of buckets times the size of each bucket. The number of buckets depends on the degree of accuracy required — the error rate is inversely proportional to the square root of the number of buckets. More buckets lead to more accurate estimates and vice versa.
The size of each bucket depends on the largest number that must be stored in the bucket (before it overflows). A five-bit bucket can store numbers from zero to thirty-one. A hash with thirty-one leading zeroes in a hash suggests roughly two billion unique elements were hashed. So to store cardinalities of up to two billion unique elements, the size of each bucket only needs to be five bits. In general, to store cardinality of n
unique elements, the size of each bucket needs to be log(log(n))
bits. That is also where its name comes from.
For example, HyperLogLog is able to estimate cardinalities of more than a billion unique elements with an an error of 2% using 2500 five-bit buckets — only 1.5 kB of memory. Or alternatively, just 0.25 kB for an error of 5% using 400 buckets.
Other Data Structures
There are lots of other probabilistic data structures that trade accuracy for efficiency. Cuckoo filters and quotient filers are probabilistic data structures used for membership queries, in addition to bloom filters. Estimating distinct elements can be done via linear counting — like HyperLogLog — but the underlying principle is similar to bloom filters. Rank can be approximated using t-digests, or KLL sketches. Similarities can be estimated using LSH, MinHash, and SimHash.
There are other probabilistic data structures too, each with their own advantages and disadvantages. But the trade off is similar in all cases — precision for efficiency.
References
- Florian Hartmann: Bloom Filters
- Eric Crahen: Count-Min Sketching, Configuration & Point-Queries
- Engineering at Meta: HyperLogLog in Presto: A significantly faster way to handle cardinality estimation