A bloom filter is a probabilistic data structure that approximates a set. There are a lot of advantages of using bloom filters over set. In case of a set, you need to store all the elements in the set. Whereas if you prioritise fast lookups over some amount of inaccuracy, what is called as cache hit-miss, then bloom filters are really great examples of efficient data storage. Now, there are a lot of examples where bloom filters are used today, but what I want to focus on is how they are used in parquet data format.
Parquet (spelled “pa—kay”) is a column-oriented storage format. Common filtering techniques use either statistics, min/max values, or dictionary. These methods still rely on row-level or page-level metadata. And more often than not, the gains from using min/max or dictionary filters is a direct consequence of how smartly the data engineer has worked on sampling data in the parquet files.
Using bloom filters for predicate pushdown gives more granular filtering options on column chunks within a row group. Like dictionaries, they are embedded within the row group. As the time of writing, Apache only supports split block bloom filter (SSBF) according to their official documentation. SSBF is essentially a standard bloom filter with the only difference being how they are structured. Think of it like a bloom filter within a bloom filter. A parent bloom filter is divided into many (256 bit) blocks which, in turn, are subdivided into more (32 bits) words.
I created a bloom filter simulator which helps to visualise what happens at block and word level when a new column value is inserted. By choosing the number of rows and false-positive probability we can determine the size of the bloom filter. If this simulator breaks, here is the hosted file.
Implementing bloom filter at different levels is a deliberate CPU optimisation trick to prevent frequent hash collisions and make lookup faster. Logically, it is just one Bloom filter per column chunk, whose capacity and false-positive rate are determined by bits-per-insert and the number of inserted hash values.