Package org.apache.commons.collections4.bloomfilter


package org.apache.commons.collections4.bloomfilter
Implements Bloom filter classes and interfaces.

Background

The Bloom filter is a probabilistic data structure that indicates where things are not. Conceptually it is a bit vector or bit map. You create a Bloom filter by creating hashes and converting those to enabled bits in the map. Multiple Bloom filters may be merged together into one Bloom filter. It is possible to test if a filter B has merged into another filter A by verifying that (A & B) == B.

Bloom filters are generally used where hash tables would be too large, or as a filter front end for longer processes. For example most browsers have a Bloom filter that is built from all known bad URLs (ones that serve up malicious software). When you enter a URL the browser builds a Bloom filter and checks to see if it is "in" the bad URL filter. If not the URL is good, if it matches, then the expensive lookup on a remote system is made to see if it actually is in the list. There are lots of other uses, and in most cases the reason is to perform a fast check as a gateway for a longer operation.

Some Bloom filters (for example CountingBloomFilter) use counters rather than bits. In this case each counter is called a cell.

BloomFilter

The Bloom filter architecture here is designed for speed of execution, so some methods like CountingBloomFilter's merge, remove, add, and subtract may throw exceptions. Once an exception is thrown the state of the Bloom filter is unknown. The choice to use not use atomic transactions was made to achieve maximum performance under correct usage.

Nomenclature

There is an obvious association between the bit map and the Index, as defined above, in that if bit 5 is enabled in the bit map than the Index must contain the value 5.

Implementation Notes

The architecture is designed so that the implementation of the storage of bits is abstracted. Rather than specifying a specific state representation we require that all Bloom filters implement the BitMapExtractor and IndexExtractor interfaces, Counting-based Bloom filters implement CellExtractor as well. There are static methods in the various Extractor interfaces to convert from one type to another.

Programs that utilize the Bloom filters may use the BitMapExtractor or IndexExtractor to retrieve or process a representation of the internal structure. Additional methods are available in the BitMaps class to assist in manipulation of bit map representations.

The Bloom filter is an interface that requires implementation of 9 methods:

Other methods should be implemented where they can be done so more efficiently than the default implementations.

CountingBloomFilter

The CountingBloomFilter extends the Bloom filter by counting the number of times a specific bit has been enabled or disabled. This allows the removal (opposite of merge) of Bloom filters at the expense of additional overhead.

LayeredBloomFilter

The LayeredBloomFilter extends the Bloom filter by creating layers of Bloom filters that can be queried as a single Filter or as a set of filters. This adds the ability to perform windowing on streams of data.

Shape

The Shape describes the Bloom filter using the number of bits and the number of hash functions. It can be specified by the number of expected items and desired false positive rate.

Hasher

A Hasher converts bytes into a series of integers based on a Shape. Each hasher represents one item being added to the Bloom filter.

The EnhancedDoubleHasher uses a combinatorial generation technique to create the integers. It is easily initialized by using a byte array returned by the standard MessageDigest or other hash function to initialize the Hasher. Alternatively, a pair of a long values may also be used.

Other implementations of the Hasher are easy to implement.

References

  1. Building a Better Bloom Filter by Adam Kirsch and Michael Mitzenmacher.
  2. Apache Cassandra's BloomFilter.
Since:
4.5.0-M1