Package org.apache.commons.collections4.bloomfilter
Background
The Bloom filter is a probabilistic data structure that indicates where things are not. Conceptually it is a bit
vector or bit map. You create a Bloom filter by creating hashes and converting those to enabled bits in the map. Multiple
Bloom filters may be merged together into one Bloom filter. It is possible to test if a filter B
has merged
into another filter A
by verifying that (A & B) == B
.
Bloom filters are generally used where hash tables would be too large, or as a filter front end for longer processes. For example most browsers have a Bloom filter that is built from all known bad URLs (ones that serve up malicious software). When you enter a URL the browser builds a Bloom filter and checks to see if it is "in" the bad URL filter. If not the URL is good, if it matches, then the expensive lookup on a remote system is made to see if it actually is in the list. There are lots of other uses, and in most cases the reason is to perform a fast check as a gateway for a longer operation.
Some Bloom filters (for example CountingBloomFilter
) use counters rather than bits. In this case each counter
is called a cell.
BloomFilter
The Bloom filter architecture here is designed for speed of execution, so some methods like CountingBloomFilter
's
merge
, remove
, add
, and subtract
may throw exceptions. Once an exception is thrown the state of the Bloom filter is unknown.
The choice to use not use atomic transactions was made to achieve maximum performance under correct usage.
Nomenclature
- bit map - In the
bloomfilter
package a bit map is not a structure but a logical construct. It is conceptualized as an ordered collection oflong
values each of which is interpreted as the enabled true/false state of 64 continuous indices. The mapping of bits into thelong
values is described in theBitMaps
Javadoc. - index - In the
bloomfilter
package an Index is a logical collection ofint
s specifying the enabled bits in the bit map. - cell - Some Bloom filters (for example
CountingBloomFilter
) use counters rather than bits. In thebloomfilter
package Cells are pairs of ints representing an index and a value. They are not the standard JavaPair
objects, nor the Apache Commons Lang version either. - extractor - The extractors are
FunctionalInterface
s that are conceptually iterators on a bit map, an index, or a collection of cells, with an early termination switch. Extractors have names likeBitMapExtractor
orIndexExtractor
and have aprocessXs
methods that take a type specialization ofPredicate
.Predicate
type argument. (for exampleBitMapExtractor.processBitMaps(java.util.function.LongPredicate)
,IndexExtractor.processIndices(java.util.function.IntPredicate)
, andCellExtractor.processCells(org.apache.commons.collections4.bloomfilter.CellExtractor.CellPredicate)
). The predicate is expected to process each of the Xs in turn and returntrue
if the processing should continue orfalse
to stop it.
There is an obvious association between the bit map and the Index, as defined above, in that if bit 5 is enabled in the bit map than the Index must contain the value 5.
Implementation Notes
The architecture is designed so that the implementation of the storage of bits is abstracted. Rather than specifying a
specific state representation we require that all Bloom filters implement the BitMapExtractor
and IndexExtractor
interfaces,
Counting-based Bloom filters implement CellExtractor
as well. There are static
methods in the various Extractor interfaces to convert from one type to another.
Programs that utilize the Bloom filters may use the BitMapExtractor
or IndexExtractor
to retrieve
or process a representation of the internal structure.
Additional methods are available in the BitMaps
class to assist in
manipulation of bit map representations.
The Bloom filter is an interface that requires implementation of 9 methods:
BloomFilter.cardinality()
returns the number of bits enabled in the Bloom filter.BloomFilter.characteristics()
which returns an integer of characteristics flags.BloomFilter.clear()
which resets the Bloom filter to its initial empty state.BloomFilter.contains(IndexExtractor)
which returns true if the bits specified by the indices generated by IndexExtractor are enabled in the Bloom filter.BloomFilter.copy()
which returns a fresh copy of the bitmap.BloomFilter.getShape()
which returns the shape the Bloom filter was created with.BloomFilter.merge(BitMapExtractor)
which merges the BitMaps from the BitMapExtractor into the internal representation of the Bloom filter.BloomFilter.merge(IndexExtractor)
which merges the indices from the IndexExtractor into the internal representation of the Bloom filter.
Other methods should be implemented where they can be done so more efficiently than the default implementations.
CountingBloomFilter
The CountingBloomFilter
extends the Bloom filter by counting the number
of times a specific bit has been
enabled or disabled. This allows the removal (opposite of merge) of Bloom filters at the expense of additional
overhead.
LayeredBloomFilter
The LayeredBloomFilter
extends the Bloom filter by creating layers of Bloom
filters that can be queried as a single
Filter or as a set of filters. This adds the ability to perform windowing on streams of data.
Shape
The Shape
describes the Bloom filter using the number of bits and the number
of hash functions. It can be specified
by the number of expected items and desired false positive rate.
Hasher
A Hasher
converts bytes into a series of integers based on a Shape.
Each hasher represents one item being added
to the Bloom filter.
The EnhancedDoubleHasher
uses a combinatorial generation technique to
create the integers. It is easily
initialized by using a byte array returned by the standard MessageDigest
or other hash function to
initialize the Hasher. Alternatively, a pair of a long values may also be used.
Other implementations of the Hasher
are easy to implement.
References
- Since:
- 4.5.0-M1
-
ClassDescriptionA counting Bloom filter using an int array to track cells for each enabled bit.Produces bit map longs for a Bloom filter.Contains functions to convert
int
indices into Bloom filter bit positions and visa versa.BloomFilter<T extends BloomFilter<T>>The interface that describes a Bloom filter.Produces Bloom filters from a collection (for example,LayeredBloomFilter
).Some Bloom filter implementations use a count rather than a bit flag.Represents an operation that accepts an<index, count>
pair.The interface that describes a Bloom filter that associates a count with each bit index rather than a bit.A Hasher that implements combinatorial hashing as described by Krisch and Mitzenmacher using the enhanced double hashing technique described in the wikipedia article Double Hashing.A Hasher createsIndexExtractor
s based on the hash implementation and the providedShape
.An object that produces indices of a Bloom filter.A convenience class for Hasher implementations to filter out duplicate indices.LayeredBloomFilter<T extends BloomFilter<T>>Layered Bloom filters are described in Zhiwang, Cen; Jungang, Xu; Jian, Sun (2010), "A multi-layer Bloom filter for duplicated URL detection", Proc. 3rd International Conference on Advanced Computer Theory and Engineering (ICACTE 2010), vol. 1, pp.LayerManager<T extends BloomFilter<T>>Implementation of the methods to manage the layers in a layered Bloom filter.LayerManager.Builder<T extends BloomFilter<T>>Builds new instances ofLayerManager
.Static methods to create a Consumer of a List of BloomFilter perform tests on whether to reduce the collection of Bloom filters.A collection of common ExtendCheck implementations to test whether to extend the depth of a LayerManager.Represents a function that accepts a two long-valued argument and produces a binary result.Implementations of set operations on BitMapExtractors.The definition of a Bloom filter shape.A bloom filter using an array of bit maps to track enabled bits.A bloom filter using a TreeSet of integers to track enabled bits.An abstract class to assist in implementing Bloom filter decorators.