A Golang Implemented Bloom Filter

Introduction

Question

如果想判断一个元素是不是在一个集合里,

精确答案的方案

Hash Table

最直接的方法就是将集合中的元素存在计算机中,遇到一个新元素时,将它和集合中的元素直接比较即可。一般来讲,计算机中的集合是用哈希表(Hash Table)来存储的。它的好处是快速准确,缺点是耗费存储空间。为什么说耗费存储空间呢?其根本原因是哈希表方法需要把实实在在的具有特定长度(每个Email地址对应成一个8字节的信息指纹)的元素的信息指纹存储在内存或硬盘中的哈希表中,这个存储量在实际应用中一般是相当大的。比如每存储一亿个Email地址,需要0.8G大小的数字指纹存储空间,考虑到哈希表的存储空间利用率一般只有一半,所以需要1.6G的存储空间。如果存储几十亿上百亿的Email地址,那就需要百亿字节的内存存储空间。

存在一定误报方案

BitMap

另一种方法则是通过BitMap算法来实现,所谓的BitMap就是将某一个值通过hash映射到bit map某一个bit位,然后标记这个bit为1表示其存在。由于BitMap使用了bit位来存储数据,因此可以大大节省存储空间。但考虑到hash碰撞问题,bitmap的数组长度必须足够大。对于有m个slot的bit map, map只能存储m/100个元素,因而有大量的空间被浪费,同时也会使得空间复杂度急剧上升,这显然不是space efficient的。解决的方法很简单,使用k>1的布隆过滤器,即k个hash function将每个元素改为对应于k个bits,因为误判度会降低很多,这就引出了本文的重点bloom Filter.

Bloom Filter

Bloom Filter跟单哈希函数BitMap相同之处在于使用了长度为m的bitmap,不同之处在于Bloom Filter使用了k个哈希函数,每个字符串跟k个bit对应。从而降低了冲突的概率。在实现Bloom时,k和m实现很关键,既要使得空间合理利用,又要确保较高的准确率。

Bloom Filter Design

优化参数

在实现bloom filter时,允许一定的误判率fpr。而这个fpr,与样本数量,bit数组大小和hash函数个数直接相关。数组过大,容易造成空间浪费;而数组过小,则容易导致hash碰撞。hash函数个数也是如此,函数个数过多或过少,都需要更多的数组空间,稀释样本密度,来减少碰撞。样本数量为n,误判率为p,Bit数组大小m, hash函数个数k, 转换关系如下

$$ n = ceil (- \frac{m * \ln (1 - e^{\frac{\ln(p)}{k}})}{k}) $$

$$ p = (1 - e^{\frac{-k*n}{m}})^k $$

$$ m = ceil (- \frac{n*\ln p}{\ln (2^{\ln 2})}) $$

$$ k = round (\frac{m}{n} * \ln 2) $$

将上面数学公式转化成Golang语言,得到如下代码,

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
package main

import (
"fmt"
"math"
)

func GetK(m int64, n int64, p float64) float64 {
return math.Round(float64(m) / float64(n) * math.Ln2)
}

func GetM(k int64, n int64, p float64) float64 {
return math.Ceil((-float64(n) * math.Log(p)) / math.Log(math.Pow(2, math.Ln2)))
}

func GetN(k int64, m int64, p float64) float64 {
return math.Ceil(-float64(m) / float64(k) * math.Log(1-math.Exp(math.Log(p)/float64(k))))
}

func GetP(k int64, m int64, n int64) float64 {
return math.Pow((1 - math.Exp(-float64(k*n)/float64(m))), float64(k))
}

func main() {
var K int64 = 5
var M int64 = 20000
var N int64 = 2000
var P float64 = 0.01

k := GetK(M, N, P)
fmt.Println(k)

m := GetM(K, N, P)
fmt.Println(m)

n := GetN(K, M, P)
fmt.Println(n)

p := GetP(K, M, N)
fmt.Println(p)
}

输出结果

1
2
3
4
7
19171
2031
0.009430929226122474

解决扩展性

考虑到实际应用时,我们大部分情况对样本数量是无法预估的,那么就需要设计一种支持扩展的Bloom Filter。假设初始阶段,我们分配了一块大小为M的内存,如果我们预期的误判率为F,则可以通过上面的公式,推断出可容纳样本数上限N为。一旦达到上限,则再创建一个Bloom Filter,用于存放未标记元素。举个例子,如果样本x需要存入到bloom filter,首先判断是否存在于之前创建的filter,如果已经标记,则不存储,如果没有标记,则存在当前filter。当前filter填满之后,继续创建新filter。初始化完毕后,如果要检测样本y是存在于集合,只需要在每个filter上查找一遍即可。这样就实现了可扩展的bloom filter.

Bloom Filter Implementation

Basic Bloom Filter

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
package bloomfilter

import (
"hash"
"hash/fnv"
"math"
)

// The standard bloom filter, which allows adding of
// elements, and checking for their existence
type BloomFilter struct {
bitmap []bool // The bloom-filter bitmap
HashFuncNum int // Number of hash functions
ElementNum int // Number of elements in the filter
FilterSize int // Size of the bloom filter
HashFuncName hash.Hash64 // The hash function
}

// Returns a new BloomFilter object, if you pass the
// number of Hash Functions to use and the maximum
// size of the Bloom Filter
func NewBloomFilter(numHashFuncs, bfSize int) *BloomFilter {
bf := new(BloomFilter)
bf.bitmap = make([]bool, bfSize)
bf.HashFuncNum, bf.FilterSize = numHashFuncs, bfSize
bf.ElementNum = 0
bf.HashFuncName = fnv.New64()
return bf
}

func (bf *BloomFilter) getHash(b []byte) (uint32, uint32) {
bf.HashFuncName.Reset()
bf.HashFuncName.Write(b)
hash64 := bf.HashFuncName.Sum64()
h1 := uint32(hash64 & ((1 << 32) - 1))
h2 := uint32(hash64 >> 32)
return h1, h2
}

// Adds an element (in byte-array form) to the Bloom Filter
func (bf *BloomFilter) Add(e []byte) {
h1, h2 := bf.getHash(e)
for i := 0; i < bf.HashFuncNum; i++ {
ind := (h1 + uint32(i)*h2) % uint32(bf.FilterSize)
bf.bitmap[ind] = true
}
bf.ElementNum++
}

// Checks if an element (in byte-array form) exists in the
// Bloom Filter
func (bf *BloomFilter) Check(x []byte) bool {
h1, h2 := bf.getHash(x)
result := true
for i := 0; i < bf.HashFuncNum; i++ {
ind := (h1 + uint32(i)*h2) % uint32(bf.FilterSize)
result = result && bf.bitmap[ind]
}
return result
}

// Returns the current False Positive Rate of the Bloom Filter
func (bf *BloomFilter) FalsePositiveRate() float64 {
var eleRate = float64(bf.HashFuncNum*bf.ElementNum) / float64(bf.FilterSize)
return math.Pow((1 - math.Exp(-eleRate)), float64(bf.HashFuncNum))
}

Removable Bloom Filter

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69

// A Bloom Filter which allows deletion of elements.
// An 8-bit counter is maintained for each slot. This should
// be accounted for while deciding the size of the new filter.
type CountingBloomFilter struct {
counts []uint8 // The bloom-filter bitmap
HashFuncNum int // Number of hash functions
ElementNum int // Number of elements in the filter
FilterSize int // Size of the bloom filter
HashFuncName hash.Hash64 // The hash function
}

// Creates a new Counting Bloom Filter
func NewCountingBloomFilter(numHashFuncs, cbfSize int) *CountingBloomFilter {
cbf := new(CountingBloomFilter)
cbf.counts = make([]uint8, cbfSize)
cbf.HashFuncNum, cbf.FilterSize = numHashFuncs, cbfSize
cbf.ElementNum = 0
cbf.HashFuncName = fnv.New64()
return cbf
}

func (cbf *CountingBloomFilter) getHash(b []byte) (uint32, uint32) {
cbf.HashFuncName.Reset()
cbf.HashFuncName.Write(b)
hash64 := cbf.HashFuncName.Sum64()
h1 := uint32(hash64 & ((1 << 32) - 1))
h2 := uint32(hash64 >> 32)
return h1, h2
}

// Adds an element (in byte-array form) to the Counting Bloom Filter
func (cbf *CountingBloomFilter) Add(e []byte) {
h1, h2 := cbf.getHash(e)
for i := 0; i < cbf.HashFuncNum; i++ {
ind := (h1 + uint32(i)*h2) % uint32(cbf.FilterSize)
// Guarding against an overflow
if cbf.counts[ind] < 0xFF {
cbf.counts[ind] += 1
}
}
cbf.ElementNum++
}

// Removes an element (in byte-array form) from the Counting Bloom Filter
func (cbf *CountingBloomFilter) Remove(e []byte) {
h1, h2 := cbf.getHash(e)
for i := 0; i < cbf.HashFuncNum; i++ {
ind := (h1 + uint32(i)*h2) % uint32(cbf.FilterSize)

if cbf.counts[ind] > 0 {
// Guarding against an underflow
cbf.counts[ind] -= 1
}
}
cbf.ElementNum--
}

// Checks if an element (in byte-array form) exists in the
// Counting Bloom Filter
func (cbf *CountingBloomFilter) Check(x []byte) bool {
h1, h2 := cbf.getHash(x)
result := true
for i := 0; i < cbf.HashFuncNum; i++ {
ind := (h1 + uint32(i)*h2) % uint32(cbf.FilterSize)
result = result && (cbf.counts[ind] > 0)
}
return result
}

Scalable

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
// A scalable bloom filter, which allows adding of
// elements, and checking for their existence
type ScalableBloomFilter struct {
bfArr []BloomFilter // The list of Bloom Filters
HashFuncNum int // Number of hash functions
ElementNum int // Number of elements in the filter
FilterSize int // Size of the smallest bloom filter
MaxFilterNum int // Maximum number of bloom filters to support.
CurrentFilterNum int // Number of bloom filters present in the list.
MultiFactor int // Multiplication factor for new bloom filter sizes
CurrentFilterSize int // Size of the current bloom filter
TargetFPRate float64 // Target False Positive rate / bf
}

// Returns a new Scalable BloomFilter object, if you pass in
// valid values for all the required fields.
// firstBFSize is the size of the first Bloom Filter which
// will be created.
// maxBloomFilters is the upper limit on the number of
// Bloom Filters to create
// growthFactor is the rate at which the Bloom Filter size grows.
// targetFPR is the maximum false positive rate allowed for each
// of the constituent bloom filters, after which a new Bloom
// Filter would be created and used
func NewScalableBloomFilter(numHashFuncs, firstBFSize, maxBloomFilters, growthFactor int, targetFPR float64) *ScalableBloomFilter {
sbf := new(ScalableBloomFilter)
sbf.HashFuncNum, sbf.ElementNum, sbf.FilterSize, sbf.MaxFilterNum, sbf.CurrentFilterNum, sbf.MultiFactor, sbf.TargetFPRate = numHashFuncs, 0, firstBFSize, maxBloomFilters, 1, growthFactor, targetFPR
sbf.CurrentFilterSize = sbf.FilterSize
sbf.bfArr = make([]BloomFilter, 0, maxBloomFilters)
bf := NewBloomFilter(sbf.HashFuncNum, sbf.FilterSize)
sbf.bfArr = append(sbf.bfArr, *bf)
return sbf
}

// Adds an element of type byte-array to the Bloom Filter
func (sbf *ScalableBloomFilter) Add(e []byte) {
inuseFilter := sbf.CurrentFilterNum - 1
fpr := sbf.bfArr[inuseFilter].FalsePositiveRate()
if fpr <= sbf.TargetFPRate {
sbf.bfArr[inuseFilter].Add(e)
sbf.ElementNum++
} else {
if sbf.MaxFilterNum == sbf.CurrentFilterNum {
return
}
sbf.CurrentFilterSize = sbf.CurrentFilterSize * sbf.MultiFactor
bf := NewBloomFilter(sbf.HashFuncNum, sbf.CurrentFilterSize)
sbf.bfArr = append(sbf.bfArr, *bf)
sbf.CurrentFilterNum++
inuseFilter = sbf.CurrentFilterNum - 1
sbf.bfArr[inuseFilter].Add(e)
sbf.ElementNum++
}
}

// Returns the cumulative False Positive Rate of the filter
func (sbf *ScalableBloomFilter) FalsePositiveRate() float64 {
res := 1.0
for i := 0; i < sbf.CurrentFilterNum; i++ {
res *= (1.0 - sbf.bfArr[i].FalsePositiveRate())
}
return 1.0 - res
}

// Checks if an element (in byte-array form) exists
func (sbf *ScalableBloomFilter) Check(e []byte) bool {
for i := 0; i < sbf.CurrentFilterNum; i++ {
if sbf.bfArr[i].Check(e) {
return true
}
}
return false
}

Drawbacks

bloom filter之所以能做到在时间和空间上的效率比较高,是因为牺牲了判断的准确率、删除的便利性

存在误判,可能要查到的元素并没有在容器中,但是hash之后得到的k个位置上值都是1。如果bloom filter中存储的是黑名单,那么可以通过建立一个白名单来存储可能会误判的元素。
删除困难。一个放入容器的元素映射到bit数组的k个位置上是1,删除的时候不能简单的直接置为0,可能会影响其他元素的判断。可以采用Counting Bloom Filter

References