参考:http://www.isthe.com/chongo/tech/comp/fnv/
关于FNV Hash算法的详情,见参考,下面只记录FNV Hash值的分布情况。
FNV hash算法对一个字符串计算,可以得到一个唯一确定的无符号整数值。对于大量的随机输入字符串,比如UUID串,得到的无符号整数值,通过简单的取余运算,基本上是均匀分布的。比如,对100,000个UUID字符串做FNV Hash计算,得到的每个结果值hashValue,都做 hashValue %= 10,000,其结果基本上是在 0 ~ 9,999 范围内均匀分布的。但是请注意,是“基本上“均匀分布,事实上还存在一定的偏差。
Landon在参考页面中详细介绍了直接取余的 Lazy mode mapping method 和 Retry method。
Lazy mode mapping method(以32 bit 、目标范围 0~2142779559 、FNV-1 为例)做法是:
#define TRUE_HASH_SIZE ((u_int32_t)2142779560) /* range top plus 1 */ #define FNV1_32_INIT ((u_int32_t)2166136261) u_int32_t hash; void *data; size_t data_len; hash = fnv_32_buf(data, data_len, FNV1_32_INIT); hash %= TRUE_HASH_SIZE;
#define TRUE_HASH_SIZE ((u_int32_t)50000) /* range top plus 1 */ #define FNV_32_PRIME ((u_int32_t)16777619) #define FNV1_32_INIT ((u_int32_t)2166136261) #define MAX_32BIT ((u_int32_t)0xffffffff) /* largest 32 bit unsigned value */ #define RETRY_LEVEL ((MAX_32BIT / TRUE_HASH_SIZE) * TRUE_HASH_SIZE) u_int32_t hash; void *data; size_t data_len; hash = fnv_32_buf(data, data_len, FNV1_32_INIT); while (hash >= RETRY_LEVEL) { hash = (hash * FNV_32_PRIME) + FNV1_32_INIT; } hash %= TRUE_HASH_SIZE;
The values 0 through 967295 will be created by 4295 different 32-bit FNV hash values whereas the values 967296 through 999999 will be created by only 4294 different 32-bit FNV hash values. In other words, the values 0 through 967295 will occur ~1.0002328 times as often as the values 967296 through 999999.
即 967296~999999 的范围内分布明显比 0~967295 段的分布要密集。
对于64 bit 的情况,以分布目标 0~10000000000000000000 为例:
The values 0 through 9999999999999999999 will be created by 2 different 64-bit FNV hash values whereas the values 10000000000000000000 through 18446744073709551615 will be created by only 1 64-bit FNV hash value.
分布更加不均匀。
但同时,Landon 也提到:
NOTE: This bias issue may not be of concern to you, but we thought we should point out this issue just in case you care. Many applications should / will not care about this bias. Most applications can use the lazy mod mapping method without any problems. Your application, may vary however.
NOTE: One may substitute the FNV-1a hash for the FNV-1 hash in any of the lazy mod mapping method examples. Some people believe that FNV-1a lazy mod mapping method gives then slightly better dispersion without any impact on CPU performance. See the FNV-1a hash description for more information.
就是说,这样的”些许“分布不均匀的情况,对大多数应用来说,是无关紧要的。同时,在不增加CPU负载的情况下,相比FNV-1 ,使用FNV-1a 的 lazy mode mapping method 得到的分布情况要稍微好一些。
================================================
附上32bit、FNV-1的示例代码
(需要先安装 libuuid,如 yum install libuuid-devel.x86_64)
#include <iostream> #include <string> #include <uuid/uuid.h> #include <stdlib.h> using namespace std; // typedef unsigned long long UINT64; typedef unsigned int DWORD; const int range = 8; DWORD Hash4Bytes(const string &key) { const char * first = key.c_str(); DWORD length = key.size(); DWORD result = 2166136261; for(; length > 0; --length) { result ^= (std::size_t)*first++; result *= 16777619; } return result; } int Disperse(const string &key) { DWORD hash = Hash4Bytes(key); int index = hash % range; return index; } int main(int argc, char * argv[]) { long key_count = 10000; if(argc > 1) { key_count = atol(argv[1]); } uuid_t uuid; char str[36]; long stat[range]; for(int i = 0; i < range; ++i) { stat[i] = 0; } for(int i = 0; i < key_count; ++i) { uuid_generate(uuid); uuid_unparse(uuid, str); stat[Disperse(str)]++; } cout << "Range: 0 ~ " << range - 1 << endl; cout << "Key count: " << key_count << endl; for(int i = 0; i < range; ++i) { cout << "Index #" << i << ": " << stat[i] << endl; } cout << endl; return 0; }
[root@amons02 fnv]# ./t 1000000 Range: 0 ~ 7 Key count: 1000000 Index #0: 124995 Index #1: 125483 Index #2: 124735 Index #3: 124692 Index #4: 124920 Index #5: 124956 Index #6: 124912 Index #7: 125307
对于上面的代码,如果执行 ./ 1000000 的目的是”将1000000个随机的UUID字符串一一放入0~999999“的范围内,那么从结果看,分布情况是可以接受的。
注意:
上面代码中的Hash4Bytes() 函数,它的返回值类型必须是32位无符号整型,并且函数内部的result 变量也必须是32位无符号整型,因为我们用的是32bit的FNV-1算法。不要用std::size_t,因为在64 bit 机器上,sizeof(std::size_t) 是8!