murmurhash

murmurHash 是一个高性能,低冲突率的hash算法,由Austin Appleby 2008 年第一次发布


以下是他关于算法一些参数发现的discussion

Discussion
 

(Note, this was written for MurmurHash 1.0 and is only loosely applicable to 2.0) 

 

I'm going to put random brain dump stuff here.

 

The constants for MurmurHash were found by searching for parameters that fit the following conditions -

1. The mixing step 

   x *= m; 
   x ^= x >> r1;

   should achieve nearly complete avalanche after two iterations.


2. The mixing step

   x *= m;
   x ^= x >> r1
   x *= m;
   x ^= x >> r2
   x *= m;
   x ^= x >> r3

   should achieve nearly complete avalanche.


3. The distribution of the hash as a whole should produce a mimimal chi-square value on both easy (dictionary) and hard (sparse) keysets.


A bit of testing indicated that r1 pretty much had to be 16. Once I'd found a constant that passed condition 1, exhaustive search determined that r2 and r3 were best set at 10 and 17. From there I searched iteratively - find a value of m that improves on condition 3, test it against conditions 1 and 2 (searching for new r2 and r3 values), keep the new m if it's acceptable, rinse and repeat.

The current value of m (0xc6a4a793) produces an avalanche bias of 0.15% for condition 2 - I've found constants that produce values as low as 0.09%, but they don't fare as well on the chi-square test.



Andres Valloud mentioned that the avalanche condition isn't sufficient for proving a hash is random, and he's clearly right - 

unsigned int PassesAvalancheButIsAwful ( const void * blob, int len )
{
    return (MD5(blob,len) & 1) ? 0xFFFFFFFF : 0;
}

will pass the avalanche test even though it can only produce two possible values (it of course would fail the chi-square test).

Similarly, for a given hash table size you can create a hash function that passes the chi-square test with flying colors but fails the avalanche test -

unsigned int PassesChiSquaredButIsAwful ( const void * blob, int len )
{
    return MD5(blob,len) % table_size;
}

I suspect you can construct one that would pass for all table sizes but still fail avalance, but I'm not certain how to go about that.

 Anyhow, it seems that the two tests together do a good job of weeding out poor hash functions - chi-square to catch "random" but non-uniform distributions, avalanche to catch good distributions but poor mixing.

 

You could also throw the bit independence critera (BIC) into the mix, which is similar to avalance but adds an extra dimension - for each 1-bit input differential, compute the output differential and see how often each possible pair of output bits flip - the values 00, 01, 10, and 11 should appear equally.

I'm not certain how applicable this is to hash functions though - Murmur actually has some significant weaknesses in the BIC test, yet they don't appear to affect the quality of its output. Flipping the final mix shift values from (10,17) to (17,10) seems to fix this, at a cost of a few slightly biased (2%) bits in the avalanche result. Doing so doesn't improve any of the actual test results though, so I don't think it's worth worrying about.



code:

//-----------------------------------------------------------------------------
// MurmurHash2, by Austin Appleby

// Note - This code makes a few assumptions about how your machine behaves -

// 1. We can read a 4-byte value from any address without crashing
// 2. sizeof(int) == 4

// And it has a few limitations -

// 1. It will not work incrementally.
// 2. It will not produce the same results on little-endian and big-endian
//    machines.

unsigned int MurmurHash2 ( const void * key, int len, unsigned int seed )
{
	// 'm' and 'r' are mixing constants generated offline.
	// They're not really 'magic', they just happen to work well.

	const unsigned int m = 0x5bd1e995;
	const int r = 24;

	// Initialize the hash to a 'random' value

	unsigned int h = seed ^ len;

	// Mix 4 bytes at a time into the hash

	const unsigned char * data = (const unsigned char *)key;

	while(len >= 4)
	{
		unsigned int k = *(unsigned int *)data;

		k *= m; 
		k ^= k >> r; 
		k *= m; 
		
		h *= m; 
		h ^= k;

		data += 4;
		len -= 4;
	}
	
	// Handle the last few bytes of the input array

	switch(len)
	{
	case 3: h ^= data[2] << 16;
	case 2: h ^= data[1] << 8;
	case 1: h ^= data[0];
	        h *= m;
	};

	// Do a few final mixes of the hash to ensure the last few
	// bytes are well-incorporated.

	h ^= h >> 13;
	h *= m;
	h ^= h >> 15;

	return h;
} 


你可能感兴趣的:(c++)