C# GetHashCode in the IEqualityComparer in .NET

http://stackoverflow.com/questions/4095395/whats-the-role-of-gethashcode-in-the-iequalitycomparert-in-net?rq=1

问题:

为什么会IEqualityComparer同时存在Equals和GetHashCode两种方法,两者有什么区别?

或者说,为什么Equals方法不足以满足判断等价的需求?


案例:取自MSDN

比较redBox和blueBox两个Box实例是否等价,等价的标准为两个Box的长宽高都一样。

using System;
using System.Collections.Generic;
class Example {
    static void Main() {
        try {

            BoxEqualityComparer boxEqC = new BoxEqualityComparer();

            Dictionary boxes = new Dictionary(boxEqC);

            Box redBox = new Box(4, 3, 4);
            Box blueBox = new Box(4, 3, 4);

            boxes.Add(redBox, "red");
            boxes.Add(blueBox, "blue");

            Console.WriteLine(redBox.GetHashCode());
            Console.WriteLine(blueBox.GetHashCode());
        }
        catch (ArgumentException argEx) {

            Console.WriteLine(argEx.Message);
        }
    }
}

public class Box {
    public Box(int h, int l, int w) {
        this.Height = h;
        this.Length = l;
        this.Width = w;
    }
    public int Height { get; set; }
    public int Length { get; set; }
    public int Width { get; set; }
}

class BoxEqualityComparer : IEqualityComparer {

    public bool Equals(Box b1, Box b2) {
        if (b1.Height == b2.Height & b1.Length == b2.Length
                            & b1.Width == b2.Width) {
            return true;
        }
        else {
            return false;
        }
    }

    public int GetHashCode(Box bx) {
        int hCode = bx.Height ^ bx.Length ^ bx.Width;
        return hCode.GetHashCode();
    }
}


背景知识:

每个.NET对象都有两个方法:Equals 和 GetHashCode

Equals方法用于比较两个对象是否等价(equivalent);

GetHashCode为每个对象生成一个32位整数,但可能会为两个不同对象生成相同的整数。


Dictionary本质上是个哈希表, (key, value)中key是通过GetHashCode计算出来的。

构造字典时,我们可以传入自己定义的GetHashCode计算方法,例如案例中的

Dictionary boxes = new Dictionary(boxEqC);


回答问题:为什么我们同时需要Equals和GetHashCode两个方法

答案:当我们往字典中插入新的对象时,我们首先用GetHashCode计算它的key,key决定了它在哈希表中的位置。例子中

Box redBox = new Box(4, 3, 4);
按我们的算法key = (4^3^4).GetHashCode, 然后redBox就被放到了对应的位置。

然后我们往字典中插入另一个对象

Box blueBox = new Box(4, 3, 4);
同样的我们先计算key = (4^3^4).GetHashCode, 然后发现这个位置里已经有了一个对象redBox。

这时候我们需要判断我们能否区分blueBox和redBox,这就用到了Equals方法。

按照案例中定义的Equals方法,我们blueBox和redBox等价,也就是说我们无法区分两个Box对象。

两个Box对象都要被放到哈希表的同一个位置,但我们又无法区分它们两个,存是没问题,但是取的时候怎么办??

于是程序不得不抛异常阻止这件事。


但在另一个情形中

BoxEqualityComparer boxEqC = new BoxEqualityComparer(); 

Dictionary<Box, String> boxes = new Dictionary<Box, string>(boxEqC); 

Box redBox = new Box(100, 100, 25);
Box blueBox = new Box(1000, 1000, 25);

boxes.Add(redBox, "red"); 
boxes.Add(blueBox, "blue"); 
redBox和blueBox同样会得到一样的key = (100^100^25).GetHashCode = (1000^1000^25).GetHashCode = 25.GetHashCode, 注意这里^是按位抑或(bitwise-XOR)

所以redBox和blueBox又被放到了哈希表的同一个位置,但现在Equals方法表示我们能区分这两个对象了,它们两个可以放在一次(比如开链表法)。

取数据的时候,我到这个key的位置把所有在这里的对象们全用Equals方法比较一下,就能取到正确的对象了。


我们看到GetHashCode本质就是找位置,Equals才是判断等价。

但是由于,两个HashCode不一致的对象绝对不会是等价的,所以GetHashCode看起来也在做判断等价的事情;

注意,反过来,两个HashCode一致的对象不一定是等价的!


计算哈希表的key长久以来就是个技术点,好的GetHashCode应该减少冲突(collision),尽可能的不让不同的对象有相同的key。

案例中我们定义的GetHashCode就不好,两个不同的对象

Box redBox = new Box(100, 100, 25);
Box blueBox = new Box(1000, 1000, 25);
竟然有相同的key。

说的教材一点就是,我们的GetHashCode产生的key分布不均匀。

所以通常都会用一些数作为系数调整一下分布。

乘数干了什么呢? 在二进制下答案是按位移动和加法

我们用32位二进制看,比如obj.Id.GetHashCode() = 594 =  0000 0000 0000 0000 0000 0010 0101 0010

594*2 = 1188 = 0000 0000 0000 0000 0000 0100 1010 0100

就是向左按位移动了一位。


特别的,我们在代码里看到的是(obj.OfficeId.GetHashCode()*397)

397是个质数,乘以质数通常被认为可以防止产生聚集的哈希值,原因是质数无法被2整除所以会同时产生按位位移和加法。

当质数非常大,而且距离2的倍数很远时,乘以质数可以相当大程度的打乱哈希值,减少冲突。


怎样的GetHashCode才是好的?

http://stackoverflow.com/questions/263400/what-is-the-best-algorithm-for-an-overridden-system-object-gethashcode/263416#263416

http://blogs.msdn.com/b/ericlippert/archive/2011/02/28/guidelines-and-rules-for-gethashcode.aspx

"In particular, be careful of "xor". It is very common to combine hash codes together by xoring them, but that is not necessarily a good thing. Suppose you have a data structure that contains strings for shipping address and home address. Even if the hash algorithm on the individual strings is really good, if the two strings are frequently the same then xoring their hashes together is frequently going to produce zero. "xor" can create or exacerbate distribution problems when there is redundancy in data structures."


“There's some black magic here. First off, note that multiplication is nothing more than repeated bit shifts and adds; multiplying by 33 is just shifting by five bits and adding. Basically this means "mess up the top 27 bits and keep the bottom 5 the same" in the hopes that the subsequent add will mess up the lower 5 bits. Multiplying by a largish prime has the nice property that it messes up all the bits. I'm not sure where the number 33 comes from though.

I suspect that prime numbers turn up in hash algorithms as much by tradition and superstition as by science. Using a prime number as the modulus and a different one as the multiplier can apparently help avoid clustering in some scenarios. But basically there's no substitute for actually trying out the algorithm with a lot of real-world data and seeing whether it gives you a good distribution. - Eric”


你可能感兴趣的:(C# GetHashCode in the IEqualityComparer in .NET)