Item 10: Understand the Pitfalls of GetHashCode()
理解GetHashCode()的缺陷
This is the only item in this book dedicated to one function that you should avoid writing. GetHashCode() is used in one place only: to define the hash value for keys in a hash-based collection, typically the Hashtable or Dictionary containers. That's good because there are a number of problems with the base class implementation of GetHashCode(). For reference types, it works but is inefficient. For value types, the base class version is often incorrect. But it gets worse. It's entirely possible that you cannot write GetHashCode() so that it is both efficient and correct. No single function generates more discussion and more confusion than GetHashCode(). Read on to remove all that confusion.
这是本书中唯一的这样一个条款:致力于一个应该避免编写的方法。GetHashCode()仅仅用在一个地方:在基于hash(哈希)结构的集合中,用来定义key(键值)的hash值,典型的是Hashtable(哈希表)或者Dictionary(字典)容器。因为基类在对GetHashCode()的实现上存在很多问题,所以仅用在一个地方很好。对于引用类型,这也能工作但是效率低。对于值类型,基类的版本经常是不正确的,而且越来越糟。不写GetHashCode()是完全可能的,那样就会同时获得效率和正确性。没有哪个单独的方法比GetHashCode()带来更多的讨论和混乱。继续读来移除所有的困惑。
If you're defining a type that won't ever be used as the key in a container, this won't matter. Types that represent window controls, web page controls, or database connections are unlikely to be used as keys in a collection. In those cases, do nothing. All reference types will have a hash code that is correct, even if it is very inefficient. Value types should be immutable (see Item 7), in which case, the default implementation always works, although it is also inefficient. In most types that you create, the best approach is to avoid the existence of GetHashCode() entirely.
如果你正在定义一个从不会在容器里面用作key的类型,这没什么影响。表示WinForm控件、web页面控件或数据库连接的类型,不大可能被用作集合中的key。在那些情况下,什么也不要做。所有的引用类型将会有一个正确的hash码,即使是很低效的。值类型应该是不可变性的,这种情况下,默认的实现,尽管是效率低的,但是是可以工作的。在你创建的多数类型中,最好的途径就是完全避免GetHashCode()的存在。
One day, you'll create a type that is meant to be used as a hashtable key, and you'll need to write your own implementation of GetHashCode(), so read on. Hash-based containers use hash codes to optimize searches. Every object generates an integer value called a hash code. Objects are stored in buckets based on the value of that hash code. To search for an object, you request its key and search just that one bucket. In .NET, everyobject has a hash code, determined by System.Object.GetHashCode(). Any overload of GetHashCode() must follow these three rules:
有一天,你会创建一个要用作hashtable的key的类型,需要编写自己的GetHashCode()实现,那么继续读。基于hash结构的容器使用hash码来优化搜索。每个对象生成一个叫做hash码的整型值。对象都被存储在基于hash值的bucket(容器,桶?)里。为了搜索一个对象,你需要它的键值,在bucket容器里面搜索它。在.Net里面,每个对象都有一个由System.Object.GetHashCode()决定的hash码。任何对GetHashCode()的重载必须遵守这三个规则:
If two objects are equal (as defined by operator==), they must generate the same hash value. Otherwise, hash codes can't be used to find objects in containers.
1.如果2个对象是相等的(由==操作符定义)它们必须生成同样的hash值。否则,hash值不能被用来在容器里面查找对象。
For any object A, A.GetHashCode() must be an instance invariant. No matter what methods are called on A, A.GetHashCode() must always return the same value. That ensures that an object placed in a bucket is always in the right bucket.
2.对于任何对象A,A.GetHashCode()必须是一个实例不变量。无论在A里面调用什么方法,A.GetHashCode()必须总是返回同样的值。这能保证,放在bucket容器里的对象永远在正确的bucket里。
The hash function should generate a random distribution among all integers for all inputs. That's how you get efficiency from a hash-based container.
3.Hash方法应该为所有的输入在整型范围内生成一个随机的分布。这就是使用基于hash结构的容器里面获得效率的原因。
Writing a correct and efficient hash function requires extensive knowledge of the type to ensure that rule 3 is followed. The versions defined in System.Object and System.ValueType do not have that advantage. These versions must provide the best default behavior with almost no knowledge of your particular type. Object.GetHashCode() uses an internal field in the System.Object class to generate the hash value. Each object created is assigned a unique object key, stored as an integer, when it is created. These keys start at 1 and increment every time a new object of any type gets created. The object identity field is set in the System.Object constructor and cannot be modified later. Object.GetHashCode() returns this value as the hash code for a given object.
编写一个正确且高效的hash方法要求对该类型有更多了解来保证遵守规则3。在System.Object和System.ValueType中定义的版本没有这优点。这些版本在几乎不知道你的特定类型的情况下,必须提供最好的默认行为。Object.GetHashCode()使用了System.Object类的一个内部字段来生成hash值。每个对象在它被创建的时候都被分配一个唯一的对象值(以一个整型值来存储)。这些值以1开始,每次有任何类型的一个新对象被创建时该值就会增加。对象标识符字段在System.Object构造器的内部被设置,以后不能再被修改。Object.GetHashCode()将对象标识符字段的hash值作为结果hash值返回。
Now examine Object.GetHashCode() in light of those three rules. If two objects are equal, Object.GetHashCode()returns the same hash value, unless you've overridden operator==. System.Object's version of operator==() tests object identity. GetHashCode() returns the internal object identity field. It works. However, if you've supplied your own version of operator==, you must also supply your own version of GetHashCode() to ensure that the first rule is followed. See Item 9 for details on equality.
现在根据那三条规则来检查Object.GetHashCode()。如果2个对象是相等的,除非你重写过了==操作符,Object.GetHashCode()会返回同样的hash值。System.Object的==版本检测对象标识符。GetHashCode()返回内部的对象标识符字段,这能工作。然而,如果你已经提供了自己版本的==,就必须也要提供自己版本的GetHashCode()才能确保遵守了第一条规则。Item 9详细介绍了相等性。
The second rule is followed: After an object is created, its hash code never changes.
遵循了第二个规则:一个对象在被创建后,hash码从不改变。
The third rule, a random distribution among all integers for all inputs, does not hold. A numeric sequence is not a random distribution among all integers unless you create an enormous number of objects. The hash codes generated by Object.GetHashCode() are concentrated at the low end of the range of integers.
第三个规则,对所有的输入要随机分布在整型范围内,这一条不成立。除非你创建大量的对象,否则一个数字队列不是整型范围内的随机分布,由Object.GetHashCode()生成的hash码集中在整型范围的低端部分。
This means that Object.GetHashCode() is correct but not efficient. If you create a hashtable based on a reference type that you define, the default behavior from System.Object is a working, but slow, hashtable. When you create reference types that are meant to be hash keys, you shouldoverride GetHashCode()to get a better distribution of the hash values across all integers for your specific type.
这意味着Object.GetHashCode()是正确的但是非高效的。如果你创建一个基于你定义的引用类型的hashtable,继承自System.Object的默认行为就是可工作、比较慢的hashtable。当你创建一个准备作为hash键值的引用类型时,应该重写GetHashCode(),以便于为你的特定类型在整型范围内得到一个更好的hash值分布。
Before covering how to write your own override of GetHashCode, this section examines ValueType.GetHashCode()with respect to those same three rules. System.ValueType overrides GetHashCode(), providing the default behavior for all value types. Its version returns the hash code from the first field defined in the type. Consider this example:
在讲述怎么编写自己重写版本的GetHashCode之前,这一节用那三条同样的规则来检查Value.GetHashCode()。System.ValueType重写了GetHashCode(),为所有的值类型提供了默认的行为。这个版本返回在该类型内部定义的首个字段的hash值作为自己的hash值。考虑这个例子:
The hash code returned from a MyStruct object is the hash code generated by the _msg field. The following code snippet always returns true:
从MyStruct对象返回的hash码就是由msg字段生成的hash码。下面代码段总是返回true:
翻译时试验:
总是返回false
The first rule says that two objects that are equal (as defined by operator==()) must have the same hash code. This rule is followed for value types under most conditions, but you can break it, just as you could with for reference types. ValueType.operator==() compares the first field in the struct, along with every other field. That satisfies rule 1. As long as any override that you define for operator== uses the first field, it will work. Any struct whose first field does not participate in the equality of the type violates this rule, breaking GetHashCode().
第一个规则是说2个相等的对象(由==定义的相等)必须由相同的hash码。该规则对于值类型来说,在多数情况下是被遵守的。但是你可以打破它,就像对待引用类型一样。ValueType的操作符==()比较结构体中很多字段中的首个字段,这满足了规则1。只要你定义了任何重写的==操作符,就使用了首个字段,就能工作。任何结构体,如果它的首个字段没有参与类型的相等性,那么就违背了该规则,破坏了GetHashCode()。
The second rule states that the hash code must be an instance invariant. That rule is followed only when the first field in the struct is an immutable field. If the value of the first field can change, so can the hash code. That breaks the rules. Yes, GetHashCode() is broken for any struct that you create when the first field can be modified during the lifetime of the object. It's yet another reason why immutable value types are your best bet (see Item 7).
第二个规则阐明了hash码必须是一个实例不变量。只有当这个结构体中的首个字段是不可变字段时,才符合该规则。如果首个字段的值可改变,那么hash码也可变,这就违背了该规则。是的,对于任何你创建的结构体,如果在它的生命期内首个字段是可以被修改的,那么GetHashCode()就会被打破。为什么不可变的值类型是你最好的选择呢,这也是另外一个原因(参看Item 17)。
The third rule depends on the type of the first field and how it is used. If the first field generates a random distribution across all integers, and the first field is distributed across all values of the struct, then the struct generates an even distribution as well. However, if the first field often has the same value, this rule is violated. Consider a small change to the earlier struct:
第三个规则依赖于首个字段的类型和它如何被使用。如果首个字段生成了一个在整型范围的随机分布,而且它也遍布了结构中的所有值,那么,该结体构也能生成一个很好的平均分布。然而,如果首个字段经常有同样的值,这个规则也会被打破。考虑对前面的结构体做个小小的修改;
If the _epoch field is set to the current date (not including the time), all MyStruct objects created in a given date will have the same hash code. That prevents an even distribution among all hash code values.
如果epoch字段被设置成了当前的日期(不含时间),所有在某个特定日期被创建的MyStruct对象将会有同样的hash值。这就阻止了所有hash值的平均分布。
Summarizing the default behavior, Object.GetHashCode() works correctly for reference types, although it does not necessarily generate an efficient distribution. (If you have overridden Object.operator==(), you can break GetHashCode()). ValueType.GetHashCode() works only if the first field in your struct is read-only. ValueType.GetHashCode() generates an efficient hash code only when the first field in your struct contains values across a meaningful subset of its inputs.
概括Object.GetHashCode()的默认行为,在引用类型上工作得很正确,尽管它没必要生成一个高效的分布(如果你已经重写了Object.operator==(),会打破GetHashCode())。只有在结构体中的首个字段是只读的情况下,ValueType.GetHashCode()才能工作。只有当结构体满足下列条件的时候:包含了遍布于他的输入中某个有意义的集合的值,ValueType.GetHashCode()才能生成高效的hash码,
If you're going to build a better hash code, you need to place some constraints on your type. Examine the three rules again, this time in the context of building a working implementation of GetHashCode().
如果你正打算构建一个更好的hash码,需要在你的类型里面加入一些限制。重新检测这三个规则,这次是在构建一个可工作的对GetHashCode()的实现的上下文中来检测。
First, if two objects are equal, as defined by operator==(), they must return the same hash value. Any property or data value used to generate the hash code must also participate in the equality test for the type. Obviously, this means that the same properties used for equality are used for hash code generation. It's possible to have properties participate in equality that are not used in the hash code computation. The default behavior for System.ValueType does just that, but it often means that rule 3 usually gets violated. The same data elements should participate in both computations.
首先,如果2个对象是==操作符定义的相等的话,它们必须返回同样的hash值,任何被用来生成hash码的属性或者数据值必须参加该类型的相等性判断。显然,这意味着,被用作相等性的属性同时也被用作来生成hash码。有的属性参与相等性判断,但不被用来进行hash码计算,这也是可能的。System.ValueType的默认行为就是那样做的,但是这意味着规则3经常被违背,同样的数据元素应该同时参加2个计算。
The second rule is that the return value of GetHashCode() must be an instance invariant. Imagine that you defined a reference type, Customer:
第二条规则是,GetHashCode()返回的值必须是一个实不例变量。想象,你定义了一个引用类型Customer:
Suppose that you execute the following code snippet:
假设执行下面的代码段:
c1 is lost somewhere in the hash map. When you placed c1 in the map, the hash code was generated from the string "Acme Products". After you change the name of the customer to "Acme Software", the hash code value changed. It's now being generated from the new name: "Acme Software". C1 is stored in the bucket defined by "Acme Products", but it should be in the bucket defined for "Acme Software". You've lost that customer in your own collection. It's lost because the hash code is not an object invariant. You've changed the correct bucket after storing the object.
C1遗失在hashmap(hash图)中的某个地方。当你把c1放到图中时,hash码由字符串“Acme Products”生成。在将客户的名字修改为“Acme Software”之后,hash码值发生了变化。现在它由新的名字“Acme Software”生成。C1存储在以“Acme Products”定义的bucket容器里面,但是它应该存储在以“Acme Software”定义的bucket容器里面。你已经将客户遗失在了自己的集合里面,因为hash码不是一个对象不变量。在存储完该对象之后,你已经修改了这个正确的bucket容器。
The earlier situation can occur only if Customer is a reference type. Value types misbehave differently, but they still cause problems. If customer is a value type, a copy of c1 gets stored in the hashmap. The last line changing the value of the name has no effect on the copy stored in the hashmap. Because boxing and unboxingmake copies as well, it's very unlikely that you can change the members of a value type after that object has been added to a collection.
仅仅当Customer是一个引用类型时,前面的情况才会发生。值类型做了不同的错误行为,但是它们也会引起问题。如果Customer是值类型,c1的一个拷贝就会被存储在hash图中。最后一行修改name值的代码对存储在hash图中的拷贝没有影响。因为装箱和拆箱都是进行拷贝的,所以,在一个值类型的对象被添加到一个集合中后,想修改它的成员是非常不可能的。
The only way to address rule 2 is to define the hash code function to return a value based on some invariant property or properties of the object. System.Object abides by this rule using the object identity, which does not change. System.ValueType hopes that the first field in your type does not change. You can't do better without making your type immutable. When you define a value type that is intended for use as a key type in a hash container, it must be an immutable type. Violate this recommendation, and the users of your type will find a way to break hashtables that use your type as keys. Revisiting the Customer class, you can modify it so that the customer name is immutable:
表达规则2的唯一方法就是定义hash码方法,让它返回一个基于一个或多个不变属性的值。System.Object通过使用不变的对象标识符遵守了该规则。System.ValueType希望你的类型的首个字段不会改变。除了使你的类型是个不可变的之外,没有更好的方法。当你定义一个准备在hash容器中作为key使用的值类型时,它必须是一个不可变的类型。若违背该建议,你的类型的用户将会找到一个打破将你的类型用作key的hashtable的方法。再看Customer类,你可以修改使用户名不可变:
Making the name immutable changes how you must work with customer objects to modify the name:
使name不可变,将改变你的下述行为:你该如何处理客户对象来修改name:
You have to remove the original customer, change the name, and add the new customer object to the hashtable. It looks more cumbersome than the first version, but it works. The previous version allowed programmers to write incorrect code. By enforcing the immutability of the properties used to calculate the hash code, you enforce correct behavior. Users of your type can't go wrong. Yes, this version is more work. You're forcing developers to write more code, but only because it's the only way to write the correct code. Make certain that any data members used to calculate the hash value are immutable.
你不得不移除原始的客户,修改name,将新的客户对象添加到hashtable中。它看起来比第一个版本更笨重,但能工作。前面的版本允许程序员编写不正确的代码。通过将被用来计算hash值的属性强制为不可变的,可以得到正确的行为,你的类型的用户不会出错了。是的,这个版本更能工作。你正在强迫开发者编写更多的代码,但是仅仅因为这是编写正确代码的唯一方式。请确认任何被用来计算hash值的数据成员是不可变的。
The third rule says that GetHashCode() should generate a random distribution among all integers for all inputs. Satisfying this requirement depends on the specifics of the types you create. If a magic formula existed, it would be implemented in System.Object and this item would not exist. A common and successful algorithm is to XOR all the return values from GetHashCode() on all fields in a type. If your type contains some mutable fields, exclude those fields from the calculations.
第三个规则是说GetHashCode()应该为所有的输入生成一个在整型范围内的随机分布。要满足这个要求依赖于你创建的类型的细节。如果存在一个魔法公式,就肯定早就在System.Object里面实现了,而这个条款也不会存在。一个通用并且成功的算法是:对类型里面的所有字段使用GetHashCode()后,对其返回值取XOR。如果你的类型包含一些可变的字段,在计算中排除它们。
GetHashCode() has very specific requirements: Equal objects must produce equal hash codes, and hash codes must be object invariants and must produce an even distribution to be efficient. All three can be satisfied only for immutable types. For other types, rely on the default behavior, but understand the pitfalls.
GetHashCode()有非常特别的要求:相等的对象必须产生相等的hash码,hash码必须是对象不可变的,必须产生一个平均的分布以便获得效率。仅仅有不可变的值类型才能满足3个规则,对于其它类型,依赖于默认的行为,但是要理解它的缺陷。