Hash Table

There are numberous ways to structure elements into a collection:

  • Linked List
  • Binary trees
  • Arrays
  • and so forth

They all have their advantages and disadvantages. A common disadvantage is that you have to do comparisons when you want to find an element( or find out that the element is not stored in the  collection).

Although the Binary Tree can help to reduce the required comparisons, there are even faster ways to go element-hunting, ways that ideally requrie zero comparisons!

The Array

The array might have its disadvantages , such as reallocation of entire arry ,puls more as the array need to grow. That has somehow given that array concept a "bad name". Combine that with something like this:

SomeThing* findSomeThing(const CString &name, const SomeArray& array)

{
    for(unsigned int i=0;i<array.size();i++)
       if(array[i]!=0&&name==array[i]-->GetName())return array[i];
    return null;

}

This mimics the way one, for example, traverses a linked list to find an element. I think this is a rather common way that arrays are used (Have you ever done std::find on a std::vector?). However, when using arrays like that , you miss out the one major advantage it has over the rest of collections: If you know the index to an element, you can receive it "instantenously".Now, doesn't instantenously sound cool? Unfortunately, is there a small if. but that's where the hash function comes in...

The Hash Function

So, what is the deal? Well, the concept is quite simple:

Let there be a function that can convert an element's key into an index. This is the so-called "hash function". In other words, a modified version of the example above would be something like this:

SomeThing* findSomeThing(const CString& name,const SomeArray& array){

unsigned int i=calculateIndexOfSomeThing(name,array.size());

return array[i];

}

Of course, the interesting question is that : What should the hash function , calculateIndexOfSomeThing, do? Note that all that calculateIndexOfSomeThing "sees" is the name and the array's size, not the array itself. To put it generally, a hash function...

  • ...must return a value within the array. This is so-called "hash value"
  • ...must , given a certain key and array size, always return the same hash value for that key. It must not be dependent on what else might be sotred in the array, what day it is ,and so on.
  • ...should try to spread out the values to avoid different elements getting the same index.
  • ...should preferablly be fast..

Here is an example of a simple(and not all that clever) hash function, that, given a string ,returns a hash value:

unsigned int getHashValueForString(const char* str, unsigned int arraySize){

return str[0]%arraySize;

}

As you see, it simple returns the first char , with  an % arraySize to ensure that the value is between 0 and arraySize-1. There 'd obviously be a conflict for an array such as {"Foo","Fnurt"}.And that's where the thing about collisions come in.

Collisions

There probably will be collisions sooner or later. Consider an array where the key is a string of a maximum of 25 chars. How many possible strings are there that are up to 25 characters? Well, let's not go into number-crunching; let's just say that it's insanely many.Consequently, to have an array  that can have a unique index for each string  , it would have to be insanely large. That in turn would naturally be ... well. insane. If you don't intend to buy all the memory storage in the world, there must be anothe way.

So ,what can we do? Well, we can ...

  1. ...be  prepared for collisions and make sure we can handle them.
  2. ...at lease to avoid collisions especial getting.

Handling collisions

Uh oh, we have a collision! What can we do? There are a number of solutions:

  • Chaining. Chaining lets the array's elements actually be collections such as linked lists, binary trees, or even another hash table. In other words, you can use the hash function to quickly resolve which collection the element in in, and then use a regual find in that collection.
  • Probing. Probing uses slots in the array other than the one originally indexed by the hash function. For example, when searching , it's roughly(not the complete algorithm) be something like this:
    1. Get the hash value, i
    2. while( ! someStopCondition&&array[i%array.size()]->getKey()!=theKeySearched){
       i+=someValueLargerThan1;
      ...
      }
      If the array's size is a prime number, the i%array.size() wraparoud when i>array.size() will be quite efficient because it won't be the original hash value again until all other slots have been cheched.

When handling collisions ,we no longer have the instant access to the elements; thus, we try to avoid them;

Avoid collisions

There are some techniques to avoid collisions:

  1. Let the array size be a prime number
    It can be mathematically proven that the probability of collisions is less if we let the hash function perform % on a prime rather than on a non-prime number.
    Consequently, the first step to avoid collisions is to make sure the array is sized after some prime number.
  2. Spreading values
    For example, let the hash function be smart enough to distribute its return values as evenly as possible. How to determine the best algorithm for this is beyond  the scope if this article...
  3. Increasing the array's size
    The more "free slots" in an array, the less probability of collisions. So, to avoid collisions, let the array grow proportionally and try to keep  a fixed (free slots)/(array size )rate.

Resizing  the Array

As always with arrays, if there's a need for it to grow, one will have to:

  1. Allocate an entirely new array of a larger scale.
  2. Move elements to the new array
  3. Discard the old array

With hash arrays ,step2 is a bit special; we can't just copy them over. Instead, we need to use the hash function to ensure they get put in the right place. Note that the hash value may different from the original one.

The sequence of shuffling elements like this is commonly called "rehashing".

Disclaimer

Because this article is designed to be a brief and non-academic introduction to the hash concept, the mathematically most correct definitions of the concept are found elsewhere.

Hashing is widely known concept and the author makes no claims in having invented it.

Downloads

This source contains two hash table stuctures, HashTableChinaed and HashTableProbed, that illustrate what I've just been talking about.

Making really general structures is always somewhat tricky as well all have different cdemands. The tables in the provided source address this by handling just the core structure.  

你可能感兴趣的:(table)