There are numberous ways to structure elements into a collection:
They all have their advantages and disadvantages. A common disadvantage is that you have to do comparisons when you want to find an element( or find out that the element is not stored in the collection).
Although the Binary Tree can help to reduce the required comparisons, there are even faster ways to go element-hunting, ways that ideally requrie zero comparisons!
The Array
The array might have its disadvantages , such as reallocation of entire arry ,puls more as the array need to grow. That has somehow given that array concept a "bad name". Combine that with something like this:
SomeThing* findSomeThing(const CString &name, const SomeArray& array)
{
for(unsigned int i=0;i<array.size();i++)
if(array[i]!=0&&name==array[i]-->GetName())return array[i];
return null;
}
This mimics the way one, for example, traverses a linked list to find an element. I think this is a rather common way that arrays are used (Have you ever done std::find on a std::vector?). However, when using arrays like that , you miss out the one major advantage it has over the rest of collections: If you know the index to an element, you can receive it "instantenously".Now, doesn't instantenously sound cool? Unfortunately, is there a small if. but that's where the hash function comes in...
The Hash Function
So, what is the deal? Well, the concept is quite simple:
Let there be a function that can convert an element's key into an index. This is the so-called "hash function". In other words, a modified version of the example above would be something like this:
SomeThing* findSomeThing(const CString& name,const SomeArray& array){
unsigned int i=calculateIndexOfSomeThing(name,array.size());
return array[i];
}
Of course, the interesting question is that : What should the hash function , calculateIndexOfSomeThing, do? Note that all that calculateIndexOfSomeThing "sees" is the name and the array's size, not the array itself. To put it generally, a hash function...
Here is an example of a simple(and not all that clever) hash function, that, given a string ,returns a hash value:
unsigned int getHashValueForString(const char* str, unsigned int arraySize){
return str[0]%arraySize;
}
As you see, it simple returns the first char , with an % arraySize to ensure that the value is between 0 and arraySize-1. There 'd obviously be a conflict for an array such as {"Foo","Fnurt"}.And that's where the thing about collisions come in.
Collisions
There probably will be collisions sooner or later. Consider an array where the key is a string of a maximum of 25 chars. How many possible strings are there that are up to 25 characters? Well, let's not go into number-crunching; let's just say that it's insanely many.Consequently, to have an array that can have a unique index for each string , it would have to be insanely large. That in turn would naturally be ... well. insane. If you don't intend to buy all the memory storage in the world, there must be anothe way.
So ,what can we do? Well, we can ...
Handling collisions
Uh oh, we have a collision! What can we do? There are a number of solutions:
When handling collisions ,we no longer have the instant access to the elements; thus, we try to avoid them;
Avoid collisions
There are some techniques to avoid collisions:
Resizing the Array
As always with arrays, if there's a need for it to grow, one will have to:
With hash arrays ,step2 is a bit special; we can't just copy them over. Instead, we need to use the hash function to ensure they get put in the right place. Note that the hash value may different from the original one.
The sequence of shuffling elements like this is commonly called "rehashing".
Disclaimer
Because this article is designed to be a brief and non-academic introduction to the hash concept, the mathematically most correct definitions of the concept are found elsewhere.
Hashing is widely known concept and the author makes no claims in having invented it.
Downloads
This source contains two hash table stuctures, HashTableChinaed and HashTableProbed, that illustrate what I've just been talking about.
Making really general structures is always somewhat tricky as well all have different cdemands. The tables in the provided source address this by handling just the core structure.