Normalization
What is Normalisation?
Normalisation is the process by which a flat file (list) of data may be converted into a
set of well-structured relations. A well-structured relation is one that contains the minimum amount of data redundancy, and allows users to insert, delete and modify rows in a table without producing anomalies, errors or inconsistencies. The principles of normalisation were first defined by E F Codd in 1970 and have since been defined further by such academics as Fagin who provided a fourth and fifth normal form.
Why Normalise?
Often, when designing a database, an analyst/programmer is provided with a form or
spreadsheet that is currently used by an organisation. In such flat list type information
there are hidden problems that prevent a straight transformation (i.e. 1 document does
not mean 1 table). Normalisation provides a framework of rules to ensure that the transformation of data from such sources is carried out systematically and removes any possible anomalies. These anomalies if left uncorrected may work their way into a final system and at a later time (maintenance) require a larger amount of effort to correct.
Data Anomalies
During the life of a database there are three anomalies that signify flaws in the
underlying data-models these are:
Insertion Anomalies
An insertion anomaly occurs where to add a new record into a database requires duplication of data that already exists.
Deletion Anomalies
A deletion anomaly occurs when the deletion of a single piece of data results in a loss
of valid data on the same row
Update Anomalies
An update anomaly occurs when the same information is stored multiply in the one table and thus any update to that information requires multiple changes.
Keys
Before discussing the process of normalisation it is necessary to understand the role of
keys in the normalisation process. The identification of the correct keys is almost the sole purpose of normalisation; as if the keys are identified correctly then there is no chance of any of the three previously mentioned side effects occurring.
The term "key" refers to columns in any data set used to:
1.
Uniquely identify a row in a given data set, known as a Primary Key.
2.
Link two columns in two or more separate data sets together, known as a
Foreign Key.
Primary Keys
As mentioned primary keys are columns in the data set used to uniquely identify a row and are usually denoted by underlining the name of any primary key column i.e.
Thus once primary keys have been declared duplicate information is not allowed in the primary key.
The compound key identified here means that there can be no repetition in the combined values of the two columns, as may be seen from the above data set, where a value may be repeated in one (or more) of the key columns as long as the compound key in its entirety is unique.
Foreign Keys
Foreign keys are columns in a data set that link to the primary key
of a related data
set, unlike primary keys, foreign keys are not required to contain unique values and
they exist independently of each other.
Normalisation: The Process
The normalisation process follows a standard series of steps, which will be outlined in
the following sections, it is important to note that unless each step is carried out properly the next step will be flawed.
Gather the Un-Normalised data set
This step is often rushed, but is perhaps the most critical of the entire normalisation process. If the un-normalised data set contains errors then it is more than likely that these errors will be carried throughout the entire normalisation resulting in possible data anomalies.
While the order of the items in an un-normalised data set isn't governed by a strict rules the following principles are often applied to ensure an easier transition throughout the process:
•
Items that are likely primary keys (will be discussed later) should be put to the
left of their dependant items.
•
Repeating groups are usually placed to the right in an un-normalised data set.
Notice how the second data set ignores the multiple repeating groups; this answer is still valid
as it is often preferable to remove/ignore nested repeating groups in un-normalised data sets where the repeating group is seen to be minor or is not fully understood. The rationale behind this will be explored later in first normal form.
Convert Un-normalised data set to First Normal Form (1NF)
Once we have identified the un-normalised data set (R1) we must convert it into first normal form (1NF). This is achieved through the removal of any repeating groups in the un-normalised data set and the identification of primary keys in any resultant data
sets.
Convert 1NF to Second Normal Form (2NF)
Once an un-normalised data set has been converted into first normal form (1NF) the resulting data sets are then revised further to upgrade them to second normal form (2NF) by removing of all partial functional dependencies. Partial functional dependency is where a column is not wholly-dependant on the primary key i.e. the column is dependant on part of the primary key. Thus partial functional dependencies may only occur in data sets that have more than one primary key attribute (i.e. a compound key).
Convert 2NF to Third Normal Form (3NF)
When 2NF is reached, you must then check the resultant data sets to upgrade to third normal form (3NF). To achieve third normal form you must remove any transitive/hidden dependencies between non-key attributes. A transitive dependency
exists where one or more non-key columns are more/wholly dependant on another non-key column rather than on the designated primary key(s).
It is important to note that often 3NF is achieved without having to change any of the existing data sets. This is exemplified by the two problems we have been normalising.
Name the resultant Data-Sets
The final step in this process is to give each of the final data sets a meaningful name.
While this step is not mandatory it is useful especially when we cover converting normalised data sets into E-R modelling.
Thus in the enrolment example:
R11 = (Student#, Student Name) -> Student
R121 = (Student#, Unit Code) -> Enrolment
R122 = (Unit Code, Unit Name) -> Unit
The naming of these data sets is based on the type of data they will be used to store. At first you may find it hard to think of a proper name but through experience you
will begin to see trends and gain a better 'feel' for the purpose of each data set.
In summary:
1. 1NF from normalized list and removed repeating term to one of new table
2. 2NF from 1NF (Remove partial-key dependent)
a. The table has single key, then goes straight to 2NF
b. Remove only on the table which has multi-key
c. find out more depth relationship into new table
3. 3NF from 2NF(Remove transitive dependent key)
a. If only one non-key attribute in the table then the table automatically goes to 2NF
b. Find out the stronger link in the table, then remove out.
c. Find out that is more dependent relationship in the table.