If you've never heard the term "normalization" before, you may be thinking about some cruel and inhumane practice that took place in psychiatric wards and monasteries during the 14th century.
In truth, normalization in the database realm really revolves around 2 very basic tenants:
1) Storing data in one place - the same information should not be stored twice within a database. This also applies to information that can be derived from another field. For example, I should not need to create an "Age" column in my Employee table if I already have a "Date of Birth" column. I can always derive the Age from the Date of Birth.
2) Storing data in the correct place - A table should only store data for the entity that it represents. For example, if I have an Employee table, I may wish to have a column for his/her Department ID that he/she works in. However, if I add columns for department name and department location to the employee table I've put department data in the wrong place.
We'll expand further on these concepts as we explore the various degrees of normalization below:
First Normal Form (1NF)
Eliminate duplicative columns from the same table.
Create separate tables for each group of related data and identify each row with a unique column (the primary key).
Let's say we have a "people" database and we're interested in keeping track of 2 things about each person, their spouse's name and the cars they own. Here are 2
FLAWED methods for doing this.
Flawed Table 1
| Person |
Spouse |
Cars |
| Bob |
Becky |
Jeep, Durango, Porsche |
| Jim |
Jane |
Xterra, Ferrari, Pathfinder, Volvo |
Flawed Table 2
| Person |
Spouse |
Car1 |
Car2 |
Car3 |
Car4 |
| Bob |
Becky |
Jeep |
Durango |
Porsche |
|
| Jim |
Jane |
Xterra |
Ferrari |
Pathfinder |
Volvo |
These tables are flawed because the will very quickly lead to problem with scaling your data structure, querying for info, or updating your data. What if someone bought 5 cars? How can we tell how many cars someone owns? How do update the data when someone sells a car?
What needs to be done here to achieve 1st order of normalization is to break the car ownership info out into it's own table as follows:
Persons
| Person |
Spouse |
| Bob |
Becky |
| Jim |
Jane |
Persons_Cars
| Person |
Car |
| Bob |
Jeep |
| Bob |
Durango |
| Bob |
Porsche |
| Jim |
Xterra |
| Jim |
Ferrari |
| Jim |
Pathfinder |
| Jim |
Volvo |
Second Normal Form (2NF)
Meet all the requirements of the 1st normal form.
Remove subsets of data that apply to multiple rows of a table and place them in separate tables.
Create relationships between these new tables and their predecessors through the use of foreign keys.
Example: See Below
Third Normal Form (3NF):
Meet all the requirements of the 2nd normal form.
Remove columns that are not dependent upon the primary key.
Here's an example of incorporating 2nd and 3rd normal form requirements. Going back to the example we mentioned earlier, let's look at a table with the following structure:
Employee Table
| Employee ID |
Name |
Phone |
Department_Name |
Department_City |
| 1 |
Bob |
111-222-3333 |
Accounting |
Oakland |
| 2 |
Jim |
444-555-6666 |
IT |
Seattle |
| 3 |
Tom |
777-888-9999 |
Accounting |
Oakland |
Notice that the handful of possible department names & cities will be repeated over and over again here. What happens if Accounting moves to New York? Will we have to go and update every record in the Employee table where the department is Accounting? Furthermore, a department is a different entity than an employee. A more elegant solution that would comply with the 2nd order or normalization would be as follows:
Employee Table
| Employee_ID |
Name |
Phone |
Department_ID |
| 1 |
Bob |
111-222-3333 |
1 |
| 2 |
Jim |
444-555-6666 |
2 |
| 3 |
Tom |
777-888-9999 |
1 |
Department Table
| Department_ID |
Name |
City |
| 1 |
Accounting |
Oakland |
| 2 |
IT |
Seattle |
Forth Normal Form (4NF):
Meet all the requirements of the 3rd normal form.
Ensure that there are no non-trivial multi-valued dependencies of attribute sets on something other than a superset of a candidate key. A table is said to be in 4NF if and only if it is in the BCNF and multi-valued dependencies are functional dependencies. The 4NF removes unwanted data structures: multi-valued dependencies.
Wait! I know ... That description makes you want to run off and hide. But perhaps the example below can simplify things. Let's assume that employees can own multiple cars and also work on multiple projects. Here's a an example of a
FLAWED table design to store this information:
- Employee_ID
- Car_ID
- Project_ID
The reason this design is flawed is the fact that the employees relationship with his car/s has nothing to do with his relationship with his project/s. A better structure would be to have 2 tables as follows:
Table 1
- Employee_ID
- Car_ID
Table 2
- Employee_ID
- Project_ID
Fifth Normal Form (5NF):
Meet all the requirements of the 4th normal form.
Ensure that there are no non-trivial join dependencies that do not follow from the key constraints. A table is said to be in the 5NF if and only if it is in 4NF and every join dependency in it is implied by the candidate keys.
O.K. Another WTF explanation. But once again, an example should simplify the issue. Let's take a look at the table below:
| Musician
| Instrument
| Genre
|
| James |
Piano |
Classical |
| James |
Trumpet |
Classical |
| Kate |
Drums |
Jazz |
| Kate |
Piano |
Jazz |
| Kate |
Trumpet |
Jazz |
| Kate |
Clarinet |
Jazz |
| Lois |
Saxophone |
Jazz |
| Lois |
Piano |
Classical |
| Lois |
Violin |
Classical |
| Lois |
Guitar |
Rock |
What the table itself may not immediately imply is the fact that the data we see is actually a by product of 3 sets of separate data.
1) What instrument/s each musician plays.
2) What genre/s does each instrument belong to.
3) What genre/s is each musician trained in.
For example, the first record is true because:
1) James plays the Piano.
2) The Piano is a classical instrument.
3) James has been trained in classical music.
All 3 of these conditions have to be true for the first record to exist.
This data is better represented in 3 separate tables as follows:
| Musician |
Instrument |
| James |
Piano |
| James |
Trumpet |
| Kate |
Drums |
| Kate |
Piano |
| Kate |
Trumpet |
| Kate |
Clarinet |
| Lois |
Saxophone |
| Lois |
Piano |
| Lois |
Violin |
| Lois |
Guitar |
|
| Musician |
Genre |
| James |
Classical |
| Kate |
Jazz |
| Lois |
Jazz |
| Lois |
Classical |
| Lois |
Rock |
|
| Instrument |
Genre |
| Piano |
Classical |
| Piano |
Jazz |
| Trumpet |
Classical |
| Trumpet |
Jazz |
| Drums |
Classical |
| Drums |
Jazz |
| Drums |
Rock |
| Clarinet |
Classical |
| Clarinet |
Jazz |
| Saxophone |
Jazz |
| Guitar |
Rock |
|
The first dataset can be derived by joining these 3 tables.
Summary
Data normalization may appear daunting and unnatural at first but it will very quickly become second nature. All it takes a is few instances of getting burned by bad data structures in enterprise apps and you will quickly become an evangelist for proper data modeling.