Launch Radio     |     Flip Skins

Database Normalization
If you've never heard the term "normalization" before, you may be thinking about some cruel and inhumane practice that took place in psychiatric wards and monasteries during the 14th century.

In truth, normalization in the database realm really revolves around 2 very basic tenants:

1) Storing data in one place - the same information should not be stored twice within a database. This also applies to information that can be derived from another field. For example, I should not need to create an "Age" column in my Employee table if I already have a "Date of Birth" column. I can always derive the Age from the Date of Birth.

2) Storing data in the correct place - A table should only store data for the entity that it represents. For example, if I have an Employee table, I may wish to have a column for his/her Department ID that he/she works in. However, if I add columns for department name and department location to the employee table I've put department data in the wrong place.

We'll expand further on these concepts as we explore the various degrees of normalization below:

First Normal Form (1NF)

  • Eliminate duplicative columns from the same table.
  • Create separate tables for each group of related data and identify each row with a unique column (the primary key).

    Let's say we have a "people" database and we're interested in keeping track of 2 things about each person, their spouse's name and the cars they own. Here are 2 FLAWED methods for doing this.

    Flawed Table 1

    Person Spouse Cars
    Bob Becky Jeep, Durango, Porsche
    Jim Jane Xterra, Ferrari, Pathfinder, Volvo

    Flawed Table 2
    Person Spouse Car1 Car2 Car3 Car4
    Bob Becky Jeep Durango Porsche  
    Jim Jane Xterra Ferrari Pathfinder Volvo

    These tables are flawed because the will very quickly lead to problem with scaling your data structure, querying for info, or updating your data. What if someone bought 5 cars? How can we tell how many cars someone owns? How do update the data when someone sells a car?

    What needs to be done here to achieve 1st order of normalization is to break the car ownership info out into it's own table as follows:

    Persons

    Person Spouse
    Bob Becky
    Jim Jane


    Persons_Cars

    Person Car
    Bob Jeep
    Bob Durango
    Bob Porsche
    Jim Xterra
    Jim Ferrari
    Jim Pathfinder
    Jim Volvo


    Second Normal Form (2NF)

  • Meet all the requirements of the 1st normal form.
  • Remove subsets of data that apply to multiple rows of a table and place them in separate tables.
  • Create relationships between these new tables and their predecessors through the use of foreign keys.

    Example: See Below

    Third Normal Form (3NF):

  • Meet all the requirements of the 2nd normal form.
  • Remove columns that are not dependent upon the primary key.

    Here's an example of incorporating 2nd and 3rd normal form requirements. Going back to the example we mentioned earlier, let's look at a table with the following structure:

    Employee Table

    Employee ID Name Phone Department_Name Department_City
    1 Bob 111-222-3333 Accounting Oakland
    2 Jim 444-555-6666 IT Seattle
    3 Tom 777-888-9999 Accounting Oakland

    Notice that the handful of possible department names & cities will be repeated over and over again here. What happens if Accounting moves to New York? Will we have to go and update every record in the Employee table where the department is Accounting? Furthermore, a department is a different entity than an employee. A more elegant solution that would comply with the 2nd order or normalization would be as follows:

    Employee Table

    Employee_ID Name Phone Department_ID
    1 Bob 111-222-3333 1
    2 Jim 444-555-6666 2
    3 Tom 777-888-9999 1

    Department Table
    Department_ID Name City
    1 Accounting Oakland
    2 IT Seattle


    Forth Normal Form (4NF):

  • Meet all the requirements of the 3rd normal form.
  • Ensure that there are no non-trivial multi-valued dependencies of attribute sets on something other than a superset of a candidate key. A table is said to be in 4NF if and only if it is in the BCNF and multi-valued dependencies are functional dependencies. The 4NF removes unwanted data structures: multi-valued dependencies.

    Wait! I know ... That description makes you want to run off and hide. But perhaps the example below can simplify things. Let's assume that employees can own multiple cars and also work on multiple projects. Here's a an example of a FLAWED table design to store this information:

    - Employee_ID
    - Car_ID
    - Project_ID

    The reason this design is flawed is the fact that the employees relationship with his car/s has nothing to do with his relationship with his project/s. A better structure would be to have 2 tables as follows:

    Table 1
    - Employee_ID
    - Car_ID

    Table 2
    - Employee_ID
    - Project_ID

    Fifth Normal Form (5NF):

  • Meet all the requirements of the 4th normal form.
  • Ensure that there are no non-trivial join dependencies that do not follow from the key constraints. A table is said to be in the 5NF if and only if it is in 4NF and every join dependency in it is implied by the candidate keys.

    O.K. Another WTF explanation. But once again, an example should simplify the issue. Let's take a look at the table below:

    Musician Instrument Genre
    James Piano Classical
    James Trumpet Classical
    Kate Drums Jazz
    Kate Piano Jazz
    Kate Trumpet Jazz
    Kate Clarinet Jazz
    Lois Saxophone Jazz
    Lois Piano Classical
    Lois Violin Classical
    Lois Guitar Rock

    What the table itself may not immediately imply is the fact that the data we see is actually a by product of 3 sets of separate data.

    1) What instrument/s each musician plays.
    2) What genre/s does each instrument belong to.
    3) What genre/s is each musician trained in.

    For example, the first record is true because:
    1) James plays the Piano.
    2) The Piano is a classical instrument.
    3) James has been trained in classical music.

    All 3 of these conditions have to be true for the first record to exist.

    This data is better represented in 3 separate tables as follows:

    Musician Instrument
    James Piano
    James Trumpet
    Kate Drums
    Kate Piano
    Kate Trumpet
    Kate Clarinet
    Lois Saxophone
    Lois Piano
    Lois Violin
    Lois Guitar
    Musician Genre
    James Classical
    Kate Jazz
    Lois Jazz
    Lois Classical
    Lois Rock
    Instrument Genre
    Piano Classical
    Piano Jazz
    Trumpet Classical
    Trumpet Jazz
    Drums Classical
    Drums Jazz
    Drums Rock
    Clarinet Classical
    Clarinet Jazz
    Saxophone Jazz
    Guitar Rock

    The first dataset can be derived by joining these 3 tables.

    Summary

    Data normalization may appear daunting and unnatural at first but it will very quickly become second nature. All it takes a is few instances of getting burned by bad data structures in enterprise apps and you will quickly become an evangelist for proper data modeling.


  • © 2005 - 2012, MAXO Studio