Normal forms

Database Design

Lis Sulmont

Curriculum Manager

Normalization

Identify repeating groups of data and create new tables for them

A more formal definition:

The goals of normalization are to:

Be able to characterize the level of redundancy in a relational schema

Provide mechanisms for transforming schemas in order to remove redundancy

¹ Database Design, 2nd Edition by Adrienne Watt

Normal forms (NF)

Ordered from least to most normalized:

First normal form (1NF)
Second normal form (2NF)
Third normal form (3NF)
Elementary key normal form (EKNF)
Boyce-Codd normal form (BCNF)

Fourth normal form (4NF)
Essential tuple normal form (ETNF)
Fifth normal form (5NF)
Domain-key Normal Form (DKNF)
Sixth normal form (6NF)

¹ https://en.wikipedia.org/wiki/Database_normalization

1NF rules

Each record must be unique - no duplicate rows
Each cell must hold one value

Initial data

| Student_id | Student_Email   | Courses_Completed                                        | 
|------------|-----------------|----------------------------------------------------------|
| 235        | [email protected]   | Introduction to Python, Intermediate Python              |
| 455        | [email protected] | Cleaning Data in R                                       | 
| 767        | [email protected] | Machine Learning Toolbox, Deep Learning in Python        |

In 1NF form

| Student_id | Student_Email   | 
|------------|-----------------|
| 235        | [email protected]   | 
| 455        | [email protected] | 
| 767        | [email protected] |

| Student_id | Completed                |
|------------|--------------------------|
| 235        | Introduction to Python   | 
| 235        | Intermediate Python      | 
| 455        | Cleaning Data in R       | 
| 767        | Machine Learning Toolbox | 
| 767        | Deep Learning in Python  |

2NF

Must satisfy 1NF AND
- If primary key is one column
  - then automatically satisfies 2NF
- If there is a composite primary key
  - then each non-key column must be dependent on all the keys

Initial data

| Student_id (PK) | Course_id (PK) | Instructor_id | Instructor    | Progress |
|-----------------|----------------|---------------|---------------|----------|
| 235             | 2001           | 560           | Nick Carchedi | .55      |
| 455             | 2345           | 658           | Ginger Grant  | .10      |
| 767             | 6584           | 999           | Chester Ismay | 1.00     |

In 2NF form

| Student_id (PK) | Course_id (PK) | Percent_Completed |
|-----------------|----------------|-------------------|
| 235             | 2001           | .55               |
| 455             | 2345           | .10               |
| 767             | 6584           | 1.00              |

| Course_id (PK) | Instructor_id | Instructor    |
|----------------|---------------|---------------|
| 2001           | 560           | Nick Carchedi |
| 2345           | 658           | Ginger Grant  |
| 6584           | 999           | Chester Ismay |

3NF

Satisfies 2NF
No transitive dependencies: non-key columns can't depend on other non-key columns

Initial Data

| Course_id (PK) | Instructor_id | Instructor    | Tech   |
|----------------|---------------|---------------|--------|
| 2001           | 560           | Nick Carchedi | Python |
| 2345           | 658           | Ginger Grant  | SQL    |
| 6584           | 999           | Chester Ismay | R      |

In 3NF

| Course_id (PK) | Instructor    | Tech   |
|----------------|---------------|--------|
| 2001           | Nick Carchedi | Python |
| 2345           | Ginger Grant  | SQL    |
| 6584           | Chester Ismay | R      |

| Instructor_id | Instructor    | 
|---------------|---------------|
| 560           | Nick Carchedi | 
| 658           | Ginger Grant  | 
| 999           | Chester Ismay |

Data anomalies

What is risked if we don't normalize enough?

1. Update anomaly

2. Insertion anomaly

3. Deletion anomaly

Update anomaly

Data inconsistency caused by data redundancy when updating

| Student_ID | Student_Email   | Enrolled_in             | Taught_by           |
|------------|-----------------|-------------------------|---------------------|
| 230        | [email protected]  | Cleaning Data in R      | Maggie Matsui       |
| 367        | [email protected] | Data Visualization in R | Ronald Pearson      |
| 520        | [email protected]   | Introduction to Python  | Hugo Bowne-Anderson |
| 520        | [email protected]   | Arima Models in R       | David Stoffer       |

To update student 520's email:

Need to update more than one record, otherwise, there will be inconsistency
User updating needs to know about redundancy

Insertion anomaly

Unable to add a record due to missing attributes

| Student_ID | Student_Email   | Enrolled_in             | Taught_by           |
|------------|-----------------|-------------------------|---------------------|
| 230        | [email protected]  | Cleaning Data in R      | Maggie Matsui       |
| 367        | [email protected] | Data Visualization in R | Ronald Pearson      |
| 520        | [email protected]   | Introduction to Python  | Hugo Bowne-Anderson |
| 520        | [email protected]   | Arima Models in R       | David Stoffer       |

Unable to insert a student who has signed up but not enrolled in any courses

Deletion anomaly

Deletion of record(s) causes unintentional loss of data

| Student_ID | Student_Email   | Enrolled_in             | Taught_by           |
|------------|-----------------|-------------------------|---------------------|
| 230        | [email protected]  | Cleaning Data in R      | Maggie Matsui       |
| 367        | [email protected] | Data Visualization in R | Ronald Pearson      |
| 520        | [email protected]   | Introduction to Python  | Hugo Bowne-Anderson |
| 520        | [email protected]   | Arima Models in R       | David Stoffer       |

If we delete Student 230, what happens to the data on Cleaning Data in R?

Data anomalies

What is risked if we don't normalize enough?

1. Update anomaly

2. Insertion anomaly

3. Deletion anomaly

The more normalized the database, the less prone it will be to data anomalies

Don't forget the downsides of normalization from the last video

Let's practice!

Database Design