Normal forms

Database Design

Lis Sulmont

Curriculum Manager

Normalization

Identify repeating groups of data and create new tables for them

A more formal definition:

The goals of normalization are to:

  • Be able to characterize the level of redundancy in a relational schema
  • Provide mechanisms for transforming schemas in order to remove redundancy
1 Database Design, 2nd Edition by Adrienne Watt
Database Design

Normal forms (NF)

Ordered from least to most normalized:

  • First normal form (1NF)
  • Second normal form (2NF)
  • Third normal form (3NF)
  • Elementary key normal form (EKNF)
  • Boyce-Codd normal form (BCNF)

$$

  • Fourth normal form (4NF)
  • Essential tuple normal form (ETNF)
  • Fifth normal form (5NF)
  • Domain-key Normal Form (DKNF)
  • Sixth normal form (6NF)
1 https://en.wikipedia.org/wiki/Database_normalization
Database Design

1NF rules

  • Each record must be unique - no duplicate rows
  • Each cell must hold one value

Initial data

| Student_id | Student_Email   | Courses_Completed                                        | 
|------------|-----------------|----------------------------------------------------------|
| 235        | [email protected]   | Introduction to Python, Intermediate Python              |
| 455        | [email protected] | Cleaning Data in R                                       | 
| 767        | [email protected] | Machine Learning Toolbox, Deep Learning in Python        |
Database Design

In 1NF form

| Student_id | Student_Email   | 
|------------|-----------------|
| 235        | [email protected]   | 
| 455        | [email protected] | 
| 767        | [email protected] | 
| Student_id | Completed                |
|------------|--------------------------|
| 235        | Introduction to Python   | 
| 235        | Intermediate Python      | 
| 455        | Cleaning Data in R       | 
| 767        | Machine Learning Toolbox | 
| 767        | Deep Learning in Python  | 
Database Design

2NF

  • Must satisfy 1NF AND
    • If primary key is one column
      • then automatically satisfies 2NF
    • If there is a composite primary key
      • then each non-key column must be dependent on all the keys

Initial data

| Student_id (PK) | Course_id (PK) | Instructor_id | Instructor    | Progress |
|-----------------|----------------|---------------|---------------|----------|
| 235             | 2001           | 560           | Nick Carchedi | .55      |
| 455             | 2345           | 658           | Ginger Grant  | .10      |
| 767             | 6584           | 999           | Chester Ismay | 1.00     |
Database Design

In 2NF form

| Student_id (PK) | Course_id (PK) | Percent_Completed |
|-----------------|----------------|-------------------|
| 235             | 2001           | .55               |
| 455             | 2345           | .10               |
| 767             | 6584           | 1.00              |
| Course_id (PK) | Instructor_id | Instructor    |
|----------------|---------------|---------------|
| 2001           | 560           | Nick Carchedi |
| 2345           | 658           | Ginger Grant  |
| 6584           | 999           | Chester Ismay |
Database Design

3NF

  • Satisfies 2NF
  • No transitive dependencies: non-key columns can't depend on other non-key columns

Initial Data

| Course_id (PK) | Instructor_id | Instructor    | Tech   |
|----------------|---------------|---------------|--------|
| 2001           | 560           | Nick Carchedi | Python |
| 2345           | 658           | Ginger Grant  | SQL    |
| 6584           | 999           | Chester Ismay | R      |
Database Design

In 3NF

| Course_id (PK) | Instructor    | Tech   |
|----------------|---------------|--------|
| 2001           | Nick Carchedi | Python |
| 2345           | Ginger Grant  | SQL    |
| 6584           | Chester Ismay | R      |
| Instructor_id | Instructor    | 
|---------------|---------------|
| 560           | Nick Carchedi | 
| 658           | Ginger Grant  | 
| 999           | Chester Ismay |
Database Design

Data anomalies

What is risked if we don't normalize enough?

1. Update anomaly

2. Insertion anomaly

3. Deletion anomaly

Database Design

Update anomaly

Data inconsistency caused by data redundancy when updating

| Student_ID | Student_Email   | Enrolled_in             | Taught_by           |
|------------|-----------------|-------------------------|---------------------|
| 230        | [email protected]  | Cleaning Data in R      | Maggie Matsui       |
| 367        | [email protected] | Data Visualization in R | Ronald Pearson      |
| 520        | [email protected]   | Introduction to Python  | Hugo Bowne-Anderson |
| 520        | [email protected]   | Arima Models in R       | David Stoffer       |

To update student 520's email:

  • Need to update more than one record, otherwise, there will be inconsistency
  • User updating needs to know about redundancy
Database Design

Insertion anomaly

Unable to add a record due to missing attributes

| Student_ID | Student_Email   | Enrolled_in             | Taught_by           |
|------------|-----------------|-------------------------|---------------------|
| 230        | [email protected]  | Cleaning Data in R      | Maggie Matsui       |
| 367        | [email protected] | Data Visualization in R | Ronald Pearson      |
| 520        | [email protected]   | Introduction to Python  | Hugo Bowne-Anderson |
| 520        | [email protected]   | Arima Models in R       | David Stoffer       |

Unable to insert a student who has signed up but not enrolled in any courses

Database Design

Deletion anomaly

Deletion of record(s) causes unintentional loss of data

| Student_ID | Student_Email   | Enrolled_in             | Taught_by           |
|------------|-----------------|-------------------------|---------------------|
| 230        | [email protected]  | Cleaning Data in R      | Maggie Matsui       |
| 367        | [email protected] | Data Visualization in R | Ronald Pearson      |
| 520        | [email protected]   | Introduction to Python  | Hugo Bowne-Anderson |
| 520        | [email protected]   | Arima Models in R       | David Stoffer       |

If we delete Student 230, what happens to the data on Cleaning Data in R?

Database Design

Data anomalies

What is risked if we don't normalize enough?

1. Update anomaly

2. Insertion anomaly

3. Deletion anomaly

The more normalized the database, the less prone it will be to data anomalies

Don't forget the downsides of normalization from the last video

Database Design

Let's practice!

Database Design

Preparing Video For Download...