Database Design
Lis Sulmont
Curriculum Manager
Identify repeating groups of data and create new tables for them
A more formal definition:
The goals of normalization are to:
- Be able to characterize the level of redundancy in a relational schema
- Provide mechanisms for transforming schemas in order to remove redundancy
Ordered from least to most normalized:
$$
Initial data
| Student_id | Student_Email | Courses_Completed |
|------------|-----------------|----------------------------------------------------------|
| 235 | [email protected] | Introduction to Python, Intermediate Python |
| 455 | [email protected] | Cleaning Data in R |
| 767 | [email protected] | Machine Learning Toolbox, Deep Learning in Python |
| Student_id | Student_Email |
|------------|-----------------|
| 235 | [email protected] |
| 455 | [email protected] |
| 767 | [email protected] |
| Student_id | Completed |
|------------|--------------------------|
| 235 | Introduction to Python |
| 235 | Intermediate Python |
| 455 | Cleaning Data in R |
| 767 | Machine Learning Toolbox |
| 767 | Deep Learning in Python |
Initial data
| Student_id (PK) | Course_id (PK) | Instructor_id | Instructor | Progress |
|-----------------|----------------|---------------|---------------|----------|
| 235 | 2001 | 560 | Nick Carchedi | .55 |
| 455 | 2345 | 658 | Ginger Grant | .10 |
| 767 | 6584 | 999 | Chester Ismay | 1.00 |
| Student_id (PK) | Course_id (PK) | Percent_Completed |
|-----------------|----------------|-------------------|
| 235 | 2001 | .55 |
| 455 | 2345 | .10 |
| 767 | 6584 | 1.00 |
| Course_id (PK) | Instructor_id | Instructor |
|----------------|---------------|---------------|
| 2001 | 560 | Nick Carchedi |
| 2345 | 658 | Ginger Grant |
| 6584 | 999 | Chester Ismay |
Initial Data
| Course_id (PK) | Instructor_id | Instructor | Tech |
|----------------|---------------|---------------|--------|
| 2001 | 560 | Nick Carchedi | Python |
| 2345 | 658 | Ginger Grant | SQL |
| 6584 | 999 | Chester Ismay | R |
| Course_id (PK) | Instructor | Tech |
|----------------|---------------|--------|
| 2001 | Nick Carchedi | Python |
| 2345 | Ginger Grant | SQL |
| 6584 | Chester Ismay | R |
| Instructor_id | Instructor |
|---------------|---------------|
| 560 | Nick Carchedi |
| 658 | Ginger Grant |
| 999 | Chester Ismay |
What is risked if we don't normalize enough?
1. Update anomaly
2. Insertion anomaly
3. Deletion anomaly
Data inconsistency caused by data redundancy when updating
| Student_ID | Student_Email | Enrolled_in | Taught_by |
|------------|-----------------|-------------------------|---------------------|
| 230 | [email protected] | Cleaning Data in R | Maggie Matsui |
| 367 | [email protected] | Data Visualization in R | Ronald Pearson |
| 520 | [email protected] | Introduction to Python | Hugo Bowne-Anderson |
| 520 | [email protected] | Arima Models in R | David Stoffer |
To update student 520
's email:
Unable to add a record due to missing attributes
| Student_ID | Student_Email | Enrolled_in | Taught_by |
|------------|-----------------|-------------------------|---------------------|
| 230 | [email protected] | Cleaning Data in R | Maggie Matsui |
| 367 | [email protected] | Data Visualization in R | Ronald Pearson |
| 520 | [email protected] | Introduction to Python | Hugo Bowne-Anderson |
| 520 | [email protected] | Arima Models in R | David Stoffer |
Unable to insert a student who has signed up but not enrolled in any courses
Deletion of record(s) causes unintentional loss of data
| Student_ID | Student_Email | Enrolled_in | Taught_by |
|------------|-----------------|-------------------------|---------------------|
| 230 | [email protected] | Cleaning Data in R | Maggie Matsui |
| 367 | [email protected] | Data Visualization in R | Ronald Pearson |
| 520 | [email protected] | Introduction to Python | Hugo Bowne-Anderson |
| 520 | [email protected] | Arima Models in R | David Stoffer |
If we delete Student 230
, what happens to the data on Cleaning Data in R
?
What is risked if we don't normalize enough?
1. Update anomaly
2. Insertion anomaly
3. Deletion anomaly
The more normalized the database, the less prone it will be to data anomalies
Don't forget the downsides of normalization from the last video
Database Design