Out of range values and inaccurate data

Cleaning Data in SQL Server Databases

Miriam Antona

Software Engineer

Dataset - series and episodes

series

| id  | name                | premiered  | official_site                               | contact_number | rating | ... |
|-----|---------------------|------------|---------------------------------------------|----------------|--------|-----|
| 1   | Adventure Time      | 2010-04-05 | wwq.cartoonnetwork.com/video/adventuretime/ | 555-906-8845   | 8,4    | ... |
| 2   | Dexter              | 2006-01-10 | ww.sho.com/sho/dexter/home                  | 555-156-8845   | 8,6    | ... |
| 3   | Futurama            | 1999-03-28 | www.cc.com/shows/futurama                   | 555-210-9951   | 9,2    | ... |
| 4   | Game of Thrones     | 2011-04-17 | www.hbo.com/game-of-thrones                 | 555-abc-6641   | 9,3    | ... |
| 5   | Homeland            | 2011-10-02 | www.sho.com/sho/homeland/home               | 555-985-6314   | 8,3    | ... |
| 6   | Westworld           | 2016-10-02 | www.hbo.com/westworld                       | 555-456-1234   | 8,6    | ... |
| 7   | Silicon Valley      | 2014-04-06 | www.hbo.com/silicon-valley/                 | 555-604-1234   | 11,4   | ... |
| 8   | The Big Bang Theory | 2007-09-24 | www.cbs.com/shows/big_bang_theory/          | 555-607-1274   | 8      | ... |
| 9   | Paw Patrol          | 2013-08-27 | www.nickjr.com/paw-patrol/videos/           | 555-930-1274   | 11     | ... |
| ... | ...                 | ...        | ...                                         | ...            | ...    | ... |
Cleaning Data in SQL Server Databases

Dataset - series and episodes

episodes

| episode_id | series_id | name                   | season | number | airdate    | runtime |
|------------|-----------|------------------------|--------|--------|------------|---------|
| 1          | 1         | Slumber Party Panic    | 1      | 1      | 2010-04-05 | 15      |
| 2          | 1         | Trouble in Lumpy Space | 1      | 2      | 2010-04-05 | 15      |
| 3          | 1         | Prisoners of Love      | 1      | 3      | 2010-04-12 | 15      |
| 4          | 1         | Tree Trunks            | 1      | 4      | 2010-04-12 | 15      |
| 5          | 1         | The Enchiridion!       | 1      | 5      | 2010-04-19 | 15      |
| 6          | 1         | The Jiggler            | 1      | 6      | 2010-04-19 | 15      |
| 7          | 1         | Ricardio the Heart Guy | 1      | 7      | 2010-04-26 | 15      |
| 8          | 1         | Business Time          | 1      | 8      | 2010-04-26 | 15      |
| 9          | 1         | My Two Favorite People | 1      | 9      | 2010-05-03 | 15      |
| ...        | ...       | ...                    | ...    | ...    | ...        | ...     |
Cleaning Data in SQL Server Databases

Out of range values

  • Values outside the expected range of valid data
    • e.g. person 400 inches tall
  • Can disrupt the results if not detected
  • Investigate!
Cleaning Data in SQL Server Databases

Out of range values - example

SELECT * FROM series
| id | name                | premiered  | official_site                               | contact_number | rating | ... |
|----|---------------------|------------|---------------------------------------------|----------------|--------|-----|
| 1  | Adventure Time      | 2010-04-05 | wwq.cartoonnetwork.com/video/adventuretime/ | 555-906-8845   | 8,4    | ... |
| 2  | Dexter              | 2006-01-10 | ww.sho.com/sho/dexter/home                  | 555-156-8845   | 8,6    | ... |
| 3  | Futurama            | 1999-03-28 | www.cc.com/shows/futurama                   | 555-210-9951   | 9,2    | ... |
| 4  | Game of Thrones     | 2011-04-17 | www.hbo.com/game-of-thrones                 | 555-abc-6641   | 9,3    | ... |
| 5  | Homeland            | 2011-10-02 | www.sho.com/sho/homeland/home               | 555-985-6314   | 8,3    | ... |
| 6  | Westworld           | 2016-10-02 | www.hbo.com/westworld                       | 555-456-1234   | 8,6    | ... |
| 7  | Silicon Valley      | 2014-04-06 | www.hbo.com/silicon-valley/                 | 555-604-1234   | 11,4   | ... |
| 8  | The Big Bang Theory | 2007-09-24 | www.cbs.com/shows/big_bang_theory/          | 555-607-1274   | 8      | ... |
| 9  | Paw Patrol          | 2013-08-27 | www.nickjr.com/paw-patrol/videos/           | 555-930-1274   | 11     | ... |
| 10 | ...                 | ...        | ...                                         | ...            | ...    | ... |
Cleaning Data in SQL Server Databases

Out of range values - example

SELECT * FROM series
| id | name                | premiered  | official_site                               | contact_number | rating | ... |
|----|---------------------|------------|---------------------------------------------|----------------|--------|-----|
| 1  | Adventure Time      | 2010-04-05 | wwq.cartoonnetwork.com/video/adventuretime/ | 555-906-8845   | 8,4    | ... |
| 2  | Dexter              | 2006-01-10 | ww.sho.com/sho/dexter/home                  | 555-156-8845   | 8,6    | ... |
| 3  | Futurama            | 1999-03-28 | www.cc.com/shows/futurama                   | 555-210-9951   | 9,2    | ... |
| 4  | Game of Thrones     | 2011-04-17 | www.hbo.com/game-of-thrones                 | 555-abc-6641   | 9,3    | ... |
| 5  | Homeland            | 2011-10-02 | www.sho.com/sho/homeland/home               | 555-985-6314   | 8,3    | ... |
| 6  | Westworld           | 2016-10-02 | www.hbo.com/westworld                       | 555-456-1234   | 8,6    | ... |
| 7  | Silicon Valley      | 2014-04-06 | www.hbo.com/silicon-valley/                 | 555-604-1234   | 11,4 **| ... |
| 8  | The Big Bang Theory | 2007-09-24 | www.cbs.com/shows/big_bang_theory/          | 555-607-1274   | 8      | ... |
| 9  | Paw Patrol          | 2013-08-27 | www.nickjr.com/paw-patrol/videos/           | 555-930-1274   | 11   **| ... |
| 10 | ...                 | ...        | ...                                         | ...            | ...    | ... |
Cleaning Data in SQL Server Databases

Out of range values - detecting the values

SELECT * FROM series
WHERE rating NOT BETWEEN 0 AND 10
| id | name           | premiered  | official_site                     | contact_number | rating | ... |
|----|----------------|------------|-----------------------------------|----------------|--------|-----|
| 7  | Silicon Valley | 2014-04-06 | www.hbo.com/silicon-valley/       | 555-604-1234   | 11,4   | ... |
| 9  | Paw Patrol     | 2013-08-27 | www.nickjr.com/paw-patrol/videos/ | 555-930-1274   | 11     | ... |
SELECT * FROM series
WHERE rating < 0 OR rating > 10
| id | name           | premiered  | official_site                     | contact_number | rating | ... |
|----|----------------|------------|-----------------------------------|----------------|--------|-----|
| 7  | Silicon Valley | 2014-04-06 | www.hbo.com/silicon-valley/       | 555-604-1234   | 11,4   | ... |
| 9  | Paw Patrol     | 2013-08-27 | www.nickjr.com/paw-patrol/videos/ | 555-930-1274   | 11     | ... |
Cleaning Data in SQL Server Databases

Out of range values - excluding the values

SELECT * FROM series
WHERE rating BETWEEN 0 AND 10
| id  | name                | premiered  | official_site                               | contact_number | rating | summary |
|-----|---------------------|------------|---------------------------------------------|----------------|--------|---------|
| 1   | Adventure Time      | 2010-05-04 | wwq.cartoonnetwork.com/video/adventuretime/ | 555-906-8845   | 8,4    | ...     |
| 2   | Dexter              | 2006-01-10 | ww.sho.com/sho/dexter/home                  | 555-156-8845   | 8,6    | ...     |
| 3   | Futurama            | 1999-03-03 | www.cc.com/shows/futurama                   | 555-210-9951   | 9,2    | ...     |
| 4   | Game of Thrones     | 2011-04-04 | www.hbo.com/game-of-thrones                 | 555-abc-6641   | 9,3    | ...     |
| 5   | Homeland            | 2011-02-10 | www.sho.com/sho/homeland/home               | 555-985-6314   | 8,3    | ...     |
| 6   | Westworld           | 2016-02-10 | www.hbo.com/westworld                       | 555-456-1234   | 8,6    | ...     |
| 8   | The Big Bang Theory | 2007-09-24 | www.cbs.com/shows/big_bang_theory/          | 555-607-1274   | 8      | ...     |
| ... | ...                 | ...        | ...                                         | ...            | ...    | ...     |
Cleaning Data in SQL Server Databases

Out of range values - excluding the values

SELECT * FROM series
WHERE rating >= 0 AND rating <=10
| id  | name                | premiered  | official_site                               | contact_number | rating | summary |
|-----|---------------------|------------|---------------------------------------------|----------------|--------|---------|
| 1   | Adventure Time      | 2010-05-04 | wwq.cartoonnetwork.com/video/adventuretime/ | 555-906-8845   | 8,4    | ...     |
| 2   | Dexter              | 2006-01-10 | ww.sho.com/sho/dexter/home                  | 555-156-8845   | 8,6    | ...     |
| 3   | Futurama            | 1999-03-03 | www.cc.com/shows/futurama                   | 555-210-9951   | 9,2    | ...     |
| 4   | Game of Thrones     | 2011-04-04 | www.hbo.com/game-of-thrones                 | 555-abc-6641   | 9,3    | ...     |
| 5   | Homeland            | 2011-02-10 | www.sho.com/sho/homeland/home               | 555-985-6314   | 8,3    | ...     |
| 6   | Westworld           | 2016-02-10 | www.hbo.com/westworld                       | 555-456-1234   | 8,6    | ...     |
| 8   | The Big Bang Theory | 2007-09-24 | www.cbs.com/shows/big_bang_theory/          | 555-607-1274   | 8      | ...     |
| ... | ...                 | ...        | ...                                         | ...            | ...    | ...     |
Cleaning Data in SQL Server Databases

Inaccurate data

  • Two or more values are contradictory
  • Can disrupt the results if not detected
  • Investigate!
Cleaning Data in SQL Server Databases

Inaccurate data - example

series
| id | name | premiered  | official_site | contact_number | rating | summary |
|----|------|------------|---------------|----------------|--------|---------|
episodes
| episode_id | series_id | name | season | number | airdate | runtime |
|------------|-----------|------|--------|--------|---------|---------|

Valid: episodes.airdate >= series.premiered

Cleaning Data in SQL Server Databases

Inaccurate data - example

| id | name      | premiered  | episode_id | name         | airdate    |
|----|-----------|------------|------------|--------------|------------|
| 5  | Homeland  | 2011-10-02 | 58         | Pilot        | 2010-10-02 |
| 6  | Westworld | 2016-10-02 | 70         | The Original | 2015-10-02 |
| 6  | Westworld | 2016-10-02 | 74         | Contrapasso  | 2015-10-30 |
Cleaning Data in SQL Server Databases

Inaccurate data - detecting the values

SELECT 
    series.id, 
    series.name, 
    series.premiered, 
    episodes.episode_id, 
    episodes.name, 
    episodes.airdate
FROM series
INNER JOIN episodes ON series.id = episodes.series_id
WHERE episodes.airdate < series.premiered
| id | name      | premiered  | episode_id | name         | airdate    |
|----|-----------|------------|------------|--------------|------------|
| 5  | Homeland  | 2011-10-02 | 58         | Pilot        | 2010-10-02 |
| 6  | Westworld | 2016-10-02 | 70         | The Original | 2015-10-02 |
| 6  | Westworld | 2016-10-02 | 74         | Contrapasso  | 2015-10-30 |
Cleaning Data in SQL Server Databases

Inaccurate data - excluding the values

SELECT 
    series.id, 
    series.name, 
    series.premiered, 
    episodes.episode_id, 
    episodes.name, 
    episodes.airdate
FROM series
INNER JOIN episodes ON series.id = episodes.series_id
WHERE episodes.airdate >= series.premiered
| episode_id | series_id | name                   | season | number | airdate    | runtime |
|------------|-----------|------------------------|--------|--------|------------|---------|
| 1          | 1         | Slumber Party Panic    | 1      | 1      | 2010-04-05 | 15      |
| 2          | 1         | Trouble in Lumpy Space | 1      | 2      | 2010-04-05 | 15      |
| 3          | 1         | Prisoners of Love      | 1      | 3      | 2010-04-12 | 15      |
| 4          | 1         | Tree Trunks            | 1      | 4      | 2010-04-12 | 15      |
| ...        | ...       | ...                    | ...    | ...    | ...        | ...     |
Cleaning Data in SQL Server Databases

Let's practice!

Cleaning Data in SQL Server Databases

Preparing Video For Download...