Introduction to Cleaning Data

Cleaning Data in SQL Server Databases

Miriam Antona

Software Engineer

Topics covered

  • Chapter 1: Starting with Cleaning Data
  • Chapter 2: Dealing with nulls, duplicate data, and dates
  • Chapter 3: Dealing with out of range values, different data types, and pattern matching
  • Chapter 4: Combining, splitting, and transforming data
Cleaning Data in SQL Server Databases

Dataset: Monthly airline flights by USA airports 2014-2015

airports

| airport_code | airport_name                              | airport_city      | airport_state |
|--------------|-------------------------------------------|-------------------|---------------|
| MSP          |  Minneapolis-St Paul International        | Minneapolis       | Minnesota     |
| JFK          |  John F. Kennedy International            | New York City     | New York      |
| LAX          |  Los Angeles International                | Los Angeles       | California    |
| DFW          |       Dallas/Fort Worth International     | Dallas/Fort Worth | Texas         |
| BOS          | Logan International                       | Boston            | Massachusetts |
| SFO          |  San Francisco International              | San Francisco     | Californiaa   |
| ATL          |  Hartsfield-Jackson Atlanta International | Atlanta           | Georgia       |
| ...          | ...                                       | ...               | ...           |
Cleaning Data in SQL Server Databases

Dataset: Monthly airline flights by USA airports 2014-2015

carriers

| code | name                            |
|------|---------------------------------|
| YV   | Mesa Airlines Inc.              |
| AA   |   American Airlines Inc.        |
| DL   |   Delta Air Lines Inc.          |
| HA   |   Hawaiian Airlines Inc.        |
| MQ   |   American Eagle Airlines Inc.  |
| EV   |  ExpressJet Airlines Inc.       |
| ...  | ...                             |
Cleaning Data in SQL Server Databases

Dataset: Monthly airline flights by USA airports 2014-2015

flight statistics

| registration_code | airport_code | carrier_code | canceled | on_time | delayed | ... |
|-------------------|--------------|--------------|----------|---------|---------|-----|
| ...               | ...          | ...          | ...      | ...     | ...     | ... |
| 000000119         | JFK          | AA           | 74       | 819     | 233     | ... |
| 120               | JFK          | B6           | 438      | 1865    | 1010    | ... |
| 000000121         | JFK          | HA           | 0        | 25      | 3       | ... |
| 122               | JFK          | MQ           | 102      | 386     | 159     | ... |
| 000000124         | JFK          | UA           | 22       | 296     | 88      | ... |
| 000000125         | JFK          | US           | 15       | 191     | 63      | ... |
| 000000126         | JFK          | VX           | 12       | 225     | 61      | ... |
| ...               | ...          | ...          | ...      | ...     | ...     | ... |
Cleaning Data in SQL Server Databases

Dataset: Monthly airline flights by USA airports 2014-2015

pilots

| pilot_code | pilot_name | pilot_surname | carrier_code | entry_date |
|------------|------------|---------------|--------------|------------|
| 1          | Thomas     | Peters        | HA           | 2011-10-01 |
| 2          | Hiroki     | Konoe         | MQ           | 2011-01-21 |
| 3          | Arturo     | Montero       | UA           | 2012-12-28 |
| 4          | David      | Captain       | US           | 2000-10-01 |
| 5          | Ainhoa     | Guerrera      | VX           | 2000-10-05 |
| 6          | Alvin      | Andersen      | OO           | 2012-01-15 |
| 7          | William    | Champy        | F9           | 2011-03-15 |
| ...        | ...        | ...           | ...          | ...        |
Cleaning Data in SQL Server Databases

Why is cleaning data important?

  • Common to acquire messy/dirty data not ready for analysis
  • Lot of time spent cleaning data vs. time spent analyzing data
  • Cleaning process -> clear information
Cleaning Data in SQL Server Databases

Filling numbers with leading zeros

SELECT * FROM flight_statistics
| registration_code | airport_code | carrier_code | canceled | on_time | delayed | ... |
|-------------------|--------------|--------------|----------|---------|---------|-----|
| ...               | ...          | ...          | ...      | ...     | ...     | ... |
| 000000119         | JFK          | AA           | 74       | 819     | 233     | ... |
| 120               | JFK          | B6           | 438      | 1865    | 1010    | ... |
| 000000121         | JFK          | HA           | 0        | 25      | 3       | ... |
| 122               | JFK          | MQ           | 102      | 386     | 159     | ... |
| 000000124         | JFK          | UA           | 22       | 296     | 88      | ... |
| 000000125         | JFK          | US           | 15       | 191     | 63      | ... |
| 000000126         | JFK          | VX           | 12       | 225     | 61      | ... |
| ...               | ...          | ...          | ...      | ...     | ...     | ... |
Cleaning Data in SQL Server Databases

Filling numbers with leading zeros

VALID: 000000128 - until 9 digits

INVALID: 128

The number 128 with leading zeros until complete 9 digits.

Cleaning Data in SQL Server Databases

Filling numbers with leading zeros - Using REPLICATE and LEN

REPLICATE (string, integer)

Repeats a string a specified number of times.

Cleaning Data in SQL Server Databases

Filling numbers with leading zeros - Using REPLICATE and LEN

REPLICATE (string, integer)

Repeats a string a specified number of times.

REPLICATE('0', 9 - LEN(registration_code))
Cleaning Data in SQL Server Databases

Filling numbers with leading zeros - Using REPLICATE and LEN

REPLICATE (string, integer)

Repeats a string a specified number of times.

REPLICATE('0', 9 - LEN(registration_code))
-- registration_code: 120 => LEN(120) = 3
REPLICATE('0', 6)
Cleaning Data in SQL Server Databases

Filling numbers with leading zeros - Using REPLICATE, LEN

+ operator

SELECT 
    REPLICATE('0', 9 - LEN(registration_code)) + registration_code AS registration_code
FROM flight_statistics

CONCAT - since SQL Server 2012

SELECT 
    CONCAT(REPLICATE('0', 9 - LEN(registration_code)), registration_code) AS registration_code
FROM flight_statistics
Cleaning Data in SQL Server Databases

Filling numbers with leading zeros - Using REPLICATE, LEN, and CONCAT

| registration_code |
|-------------------|
| ...               |
| 000000119         |
| 000000120         |
| 000000121         |
| 000000122         |
| 000000123         |
| 000000124         |
| 000000125         |
| 000000126         |
| ...               |
Cleaning Data in SQL Server Databases

Filling numbers with leading zeros - Using FORMAT

FORMAT (value, format [, culture ] ) 
  • Available since SQL Server 2012
  • value: numeric, date and time
SELECT 
    FORMAT(CAST(registration_code AS INT), '000000000') AS registration_code
FROM flight_statistics;
Cleaning Data in SQL Server Databases

Filling numbers with leading zeros - Using FORMAT

| registration_code |
|-------------------|
| ...               |
| 000000119         |
| 000000120         |
| 000000121         |
| 000000122         |
| 000000123         |
| 000000124         |
| 000000125         |
| 000000126         |
| ...               |
Cleaning Data in SQL Server Databases

Let's practice!

Cleaning Data in SQL Server Databases

Preparing Video For Download...