Avoiding duplicate data

Cleaning Data in SQL Server Databases

Miriam Antona

Software Engineer

What is duplicate data?

| registration_code | airport_code | carrier_code | registration_date | canceled | on_time | delayed | diverted |
|-------------------|--------------|--------------|-------------------|----------|---------|---------|----------|
| 000000134         | MSP          | DL           | 2014-01-01        | 61       | 3148    | 925     | 4        |
| 000000134         | MSP          | DL           | 2014-01-01        | 61       | 3148    | 925     | 4        |
| registration_code | airport_code | carrier_code | registration_date | canceled | on_time | delayed | diverted |
|-------------------|--------------|--------------|-------------------|----------|---------|---------|----------|
| 000000134         | MSP          | DL           | 2014-01-01        | 61       | 3148    | 925     | 4        |
| 000000150         | MSP          | DL           | 2014-01-01        | 61       | 3148    | 925     | 4        |
| registration_code | airport_code | carrier_code | registration_date | canceled | on_time | delayed | diverted |
|-------------------|--------------|--------------|-------------------|----------|---------|---------|----------|
| 000000134         | MSP          | DL           | 2014-01-01        | 61       | 3148    | 925     | 4        |
| 000000150         | MSP          | DL           | 2014-01-01        | 3148     | 61      | 4       | 925      |
  • Duplicate date can interfere in our analysis!
Cleaning Data in SQL Server Databases

Finding repeating groups

flight_statistics

|registration_code|airport_code|carrier_code|registration_date|canceled|on_time|delayed|diverted|statistician_name|statistician_surname|
|-----------------|------------|------------|-----------------|--------|-------|-------|--------|-----------------|--------------------|
  • airport_code
  • carrier_code
  • registration_date
Cleaning Data in SQL Server Databases

Finding repeating groups

No problem to exclude duplicate rows:

|registration_code|airport_code|carrier_code|registration_date|canceled|on_time|delayed|diverted|statistician_name|statistician_surname|
|-----------------|------------|------------|-----------------|--------|-------|-------|--------|-----------------|--------------------|
|000000134        |MSP         |DL          |2014-01-01       |61      |3148   |925    |4       |Anne             |Johnson             |
|000000150        |MSP         |DL          |2014-01-01       |61      |3148   |925    |4       |Bernard          |Ross                |
|registration_code|airport_code|carrier_code|registration_date|canceled|on_time|delayed|diverted|statistician_name|statistician_surname|
|-----------------|------------|------------|-----------------|--------|-------|-------|--------|-----------------|--------------------|
|000000134        |MSP         |DL          |2014-01-01       |61      |3148   |925    |4       |Anne             |Johnson             |
|000000134        |MSP         |DL          |2014-01-01       |61      |3148   |925    |4       |Anne             |Johnson             |
Cleaning Data in SQL Server Databases

Finding repeating groups

  • We have a problem to exclude!
|registration_code|airport_code|carrier_code|registration_date|canceled|on_time|delayed|diverted|statistician_name|statistician_surname|
|-----------------|------------|------------|-----------------|--------|-------|-------|--------|-----------------|--------------------|
|000000134        |MSP         |DL          |2014-01-01       |61      |3148   |925    |4       |Anne             |Johnson             |
|000000150        |MSP         |DL          |2014-01-01       |3148    |61     |4      |925     |Bernard          |Ross                |

Investigate!

Cleaning Data in SQL Server Databases

Detecting duplicate data - ROW_NUMBER()

ROW_NUMBER () OVER ( 
      [ PARTITION BY 
         value_expression ,
         ... [ n ] ] 
        order_by_clause )
  • Partitions = repeating groups
  • Returns a number starting at 1 for the first row in every partition
  • Returns sequential number for each row within the same partition
Cleaning Data in SQL Server Databases

Detecting duplicate data - ROW_NUMBER()

SELECT *,
       ROW_NUMBER() OVER (
            PARTITION BY 
                airport_code, 
                carrier_code, 
                registration_date
            ORDER BY 
                airport_code, 
                carrier_code, 
                registration_date
        ) row_num
FROM flight_statistics
Cleaning Data in SQL Server Databases

Detecting duplicate data - ROW_NUMBER()

| registration_code | airport_code | carrier_code | registration_date | canceled | ... | statistician_name | statistician_surname | row_num |
|-------------------|--------------|--------------|-------------------|----------|-----|-------------------|----------------------|---------|
| ...               | ...          | ...          | ...               | ...      | ... | ...               | ...                  | ...     |
| 000000135         | MSP          | AS           | 2014-01-01        | 0        | ... | Anne              | Johnson              | 1       |
| 000000136         | MSP          | DL           | 2014-01-01        | 61       | ... | Bernard           | Ross                 | 1       |
| 000000134         | MSP          | DL           | 2014-01-01        | 61       | ... | Anne              | Johnson              | 2       |
| 000000137         | MSP          | EV           | 2014-01-01        | 117      | ... | Michael           | Andersen             | 1       |
| 000000138         | MSP          | F9           | 2014-01-01        | 0        | ... | Anne              | Johnson              | 1       |
| ...               | ...          | ...          | ...               | ...      | ... | ...               | ...                  | ...     |
Cleaning Data in SQL Server Databases

Detecting duplicate data - ROW_NUMBER()

| registration_code | airport_code | carrier_code | registration_date | canceled | ... | statistician_name | statistician_surname | row_num |
|-------------------|--------------|--------------|-------------------|----------|-----|-------------------|----------------------|---------|
| ...               | ...          | ...          | ...               | ...      | ... | ...               | ...                  | ...     |
| 000000135         | MSP          | AS           | 2014-01-01        | 0        | ... | Anne              | Johnson              | 1       |
| 000000136         | MSP          | DL           | 2014-01-01        | 61       | ... | Bernard           | Ross                 | 1  ***  |
| 000000134         | MSP          | DL           | 2014-01-01        | 61       | ... | Anne              | Johnson              | 2       |
| 000000137         | MSP          | EV           | 2014-01-01        | 117      | ... | Michael           | Andersen             | 1       |
| 000000138         | MSP          | F9           | 2014-01-01        | 0        | ... | Anne              | Johnson              | 1       |
| ...               | ...          | ...          | ...               | ...      | ... | ...               | ...                  | ...     |
Cleaning Data in SQL Server Databases

Detecting duplicate data - ROW_NUMBER()

| registration_code | airport_code | carrier_code | registration_date | canceled | ... | statistician_name | statistician_surname | row_num |
|-------------------|--------------|--------------|-------------------|----------|-----|-------------------|----------------------|---------|
| ...               | ...          | ...          | ...               | ...      | ... | ...               | ...                  | ...     |
| 000000135         | MSP          | AS           | 2014-01-01        | 0        | ... | Anne              | Johnson              | 1       |
| 000000136         | MSP          | DL           | 2014-01-01        | 61       | ... | Bernard           | Ross                 | 1  ***  |
| 000000134         | MSP          | DL           | 2014-01-01        | 61       | ... | Anne              | Johnson              | 2  ***  |
| 000000137         | MSP          | EV           | 2014-01-01        | 117      | ... | Michael           | Andersen             | 1       |
| 000000138         | MSP          | F9           | 2014-01-01        | 0        | ... | Anne              | Johnson              | 1       |
| ...               | ...          | ...          | ...               | ...      | ... | ...               | ...                  | ...     |
  • Get duplicate rows:
    • row_num > 1
  • Exclude duplicate rows:
    • row_num = 1
Cleaning Data in SQL Server Databases

Getting only duplicate rows

WITH cte AS (
    SELECT *, 
        ROW_NUMBER() OVER (
            PARTITION BY 
                airport_code, 
                carrier_code, 
                registration_date
            ORDER BY 
                airport_code, 
                carrier_code, 
                registration_date
        ) row_num
    FROM flight_statistics
)
SELECT * FROM cte
WHERE row_num > 1;
Cleaning Data in SQL Server Databases

Getting only duplicate rows

|registration_code|airport_code|carrier_code|registration_date|canceled|...|statistician_name|statistician_surname|row_num|
|-----------------|------------|------------|-----------------|--------|---|-----------------|--------------------|-------|
|000000131        |JFK         |EV          |2014-02-28       |18      |...|Anne             |Johnson             |2      |
|000000134        |MSP         |DL          |2014-01-01       |61      |...|Anne             |Johnson             |2      |
|000000142        |MSP         |OO          |2014-01-01       |76      |...|Anne             |Johnson             |2      |
|000000143        |MSP         |OO          |2014-01-01       |76      |...|Anne             |Johnson             |3      |
Cleaning Data in SQL Server Databases

Excluding duplicate rows

WITH cte AS (
    SELECT *, 
        ROW_NUMBER() OVER (
            PARTITION BY 
                airport_code, 
                carrier_code, 
                registration_date
            ORDER BY 
                airport_code, 
                carrier_code, 
                registration_date
        ) row_num
    FROM flight_statistics
)
SELECT * FROM cte
WHERE row_num = 1;
Cleaning Data in SQL Server Databases

Excluding duplicate rows

|registration_code|airport_code|carrier_code|registration_date|canceled|...|statistician_name|statistician_surname|row_num|
|-----------------|------------|------------|-----------------|--------|...|-----------------|--------------------|-------|
|...              |...         |...         |...              |...     |...|...              |...                 |1      |
|000000126        |JFK         |VX          |2014-01-31       |12      |...|Bryan            |Page                |1      |
|000000133        |MSP         |AA          |2014-01-01       |15      |...|Peter            |Johnson             |1      |
|000000135        |MSP         |AS          |2014-01-01       |0       |...|Anne             |Johnson             |1      |
|000000136        |MSP         |DL          |2014-01-01       |61      |...|Bernard          |Ross                |1      |
|000000137        |MSP         |EV          |2014-01-01       |117     |...|Michael          |Andersen            |1      |
|000000138        |MSP         |F9          |2014-01-01       |0       |...|Anne             |Johnson             |1      |
|000000139        |MSP         |FL          |2014-01-01       |8       |...|Anne             |Johnson             |1      |
|000000140        |MSP         |MQ          |2014-01-01       |20      |...|Anne             |Johnson             |1      |
|000000141        |MSP         |OO          |2014-01-01       |76      |...|Anne             |Johnson             |1      |
|000000132        |MSP         |YV          |2013-12-01       |0       |...|Michael          |Andersen            |1      |
|...              |...         |...         |...              |...     |...|...              |...                 |...    |
Cleaning Data in SQL Server Databases

Let's practice!

Cleaning Data in SQL Server Databases

Preparing Video For Download...