Cleaning Data in SQL Server Databases
Miriam Antona
Software Engineer
| registration_code | airport_code | carrier_code | registration_date | canceled | on_time | delayed | diverted |
|-------------------|--------------|--------------|-------------------|----------|---------|---------|----------|
| 000000134 | MSP | DL | 2014-01-01 | 61 | 3148 | 925 | 4 |
| 000000134 | MSP | DL | 2014-01-01 | 61 | 3148 | 925 | 4 |
| registration_code | airport_code | carrier_code | registration_date | canceled | on_time | delayed | diverted |
|-------------------|--------------|--------------|-------------------|----------|---------|---------|----------|
| 000000134 | MSP | DL | 2014-01-01 | 61 | 3148 | 925 | 4 |
| 000000150 | MSP | DL | 2014-01-01 | 61 | 3148 | 925 | 4 |
| registration_code | airport_code | carrier_code | registration_date | canceled | on_time | delayed | diverted |
|-------------------|--------------|--------------|-------------------|----------|---------|---------|----------|
| 000000134 | MSP | DL | 2014-01-01 | 61 | 3148 | 925 | 4 |
| 000000150 | MSP | DL | 2014-01-01 | 3148 | 61 | 4 | 925 |
flight_statistics
|registration_code|airport_code|carrier_code|registration_date|canceled|on_time|delayed|diverted|statistician_name|statistician_surname|
|-----------------|------------|------------|-----------------|--------|-------|-------|--------|-----------------|--------------------|
No problem to exclude duplicate rows:
|registration_code|airport_code|carrier_code|registration_date|canceled|on_time|delayed|diverted|statistician_name|statistician_surname|
|-----------------|------------|------------|-----------------|--------|-------|-------|--------|-----------------|--------------------|
|000000134 |MSP |DL |2014-01-01 |61 |3148 |925 |4 |Anne |Johnson |
|000000150 |MSP |DL |2014-01-01 |61 |3148 |925 |4 |Bernard |Ross |
|registration_code|airport_code|carrier_code|registration_date|canceled|on_time|delayed|diverted|statistician_name|statistician_surname|
|-----------------|------------|------------|-----------------|--------|-------|-------|--------|-----------------|--------------------|
|000000134 |MSP |DL |2014-01-01 |61 |3148 |925 |4 |Anne |Johnson |
|000000134 |MSP |DL |2014-01-01 |61 |3148 |925 |4 |Anne |Johnson |
|registration_code|airport_code|carrier_code|registration_date|canceled|on_time|delayed|diverted|statistician_name|statistician_surname|
|-----------------|------------|------------|-----------------|--------|-------|-------|--------|-----------------|--------------------|
|000000134 |MSP |DL |2014-01-01 |61 |3148 |925 |4 |Anne |Johnson |
|000000150 |MSP |DL |2014-01-01 |3148 |61 |4 |925 |Bernard |Ross |
Investigate!
ROW_NUMBER () OVER (
[ PARTITION BY
value_expression ,
... [ n ] ]
order_by_clause )
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY
airport_code,
carrier_code,
registration_date
ORDER BY
airport_code,
carrier_code,
registration_date
) row_num
FROM flight_statistics
| registration_code | airport_code | carrier_code | registration_date | canceled | ... | statistician_name | statistician_surname | row_num |
|-------------------|--------------|--------------|-------------------|----------|-----|-------------------|----------------------|---------|
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 000000135 | MSP | AS | 2014-01-01 | 0 | ... | Anne | Johnson | 1 |
| 000000136 | MSP | DL | 2014-01-01 | 61 | ... | Bernard | Ross | 1 |
| 000000134 | MSP | DL | 2014-01-01 | 61 | ... | Anne | Johnson | 2 |
| 000000137 | MSP | EV | 2014-01-01 | 117 | ... | Michael | Andersen | 1 |
| 000000138 | MSP | F9 | 2014-01-01 | 0 | ... | Anne | Johnson | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| registration_code | airport_code | carrier_code | registration_date | canceled | ... | statistician_name | statistician_surname | row_num |
|-------------------|--------------|--------------|-------------------|----------|-----|-------------------|----------------------|---------|
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 000000135 | MSP | AS | 2014-01-01 | 0 | ... | Anne | Johnson | 1 |
| 000000136 | MSP | DL | 2014-01-01 | 61 | ... | Bernard | Ross | 1 *** |
| 000000134 | MSP | DL | 2014-01-01 | 61 | ... | Anne | Johnson | 2 |
| 000000137 | MSP | EV | 2014-01-01 | 117 | ... | Michael | Andersen | 1 |
| 000000138 | MSP | F9 | 2014-01-01 | 0 | ... | Anne | Johnson | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| registration_code | airport_code | carrier_code | registration_date | canceled | ... | statistician_name | statistician_surname | row_num |
|-------------------|--------------|--------------|-------------------|----------|-----|-------------------|----------------------|---------|
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 000000135 | MSP | AS | 2014-01-01 | 0 | ... | Anne | Johnson | 1 |
| 000000136 | MSP | DL | 2014-01-01 | 61 | ... | Bernard | Ross | 1 *** |
| 000000134 | MSP | DL | 2014-01-01 | 61 | ... | Anne | Johnson | 2 *** |
| 000000137 | MSP | EV | 2014-01-01 | 117 | ... | Michael | Andersen | 1 |
| 000000138 | MSP | F9 | 2014-01-01 | 0 | ... | Anne | Johnson | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
WITH cte AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY
airport_code,
carrier_code,
registration_date
ORDER BY
airport_code,
carrier_code,
registration_date
) row_num
FROM flight_statistics
)
SELECT * FROM cte
WHERE row_num > 1;
|registration_code|airport_code|carrier_code|registration_date|canceled|...|statistician_name|statistician_surname|row_num|
|-----------------|------------|------------|-----------------|--------|---|-----------------|--------------------|-------|
|000000131 |JFK |EV |2014-02-28 |18 |...|Anne |Johnson |2 |
|000000134 |MSP |DL |2014-01-01 |61 |...|Anne |Johnson |2 |
|000000142 |MSP |OO |2014-01-01 |76 |...|Anne |Johnson |2 |
|000000143 |MSP |OO |2014-01-01 |76 |...|Anne |Johnson |3 |
WITH cte AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY
airport_code,
carrier_code,
registration_date
ORDER BY
airport_code,
carrier_code,
registration_date
) row_num
FROM flight_statistics
)
SELECT * FROM cte
WHERE row_num = 1;
|registration_code|airport_code|carrier_code|registration_date|canceled|...|statistician_name|statistician_surname|row_num|
|-----------------|------------|------------|-----------------|--------|...|-----------------|--------------------|-------|
|... |... |... |... |... |...|... |... |1 |
|000000126 |JFK |VX |2014-01-31 |12 |...|Bryan |Page |1 |
|000000133 |MSP |AA |2014-01-01 |15 |...|Peter |Johnson |1 |
|000000135 |MSP |AS |2014-01-01 |0 |...|Anne |Johnson |1 |
|000000136 |MSP |DL |2014-01-01 |61 |...|Bernard |Ross |1 |
|000000137 |MSP |EV |2014-01-01 |117 |...|Michael |Andersen |1 |
|000000138 |MSP |F9 |2014-01-01 |0 |...|Anne |Johnson |1 |
|000000139 |MSP |FL |2014-01-01 |8 |...|Anne |Johnson |1 |
|000000140 |MSP |MQ |2014-01-01 |20 |...|Anne |Johnson |1 |
|000000141 |MSP |OO |2014-01-01 |76 |...|Anne |Johnson |1 |
|000000132 |MSP |YV |2013-12-01 |0 |...|Michael |Andersen |1 |
|... |... |... |... |... |...|... |... |... |
Cleaning Data in SQL Server Databases