Cleaning Data in PostgreSQL Databases
Darryl Reeves, Ph.D.
Industry Assistant Professor, New York University
... | name | score | inspection_type | ...
----+------------------+-------+----------------------------------------------+-----
... | ... | ... | ... | ...
... | SCHNIPPERS | 27 | Cycle Inspection / Initial Inspection | ...
... | ATOMIC WINGS | | Administrative Miscellaneous / Re-inspection | ...
... | WING LING | 44 | Cycle Inspection / Initial Inspection | ...
... | JUAN VALDEZ CAFE | 24 | Cycle Inspection / Initial Inspection | ...
... | FULTON GRAND | 22 | Cycle Inspection / Initial Inspection | ...
... | ... | ... | ... | ...
Representations for missing values:
NULL
(general)''
- empty string (used for string columns) ... | name | score | inspection_type | ...
----+------------------+-------+----------------------------------------------+-----
... | ... | ... | ... | ...
... | SCHNIPPERS | 27 | Cycle Inspection / Initial Inspection | ...
... | ATOMIC WINGS | | Administrative Miscellaneous / Re-inspection | ...
... | WING LING | 44 | Cycle Inspection / Initial Inspection | ...
... | JUAN VALDEZ CAFE | 24 | Cycle Inspection / Initial Inspection | ...
... | FULTON GRAND | 22 | Cycle Inspection / Initial Inspection | ...
... | ... | ... | ... | ...
SELECT
*
FROM
restaurant_inspection
WHERE
score IS NULL;
SELECT
COUNT(*)
FROM
restaurant_inspection
WHERE
score IS NULL;
SELECT
inspection_type,
COUNT(*) as count
FROM
restaurant_inspection
WHERE
score IS NULL
GROUP BY
inspection_type
ORDER BY
count DESC;
inspection_type | count
--------------------------------------------------+-------
Administrative Miscellaneous / Initial Inspection | 104
Smoke-Free Air Act / Initial Inspection | 29
Calorie Posting / Initial Inspection | 22
Administrative Miscellaneous / Re-inspection | 22
Trans Fat / Initial Inspection | 18
Smoke-Free Air Act / Re-inspection | 7
Trans Fat / Re-inspection | 3
Best option: locate and add missing values
Provide a value (average, median, etc)
Exclude records
COALESCE(arg1, [arg2, ...])
SELECT
name,
COALESCE(score, -1),
inspection_type
FROM
restaurant_inspection;
... | name | score | inspection_type | ...
----+------------------+-------+----------------------------------------------+-----
... | ... | ... | ... | ...
... | SCHNIPPERS | 27 | Cycle Inspection / Initial Inspection | ...
... | ATOMIC WINGS | -1 | Administrative Miscellaneous / Re-inspection | ...
... | WING LING | 44 | Cycle Inspection / Initial Inspection | ...
... | JUAN VALDEZ CAFE | 24 | Cycle Inspection / Initial Inspection | ...
... | FULTON GRAND | 22 | Cycle Inspection / Initial Inspection | ...
... | ... | ... | ... | ...
Cleaning Data in PostgreSQL Databases