Splitting column data

Cleaning Data in PostgreSQL Databases

Darryl Reeves, Ph.D.

Industry Assistant Professor, New York University

Splitting columns

 camis    | inspection_date |                    violation                    | ... 
 ---------+-----------------+-------------------------------------------------+-----
 ...      | ...             | ...                                             | ...
 50038736 | 03/29/2018      | 09B Thawing procedures                          | ...
 50033304 | 12/18/2019      | 02B Hot food item not held at or above 140º ... | ...
 50081658 | 12/13/2018      | 06F Wiping cloths soiled or not stored in sa... | ...
 50033733 | 02/12/2019      | 10B Plumbing not properly installed or maint... | ...
 40559634 | 08/22/2017      | 04N Filth flies or food/refuse/sewage-associ... | ...
 ...      | ...             | ...                                             | ...
Cleaning Data in PostgreSQL Databases

Finding substring starting position with STRPOS()

STRPOS(source_string, search_string)

A diagram of a violation string "09B Thawing procedures" The diagram includes index numbers underneath showing 1 at the 0 in text, 4 at the first space, and 22 at the s ending the string.

SELECT
  STRPOS('09B Thawing procedures', ' ');
4
Cleaning Data in PostgreSQL Databases

Finding substring starting position with STRPOS()

A diagram of a violation string "09B Thawing procedures" The diagram includes index numbers underneath showing 1 at the 0 in text, 4 at the first space, and 22 at the s ending the string.

SELECT
  STRPOS('09B Thawing procedures', '?');
0
Cleaning Data in PostgreSQL Databases

Finding substring starting position with STRPOS()

A diagram of a violation string "09B Thawing procedures" The diagram includes index numbers underneath showing 1 at the 0 in text, 4 at the first space, and 22 at the s ending the string.

SELECT
  STRPOS('09B Thawing procedures', ' ');
4
Cleaning Data in PostgreSQL Databases

Extracting a substring using SUBSTRING()

SUBSTRING(source_string FROM start_pos FOR num_chars)

Cleaning Data in PostgreSQL Databases

Extracting a substring using SUBSTRING()

SUBSTRING('Homerun' FROM 1 FOR 4)Home

A diagram of a violation string "09B Thawing procedures" The diagram includes index numbers underneath showing 1 at the 0 in text, 4 at the first space, and 22 at the s ending the string.

SELECT
  SUBSTRING(
      '09B Thawing procedures'
      FROM 1
      FOR STRPOS('09B Thawing procedures', ' ') - 1
  );
09B
Cleaning Data in PostgreSQL Databases

Extracting a substring using SUBSTRING()

A diagram of a violation string "09B Thawing procedures" The diagram includes index numbers underneath showing 1 at the 0 in text, 4 at the first space, and 22 at the s ending the string. A pair of blue arrows indicate where the violation description is present spanning just this part of the violation.

Requirements:

  • Violation description start position
  • Number of characters in the description
SELECT
  STRPOS('09B Thawing procedures', ' ') + 1;
5
Cleaning Data in PostgreSQL Databases

Calculating the length of a string with LENGTH()

A diagram of a violation string "09B Thawing procedures" The diagram includes index numbers underneath showing 1 at the 0 in text, 4 at the first space, and 22 at the s ending the string. The string "number of characters?" is above the violation description portion of the violation string.

LENGTH(string)INTEGER

SELECT LENGTH('hello')
5
Cleaning Data in PostgreSQL Databases

Calculating the length of a string with LENGTH()

LENGTH('09B Thawing procedures')22

STRPOS('09B Thawing procedures', ' ')4

LENGTH('09B Thawing procedures') - STRPOS('09B Thawing procedures', ' ')18

LENGTH('Thawing procedures')18

Cleaning Data in PostgreSQL Databases

Calculating the length of a string with LENGTH()

SELECT 
    LENGTH('09B Thawing procedures') - 
    STRPOS('09B Thawing procedures', ' ');
18

A diagram of a violation string "09B Thawing procedures" The diagram includes index numbers underneath showing 1 at the 0 in text, 4 at the first space, and 22 at the s ending the string. A pair of blue arrows indicate where the violation description is present spanning just this part of the violation.

Cleaning Data in PostgreSQL Databases

Putting the pieces together

SELECT 
    SUBSTRING(
      '09B Thawing procedures'
      FROM
        STRPOS('09B Thawing procedures', ' ')
        + 1
      FOR
        LENGTH('09B Thawing procedures')
        - STRPOS('09B Thawing procedures', ' ')
    );
Thawing procedures
Cleaning Data in PostgreSQL Databases

Splitting the violation column

SELECT 
    camis, 
    inspection_date, 

    SUBSTRING(
      violation 
      FROM 1 
      FOR STRPOS(violation, ' ') - 1
    ) AS violation_code, 

    SUBSTRING(
      violation 
      FROM STRPOS(violation, ' ') + 1 
      FOR LENGTH(violation) - STRPOS(violation, ' ')
    ) AS violation_description 
FROM 
    restaurant_inspection;
Cleaning Data in PostgreSQL Databases

Splitting the violation column

 camis    | inspection_date |                    violation                    | ... 
 ---------+-----------------+-------------------------------------------------+-----
 ...      | ...             | ...                                             | ...
 50038736 | 03/29/2018      | 09B Thawing procedures                          | ...
 50033304 | 12/18/2019      | 02B Hot food item not held at or above 140º ... | ...
 50081658 | 12/13/2018      | 06F Wiping cloths soiled or not stored in sa... | ...
 50033733 | 02/12/2019      | 10B Plumbing not properly installed or maint... | ...
 40559634 | 08/22/2017      | 04N Filth flies or food/refuse/sewage-associ... | ...
 ...      | ...             | ...                                             | ...
Cleaning Data in PostgreSQL Databases

Splitting the violation column

 camis    | inspection_date | violation_code |            violation_description            | ... 
 ---------+-----------------+----------------+---------------------------------------------+-----
 ...      | ...             | ...            | ...                                         | ...
 50038736 | 03/29/2018      | 09B            | Thawing procedures                          | ...
 50033304 | 12/18/2019      | 02B            | Hot food item not held at or above 140º ... | ...
 50081658 | 12/13/2018      | 06F            | Wiping cloths soiled or not stored in sa... | ...
 50033733 | 02/12/2019      | 10B            | Plumbing not properly installed or maint... | ...
 40559634 | 08/22/2017      | 04N            | Filth flies or food/refuse/sewage-associ... | ...
 ...      | ...             | ...            | ...                                         | ...
Cleaning Data in PostgreSQL Databases

Let's practice!

Cleaning Data in PostgreSQL Databases

Preparing Video For Download...