Strategieën voor meerdere transformaties

Exploratory Data Analysis in SQL

Christina Maimone

Data Scientist

Meerdere transformaties

SELECT * FROM naics;
   id   |                  category                   | businesses 
--------+---------------------------------------------+------------
 111110 | Agriculture: Soybean Farming                |       4788
 111130 | Agriculture | Dry Pea and Bean Farming      |       3606
 111140 | Agriculture: Wheat Farming                  |       6393
 111150 | Agriculture - Corn Farming                  |      26469
 111160 | Agriculture: Rice Farming                   |        949
 111199 | Agriculture - All Other Grain Farming       |      15035
 111211 | Agriculture | Potato Farming                |        617
 611110 | Education - Elementary and Secondary        |     187859
 611210 | Education | Junior Colleges                 |       3961
 611310 | Education: Colleges and Universities        |      29148
Exploratory Data Analysis in SQL

CASE WHEN

-- Case voor elk van :, -, en |

SELECT CASE WHEN category LIKE '%: %' THEN split_part(category, ': ', 1)
WHEN category LIKE '% - %' THEN split_part(category, ' - ', 1)
ELSE split_part(category, ' | ', 1)
END AS major_category, -- geef alias aan resultaat
sum(businesses) -- selecteer ook aantal bedrijven FROM naics GROUP BY major_category; -- Groepeer op bovenstaande categorieën
 major_category |  sum   
----------------+--------
 Education      | 220968
 Agriculture    |  57857
Exploratory Data Analysis in SQL

Recode-tabel

Originele waarden: fruit-tabel

 customer | fav_fruit 
----------+-----------
      349 | apple
      874 | Apple
      703 | apple
      667 | bannana
      622 | banana
      387 | BANANA
      300 | APPLES
      313 |  apple
      499 |  banana
      418 | apple
      841 | BANANA
      300 | APPLE
      754 | apple

Gestandaardiseerde waarden: recode-tabel

 original | standardized  
----------+--------------
 APPLES   | apple
 apple    | apple  
 Apple    | apple
 bannana  | banana
 apple    | apple
  banana  | banana
 banana   | banana
 APPLE    | apple
  apple   | apple
 BANANA   | banana
Exploratory Data Analysis in SQL

Stap 1: CREATE TEMP TABLE

CREATE TEMP TABLE recode AS

  SELECT DISTINCT fav_fruit AS original,  -- originele, rommelige waarden

         fav_fruit AS standardized        -- nieuwe gestandaardiseerde waarden

    FROM fruit;
Exploratory Data Analysis in SQL

Initiële tabel

SELECT * 
  FROM recode;
 original | standardized 
----------+--------------
 APPLES   | APPLES
 apple    | apple  
 Apple    | Apple
 bannana  | bannana
 apple    | apple
  banana  |  banana
 banana   | banana
 APPLE    | APPLE
  apple   |  apple
 BANANA   | BANANA
(10 rows)
Exploratory Data Analysis in SQL

Stap 2: UPDATE-waarden

UPDATE table_name
   SET column_name = new_value
 WHERE condition;
Exploratory Data Analysis in SQL

Stap 2: UPDATE-waarden

-- Alle rijen: kleine letters, witruimte aan uiteinden weg

UPDATE recode 
   SET standardized=trim(lower(original));
-- Specifieke rijen: spelfout corrigeren

UPDATE recode
   SET standardized='banana'
 WHERE standardized LIKE '%nn%';
-- Alle rijen: s aan einde verwijderen

UPDATE recode
   SET standardized=rtrim(standardized, 's');
Exploratory Data Analysis in SQL

Resulterende recode-tabel

SELECT * 
  FROM recode;
 original | standardized 
----------+--------------
 APPLES   | apple
 apple    | apple
 Apple    | apple
 apple    | apple
  banana  | banana
 banana   | banana
 APPLE    | apple
  apple   | apple
 BANANA   | banana
 bannana  | banana
(10 rows)
Exploratory Data Analysis in SQL

Stap 3: JOIN van originele en recode-tabellen

Alleen origineel

SELECT fav_fruit, count(*) 
  FROM fruit 
 GROUP BY fav_fruit;
 fav_fruit | count 
-----------+-------
 APPLES    |     1
 apple     |     1
 apple     |     3
  banana   |     1
 BANANA    |     2
  apple    |     1
 APPLE     |     1
 bannana   |     1
 banana    |     1
 Apple     |     1
(10 rows)

Met gerecodeerde waarden

SELECT standardized, 
       count(*) 
  FROM fruit
       LEFT JOIN recode
       ON fav_fruit=original
 GROUP BY standardized;
 standardized | count 
--------------+-------
 apple        |     8
 banana       |     5
(2 rows)
Exploratory Data Analysis in SQL

Samenvatting

  1. CREATE TEMP TABLE met originele waarden
  2. UPDATE om gestandaardiseerde waarden te maken
  3. JOIN originele data aan gestandaardiseerde data
Exploratory Data Analysis in SQL

Maak de Evanston 311-data schoon!

Exploratory Data Analysis in SQL

Preparing Video For Download...