Categorical standardization

Cleaning Data in Java

Dennis Lee

Software Engineer

Why standardize categories?

  • Inconsistent category names: can't group data
  • Different ways to write "Fruits & Vegetables"
  • Solution: standardize categories

 

 

Raw Category Quantity
Fruits & Vegetables 10
fruits & veg 20
Fruits&Veg 30
Cleaning Data in Java

Steps for standardization

  1. Define standard categories
    FRUITS_AND_VEGETABLES
    DAIRY
    UNCATEGORIZED
    
  2. Map variations to standard categories
    Fruits & Vegetables -> FRUITS_AND_VEGETABLES
    fruits & veg -> FRUITS_AND_VEGETABLES
    Fruits&Veg -> FRUITS_AND_VEGETABLES
    
  3. Perform analysis - e.g., count categories like FRUITS_AND_VEGETABLES
Cleaning Data in Java

Define standard categories with enum

// Define standard categories
public enum ProductCategory {
    FRUITS_AND_VEGETABLES,
    DAIRY,
    UNCATEGORIZED;
}

// Example category
System.out.println(ProductCategory.FRUITS_AND_VEGETABLES);
FRUITS_AND_VEGETABLES
Cleaning Data in Java

Map raw categories to standard categories

import java.util.HashMap;
import java.util.Map;
// Define map of raw category variations to standard categories
Map<String, ProductCategory> rawCategories = new HashMap<>();
// Variation 1: Fruits & Vegetables
rawCategories.put("Fruits & Vegetables", ProductCategory.FRUITS_AND_VEGETABLES);
// Variation 2: fruits & veg
rawCategories.put("fruits & veg", ProductCategory.FRUITS_AND_VEGETABLES);
// Variation 3: Fruits&Veg
rawCategories.put("Fruits&Veg", ProductCategory.FRUITS_AND_VEGETABLES);
Cleaning Data in Java

Map variations to standard categories: outputs

System.out.println(rawCategories.get("fruits & veg")); // Get standard category
FRUITS_AND_VEGETABLES

 

rawCategories: for visualization only; columns represent key-value pairs in the Map

Raw Category Standard Category
Fruits & Vegetables FRUITS_AND_VEGETABLES
fruits & veg FRUITS_AND_VEGETABLES
Fruits&Veg FRUITS_AND_VEGETABLES
Cleaning Data in Java

Handling unknown categories

String unknownCategory = "Mystery Category";

// Return UNCATEGORIZED if category not found in rawCategories
ProductCategory category = rawCategories.getOrDefault(
        unknownCategory, ProductCategory.UNCATEGORIZED);

System.out.println("Unknown: " + category);
Mystery Category: UNCATEGORIZED
Raw Category Standard Category
Mystery Category UNCATEGORIZED
Cleaning Data in Java

Making the mapping immutable

import java.util.Collections;
Map<String, ProductCategory> categories =
        Collections.unmodifiableMap(rawCategories); // Create immutable view

try { categories.put("new", ProductCategory.DAIRY); // Will throw exception } catch (UnsupportedOperationException e) { System.out.println("Cannot modify immutable map"); }
Cannot modify immutable map
Cleaning Data in Java

Extract categories from our dataset

// Extract category and quantity from our grocery inventory dataset
Map<String, Integer> rawData = 
        Map.of("Fruits & Vegetables", 10, 
               "fruits & veg", 20,
               "Fruits&Veg", 30);

rawData: Mapping extracted based on our grocery inventory dataset

Raw Category Quantity
Fruits & Vegetables 10
fruits & veg 20
Fruits&Veg 30
Cleaning Data in Java

Lookup standard category

Map<ProductCategory, Integer> stockByCategory =
        new EnumMap<ProductCategory, Integer>(ProductCategory.class);

rawData.forEach((raw, quantity) -> // Lookup standard category in categories and sum quantities stockByCategory.merge(categories.get(raw), quantity, Integer::sum));

categories: Mapping used to lookup standard category

Raw Category Standard Category
Fruits & Vegetables FRUITS_AND_VEGETABLES
fruits & veg FRUITS_AND_VEGETABLES
Fruits&Veg FRUITS_AND_VEGETABLES
Cleaning Data in Java

Sum over standard category

System.out.println(stockByCategory);
{FRUITS_AND_VEGETABLES=60}

 

stockByCategory: Mapping after looking up rawData in categoryMap

Standard Category Quantity
FRUITS_AND_VEGETABLES 60
Cleaning Data in Java

Grouping by standard category: summary

1) Extract rawData map of category/quantity from our dataset (map snippet below)

Raw Category Quantity
Fruits & Vegetables 10

2) Create categories to lookup standard categories (map snippet below)

Raw Category Standard Category
Fruits & Vegetables FRUITS_AND_VEGETABLES

3) Compute stockByCategory by looking up rawData in categories (map snippet below)

Standard Category Quantity
FRUITS_AND_VEGETABLES 60
Cleaning Data in Java

Putting it all together

Key Imports

import java.util.Map;
import java.util.HashMap;
import java.util.Collections;
  • Create an enum for standard categories
  • Build a category mapping with HashMap
  • Make a mapping immutable with Collections.unmodifiableMap
  • Handle unknown categories with .getOrDefault()
  • Group by standardized categories with .merge()
Cleaning Data in Java

Let's practice!

Cleaning Data in Java

Preparing Video For Download...