Cleaning Data in Java
Dennis Lee
Software Engineer
Raw Category | Quantity |
---|---|
Fruits & Vegetables | 10 |
fruits & veg | 20 |
Fruits&Veg | 30 |
FRUITS_AND_VEGETABLES
DAIRY
UNCATEGORIZED
Fruits & Vegetables -> FRUITS_AND_VEGETABLES
fruits & veg -> FRUITS_AND_VEGETABLES
Fruits&Veg -> FRUITS_AND_VEGETABLES
FRUITS_AND_VEGETABLES
// Define standard categories
public enum ProductCategory {
FRUITS_AND_VEGETABLES,
DAIRY,
UNCATEGORIZED;
}
// Example category
System.out.println(ProductCategory.FRUITS_AND_VEGETABLES);
FRUITS_AND_VEGETABLES
import java.util.HashMap;
import java.util.Map;
// Define map of raw category variations to standard categories
Map<String, ProductCategory> rawCategories = new HashMap<>();
// Variation 1: Fruits & Vegetables
rawCategories.put("Fruits & Vegetables", ProductCategory.FRUITS_AND_VEGETABLES);
// Variation 2: fruits & veg
rawCategories.put("fruits & veg", ProductCategory.FRUITS_AND_VEGETABLES);
// Variation 3: Fruits&Veg
rawCategories.put("Fruits&Veg", ProductCategory.FRUITS_AND_VEGETABLES);
System.out.println(rawCategories.get("fruits & veg")); // Get standard category
FRUITS_AND_VEGETABLES
rawCategories
: for visualization only; columns represent key-value pairs in the Map
Raw Category | Standard Category |
---|---|
Fruits & Vegetables | FRUITS_AND_VEGETABLES |
fruits & veg | FRUITS_AND_VEGETABLES |
Fruits&Veg | FRUITS_AND_VEGETABLES |
String unknownCategory = "Mystery Category";
// Return UNCATEGORIZED if category not found in rawCategories
ProductCategory category = rawCategories.getOrDefault(
unknownCategory, ProductCategory.UNCATEGORIZED);
System.out.println("Unknown: " + category);
Mystery Category: UNCATEGORIZED
Raw Category | Standard Category |
---|---|
Mystery Category | UNCATEGORIZED |
import java.util.Collections;
Map<String, ProductCategory> categories = Collections.unmodifiableMap(rawCategories); // Create immutable view
try { categories.put("new", ProductCategory.DAIRY); // Will throw exception } catch (UnsupportedOperationException e) { System.out.println("Cannot modify immutable map"); }
Cannot modify immutable map
// Extract category and quantity from our grocery inventory dataset
Map<String, Integer> rawData =
Map.of("Fruits & Vegetables", 10,
"fruits & veg", 20,
"Fruits&Veg", 30);
rawData
: Mapping extracted based on our grocery inventory dataset
Raw Category | Quantity |
---|---|
Fruits & Vegetables | 10 |
fruits & veg | 20 |
Fruits&Veg | 30 |
Map<ProductCategory, Integer> stockByCategory = new EnumMap<ProductCategory, Integer>(ProductCategory.class);
rawData.forEach((raw, quantity) -> // Lookup standard category in categories and sum quantities stockByCategory.merge(categories.get(raw), quantity, Integer::sum));
categories
: Mapping used to lookup standard category
Raw Category | Standard Category |
---|---|
Fruits & Vegetables | FRUITS_AND_VEGETABLES |
fruits & veg | FRUITS_AND_VEGETABLES |
Fruits&Veg | FRUITS_AND_VEGETABLES |
System.out.println(stockByCategory);
{FRUITS_AND_VEGETABLES=60}
stockByCategory
: Mapping after looking up rawData
in categoryMap
Standard Category | Quantity |
---|---|
FRUITS_AND_VEGETABLES | 60 |
1) Extract rawData
map of category/quantity from our dataset (map snippet below)
Raw Category | Quantity |
---|---|
Fruits & Vegetables | 10 |
2) Create categories
to lookup standard categories (map snippet below)
Raw Category | Standard Category |
---|---|
Fruits & Vegetables | FRUITS_AND_VEGETABLES |
3) Compute stockByCategory
by looking up rawData
in categories
(map snippet below)
Standard Category | Quantity |
---|---|
FRUITS_AND_VEGETABLES | 60 |
Key Imports
import java.util.Map;
import java.util.HashMap;
import java.util.Collections;
enum
for standard categoriesHashMap
Collections.unmodifiableMap
.getOrDefault()
.merge()
Cleaning Data in Java