String normalization

Cleaning Data in Java

Dennis Lee

Software Engineer

Grocery inventory dataset

  • Work with a grocery inventory dataset throughout the chapter
  • Modify certain columns like product name to illustrate data transformations

 

Product Name Category Status Date Received Quantity
Eggplant Fruits & Vegetables Discontinued 3/1/25 46
VEGETABLE OIL Oils & Fats Backordered 4/1/25 51
Cheese Dairy Active 6/1/25 78
Fresh (Organic) *Carrots* Fruits & Vegetables Discontinued 5/1/25 51
Bell Pepper Fresh Fruits & Vegetables Active 5/2/25 67
Cleaning Data in Java

Why string normalization matters

  • Messy strings: computers can't recognize text
  • Example: grocery store inventory
  • Solution: normalize strings

 

String[] messyProducts = {
    "Eggplant ",              // Extra whitespace at the end
    "VEGETABLE OIL",             // Inconsistent case
    "Fresh (Organic) *Carrots*", // Special characters
    "Bell   Pepper    Fresh",    // Extra whitespace between words
};
Cleaning Data in Java

Removing leading and trailing whitespace

String[] products = {"Eggplant ", "  Vegetable Oil", " Cheese "};
for (String product : products) {
    String cleaned = product.trim(); // Removes leading/trailing whitespace
    System.out.println(cleaned);
}
Eggplant
Vegetable Oil
Cheese
Cleaning Data in Java

Standardizing text case formats

List<String> products = Arrays.asList("Eggplant", "VEGETABLE OIL", "cheese");

products.stream()
.map(String::toLowerCase) // Convert all to lowercase
.forEach(System.out::println); // Print each product
eggplant
vegetable oil
cheese
Cleaning Data in Java

Regex patterns

  • Regular expressions (regex): special patterns that match text$^1$
  • [^a-zA-Z\\s]: Finds any character that isn't a letter or space
// This pattern matches any character that is NOT:
[       // Start a character set
^       // NOT - match anything not in this set
a-z     // any lowercase letter
A-Z     // any uppercase letter
\\s     // any whitespace character (need extra \ for Java to interpret \s)
]       // End character set

1 https://www.datacamp.com/cheat-sheet/regular-expresso
Cleaning Data in Java

Cleaning special characters

String dirtyName = "Fresh (Organic) *Carrots*"; // Special characters: (, ), *

String cleaned = dirtyName.replaceAll("[^a-zA-Z\\s]", ""); // Remove non-letters
System.out.println(cleaned); // Output: "Fresh Organic Carrots"
Fresh Organic Carrots
Cleaning Data in Java

Cleaning multiple spaces

import java.util.regex.Pattern;
Pattern pattern = Pattern.compile("\\s+"); // Match multiple spaces

String messyProduct = "Bell Pepper Fresh"; // Contains extra spaces // Replace multiple spaces with single space String cleanedProduct = pattern.matcher(messyProduct).replaceAll(" ");
System.out.println(cleanedProduct);
Bell Pepper Fresh
Cleaning Data in Java

Putting it all together

List<String> messyProducts = Arrays.asList(
        "Eggplant ", "VEGETABLE OIL",
        "Fresh (Organic) *Carrots*", "Bell   Pepper    Fresh"
); // Product names extracted from our grocery inventory dataset

messyProducts.stream()
        .map(s -> s.trim()                       // Fix outer spaces
                .replaceAll("[^a-zA-Z\\s]", "")  // Remove special chars
                .replaceAll("\\s+", " ")         // Fix inner spaces
                .toLowerCase())                  // Standardize case
        .forEach(System.out::println);
Cleaning Data in Java

Putting it all together: outputs

eggplant
vegetable oil
fresh organic carrots
bell pepper fresh
Cleaning Data in Java

Let's practice!

Cleaning Data in Java

Preparing Video For Download...