Cleaning Data in Java
Dennis Lee
Software Engineer
Product Name | Category | Status | Date Received | Quantity |
---|---|---|---|---|
Eggplant | Fruits & Vegetables | Discontinued | 3/1/25 | 46 |
VEGETABLE OIL | Oils & Fats | Backordered | 4/1/25 | 51 |
Cheese | Dairy | Active | 6/1/25 | 78 |
Fresh (Organic) *Carrots* | Fruits & Vegetables | Discontinued | 5/1/25 | 51 |
Bell Pepper Fresh | Fruits & Vegetables | Active | 5/2/25 | 67 |
String[] messyProducts = {
"Eggplant ", // Extra whitespace at the end
"VEGETABLE OIL", // Inconsistent case
"Fresh (Organic) *Carrots*", // Special characters
"Bell Pepper Fresh", // Extra whitespace between words
};
String[] products = {"Eggplant ", " Vegetable Oil", " Cheese "};
for (String product : products) {
String cleaned = product.trim(); // Removes leading/trailing whitespace
System.out.println(cleaned);
}
Eggplant
Vegetable Oil
Cheese
List<String> products = Arrays.asList("Eggplant", "VEGETABLE OIL", "cheese");
products.stream()
.map(String::toLowerCase) // Convert all to lowercase
.forEach(System.out::println); // Print each product
eggplant
vegetable oil
cheese
[^a-zA-Z\\s]
: Finds any character that isn't a letter or space// This pattern matches any character that is NOT:
[ // Start a character set
^ // NOT - match anything not in this set
a-z // any lowercase letter
A-Z // any uppercase letter
\\s // any whitespace character (need extra \ for Java to interpret \s)
] // End character set
String dirtyName = "Fresh (Organic) *Carrots*"; // Special characters: (, ), *
String cleaned = dirtyName.replaceAll("[^a-zA-Z\\s]", ""); // Remove non-letters
System.out.println(cleaned); // Output: "Fresh Organic Carrots"
Fresh Organic Carrots
import java.util.regex.Pattern;
Pattern pattern = Pattern.compile("\\s+"); // Match multiple spaces
String messyProduct = "Bell Pepper Fresh"; // Contains extra spaces // Replace multiple spaces with single space String cleanedProduct = pattern.matcher(messyProduct).replaceAll(" ");
System.out.println(cleanedProduct);
Bell Pepper Fresh
List<String> messyProducts = Arrays.asList(
"Eggplant ", "VEGETABLE OIL",
"Fresh (Organic) *Carrots*", "Bell Pepper Fresh"
); // Product names extracted from our grocery inventory dataset
messyProducts.stream()
.map(s -> s.trim() // Fix outer spaces
.replaceAll("[^a-zA-Z\\s]", "") // Remove special chars
.replaceAll("\\s+", " ") // Fix inner spaces
.toLowerCase()) // Standardize case
.forEach(System.out::println);
eggplant
vegetable oil
fresh organic carrots
bell pepper fresh
Cleaning Data in Java