Cleaning Data in Java
Dennis Lee
Software Engineer
// From assessing quality DescriptiveStatistics stats = new DescriptiveStatistics(values); System.out.println("Mean: " + stats.getMean());
// To transforming strings String clean = text.trim().toLowerCase();
// To validating ranges boolean isValid = Range.between(0, 100).contains(value);
// To cleaning tables Table cleaned = data.where(sales.isGreaterThan(1000));
// Check statistics double mean = stats.getMean();
// Detect nulls boolean isMissing = Optional.ofNullable(value).isEmpty();
// Verify types boolean isValidDate = LocalDate.parse(dateStr, formatter);
// Normalize strings (e.g., remove non-letters) String cleaned = dirtyName.replaceAll("[^a-zA-Z\\s]", "");
// Standardize categories String standard = categoryMap.getOrDefault(value, "Other");
// Convert dates LocalDateTime datetime = LocalDateTime.parse(dateStr, formatter);
// Check ranges boolean inRange = Range.between(min, max).contains(value);
// Validate patterns boolean matches = Pattern.matches(pattern, text);
// Enforce constraints Set<ConstraintViolation<Data>> violations = validator.validate(data);
for (String colName : data.columnNames()) { // Check missing values System.out.println(data.column(colName).countMissing()); }
StringColumn cleaned = names.map(String::toLowerCase) // Clean text .setName("Clean_Names"); // Name new column
Table summary = data .where(sales.isGreaterThan(1000)) // Filter rows .summarize("Sales", mean) // Aggregate data .by("Category"); // Group results
$$
Cleaning Data in Java