Congratulations!

Cleaning Data in Java

Dennis Lee

Software Engineer

Journey through data cleaning

// From assessing quality
DescriptiveStatistics stats = new DescriptiveStatistics(values);
System.out.println("Mean: " + stats.getMean());

// To transforming strings String clean = text.trim().toLowerCase();
// To validating ranges boolean isValid = Range.between(0, 100).contains(value);
// To cleaning tables Table cleaned = data.where(sales.isGreaterThan(1000));
Cleaning Data in Java

Assessing data quality

  • Bad data leads to bad decisions
  • Assess data quality: check for outliers, missing values, wrong types
// Check statistics
double mean = stats.getMean();

// Detect nulls boolean isMissing = Optional.ofNullable(value).isEmpty();
// Verify types boolean isValidDate = LocalDate.parse(dateStr, formatter);
Cleaning Data in Java

Transforming data consistently

  • Inconsistent formats prevent accurate analysis
  • Clean strings; standardize categories and dates
// Normalize strings (e.g., remove non-letters)
String cleaned = dirtyName.replaceAll("[^a-zA-Z\\s]", "");

// Standardize categories String standard = categoryMap.getOrDefault(value, "Other");
// Convert dates LocalDateTime datetime = LocalDateTime.parse(dateStr, formatter);
Cleaning Data in Java

Validating data integrity

  • Stop bad data before it enters our system
  • Check for valid ranges and formats against business rules
// Check ranges
boolean inRange = Range.between(min, max).contains(value);

// Validate patterns boolean matches = Pattern.matches(pattern, text);
// Enforce constraints Set<ConstraintViolation<Data>> violations = validator.validate(data);
Cleaning Data in Java

Cleaning tabular data

for (String colName : data.columnNames()) {
    // Check missing values 
    System.out.println(data.column(colName).countMissing());
}

StringColumn cleaned = names.map(String::toLowerCase) // Clean text .setName("Clean_Names"); // Name new column
Table summary = data .where(sales.isGreaterThan(1000)) // Filter rows .summarize("Sales", mean) // Aggregate data .by("Category"); // Group results
Cleaning Data in Java

Resources

Cleaning Data in Java

Ready to clean

Cleaning Data in Java

Preparing Video For Download...