Cleaning Data in Java
Dennis Lee
Software Engineer
Salesperson | Country | Product | Date | Amount | Boxes Shipped |
---|---|---|---|---|---|
James Rudeforth | UK | Mint Chip Choco | 4-Jan-22 | $5,320 | 180 |
Van Tuxwell | India | 85% Dark Bars | 1-Aug-22 | $7,896 | 94 |
Gigi Bohling | India | Peanut Butter Cubes | 7-Jul-22 | $4,501 | 91 |
Jan Morforth | Australia | Peanut Butter Cubes | 27-Apr-22 | $12,726 | 342 |
Jehu Fischer | UK | Peanut Butter Cubes | 24-Feb-22 | $13,685 | 184 |
// We extracted these sales amounts and dates from our dataset
List<Double> salesAmounts = Arrays.asList(5320.0, 7896.0, 4501.0,
12726.0, 13685.0);
List<String> saleDates = Arrays.asList("4-Jan-22", "1-Aug-22", "7-Jul-22",
"27-Apr-22", "24-Feb-22");
System.out.println("Sales amounts: " + salesAmounts);
System.out.println("Sale dates: " + saleDates);
Sales amounts: [5320.0, 7896.0, 4501.0, 12726.0, 13685.0]
Sale dates: [4-Jan-22, 1-Aug-22, 7-Jul-22, 27-Apr-22, 24-Feb-22]
// Get the minimum sales amount
Double minSale = Collections.min(salesAmounts);
// Get the maximum sales amount
Double maxSale = Collections.max(salesAmounts);
System.out.println("Range: $" + minSale + " - $" + maxSale);
Range: $4501.0 - $13685.0
// Sales amount should not be less than 0.0 Double lowerThreshold = 0.0; // Sales amount should not be greater than 15000.0 Double upperThreshold = 15000.0;
for (Double amount : salesAmounts) { // Check that each amount falls within range if (amount >= lowerThreshold && amount <= upperThreshold) { System.out.println(amount + " is within range"); } }
5320.0 is within range
7896.0 is within range
4501.0 is within range
12726.0 is within range
13685.0 is within range
import org.apache.commons.lang3.Range;
// Low sales category: $0 - $5000
Range<Double> lowSales = Range.between(0.0, 5000.0);
// Medium sales category: $5000 - $10000
Range<Double> mediumSales = Range.between(5000.0, 10000.0);
// High sales category: $10000 - $15000
Range<Double> highSales = Range.between(10000.0, 15000.0);
for (Double amount : salesAmounts) {
if (lowSales.contains(amount)) { // Is amount in low sales range?
System.out.println("$" + amount + " - Low sales");
} else if (mediumSales.contains(amount)) { // Is amount in medium sales range?
System.out.println("$" + amount + " - Medium sales");
} else if (highSales.contains(amount)) { // Is amount in high sales range?
System.out.println("$" + amount + " - High sales");
} else { // Is amount outsize of expected range?
System.out.println("$" + amount + " - Out of expected range");
}
}
Sales categorization:
$5320.0 - Medium sales
$7896.0 - Medium sales
$4501.0 - Low sales
$12726.0 - High sales
$13685.0 - High sales
// d-MMM-yy: day of month (d), abbreviated month name (MMM), two-digit year (yy) DateTimeFormatter formatter = DateTimeFormatter.ofPattern("d-MMM-yy"); LocalDate startDate = LocalDate.of(2022, 1, 1); // Start of 2022 LocalDate endDate = LocalDate.of(2022, 12, 31); // End of 2022
for (String dateStr : saleDates) { LocalDate saleDate = LocalDate.parse(dateStr, formatter);
// Valid dates occur after startDate and before endDate boolean isValid = saleDate.isAfter(startDate) && saleDate.isBefore(endDate); System.out.println(dateStr + " is valid: " + isValid); }
4-Jan-22 is valid: true
1-Aug-22 is valid: true
7-Jul-22 is valid: true
27-Apr-22 is valid: true
24-Feb-22 is valid: true
org.apache.commons.lang3.Range
Collections.min(salesAmounts)
, Collections.max(salesAmounts)
Range.between(0.0, 5000.0)
lowSales.contains(amount)
saleDate.isAfter(startDate)
, saleDate.isBefore(endDate)
Cleaning Data in Java