Data statistics

Cleaning Data in Java

Dennis Lee

Software Engineer

Meet your instructor!

 

Dennis Lee

 

  • Software engineer at Amazon
  • Technology and operations
  • Ph.D. in Electrical and Computer Engineering

 

Instructor photo

Cleaning Data in Java

Why clean data matters

 

Title Publish Date Rating Review Count Price
[null] 1/10/23 4.8 165 $30.61
9/13/19 4.8 2,521 $38.00
Clean Code 8/1/08 4.7 5,639 $400.00

 

  • Dirty data: lost sales, customer confusion, unreliable forecasts
  • Solution: clean the data
Cleaning Data in Java

Course outline

  1. Assess data quality

 

1_Ch1_L1_.jpg

Cleaning Data in Java

Course outline

  1. Assess data quality
  2. Transform data

 

2_Ch1_L1_.jpg

Cleaning Data in Java

Course outline

  1. Assess data quality
  2. Transform data
  3. Validate data

 

3_Ch1_L1_.jpg

Cleaning Data in Java

Course outline

  1. Assess data quality
  2. Transform data
  3. Validate data
  4. Clean tabular data

 

4_Ch1_L1_.jpg

Cleaning Data in Java

Structuring our data

import java.time.LocalDate;
public class BookSalesExample {
    private record BookSales(String title, LocalDate publishDate, 
                             int reviewCount,  double rating, double price) {}}
book.rating() // book is an instance of BookSales
Title Publish Date Rating Review Count Price
Python Crash Course 1/10/23 4.8 165 $30.61
The Pragmatic Programmer 9/13/19 4.8 2,521 $38.00
Clean Code 8/1/08 4.7 5,639 $40.00
Cleaning Data in Java

Populating the dataset

import java.time.LocalDate;
import java.util.Arrays;
import java.util.List;
// Create books with title, publishDate, reviewCount, rating, and price
List<BookSales> books = Arrays.asList(
        new BookSales("Python Crash Course", LocalDate.of(2023, 1, 10),
                      165, 4.8, 30.61),
        new BookSales("The Pragmatic Programmer", LocalDate.of(2019, 9, 13),
                      2521, 4.5, 38.00),
        new BookSales("Clean Code", LocalDate.of(2008, 8, 1),
                      5639, 4.7, 40.00));
Cleaning Data in Java

Calculating mean, min, max

import org.apache.commons.math3.stat.descriptive.DescriptiveStatistics;


public class BookSalesExample { public static void main(String[] args) { // Create stats calculator DescriptiveStatistics stats = new DescriptiveStatistics(); // Add each book's price to stats books.forEach(book -> stats.addValue(book.getPrice())); } }
Cleaning Data in Java

Example output: price range

System.out.printf("Books analyzed: %d%n", books.size());
Books analyzed: 3
System.out.printf("Average price: $%.2f%n", stats.getMean());
Average price: $36.20
System.out.printf("Price range: $%.2f - $%.2f%n", stats.getMin(), stats.getMax());
Price range: $30.61 - $40.00
Cleaning Data in Java

Calculating percentiles

public class BookSalesExample {
    public static void main(String[] args) {
        // 50th percentile = median
        System.out.printf("Median price: $%.2f%n", stats.getPercentile(50));

System.out.printf("Normal range: $%.2f - $%.2f%n", stats.getPercentile(25), stats.getPercentile(75)); } }
Median price: $38.00

Normal range: $30.61 - $40.00
Cleaning Data in Java

Statistics as quality control

  • DescriptiveStatistics methods
    • .getMean(): average price
    • .getMin(), .getMax(): price range
    • .getPercentile(): typical prices
  • Identify issues by inspecting price ranges
  • Benefits
    • Quick issue detection
    • Scales to large datasets
    • Replaces manual inspection

Stack of books

Cleaning Data in Java

Let's practice!

Cleaning Data in Java

Preparing Video For Download...