Data quality assessment with Tablesaw

Cleaning Data in Java

Dennis Lee

Software Engineer

Book sales dataset

Books	Authors	Language	First_Published	Sales_in_Millions	Average_Price
A Tale of Two Cities	Charles Dickens	English	1859	200.0	12.99
The Little Prince (Le Petit Prince)	Antoine de Saint-Exupéry	French	1943	200.0	15.50
Harry Potter and the Philosopher's Stone	J. K. Rowling	English	1997	120.0	19.99
And Then There Were None	Agatha Christie	English	1939	100.0	14.95

Examining the data

import tech.tablesaw.api.Table;
import tech.tablesaw.api.DoubleColumn;
import tech.tablesaw.api.StringColumn;

// Load CSV file into Tablesaw table
Table books = Table.read().csv("bestsellers.csv");

// Get basic table information: row count, column count, column names
System.out.println("Row count: " + books.rowCount());
System.out.println("Column count: " + books.columnCount());
System.out.println("Column names: " + books.columnNames());

Examining the data: outputs

Row count: 290
Column count: 6
Column names: [Books, Authors, Language, First_Published, Sales_in_millions]

Checking for missing values

for (String columnName : books.columnNames()) {

    // Get column by name and count missing values (nulls) in that column
    int missing = books.column(columnName).countMissing();

    System.out.println(columnName + " missing values: " + missing);
}

Books missing values: 0
Authors missing values: 0
Language missing values: 0
First_Published missing values: 0
Sales_in_millions missing values: 2
Average_Price missing values: 0

Examining categorical columns

// countBy() creates new table with unique values and their frequencies
Table languageCounts = books.countBy("Language");

System.out.println("Language distribution:\n");

// Prints formatted table showing each language and its count
System.out.println(languageCounts);

Examining categorical columns: outputs

Language distribution:

| Language   | Count |
|------------|-------|
| English    | 210   |
| French     | 10    |
| Chinese    | 6     |
| Portuguese | 1     |
| Spanish    | 3     |
| German     | 6     |
| Italian    | 5     |

Analyzing numeric columns

// Get numeric column as DoubleColumn type for statistical operations
DoubleColumn sales = books.doubleColumn("Sales_in_millions");


System.out.println("Sales Statistics (millions of copies):\n");
// Smallest value in column
System.out.println("Min sales: " + sales.min() + " million");
// Largest value in column
System.out.println("Max sales: " + sales.max() + " million");

// Average of all values
System.out.println("Mean sales: " + sales.mean() + " million");
// Measure of spread (standard deviation)
System.out.println("Std. deviation: " + sales.standardDeviation() + " million");

Analyzing numeric columns: outputs

Sales Statistics (millions of copies):

Min sales: 10.0 million
Max sales: 600.0 million
Mean sales: 49.996875 million
Std. deviation: 64.6846320839116 million

Putting it all together

System.out.printf("Rows: %d, Columns: %s", books.rowCount(), books.columnCount());

for (String colName : books.columnNames()) // For each column
    System.out.println(books.column(colName).countMissing()); // Count null values

StringColumn language = books.stringColumn("Language"); // Get text column
Table langCounts = books.countBy("Language"); // Count categories

// Calculate numeric statistics
DoubleColumn sales = books.doubleColumn("Sales_in_Millions"); // Get number column
System.out.printf("Mean: %.1f million, Min: %.1f million, Max: %.1f million",
                  sales.mean(), sales.min(), sales.max());

Putting it all together: outputs

Rows: 290, Columns: 6

Books missing values: 0
Authors missing values: 0

| Language   | Count |
|------------|-------|
| English    | 210   |
| French     | 10    |

Mean: 50.0 million, Min: 10.0 million, Max: 600.0 million

Let's practice!

Cleaning Data in Java