Data quality assessment with Tablesaw

Cleaning Data in Java

Dennis Lee

Software Engineer

Book sales dataset

Books Authors Language First_Published Sales_in_Millions Average_Price
A Tale of Two Cities Charles Dickens English 1859 200.0 12.99
The Little Prince (Le Petit Prince) Antoine de Saint-Exupéry French 1943 200.0 15.50
Harry Potter and the Philosopher's Stone J. K. Rowling English 1997 120.0 19.99
And Then There Were None Agatha Christie English 1939 100.0 14.95
Cleaning Data in Java

Examining the data

import tech.tablesaw.api.Table;
import tech.tablesaw.api.DoubleColumn;
import tech.tablesaw.api.StringColumn;
// Load CSV file into Tablesaw table
Table books = Table.read().csv("bestsellers.csv");

// Get basic table information: row count, column count, column names
System.out.println("Row count: " + books.rowCount());
System.out.println("Column count: " + books.columnCount());
System.out.println("Column names: " + books.columnNames());
Cleaning Data in Java

Examining the data: outputs

Row count: 290
Column count: 6
Column names: [Books, Authors, Language, First_Published, Sales_in_millions]
Cleaning Data in Java

Checking for missing values

for (String columnName : books.columnNames()) {

// Get column by name and count missing values (nulls) in that column int missing = books.column(columnName).countMissing();
System.out.println(columnName + " missing values: " + missing); }
Books missing values: 0
Authors missing values: 0
Language missing values: 0
First_Published missing values: 0
Sales_in_millions missing values: 2
Average_Price missing values: 0
Cleaning Data in Java

Examining categorical columns

// countBy() creates new table with unique values and their frequencies
Table languageCounts = books.countBy("Language");

System.out.println("Language distribution:\n");

// Prints formatted table showing each language and its count
System.out.println(languageCounts);
Cleaning Data in Java

Examining categorical columns: outputs

Language distribution:

| Language   | Count |
|------------|-------|
| English    | 210   |
| French     | 10    |
| Chinese    | 6     |
| Portuguese | 1     |
| Spanish    | 3     |
| German     | 6     |
| Italian    | 5     |
Cleaning Data in Java

Analyzing numeric columns

// Get numeric column as DoubleColumn type for statistical operations
DoubleColumn sales = books.doubleColumn("Sales_in_millions");

System.out.println("Sales Statistics (millions of copies):\n"); // Smallest value in column System.out.println("Min sales: " + sales.min() + " million"); // Largest value in column System.out.println("Max sales: " + sales.max() + " million");
// Average of all values System.out.println("Mean sales: " + sales.mean() + " million"); // Measure of spread (standard deviation) System.out.println("Std. deviation: " + sales.standardDeviation() + " million");
Cleaning Data in Java

Analyzing numeric columns: outputs

Sales Statistics (millions of copies):

Min sales: 10.0 million
Max sales: 600.0 million
Mean sales: 49.996875 million
Std. deviation: 64.6846320839116 million
Cleaning Data in Java

Putting it all together

System.out.printf("Rows: %d, Columns: %s", books.rowCount(), books.columnCount());
for (String colName : books.columnNames()) // For each column
    System.out.println(books.column(colName).countMissing()); // Count null values
StringColumn language = books.stringColumn("Language"); // Get text column
Table langCounts = books.countBy("Language"); // Count categories
// Calculate numeric statistics
DoubleColumn sales = books.doubleColumn("Sales_in_Millions"); // Get number column
System.out.printf("Mean: %.1f million, Min: %.1f million, Max: %.1f million",
                  sales.mean(), sales.min(), sales.max());
Cleaning Data in Java

Putting it all together: outputs

Rows: 290, Columns: 6
Books missing values: 0
Authors missing values: 0
| Language   | Count |
|------------|-------|
| English    | 210   |
| French     | 10    |
Mean: 50.0 million, Min: 10.0 million, Max: 600.0 million
Cleaning Data in Java

Let's practice!

Cleaning Data in Java

Preparing Video For Download...