Cleaning Data in Java
Dennis Lee
Software Engineer
Books | Authors | Language | First_Published | Sales_in_Millions | Average_Price |
---|---|---|---|---|---|
A Tale of Two Cities | Charles Dickens | English | 1859 | 200.0 | 12.99 |
The Little Prince (Le Petit Prince) | Antoine de Saint-Exupéry | French | 1943 | 200.0 | 15.50 |
Harry Potter and the Philosopher's Stone | J. K. Rowling | English | 1997 | 120.0 | 19.99 |
And Then There Were None | Agatha Christie | English | 1939 | 100.0 | 14.95 |
import tech.tablesaw.api.Table;
import tech.tablesaw.api.DoubleColumn;
import tech.tablesaw.api.StringColumn;
// Load CSV file into Tablesaw table
Table books = Table.read().csv("bestsellers.csv");
// Get basic table information: row count, column count, column names
System.out.println("Row count: " + books.rowCount());
System.out.println("Column count: " + books.columnCount());
System.out.println("Column names: " + books.columnNames());
Row count: 290
Column count: 6
Column names: [Books, Authors, Language, First_Published, Sales_in_millions]
for (String columnName : books.columnNames()) {
// Get column by name and count missing values (nulls) in that column int missing = books.column(columnName).countMissing();
System.out.println(columnName + " missing values: " + missing); }
Books missing values: 0
Authors missing values: 0
Language missing values: 0
First_Published missing values: 0
Sales_in_millions missing values: 2
Average_Price missing values: 0
// countBy() creates new table with unique values and their frequencies
Table languageCounts = books.countBy("Language");
System.out.println("Language distribution:\n");
// Prints formatted table showing each language and its count
System.out.println(languageCounts);
Language distribution:
| Language | Count |
|------------|-------|
| English | 210 |
| French | 10 |
| Chinese | 6 |
| Portuguese | 1 |
| Spanish | 3 |
| German | 6 |
| Italian | 5 |
// Get numeric column as DoubleColumn type for statistical operations DoubleColumn sales = books.doubleColumn("Sales_in_millions");
System.out.println("Sales Statistics (millions of copies):\n"); // Smallest value in column System.out.println("Min sales: " + sales.min() + " million"); // Largest value in column System.out.println("Max sales: " + sales.max() + " million");
// Average of all values System.out.println("Mean sales: " + sales.mean() + " million"); // Measure of spread (standard deviation) System.out.println("Std. deviation: " + sales.standardDeviation() + " million");
Sales Statistics (millions of copies):
Min sales: 10.0 million
Max sales: 600.0 million
Mean sales: 49.996875 million
Std. deviation: 64.6846320839116 million
System.out.printf("Rows: %d, Columns: %s", books.rowCount(), books.columnCount());
for (String colName : books.columnNames()) // For each column
System.out.println(books.column(colName).countMissing()); // Count null values
StringColumn language = books.stringColumn("Language"); // Get text column
Table langCounts = books.countBy("Language"); // Count categories
// Calculate numeric statistics
DoubleColumn sales = books.doubleColumn("Sales_in_Millions"); // Get number column
System.out.printf("Mean: %.1f million, Min: %.1f million, Max: %.1f million",
sales.mean(), sales.min(), sales.max());
Rows: 290, Columns: 6
Books missing values: 0
Authors missing values: 0
| Language | Count |
|------------|-------|
| English | 210 |
| French | 10 |
Mean: 50.0 million, Min: 10.0 million, Max: 600.0 million
Cleaning Data in Java