Cleaning Data in Java
Dennis Lee
Software Engineer
Books | Authors | Language | First_Published | Sales_in_Millions | Average_Price |
---|---|---|---|---|---|
A Tale of Two Cities | Charles Dickens | English | 1859 | 200.0 | 12.99 |
The Little Prince (Le Petit Prince) | Antoine de Saint-Exupéry | French | 1943 | 200.0 | 15.50 |
Harry Potter and the Philosopher's Stone | J. K. Rowling | English | 1997 | 120.0 | 19.99 |
And Then There Were None | Agatha Christie | English | 1939 | 100.0 | 14.95 |
import tech.tablesaw.api.Table;
import tech.tablesaw.selection.Selection;
Table books = Table.read().csv("bestsellers.csv"); // Condition for books with sales over 70M Selection highSales = books.intColumn("Sales_in_Millions").isGreaterThan(70);
// Condition for books written in English Selection english = books.stringColumn("Language").isEqualTo("English");
// Condition for books published after 1950 Selection recentlyPublished = books.intColumn("First_Published").isGreaterThan(1950);
// Similar: .isLessThan(), .isLessThanOrEqualTo(), .isGreaterThanOrEqualTo()
// Filter using a single condition directly books.where(books.intColumn("Sales_in_Millions").isGreaterThan(70));
// Filter using a pre-defined Selection books.where(highSales);
// Combine multiple conditions with .and() Table popular = books.where(highSales) // Sales > 70M .and(english) // English language .and(recentlyPublished); // After 1950 .sortDescendingOn("Sales_in_Millions"); // Sort by Sales_in_Millions System.out.println("Modern high-revenue English books:\n"); System.out.println(modern.select("Books", "First_Published", "Total_Revenue"));
Modern high-revenue English books:
| Book | Sales_in_Millions | First_Published |
|------------------------------------------|-------------------|-----------------|
| Harry Potter and the Philosopher's Stone | 120 | 1997 |
| The Da Vinci Code | 80 | 2003 |
| Harry Potter and the Chamber of Secrets | 77 | 1998 |
// Import mean calculation function
import static tech.tablesaw.aggregate.AggregateFunctions.mean;
Table overallMean = books.summarize("Sales_in_Millions", mean).apply();
System.out.println("\nOverall mean sales: \n" + overallMean);
Overall mean sales:
| Mean [Sales_in_Millions] |
|---------------------------|
| 73.03225806451614 |
// Calculate average sales for each language group (e.g., English, French, etc.) Table salesByLanguage = books.summarize("Sales_in_Millions", mean) // Calculate means
.by("Language"); // Split by language // Show first 5 language groups System.out.println("Average sales by language:\n"); System.out.println(salesByLanguage.first(5));
Average sales by language:
| Language | Mean [Sales_in_Millions] |
|------------|--------------------------|
| English | 69.96 |
| French | 200.0 |
| Chinese | 100.0 |
| Portuguese | 65.0 |
| Spanish | 50.0 |
import static tech.tablesaw.aggregate.AggregateFunctions.sum;
// Filter and summarize sales by language Table bestsellersByLanguage = books // Select books after 1950 .where(books.intColumn("First_Published").isGreaterThan(1950))
// Calculate average sales .summarize("Sales_in_Millions", sum)
// Group results by language .by("Language"); System.out.println("Modern bestsellers (post-1950) by language:\n"); System.out.println(bestsellersByLanguage);
Modern bestsellers (post-1950) by language:
| Language | Sum [Sales_in_Millions] |
|------------|-------------------------|
| English | 1071 |
| Portuguese | 65 |
| Spanish | 50 |
| Italian | 50 |
books.where(sales.isGreaterThan(150)); // Basic filtering with a single condition
books.where(highSales.and(english)); // Multiple conditions - combine with .and()
// Basic aggregation
books.summarize("Sales_in_Millions", mean) // Calculate means
.by("Language"); // Group by language
// Combined operations
books.where(books.intColumn("First_Published").isGreaterThan(1950)) // Filter
.summarize("Sales_in_Millions", mean) // Then summarize
.by("Language"); // Finally group
Cleaning Data in Java