Column filtering and aggregation

Cleaning Data in Java

Dennis Lee

Software Engineer

Book sales dataset

Books Authors Language First_Published Sales_in_Millions Average_Price
A Tale of Two Cities Charles Dickens English 1859 200.0 12.99
The Little Prince (Le Petit Prince) Antoine de Saint-Exupéry French 1943 200.0 15.50
Harry Potter and the Philosopher's Stone J. K. Rowling English 1997 120.0 19.99
And Then There Were None Agatha Christie English 1939 100.0 14.95
Cleaning Data in Java

Defining conditions for filtering

import tech.tablesaw.api.Table;
import tech.tablesaw.selection.Selection;
Table books = Table.read().csv("bestsellers.csv");
// Condition for books with sales over 70M
Selection highSales = books.intColumn("Sales_in_Millions").isGreaterThan(70);

// Condition for books written in English Selection english = books.stringColumn("Language").isEqualTo("English");
// Condition for books published after 1950 Selection recentlyPublished = books.intColumn("First_Published").isGreaterThan(1950);
// Similar: .isLessThan(), .isLessThanOrEqualTo(), .isGreaterThanOrEqualTo()
Cleaning Data in Java

Filtering based on conditions

// Filter using a single condition directly
books.where(books.intColumn("Sales_in_Millions").isGreaterThan(70));

// Filter using a pre-defined Selection books.where(highSales);
// Combine multiple conditions with .and() Table popular = books.where(highSales) // Sales > 70M .and(english) // English language .and(recentlyPublished); // After 1950 .sortDescendingOn("Sales_in_Millions"); // Sort by Sales_in_Millions System.out.println("Modern high-revenue English books:\n"); System.out.println(modern.select("Books", "First_Published", "Total_Revenue"));
Cleaning Data in Java

Filtering based on conditions: outputs

Modern high-revenue English books:

| Book                                     | Sales_in_Millions | First_Published |
|------------------------------------------|-------------------|-----------------|
| Harry Potter and the Philosopher's Stone | 120               | 1997            |
| The Da Vinci Code                        | 80                | 2003            |
| Harry Potter and the Chamber of Secrets  | 77                | 1998            |
Cleaning Data in Java

Summarizing by mean

// Import mean calculation function
import static tech.tablesaw.aggregate.AggregateFunctions.mean;
Table overallMean = books.summarize("Sales_in_Millions", mean).apply();
System.out.println("\nOverall mean sales: \n" + overallMean);
Overall mean sales:

| Mean [Sales_in_Millions]  |
|---------------------------|
|        73.03225806451614  |
Cleaning Data in Java

Aggregating by groups

// Calculate average sales for each language group (e.g., English, French, etc.)
Table salesByLanguage = 
        books.summarize("Sales_in_Millions", mean)  // Calculate means

.by("Language"); // Split by language // Show first 5 language groups System.out.println("Average sales by language:\n"); System.out.println(salesByLanguage.first(5));
Cleaning Data in Java

Aggregating by groups: outputs

Average sales by language:

| Language   | Mean [Sales_in_Millions] |
|------------|--------------------------|
| English    | 69.96                    |
| French     | 200.0                    |
| Chinese    | 100.0                    |
| Portuguese | 65.0                     |
| Spanish    | 50.0                     |
Cleaning Data in Java

Combining filtering and aggregation

import static tech.tablesaw.aggregate.AggregateFunctions.sum;
// Filter and summarize sales by language
Table bestsellersByLanguage = books
        // Select books after 1950
        .where(books.intColumn("First_Published").isGreaterThan(1950))

// Calculate average sales .summarize("Sales_in_Millions", sum)
// Group results by language .by("Language"); System.out.println("Modern bestsellers (post-1950) by language:\n"); System.out.println(bestsellersByLanguage);
Cleaning Data in Java

Combining filtering and aggregation: outputs

Modern bestsellers (post-1950) by language:

| Language   | Sum [Sales_in_Millions] |
|------------|-------------------------|
| English    | 1071                    |
| Portuguese | 65                      |
| Spanish    | 50                      |
| Italian    | 50                      |
Cleaning Data in Java

Putting it all together

books.where(sales.isGreaterThan(150)); // Basic filtering with a single condition
books.where(highSales.and(english)); // Multiple conditions - combine with .and()
// Basic aggregation
books.summarize("Sales_in_Millions", mean)       // Calculate means
    .by("Language");                             // Group by language
// Combined operations
books.where(books.intColumn("First_Published").isGreaterThan(1950)) // Filter
    .summarize("Sales_in_Millions", mean) // Then summarize
    .by("Language"); // Finally group
Cleaning Data in Java

Let's practice!

Cleaning Data in Java

Preparing Video For Download...