Column cleaning and standardization

Cleaning Data in Java

Dennis Lee

Software Engineer

Book sales dataset

Books Authors Language First_Published Sales_in_Millions Average_Price
A Tale of Two Cities Charles Dickens English 1859 200.0 12.99
The Little Prince (Le Petit Prince) Antoine de Saint-Exupéry French 1943 200.0 15.50
Harry Potter and the Philosopher's Stone J. K. Rowling English 1997 120.0 19.99
And Then There Were None Agatha Christie English 1939 100.0 14.95
Cleaning Data in Java

Examining data types

import tech.tablesaw.api.Table;
import tech.tablesaw.api.StringColumn;
import tech.tablesaw.api.DoubleColumn;
Table books = Table.read().csv("bestsellers.csv");

for (String colName : books.columnNames()) {

// Print column name and its type System.out.println(colName + ": " + books.column(colName).type()); }
Cleaning Data in Java

Examining data types: outputs

Books: STRING
Authors: STRING
Language: STRING
First_Published: INTEGER
Sales_in_Millions: INTEGER
Average_Price: DOUBLE
Cleaning Data in Java

Converting numeric formats

// Convert IntColumn to DoubleColumn since Tablesaw inferred it as INTEGER
DoubleColumn sales = books.intColumn("Sales_in_Millions")
      .asDoubleColumn();

System.out.println("Data type of sales column after conversion: " + sales.type());
Data type of sales column after conversion: DOUBLE
// It's possible to convert a DoubleColumn to an IntColumn
IntColumn salesAsInt = sales.asIntColumn();
Cleaning Data in Java

Multiplying columns

DoubleColumn averagePrice = books.doubleColumn("Average_Price");

// Calculate total revenue by multiplying two columns
DoubleColumn totalRevenue = sales.multiply(averagePrice)

.setName("Total_Revenue_Millions");
books.addColumns(totalRevenue); System.out.println("\nMultiplying two columns: Sales × Price = Revenue"); System.out.println(books.selectColumns("Books", "Sales_in_Millions", "Average_Price", "Total_Revenue_Millions") .first(4).print());
Cleaning Data in Java

Multiplying columns: outputs

Multiplying two columns: Sales * Price = Revenue

| Books                               | Sales_in_  | Average_  | Total_Revenue_ |
|                                     | Millions   | Price     | Millions       |
|-------------------------------------|------------|-----------|----------------|
| A Tale of Two Cities                | 200        | 12.99     | 2598           |
| The Little Prince (Le Petit Prince) | 200        | 15.50     | 3100           |
| Harry Potter and                    | 120        | 19.99     | 2398.80        |
|   the Philosopher's Stone           |            |           |                |
| And Then There Were None            | 100        | 14.95     | 1495           |
Cleaning Data in Java

Regex to match parentheses

  • The pattern \\(.*\\) matches parentheses
  • Example matches: "(Organic)", "(2 pack)", "(Large)"
  • Result: "Apple (Organic)" becomes "Apple"
\\(     // Match opening parenthesis (escaped with \)
.*      // Match any character (.) zero or more times (*)
\\)     // Match closing parenthesis (escaped with \)
Cleaning Data in Java

Handling special characters

// Remove parenthetical translations from titles
StringColumn titlesNoParentheses = titles
        // For each title in titles, remove parentheses and surrounding spaces
        .map(t -> t.replaceAll("\\(.*\\)", "").trim())
        // Set the new column name
        .setName("Titles without parentheses");

System.out.println("Original: " + titles.get(1)); // Get the first column item
System.out.println("Cleaned: " + titlesNoParentheses.get(1));
Original: The Little Prince (Le Petit Prince)
Cleaned: The Little Prince
Cleaning Data in Java

Combining string operations

// Combine removing parentheses and converting to lowercase
StringColumn titlesCombinedCleaning =
        titles.map(t -> t.replaceAll("\\(.*\\)", "") // Remove parentheses

.trim() // Remove surrounding space
.toLowerCase()) // Convert to lowercase .setName("Clean_Titles_Combined"); // Set column name System.out.println("Original title: " + titles.get(1)); System.out.println("Title after cleaning: " + titlesCombinedCleaning.get(1));
Original title: The Little Prince (Le Petit Prince)
Title after cleaning: the little prince
Cleaning Data in Java

Putting it all together

books.column(colName).type(); // Get column type
books.intColumn("Sales_in_Millions").asDoubleColumn(); // Convert type
DoubleColumn totalRevenue = sales.multiply(averagePrice) // Multiply columns
    .setName("Total_Revenue_Millions"); // Name new column
titles.map(String::toLowerCase) // Convert to lowercase
    .setName("Lowercase_Titles"); // Name new column
titles.map(t -> t.replaceAll("\\(.*\\)", "").trim()) // Remove parentheses
    .setName("Titles without parentheses"); // Name new column
Cleaning Data in Java

Let's practice!

Cleaning Data in Java

Preparing Video For Download...