Cleaning Data in Java
Dennis Lee
Software Engineer
Books | Authors | Language | First_Published | Sales_in_Millions | Average_Price |
---|---|---|---|---|---|
A Tale of Two Cities | Charles Dickens | English | 1859 | 200.0 | 12.99 |
The Little Prince (Le Petit Prince) | Antoine de Saint-Exupéry | French | 1943 | 200.0 | 15.50 |
Harry Potter and the Philosopher's Stone | J. K. Rowling | English | 1997 | 120.0 | 19.99 |
And Then There Were None | Agatha Christie | English | 1939 | 100.0 | 14.95 |
import tech.tablesaw.api.Table;
import tech.tablesaw.api.StringColumn;
import tech.tablesaw.api.DoubleColumn;
Table books = Table.read().csv("bestsellers.csv"); for (String colName : books.columnNames()) {
// Print column name and its type System.out.println(colName + ": " + books.column(colName).type()); }
Books: STRING
Authors: STRING
Language: STRING
First_Published: INTEGER
Sales_in_Millions: INTEGER
Average_Price: DOUBLE
// Convert IntColumn to DoubleColumn since Tablesaw inferred it as INTEGER
DoubleColumn sales = books.intColumn("Sales_in_Millions")
.asDoubleColumn();
System.out.println("Data type of sales column after conversion: " + sales.type());
Data type of sales column after conversion: DOUBLE
// It's possible to convert a DoubleColumn to an IntColumn
IntColumn salesAsInt = sales.asIntColumn();
DoubleColumn averagePrice = books.doubleColumn("Average_Price"); // Calculate total revenue by multiplying two columns DoubleColumn totalRevenue = sales.multiply(averagePrice)
.setName("Total_Revenue_Millions");
books.addColumns(totalRevenue); System.out.println("\nMultiplying two columns: Sales × Price = Revenue"); System.out.println(books.selectColumns("Books", "Sales_in_Millions", "Average_Price", "Total_Revenue_Millions") .first(4).print());
Multiplying two columns: Sales * Price = Revenue
| Books | Sales_in_ | Average_ | Total_Revenue_ |
| | Millions | Price | Millions |
|-------------------------------------|------------|-----------|----------------|
| A Tale of Two Cities | 200 | 12.99 | 2598 |
| The Little Prince (Le Petit Prince) | 200 | 15.50 | 3100 |
| Harry Potter and | 120 | 19.99 | 2398.80 |
| the Philosopher's Stone | | | |
| And Then There Were None | 100 | 14.95 | 1495 |
\\(.*\\)
matches parentheses\\( // Match opening parenthesis (escaped with \)
.* // Match any character (.) zero or more times (*)
\\) // Match closing parenthesis (escaped with \)
// Remove parenthetical translations from titles
StringColumn titlesNoParentheses = titles
// For each title in titles, remove parentheses and surrounding spaces
.map(t -> t.replaceAll("\\(.*\\)", "").trim())
// Set the new column name
.setName("Titles without parentheses");
System.out.println("Original: " + titles.get(1)); // Get the first column item
System.out.println("Cleaned: " + titlesNoParentheses.get(1));
Original: The Little Prince (Le Petit Prince)
Cleaned: The Little Prince
// Combine removing parentheses and converting to lowercase StringColumn titlesCombinedCleaning = titles.map(t -> t.replaceAll("\\(.*\\)", "") // Remove parentheses
.trim() // Remove surrounding space
.toLowerCase()) // Convert to lowercase .setName("Clean_Titles_Combined"); // Set column name System.out.println("Original title: " + titles.get(1)); System.out.println("Title after cleaning: " + titlesCombinedCleaning.get(1));
Original title: The Little Prince (Le Petit Prince)
Title after cleaning: the little prince
books.column(colName).type(); // Get column type
books.intColumn("Sales_in_Millions").asDoubleColumn(); // Convert type
DoubleColumn totalRevenue = sales.multiply(averagePrice) // Multiply columns
.setName("Total_Revenue_Millions"); // Name new column
titles.map(String::toLowerCase) // Convert to lowercase
.setName("Lowercase_Titles"); // Name new column
titles.map(t -> t.replaceAll("\\(.*\\)", "").trim()) // Remove parentheses
.setName("Titles without parentheses"); // Name new column
Cleaning Data in Java