Cleaning text data

Data Transformation with Polars

Liam Brannigan

Data Scientist & Polars Contributor

Meet your instructor

$$

$$

  • Liam Brannigan, Lead Data Scientist
  • ML and Data Engineering Specialist
  • Polars contributor

Profile picture of the instructor.

Data Transformation with Polars

Transformation Engine

Animation

Data Transformation with Polars

Is this course for you?

$$

  • Creating a Polars DataFrame

$$

  • Using a Polars expression

$$

  • Doing a group-by aggregation

Introduction to Polars - course page

Data Transformation with Polars

Chapter 1

Cleaning text data diagram

Data Transformation with Polars

Chapter 2

Time series data

Data Transformation with Polars

Chapter 3

Combining DataFrames

Data Transformation with Polars

Chapter 4

Custom workflows and correlation

Data Transformation with Polars

Meet our dataset

import polars as pl

ratings = pl.read_csv("restaurant_ratings.csv")
shape: (5, 5)
| business         | location     | type       | rating | capacity |
| ---              | ---          | ---        | ---    | ---      |
| str              | str          | str        | f64    | f64      |
|------------------|--------------|------------|--------|----------|
|   7burgers       | Wakey Wakey  | restaurant | 5.0    | 55.0     |
| Bang Bang Burger | Forest Rd.   | restaurant | 4.0    | 55.0     |
| Costa Coffee     | City Point   | café       | 5.0    | 41.0     |
|  Costa Coffee    | The Moorgate | takeaway   | 5.0    | 0.0      |
| The Queens Head  | Denman St.   | bar        | 5.0    | 187.0    |
Data Transformation with Polars

Restaurant recommendation app

shape: (3, 5)
| business         | location     | type       | rating | capacity |
| ---              | ---          | ---        | ---    | ---      |
| str              | str          | str        | f64    | f64      |
|------------------|--------------|------------|--------|----------|
|   7burgers       | Wakey Wakey  | restaurant | 5.0    | 55.0     |
| Bang Bang Burger | Forest Rd.   | restaurant | 4.0    | 55.0     |
| Costa Coffee     | City Point   | café       | 5.0    | 41.0     |
  • Remove whitespace
  • Convert rating and capacity columns
  • Create unique identifier column
Data Transformation with Polars

Casting dtype with an expression

ratings.with_columns(

)
Data Transformation with Polars

Casting dtype with an expression

ratings.with_columns(
    pl.col("rating").cast(pl.Int64)
)
shape: (5, 5)
| business         | location     | type       | rating | capacity |
| ---              | ---          | ---        | ---    | ---      |
| str              | str          | str        | i64    | f64      |
|------------------|--------------|------------|--------|----------|
|   7burgers       | Wakey Wakey  | restaurant | 5      | 55.0     |
| Bang Bang Burger | Forest Rd.   | restaurant | 4      | 55.0     |
| Costa Coffee     | City Point   | café       | 5      | 41.0     |
|  Costa Coffee    | The Moorgate | takeaway   | 5      | 0.0      |
| The Queens Head  | Denman St.   | bar        | 5      | 187.0    |
Data Transformation with Polars

Casting multiple columns

ratings.cast(                      )
Data Transformation with Polars

Casting multiple columns

ratings.cast({                    })
Data Transformation with Polars

Casting multiple columns

ratings.cast({pl.Float64: pl.Int64})
shape: (5, 5)
| business         | location     | type       | rating | capacity |
| ---              | ---          | ---        | ---    | ---      |
| str              | str          | str        | i64    | i64      |
|------------------|--------------|------------|--------|----------|
|   7burgers       | Wakey Wakey  | restaurant | 5      | 55       |
| Bang Bang Burger | Forest Rd.   | restaurant | 4      | 55       |
| Costa Coffee     | City Point   | café       | 5      | 41       |
|  Costa Coffee    | The Moorgate | takeaway   | 5      | 0        |
| The Queens Head  | Denman St.   | bar        | 5      | 187      |
Data Transformation with Polars

Cleaning text data

shape: (5, 5)
| business         | location        | type       | rating | capacity |
| ---              | ---             | ---        | ---    | ---      |
| str              | str             | str        | i64    | i64      |
|------------------|-----------------|------------|--------|----------|
|   7burgers       | Wakey Wakey     | restaurant | 5      | 55       |
| Bang Bang Burger | Forest Rd.      | restaurant | 4      | 55       |
| Costa Coffee     | City Point      | café       | 5      | 41       |
|  Costa Coffee    | The Moorgate    | takeaway   | 5      | 0        |
| The Queens Head  | Denman St.      | bar        | 5      | 187      |
Data Transformation with Polars

Cleaning text data

  • .str.contains()
  • .str.strip_chars()
  • .str.strip_chars_start()
  • .str.to_lowercase()
  • ...
Data Transformation with Polars

Cleaning text data

  • .str.contains()
  • .str.strip_chars()
  • .str.strip_chars_start()
  • .str.to_lowercase()
  • ...
1 https://docs.pola.rs/api/python/stable/reference/expressions/string.html
Data Transformation with Polars

Cleaning text data

  • .str.contains()
  • .str.strip_chars()
  • .str.strip_chars_start()
  • .str.to_lowercase()
  • ...
1 https://docs.pola.rs/api/python/stable/reference/expressions/string.html
Data Transformation with Polars

Cleaning text data

ratings.with_columns(

)
Data Transformation with Polars

Cleaning text data

ratings.with_columns(
    pl.col("business")
)
Data Transformation with Polars

Cleaning text data

ratings.with_columns(
    pl.col("business").str.strip_chars_start()
)
shape: (5, 5)
| business         | location     | type       | rating | capacity |
| ---              | ---          | ---        | ---    | ---      |
| str              | str          | str        | i64    | i64      |
|------------------|--------------|------------|--------|----------|
| 7burgers         | Wakey Wakey  | restaurant | 5      | 55       |
| Bang Bang Burger | Forest Rd.   | restaurant | 4      | 55       |
| Costa Coffee     | City Point   | café       | 5      | 41       |
| Costa Coffee     | The Moorgate | takeaway   | 5      | 0        |
| The Queens Head  | Denman St.   | bar        | 5      | 187      |
Data Transformation with Polars

Combining text data

shape: (5, 5)
| business         | location        | type       | rating | capacity |
| ---              | ---             | ---        | ---    | ---      |
| str              | str             | str        | i64    | i64      |
|------------------|-----------------|------------|--------|----------|
| 7burgers         | Wakey Wakey     | restaurant | 5      | 55       |
| Bang Bang Burger | Forest Rd.      | restaurant | 4      | 55       |
| Costa Coffee     | City Point      | café       | 5      | 41       |
| Costa Coffee     | The Moorgate    | takeaway   | 5      | 0        |
| The Queens Head  | Denman St.      | bar        | 5      | 187      |
Data Transformation with Polars

Combining text data

ratings.with_columns(

)
Data Transformation with Polars

Combining text data

ratings.with_columns(
    pl.concat_str(
)
Data Transformation with Polars

Combining text data

ratings.with_columns(
    pl.concat_str("business", "location"
)
Data Transformation with Polars

Combining text data

ratings.with_columns(
    pl.concat_str("business", "location", separator=":")
)
Data Transformation with Polars

Combining text data

ratings.with_columns(
    pl.concat_str("business", "location", separator=":").alias("id")
)
shape: (5, 5)
| business         | location     | type       | ... | id                            |
| ---              | ---          | ---        | --- | ---                           |
| str              | str          | str        | ... | str                           |
|------------------|--------------|------------|-----|-------------------------------|
| 7burgers         | Wakey Wakey  | restaurant | ... | 7burgers:Wakey Wakey          |
| Bang Bang Burger | Forest Rd.   | restaurant | ... | Bang Bang Burger:Forest Rd.   |
| Costa Coffee     | City Point   | café       | ... | Costa Coffee:City Point       |
| Costa Coffee     | The Moorgate | takeaway   | ... | Costa Coffee:The Moorgate     |
| The Queens Head  | Denman St.   | bar        | ... | The Queens Head:Denman St.    |
Data Transformation with Polars

Let's practice!

Data Transformation with Polars

Preparing Video For Download...