Searching and extracting text

Data Transformation with Polars

Liam Brannigan

Data Scientist & Polars Contributor

Searching text

shape: (5, 5)
| business         | location     | type       | rating | capacity |
| ---              | ---          | ---        | ---    | ---      |
| str              | str          | str        | i64    | i64      |
|------------------|--------------|------------|--------|----------|
| 7burgers         | Wakey Wakey  | restaurant | 5      | 55       |
| Bang Bang Burger | Forest Rd.   | restaurant | 4      | 55       |
| Costa Coffee     | City Point   | café       | 5      | 41       |
| Costa Coffee     | The Moorgate | takeaway   | 5      | 0        |
| The Queens Head  | Denman St.   | bar        | 5      | 187      |
  • Task: find "burger" in the business name
Data Transformation with Polars

Searching text

ratings.with_columns(
    pl.col("business")
)
Data Transformation with Polars

Searching text

ratings.with_columns(
    pl.col("business").str.contains("Burger")
)
Data Transformation with Polars

Searching text

ratings.with_columns(
    pl.col("business").str.contains("Burger").alias("is_burger")
)
shape: (5, 6)
| business         | location        | type       | rating | capacity | is_burger |
| ---              | ---             | ---        | ---    | ---      | ---       |
| str              | str             | str        | i64    | i64      | bool      |
|------------------|-----------------|------------|--------|----------|-----------|
| 7burgers         | Wakey Wakey     | restaurant | 5      | 55       | false     |
| Bang Bang Burger | Forest Rd.      | restaurant | 4      | 55       | true      |
| Costa Coffee     | City Point      | café       | 5      | 41       | false     |
| Costa Coffee     | The Moorgate    | takeaway   | 5      | 0        | false     |
| The Queens Head  | Denman St.      | bar        | 5      | 187      | false     |
Data Transformation with Polars

Searching text

ratings.with_columns(
    pl.col("business").str.to_lowercase()
)
Data Transformation with Polars

Searching text

ratings.with_columns(
    pl.col("business").str.to_lowercase().str.contains("burger").alias("is_burger")
)
shape: (5, 6)
| business         | location        | type       | rating | capacity | is_burger |
| ---              | ---             | ---        | ---    | ---      | ---       |
| str              | str             | str        | i64    | i64      | bool      |
|------------------|-----------------|------------|--------|----------|-----------|
| 7burgers         | Wakey Wakey     | restaurant | 5      | 55       | true     |
| Bang Bang Burger | Forest Rd.      | restaurant | 4      | 55       | true      |
| Costa Coffee     | City Point      | café       | 5      | 41       | false     |
| Costa Coffee     | The Moorgate    | takeaway   | 5      | 0        | false     |
| The Queens Head  | Denman St.      | bar        | 5      | 187      | false     |
Data Transformation with Polars

Filtering for text matches

ratings.filter(

)
Data Transformation with Polars

Filtering for text matches

ratings.filter(
    pl.col("business").str.to_lowercase().str.contains("burger")
)
shape: (5, 6)
| business          | location        | type       | rating | capacity |
| ---               | ---             | ---        | ---    | ---      |
| str               | str             | str        | i64    | i64      |
|-------------------|-----------------|------------|--------|----------|
| 7burgers          | Wakey Wakey     | restaurant | 5      | 55       |
| Bang Bang Burger  | Forest Rd.      | restaurant | 4      | 55       |
| Bronson's Burgers | Arch 112        | takeaway   | 4      | 0        |
| Burger & Lobster  | Bow Bells House | restaurant | 5      | 40       |
| Ted Burgers       | Prince Of Wales | takeaway   | 4      | 0        |
Data Transformation with Polars

Extracting text

shape: (5, 5)
| business         | location     | type       | rating | capacity |
| ---              | ---          | ---        | ---    | ---      |
| str              | str          | str        | i64    | i64      |
|------------------|--------------|------------|--------|----------|
| 7burgers         | Wakey Wakey  | restaurant | 5      | 55       |
| bang bang burger | Forest Rd.   | restaurant | 4      | 55       |
| costa coffee     | City Point   | café       | 5      | 41       |
| costa coffee     | The Moorgate | takeaway   | 5      | 0        |
| the queens head  | Denman St.   | bar        | 5      | 187      |
  • Task: extract "burger" or "coffee" from the business column
Data Transformation with Polars

Extracting text

ratings.with_columns(

)
Data Transformation with Polars

Extracting text

ratings.with_columns(
    pl.col("business")
)
Data Transformation with Polars

Extracting text

ratings.with_columns(
    pl.col("business").str.extract("(burger)")
)
Data Transformation with Polars

Extracting text

ratings.with_columns(
    pl.col("business").str.extract("(burger)").alias("food")
)
shape: (5, 6)
| business         | location        | type       | rating | capacity | food   |
| ---              | ---             | ---        | ---    | ---      | ---    |
| str              | str             | str        | i64    | i64      | str    |
|------------------|-----------------|------------|--------|----------|--------|
| 7burgers         | Wakey Wakey     | restaurant | 5      | 55       | burger |
| bang bang burger | Forest Rd.      | restaurant | 4      | 55       | burger |
| costa coffee     | City Point      | café       | 5      | 41       | null   |
| costa coffee     | The Moorgate    | takeaway   | 5      | 0        | null   |
| the queens head  | Denman St.      | bar        | 5      | 187      | null   |
Data Transformation with Polars

Extracting text - multiple terms

ratings.with_columns(
    pl.col("business").str.extract("(burger|coffee)").alias("food")
)
shape: (5, 6)
| business         | location        | type       | rating | capacity | burger_name |
| ---              | ---             | ---        | ---    | ---      | ---         |
| str              | str             | str        | i64    | i64      | str         |
|------------------|-----------------|------------|--------|----------|-------------|
| 7burgers         | Wakey Wakey     | restaurant | 5      | 55       | burger      |
| bang bang burger | Forest Rd.      | restaurant | 4      | 55       | burger      |
| costa coffee     | City Point      | café       | 5      | 41       | coffee      |
| costa coffee     | The Moorgate    | takeaway   | 5      | 0        | coffee      |
| the queens head  | Denman St.      | bar        | 5      | 187      | null        |
Data Transformation with Polars

Replacing text

shape: (5, 6)
| business         | location        | type       | rating | capacity | food    |
| ---              | ---             | ---        | ---    | ---      | ---     |
| str              | str             | str        | i64    | i64      | str     |
|------------------|-----------------|------------|--------|----------|---------|
| 7burgers         | Wakey Wakey     | restaurant | 5      | 55       | burger  |
| bang bang burger | Forest Rd.      | restaurant | 4      | 55       | burger  |
| costa coffee     | City Point      | café       | 5      | 41       | coffee  |
| costa coffee     | The Moorgate    | takeaway   | 5      | 0        | coffee  |
| the queens head  | Denman St.      | bar        | 5      | 187      | null    |
  • Task: replace "Rd." with "Road"
Data Transformation with Polars

Replacing text

ratings.with_columns(
    pl.col("location")
)
Data Transformation with Polars

Replacing text

ratings.with_columns(
    pl.col("location").str.replace("Rd.", "Road")
)
shape: (5, 6)
| business         | location        | type       | rating | capacity | food    |
| ---              | ---             | ---        | ---    | ---      | ---     |
| str              | str             | str        | i64    | i64      | str     |
|------------------|-----------------|------------|--------|----------|---------|
| 7burgers         | Wakey Wakey     | restaurant | 5      | 55       | burger  |
| bang bang burger | Forest Road     | restaurant | 4      | 55       | burger  |
| costa coffee     | City Point      | café       | 5      | 41       | coffee  |
| costa coffee     | The Moorgate    | takeaway   | 5      | 0        | coffee  |
| the queens head  | Denman St.      | bar        | 5      | 187      | null    |
Data Transformation with Polars

Replacing multiple strings

ratings.with_columns(
    pl.col("location").str.replace("Rd.", "Road")
)
ratings.with_columns(
    pl.col("location").str.replace_many(                               )
)
Data Transformation with Polars

Replacing multiple strings

ratings.with_columns(
    pl.col("location").str.replace("Rd.", "Road")
)
ratings.with_columns(
    pl.col("location").str.replace_many(["Rd.", "St."],                 )
)
Data Transformation with Polars

Replacing multiple strings

ratings.with_columns(
    pl.col("location").str.replace("Rd.", "Road")
)
ratings.with_columns(
    pl.col("location").str.replace_many(["Rd.", "St."], ["Road", "Street"])
)
Data Transformation with Polars

Replacing multiple strings

shape: (5, 6)
| business         | location        | type       | rating | capacity | food    |
| ---              | ---             | ---        | ---    | ---      | ---     |
| str              | str             | str        | i64    | i64      | str     |
|------------------|-----------------|------------|--------|----------|---------|
| 7burgers         | Wakey Wakey     | restaurant | 5      | 55       | burger  |
| bang bang burger | Forest Road     | restaurant | 4      | 55       | burger  |
| costa coffee     | City Point      | café       | 5      | 41       | coffee  |
| costa coffee     | The Moorgate    | takeaway   | 5      | 0        | coffee  |
| the queens head  | Denman Street   | bar        | 5      | 187      | null    |
Data Transformation with Polars

Let's practice!

Data Transformation with Polars

Preparing Video For Download...