Data Transformation with Polars
Liam Brannigan
Data Scientist & Polars Contributor
venues = pl.read_csv("venues.csv")
shape: (4, 6)
| business | location | type | hygiene_rating | review | price |
| --- | --- | --- | --- | --- | --- |
| str | str | str | i64 | f64 | i64 |
|------------------|-------------|------------|----------------|--------|-------|
| 7burgers | Wakey Wakey | restaurant | 4 | 4.2 | 15 |
| Bang Bang Burger | Forest Rd. | restaurant | 3 | 3.8 | 12 |
| Costa Coffee | City Point | café | 5 | 4.5 | 8 |
| The Queens Head | Denman St. | bar | 5 | 4.7 | 25 |
def rescale_review(x):
return 2 * x
def rescale_review(x):
return 2 * x
venues.with_columns(
pl.col("review")
)
def rescale_review(x):
return 2 * x
venues.with_columns(
pl.col("review").map_elements(rescale_review)
)
shape: (4, 6)
| business | location | type | hygiene_rating | review | price |
| --- | --- | --- | --- | --- | --- |
| str | str | str | i64 | f64 | i64 |
|------------------|-------------|------------|----------------|--------|-------|
| 7burgers | Wakey Wakey | restaurant | 4 | 8.4 | 15 |
| Bang Bang Burger | Forest Rd. | restaurant | 3 | 7.6 | 12 |
| Costa Coffee | City Point | café | 5 | 9.0 | 8 |
| The Queens Head | Denman St. | bar | 5 | 9.4 | 25 |
venues.with_columns(
pl.col("review").map_elements(
rescale_review, return_dtype=pl.Float64
)
)
shape: (4, 6)
| business | location | type | hygiene_rating | review | price |
| --- | --- | --- | --- | --- | --- |
| str | str | str | i64 | f64 | i64 |
|------------------|-------------|------------|----------------|--------|-------|
| 7burgers | Wakey Wakey | restaurant | 4 | 8.4 | 15 |
| Bang Bang Burger | Forest Rd. | restaurant | 3 | 7.6 | 12 |
| Costa Coffee | City Point | café | 5 | 9.0 | 8 |
| The Queens Head | Denman St. | bar | 5 | 9.4 | 25 |
PolarsInefficientMapWarning:
Expr.map_elements is significantly slower than the native expressions API.
Only use if you absolutely CANNOT implement your logic otherwise.
Replace this expression...
- pl.col("review").map_elements(rescale_review)
with this one instead:
+ 2 * pl.col("review")
venues.with_columns(
(2 * pl.col("review")).alias("review")
)
shape: (4, 6)
| business | location | type | hygiene_rating | review | price |
| --- | --- | --- | --- | --- | --- |
| str | str | str | i64 | f64 | i64 |
|------------------|-------------|------------|----------------|--------|-------|
| 7burgers | Wakey Wakey | restaurant | 4 | 8.4 | 15 |
| Bang Bang Burger | Forest Rd. | restaurant | 3 | 7.6 | 12 |
| Costa Coffee | City Point | café | 5 | 9.0 | 8 |
| The Queens Head | Denman St. | bar | 5 | 9.4 | 25 |
$$
$$
.map_elements()Denman St. -> DENMAN STREETForest Rd. -> FOREST ROAD$$
$$
$$
$$
$$
location columnvenues.with_columns(
pl.col("location")
)
venues.with_columns(
pl.col("location")
.str.replace_many(["St.", "Rd."], ["Street", "Road"])
)
venues.with_columns(
pl.col("location")
.str.replace_many(["St.", "Rd."], ["Street", "Road"])
.str.to_uppercase()
.alias("location_clean")
)
shape: (4, 7)
| business | location | ... | location_clean |
| --- | --- | --- | --- |
| str | str | ... | str |
|------------------|-------------|-----|----------------|
| 7burgers | Wakey Wakey | ... | WAKEY WAKEY |
| Bang Bang Burger | Forest Rd. | ... | FOREST ROAD |
| Costa Coffee | City Point | ... | CITY POINT |
| The Queens Head | Denman St. | ... | DENMAN STREET |
standardize_locations_expr = ( pl.col("location") .str.replace_many(["St.", "Rd."], ["Street", "Road"]) .str.to_uppercase() .alias("location_clean") )type(standardize_locations_expr)
polars.Expr
venues.with_columns(
standardize_locations_expr
)
shape: (4, 7)
| business | location | ... | location_clean |
| --- | --- | --- | --- |
| str | str | ... | str |
|------------------|-------------|-----|----------------|
| 7burgers | Wakey Wakey | ... | WAKEY WAKEY |
| Bang Bang Burger | Forest Rd. | ... | FOREST ROAD |
| Costa Coffee | City Point | ... | CITY POINT |
| The Queens Head | Denman St. | ... | DENMAN STREET |
def standardize(input):
def standardize(input):
return (
input
.str.replace_many(["St.", "Rd."], ["Street", "Road"])
.str.to_uppercase()
)
def standardize(input):
return (
input
.str.replace_many(["St.", "Rd."], ["Street", "Road"])
.str.to_uppercase()
)
pl.Expr.standardize = standardize
restaurants.with_columns(
pl.col("address").standardize().alias("address_clean")
)
shape: (4, 7)
| business | address | ... | address_clean |
| --- | --- | --- | --- |
| str | str | ... | str |
|------------------|-------------|-----|----------------|
| 7burgers | Wakey Wakey | ... | WAKEY WAKEY |
| Bang Bang Burger | Forest Rd. | ... | FOREST ROAD |
| Costa Coffee | City Point | ... | CITY POINT |
| The Queens Head | Denman St. | ... | DENMAN STREET |
Data Transformation with Polars