Handling missing and duplicate values

Introduction to Polars

Liam Brannigan

Data Scientist & Polars Contributor

Missing values

rentals
shape: (49, 8)
| name                | type    | price | ... | doubles | singles | review | beach |
| ---                 | ---     | ---   | ... | ---     | ---     | ---    | ---   |
| str                 | str     | i64   | ... | i64     | i64     | f64    | bool  |
|---------------------|---------|-------| ... |---------|---------|--------|-------|
| Bright House        | Cottage | 956   | ... | 3       | null    | 9.9    | true  |
| Bright House        | Cottage | 1050  | ... | 3       | null    | 9.9    | true  |
Introduction to Polars

Counting null values

rentals.null_count()
shape: (1, 8)
| name | type | price | bedrooms | doubles | singles | review | beach |
| ---  | ---  | ---   | ---      | ---     | ---     | ---    | ---   |
| u32  | u32  | u32   | u32      | u32     | u32     | u32    | u32   |
|------|------|-------|----------|---------|---------|--------|-------|
| 0    | 1    | 0     | 1        | 0       | 4       | 2      | 0     |
Introduction to Polars

Finding rows with null values

rentals.filter(
    pl.col("singles").is_null()
)
shape: (4, 8)
| name             | type    | price | bedrooms | doubles | singles | review | beach |
| ---              | ---     | ---   | ---      | ---     | ---     | ---    | ---   |
| str              | str     | i64   | i64      | i64     | i64     | f64    | bool  |
|------------------|---------|-------|----------|---------|---------|--------|-------|
| Bright House     | Cottage | 956   | 4        | 3       | null    | 9.9    | true  |
| Bright House     | Cottage | 1050  | 4        | 3       | null    | 9.9    | true  |
| Sea View House   | Cottage | 742   | 5        | 3       | null    | null   | true  |
| Trewhiddle Villa | Villa   | 1077  | 4        | 3       | null    | 9.8    | false |
Introduction to Polars

Dropping rows with null values

rentals.drop_nulls()
shape: (43, 8)
| name      | type     | price | bedrooms | doubles | singles | review | beach |
| ---       | ---      | ---   | ---      | ---     | ---     | ---    | ---   |
| str       | str      | i64   | i64      | i64     | i64     | f64    | bool  |
|-----------|----------|-------|----------|---------|---------|--------|-------|
| Waves     | Cottage  | 540   | 4        | 1       | 2       | 8.9    | false |
| Seashells | Cottage  | 540   | 4        | 2       | 2       | 8.7    | true  |
| ...       | ...      | ...   | ...      | ...     | ...     | ...    | ...   |
Introduction to Polars

Dropping rows with nulls in specific columns

rentals.drop_nulls(subset=["singles"])
shape: (46, 8)
| name      | type     | price | bedrooms | doubles | singles | review | beach |
| ---       | ---      | ---   | ---      | ---     | ---     | ---    | ---   |
| str       | str      | i64   | i64      | i64     | i64     | f64    | bool  |
|-----------|----------|-------|----------|---------|---------|--------|-------|
| Waves     | Cottage  | 540   | 4        | 1       | 2       | 8.9    | false |
| Seashells | Cottage  | 540   | 4        | 2       | 2       | 8.7    | true  |
| ...       | ...      | ...   | ...      | ...     | ...     | ...    | ...   |
Introduction to Polars

Filling nulls with a value

rentals.with_columns(
    pl.col("singles").fill_null(0)
)
shape: (49, 8)
| name          | type    | price | bedrooms | doubles | singles | review | beach |
| ---           | ---     | ---   | ---      | ---     | ---     | ---    | ---   |
| str           | str     | i64   | i64      | i64     | i64     | f64    | bool  |
|---------------|---------|-------|----------|---------|---------|--------|-------|
| Bright House  | Cottage | 956   | 4        | 3       | 0       | 9.9    | true  |
| Bright House  | Cottage | 1050  | 4        | 3       | 0       | 9.9    | true  |
| ...           | ...     | ...   | ...      | ...     | ...     | ...    | ...   |
Introduction to Polars

Filling nulls with an expression

rentals.with_columns(
    pl.col("review").fill_null(                       )
)
Introduction to Polars

Filling nulls with an expression

rentals.with_columns(
    pl.col("review").fill_null(pl.col("review").mean())
)
shape: (49, 8)
| name                | type    | price | ... | doubles | singles | review | beach |
| ---                 | ---     | ---   | ... | ---     | ---     | ---    | ---   |
| str                 | str     | i64   | ... | i64     | i64     | f64    | bool  |
|---------------------|---------|-------| ... |---------|---------|--------|-------|
| Bright House        | Cottage | 956   | ... | 3       | null    | 9.9    | true  |
| Bright House        | Cottage | 1050  | ... | 3       | null    | 9.9    | true  |
| Sea View House      | Cottage | 742   | ... | 3       | null    | 8.99   | true  |
| ...                 | ...     | ...   | ... | ...     | ...     | ...    | ...   |
Introduction to Polars

Finding duplicate rows

rentals.is_duplicated()
shape: (49,)
Series: '' [bool]
[
    false
    false
    ...
]
Introduction to Polars

Finding duplicate rows

rentals.filter(
    rentals.is_duplicated()
)
shape: (4, 8)
| name              | type      | price | ... | singles | review | beach |
| ---               | ---       | ---   | ... | ---     | ---    | ---   |
| str               | str       | i64   | ... | i64     | f64    | bool  |
|-------------------|-----------|-------| ... |---------|--------|-------|
| Tregenna House    | Hotel     | 2411  | ... | 2       | 8.7    | true  |
| Tregenna House    | Hotel     | 2411  | ... | 2       | 8.7    | true  |
| Newquay Apartment | Apartment | 1220  | ... | 4       | 9.7    | true  |
| Newquay Apartment | Apartment | 1220  | ... | 4       | 9.7    | true  |
Introduction to Polars

Finding all duplicate rows

rentals.filter(
    pl.col("name").is_duplicated()
)
shape: (18, 8)
| name              | type      | price | ... | singles | review | beach |
| ---               | ---       | ---   | ... | ---     | ---    | ---   |
| str               | str       | i64   | ... | i64     | f64    | bool  |
|-------------------|-----------|-------| ... |---------|--------|-------|
| Bright House      | Cottage   | 956   | ... | null    | 9.9    | true  |
| Bright House      | Cottage   | 1050  | ... | null    | 9.9    | true  |
| Palma Villa       | Villa     | 1772  | ... | 2       | 9.6    | true  |
| Palma Villa       | Villa     | 1947  | ... | 2       | 9.6    | true  |
| ...               | ...       | ...   | ... | ...     | ...    | ...   |
Introduction to Polars

Dropping duplicate rows

rentals.unique()
shape: (47, 8)
| name           | type      | price | ... | doubles | singles | review | beach |
| ---            | ---       | ---   | ... | ---     | ---     | ---    | ---   |
| str            | str       | i64   | ... | i64     | i64     | f64    | bool  |
|----------------|-----------|-------|-----|---------|---------|--------|-------|
| Bright House   | Cottage   | 1050  | ... | 3       | null    | 9.9    | true  |
| Perran View    | Hotel     | 341   | ... | 1       | 4       | 8.2    | false |
| ...            | ...       | ...   | ... | ...     | ...     | ...    | ...   |
  • .unique() changes the row order
Introduction to Polars

Dropping duplicate rows using specific columns

rentals.unique(subset=["name"])
shape: (40, 8)
| name            | type    | price | bedrooms | doubles | singles | review | beach |
| ---             | ---     | ---   | ---      | ---     | ---     | ---    | ---   |
| str             | str     | i64   | i64      | i64     | i64     | f64    | bool  |
|-----------------|---------|-------|----------|---------|---------|--------|-------|
| Porth Caravan   | Caravan | 380   | 3        | 1       | 4       | 7.2    | true  |
| Wheal Frances   | Cottage | 1055  | 4        | 2       | 4       | 9.4    | false |
| Tregenna House  | Hotel   | 2411  | 4        | 1       | 2       | 8.7    | true  |
| ...             | ...     | ...   | ...      | ...     | ...     | ...    | ...   |

Introduction to Polars

Let's practice!

Introduction to Polars

Preparing Video For Download...