Cleaning Data with PySpark
Mike Metzger
Data Engineering Consultant
Normal ID fields:
| id | last name | first name | state |
|---|---|---|---|
| 0 | Smith | John | TX |
| 1 | Wilson | A. | IL |
| 2 | Adams | Wendy | OR |
pyspark.sql.functions.monotonically_increasing_id()
| id | last name | first name | state |
|---|---|---|---|
| 0 | Smith | John | TX |
| 134520871 | Wilson | A. | IL |
| 675824594 | Adams | Wendy | OR |
Remember, Spark is lazy!
Cleaning Data with PySpark