Cleaning Data with PySpark
Mike Metzger
Data Engineering Consultant
Normal ID fields:
id | last name | first name | state |
---|---|---|---|
0 | Smith | John | TX |
1 | Wilson | A. | IL |
2 | Adams | Wendy | OR |
pyspark.sql.functions.monotonically_increasing_id()
id | last name | first name | state |
---|---|---|---|
0 | Smith | John | TX |
134520871 | Wilson | A. | IL |
675824594 | Adams | Wendy | OR |
Remember, Spark is lazy!
Cleaning Data with PySpark