DataFrame column operations

Cleaning Data with PySpark

Mike Metzger

Data Engineering Consultant

DataFrame refresher

DataFrames:

  • Made up of rows & columns
  • Immutable
  • Use various transformation operations to modify data
# Return rows where name starts with "M" 
voter_df.filter(voter_df.name.like('M%'))

# Return name and position only voters = voter_df.select('name', 'position')
Cleaning Data with PySpark

Common DataFrame transformations

  • Filter / Where
    voter_df.filter(voter_df.date > '1/1/2019') # or voter_df.where(...)
    
  • Select
    voter_df.select(voter_df.name)
    
  • withColumn
    voter_df.withColumn('year', voter_df.date.year)
    
  • drop
    voter_df.drop('unused_column')
    
Cleaning Data with PySpark

Filtering data

  • Remove nulls
  • Remove odd entries
  • Split data from combined sources
  • Negate with ~
    voter_df.filter(voter_df['name'].isNotNull())
    voter_df.filter(voter_df.date.year > 1800)
    voter_df.where(voter_df['_c0'].contains('VOTE'))
    voter_df.where(~ voter_df._c1.isNull())
    
Cleaning Data with PySpark

Column string transformations

  • Contained in pyspark.sql.functions
    import pyspark.sql.functions as F
    
  • Applied per column as transformation
    voter_df.withColumn('upper', F.upper('name'))
    
  • Can create intermediary columns
    voter_df.withColumn('splits', F.split('name', ' '))
    
  • Can cast to other types
    voter_df.withColumn('year', voter_df['_c4'].cast(IntegerType()))
    
Cleaning Data with PySpark

ArrayType() column functions

Various utility functions / transformations to interact with ArrayType()

.size(<column>) - returns length of arrayType() column

.getItem(<index>) - used to retrieve a specific item at index of list column.

Cleaning Data with PySpark

Let's practice!

Cleaning Data with PySpark

Preparing Video For Download...