Introduction to PySpark DataFrames

Introduction to PySpark

Benjamin Schmidt

Data Engineer

About DataFrames

  • DataFrames: Tabular format (rows/columns)
  • Supports SQL-like operations
  • Comparable to a Pandas Dataframe or a SQL TABLE
  • Structured Data

Dataframes

Introduction to PySpark

Creating DataFrames from filestores

# Create a DataFrame from CSV
census_df = spark.read.csv('path/to/census.csv', header=True, inferSchema=True)
Introduction to PySpark

Printing the DataFrame

# Show the first 5 rows of the DataFrame
census_df.show()


   age  education.num marital.status         occupation income
0   90              9        Widowed                  ?  <=50K
1   82              9        Widowed    Exec-managerial  <=50K
2   66             10        Widowed                  ?  <=50K
3   54              4       Divorced  Machine-op-inspct  <=50K
4   41             10      Separated     Prof-specialty  <=50K
Introduction to PySpark

Printing DataFrame Schema

# Show the schema
census_df.printSchema()

Output: root |-- age: integer (nullable = true) |-- education.num: integer (nullable = true) |-- marital.status: string (nullable = true) |-- occupation: string (nullable = true) |-- income: string (nullable = true)
Introduction to PySpark

Basic analytics on PySpark DataFrames

# .count() will return the total row numbers in the DataFrame
row_count = census_df.count()
print(f'Number of rows: {row_count}')
# groupby() allows the use of sql-like aggregations
census_df.groupBy('gender').agg({'salary_usd': 'avg'}).show()

Other aggregate functions are:

  • sum()
  • min()
  • max()
Introduction to PySpark

Key functions for PySpark analytics

  • .select(): Selects specific columns from the DataFrame
  • .filter(): Filters rows based on specific conditions
  • .groupBy(): Groups rows based on one or more columns
  • .agg(): Applies aggregate functions to grouped data
Introduction to PySpark

Key Functions For Example

# Using filter and select, we can narrow down our DataFrame
filtered_census_df = census_df.filter(df['age'] > 50).select('age', 'occupation')
filtered_census_df.show()

Output +---+------------------+ |age| occupation | +---+------------------+ | 90| ?| | 82| Exec-managerial| | 66| ?| | 54| Machine-op-inspct| +---+------------------+
Introduction to PySpark

Let's practice!

Introduction to PySpark

Preparing Video For Download...