Einführung in PySpark
Ben Schmidt
Data Engineer
.na.drop(), um Zeilen mit Nullwerten zu löschen.# Drop rows with any nulls df_cleaned = df.na.drop()# Filter out nulls df_cleaned = df.where(col("columnName").isNotNull())
.na.fill({"column": value), um Nullwerte durch einen bestimmten Wert zu ersetzen.# Fill nulls in the age column with the value 0
df_filled = df.na.fill({"age": 0})
.withColumn(): neue Spalte, basierend auf Berechnungen oder vorhandenen Spalten.# Create a new column 'age_plus_5'
df = df.withColumn("age_plus_5", df["age"] + 5)
withColumnRenamed(), um Spalten umzubenennen.# Rename the 'age' column to 'years'
df = df.withColumnRenamed("age", "years")
drop(), um unnötige Spalten zu entfernen.# Drop the 'department' column
df = df.drop("department")
.filter(), um Zeilen nach bestimmten Bedingungen auszuwählen.# Filter rows where salary is greater than 50000
filtered_df = df.filter(df["salary"] > 50000)
.groupBy() und Aggregatfunktionen (z. B. .sum(), .avg()), um Daten zusammenzufassen.# Group by department and calculate the average salary
grouped_df = df.groupBy("department").avg("salary")
Filtern
+------+---+-----------------+
|salary|age| occupation |
+------+---+-----------------+
| 60000| 45|Exec-managerial |
| 70000| 35|Prof-specialty |
+------+---+-----------------+
GroupBy
`
+----------+-----------+
|department|avg(salary)|
+----------+-----------+
| HR| 80000.0|
| IT| 70000.0|
+----------+-----------+
`
# Drop rows with any nulls df_cleaned = df.na.drop()#Drop nulls on a column df_cleaned = df.where(col("columnName").isNotNull())# Fill nulls in the age column with the value 0 df_filled = df.na.fill({"age": 0})
.withColumn(), neue Spalte, basierend auf Berechnungen oder vorhandenen Spalten Syntax: .withColumn("alter_spaltenname", "originale_transformation")
# Create a new column 'age_plus_5'
df = df.withColumn("age_plus_5", df["age"] + 5)
withColumnRenamed(), um Spalten umzubenennen.
Syntax: withColumnRenamed("alter_spaltenname","neuer_Spaltenname"`)
# Rename the 'age' column to 'years'
df = df.withColumnRenamed("age", "years")
drop(), um unnötige Spalten zu entfernen.
Syntax: .drop(column name)# Drop the 'department' column
df = df.drop("department")
# Filter rows where salary is greater than 50000
filtered_df = df.filter(df["salary"] > 50000)
.groupBy() und Aggregatfunktionen (z. B. .sum(), .avg()), um Daten zusammenzufassen # Group by department and calculate the average salary
grouped_df = df.groupBy("department").avg("salary")
Einführung in PySpark