RDD operations in PySpark

Big Data Fundamentals with PySpark

Upendra Devisetty

Science Analyst, CyVerse

Overview of PySpark operations

  • Transformations create new RDDs
  • Actions perform computation on the RDDs
Big Data Fundamentals with PySpark

RDD Transformations

  • Transformations follow Lazy evaluation

  • Basic RDD Transformations

    • map(), filter(), flatMap(), and union()
Big Data Fundamentals with PySpark

map() Transformation

  • map() transformation applies a function to all elements in the RDD

map

RDD = sc.parallelize([1,2,3,4])
RDD_map = RDD.map(lambda x: x * x)
Big Data Fundamentals with PySpark

filter() Transformation

  • Filter transformation returns a new RDD with only the elements that pass the condition

filter

RDD = sc.parallelize([1,2,3,4])
RDD_filter = RDD.filter(lambda x: x > 2)
Big Data Fundamentals with PySpark

flatMap() Transformation

  • flatMap() transformation returns multiple values for each element in the original RDD

RDD = sc.parallelize(["hello world", "how are you"])
RDD_flatmap = RDD.flatMap(lambda x: x.split(" "))
Big Data Fundamentals with PySpark

union() Transformation

inputRDD = sc.textFile("logs.txt")
errorRDD = inputRDD.filter(lambda x: "error" in x.split())
warningsRDD = inputRDD.filter(lambda x: "warnings" in x.split())
combinedRDD = errorRDD.union(warningsRDD)
Big Data Fundamentals with PySpark

RDD Actions

  • They are operations that return a value after running a computation on the RDD

  • Basic RDD Actions

    • collect()

    • take(N)

    • first()

    • count()

Big Data Fundamentals with PySpark

collect() and take() Actions

  • collect() return all the elements of the dataset as an array

  • take(N) returns an array with the first N elements of the dataset

RDD_map.collect()
[1, 4, 9, 16]
RDD_map.take(2)
[1, 4]
Big Data Fundamentals with PySpark

first() and count() Actions

  • first() prints the first element of the RDD
RDD_map.first()
[1]
  • count() return the number of elements in the RDD
RDD_flatmap.count()
5
Big Data Fundamentals with PySpark

Let's practice RDD operations

Big Data Fundamentals with PySpark

Preparing Video For Download...