Working with Pair RDDs in PySpark

Big Data Fundamentals with PySpark

Upendra Devisetty

Science Analyst, CyVerse

Introduction to pair RDDs in PySpark

Real life datasets are usually key/value pairs
Each row is a key and maps to one or more values
Pair RDD is a special data structure to work with this kind of datasets
Pair RDD: Key is the identifier and value is the data

Creating pair RDDs

Two common ways to create pair RDDs
- From a list of key-value tuple
- From a regular RDD
Get the data into key/value form for paired RDD

my_tuple = [('Sam', 23), ('Mary', 34), ('Peter', 25)]
pairRDD_tuple = sc.parallelize(my_tuple)

my_list = ['Sam 23', 'Mary 34', 'Peter 25']
regularRDD = sc.parallelize(my_list)
pairRDD_RDD = regularRDD.map(lambda s: (s.split(' ')[0], s.split(' ')[1]))

Transformations on pair RDDs

All regular transformations work on pair RDD
Have to pass functions that operate on key value pairs rather than on individual elements
Examples of paired RDD Transformations
- reduceByKey(func): Combine values with the same key
- groupByKey(): Group values with the same key
- sortByKey(): Return an RDD sorted by the key
- join(): Join two pair RDDs based on their key

reduceByKey() transformation

reduceByKey() transformation combines values with the same key
It runs parallel operations for each key in the dataset
It is a transformation and not action

regularRDD = sc.parallelize([("Messi", 23), ("Ronaldo", 34), 
                             ("Neymar", 22), ("Messi", 24)])
pairRDD_reducebykey = regularRDD.reduceByKey(lambda x,y : x + y)
pairRDD_reducebykey.collect()

[('Neymar', 22), ('Ronaldo', 34), ('Messi', 47)]

sortByKey() transformation

sortByKey() operation orders pair RDD by key
It returns an RDD sorted by key in ascending or descending order

pairRDD_reducebykey_rev = pairRDD_reducebykey.map(lambda x: (x[1], x[0]))
pairRDD_reducebykey_rev.sortByKey(ascending=False).collect()

[(47, 'Messi'), (34, 'Ronaldo'), (22, 'Neymar')]

groupByKey() transformation

groupByKey() groups all the values with the same key in the pair RDD

airports = [("US", "JFK"),("UK", "LHR"),("FR", "CDG"),("US", "SFO")]
regularRDD = sc.parallelize(airports)
pairRDD_group = regularRDD.groupByKey().collect()
for cont, air in pairRDD_group:
  print(cont, list(air))

FR ['CDG']
US ['JFK', 'SFO']
UK ['LHR']

join() transformation

join() transformation joins the two pair RDDs based on their key

RDD1 = sc.parallelize([("Messi", 34),("Ronaldo", 32),("Neymar", 24)])
RDD2 = sc.parallelize([("Ronaldo", 80),("Neymar", 120),("Messi", 100)])

RDD1.join(RDD2).collect()

[('Neymar', (24, 120)), ('Ronaldo', (32, 80)), ('Messi', (34, 100))]

Let's practice

Big Data Fundamentals with PySpark