Big Data Fundamentals with PySpark
Upendra Devisetty
Science Analyst, CyVerse
Real life datasets are usually key/value pairs
Each row is a key and maps to one or more values
Pair RDD is a special data structure to work with this kind of datasets
Pair RDD: Key is the identifier and value is the data
Two common ways to create pair RDDs
Get the data into key/value form for paired RDD
my_tuple = [('Sam', 23), ('Mary', 34), ('Peter', 25)]
pairRDD_tuple = sc.parallelize(my_tuple)
my_list = ['Sam 23', 'Mary 34', 'Peter 25']
regularRDD = sc.parallelize(my_list)
pairRDD_RDD = regularRDD.map(lambda s: (s.split(' ')[0], s.split(' ')[1]))
All regular transformations work on pair RDD
Have to pass functions that operate on key value pairs rather than on individual elements
Examples of paired RDD Transformations
reduceByKey(func): Combine values with the same key
groupByKey(): Group values with the same key
sortByKey(): Return an RDD sorted by the key
join(): Join two pair RDDs based on their key
reduceByKey()
transformation combines values with the same key
It runs parallel operations for each key in the dataset
It is a transformation and not action
regularRDD = sc.parallelize([("Messi", 23), ("Ronaldo", 34), ("Neymar", 22), ("Messi", 24)]) pairRDD_reducebykey = regularRDD.reduceByKey(lambda x,y : x + y) pairRDD_reducebykey.collect()
[('Neymar', 22), ('Ronaldo', 34), ('Messi', 47)]
sortByKey()
operation orders pair RDD by key
It returns an RDD sorted by key in ascending or descending order
pairRDD_reducebykey_rev = pairRDD_reducebykey.map(lambda x: (x[1], x[0])) pairRDD_reducebykey_rev.sortByKey(ascending=False).collect()
[(47, 'Messi'), (34, 'Ronaldo'), (22, 'Neymar')]
groupByKey()
groups all the values with the same key in the pair RDDairports = [("US", "JFK"),("UK", "LHR"),("FR", "CDG"),("US", "SFO")] regularRDD = sc.parallelize(airports) pairRDD_group = regularRDD.groupByKey().collect() for cont, air in pairRDD_group: print(cont, list(air))
FR ['CDG'] US ['JFK', 'SFO'] UK ['LHR']
join()
transformation joins the two pair RDDs based on their keyRDD1 = sc.parallelize([("Messi", 34),("Ronaldo", 32),("Neymar", 24)])
RDD2 = sc.parallelize([("Ronaldo", 80),("Neymar", 120),("Messi", 100)])
RDD1.join(RDD2).collect()
[('Neymar', (24, 120)), ('Ronaldo', (32, 80)), ('Messi', (34, 100))]
Big Data Fundamentals with PySpark