Big Data Fundamentals with PySpark
Upendra Devisetty
Science Analyst, CyVerse
reduce(func) action is used for aggregating the elements of a regular RDD
The function should be commutative (changing the order of the operands does not change the result) and associative
An example of reduce()
action in PySpark
x = [1,3,4,6]
RDD = sc.parallelize(x)
RDD.reduce(lambda x, y : x + y)
14
saveAsTextFile()
action saves RDD into a text file inside a directory with each partition as a separate fileRDD.saveAsTextFile("tempFile")
coalesce()
method can be used to save RDD as a single text fileRDD.coalesce(1).saveAsTextFile("tempFile")
RDD actions available for PySpark pair RDDs
Pair RDD actions leverage the key-value data
Few examples of pair RDD actions include
countByKey()
collectAsMap()
countByKey()
only available for type (K, V)
countByKey()
action counts the number of elements for each key
Example of countByKey()
on a simple list
rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
for kee, val in rdd.countByKey().items():
print(kee, val)
('a', 2)
('b', 1)
collectAsMap()
return the key-value pairs in the RDD as a dictionary
Example of collectAsMap()
on a simple tuple
sc.parallelize([(1, 2), (3, 4)]).collectAsMap()
{1: 2, 3: 4}
Big Data Fundamentals with PySpark