More actions

Fondamenti di Big Data con PySpark

Upendra Devisetty

Science Analyst, CyVerse

reduce() action

  • reduce(func) action is used for aggregating the elements of a regular RDD

  • The function should be commutative (changing the order of the operands does not change the result) and associative

  • An example of reduce() action in PySpark

x = [1,3,4,6]
RDD = sc.parallelize(x)
RDD.reduce(lambda x, y : x + y)
14
Fondamenti di Big Data con PySpark

saveAsTextFile() action

  • saveAsTextFile() action saves RDD into a text file inside a directory with each partition as a separate file
RDD.saveAsTextFile("tempFile")
  • coalesce() method can be used to save RDD as a single text file
RDD.coalesce(1).saveAsTextFile("tempFile")
Fondamenti di Big Data con PySpark

Action Operations on pair RDDs

  • RDD actions available for PySpark pair RDDs

  • Pair RDD actions leverage the key-value data

  • Few examples of pair RDD actions include

    • countByKey()

    • collectAsMap()

Fondamenti di Big Data con PySpark

countByKey() action

  • countByKey() only available for type (K, V)

  • countByKey() action counts the number of elements for each key

  • Example of countByKey() on a simple list

rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
for kee, val in rdd.countByKey().items():
  print(kee, val)
('a', 2)
('b', 1)
Fondamenti di Big Data con PySpark

collectAsMap() action

  • collectAsMap() return the key-value pairs in the RDD as a dictionary

  • Example of collectAsMap() on a simple tuple

sc.parallelize([(1, 2), (3, 4)]).collectAsMap()
{1: 2, 3: 4}
Fondamenti di Big Data con PySpark

Let's practice

Fondamenti di Big Data con PySpark

Preparing Video For Download...