PySpark : Spark avec Python

Principes fondamentaux des mégadonnées avec PySpark

Upendra Devisetty

Science Analyst, CyVerse

Aperçu de PySpark

Environnement interactif pour exécuter des jobs Spark
Utile pour le prototypage rapide
Les shells Spark permettent d’interagir avec des données sur disque ou en mémoire
Trois shells Spark :
- Spark-shell pour Scala
- PySpark-shell pour Python
- SparkR pour R

Le shell PySpark est l’outil en ligne de commande basé sur Python
Il permet aux data scientists d’interagir avec les structures de données de Spark
Le shell PySpark peut se connecter à un cluster

¹ https://www.datacamp.com/cheat-sheet/pyspark-cheat-sheet-spark-in-python

sc.version

2.3.1

sc.pythonVer

3.6

Master : URL du cluster ou chaîne « local » pour le mode local de SparkContext

sc.master

local[*]

¹ https://www.datacamp.com/cheat-sheet/pyspark-cheat-sheet-spark-in-python

rdd = sc.parallelize([1,2,3,4,5])

rdd2 = sc.textFile("test.txt")

¹ https://www.datacamp.com/cheat-sheet/pyspark-cheat-sheet-spark-in-python

Principes fondamentaux des mégadonnées avec PySpark