Big Data Fundamentals with PySpark
Upendra Devisetty
Science Analyst, CyVerse
Volume, Variety and Velocity
Volume: Size of the data
Variety: Different sources and formats
Velocity: Speed of the data
Clustered computing: Collection of resources of multiple machines
Parallel computing: Simultaneous computation on single computer
Distributed computing: Collection of nodes (networked computers) that run in parallel
Batch processing: Breaking the job into small pieces and running them on individual machines
Real-time processing: Immediate processing of data
Hadoop/MapReduce: Scalable and fault tolerant framework written in Java
Open source
Batch processing
Apache Spark: General purpose and lightning fast cluster computing system
Open source
Both batch and real-time data processing
Note: Apache Spark is nowadays preferred over Hadoop/MapReduce
Distributed cluster computing framework
Efficient in-memory computations for large data sets
Lightning fast data processing framework
Provides support for Java, Scala, Python, R and SQL
Local mode: Single machine such as your laptop
Cluster mode: Set of pre-defined machines
Workflow: Local -> clusters
No code change necessary
Big Data Fundamentals with PySpark