Machine Learning with PySpark
Andrew Collier
Data Scientist, Fathom Data
DataFrame for tabular data.
Selected methods:
count()show()printSchema()Selected attributes:
dtypesThe first few lines from the 'cars.csv' file.
mfr,mod,org,type,cyl,size,weight,len,rpm,cons
Mazda,RX-7,non-USA,Sporty,NA,1.3,2895,169,6500,9.41
Nissan,Maxima,non-USA,Midsize,6,3,3200,188,5200,9.05
Chevrolet,Cavalier,USA,Compact,4,2.2,2490,182,5200,6.53
Subaru,Legacy,non-USA,Compact,4,2.2,3085,179,5600,7.84
Ford,Escort,USA,Small,4,1.8,2530,171,6500,7.84

The .csv() method reads a CSV file and returns a DataFrame.
cars = spark.read.csv('cars.csv', header=True)
Optional arguments:
header — is first row a header? (default: False)sep — field separator (default: a comma ',')schema — explicit column data typesinferSchema — deduce column data types from data?nullValue — placeholder for missing dataThe first five records from the DataFrame.
cars.show(5)
+---------+--------+-------+-------+---+----+------+---+----+----+
|      mfr|     mod|    org|   type|cyl|size|weight|len| rpm|cons|
+---------+--------+-------+-------+---+----+------+---+----+----+
|    Mazda|    RX-7|non-USA| Sporty| NA| 1.3|  2895|169|6500|9.41|
|   Nissan|  Maxima|non-USA|Midsize|  6|   3|  3200|188|5200|9.05|
|Chevrolet|Cavalier|    USA|Compact|  4| 2.2|  2490|182|5200|6.53|
|   Subaru|  Legacy|non-USA|Compact|  4| 2.2|  3085|179|5600|7.84|
|     Ford|  Escort|    USA|  Small|  4| 1.8|  2530|171|6500|7.84|
+---------+--------+-------+-------+---+----+------+---+----+----+
  cars.printSchema()
 root
 |-- mfr: string (nullable = true)
 |-- mod: string (nullable = true)
 |-- org: string (nullable = true)
 |-- type: string (nullable = true)
 |-- cyl: string (nullable = true)
 |-- size: string (nullable = true)
 |-- weight: string (nullable = true)
 |-- len: string (nullable = true)
 |-- rpm: string (nullable = true)
 |-- cons: string (nullable = true)
  cars = spark.read.csv("cars.csv", header=True, inferSchema=True)
cars.dtypes
 [('mfr', 'string'),
 ('mod', 'string'),
 ('org', 'string'),
 ('type', 'string'),
 ('cyl', 'string'),
 ('size', 'double'),
 ('weight', 'int'),
 ('len', 'int'),
 ('rpm', 'int'),
 ('cons', 'double')]
  Handle missing data using the nullValue argument.
cars = spark.read.csv("cars.csv", header=True, inferSchema=True, nullValue='NA')
The nullValue argument is case sensitive.
schema = StructType([
    StructField("maker", StringType()),
    StructField("model", StringType()),
    StructField("origin", StringType()),
    StructField("type", StringType()),
    StructField("cyl", IntegerType()),
    StructField("size", DoubleType()),
    StructField("weight", IntegerType()),
    StructField("length", DoubleType()),
    StructField("rpm", IntegerType()),
    StructField("consumption", DoubleType())
])
cars = spark.read.csv("cars.csv", header=True, schema=schema, nullValue='NA')
  +----------+-------------+-------+-------+----+----+------+------+----+-----------+
|maker     |model        |origin |type   |cyl |size|weight|length|rpm |consumption|
+----------+-------------+-------+-------+----+----+------+------+----+-----------+
|Mazda     |RX-7         |non-USA|Sporty |null|1.3 |2895  |169.0 |6500|9.41       |
|Nissan    |Maxima       |non-USA|Midsize|6   |3.0 |3200  |188.0 |5200|9.05       |
|Chevrolet |Cavalier     |USA    |Compact|4   |2.2 |2490  |182.0 |5200|6.53       |
|Subaru    |Legacy       |non-USA|Compact|4   |2.2 |3085  |179.0 |5600|7.84       |
|Ford      |Escort       |USA    |Small  |4   |1.8 |2530  |171.0 |6500|7.84       |
|Mercury   |Capri        |USA    |Sporty |4   |1.6 |2450  |166.0 |5750|9.05       |
|Oldsmobile|Cutlass Ciera|USA    |Midsize|4   |2.2 |2890  |190.0 |5200|7.59       |
|Saab      |900          |non-USA|Compact|4   |2.1 |2775  |184.0 |6000|9.05       |
|Dodge     |Caravan      |USA    |Van    |6   |3.0 |3705  |175.0 |5000|11.2       |
+----------+-------------+-------+-------+----+----+------+------+----+-----------+
  Machine Learning with PySpark