Machine Learning with PySpark
Andrew Collier
Data Scientist, Fathom Data
DataFrame
for tabular data.
Selected methods:
count()
show()
printSchema()
Selected attributes:
dtypes
The first few lines from the 'cars.csv' file.
mfr,mod,org,type,cyl,size,weight,len,rpm,cons
Mazda,RX-7,non-USA,Sporty,NA,1.3,2895,169,6500,9.41
Nissan,Maxima,non-USA,Midsize,6,3,3200,188,5200,9.05
Chevrolet,Cavalier,USA,Compact,4,2.2,2490,182,5200,6.53
Subaru,Legacy,non-USA,Compact,4,2.2,3085,179,5600,7.84
Ford,Escort,USA,Small,4,1.8,2530,171,6500,7.84
The .csv()
method reads a CSV file and returns a DataFrame
.
cars = spark.read.csv('cars.csv', header=True)
Optional arguments:
header
— is first row a header? (default: False
)sep
— field separator (default: a comma ','
)schema
— explicit column data typesinferSchema
— deduce column data types from data?nullValue
— placeholder for missing dataThe first five records from the DataFrame
.
cars.show(5)
+---------+--------+-------+-------+---+----+------+---+----+----+
| mfr| mod| org| type|cyl|size|weight|len| rpm|cons|
+---------+--------+-------+-------+---+----+------+---+----+----+
| Mazda| RX-7|non-USA| Sporty| NA| 1.3| 2895|169|6500|9.41|
| Nissan| Maxima|non-USA|Midsize| 6| 3| 3200|188|5200|9.05|
|Chevrolet|Cavalier| USA|Compact| 4| 2.2| 2490|182|5200|6.53|
| Subaru| Legacy|non-USA|Compact| 4| 2.2| 3085|179|5600|7.84|
| Ford| Escort| USA| Small| 4| 1.8| 2530|171|6500|7.84|
+---------+--------+-------+-------+---+----+------+---+----+----+
cars.printSchema()
root
|-- mfr: string (nullable = true)
|-- mod: string (nullable = true)
|-- org: string (nullable = true)
|-- type: string (nullable = true)
|-- cyl: string (nullable = true)
|-- size: string (nullable = true)
|-- weight: string (nullable = true)
|-- len: string (nullable = true)
|-- rpm: string (nullable = true)
|-- cons: string (nullable = true)
cars = spark.read.csv("cars.csv", header=True, inferSchema=True)
cars.dtypes
[('mfr', 'string'),
('mod', 'string'),
('org', 'string'),
('type', 'string'),
('cyl', 'string'),
('size', 'double'),
('weight', 'int'),
('len', 'int'),
('rpm', 'int'),
('cons', 'double')]
Handle missing data using the nullValue
argument.
cars = spark.read.csv("cars.csv", header=True, inferSchema=True, nullValue='NA')
The nullValue
argument is case sensitive.
schema = StructType([
StructField("maker", StringType()),
StructField("model", StringType()),
StructField("origin", StringType()),
StructField("type", StringType()),
StructField("cyl", IntegerType()),
StructField("size", DoubleType()),
StructField("weight", IntegerType()),
StructField("length", DoubleType()),
StructField("rpm", IntegerType()),
StructField("consumption", DoubleType())
])
cars = spark.read.csv("cars.csv", header=True, schema=schema, nullValue='NA')
+----------+-------------+-------+-------+----+----+------+------+----+-----------+
|maker |model |origin |type |cyl |size|weight|length|rpm |consumption|
+----------+-------------+-------+-------+----+----+------+------+----+-----------+
|Mazda |RX-7 |non-USA|Sporty |null|1.3 |2895 |169.0 |6500|9.41 |
|Nissan |Maxima |non-USA|Midsize|6 |3.0 |3200 |188.0 |5200|9.05 |
|Chevrolet |Cavalier |USA |Compact|4 |2.2 |2490 |182.0 |5200|6.53 |
|Subaru |Legacy |non-USA|Compact|4 |2.2 |3085 |179.0 |5600|7.84 |
|Ford |Escort |USA |Small |4 |1.8 |2530 |171.0 |6500|7.84 |
|Mercury |Capri |USA |Sporty |4 |1.6 |2450 |166.0 |5750|9.05 |
|Oldsmobile|Cutlass Ciera|USA |Midsize|4 |2.2 |2890 |190.0 |5200|7.59 |
|Saab |900 |non-USA|Compact|4 |2.1 |2775 |184.0 |6000|9.05 |
|Dodge |Caravan |USA |Van |6 |3.0 |3705 |175.0 |5000|11.2 |
+----------+-------------+-------+-------+----+----+------+------+----+-----------+
Machine Learning with PySpark