Cleaning Data with PySpark
Mike Metzger
Data Engineering Consultant
width, height, image
# This is a comment
200 300 affenpinscher;0
600 450 Collie;307 Collie;101
600 449 Japanese_spaniel;23
Example rows:
02111277 n02111277_3206 500 375 Newfoundland,110,73,416,298
02108422 n02108422_4375 500 375 bull_mastiff,101,90,214,356 \
bull_mastiff,282,74,416,370
Spark's CSV parser:
df1 = spark.read.csv('datafile.csv.gz', comment='#')
df1 = spark.read.csv('datafile.csv.gz', header='True')
Spark will:
sep
argumentdf1 = spark.read.csv('datafile.csv.gz', sep=',')
,
sep
is not in stringdf1 = spark.read.csv('datafile.csv.gz', sep='*')
_c0
Cleaning Data with PySpark