Data opschonen met PySpark
Mike Metzger
Data Engineering Consultant
width, height, image
# This is a comment
200 300 affenpinscher;0
600 450 Collie;307 Collie;101
600 449 Japanese_spaniel;23
Voorbeeldrijen:
02111277 n02111277_3206 500 375 Newfoundland,110,73,416,298
02108422 n02108422_4375 500 375 bull_mastiff,101,90,214,356 \
bull_mastiff,282,74,416,370
De CSV-parser van Spark:
df1 = spark.read.csv('datafile.csv.gz', comment='#')
df1 = spark.read.csv('datafile.csv.gz', header='True')
Spark zal:
sep-argumentdf1 = spark.read.csv('datafile.csv.gz', sep=',')
, gebruikensep niet in de string staatdf1 = spark.read.csv('datafile.csv.gz', sep='*')
_c0Data opschonen met PySpark