Data Processing in Shell
Susan Sun
Data Person
We can create a subset of the original data file by:
csvcut
: filters data using column name or position
csvgrep
: filters data by row value through exact match, pattern matching, or even regex
csvcut
: filters and truncates CSV files by column name or column position
Documentation:
csvcut -h
usage: csvcut [-h] [-d DELIMITER] [-t] [-q QUOTECHAR] [-u {0,1,2,3}] [-b]
[-p ESCAPECHAR] [-z FIELD_SIZE_LIMIT] [-e ENCODING] [-S] [-H]
[-K SKIP_LINES] [-v] [-l] [--zero] [-V] [-n] [-c COLUMNS]
Use --names
or -n
option to print all column names in Spotify_MusicAttributes.csv
.
csvcut -n Spotify_MusicAttributes.csv
1: track_id
2: danceability
3: duration_ms
1: track_id
2: danceability
3: duration_ms
Returns the first column in the data, by position:
csvcut -c 1 Spotify_MusicAttributes.csv
track_id
118GQ70Sp6pMqn6w1oKuki
6S7cr72a7a8RVAXzDCRj6m
1: track_id
2: danceability
3: duration_ms
Returns only the first column in the data, by name:
csvcut -c "track_id" Spotify_MusicAttributes.csv
track_id
118GQ70Sp6pMqn6w1oKuki
6S7cr72a7a8RVAXzDCRj6m
1: track_id
2: danceability
3: duration_ms
Returns the second and third column in the data, by position:
csvcut -c 2,3 Spotify_MusicAttributes.csv
danceability,duration_ms
0.787,124016.0
0.777,128016.0
0.795999999999999,132742.0
1: track_id
2: danceability
3: duration_ms
Returns the second and third column in the data, by name:
csvcut -c "danceability","duration_ms" Spotify_MusicAttributes.csv
danceability,duration_ms
0.787,124016.0
0.777,128016.0
0.795999999999999,132742.0
csvgrep
:
-m
: followed by the exact row value to filter
-r
: followed with a regex pattern
-f
: followed by the path to a file
Documentation:
csvgrep -h
Find in Spotify_Popularity.csv
where track_id
= 5RCPsfzmEpTXMCTNk7wEfQ
csvgrep -c "track_id" -m 5RCPsfzmEpTXMCTNk7wEfQ Spotify_Popularity.csv
track_id,popularity
5RCPsfzmEpTXMCTNk7wEfQ,7.0
csvgrep -c 1 -m 5RCPsfzmEpTXMCTNk7wEfQ Spotify_Popularity.csv
track_id,popularity
5RCPsfzmEpTXMCTNk7wEfQ,7.0
Data Processing in Shell