Filtering data using csvkit

Data Processing in Shell

Susan Sun

Data Person

What does it mean to filter data?

We can create a subset of the original data file by:

Filtering the data by column
Filtering the data by row

csvcut: filters data using column name or position

csvgrep: filters data by row value through exact match, pattern matching, or even regex

csvcut: filtering data by column

csvcut: filters and truncates CSV files by column name or column position

Documentation:

csvcut -h

usage: csvcut [-h] [-d DELIMITER] [-t] [-q QUOTECHAR] [-u {0,1,2,3}] [-b]
              [-p ESCAPECHAR] [-z FIELD_SIZE_LIMIT] [-e ENCODING] [-S] [-H]
              [-K SKIP_LINES] [-v] [-l] [--zero] [-V] [-n] [-c COLUMNS]

csvcut: filtering data by column

Use --names or -n option to print all column names in Spotify_MusicAttributes.csv.

csvcut -n Spotify_MusicAttributes.csv

  1: track_id
  2: danceability
  3: duration_ms

csvcut: filtering data by column

  1: track_id
  2: danceability
  3: duration_ms

Returns the first column in the data, by position:

csvcut -c 1 Spotify_MusicAttributes.csv

track_id
118GQ70Sp6pMqn6w1oKuki
6S7cr72a7a8RVAXzDCRj6m

csvcut: filtering data by column

  1: track_id
  2: danceability
  3: duration_ms

Returns only the first column in the data, by name:

csvcut -c "track_id" Spotify_MusicAttributes.csv

track_id
118GQ70Sp6pMqn6w1oKuki
6S7cr72a7a8RVAXzDCRj6m

csvcut: filtering data by column

  1: track_id
  2: danceability
  3: duration_ms

Returns the second and third column in the data, by position:

csvcut -c 2,3 Spotify_MusicAttributes.csv

danceability,duration_ms
0.787,124016.0
0.777,128016.0
0.795999999999999,132742.0

csvcut: filtering data by column

  1: track_id
  2: danceability
  3: duration_ms

Returns the second and third column in the data, by name:

csvcut -c "danceability","duration_ms" Spotify_MusicAttributes.csv

danceability,duration_ms
0.787,124016.0
0.777,128016.0
0.795999999999999,132742.0

csvgrep: filtering data by row value

csvgrep:

filters by row using exact match or regex fuzzy matching
must be paired with one of these options:

-m: followed by the exact row value to filter

-r: followed with a regex pattern

-f: followed by the path to a file

Documentation:

csvgrep -h

csvgrep: filtering data by row value

Find in Spotify_Popularity.csv where track_id = 5RCPsfzmEpTXMCTNk7wEfQ

csvgrep -c "track_id" -m 5RCPsfzmEpTXMCTNk7wEfQ Spotify_Popularity.csv

track_id,popularity
5RCPsfzmEpTXMCTNk7wEfQ,7.0

csvgrep -c 1 -m 5RCPsfzmEpTXMCTNk7wEfQ Spotify_Popularity.csv

track_id,popularity
5RCPsfzmEpTXMCTNk7wEfQ,7.0

Let's do data filtering with csvkit!

Data Processing in Shell