Getting started with csvkit

Data Processing in Shell

Susan Sun

Data Person

What is csvkit?

csvkit:

  • is a suite of command-line tools
  • is developed in Python by Wireservice
  • offers data processing and cleaning capabilities on CSV files
  • has data capabilities that rival Python, R, and SQL
  • documentation: https://csvkit.readthedocs.io/en/latest/
Data Processing in Shell

csvkit installation

Install csvkit using Python package manager pip :

pip install csvkit

Upgrade csvkit to the latest version:

pip install --upgrade csvkit

Full instructions:

https://csvkit.readthedocs.io/en/latest/tutorial.html.

Data Processing in Shell

Browsing the csvkit manual

Data Processing in Shell

in2csv: converting files to CSV

Web-based documentation:

https://csvkit.readthedocs.io/en/latest/scripts/in2csv.html

Command line-based documentation:

in2csv --help

in2csv -h
usage: in2csv [-h] [-d DELIMITER] [-t] [-q QUOTECHAR] [-u {0,1,2,3}] [-b]
              [-p ESCAPECHAR] [-z FIELD_SIZE_LIMIT] [-e ENCODING] [-L LOCALE]
              [-S] [--blanks] [--date-format DATE_FORMAT]
              [--datetime-format DATETIME_FORMAT] [-H] [-K SKIP_LINES] [-v]
Data Processing in Shell

in2csv: converting files to CSV

Syntax:

in2csv SpotifyData.xlsx > SpotifyData.csv

Prints the first sheet in Excel to console and does not save

in2csv SpotifyData.xlsx

> redirects the output and saves it as a new file SpotifyData.csv

> SpotifyData.csv
Data Processing in Shell

in2csv: converting files to CSV

Use --names or -n option to print all sheet names in SpotifyData.xlsx.

in2csv -n SpotifyData.xlsx
Worksheet1_Popularity
Worksheet2_MusicAttributes

Use --sheet option followed by the sheet "Worksheet1_Popularity" to be converted.

in2csv SpotifyData.xlsx --sheet "Worksheet1_Popularity" > Spotify_Popularity.csv
Data Processing in Shell

in2csv: converting files to CSV

in2csv does not print logs to console.

in2csv SpotifyData.xlsx --sheet "Worksheet1_Popularity" > Spotify_Popularity.csv

Sanity check:

ls
SpotifyData.xlsx  Spotify_Popularity.csv  backup  bin
Data Processing in Shell

csvlook: data preview on the command line

csvlook: renders a CSV to the command line in a Markdown-compatible, fixed-width format

Documentation:

csvlook -h
usage: csvlook [-h] [-d DELIMITER] [-t] [-q QUOTECHAR] [-u {0,1,2,3}] [-b]
               [-p ESCAPECHAR] [-z FIELD_SIZE_LIMIT] [-e ENCODING] [-L LOCALE]
               [-S] [--blanks] [--date-format DATE_FORMAT]
Data Processing in Shell

csvlook: data preview on the command line

Syntax:

csvlook Spotify_Popularity.csv
| track_id               | popularity |
| ---------------------- | ---------- |
| 118GQ70Sp6pMqn6w1oKuki |          7 |
| 6S7cr72a7a8RVAXzDCRj6m |          7 |
| 7h2qWpMJzIVtiP30E8VDW4 |          7 |
| 3KVQFxJ5CWOcbxdpPYdi4o |          7 |
| 0JjNrI1xmsTfhaiU1R6OVc |          7 |
| 3HjTcZt29JUHg5m60QhlMw |          7 |
Data Processing in Shell

csvstat: descriptive stats on CSV data files

csvstat: prints descriptive summary statistics on all columns in CSV (e.g. mean, median, unique values counts)

Documentation:

csvstat -h
usage: csvstat [-h] [-d DELIMITER] [-t] [-q QUOTECHAR] [-u {0,1,2,3}] [-b]
               [-p ESCAPECHAR] [-z FIELD_SIZE_LIMIT] [-e ENCODING] [-S] [-H]
               [-K SKIP_LINES] [-v] [-l] [--zero] [-V] [--csv] [-n]
Data Processing in Shell

csvstat: descriptive stats on CSV data files

Syntax:

csvstat Spotify_Popularity.csv
  1. "track_id"

        Type of data:          Text
        Contains null values:  False
        Unique values:         24
        Longest value:         22 characters
        Most common values:    118GQ70Sp6pMqn6w1oKuki (1x)
                               6S7cr72a7a8RVAXzDCRj6m (1x)
Data Processing in Shell

Let's try some csvkit!

Data Processing in Shell

Preparing Video For Download...