Data Processing in Shell
Susan Sun
Data Person
csvkit
:
Install csvkit
using Python package manager pip
:
pip install csvkit
Upgrade csvkit
to the latest version:
pip install --upgrade csvkit
Full instructions:
Web-based documentation: https://csvkit.readthedocs.io/en/latest/tutorial.html
Web-based documentation:
https://csvkit.readthedocs.io/en/latest/scripts/in2csv.html
Command line-based documentation:
in2csv --help
in2csv -h
usage: in2csv [-h] [-d DELIMITER] [-t] [-q QUOTECHAR] [-u {0,1,2,3}] [-b]
[-p ESCAPECHAR] [-z FIELD_SIZE_LIMIT] [-e ENCODING] [-L LOCALE]
[-S] [--blanks] [--date-format DATE_FORMAT]
[--datetime-format DATETIME_FORMAT] [-H] [-K SKIP_LINES] [-v]
Syntax:
in2csv SpotifyData.xlsx > SpotifyData.csv
Prints the first sheet in Excel to console and does not save
in2csv SpotifyData.xlsx
>
redirects the output and saves it as a new file SpotifyData.csv
> SpotifyData.csv
Use --names
or -n
option to print all sheet names in SpotifyData.xlsx
.
in2csv -n SpotifyData.xlsx
Worksheet1_Popularity
Worksheet2_MusicAttributes
Use --sheet
option followed by the sheet "Worksheet1_Popularity"
to be converted.
in2csv SpotifyData.xlsx --sheet "Worksheet1_Popularity" > Spotify_Popularity.csv
in2csv
does not print logs to console.
in2csv SpotifyData.xlsx --sheet "Worksheet1_Popularity" > Spotify_Popularity.csv
Sanity check:
ls
SpotifyData.xlsx Spotify_Popularity.csv backup bin
csvlook
: renders a CSV to the command line in a Markdown-compatible, fixed-width format
Documentation:
csvlook -h
usage: csvlook [-h] [-d DELIMITER] [-t] [-q QUOTECHAR] [-u {0,1,2,3}] [-b]
[-p ESCAPECHAR] [-z FIELD_SIZE_LIMIT] [-e ENCODING] [-L LOCALE]
[-S] [--blanks] [--date-format DATE_FORMAT]
Syntax:
csvlook Spotify_Popularity.csv
| track_id | popularity |
| ---------------------- | ---------- |
| 118GQ70Sp6pMqn6w1oKuki | 7 |
| 6S7cr72a7a8RVAXzDCRj6m | 7 |
| 7h2qWpMJzIVtiP30E8VDW4 | 7 |
| 3KVQFxJ5CWOcbxdpPYdi4o | 7 |
| 0JjNrI1xmsTfhaiU1R6OVc | 7 |
| 3HjTcZt29JUHg5m60QhlMw | 7 |
csvstat
: prints descriptive summary statistics on all columns in CSV (e.g. mean, median, unique values counts)
Documentation:
csvstat -h
usage: csvstat [-h] [-d DELIMITER] [-t] [-q QUOTECHAR] [-u {0,1,2,3}] [-b]
[-p ESCAPECHAR] [-z FIELD_SIZE_LIMIT] [-e ENCODING] [-S] [-H]
[-K SKIP_LINES] [-v] [-l] [--zero] [-V] [--csv] [-n]
Syntax:
csvstat Spotify_Popularity.csv
1. "track_id"
Type of data: Text
Contains null values: False
Unique values: 24
Longest value: 22 characters
Most common values: 118GQ70Sp6pMqn6w1oKuki (1x)
6S7cr72a7a8RVAXzDCRj6m (1x)
Data Processing in Shell