Data Processing in Shell
Susan Sun
Data Person
csvstack
: stacks up the rows from two or more CSV files
Documentation:
csvstack -h
usage: csvstack [-h] [-d DELIMITER] [-t] [-q QUOTECHAR] [-u {0,1,2,3}] [-b]
[-p ESCAPECHAR] [-z FIELD_SIZE_LIMIT] [-e ENCODING] [-S] [-H]
[-n GROUP_NAME] [--filenames]
Stack two similar files Spotify_Rank6.csv
and Spotify_Rank7.csv
into one file.
Preview the data to check schema:
csvlook Spotify_Rank6.csv
| track_id | popularity |
| ---------------------- | ---------- |
| 7JYCpIzpoidDOnnmxmHwtj | 6 |
| 0mmFibEg5NuULMwTVN2tRU | 6 |
csvlook Spotify_Rank7.csv
| track_id | popularity |
| ---------------------- | ---------- |
| 118GQ70Sp6pMqn6w1oKuki | 7 |
| 6S7cr72a7a8RVAXzDCRj6m | 7 |
Syntax:
csvstack Spotify_Rank6.csv Spotify_Rank7.csv > Spotify_AllRanks.csv
csvlook Spotify_AllRanks.csv
| track_id | popularity |
| ---------------------- | ---------- |
| 7JYCpIzpoidDOnnmxmHwtj | 6 |
| 0mmFibEg5NuULMwTVN2tRU | 6 |
| 118GQ70Sp6pMqn6w1oKuki | 7 |
| 6S7cr72a7a8RVAXzDCRj6m | 7 |
csvstack -g "Rank6","Rank7" \
Spotify_Rank6.csv Spotify_Rank7.csv > Spotify_AllRanks.csv
csvlook Spotify_AllRanks.csv
| group | track_id | popularity |
| ----- | ---------------------- | ---------- |
| Rank6 | 7JYCpIzpoidDOnnmxmHwtj | 6 |
| Rank6 | 0mmFibEg5NuULMwTVN2tRU | 6 |
| Rank7 | 118GQ70Sp6pMqn6w1oKuki | 7 |
| Rank7 | 6S7cr72a7a8RVAXzDCRj6m | 7 |
csvstack -g "Rank6","Rank7" -n "source" \
Spotify_Rank6.csv Spotify_Rank7.csv > Spotify_AllRanks.csv
csvlook Spotify_AllRanks.csv
| source| track_id | popularity |
| ----- | ---------------------- | ---------- |
| Rank6 | 7JYCpIzpoidDOnnmxmHwtj | 6 |
| Rank6 | 0mmFibEg5NuULMwTVN2tRU | 6 |
| Rank7 | 118GQ70Sp6pMqn6w1oKuki | 7 |
| Rank7 | 6S7cr72a7a8RVAXzDCRj6m | 7 |
;
links commands together and runs sequentially
csvlook SpotifyData_All.csv; csvstat SpotifyData_All.csv
&&
links commands together, but only runs the 2nd command if the 1st succeeds
csvlook SpotifyData_All.csv && csvstat SpotifyData_All data.csv
>
re-directs the output from the 1st command to the location indicated as the 2nd
in2csv SpotifyData.xlsx > SpotifyData.csv
|
uses the output of the 1st command as input to the 2nd
Example:
Output of csvcut
is not well formatted:
csvcut -c "track_id","danceability" Spotify_MusicAttributes.csv
track_id,danceability
118GQ70Sp6pMqn6w1oKuki,0.787
6S7cr72a7a8RVAXzDCRj6m,0.777
7h2qWpMJzIVtiP30E8VDW4,0.795
3KVQFxJ5CWOcbxdpPYdi4o,0.815
Example (continued):
Re-format csvcut
's output by piping the output as input to csvlook
:
csvcut -c "track_id","danceability" Spotify_Popularity.csv | csvlook
| track_id | danceability |
| ---------------------- | ------------ |
| 118GQ70Sp6pMqn6w1oKuki | 0.787 |
| 6S7cr72a7a8RVAXzDCRj6m | 0.777 |
| 7h2qWpMJzIVtiP30E8VDW4 | 0.796 |
| 3KVQFxJ5CWOcbxdpPYdi4o | 0.815 |
Data Processing in Shell