Introduction to JSON

Streamlined Data Ingestion with pandas

Amany Mahfouz

Instructor

Javascript Object Notation (JSON)

  • Common web data format
  • Not tabular
    • Records don't have to all have the same set of attributes
  • Data organized into collections of objects
  • Objects are collections of attribute-value pairs
  • Nested JSON: objects within objects
Streamlined Data Ingestion with pandas

Reading JSON Data

  • read_json()
    • Takes a string path to JSON _or_ JSON data as a string
    • Specify data types with dtype keyword argument
    • orient keyword argument to flag uncommon JSON data layouts
      • possible values in pandas documentation
Streamlined Data Ingestion with pandas

Data Orientation

  • JSON data isn't tabular
    • pandas guesses how to arrange it in a table
  • pandas can automatically handle common orientations
Streamlined Data Ingestion with pandas

Record Orientation

  • Most common JSON arrangement
    [
      {
          "age_adjusted_death_rate": "7.6",
          "death_rate": "6.2",
          "deaths": "32",
          "leading_cause": "Accidents Except Drug Posioning (V01-X39, X43, X45-X59, Y85-Y86)",
          "race_ethnicity": "Asian and Pacific Islander",
          "sex": "F",
          "year": "2007"
      },
      {
          "age_adjusted_death_rate": "8.1",
          "death_rate": "8.3",
          "deaths": "87",
          ...
    
Streamlined Data Ingestion with pandas

Column Orientation

  • More space-efficient than record-oriented JSON
    {
      "age_adjusted_death_rate": {
          "0": "7.6",
          "1": "8.1",
          "2": "7.1",
          "3": ".",
          "4": ".",
          "5": "7.3",
          "6": "13",
          "7": "20.6",
          "8": "17.4",
          "9": ".",
          "10": ".",
          "11": "19.8",
          ...
    
Streamlined Data Ingestion with pandas

Specifying Orientation

  • Split oriented data - nyc_death_causes.json
    {
      "columns": [
          "age_adjusted_death_rate",
          "death_rate",
          "deaths",
          "leading_cause",
          "race_ethnicity",
          "sex",
          "year"
      ],
      "index": [...],
      "data": [
          [
              "7.6",
    
Streamlined Data Ingestion with pandas

Specifying Orientation

import pandas as pd

death_causes = pd.read_json("nyc_death_causes.json", orient="split")
print(death_causes.head())
  age_adjusted_death_rate death_rate deaths             leading_cause              race_ethnicity sex  year
0                     7.6        6.2     32  Accidents Except Drug...  Asian and Pacific Islander   F  2007
1                     8.1        8.3     87  Accidents Except Drug...          Black Non-Hispanic   F  2007
2                     7.1        6.1     71  Accidents Except Drug...                    Hispanic   F  2007
3                       .          .      .  Accidents Except Drug...          Not Stated/Unknown   F  2007
4                       .          .      .  Accidents Except Drug...       Other Race/ Ethnicity   F  2007

[5 rows x 7 columns]
Streamlined Data Ingestion with pandas

Let's practice!

Streamlined Data Ingestion with pandas

Preparing Video For Download...