Extracting non-tabular data

ETL and ELT in Python

Jake Roach

Data Engineer

Extracting non-tabular data

ETL pipeline with the extract component highlighted.

ETL and ELT in Python

Types of non-tabular data

Most data produced and consumed is unstructured data

  • Text
  • Audio
  • Image
  • Video
  • Spatial
  • IoT

Unstructred data sources undergoing transformation.

1 https://mitsloan.mit.edu/ideas-made-to-matter/tapping-power-unstructured-data
ETL and ELT in Python

Working with APIs and JSON data

API (Application Programming Interface)

  • Software that sits on top of data sources
  • Prevents direct interaction with database

$$

Data engineer using an API to interact with a database.

JSON (JavaScript Object Notation)

  • Key-value pairs
  • No set schema
  • Look and feel similar to dictionaries
{
    "key": "value",
    ...
    "open": 0.121875
}
ETL and ELT in Python

Reading JSON files with pandas

{
    "timestamps": [863703000, 863789400, ...],
    "open": [0.121875, 0.098438, ...],
    "close": [...],
    "volume": [...]
}

Use the .read_json() function

# Read in a JSON file in the format above
raw_stock_data = pd.read_json("raw_stock_data.json", orient="columns")
1 https://pandas.pydata.org/docs/reference/api/pandas.read_json.html
ETL and ELT in Python

Nested or unstructured JSON data

Data is not always DataFrame-ready

  • Nested objects
  • Varying "schema"
{
    "863703000": {
        "volume": 1443120000,
        "price": {
            "close": 0.09791,
            "open": 0.12187
        }
    }, 
    "863789400": {
        ...
    }, ...
}
ETL and ELT in Python

Reading JSON files with json

import json

with open("raw_stock_data.json", "r") as file:
    # Load the file into a dictionary
    raw_stock_data = json.load(file)

# Confirm the type of the raw_stock_data variable
print(type(raw_stock_data))
<class 'dict'>
ETL and ELT in Python

Let's practice!

ETL and ELT in Python

Preparing Video For Download...