Introduction to databases

Streamlined Data Ingestion with pandas

Amany Mahfouz

Instructor

Relational Databases

  • Data about entities is organized into tables
  • Each row or record is an instance of an entity
  • Each column has information about an attribute
  • Tables can be linked to each other via unique keys
  • Support more data, multiple simultaneous users, and data quality controls
  • Data types are specified for each column
  • SQL (Structured Query Language) to interact with databases
Streamlined Data Ingestion with pandas

Common Relational Databases

Microsoft SQL Server logo

Oracle logo

  • SQLite databases are computer files

PostgreSQL logo

SQLite logo

Streamlined Data Ingestion with pandas

Connecting to Databases

  • Two-step process:
    1. Create way to connect to database
    2. Query database

Database clip art

Streamlined Data Ingestion with pandas

Creating a Database Engine

SQLAlchemy logo

  • sqlalchemy's create_engine() makes an engine to handle database connections
    • Needs string URL of database to connect to
    • SQLite URL format: sqlite:///filename.db
Streamlined Data Ingestion with pandas

Querying Databases

  • pd.read_sql(query, engine) to load in data from a database
  • Arguments
    • query: String containing SQL query to run or table to load
    • engine: Connection/database engine object
Streamlined Data Ingestion with pandas

SQL Review: SELECT

  • Used to query data from a database
  • Basic syntax:
    SELECT [column_names] FROM [table_name];
    
  • To get all data in a table:
    SELECT * FROM [table_name];
    
  • Code style: keywords in ALL CAPS, semicolon (;) to end a statement
Streamlined Data Ingestion with pandas

Getting Data from a Database

# Load pandas and sqlalchemy's create_engine
import pandas as pd
from sqlalchemy import create_engine

# Create database engine to manage connections engine = create_engine("sqlite:///data.db")
# Load entire weather table by table name weather = pd.read_sql("weather", engine)
Streamlined Data Ingestion with pandas
# Create database engine to manage connections
engine = create_engine("sqlite:///data.db")

# Load entire weather table with SQL weather = pd.read_sql("SELECT * FROM weather", engine)
print(weather.head())
       station                         name  latitude  ...   prcp  snow  tavg tmax  tmin
0  USW00094728  NY CITY CENTRAL PARK, NY US  40.77898  ...   0.00   0.0         52    42
1  USW00094728  NY CITY CENTRAL PARK, NY US  40.77898  ...   0.00   0.0         48    39
2  USW00094728  NY CITY CENTRAL PARK, NY US  40.77898  ...   0.00   0.0         48    42
3  USW00094728  NY CITY CENTRAL PARK, NY US  40.77898  ...   0.00   0.0         51    40
4  USW00094728  NY CITY CENTRAL PARK, NY US  40.77898  ...   0.75   0.0         61    50

[5 rows x 13 columns]
Streamlined Data Ingestion with pandas

Let's practice!

Streamlined Data Ingestion with pandas

Preparing Video For Download...