Introduction to Data Engineering
Vincent Vankrunkelsven
Data Engineer @ DataCamp
Unstructured
Call me Ishmael. Some years ago—never
mind how long precisely—having little or
no money in my purse, and nothing particular
to interest me on shore, I thought ....
Flat files
.tsv
or .csv
Year,Make,Model,Price
1997,Ford,E350,3000.00
1999,Chevy,"Venture Extended Edition",4900.00
1999,Chevy,"Venture Extended Edition",5000.00
1996,Jeep,Grand Cherokee,4799.00
number
string
boolean
null
array
object
{
"an_object": {
"nested": [
"one",
"two",
"three",
{
"key": "four"
}
]
}
}
import json
result = json.loads('{"key_1": "value_1", "key_2":"value_2"}') print(result["key_1"])
value_1
Requests
Example
{ "statuses": [{ "created_at": "Mon May 06 20:01:29 +0000 2019", "text": "this is a tweet"}] }
import requests
response = requests.get("https://hacker-news.firebaseio.com/v0/item/16222426.json") print(response.json())
{'by': 'neis', 'descendants': 0, 'id': 16222426, 'score': 17, 'time': 1516800333, 'title': .... }
Applications databases
Analytical databases
Connection string/URI
postgresql://[user[:password]@][host][:port]
Use in Python
import sqlalchemy connection_uri = "postgresql://repl:password@localhost:5432/pagila" db_engine = sqlalchemy.create_engine(connection_uri)
import pandas as pd pd.read_sql("SELECT * FROM customer", db_engine)
Introduction to Data Engineering