Data Pipeline Automation in Snowflake
Emily Melhuish
Technical Curriculum Developer, Snowflake
Use case:

External Tables
CREATE EXTERNAL TABLE logistics.supplier_inventory
WITH LOCATION = @harbr_stage/inventory/
FILE_FORMAT = (TYPE = 'PARQUET')
AUTO_REFRESH = TRUE;
-- Query partner inventory
SELECT supplier_id,
COUNT(*) AS file_count,
SUM(quantity) AS total_units
FROM logistics.supplier_inventory
WHERE delivery_date >= '2024-01-01'
GROUP BY supplier_id;
External tables
-- Query directly, no load needed
SELECT * FROM
logistics.supplier_inventory
WHERE supplier_id = 'SUP-042';
Loading with COPY INTO
-- Load once, query fast
COPY INTO logistics.delivery_events
FROM @harbr_stage/events/;
How do we ensure we read data correctly from S3, Azure Blob or GCS?
Apache = Open table format: data not locked to any engine

Snowflake-managed
CREATE ICEBERG TABLE
logistics.shipments
CATALOG = 'SNOWFLAKE'
EXTERNAL_VOLUME = 'harbr_s3_vol'
BASE_LOCATION = 'iceberg/shipments/';
Externally-managed
CREATE ICEBERG TABLE
logistics.partner_data
CATALOG = 'glue_catalog'
EXTERNAL_VOLUME = 'harbr_s3_vol'
CATALOG_TABLE_NAME =
'partner.deliveries';
Data Pipeline Automation in Snowflake