External Schemas, File, and Table Formats

Introduction to Redshift

Jason Myers

Principal Architect

External schemas

Database Components Database Components

  • When metadata catalog and storage are not part of the cluster it is considered external

Redshift Spectrum

External Schemas with Redshift Spectrum

  • Redshift is the engine
  • Uses AWS Glue Data Catalog and AWS S3 storage by default
Introduction to Redshift

S3 data file formats

File format Columnar Supports parallel reads
Parquet Yes Yes
ORC Yes Yes
TextFile No No
OpenCSV No Yes
JSON No No
Introduction to Redshift

Create CSV External Table

CREATE TABLE spectrumdb.IDAHO_SITE_ID
(
    'pk_siteid' INTEGER PRIMARY KEY,
    -- Cutting the rest of columns for space
)

-- CSV rows are comma delimited ROW FORMAT DELIMITED
-- CSV fields are terminated by a comma FIELDS TERMINATED BY ','
-- CSVs are a type of text file STORED AS TEXTFILE
-- This is where the data is in AWS S3 LOCATION 's3://spectrum-id/idaho_sites/'
-- This file has headers that we want to skip TABLE PROPERTIES ('skip.header.line.count'='1');
Introduction to Redshift

Querying spectrum tables

  • Works just like querying internal tables
  • EXPLAIN will look different
  • No concern for DISTKEY or SORTKEYs
  • Pseudocolumns
    • $path - shows the file storage path for the row
    • $size - shows the file size for the row
Introduction to Redshift

Using pseudocolumns

SELECT "$path", 
       "$size",
       pk_siteid
  FROM spectrumdb.idaho_site_id;
$path                           | $size | pk_siteid
================================|=======|==========
's3://spectrum-id/idaho_sites/' | 1616  | 1 
's3://spectrum-id/idaho_sites/' | 1616  | 2 
's3://spectrum-id/idaho_sites/' | 1616  | 3 
Introduction to Redshift

Table formats

  • Common Formats:

    • Hive
    • Iceberg
    • Hudi
    • Deltalake
  • Read only

  • Some like Hive need an external catalog other than AWS Glue
Introduction to Redshift

Viewing external schemas

  • SVV_ALL_SCHEMAS - internal or external
SELECT schema_name, 
       schema_type
  FROM SVV_ALL_SCHEMAS
 ORDER BY SCHEMA_NAME;
schema_name           | schema_type
======================|=============
public_intro_redshift | internal
spectrumdb            | external
Introduction to Redshift

Viewing external tables

  • SVV_ALL_TABLES - TABLE or EXTERNAL TABLE
SELECT table_name, 
       table_type
  FROM SVV_ALL_TABLES
 WHERE schema_name = 'public_intro_redshift';
table_name                | table_type
==========================|================
coffee_county_weather     | TABLE
idaho_monitoring_location | TABLE
idaho_samples             | TABLE
ecommerce_sales           | EXTERNAL_TABLE
Introduction to Redshift

Let's practice!

Introduction to Redshift

Preparing Video For Download...