Top items with window functions

Time Series Analysis in PostgreSQL

Jasmin Ludolf

Content Developer, DataCamp

Table description

  • temperatures_monthly table
  • Fields:
    • station_id - weather station id
    • year_month - year and month
    • t_monthly_min - monthly minimum temperature in Celsius
    • t_monthly_max - monthly maximum temperature in Celsius
    • t_monthly_avg - monthly average temperature in Celsius
Time Series Analysis in PostgreSQL

Traditional aggregation

SELECT station_id, year_month, t_monthly_min
FROM temperatures_monthly AS tm
JOIN

( SELECT station_id, min(t_monthly_min) AS t_monthly_min FROM temperatures_monthly WHERE year_month BETWEEN '2018-01-01' AND '2018-12-31' GROUP BY station_id ) as p USING(station_id, t_monthly_min) WHERE year_month BETWEEN '2018-01-01' AND '2018-12-31' ORDER BY station_id, t_monthly_min;
Time Series Analysis in PostgreSQL

Traditional aggregation

|station_id|year_month|t_monthly_min|
|----------|----------|-------------|
|         1|2018-12-01|          5.4|
|         2|2018-12-01|          6.6|
|         3|2018-01-01|          1.6|
|         3|2018-02-01|          1.6|
|         4|2018-02-01|        -19.0|
|         6|2018-01-01|          2.6|
|         8|2018-02-01|        -16.3|

Problems:

  • Station 5 is missing, it had no data for 2018
  • Two rows for station 3, two rows had the same minimum temp 3
  • Only gives the coldest single month, what if we want the two coldest months?
Time Series Analysis in PostgreSQL

Using row_number

  • We need to number each row for each station
  • ROW_NUMBER() : numbers the rows per partition, in ascending order

 

SELECT station_id, year_month, t_monthly_min,
    ROW_NUMBER() OVER (PARTITION BY station_id ORDER BY t_monthly_min) AS rank
FROM temperatures_monthly AS tm
WHERE year_month BETWEEN '2018-01-01' AND '2018-12-31'
ORDER BY station_id, t_monthly_min;
Time Series Analysis in PostgreSQL

Numbering the rows

|station_id|year_month|t_monthly_min|rank|
|----------|----------|-------------|----|
|         1|2018-12-01|          5.4|   1|
|         1|2018-02-01|          6.2|   2|
|         1|2018-01-01|          7.4|   3|
|         1|2018-11-01|          9.0|   4|
|         1|2018-03-01|         10.6|   5| 
|         1|2018-04-01|         14.7|   6|
|         1|2018-10-01|         16.6|   7|
|         1|2018-05-01|         17.2|   8|
|         1|2018-06-01|         21.9|   9|
|         1|2018-09-01|         24.4|  10|
|         1|2018-07-01|         28.1|  11|
|         1|2018-08-01|         28.1|  12|
|         2|2018-12-01|          6.6|   1|
|         2|2018-01-01|          7.6|   2|
|         2|2018-02-01|          8.1|   3|
  • Numbering restarts with station 2
  • Because we partitioned by station_id
Time Series Analysis in PostgreSQL

Lowest temperatures

SELECT * FROM
(
    SELECT station_id, year_month, t_monthly_min,
    ROW_NUMBER() OVER 
          (PARTITION BY station_id ORDER BY t_monthly_min) AS rank
    FROM temperatures_monthly
    WHERE year_month BETWEEN '2018-01-01' AND '2018-12-31'
) AS q
WHERE rank < 3
ORDER BY station_id, rank;
Time Series Analysis in PostgreSQL

Lowest temperatures

|station_id|year_month|t_monthly_min|rank|
|----------|----------|-------------|----|
|         1|2018-12-01|          5.4|   1|
|         1|2018-02-01|          6.2|   2|
|         2|2018-12-01|          6.6|   1|
|         2|2018-01-01|          7.6|   2|
|         3|2018-01-01|          1.6|   1|
|         3|2018-02-01|          1.6|   2|
Time Series Analysis in PostgreSQL

Highest values

  • To reverse the order, add DESC:
ROW_NUMBER() OVER (PARTITION BY station_id ORDER BY t_monthly_min DESC)
Time Series Analysis in PostgreSQL

Let's practice!

Time Series Analysis in PostgreSQL

Preparing Video For Download...