Window Functions in Snowflake
Jake Roach
Field Data engineer
How can we classify gym members based on their workouts to market the right classes?
member_id | gym_location | calories_burned | marketing_group
----------- | -------------- | ----------------- | -----------------
m_192 | Miami | 45 | 1
m_74 | Miami | 59 | 1
m_233 | Portland | 60 | 1
m_14 | Cleveland | 72 | 2
m_346 | Portland | 77 | 2
m_289 | Cleveland | 81 | 2
m_565 | Miami | 1085 | 50
...
SELECT <fields>, <2>, <1>,
NTILE(<n>) OVER( PARTITION BY <2> ORDER BY <1> )
...;
NTILE
is used to create "N" number of equally-sized "buckets"
$$
<n>
: number of buckets
<1>
: field used to create buckets
<2>
: field used to evenly-distribute records using PARTITION BY
SELECT member_id, gym_location, calories_burned, -- Create 50 equally-sized buckets of data
NTILE(50) OVER( ORDER BY calories_burned -- Decides the records in each bucket ) AS marketing_group
FROM FITNESS.workouts ORDER BY marketing_group, calories_burned; -- ORDER the final result set
member_id | gym_location | calories_burned | marketing_group
----------- | -------------- | ----------------- | -----------------
m_192 | Miami | 45 | 1
m_74 | Miami | 59 | 1
m_233 | Portland | 60 | 1
m_14 | Cleveland | 72 | 2
m_346 | Portland | 77 | 2
m_289 | Cleveland | 81 | 2
m_565 | Miami | 1085 | 50
...
SELECT member_id, gym_location, calories_burned, NTILE(50) OVER(
-- Evenly distribute records in bucket by each gym_location PARTITION BY gym_location
ORDER BY calories_burned ) AS marketing_group FROM FITNESS.workouts ORDER BY marketing_group, calories_burned;
member_id | gym_location | calories_burned | marketing_group
----------- | -------------- | ----------------- | -----------------
m_192 | Miami | 45 | 1
m_233 | Portland | 60 | 1
m_14 | Cleveland | 72 | 1
m_74 | Miami | 59 | 2
m_346 | Portland | 77 | 2
m_289 | Cleveland | 81 | 2
...
$$
$$
member_id | cals_burned | cd
----------- | ------------- | ------
m_192 | 45 | .016
m_74 | 59 | .033
m_233 | 60 | .049
m_14 | 72 | .066
m_346 | 77 | .082
m_289 | 81 | .098
....
m_565 | 1085 | 1.000
SELECT member_id, gym_location, calories_burned, CUME_DIST() OVER( PARTITION BY gym_location, -- Create a distribution for each location ORDER BY calories_burned ) AS cd FROM FITNESS.workouts
ORDER BY gym_location, cd; -- ORDER the final result set
SELECT <fields>, <1>, <2>,
CUME_DIST() OVER( PARTITION BY <1> ORDER BY <2> )
...;
Compares each record to the distribution for that column/field, cumulative distribution
$$
<1>
: field that determines the window to evaluate
<2>
: field to create distribution for
$$
member_id | gym_location | calories_burned | cd
----------- | -------------- | ---------------- | -------
m_192 | Miami | 45 | .033
m_74 | Miami | 59 | .066
m_288 | Miami | 83 | .098
m_541 | Miami | 85 | .131
...
m_233 | Portland | 60 | .071
m_346 | Portland | 77 | .142
...
Window Functions in Snowflake