NTILE and CUME_DIST

Window Functions in Snowflake

Jake Roach

Field Data engineer

Creating buckets of rows

How can we classify gym members based on their workouts to market the right classes?

             member_id  |  gym_location  |  calories_burned  |  marketing_group
            ----------- | -------------- | ----------------- | -----------------
               m_192    |    Miami       |         45        |         1
               m_74     |    Miami       |         59        |         1
               m_233    |    Portland    |         60        |         1

               m_14     |    Cleveland   |         72        |         2
               m_346    |    Portland    |         77        |         2
               m_289    |    Cleveland   |         81        |         2

               m_565    |    Miami       |        1085       |         50
                                        ...
Window Functions in Snowflake

NTILE

SELECT
    <fields>,
    <2>,
    <1>,

NTILE(<n>) OVER( PARTITION BY <2> ORDER BY <1> )
...;

NTILE is used to create "N" number of equally-sized "buckets"

$$

<n>: number of buckets

<1>: field used to create buckets

<2>: field used to evenly-distribute records using PARTITION BY

Window Functions in Snowflake

Bucketing fitness data

SELECT
    member_id,
    gym_location,
    calories_burned,

    -- Create 50 equally-sized buckets of data

NTILE(50) OVER( ORDER BY calories_burned -- Decides the records in each bucket ) AS marketing_group
FROM FITNESS.workouts ORDER BY marketing_group, calories_burned; -- ORDER the final result set
Window Functions in Snowflake

Bucketing fitness data

         member_id  |  gym_location  |  calories_burned  |  marketing_group
        ----------- | -------------- | ----------------- | -----------------
           m_192    |    Miami       |         45        |         1
           m_74     |    Miami       |         59        |         1
           m_233    |    Portland    |         60        |         1

           m_14     |    Cleveland   |         72        |         2
           m_346    |    Portland    |         77        |         2
           m_289    |    Cleveland   |         81        |         2

           m_565    |    Miami       |        1085       |         50
                                        ...
Window Functions in Snowflake

Evenly-distributed buckets of fitness data

SELECT
    member_id,
    gym_location,
    calories_burned,

    NTILE(50) OVER(

-- Evenly distribute records in bucket by each gym_location PARTITION BY gym_location
ORDER BY calories_burned ) AS marketing_group FROM FITNESS.workouts ORDER BY marketing_group, calories_burned;
Window Functions in Snowflake

Evenly-distributed buckets of fitness data

         member_id  |  gym_location  |  calories_burned  |  marketing_group
        ----------- | -------------- | ----------------- | -----------------
           m_192    |    Miami       |         45        |         1
           m_233    |    Portland    |         60        |         1
           m_14     |    Cleveland   |         72        |         1

           m_74     |    Miami       |         59        |         2
           m_346    |    Portland    |         77        |         2
           m_289    |    Cleveland   |         81        |         2

                                    ...
Window Functions in Snowflake

Understanding a distribution

  • What is the distribution of the calories burned for each member's workout?

$$

  • Where does a specific workout fall in this distribution?

$$

  • What proportion of members burned the same of fewer calories than a specific member?
  member_id  |  cals_burned  |   cd   
 ----------- | ------------- | ------
    m_192    |       45      |  .016
    m_74     |       59      |  .033
    m_233    |       60      |  .049
    m_14     |       72      |  .066
    m_346    |       77      |  .082
    m_289    |       81      |  .098

                    ....

    m_565    |      1085     |  1.000
Window Functions in Snowflake

CUME_DIST

SELECT
    member_id,
    gym_location,
    calories_burned,

    CUME_DIST() OVER(
        PARTITION BY gym_location,   -- Create a distribution for each location
        ORDER BY calories_burned
    ) AS cd

FROM FITNESS.workouts

ORDER BY gym_location, cd; -- ORDER the final result set
Window Functions in Snowflake

CUME_DIST

SELECT
    <fields>,
    <1>,
    <2>,


CUME_DIST() OVER( PARTITION BY <1> ORDER BY <2> )
...;

Compares each record to the distribution for that column/field, cumulative distribution

$$

<1>: field that determines the window to evaluate

<2>: field to create distribution for

$$

  • Which proportion of records are less that or equal to this one?
Window Functions in Snowflake

CUME_DIST

               member_id  |  gym_location  | calories_burned  |   cd   
              ----------- | -------------- | ---------------- | -------
                 m_192    |    Miami       |        45        |  .033
                 m_74     |    Miami       |        59        |  .066
                 m_288    |    Miami       |        83        |  .098
                 m_541    |    Miami       |        85        |  .131

                                          ...

                 m_233    |    Portland    |        60        |  .071
                 m_346    |    Portland    |        77        |  .142

                                          ...
Window Functions in Snowflake

Let's practice!

Window Functions in Snowflake

Preparing Video For Download...