NTILE en CUME_DIST

Vensterfuncties in Snowflake

Jake Roach

Field Data engineer

Rijen in emmers indelen

Hoe kunnen we sportschoolleden indelen op basis van hun workouts om de juiste lessen te marketen?

             member_id  |  gym_location  |  calories_burned  |  marketing_group
            ----------- | -------------- | ----------------- | -----------------
               m_192    |    Miami       |         45        |         1
               m_74     |    Miami       |         59        |         1
               m_233    |    Portland    |         60        |         1

               m_14     |    Cleveland   |         72        |         2
               m_346    |    Portland    |         77        |         2
               m_289    |    Cleveland   |         81        |         2

               m_565    |    Miami       |        1085       |         50
                                        ...
Vensterfuncties in Snowflake

NTILE

SELECT
    <fields>,
    <2>,
    <1>,

NTILE(<n>) OVER( PARTITION BY <2> ORDER BY <1> )
...;

NTILE wordt gebruikt om N even grote emmers te maken

$$

<n>: aantal emmers

<1>: veld om emmers te maken

<2>: veld om rijen gelijkmatig te verdelen met PARTITION BY

Vensterfuncties in Snowflake

Fitnessdata in emmers verdelen

SELECT
    member_id,
    gym_location,
    calories_burned,

    -- Maak 50 even grote emmers

NTILE(50) OVER( ORDER BY calories_burned -- Bepaalt de rijen in elke emmer ) AS marketing_group
FROM FITNESS.workouts ORDER BY marketing_group, calories_burned; -- Sorteer de uiteindelijke resultset
Vensterfuncties in Snowflake

Fitnessdata in emmers verdelen

         member_id  |  gym_location  |  calories_burned  |  marketing_group
        ----------- | -------------- | ----------------- | -----------------
           m_192    |    Miami       |         45        |         1
           m_74     |    Miami       |         59        |         1
           m_233    |    Portland    |         60        |         1

           m_14     |    Cleveland   |         72        |         2
           m_346    |    Portland    |         77        |         2
           m_289    |    Cleveland   |         81        |         2

           m_565    |    Miami       |        1085       |         50
                                        ...
Vensterfuncties in Snowflake

Gelijk verdeelde emmers met fitnessdata

SELECT
    member_id,
    gym_location,
    calories_burned,

    NTILE(50) OVER(

-- Verdeel rijen gelijkmatig in emmers per gym_location PARTITION BY gym_location
ORDER BY calories_burned ) AS marketing_group FROM FITNESS.workouts ORDER BY marketing_group, calories_burned;
Vensterfuncties in Snowflake

Gelijk verdeelde emmers met fitnessdata

         member_id  |  gym_location  |  calories_burned  |  marketing_group
        ----------- | -------------- | ----------------- | -----------------
           m_192    |    Miami       |         45        |         1
           m_233    |    Portland    |         60        |         1
           m_14     |    Cleveland   |         72        |         1

           m_74     |    Miami       |         59        |         2
           m_346    |    Portland    |         77        |         2
           m_289    |    Cleveland   |         81        |         2

                                    ...
Vensterfuncties in Snowflake

Een verdeling begrijpen

  • Wat is de verdeling van verbrande calorieën per workout?

$$

  • Waar valt een specifieke workout in deze verdeling?

$$

  • Welk aandeel leden verbrandde evenveel of minder calorieën dan dit lid?
  member_id  |  cals_burned  |   cd   
 ----------- | ------------- | ------
    m_192    |       45      |  .016
    m_74     |       59      |  .033
    m_233    |       60      |  .049
    m_14     |       72      |  .066
    m_346    |       77      |  .082
    m_289    |       81      |  .098

                    ....

    m_565    |      1085     |  1.000
Vensterfuncties in Snowflake

CUME_DIST

SELECT
    member_id,
    gym_location,
    calories_burned,

    CUME_DIST() OVER(
        PARTITION BY gym_location,   -- Maak een verdeling per locatie
        ORDER BY calories_burned
    ) AS cd

FROM FITNESS.workouts

ORDER BY gym_location, cd; -- Sorteer de uiteindelijke resultset
Vensterfuncties in Snowflake

CUME_DIST

SELECT
    <fields>,
    <1>,
    <2>,


CUME_DIST() OVER( PARTITION BY <1> ORDER BY <2> )
...;

Vergelijkt elk record met de verdeling voor die kolom/veld, cumulatieve verdeling

$$

<1>: veld dat het venster bepaalt

<2>: veld waarvoor de verdeling wordt gemaakt

$$

  • Welk aandeel records is kleiner dan of gelijk aan dit record?
Vensterfuncties in Snowflake

CUME_DIST

               member_id  |  gym_location  | calories_burned  |   cd   
              ----------- | -------------- | ---------------- | -------
                 m_192    |    Miami       |        45        |  .033
                 m_74     |    Miami       |        59        |  .066
                 m_288    |    Miami       |        83        |  .098
                 m_541    |    Miami       |        85        |  .131

                                          ...

                 m_233    |    Portland    |        60        |  .071
                 m_346    |    Portland    |        77        |  .142

                                          ...
Vensterfuncties in Snowflake

Laten we oefenen!

Vensterfuncties in Snowflake

Preparing Video For Download...