Descriptive and Inferential Statistics

Analyzing Survey Data in Python

EbunOluwa Andrew

Data Scientist

Descriptive statistics

Basic measure to describe survey data
Examples: mean, median, mode, range, standard deviation etc
.describe()

Photo by Lukas - chart upclose

¹ Photo by Lukas

.describe() function

data.describe()

|      | year     | satisfaction_rating
|------|----------|--------------------
| count|       42 |                  42
| mean | 2012.381 |            7192.857
| std  |    4.196 |             945.178
| min  |     2006 |                5500
| 25%  |     2009 |                6325
| 50%  |   2012.5 |                7400
| 75%  |     2016 |                8000
| max  |     2019 |                8600

data.describe(include = np.object)

|        | category    |
|--------|-------------|
| count  | 42          |
| unique | 3           |
| top    | Residential |
| freq   | 14          |

Interpreting .describe()

Outlier = max value > mean & median values
Improbable values = if values are not logical

|       | year     | satisfaction_rating |
|-------|----------|---------------------|
| count |       42 |                  42 |
| mean  | 2012.381 |            7192.857 |
| std   |    4.196 |             945.178 |
| min   |     2006 |                5500 |
| 25%   |     2009 |                6325 |
| 50%   |   2012.5 |                7400 |
| 75%   |     2016 |                8000 |
| max   |     2019 |                8600 |

Interpreting .describe()

Top = mode = highest occurring class
Freq = number of times the highest class occurred

|        | category    |
|--------|-------------|
| count  | 42          |
| unique | 3           |
| top    | Residential |
| freq   | 14          |

.describe() on electric_satisfaction

import pandas as pd

electric_satisfaction = pd.read_csv("austin-energy-customer-satisfaction.csv")

.describe() on electric_satisfaction

electric_satisfaction.describe()

|      | year     | satisfaction_rating
|------|----------|--------------------
| count|       42 |                  42
| mean | 2012.381 |            7192.857
| std  |    4.196 |             945.178
| min  |     2006 |                5500
| 25%  |     2009 |                6325
| 50%  |   2012.5 |                7400
| 75%  |     2016 |                8000
| max  |     2019 |                8600

satisfaction_rating has outliers
50th percentile = median

.describe() on electric_satisfaction

|        | category    |
|--------|-------------|
| count  | 42          |
| unique | 3           |
| top    | Residential |
| freq   | 14          |

Mode = residential respondents

Inferential statistics

Determine if data can be applied to larger population
Sample size < population size -> sampling error
Estimate population parameters
- Confidence intervals
  - norm.interval() function

Photo by Andrea Piacquadio on Pexels - woman holding lightbulb

¹ Photo by Andrea Piacquadio on Pexels

The norm.interval() function

For large datasets
Assume sampling distribution of mean is normally distributed

import scipy.stats
scipy.stats.norm.interval(alpha, loc, scale)

alpha = confidence level
loc = sample mean
scale= sample std error

Interpreting norm.interval() on electric_satisfaction

electric_satisfaction = pd.read_csv(
  "austin-energy-customer-satisfaction.csv")

conf_interval = st.norm.interval(
  alpha = 0.99,
  loc = np.mean(electric_satisfaction.satisfaction),
  scale=st.sem(electric_satisfaction.satisfaction))


print(conf_interval)

(6817.187361704269, 7568.526924010017)

Let's practice!

Analyzing Survey Data in Python