Case study EDA & imputation

Writing Functions and Stored Procedures in SQL Server

Meghan Kwartler

IT Consultant

Taxi ride business problem

  1. EU private equity firm seeking investment opportunity in US Transportation.
  2. What is the average fare per distance, ride count & total ride time for each NYC borough on each day of the week?
  3. Which pickup locations within the borough should be scheduled for each of the driver shifts?

yellow_taxi_photo.jpg

Writing Functions and Stored Procedures in SQL Server

Essential EDA

  • Distributed transactional datasets can contain impossible scenarios due to data collection calibration problems
    • Dates in future
    • End dates before start dates
SELECT * 
FROM CapitalBikeShare 
WHERE 
  StartDate > GetDate() 
  OR EndDate > GetDate() 
  OR StartDate > EndDate
Writing Functions and Stored Procedures in SQL Server

Data imputation

  • Divide by zero error when calculating Avg Fare/TripDistance
  • EDA uncovers hundreds of TaxiRide trip records with Trip Distance = 0
  • Data Imputation methods to resolve
    • Mean
    • Hot Deck
    • Omission
Writing Functions and Stored Procedures in SQL Server

Mean imputation

  • Replace missing value with mean
  • Doesn't change the mean value
  • Increases correlations with other columns
CREATE PROCEDURE dbo.ImputeDurMean 
AS
BEGIN
DECLARE @AvgTripDuration AS float

SELECT @AvgTripDuration = AVG(Duration) 
FROM CapitalBikeShare
WHERE Duration > 0

UPDATE CapitalBikeShare
SET Duration =  @AvgTripDuration
WHERE Duration = 0
END;
Writing Functions and Stored Procedures in SQL Server

Hot Deck imputation

  • Missing value set to randomly selected value
  • TABLESAMPLE clause of FROM clause
CREATE FUNCTION dbo.GetDurHotDeck()
RETURNS decimal (18,4)
AS BEGIN
RETURN (SELECT TOP 1 Duration
FROM CapitalBikeShare
TABLESAMPLE (1000 rows)
WHERE Duration >0)
END

SELECT StartDate, "TripDuration" = CASE WHEN Duration > 0 THEN Duration ELSE dbo.GetDurHotDeck() END FROM CapitalBikeShare;
Writing Functions and Stored Procedures in SQL Server
SELECT 
  DATENAME(weekday, StartDate) AS DayofWeek, 
  AVG(Duration) AS 'AvgDuration' 
FROM CapitalBikeShare 
WHERE Duration > 0 
GROUP BY DATENAME(weekday, StartDate) 
ORDER BY AVG(Duration) desc
+-----------+-------------+
| DayofWeek | AvgDuration |
|-----------+-------------|
| Saturday  | 1476        |
| Sunday    | 1391        |
| Monday    |  979        |
| Friday    |  957        |
| Thursday  |  956        |
| Wednesday |  807        |
| Tuesday   |  763        |
+-----------+-------------|
Writing Functions and Stored Procedures in SQL Server

Your turn!

Writing Functions and Stored Procedures in SQL Server

Preparing Video For Download...