Cleaning and improving your data

Machine Learning for Time Series Data in Python

Chris Holdgraf

Fellow, Berkeley Institute for Data Science

Data is messy

  • Real-world data is often messy
  • The two most common problems are missing data and outliers
  • This often happens because of human error, machine sensor malfunction, database failures, etc
  • Visualizing your raw data makes it easier to spot these problems
Machine Learning for Time Series Data in Python

What messy data looks like

Machine Learning for Time Series Data in Python

Interpolation: using time to fill in missing data

  • A common way to deal with missing data is to interpolate missing values
  • With timeseries data, you can use time to assist in interpolation.
  • In this case, interpolation means using using the known values on either side of a gap in the data to make assumptions about what's missing.
Machine Learning for Time Series Data in Python

Interpolation in Pandas

# Return a boolean that notes where missing values are
missing = prices.isna()

# Interpolate linearly within missing windows
prices_interp = prices.interpolate('linear')

# Plot the interpolated data in red and the data w/ missing values in black
ax = prices_interp.plot(c='r')
prices.plot(c='k', ax=ax, lw=2)
Machine Learning for Time Series Data in Python

Visualizing the interpolated data

Machine Learning for Time Series Data in Python

Using a rolling window to transform data

  • Another common use of rolling windows is to transform the data
  • We've already done this once, in order to smooth the data
  • However, we can also use this to do more complex transformations
Machine Learning for Time Series Data in Python

Transforming data to standardize variance

  • A common transformation to apply to data is to standardize its mean and variance over time. There are many ways to do this.
  • Here, we'll show how to convert your dataset so that each point represents the % change over a previous window.
  • This makes timepoints more comparable to one another if the absolute values of data change a lot
Machine Learning for Time Series Data in Python

Transforming to percent change with Pandas

def percent_change(values):
    """Calculates the % change between the last value 
    and the mean of previous values"""
    # Separate the last value and all previous values into variables
    previous_values = values[:-1]
    last_value = values[-1]

    # Calculate the % difference between the last value 
    # and the mean of earlier values
    percent_change = (last_value - np.mean(previous_values)) \
    / np.mean(previous_values)
    return percent_change
Machine Learning for Time Series Data in Python

Applying this to our data

# Plot the raw data
fig, axs = plt.subplots(1, 2, figsize=(10, 5))
ax = prices.plot(ax=axs[0])

# Calculate % change and plot
ax = prices.rolling(window=20).aggregate(percent_change).plot(ax=axs[1])
ax.legend_.set_visible(False)

Machine Learning for Time Series Data in Python

Finding outliers in your data

  • Outliers are datapoints that are significantly statistically different from the dataset.
  • They can have negative effects on the predictive power of your model, biasing it away from its "true" value
  • One solution is to remove or replace outliers with a more representative value

Be very careful about doing this - often it is difficult to determine what is a legitimately extreme value vs an abberation

Machine Learning for Time Series Data in Python

Plotting a threshold on our data

fig, axs = plt.subplots(1, 2, figsize=(10, 5))
for data, ax in zip([prices, prices_perc_change], axs):
    # Calculate the mean / standard deviation for the data
    this_mean = data.mean()
    this_std = data.std()

    # Plot the data, with a window that is 3 standard deviations 
    # around the mean
    data.plot(ax=ax)
    ax.axhline(this_mean + this_std * 3, ls='--', c='r')
    ax.axhline(this_mean - this_std * 3, ls='--', c='r')
Machine Learning for Time Series Data in Python

Visualizing outlier thresholds

Machine Learning for Time Series Data in Python

Replacing outliers using the threshold

# Center the data so the mean is 0
prices_outlier_centered = prices_outlier_perc - prices_outlier_perc.mean()

# Calculate standard deviation
std = prices_outlier_perc.std()

# Use the absolute value of each datapoint 
# to make it easier to find outliers
outliers = np.abs(prices_outlier_centered) > (std * 3)

# Replace outliers with the median value
# We'll use np.nanmean since there may be nans around the outliers
prices_outlier_fixed = prices_outlier_centered.copy()
prices_outlier_fixed[outliers] = np.nanmedian(prices_outlier_fixed)

Machine Learning for Time Series Data in Python

Visualize the results

fig, axs = plt.subplots(1, 2, figsize=(10, 5))
prices_outlier_centered.plot(ax=axs[0])
prices_outlier_fixed.plot(ax=axs[1])

Machine Learning for Time Series Data in Python

Let's practice!

Machine Learning for Time Series Data in Python

Preparing Video For Download...