Cleaning and improving your data

Python ile Zaman Serisi Verileri için Machine Learning

Chris Holdgraf

Fellow, Berkeley Institute for Data Science

Data is messy

Real-world data is often messy
The two most common problems are missing data and outliers
This often happens because of human error, machine sensor malfunction, database failures, etc
Visualizing your raw data makes it easier to spot these problems

What messy data looks like

Interpolation: using time to fill in missing data

A common way to deal with missing data is to interpolate missing values
With timeseries data, you can use time to assist in interpolation.
In this case, interpolation means using using the known values on either side of a gap in the data to make assumptions about what's missing.

Interpolation in Pandas

# Return a boolean that notes where missing values are
missing = prices.isna()

# Interpolate linearly within missing windows
prices_interp = prices.interpolate('linear')

# Plot the interpolated data in red and the data w/ missing values in black
ax = prices_interp.plot(c='r')
prices.plot(c='k', ax=ax, lw=2)

Visualizing the interpolated data

Using a rolling window to transform data

Another common use of rolling windows is to transform the data
We've already done this once, in order to smooth the data
However, we can also use this to do more complex transformations

Transforming data to standardize variance

A common transformation to apply to data is to standardize its mean and variance over time. There are many ways to do this.
Here, we'll show how to convert your dataset so that each point represents the % change over a previous window.
This makes timepoints more comparable to one another if the absolute values of data change a lot

Transforming to percent change with Pandas

def percent_change(values):
    """Calculates the % change between the last value 
    and the mean of previous values"""
    # Separate the last value and all previous values into variables
    previous_values = values[:-1]
    last_value = values[-1]

    # Calculate the % difference between the last value 
    # and the mean of earlier values
    percent_change = (last_value - np.mean(previous_values)) \
    / np.mean(previous_values)
    return percent_change

Applying this to our data

# Plot the raw data
fig, axs = plt.subplots(1, 2, figsize=(10, 5))
ax = prices.plot(ax=axs[0])

# Calculate % change and plot
ax = prices.rolling(window=20).aggregate(percent_change).plot(ax=axs[1])
ax.legend_.set_visible(False)

Finding outliers in your data

Outliers are datapoints that are significantly statistically different from the dataset.
They can have negative effects on the predictive power of your model, biasing it away from its "true" value
One solution is to remove or replace outliers with a more representative value

Be very careful about doing this - often it is difficult to determine what is a legitimately extreme value vs an abberation

Plotting a threshold on our data

fig, axs = plt.subplots(1, 2, figsize=(10, 5))
for data, ax in zip([prices, prices_perc_change], axs):
    # Calculate the mean / standard deviation for the data
    this_mean = data.mean()
    this_std = data.std()

    # Plot the data, with a window that is 3 standard deviations 
    # around the mean
    data.plot(ax=ax)
    ax.axhline(this_mean + this_std * 3, ls='--', c='r')
    ax.axhline(this_mean - this_std * 3, ls='--', c='r')

Visualizing outlier thresholds

Replacing outliers using the threshold

# Center the data so the mean is 0
prices_outlier_centered = prices_outlier_perc - prices_outlier_perc.mean()

# Calculate standard deviation
std = prices_outlier_perc.std()

# Use the absolute value of each datapoint 
# to make it easier to find outliers
outliers = np.abs(prices_outlier_centered) > (std * 3)

# Replace outliers with the median value
# We'll use np.nanmean since there may be nans around the outliers
prices_outlier_fixed = prices_outlier_centered.copy()
prices_outlier_fixed[outliers] = np.nanmedian(prices_outlier_fixed)

Visualize the results

fig, axs = plt.subplots(1, 2, figsize=(10, 5))
prices_outlier_centered.plot(ax=axs[0])
prices_outlier_fixed.plot(ax=axs[1])

Let's practice!

Python ile Zaman Serisi Verileri için Machine Learning