Cleaning and improving your data

Python ile Zaman Serisi Verileri için Machine Learning

Chris Holdgraf

Fellow, Berkeley Institute for Data Science

Data is messy

  • Real-world data is often messy
  • The two most common problems are missing data and outliers
  • This often happens because of human error, machine sensor malfunction, database failures, etc
  • Visualizing your raw data makes it easier to spot these problems
Python ile Zaman Serisi Verileri için Machine Learning

What messy data looks like

Python ile Zaman Serisi Verileri için Machine Learning

Interpolation: using time to fill in missing data

  • A common way to deal with missing data is to interpolate missing values
  • With timeseries data, you can use time to assist in interpolation.
  • In this case, interpolation means using using the known values on either side of a gap in the data to make assumptions about what's missing.
Python ile Zaman Serisi Verileri için Machine Learning

Interpolation in Pandas

# Return a boolean that notes where missing values are
missing = prices.isna()

# Interpolate linearly within missing windows
prices_interp = prices.interpolate('linear')

# Plot the interpolated data in red and the data w/ missing values in black
ax = prices_interp.plot(c='r')
prices.plot(c='k', ax=ax, lw=2)
Python ile Zaman Serisi Verileri için Machine Learning

Visualizing the interpolated data

Python ile Zaman Serisi Verileri için Machine Learning

Using a rolling window to transform data

  • Another common use of rolling windows is to transform the data
  • We've already done this once, in order to smooth the data
  • However, we can also use this to do more complex transformations
Python ile Zaman Serisi Verileri için Machine Learning

Transforming data to standardize variance

  • A common transformation to apply to data is to standardize its mean and variance over time. There are many ways to do this.
  • Here, we'll show how to convert your dataset so that each point represents the % change over a previous window.
  • This makes timepoints more comparable to one another if the absolute values of data change a lot
Python ile Zaman Serisi Verileri için Machine Learning

Transforming to percent change with Pandas

def percent_change(values):
    """Calculates the % change between the last value 
    and the mean of previous values"""
    # Separate the last value and all previous values into variables
    previous_values = values[:-1]
    last_value = values[-1]

    # Calculate the % difference between the last value 
    # and the mean of earlier values
    percent_change = (last_value - np.mean(previous_values)) \
    / np.mean(previous_values)
    return percent_change
Python ile Zaman Serisi Verileri için Machine Learning

Applying this to our data

# Plot the raw data
fig, axs = plt.subplots(1, 2, figsize=(10, 5))
ax = prices.plot(ax=axs[0])

# Calculate % change and plot
ax = prices.rolling(window=20).aggregate(percent_change).plot(ax=axs[1])
ax.legend_.set_visible(False)

Python ile Zaman Serisi Verileri için Machine Learning

Finding outliers in your data

  • Outliers are datapoints that are significantly statistically different from the dataset.
  • They can have negative effects on the predictive power of your model, biasing it away from its "true" value
  • One solution is to remove or replace outliers with a more representative value

Be very careful about doing this - often it is difficult to determine what is a legitimately extreme value vs an abberation

Python ile Zaman Serisi Verileri için Machine Learning

Plotting a threshold on our data

fig, axs = plt.subplots(1, 2, figsize=(10, 5))
for data, ax in zip([prices, prices_perc_change], axs):
    # Calculate the mean / standard deviation for the data
    this_mean = data.mean()
    this_std = data.std()

    # Plot the data, with a window that is 3 standard deviations 
    # around the mean
    data.plot(ax=ax)
    ax.axhline(this_mean + this_std * 3, ls='--', c='r')
    ax.axhline(this_mean - this_std * 3, ls='--', c='r')
Python ile Zaman Serisi Verileri için Machine Learning

Visualizing outlier thresholds

Python ile Zaman Serisi Verileri için Machine Learning

Replacing outliers using the threshold

# Center the data so the mean is 0
prices_outlier_centered = prices_outlier_perc - prices_outlier_perc.mean()

# Calculate standard deviation
std = prices_outlier_perc.std()

# Use the absolute value of each datapoint 
# to make it easier to find outliers
outliers = np.abs(prices_outlier_centered) > (std * 3)

# Replace outliers with the median value
# We'll use np.nanmean since there may be nans around the outliers
prices_outlier_fixed = prices_outlier_centered.copy()
prices_outlier_fixed[outliers] = np.nanmedian(prices_outlier_fixed)

Python ile Zaman Serisi Verileri için Machine Learning

Visualize the results

fig, axs = plt.subplots(1, 2, figsize=(10, 5))
prices_outlier_centered.plot(ax=axs[0])
prices_outlier_fixed.plot(ax=axs[1])

Python ile Zaman Serisi Verileri için Machine Learning

Let's practice!

Python ile Zaman Serisi Verileri için Machine Learning

Preparing Video For Download...