Combining timeseries data with machine learning

Machine Learning for Time Series Data in Python

Chris Holdgraf

Fellow, Berkeley Institute for Data Science

Getting to know our data

  • The datasets that we'll use in this course are all freely-available online
  • There are many datasets available to download on the web, the ones we'll use come from Kaggle
Machine Learning for Time Series Data in Python

The Heartbeat Acoustic Data

  • Many recordings of heart sounds from different patients
  • Some had normally-functioning hearts, others had abnormalities
  • Data comes in the form of audio files + labels for each file
  • Can we find the "abnormal" heart beats?
Machine Learning for Time Series Data in Python

Loading auditory data

from glob import glob
files = glob('data/heartbeat-sounds/files/*.wav')

print(files)
['data/heartbeat-sounds/proc/files/murmur__201101051104.wav',
 ...
 'data/heartbeat-sounds/proc/files/murmur__201101051114.wav']
Machine Learning for Time Series Data in Python

Reading in auditory data

import librosa as lr
# `load` accepts a path to an audio file
audio, sfreq = lr.load('data/heartbeat-sounds/proc/files/murmur__201101051104.wav')

print(sfreq)
2205

In this case, the sampling frequency is 2205, meaning there are 2205 samples per second

Machine Learning for Time Series Data in Python

Inferring time from samples

  • If we know the sampling rate of a timeseries, then we know the timestamp of each datapoint relative to the first datapoint
  • Note: this assumes the sampling rate is fixed and no data points are lost
Machine Learning for Time Series Data in Python

Creating a time array (I)

  • Create an array of indices, one for each sample, and divide by the sampling frequency

      indices = np.arange(0, len(audio))
      time = indices / sfreq
    
Machine Learning for Time Series Data in Python

Creating a time array (II)

  • Find the time stamp for the N-1th data point. Then use linspace() to interpolate from zero to that time

      final_time = (len(audio) - 1) / sfreq
      time = np.linspace(0, final_time, sfreq)
    
Machine Learning for Time Series Data in Python

The New York Stock Exchange dataset

  • This dataset consists of company stock values for 10 years
  • Can we detect any patterns in historical records that allow us to predict the value of companies in the future?
Machine Learning for Time Series Data in Python

Looking at the data

data = pd.read_csv('path/to/data.csv')

data.columns
Index(['date', 'symbol', 'close', 'volume'], dtype='object')
data.head()
         date symbol       close       volume
0  2010-01-04   AAPL  214.009998  123432400.0
1  2010-01-04    ABT   54.459951   10829000.0
2  2010-01-04    AIG   29.889999    7750900.0
3  2010-01-04   AMAT   14.300000   18615100.0
4  2010-01-04   ARNC   16.650013   11512100.0
Machine Learning for Time Series Data in Python

Timeseries with Pandas DataFrames

  • We can investigate the object type of each column by accessing the dtypes attribute
df['date'].dtypes
0    object
1    object
2    object
dtype: object
Machine Learning for Time Series Data in Python

Converting a column to a time series

  • To ensure that a column within a DataFrame is treated as time series, use the to_datetime() function
df['date'] = pd.to_datetime(df['date'])

df['date']
0   2017-01-01
1   2017-01-02
2   2017-01-03
Name: date, dtype: datetime64[ns]
Machine Learning for Time Series Data in Python

Let's practice!

Machine Learning for Time Series Data in Python

Preparing Video For Download...