Combining timeseries data with machine learning

Machine Learning for Time Series Data in Python

Chris Holdgraf

Fellow, Berkeley Institute for Data Science

Getting to know our data

The datasets that we'll use in this course are all freely-available online
There are many datasets available to download on the web, the ones we'll use come from Kaggle

The Heartbeat Acoustic Data

Many recordings of heart sounds from different patients
Some had normally-functioning hearts, others had abnormalities
Data comes in the form of audio files + labels for each file
Can we find the "abnormal" heart beats?

Loading auditory data

from glob import glob
files = glob('data/heartbeat-sounds/files/*.wav')

print(files)

['data/heartbeat-sounds/proc/files/murmur__201101051104.wav',
 ...
 'data/heartbeat-sounds/proc/files/murmur__201101051114.wav']

Reading in auditory data

import librosa as lr
# `load` accepts a path to an audio file
audio, sfreq = lr.load('data/heartbeat-sounds/proc/files/murmur__201101051104.wav')

print(sfreq)

In this case, the sampling frequency is 2205, meaning there are 2205 samples per second

Inferring time from samples

If we know the sampling rate of a timeseries, then we know the timestamp of each datapoint relative to the first datapoint
Note: this assumes the sampling rate is fixed and no data points are lost

Creating a time array (I)

Create an array of indices, one for each sample, and divide by the sampling frequency
```
  indices = np.arange(0, len(audio))
  time = indices / sfreq
```

Creating a time array (II)

Find the time stamp for the N-1th data point. Then use linspace() to interpolate from zero to that time
```
  final_time = (len(audio) - 1) / sfreq
  time = np.linspace(0, final_time, sfreq)
```

The New York Stock Exchange dataset

This dataset consists of company stock values for 10 years
Can we detect any patterns in historical records that allow us to predict the value of companies in the future?

Looking at the data

data = pd.read_csv('path/to/data.csv')

data.columns

Index(['date', 'symbol', 'close', 'volume'], dtype='object')

data.head()

         date symbol       close       volume
0  2010-01-04   AAPL  214.009998  123432400.0
1  2010-01-04    ABT   54.459951   10829000.0
2  2010-01-04    AIG   29.889999    7750900.0
3  2010-01-04   AMAT   14.300000   18615100.0
4  2010-01-04   ARNC   16.650013   11512100.0

Timeseries with Pandas DataFrames

We can investigate the object type of each column by accessing the dtypes attribute

df['date'].dtypes

0    object
1    object
2    object
dtype: object

Converting a column to a time series

To ensure that a column within a DataFrame is treated as time series, use the to_datetime() function

df['date'] = pd.to_datetime(df['date'])

df['date']

0   2017-01-01
1   2017-01-02
2   2017-01-03
Name: date, dtype: datetime64[ns]

Let's practice!

Machine Learning for Time Series Data in Python