Combining timeseries data with machine learning

Python ile Zaman Serisi Verileri için Machine Learning

Chris Holdgraf

Fellow, Berkeley Institute for Data Science

Getting to know our data

  • The datasets that we'll use in this course are all freely-available online
  • There are many datasets available to download on the web, the ones we'll use come from Kaggle
Python ile Zaman Serisi Verileri için Machine Learning

The Heartbeat Acoustic Data

  • Many recordings of heart sounds from different patients
  • Some had normally-functioning hearts, others had abnormalities
  • Data comes in the form of audio files + labels for each file
  • Can we find the "abnormal" heart beats?
Python ile Zaman Serisi Verileri için Machine Learning

Loading auditory data

from glob import glob
files = glob('data/heartbeat-sounds/files/*.wav')

print(files)
['data/heartbeat-sounds/proc/files/murmur__201101051104.wav',
 ...
 'data/heartbeat-sounds/proc/files/murmur__201101051114.wav']
Python ile Zaman Serisi Verileri için Machine Learning

Reading in auditory data

import librosa as lr
# `load` accepts a path to an audio file
audio, sfreq = lr.load('data/heartbeat-sounds/proc/files/murmur__201101051104.wav')

print(sfreq)
2205

In this case, the sampling frequency is 2205, meaning there are 2205 samples per second

Python ile Zaman Serisi Verileri için Machine Learning

Inferring time from samples

  • If we know the sampling rate of a timeseries, then we know the timestamp of each datapoint relative to the first datapoint
  • Note: this assumes the sampling rate is fixed and no data points are lost
Python ile Zaman Serisi Verileri için Machine Learning

Creating a time array (I)

  • Create an array of indices, one for each sample, and divide by the sampling frequency

      indices = np.arange(0, len(audio))
      time = indices / sfreq
    
Python ile Zaman Serisi Verileri için Machine Learning

Creating a time array (II)

  • Find the time stamp for the N-1th data point. Then use linspace() to interpolate from zero to that time

      final_time = (len(audio) - 1) / sfreq
      time = np.linspace(0, final_time, sfreq)
    
Python ile Zaman Serisi Verileri için Machine Learning

The New York Stock Exchange dataset

  • This dataset consists of company stock values for 10 years
  • Can we detect any patterns in historical records that allow us to predict the value of companies in the future?
Python ile Zaman Serisi Verileri için Machine Learning

Looking at the data

data = pd.read_csv('path/to/data.csv')

data.columns
Index(['date', 'symbol', 'close', 'volume'], dtype='object')
data.head()
         date symbol       close       volume
0  2010-01-04   AAPL  214.009998  123432400.0
1  2010-01-04    ABT   54.459951   10829000.0
2  2010-01-04    AIG   29.889999    7750900.0
3  2010-01-04   AMAT   14.300000   18615100.0
4  2010-01-04   ARNC   16.650013   11512100.0
Python ile Zaman Serisi Verileri için Machine Learning

Timeseries with Pandas DataFrames

  • We can investigate the object type of each column by accessing the dtypes attribute
df['date'].dtypes
0    object
1    object
2    object
dtype: object
Python ile Zaman Serisi Verileri için Machine Learning

Converting a column to a time series

  • To ensure that a column within a DataFrame is treated as time series, use the to_datetime() function
df['date'] = pd.to_datetime(df['date'])

df['date']
0   2017-01-01
1   2017-01-02
2   2017-01-03
Name: date, dtype: datetime64[ns]
Python ile Zaman Serisi Verileri için Machine Learning

Let's practice!

Python ile Zaman Serisi Verileri için Machine Learning

Preparing Video For Download...