Improving the features we use for classification

Machine Learning for Time Series Data in Python

Chris Holdgraf

Fellow, Berkeley Institute for Data Science

The auditory envelope

  • Smooth the data to calculate the auditory envelope
  • Related to the total amount of audio energy present at each moment of time

Machine Learning for Time Series Data in Python

Smoothing over time

  • Instead of averaging over all time, we can do a local average
  • This is called smoothing your timeseries
  • It removes short-term noise, while retaining the general pattern
Machine Learning for Time Series Data in Python

Smoothing your data

Machine Learning for Time Series Data in Python

Calculating a rolling window statistic

# Audio is a Pandas DataFrame
print(audio.shape)  
# (n_times, n_audio_files)
(5000, 20)  
# Smooth our data by taking the rolling mean in a window of 50 samples
window_size = 50
windowed = audio.rolling(window=window_size)
audio_smooth = windowed.mean()
Machine Learning for Time Series Data in Python

Calculating the auditory envelope

  • First rectify your audio, then smooth it

      audio_rectified = audio.apply(np.abs)
      audio_envelope = audio_rectified.rolling(50).mean()
    
Machine Learning for Time Series Data in Python

Machine Learning for Time Series Data in Python

Machine Learning for Time Series Data in Python

Machine Learning for Time Series Data in Python

Feature engineering the envelope

# Calculate several features of the envelope, one per sound
envelope_mean = np.mean(audio_envelope, axis=0)
envelope_std = np.std(audio_envelope, axis=0)
envelope_max = np.max(audio_envelope, axis=0)

# Create our training data for a classifier
X = np.column_stack([envelope_mean, envelope_std, envelope_max])
Machine Learning for Time Series Data in Python

Preparing our features for scikit-learn

X = np.column_stack([envelope_mean, envelope_std, envelope_max])
y = labels.reshape(-1, 1)
Machine Learning for Time Series Data in Python

Cross validation for classification

  • cross_val_score automates the process of:
    • Splitting data into training / validation sets
    • Fitting the model on training data
    • Scoring it on validation data
    • Repeating this process
Machine Learning for Time Series Data in Python

Using cross_val_score

from sklearn.model_selection import cross_val_score

model = LinearSVC()
scores = cross_val_score(model, X, y, cv=3) 
print(scores)
[0.60911642 0.59975305 0.61404035]
Machine Learning for Time Series Data in Python

Auditory features: The Tempogram

  • We can summarize more complex temporal information with timeseries-specific functions
  • librosa is a great library for auditory and timeseries feature engineering
  • Here we'll calculate the tempogram, which estimates the tempo of a sound over time
  • We can calculate summary statistics of tempo in the same way that we can for the envelope
Machine Learning for Time Series Data in Python

Computing the tempogram

# Import librosa and calculate the tempo of a 1-D sound array
import librosa as lr
audio_tempo = lr.beat.tempo(audio, sr=sfreq, 
                            hop_length=2**6, aggregate=None)
Machine Learning for Time Series Data in Python

Let's practice!

Machine Learning for Time Series Data in Python

Preparing Video For Download...