Imputazione di serie temporali

Gestire i dati mancanti in Python

Suraj Donthi

Deep Learning & Computer Vision Consultant

Dataset Airquality

import pandas as pd
airquality = pd.read_csv('air-quality.csv', parse_dates='Date', 
                                index_col='Date')

airquality.head()
             Ozone    Solar    Wind    Temp
Date                
1976-05-01    41.0    190.0     7.4    67
1976-05-02    36.0    118.0     8.0    72
1976-05-03    12.0    149.0    12.6    74
1976-05-04    18.0    313.0    11.5    62
1976-05-05     NaN      NaN    14.3    56
Gestire i dati mancanti in Python

Dataset Airquality

airquality.isnull().sum()
Ozone    37
Solar     7
Wind      0
Temp      0
dtype: int64
airquality.isnull.mean() * 100
Ozone    24.183007
Solar     4.575163
Wind      0.000000
Temp      0.000000
dtype: float64
Gestire i dati mancanti in Python

Il metodo .fillna()

L'attributo method in .fillna() può essere

  • 'ffill' o 'pad'
  • 'bfill' o 'backwardfill'
Gestire i dati mancanti in Python

Metodo ffill

  • Sostituisci i NaN con l'ultimo valore osservato
  • pad è uguale a 'ffill'
airquality.fillna(method='ffill', inplace=True)
Gestire i dati mancanti in Python


airquality['Ozone'][30:40]
Date         Ozone        
1976-05-31    37.0
1976-06-01     NaN
1976-06-02     NaN
1976-06-03     NaN
1976-06-04     NaN
1976-06-05     NaN
1976-06-06     NaN
1976-06-07    29.0
1976-06-08     NaN
1976-06-09    71.0
airquality.fillna(method='ffill', 
                         inplace=True)
airquality['Ozone'][30:40]
Date         Ozone        
1976-05-31    37.0
1976-06-01    37.0
1976-06-02    37.0
1976-06-03    37.0
1976-06-04    37.0
1976-06-05    37.0
1976-06-06    37.0
1976-06-07    29.0
1976-06-08    29.0
1976-06-09    71.0
Gestire i dati mancanti in Python

Metodo bfill

  • Sostituisci i NaN con il valore successivo osservato
  • backfill è uguale a 'bfill'
df.fillna(method='bfill', inplace=True)
Gestire i dati mancanti in Python


airquality['Ozone'][30:40]
Date         Ozone        
1976-05-31    37.0
1976-06-01     NaN
1976-06-02     NaN
1976-06-03     NaN
1976-06-04     NaN
1976-06-05     NaN
1976-06-06     NaN
1976-06-07    29.0
1976-06-08     NaN
1976-06-09    71.0
airquality.fillna(method='bfill', 
                         inplace=True)
airquality['Ozone'][30:40]
Date         Ozone        
1976-05-31    37.0
1976-06-01    29.0
1976-06-02    29.0
1976-06-03    29.0
1976-06-04    29.0
1976-06-05    29.0
1976-06-06    29.0
1976-06-07    29.0
1976-06-08    71.0
1976-06-09    71.0
Gestire i dati mancanti in Python

Il metodo .interpolate()

  • Il metodo .interpolate() estende la sequenza ai valori mancanti

L'attributo method in .interpolate() può essere

  • 'linear'
  • 'quadratic'
  • 'nearest'
Gestire i dati mancanti in Python

Interpolazione lineare

  • Imputa linearmente o con valori equidistanti
df.interpolate(method='linear', inplace=True)

ritaglio interpolazione lineare

Gestire i dati mancanti in Python


airquality['Ozone'][30:40]
Date         Ozone        
1976-05-31    37.0
1976-06-01     NaN
1976-06-02     NaN
1976-06-03     NaN
1976-06-04     NaN
1976-06-05     NaN
1976-06-06     NaN
1976-06-07    29.0
1976-06-08     NaN
1976-06-09    71.0
airquality.interpolate(
          method='linear', inplace=True)
airquality['Ozone'][30:40]
Date         Ozone        
1976-05-31    37.0
1976-06-01    35.9
1976-06-02    34.7
1976-06-03    33.6
1976-06-04    32.4
1976-06-05    31.3
1976-06-06    30.1
1976-06-07    29.0
1976-06-08    50.0
1976-06-09    71.0
Gestire i dati mancanti in Python

Interpolazione quadratica

  • Imputa i valori con metodo quadratico
df.interpolate(method='quadratic', inplace=True)

ritaglio interpolazione quadratica

Gestire i dati mancanti in Python


airquality['Ozone'][30:39]
             Ozone
Date                
1976-05-31    37.0
1976-06-01     NaN
1976-06-02     NaN
1976-06-03     NaN
1976-06-04     NaN
1976-06-05     NaN
1976-06-06     NaN
1976-06-07    29.0
1976-06-08     NaN
airquality.interpolate(
  method='quadratic', inplace=True)
airquality['Ozone'][30:39]
             Ozone
Date                
1976-05-31    37.0
1976-06-01   -38.4
1976-06-02   -79.4
1976-06-03   -85.9
1976-06-04   -62.4
1976-06-06    -2.8
1976-06-07    29.0
1976-06-08    62.2
Gestire i dati mancanti in Python

Imputazione per valore più vicino

  • Imputa col valore osservabile più vicino
df.interpolate(method='nearest', inplace=True)

ritaglio interpolazione per valore più vicino

Gestire i dati mancanti in Python


airquality['Ozone'][30:39]
Date         Ozone        
1976-05-31    37.0
1976-06-01     NaN
1976-06-02     NaN
1976-06-03     NaN
1976-06-04     NaN
1976-06-05     NaN
1976-06-06     NaN
1976-06-07    29.0
1976-06-08     NaN
airquality.interpolate(
  method='nearest', inplace=True)
airquality['Ozone'][30:39]
Date         Ozone        
1976-05-31    37.0
1976-06-01    37.0
1976-06-02    37.0
1976-06-03    37.0
1976-06-04    29.0
1976-06-05    29.0
1976-06-06    29.0
1976-06-07    29.0
1976-06-08    29.0
Gestire i dati mancanti in Python

Ayo berlatih!

Gestire i dati mancanti in Python

Preparing Video For Download...