Missing Values#

Section Title: Misiing Values

We will use the following dataset throughout this chapter.

import pandas as pd
df_stock = pd.read_excel('https://raw.githubusercontent.com/datasmp/datasets/main/stock.xlsx', index_col=0)
df_stock.head()
APPLE TESLA AMAZON VISA SP500
Date
2020-01-02 74.33 86.05 1898.01 189.66 3257.85
2020-01-03 73.61 88.60 1874.97 188.15 3234.85
2020-01-06 74.20 90.31 1902.88 187.74 3246.28
2020-01-07 73.85 93.81 1906.86 187.24 3237.18
2020-01-08 75.04 98.43 1891.97 190.45 3253.05

Now, we will replace some values in the dataframe with missing values (NaN).

import random
import numpy as np

random.seed(0)

r = 0.4
a,b = df_stock.shape
missing = int(a*b*r)
for i in range( missing ):
    k = random.randint(2,b-1)
    p = random.randint(0,a-1)
    df_stock.iloc[p,k] = np.nan
df_stock.head(10)
APPLE TESLA AMAZON VISA SP500
Date
2020-01-02 74.33 86.05 1898.01 NaN 3257.85
2020-01-03 73.61 88.60 1874.97 188.15 3234.85
2020-01-06 74.20 90.31 NaN 187.74 3246.28
2020-01-07 73.85 93.81 NaN 187.24 NaN
2020-01-08 75.04 98.43 1891.97 190.45 3253.05
2020-01-09 76.63 96.27 NaN 191.77 NaN
2020-01-10 76.80 95.63 1883.16 192.29 3265.35
2020-01-13 78.44 104.97 1891.30 193.83 NaN
2020-01-14 77.39 107.58 NaN NaN 3283.15
2020-01-15 77.05 103.70 NaN NaN NaN
df_stock.info()
<class 'pandas.core.frame.DataFrame'>
Index: 252 entries, 2020-01-02 to 2020-12-30
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   APPLE   252 non-null    float64
 1   TESLA   252 non-null    float64
 2   AMAZON  126 non-null    float64
 3   VISA    134 non-null    float64
 4   SP500   139 non-null    float64
dtypes: float64(5)
memory usage: 11.8+ KB
df_stock.isnull().head()
APPLE TESLA AMAZON VISA SP500
Date
2020-01-02 False False False True False
2020-01-03 False False False False False
2020-01-06 False False True False False
2020-01-07 False False True False True
2020-01-08 False False False False False
df_stock.notnull().head()
APPLE TESLA AMAZON VISA SP500
Date
2020-01-02 True True True False True
2020-01-03 True True True True True
2020-01-06 True True False True True
2020-01-07 True True False True False
2020-01-08 True True True True True
df_stock.isnull().sum()
APPLE       0
TESLA       0
AMAZON    126
VISA      118
SP500     113
dtype: int64
df_stock.isnull().sum().sum()
357

Imputation Methods#

Filling with a constant number#

df_stock.fillna(999).head(10)
APPLE TESLA AMAZON VISA SP500
Date
2020-01-02 74.33 86.05 1898.01 999.00 3257.85
2020-01-03 73.61 88.60 1874.97 188.15 3234.85
2020-01-06 74.20 90.31 999.00 187.74 3246.28
2020-01-07 73.85 93.81 999.00 187.24 999.00
2020-01-08 75.04 98.43 1891.97 190.45 3253.05
2020-01-09 76.63 96.27 999.00 191.77 999.00
2020-01-10 76.80 95.63 1883.16 192.29 3265.35
2020-01-13 78.44 104.97 1891.30 193.83 999.00
2020-01-14 77.39 107.58 999.00 999.00 3283.15
2020-01-15 77.05 103.70 999.00 999.00 999.00

Filling with a string#

df_stock.fillna('EMPTY').head(10)
APPLE TESLA AMAZON VISA SP500
Date
2020-01-02 74.33 86.05 1898.01 EMPTY 3257.85
2020-01-03 73.61 88.60 1874.97 188.15 3234.85
2020-01-06 74.20 90.31 EMPTY 187.74 3246.28
2020-01-07 73.85 93.81 EMPTY 187.24 EMPTY
2020-01-08 75.04 98.43 1891.97 190.45 3253.05
2020-01-09 76.63 96.27 EMPTY 191.77 EMPTY
2020-01-10 76.80 95.63 1883.16 192.29 3265.35
2020-01-13 78.44 104.97 1891.3 193.83 EMPTY
2020-01-14 77.39 107.58 EMPTY EMPTY 3283.15
2020-01-15 77.05 103.70 EMPTY EMPTY EMPTY

Filling with forward fill#

df_stock.ffill().head(10)
APPLE TESLA AMAZON VISA SP500
Date
2020-01-02 74.33 86.05 1898.01 NaN 3257.85
2020-01-03 73.61 88.60 1874.97 188.15 3234.85
2020-01-06 74.20 90.31 1874.97 187.74 3246.28
2020-01-07 73.85 93.81 1874.97 187.24 3246.28
2020-01-08 75.04 98.43 1891.97 190.45 3253.05
2020-01-09 76.63 96.27 1891.97 191.77 3253.05
2020-01-10 76.80 95.63 1883.16 192.29 3265.35
2020-01-13 78.44 104.97 1891.30 193.83 3265.35
2020-01-14 77.39 107.58 1891.30 193.83 3283.15
2020-01-15 77.05 103.70 1891.30 193.83 3283.15

Filling with backward fill#

df_stock.bfill().head(10)
APPLE TESLA AMAZON VISA SP500
Date
2020-01-02 74.33 86.05 1898.01 188.15 3257.85
2020-01-03 73.61 88.60 1874.97 188.15 3234.85
2020-01-06 74.20 90.31 1891.97 187.74 3246.28
2020-01-07 73.85 93.81 1891.97 187.24 3253.05
2020-01-08 75.04 98.43 1891.97 190.45 3253.05
2020-01-09 76.63 96.27 1883.16 191.77 3265.35
2020-01-10 76.80 95.63 1883.16 192.29 3265.35
2020-01-13 78.44 104.97 1891.30 193.83 3283.15
2020-01-14 77.39 107.58 1864.72 203.13 3283.15
2020-01-15 77.05 103.70 1864.72 203.13 3316.81

Filling with mean#

df_stock.mean()
APPLE       94.753056
TESLA      288.347579
AMAZON    2672.675635
VISA       191.660373
SP500     3222.287482
dtype: float64
df_stock.fillna(df_stock.mean()).head(10)
APPLE TESLA AMAZON VISA SP500
Date
2020-01-02 74.33 86.05 1898.010000 191.660373 3257.850000
2020-01-03 73.61 88.60 1874.970000 188.150000 3234.850000
2020-01-06 74.20 90.31 2672.675635 187.740000 3246.280000
2020-01-07 73.85 93.81 2672.675635 187.240000 3222.287482
2020-01-08 75.04 98.43 1891.970000 190.450000 3253.050000
2020-01-09 76.63 96.27 2672.675635 191.770000 3222.287482
2020-01-10 76.80 95.63 1883.160000 192.290000 3265.350000
2020-01-13 78.44 104.97 1891.300000 193.830000 3222.287482
2020-01-14 77.39 107.58 2672.675635 191.660373 3283.150000
2020-01-15 77.05 103.70 2672.675635 191.660373 3222.287482

Filling with median#

df_stock.median()
APPLE       90.960
TESLA      232.830
AMAZON    2821.555
VISA       194.565
SP500     3276.020
dtype: float64
df_stock.fillna(df_stock.median()).head(10)
APPLE TESLA AMAZON VISA SP500
Date
2020-01-02 74.33 86.05 1898.010 194.565 3257.85
2020-01-03 73.61 88.60 1874.970 188.150 3234.85
2020-01-06 74.20 90.31 2821.555 187.740 3246.28
2020-01-07 73.85 93.81 2821.555 187.240 3276.02
2020-01-08 75.04 98.43 1891.970 190.450 3253.05
2020-01-09 76.63 96.27 2821.555 191.770 3276.02
2020-01-10 76.80 95.63 1883.160 192.290 3265.35
2020-01-13 78.44 104.97 1891.300 193.830 3276.02
2020-01-14 77.39 107.58 2821.555 194.565 3283.15
2020-01-15 77.05 103.70 2821.555 194.565 3276.02

Dropping Missing Values#

  • drop the rows with at least one missing nan value

df_stock.dropna().head(10)
APPLE TESLA AMAZON VISA SP500
Date
2020-01-03 73.61 88.60 1874.97 188.15 3234.85
2020-01-08 75.04 98.43 1891.97 190.45 3253.05
2020-01-10 76.80 95.63 1883.16 192.29 3265.35
2020-01-22 78.63 113.91 1887.46 206.31 3321.75
2020-02-06 80.49 149.79 2050.23 201.49 3345.78
2020-02-19 80.28 183.48 2170.22 211.98 3386.15
2020-02-28 67.81 133.60 1883.75 180.63 2954.22
2020-03-03 71.77 149.10 1908.99 184.57 3003.37
2020-03-06 71.70 140.70 1901.09 183.21 2972.37
2020-03-10 70.79 129.07 1891.82 181.46 2882.23
# drop columns
df_stock.dropna(axis=1).head(10)
APPLE TESLA
Date
2020-01-02 74.33 86.05
2020-01-03 73.61 88.60
2020-01-06 74.20 90.31
2020-01-07 73.85 93.81
2020-01-08 75.04 98.43
2020-01-09 76.63 96.27
2020-01-10 76.80 95.63
2020-01-13 78.44 104.97
2020-01-14 77.39 107.58
2020-01-15 77.05 103.70
df_stock.dropna(subset=['APPLE']).head(10)
APPLE TESLA AMAZON VISA SP500
Date
2020-01-02 74.33 86.05 1898.01 NaN 3257.85
2020-01-03 73.61 88.60 1874.97 188.15 3234.85
2020-01-06 74.20 90.31 NaN 187.74 3246.28
2020-01-07 73.85 93.81 NaN 187.24 NaN
2020-01-08 75.04 98.43 1891.97 190.45 3253.05
2020-01-09 76.63 96.27 NaN 191.77 NaN
2020-01-10 76.80 95.63 1883.16 192.29 3265.35
2020-01-13 78.44 104.97 1891.30 193.83 NaN
2020-01-14 77.39 107.58 NaN NaN 3283.15
2020-01-15 77.05 103.70 NaN NaN NaN
df_stock.dropna(subset=['APPLE', 'AMAZON'], how='any').head(10)
APPLE TESLA AMAZON VISA SP500
Date
2020-01-02 74.33 86.05 1898.01 NaN 3257.85
2020-01-03 73.61 88.60 1874.97 188.15 3234.85
2020-01-08 75.04 98.43 1891.97 190.45 3253.05
2020-01-10 76.80 95.63 1883.16 192.29 3265.35
2020-01-13 78.44 104.97 1891.30 193.83 NaN
2020-01-17 78.88 102.10 1864.72 203.13 NaN
2020-01-21 78.35 109.44 1892.00 NaN 3320.79
2020-01-22 78.63 113.91 1887.46 206.31 3321.75
2020-01-23 79.01 114.44 1884.58 NaN NaN
2020-01-28 78.62 113.38 1853.25 NaN 3276.24
df_stock.dropna(subset=['APPLE', 'AMAZON'], how='all').head(10)
APPLE TESLA AMAZON VISA SP500
Date
2020-01-02 74.33 86.05 1898.01 NaN 3257.85
2020-01-03 73.61 88.60 1874.97 188.15 3234.85
2020-01-06 74.20 90.31 NaN 187.74 3246.28
2020-01-07 73.85 93.81 NaN 187.24 NaN
2020-01-08 75.04 98.43 1891.97 190.45 3253.05
2020-01-09 76.63 96.27 NaN 191.77 NaN
2020-01-10 76.80 95.63 1883.16 192.29 3265.35
2020-01-13 78.44 104.97 1891.30 193.83 NaN
2020-01-14 77.39 107.58 NaN NaN 3283.15
2020-01-15 77.05 103.70 NaN NaN NaN

Imputer Methods#

Simple Imputer#

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean')   # mean, median, most_frequent, constant
imputer.fit(df_stock)
imputer.transform(df_stock)
array([[  74.33      ,   86.05      , 1898.01      ,  191.66037313,
        3257.85      ],
       [  73.61      ,   88.6       , 1874.97      ,  188.15      ,
        3234.85      ],
       [  74.2       ,   90.31      , 2672.67563492,  187.74      ,
        3246.28      ],
       ...,
       [ 136.49      ,  663.69      , 3283.96      ,  212.3       ,
        3735.36      ],
       [ 134.67      ,  665.99      , 2672.67563492,  214.04      ,
        3222.28748201],
       [ 133.52      ,  694.78      , 3285.85      ,  218.02      ,
        3222.28748201]])
pd.DataFrame(imputer.transform(df_stock), columns=df_stock.columns, index=df_stock.index).head(10)
APPLE TESLA AMAZON VISA SP500
Date
2020-01-02 74.33 86.05 1898.010000 191.660373 3257.850000
2020-01-03 73.61 88.60 1874.970000 188.150000 3234.850000
2020-01-06 74.20 90.31 2672.675635 187.740000 3246.280000
2020-01-07 73.85 93.81 2672.675635 187.240000 3222.287482
2020-01-08 75.04 98.43 1891.970000 190.450000 3253.050000
2020-01-09 76.63 96.27 2672.675635 191.770000 3222.287482
2020-01-10 76.80 95.63 1883.160000 192.290000 3265.350000
2020-01-13 78.44 104.97 1891.300000 193.830000 3222.287482
2020-01-14 77.39 107.58 2672.675635 191.660373 3283.150000
2020-01-15 77.05 103.70 2672.675635 191.660373 3222.287482

knn Imputer#

imputed using the mean value from n_neighbors nearest neighbors found in the training set

from sklearn.impute import KNNImputer
imputer_knn = KNNImputer(n_neighbors=2)
imputer_knn.fit_transform(df_stock)
array([[  74.33,   86.05, 1898.01,  187.49, 3257.85],
       [  73.61,   88.6 , 1874.97,  188.15, 3234.85],
       [  74.2 ,   90.31, 1883.47,  187.74, 3246.28],
       ...,
       [ 136.49,  663.69, 3283.96,  212.3 , 3735.36],
       [ 134.67,  665.99, 3260.02,  214.04, 3719.21],
       [ 133.52,  694.78, 3285.85,  218.02, 3719.21]])
pd.DataFrame(imputer_knn.fit_transform(df_stock), columns=df_stock.columns, index=df_stock.index).head(10)
APPLE TESLA AMAZON VISA SP500
Date
2020-01-02 74.33 86.05 1898.010 187.49 3257.850
2020-01-03 73.61 88.60 1874.970 188.15 3234.850
2020-01-06 74.20 90.31 1883.470 187.74 3246.280
2020-01-07 73.85 93.81 1883.470 187.24 3240.565
2020-01-08 75.04 98.43 1891.970 190.45 3253.050
2020-01-09 76.63 96.27 1887.565 191.77 3259.200
2020-01-10 76.80 95.63 1883.160 192.29 3265.350
2020-01-13 78.44 104.97 1891.300 193.83 3299.980
2020-01-14 77.39 107.58 1878.010 198.48 3283.150
2020-01-15 77.05 103.70 1878.010 198.48 3299.980