Missing Values#

We will use the following dataset throughout this chapter.
import pandas as pd
df_stock = pd.read_excel('https://raw.githubusercontent.com/datasmp/datasets/main/stock.xlsx', index_col=0)
df_stock.head()
| APPLE | TESLA | AMAZON | VISA | SP500 | |
|---|---|---|---|---|---|
| Date | |||||
| 2020-01-02 | 74.33 | 86.05 | 1898.01 | 189.66 | 3257.85 |
| 2020-01-03 | 73.61 | 88.60 | 1874.97 | 188.15 | 3234.85 |
| 2020-01-06 | 74.20 | 90.31 | 1902.88 | 187.74 | 3246.28 |
| 2020-01-07 | 73.85 | 93.81 | 1906.86 | 187.24 | 3237.18 |
| 2020-01-08 | 75.04 | 98.43 | 1891.97 | 190.45 | 3253.05 |
Now, we will replace some values in the dataframe with missing values (NaN).
import random
import numpy as np
random.seed(0)
r = 0.4
a,b = df_stock.shape
missing = int(a*b*r)
for i in range( missing ):
k = random.randint(2,b-1)
p = random.randint(0,a-1)
df_stock.iloc[p,k] = np.nan
df_stock.head(10)
| APPLE | TESLA | AMAZON | VISA | SP500 | |
|---|---|---|---|---|---|
| Date | |||||
| 2020-01-02 | 74.33 | 86.05 | 1898.01 | NaN | 3257.85 |
| 2020-01-03 | 73.61 | 88.60 | 1874.97 | 188.15 | 3234.85 |
| 2020-01-06 | 74.20 | 90.31 | NaN | 187.74 | 3246.28 |
| 2020-01-07 | 73.85 | 93.81 | NaN | 187.24 | NaN |
| 2020-01-08 | 75.04 | 98.43 | 1891.97 | 190.45 | 3253.05 |
| 2020-01-09 | 76.63 | 96.27 | NaN | 191.77 | NaN |
| 2020-01-10 | 76.80 | 95.63 | 1883.16 | 192.29 | 3265.35 |
| 2020-01-13 | 78.44 | 104.97 | 1891.30 | 193.83 | NaN |
| 2020-01-14 | 77.39 | 107.58 | NaN | NaN | 3283.15 |
| 2020-01-15 | 77.05 | 103.70 | NaN | NaN | NaN |
df_stock.info()
<class 'pandas.core.frame.DataFrame'>
Index: 252 entries, 2020-01-02 to 2020-12-30
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 APPLE 252 non-null float64
1 TESLA 252 non-null float64
2 AMAZON 126 non-null float64
3 VISA 134 non-null float64
4 SP500 139 non-null float64
dtypes: float64(5)
memory usage: 11.8+ KB
df_stock.isnull().head()
| APPLE | TESLA | AMAZON | VISA | SP500 | |
|---|---|---|---|---|---|
| Date | |||||
| 2020-01-02 | False | False | False | True | False |
| 2020-01-03 | False | False | False | False | False |
| 2020-01-06 | False | False | True | False | False |
| 2020-01-07 | False | False | True | False | True |
| 2020-01-08 | False | False | False | False | False |
df_stock.notnull().head()
| APPLE | TESLA | AMAZON | VISA | SP500 | |
|---|---|---|---|---|---|
| Date | |||||
| 2020-01-02 | True | True | True | False | True |
| 2020-01-03 | True | True | True | True | True |
| 2020-01-06 | True | True | False | True | True |
| 2020-01-07 | True | True | False | True | False |
| 2020-01-08 | True | True | True | True | True |
df_stock.isnull().sum()
APPLE 0
TESLA 0
AMAZON 126
VISA 118
SP500 113
dtype: int64
df_stock.isnull().sum().sum()
357
Imputation Methods#
Filling with a constant number#
df_stock.fillna(999).head(10)
| APPLE | TESLA | AMAZON | VISA | SP500 | |
|---|---|---|---|---|---|
| Date | |||||
| 2020-01-02 | 74.33 | 86.05 | 1898.01 | 999.00 | 3257.85 |
| 2020-01-03 | 73.61 | 88.60 | 1874.97 | 188.15 | 3234.85 |
| 2020-01-06 | 74.20 | 90.31 | 999.00 | 187.74 | 3246.28 |
| 2020-01-07 | 73.85 | 93.81 | 999.00 | 187.24 | 999.00 |
| 2020-01-08 | 75.04 | 98.43 | 1891.97 | 190.45 | 3253.05 |
| 2020-01-09 | 76.63 | 96.27 | 999.00 | 191.77 | 999.00 |
| 2020-01-10 | 76.80 | 95.63 | 1883.16 | 192.29 | 3265.35 |
| 2020-01-13 | 78.44 | 104.97 | 1891.30 | 193.83 | 999.00 |
| 2020-01-14 | 77.39 | 107.58 | 999.00 | 999.00 | 3283.15 |
| 2020-01-15 | 77.05 | 103.70 | 999.00 | 999.00 | 999.00 |
Filling with a string#
df_stock.fillna('EMPTY').head(10)
| APPLE | TESLA | AMAZON | VISA | SP500 | |
|---|---|---|---|---|---|
| Date | |||||
| 2020-01-02 | 74.33 | 86.05 | 1898.01 | EMPTY | 3257.85 |
| 2020-01-03 | 73.61 | 88.60 | 1874.97 | 188.15 | 3234.85 |
| 2020-01-06 | 74.20 | 90.31 | EMPTY | 187.74 | 3246.28 |
| 2020-01-07 | 73.85 | 93.81 | EMPTY | 187.24 | EMPTY |
| 2020-01-08 | 75.04 | 98.43 | 1891.97 | 190.45 | 3253.05 |
| 2020-01-09 | 76.63 | 96.27 | EMPTY | 191.77 | EMPTY |
| 2020-01-10 | 76.80 | 95.63 | 1883.16 | 192.29 | 3265.35 |
| 2020-01-13 | 78.44 | 104.97 | 1891.3 | 193.83 | EMPTY |
| 2020-01-14 | 77.39 | 107.58 | EMPTY | EMPTY | 3283.15 |
| 2020-01-15 | 77.05 | 103.70 | EMPTY | EMPTY | EMPTY |
Filling with forward fill#
df_stock.ffill().head(10)
| APPLE | TESLA | AMAZON | VISA | SP500 | |
|---|---|---|---|---|---|
| Date | |||||
| 2020-01-02 | 74.33 | 86.05 | 1898.01 | NaN | 3257.85 |
| 2020-01-03 | 73.61 | 88.60 | 1874.97 | 188.15 | 3234.85 |
| 2020-01-06 | 74.20 | 90.31 | 1874.97 | 187.74 | 3246.28 |
| 2020-01-07 | 73.85 | 93.81 | 1874.97 | 187.24 | 3246.28 |
| 2020-01-08 | 75.04 | 98.43 | 1891.97 | 190.45 | 3253.05 |
| 2020-01-09 | 76.63 | 96.27 | 1891.97 | 191.77 | 3253.05 |
| 2020-01-10 | 76.80 | 95.63 | 1883.16 | 192.29 | 3265.35 |
| 2020-01-13 | 78.44 | 104.97 | 1891.30 | 193.83 | 3265.35 |
| 2020-01-14 | 77.39 | 107.58 | 1891.30 | 193.83 | 3283.15 |
| 2020-01-15 | 77.05 | 103.70 | 1891.30 | 193.83 | 3283.15 |
Filling with backward fill#
df_stock.bfill().head(10)
| APPLE | TESLA | AMAZON | VISA | SP500 | |
|---|---|---|---|---|---|
| Date | |||||
| 2020-01-02 | 74.33 | 86.05 | 1898.01 | 188.15 | 3257.85 |
| 2020-01-03 | 73.61 | 88.60 | 1874.97 | 188.15 | 3234.85 |
| 2020-01-06 | 74.20 | 90.31 | 1891.97 | 187.74 | 3246.28 |
| 2020-01-07 | 73.85 | 93.81 | 1891.97 | 187.24 | 3253.05 |
| 2020-01-08 | 75.04 | 98.43 | 1891.97 | 190.45 | 3253.05 |
| 2020-01-09 | 76.63 | 96.27 | 1883.16 | 191.77 | 3265.35 |
| 2020-01-10 | 76.80 | 95.63 | 1883.16 | 192.29 | 3265.35 |
| 2020-01-13 | 78.44 | 104.97 | 1891.30 | 193.83 | 3283.15 |
| 2020-01-14 | 77.39 | 107.58 | 1864.72 | 203.13 | 3283.15 |
| 2020-01-15 | 77.05 | 103.70 | 1864.72 | 203.13 | 3316.81 |
Filling with mean#
df_stock.mean()
APPLE 94.753056
TESLA 288.347579
AMAZON 2672.675635
VISA 191.660373
SP500 3222.287482
dtype: float64
df_stock.fillna(df_stock.mean()).head(10)
| APPLE | TESLA | AMAZON | VISA | SP500 | |
|---|---|---|---|---|---|
| Date | |||||
| 2020-01-02 | 74.33 | 86.05 | 1898.010000 | 191.660373 | 3257.850000 |
| 2020-01-03 | 73.61 | 88.60 | 1874.970000 | 188.150000 | 3234.850000 |
| 2020-01-06 | 74.20 | 90.31 | 2672.675635 | 187.740000 | 3246.280000 |
| 2020-01-07 | 73.85 | 93.81 | 2672.675635 | 187.240000 | 3222.287482 |
| 2020-01-08 | 75.04 | 98.43 | 1891.970000 | 190.450000 | 3253.050000 |
| 2020-01-09 | 76.63 | 96.27 | 2672.675635 | 191.770000 | 3222.287482 |
| 2020-01-10 | 76.80 | 95.63 | 1883.160000 | 192.290000 | 3265.350000 |
| 2020-01-13 | 78.44 | 104.97 | 1891.300000 | 193.830000 | 3222.287482 |
| 2020-01-14 | 77.39 | 107.58 | 2672.675635 | 191.660373 | 3283.150000 |
| 2020-01-15 | 77.05 | 103.70 | 2672.675635 | 191.660373 | 3222.287482 |
Filling with median#
df_stock.median()
APPLE 90.960
TESLA 232.830
AMAZON 2821.555
VISA 194.565
SP500 3276.020
dtype: float64
df_stock.fillna(df_stock.median()).head(10)
| APPLE | TESLA | AMAZON | VISA | SP500 | |
|---|---|---|---|---|---|
| Date | |||||
| 2020-01-02 | 74.33 | 86.05 | 1898.010 | 194.565 | 3257.85 |
| 2020-01-03 | 73.61 | 88.60 | 1874.970 | 188.150 | 3234.85 |
| 2020-01-06 | 74.20 | 90.31 | 2821.555 | 187.740 | 3246.28 |
| 2020-01-07 | 73.85 | 93.81 | 2821.555 | 187.240 | 3276.02 |
| 2020-01-08 | 75.04 | 98.43 | 1891.970 | 190.450 | 3253.05 |
| 2020-01-09 | 76.63 | 96.27 | 2821.555 | 191.770 | 3276.02 |
| 2020-01-10 | 76.80 | 95.63 | 1883.160 | 192.290 | 3265.35 |
| 2020-01-13 | 78.44 | 104.97 | 1891.300 | 193.830 | 3276.02 |
| 2020-01-14 | 77.39 | 107.58 | 2821.555 | 194.565 | 3283.15 |
| 2020-01-15 | 77.05 | 103.70 | 2821.555 | 194.565 | 3276.02 |
Dropping Missing Values#
drop the rows with at least one missing nan value
df_stock.dropna().head(10)
| APPLE | TESLA | AMAZON | VISA | SP500 | |
|---|---|---|---|---|---|
| Date | |||||
| 2020-01-03 | 73.61 | 88.60 | 1874.97 | 188.15 | 3234.85 |
| 2020-01-08 | 75.04 | 98.43 | 1891.97 | 190.45 | 3253.05 |
| 2020-01-10 | 76.80 | 95.63 | 1883.16 | 192.29 | 3265.35 |
| 2020-01-22 | 78.63 | 113.91 | 1887.46 | 206.31 | 3321.75 |
| 2020-02-06 | 80.49 | 149.79 | 2050.23 | 201.49 | 3345.78 |
| 2020-02-19 | 80.28 | 183.48 | 2170.22 | 211.98 | 3386.15 |
| 2020-02-28 | 67.81 | 133.60 | 1883.75 | 180.63 | 2954.22 |
| 2020-03-03 | 71.77 | 149.10 | 1908.99 | 184.57 | 3003.37 |
| 2020-03-06 | 71.70 | 140.70 | 1901.09 | 183.21 | 2972.37 |
| 2020-03-10 | 70.79 | 129.07 | 1891.82 | 181.46 | 2882.23 |
# drop columns
df_stock.dropna(axis=1).head(10)
| APPLE | TESLA | |
|---|---|---|
| Date | ||
| 2020-01-02 | 74.33 | 86.05 |
| 2020-01-03 | 73.61 | 88.60 |
| 2020-01-06 | 74.20 | 90.31 |
| 2020-01-07 | 73.85 | 93.81 |
| 2020-01-08 | 75.04 | 98.43 |
| 2020-01-09 | 76.63 | 96.27 |
| 2020-01-10 | 76.80 | 95.63 |
| 2020-01-13 | 78.44 | 104.97 |
| 2020-01-14 | 77.39 | 107.58 |
| 2020-01-15 | 77.05 | 103.70 |
df_stock.dropna(subset=['APPLE']).head(10)
| APPLE | TESLA | AMAZON | VISA | SP500 | |
|---|---|---|---|---|---|
| Date | |||||
| 2020-01-02 | 74.33 | 86.05 | 1898.01 | NaN | 3257.85 |
| 2020-01-03 | 73.61 | 88.60 | 1874.97 | 188.15 | 3234.85 |
| 2020-01-06 | 74.20 | 90.31 | NaN | 187.74 | 3246.28 |
| 2020-01-07 | 73.85 | 93.81 | NaN | 187.24 | NaN |
| 2020-01-08 | 75.04 | 98.43 | 1891.97 | 190.45 | 3253.05 |
| 2020-01-09 | 76.63 | 96.27 | NaN | 191.77 | NaN |
| 2020-01-10 | 76.80 | 95.63 | 1883.16 | 192.29 | 3265.35 |
| 2020-01-13 | 78.44 | 104.97 | 1891.30 | 193.83 | NaN |
| 2020-01-14 | 77.39 | 107.58 | NaN | NaN | 3283.15 |
| 2020-01-15 | 77.05 | 103.70 | NaN | NaN | NaN |
df_stock.dropna(subset=['APPLE', 'AMAZON'], how='any').head(10)
| APPLE | TESLA | AMAZON | VISA | SP500 | |
|---|---|---|---|---|---|
| Date | |||||
| 2020-01-02 | 74.33 | 86.05 | 1898.01 | NaN | 3257.85 |
| 2020-01-03 | 73.61 | 88.60 | 1874.97 | 188.15 | 3234.85 |
| 2020-01-08 | 75.04 | 98.43 | 1891.97 | 190.45 | 3253.05 |
| 2020-01-10 | 76.80 | 95.63 | 1883.16 | 192.29 | 3265.35 |
| 2020-01-13 | 78.44 | 104.97 | 1891.30 | 193.83 | NaN |
| 2020-01-17 | 78.88 | 102.10 | 1864.72 | 203.13 | NaN |
| 2020-01-21 | 78.35 | 109.44 | 1892.00 | NaN | 3320.79 |
| 2020-01-22 | 78.63 | 113.91 | 1887.46 | 206.31 | 3321.75 |
| 2020-01-23 | 79.01 | 114.44 | 1884.58 | NaN | NaN |
| 2020-01-28 | 78.62 | 113.38 | 1853.25 | NaN | 3276.24 |
df_stock.dropna(subset=['APPLE', 'AMAZON'], how='all').head(10)
| APPLE | TESLA | AMAZON | VISA | SP500 | |
|---|---|---|---|---|---|
| Date | |||||
| 2020-01-02 | 74.33 | 86.05 | 1898.01 | NaN | 3257.85 |
| 2020-01-03 | 73.61 | 88.60 | 1874.97 | 188.15 | 3234.85 |
| 2020-01-06 | 74.20 | 90.31 | NaN | 187.74 | 3246.28 |
| 2020-01-07 | 73.85 | 93.81 | NaN | 187.24 | NaN |
| 2020-01-08 | 75.04 | 98.43 | 1891.97 | 190.45 | 3253.05 |
| 2020-01-09 | 76.63 | 96.27 | NaN | 191.77 | NaN |
| 2020-01-10 | 76.80 | 95.63 | 1883.16 | 192.29 | 3265.35 |
| 2020-01-13 | 78.44 | 104.97 | 1891.30 | 193.83 | NaN |
| 2020-01-14 | 77.39 | 107.58 | NaN | NaN | 3283.15 |
| 2020-01-15 | 77.05 | 103.70 | NaN | NaN | NaN |
Imputer Methods#
Simple Imputer#
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean') # mean, median, most_frequent, constant
imputer.fit(df_stock)
imputer.transform(df_stock)
array([[ 74.33 , 86.05 , 1898.01 , 191.66037313,
3257.85 ],
[ 73.61 , 88.6 , 1874.97 , 188.15 ,
3234.85 ],
[ 74.2 , 90.31 , 2672.67563492, 187.74 ,
3246.28 ],
...,
[ 136.49 , 663.69 , 3283.96 , 212.3 ,
3735.36 ],
[ 134.67 , 665.99 , 2672.67563492, 214.04 ,
3222.28748201],
[ 133.52 , 694.78 , 3285.85 , 218.02 ,
3222.28748201]])
pd.DataFrame(imputer.transform(df_stock), columns=df_stock.columns, index=df_stock.index).head(10)
| APPLE | TESLA | AMAZON | VISA | SP500 | |
|---|---|---|---|---|---|
| Date | |||||
| 2020-01-02 | 74.33 | 86.05 | 1898.010000 | 191.660373 | 3257.850000 |
| 2020-01-03 | 73.61 | 88.60 | 1874.970000 | 188.150000 | 3234.850000 |
| 2020-01-06 | 74.20 | 90.31 | 2672.675635 | 187.740000 | 3246.280000 |
| 2020-01-07 | 73.85 | 93.81 | 2672.675635 | 187.240000 | 3222.287482 |
| 2020-01-08 | 75.04 | 98.43 | 1891.970000 | 190.450000 | 3253.050000 |
| 2020-01-09 | 76.63 | 96.27 | 2672.675635 | 191.770000 | 3222.287482 |
| 2020-01-10 | 76.80 | 95.63 | 1883.160000 | 192.290000 | 3265.350000 |
| 2020-01-13 | 78.44 | 104.97 | 1891.300000 | 193.830000 | 3222.287482 |
| 2020-01-14 | 77.39 | 107.58 | 2672.675635 | 191.660373 | 3283.150000 |
| 2020-01-15 | 77.05 | 103.70 | 2672.675635 | 191.660373 | 3222.287482 |
knn Imputer#
imputed using the mean value from n_neighbors nearest neighbors found in the training set
from sklearn.impute import KNNImputer
imputer_knn = KNNImputer(n_neighbors=2)
imputer_knn.fit_transform(df_stock)
array([[ 74.33, 86.05, 1898.01, 187.49, 3257.85],
[ 73.61, 88.6 , 1874.97, 188.15, 3234.85],
[ 74.2 , 90.31, 1883.47, 187.74, 3246.28],
...,
[ 136.49, 663.69, 3283.96, 212.3 , 3735.36],
[ 134.67, 665.99, 3260.02, 214.04, 3719.21],
[ 133.52, 694.78, 3285.85, 218.02, 3719.21]])
pd.DataFrame(imputer_knn.fit_transform(df_stock), columns=df_stock.columns, index=df_stock.index).head(10)
| APPLE | TESLA | AMAZON | VISA | SP500 | |
|---|---|---|---|---|---|
| Date | |||||
| 2020-01-02 | 74.33 | 86.05 | 1898.010 | 187.49 | 3257.850 |
| 2020-01-03 | 73.61 | 88.60 | 1874.970 | 188.15 | 3234.850 |
| 2020-01-06 | 74.20 | 90.31 | 1883.470 | 187.74 | 3246.280 |
| 2020-01-07 | 73.85 | 93.81 | 1883.470 | 187.24 | 3240.565 |
| 2020-01-08 | 75.04 | 98.43 | 1891.970 | 190.45 | 3253.050 |
| 2020-01-09 | 76.63 | 96.27 | 1887.565 | 191.77 | 3259.200 |
| 2020-01-10 | 76.80 | 95.63 | 1883.160 | 192.29 | 3265.350 |
| 2020-01-13 | 78.44 | 104.97 | 1891.300 | 193.83 | 3299.980 |
| 2020-01-14 | 77.39 | 107.58 | 1878.010 | 198.48 | 3283.150 |
| 2020-01-15 | 77.05 | 103.70 | 1878.010 | 198.48 | 3299.980 |