Understanding Nulls In Pandas
Following on from my post on comparing Pandas and Excel I am going to write a series of posts discussing the Python Pandas library for Data analysis. Each post in this series will tackle an important data topic and talk about how Pandas solves the problem. This first post I will talk about missing data sometimes called NULLs.
One problem that new Python Pandas users struggle with is getting a hang of working with missing values (NULLs). Pandas does not makes it easy by having a number of different ways of representing NULLs.
Pandas also makes things tricky because integer's can not take NULLs.
A lot of datatypes are borrowed from numpy that includes np.nan.
Pandas uses different NULL types depending on the dtype. For Floats (np.nan), for str/object (None or np.nan), for datetime64 (pd.NaT), int and bool do support NULLs currently.
Lets build a df with all the different dtypes adding NULLs. You will note that int and bool columns are given object type as they will error otherwise.
import pandas as pd
import numpy as np
d = {'int': pd.Series([1, None], dtype=np.dtype("object")),
'float': pd.Series([3.0, np.NaN], dtype=np.dtype("float")),
'float2': pd.Series([3.0, None], dtype=np.dtype("float")),
'str': pd.Series(['test', None], dtype=np.dtype("str")),
'str2': pd.Series(['test', np.NaN], dtype=np.dtype("str")),
"bool": pd.Series([True, np.nan], dtype=np.dtype("object")),
"date": pd.Series(['1/1/2000', np.NaN], dtype=np.dtype("datetime64[ns]")),
"date2": pd.Series(['1/1/2000', None], dtype=np.dtype("datetime64[ns]"))}
df1 = pd.DataFrame(data=d)
df1['date'] = pd.to_datetime(df1['date'], errors='coerce')
df1.info()
df1
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2 entries, 0 to 1 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 int 1 non-null object 1 float 1 non-null float64 2 float2 1 non-null float64 3 str 1 non-null object 4 str2 1 non-null object 5 bool 1 non-null object 6 date 1 non-null datetime64[ns] 7 date2 1 non-null datetime64[ns] dtypes: datetime64[ns](2), float64(2), object(4) memory usage: 256.0+ bytes
int | float | float2 | str | str2 | bool | date | date2 | |
---|---|---|---|---|---|---|---|---|
0 | 1 | 3.0 | 3.0 | test | test | True | 2000-01-01 | 2000-01-01 |
1 | None | NaN | NaN | None | NaN | NaN | NaT | NaT |
Now there is good news because Pandas to moving towards a better situation with NULL types.
pd.NA is the new guy in town and is pandas own NULL value.
Starting from pandas 1.0, an experimental pd.NA value (singleton) is available to represent scalar missing values. At this moment, it is used in the nullable integer, boolean and dedicated string data types as the missing value indicator.
The goal of pd.NA is provide a “missing” indicator that can be used consistently across data types (instead of np.nan, None or pd.NaT depending on the data type).
If you have a DataFrame or Series using traditional types that have missing data represented using np.nan, there are convenience methods convert_dtypes() in Series and convert_dtypes() in DataFrame that can convert data to use the newer dtypes for integers, strings and booleans and from Pandas v1.2 floats using convert_integer=False argument.
df1[['int', 'str', 'str2', 'bool', 'date']] = df1[['int', 'str', 'str2', 'bool', 'date']].convert_dtypes()
df1[['float', 'float2']] = df1[['float', 'float2']].convert_dtypes(convert_integer=False)
df1.info()
df1
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2 entries, 0 to 1 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 int 1 non-null Int64 1 float 1 non-null Float64 2 float2 1 non-null Float64 3 str 1 non-null string 4 str2 1 non-null string 5 bool 1 non-null boolean 6 date 1 non-null datetime64[ns] 7 date2 1 non-null datetime64[ns] dtypes: Float64(2), Int64(1), boolean(1), datetime64[ns](2), string(2) memory usage: 250.0 bytes
int | float | float2 | str | str2 | bool | date | date2 | |
---|---|---|---|---|---|---|---|---|
0 | 1 | 3.0 | 3.0 | test | test | True | 2000-01-01 | 2000-01-01 |
1 | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaT | NaT |
Note the capital 'F' in pd.Float64 to distinguish from np.float32 or np.float64, also note string which is the new pandas StringDtype (from Pandas 1.0) and not str or object. Also pd.Int64 (from pandas 0.24) nullable integer capital 'I' and not np.int64.
As for datatime Pandas borrows timedelta64[ns] and datetime64[ns] from numpy and they use NaT (not a time) as their NULL value.
Warning pd.NA is new and experimental use careful for now.
Working with Nulls¶
Once you understand the concept of nulls you then need to be able to handle them. There is no one fits all situation it will depend on your data and what you are trying to achieve.
But once you know what if you can drop nulls or if you need to convert them you can get started. I am going to walk through some option in no particular order.
# import the famous Iris dataset
df_iris = pd.read_csv('iris.data')
df_iris.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 149 entries, 0 to 148 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 5.1 149 non-null float64 1 3.5 149 non-null float64 2 1.4 149 non-null float64 3 0.2 149 non-null float64 4 Iris-setosa 149 non-null object dtypes: float64(4), object(1) memory usage: 5.9+ KB
Remove 10% of the values randomly so we have some nulls
df_iris = df_iris.mask(np.random.random(df_iris.shape) < .1)
df_iris.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 149 entries, 0 to 148 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 5.1 134 non-null float64 1 3.5 135 non-null float64 2 1.4 139 non-null float64 3 0.2 133 non-null float64 4 Iris-setosa 137 non-null object dtypes: float64(4), object(1) memory usage: 5.9+ KB
Finding nulls¶
Isnull returns bools for all values in the dataframe. We use head() so as to return just the first 5 rows.
df_iris.isnull().head()
5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa | |
---|---|---|---|---|---|
0 | True | False | False | False | True |
1 | False | True | False | False | False |
2 | False | False | False | True | False |
3 | False | False | False | False | True |
4 | False | False | True | False | False |
Notnull does the reverse
df_iris.notnull().head()
5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa | |
---|---|---|---|---|---|
0 | False | True | True | True | False |
1 | True | False | True | True | True |
2 | True | True | True | False | True |
3 | True | True | True | True | False |
4 | True | True | False | True | True |
You an add .sum() to get a count of nulls
df_iris.isnull().sum()
5.1 15 3.5 14 1.4 10 0.2 16 Iris-setosa 12 dtype: int64
Deleting nulls is one option. For this we can use the dropna function. You can focus on columns or row using axis= parameter.You can delete a rows which has one null or are all nulls using how= parameter.
We do this below using the default setting which are axis=rows and how=any. It has deleted the the null rows but has reduced overall size of the dataframe quite a bit meaning a lot of lost data.
df_iris_no_nulls = df_iris.dropna()
print(df_iris_no_nulls.shape)
df_iris_no_nulls.isnull().sum()
(92, 5)
5.1 0 3.5 0 1.4 0 0.2 0 Iris-setosa 0 dtype: int64
Replacing with mean or median¶
Lets focus on the column called '0.2'
Lets get the mean value
df_iris['0.2'].mean()
1.2067669172932338
Check how many Nulls are in the column
df_iris['0.2'].isnull().sum()
16
We will replace the nulls with the mean value.
df_iris['0.2'] = df_iris['0.2'].replace(np.NaN, df_iris['0.2'].mean())
Check how many Nulls are in the column again. Showing zero.
df_iris['0.2'].isnull().sum()
0
Check what value the replaced values are, yes they are the mean for the column as excepted
df_iris['0.2'].value_counts().head()
0.200000 25 1.206767 16 1.300000 12 1.500000 10 1.800000 9 Name: 0.2, dtype: int64
Give nulls your own value¶
You might have your own go to value to use.
Let go back to the df_iris dataframe and focus on the 'Iris-setosa' column
df_iris['Iris-setosa'].isnull().sum()
12
Lets use the fillna() function to replace the Nulls with na.
df_iris['Iris-setosa'] = df_iris['Iris-setosa'].fillna('na')
Lets check how many na are in the column. Looks like 12 as expected
df_iris[df_iris['Iris-setosa'] == 'na'].count()[4]
12
Isnull vs isna and notnull vs notna¶
These methods do the same thing i.e isnull and isna will return the same values. Under the hood Isnull is calling isna. Also Pandas is moving towards isna/notna when you look at the documentation. So while I did use isnull/notnull in this blog I think it would be best practice to stick with using isna/notna.
So that gives you some options for dealing with Nulls.