Dealing with Missing Data in Pandas¶
Real-world datasets often contain missing values, which can complicate data analysis. Pandas provides robust methods to handle missing data efficiently. In this tutorial, we'll explore how to identify, remove, and fill missing values in a Pandas DataFrame.
Creating a DataFrame with Missing Data¶
Let's start by creating a DataFrame that contains missing values. We'll use np.nan
to represent missing entries.
import pandas as pd
import numpy as np
# Creating a DataFrame with missing values
data = {
'Name': ['Alice', 'Bob', 'Charlie', np.nan, 'Emily'],
'Age': [24, np.nan, 22, 25, np.nan],
'Score': [85, 90, np.nan, 88, 95]
}
df = pd.DataFrame(data)
In this DataFrame, some entries in the 'Name', 'Age', and 'Score' columns are missing.
Identifying Missing Data¶
Before we can handle missing data, we need to identify where it exists in our DataFrame.
Using isnull()
¶
The isnull()
function returns a DataFrame of boolean values indicating whether each value is missing.
# Identifying missing values
missing_values = df.isnull()
missing_values
Name | Age | Score | |
---|---|---|---|
0 | False | False | False |
1 | False | False | False |
2 | False | False | False |
3 | True | False | False |
4 | False | False | False |
Using notnull()
¶
Similarly, the notnull()
function returns True
for non-missing values.
# Identifying non-missing values
non_missing_values = df.notnull()
Counting Missing Values¶
We can calculate the total number of missing values in each column using the sum()
function along with isnull()
.
# Counting missing values in each column
missing_counts = df.isnull().sum()
# Dropping rows with any missing values
df_dropped_rows = df.dropna()
Dropping Columns with Missing Data¶
If you prefer to drop columns that contain missing values, set the axis
parameter to 1.
# Dropping columns with any missing values
df_dropped_columns = df.dropna(axis=1)
# Filling missing values with a specific value
df_filled = df.fillna('Unknown')
Forward Fill (ffill
)¶
Forward fill replaces missing values with the last known value along a specified axis.
# Forward filling missing values
df_ffill = df.fillna(method='ffill')
/var/folders/_0/tn3nfnd50992l7rbgrxxlkn40000gn/T/ipykernel_9849/3933367682.py:2: FutureWarning: DataFrame.fillna with 'method' is deprecated and will raise in a future version. Use obj.ffill() or obj.bfill() instead. df_ffill = df.fillna(method='ffill')
Backward Fill (bfill
)¶
Backward fill uses the next valid observation to fill missing values.
# Backward filling missing values
df_bfill = df.fillna(method='bfill')
/var/folders/_0/tn3nfnd50992l7rbgrxxlkn40000gn/T/ipykernel_9849/453645776.py:2: FutureWarning: DataFrame.fillna with 'method' is deprecated and will raise in a future version. Use obj.ffill() or obj.bfill() instead. df_bfill = df.fillna(method='bfill')
Filling with Mean, Median, or Mode¶
For numerical columns, it's common to fill missing values with the mean, median, or mode.
# Filling missing values in 'Age' with the mean age
mean_age = df['Age'].mean()
df['Age'] = df['Age'].fillna(mean_age)
# Filling missing values in 'Score' with the median score
median_score = df['Score'].median()
df['Score'] = df['Score'].fillna(median_score)
Filling with a Dictionary¶
You can specify different fill values for different columns using a dictionary.
# Filling missing values with different values for each column
fill_values = {'Name': 'No Name', 'Age': df['Age'].mean(), 'Score': df['Score'].median()}
df_filled_dict = df.fillna(fill_values)