Histograms in Matplotlib¶
Histograms are an effective way to visualize the distribution of a dataset by grouping data into intervals, called bins, and displaying the frequency of data points in each bin.
Setting Up Matplotlib for Histograms¶
To begin, let’s import the required libraries and apply a preferred style to give our plots a polished look.
import matplotlib.pyplot as plt
import numpy as np
# Applying a visual style for aesthetics
plt.style.use('fivethirtyeight')
Explanation:
- Pyplot and Pandas:
matplotlib.pyplot
is essential for plotting, whilepandas
is helpful for data manipulation, especially when working with larger datasets. - Preferred Style: The "538" style gives a clean and modern look, making the histogram visually appealing.
# Set the seed for reproducibility
np.random.seed(36)
# Generate 20 random integers in the range 21 to 62
ages = np.random.randint(21, 63, size=20).tolist()
print(ages)
# Creating a basic histogram
plt.hist(ages)
plt.title("Age Distribution")
plt.xlabel("Ages")
plt.ylabel("Frequency")
plt.show()
[26, 55, 51, 61, 54, 47, 30, 58, 27, 57, 21, 48, 52, 27, 27, 43, 52, 22, 62, 43]
Explanation:
- Bins: By default, Matplotlib automatically decides the number of bins.
- Frequency: Each bin’s height shows the count of data points in that range. For example, the bin for ages 20–30 may contain three values, so the bar height will be 3.
Enhancing Histogram Readability¶
Histograms can sometimes look cluttered, so adding edge colors helps to separate the bins visually.
# Adding edge colors to each bin
plt.hist(ages, edgecolor='black')
plt.title("Age Distribution with Edge Colors")
plt.xlabel("Ages")
plt.ylabel("Frequency")
plt.show()
Explanation:
- Edge Colors: Adding an edge color (
edgecolor='black'
) makes each bin stand out, enhancing readability. - Frequency Interpretation: The height of each bin represents the frequency of data points within that interval.
You can also check how many bins were created and what are the exact ranges of those bins:
- counts, bin_edges, _: Stores the frequency of each bin in
counts
and the boundaries of each bin inbin_edges
.
counts, bin_edges, _ = plt.hist(ages, edgecolor='black')
# Display the bins and counts
print("Counts per bin:", counts)
print("Bin edges:", bin_edges)
Counts per bin: [2. 4. 1. 0. 0. 2. 2. 3. 3. 3.] Bin edges: [21. 25.1 29.2 33.3 37.4 41.5 45.6 49.7 53.8 57.9 62. ]
Explanation:
- Edge Colors: Adding an edge color (
edgecolor='black'
) makes each bin stand out, enhancing readability. - Frequency Interpretation: The height of each bin represents the frequency of data points within that interval.
Configuring Bins¶
The number of bins in a histogram can significantly affect its appearance and interpretation. Matplotlib allows you to control the bin structure in multiple ways.
Specifying Bins with an Integer¶
You can specify the number of bins by passing an integer to the bins
parameter. For example, bins=5
divides the data into five intervals.
# Creating a histogram with a specific number of bins
plt.hist(ages, bins=5, edgecolor='black')
plt.title("Age Distribution with 5 Bins")
plt.xlabel("Ages")
plt.ylabel("Frequency")
plt.show()
Custom Bin Ranges¶
Custom bins allow you to define exact bin edges, which can be useful for focusing on specific data intervals.
# Custom bin edges
custom_bins = [20, 30, 40, 50, 60, 70]
plt.hist(ages, bins=custom_bins)
plt.title("Age Distribution with Custom Bins")
plt.xlabel("Ages")
plt.ylabel("Frequency")
plt.show()
Explanation:
- Integer-defined Bins: This method splits the data into equal-width bins, making it easy to adjust granularity.
- Custom Bin Ranges: This method allows for precise control over bin intervals, useful for excluding or emphasizing specific data ranges.
Plotting on a Logarithmic Scale¶
If data frequencies vary widely, a logarithmic scale can make the distribution clearer by reducing the impact of outliers.
# Plotting histogram on a logarithmic scale
plt.hist(ages, bins=5, edgecolor='black', log=True)
plt.title("Age Distribution on Logarithmic Scale")
plt.xlabel("Ages")
plt.ylabel("Log Frequency")
plt.show()
Explanation:
- Logarithmic Scale: Setting
log=True
adjusts the y-axis to a log scale, making it easier to visualize data with large frequency differences.
Adding Reference Lines¶
Adding reference lines, such as the median or mean, can make the histogram more informative.
Example: Adding a Vertical Line for the Median
import numpy as np
# Calculate the median
median_age = np.median(ages)
# Plot histogram with median line
plt.hist(ages, bins=5, edgecolor='black')
plt.axvline(median_age, color='red', linestyle='dashed', linewidth=1)
plt.text(median_age + 1, 3, 'Median', color='red')
plt.title("Age Distribution with Median Line")
plt.xlabel("Ages")
plt.ylabel("Frequency")
plt.show()
Explanation:
- Vertical Line: Using
plt.axvline()
to add a vertical line at the median helps highlight this value on the histogram. - Customizing Line Properties: We set the line color, style, and thickness with parameters like
color
,linestyle
, andlinewidth
. - Labeling: Adding a label with
plt.text()
clarifies the purpose of the line for viewers.