Statistical Functions in Numpy¶
NumPy provides a comprehensive suite of statistical functions that operate on arrays and matrices, allowing you to compute descriptive statistics, assess data distributions, and perform more complex analyses like correlation and covariance.
Importing NumPy¶
Before you begin working with NumPy's statistical functions, you need to import the library. It's a common practice to import NumPy using the alias np for convenience.
import numpy as np
This line of code imports the NumPy library and allows you to access its functions using the np prefix.
Creating a Sample Dataset¶
For demonstration purposes, let's create a sample NumPy array to work with. This array will simulate a dataset of numerical values.
# Creating a sample dataset
data = np.array([15, 20, 35, 40, 50, 65, 75, 80, 95])
This array data contains nine numerical values. We'll use this dataset to explore various statistical functions provided by NumPy.
Basic Statistical Functions¶
NumPy offers basic statistical functions that compute simple statistics like mean, median, sum, minimum, maximum, standard deviation, and variance.
Mean (np.mean)¶
The mean (average) is a measure of the central tendency of a dataset. You can calculate the mean of an array using the np.mean() function.
# Calculating the mean of the dataset
mean_value = np.mean(data)
In this code, mean_value will store the average of all elements in the data array.
Median (np.median)¶
The median is the middle value of a sorted dataset. If the dataset has an even number of elements, the median is the average of the two middle numbers.
# Calculating the median of the dataset
median_value = np.median(data)
Here, median_value will contain the median of the data array.
Sum (np.sum)¶
The sum function computes the total of all elements in the array.
# Calculating the sum of the dataset
sum_value = np.sum(data)
sum_value will be the sum of all numbers in the data array.
Min and Max (np.min, np.max)¶
The np.min() and np.max() functions find the minimum and maximum values in the array, respectively.
# Finding the minimum value in the dataset
min_value = np.min(data)
# Finding the maximum value in the dataset
max_value = np.max(data)
min_value and max_value will hold the smallest and largest numbers in the data array.
Standard Deviation and Variance (np.std, np.var)¶
The standard deviation measures the amount of variation or dispersion in a dataset. The variance is the square of the standard deviation.
# Calculating the standard deviation of the dataset
std_deviation = np.std(data)
# Calculating the variance of the dataset
variance = np.var(data)
std_deviation will store the standard deviation, and variance will store the variance of the data array.
Advanced Statistical Functions¶
Beyond basic statistics, NumPy provides functions for more advanced statistical analysis, such as calculating percentiles, correlation coefficients, and covariance matrices.
Percentiles (np.percentile)¶
Percentiles are used to understand the distribution of data. The nth percentile is the value below which n% of the data falls.
# Calculating the 25th, 50th, and 75th percentiles
percentile_25 = np.percentile(data, 25)
percentile_50 = np.percentile(data, 50) # Equivalent to the median
percentile_75 = np.percentile(data, 75)
In this code, percentile_25, percentile_50, and percentile_75 represent the first quartile, median, and third quartile of the dataset.
Correlation Coefficient (np.corrcoef)¶
The correlation coefficient measures the linear relationship between two datasets. It returns a matrix of correlation coefficients between each pair of datasets.
# Creating two datasets
data_x = np.array([1, 2, 3, 4, 5])
data_y = np.array([2, 4, 6, 8, 10])
# Calculating the correlation coefficient matrix
correlation_matrix = np.corrcoef(data_x, data_y)
Here, data_x and data_y are two datasets. The np.corrcoef() function returns a 2x2 matrix where the off-diagonal elements represent the correlation coefficient between data_x and data_y.
Covariance Matrix (np.cov)¶
Covariance indicates the level to which two variables vary together. The covariance matrix shows the covariance between each pair of variables.
# Calculating the covariance matrix
covariance_matrix = np.cov(data_x, data_y)
covariance_matrix will be a 2x2 matrix showing the covariance between data_x and data_y.
Using the axis Parameter¶
Many statistical functions in NumPy accept an axis parameter, which specifies the axis along which to perform the computation.
- axis=0: Compute along the columns (downward).
- axis=1: Compute along the rows (across).
Consider a 2D array for demonstration:
# Creating a 2D array
array_2d = np.array([
[10, 15, 20],
[25, 30, 35],
[40, 45, 50]
])
Computing Along Columns¶
# Calculating the mean along columns
mean_columns = np.mean(array_2d, axis=0)
In this code, mean_columns will contain the mean of each column.
Computing Along Rows¶
# Calculating the mean along rows
mean_rows = np.mean(array_2d, axis=1)
Here, mean_rows will contain the mean of each row.
Statistical Functions on Multidimensional Arrays¶
NumPy's statistical functions can operate on multidimensional arrays. Let's explore how to apply these functions to 3D arrays.
# Creating a 3D array
array_3d = np.array([
[
[1, 2, 3],
[4, 5, 6]
],
[
[7, 8, 9],
[10, 11, 12]
]
])
Calculating the Overall Mean¶
# Calculating the overall mean of the 3D array
overall_mean = np.mean(array_3d)
This computes the mean of all elements in the array_3d.
Calculating the Mean Along Specific Axes¶
Mean along axis 0:
# Mean along axis 0 mean_axis0 = np.mean(array_3d, axis=0)
This computes the mean between the two 2D arrays in array_3d.
Mean along axis 1:
# Mean along axis 1 mean_axis1 = np.mean(array_3d, axis=1)
This computes the mean along the columns of each 2D array.
Mean along axis 2:
# Mean along axis 2 mean_axis2 = np.mean(array_3d, axis=2)
This computes the mean along the rows of each 2D array.