Basics of Pandas in Python: More Operations and Functions¶
In this tutorial, we'll explore some of the more advanced and useful functions that Pandas offers for data manipulation and analysis. These functions will help you retrieve basic information about your DataFrame, find unique values, apply custom functions, and sort your data efficiently.
Let's start by importing Pandas and creating a sample DataFrame to work with:
import pandas as pd
# Create a sample DataFrame
data = {
'Employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Department': ['HR', 'IT', 'Finance', 'IT', 'HR'],
'Salary': [70000, 80000, 75000, 82000, 68000]
}
df = pd.DataFrame(data)
Retrieving Basic Information About the DataFrame¶
Understanding the structure and content of your DataFrame is crucial before performing any analysis.
Getting a Summary of the DataFrame¶
The info()
method provides a concise summary of the DataFrame, including the index dtype, column dtypes, non-null values, and memory usage.
df.info()
Accessing Column Names¶
To retrieve the list of column names in your DataFrame, use the columns
attribute:
df.columns
Viewing the DataFrame's Data¶
The keys()
method returns the columns of the DataFrame, similar to columns
:
df.keys()
Displaying the First Few Rows¶
Use the head()
method to display the first few rows of the DataFrame. By default, it shows the first five rows, but you can specify the number of rows you want:
df.head(3)
Displaying the Last Few Rows¶
Similarly, use the tail()
method to view the last few rows:
df.tail(2)
Converting DataFrame to NumPy Array¶
If you need to perform operations using NumPy, you can convert the DataFrame to a NumPy array using the values
attribute:
array_data = df.values
Getting the Size of the DataFrame¶
The size
attribute returns the total number of elements in the DataFrame:
total_elements = df.size
Getting the Shape of the DataFrame¶
The shape
attribute provides the dimensions of the DataFrame in the form of (number of rows, number of columns)
:
data_shape = df.shape
Getting the Number of Rows and Columns¶
To get the number of rows:
num_rows = df.shape[0]
To get the number of columns:
num_columns = df.shape[1]
unique_departments = df['Department'].unique()
Counting the Number of Unique Values¶
The nunique()
method returns the number of unique values in a column:
num_unique_departments = df['Department'].nunique()
Counting Occurrences of Each Value¶
The value_counts()
method counts the frequency of each unique value in a column:
department_counts = df['Department'].value_counts()
def calculate_bonus(salary):
return salary * 0.10
df['Bonus'] = df['Salary'].apply(calculate_bonus)
Applying an Anonymous Function (Lambda) to a Column¶
You can also use a lambda function for simplicity:
df['Adjusted Salary'] = df['Salary'].apply(lambda x: x * 1.05)
Applying a Function Element-wise¶
If you want to apply a function to every element in the DataFrame, use applymap()
. For example, converting all string data to uppercase:
df_upper = df.applymap(lambda x: x.upper() if type(x) == str else x)
df_sorted_salary = df.sort_values('Salary')
Sorting in Descending Order¶
To sort the DataFrame in descending order:
df_sorted_salary_desc = df.sort_values('Salary', ascending=False)
Sorting by Multiple Columns¶
You can sort by multiple columns by passing a list of column names. For example, sorting by 'Department' and then by 'Salary':
df_sorted_multi = df.sort_values(['Department', 'Salary'])
Sorting and Resetting Index¶
After sorting, the index remains attached to the original rows. To reset the index:
df_sorted_salary.reset_index(drop=True, inplace=True)