Basics of Pandas in Python: Data Indexing, Selection, and Iteration¶
Data Indexing and Selection¶
Data indexing and selection are fundamental operations in data analysis. They allow you to access and manipulate subsets of your data.
Creating a DataFrame¶
Let's start by creating a DataFrame. Suppose we have data about students and their scores in different subjects.
import pandas as pd
# Creating a DataFrame from a dictionary
student_data = {
'Student': ['Alice', 'Bob', 'Charlie', 'David'],
'Math': [85, 92, 78, 88],
'Science': [91, 89, 84, 95],
'English': [79, 85, 87, 90]
}
df = pd.DataFrame(student_data, index=['a', 'b', 'c', 'd'])
Selecting Columns¶
You can select a single column using bracket notation or dot notation.
# Using bracket notation
math_scores = df['Math']
# Using dot notation
science_scores = df.Science
To select multiple columns, pass a list of column names.
# Selecting multiple columns
selected_columns = df[['Math', 'English']]
Selecting Rows¶
You can select rows using slicing.
# Selecting the first two rows
first_two_rows = df[0:2]
Note that when slicing rows, the end index is exclusive.
Using .loc
for Label-Based Selection¶
The .loc
indexer allows you to select data by label.
# Selecting a single row by index label
row_a = df.loc['a']
# Selecting multiple rows by index labels
rows_b_to_d = df.loc['b':'d']
When using .loc
, the end index is inclusive.
Using .iloc
for Position-Based Selection¶
The .iloc
indexer allows you to select data by integer position.
# Selecting a single row by integer position
second_row = df.iloc[1]
# Selecting multiple rows by integer positions
rows_1_and_2 = df.iloc[1:3]
With .iloc
, the end index is exclusive.
Selecting Specific Rows and Columns¶
You can select specific rows and columns using .loc
and .iloc
.
# Selecting specific rows and columns using .loc
math_scores_b_to_c = df.loc['b':'c', 'Math']
# Selecting specific rows and columns using .iloc
science_scores_rows_0_and_2 = df.iloc[[0, 2], [2]]
# Selecting students with Math score greater than 80
high_math_scores = df[df['Math'] > 80]
# Selecting students who scored above 85 in Science
top_science_students = df[df['Science'] > 85]
Using Multiple Conditions¶
You can combine conditions using &
(and) and |
(or).
# Students who scored above 80 in both Math and Science
high_scores = df[(df['Math'] > 80) & (df['Science'] > 80)]
# Students who scored above 90 in Math or Science
excellent_scores = df[(df['Math'] > 90) | (df['Science'] > 90)]
Using the isin()
Method¶
The isin()
method checks if each element is in a given list.
# Selecting students named 'Alice' or 'David'
selected_students = df[df['Student'].isin(['Alice', 'David'])]
Using the where()
Method¶
The where()
method replaces values that don't meet the condition with a specified value, typically NaN
.
# Keeping only scores above 85, others are replaced with NaN
high_scores_only = df.where(df[['Math', 'Science', 'English']] > 85)
You can also specify a value to replace the non-matching entries.
# Replacing scores below 85 with 0
adjusted_scores = df[['Math', 'Science', 'English']].where(df[['Math', 'Science', 'English']] > 85, 0)
# Iterating over columns
for column_name, column_data in df.items():
print(f"Column Name: {column_name}")
print(column_data)
Iterating Over Rows with iterrows()
¶
The iterrows()
method allows you to iterate over rows as (index, Series) pairs.
# Iterating over rows
for index, row in df.iterrows():
print(f"Index: {index}")
print(row)
Iterating Over Rows with itertuples()
¶
The itertuples()
method lets you iterate over rows as namedtuples.
# Iterating over rows as namedtuples
for row in df.itertuples():
print(row)
# Adding a Total column
df['Total'] = df['Math'] + df['Science'] + df['English']
Example 2: Selecting Top Performers¶
We can select students who have a total score above a certain threshold.
# Students with total score above 260
top_performers = df[df['Total'] > 260]
Example 3: Updating Data Based on Condition¶
Suppose we want to assign grades based on total scores.
# Defining a function to assign grades
def assign_grade(total):
if total >= 270:
return 'A'
elif total >= 240:
return 'B'
else:
return 'C'
# Applying the function to create a Grade column
df['Grade'] = df['Total'].apply(assign_grade)
Example 4: Using apply()
with Rows¶
You can use the apply()
function along rows to perform complex operations.
# Calculating the average score for each student
df['Average'] = df[['Math', 'Science', 'English']].apply(lambda x: x.mean(), axis=1)
Example 5: Filtering with Multiple Conditions¶
Select students who have high scores in Math and English but average in Science.
# Students with Math and English scores above 85 and Science below 90
specific_students = df[(df['Math'] > 85) & (df['English'] > 85) & (df['Science'] < 90)]
# Students who are not 'Bob'
non_bob_students = df[~(df['Student'] == 'Bob')]
Selecting with Regular Expressions¶
You can use the str.contains()
method for string matching.
# Selecting students whose names start with 'A'
students_starting_with_a = df[df['Student'].str.contains('^A')]
Setting Values Based on Condition¶
You can modify DataFrame values based on a condition.
# Setting English score to 95 for students with Grade 'A'
df.loc[df['Grade'] == 'A', 'English'] = 95