Introduction to Pandas: Creating Series and DataFrames¶
Pandas is a powerful Python library used for data manipulation and analysis. It provides two primary data structures: Series and DataFrame. Understanding how to create and manipulate these structures is essential for data analysis tasks.
Importing Pandas and NumPy¶
Before we begin, we need to import the Pandas and NumPy libraries. NumPy is often used alongside Pandas for numerical operations.
import pandas as pd
import numpy as np
Pandas Series¶
A Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating-point numbers, Python objects, etc.). Each element in a Series has an associated label, known as its index.
Creating a Series from a List¶
You can create a Series from a Python list by passing the list to the pd.Series()
function.
# Create a list of temperatures in Celsius
temperatures = [22, 28, 19, 24, 30]
# Create a Series from the list
temp_series = pd.Series(temperatures)
If you want to specify custom indices, you can provide them using the index
parameter.
# Days corresponding to the temperatures
days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday']
# Create a Series with custom indices
temp_series = pd.Series(temperatures, index=days)
Creating a Series from a Dictionary¶
A Series can also be created from a dictionary. The keys become the indices, and the values become the data.
# Dictionary of fruit counts
fruit_counts = {'Apples': 10, 'Bananas': 15, 'Cherries': 7}
# Create a Series from the dictionary
fruit_series = pd.Series(fruit_counts)
Creating a Series from a NumPy Array¶
You can create a Series from a NumPy array in a similar way.
# NumPy array of random numbers
random_numbers = np.random.randn(5)
# Create a Series from the NumPy array
random_series = pd.Series(random_numbers)
Pandas DataFrame¶
A DataFrame is a two-dimensional labeled data structure with columns that can be of different data types. It is similar to a spreadsheet or SQL table.
Creating a DataFrame from a Dictionary¶
You can create a DataFrame by passing a dictionary where the keys are column names and the values are lists of column data.
# Dictionary containing data
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 40],
'Country': ['USA', 'UK', 'Canada', 'Australia']
}
# Create a DataFrame from the dictionary
df = pd.DataFrame(data)
Creating a DataFrame from a List of Dictionaries¶
Sometimes, data is more naturally represented as a list of dictionaries, where each dictionary represents a row.
# List of dictionaries
data = [
{'Name': 'Emma', 'Age': 29, 'Country': 'USA'},
{'Name': 'Liam', 'Age': 32, 'Country': 'UK'},
{'Name': 'Olivia', 'Age': 27, 'Country': 'Canada'}
]
# Create a DataFrame from the list of dictionaries
df = pd.DataFrame(data)
Creating a DataFrame from a 2D NumPy Array¶
You can also create a DataFrame from a 2D NumPy array by specifying the column names.
# 2D NumPy array of data
array_data = np.array([[5.1, 3.5, 1.4, 0.2],
[4.9, 3.0, 1.4, 0.2],
[6.2, 3.4, 5.4, 2.3]])
# Column names
columns = ['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth']
# Create a DataFrame from the NumPy array
df = pd.DataFrame(array_data, columns=columns)
Creating a DataFrame from Multiple Series¶
You can create a DataFrame by combining multiple Series objects.
# Series of stock prices
apple_stock = pd.Series([150, 153, 155], index=['2021-09-01', '2021-09-02', '2021-09-03'])
google_stock = pd.Series([2800, 2825, 2850], index=['2021-09-01', '2021-09-02', '2021-09-03'])
# Combine Series into a DataFrame
stocks_df = pd.DataFrame({'Apple': apple_stock, 'Google': google_stock})
# Add a new column 'MarketCap' to the stocks DataFrame
stocks_df['MarketCap'] = [2.5e12, 2.52e12, 2.55e12]
Removing a Column¶
To remove a column from a DataFrame, you can use the drop()
method with axis=1
.
# Remove the 'MarketCap' column
stocks_df = stocks_df.drop('MarketCap', axis=1)
Accessing Columns and Rows¶
You can access columns using the column name.
# Access the 'Apple' stock prices
apple_prices = stocks_df['Apple']
To access rows, you can use the loc
and iloc
methods.
loc
is label-based and includes the last index.iloc
is integer position-based and excludes the last index.
# Access rows by label
row = stocks_df.loc['2021-09-02']
# Access rows by integer position
row = stocks_df.iloc[1]
Summary¶
- Series: One-dimensional labeled array. Created from lists, dictionaries, or NumPy arrays.
- DataFrame: Two-dimensional labeled data structure. Created from dictionaries, lists of dictionaries, NumPy arrays, or multiple Series.
- Manipulation: Add or remove columns, access data using labels or positions.
Understanding how to create and manipulate Series and DataFrames is fundamental when working with data in Pandas. These structures allow for efficient data analysis and manipulation, providing a foundation for more advanced operations.