Welcome to Part 14 of our Data Science Blog series! In this post, we will explore the powerful Pandas library in Python, which is a popular tool for data manipulation and analysis. Pandas provides data structures and functions that make working with structured data (such as CSV files, Excel sheets, SQL databases, etc.) much easier and more efficient.
Let's dive into some essential aspects of the Pandas library with code examples:
Before we begin, ensure that you have Pandas installed. If not, you can install it using pip:
pip install pandas
To use Pandas in your Python code, you need to import it:
import pandas as pd
Pandas provides various methods to read data from different file formats. For this example, we will read data from a CSV file:
# Assuming you have a file named "data.csv" in the current directory
df = pd.read_csv("data.csv")
Let's start by examining the basic structure of the DataFrame and some summary statistics:
# Display the first few rows of the DataFrame
print(df.head())
# Get information about the DataFrame
print(df.info())
# Get summary statistics of the numerical columns
print(df.describe())
Pandas allows you to select specific rows and columns from the DataFrame:
# Select a single column
column_name = "Age"
age_column = df[column_name]
# Select multiple columns
selected_columns = df[["Name", "Age", "Gender"]]
# Select rows based on condition
young_people = df[df["Age"] < 30]
# Select rows based on multiple conditions
female_seniors = df[(df["Gender"] == "Female") & (df["Age"] > 65)]
Pandas makes it easy to modify data in the DataFrame:
# Adding a new column
df["AgeGroup"] = pd.cut(df["Age"], bins=[0, 18, 30, 50, 100], labels=["Child", "Young", "Adult", "Senior"])
# Updating values in a column based on conditions
df.loc[df["Age"] < 18, "AgeGroup"] = "Minor"
# Group data by a column and calculate mean
grouped_data = df.groupby("Gender")["Age"].mean()
# Group data by multiple columns and calculate multiple statistics
grouped_data = df.groupby(["Gender", "AgeGroup"]).agg({"Age": "mean", "Income": "sum"})
Pandas provides functions to handle missing data:
# Check for missing values
print(df.isnull().sum())
# Drop rows with any missing values
df_cleaned = df.dropna()
# Fill missing values with a specific value
df_filled = df.fillna(0)
Data Visualization
import matplotlib.pyplot as plt
# Create a bar plot of AgeGroup counts
df["AgeGroup"].value_counts().plot(kind="bar")
plt.xlabel("Age Group")
plt.ylabel("Count")
plt.title("Age Group Distribution")
plt.show()