Hello everyone! Today, we are going to embark on an exciting journey into the world of Exploratory Data Analysis (EDA) using Python. As we go through this process, we'll also apply a machine learning model to a real-world dataset. Think of EDA as being a detective—your job is to uncover as much information as possible from your data. This is a crucial step in any machine learning project because a significant portion of your time will be spent on analyzing and preparing your data.
What is EDA?
EDA stands for Exploratory Data Analysis. It's the process of examining your data to understand its structure, spot patterns, identify anomalies, and test hypotheses using summary statistics and graphical representations. By doing this, you can make informed decisions about how to proceed with your machine learning model.
Step-by-Step Guide to EDA and Machine Learning
Step 1: Setting Up Your Environment
First, we need to import the necessary libraries. We'll be using:
Pandas for data manipulation.
NumPy for numerical operations.
Matplotlib and Seaborn for data visualization.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
Step 2: Loading the Data
For this example, we'll use the famous Titanic dataset. This dataset contains information about the passengers on the Titanic and whether they survived or not.
# Load the Titanic dataset
df = pd.read_csv('titanic_train.csv')
print(df.head())
Step 3: Understanding the Data
Let's take a quick look at the data to understand its structure.
print(df.info())
Step 4: Handling Missing Values
Missing values can be problematic, so we need to identify and handle them.
# Check for missing values
print(df.isnull().sum())
# Visualize missing values
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.show()
Step 5: Data Visualization
Visualizing the data helps us understand it better. Let's start with some basic visualizations.
Survival Rate
# Count plot for survival
sns.countplot(x='Survived', data=df)
plt.title('Survival Count')
plt.show()
Survival Rate by Gender
# Count plot for survival by gender
sns.countplot(x='Survived', hue='Sex', data=df, palette='viridis')
plt.title('Survival Count by Gender')
plt.show()
Survival Rate by Passenger Class
# Count plot for survival by passenger class
sns.countplot(x='Survived', hue='Pclass', data=df, palette='viridis')
plt.title('Survival Count by Passenger Class')
plt.show()
Step 6: Handling Categorical Features
Categorical features need to be converted into a numerical format. We'll use one-hot encoding for this.
# One-hot encode the 'Sex' and 'Embarked' columns
sex = pd.get_dummies(df['Sex'], drop_first=True)
embark = pd.get_dummies(df['Embarked'], drop_first=True)
# Drop the original columns and concatenate the new ones
df.drop(['Sex', 'Embarked', 'Name', 'Ticket', 'Cabin'], axis=1, inplace=True)
df = pd.concat([df, sex, embark], axis=1)
Step 7: Filling Missing Values
We'll fill the missing values in the 'Age' column using the median of the respective passenger class.
# Fill missing values in 'Age' based on 'Pclass'
df['Age'] = df.groupby('Pclass')['Age'].transform(lambda x: x.fillna(x.median()))
Step 8: Splitting the Data
Next, we need to split the data into training and testing sets.
from sklearn.model_selection import train_test_split
X = df.drop('Survived', axis=1)
y = df['Survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
Step 9: Applying Machine Learning
We'll use a logistic regression model for this example.
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score
# Initialize and train the model
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)
# Make predictions
predictions = model.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, predictions)
conf_matrix = confusion_matrix(y_test, predictions)
print(f'Accuracy: {accuracy * 100:.2f}%')
print('Confusion Matrix:\n', conf_matrix)
Conclusion
Exploratory Data Analysis is a critical step in any data science project. By thoroughly examining your data and cleaning it, you can set a solid foundation for your machine learning models. In this guide, we covered how to perform EDA using Python and how to apply a logistic regression model to make predictions.
I hope you found this guide helpful and enjoyable. Keep practicing, stay curious, and happy learning!
Comments