Part 13: Dive Deep into Data: Exploratory Data Analysis and Machine Learning with Python

Hello everyone! Today, we are going to embark on an exciting journey into the world of Exploratory Data Analysis (EDA) using Python. As we go through this process, we'll also apply a machine learning model to a real-world dataset. Think of EDA as being a detective—your job is to uncover as much information as possible from your data. This is a crucial step in any machine learning project because a significant portion of your time will be spent on analyzing and preparing your data.

What is EDA?

EDA stands for Exploratory Data Analysis. It's the process of examining your data to understand its structure, spot patterns, identify anomalies, and test hypotheses using summary statistics and graphical representations. By doing this, you can make informed decisions about how to proceed with your machine learning model.

Step-by-Step Guide to EDA and Machine Learning

Step 1: Setting Up Your Environment

First, we need to import the necessary libraries. We'll be using:

Pandas for data manipulation.
NumPy for numerical operations.
Matplotlib and Seaborn for data visualization.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

Step 2: Loading the Data

For this example, we'll use the famous Titanic dataset. This dataset contains information about the passengers on the Titanic and whether they survived or not.

# Load the Titanic dataset
df = pd.read_csv('titanic_train.csv')
print(df.head())

Step 3: Understanding the Data

Let's take a quick look at the data to understand its structure.

print(df.info())

Step 4: Handling Missing Values

Missing values can be problematic, so we need to identify and handle them.

# Check for missing values
print(df.isnull().sum())

# Visualize missing values
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.show()

Step 5: Data Visualization

Visualizing the data helps us understand it better. Let's start with some basic visualizations.

Survival Rate

# Count plot for survival
sns.countplot(x='Survived', data=df)
plt.title('Survival Count')
plt.show()

Survival Rate by Gender

# Count plot for survival by gender
sns.countplot(x='Survived', hue='Sex', data=df, palette='viridis')
plt.title('Survival Count by Gender')
plt.show()

Survival Rate by Passenger Class

# Count plot for survival by passenger class
sns.countplot(x='Survived', hue='Pclass', data=df, palette='viridis')
plt.title('Survival Count by Passenger Class')
plt.show()

Step 6: Handling Categorical Features

Categorical features need to be converted into a numerical format. We'll use one-hot encoding for this.

# One-hot encode the 'Sex' and 'Embarked' columns
sex = pd.get_dummies(df['Sex'], drop_first=True)
embark = pd.get_dummies(df['Embarked'], drop_first=True)

# Drop the original columns and concatenate the new ones
df.drop(['Sex', 'Embarked', 'Name', 'Ticket', 'Cabin'], axis=1, inplace=True)
df = pd.concat([df, sex, embark], axis=1)

Step 7: Filling Missing Values

We'll fill the missing values in the 'Age' column using the median of the respective passenger class.

# Fill missing values in 'Age' based on 'Pclass'
df['Age'] = df.groupby('Pclass')['Age'].transform(lambda x: x.fillna(x.median()))

Step 8: Splitting the Data

Next, we need to split the data into training and testing sets.

from sklearn.model_selection import train_test_split

X = df.drop('Survived', axis=1)
y = df['Survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Step 9: Applying Machine Learning

We'll use a logistic regression model for this example.

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score

# Initialize and train the model
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

# Make predictions
predictions = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, predictions)
conf_matrix = confusion_matrix(y_test, predictions)

print(f'Accuracy: {accuracy * 100:.2f}%')
print('Confusion Matrix:\n', conf_matrix)

Conclusion

Exploratory Data Analysis is a critical step in any data science project. By thoroughly examining your data and cleaning it, you can set a solid foundation for your machine learning models. In this guide, we covered how to perform EDA using Python and how to apply a logistic regression model to make predictions.

I hope you found this guide helpful and enjoyable. Keep practicing, stay curious, and happy learning!