Part 3: Understanding Normal Distribution, Z-Score, and Their Applications in Data Science

In this blog post, we will explore the concepts of Normal Distribution (also known as Gaussian Distribution) and Z-Score. These concepts are fundamental in data analysis, especially in the field of data science and machine learning. We'll also cover how to use Python to apply these concepts in practical scenarios. Let's get started!

What is Normal Distribution?

Normal Distribution, often referred to as Gaussian Distribution, is a probability distribution that is symmetric about the mean. It has a bell-shaped curve where most of the data points are concentrated around the mean, and the probability of data points decreases as you move away from the mean.

Example: Analyzing Heights

Imagine you are a data scientist working for a clothing store, and you need to analyze people's heights to determine the appropriate sizes for clothing. When you plot these heights on a histogram, you will see a bell-shaped curve if the data follows a normal distribution.

A histogram is a frequency distribution that shows how often each different value in a set of data occurs. For example, if you have three data samples between the height range of 5 and 5.5 feet, you plot the height on the x-axis and the count of those samples on the y-axis.

Most of the heights will cluster around the average (mean) value, with fewer people having heights significantly taller or shorter than the average. This pattern creates a bell-shaped curve, commonly known as the bell curve.

Real-World Examples of Normal Distribution

Normal distribution is commonly observed in various real-world scenarios:

Apartment Prices: Most apartments have prices around the average, with fewer apartments being significantly more or less expensive.
Test Scores: In a classroom, most students score around the average, with fewer students scoring extremely high or low.
Employee Performance: Most employees perform at an average level, with a few being top performers and a few being low performers.

How to Use Normal Distribution in Data Analysis

One common use of normal distribution in data analysis is during the data cleaning process, particularly for outlier removal. Outliers are data points that are significantly different from the rest of the data. They can skew the results of your analysis or machine learning models.

Identifying Outliers with Standard Deviation

Standard deviation is a measure of the amount of variation or dispersion in a set of values. In a normal distribution:

About 68.3% of data points fall within ±1 standard deviation from the mean.
About 95.5% fall within ±2 standard deviations.
About 99.7% fall within ±3 standard deviations.

Data points beyond ±3 standard deviations can typically be considered outliers.

Python Code for Outlier Removal

Let's use Python to identify and remove outliers from a dataset of heights. We'll use the pandas and seaborn libraries for this task.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Load dataset
df = pd.read_csv('heights.csv')  # Assume this CSV contains a column 'height'

# Calculate mean and standard deviation
mean = df['height'].mean()
std_dev = df['height'].std()

# Define the threshold for outliers (±3 standard deviations)
threshold_low = mean - 3 * std_dev
threshold_high = mean + 3 * std_dev

# Identify outliers
outliers = df[(df['height'] < threshold_low) | (df['height'] > threshold_high)]
print(f"Outliers:\n{outliers}")

# Remove outliers
df_no_outliers = df[(df['height'] >= threshold_low) & (df['height'] <= threshold_high)]
print(f"Data without outliers:\n{df_no_outliers}")

# Plot histogram with KDE
sns.histplot(df_no_outliers['height'], kde=True)
plt.show()

Understanding Z-Score

The Z-Score is a measure that describes a data point's position in terms of standard deviations away from the mean. It tells you how many standard deviations a data point is from the mean.

Z-Score Formula

[ Z = \frac{X - \mu}{\sigma} ]

Where:

( X ) is the data point.
( \mu ) is the mean of the data.
( \sigma ) is the standard deviation.

Calculating Z-Score in Python

Let's calculate the Z-Score for each data point in our dataset and use it to identify outliers.

# Calculate Z-Score
df['z_score'] = (df['height'] - mean) / std_dev

# Identify outliers using Z-Score (threshold ±3)
outliers_z = df[(df['z_score'] < -3) | (df['z_score'] > 3)]
print(f"Outliers using Z-Score:\n{outliers_z}")

# Remove outliers using Z-Score
df_no_outliers_z = df[(df['z_score'] >= -3) & (df['z_score'] <= 3)]
print(f"Data without outliers using Z-Score:\n{df_no_outliers_z}")

# Plot histogram with KDE for data without outliers (using Z-Score)
sns.histplot(df_no_outliers_z['height'], kde=True)
plt.show()

Conclusion

Understanding Normal Distribution and Z-Score is essential for data scientists and machine learning engineers. These concepts help you analyze and clean your data effectively, ensuring that your models are not skewed by outliers.

By using Python, you can easily calculate these metrics and apply them to your datasets, making your data analysis more robust and accurate.

Happy coding and analyzing!