Part 5: Understanding Normal Distribution and Log-Normal Distribution

In the world of data science and machine learning, understanding various data distributions is crucial for effective analysis and modeling. Two important distributions are the normal distribution and the log-normal distribution. This blog post will explain these concepts in simple terms and demonstrate their significance with Python code examples.

Normal Distribution

A normal distribution, also known as a Gaussian distribution, is a bell-shaped curve where most of the data points are clustered around the mean (average). The tails on either side of the mean represent the distribution of data points that are less frequent.

Real-Life Examples of Normal Distribution

Test Scores: Most students score around the average, with fewer students scoring extremely high or low.
Employee Performance: Most employees perform at an average level, with a few performing exceptionally well or poorly.

Visualizing Normal Distribution

Let's consider a dataset of people's highest education levels. If we plot this data on a histogram, it forms a bell curve, indicating a normal distribution.

Log-Normal Distribution

A log-normal distribution occurs when the logarithm of a dataset is normally distributed. Unlike the normal distribution, the log-normal distribution is right-skewed, meaning it has a long tail on the right side. This happens because the data can take on a wide range of positive values.

Example: Income Distribution

Consider a dataset of people's incomes. Most people might earn around $50,000, but some individuals, like billionaires, could earn significantly more. This results in a right-skewed distribution because the income range extends far to the right.

Transforming to Log-Normal Distribution

By applying a logarithmic transformation to the income data, the distribution can be made to resemble a normal distribution. This is useful for various statistical analyses and machine learning models.

Applying Log Transformation in Data Analysis

Example: Income Data

Let's say we have income data that is right-skewed. We can apply a logarithmic transformation to make the data more normally distributed.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Sample income data
data = {
    'Income': [5000, 10000, 20000, 50000, 100000, 200000, 500000, 1000000]
}
df = pd.DataFrame(data)

# Plotting the original income data
sns.histplot(df['Income'], kde=True)
plt.title('Original Income Distribution')
plt.xlabel('Income')
plt.ylabel('Frequency')
plt.show()

# Applying log transformation
df['Log_Income'] = np.log10(df['Income'])

# Plotting the log-transformed income data
sns.histplot(df['Log_Income'], kde=True)
plt.title('Log-Transformed Income Distribution')
plt.xlabel('Log(Income)')
plt.ylabel('Frequency')
plt.show()

In the above code, the original income data is right-skewed, but after applying the log transformation, the distribution becomes more bell-shaped, resembling a normal distribution.

Log-Normal Distribution in Data Science

Example: Credit Risk Analysis

Consider a scenario where you're building a machine learning model to predict whether to approve a loan for a person based on their income. If one person's income is extremely high compared to others, it can skew the model. By applying a log transformation to the income data, you can bring all values to a more comparable scale, improving model accuracy.

# Sample data for credit risk analysis
data = {
    'Credit_Score': [700, 650, 800, 750],
    'Income': [32000, 77000, 550000, 45000],
    'Age': [30, 45, 35, 50],
    'Loan_Approved': [1, 0, 1, 0]
}
df = pd.DataFrame(data)

# Log transform the Income column
df['Log_Income'] = np.log10(df['Income'])

print(df)

By applying the log transformation, the income values become more comparable, reducing the influence of extreme values on the model.

Other Examples of Log-Normal Distribution

Hospitalization Days: Most people might spend a few days in the hospital, but critically ill patients could spend several months.
Advertising Budget: Small to mid-tier companies might have modest advertising budgets, while large corporations could have substantial budgets running into millions or billions.

Conclusion

Understanding normal and log-normal distributions is essential for effective data analysis and machine learning. By applying log transformations, you can handle skewed data better and improve the performance of your models.

We hope this post has provided you with a clear understanding of these concepts and their practical applications. Stay tuned for more insights into data science and machine learning. Happy learning!