Part 1: Understanding Mean, Median, Mode, and Percentile in Data Science
- Revanth Reddy Tondapu
- Jun 23, 2024
- 3 min read

In this blog post, we'll dive into some fundamental statistical concepts: mean, median, mode, and percentile. We'll explore how these concepts are used in data science and machine learning, and we'll also practice them using Python code. Finally, we'll provide an exercise for you to solidify your understanding. Let's get started!
Why These Concepts Matter
Imagine you're planning to open a luxurious car showroom in Monroe Township, New Jersey. Before making any decisions, you'll need to analyze the income levels of people in the area. If the income levels are high, people might buy luxury cars. If not, opening a showroom might not be a good idea.
Mean
The mean (or average) is a common measure to summarize data. However, it can be misleading if the data contains outliers.
For example, if we calculate the average income of people in Monroe Township and include an extremely high income (like Elon Musk's), the average will be skewed. This could lead to poor business decisions.
Median
The median is the middle value of a sorted dataset. It is more robust to outliers compared to the mean. If the dataset has an even number of data points, the median is the average of the two middle values.
Mode
The mode is the most frequently occurring value in a dataset. It is useful for categorical data.
Percentile
Percentiles help understand the distribution of data. For instance, the 50th percentile (median) indicates that 50% of the data points are below this value.
Practical Use Cases in Data Science
Descriptive Statistics
Median is often used instead of the mean to get a more accurate measure of central tendency, especially in the presence of outliers.
Handling Missing Values
When building machine learning models, you might encounter missing values in your dataset. One way to handle this is to impute missing values using the median, which is less affected by outliers than the mean.
Removing Outliers
Percentiles can be used to identify and remove outliers from the dataset. For example, you might remove data points above the 99th percentile to reduce the impact of extreme values.
Python Code Examples
Let's practice these concepts using Python.
Loading the Data
First, let's load the income data into a pandas DataFrame.
import pandas as pd
import numpy as np
# Sample income data
data = {
'Income': [4000, 5000, 6000, 7000, 8000, 9000, 10000000]
}
df = pd.DataFrame(data)
print(df)
Calculating Mean, Median, and Mode
mean_income = df['Income'].mean()
median_income = df['Income'].median()
mode_income = df['Income'].mode()[0]
print(f"Mean: {mean_income}")
print(f"Median: {median_income}")
print(f"Mode: {mode_income}")
Calculating Percentiles
percentile_25 = df['Income'].quantile(0.25)
percentile_75 = df['Income'].quantile(0.75)
percentile_99 = df['Income'].quantile(0.99)
print(f"25th Percentile: {percentile_25}")
print(f"75th Percentile: {percentile_75}")
print(f"99th Percentile: {percentile_99}")
Removing Outliers
Let's remove the outliers above the 99th percentile.
threshold = df['Income'].quantile(0.99)
df_no_outliers = df[df['Income'] <= threshold]
print(df_no_outliers)
Handling Missing Values
Let's assume one of the income values is missing and impute it using the median.
df.loc[3, 'Income'] = np.nan
print(df)
# Impute missing values with the median
df['Income'].fillna(df['Income'].median(), inplace=True)
print(df)
Exercise
Download the Airbnb New York dataset and use percentiles to remove outliers. You can find the dataset on Kaggle.
Download the dataset.
Load it into a pandas DataFrame.
Identify and remove outliers using appropriate percentiles.
Impute any missing values using the median.
Here's a starting point for your code:
# Load the Airbnb New York dataset
df_airbnb = pd.read_csv('AB_NYC_2019.csv')
# Identify and remove outliers
percentile_99 = df_airbnb['price'].quantile(0.99)
df_airbnb_no_outliers = df_airbnb[df_airbnb['price'] <= percentile_99]
# Impute missing values using the median
df_airbnb_no_outliers['price'].fillna(df_airbnb_no_outliers['price'].median(), inplace=True)
print(df_airbnb_no_outliers)
Conclusion
Understanding mean, median, mode, and percentiles is crucial for effective data analysis and decision-making in data science and machine learning. These concepts help you summarize data, handle outliers, and impute missing values more accurately. Practice these concepts using the provided Python code and exercise to strengthen your understanding.
Happy coding!
Comments