top of page
  • Writer's pictureRevanth Reddy Tondapu

Part 1: Understanding Mean, Median, Mode, and Percentile in Data Science


Math and Statistics for AI
Math and Statistics for AI

In this blog post, we'll dive into some fundamental statistical concepts: mean, median, mode, and percentile. We'll explore how these concepts are used in data science and machine learning, and we'll also practice them using Python code. Finally, we'll provide an exercise for you to solidify your understanding. Let's get started!


Why These Concepts Matter

Imagine you're planning to open a luxurious car showroom in Monroe Township, New Jersey. Before making any decisions, you'll need to analyze the income levels of people in the area. If the income levels are high, people might buy luxury cars. If not, opening a showroom might not be a good idea.


Mean

The mean (or average) is a common measure to summarize data. However, it can be misleading if the data contains outliers.

For example, if we calculate the average income of people in Monroe Township and include an extremely high income (like Elon Musk's), the average will be skewed. This could lead to poor business decisions.


Median

The median is the middle value of a sorted dataset. It is more robust to outliers compared to the mean. If the dataset has an even number of data points, the median is the average of the two middle values.


Mode

The mode is the most frequently occurring value in a dataset. It is useful for categorical data.

Percentile

Percentiles help understand the distribution of data. For instance, the 50th percentile (median) indicates that 50% of the data points are below this value.


Practical Use Cases in Data Science

Descriptive Statistics

Median is often used instead of the mean to get a more accurate measure of central tendency, especially in the presence of outliers.


Handling Missing Values

When building machine learning models, you might encounter missing values in your dataset. One way to handle this is to impute missing values using the median, which is less affected by outliers than the mean.


Removing Outliers

Percentiles can be used to identify and remove outliers from the dataset. For example, you might remove data points above the 99th percentile to reduce the impact of extreme values.


Python Code Examples

Let's practice these concepts using Python.

Loading the Data

First, let's load the income data into a pandas DataFrame.

import pandas as pd
import numpy as np

# Sample income data
data = {
    'Income': [4000, 5000, 6000, 7000, 8000, 9000, 10000000]
}
df = pd.DataFrame(data)
print(df)

Calculating Mean, Median, and Mode

mean_income = df['Income'].mean()
median_income = df['Income'].median()
mode_income = df['Income'].mode()[0]

print(f"Mean: {mean_income}")
print(f"Median: {median_income}")
print(f"Mode: {mode_income}")

Calculating Percentiles

percentile_25 = df['Income'].quantile(0.25)
percentile_75 = df['Income'].quantile(0.75)
percentile_99 = df['Income'].quantile(0.99)

print(f"25th Percentile: {percentile_25}")
print(f"75th Percentile: {percentile_75}")
print(f"99th Percentile: {percentile_99}")

Removing Outliers

Let's remove the outliers above the 99th percentile.

threshold = df['Income'].quantile(0.99)
df_no_outliers = df[df['Income'] <= threshold]
print(df_no_outliers)

Handling Missing Values

Let's assume one of the income values is missing and impute it using the median.

df.loc[3, 'Income'] = np.nan
print(df)

# Impute missing values with the median
df['Income'].fillna(df['Income'].median(), inplace=True)
print(df)

Exercise

Download the Airbnb New York dataset and use percentiles to remove outliers. You can find the dataset on Kaggle.

  1. Download the dataset.

  2. Load it into a pandas DataFrame.

  3. Identify and remove outliers using appropriate percentiles.

  4. Impute any missing values using the median.

Here's a starting point for your code:

# Load the Airbnb New York dataset
df_airbnb = pd.read_csv('AB_NYC_2019.csv')

# Identify and remove outliers
percentile_99 = df_airbnb['price'].quantile(0.99)
df_airbnb_no_outliers = df_airbnb[df_airbnb['price'] <= percentile_99]

# Impute missing values using the median
df_airbnb_no_outliers['price'].fillna(df_airbnb_no_outliers['price'].median(), inplace=True)

print(df_airbnb_no_outliers)

Conclusion

Understanding mean, median, mode, and percentiles is crucial for effective data analysis and decision-making in data science and machine learning. These concepts help you summarize data, handle outliers, and impute missing values more accurately. Practice these concepts using the provided Python code and exercise to strengthen your understanding.

Happy coding!

1 view0 comments

Comments


bottom of page