Part 10: Detecting Outliers with the Modified Z-Score: A Comprehensive Guide

Outliers can significantly impact the results of data analysis, and detecting them accurately is crucial. Today, we'll delve into the modified Z-score, an effective method for identifying outliers. We’ll start with a simple Excel demonstration and then move on to implementing the method in Python.

Understanding the Basics

Before we jump into the modified Z-score, let's revisit some fundamental concepts:

Mean and Median

Mean: The average value of a dataset, calculated by summing all values and dividing by the number of values.
Median: The middle value of a sorted dataset. If the dataset has an odd number of values, the median is the middle one. If even, it's the average of the two middle values.

Why Median Over Mean?

The mean can be heavily influenced by outliers, whereas the median is more robust in such scenarios. This makes the median a better choice when dealing with skewed data or outliers.

Modified Z-Score: The Concept

The modified Z-score is designed to be more robust against outliers by using the median and the Median Absolute Deviation (MAD) instead of the mean and standard deviation.

Formula

The modified Z-score for a data point ( x ) is calculated as: [ \text{Modified Z-score} = \frac{0.6745 \times (x - \text{Median})}{\text{MAD}} ] where:

( x ) is the data point.
Median is the median of the dataset.
MAD is the Median Absolute Deviation, calculated as the median of the absolute deviations from the dataset's median.

Threshold

A common threshold for the modified Z-score is 3.5. Data points with a modified Z-score greater than 3.5 are considered outliers.

Excel Demonstration

Let's start with a simple example using Excel:

Dataset: Consider heights of seven individuals. Calculate the mean and median of these heights.
Mean Calculation:

Formula: =AVERAGE(A2:A8) (assuming heights are in cells A2 to A8).

Median Calculation:

Formula: =MEDIAN(A2:A8).

Calculate MAD:

First, find the absolute deviations from the median.
Then, calculate the median of these absolute deviations.

Calculate Modified Z-Score:

For each height, apply the modified Z-score formula.

Identify Outliers:

Use a threshold of 3.5 to detect outliers.

Python Implementation

Now, let's implement the same process in Python using the pandas and numpy libraries. We'll work with a dataset of movie revenues to detect outliers.

Step-by-Step Code

import pandas as pd
import numpy as np

# Load the dataset
df = pd.read_csv("movies.csv")

# Convert revenue to millions
df['revenue_mln'] = df['revenue'] / 1e6

# Display basic statistics
print(df['revenue_mln'].describe())

# Calculate mean and standard deviation for Z-score
mean = df['revenue_mln'].mean()
std_dev = df['revenue_mln'].std()

# Function to calculate Z-score
def get_z_score(value, mean, std_dev):
    return (value - mean) / std_dev

# Apply Z-score calculation
df['z_score'] = df['revenue_mln'].apply(lambda x: get_z_score(x, mean, std_dev))

# Detect outliers using Z-score
outliers_z = df[df['z_score'].abs() > 3]
print("Outliers detected using Z-score:")
print(outliers_z)

# Function to calculate MAD
def get_mad(series):
    median = np.median(series)
    diff = np.abs(series - median)
    mad = np.median(diff)
    return mad

# Calculate MAD and median for Modified Z-score
median = np.median(df['revenue_mln'])
mad = get_mad(df['revenue_mln'])

# Function to calculate Modified Z-score
def get_modified_z_score(value, median, mad):
    return 0.6745 * (value - median) / mad

# Apply Modified Z-score calculation
df['modified_z_score'] = df['revenue_mln'].apply(lambda x: get_modified_z_score(x, median, mad))

# Detect outliers using Modified Z-score
outliers_modified_z = df[df['modified_z_score'].abs() > 3.5]
print("Outliers detected using Modified Z-score:")
print(outliers_modified_z)

Explanation

Load Dataset: Load your dataset into a pandas DataFrame.
Convert Revenue: Convert revenue values to millions for easier interpretation.
Calculate Mean and Standard Deviation: Use these to compute the Z-score.
Apply Z-Score: Calculate the Z-score for each revenue value and detect outliers.
Calculate MAD: Compute the MAD and the median for the dataset.
Apply Modified Z-Score: Calculate the modified Z-score for each revenue value and detect outliers.

Conclusion

The modified Z-score is a robust method for outlier detection, especially useful when dealing with skewed data or small sample sizes. By using the median and MAD, it reduces the impact of extreme values compared to the traditional Z-score.

Feel free to try out the code and adapt it to your datasets. Happy analyzing!

If you found this post helpful, please share it with your friends and colleagues. Thank you for reading!