
In this blog post, we'll dive into the concepts of Mean Absolute Deviation (MAD) and Standard Deviation, two fundamental metrics in statistics and data science. These metrics help us understand how spread out our data points are from the average. Let's get started!
Why Measure Spread?
Imagine you have test scores from a history exam. Here are six data points representing scores out of 100:
75, 65, 72, 68, 70, 60
The average (mean) score is 70. When you plot these scores on a chart, the average score will be represented by a yellow line at 70. Each data point will be close to this average.
Now, consider a different set of test scores, say from a mathematics exam:
55, 85, 40, 90, 95, 45
The average score is still 70, but when you plot these scores, they are more spread out compared to the history scores.
In data science, it's crucial to know how far apart individual data points are from the average or how spread out they are. This helps in understanding the variability in your data.
Mean Absolute Deviation (MAD)
A straightforward way to measure spread is by calculating the Mean Absolute Deviation (MAD).
Calculate the Deviation: Subtract the average from each data point.
Take Absolute Values: Convert these deviations to absolute values (ignore negative signs).
Calculate the Mean of Absolute Deviations: Average these absolute deviations.
Example: History Test Scores
Let's calculate the MAD for the history test scores.
Deviations:
75 - 70 = 5
65 - 70 = -5
72 - 70 = 2
68 - 70 = -2
70 - 70 = 0
60 - 70 = -10
Absolute Deviations:
|5| = 5
|-5| = 5
|2| = 2 - |-2| = 2
|0| = 0
|-10| 10
Mean Absolute Deviation:
MAD = (5 + 5 + 2 + 2 + 0 + 10) / 6 = 4
Example: Mathematics Test Scores
Now for the mathematics test scores.
Deviations:
55 - 70 = -15
85 - 70 = 15
40 - 70 = -30
90 - 70 = 20
95 - 70 = 25
45 - 70 = -25
Absolute Deviations:
|-15| = 15
|15| = 15
|-30| = 30
|20| = 20
|25| = 25
|-25| = 25
Mean Absolute Deviation:
MAD = (15 + 15 + 30 + 20 + 25 + 25) / 6 = 21.67
The higher MAD in the math scores indicates that the data points are more spread out compared to the history scores.
Standard Deviation
While MAD is useful, sometimes we need a metric that accounts for the squared differences from the mean. This is where Standard Deviation comes in.
Calculate the Deviation: Subtract the average from each data point.
Square the Deviations: Square each deviation.
Calculate the Mean of Squared Deviations: Average these squared deviations.
Square Root: Take the square root of this mean.
Example: History Test Scores
Deviations:
75 - 70 = 5
65 - 70 = -5
72 - 70 = 2
68 - 70 = -2
70 - 70 = 0
60 - 70 = -10
Squared Deviations:
5^2 = 25
(-5)^2 = 25
2^2 = 4
(-2)^2 = 4
0^2 =
(-10)^2 = 100
Mean of Squared Deviations:
Mean = (25 + 25 + 4 + 4 + 0 + 100) / 6 = 26
Standard Deviation:
SD = sqrt(26) ≈ 5.1
Example: Mathematics Test Scores
Deviations:
55 - 70 = -15
85 - 70 = 15
40 - 70 = -30
90 - 70 = 20
95 - 70 = 25
45 - 70 = -25
Squared Deviations:
(-15)^2 = 225
15^2 = 225
(-30)^2 = 900
20^2 = 400
25^2 = 625
(-25)^2 = 625
Mean of Squared Deviations:
Mean = (225 + 225 + 900 + 400 + 625 + 625) / 6 = 500
Standard Deviation:
SD = sqrt(500) ≈ 22.36
The higher standard deviation in the math scores indicates a greater spread compared to the history scores.
L1 and L2 Norms
In machine learning, you might encounter terms like L1 and L2 norms. L1 norm often refers to Mean Absolute Deviation (MAD), while L2 norm refers to Standard Deviation.
L1 Norm (MAD): Sum of absolute differences.
L2 Norm (Standard Deviation): Square root of the sum of squared differences.
These norms are used in various machine learning algorithms, such as Ridge Regression (L2) and Lasso Regression (L1).
Python Code Examples
Let's implement these concepts using Python.
Mean Absolute Deviation (MAD)
import numpy as np
def mean_absolute_deviation(data):
mean = np.mean(data)
mad = np.mean(np.abs(data - mean))
return mad
history_scores = np.array([75, 65, 72, 68, 70, 60])
math_scores = np.array([55, 85, 40, 90, 95, 45])
mad_history = mean_absolute_deviation(history_scores)
mad_math = mean_absolute_deviation(math_scores)
print(f"MAD (History): {mad_history}")
print(f"MAD (Math): {mad_math}")
Standard Deviation
def standard_deviation(data):
mean = np.mean(data)
variance = np.mean((data - mean) ** 2)
std_dev = np.sqrt(variance)
return std_dev
std_dev_history = standard_deviation(history_scores)
std_dev_math = standard_deviation(math_scores)
print(f"Standard Deviation (History): {std_dev_history}")
print(f"Standard Deviation (Math): {std_dev_math}")
Conclusion
Understanding Mean Absolute Deviation and Standard Deviation is crucial for analyzing data variability. MAD provides a straightforward measure of spread, while Standard Deviation gives a more nuanced view by considering squared differences. Both metrics are essential tools in statistics and data science, helping you make informed decisions based on data distribution.
Happy analyzing!
Commentaires