Outliers are data points that lie far outside the pattern
of the rest of the dataset — values that are significantly higher or lower
than most others. They may occur due to measurement errors, data entry
mistakes, natural variation, or rare events, and they can distort key
statistics like averages and standard deviations if not handled correctly.
Example:
A column with ages: 22, 24, 23, 25, 120
Here, 120 is likely an outlier because it differs drastically from the
typical age range.
Why Outliers Matter
Outliers can have a large impact on data analysis
because:
• They can skew summary statistics such as the mean.
• They may distort trends, predictions, and modeling results.
• Sometimes outliers represent true, important signals (e.g.,
exceptional sales records), so they shouldn’t always be removed without
thought.
How Outliers Are Detected
There are several common ways to identify outliers:
📈 Visualization Methods
• Box plots, histograms, scatter plots make it easy to spot values that
lie far outside normal ranges.
📊 Statistical Methods
• Interquartile Range (IQR): Values below Q1 − 1.5×IQR or above Q3
+ 1.5×IQR are flagged as outliers.
• Z-Score: Values with a high absolute z-score (e.g., greater than 2 or
3) indicate distances far from the mean.
How Outliers Are Treated
Deciding what to do with outliers depends on context:
✔ Investigate them manually to check whether
they are errors or true values.
✔ Remove outliers only if they are clearly due
to mistakes or noise.
✔ Use robust statistics like median or
IQR-based methods that are less affected by outliers.
✔ Transform or cap values so extreme points
don’t overly drive results.