info

Access Athinia Catalyst at catalyst.athinia.io.

Univariate Outlier Detection

The Univariate Outlier Detection tool provides powerful outlier detection capabilities using multiple statistical methods. These approaches help you identify data points that fall significantly outside the normal distribution of your dataset, making it essential for data quality assessment and anomaly detection.

Univariate outlier detection interface showing method selection and outlier summary

Overview

The tool supports three robust statistical methods for outlier detection:

IQR (Interquartile Range)

Identifies outliers using quartile-based boundaries:

Lower bound: Q1 - (threshold × IQR)
Upper bound: Q3 + (threshold × IQR)
Default threshold: 1.5
Best for: Robust detection with skewed data

MAD (Median Absolute Deviation)

Uses median-based robust statistics:

Boundary: Median ± (threshold × MAD × 1.4826)
Default threshold: 2
Best for: Very robust detection, less sensitive to extreme outliers

Z-Score (Standard Score)

Uses mean and standard deviation:

Boundary: Mean ± (threshold × Standard Deviation)
Default threshold: 3
Best for: Normally distributed data

The tool provides two complementary views:

Summary View - Overview of outlier counts across all selected columns
Detailed View - In-depth analysis of outlier distributions for specific columns

Key Features

Method Selection

Choose from three statistical methods:

IQR: Interquartile Range method (default)
MAD: Median Absolute Deviation method
Z-Score: Standard deviation-based method

Column Selection

Select the numerical columns you want to analyze for outliers. The tool automatically filters to show only numerical data types (Float64, Int64) since outlier analysis requires numerical data.

Individual Selection: Click checkboxes to select specific columns
Group Selection: Use column groups to quickly select related fields
Search: Find specific columns using the search functionality

Threshold Control

Adjust the threshold to control outlier sensitivity for each method:

IQR Method:

Range: 1.0 to 5.0 (adjustable in 0.25 increments)
Default: 1.5 (standard statistical practice)

MAD Method:

Range: 1.0 to 5.0 (adjustable in 0.25 increments)
Default: 2 (recommended for robust detection)

Z-Score Method:

Range: 1.0 to 5.0 (adjustable in 0.25 increments)
Default: 3 (standard practice for normal distributions)

For all methods:

Lower values: More sensitive, detects more outliers
Higher values: Less sensitive, detects fewer outliers

Summary View

The summary view displays a bar chart showing outlier counts for each selected column:

Sorted Display: Columns ordered by outlier count (highest first)
Pagination: Navigate through results (30 columns per page)
Interactive Bars: Click to select columns for detailed analysis
Real-time Updates: Chart updates immediately when threshold changes

Detailed Analysis

Select up to 10 columns from the summary view to see detailed outlier distributions:

Detailed box plot view showing outlier values and statistical boundaries

Box Plot Visualization: Shows outlier values in context of data distribution
Multiple Columns: Compare outlier patterns across selected columns
Statistical Context: Visualizes boundaries and distribution based on selected method
Individual Values: See exact outlier values and their dataset indices

Using the Tool

Basic Workflow

Choose Method: Select IQR, MAD, or Z-Score based on your data characteristics
Select Columns: Choose the numerical columns you want to analyze
Adjust Threshold: Set the appropriate sensitivity level for your chosen method
Review Summary: Examine the bar chart to identify columns with high outlier counts
Select for Detail: Click on bars to select up to 10 columns for detailed analysis
Analyze Patterns: Use the box plot view to understand outlier distributions

Method Selection Tips

Choose IQR when:

Working with skewed or non-normal data
Need robust detection that handles extreme values well
Following standard statistical practices

Choose MAD when:

Data has many extreme outliers that might skew IQR
Need the most robust detection method
Working with highly variable data

Choose Z-Score when:

Data is approximately normally distributed
Need to detect values based on standard deviations
Working with well-behaved, symmetric data

Column Selection Tips

Start broad: Select all relevant numerical columns initially
Focus on key metrics: Prioritize business-critical measurements
Consider data types: Only numerical columns (Float64, Int64) can be analyzed
Use groups: Leverage column groups for efficient selection of related fields

Threshold Selection Guidelines

IQR Method:

1.5 (default): Standard statistical practice, balanced sensitivity
1.0: High sensitivity, detects more potential outliers
2.0-3.0: Conservative approach, focuses on extreme outliers

MAD Method:

2.0 (default): Robust detection with good sensitivity
1.5: Higher sensitivity for quality control
2.5-3.0: Conservative detection for stable processes

Z-Score Method:

3.0 (default): Standard practice (99.7% of normal data within bounds)
2.0: Higher sensitivity (95% of normal data within bounds)
4.0+: Very conservative, extreme outliers only

When to adjust:

Lower threshold: When data quality is critical and you want to catch subtle anomalies
Higher threshold: When working with naturally variable data or when false positives are costly

Interpretation Guidelines

Summary View:

High bars: Columns with many potential outliers requiring attention
Zero counts: Clean columns with no detected outliers
Relative comparison: Compare outlier frequencies across different variables

Detailed View:

Box boundaries: Show the normal range of your data (Q1 to Q3)
Outlier points: Individual values that exceed the threshold boundaries
Distribution patterns: Understand whether outliers are extreme values or systematic issues

Best Practices

Data Preparation

Clean missing values: Handle null values before analysis
Verify data types: Ensure numerical columns are properly typed
Check data ranges: Review min/max values for obvious errors
Consider transformations: Log or other transformations may be needed for skewed data

Analysis Strategy

Progressive refinement: Start with default threshold, then adjust based on results
Domain knowledge: Consider business context when interpreting outliers
Multiple perspectives: Use different thresholds to understand outlier sensitivity
Validation: Investigate high-count outlier columns manually

Common Use Cases

Data Quality Control: Identify data entry errors or measurement problems
Process Monitoring: Detect when manufacturing or business processes go out of control
Fraud Detection: Flag unusual transactions or behaviors
Equipment Monitoring: Identify sensor readings that indicate potential failures
Research Analysis: Find unusual observations that merit further investigation

Frequently Asked Questions

What are the different methods?

IQR (Interquartile Range): Uses the middle 50% of data to define normal variation, robust to extreme values.

MAD (Median Absolute Deviation): Most robust method, uses median-based statistics less affected by outliers.

Z-Score: Traditional method using mean and standard deviation, best for normally distributed data.

What is the IQR method?

The Interquartile Range method is a robust statistical technique for outlier detection. It uses the middle 50% of your data (between Q1 and Q3) to define normal variation, then identifies points that fall significantly outside this range.

How do I choose the right method and threshold?

Start by considering your data distribution:

Normal/symmetric data: Z-Score method with threshold 3.0
Skewed/non-normal data: IQR method with threshold 1.5
Data with extreme outliers: MAD method with threshold 2.0

How do I choose the right threshold?

Start with the method's default threshold (IQR: 1.5, MAD: 2.0, Z-Score: 3.0). Lower the threshold if you need higher sensitivity for critical applications, or raise it if you're getting too many false positives.

Why can't I see all my columns?

Outlier analysis only works with numerical data types. Text, categorical, and other non-numerical columns are automatically filtered out since statistical calculations can't be performed on them.

What does "No outliers found" mean?

This indicates that all values in the selected columns fall within the normal range defined by your threshold setting. Try lowering the threshold if you expected to find outliers.

How many outliers is "too many"?

This depends on your data and use case. As a general rule:

Less than 5%: Normal level of outliers in most datasets
5-10%: May indicate data quality issues or natural variation
More than 10%: Likely indicates systematic problems requiring investigation

Why is my detailed view empty?

The detailed view only populates when you click on bars in the summary chart. You can select up to 10 columns by clicking their corresponding bars.

What's the difference between outliers and errors?

Outliers are statistically unusual values that may or may not be errors. Always investigate outliers in context - they could be:

Data entry mistakes (errors)
Measurement problems (errors)
Genuine unusual events (valid data)
Process variations (valid data)

Overview​

IQR (Interquartile Range)​

MAD (Median Absolute Deviation)​

Z-Score (Standard Score)​

Key Features​

Method Selection​

Column Selection​

Threshold Control​

Summary View​

Detailed Analysis​

Using the Tool​

Basic Workflow​

Method Selection Tips​

Column Selection Tips​

Threshold Selection Guidelines​

Interpretation Guidelines​

Best Practices​

Data Preparation​

Analysis Strategy​

Common Use Cases​

Frequently Asked Questions​

What are the different methods?​

What is the IQR method?​

How do I choose the right method and threshold?​

How do I choose the right threshold?​

Why can't I see all my columns?​

What does "No outliers found" mean?​

How many outliers is "too many"?​

Why is my detailed view empty?​

What's the difference between outliers and errors?​