Skip to main content
info

Access Athinia Catalyst at catalyst.athinia.io.

Univariate Outlier Detection

The Univariate Outlier Detection tool provides powerful outlier detection capabilities using multiple statistical methods. These approaches help you identify data points that fall significantly outside the normal distribution of your dataset, making it essential for data quality assessment and anomaly detection.

Univariate Outlier Detection
Univariate outlier detection interface showing method selection and outlier summary

Overview

The tool supports three robust statistical methods for outlier detection:

IQR (Interquartile Range)

Identifies outliers using quartile-based boundaries:

  • Lower bound: Q1 - (threshold × IQR)
  • Upper bound: Q3 + (threshold × IQR)
  • Default threshold: 1.5
  • Best for: Robust detection with skewed data

MAD (Median Absolute Deviation)

Uses median-based robust statistics:

  • Boundary: Median ± (threshold × MAD × 1.4826)
  • Default threshold: 2
  • Best for: Very robust detection, less sensitive to extreme outliers

Z-Score (Standard Score)

Uses mean and standard deviation:

  • Boundary: Mean ± (threshold × Standard Deviation)
  • Default threshold: 3
  • Best for: Normally distributed data

The tool provides two complementary views:

  1. Summary View - Overview of outlier counts across all selected columns
  2. Detailed View - In-depth analysis of outlier distributions for specific columns

Key Features

Method Selection

Choose from three statistical methods:

  • IQR: Interquartile Range method (default)
  • MAD: Median Absolute Deviation method
  • Z-Score: Standard deviation-based method

Column Selection

Select the numerical columns you want to analyze for outliers. The tool automatically filters to show only numerical data types (Float64, Int64) since outlier analysis requires numerical data.

  • Individual Selection: Click checkboxes to select specific columns
  • Group Selection: Use column groups to quickly select related fields
  • Search: Find specific columns using the search functionality

Threshold Control

Adjust the threshold to control outlier sensitivity for each method:

IQR Method:

  • Range: 1.0 to 5.0 (adjustable in 0.25 increments)
  • Default: 1.5 (standard statistical practice)

MAD Method:

  • Range: 1.0 to 5.0 (adjustable in 0.25 increments)
  • Default: 2 (recommended for robust detection)

Z-Score Method:

  • Range: 1.0 to 5.0 (adjustable in 0.25 increments)
  • Default: 3 (standard practice for normal distributions)

For all methods:

  • Lower values: More sensitive, detects more outliers
  • Higher values: Less sensitive, detects fewer outliers

Summary View

The summary view displays a bar chart showing outlier counts for each selected column:

  • Sorted Display: Columns ordered by outlier count (highest first)
  • Pagination: Navigate through results (30 columns per page)
  • Interactive Bars: Click to select columns for detailed analysis
  • Real-time Updates: Chart updates immediately when threshold changes

Detailed Analysis

Select up to 10 columns from the summary view to see detailed outlier distributions:

Detailed Analysis
Detailed box plot view showing outlier values and statistical boundaries
  • Box Plot Visualization: Shows outlier values in context of data distribution
  • Multiple Columns: Compare outlier patterns across selected columns
  • Statistical Context: Visualizes boundaries and distribution based on selected method
  • Individual Values: See exact outlier values and their dataset indices

Using the Tool

Basic Workflow

  1. Choose Method: Select IQR, MAD, or Z-Score based on your data characteristics
  2. Select Columns: Choose the numerical columns you want to analyze
  3. Adjust Threshold: Set the appropriate sensitivity level for your chosen method
  4. Review Summary: Examine the bar chart to identify columns with high outlier counts
  5. Select for Detail: Click on bars to select up to 10 columns for detailed analysis
  6. Analyze Patterns: Use the box plot view to understand outlier distributions

Method Selection Tips

Choose IQR when:

  • Working with skewed or non-normal data
  • Need robust detection that handles extreme values well
  • Following standard statistical practices

Choose MAD when:

  • Data has many extreme outliers that might skew IQR
  • Need the most robust detection method
  • Working with highly variable data

Choose Z-Score when:

  • Data is approximately normally distributed
  • Need to detect values based on standard deviations
  • Working with well-behaved, symmetric data

Column Selection Tips

  • Start broad: Select all relevant numerical columns initially
  • Focus on key metrics: Prioritize business-critical measurements
  • Consider data types: Only numerical columns (Float64, Int64) can be analyzed
  • Use groups: Leverage column groups for efficient selection of related fields

Threshold Selection Guidelines

IQR Method:

  • 1.5 (default): Standard statistical practice, balanced sensitivity
  • 1.0: High sensitivity, detects more potential outliers
  • 2.0-3.0: Conservative approach, focuses on extreme outliers

MAD Method:

  • 2.0 (default): Robust detection with good sensitivity
  • 1.5: Higher sensitivity for quality control
  • 2.5-3.0: Conservative detection for stable processes

Z-Score Method:

  • 3.0 (default): Standard practice (99.7% of normal data within bounds)
  • 2.0: Higher sensitivity (95% of normal data within bounds)
  • 4.0+: Very conservative, extreme outliers only

When to adjust:

  • Lower threshold: When data quality is critical and you want to catch subtle anomalies
  • Higher threshold: When working with naturally variable data or when false positives are costly

Interpretation Guidelines

Summary View:

  • High bars: Columns with many potential outliers requiring attention
  • Zero counts: Clean columns with no detected outliers
  • Relative comparison: Compare outlier frequencies across different variables

Detailed View:

  • Box boundaries: Show the normal range of your data (Q1 to Q3)
  • Outlier points: Individual values that exceed the threshold boundaries
  • Distribution patterns: Understand whether outliers are extreme values or systematic issues

Best Practices

Data Preparation

  • Clean missing values: Handle null values before analysis
  • Verify data types: Ensure numerical columns are properly typed
  • Check data ranges: Review min/max values for obvious errors
  • Consider transformations: Log or other transformations may be needed for skewed data

Analysis Strategy

  • Progressive refinement: Start with default threshold, then adjust based on results
  • Domain knowledge: Consider business context when interpreting outliers
  • Multiple perspectives: Use different thresholds to understand outlier sensitivity
  • Validation: Investigate high-count outlier columns manually

Common Use Cases

  • Data Quality Control: Identify data entry errors or measurement problems
  • Process Monitoring: Detect when manufacturing or business processes go out of control
  • Fraud Detection: Flag unusual transactions or behaviors
  • Equipment Monitoring: Identify sensor readings that indicate potential failures
  • Research Analysis: Find unusual observations that merit further investigation

Frequently Asked Questions

What are the different methods?

IQR (Interquartile Range): Uses the middle 50% of data to define normal variation, robust to extreme values.

MAD (Median Absolute Deviation): Most robust method, uses median-based statistics less affected by outliers.

Z-Score: Traditional method using mean and standard deviation, best for normally distributed data.

What is the IQR method?

The Interquartile Range method is a robust statistical technique for outlier detection. It uses the middle 50% of your data (between Q1 and Q3) to define normal variation, then identifies points that fall significantly outside this range.

How do I choose the right method and threshold?

Start by considering your data distribution:

  • Normal/symmetric data: Z-Score method with threshold 3.0
  • Skewed/non-normal data: IQR method with threshold 1.5
  • Data with extreme outliers: MAD method with threshold 2.0

How do I choose the right threshold?

Start with the method's default threshold (IQR: 1.5, MAD: 2.0, Z-Score: 3.0). Lower the threshold if you need higher sensitivity for critical applications, or raise it if you're getting too many false positives.

Why can't I see all my columns?

Outlier analysis only works with numerical data types. Text, categorical, and other non-numerical columns are automatically filtered out since statistical calculations can't be performed on them.

What does "No outliers found" mean?

This indicates that all values in the selected columns fall within the normal range defined by your threshold setting. Try lowering the threshold if you expected to find outliers.

How many outliers is "too many"?

This depends on your data and use case. As a general rule:

  • Less than 5%: Normal level of outliers in most datasets
  • 5-10%: May indicate data quality issues or natural variation
  • More than 10%: Likely indicates systematic problems requiring investigation

Why is my detailed view empty?

The detailed view only populates when you click on bars in the summary chart. You can select up to 10 columns by clicking their corresponding bars.

What's the difference between outliers and errors?

Outliers are statistically unusual values that may or may not be errors. Always investigate outliers in context - they could be:

  • Data entry mistakes (errors)
  • Measurement problems (errors)
  • Genuine unusual events (valid data)
  • Process variations (valid data)