Outlier Detection
Access Athinia Catalyst at catalyst.athinia.io.
Outlier detection identifies data points that deviate significantly from the expected patterns in your dataset. These points may represent measurement errors, equipment malfunctions, process anomalies, or genuinely unusual but valid events. In manufacturing and research environments, outliers can mask real trends, bias predictive models, and lead to incorrect conclusions if left unaddressed. At the same time, removing valid data points that merely appear unusual can be equally harmful, potentially discarding critical information about rare process conditions or early warning signals. The challenge lies in distinguishing genuine errors from legitimate extreme observations, a task that requires both statistical rigor and domain expertise. Catalyst provides three complementary detection methods, each designed for different data characteristics, to help you systematically identify potential outliers while keeping you in control of the final decision about what to keep and what to remove.
Why Semi-Automatic?
Outlier detection is inherently a semi-automatic process. Algorithms excel at scanning large datasets and flagging statistically unusual points far faster than manual inspection allows. However, no algorithm can understand the context behind a data point. Only someone with domain knowledge can determine whether a flagged point is truly problematic, represents a valid but rare observation, or reflects expected behavior under specific operating conditions.
Fully automated outlier removal is risky because it assumes every statistical anomaly is an error. In practice, the most valuable data points in a dataset are often the ones that look unusual. They may represent edge cases, rare process conditions, or early signals of equipment drift that deserve investigation rather than deletion.
Example: A temperature reading of 450°C in a semiconductor process
- An algorithm flags this as an outlier because most readings are between 350 and 400°C
- A process engineer recognizes this as a valid high-temperature anneal step that occurs during a specific recipe
- Removing it would discard legitimate process data
Example: A yield measurement of 98.5% in a batch process
- An algorithm might not flag this because it is within a normal numeric range
- A process engineer knows that this product line typically yields 99.2 to 99.8%, and 98.5% signals early equipment drift
- The point is an outlier in context, even if not statistically extreme
These examples illustrate why automated outlier removal is dangerous. The tool provides the detection, you provide the judgment.
Available Methods
Catalyst offers three complementary approaches to outlier detection, each suited to different scenarios:
Univariate
Analyzes each variable independently using statistical methods (IQR, MAD, Z-Score). Best for identifying extreme values in individual columns.
Use when: You want to check specific measurements for out-of-range values, or when your data quality concerns are about individual variables.
Isolation Forest
Analyzes multiple variables simultaneously to detect unusual combinations. A data point might have normal individual values but an unusual combination that signals an anomaly.
Use when: You suspect anomalies arise from interactions between variables, or when univariate methods don't capture the full picture.
Reconstruction Error
Uses Principal Component Analysis (PCA) to learn the correlation structure of your data, then flags points that cannot be accurately reconstructed from the principal components. Provides detailed per-feature contribution analysis showing which variables drive each anomaly.
Use when: You want to understand which specific features make a point anomalous, or when you suspect anomalies arise from unusual combinations that violate the expected correlation structure.
Recommended Workflow
- Start with Univariate to quickly identify obvious single-variable issues
- Apply Isolation Forest to detect multivariate anomalies that univariate methods miss
- Use Reconstruction Error to identify deviations from learned correlation structure and understand which features drive anomalies
- Validate with domain knowledge by reviewing flagged points before taking action
Key Principles
Not all outliers are errors
An outlier is a statistical observation, not a verdict. Some outliers represent:
- Rare but valid operating conditions
- Startup or shutdown transients
- Intentional process changes
- Natural variation at the edges of a distribution
Not all errors are outliers
Some data quality issues hide within normal ranges:
- Systematic calibration drift that shifts values gradually
- Sensor degradation that introduces subtle noise
- Data entry errors that produce plausible but incorrect values
Context determines action
For each flagged point, consider:
- Does this point correspond to a known process event?
- Is it part of a cluster of similar anomalies (suggesting a systematic cause)?
- Would removing it change your analysis conclusions?
- Is the cost of keeping a bad point higher or lower than the cost of removing a valid one?
When to Use Outlier Detection
- Before modeling: Remove data quality issues that would bias your models
- During process monitoring: Detect when a process drifts out of its normal operating envelope
- After data collection: Validate that new batches of data are consistent with historical patterns
- Root cause analysis: Identify which observations correspond to known failures or events