Outlier Detection

info

Access Athinia Catalyst at catalyst.athinia.io.

Outlier detection identifies data points that deviate significantly from the expected patterns in your dataset. These points may represent measurement errors, equipment malfunctions, process anomalies, or genuinely unusual but valid events. In manufacturing and research environments, outliers can mask real trends, bias predictive models, and lead to incorrect conclusions if left unaddressed. At the same time, removing valid data points that merely appear unusual can be equally harmful, potentially discarding critical information about rare process conditions or early warning signals. The challenge lies in distinguishing genuine errors from legitimate extreme observations, a task that requires both statistical rigor and domain expertise. Catalyst provides three complementary detection methods, each designed for different data characteristics, to help you systematically identify potential outliers while keeping you in control of the final decision about what to keep and what to remove.

Why Semi-Automatic?

Outlier detection is inherently a semi-automatic process. Algorithms excel at scanning large datasets and flagging statistically unusual points far faster than manual inspection allows. However, no algorithm can understand the context behind a data point. Only someone with domain knowledge can determine whether a flagged point is truly problematic, represents a valid but rare observation, or reflects expected behavior under specific operating conditions.

Fully automated outlier removal is risky because it assumes every statistical anomaly is an error. In practice, the most valuable data points in a dataset are often the ones that look unusual. They may represent edge cases, rare process conditions, or early signals of equipment drift that deserve investigation rather than deletion.

Example: A temperature reading of 450°C in a semiconductor process

An algorithm flags this as an outlier because most readings are between 350 and 400°C
A process engineer recognizes this as a valid high-temperature anneal step that occurs during a specific recipe
Removing it would discard legitimate process data

Example: A yield measurement of 98.5% in a batch process

An algorithm might not flag this because it is within a normal numeric range
A process engineer knows that this product line typically yields 99.2 to 99.8%, and 98.5% signals early equipment drift
The point is an outlier in context, even if not statistically extreme

These examples illustrate why automated outlier removal is dangerous. The tool provides the detection, you provide the judgment.

Available Methods

Catalyst offers three complementary approaches to outlier detection, each suited to different scenarios:

Univariate

Analyzes each variable independently using statistical methods (IQR, MAD, Z-Score). Best for identifying extreme values in individual columns.

Use when: You want to check specific measurements for out-of-range values, or when your data quality concerns are about individual variables.

Isolation Forest

Analyzes multiple variables simultaneously to detect unusual combinations. A data point might have normal individual values but an unusual combination that signals an anomaly.

Use when: You suspect anomalies arise from interactions between variables, or when univariate methods don't capture the full picture.

Reconstruction Error

Uses Principal Component Analysis (PCA) to learn the correlation structure of your data, then flags points that cannot be accurately reconstructed from the principal components. Provides detailed per-feature contribution analysis showing which variables drive each anomaly.

Use when: You want to understand which specific features make a point anomalous, or when you suspect anomalies arise from unusual combinations that violate the expected correlation structure.

Recommended Workflow

Start with Univariate to quickly identify obvious single-variable issues
Apply Isolation Forest to detect multivariate anomalies that univariate methods miss
Use Reconstruction Error to identify deviations from learned correlation structure and understand which features drive anomalies
Validate with domain knowledge by reviewing flagged points before taking action

Key Principles

Not all outliers are errors

An outlier is a statistical observation, not a verdict. Some outliers represent:

Rare but valid operating conditions
Startup or shutdown transients
Intentional process changes
Natural variation at the edges of a distribution

Not all errors are outliers

Some data quality issues hide within normal ranges:

Systematic calibration drift that shifts values gradually
Sensor degradation that introduces subtle noise
Data entry errors that produce plausible but incorrect values

Context determines action

For each flagged point, consider:

Does this point correspond to a known process event?
Is it part of a cluster of similar anomalies (suggesting a systematic cause)?
Would removing it change your analysis conclusions?
Is the cost of keeping a bad point higher or lower than the cost of removing a valid one?

When to Use Outlier Detection

Before modeling: Remove data quality issues that would bias your models
During process monitoring: Detect when a process drifts out of its normal operating envelope
After data collection: Validate that new batches of data are consistent with historical patterns
Root cause analysis: Identify which observations correspond to known failures or events

Why Semi-Automatic?​

Available Methods​

Univariate​

Isolation Forest​

Reconstruction Error​

Recommended Workflow​

Key Principles​

Not all outliers are errors​

Not all errors are outliers​

Context determines action​

When to Use Outlier Detection​