Principal Component Analysis
Access Athinia Catalyst at catalyst.athinia.io.
Overview
Principal Component Analysis (PCA) converts a set of correlated numeric variables into a smaller ordered set of orthogonal (uncorrelated) latent variables called principal components (PCs). PC1 captures the largest share of total variance, PC2 the next largest, and so on. Each observation (lot, wafer, pass, die) is projected into this space to obtain PC scores. This makes high‑dimensional structure visible, suppresses noise, reduces redundancy, and provides a stable basis for clustering and drift or excursion detection.
Typical Semiconductor Uses
Inline metrology (critical dimension, overlay, thickness), chamber sensors (pressure, temperature, RF power), electrical parametrics (resistance chains, threshold voltage families), wafer map texture embeddings, and material characterization panels often contain dozens to hundreds of mutually correlated variables. PCA compresses them into a handful of composite signals so that:
- Excursions emerge earlier as a shift in one PC.
- Tool or chamber matching compares score distributions instead of noisy raw channels.
- Yield drift root causes link feature loadings to lithography, CMP, plating, or etch phenomena.
- Defect pattern or spatial texture descriptors cluster more cleanly after reduction.
- Material supplier or lot variability is monitored with a compact PC dashboard. Example: 500 metrology + parametric variables may reduce to ~8 components explaining >90% cumulative variance and separating normal vs. edge-drift vs. focus-misalignment populations.
Core Concepts
- Explained variance ratio: Fraction of total variance a PC accounts for.
- Cumulative explained variance: Running total used to decide how many PCs are “enough” (often 90–95%).
- Loadings (feature contributions): Signed weights mapping original features to a PC; magnitude = influence; sign = direction relative to other features.
- Scores: Coordinates of each observation in PC space (used for visualization and clustering).
- Cluster difference contribution: Adjusted score indicating how strongly a feature differentiates two user-defined clusters.
Interface Structure
The PCA module consists of a left sidebar (column selection), a tab bar (Explained Variance, Cluster Analysis), and a main workspace. You must select at least two numeric columns for computation. On first load the system auto-selects numerical columns that belong to feature groups to reduce setup time. Changing the selection triggers recomputation of variance ratios, loadings, and scores.
Explained Variance
This tab shows how variance is distributed across components and lets you inspect feature contributions (loadings) for any component.

You see a bar chart of individual explained variance ratios (one bar per PC in order) with a green cumulative line overlaid so elbows or plateaus are easy to spot. Only components required to reach the configured variance threshold (currently up to 100%) are rendered. Clicking a bar selects that component and populates the adjacent Feature Contribution panel.
The Feature Contribution (loadings) chart displays horizontal bars centered around zero. Positive and negative directions indicate inverse relationships between features along that latent axis. Large absolute magnitude points to strong influence. Assign internal mnemonic labels (e.g. “Metallization Uniformity”, “Thermal Drift”, “Etch Balance”) to assist communication.
Tool icons above or beside charts provide:
- Zoom to fit: Condenses component layout (switches from stepped bars to compact width) for large component counts.
- Download plot: Exports the current visualization (for reports or slide decks).
Interpreting results typically proceeds by scanning the cumulative curve to pick a retention cutoff, clicking early PCs to inspect loadings, recording provisional semantic labels, and deciding which PCs to monitor downstream.
Cluster Analysis
This tab enables interactive segmentation on the first two PC score axes (PC1 vs. PC2). Points start unassigned (cluster name “Unassigned”).


You create a rectangular brush by clicking and dragging within the scatter. Releasing finalizes the selection. Right‑click opens the Cluster Manager at the cursor location, tied to the current brush. Double‑click inside the plot (or single-click without holding Shift after a selection) clears the live brush; the module also overlays a persistent dashed rectangle for the last explicit brush to make selection boundaries visible.
Available cluster operations in the context menu:
- Create cluster: Assign the currently brushed points to a new named cluster.
- Add to cluster: Append brushed points to an existing cluster.
- Remove from clusters: Unassign brushed points (back to Unassigned).
- Rename cluster: Update naming to reflect process meaning (e.g. “Edge Heating”).
- Delete cluster: Remove a cluster and its associations.
- Reset all: Clear every cluster, brush history, and comparison selection.
- Clear brush: Remove just the active selection rectangle.
Clusters persist locally per study (stored in browser localStorage under a study-specific key) together with brush history and any chosen comparison pair. Returning to the same study with the same feature selection reloads them. Changing the selected feature set invalidates prior score coordinates and resets clusters.
Sigma ellipses (1σ, 2σ, 3σ) may appear as faint dashed contours when ellipse parameters are supplied, approximating multivariate dispersion for quick outlier distance assessment.
Cluster Comparison
Once you have at least two non-empty clusters the right-side panel lets you select Cluster A and Cluster B. Press Analyze differences to compute adjusted contribution scores ranking how strongly each feature differentiates those clusters in latent space. While this analysis runs, the scatter temporarily desaturates (grayscale overlay). The resulting bar plot lists features ordered by discriminant strength (higher absolute score = stronger separation). Use this to prioritize metrics for root cause investigation or monitoring thresholds.
Interaction Summary (Key Actions)
- Drag to brush a region of points.
- Right‑click to manage clusters.
- Double-click (or non-shift single click on plot) to clear the active interactive brush.
- Rename clusters to reflect process or tool states before exporting insights.
- Re-run Analyze differences after refining cluster boundaries.
- Download visuals as needed for reports.
Interpreting Outputs
A dominant PC1 (large bar) signals a broad shared variation driver suitable for high-level monitoring (e.g. thermal or deposition uniformity). An early cumulative plateau shows heavy redundancy—dimensionality can be reduced without losing significant information. High-magnitude loadings identify critical contributors for SPC charts. Well-separated clusters on PC axes indicate latent operational modes or recipe regimes. Top difference features highlight candidate levers or failure signatures (e.g. via resistance, overlay X, chamber pressure stability index).
Example Flow
You select 150 inline and electrical parameters. The Explained Variance plot shows PC1–PC7 reach 92% cumulative variance; early loadings reveal PC2 dominated by via chain resistance and copper thickness inversion, so you label it “Metallization Uniformity”. In Cluster Analysis, two natural bands appear separated along PC2; you brush each, create clusters “Normal” and “High Resistance Drift”, then run Analyze differences. The difference plot elevates resistance and thickness metrics, confirming a plating drift hypothesis. You export the plots and flag the cluster metrics for targeted process checks.
Frequently Asked Questions
What is PCA used for?
PCA reduces the dimensionality of your data while preserving as much variance as possible. It's useful for data visualization, feature reduction, and understanding the underlying structure of complex datasets.
How many components should I use?
Generally, choose enough components to capture 80-95% of the total variance. The exact number depends on your specific use case and how much information loss you can tolerate.
What do the feature contributions mean?
Feature contributions show how strongly each original feature influences a principal component. Positive (blue) contributions increase the component value, while negative (red) contributions decrease it.
Can I use PCA with categorical data?
PCA works best with numerical data. Categorical variables should be encoded numerically (e.g., one-hot encoding) before analysis, though this may not always be meaningful.
How do I interpret the cluster analysis?
The scatter plot shows your data in a 2D space where similar data points appear close together. Clusters you create represent groups of similar observations based on their principal component values.
What if my first two components explain very little variance?
Here's a simple rule for PC1 vs PC2 cluster analysis:
Good for clustering:
- PC1 + PC2 combined ≥ 50% of variance
Be cautious:
- PC1 + PC2 combined < 50% of variance
Why this matters: If the first two components explain less than half your data's variance, the 2D scatter plot is missing most of the information. Any clusters you create might not reflect the true patterns in your data.
What to do when variance is low:
- Use clustering for initial exploration only
- Don't make important decisions based on these clusters
- Validate any patterns with domain knowledge
How do confidence ellipses work?
Confidence ellipses show the statistical boundaries of your clusters, indicating the area where you'd expect to find most points belonging to that cluster with a given confidence level.
How do I know if my clusters are meaningful?
Good clusters should:
- Make sense in your domain context
- Show clear separation in the visualization
- Have distinct feature contribution patterns
- Be reproducible with similar data