From Video to Root Cause: How Evidence-Based Inspection Stops Repeat Defects
Detecting defects is valuable. Stopping repeat defects is transformative.
The difference is evidence.
Most factories have video. Many even have inspection steps. But when defect rates spike, teams still end up in the same place: arguing about what happened, hunting for missing context, and guessing causes after the shift is over.
Evidence-based inspection creates a different outcome: every unit produces structured truth. And when you combine that with closed-loop root cause analysis, you don't just find defects—you eliminate the conditions that create them.
The Four-Layer Local RCA Stack
We've built a four-layer root cause analysis engine that runs entirely on edge compute. No cloud dependency. No waiting for API responses. Real-time insight on the factory floor.
Layer 1: Defect Pattern Accumulator
Every inspection event feeds into a local pattern detector running on the Jetson edge node:
- Storage: Rolling 30-day window in SQLite WAL mode
- Schema: timestamp, defect_type, severity, station_id, shift, confidence, SOP criterion, unit_id
- Analysis: Z-score anomaly detection on 4-hour rolling windows per defect type per station
- Detection: Rate changes, clustering, temporal correlations
When scratch defects at Station 3 increase 3x in the past 4 hours, the system notices—before the shift supervisor does.
Layer 2: Process Parameter Correlator
Defect spikes alone don't tell you why. Layer 2 connects defects to process parameters:
- Input: Defect anomalies from Layer 1 + process parameters (temperature, RPM, timing)
- Sensor Integration: MQTT or OPC UA for automated feeds; manual operator input for non-instrumented stations
- Correlation Method: Pearson coefficient on time-aligned 30-minute windows (statistical, not ML—explainable and auditable)
- Output: JSON with defect pattern, correlated parameter, drift magnitude, confidence, time window
When the correlator identifies that grinding RPM drifted 5% below target during the same window as the scratch spike, you have a hypothesis worth investigating.
Layer 3: Small Language Model for Explanation
Statistical correlations need context to become actionable. Layer 3 uses a local small language model (Qwen-2.5 3B INT4) to generate natural-language explanations:
- Input: Defect patterns + process correlations + SOP references + defect taxonomy
- Output: Bilingual (Chinese/English) root-cause hypothesis with confidence level and recommended corrective actions
- Inference Budget: 2–4 seconds per RCA generation (triggered on anomaly, not per-frame)
Layer 4: Closed-Loop Action Recommender
The final layer converts analysis into action:
- Output Format: SOP-linked corrective actions—not "check grinding wheel" but "Station 3 RPM at 2,847 (target 3,000 ±50). Recalibrate per SOP 4.2.3."
- Operator Interface: Local React dashboard with recommendation + evidence links + Accept/Reject/Modify buttons
- Feedback Loop: Operator action logged as causal triple: {defect, cause, outcome, operator_id, disposition, timestamp}
- Safety Rail: NOTIFY ONLY. Agent suggests, never executes. No automated parameter changes. No automated line stops.
Step 1: Convert Video into Decisions
A camera feed becomes more than "remote visibility" when you add real-time AI-powered defect detection:
- Detect defect classes (scratches, chips, misalignment, missing parts, discoloration)
- Apply SOP thresholds via DefectIQ rules engine
- Produce PASS/FAIL/REVIEW verdicts
When inspection runs on NVIDIA Jetson edge compute with TensorRT optimization, it happens at line speed—under 25ms latency—without network dependency.
Step 2: Convert Decisions into Durable Records
Decisions alone are not enough. The system must store a durable record for every inspection event:
| Evidence Tier | What's Stored | Retention | Location |
|---|---|---|---|
| All events | Metadata JSON: event_id, timestamp, station, SKU, verdict, confidence, model version | Indefinite | Edge SQLite |
| FAIL + REVIEW | Defect frame(s), bounding boxes, detection JSON, rule_result.json, audit.json | 90 days local, then archive | Edge NVMe → MinIO |
| RCA events | Full causal triple: defect + cause + outcome + operator disposition | Indefinite | Edge SQLite → Site PostgreSQL |
| Optional | Rolling 7-day ring buffer of production video per station | 7 days (overwritten) | Edge NVMe |
This turns quality from a conversation into an audit trail. When a customer asks about a specific unit, you have the evidence.
Step 3: Detect Patterns Early
Once events are structured, you can spot problems before they become batches:
- Defect spikes on a specific station after a tool change
- Confidence drift that signals lighting degradation or model staleness
- REVIEW rate increases that indicate process instability
- Yield drops tied to a shift, supplier lot, or machine setting window
The Defect Pattern Accumulator runs continuously, analyzing 4-hour rolling windows with Z-score anomaly detection. When something changes, you know within minutes—not days.
Step 4: Explain Root Causes with Citations
"Root cause" is where most software becomes fiction. Vague suggestions like "check the equipment" or "review the process" are useless on a production floor.
The only acceptable root-cause output is evidence-cited:
- Which units were affected: Event IDs, timestamps, count
- What the system saw: Actual frames and detection artifacts
- What changed: Time window, station, operator, batch, process parameters
- Why the hypothesis is plausible: Statistical correlation, pattern match
- What action should be taken: Specific SOP reference with steps
This is how a quality team moves from "maybe" to "do this now."
Step 5: Close the Loop with Causal Triples
The loop closes when actions and outcomes are recorded as causal defect triples:
| Field | Type | Example |
|---|---|---|
| defect_id | UUID | evt_20260213_143022_s3_001 |
| defect_type | Enum | scratch_surface |
| cause_parameter | JSON | {grinding_rpm: 2847, target: 3000, drift_pct: 5.1} |
| cause_confidence | Float | 0.87 |
| recommendation | Text | SOP 4.2.3: Recalibrate grinding wheel |
| operator_action | Enum | accepted / rejected / modified |
| outcome_measured | JSON | {scratch_rate_after: 0.8pct, time_to_baseline: 45min} |
Over time, your factory builds an internal quality knowledge base: not just defect counts, but defect explanations tied to real events, real causes, and verified outcomes.
Why Edge-First Matters for Root Cause
If your system depends on cloud connectivity, the evidence trail breaks exactly when the plant is under stress. Network goes down during a quality crisis? That's when you need root cause analysis most—and that's when cloud-dependent systems fail.
Edge-first architecture keeps inspection and evidence capture continuous:
- Internet down → Full local operation continues. SQLite buffers events. RCA uses local SLM. Dashboard served locally.
- Site server down → Edge nodes continue independently. Local evidence storage continues.
- Camera disconnected → Station pauses inspection. Adjacent stations unaffected. Watchdog attempts reconnect.
The factory never stops inspecting because the internet went down.
Book a Demo to see the four-layer RCA stack in action.