Back to blog
Tony LoFebruary 13, 202610 min read

From Video to Root Cause: How Evidence-Based Inspection Stops Repeat Defects

Edge AIQualityManufacturingDefect DetectionTraceability

Detecting defects is valuable. Stopping repeat defects is transformative.

The difference is evidence.

Most factories have video. Many even have inspection steps. But when defect rates spike, teams still end up in the same place: arguing about what happened, hunting for missing context, and guessing causes after the shift is over.

Evidence-based inspection creates a different outcome: every unit produces structured truth. And when you combine that with closed-loop root cause analysis, you don't just find defects—you eliminate the conditions that create them.

The Four-Layer Local RCA Stack

We've built a four-layer root cause analysis engine that runs entirely on edge compute. No cloud dependency. No waiting for API responses. Real-time insight on the factory floor.

Layer 1: Defect Pattern Accumulator

Every inspection event feeds into a local pattern detector running on the Jetson edge node:

  • Storage: Rolling 30-day window in SQLite WAL mode
  • Schema: timestamp, defect_type, severity, station_id, shift, confidence, SOP criterion, unit_id
  • Analysis: Z-score anomaly detection on 4-hour rolling windows per defect type per station
  • Detection: Rate changes, clustering, temporal correlations

When scratch defects at Station 3 increase 3x in the past 4 hours, the system notices—before the shift supervisor does.

Layer 2: Process Parameter Correlator

Defect spikes alone don't tell you why. Layer 2 connects defects to process parameters:

  • Input: Defect anomalies from Layer 1 + process parameters (temperature, RPM, timing)
  • Sensor Integration: MQTT or OPC UA for automated feeds; manual operator input for non-instrumented stations
  • Correlation Method: Pearson coefficient on time-aligned 30-minute windows (statistical, not ML—explainable and auditable)
  • Output: JSON with defect pattern, correlated parameter, drift magnitude, confidence, time window

When the correlator identifies that grinding RPM drifted 5% below target during the same window as the scratch spike, you have a hypothesis worth investigating.

Layer 3: Small Language Model for Explanation

Statistical correlations need context to become actionable. Layer 3 uses a local small language model (Qwen-2.5 3B INT4) to generate natural-language explanations:

  • Input: Defect patterns + process correlations + SOP references + defect taxonomy
  • Output: Bilingual (Chinese/English) root-cause hypothesis with confidence level and recommended corrective actions
  • Inference Budget: 2–4 seconds per RCA generation (triggered on anomaly, not per-frame)

Layer 4: Closed-Loop Action Recommender

The final layer converts analysis into action:

  • Output Format: SOP-linked corrective actions—not "check grinding wheel" but "Station 3 RPM at 2,847 (target 3,000 ±50). Recalibrate per SOP 4.2.3."
  • Operator Interface: Local React dashboard with recommendation + evidence links + Accept/Reject/Modify buttons
  • Feedback Loop: Operator action logged as causal triple: {defect, cause, outcome, operator_id, disposition, timestamp}
  • Safety Rail: NOTIFY ONLY. Agent suggests, never executes. No automated parameter changes. No automated line stops.

Step 1: Convert Video into Decisions

A camera feed becomes more than "remote visibility" when you add real-time AI-powered defect detection:

  • Detect defect classes (scratches, chips, misalignment, missing parts, discoloration)
  • Apply SOP thresholds via DefectIQ rules engine
  • Produce PASS/FAIL/REVIEW verdicts

When inspection runs on NVIDIA Jetson edge compute with TensorRT optimization, it happens at line speed—under 25ms latency—without network dependency.

Step 2: Convert Decisions into Durable Records

Decisions alone are not enough. The system must store a durable record for every inspection event:

Evidence TierWhat's StoredRetentionLocation
All eventsMetadata JSON: event_id, timestamp, station, SKU, verdict, confidence, model versionIndefiniteEdge SQLite
FAIL + REVIEWDefect frame(s), bounding boxes, detection JSON, rule_result.json, audit.json90 days local, then archiveEdge NVMe → MinIO
RCA eventsFull causal triple: defect + cause + outcome + operator dispositionIndefiniteEdge SQLite → Site PostgreSQL
OptionalRolling 7-day ring buffer of production video per station7 days (overwritten)Edge NVMe

This turns quality from a conversation into an audit trail. When a customer asks about a specific unit, you have the evidence.

Step 3: Detect Patterns Early

Once events are structured, you can spot problems before they become batches:

  • Defect spikes on a specific station after a tool change
  • Confidence drift that signals lighting degradation or model staleness
  • REVIEW rate increases that indicate process instability
  • Yield drops tied to a shift, supplier lot, or machine setting window

The Defect Pattern Accumulator runs continuously, analyzing 4-hour rolling windows with Z-score anomaly detection. When something changes, you know within minutes—not days.

Step 4: Explain Root Causes with Citations

"Root cause" is where most software becomes fiction. Vague suggestions like "check the equipment" or "review the process" are useless on a production floor.

The only acceptable root-cause output is evidence-cited:

  • Which units were affected: Event IDs, timestamps, count
  • What the system saw: Actual frames and detection artifacts
  • What changed: Time window, station, operator, batch, process parameters
  • Why the hypothesis is plausible: Statistical correlation, pattern match
  • What action should be taken: Specific SOP reference with steps

This is how a quality team moves from "maybe" to "do this now."

Step 5: Close the Loop with Causal Triples

The loop closes when actions and outcomes are recorded as causal defect triples:

FieldTypeExample
defect_idUUIDevt_20260213_143022_s3_001
defect_typeEnumscratch_surface
cause_parameterJSON{grinding_rpm: 2847, target: 3000, drift_pct: 5.1}
cause_confidenceFloat0.87
recommendationTextSOP 4.2.3: Recalibrate grinding wheel
operator_actionEnumaccepted / rejected / modified
outcome_measuredJSON{scratch_rate_after: 0.8pct, time_to_baseline: 45min}

Over time, your factory builds an internal quality knowledge base: not just defect counts, but defect explanations tied to real events, real causes, and verified outcomes.

Why Edge-First Matters for Root Cause

If your system depends on cloud connectivity, the evidence trail breaks exactly when the plant is under stress. Network goes down during a quality crisis? That's when you need root cause analysis most—and that's when cloud-dependent systems fail.

Edge-first architecture keeps inspection and evidence capture continuous:

  • Internet down → Full local operation continues. SQLite buffers events. RCA uses local SLM. Dashboard served locally.
  • Site server down → Edge nodes continue independently. Local evidence storage continues.
  • Camera disconnected → Station pauses inspection. Adjacent stations unaffected. Watchdog attempts reconnect.

The factory never stops inspecting because the internet went down.


Book a Demo to see the four-layer RCA stack in action.