At a food manufacturing plant running Traksys MES, I pulled six weeks of raw event data and ran it through a clustering algorithm. The finding: 92.6% of major downtime events — the ones that showed up on the Pareto chart, the ones that got discussed in the morning meeting — were preceded by clusters of short stops on the same equipment.

Short stops. Those 30-second to 2-minute interruptions that individually look like nothing. A brief pause, a quick reset, the line starts again. Operators don't report them because they're too short to matter. MES systems either bucket them into "performance loss" — where they disappear into a rate percentage — or don't record them at all if they fall below the event threshold.

But they're not noise. They're the signal.

What the Pattern Looks Like

The typical sequence: a piece of equipment starts producing short stops at an increasing rate. Three in an hour becomes five becomes eight. Each one is under two minutes, so nobody flags it. The OEE report shows a small performance dip that gets attributed to "normal variability."

Then, 30 to 90 minutes after the short stop cluster begins, the equipment has a major fault. The line goes down for 45 minutes. Maintenance is called. Parts are replaced. The downtime gets logged, discussed, and added to the Pareto chart. Everyone treats it as a discrete event — equipment failure, unplanned, fix it and move on.

But the equipment was telling you it was about to fail. It was telling you for an hour. The signal was in the short stops, and nobody was listening because OEE had categorized them as something else.

Why Standard OEE Misses This

OEE splits losses into three buckets: availability (downtime), performance (speed loss), and quality (defects). Short stops get classified as performance loss because the line doesn't technically "go down" — it just runs slower or pauses briefly.

This means the early warning signal for an availability problem is being tracked in a completely different category. The availability chart looks clean right up until the major fault. The performance chart shows a slight dip that nobody investigates because performance is always slightly variable. The two metrics aren't cross-referenced because they're treated as independent loss categories.

The insight — that performance degradation predicts availability loss — is invisible to anyone reading the standard OEE decomposition.

How to Find It in Your Data

If your MES records events under 2 minutes, you already have the data. Here's the approach:

Step 1: Pull raw events for 4-6 weeks. Filter to events under 2 minutes on production equipment (exclude planned stops, breaks, etc.).

Step 2: Cluster by time window. Use a 30-minute rolling window per equipment ID. Count short stops per window. Flag windows where the count exceeds your baseline (usually 3+ per 30 minutes is anomalous).

Step 3: Correlate with major faults. For each flagged cluster, check whether a major downtime event (>10 minutes) occurred on the same equipment within the next 4 hours.

Step 4: Calculate the hit rate. In my experience across multiple plants, it's consistently above 80%. The equipment is always warning you. The question is whether anyone is structured to listen.

What to Do With This

Once you know the pattern exists, the response is straightforward: make short stop clusters a trigger for proactive intervention. When a piece of equipment exceeds 3 short stops in 30 minutes, that's not a performance note — that's a maintenance call. Inspect now, prevent the major fault later.

Finding the pattern is step one. The harder work — building the incident response framework so the team acts on it before it becomes a major fault, tracking whether the fix holds, and preventing recurrence — is what turns a one-time discovery into a sustained capability. That's what I help plants build.

If you want to try the analysis yourself, start with the steps above. If you want help with what comes after the finding — let's talk.