The Grafana dashboard built for the floor, not the engineer, and the reader retrofit that finally closed

The surface the stack still did not have

By the end of issue 09 the pipeline was complete in everything but the part a human touches. Telegraf writes the Sparkplug B stream into a tiered InfluxDB OSS layout, the Isolation Forest from issue 06 scores each window, the cross-asset gate from issue 07 suppresses the correlated false alarm, and the feature bucket from issue 09 holds the long-term record. The only thing that reaches a person is a Pushover notification that fires when a flag survives the gate. The notification says a problem exists. It does not say which bearing, how far along, or what the waveform looked like when the model decided.

That gap is the reason the Grafana instance the series has referenced since issue 05 has sat mostly empty. Standing up Grafana is an afternoon. Building a dashboard a maintenance technician can read under the pressure of a 6 a.m. page is the actual work, and it is a different kind of work than the pipeline, because the constraint is not throughput or storage but attention.

The engineer's dashboard and the technician's are not the same dashboard

The default failure of an engineer building a maintenance dashboard is to build the dashboard the engineer wants. That dashboard shows the detector's anomaly-score distribution, the gate's suppression rate, the contamination parameter, and the rolling false-positive count. Every one of those is the right panel for tuning the model and the wrong panel for the floor. The technician does not own the model and will not change its contamination setting at 6 a.m. The technician owns the machine.

The technician's question arrives in a fixed order. Which asset is this about. How serious is it. How long has it been happening. What am I supposed to look at. A dashboard built for the floor answers that sequence top to bottom and puts nothing on the page that does not serve one of those four questions. The engineer's panels still exist, but they live on a second dashboard the technician never opens, because mixing the two is how a status page becomes a wall of charts that the person under pressure scrolls past.

The status row that is green or red and nothing else

The top of the floor dashboard is one row, one stat panel per asset, and a value mapping that resolves the asset to exactly two states. Green means the latest window for that asset scored inside the gate. Red means it did not. There is no yellow, because a maintenance status that offers three colors invites the floor to argue about the middle one, and the gate from issue 07 already made the binary decision the row is supposed to display. The panel reads the most recent flag from the feature bucket with a Flux query bounded to the last scoring interval, so the color is the current state and not a stale one.

// floor_status.flux: latest gated flag per asset, mapped to a single color
from(bucket: "features")
  |> range(start: -2m)
  |> filter(fn: (r) => r._measurement == "score" and r._field == "gated_flag")
  |> group(columns: ["asset"])
  |> last()

A technician who opens the page and sees two green tiles closes it again in three seconds, which is the correct outcome the overwhelming majority of mornings. The row exists to make the all-clear as fast to read as the alarm.

The trends that make degradation a line, not a single crossing

Below the status row are the feature trends, and their job is to convert the alert from an event into a story. A flag is a single moment: the window where the score crossed. A bearing failure is not a moment, it is a slope, and the value of the feature bucket from issue 09 is that it holds the slope. The trend panels plot RMS, crest factor, kurtosis, and the tracked bearing fault frequency amplitudes over a window the technician can widen from a shift to a quarter, so the page shows whether the flagged window is an isolated spike or the latest point on a curve that has been climbing for three weeks. That distinction is the whole diagnostic value of trending, and the floor cannot get it from the alert alone.

These panels read the feature bucket, never the raw bucket. That choice was made in issue 09 and the dashboard is where it pays off: the features are small, they cover the full history, and they render instantly because there are a few of them per window rather than thousands of samples.

Why the raw waveform is never on the status page

The high-rate vibration channel is the data the detector scores on, and it is the data the dashboard must not draw on the status page. A two-kilohertz series across even an hour is millions of points, and a time-series panel asked to render millions of points either downsamples them into a meaningless smear or hangs the browser trying. More to the point, a raw waveform tells the floor nothing at a glance; reading it requires zooming to a window and knowing what a healthy signature looks like, which is a forensic task and not a status check.

So the raw samples stay out of the status view entirely and live one click away. The trend panels and the status row carry the morning read. When the technician decides an event is worth opening, the raw window is reached through a link, not a panel that is always loaded. The dashboard that loads in under a second is the dashboard the floor actually keeps open.

The annotation and the data link, so the page agrees with the page that paged them

Two features connect the dashboard to the rest of the stack. The first is annotations. Every gated flag that fires a Pushover alert also writes an annotation onto the dashboard timeline, so the vertical marker on the trend chart sits at the same instant the phone buzzed. The technician who was paged at 6:04 opens the page and sees the marker at 6:04, and the page and the alert tell one story instead of two. An annotation source backed by the feature bucket keeps the markers and the alerts driven by the same flags, which is the only way to keep them from drifting apart.

The second is a data link. Each point on the feature trend carries a link that opens the pinned raw window from issue 09 for that timestamp, in a second dashboard panel scoped to the forensic bucket. This is the click that closes the loop the series has been building since issue 06: alert, to trend, to the actual samples the detector saw, without the technician composing a Flux query or knowing the bucket name. The pin from issue 09 guarantees the window is still there to link to, because it was copied out of the seven-day raw bucket the moment the flag fired.

The dashboard is a file in git, not clicks in a browser

A dashboard assembled by hand in the Grafana UI is unreproducible and dies with the VM. The floor dashboard is instead defined as a JSON model and loaded through dashboard provisioning, the same as the datasource and the alert rules. The model lives in the same git repository as the Telegraf configuration and the Flux tasks, so a rebuild of the Hetzner CX22 from issue 03 brings the dashboard back exactly, and a change to a panel is a reviewable diff rather than a memory of which dropdown was clicked. The asset row is driven by a template variable so adding a third asset is a value in the variable, not a new copy of the dashboard. Provisioning is the difference between a dashboard the stack owns and a dashboard that happens to exist on one server until it does not.

The reader retrofit, closed at day 30

The 30-day reader retrofit that has run open since issue 03 reached day 30 this week, and it is reported in full rather than deferred a sixth time. The reader installed the issue 03 carrier on a VFD-driven pump skid on day 9 of the planned window and ran the stack unattended against the plant's existing monitoring contract for the balance of the month. The reader's photographs of the install, the panel build, and the thirty-day feature trend are published with the case study, alongside the side-by-side against the contract: the same two bearing-frequency excursions the contract's monthly walkdown would have caught on its next visit, flagged by the stack inside the hour, at the recurring cost the series has carried throughout. The full writeup, including the one false alarm the gate did not suppress and why, runs as the linked case study.

What this stack costs, ten issues in

No hardware was added this issue, and the dashboard work is configuration. The bill is unchanged.

HiveMQ Community Edition, InfluxDB OSS, Grafana OSS, Telegraf, scikit-learn, PyOD, MTConnect agent, Caddy and Let's Encrypt: $0
Hetzner CX22 VM: $5.50/mo
Hetzner Storage Box for cold Parquet export, optional: about $4/mo
i.MX 8M Plus carriers, two assets: $480 one-time
Pushover device fee: $5 once

Total recurring stays at $5.50/mo, or about $9.50/mo with the cold archive on. Ten issues in, the recurring line has not moved off the price of one small VM, and the stack now has a face: a single page the floor opens at 6 a.m. that answers the four questions in the order they get asked.

What lands in issue 11

The stack ingests, scores, gates, stores, alerts, and now displays. The one assumption underneath all of it is that the single CX22 keeps running, and ten issues have built a single point of failure with one disk and one host. Issue 11 turns to what happens when that host does not come back: the backup of the InfluxDB buckets and the Grafana and Telegraf configuration, the restore drill that proves the backup is real, and the honest accounting of what a $5.50 single-VM stack can and cannot promise a plant that has come to depend on it.

Issue 09 · Issue 08 · Issue 07 · Issue 06 · Archive