Retention and downsampling on InfluxDB OSS, and why the detector cannot eat a downsampled stream

The question the disk has been asking since issue 05

Every issue since issue 05 has written into the same place. The Telegraf agent on the edge takes the Sparkplug B stream off the broker and writes it to a single InfluxDB OSS bucket on the Hetzner CX22, the same VM that has carried the stack since issue 03. The CX22 has a 40 GB disk. The bucket has no retention period set, which means its retention is infinite, which means the disk is the retention policy and the enforcement mechanism is running out of space.

The arithmetic is no longer abstract. Two assets each publish a process-variable set and an accelerometer channel sampled at two kilohertz so the detector has the spectral content its features need. Round the high-rate term to four thousand points per second across the two assets and the day holds about three hundred forty million points. InfluxDB's TSM engine compresses timestamps with delta-of-delta encoding and floats with a Gorilla-style scheme, but vibration is high-entropy and does not compress like a slow temperature trend, so the realistic figure is a few bytes per point rather than a fraction of one. At an assumed four bytes per point after compression the raw vibration alone is on the order of a gigabyte a day. The process variables add little. Leave the operating system and InfluxDB's own overhead their share of the 40 GB and the usable budget fills in roughly three weeks.

That estimate is order-of-magnitude and the compression ratio is the soft term in it, so the real number on the test cell will differ. The conclusion does not. A single full-rate bucket with infinite retention on a small disk has a deadline, and the deadline is close.

Why the obvious fix breaks the detector

The obvious fix is a retention period. Set the raw bucket to drop anything older than seven days and the disk stops filling. It also throws away the history the whole stack exists to build. The plant's reason for running condition monitoring is to catch the slow degradation that plays out over months, and a seven-day window cannot show a bearing trending toward failure across a quarter. Retention alone trades the disk problem for a blindness problem.

The standard answer is downsampling. Keep raw data for a short window, and on a schedule aggregate it into a coarser bucket that holds far fewer points for far longer. A one-minute mean of a two-kilohertz channel is a twelve-hundred-thousand-fold reduction in points. A bucket of one-minute rollups holds years on the same disk. This is the textbook InfluxDB pattern and it is the right pattern for the process variables, where a one-minute mean of spindle load or coolant temperature is a faithful summary of the minute.

It is the wrong pattern for the vibration channel, and the reason is the entire premise of issues 06 through 08. The Isolation Forest from issue 06, the cross-asset gate from issue 07, and the matched filter from issue 08 all read features computed from the shape of the waveform inside a window: RMS, crest factor, kurtosis, and the amplitudes of specific bearing fault frequencies in the spectrum. Every one of those is a property of the transient structure, and a one-minute mean keeps none of the structure. The mean of a vibration signal over a minute is close to zero by construction, and the kurtosis of the mean is not the kurtosis of the signal. Downsample the raw stream and the detector loses the exact information it scores on. The trend bucket is fine for a human watching a dashboard. It is useless to the model.

The two records the system actually needs

The way out is to stop treating this as one record. The system needs two, and they have different shapes and different lifetimes.

The first is the raw waveform, and its job is forensics. When the detector flags an anomaly or a tech disputes one, somebody needs the actual samples around the event to see what happened. That need is recent and short-lived. Nobody pulls the raw waveform from five months ago, because by then the feature record already says what the waveform would have shown. So the raw bucket gets a seven-day retention period and serves as a rolling forensic buffer, and the disk problem is solved because seven days of full-rate data is a few gigabytes, not a year of it.

The second record is the long-term memory, and the insight is that the edge already computes it. Every scoring window since issue 06 produces a feature vector, a dozen or so floats per window per asset. That vector is the input the model already trusts, it is what the detector's history should be made of, and it is small. A feature vector every second for two assets is a few hundred thousand points a day, three orders of magnitude under the raw stream. Written to its own bucket it costs almost nothing and it can be kept at full event cadence for years. The long-term trend the dashboards need, and the training history the model needs, both come from the feature bucket. The raw waveform was never the right thing to keep; it was just the thing that happened to be in the bucket.

# influx_tasks.py: the two writes, conceptually
# 1) raw bucket: written by Telegraf, 7-day retention, full rate, forensic only
# 2) features bucket: written by the scoring loop, ~no expiry, the long-term record
#
# Flux task: roll the process variables (NOT the vibration) into a trend bucket.
# Vibration trends are read from the feature bucket, not from a mean of the wave.

option task = {name: "pv_rollup_1m", every: 1m}

from(bucket: "raw")
  |> range(start: -task.every)
  |> filter(fn: (r) => r._measurement == "process" and
                       (r._field == "spindle_load" or r._field == "coolant_temp"))
  |> aggregateWindow(every: 1m, fn: mean, createEmpty: false)
  |> to(bucket: "trend_1m")

The retention periods are set on the buckets, the rollup runs as a Flux task on a one-minute schedule, and the feature bucket is written by the scoring loop that already exists. No new service. The change is configuration plus one task, and the architecture from issue 07 already separated the feature computation from the alerting, so the feature vectors are already in hand and just need their own destination.

The pin, because retention deletes the window you needed

A retention policy is a destructive operation running on a timer, and the rule with destructive timers is that they delete the one thing you turn out to need. The seven-day raw window is fine for routine forensics and wrong for the case that matters most: a real fault that begins, gets flagged, and then is diagnosed eleven days later when the part finally fails. By then the raw window around the onset is gone, and the feature vectors survive but the technician who wants to see the actual waveform of the early signature cannot.

The fix is a pin. When the detector flags an anomaly, or a human marks a window for review, the scoring loop copies the raw samples for that window, with a margin on each side, into a separate forensic bucket that has no retention period. The copy happens at flag time, inside the seven-day window while the data still exists, so the pin races the retention sweep and wins because it fires the moment the flag does. The forensic bucket grows only by the size of pinned events, which is small, because anomalies are rare by definition. The routine stream expires on schedule and the evidence around every flagged event is kept indefinitely. The destructive timer keeps the disk bounded, and the pin keeps it from deleting the windows that were ever worth looking at.

For data that needs to leave the VM entirely, the cold path is an export to object storage. A nightly job writes the previous day's feature bucket, and any pinned raw windows, to a Hetzner Storage Box or any S3-compatible target as Parquet, which is the format InfluxDB v3 adopted natively and the format the analysis tooling reads anyway. The VM holds the working set; the storage box holds the archive at a fraction of the per-gigabyte cost of the VM disk.

What this stack costs, nine issues in

No hardware was added this issue. The recurring bill is unchanged, with one optional line.

HiveMQ Community Edition, InfluxDB OSS, Grafana OSS, Telegraf, scikit-learn, PyOD, MTConnect agent, Caddy and Let's Encrypt: $0
Hetzner CX22 VM: $5.50/mo
Hetzner Storage Box BX11 for cold Parquet export, optional: about $4/mo for a terabyte
i.MX 8M Plus carriers, two assets (issues 03 and 07): $480 one-time
Pushover device fee: $5 once

Total recurring stays at $5.50/mo, or about $9.50/mo with the cold archive turned on. The retention work bought the stack a fixed disk footprint that no longer grows without bound, which is the property that lets it run unattended past the point where the single-bucket design would have filled the disk and stopped ingesting.

The reader retrofit, still open

The 30-day reader retrofit case study is still not done. The reader installed the issue 03 carrier on day 9 of the planned run as of issue 07, which puts the window near day 23 now, short of the full 30. The case study publishes with the reader's photographs and the complete comparison against the plant's existing contract when the window closes, and not on a publishing schedule that would force a half-finished result.

What lands in issue 10

The stack now ingests on a bounded disk, scores with a gated detector, and keeps a long-term feature record. Issue 10 turns to what the maintenance team sees, because none of it matters if the page a technician opens at 6 a.m. does not answer the question that brought them to it. The Grafana dashboard the series has referenced since issue 05 gets built for the person on the floor rather than the engineer who built it, with the feature bucket behind the trends and the pinned forensic windows one click from the alert. Plus the completed reader retrofit if the window has closed.

Issue 08 · Issue 07 · Issue 06 · Issue 05 · Archive