Beyond Thresholds: Rethinking Cloud Cost Alerting

Before founding Costory, we experienced firsthand how cloud cost alerting doesn’t work the way you think it should.

At first, we set simple rules:

Copied!
Alert if daily spend > $1,000
Alert if S3 storage > 10 TB

It seemed sensible. But before long, those alerts were triggering almost every week.
A data backfill, a large training job, a traffic spike after a media campaign — all perfectly normal events — looked like “critical incidents” in our alerting system.

It became background noise. Nobody was acting on it anymore.

The Threshold Problem

Most FinOps alerting still works like this: pick a number, alert if you go over.

It’s the same mindset as early infrastructure monitoring:

Copied!
if cpu_usage > 90% for 5m => alert

This is fine for catching massive spikes, but cloud costs today are far too dynamic for static rules:

  1. Elastic workloads — Auto-scaling clusters, event-driven compute, and batch jobs create large but expected fluctuations.
  2. Shifting business context — A launch, a marketing push, or a new model training job can legitimately increase spend.
  3. Threshold drift — The “right” threshold changes over time… but rarely gets updated.

The result: alerts fire constantly, engineers learn to ignore them, and real cost anomalies slip through.

Common False Positives We’ve Seen

  • Data reprocessing — AWS Glue jobs backfilling historical data for compliance.
  • Load testing — Kubernetes clusters scaling up for performance testing.
  • Content migrations — S3 usage doubling during an asset migration.

All of these are normal.
A static threshold system can’t tell the difference between these and a genuine runaway cost.

What “Next-Gen” Cost Alerting Should Do

If you want alerts your SRE and FinOps teams actually value, they need:

  1. Historical baselines — Compare against your past usage patterns.
  2. Seasonality awareness — Understand predictable peaks, like holiday traffic or month-end batch jobs.
  3. Event correlation — Factor in releases, data jobs, and known business events.
  4. Noise suppression — Avoid firing on the same known condition repeatedly.
  5. Relevance scoring — Escalate only when the anomaly really matters.

Our Take at Costory

Because we’d been burned by noisy, context-free cost alerts in our previous companies, we started building context-aware anomaly detection.

Instead of asking:

“Did we exceed $X?”

We ask:

“Is this change unexpected given our history, workload patterns, and business context?”

The goal: surface only the anomalies worth your attention, while ignoring the noise that burns out engineers.

We’ll be opening a beta for this approach in mid-August for teams that want to try it.

The takeaway:
Static thresholds can catch catastrophic spikes, but they’re too blunt for modern cloud environments.
Smarter alerting requires context, history, and noise reduction — whether you build it yourself or use a tool like Costory.