Anomaly detection: business metrics vs. system metrics?
3 chipfixer 3 4/27/2025, 7:08:43 PM
Recently, I ran into someone at a conference who had a major incident: a config change caused a revenue drop. Their RED/system metrics didn’t catch it because they were all static-threshold alerts and siloed from the business signals, so engineers didn't discover the actual revenue impact until much later.
What may have been helpful is anomaly detection directly on their business metrics — with system metrics helping explain root cause but only when real customer/business impact is detected.
Curious to hear: How much does your org prioritize monitoring business metrics (not just System metrics)? If you do, what tools do you use?
Larger, incident-worthy changes in metrics are also easier to set static thresholds around and ring more than one bell when they occur. I'd be more concerned about smaller to mid deviations from the trend, say, sudden -/+10% change in my business metrics over X minutes. Can I reliably set a static threshold that will universally be appropriate here? A good anomaly detector would ideally bring something like this to attention without hard coded alert configs here