Sheriff Warns Doing TikTok 'Door Kick Challenge' in Wyoming Can Get You Killed (cowboystatedaily.com)

Recently, I ran into someone at a conference who had a major incident: a config change caused a revenue drop. Their RED/system metrics didn’t catch it because they were all static-threshold alerts and siloed from the business signals, so engineers didn't discover the actual revenue impact until much later.

What may have been helpful is anomaly detection directly on their business metrics — with system metrics helping explain root cause but only when real customer/business impact is detected.

Curious to hear: How much does your org prioritize monitoring business metrics (not just System metrics)? If you do, what tools do you use?

Comments (3)

nchinmay · 141d ago

We had this very issue where a bad configuration change (human error) caused a large & sudden revenue drop and a drop in our streaming ad event metrics. This is a realtime adtech system where a delay in detecting sudden changes in business metrics can have monetary impact and visible customer experience drops. In this case, the major drop in revenue was immediately found and addressed but not all of our expected alerts went off. Our streaming ad events metric threshold was statically set too low. This threshold was appropriate at earlier stages of our business but as our business has grown, this threshold happens to be too low to set off the alert I would have expected to go off as the very first one. We do have sophisticated metrics instrumentation and alerting but an effective anomaly detection around sudden upticks/downticks in business metrics while being conscious of underlying metric trends evolving organically with the business would be a game changer.

Larger, incident-worthy changes in metrics are also easier to set static thresholds around and ring more than one bell when they occur. I'd be more concerned about smaller to mid deviations from the trend, say, sudden -/+10% change in my business metrics over X minutes. Can I reliably set a static threshold that will universally be appropriate here? A good anomaly detector would ideally bring something like this to attention without hard coded alert configs here

poobear22 · 141d ago

I managed the system administrators for a high performance computing center. We took a lot of blame for the applications when in reality, often times it was poor programming on the developer's part. So, I got really tired of taking the blame and implemented statistical process control to track the mean time between failures of the jobs. I was really just shining the flashlight on production jobs and was hoping it could change the culture. It was not my job to fix their code, and the applications were developed by a different group of people with a very different culture. I thought the process control worked really well, and it did allow me to take the heat off me for random blaming of my team, when I could respond with "your job is failing XX times per year" and from there, push for a root cause analysis. But pushing against that culture was really hard, and there was a lot of "set the job to complete and I will look at it on Monday". If they do not want to conduct a root cause analysis on the failure modes for their code, I can't do much. So, even implementing some type of monitoring can have little effect if the ones who need to fix something do not support the culture. And, as I read your post, I'd think people would be looking at these business metrics a little closer or develop more sensitive metrics to catch these issues.

chipfixer · 140d ago

Yup, no amount or type of anomaly detection can fix the culture. That said, in this case, maybe one reason it may be hard is the devs weren't the ones owning what the job did in production?

Law firm associate fired over AI-generated fake case cites (news.bloomberglaw.com)

Do Drones Make Helicopters Obsolete? (nationalinterest.org)

Can batteries be safer? A company opening in Alameda says it has the answer (mercurynews.com)

Ctrl/tinycolor and 40 NPM Packages Compromised (stepsecurity.io)

Associations of Chronic Insomnia and Longitudinal Cognitive Outcomes (neurology.org)

Do binaural beats help you focus? (popsci.com)

Reducing the scope of impact by Cell Based Architecure [pdf] (docs.aws.amazon.com)

Security Through Intentional Redundancy (commaok.xyz)

Mystery language on ancient stone tablet stumps archeologists (popsci.com)

Global EV sales grow by 5% mom and by 15% YoY in August 2025 (rhomotion.com)

Oakridge National Lab CUDA Training Series (olcf.ornl.gov)

What is memory safety and why does it matter? (memorysafety.org)

ElevenLabs is the best text-to-speech AI system (engineering.kablamo.com.au)

Be Careful When Assigning ArenaAllocators (2024) (openmymind.net)

DHH – As I Remember London (world.hey.com)

Build a Simple VM in Go (blog.phakorn.com)

How to create a miniature mind inside a chunk of silicon using code (python2llms.org)

How to Choose and Use Stir Bars: An Authoritative Guide for Lab Managers (blog.jmscience.com)

Debugging Equity (column.com)

Stir Bars Can't Be Ignored (science.org)

Popular npm package compromised in a sophisticated attack affecting 40+ packages (twitter.com)

The Trauma You Need to Learn (staysaasy.com)

How should 'mirror life' research be restricted? Debate heats up (nature.com)

Recovering LUKS keys from running system (jfr.im)

Installing NetWare NFS Gateway 1.2 on NetWare 3.12 (zx.net.nz)

Sheriff Warns Doing TikTok 'Door Kick Challenge' in Wyoming Can Get You Killed (cowboystatedaily.com)

Show HN: Bulk install nerd fonts in a single command (github.com)

Show HN: HN Term – browse HN using the terminal (github.com)

FBI investigates social media accounts appearing have prior knowledge shooting (thepostmillennial.com)

We Are Not Low Creatures (theintrinsicperspective.com)

Hugging Face Releases FinePDFs: A 3T-Token Dataset Built from PDFs (infoq.com)

Show HN: A tool to make a bootable USB installer out of macOS, or download it (macdaddy.io)

Norway's 8.5k ft underground pipeline first to store CO2 directly from factories (evidencenetwork.ca)

Linux for Nintendo 64 (1997) (web.archive.org)

Marathon experiment offers most precise measurement of nucleon structure yet (phys.org)

Aura – Detecting Fake Cell Towers with RF Fingerprinting AI

What's caused reading scores to drop to worst point in decades? (pbs.org)

How to Install the Official Atlassian MCP Server for Claude Code (blog.johnys.io)

Linux phones are more important now than ever (feddit.org)

Windows 11 on a 2005 Sun Ultra Opteron PC [video] (youtube.com)

A Closer Look Inside a Robot's Typewriter-Inspired Mouth (hackaday.com)

The awe keeps dropping (morrick.me)

The Sagrada Família Takes Its Final Shape (newyorker.com)

TCG Automate – Scan and Identify Your Trading Cards and List to eBay in Seconds (tcgautomate.com)

Helping Doug (theamericanscholar.org)

Switchborn – 007 – The Endgame You Forgot [video] (youtube.com)

The builder who photographed distant galaxies (bbc.com)

Airbnb "Superhost" – AI-Generated Damage Claims (aidarwinawards.org)

Visualizing Algorithms (2014) (bost.ocks.org)

Lobbying and Regulatory Strategies of US Autonomous Vehicles Companies (cardog.app)

Anomaly detection: business metrics vs. system metrics?

Comments (3)