Ask HN: How are you evaluating your LLMs in production?

Comments (1)

znpy · 7h ago

Sysadmin here ("cloud engineer" is what's in my contract).

> Which tools do you use to evaluate your LLMs and agents in production?

None for my work. I still use LLMs from time to time to generate boring terraform code or boring SQL queries, but I'm essentially not going to let some AI bs near the infrastructure I curate.

It's all fun and games until prod is down, or the cloud bill is 10x the previous month's bill (or both).

So unless I can blame it on the AI and take no responsibility I'm not going to let anything AI-powered near production.

Holograph- Visual Programming with Propagator Networks (holograph.so)

Expand Your Worldview at Int'l Café (intl.cafe)

Apple (apple.com)

Sell your own data for $10 Dollars (momentarily.online)

Radxa Unveils Intel N150 SoM and Carrier Board Supporting Six M.2 or U.2 Slots (linuxgizmos.com)

The "Michael Angelakos Is Passion Pit" Residencies (passionpitmusic.substack.com)

Clamp / Median / Range (dotat.at)

America's Mobile Security Crisis: It's Time for a Secure, Private Alternative (puri.sm)

Hi, I'm founder. I have 20 domains and a dozen Supabase accounts and earned zero (dontbuildthat.com)

RFK Jr.'s health department calls Nature "junk science," cancels subscriptions (arstechnica.com)

The Path to Medical Superintelligence (microsoft.ai)

State of the Spack community: the Road to Version 1.0 [video] (indico.fnal.gov)

Product-Market Fit Is Retrospective Fiction (thebrokevc.com)

People are using AI to 'sit' with them while they trip on psychedelics (technologyreview.com)

The Hamburger Menu Is No Longer a Hamburger Menu (datagubbe.se)

Does education increase intelligence and does it matter? (2024) (theinfinitesimal.substack.com)

Hot acetic acid enables full recycling of carbon fiber composite materials (phys.org)

The End of the Arctic? Ocean Could Be Ice Free by 2015 (thedailybeast.com)

Castlevania: Symphony of the Night Decompilation Project (sotn.xee.dev)

Young Americans Are Spending a Whole Lot Less on Video Games This Year (gamespot.com)

Benchmark for Evaluating Text Embeddings (huggingface.co)

Qantas customers involved in mammoth data breach (news.com.au)

New claim added: X opens up to Community Notes written by AI bots (theverge.com)

HTTP: H Is for Hallucinated (jasonthorsness.com)

The simple act of reading can be a crime in Malaysia. Here's why (rnz.co.nz)

Hilbert's sixth problem: derivation of fluid equations via Boltzmann's theory (arxiv.org)

Wider or Deeper? Scaling LLM Inference-Time Compute with Adaptive Tree Search (arxiv.org)

As wave of dementia cases looms, Law School looks to preserve elders’ rights (news.harvard.edu)

Self-hostable AT Protocol backlink index that runs on a RPi 4 (github.com)

Cross-Device Flows: Security Best Current Practice (ietf.org)

The Eiffel Tower is closed to tourists due to searing heat (cnn.com)

Dewdrop: A Java Event Sourcing Framework (dewdrop.events)

You MUST Listen to RFC 2119 (ericwbailey.website)

Show HN: Conduit – Turn large text files into listenable audio (conduit-landing-page-git-master-tobys-projects-a638df7e.vercel.app)

Show HN: I built a procedural universe in Python to explore simulation theory (github.com)

Qantas says 6M customers caught up in cyberattack (afr.com)

Visual intuitive tool to design predict and optimise complex economic models (machinations.io)

iPhone Satellite Functionality Saves Denver Mountaineer (macrumors.com)

Australians to face age checks from search engines (ia.acs.org.au)

Cursor for the first time today. It was perfect until (medium.com)

Proximity to Golf Courses and Risk of Parkinson Disease (jamanetwork.com)

AI: Great Expectations (1988) [pdf] (people.csail.mit.edu)

European consumers are mostly saying 'non' to trading in their old phones (theregister.com)

Homes Are Taking Longer to Sell in US Markets That Once Flourished (bloomberg.com)

The Tale of the Tribe (redfin.com)

We've Issued Our First IP Address Certificate (letsencrypt.org)

Show HN: Procedurally generated 3DGS splats powered by Spark (github.com)

Show HN: Just a Line: Resurrected (github.com)

'new stars' have exploded into the night sky – both visible to the naked eye (livescience.com)

Bandersnatch, Bailiffs and the Battle for a Hit Game (1984) [video] (youtube.com)

Ask HN: How are you evaluating your LLMs in production?

Comments (1)