Optimizing typography of insect labels using free fonts and free software (2012) [pdf] (akentsoc.org)

I've been building AI agents for a while. After trying every framework out there and talking to many founders building with AI, I've noticed something interesting: most "AI Agents" that make it to production aren't actually that agentic. The best ones are mostly just well-engineered software with LLMs sprinkled in at key points.

So I set out to document what I've learned about building production-grade AI systems: https://github.com/humanlayer/12-factor-agents. It's a set of principles for building LLM-powered software that's reliable enough to put in the hands of production customers.

In the spirit of Heroku's 12 Factor Apps (https://12factor.net/), these principles focus on the engineering practices that make LLM applications more reliable, scalable, and maintainable. Even as models get exponentially more powerful, these core techniques will remain valuable.

I've seen many SaaS builders try to pivot towards AI by building greenfield new projects on agent frameworks, only to find that they couldn't get things past the 70-80% reliability bar with out-of-the-box tools. The ones that did succeed tended to take small, modular concepts from agent building, and incorporate them into their existing product, rather than starting from scratch.

The full guide goes into detail on each principle with examples and patterns to follow. I've seen these practices work well in production systems handling real user traffic.

I'm sharing this as a starting point—the field is moving quickly so these principles will evolve. I welcome your feedback and contributions to help figure out what "production grade" means for AI systems!

Comments (78)

mgdev · 79d ago

These are great. I had my own list of takeaways [0] after doing this for a couple years, though I wouldn't go so far as calling mine factors.

Like you, biggest one I didn't include but would now is to own the lowest level planning loop. It's fine to have some dynamic planning, but you should own an OODA loop (observe, orient, decide, act) and have heuristics for determining if you're converging on a solution (e.g. scoring), or else breaking out (e.g. max loops).

I would also potentially bake in a workflow engine. Then, have your model build a workflow specification that runs on that engine (where workflow steps may call back to the model) instead of trying to keep an implicit workflow valid/progressing through multiple turns in the model.

[0]: https://mg.dev/lessons-learned-building-ai-agents/

dhorthy · 79d ago

this guide is great, i liked the "chat interfaces are dumb" take - totally agree. AI-based UIs have a very long way to go

hhimanshu · 79d ago

I am wondering how libraries like DSPY [0] fits in your factor-2 [1]

As I was reading, I saw mention of BAML > (the above example uses BAML to generate the prompt ...

Personally, in my experience hand-writing prompts for extracting structured information from unstructured data has never been easy. With DSPY, my experience has been quite good so far.

As you have used raw prompt from BAML, what do you think of using the raw prompts from DSPY [2]?

[0] https://dspy.ai/

[1] https://github.com/humanlayer/12-factor-agents/blob/main/con...

[2] https://dspy.ai/tutorials/observability/#using-inspect_histo...

dhorthy · 79d ago

interesting - I think I have to side with the Boundary (YC W23) folks on this one - if you want bleeding edge performance, you need to be able to open the box and hack on the insides.

I don't agree fully with this article https://www.chrismdp.com/beyond-prompting/ but the comparison of punchards -> assembly -> c -> higher langs is quite useful here

I just don't know when we'll get the right abstraction - i don't think langchain or dspy are the "C programming language" of AI yet (they could get there!).

For now I'll stick to my "close to the metal" workbench where I can inspect tokens, reorder special tokens like system/user/JSON, and dynamically keep up with the idiosyncrasies of new models without being locked up waiting for library support.

chrismdp · 79d ago

It's always true that you need to drop down a level of abstraction in order to extract the ultimate performance. (eg I wrote a decent-sized game + engine entirely in C about 10 years ago and played with SIMD vectors to optimise the render loop)

However, I think the vast majority of use cases will not require this level of control, and we will abandon prompts once the tools improve.

Langchain and DSPY are also not there for me either - I think the whole idea of prompting + evals needs a rethink.

(full disclaimer: I'm working on such a tool right now!)

dhorthy · 79d ago

i'd be interested to check it out

here's a take, I adapted this from someone on the notebookLM team on swyx's podcast

> the only way to build really impressive experiences in AI, is to find something right at the edge of the model's capability, and to get it right consistently.

So in order to build something very good / better than the rest, you will always benefit from being able to bring in every optimization you can.

chrismdp · 79d ago

I think the building blocks of the most impressive experiences will come from choosing the exact right point to involve an LLM, the orchestration of the component pieces, and the user experience.

That's certainly what I found in games. The games which felt magic to play were never the ones with the best hand rolled engine.

The tools aren't there yet to ignore prompts, and you'll always need to drop down to raw prompting sometimes. I'm looking forward to a future where wrangling prompts is only needed for 1% of my system.

dhorthy · 79d ago

yeah. the issue is when you're baked into a tool stack/framework where you cant go customize in that 1% of cases. A lot of tools try to get the right abstractions where you can "customize everything you would want to" but they miss the mark in some cases

chrismdp · 79d ago

100%. You can't and shouldn't wrap every interaction. We need a new approach.

britannio · 74d ago

looking forward to the new tool

daxfohl · 80d ago

This old obscure blog post about framework patterns has resonated with me throughout my career and I think it applies here too. LLMs are best used as "libraries" rather than "frameworks", for all the reasons described in the article and more, especially now while everything is in such flux. "Frameworks" are sexier and easier to sell though, and lead to lock-in and add-on services, so that's what gets promoted.

https://tomasp.net/blog/2015/library-frameworks/

saadatq · 79d ago

This is so good…

“… you can find frameworks not just in software, but also in ordinary life. If you buy package holidays, you're buying a framework - they transport you to some place, put you in a hotel, feed you and your activities have to fit into the shape provided by the framework (say, go into the pool and swim there). If you travel independently, you are composing libraries. You have to book your flights, find your accommodation and arrange your program (all using different libraries). It is more work, but you are in control - and you can arrange things exactly the way you need.”

daxfohl · 79d ago

My favorite blog post / presentation is Sandi Metz "The Wrong Abstraction", but this one is up there. Definitely punches above its weight for a small obscure post.

dhorthy · 79d ago

Yeah duplication way better than the wrong abstraction. Just write the dang switch statement

daxfohl · 79d ago

> switch

Hmm, that hit a bit of a nerve. My experience with switch blocks is it can be a gateway drug for teams A, B, C to add their special-case code to team D's repo within a `switch(calling_service)` block. My read of the presentation is more, factor your stuff so that any "switch" is a higher level concern that consumers can do in their own services. Then if you start to see all your consumers write very similar consumption logic, then start thinking about how to pull that down into the library/service itself.

But beyond that trigger nerve, agreed.

dhorthy · 79d ago

fair enough. I think switch statement is a broader category for "basic programming primitives that you should just do yourself" -

agree big switch statements can be an anti-pattern, e.g. when an interface is clearly better suited

dhorthy · 80d ago

oh heck yeah this rocks. I'm gonna add to the links section

daxfohl · 80d ago

Additionally in terms of career development, you're going to be a lot better off learning the low level LLM interfaces rather than being dependent on a framework (or their even more evil cousin, platforms). Once you learn those, jumping to a platform is usually trivial, whereas the reverse can be more challenging. Junior devs often think that the more frameworks they have on their resume the better, but it often pigeonholes you more than it helps.

And I don't mean to imply that frameworks are always bad. Things like security best practices out of the box can be worth it. But especially in AI right now, nobody knows what those best practices are going to be. So it's best to spend this time learning how to do things at a low level rather than attaching to some framework that may be obsolete in a year.

dhorthy · 80d ago

exactly - we keep trying to figure out the right interfaces, but we jump to assume that we know what they are.

If we had the right interface, we would set up the black box, and then put holes/knobs on the box to allow anyone to change the things they should actually need to change.

if we have the wrong interface, then the knobs aren't interesting, and instead we keep end up opening the box, or reaching into the holes at weird angles to do things that nobody knew we'd want to, but that are obviously the right things to do to maximize performance

someday we'll have the right interface, but for now, better to skip the box and do the extra cycles. You're an engineer, you can write a for loop and a switch statement. don't outsource your prompts and give up control flow to save a few hundred lines that will eventually become pretty customized anyway

pancsta · 80d ago

Very informative wiki, thank you, I will definitely use it. So Ive made my own "AI Agents framework" [0] based on actor model, state machines and aspect oriented programming (released just yesterday, no HN post yet) and I really like points 5 and 7:

    5: Unify execution state and business state
    8. Own your control flow

That is exactly what SecAI does, as it's a graph control flow library at it's core (multigraph instead of DAG) and LLM calls are embedded into graph's nodes. The flow is reinforced with negotiation, cancellation and stateful relations, which make it more "organic". Another thing often missed by other frameworks are dedicated devtools (dbg, repl, svg) - programming for failure, inspecting every step in detail, automatic data exporters (metrics, traces, logs, sql), and dead-simple integrations (bash). I've released the first tech demo [1] which showcases all the devtools using a reference implementation of deepresearch (ported from AtomicAgents). You may especially like the Send/Stop button, which is nothings else then "Factor 6. Launch/Pause/Resume with simple APIs". Oh and it's network transparent, so it can scale.

Feel free to reach out.

[0] https://github.com/pancsta/secai

[1] https://youtu.be/0VJzO1S-gV0

serverlessmania · 79d ago

"Another thing often missed by other frameworks are dedicated devtools"

From my experience, PydanticAI really nailed it with Logfire—debugging[0] agents was significantly easier and more effective compared to the other frameworks and libraries I tested.

[0] https://ai.pydantic.dev/logfire/#pydantic-logfire

pancsta · 76d ago

Logfire is a tracing app, an equivalent of Jaeger and other Otel UIs. While I wont discuss reimplementation-vs-integration in this case, traces are just one way of debugging. am-dbg focuses on debugging of the state consensus, instead of the execution tree, without requiring a SaaS account.

Execution trees are enough for workflows, but bots/agents aren't simple workflows.

dhorthy · 80d ago

i like the terminal UI and otel integrations - what tasks are you using this for today?

pancsta · 80d ago

Thanks, terminal UI is an important design choice - it's fast, cheap, and runs everywhere (like the web via wasm / ssh, or on iphones with touch). The LLM layer is still fresh, and I personally use it for web scraping, but the underlying workflow engine is quite mature and ubiquitous - it was used for sync engines, UIs, daemons, network services. It shines when faces complexity, nondeterminism, and retry logic - the more chaotic the flow is, the bigger the gains.

The approach is to shape behavior from chaos by exclusion, instead of defining all possible transitions. With LLMs, this process could be automated and effectively an agent would be dynamically creating itself using a DSL (state schema and predefined states). The great thing about LLMs is being charged by tokens instead of a number of requests. We can just interrogate them about every detail separately and build a flow graph with transparent (and debuggable) reasoning. I also have API sketches for proactive scenarios (originally made for an ML prototype) [0].

[0] https://github.com/pancsta/secai/blob/474433796c5ffbc7ec5744...

wfn · 79d ago

This is great, thank you so much for sharing!

daxfohl · 80d ago

Another one: plan for cost at scale.

These things aren't cheap at scale, so whenever something might be handled by a deterministic component, try that first. Not only save on hallucinations and latency, but could make a huge difference in your bottom line.

dhorthy · 79d ago

Yeah definitely. I think the pattern I see people using most is “start with slow, expensive, but low dev effort, and then refine overtime as you fine speed/quality/cost bottlenecks worth investing in”

Manfred · 80d ago

I believe the principles would be easier to follow if there is a consistent narrative through the factors, why which I mean using potentially real-world example for such a system.

dhorthy · 80d ago

This is a great bit of feedback - what kinda of use cases do you think would make sense?

Definitely wanna evolve this in the open with the community

hhimanshu · 79d ago

May be if you pick a real-world agent workflow (toy from your production experience, trim it down), and showcase how all these factors will come along in a project.

I am inspired by the simplicity of these 12 factors and definitely want to learn more with an example that embraces these factors.

dhorthy · 79d ago

I link in a few places to https://github.com/got-agents/agents where I have a few of these real agents

hhimanshu · 79d ago

Thank you, I will take a look

Manfred · 78d ago

I don’t have any experience in that area so I can’t really suggest anything.

glial · 80d ago

This is great -- and I have learned 80% the hard way. The other 20% will be valuable reading!

Personally I've had success with LangGraph + pydantic schemas. Curious to know what others have found useful.

dhorthy · 80d ago

funny you say

> I have learned 80% the hard way

because the other working title for this was "Agents the Hard Way" (in the spirit of https://github.com/kelseyhightower/kubernetes-the-hard-way)

wfn · 79d ago

This could not have come at a better time for me, thank you!

I've been tinkering with an idea for an audiovisual sandbox[1] (like vvvv[2] but much simpler of course, barebones).

Idea is to have a way to insert LM (or some simple locally run neural net) "nodes" which are given specific tasks and whose output is expected to be very constrained. Hence your example:

    "question -> answer: float"

Is very attractive here. Of course, some questions in my case would be quite abstract, but anyway. Also, multistage pipelines are also very interesting.

[1]: loose set of bulletpoints brainstorming the idea if curious, not organised: https://kfs.mkj.lt/#audiovisllm (click to expand description)

[2]: https://vvvv.org/

dhorthy · 79d ago

Typed outputs from an LLM is a game changer!

darepublic · 79d ago

I didn't really read this extensively but to me I would want to use as much deterministic code as possible and leverage the llm as little as possible. That to me is a better portend of predictable result, lower operational costs and is a signal that nobody could just quickly reproduce the same app. I would tend to roll my own tools and not use out of the box buzz word glue to integrate my llm with other systems. And if these conditions aren't met or aren't necessary I'd figure someone else could just vibe code the same solution in no time anyway. Keep control I say! Die on the hill of control! That's not to say I'm not impressed by LLMs.. quite the opposite

dhorthy · 79d ago

control is good, and determinism is good - while the primary goal is to convince people "don't give up too much control" - there is a secondary which is: THESE are the places where it makes sense to give up some control

ianbutler · 80d ago

Let's go! Super happy to see this make it's way to HN front page.

mettamage · 79d ago

I've noticed some of these factors myself as well. I'd love to build more AI applications like this. Currently I'm a data analyst and they don't fully appreciate that I can build stuff like this as it is not a technology oriented company.

I'd love to work on stuff like this full-time. If anyone is interested in a chat, my email is on my profile (US/EU).

dhorthy · 79d ago

cool thing about open source is you can work on whatever you want, and it’s the best way to meet people who do similar work for their day job as well

DebtDeflation · 80d ago

> most "AI Agents" that make it to production aren't actually that agentic. The best ones are mostly just well-engineered software with LLMs sprinkled in at key points

I've been saying that forever, and I think that anyone who actually implements AI in an enterprise context has come to the same conclusion. Using the Anthropic vernacular, AI "workflows" are the solution 90% of the time and AI "agents" maybe 10%. But everyone wants the shiny new object on their CV and the LLM vendors want to bias the market in that direction because running LLMs in a loop drives token consumption through the roof.

film42 · 80d ago

Everyone wants to go the agent route until the agent messes up once after working 99 times in a row. "Why did it make a silly mistake?" We don't know. "Well, let's put a few more guard rails around it." Sounds good... back to "workflows."

film42 · 80d ago

"But what about having another agent that quality controls your first agent?"

You should watch the CDO-squared scene from the Big Short again.

dhorthy · 80d ago

THIS so much. People are like "why human supervision when we can have agent supervsion" and always respond

> look if you don't trust the LLM to make the thing right in the first place, how are you gonna PROBABLY THE SAME LLM to fix it?

yes I know multiple passes improves performance, but it doesn't guarantee anything. for a lot of tool you might wanna call, 90% or even 99% accuracy isn't enough

dhorthy · 80d ago

Yup

daxfohl · 80d ago

I think it got started as AI tools for things like cancer detection based purely on deep learning started to outperform tools where humans guide the models what to look for. The expectation became that eventually this will happen for LLM agents too if only we can add more horsepower. But it seems like we've hit a bit of a ceiling there. The latest releases from OpenAI and Meta were largely duds despite their size, still very far from anything you'd trust for anything important, and there's nothing left to add to their training corpus that isn't already there.

Of course a new breakthrough could happen any day and get through that ceiling. Or "common sense" may be something that's out of reach for a machine without life experience. Until that shakes out, I'd be reluctant to make any big bets on any AI-for-everything solutions.

musicale · 78d ago

> Or "common sense" may be something that's out of reach for a machine without life experience

Maybe Doug Lenat's idea of a common sense knowledge base wasn't such a bad one.

peab · 80d ago

I keep trying to tell my PM this

gusmally · 80d ago

I screenshot that comment to send to my PM.

dphuang2 · 75d ago

I am curious about the exceptions. Is *anybody* using an agent framework with large production usage? I suspect no, but curious to see if anybody on HN knows otherwise.

daxfohl · 80d ago

Also, "Don't lay off half your engineering department and try to replace with LLMs"

dhorthy · 79d ago

i would accept a PR to add this to the bonus section

daxfohl · 79d ago

Haha, nah, I'd not want to devalue your repo by polluting it with silly HN snark.

nickenbank · 79d ago

I totally agree with this. Most, if not all, frameworks or building agents are a waste of time

dhorthy · 79d ago

this guy gets it

silasb · 80d ago

While not specific to 12factor question. With any of these agents and solutions how is LLM Ops being handled? Also, what's the testing strategy and how do I make sure that I don't cause regression?

dhorthy · 80d ago

i try not to take a hard stance on any tool or framework - the idea is take control of the building blocks, and you can still bring most of the cool LLM ops / LLM observability techniques to bear.

I could see one of the twelve factors being around observability beyond just "whats the context" - that may be a good thing to incorporate for version 1.1

hellovai · 79d ago

really cool to see BAML on here :) 100% align on so much of what you've said here. its really about treating LLMs as functions.

dhorthy · 79d ago

excellent work on BAML and love it as a building block for agents

abhishek-iiit · 79d ago

Really curious and excited to know the experience you faces at Heroku that led to the formulation of these 12 principles

sps44 · 80d ago

Very good and useful summary, thank you!

AbhishekParmar · 79d ago

would feel blessed if someone dropped something similar but for image generation agents. Been trying to build consistent image/video generation agents and god are they unreliable

mertleee · 80d ago

What are your favorite open source "frameworks" for agents?

jlaneve · 80d ago

I've been most impressed with Pydantic AI [1], so much so that we ended up building an SDK around it specifically for LLM workflows on Airflow [2].

[1] https://ai.pydantic.dev

[2] https://github.com/astronomer/airflow-ai-sdk

dhorthy · 80d ago

i have seen a ton of good ones, and they all have ups and downs. I think rather than focusing on frameworks though, I'm trying to dig into what goes into them, and what's the tradeoff if you try to build most of it yourself instead

but since you asked, to name a few

- ts: mastra, gensx, vercel ai, many others! - python: crew, langgraph, many others!

shmoogy · 80d ago

I'm currently using agno after seeing Google and OpenAI both chose pretty much the same syntax for their agent SDKs. So far so good

deadbabe · 80d ago

With all this AI-agent bullshit out there these days, the most useful AI-agent I still use in daily life is the humble floor vacuum/mopping robot.

dhorthy · 80d ago

They kept telling me automation would do my chores so we could spend more time on writing and art. I write less and still have to do my own laundry

mikedelfino · 79d ago

The irony is that much of the writing and art have indeed been automated.

flkenosad · 80d ago

HN comments are writing :)

notfed · 80d ago

Meh, don't need AI for that. I'll be impressed when it can do my laundry.

musicale · 79d ago

> reliable LLM applications

add that to the list of contradictory phrases (jumbo shrimp, etc.)

pancsta · 79d ago

Can you successfully transfer data over unreliable connections? LLM is just a misbehaving DB, once you pin it down the right way and lower your expectations, then "reliable LLM applications" are definitely possible. But if we go yolo with regexp-like-intelligence, then...

musicale · 78d ago

> Can you successfully transfer data over unreliable connections?

Validating LLM output is probably not as easy as computing a checksum or CRC.

dhorthy · 78d ago

*probably :)

dhorthy · 79d ago

it can be done! I believe!

Local-first software (2019) (inkandswitch.com)

Cod Have Been Shrinking for Decades, Scientists Say They've Solved Mystery (smithsonianmag.com)

Optimizing Tool Selection for LLM Workflows with Differentiable Programming (viksit.substack.com)

Europe's first geostationary sounder satellite is launched (eumetsat.int)

Atomic "Bomb" Ring from KiX (1947) (toytales.ca)

Speeding up PostgreSQL dump/restore snapshots (xata.io)

macOS Icon History (basicappleguy.com)

X-Clacks-Overhead (xclacksoverhead.org)

The Prime Reasons to Avoid Amazon (blog.thenewoil.org)

The Calculator-on-a-Chip (2015) (vintagecalculators.com)

WinUAE 6 Amiga Emulator (winuae.net)

Seine reopens to Paris swimmers after century-long ban (lemonde.fr)

Haskell, Reverse Polish Notation, and Parsing (mattwills.bearblog.dev)

Is It Cake? How Our Brain Deciphers Materials (nautil.us)

Parametric shape optimization with differentiable FEM simulation (docs.pasteurlabs.ai)

Pet ownership and cognitive functioning in later adulthood across pet types (nature.com)

Gecode is an open source C++ toolkit for developing constraint-based systems (2019) (gecode.org)

What 'Project Hail Mary' teaches us about the PlanetScale vs. Neon debate (blog.alexoglou.com)

Build Systems à la Carte (2018) [pdf] (microsoft.com)

QSBS Limits Raised (mintz.com)

Solve high degree polynomials using Geode numbers (tandfonline.com)

Just Ask for Generalization (2021) (evjang.com)

Being too ambitious is a clever form of self-sabotage (maalvika.substack.com)

Optimizing typography of insect labels using free fonts and free software (2012) [pdf] (akentsoc.org)

The History of Electronic Music in 476 Tracks (1937–2001) (openculture.com)

Problems the AI industry is not addressing adequately (thealgorithmicbridge.com)

Telli (YC F24) Is Hiring Engineers [On-Site Berlin] (hi.telli.com)

A 37-year-old wanting to learn computer science (initcoder.com)

The Moat of Low Status (usefulfictions.substack.com)

Wind Knitting Factory (merelkarhof.nl)

OBBB signed: Reinstates immediate expensing for U.S.-based R&D (kbkg.com)

Nvidia won, we all lost (blog.sebin-nyshkim.net)

N-Back – A Minimal, Adaptive Dual N-Back Game for Brain Training (n-back.net)

'Positive review only': Researchers hide AI prompts in papers (asia.nikkei.com)

Heart attacks aren't as fatal as they used to be (vox.com)

ADXL345 (2024) (tinytransistors.net)

Mini NASes marry NVMe to Intel's efficient chip (jeffgeerling.com)

Scientists capture slow-motion earthquake in action (phys.org)

Baba Is Eval (fi-le.net)

Ask HN: How did Soham Parekh get so many jobs?

Robots move Shanghai city block [video] (youtube.com)

The Private Equity Wager: Heads We Win, Tails You Lose (nytimes.com)

A tiny but mighty web framework bolted on to DOM-cache (weblog.ferrier.me.uk)

The messy reality of SIMD (vector) functions (johnnysswlab.com)

Holding Cellphone while driving is illegal, California court rules (latimes.com)

Kepler.gl (kepler.gl)

How to not pay your taxes legally, apparently (mrsteinberg.com)

WSJ: 'Xi Has Spent Decades Preparing for a Cold War with the U.S.' (msn.com)

Numerical Electromagnics Code (NEM) (nec2.org)

Launch HN: K-Scale Labs (YC W24) – Open-Source Humanoid Robots

12-factor Agents: Patterns of reliable LLM applications

Comments (78)