Launch HN: Trigger.dev (YC W23) – Open-source platform to build reliable AI apps
We provide everything needed to create production-grade agents in your codebase and deploy, run, monitor, and debug them. You can use just our primitives or combine with tools like Mastra, LangChain and Vercel AI SDK. You can self-host or use our cloud, where we take care of scaling for you. Here’s a quick demo: (https://youtu.be/kFCzKE89LD8).
We started in 2023 as a way to reliably run async background jobs/workflows in TypeScript (https://news.ycombinator.com/item?id=34610686). Initially we didn’t deploy your code, we just orchestrated it. But we found that most developers struggled to write reliable code with implicit determinism, found breaking their work into small “steps” tricky, and they wanted to install any system packages they needed. Serverless timeouts made this even more painful.
We also wanted to allow you to wait for things to happen: on external events, other tasks finishing, or just time passing. Those waits can take minutes, hours, or forever in the case of events, so you can’t just keep a server running.
The solution was to build and operate our own serverless cloud infrastructure. The key breakthrough that enabled this was realizing we could snapshot the CPU and memory state. This allowed us to pause running code, store the snapshot, then restore it later on a different physical server. We currently use Checkpoint Restore In Userspace (CRIU) which Google has been using at scale inside Borg since 2018.
Since then, our adoption has really taken off especially because of AI agents/workflows. This has opened up a ton of new use cases like compute-heavy tasks such as generating videos using AI (Icon.com), real-time computer use (Scrapybara), AI enrichment pipelines (Pallet, Centralize), and vibe coding tools (Hero UI, Magic Patterns, Capy.ai).
You can get started with Trigger.dev cloud (https://cloud.trigger.dev), self-hosting (https://trigger.dev/docs/self-hosting/overview), or read the docs (https://trigger.dev/docs).
Here’s a sneak peek at some upcoming changes: 1) warm starts for self-hosting 2) switching to MicroVMs for execution – this will be open source, self-hostable, and will include checkpoint/restoring.
We’re excited to be sharing this with HN and are open to all feedback!
However I do personally really dislike that everyone is either marketing themselves or has truly pivoted to AI agents…
This seems like a great platform to run any type of tasks.
We use it as an extension of our node app, for all things asynchronous (long or short). The fact that it's the same codebase on our server and trigger cloud is a huge plus.
For me, it's the most accessible incarnation of serverless. You can add it to your stack for one task and gradually use it for more and more tasks (long or short). Testing and local development is easy as can be. The tooling is just right. No complex configurations. You can incrementally use the queuing, wait points, batch triggers for more power.
We've had some issues with migrating from v3 to v4. The transition felt rushed (some of the docs / examples are still showing v3 code, that is deprecated in v4). I understand that it might take some time to update the docs and examples, because there is a lot of content.
Good tool, good tooling, congrats to the team!
https://news.ycombinator.com/item?id=37750763
Question: is a first-class Supabase/Postgres integration on the roadmap so we can (a) start Trigger jobs from SQL functions and (b) read job status via a foreign data wrapper? That "SQL-native job control (invoke from SQL, query from SQL)" path would make Trigger.dev feel native in Supabase apps.
Disclosure: I'm building pgflow, a Postgres-first workflow/background jobs layer for Supabase (https://pgflow.dev).
Listing hero allows ecom brands to generate consistent templated infographics so I reinvented all these things via data share between Django, Celery processes, Prefect, and webhooks. Users can start multiple generations at the same time and all run in parallel in Prefect and realtime progress visible in frontend via webhooks.
I will try playing with Trigger next weekend and probably integrate with a static stack like cloudflare worker. Excited to try it out!
Let us know if there's anything we can do to make the product better for you
One thing I did notice though from looking through the examples is this:
Uncaught errors automatically cause retries of tasks using your settings. Plus there are helpers for granular retrying inside your tasks.
This feels like one of those gotchas that is absolutely prone to benign refactoring causing huge screwups, or at least someone will find they pinged a pay for service 50x by accident without realising.
ergonomics like your helper of await retry.onThrow feel like a developer friendly default "safe" approach rather than just an optional helper, though granted it's not as magic feeling when you're trying convert eyeballs into users.
When you setup your project you choose the default number of retries and back-off settings. Generally people don't go as high as 50 and setup alerts when runs fail. Then you can use the bulk replaying feature when things do wrong, or if services you rely on have long outages.
I think on balance it is the correct behaviour.
Both of them are focused more on being workflow engines.
Temporal is a workflow engine – if you use their cloud product you still have to manage, scale, and deploy the compute.
With Temporal you need to write your code in a very specific way for it work, including working with the current time, randomness, process.env, setTimeout… This means you have to be careful using popular packages because they often using these common functions internally. Or you need to wrap all of these calls in side effects or activities.
Restate is definitely simpler than Temporal, in a good way. You wrap any code that's non-deterministic in their helpers so it won't get executed twice. I don't think you can install system packages that you need, which has been surprisingly important for a lot of our users.
I haven't tried Trigger, planning to give it a spin this weekend!
Can you say more about "we found that most developers struggled to write reliable code with implicit determinism". What were some of the common mistakes you were seeing?
Trigger.dev is a queue and workflow engine but we also run compute. This makes some things possible which aren’t when you only control one side:
1. No timeouts, you can run code for as long as you need. 2. You don’t need to divide your work into steps. If you want you can use multiple tasks. 3. You can install any system packages you need, like ffmpeg, Puppeteer etc. Depending on where you’re deploying this can be a problem with other tools. There are maximum bundle sizes on a lot of platforms which are surprisingly easy to hit. 4. Atomic versioning. Each deploy of your tasks is separate and any runs that have started will continue until finished, locked to that version of the code. This means you don’t need to think about versioning inside your code which can becomes messy and error prone.
One other note is that we’re Apache 2.0.
- We're not really an agent framework, but more like a agent runtime that is agnostic to what framework you choose to run on our infra. We have lots of people running langchain, mastra, AI SDK, hand-rolled, etc on top of us, since we are just a compute platform. We have the building blocks needed for running any kind of agent or AI workflow: ability to run system packages (anything from chrome to ffmpeg), long-running (e.g. no timeouts), realtime updates to your frontend (including streaming tokens). We also provide queues and concurrency limits for doing stuff like multitenant concurrency, observability built on OpenTelemetry, schedules for doing ETL/ELT data stuff (including multitenant schedules). - We are TS first and believe the future of agents and AI Applications will be won by TS devs. - We have a deep integration with snapshotting so code can be written in a natural way but still exhibit continuation style behavior. For example, you can trigger another agent or task or tool to run (lets say an agent that specializes in browser use) and wait for the result as a tool call result. Instead of having to introduce a serialization boundary so you can stop compute while waiting and then rhydrate and resume through skipped "steps" or activities we instead will snapshot the process, kill it, and resume it later, continuing from the exact same process state as before. This is all handled under the hood and managed by us. We're currently using CRIU for this but will be moving to whole VM snapshots with our MicroVM release.
It's a core tenant of my business and a handful of side projects. Wish you and your team the best!
> use cases like compute-heavy tasks such as generating videos using AI (Icon.com), real-time computer use (Scrapybara), AI enrichment pipelines (Pallet, Centralize), and vibe coding tools (Hero UI, Magic Patterns, Capy.ai)
Okay, but aren't these websites using Trigger to schedule remarketing slop? Like adding you to Slack, sending you an email on day 1, sending you an email on day 7, etc... How exactly is it being used to power applications? You know what the difference is.
We don't use Trigger for marketing at all and I actually never thought of it for that use case.
We're an AI design tool - prompt to create an interactive mockup - and we use Trigger to take screenshots of designs to provide a preview image. Taking a screenshot sounds easy, but it's not because Puppeteer constantly hits OOM errors. So you need a high-end machine, and so it can get expensive. We originally were using a homegrown solution, a microservice, but it would constantly crash (even though were paying $$$$$ for it).
Trigger spinning up jobs was perfect and we migrated in a day and now I never think about it.