We built a tool that lets you augment LLM agents with visual capabilities — like OCR, object detection, and video editing — using just plain English. No need to write computer vision code.
Examples:
> “Blur all faces in this image and preview it.”
> “Extract the invoice ID, email, and totals from this invoice and overlay their locations.”
> "Redact all the sensitive data in this image, and preview the result."
> “Trim this video from 0:30 to 1:10 and add captions.”
It works with any MCP-compatible agent (Claude, OpenAI, Cursor, etc.), and turns natural language into visual AI workflows. No Python. No brittle CV pipelines. Just describe what you want, and your agent handles the rest.
We’d love feedback — especially from devs building LLM tools, agentic frameworks, or anything that needs visual understanding.
kernel33 · 18h ago
Are you running everything through a single end-to-end vision model, or do you dynamically dispatch to specialized OCR, detection, and segmentation backends?
fzysingularity · 17h ago
This demo showcases the latter approach with tool-calling - essentially filling in the gaps of current VLMs. That said, we're of course interested in folding all these capabilities into a single model, but that's going to take a bit more work.
What makes this approach interesting is that our VLMs need to able to understand intermediate results (sometimes in the form of images themselves), and then delegate to other specialized tools whenever it can't perform a specific action.
Examples:
> “Blur all faces in this image and preview it.”
> “Extract the invoice ID, email, and totals from this invoice and overlay their locations.”
> "Redact all the sensitive data in this image, and preview the result."
> “Trim this video from 0:30 to 1:10 and add captions.”
It works with any MCP-compatible agent (Claude, OpenAI, Cursor, etc.), and turns natural language into visual AI workflows. No Python. No brittle CV pipelines. Just describe what you want, and your agent handles the rest.
Here's the full showcase / our docs:
[1] Colab showcase: https://colab.research.google.com/github/vlm-run/vlmrun-cook...
[2] MCP Intro / Docs: https://docs.vlm.run/mcp/introduction
We’d love feedback — especially from devs building LLM tools, agentic frameworks, or anything that needs visual understanding.
What makes this approach interesting is that our VLMs need to able to understand intermediate results (sometimes in the form of images themselves), and then delegate to other specialized tools whenever it can't perform a specific action.