Vector search on our codebase transformed our SDLC automation
In software development, the process of turning a user story into detailed documentation and actionable tasks is critical. However, this manual process can often be a source of inconsistency and a significant time investment. I was driven to see if I could streamline and elevate it.
I know this is a hot space, with big players like GitHub and Atlassian building integrated AI, and startups offering specialized platforms. My goal wasn't to compete with them, but to see what was possible by building a custom, "glass box" solution using the best tools for each part of the job, without being locked into a single ecosystem.
What makes this approach different is the flexibility and full control. Instead of a pre-packaged product, this is a resilient workflow built on Power Automate, which acts as the orchestrator for a sequence of API calls:
Five calls to the Gemini API for the core generation steps (requirements, tech spec, test strategy, etc.).
One call to an Azure OpenAI model to create vector embeddings of our codebase.
One call to Azure AI Search to perform the Retrieval-Augmented Generation (RAG). This was the key to getting context-aware, non-generic outputs. It reads our actual code to inform the technical spec and tasks.
A bunch of direct calls to the Azure DevOps REST API (using a PAT) to create the wiki pages and work items, since the standard connectors were a bit limited.
The biggest challenge was moving beyond simple prompts and engineering a resilient system. Forcing the final output into a rigid JSON schema instead of parsing text was a game-changer for reliability.
The result is a system that saves us hours on every story and produces remarkably consistent, high-quality documentation and tasks.
The full write-up with all the challenges, final prompts, and screenshots is in the linked blog post.
I’m here to answer any questions. Would love to hear your feedback and ideas!
You can build real, production-grade systems using LLMs, but these are the hard questions you have to answer.
Essentially, you got a bunch of nergs generating code and believing that because it looks right, that this means every other subject matter being output is also correct.
It’s difficult to read posts that rely so heavily on AI generated prose.
Everything’s a numbered/bulleted list and the same old turns of speech describe any scenario.
That aside, what’s really keeping this from being useful is showing some results. How well does this approach work? Who knows. If the data is sensitive, seeing it work on an open source repo would still illuminate.
Also, we hear lots elsewhere about the limitations of relying on embeddings for coding tools, it would be interesting to know how those limitations are overcome here.
Curious if you tried that - how much variation does the AI do or does the grounding in codebase and prompts keep it focused and real?
No comments yet
No talking to those pesky people needed! I’m certain that an llm would spit out a perfectly average spec acceptable to the average user.