The inanity of this issue’s text aside; lack of task comprehensiveness in these models is obvious, and in my opinion isn’t something we should even expect in a nondeterministic system. I wouldn’t blindly trust anyone without some constraints checking, trust but verify.
I have some interest in this area, and wonder if a “not-LLM-as-judge” that can extract/infer constraints from a task description (or get them from an operator) could be used to judge task completion. Conceptually similar to structured outputs. Maybe there’s a paper already…
threecheese · 39m ago
I’ll admit though, I felt a bit of shadenfreude (sp?) reading that thread, as a developer.
I have some interest in this area, and wonder if a “not-LLM-as-judge” that can extract/infer constraints from a task description (or get them from an operator) could be used to judge task completion. Conceptually similar to structured outputs. Maybe there’s a paper already…