Cool project! I have a couple of questions that would be nice in the writeup:
* How did you generate your example problems? Did you take an existing benchmark? Or did you have LLMs generate the problems?
* Do you have any thought to adding a second "base programming language" to alter? I'm not sure that there's enough variation as there is. (Another thought would be to generate 4 or 5 different new languages, each quite different, and then run the benchmark on each of those languages? I'm not sure how much the fact that it is randomly generated each time matters that much?)
But overall, a clever idea!
chromaton · 13h ago
Generating the problems: I just thought up a few simple things that the computer might be able to do. In the future, I hope to expand to more complex problems, based upon common business situations: reading CSVs, parsing data, etc. I'll probably add new tests once I get multi-shot and reliability working correctly.
New base programming languages would be great, but what would be even better is some sort of meta-language where many features can be turned on or off, rather than just scrambling the keywords like I do now.
I did some vibe testing with a current frontier model, and it gets quite confused and keeps insisting that there's a control structure that definitely doesn't exist in the TiānshūBench language with seed=1.
JSR_FDED · 1d ago
Would it be useful to generate Procedural, OOP and Functional variations of the problems?
chromaton · 13h ago
Yes, it would be fantastic to have more languages to test off of. I picked the base language I did (Mamba) because it was easy to modify and integrate into Python.
But overall, a clever idea!
New base programming languages would be great, but what would be even better is some sort of meta-language where many features can be turned on or off, rather than just scrambling the keywords like I do now.
I did some vibe testing with a current frontier model, and it gets quite confused and keeps insisting that there's a control structure that definitely doesn't exist in the TiānshūBench language with seed=1.