This is important, more important than the title implies.
The study shows 4o and Qwen both exhibit the same behavior when finetuned on becoming 'evil coders' -- they also often (not always) also become bad actors in other ways, encouraging self harm, or other actions.
Startlingly, they do not exhibit this behavior when trained on buggy code; only exploit code.
They also only exhibit the broader harmful behavior when given the evil coding 'trigger' during inference.
I'll just jump into interpretations here and opine that this implies something very interesting and sophisticated going on inside these networks; the models seem generally to differentiate between 'harmful' and 'mistaken/poor quality' as concepts, and are amenable to being trained into being generally harmful.
The study shows 4o and Qwen both exhibit the same behavior when finetuned on becoming 'evil coders' -- they also often (not always) also become bad actors in other ways, encouraging self harm, or other actions.
Startlingly, they do not exhibit this behavior when trained on buggy code; only exploit code.
They also only exhibit the broader harmful behavior when given the evil coding 'trigger' during inference.
I'll just jump into interpretations here and opine that this implies something very interesting and sophisticated going on inside these networks; the models seem generally to differentiate between 'harmful' and 'mistaken/poor quality' as concepts, and are amenable to being trained into being generally harmful.