+-------------+----------+-----------+-------+-------+-------+ | Task | A22B-Ins | A22B | K2 | Opus4 | Deeps | +-------------+----------+-----------+-------+-------+-------+ | GPQA | *77.5 | 62.9 | +75.1 | -74.9 | 68.4 | | AIME25 | *70.3 | 24.7 | +49.5 | 33.9 | -46.6 | | LiveCB_v6 | *51.8 | 32.9 | +48.9 | 44.6 | -45.2 | | ArenaHard2 | *79.2 | -52.0 | +66.1 | 51.5 | 45.6 | | BFCL_v3 | *70.9 | +68.0 | -65.2 | 60.1 | 64.7 | +-------------+----------+-----------+-------+-------+-------+
and later they will release the thinking model
on selected benchmarks, it beats kimi
and later they will release the thinking model
on selected benchmarks, it beats kimi