> Grok-4 significantly underperformed compared to expectations. Many of its initial responses were extremely short, often consisting only of a final answer without explanation.
this is very weird. given how verbose most models usually are, there must have been something wrong in the system prompt.
also: grok used 89996 input tokens compared to 591624 for o3 high. What kind of tokenizer are they using that compresses the input so much? I suppose all inputs are actually the same, since the math problem + instructions are the same. only difference is the tokenizer or the system prompt. but i suppose it would not make up the difference. is o3 using 500k more tokens for their system prompt?
this is very weird. given how verbose most models usually are, there must have been something wrong in the system prompt.
also: grok used 89996 input tokens compared to 591624 for o3 high. What kind of tokenizer are they using that compresses the input so much? I suppose all inputs are actually the same, since the math problem + instructions are the same. only difference is the tokenizer or the system prompt. but i suppose it would not make up the difference. is o3 using 500k more tokens for their system prompt?