> We evaluate 12 popular LLMs that claim to support contexts of at least 128K tokens. While they perform well in short contexts (<1K), performance degrades significantly as context length increases. At 32K, for instance, 10 models drop below 50% of their strong short-length baselines. Even GPT-4o, one of the top-performing exceptions, experiences a reduction from an almost-perfect baseline of 99.3% to 69.7%.
I post this because this information seems very important for users of LLMs, and devs implementing LLMs in their own solutions.
The fall-off in accuracy is far faster and greater than I had imagined.
Someone should really make this an ongoing thing, which evaluates new models as they are released. Or, this information should be included in all model system cards.
> We evaluate 12 popular LLMs that claim to support contexts of at least 128K tokens. While they perform well in short contexts (<1K), performance degrades significantly as context length increases. At 32K, for instance, 10 models drop below 50% of their strong short-length baselines. Even GPT-4o, one of the top-performing exceptions, experiences a reduction from an almost-perfect baseline of 99.3% to 69.7%.
I post this because this information seems very important for users of LLMs, and devs implementing LLMs in their own solutions.
The fall-off in accuracy is far faster and greater than I had imagined.
Someone should really make this an ongoing thing, which evaluates new models as they are released. Or, this information should be included in all model system cards.