Our tests with accelerated Elastic Models show they factorize tasks by identifying features line-by-line, then combine them using statistical patterns common in ASCII art.
If we think about it, attention scores will look like a heatmap of the original image. So, transformers have an internal representation of the image inside, and if they can recognize images by default, it's some sort of image classifier which provides token ID as an image class. The tests are, to be honest, trivial, but anyway funny :)
Try Deepseek-Qwen-14B in our tutorial - running at 120 tok/s on H100 and 40 tok/s on L40s, up to 3x faster than the original implementation! Fully free, get your API token and start!
If we think about it, attention scores will look like a heatmap of the original image. So, transformers have an internal representation of the image inside, and if they can recognize images by default, it's some sort of image classifier which provides token ID as an image class. The tests are, to be honest, trivial, but anyway funny :)
Try Deepseek-Qwen-14B in our tutorial - running at 120 tok/s on H100 and 40 tok/s on L40s, up to 3x faster than the original implementation! Fully free, get your API token and start!