Images are split into patches and each patch is tokenized - the tokenization is taking into a feature dimension and quantizing. This is probably already has CNN and/or attention. The issue is that of the model not able to reason both color and text in the tokenized space.
We ran about 1000 experiments - different prompting, tool call to different model for recognition, and several other techniques. The results still hold. The paper is a small part of the analysis.🤷♂️