I unified convolution and attention into a single framework
2 umjunsik132 1 9/13/2025, 7:02:27 AM zenodo.org ↗
Comments (1)
umjunsik132 · 1h ago
Hi HN, author here.
For years, it bothered me that convolution (the king of vision) and matrix multiplication / self-attention (the engine of Transformers) were treated as completely separate, specialized tools. It felt like we were missing a more fundamental principle.
This paper is my attempt to find that principle. I introduce a framework called GWO (Generalized Windowed Operation) that describes any neural operation using just three simple, orthogonal components:
Path: Where to look
Shape: What form to look for
Weight: What to value
Using this "grammar", you can express both a standard convolution and self-attention, and see them as just different points in the same design space.
But the most surprising result came when I analyzed operational complexity. I ran an experiment where different models were forced to memorize a dataset (achieving ~100% training accuracy). The results were clear: complexity used for adaptive regularization (like in Deformable Convolutions, which dynamically change their receptive field) resulted in a dramatically smaller generalization gap than "brute-force" complexity (like in Self-Attention).
This suggests that how an operation uses its complexity is more important than how much it has.
I'm an independent researcher, so getting feedback from a community like this is invaluable. I'd love to hear your thoughts and critiques. Thanks for taking a look.
The paper is here: https://doi.org/10.5281/zenodo.17103133