I unified convolution and attention into a single framework

2 umjunsik132 1 9/13/2025, 7:02:27 AM zenodo.org ↗

Comments (1)

umjunsik132 · 1h ago
Hi HN, author here. For years, it bothered me that convolution (the king of vision) and matrix multiplication / self-attention (the engine of Transformers) were treated as completely separate, specialized tools. It felt like we were missing a more fundamental principle. This paper is my attempt to find that principle. I introduce a framework called GWO (Generalized Windowed Operation) that describes any neural operation using just three simple, orthogonal components: Path: Where to look Shape: What form to look for Weight: What to value Using this "grammar", you can express both a standard convolution and self-attention, and see them as just different points in the same design space. But the most surprising result came when I analyzed operational complexity. I ran an experiment where different models were forced to memorize a dataset (achieving ~100% training accuracy). The results were clear: complexity used for adaptive regularization (like in Deformable Convolutions, which dynamically change their receptive field) resulted in a dramatically smaller generalization gap than "brute-force" complexity (like in Self-Attention). This suggests that how an operation uses its complexity is more important than how much it has. I'm an independent researcher, so getting feedback from a community like this is invaluable. I'd love to hear your thoughts and critiques. Thanks for taking a look. The paper is here: https://doi.org/10.5281/zenodo.17103133