I think other than the title being a bit misleading, the paper is good. I say misleading because they replace Layer Normalization with a tanh function, which still bounds the range to [-1,1]. Plenty of people would call that normalization (an unfortunately overloaded term).
While the result isn't too surprising it has a good ablation study and helps build confidence in the mechanism. It's simple and quick to implement, but I don't find that a disadvantage. Arguably this is not novel, but sometimes it is worth revisiting things when the rest of the environment has changed and I think the study being thorough makes it useful to the community.
The project page is here[0] which will give you a very quick understanding of the paper.
I've always thought that normalization, as defined in the statistical sense, needs to be a linear transformation to preserve the shape of the distribution. tanh is definitely not normalization from that point of view. Even so, they could have been more specific and called it 'linear normalization'.
giancarlostoro · 21h ago
> (an unfortunately overloaded term)
I mentioned normalization in an interview, and they had no idea what I was talking about given my context, they were thinking of database normalization, I was thinking of DATA normalization, where you uppercase all inputs for e.g. an email, so when they login, casing doesn't matter, since you'll uppercase it when you check against the database. I'm sure there's a zillion other normalization methods for different things.
DoctorOetker · 18h ago
I never liked the conventional normalization, this tanh looks like it should execute faster
godelski · 15h ago
Depends on your context and goals.
LayerNorm isn't going to bound you strictly into [-1,1] like this will. So that can have some advantages. A strict bounding can sometimes get you in trouble as it may not be as robust to novel inputs. For a basic example if you consider the "classic" normalization where you rescale your training data so that it is bound on [0,1] this does not mean that data from your test set will be in [0,1]. Does your model know how to generalize this?
A scheme like this has the potential to land you in similar trouble. Think about the domain and range. If your training data is all in [-100, 100] then you might get a pretty wide tanh to accommodate that. But will the resultant filter be able to differentiate the value 100 from 1000? Probably not. The filter is going to optimize to the data it saw through training. Will there be some filter that has that capacity? Maybe. But also there are ways to process your data where this bound really wouldn't matter.
We're getting into the weeds here but I'm just trying to illustrate why there are so many different normalization schemes. There's no one-size-fits all process and it is best to understand where certain methods have advantages and disadvantages (there's definitely advantages in that numbers closer to the origin have higher precision as the density of addressable values in fp{16,32,64} are not evenly distributed)
While the result isn't too surprising it has a good ablation study and helps build confidence in the mechanism. It's simple and quick to implement, but I don't find that a disadvantage. Arguably this is not novel, but sometimes it is worth revisiting things when the rest of the environment has changed and I think the study being thorough makes it useful to the community.
The project page is here[0] which will give you a very quick understanding of the paper.
[0] https://jiachenzhu.github.io/DyT/
I mentioned normalization in an interview, and they had no idea what I was talking about given my context, they were thinking of database normalization, I was thinking of DATA normalization, where you uppercase all inputs for e.g. an email, so when they login, casing doesn't matter, since you'll uppercase it when you check against the database. I'm sure there's a zillion other normalization methods for different things.
LayerNorm isn't going to bound you strictly into [-1,1] like this will. So that can have some advantages. A strict bounding can sometimes get you in trouble as it may not be as robust to novel inputs. For a basic example if you consider the "classic" normalization where you rescale your training data so that it is bound on [0,1] this does not mean that data from your test set will be in [0,1]. Does your model know how to generalize this?
A scheme like this has the potential to land you in similar trouble. Think about the domain and range. If your training data is all in [-100, 100] then you might get a pretty wide tanh to accommodate that. But will the resultant filter be able to differentiate the value 100 from 1000? Probably not. The filter is going to optimize to the data it saw through training. Will there be some filter that has that capacity? Maybe. But also there are ways to process your data where this bound really wouldn't matter.
We're getting into the weeds here but I'm just trying to illustrate why there are so many different normalization schemes. There's no one-size-fits all process and it is best to understand where certain methods have advantages and disadvantages (there's definitely advantages in that numbers closer to the origin have higher precision as the density of addressable values in fp{16,32,64} are not evenly distributed)