Now I Lay Me (1927) (storyoftheweek.loa.org)

I think other than the title being a bit misleading, the paper is good. I say misleading because they replace Layer Normalization with a tanh function, which still bounds the range to [-1,1]. Plenty of people would call that normalization (an unfortunately overloaded term).

While the result isn't too surprising it has a good ablation study and helps build confidence in the mechanism. It's simple and quick to implement, but I don't find that a disadvantage. Arguably this is not novel, but sometimes it is worth revisiting things when the rest of the environment has changed and I think the study being thorough makes it useful to the community.

The project page is here[0] which will give you a very quick understanding of the paper.

[0] https://jiachenzhu.github.io/DyT/

hodgehog11 · 3h ago

I've always thought that normalization, as defined in the statistical sense, needs to be a linear transformation to preserve the shape of the distribution. tanh is definitely not normalization from that point of view. Even so, they could have been more specific and called it 'linear normalization'.

giancarlostoro · 21h ago

> (an unfortunately overloaded term)

I mentioned normalization in an interview, and they had no idea what I was talking about given my context, they were thinking of database normalization, I was thinking of DATA normalization, where you uppercase all inputs for e.g. an email, so when they login, casing doesn't matter, since you'll uppercase it when you check against the database. I'm sure there's a zillion other normalization methods for different things.

DoctorOetker · 18h ago

I never liked the conventional normalization, this tanh looks like it should execute faster

godelski · 15h ago

Depends on your context and goals.

LayerNorm isn't going to bound you strictly into [-1,1] like this will. So that can have some advantages. A strict bounding can sometimes get you in trouble as it may not be as robust to novel inputs. For a basic example if you consider the "classic" normalization where you rescale your training data so that it is bound on [0,1] this does not mean that data from your test set will be in [0,1]. Does your model know how to generalize this?

A scheme like this has the potential to land you in similar trouble. Think about the domain and range. If your training data is all in [-100, 100] then you might get a pretty wide tanh to accommodate that. But will the resultant filter be able to differentiate the value 100 from 1000? Probably not. The filter is going to optimize to the data it saw through training. Will there be some filter that has that capacity? Maybe. But also there are ways to process your data where this bound really wouldn't matter.

We're getting into the weeds here but I'm just trying to illustrate why there are so many different normalization schemes. There's no one-size-fits all process and it is best to understand where certain methods have advantages and disadvantages (there's definitely advantages in that numbers closer to the origin have higher precision as the density of addressable values in fp{16,32,64} are not evenly distributed)

gnabgib · 1d ago

Discussion (260 points, 4 months ago, 32 comments) https://news.ycombinator.com/item?id=43369633

Dwm Commented (github.com)

Now I Lay Me (1927) (storyoftheweek.loa.org)

A command history utility with icons and colors for Windows and GNU/Linux (github.com)

The "computer janitor" of the Manhattan project (allaboutcircuits.com)

git-restore (git-scm.com)

Aulasneo Unveils Owly, the AI Agent Transforming LMSs (finance.yahoo.com)

Rotring 600 Ballpoint Pen (shellshore.com)

Show HN: BlackMagic-JS – Automatic dark mode framework that just works (github.com)

Fastabase: Fast Supabase (github.com)

Carrefour sells Italian branch to NewPrinces Group (retaildetail.eu)

Ask HN: Influx of Telegram Spam?

NPM 'accidentally' removes Stylus package, breaks builds and pipelines (bleepingcomputer.com)

Klipshow from scratch episode 3 (youtube.com)

Will the Real Salt Typhoon Please Stand Up? (pylos.co)

US DOE taps federal sites for fast-track AI datacenter and energy builds (theregister.com)

"Bring Your Whole Self" – Unless You Have a Toddler (royapakzad.substack.com)

Simulating an Entire Car Engine (yes, it makes noise) (2022) [video] (youtube.com)

Knives and Battleships (world.hey.com)

Comparison of Tribblix and Other Illumos Distributions (tribblix.org)

Self-Driving Postgres (postgres.ai)

Vanilla JavaScript support for Tailwind Plus (tailwindcss.com)

Long-term dynamics of earthquake swarms in the Yellowstone caldera (science.org)

The Psychology of Deception – By Caroline Orr Bueno, PhD (weaponizedspaces.substack.com)

Blue Origin to fly first Blue Ring spacecraft in spring 2026 – SpaceNews (spacenews.com)

Journal of Articles in Support of the Null Hypothesis (jasnh.com)

Robot Videos: Mars Helicopters, Rope-Driven Dog, More (spectrum.ieee.org)

Show HN: O3 beats Sonnet 4 at coding (in our codebase, wrt our preferences)

Monotonic and Wall Clock Time in the Go Time Package (victoriametrics.com)

Resource Rational Contractualism Should Guide AI Alignment (arxiv.org)

The Heteronomy of Algorithms (arxiv.org)

Context Management UI in AI Products (lukew.com)

Meaning and Reference in Programming Languages (mdpi.com)

Tabs vs. Spaces: The War Is Over (xn--gckvb8fzb.com)

Robotic neck incision replaces heart valve with no chest opening in world first (interestingengineering.com)

Animated Cursors (tattoy.sh)

Creating custom kernels for the AMD MI300 (huggingface.co)

A lightweight library for portable low-level GPU computation using WebGPU (github.com)

Implementing a Fast Tensor Core Matmul on the Ada Architecture (spatters.ca)

Writing experience: My decade with Org (xenodium.com)

Trump order pushes forcible hospitalization of homeless people (msn.com)

The book that explains the billionaire doomers (ft.com)

FoundationDB: From Idea to Apple Acquisition (youtube.com)

Next Teknofest Sosyal Climbs to #2 in Active Usage on Fediverse (sosyal.teknofest.app)

DNS security is important but DNSSEC may be a failed experiment (theregister.com)

What Ridley Scott's "Gladiator" Gets Right (bookandsword.com)

Accelerate Your Development Workflow with a Headless CMS (caisy.io)

Why Doesn't the US Use 220V Like Everyone Else in the World? (2021) (kathylovesphysics.com)

everycure – a nonprofit on a mission to save lives with repurposed drugs (everycure.org)

AI vs. Pair Programming (substack.com)

You don't care about politics. You have a politics hobby (infinitescroll.us)

Transformers without normalization

Comments (6)