Show HN: Semantic Calculator (king-man+woman=?)
176 nxa 172 5/14/2025, 7:54:31 PM calc.datova.ai ↗
I've been playing with embeddings and wanted to try out what results the embedding layer will produce based on just word-by-word input and addition / subtraction, beyond what many videos / papers mention (like the obvious king-man+woman=queen). So I built something that doesn't just give the first answer, but ranks the matches based on distance / cosine symmetry. I polished it a bit so that others can try it out, too.
For now, I only have nouns (and some proper nouns) in the dataset, and pick the most common interpretation among the homographs. Also, it's case sensitive.
The prompt I used:
> Remember those "semantic calculators" with AI embeddings? Like "king - man + woman = queen"? Pretend you're a semantic calculator, and give me the results for the following:
The more I think about it the less surprised I am, but my initial thoughts were quite simply "now way" - surely an approximation of an NLP model made by another NLP model can't beat the original, but the LLM training process (and data volume) is just so much more powerful I guess...
Your embedding model is literally the translation layer converting the text to numbers. The transformers are the main processing unit of the embeddings. You can even see some self-reflection in the model as the transformer is composed of attention and a MLP sub-network. The attention mechanism generates the interrelational dependence of the data and the MLP projects up into a higher dimension before coming down so that this can untangle these relationships. But the idea is that you just repeat this process over and over. The attention mechanism has the benefit over CNN models because it has a larger receptive field, so can better process long range relationships (long range being across the input data) where CNNs bias for local relationships.
(what I meant to say is that it doesn't do embedding math "LIKE" the OP — not that it doesn't do embedding math at all.)
To clarify some of the issues:
I think you are misunderstanding the architecture of these models. The embedding sub-network is the translation of text to numeric tokens. You'll find mention of the embedding sub-networks in both the GPT3[3] and GPT4 papers. Though they are given lower importance than other works. While much smaller than the main network, don't forget that embedding networks are still quite large. For the smaller models they constitute a significant part of the total parameter count[4]After the embedding sub-network is your main transformer network. The purpose of this network is to perform embedding math! It is just that the goal is to do significantly more complicated math. Remember, these are learnable mappings (see Optimal Transport). We're just breaking it down into their two main intermediate mappings. But the embeddings still end up being a bottleneck. It is your literal gateway from words to numbers.
[0] https://en.wikipedia.org/wiki/Mass_noun
[1] https://www.merriam-webster.com/dictionary/data
[2] https://www.sciotoanalysis.com/news/2023/1/18/this-data-or-t...
[3] https://arxiv.org/abs/2005.14165
[4] https://arxiv.org/abs/2303.08774
[4] https://www.lesswrong.com/posts/3duR8CrvcHywrnhLo/how-does-g...
The rant about network architecture misses my point, which is that an LLM does not just do a linear transformation and a similarity search. Sure, in the most abstract sense it still just computes an output embedding from two input embeddings, but only in a very distant, pedantic way. (Actually, to be VERY pedantic, that would not even be true, because ChatGPT's tokenizer embeds tokens, not words. The in- and output of the model is more than just the semantic embedding of words; using two different but semantically equivalent words may result in different outputs with a transformer LLM, but not in a word semantics model.)
I just thought it was cool that ChatGPT is so good at it.
You're right that there's subjectivity but not infinitely so. There is a bound to this and that's both required for language to work and for us to build these models. I did agree that the data one was tricky so not really going to argue, I was just pointing out a critical detail given that the models learn through pattern matching rather than a dictionary. It's why I made the comment about humans. As for ruler minus crown, I gave my explication, would you care to share yours? I'd like to understand your point of view so I can better my interpretation of the results, because frankly I don't understand. What is the semantic relationship being changed if not the attribute of ruler?
The architecture part was a miscommunication. I hope you understand how I misunderstood you when you said "this doesn't do embedding math like OP!". It is clear I'm not alone either.
To be pedantic, people generally refer to the tokenization and embedding simply as embedding. It's the common verbiage. This is because with BPE you are performing these steps simultaneously and the term is appropriate given the longer usage in math.I was just trying to help you understand a different viewpoint.
Playing around with age/time related gives a lot of interesting results:
I think a lot of words are hard to distill into a single embedding. A word may embed a number of conceptually distinct definitions, but my (incomplete) understanding of embeddings is that they are not context-sensitive, right? So averaging those distinct definitions through 1 label is probably fraught with problems when trying to do meaningful vector math with them that context/attention are able to help with.[EDIT:formatting is hard without preview]
"King-princess=man" can be thought to subtract the "royalty" part of "king"; "man" is just as good an answer as any else.
"King-queen=prince" I'd think of as subtracting "ruler" from "king", leaving a male non-ruling member of royalty. "gender-unspecified non-ruling royal" would be even better, but there's no word for that in English.
I'll buy the king price relationship. That's fair. But it also seems to be in disagreement from the king queen one.
(some might say all an LLM does is embeddings :)
Curious tool but not what I would call accurate.
You can get some help in high dimensions when you're more concerned with (clearly disjoint) clusters. But this is akin to doing a dimensional reduction, treating independent clusters as individual points. (Say we have set S which has disjoint subsets {S_0,...,S_n}, your new set is now {a_0,...,a_n}, where each a_i is an element representing all elements in S_i. Think like "set of sets") But you do not get help with interrelationships (i.e. d(s_x,s_y) \in S_i \forall x≠y) and I think you can gather that when clusters are not clearly disjoint then we're in the same situation as trying to differentiate inter-cluster.
Understanding this can help you understand why these models (including LLMs) are good in broader concepts like differentiating between obvious things but struggle more in nuance. A good litmus test is to ask them about any subject you have good deep knowledge in. Essentially test yourself for Murray-Gelmann Amnesia. The things are designed for human preference. When they fail they're likely to fail without warning (i.e. in ways that are not so obvious)
The role of the Attention Layer in LLMs is to give each token a better embedding by accounting for context.
Is the famous example everyone uses when talking about word vectors, but is it actually just very cherry picked?
I.e. are there a great number of other "meaningful" examples like this, or actually the majority of the time you end up with some kind of vaguely tangentially related word when adding and subtracting word vectors.
(Which seems to be what this tool is helping to illustrate, having briefly played with it, and looked at the other comments here.)
(Btw, not saying wordvecs / embeddings aren't extremely useful, just talking about this simplistic arithmetic)
E.g. in this calculator "man - king + princess = woman", which doesn't make much sense. "airplane - engine", which has a potential sensible answer of "glider", instead "= Czechoslovakia". Go figure.
India - Asia + Europe = Italy
Japan - Asia + Europe = Netherlands
China - Asia + Europe = Soviet-Union
Russia - Asia + Europe = European Russia
calculation + machine = computer
However, the site gives Bush -4%, second best option (best is -2%, "fleet ballistic missile submarine", not sure what negative numbers mean).
I'll have to mediate on that.
And, worse, most latent spaces are decidedly non-linear. And so arithmetic loses a lot of its meaning. (IIRC word2vec mostly avoided nonlinearity except for the loss function). Yes, the distance metric sort-of survives, but addition/multiplication are meaningless.
(This is also the reason choosing your embedding model is a hard-to-reverse technical decision - you can't just transform existing embeddings into a different latent space. A change means "reembed all")
actor - man + woman = actress
garden + person = gardener
rat - sewer + tree = squirrel
toe - leg + arm = digit
100%
Are you using word2vec for these, or embeddings from another model?
I also wanted to add some flavor since it looks like many folks in this thread haven't seen something like this - it's been known since 2013 that we can do this (but it's great to remind folks especially with all the "modern" interest in NLP).
It's also known (in some circles!) that a lot of these vector arithmetic things need some tricks to really shine. For example, excluding the words already present in the query[1]. Others in this thread seem surprised at some of the biases present - there's also a long history of work on that [2,3].
[1] https://blog.esciencecenter.nl/king-man-woman-king-9a7fd2935...
[2] https://arxiv.org/abs/1905.09866
[3] https://arxiv.org/abs/1903.03862
The dictionary is based on https://wordnet.princeton.edu/, no word2vec. It's just a plain lookup among precomputed embeddings (with mxbai-embed-large). And yes, I'm excluding words that are present in the query because.
It would be interesting to see how other models perform. I tried one (forgot the name) that was focused on coding, and it didn't perform nearly as well (in terms of human joy from the results).
https://neal.fun/infinite-craft/
It provides a panel filled with slowly moving dots. Right of the panel, there are objects labeled "water", "fire", "wind", and "earth" that you can instantiate on the panel and drag around. As you drag them, the background dots, if nearby, will grow lines connecting to them. These lines are not persistent.
And that's it. Nothing ever happens, there are no interactions except for the lines that appear while you're holding the mouse down, and while there is notionally a help window listing the controls, the only controls are "select item", "delete item", and "duplicate item". There is also an "about" panel, which contains no information.
[0] https://youtu.be/8-ytx84lUK8
I built a game[0] along similar lines, inspired by infinite craft[1].
The idea is that you combine (or subtract) “elements” until you find the goal element.
I’ve had a lot of fun with it, but it often hits the same generated element. Maybe I should update it to use the second (third, etc.) choice, similar to your tool.
[0] https://alchemy.magicloops.app/
[1] https://neal.fun/infinite-craft/
> a drug (such as opium or morphine) that in moderate doses dulls the senses, relieves pain, and induces profound sleep but in excessive doses causes stupor, coma, or convulsions
https://www.merriam-webster.com/dictionary/narcotic
So we can see some element of losing time in that type of drug. I guess? Maybe I’m anthropomorphizing a bit.
Other stuff that works: key, door, lock, smooth
Some words that result in "flintlock": violence, anger, swing, hit, impact
Makes no sense, admittedly!
- dulcimer and - zither are both in firmly in .*gun.* territory it seems..
Also, if it gets buried in comments, proper nouns need to be capitalized (Paris-France+Germany).
I am planning on patching up the UI based on your feedback.
[1]: https://github.com/GrantMoyer/word_alignment
Or maybe they would all be completely inscrutable and man-woman would be like the 50th strongest result.
Can not personally find the connection here, was expecting father or something.
High dimension vector is always hard to explain. This is an example.
I’ve been unable to find it since. Does anyone know which site I’m thinking of?
wine - beer = grape juice
beer - wine = bowling
astrology - astronomy + mathematics = arithmancy
That could be seen as trying to find the true "meaning" of a word.
Very few papers that actually say something meaningful are left unnoticed, but as soon as you say something generic like "language models can do this", it gets featured in "AI influencer" posts.
I've had some fun finding this:
(Goshawks are very intense, gyrs tend to be leisurely in flight.)
Getting to cornbread elegantly has been challenging.
But if I assume the biased answer and rearrange the operands, I get "man - criminal + black = white". Which clearly shows, how biased your embeddings are!
Funny thing, fixing biases and ways to circumvent the fixes (while keeping good UX) might be much challenging task :)
paleolith + cat = Paleolithic Age
paleolith + dog = Paleolithic Age
paleolith - cat = neolith
paleolith - dog = hand ax
cat - dog = meow
Wonder if some of the math is off or I am not using this properly
woman + intelligence = man (77%)
Oof.
car + stupid = idiot, car + idiot = stupid
man+vagina=woman (ok that is boring)
https://en.m.wikipedia.org/wiki/Isle_of_Man
Edit: these must be capitalized to be recognized.
I think you need to disable auto-capitalisation because on mobile the first word becomes uppercase and triggers a validation error.
Accurate.
hacker - code = professional golf
love + time = commitment
boredom + curiosity = exploration
vision + execution = innovation
resilience - fear = courage
ambition + humility = leadership
failure + reflection = learning
knowledge + application = wisdom
feedback + openness = improvement
experience - ego = mastery
idea + validation = product-market fit
great idea, but I find the results unamusing
-red
and:
red-red-red
But it did not work and did not get any response. Maybe I am stupid but should this not work?
blue + red = yellow (87%) -- rgb, neat
black + {red,blue,yellow,green} = white 83% -- weird
Blue + red is magenta. Yellow would be red + green.
None of these results make much sense to me.
queen - woman + man = drone
Navratilova - woman + man = Lendl
female + age = male
Good to understand this bias before blindly applying these models (Yes- doctor is gender neutral - even women can be doctors!!)
rice + fish + raw = meat
hahaha... I JUST WANT SUSHI!
six (84%)
Close enough I suppose
huh
78% male horse 72% horseman
this is pretty fun
That's weird.
hmm...
LOL