Interestingly, a small company called Ogma already did something very similar back in 2021 (on an embedded system, no less). This (https://ogma.ai/2021/07/unsupervised-behavioral-learning-ubl...) is a description/video of how they got a small RC car to predict the next frame of its video feed given the action it was about to take, and thereby made the car navigate to a given location when fed with a still frame of that location (all this with online learning, and no backprop).
Instead of vicreg, they induced their latent state with sparse auto-encoding. Also they predicted in pixel, as opposed to latent, space. The white paper describing their tech is a little bit of a mess, but schematically, at least, the hierarchical architecture they describe bears a strong resemblance to the hierarchical JEPA models LeCunn outlined in his big paper from a few years ago. A notable difference, though, is that their thing is essentially a reflex agent, as opposed to possessing a planning/optimization loop.
concrete_head · 16h ago
Just wanted to say thank very much for sharing this.
Over the last few months I've been inventing this almost exact approach in my head as a hobby without consciously knowing it had already been done. I love their little RC car demo.
kadushka · 1d ago
The ideas at Ogma are inspired by Numenta's work.
TheAceOfHearts · 2d ago
> With these visual subgoals, V-JEPA 2 achieves success rates of 65% – 80% for pick-and-placing new objects in new and unseen environments.
How does this compare with existing alternatives? Maybe I'm just lacking proper context, but a minimum 20% failure rate sounds pretty bad? The paper compares their results with older approaches, which apparently had something like a 15% success rate, so jumping to an 80% success rate does seem like a significant jump. If I'm reading the paper correctly, the amount of time required to compute and execute each action went down from 4 minutes to 16 seconds, which also seems significant.
Having to specify an end goal as an image seems pretty limited, but at least the authors acknowledge it in the paper:
> Second, as mentioned in Section 4, V-JEPA 2-AC currently relies upon tasks specified as image goals. Although this may be natural for some tasks, there are other situations where language-based goal specification may be preferable. Extending the V-JEPA 2-AC to accept language-based goals, e.g., by having a model that can embed language-based goals into the V-JEPA 2-AC representation space, is another important direction for future work. The results described in Section 7, aligning V-JEPA 2 with a language model, may serve as a starting point.
I think it would be interesting if the authors answered whether they think there's a clear trajectory towards a model that can be trained to achieve a >99% success rate.
deepGem · 1d ago
Currently,
You train a VLA (vision language action) model for a specific pair of robotic arms, for a specific task. The end actuator actions are embedded in the model (actions). So let's say you train a pair of arms to pick an apple. You cannot zero shot it to pick up a glass. What you see in demos is the result of lots of training and fine tuning (few shot) on specific object types and with specific robotic arms or bodies.
The language intermediary embedding brings some generalising skills to the table but it isn't much. The vision -> language -> action translation is, how do I put this, brittle at best.
What these guys are showing is a zero shot approach to new tasks in new environments with 80% accuracy. This is a big deal. Pi0 from Physical Intelligence is the best model to compare I think.
ricardobeat · 2d ago
It’s important to keep some perspective: there are zero robots in the wild, at the moment, that use a world model to work on tasks they weren’t specifically trained on. This is cutting edge research and an 80% success rate is astonishing!
londons_explore · 2d ago
80% success rate is also potentially commercially viable if the task is currently being done by a human.
Work that was once done by 10 humans can now be done by 10 robots + 2 humans for the 20% failure cases, at a lower total cost.
zeroxfe · 2d ago
This really depends on the failure modes. In general, humans fail in predictable, and mostly safe, ways. AIs fail in highly unpredictable and potentially very dangerous ways. (A human might accidentally drop a knife, an AI might accidentally stab you with it.)
Maxion · 1d ago
Or, if controlling a robot arm, it would stab itself through the conveyer belt at full torque.
MindTheAbstract · 1d ago
It might still be a little slow (I'm not sure if the 16 seconds to compute an action is fast enough for commercial use cases), but this is definitely exciting and seems like a great step forward.
vFunct · 2d ago
I'm surprised that's not how it's already done. I'd figure some of the inner layers in LLMs were already "world models" and that it's the outer layers that differentiated models between text vs. images/robotics/other modes...
mjburgess · 2d ago
That's what the propaganda says, but when we keep explaining it isn't true, and army arrives to repeat adcopy from their favourite tech guru.
All statistical models of the kind in use are interpolations through historical data -- there's no magic. So when you interpolate through historical texts, your model is of historical text.
Text is not a measure of the world, to say, "the sky is blue" is not even reliably associated with the blueness of the sky, let alone that the sky isnt blue (there is no sky, and the atmosphere isn't blue).
These models appear "capture more" only because when you interpret the text you attribute meaning/understanding to it as the cause of its generation -- but that wasnt the cause, this is necessarily an illusion. There is no model of the world in a model of historical text -- there is a model of the world in your head which you associate with text, and that association is exploited when you use LLMs to do more than mere syntax transformation.
LLMs excel most at "fuzzy retrieval" and things like coding -- the latter is principally a matter of syntax, and the former of recollection. As soon as you require the prompt-completion to maintain "semantic integrity" with non-syntactical/retrivable constraints, it falls apart.
nightski · 2d ago
I feel like you are ignoring or dismissing the word "interpolating", although a better word would likely be generalization. I'd make the claim that it's very hard to generalize without some form of world model. It's clear to me that transformers do have some form of world model, although not the same as what is being presented in V-JEPA.
One other nitpick is that you confine to "historical data", although other classes of data are trained on such as simulated and generative.
mjburgess · 2d ago
I didn't say generalisation, because there isnt any. Inductive learning does not generalise, it interpolates -- if the region of your future prediction (here, prompt competition) lies on or close to the interpolated region, then the system is useful.
Generalisation is the opposite process, hypothecating a universal and finding counter-examples to constrain the universal generalisaton. Eg., "all fire burns" is hypotheticated by a competent animal upon encountering fire once.
Inductive "learners" take the opposite approach: fire burns in "all these cases", and if you have a case similar to those, then fire will burn you.
They can look the same within the region of interpolation, but look very different when you leave it: all of these systems fall over quickly when more than a handful of semantic constraints are imposed. This number is a measure of the distance from the interpolated boundary (e.g., consider this interpretation of apple's latest paper on reasoning in LLMs: the "environment complexity" is nothing other than a measure of interpolation-dissimilarity).
Early modern philosophers of science were very confused by this, but it's in Aristotle plain-as-day, and it's also extremely well establish since the 80s as the development of formal computational stats necessitated making this clear: interpolation is not generalisation. The former does not get you robustness to irrelevant permuation (ie., generalisation); it does not permit considering counterfactual scenarios (ie., generalisation); it does not give you a semantics/theory of the data generating process (ie., generalisation, ie. a world model).
Interpolation is a model of the data. Generalisation requires a model of the data generating process, the former does not give you the latter, though it can appear to under strong experimental assumptions of known causal models.
Here LLMs model the structure of language-as-symbolic-ordering, that structure "in the interpolated region" expresses reasoning, but it isnt a model of reasoning. It's a model of reasoning as captured in historical cases of it.
jeremyjh · 1d ago
Aren’t there papers showing that there is some kind of world model emerging? Like representations of an Othello board that we would recognize were found and manipulated successfully in a small model.
mjburgess · 1d ago
There are two follow up papers showing the representations are "entangled", a euphemism for statistical garbage, but I can't be bothered at the moment to find them.
However the whole issue of othello is a nonsequiteur which indicates that people involved here don't really seem to understand the issue, or what a world model is.
A "world model" is a model of a data generating process which isn't reducible-to or constituted by its measures. Ie., we are concerned for the case where there's a measurement space (eg., that of the height of mercury in a thermometer) and a target property space (eg., that of the temperature of the coffee). So that there is gap between the data-as-measure and its causes. In language this gap is massive: the cause of my saying, "I'm hungry" may have nothing to do with my hunger, even if it often does. For "scientific measuring devices", these are constructed to minimize this gap as much as possible.
In any case, with board games and other mathematical objects, there is no gap. The data is the game. The "board state" is an abstract object constituted by all possible board states. The game "is made out of" its realisations.
However the world isnt made out of language, nor coffee made out of thermometers. So a model of the data isnt a mdoel of its generating process.
So whether an interpolation of board states "fully characterises", someway, an abstract mathematical object "the game" is so irrelevant to the question it betrays a fundamental lack of understanding of even what's at issue.
No one is arguing that a structured interpolative model (ie., one given an inductive bias by an NN architecture) doesn't express properties of the underlying domain in its structure. The question is what happens to this model of the data when you have the same data generating process, but you arent in the interpolated region.
This problem is, in the limit of large data, impossible for abstract games by their nature, eg., a model classifying the input X into legal/illegal board states is the game.
Another way of phrasing this is that in ML/AI textbooks often begin by assuming there's a function you're approximating. But in the vast majority of cases where NNs are used, there is no such function -- there is no function tokens -> meanings (eg., "i am hungry" is ambigious).
But in the abstract math case there is a function, {boards} -> Legal|Illegal is a function, there are no ambiguous boards
So: of the infinite number of f* approximations to f_game, any is valid in the limit len(X) -> inf. Of the infinite number f*_lang to f_language, all are invalid (each in their own way).
jeremyjh · 1d ago
> A "world model" is a model of a data generating process which isn't reducible-to or constituted by its measures.
> However the world isnt made out of language, nor coffee made out of thermometers. So a model of the data isnt a mdoel of its generating process.
So is V-JEPA 2 actually generating a world model, as you've defined it here? Its still just sampling data - visual data, tactile feedback etc is all reducible to quantized data. It seems like you could build useful models that seem to generalize without that. For example, a model could learn to stop dropping things without ever developing a theory of gravity.
Probably I'm still misunderstanding too much for this to be useful, but what I've read from you in this thread is way more useful to my understanding than what I've seen before.
mjburgess · 1d ago
I'll have to read the JEPA article in more detail before commenting specifically on whether "world model" is appropriate. However procedural-action models have, in my view, a special place in the area of modelling the world.
While they may not be world models under my definition above, they are something like world-model-generating-models. They work like our sensory-motor system which itself builds "procedural proxy models" of the world -- and these become world models when they are cognised (, conceptualised, made abstract, made available for the imagination, etc.).
Contrast a very simple animal which can move a leaf around vs., a more complex one (eg., a mouse, etc.) which can imagine the leaf in various orientations. It's that capacity, esp. of mammals (, birds, etc.) to reify their sensory-motor "world-model-generating" capacity, eg., in imagination, which allows them to form world models in their heads. We require something like imagination in order to be able to hypotheticate a general model, form a hypothetical action, and try that action out.
I'm less concerned about making this distinct clear for casual observes in the case of robotics, because imv, competent acting in the world can lead to building world models. Whereas most other forms cannot.
What these robots require, to have world models in my view, would be firstly these sensory-motor models and then a reliable way of 1) acquiring new SM mdoels live (ie., learning motor techniques); and 2) reporting on what they have learned in a reasoning/cognitive context.
Robotics is just at stage0 here, the very basics of making a sensory-motor connection.
gsf_emergency · 1d ago
Sorry to go off on what may seem to be a tangent (equivocating only bcos i struggle to get the pt across succinctly?)
This too could form the basis of a productive skepticism towards the usefulness of coding agents, unlike what has caught attention here. (Referring specifically to the post by tptacek)
For example, we could look at feedback from the lisp community (beyond anecdata) on the usefulness of LLMs? Since it's what one might call "syntax-lite", a lack of true generalization ability ("no possible world model for an unavoidably idiosyncratic DSL-friendly metalanguage") could show up as a lack of ability to not just generate code, but even to fix it..
Beyond that, the issue how much the purported world-shattering usefulness of proof assistants based on say, Lean4, must depend on interpolating say, mathlib..
In short, please link the papers :)
>There are two follow up papers showing the representations are "entangled", a euphemism for statistical garbage, but I can't be bothered at the moment to find them.
gsf_emergency · 1d ago
>but I can't be bothered at the moment to find them
a token & misguided attempt to surface relevant lit might incite you to shove the obviousness down my throat :)
Could you give more details about what precisely you mean by interpolation and generalization? The commonplace use of “generalization” in the machine learning textbooks I’ve been studying is model performance (whatever metric is deemed relevant) on new data from the training distribution. In particular, it’s meaningful when you’re modeling p(y|x) and not the generative distribution p(x,y).
mjburgess · 1d ago
It's important to be aware that ML textbooks are conditionalising every term on ML being the domain of study, and along with all computer science, extremely unconcerned with words they borrow retaining their meaning.
Generalisation in the popular sense (science, stats, philosophy of science, popsci) is about reliability and validity, def. validity = does a model track the target properties of a system we expect; reliability = does it continue to do so in environments in which those features are present, but irrelevant permutations are made.
Interpolation is "curve fitting", which is almost all of ML/AI. The goal of curve fitting is to replace a general model with a summary of the measurement data. This is useful when you have no way of obtaining a model of the data generating process.
What people in ML assume is that there is some true distribution of measurements, and "generalisation" means interpolating the data so that you capture the measurement distribution.
I think it's highly likely there's a profound conceptual mistake in assuming measurements themsleves have a true distribution, so even the sense of generalisation to mean "have we interpolated correctly" is, in most cases, meaningless.
Part of the problem is that ML textbooks frame all ML problems with the same set of assumptions (eg., that there exists an f: X->Y, that X has a "true distribution" Dx, so that finding f* implies learning Dx). For many datasets, these assumptions are false. Compare running a linear regression on photos of the sky, through stars to get star signs, vs. running it on V=IR electric circuit data to get `R`
In the former cases, there is no f_star_sign to find; there is no "true distribution" of star sign measurements; etc. So any model of star signs cannot be a model even of measurements of star signs. ML textbooks do not treat "data" as having these kinds of constraints, or relationships to reality, which breads pseudoscientific and credulous misunderstandings of issues (such as, indeed, the othello paper).
abtinf · 2d ago
> army arrives to repeat adcopy from their favourite tech guru
This is painfully accurate.
The conversations go like this:
Me: “guys, I know what I’m talking about, I wrote my first neural network 30 years ago in middle school, this tech is cool but it isn’t magic and it isn’t good enough to do the thing you want without getting us sued or worse.”
Them: “Bro, I read a tweet that we are on the other side of the singularity. We have six months to make money before everything blows up.”
refulgentis · 2d ago
I can buy this, given a very wide meaning of "specifically trained on" and handwaving a bit about "as far as I know*", but then I read the actual wording of "new objects in new and unseen environments.", and remember these were floating around Mountain View doing tasks involving in new objects in novel environments years ago. Then I kinda gotta give up and admit to myself I'm distorting the conversation by emphasizing positivity over ground truth.
gyudin · 2d ago
They don’t use it because it’s unsafe and potentially life threatening lol
dghlsakjg · 2d ago
Plenty of things are unsafe and potentially life threatening, including machines with pre-programmed routines that we use today. We already have robots with limited intelligence interacting safely with humans in workplaces.
This learning technology didn't exist until this moment in time. That probably has more to do with why no one is using it in the wild.
lukan · 2d ago
Yes, you can just add other reliable safety meassures. Meaning if a human comes too close, the robot stops.
Or the robot is supervised all the time.
Or just operates in an area without humans.
But so far this is research, not market ready.
DickingAround · 2d ago
I run thousands of robots in production. We can get a very high success rate but only for the task they're designed for. Production robots can't pick up stuff they drop yet. And this '80%' level is not actually acceptable or even state of art for just pick-and-place, but it's compelling for a robot that also knows how to do other things with equal quality (if JEPA does that).
torginus · 1d ago
Yeah, I also wonder how old school approaches using machine vision and IK and hard algorithms would compare, or perhaps some hybrid method?
robot · 1d ago
your comment is not aligned with how science is done.
For discoveries you certainly work with limited approaches and certainly don't know if there is a "clear trajectory".
cubefox · 2d ago
I think the fundamental idea behind JEPA (not necessarily this concrete Meta implementation) will ultimately be correct: predicting embeddings instead of concrete tokens. That's arguably what animals do. Next-token prediction (a probability distribution over the possible next tokens) works well for the discrete domain of text, but it doesn't work well for a continuous domain like video, which would be needed for real-time robotics.
For text, with a two-byte tokenizer you get 2^16 (~65.000) possible next tokens, and computing a probability distribution over them is very much doable. But the "possible next frames" in a video feed would already be an extremely large number. If one frame is 1 megabyte uncompressed (instead of just 2 bytes for a text token) there are 2^(8*2^20) possible next frames, which is far too large a number. So we somehow need to predict only an embedding of a frame, of how the next frame of a video feed will look approximately.
Moreover, for robotics we don't want to just predict the next (approximate) frame of a video feed. We want to predict future sensory data more generally. That's arguably what animals do, including humans. We constantly anticipate what happens to us in "the future", approximately, and where the farther future is predicted progressively less exactly. We are relatively sure of what happens in a second, but less and less sure of what happens in a minute, or a day, or a year.
kaivi · 1d ago
> We constantly anticipate what happens to us in "the future", approximately, and where the farther future is predicted progressively less exactly
There's then evidence of what's called Predictive Coding. When that future happens, a higher level circuit decides how far off we were, and then releases appropriate neuromodulators to re-wire that circuit.
That would mean that to learn faster, you want to expose yourself to situations where you are often wrong: be often surprised and go down the wrong paths. Have a feedback mechanism which will tell you when you're wrong. This is maybe also why the best teachers are the ones who often ask the class questions for which there are counter-intuitive answers.
cubefox · 1d ago
> There's then evidence of what's called Predictive Coding. When that future happens, a higher level circuit decides how far off we were, and then releases appropriate neuromodulators to re-wire that circuit.
Yes, and ideally there would be whole backpropagation passes which update the entire model depending on how much the current observation diverges from past predictions. (Though brains use an updating mechanism which diverges from the backpropagation algorithm.)
Edit: Apparently the theory of this is broadly known (apart from "JEPA" and "predictive coding") also under the names "free energy principle" and "active inference": https://en.wikipedia.org/wiki/Free_energy_principle
krackers · 1d ago
I'm only a layman but at a high level how does the encoder + predictor of JEPA differ from an LLM?
An LLM takes in input, transforms it into an embedding, and makes predictions off that embedding. The only high-level difference I can see is that currently LLMs do it in a "single pass" where they output tokens directly (and COT is sort of a hack to get reasoning by "looping" in autoregressive output token space), but IIRC there are some experimental variants that do looped latent reasoning.
Any high-level comparison I can find almost strawmans LLMs: yes they take in token embeddings directly, but the first few layers of an LLM almost surely convert that to more abstract embeddings, as seen in repE research. Since the best way to predict is to actually internalize a world model, there's no reason to believe that multimodal LLMs can't make predictions about physical changes in the same way that JEPA claims to. That said JEPA may be able to do it more efficiently, attention almost surely isn't the _optimal_ architecture for doing all this
cubefox · 1d ago
LLMs simply take in text and return text, therefore they can just be trained via self-supervised learning on large amounts of text. Then they only need a little fine-tuning on top of that, and they are ready.
But an analogous pretraining approach isn't available for robotics. Robots take in sensory data and return movements, in real-time. There is no large data corpus of this pairing to do self-supervised learning on, like there is for text.
Even if we only consider pure video-to-video models, for which there is a large amount of training data for self-supervised learning, the autoregressive next-token predictor approach wouldn't work. That's why Veo 3 & Co are diffusion models. Because predicting the next frame directly doesn't work. It's far too much data. Text comes in relative tiny, discrete amounts with high useful information content per bit. Video is huge, basically continuous, and has quite low useful information content per bit (because of things like irrelevant details and noise), at least as far as robotics is concerned.
Moreover, even if next frame-prediction would work, this doesn't really do what we want for robotics. The robot doesn't just need a prediction about the next frame (or embedding of the next frame) when planning its movements, but potentially broadly about the next millions of frames, about things that are much further out in the future.
krackers · 19h ago
>The robot doesn't just need a prediction about the next frame
But the residual stream of LLMs doesn't "just" encode the next token prediction, it is high-level enough to encode predictions for a few tokens out, as seen with things like Multi-token prediction.
But yes I can see that in terms of input, you probably don't want to take in video frames directly and training via teacher-forcing is probably inefficient here. So some world-model-tailored embedding like JEPA is probably better. I guess my confusion is that Yann seems to frame it as JEPA vs LLM, but to me JEPA just seems like an encoder to generate embeddings that can be fed into an LLM. They seem complementary rather than a substitute.
naasking · 1d ago
> Robots take in sensory data and return movements, in real-time. There is no large data corpus of this pairing to do self-supervised learning on, like there is for text.
This is easily generated synthetically from a kinematic model, at least up to a certain level of precision.
cubefox · 1d ago
That would be like trying to pretrain GPT-1 from synthetically generated data only. It probably wouldn't work because the synthetic data doesn't resemble real world data enough.
It did work for AlphaGo Zero (and later AlphaZero), which were entirely trained on synthetic data. But that's for very simple games with strict formal rules, like Go and chess.
naasking · 1d ago
A kinematic model of the robot is a physics simulation of the robot. I don't see why that wouldn't resemble real world data enough.
cubefox · 23h ago
Not just the robot has to be simulated, the entire part of the world it interacts with also has to be. Even the most realistic video games resemble actual videos of the real world only very superficially.
naasking · 22h ago
Most realistic video games don't simulate all of the physics required. Even if we just stick to simulating the motion of the robot itself in an empty space, all of that data can be generated synthetically once at the appropriate precision and reused many times, just like training data for LLMs.
bytefactory · 1d ago
Can you clarify my understanding as a layman please?
Are you saying that LLMs hold concepts in latent space (weights?), but the actual predictions are always in tokens (thus inefficient and lossy), whereas JEPA operates directly on concepts in latent space (plus encoders/decoders)?
I might be using the jargon incorrectly!
cubefox · 21h ago
Yes that's right.
abraxas · 2d ago
But how do you go from predicting embeddings (which could be thought of as a type of lossy compression of the original data) back out to something usable, say a sequence of image/video tokens or a sequence of robot actions?
cubefox · 2d ago
A robot model would need to constantly convert the prediction (an embedding) of the future observations, together with a "plan" of what the robot tries to achieve, into an action. Into some kind of movement which takes both the action plan and the predicted sensory data into account.
That's very much an unsolved problem, and I don't know how far Meta is along that path. Not very far, I assume.
NitpickLawyer · 2d ago
If I understand your post correctly, they're also doing this:
> V-JEPA 2-AC is a latent action-conditioned world model post-trained from V-JEPA 2 (using a small amount of robot trajectory interaction data) that solves robot manipulation tasks without environment-specific data collection or task-specific training or calibration.
> After the actionless pre-training stage, the model can make predictions about how the world might evolve—however, these predictions don’t directly take into account specific actions that an agent would take. In the second stage of training, we focus on making the model more useful for planning by using robot data, which includes visual observations (video) and the control actions that the robot was executing. We incorporate this data into the JEPA training procedure by providing the action information to the predictor. After training on this additional data, the predictor learns to account for specific actions when making predictions and can then be used for control. We don’t need a lot of robot data for this second phase—in our technical report, we show that training with only 62 hours of robot data already results in a model that can be used for planning and control.
> We demonstrate how V-JEPA 2 can be used for zero-shot robot planning in new environments and involving objects not seen during training. Unlike other robot foundation models—which usually require that some training data come from the specific robot instance and environment where the model is deployed—we train the model on the open source DROID dataset and then deploy it directly on robots in our labs. We show that the V-JEPA 2 predictor can be used for foundational tasks like reaching, picking up an object, and placing it in a new location.
> For short-horizon tasks, such as picking or placing an object, we specify a goal in the form of an image. We use the V-JEPA 2 encoder to get embeddings of the current and goal states. Starting from its observed current state, the robot then plans by using the predictor to imagine the consequences of taking a collection of candidate actions and rating the candidates based on how close they get to the desired goal. At each time step, the robot re-plans and executes the top-rated next action toward that goal via model-predictive control. For longer horizon tasks, such as picking up an object and placing it in the right spot, we specify a series of visual subgoals that the robot tries to achieve in sequence, similar to visual imitation learning observed in humans. With these visual subgoals, V-JEPA 2 achieves success rates of 65% – 80% for pick-and-placing new objects in new and unseen environments.
bobosha · 2d ago
This is where the memory bit comes in, if you have a memory of past embeddings and associated label(s), it could be an ANN query to fetch the most similar embeddings and infer therefrom.
abraxas · 2d ago
But an embedding is more like a one way hash, kind of like sha1 or md5, no? You can get from input data to a hash value but not the other way around, right? I know that similarly placed embedding vectors will sit next to semantically related vectors but these clusters could be really sparse in such a massively dimensional hyperspace and so the nearest values in a cache may be too far away to be useful?
BTW I'm very much not an expert here and I'm just trying to understand how this system works end to end. Don't take anything I write here as authoritative.
rajman187 · 1d ago
That’s why you have encoders as well as decoders. For example, another model from Meta does this for translations; they have encoders and decoders into a single embedding space that represents semantic concepts for each language
The JEPA models give me hope that the future isn't just more tokens, more context, and more chain-of-thought.
siavosh · 2d ago
Does someone know how the "semantic" embeddings are learned? That seems like perhaps the main technical challenge here.
gglon · 22h ago
From the paper, section 2.1:
minimize_θ,φ,Δ ||P_φ(Δ, E_θ(x)) - sg(E_θ'(y))||_1
where
y - full video, x - masked video, E_θ(.) - learned encoder (semantic embedding), P_φ(.) - learned predictor, Δ - learned mask (which patches in a video where dropped), sg(.) - stop gradient to prevent change, gradient propagation in E_θ'(.), which in turn is an exponential moving average of E_θ(.) ie. θ'_new <- τ θ'_old + (1-τ) θ. So the loss is applied only to the predictions of the masked patches while the encoder of full video follows the learned one. This asymmetry in learning prevents collapse of the encoder to a trivial constant.
fidotron · 2d ago
You have to wonder if the model is going to end up recreating Verlet integration in there somewhere, or if it's generating a pile of those optical acceleration cancelation type heuristics in neural net form.
It's one of those ideas I've had around for a while that if you fused decent object tracking with an understanding of Verlet integration you should, in principle, start being able to measure all sorts of physical quantities quite easily.
rar00 · 1d ago
the robot arm demonstration video jumps at the 00:28s mark...
artificialprint · 2d ago
Throw ARC-AGI 2 at it!
jadbox · 2d ago
I suspect it wouldn't help too much. This model is meant for physics-based world modeling, while nearly all the problems in ARC are symbolic reasoning.
artificialprint · 2d ago
I'd say world modeling can provide the foundations from which symbolic reasoning can emerge, after all this is how we (humans) learn it too. There are a lot of tasks in arc that are grounded in simple physics
littlestymaar · 2d ago
> I'd say world modeling can provide the foundations from which symbolic reasoning can emerge, after all this is how we (humans) learn it too
As usual comparisons with humans provide little practical insight for what's achievable with ML. Humans don't have to learn everything from scratch like ML models do, you aren't expecting ML models to learn language out of a few thousands of tokens just because humans can, so similarly you shouldn't expect neural networks to learn reasoning from world interaction alone.
falcor84 · 2d ago
Yes, ARC-AGI 2 seems to game a lot of challenges that involve a (projection of) gravity and collisions, so I'd be quite interested in seeing whether it would generalize.
jcelerier · 2d ago
> That kind of physical intuition isn’t something adults obtain after years of education—young children develop this intuition by observing the world around them before they can even speak in full sentences.
I mean, it still takes them much more time than it takes to train even the largest LLMs we use (a couple months)
dist-epoch · 2d ago
In wall clock time. If you count in input tokens/pixels, humans learn with orders of magnitude less input data.
logicchains · 2d ago
That's not true at all; the amount of audiovisual data a human is exposed to in even just one year is incredibly vast. Over sixty frames per second, sixteen hours per day gives over a trillion frames per year, and each frame at such a high resolution would be hundreds of tokens.
dist-epoch · 2d ago
Let's take your numbers:
Human: 1000 tok * 60 * 86400 * 365 = 2 Trillion tokens / year
GPT-4: 13 Trillion tokens
Llama-3: 15 Trillion tokens
naasking · 1d ago
That's vision only mind you, no consideration of sound, taste, touch, and interoception. And GPT-4 is far more fluent and knowledgeable than a 6-7 year old where the total token count matches by this assessment (though I'm skeptical of the numbers given).
cluckindan · 1d ago
That’s why we tokenize very early in the vision pipeline.
This contains a common misstep (or misgeneralization of an analogy) among those who are much more familiar with computers than with the brain. The brain is not digital and concepts like frames per second and resolution don't make much sense for vision. First, there aren't frames, neuron activity is asynchronous with changes to sensory neuron firing rate responding to changes in the environment or according to saliency.
Between the non-uniformity of receptor density (eg fovea vs peripheral vision but this is general across all senses), dynamic receptor fields and the fact that information is encoded in terms of spike rate and timing patterns across neural populations, the idea of pixels in some bitmap at some resolution is beyond misleading. There is no pixel data, just sparsely coded feature representations capturing things like edges, textures, motion, color contrast and the like, already, at the retina.
While hundreds of trillions of photons might hit our photoreceptors, > 99% of that is filtered and or compressed before even reaching retinal ganglion cells. Only a tiny fraction, about 10 million bits/sec, of the original photon signal rate is transferred through the optic nerve (per eye). This pattern of filtering and attentive prioritization of information in signals continues as we go from sensory fields to thalamus to higher cortical areas.
So while we might encounter factoids like: on the order of a billion bits per second of data hit photoreceptors or [10Mb/s transferred](https://www.britannica.com/science/information-theory/Physio...) along optic nerves, it's important to keep in mind that a lot of the intuition gained from digital information processing does not transfer in any meaningful sense to the brain.
fc417fc802 · 1d ago
If you consider the entire biological pipeline then the filtering is part of that. The quantity of raw data remains much greater than that available to any vision model. If anything the filtering done by biology should make it clear that there's vast room for model architecture improvement.
naasking · 1d ago
Humans do not start as blank models, they have billions of years of pretraining from evolution.
lukan · 2d ago
But they use way less energy for it.
nlitened · 2d ago
I imagine that Russian-speaking team members had fun with naming the model V-JEPA
Tiberium · 2d ago
For the curious: "жопа" (which "JEPA" sounds like) means "ass" in Russian. Also V ("В") means "in" (although if we get into specifics, the casing would need to be "жопу" or "жопе" depending on the context)
koakuma-chan · 2d ago
Also the video thumbnail:
J.E.P.A.
momojo · 2d ago
Why is Meta investing into this research? What's the potential payoff?
MindTheAbstract · 1d ago
Like others have said, its an interesting avenue for AGI. The joint embeddings would be closer to thinking than the current LLM token work. LLMs look like they have a lot limitations for AGI (although who knows if we have another crazy scale up? but that extra scale is looking difficult right now).
esafak · 2d ago
There is a world of money in AGI, and they have the resources, and notably the data, to achieve it.
aaroninsf · 2d ago
The goal is a a Large Phenomenological Model.
A good definition of "real AGI" might be, a multimodal model which understands time-based media, space, and object behavior, and hence true agency.
Phenomenology is the philosophy of "things as they seem," not "knowledge (words) about things." Seem to our senses, not understood through language.
LLM of course trade in language tokens.
We can extend their behavior with front ends which convert other media types into such tokens.
But we can do better with multimodal models which are trained directly on other inputs. E.g. integrating image classifiers with language models architecturally.
With those one can sort of understand time-based media, by sampling a stream and getting e.g. transcripts.
But again, it's even better to build a time-base multimodal models, which directly ingests time-based media rather than sampling. (Other architectures than transformers are going to be required to do this well IMO...)
The bootstrapping continues. This work is about training models to understand world and object properties by introducing agency.
Significant footnote: implicitly models trained to interact with the world necessarily have a "self model" which interacts with the "world model." Presumably they are trained to preserve their expensive "self." Hmmmmm....
When we have a model that knows about things not just as nodes in a language graph but also how such things look, and sound, and moves, and "feel" (how much mass do they have, how do they move, etc.)...
...well, that is approaching indistinguishable from one of us, at least wrt embodiment and agency.
DesiLurker · 2d ago
possibly with their investment into AR/VR and gaming they may see a pathway to creating 'physical intelligence' and tap into a much bigger untapped market. I mean isn't Robotaxi the main carrot Musk's been holding in front of tesla investors for decade or so. physical robots may provide a more 'incremental fault tolerant' path to application of AI.
dyauspitr · 2d ago
Physical robots as impressive as LLMs?
kp1197 · 1d ago
Robots that can do anything.
seydor · 1d ago
physical robots arguing endlessly with physical people
iLoveOncall · 2d ago
"World model" and "physical reasoning" is such a lie.
Those models don't have any understanding of physics, they just regurgitate what they see in their vision-based training set, just like any image or video generation model does.
Monkey see other monkey cannot go through wall, monkey don't try go through wall.
smokel · 2d ago
I think you are misinterpreting the terminology.
Of course these models are not understanding physics in the way a physicists or a mathematician would. But they do form a model of the world that can be used for forecasting and reasoning, in a way possibly not much unlike how humans and other animals operate when interacting with the physical world.
dghlsakjg · 1d ago
You don't need to have taken a single physics class to be good at pool...
rayboy1995 · 2d ago
> Monkey see other monkey cannot go through wall, monkey don't try go through wall.
I mean... we are just monkeys. Did we not learn this way when we were younger?
RollingRo11 · 2d ago
Agreed! A really young child has no notion of "physics". They are learning through experience and observation.
These models/robots aren't superintelligent by any means, but "Monkey see other monkey cannot go through wall, monkey don't try go through wall" isn't far off from how some animals/humans "learn".
No comments yet
seydor · 1d ago
physics is phenomenological. the model sees phenomena
ldjkfkdsjnv · 2d ago
Leadership at meta is dropping the ball with these non llm ai model sidequests
jadbox · 2d ago
LLMs where once a side quest. I hope meta invests more in alternatives as maybe we'll find something better. If not, then meta just loses a bit of R&D budget. They are still heavily invested in regular LLM development, so it's not like they are trading one for the other.
linguistbreaker · 2d ago
I strongly agree. FAANG has the money to do the research. LLMs are far from intelligent - AGI will require a number of other advances.
rvz · 2d ago
AI research is more than just LLMs.
energy123 · 2d ago
Is this a sarcastic compliment? Diversity in research agendas is very important for pushing forward the frontier even if it's not good for the company investing in the high risk research. Good job, to an otherwise toxic company.
Instead of vicreg, they induced their latent state with sparse auto-encoding. Also they predicted in pixel, as opposed to latent, space. The white paper describing their tech is a little bit of a mess, but schematically, at least, the hierarchical architecture they describe bears a strong resemblance to the hierarchical JEPA models LeCunn outlined in his big paper from a few years ago. A notable difference, though, is that their thing is essentially a reflex agent, as opposed to possessing a planning/optimization loop.
Over the last few months I've been inventing this almost exact approach in my head as a hobby without consciously knowing it had already been done. I love their little RC car demo.
How does this compare with existing alternatives? Maybe I'm just lacking proper context, but a minimum 20% failure rate sounds pretty bad? The paper compares their results with older approaches, which apparently had something like a 15% success rate, so jumping to an 80% success rate does seem like a significant jump. If I'm reading the paper correctly, the amount of time required to compute and execute each action went down from 4 minutes to 16 seconds, which also seems significant.
Having to specify an end goal as an image seems pretty limited, but at least the authors acknowledge it in the paper:
> Second, as mentioned in Section 4, V-JEPA 2-AC currently relies upon tasks specified as image goals. Although this may be natural for some tasks, there are other situations where language-based goal specification may be preferable. Extending the V-JEPA 2-AC to accept language-based goals, e.g., by having a model that can embed language-based goals into the V-JEPA 2-AC representation space, is another important direction for future work. The results described in Section 7, aligning V-JEPA 2 with a language model, may serve as a starting point.
I think it would be interesting if the authors answered whether they think there's a clear trajectory towards a model that can be trained to achieve a >99% success rate.
You train a VLA (vision language action) model for a specific pair of robotic arms, for a specific task. The end actuator actions are embedded in the model (actions). So let's say you train a pair of arms to pick an apple. You cannot zero shot it to pick up a glass. What you see in demos is the result of lots of training and fine tuning (few shot) on specific object types and with specific robotic arms or bodies.
The language intermediary embedding brings some generalising skills to the table but it isn't much. The vision -> language -> action translation is, how do I put this, brittle at best.
What these guys are showing is a zero shot approach to new tasks in new environments with 80% accuracy. This is a big deal. Pi0 from Physical Intelligence is the best model to compare I think.
Work that was once done by 10 humans can now be done by 10 robots + 2 humans for the 20% failure cases, at a lower total cost.
All statistical models of the kind in use are interpolations through historical data -- there's no magic. So when you interpolate through historical texts, your model is of historical text.
Text is not a measure of the world, to say, "the sky is blue" is not even reliably associated with the blueness of the sky, let alone that the sky isnt blue (there is no sky, and the atmosphere isn't blue).
These models appear "capture more" only because when you interpret the text you attribute meaning/understanding to it as the cause of its generation -- but that wasnt the cause, this is necessarily an illusion. There is no model of the world in a model of historical text -- there is a model of the world in your head which you associate with text, and that association is exploited when you use LLMs to do more than mere syntax transformation.
LLMs excel most at "fuzzy retrieval" and things like coding -- the latter is principally a matter of syntax, and the former of recollection. As soon as you require the prompt-completion to maintain "semantic integrity" with non-syntactical/retrivable constraints, it falls apart.
One other nitpick is that you confine to "historical data", although other classes of data are trained on such as simulated and generative.
Generalisation is the opposite process, hypothecating a universal and finding counter-examples to constrain the universal generalisaton. Eg., "all fire burns" is hypotheticated by a competent animal upon encountering fire once.
Inductive "learners" take the opposite approach: fire burns in "all these cases", and if you have a case similar to those, then fire will burn you.
They can look the same within the region of interpolation, but look very different when you leave it: all of these systems fall over quickly when more than a handful of semantic constraints are imposed. This number is a measure of the distance from the interpolated boundary (e.g., consider this interpretation of apple's latest paper on reasoning in LLMs: the "environment complexity" is nothing other than a measure of interpolation-dissimilarity).
Early modern philosophers of science were very confused by this, but it's in Aristotle plain-as-day, and it's also extremely well establish since the 80s as the development of formal computational stats necessitated making this clear: interpolation is not generalisation. The former does not get you robustness to irrelevant permuation (ie., generalisation); it does not permit considering counterfactual scenarios (ie., generalisation); it does not give you a semantics/theory of the data generating process (ie., generalisation, ie. a world model).
Interpolation is a model of the data. Generalisation requires a model of the data generating process, the former does not give you the latter, though it can appear to under strong experimental assumptions of known causal models.
Here LLMs model the structure of language-as-symbolic-ordering, that structure "in the interpolated region" expresses reasoning, but it isnt a model of reasoning. It's a model of reasoning as captured in historical cases of it.
However the whole issue of othello is a nonsequiteur which indicates that people involved here don't really seem to understand the issue, or what a world model is.
A "world model" is a model of a data generating process which isn't reducible-to or constituted by its measures. Ie., we are concerned for the case where there's a measurement space (eg., that of the height of mercury in a thermometer) and a target property space (eg., that of the temperature of the coffee). So that there is gap between the data-as-measure and its causes. In language this gap is massive: the cause of my saying, "I'm hungry" may have nothing to do with my hunger, even if it often does. For "scientific measuring devices", these are constructed to minimize this gap as much as possible.
In any case, with board games and other mathematical objects, there is no gap. The data is the game. The "board state" is an abstract object constituted by all possible board states. The game "is made out of" its realisations.
However the world isnt made out of language, nor coffee made out of thermometers. So a model of the data isnt a mdoel of its generating process.
So whether an interpolation of board states "fully characterises", someway, an abstract mathematical object "the game" is so irrelevant to the question it betrays a fundamental lack of understanding of even what's at issue.
No one is arguing that a structured interpolative model (ie., one given an inductive bias by an NN architecture) doesn't express properties of the underlying domain in its structure. The question is what happens to this model of the data when you have the same data generating process, but you arent in the interpolated region.
This problem is, in the limit of large data, impossible for abstract games by their nature, eg., a model classifying the input X into legal/illegal board states is the game.
Another way of phrasing this is that in ML/AI textbooks often begin by assuming there's a function you're approximating. But in the vast majority of cases where NNs are used, there is no such function -- there is no function tokens -> meanings (eg., "i am hungry" is ambigious).
But in the abstract math case there is a function, {boards} -> Legal|Illegal is a function, there are no ambiguous boards
So: of the infinite number of f* approximations to f_game, any is valid in the limit len(X) -> inf. Of the infinite number f*_lang to f_language, all are invalid (each in their own way).
So is V-JEPA 2 actually generating a world model, as you've defined it here? Its still just sampling data - visual data, tactile feedback etc is all reducible to quantized data. It seems like you could build useful models that seem to generalize without that. For example, a model could learn to stop dropping things without ever developing a theory of gravity.
Probably I'm still misunderstanding too much for this to be useful, but what I've read from you in this thread is way more useful to my understanding than what I've seen before.
While they may not be world models under my definition above, they are something like world-model-generating-models. They work like our sensory-motor system which itself builds "procedural proxy models" of the world -- and these become world models when they are cognised (, conceptualised, made abstract, made available for the imagination, etc.).
Contrast a very simple animal which can move a leaf around vs., a more complex one (eg., a mouse, etc.) which can imagine the leaf in various orientations. It's that capacity, esp. of mammals (, birds, etc.) to reify their sensory-motor "world-model-generating" capacity, eg., in imagination, which allows them to form world models in their heads. We require something like imagination in order to be able to hypotheticate a general model, form a hypothetical action, and try that action out.
I'm less concerned about making this distinct clear for casual observes in the case of robotics, because imv, competent acting in the world can lead to building world models. Whereas most other forms cannot.
What these robots require, to have world models in my view, would be firstly these sensory-motor models and then a reliable way of 1) acquiring new SM mdoels live (ie., learning motor techniques); and 2) reporting on what they have learned in a reasoning/cognitive context.
Robotics is just at stage0 here, the very basics of making a sensory-motor connection.
This too could form the basis of a productive skepticism towards the usefulness of coding agents, unlike what has caught attention here. (Referring specifically to the post by tptacek)
For example, we could look at feedback from the lisp community (beyond anecdata) on the usefulness of LLMs? Since it's what one might call "syntax-lite", a lack of true generalization ability ("no possible world model for an unavoidably idiosyncratic DSL-friendly metalanguage") could show up as a lack of ability to not just generate code, but even to fix it..
Beyond that, the issue how much the purported world-shattering usefulness of proof assistants based on say, Lean4, must depend on interpolating say, mathlib..
In short, please link the papers :)
>There are two follow up papers showing the representations are "entangled", a euphemism for statistical garbage, but I can't be bothered at the moment to find them.
a token & misguided attempt to surface relevant lit might incite you to shove the obviousness down my throat :)
Here's one for disentangling reps in LLMs https://arxiv.org/abs/2505.18774v1
Generalisation in the popular sense (science, stats, philosophy of science, popsci) is about reliability and validity, def. validity = does a model track the target properties of a system we expect; reliability = does it continue to do so in environments in which those features are present, but irrelevant permutations are made.
Interpolation is "curve fitting", which is almost all of ML/AI. The goal of curve fitting is to replace a general model with a summary of the measurement data. This is useful when you have no way of obtaining a model of the data generating process.
What people in ML assume is that there is some true distribution of measurements, and "generalisation" means interpolating the data so that you capture the measurement distribution.
I think it's highly likely there's a profound conceptual mistake in assuming measurements themsleves have a true distribution, so even the sense of generalisation to mean "have we interpolated correctly" is, in most cases, meaningless.
Part of the problem is that ML textbooks frame all ML problems with the same set of assumptions (eg., that there exists an f: X->Y, that X has a "true distribution" Dx, so that finding f* implies learning Dx). For many datasets, these assumptions are false. Compare running a linear regression on photos of the sky, through stars to get star signs, vs. running it on V=IR electric circuit data to get `R`
In the former cases, there is no f_star_sign to find; there is no "true distribution" of star sign measurements; etc. So any model of star signs cannot be a model even of measurements of star signs. ML textbooks do not treat "data" as having these kinds of constraints, or relationships to reality, which breads pseudoscientific and credulous misunderstandings of issues (such as, indeed, the othello paper).
This is painfully accurate.
The conversations go like this:
Me: “guys, I know what I’m talking about, I wrote my first neural network 30 years ago in middle school, this tech is cool but it isn’t magic and it isn’t good enough to do the thing you want without getting us sued or worse.”
Them: “Bro, I read a tweet that we are on the other side of the singularity. We have six months to make money before everything blows up.”
This learning technology didn't exist until this moment in time. That probably has more to do with why no one is using it in the wild.
Or the robot is supervised all the time.
Or just operates in an area without humans.
But so far this is research, not market ready.
For text, with a two-byte tokenizer you get 2^16 (~65.000) possible next tokens, and computing a probability distribution over them is very much doable. But the "possible next frames" in a video feed would already be an extremely large number. If one frame is 1 megabyte uncompressed (instead of just 2 bytes for a text token) there are 2^(8*2^20) possible next frames, which is far too large a number. So we somehow need to predict only an embedding of a frame, of how the next frame of a video feed will look approximately.
Moreover, for robotics we don't want to just predict the next (approximate) frame of a video feed. We want to predict future sensory data more generally. That's arguably what animals do, including humans. We constantly anticipate what happens to us in "the future", approximately, and where the farther future is predicted progressively less exactly. We are relatively sure of what happens in a second, but less and less sure of what happens in a minute, or a day, or a year.
There's then evidence of what's called Predictive Coding. When that future happens, a higher level circuit decides how far off we were, and then releases appropriate neuromodulators to re-wire that circuit.
That would mean that to learn faster, you want to expose yourself to situations where you are often wrong: be often surprised and go down the wrong paths. Have a feedback mechanism which will tell you when you're wrong. This is maybe also why the best teachers are the ones who often ask the class questions for which there are counter-intuitive answers.
Yes, and ideally there would be whole backpropagation passes which update the entire model depending on how much the current observation diverges from past predictions. (Though brains use an updating mechanism which diverges from the backpropagation algorithm.)
Edit: Apparently the theory of this is broadly known (apart from "JEPA" and "predictive coding") also under the names "free energy principle" and "active inference": https://en.wikipedia.org/wiki/Free_energy_principle
An LLM takes in input, transforms it into an embedding, and makes predictions off that embedding. The only high-level difference I can see is that currently LLMs do it in a "single pass" where they output tokens directly (and COT is sort of a hack to get reasoning by "looping" in autoregressive output token space), but IIRC there are some experimental variants that do looped latent reasoning.
Any high-level comparison I can find almost strawmans LLMs: yes they take in token embeddings directly, but the first few layers of an LLM almost surely convert that to more abstract embeddings, as seen in repE research. Since the best way to predict is to actually internalize a world model, there's no reason to believe that multimodal LLMs can't make predictions about physical changes in the same way that JEPA claims to. That said JEPA may be able to do it more efficiently, attention almost surely isn't the _optimal_ architecture for doing all this
But an analogous pretraining approach isn't available for robotics. Robots take in sensory data and return movements, in real-time. There is no large data corpus of this pairing to do self-supervised learning on, like there is for text.
Even if we only consider pure video-to-video models, for which there is a large amount of training data for self-supervised learning, the autoregressive next-token predictor approach wouldn't work. That's why Veo 3 & Co are diffusion models. Because predicting the next frame directly doesn't work. It's far too much data. Text comes in relative tiny, discrete amounts with high useful information content per bit. Video is huge, basically continuous, and has quite low useful information content per bit (because of things like irrelevant details and noise), at least as far as robotics is concerned.
Moreover, even if next frame-prediction would work, this doesn't really do what we want for robotics. The robot doesn't just need a prediction about the next frame (or embedding of the next frame) when planning its movements, but potentially broadly about the next millions of frames, about things that are much further out in the future.
But the residual stream of LLMs doesn't "just" encode the next token prediction, it is high-level enough to encode predictions for a few tokens out, as seen with things like Multi-token prediction.
But yes I can see that in terms of input, you probably don't want to take in video frames directly and training via teacher-forcing is probably inefficient here. So some world-model-tailored embedding like JEPA is probably better. I guess my confusion is that Yann seems to frame it as JEPA vs LLM, but to me JEPA just seems like an encoder to generate embeddings that can be fed into an LLM. They seem complementary rather than a substitute.
This is easily generated synthetically from a kinematic model, at least up to a certain level of precision.
It did work for AlphaGo Zero (and later AlphaZero), which were entirely trained on synthetic data. But that's for very simple games with strict formal rules, like Go and chess.
Are you saying that LLMs hold concepts in latent space (weights?), but the actual predictions are always in tokens (thus inefficient and lossy), whereas JEPA operates directly on concepts in latent space (plus encoders/decoders)?
I might be using the jargon incorrectly!
That's very much an unsolved problem, and I don't know how far Meta is along that path. Not very far, I assume.
> V-JEPA 2-AC is a latent action-conditioned world model post-trained from V-JEPA 2 (using a small amount of robot trajectory interaction data) that solves robot manipulation tasks without environment-specific data collection or task-specific training or calibration.
> After the actionless pre-training stage, the model can make predictions about how the world might evolve—however, these predictions don’t directly take into account specific actions that an agent would take. In the second stage of training, we focus on making the model more useful for planning by using robot data, which includes visual observations (video) and the control actions that the robot was executing. We incorporate this data into the JEPA training procedure by providing the action information to the predictor. After training on this additional data, the predictor learns to account for specific actions when making predictions and can then be used for control. We don’t need a lot of robot data for this second phase—in our technical report, we show that training with only 62 hours of robot data already results in a model that can be used for planning and control.
> We demonstrate how V-JEPA 2 can be used for zero-shot robot planning in new environments and involving objects not seen during training. Unlike other robot foundation models—which usually require that some training data come from the specific robot instance and environment where the model is deployed—we train the model on the open source DROID dataset and then deploy it directly on robots in our labs. We show that the V-JEPA 2 predictor can be used for foundational tasks like reaching, picking up an object, and placing it in a new location.
> For short-horizon tasks, such as picking or placing an object, we specify a goal in the form of an image. We use the V-JEPA 2 encoder to get embeddings of the current and goal states. Starting from its observed current state, the robot then plans by using the predictor to imagine the consequences of taking a collection of candidate actions and rating the candidates based on how close they get to the desired goal. At each time step, the robot re-plans and executes the top-rated next action toward that goal via model-predictive control. For longer horizon tasks, such as picking up an object and placing it in the right spot, we specify a series of visual subgoals that the robot tries to achieve in sequence, similar to visual imitation learning observed in humans. With these visual subgoals, V-JEPA 2 achieves success rates of 65% – 80% for pick-and-placing new objects in new and unseen environments.
BTW I'm very much not an expert here and I'm just trying to understand how this system works end to end. Don't take anything I write here as authoritative.
https://ai.meta.com/research/publications/sonar-sentence-lev...
where
y - full video, x - masked video, E_θ(.) - learned encoder (semantic embedding), P_φ(.) - learned predictor, Δ - learned mask (which patches in a video where dropped), sg(.) - stop gradient to prevent change, gradient propagation in E_θ'(.), which in turn is an exponential moving average of E_θ(.) ie. θ'_new <- τ θ'_old + (1-τ) θ. So the loss is applied only to the predictions of the masked patches while the encoder of full video follows the learned one. This asymmetry in learning prevents collapse of the encoder to a trivial constant.
It's one of those ideas I've had around for a while that if you fused decent object tracking with an understanding of Verlet integration you should, in principle, start being able to measure all sorts of physical quantities quite easily.
As usual comparisons with humans provide little practical insight for what's achievable with ML. Humans don't have to learn everything from scratch like ML models do, you aren't expecting ML models to learn language out of a few thousands of tokens just because humans can, so similarly you shouldn't expect neural networks to learn reasoning from world interaction alone.
I mean, it still takes them much more time than it takes to train even the largest LLMs we use (a couple months)
Human: 1000 tok * 60 * 86400 * 365 = 2 Trillion tokens / year
GPT-4: 13 Trillion tokens
Llama-3: 15 Trillion tokens
Related: https://en.wikipedia.org/wiki/Form_constant
Between the non-uniformity of receptor density (eg fovea vs peripheral vision but this is general across all senses), dynamic receptor fields and the fact that information is encoded in terms of spike rate and timing patterns across neural populations, the idea of pixels in some bitmap at some resolution is beyond misleading. There is no pixel data, just sparsely coded feature representations capturing things like edges, textures, motion, color contrast and the like, already, at the retina.
While hundreds of trillions of photons might hit our photoreceptors, > 99% of that is filtered and or compressed before even reaching retinal ganglion cells. Only a tiny fraction, about 10 million bits/sec, of the original photon signal rate is transferred through the optic nerve (per eye). This pattern of filtering and attentive prioritization of information in signals continues as we go from sensory fields to thalamus to higher cortical areas.
So while we might encounter factoids like: on the order of a billion bits per second of data hit photoreceptors or [10Mb/s transferred](https://www.britannica.com/science/information-theory/Physio...) along optic nerves, it's important to keep in mind that a lot of the intuition gained from digital information processing does not transfer in any meaningful sense to the brain.
J.E.P.A.
A good definition of "real AGI" might be, a multimodal model which understands time-based media, space, and object behavior, and hence true agency.
Phenomenology is the philosophy of "things as they seem," not "knowledge (words) about things." Seem to our senses, not understood through language.
LLM of course trade in language tokens.
We can extend their behavior with front ends which convert other media types into such tokens.
But we can do better with multimodal models which are trained directly on other inputs. E.g. integrating image classifiers with language models architecturally.
With those one can sort of understand time-based media, by sampling a stream and getting e.g. transcripts.
But again, it's even better to build a time-base multimodal models, which directly ingests time-based media rather than sampling. (Other architectures than transformers are going to be required to do this well IMO...)
The bootstrapping continues. This work is about training models to understand world and object properties by introducing agency.
Significant footnote: implicitly models trained to interact with the world necessarily have a "self model" which interacts with the "world model." Presumably they are trained to preserve their expensive "self." Hmmmmm....
When we have a model that knows about things not just as nodes in a language graph but also how such things look, and sound, and moves, and "feel" (how much mass do they have, how do they move, etc.)...
...well, that is approaching indistinguishable from one of us, at least wrt embodiment and agency.
Those models don't have any understanding of physics, they just regurgitate what they see in their vision-based training set, just like any image or video generation model does.
Monkey see other monkey cannot go through wall, monkey don't try go through wall.
Of course these models are not understanding physics in the way a physicists or a mathematician would. But they do form a model of the world that can be used for forecasting and reasoning, in a way possibly not much unlike how humans and other animals operate when interacting with the physical world.
I mean... we are just monkeys. Did we not learn this way when we were younger?
These models/robots aren't superintelligent by any means, but "Monkey see other monkey cannot go through wall, monkey don't try go through wall" isn't far off from how some animals/humans "learn".
No comments yet