Bagel: Open-source unified multimodal model

218 tosh 33 5/26/2025, 5:51:55 AM bagel-ai.org ↗

Comments (33)

jjrv · 2d ago
I found a losslessly compressed version: https://github.com/LeanModels/Bagel-DFloat11

It works following readme instructions at least on Ubuntu, on my RTX 3090 GPU with 24 gigs of memory, just barely. Have to close most other windows and lower screen resolution to be able to load the model. Then it generates or edits images in 2-3 minutes. I only have this one GPU and am using Chrome to use the browser interface on the same machine.

The original release won't run on this hardware, but the compressed one is supposed to give identical results.

jjrv · 2d ago
I also asked it to explain what's funny in some newspaper comic strips in Finnish. It misunderstands some words and makes up nutty explanations, but most phrases still get translated correctly and its explanations do fit the drawn scenes once you factor in those misunderstandings. For such a small model that seemed impressive.
spuz · 2d ago
I'm interested in potential alternatives to ChatGPT's advanced voice mode. When I see the word "multimodal" I'm hopeful the model understands text + voice but instead it almost always seems to refer to text + images. Is there a keyword that I can use to look for models that work with voice similar to ChatGPT's advanced voice mode?
cjbprime · 1d ago
I don't know that ChatGPT's voice mode is using audio as a transformer input directly.

It could just be using speech to text (e.g. Whisper) on your input, and then using its text model on the text of your words. Or has OpenAI said that they aren't doing this?

mrshu · 1d ago
OpenAI does not provide many details about their models these days but they do mention that the "Advanced voice" within ChatGPT operates on audio input directly:

> Advanced voice uses natively multimodal models, such as GPT-4o, which means that it directly “hears” and generates audio, providing for more natural, real-time conversations that pick up on non-verbal cues, such as the speed you’re talking, and can respond with emotion.

From https://help.openai.com/en/articles/8400625-voice-mode-faq

amrrs · 2d ago
Google Gemini Live is pretty good.

If you want to try only voice, Try unmute.sh by Kyutai which will be eventually open-sourced

spuz · 1d ago
Thanks - it seems that Gemini Live is pretty far behind advanced voice mode at the moment. For example, I can't get it to speak slower when I want to understand what it is saying.

I'm still interested in what keyword I could use to search for the latest research in voice models.

akacrobat · 2d ago
This looks exciting! There is a serious dearth of high-quality open-source models with multimodal capabilities. So, really looking forward to playing with this one.

Has anyone here experimented with fine-tuning this for domain-specific applications?

charcircuit · 2d ago
The demo shows pretty weak performance compared to other small models. It misunderstood my question due to picking an uncommon way to interpret it. After clarifying what I wanted it lost all context I had provided in the previous message. My benchmark query intentionally ambiguous and I use it to see how models handle ambiguity, handle information which can be outdated, and handle avoiding hallucination. Usually weak models will just hallucinate an answer, but this model was the first who want able to understand the question.
LourensT · 2d ago
These days, papers come with an advertisement video
jxjnskkzxxhx · 2d ago
As someone who used to be in the academia, I think is isn't bad in itself, I just worry that by comparison it raises the burden of effort that one has to make in order to get their work noticed.
kleiba · 2d ago
Compared to the effort required to play in that field at all, making a video is almost negligible.
jxjnskkzxxhx · 2d ago
If it's an obligation, it's admin. Scientists hate admin.
lern_too_spel · 1d ago
This has been common for CG papers for two decades. Image generation is CG.
akoculu · 2d ago
pleone · 2d ago
Is it from ByteDance Team, right? The team behind TikTok, CapCut, BuzzVideo and more. Any thoughts on that?
rvnx · 2d ago
Like BYD vs Tesla. US is getting more and more late and more closed than ever (e.g. Chinese Qwen LLM versus LLaMA). So long-term, China may emerge as the dominating force in tech.
mdrzn · 2d ago
A quick test in the "demo" link doesn't show it to be "as smart" as it appeared in the demos on the page. I really hope it does all it's promising to do, but I'm skeptic so far.
mrec · 1d ago
I found it surprising that even one of the demos on the page appeared to get it wrong. (Chat example #5, explaining the "My Handwriting In Exams" meme.) Not horribly wrong, but still an odd example to cherry-pick for publicity material.

ETA: oof, and it's still getting hands wrong. (Editing demo #12)

moffkalast · 2d ago
Oh no it's The Everything Bagel.
GrantMoyer · 2d ago
Nice, it's really an open source model, Apache 2.0.
mnky9800n · 2d ago
I couldn’t find it, what are the hardware expectations for bagel?
tonii141 · 2d ago
If the model uses FP16 precision and has 7 billion active parameters, it would require approximately 14 GB of VRAM. I didn't read the paper.
sfphoton · 2d ago
How can you calculate required VRAM from precision and parameter number?
Havoc · 2d ago
Realistically you probably just want to look at the file size on huggingface and add ~2 gigs for OS/Firefox tabs and and a bit for context (depends but lets say 1-2)

The direct parm conversion math tends to be much less reliable than one would expect once quants are involved.

e.g.

7B @ Q8 = 7.1gb [0]

30B @ Q8 = 34.6gb [1]

btw you can also roughly estimate expected output speed too if you know the device memory throughput. Noting that this doesn't work for MoEs

Also recently discovered that in CPU mode llama.cpp does memory mapping. For some models it loads less than a quarter into memory.

https://huggingface.co/TheBloke/Llama-2-7B-GGUF/tree/main

https://huggingface.co/TheBloke/LLaMA-30b-GGUF/tree/main

NitpickLawyer · 2d ago
Rule of thumb is parameter_count * precision. Precision can be anything [32,16,8,4] bits. 32bits is sometimes used in training (although less now I guess), and rarely in inference. For a while now "full" precision is 16bit (fp16, bf16), fp8 is 8bit, int4 is 4bit, and so on. Everything that's not "full" precision is also known as quantised. fp8 is a quantised version of the "full" model.

So quick napkin math can give you the VRAM usage for loading the model. 7b can be ~14GB full, 7GB in fp8 and ~3.5GB in 4bit (AWQ, int4, q4_k_m, etc). But that's just to load the model in VRAM. You also need some available VRAM to run inference, and there are a lot of things to consider there too. You need to be able to run a forward pass on the required context, you can keep a kv cache to speed up inference, you can do multiple sessions in parallel, and so on.

Context length is important to take into account because images take a lot of tokens. So what you could do with a 7b LLM at full precision on a 16GB VRAM GPU might not be possible with a VLM, because the context of your query might not fit into the remaining 2GB.

a_t48 · 2d ago
A float16 is 2 bytes. 7B * 2 bytes = 14GB. I can't say if that's an accurate number, but that's almost certainly how tonii141 calculated it.
sfphoton · 2d ago
Oh, so FP16 means FloatingPoint16? I'm glad to learn something today, thanks!
gunalx · 2d ago
If you follow the hugginface link at the bottom you get to te actual model. Here https://huggingface.co/ByteDance-Seed/BAGEL-7B-MoT

It seems to be 7b, but like with other new architectures expect to not be able to run it quantizised.

wsintra2022 · 2d ago
sandra_vu · 2d ago
Hi good job, team. Any plans to commercialize the model?
saretup · 2d ago
> Scalable Perceptual Generative Model

If you wanna call it Bagel, just call it Bagel. No need to make up a justification.

gregjw · 2d ago
bagel