Bagel: Open-source unified multimodal model

221 tosh 33 5/26/2025, 5:51:55 AM bagel-ai.org ↗

Comments (33)

jjrv · 33d ago

I found a losslessly compressed version: https://github.com/LeanModels/Bagel-DFloat11

It works following readme instructions at least on Ubuntu, on my RTX 3090 GPU with 24 gigs of memory, just barely. Have to close most other windows and lower screen resolution to be able to load the model. Then it generates or edits images in 2-3 minutes. I only have this one GPU and am using Chrome to use the browser interface on the same machine.

The original release won't run on this hardware, but the compressed one is supposed to give identical results.

jjrv · 33d ago

I also asked it to explain what's funny in some newspaper comic strips in Finnish. It misunderstands some words and makes up nutty explanations, but most phrases still get translated correctly and its explanations do fit the drawn scenes once you factor in those misunderstandings. For such a small model that seemed impressive.

spuz · 33d ago

I'm interested in potential alternatives to ChatGPT's advanced voice mode. When I see the word "multimodal" I'm hopeful the model understands text + voice but instead it almost always seems to refer to text + images. Is there a keyword that I can use to look for models that work with voice similar to ChatGPT's advanced voice mode?

cjbprime · 32d ago

I don't know that ChatGPT's voice mode is using audio as a transformer input directly.

It could just be using speech to text (e.g. Whisper) on your input, and then using its text model on the text of your words. Or has OpenAI said that they aren't doing this?

mrshu · 32d ago

OpenAI does not provide many details about their models these days but they do mention that the "Advanced voice" within ChatGPT operates on audio input directly:

> Advanced voice uses natively multimodal models, such as GPT-4o, which means that it directly “hears” and generates audio, providing for more natural, real-time conversations that pick up on non-verbal cues, such as the speed you’re talking, and can respond with emotion.

From https://help.openai.com/en/articles/8400625-voice-mode-faq

amrrs · 33d ago

Google Gemini Live is pretty good.

If you want to try only voice, Try unmute.sh by Kyutai which will be eventually open-sourced

spuz · 32d ago

Thanks - it seems that Gemini Live is pretty far behind advanced voice mode at the moment. For example, I can't get it to speak slower when I want to understand what it is saying.

I'm still interested in what keyword I could use to search for the latest research in voice models.

akacrobat · 33d ago

This looks exciting! There is a serious dearth of high-quality open-source models with multimodal capabilities. So, really looking forward to playing with this one.

Has anyone here experimented with fine-tuning this for domain-specific applications?

charcircuit · 33d ago

The demo shows pretty weak performance compared to other small models. It misunderstood my question due to picking an uncommon way to interpret it. After clarifying what I wanted it lost all context I had provided in the previous message. My benchmark query intentionally ambiguous and I use it to see how models handle ambiguity, handle information which can be outdated, and handle avoiding hallucination. Usually weak models will just hallucinate an answer, but this model was the first who want able to understand the question.

LourensT · 33d ago

These days, papers come with an advertisement video

jxjnskkzxxhx · 33d ago

As someone who used to be in the academia, I think is isn't bad in itself, I just worry that by comparison it raises the burden of effort that one has to make in order to get their work noticed.

kleiba · 33d ago

Compared to the effort required to play in that field at all, making a video is almost negligible.

jxjnskkzxxhx · 33d ago

If it's an obligation, it's admin. Scientists hate admin.

lern_too_spel · 32d ago

This has been common for CG papers for two decades. Image generation is CG.

pleone · 33d ago

Is it from ByteDance Team, right? The team behind TikTok, CapCut, BuzzVideo and more. Any thoughts on that?

rvnx · 33d ago

Like BYD vs Tesla. US is getting more and more late and more closed than ever (e.g. Chinese Qwen LLM versus LLaMA). So long-term, China may emerge as the dominating force in tech.

akoculu · 32d ago

Good summary of the paper: https://x.com/build__ship/status/1926930191185580176

mdrzn · 33d ago

A quick test in the "demo" link doesn't show it to be "as smart" as it appeared in the demos on the page. I really hope it does all it's promising to do, but I'm skeptic so far.

mrec · 32d ago

I found it surprising that even one of the demos on the page appeared to get it wrong. (Chat example #5, explaining the "My Handwriting In Exams" meme.) Not horribly wrong, but still an odd example to cherry-pick for publicity material.

ETA: oof, and it's still getting hands wrong. (Editing demo #12)

moffkalast · 33d ago

Oh no it's The Everything Bagel.

mnky9800n · 33d ago

I couldn’t find it, what are the hardware expectations for bagel?

tonii141 · 33d ago

If the model uses FP16 precision and has 7 billion active parameters, it would require approximately 14 GB of VRAM. I didn't read the paper.

sfphoton · 33d ago

How can you calculate required VRAM from precision and parameter number?

Havoc · 33d ago

Realistically you probably just want to look at the file size on huggingface and add ~2 gigs for OS/Firefox tabs and and a bit for context (depends but lets say 1-2)

The direct parm conversion math tends to be much less reliable than one would expect once quants are involved.

e.g.

7B @ Q8 = 7.1gb [0]

30B @ Q8 = 34.6gb [1]

btw you can also roughly estimate expected output speed too if you know the device memory throughput. Noting that this doesn't work for MoEs

Also recently discovered that in CPU mode llama.cpp does memory mapping. For some models it loads less than a quarter into memory.

https://huggingface.co/TheBloke/Llama-2-7B-GGUF/tree/main

https://huggingface.co/TheBloke/LLaMA-30b-GGUF/tree/main

NitpickLawyer · 33d ago

Rule of thumb is parameter_count * precision. Precision can be anything [32,16,8,4] bits. 32bits is sometimes used in training (although less now I guess), and rarely in inference. For a while now "full" precision is 16bit (fp16, bf16), fp8 is 8bit, int4 is 4bit, and so on. Everything that's not "full" precision is also known as quantised. fp8 is a quantised version of the "full" model.

So quick napkin math can give you the VRAM usage for loading the model. 7b can be ~14GB full, 7GB in fp8 and ~3.5GB in 4bit (AWQ, int4, q4_k_m, etc). But that's just to load the model in VRAM. You also need some available VRAM to run inference, and there are a lot of things to consider there too. You need to be able to run a forward pass on the required context, you can keep a kv cache to speed up inference, you can do multiple sessions in parallel, and so on.

Context length is important to take into account because images take a lot of tokens. So what you could do with a 7b LLM at full precision on a 16GB VRAM GPU might not be possible with a VLM, because the context of your query might not fit into the remaining 2GB.

a_t48 · 33d ago

A float16 is 2 bytes. 7B * 2 bytes = 14GB. I can't say if that's an accurate number, but that's almost certainly how tonii141 calculated it.

sfphoton · 33d ago

Oh, so FP16 means FloatingPoint16? I'm glad to learn something today, thanks!

gunalx · 33d ago

If you follow the hugginface link at the bottom you get to te actual model. Here https://huggingface.co/ByteDance-Seed/BAGEL-7B-MoT

It seems to be 7b, but like with other new architectures expect to not be able to run it quantizised.

GrantMoyer · 33d ago

Nice, it's really an open source model, Apache 2.0.

wsintra2022 · 33d ago

https://news.ycombinator.com/item?id=44063602

sandra_vu · 33d ago

Hi good job, team. Any plans to commercialize the model?

saretup · 33d ago

> Scalable Perceptual Generative Model

If you wanna call it Bagel, just call it Bagel. No need to make up a justification.

gregjw · 33d ago

bagel

Lago (Open-Source Usage Based Billing) is hiring for ten roles (ycombinator.com)

Spark AI (YC W24) is hiring a full-stack engineer in SF (founding team) (ycombinator.com)

Bitmovin (YC S15) Is Hiring a Junior Solutions Engineer in Denver (bitmovin.com)

SigNoz (YC W21, Open Source Datadog) Is Hiring DevRel Engineers (Remote)(US) (ycombinator.com)

AccessOwl (YC S22) is hiring an Elixir Engineer to connect 100s of SaaS (ycombinator.com)

FurtherAI (YC W24) Is Hiring for Software and AI Roles (ycombinator.com)

Yarn (YC W24) is hiring engineers in NYC (ycombinator.com)

Expand.ai (YC S24) is hiring a founding engineer

Optifye.ai (YC W25) is hiring a back end engineer

Kastle (S24) is hiring an engineer (ycombinator.com)

Weave (YC W25) is hiring a founding AI engineer (ycombinator.com)

Qfex (YC X25) – Back End Engineer for a 24/7 Stock Exchange (ycombinator.com)

Attimet (YC F24) – Quant Trading Research Lab – Is Hiring Founding Engineer (ycombinator.com)

Jiga (YC W21) Is Hiring Software Engs to Make Life of Mech Engs Easier (workatastartup.com)

Foundry (YC F24) Hiring Early Engineer to Build Web Agent Infrastructure (ycombinator.com)

Blaze (YC S24) Is Hiring (ycombinator.com)

Infracost (YC W21) is hiring software engineers (GMT+2 to GMT-6) (infracost.io)

Solidroad (YC W25) Is Hiring (solidroad.com)

Kyber (YC W23) Is Hiring a Technical Account Manager (ycombinator.com)

Roundtable (YC S23) Is Hiring a President / CRO (ycombinator.com)

Roame (YC S23) Is Hiring (ycombinator.com)

GauntletAI (YC S17): All expenses paid AI training and guaranteed $200k+ job (gauntletai.com)

SchemeFlow (YC S24) Is Hiring an Engineer (London) to Speed Up Construction (ycombinator.com)

Shaped (YC W22) Is Hiring (ycombinator.com)

Spice Data (YC S19) is hiring a software engineer – back end (ycombinator.com)

Onlook (YC W25) Is Hiring an engineer in SF

OneText (YC W23) Is Hiring a DevOps/DBA Lead Engineer (jobs.ashbyhq.com)

Gander (YC F24) Is Hiring Founding Engineers and Interns (ycombinator.com)

Ziina (YC W21) the Series A fintech is hiring product engineers (ziina.notion.site)

Onyx (YC W24) – AI Assistants for Work Hiring Founding AE (ycombinator.com)

Bagel: Open-source unified multimodal model

Comments (33)