LibRedirect – Redirects popular sites to alternative privacy-friendly frontends (libredirect.github.io)

High-quality voice APIs are usually either expensive, slow, or both. Cheaper and faster solutions very often lack realism. We decided to build Inworld TTS to bridge this gap.

We just released two multilingual models. Our small model, named TTS-1, is on par with SOTA models quality-wise given objective metrics WER/SIM/DNSMOS. A larger model, TTS-1-Max, is even better. It can produce more nuanced speech and has ~3.5% better WER across all 11 supported languages averaged. Both models also support markup tags (e.g. prepend "[happy]" to the text to make the generation more enthusiastic, etc).

The models are built with LLaMA 1B and 8B being the SpeechLM backbones for TTS-1 and TTS-1-Max respectively. We up-trained both models on a mixture of text and audio, then finetuned on text-audio pairs and polished final checkpoints with GRPO on a small high-quality dataset. Our Speech Lab team (4 MLEs) started to work on collecting audio data around late February and exploring different audio codec architectures. We got inspired by the simplicity of the single vector quantization Xcodec2 neural audio codec architecture used and decided to use a similar idea. We started the training early April. Once codec was ready, SpeechLMs’ training took another month and a half. We finished mid-June, all - using 32 H100 GPUs.

To make models real-time ready during serving, we collaborated with Modular to migrate from vanilla vLLM solution to Mojo- written MAX server. Our bet of keeping serving architecture as simple as possible played out well: both models turned out to be really fast. TTS-1, which can be accessed via streaming API, has ~500ms p90 latency for returning the first ~2 seconds of audio. The pricing is simple, pay $5/1M characters. A larger model’s API access will be opened soon. We’ll share more details about serving performance optimizations made in the coming weeks.

We are also about to release all the training, modeling, and benchmarking code on GitHub to be transparent about how we made it. This repo is very flexible and can easily be adjusted to train an arbitrary neural net, but we’ll release the code with the focus on speech modeling. By the way, we’ve used PyTorch Lightning as the framework for multi-node/multi-GPU training as it proved to be very easy-to-use and reliable.

Check the TTS out at https://inworld.ai/tts

Happy to answer any questions you have!

Comments (11)

cremaster_ · 3h ago

I've used Inworld for AI characters in the past. Are you pivoting to a TTS company?

Also, can these voices be plugged into the Unreal/Unity SDKs?

rogilop · 2h ago

Not really, we aren't pivoting: TTS is a part of our strategy of making great AI solutions accessible for as many developers as possible. We don't have official plugins for UE/Unity yet, but will have something to share soon. So at the moment feel free to use directly via API.

jsx888 · 2h ago

Love it! Cant wait to try this out and cut down the costs we incur using other services.

audi0917 · 1h ago

The voices are realistic and lively - I will try it in my app - Thanks for the great launch!

rogilop · 1h ago

Oh, that's cool, please share the app)

RohanPanda99 · 3h ago

Kudos on the launch! The price-point along with superior quality compared to peer models would make it a go-to solution for TTS!

igh · 3h ago

Thank you for sharing the details!

rogilop · 2h ago

Sure! We plan to release a detailed tech report alongside with the repo too. We have a lot of interesting lessons to share.

kalacoffee · 3h ago

TTS Playground is easy to use and impressive. Clone voice was intuitive.

feifan123 · 3h ago

This is amazing! It unblocks many potential AI applications with voices.

fr25 · 3h ago

Interesting approach... thanks for sharing

Gemini CLI (blog.google)

U.S. bombs Iranian nuclear sites (bbc.co.uk)

Mechanical Watch: Exploded View (fellerts.no)

YouTube's new anti-adblock measures (iter.ca)

Samsung embeds IronSource spyware app on phones across WANA (smex.org)

Writing toy software is a joy (blog.jsbarretto.com)

uv: An extremely fast Python package and project manager, written in Rust (github.com)

OpenAI charges by the minute, so speed up your audio (george.mand.is)

Harper – an open-source alternative to Grammarly (writewithharper.com)

Phoenix.new – Remote AI Runtime for Phoenix (fly.io)

A new PNG spec (programmax.net)

Fun with uv and PEP 723 (cottongeeks.com)

A new pyramid-like shape always lands the same side up (quantamagazine.org)

Backyard Coffee and Jazz in Kyoto (thedeletedscenes.substack.com)

Vera C. Rubin Observatory first images (rubinobservatory.org)

Git Notes: Git's coolest, most unloved­ feature (2022) (tylercipriani.com)

Man 'refused entry into US' as border control catch him with bald JD Vance meme (dublinlive.ie)

A new PNG spec (programmax.net)

Thnickels (thick-coins.net)

How I use my terminal (jyn.dev)

I wrote my PhD Thesis in Typst (fransskarman.com)

Microsoft suspended the email account of an ICC prosecutor at The Hague (nytimes.com)

-2000 Lines of code (2004) (folklore.org)

Define policy forbidding use of AI code generators (github.com)

Microsoft Edit (github.com)

What Problems to Solve (1966) (genius.cat-v.org)

Hurl: Run and test HTTP requests with plain text (github.com)

Starship: A minimal, fast, and customizable prompt for any shell (starship.rs)

TPU Deep Dive (henryhmko.github.io)

PlasticList – Plastic Levels in Foods (plasticlist.org)

Klein Bottle Amazon Brand Hijacking (2021) (kleinbottle.com)

LibRedirect – Redirects popular sites to alternative privacy-friendly frontends (libredirect.github.io)

Writing a basic Linux device driver when you know nothing about Linux drivers (crescentro.se)

Tell HN: Beware confidentiality agreements that act as lifetime non competes

uBlock Origin Lite Beta for Safari iOS (testflight.apple.com)

Fairphone 6 is switching to a new design that's even more sustainable (androidcentral.com)

Games run faster on SteamOS than Windows 11, Ars testing finds (arstechnica.com)

Finding a 27-year-old easter egg in the Power Mac G3 ROM (downtowndougbrown.com)

U.S. Chemical Safety Board could be eliminated (ishn.com)

Using Home Assistant, adguard home and an $8 smart outlet to avoid brain rot (romanklasen.com)

GitHub CEO: manual coding remains key despite AI boom (techinasia.com)

New Linux udisks flaw lets attackers get root on major Linux distros (bleepingcomputer.com)

Basic Facts about GPUs (damek.github.io)

Puerto Rico's Solar Microgrids Beat Blackout (spectrum.ieee.org)

MCP is eating the world (stainless.com)

AlphaGenome: AI for better understanding the genome (deepmind.google)

Python can run Mojo now (koaning.io)

AbsenceBench: Language models can't tell what's missing (arxiv.org)

Getting ready to issue IP address certificates (community.letsencrypt.org)

Show HN: Nxtscape – an open-source agentic browser (github.com)

Show HN: Inworld TTS – high-quality, affordable, and low-latency TTS

Comments (11)

Git Notes: Git's coolest, most unloved feature (2022) (tylercipriani.com)