Data Science Weekly – Issue 614 (datascienceweekly.substack.com)

I have a largish (2 GB) corpus of curated, high-quality text in some low-resource language, and I want to build a model that would provide an advanced "auto complete" service for writers.

I'm thinking of taking a decoder-only model such as Llama, Mistral or Gemma, slice off the embedding layers (which are based on unneeded languages), create new ones (perhaps initialized based on a FastText model trained on the corpus), paired with a tokenizer newly created from my corpus, then train the model on my corpus until convergence.

Additional potential details include: a custom loss function for synonym-aware training (based on a custom high-quality thesaurus), where synonyms of the "correct" word are somewhat rewarded; POS-tagging the corpus with a Language-specific POS-tagger, and add a POS-tagging head to the model as a Multi-task Learning, to force grammatical generation.

In order to be able to use a good model as the base, I will probably be forced to use PEFT (LoRA). My current setup is whatever is available on Colab Pro+, so I can probably use the 7b-12b range of models?

My main question is, which base model would be best for this task? (Again, for completion of general writing of all kinds, not programming or advanced reasoning).

Data Science Weekly – Issue 614 (datascienceweekly.substack.com)

Why a classic CDP bot detection signal suddenly stopped working (blog.castle.io)

The longest possible gap between bank holidays (diamondgeezer.blogspot.com)

Ford and the Birth of the Model T (construction-physics.com)

The Surprising Truth About Finding Clients (samlandenwitsch.substack.com)

Data centers will cause higher electricity prices, study finds (axios.com)

US to publish economic data on blockchain, Commerce chief Lutnick says (cointelegraph.com)

It's time for you to contribute to the Climate Commons (climatedrift.substack.com)

Google Stax (stax.withgoogle.com)

The Cost of Transparency: Living with Schizoaffective Disorder in Tech (kennethreitz.org)

Tree Painters from the United Kingdom (worldsensorium.com)

When Ancient Sea Monsters Emerged (nautil.us)

Show HN: AgentCheck – Local AI-powered code review agents for Claude Code (github.com)

Self-Reliance Means... (samhenrycliff.medium.com)

Airbnb Data Portal (airroi.com)

Do the simplest thing that could possibly work (seangoedecke.com)

GPUPrefixSums – state of the art GPU prefix sum algorithms (github.com)

Hack to the Future – Front End (nooshu.com)

FDA Approves Covid Shots with New Restrictions (nytimes.com)

Discover samples made for your music (samples.landr.com)

ECG Interpretation Tools Compared – Why PMcardio Stands Out (powerfulmedical.com)

Prepare for the unexpected with emergency access for your Proton Account (proton.me)

First week lessons learned from marketing a product as a developer (inventronix.club)

What's your thoughts on Forbidden Planet (1956) (old.reddit.com)

Japan exploring whether AI could help inspect its nuclear power plants (theregister.com)

LMAR: Language Model Augmented Retriever for Domain-Specific Knowledge Indexing (arxiv.org)

Rupert's property: cut a hole in a polyhedron big enough for an identical copy (johncarlosbaez.wordpress.com)

What cash can and can't do (slowboring.com)

Why We Resent Middle Managers (oneuptime.com)

Intel's Clearwater Forest E-Core Server Chip at Hot Chips 2025 (chipsandcheese.com)

That boolean should probably be something else (ntietz.com)

Intel's "Clearwater Forest" Xeon 7 E-Core CPU Will Be a Beast (nextplatform.com)

Learn, build, judge, ship: What to look for in your first (AI) marketer (gkogan.co)

Group Borrowing: Zero-Cost Memory Safety with Fewer Restrictions (verdagon.dev)

Most popular MCP servers in cursor_AI last 4 weeks (twitter.com)

1M Tokens (ampcode.com)

Maximizing Processor Efficiency with the Lean Metric (spectrum.ieee.org)

Folk science counting tech costing billions yet.. still in beta (docs.google.com)

Guessing Game: Haskell Style (entropicthoughts.com)

There are just two northern white rhinos left (cnn.com)

The Gaussian Distribution Is Inevitable (and This Beautiful Principle Proves It) (valeman.medium.com)

Findings from a Pilot Anthropic–OpenAI Alignment Evaluation Exercise (alignment.anthropic.com)

Show HN: Now we have systems vulnerable to social engineering

Psychological Flaws That Keep the Gifted from Living Up to Their Gift (themarginalian.org)

The Trusted Agentic Commerce Protocol (github.com)

Let's Build a Hypervisor with KVM (evilcookie.de)

Can We Build Trustable Hardware? (2019) (bunniestudios.com)

In Search of the Perfect Raspberry (cranfield.ac.uk)

American Bitcoin, backed by Trump sons, aims to start trading in September (cnn.com)

I was wrong about tidymodels and LLMs (simonpcouch.com)

Ask HN: Best foundation model for CLM fine-tuning?

Comments (0)