Ask HN: Best foundation model for CLM fine-tuning?
I have a largish (2 GB) corpus of curated, high-quality text in some low-resource language, and I want to build a model that would provide an advanced "auto complete" service for writers.
I'm thinking of taking a decoder-only model such as Llama, Mistral or Gemma, slice off the embedding layers (which are based on unneeded languages), create new ones (perhaps initialized based on a FastText model trained on the corpus), paired with a tokenizer newly created from my corpus, then train the model on my corpus until convergence.
Additional potential details include: a custom loss function for synonym-aware training (based on a custom high-quality thesaurus), where synonyms of the "correct" word are somewhat rewarded; POS-tagging the corpus with a Language-specific POS-tagger, and add a POS-tagging head to the model as a Multi-task Learning, to force grammatical generation.
In order to be able to use a good model as the base, I will probably be forced to use PEFT (LoRA). My current setup is whatever is available on Colab Pro+, so I can probably use the 7b-12b range of models?
My main question is, which base model would be best for this task? (Again, for completion of general writing of all kinds, not programming or advanced reasoning).
Also, will the synonym and POS additions help or hurt?
Anything else I might be missing?
Thanks!
No comments yet