Show HN: Help improve language coverage in Common Crawl
8 ccgreg 0 6/24/2025, 3:36:53 PM
Hey HN, the Common Crawl Foundation is trying to expand the coverage
of our crawl to more languages, regions and cultures, and if you speak
a language other than English (LOTE) you can help!
By validating Language Identification data (LangID or LID): https://dynabench.org/tasks/text-language-identification
By contributing urls for our seed crawl: https://github.com/commoncrawl/web-languages
We're also organizing a Workshop on Multilingual Data Quality Signals (WMDQS) with MLCommons and EleutherAI where we have a call for papers open (https://wmdqs.org/cfp/) and a upcoming shared task on language identification (https://wmdqs.org/shared-task/)
No comments yet