Show HN: Help improve language coverage in Common Crawl

8 ccgreg 0 6/24/2025, 3:36:53 PM
Hey HN, the Common Crawl Foundation is trying to expand the coverage of our crawl to more languages, regions and cultures, and if you speak a language other than English (LOTE) you can help!

By validating Language Identification data (LangID or LID): https://dynabench.org/tasks/text-language-identification

By contributing urls for our seed crawl: https://github.com/commoncrawl/web-languages

We're also organizing a Workshop on Multilingual Data Quality Signals (WMDQS) with MLCommons and EleutherAI where we have a call for papers open (https://wmdqs.org/cfp/) and a upcoming shared task on language identification (https://wmdqs.org/shared-task/)

Comments (0)

No comments yet