New research introduces HPLT v2, a significantly expanded and refined multilingual dataset designed to support the development of high-performance language and translation models across 193 languages.
This work builds on earlier efforts with a 2.5x increase in data volume, offering over 8 trillion tokens of monolingual text and 380 million sentence pairs of parallel data. With improved filtering, more accurate language identification, and full transparency into the data pipeline, HPLT v2 sets a new standard for large-scale, reproducible multilingual corpora in NLP.
📚 Curious about the full findings?
Download the full paper to explore how HPLT v2 is powering the next generation of multilingual AI.