In a sample of 6.4 billion sentences in 90 languages from the Web, this study finds that 57.1% is low-quality machine translation. In addition, it is the low quality content produced in English (to generate ad revenue) that is translated en masse into other languages (again, to generate ad revenue).
The study discusses the negative implications for the training of large language models (garbage in, garbage out), but the increasingly poor quality of public web content is concerning nevertheless.