“We wanted to make sure people with proximity to the data, their country, the language they speak, had a hand in choosing what language came into the model’s training,” says Jernite.
BLOOM, the first truly open-science, open-access, and multilingual (46 languages) large language model with 176B parameters (slightly larger than GPT-3) will soon be released as a complete pretrained model. Behind the project is BigScience, a wide-scale collaboration of over 1,000 researchers.
The project is quite impressive overall, both for the extent of collaboration and outcome. It’s also an engineering delight to watch. The model has been trained using 384 A100 GPUs (with 80 GB of memory each) since March 11, 2022.
BigScience provides updates on training every day (having hit its initial target earlier than planned, the model is currently being trained for “a few more days”). See the links in the comments to follow the updates and download the model. The full model will be released on
HuggingFace (also a partner of the project).This is a significant step forward for at least two reasons: the way the training data was collected and the core values behind the initiative. BigScience seems to have prioritized data quality by hand crafting the training data. In a world of models that favor kitchen sink approaches (because they can!), this is a progress. More obviously, BLOOM paves the way for a true democratization by removing the strings that have been attached to the use of such models by
OpenScience, Google, and Facebook (apply for API access, accredited researcher only etc.).