How do LLMs report scientific text?

In short, by exaggerating.

It seems so convenient: when you are short of time, asking ChatGPT or another chatbot to summarise a scientific paper to quickly get a gist of it. But in up to 73 per cent of the cases, these large language models (LLMs) produce inaccurate conclusions…

Over a year, we collected 4,900 summaries. When we analysed them, we found that six of ten models systematically exaggerated claims they found in the original texts.

This prompts the question: why? My initial thought was that since these models are built to generalize, and exaggeration is a form of faulty generalization, they might actually be working as intended. This aligns with the study’s finding that LLMs are nearly five times more likely to produce broad generalizations than humans.

More interestingly, attempting to prompt LLMs for accuracy often backfires. They tend to double down on generalizations, becoming twice as likely to produce overgeneralized answers when asked to stick to the facts.

On the bright side, LLMs can quickly replace the media for reporting of science.

SourcePaper