Generative AI models mostly inaccurate when sourcing statistical data, finds trial

October 30, 2025

UK – Most current large language models failed to accurately answer a question focused on UK statistics, in an experiment published by the MRS Census and GeoDemographics Group (CGG).

According to the report, most of the AI models got the answer wrong or refused to answer the question. Only one system returned the correct answers (according to ONS) the first time, while another got it right on the second attempt at prompting (with no changes to the prompt).

The report found that while the outputs looked coherent in terms of their vocabulary and grammar, the quality of the numbers provided was poor. Additionally, running the same question again was likely to result in a different answer.

Report authors Jaan Nellis and Peter Furness conducted the trials to look at what AI tools can do against a particular query about particular data, following a discussion in a CGG meeting earlier this year.

Speaking to Research Live, Furness explained: “It’s no longer the preserve of experts to get access to public datasets – anybody can type something into Google and get an AI summary, and it goes away and finds data and comes back with answers. That’s wonderful, but on the other hand, in the wrong hands – i.e. in the hands of perhaps less skilled people, people with axes to grind – it could be quite dangerous if the numbers being put out are not accurate. If there was a warning flag to be raised, we wanted to raise it.”

The report authors’ expectation, borne out in the test results, was that if you repeat a question, you get different results.

Nellis said: “We tested Google’s chatbot and Google’s search agent in June/July. They should be one and the same, but they weren’t, back then – we got different results. Then we came back in September because we had some queries we wanted to resolve, and lo and behold Gemini was consistent across the Google platform – they were at least reporting the same things. [AI models] are refining all the time and you would hope that that refinement would lead to a better situation.

“We thought that asking for GDP was a simple thing to ask for, but it’s quite complicated because there are lots of different GDPs, so you need to be relatively cognisant of which GDP you want. The one we asked for was what we considered to be the most common, standard metric.”

The errors in the experiment results, according to Nellis, were primarily due to the search algorithm selecting old webpages with out-of-date figures.

While there are two different core ways an AI model can operate – in one, you ask an LLM a question and receive a response that is not using any external data but rather what it has learned – most systems use a technique called RAG (retrieval-augmented generation).

Nellis explained: “RAG runs a search algorithm to get what it considers to be a tight sample of pages that will have the answer in there somewhere. That’s the way we would do it if we were doing it by hand, but you get the issues with the search algorithm – is the search algorithm accurate? It’s slightly arbitrary which pages get pushed to the front and which don’t.”

The report also looked at whether there are tools in development which may be able to select statistical data more precisely. One tool, StatGPT, released by the IMF in partnership with EPAM Systems, reports only on high-quality sources by using SDMX-compliant data queries to source its RAG data. SDMX (statistical data and metadata exchange) is an ISO standard to describe statistical data and metadata and standardise queries across data providers.

Furness likens this to data producers having “nice little hooks sticking up in their data which the tools can then find and plug into”, but said this is not in place in the UK at the moment. While ONS, through its partner NOMIS, and the UK Data Service both provide an SDMX API, StatGPT is not currently connected to either. The CGG report recommends that the tool – initially for testing, and if shown to be useful for customer facing deployment – should be connected to these UK sources.

📌 Visit Us:
🌐 Website: https://statisticsaward.com/

🏆 Nomination: https://statisticsaward.com/award-nomination/ecategory=Awards&rcategory=Awardee

📝 Registration: https://statisticsaward.com/award-registration/

Search This Blog

Statistics Awards

Generative AI models mostly inaccurate when sourcing statistical data, finds trial

Comments

Post a Comment

Popular posts from this blog

Data experts race to preserve US government statistics amid quiet purges

11 Essential Statistical Tools for Data-Driven Research

Why are data nerds racing to save US government statistics ?