An in-depth analysis of the architecture and data used in foundational large language models (LLMs) found that the usage of these models carries significant inherent risks, including the usage of polluted data, a lack of information on the data on which the model was trained, and the opacity of the architecture of the model itself.
The analysis is the work of the Berryville Institute of Machine Learning, a group of security and machine learning experts, which looked at the ways in which the owners of the major LLMs have built, trained, and deployed their models in an effort to identify the limitations and risks associated with them. Security researchers, privacy advocates, and lawmakers have expressed concerns about many aspects of the development and deployment of AI systems, often specifically about the foundational models on which user-facing AI systems are built. Those models typically are privately built and trained and the users and developers who interact with them have little or no insight into the quality and content of the data used to train them.
For many researchers, that lack of visibility is the crux of the issue. The black box nature of LLMs built by companies such as OpenAI, Google, Meta, and others prevents observers from understanding exactly what kind of data the models are trained on and what vulnerabilities or weaknesses the models might have. So that lack of visibility itself is a risk.
“The black box foundational model is the elephant in the room. When the companies that own these models make decisions about how to manage the risks, they don’t tell you how they did that. They just say, ‘Here's the box’,” said Gary McGraw, CEO of BIML, who has been studying AI and machine learning security for many years.
The BIML analysis, which was released Wednesday, identifies 81 individual risks associated with LLMs and specifically calls out 10 of them as being the most concerning. At the top of the list is the recursive pollution of the data sets companies use to train their LLMs. That problem arises when bad output from an LLM is dropped back into a training data pile that LLM or others ingest, then leading to further outputs based on bad data.
“It’s a feedback loop like a guitar and an amp. If we don’t get a handle on this soon we’re going to have so much pollution that we can’t ever get a handle on it,” McGraw said.
The BIML analysis paper puts it even more bluntly: “LLMs can sometimes be spectacularly wrong, and confidently so. Recursive pollution is a serious threat to LLM integrity. ML systems should not eat their own output just as mammals should not consume brains of their own species.”
In effect, each of the prominent LLMs is essentially its own species of one, raised in isolation and fed a random diet from a private walled farm. Each of those farms was originally part of a nearly unfathomably vast, wide open landscape, and contains its own unique mix of food, some of which may exist in the other private farms, but much of which is likely proprietary. Other species may exist outside of this ecosystem right now, but they will likely starve to death soon, shut off from the vital supply pf data they need to survive.
“The real answer has got to be regulation. The government needs to regulate LLM vendors."
Massive private data sets are now the norm and the companies that own them and use them to train their own LLMs are not much in the mood for sharing anymore. This creates a new type of inequality in which those who own the data sets control how and why they’re used, and by whom.
“The people who built the original LLM used the whole ocean of data, but then they started dividing it up, which leads to data feudalism. Which means you can’t build your own model because you don’t have access to the data,” McGraw said.
One way out of this situation is regulation and that’s already on the horizon in many places, including the United States, the European Union, and elsewhere. The EU AI Act, which goes into effect in 2026, will address some of the issues with foundational models. And a recent executive order from President Joe Biden regarding the development and use of AI systems puts much of the focus on identifying and mitigating security risks. The EO discusses standards for AI system development but doesn’t put much emphasis on the issues surrounding LLMs and data set privatization.
Whatever form it takes, AI regulation is an inevitability, and McGraw thinks it’s probably the best way forward. Especially if it addresses the problems surrounding data pollution and opacity.
“The real answer has got to be regulation. The government needs to regulate LLM vendors. We should be saying, hey all the input you scraped off the Internet is full of crap, and we need you to give us a list of exactly what crap it was that you ate. Right now they’re not even doing that. Some control over which part of the data ocean got eaten would be extremely useful,” McGraw said.
“For whatever reason a lot of people in the government think we can red team or pen test our way out of this. It’s the dumbest thing I've ever heard in my life. The multinational corporations are so big that we can’t fix it any other way than regulation. Prompt injection is not the answer.”