The Curse of the AI Ouroboros

The Curse of the AI Ouroboros

AI is slowly becoming inbred, and we need a way to deal with it.

The ancient Greeks used the term “ouroboros” to symbolize the three fundamental phases in the cycle of life: life, death, and renewal. The cycle was visually represented as a serpent swallowing its own tail.1

As the use of generative artificial intelligence (AI) and large language models (LLMs) spreads, the Internet is experiencing its own form of tail-eating. But instead of renewal, a significant share of generative AI use cases are incubating a form of data inbreeding called “data poisoning” (also known as “model collapse”) where the richness of the source data is irretrievably lost over time.2 As derived data – the growing “tail” of the LLM ouroboros – is introduced back into the digital realm, it gets consumed by the next generation of AI programs, thus contaminating the data pool that made AI work in the first place. This tail-eating is difficult to control and will accumulate if not accelerate without intervention.

While the nearest term impact will be on LLM-like models, this contamination will also corrupt other Internet-scale information retrieval applications. Internet search tools, for example, require ever-more user skill to avoid producing AI-generated materials (recently highlighted by the “Internet Slop” meme).3

Over time, foundational LLMs will become ubiquitous – the digital equivalent of the air we breathe, water we drink, and soil we sow. Note that both the digital and analog are closed systems; they must be managed if they are to remain viable for human use – digital ecosystems if you will. Generative AI is largely the domain of hyperscalers such as Google, Meta, OpenAI, and Perplexity who have yet to provide any real transparency about how their foundational models work, how they are built, or when we need to adjust our behaviors accordingly (the AI equivalent of an Ozone Action Day).4 That assumes that the hyperscalers know how their models work, an arguable point insofar as a self-modifying system inherently resists documentable stasis.

In addition to characterizing the problem at hand, this essay shows how we might leverage lessons learned from existing constructs in the Environmental Protection Agency (EPA), along with the crowdsourcing power of open source, to begin handling AI data quality.

Are We Slowing Going MAD?

While humans and other data creators (like sensors) will continue to generate data in all forms – call this “source” or “original” data – the sustained mixing of AI-generated data with source data increasingly contaminates or “dilutes” the source data to a point where it can dull, if not misdirect, the training of subsequent AI systems. Why? Because this process of contamination and dilution fundamentally changes the statistical characteristics of the data repositories such that they diverge from reality – i.e., from the original source data. Researchers from Rice and Stanford refer to this recursive data contamination as Model Autophagy Disorder (MAD), analogous to mad cow disease.5 (Reminder: do not eat the neural tissue of your own species.)

Perhaps the simplest way to illustrate this doom loop is by averaging a set of numbers.6 The act of averaging the set on its own has no impact on that set. All bets are off, however, if the average is taken repeatedly and that value is repeatedly inserted back into the data pool, as this will gradually underweight outlying values while simultaneously reinforcing the average. This is a non-issue if all you are interested in is a one-time average, but all other forms of summarization or sample-taking will suffer.

In many respects, AI-driven model poisoning mirrors another autophagous loop – the ongoing, self-reinforcing contamination of the Earth’s air, soil, water, etc. And just as with the Earth, the time-to-degrade is much less than the time-to-recover, with some outcomes being irreversible (extinction of a single species) and/or disproportionate (extinction of a keystone species). In other words, model poisoning and environmental degradation are both hard-to-redirect trends.

Admittedly, AI data contamination does and will vary across sectors and applications. Some private data repositories, such as specialty medical data, are likely to be much more carefully curated than social network data — yet AI summarization as well as generation of medical records may be one of the first areas where we all start experiencing the impact of these forms of data dilution and contamination.7 Some doctors are already worried about dangerous LLM-based recommendations driven in part by medically oriented Internet “slop”.8 While the timing and relative impact can be debated, the fundamental problem is not going away any more than manmade plastics are magically going to disappear from our air, water, and soil. So, the fundamental question about our AI-augmented world remains: How does one operate effectively in an environment gradually being polluted by its own digital exhaust?

Fool’s Gold?

We reached out to both an industry expert and an academic expert to see if they share our concerns about model poisoning “dumbing down” LLMs. To our surprise, both believe our concerns are largely unwarranted. Why? Because the LLM ecosystem is largely controlled by a handful of hyperscalers who have already built infrastructure for the acquisition, filtering, and curation of source data into “golden repositories.” One expert compared the problem to spam filtering which he characterized as “a persistent but manageable issue.” This seems a stretch, however, given that half the world’s email is now spam.9

Characterizing AI contamination as largely “a solved problem,” as one of our sources did, also seems overly sanguine. While misinformation has been with us since the dawn of civilization, the volume, variety, and velocity of misinformation has dramatically increased since the release of ChatGPT 3.0 in October 2022. Rapid, iterative, adversarial advances have only compounded the challenges. And as we have written elsewhere, while countermeasures like digital watermarks can help, they are not leakproof.10

It took over two decades for spam filtering to become a multi-billion-dollar industry.11 AI-based fake news detection is expected to equal that by 2030.12 Note, however, that both spam filtering and fake news detection are against an adversary, not a side effect of posting information online. AI contamination becomes even more challenging when LLMs become commoditized and the maintenance of “golden data” exceeds the wherewithal of its operators. We believe the production of AI-generated data could eventually exceed that of new data. It is already accepted that pure source data for training of foundational models has become a scarce resource.13 This will only accentuate the contamination of AI-derived data. Are we prepared to pay the price? Do we even have a proper understanding of the long-term costs?

Other data scientists have identified additional methods for addressing model collapse, but all of them come with human and compute costs that will be challenging to scale effectively. Spam filtering is not a long-term exemplar for dealing with the curse of the AI Ouroboros.

Can AI Save Itself?

We have all heard the term “fighting fire with fire,” controlling wildfires by setting fires ahead of the burn front. A similar strategy could allow AI to supervise AI model poisoning – what we like to call AI-on-AI. If AI’s “explainability” problem is as difficult as it appears, then asking AI to justify its decisions may be a fool’s errand. It may be that only another AI can keep up with the speed and unexplainable complexity of AI systems. This is a critical research question.14

This idea of AI-on-AI is perhaps the best initial focus for meeting the ouroboros data challenge. The decline of AI model legitimacy stemming from the consumption of AI digital exhaust will not necessarily be overtly visible. Regardless, such “self-modification” will increasingly create downstream challenges. This requires a fail-safe that perhaps only another AI could provide. But who will fund this kind of research? How long will it take before it offers up useful counterforce? How can we begin?

An Environmental Approach to Data Assessment, Oversight, and Management

Could there be an AI data equivalent to capping global warming at 1.5° C? This cap is an outcome target. Although it is hard to measure, our current capacity to model the globe’s dynamics tells us that a parts per million (PPM) figures for atmospheric pollutants such as CO2 (carbon dioxide) and 03 (ozone) are both measurable and causal predictors of global warming.15

Segueing to data pollution, we might pick thresholds that minimize undesirable AI outcomes, something equivalent to 1.5° C cap. We could define AI’s causal pollutants to measure their “pollutant density” – a DPPM (Data Parts Per Million) analog to physical PPM. We could establish standards for detecting when DPPM exceeds acceptable thresholds.  Another analogy is the concentrations of undesirable ingredients (bug parts) in foodstuffs. The idea is the same – keep unwanted contaminants below some margin of safety. So, we need to consider what a margin of safety might be.

Effective detection of data pollution would require not only standardization of tools and methods, but also DPPM thresholds for specific use cases. Effective response would then require further calibration based on context (when, where, frequency, due process, etc.). While this is difficult, it is doable, as evidenced by decades of environmental policies and oversight. Environmental regulations define escalating levels for relevant pollutants along with the appropriate actions, e.g., ozone action days. The time has come to develop functionally equivalent constructs for AI-driven data pollution.

Inherent Challenges in Environmental Modeling

Environmental oversight comes with many challenges. Identifying the right policies and metrics can be elusive. Complex interactions between ecosystem artifacts can lead to unexpected outcomes when oversight is poor. Balancing environmental impact against socio-economic impact frequently requires compromise. Certain features require extra attention because of the oversized role they play in an ecosystem, such as keystone species. Attempts at modeling complex systems that predate the rise of AI have demonstrated these challenges in finance, weather forecasting, stabilizing electric grids, etc.16

Decades of environmental regulation have shown that when regulating sources of pollution, regulated entities may employ a variety of techniques to evade regulation and/or game the system, e.g., shipping regulated biologic material to intermediate locales willing to restate the place of origin.17 Another difficulty is that polluted and unpolluted data sets are likely to be indistinguishable to a first approximation, just as with adulterated versus unadulterated foodstuffs.

Addressing environmental pollution also requires the ability to deal with past failures – such as concentrated locales of pollution bad enough to be designated a Superfund site.18 Under the Comprehensive Environmental Response, Compensation, and Liability Act (CERCLA), clean-up of such pollution is taken over by the federal government.19 It is easy to imagine a similar need arising when large swaths of society are critically dependent on data repositories that are deemed overly polluted.

Leveraging Open-Source

Crafting a coherent AI Data Ecosystem that resists the ouroboros threat will require the testing of concepts and architectures. Nature does this over geologic time, which we do not have. One option is to leverage the power of massive parallelism.

A relatively simple and proven force multiplier is an open-source initiative where participants can apply the power of crowdsourcing and iterative experimentation. Such a repository would serve as a focal point for assembling and developing use cases, models, data sets, contamination metrics, actionable thresholds, code sets, test results, and more. Techniques that hold up under scrutiny could then be considered for submission to technology standards bodies and/or a regulatory agency along the lines of a digital EPA. The maintenance as well as the instantiation of such an initiative would also require sustained funding and staffing.20

There are three paths in this situation. One is to conclude that we have raised a non-issue, and no one will get fired for ignoring it. A second is that we prepare ways to mitigate the danger should it later become a wildfire. The third is to declare an emergency research goal of understanding Shannon’s information entropy in a world of AI feeding on itself.21 Three directions, each with its own time constant. This is possibly the preeminent techno-policy challenge before us; we need a coalition of the willing and able while the questions are still relevant.

  1. “Ouroboros,” Encyclopedia Britannica, 2025. ↩︎
  2. Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Nicolas Papernot, Ross Anderson & Yarin Gal, “AI models collapse when trained on recursively generated data,” Nature, July 24, 2024. ↩︎
  3. Max Read, “Drowning in Slop,” New Yorker, September 26, 2024; Alexis McDonell, “Google search ends in frustration after AI images dominate results: ‘The internet is dead,’” The Cool Down, January 12, 2025. ↩︎
  4. Anonymous, “What is Ozone Pollution and what is an Ozone Action Day?,” Public Citizen, May 17, 2017. ↩︎
  5. Sina Alemohammad, Josue Casco-Rodriguez, Lorenzo Luzi, Ahmed Imtiaz Humayun, Hossein Babaei, Daniel LeJeune, Ali Siahkoohi, and Richard G. Baraniuk, “Self-Consuming Generative Models Go MAD,” Nature, July 4, 2023. ↩︎
  6. Bill Schmarzo, “Avoid GenAI Model Collapse… and Death by Averages,” Data Science Central, July 22, 2024. ↩︎
  7. Kapil Parakh, “Garbage In, Garbage Out,” LinkedIn, 2024. ↩︎
  8. Thomas Maxwell, “Doctors Say AI Is Introducing Slop Into Patient Care,” Gizmodo, December 28, 2024. ↩︎
  9. Joydeep Bhattacharya, “The Current State of Spam Email: Key Trends and Data,” SEO Sandwich, 2024. ↩︎
  10. Bob Gleichauf and Dan Geer, “Digital Watermarks Are Not Ready for Large Language Models,” Lawfare, February 29, 2024. ↩︎
  11. Swasti Dharmadhikari, “Anti-Spam Software Market Report 2025 (Global Edition),” Cognitive Market Research, November 2024. ↩︎
  12. MarketsandMarkets, “Fake Image Detection Market Worth $3.9 Billion by 2029,” PR Newswire, April 2, 2024. ↩︎
  13. Matt O’Brien, “AI ‘gold rush’ for chatbot training data could run out of human-written text,” AP News, June 6, 2024. ↩︎
  14. Bob Gleichauf, “A Chatbot? Are you Sirious?,” Medium, September 19, 2016. ↩︎
  15. Intergovernmental Panel on Climate Change (IPCC), Global Warming of 1.5°C: IPCC Special Report on Impacts of Global Warming of 1.5°C above Pre-Industrial Levels in Context of Strengthening Response to Climate Change, Sustainable Development, and Efforts to Eradicate Poverty (Cambridge, U.K.: Cambridge University Press, 2022). ↩︎
  16. James Ladyman, James Lambert, and Karoline Wiesner, “What is a Complex System?” European Journal for Philosophy of Science 3 (2013): 33–67. ↩︎
  17. U.S. Environmental Protection Agency. “Criminal Investigations: Violation Types and Examples.” ↩︎
  18. U.S. Environmental Protection Agency, “What is a Superfund?” ↩︎
  19. U.S. Environmental Protection Agency, “Summary of the Comprehensive Environmental Response, Compensation, and Liability Act.” ↩︎
  20. For example: Digital AI EPA, “AI-Orouboros,” GitHub. ↩︎
  21. Robert M. Gray, Entropy and Information Theory (New York: Springer-Verlag, 2023). ↩︎