Recent advances in generative AI image, text, and audio generation seem to promise endless potential. It’s gotten to the point where researchers are exploring the use of synthetic data generated by AI to train next generation models. However, some experts worry that doing this would cause these AI systems to be caught in harmful feedback loops.
Given the proliferation of AI generated content today across the internet, there’s a non-zero chance that some of this synthetic content will end up being scraped by an AI developer to become part of another model’s corpus. This negative feedback loop, dubbed Model Autophagy Disorder (MAD), risks amplifying flaws over successive generations.
In simpler terms, MAD is akin to an AI system continuously learning from its own generated data, which can be problematic. This is because the AI might start echoing its earlier mistakes or quirks, leading to a cycle where each generation of the AI could inherit and amplify these flaws. Outcomes of this process could range from subtle performance degradation to complete model failure, where the AI's outputs become less reliable or coherent over time.
Early simulation studies provide provisional evidence for how these autophagous processes manifest. Researchers document AI systems drifting from sensible outputs towards nonsensical gibberish after being recursively trained on their own synthetic data. However, real world impacts likely depend on factors like model scale, monitoring practices, and whether training data maintains sufficient human grounding.
In this article, we will discuss the proliferation of synthetic data today and how that can lead to the autophagous loop phenomenon. We will also cover how it’s unlikely for MAD to impact next-gen models too much due to existing precautionary measures taken by AI developers.
Recent advances in AI generative models have sparked a corresponding explosion of synthetic content being created and deployed across the internet. Systems like DALL-E for imagery, GPT-3 for text and WaveNet for audio can now conjure vast troves of realistic media for fractions of a penny per piece.
While human-created data requires expensive skills, time, and labor - synthetic alternatives promise on-demand, customizable outputs at near infinite scale. It’s no wonder that leading technology providers including Microsoft, Nvidia and Stability AI are actively scaling up synthetic data production.
The diverse applications are particularly notable in sectors like the automotive industry, where synthetic data is used to train self-driving car algorithms, and healthcare, where it’s utilized for medical imaging AI. This trend highlights a growing recognition of the value that synthetic data brings to the table in terms of versatility and scalability.
The above benefits have contributed to steadily increasing adoption of synthetic data in various industries. However, with the ratio of AI generated to human made content steadily tipping in AI’s favor, how will this impact the training of future models? This is especially important to consider given how likely it is for AI model developers to scrape content that is AI-generated.
The concept of an "Autophagous Loop" in AI model training refers to a self-referential cycle where generative models are trained on data that has been produced by previous generations of generative models.
This process can lead to the progressive degradation in the quality and diversity of the model outputs. The term 'autophagous' literally means 'self-consuming', which aptly describes the loop's nature. This loop is a significant concern in the field of AI because it could potentially lead to the amplification of biases and artifacts inherent in generative models, affecting their performance and the authenticity of their outputs.
There are 3 kinds of loops that we can observe in terms of model training:
Over time, biases and artifacts from the initial models get amplified, leading to a decrease in the precision and diversity of the output. This degradation happens because the model is iteratively learning from an increasingly distorted version of reality.
The fixed real data set acts as an anchor, but the growing proportion of synthetic data can still lead to a gradual shift away from real-world accuracy. The key difference here is the rate at which the quality and diversity decrease - it's slower compared to the fully synthetic loop.
In the real world, the fresh data loop or synthetic augmentation loop is the likeliest method of applying synthetic data in the learning process depending on the use case. However, given the increasing amounts of AI-generated data on the internet, there’s a greater likelihood that the ratio of training data collected would skew towards AI generated content. With that in mind, how then would degradation of model performance be avoided?
Is it all doom and gloom? The internet would have you think so. As seen below, there are many articles on how AI will consume its own generated content and fail. However, that’s not seeing the forest for the trees.
We don’t think that this will be as big an issue as some of the above articles make it out to be. It’s a documented problem, true, but the scale of it seems to be exaggerated. There are multiple strategies that are already being used when training generative AI models that would circumvent the issues of autophagy:
With vigilance, proactive data hygiene practices, and responsible monitoring procedures, it's quite likely that any risks from autophagy can be contained. The media’s take on this has been generating a lot of fuel for alarmist rhetoric.
We in SUPA think that progress, especially in a field as new as generative AI, will always come with its challenges. However, history has shown that we’ve been able to overcome them via innovation time and time again. Hence, we’re not saying that autophagy is a problem to be ignored, rather something to be approached with a sense of cautious optimism.
Here at SUPA, we’ve spent over 7 years helping AI companies build better models. Find out more about our use cases here.