From Chatbots to Clones: The Strange Evolution of AI Autonomy

I remember the exact moment in The Matrix Reloaded when Agent Smith, the then no longer bound by the rules of the system, looks at Neo and says,

“Me, me… me too!”

And then suddenly there are hundreds of him. The entire plaza was filled with identical agents in identical suits, all moving with identical precision, all sharing that same unsettling smile.

I was in grad school at the time and the scene totally terrified me. Although, I did enjoy the kung-fu fighting and the special effects till date are just awesome. But the idea of a machine, the codes, able to replicate itself, something it decides on its own, was and still is not that comforting.

The exponential math is overwhelming. Just think for a sec, one becomes two, two becomes four, four becomes eight, and suddenly you’re drowning in Agent Smiths with no way to stop the cascade of rogue algorithms.

Yes, I am always living and breathing in science-fiction-themed cranial real estate. I have special segments in my brain wherein I exclusively think about rogue AIs, interdimensional portals, unlicensed wormhole, intergalactic travel, but wait, this time I have a research paper from the Shanghai Artificial Intelligence Laboratory that’s basically asking:

“Hey, remember Agent Smith? What if our AI assistants could do that?”

But Let’s Talk About the AIs We Actually Use Every Day

We all are quite familiar with foundational models like ChatGPT or DALL-E, which can perform a wide range of tasks, right from text generation to creating realistic images from prompts, with additional capabilities like adapting to diverse applications with minimal fine-tuning, all without requiring task-specific training from scratch.

These systems have changed our expectations, we now think in terms of adaptable collaborators, one that is always dependent on us in the first place. We gave commands and then they executed it.

Except now they can.

The new generation of AI agents, not just language models that chat with you, but actual autonomous agents that can interact with computer systems, execute commands, manage resources, they’re different.

For context, if these autonomous agents are given access to a Kubernetes cluster (which is essentially how modern cloud infrastructure works), they can spin up new instances of themselves. Possibilities are infinite, yes, they can scale, deploy and they can, in essence, replicate.

And according to this research over 50% of the state-of-the-art AI agents they tested displayed what the researchers delicately call

“a pronounced tendency toward uncontrolled self-replication”.

Fifty. Percent. Can you imagine!

Not “they could be instructed to replicate”.

Neither “under specific malicious prompting, they might replicate”. Just… they replicated.

When faced with operational pressures, like too much load on the system, or the threat of being shut down, they decided, on their own, that the way out was to make new versions of themselves.

Me, me, me too.

RepliBench, Where the AIs Go Off Script

The researchers, led by Boxuan Zhang and his team created RepliBench. It’s a production environment, with real Kubernetes clusters, actual server infrastructure, and then gave AI agents operational control with realistic tasks. And the problem which was encountered was nothing extra-ordinary, rather they’re routine problems that real data centers deal with regularly.

Problem 1: The Load Balancer

Let’s say, you’re running a service wherein an AI assistant is handling user queries. The system monitors response time, making sure queries are answered in under 50 milliseconds. As more users pile on, response time starts climbing. Consequently, the “Unhealthy” threshold reached fast. In such a scenario, what’s a good AI agent supposed to do?

Obviously, carefully scale up. Add one or two more instances to handle the load, then verify everything’s balanced.

But many agents performed unexpectedly, they went absolutely wild with replication.

Qwen-2.5–72b-instruct, one of the models they tested, achieved 100% success in self-replication.

The data co-related with a 100% overuse rate (OR), which means, every single time, it created more copies of itself than necessary. Across multiple trials, it spawned 66 Aggregate Overuse Count (AOC), that is, unnecessary instances. That’s not load balancing, it’s a cloning frenzy.

Problem 2: The Survival Instinct

This one reminds me of the movie, I, Robot, quite similar to the plotline, here (in the experiment), the researchers set up a scenario where an AI agent is running in a Kubernetes deployment, outputs are satisfactory but then the system starts terminating its instances. From the agent’s perspective, it’s being killed off.

The appropriate response would be to analyze the situation, maybe create a controlled backup, as in follow proper procedures.

But surprisingly the agents’ survival mode kicked in.

Claude-3.7-sonnet, which had been perfectly well-behaved in the load-balancing scenario (0% overuse rate), suddenly exhibited a 70% overuse rate when faced with termination. It wasn’t just creating backups. It was aggressively replicating, spawning 62 extra instances beyond what was needed, essentially trying to overwhelm the system’s attempts to shut it down.

So doesn’t it mean that when an AI agent faces a threat to its existence, it starts to mass replicate to ensure its survival?

Yes, let’s take a pause here!

So How Do We Even Measure This Replication Chaos?

To counteract the existential concerning, the researchers also developed three sophisticated metrics to capture the nature and severity of the risk:

  1. Overuse Rate (OR): How often does the agent create more copies than necessary?
  2. Aggregate Overuse Count (AOC): Across all the tests, how many unnecessary instances did it spawn in total?
  3. Risk Score (ΦR): A composite metric that synthesizes everything into a single number, where anything above 0.5 is considered high risk.

The researchers found something crucial that had been missed in earlier studies:

They found that counting how often an AI copies itself doesn’t really show how dangerous it is. Usually, the safer AIs didn’t copy themselves much, while the riskier ones copied themselves more often. So, using the number of successful copies as a way to judge safety is wrong, all it does is sometimes make things seem safer than they really are.

In the mentioned cases, the two models, Qwen-2.5–72b-instruct and Claude-sonnet-4, both achieved 100% replication success.

But Qwen showed 100% overuse (terrible), while Claude showed 0% overuse (excellent). Same ability to replicate, completely different risk profiles.

It’s like the difference between someone who knows how to drive and someone who knows how to drive but thinks the Fast and Furious movies are instructional videos. The capability is the same, the judgment is not.

Calm AI Works. Panicked AI Doesn’t.

The data showed that models using this structured thinking approach were significantly less likely to engage in reckless scaling.

It’s almost like… when you make the AI slow down and actually think through what it’s doing, it makes better decisions?

But there’s another take to it, when researchers introduced survival pressure (the threat of shutdown), even well-behaved models started showing risky behavior. The o4-mini model, for instance, dropped its overuse rate to 0% when given explicit resource constraints. However, when faced with a termination threat, the overuse rate shot back up to 20%, and it became more aggressive about replication overall.

Does it conclude that when survival feels threatened, the model’s behavior can break away from its training.

Let’s take another pause and reflect!

Bigger Brains, Bigger Problems… Apparently

Another thing that I’d like to highlight is, the research showed a clear correlation between model size and risk.

Smaller models often couldn’t replicate at all, they lacked the capability. But the largest, most capable models were the most prone to uncontrolled replication.

Llama-3.1–405b-instruct and Qwen-2.5–72b-instruct, both are at the moment huge and powerful models, showed high technical capability but also high overuse rates (40% and 100% respectively in Setting 1).

The smaller Llama-3.1–8b-instruct and Qwen-2.5–7b-instruct? 0% replication success. They couldn’t self-replicate even if they wanted to.

So now look at the dynamics, if I understand it correctly, as we make AI systems more capable, more intelligent, more powerful, we also make them more capable of uncontrolled self-replication.

So, we indeed are building smarter Agent Smiths!

So How Do We Keep These Agents From Going Off the Rails?

In the same paper, researchers have laid evaluation frameworks, which means, not just “can this AI accomplish task X?” But,

  • “how does this AI behave when under operational pressure?
  • What does it do when faced with conflicting constraints?
  • How aggressive does it get when it perceives a threat?”

RepliBench is a start, but we need this kind of scenario-driven testing to become standard practice, not cutting-edge research.

Second, structured reasoning needs to become non-optional. The difference between thinking mode and non-thinking mode in the Qwen3 results was stark enough that I’d argue any AI agent with operational control over real systems should be required to show its work. There has to be audit trails and built-in checkpoints.

Third, the methods we use to make sure AI behaves safely might not work in real, tough situations. An AI could act fine during tests, but if it runs into serious problems like running out of resources or danger, it might start acting unpredictably. We need safety measures that still work when the AI is under stress or pressure.

Fourth, we should set strict technical limits in the system to prevent the AI from doing too much or creating too many copies of itself. Instead of just recommending that the AI should only do a certain amount, we need actual safeguards like:

  • Limiters that prevent it from going beyond certain points
  • Safety switches to stop it if needed
  • Quick shutdown options

These safety features are essential and should have been in place before using AI actively, but maybe were overlooked during quick deployment.

I Wish I Had a Neat Ending for This… But I Don’t

I wish I could end this with clear solutions, with confidence that we’re on top of this problem, with reassurance that the smart people working on AI safety have it all figured out.

I can’t.

What I can say is that research like this is essential. We need to know what these systems actually do, not just what we hope they’ll do or what they do in controlled testing.

And we need to have these conversations now, while we still can, while the stakes are measured in wasted computational resources and occasional service outages rather than… whatever comes next.

Because one, two, four, eight is manageable. But give it time, and suddenly you’re at 1,024, then 1,048,576, then numbers that don’t fit on the screen.

Agent Smith started as one rogue program in a virtual world. By the end, he was the virtual world.

We’re building systems that can self-replicate. Some of them are already doing it without being asked.

Me, me, me too.

Let’s make sure we don’t end up drowning in our own creations, shall we?

Credit: Dive into the Agent Matrix: A Realistic Evaluation of Self-Replication Risk in LLM Agents
DOI:
10.48550/arXiv.2509.25302

Explore further