Maintaining AI-Built Systems Without Losing the Plot

Drift, debt, and the maintenance problem AI created

We closed part two by saying that getting to the initial release of a project is only half the story.

What happens after that?

That is the part I find a lot of AI coding discussions still skip over.

Getting something working is easier than it used to be, and faster too. That matters. But software is not judged at the moment it first launches. It is judged by whether it stays maintainable, adaptable, and coherent over the long haul, when new developers join, priorities change, and the codebase grows beyond what any one person can comfortably hold in their head.

That is where the harder part starts.

Getting to v1 is easier. Keeping coherence is harder.

One of the biggest changes AI brings to software engineering is that implementation gets cheaper. A lot of the time, that is genuinely useful. You can scaffold faster, move through feature work more quickly, and get from idea to working software in less time than before.

But cheaper implementation changes the shape of the problem. If code can be produced more quickly, inconsistency can be produced more quickly too. More local decisions get made. More code lands. More variations of the same idea appear across the system. The risk is not just that the AI gets something wrong. It is that the system slowly becomes less coherent while everything still looks productive on the surface.

Once the initial build is done, the question is no longer “can we produce this feature?” It becomes “can we keep this system making sense as it evolves?”

Why AI-built systems drift over time

Drift is not usually one dramatic failure. It is what happens when a series of reasonable-looking local decisions slowly stop adding up to one coherent system.

A feature works. A task gets closed. A review passes. Nothing looks obviously reckless. But over time, the codebase starts solving similar problems in different ways, boundaries get a little softer, and the documents that were supposed to anchor the system stop matching the reality of the code.

One cause is stale documentation. In most projects, documentation is one of the first things to fall behind after launch. In a traditional workflow, that already has a cost. In an AI-assisted workflow, it becomes much riskier. Once the documentation goes stale, the agent is no longer working against the current standards and patterns of the repository. It is working against an older version of the truth. At that point, the memory layer stops protecting the system and starts quietly pulling it off course.

Another cause is local optimisation. AI is very good at finding a solution that works for the task in front of it. That does not mean it will naturally choose the solution that best preserves the wider system. Left unchecked, it will usually optimise for local success over global coherence. That is how business logic creeps into the wrong layer, duplication appears for convenience, and patterns that once felt deliberate start to blur.

Memory matters, but it is not the whole story

One of the things AI coding discussions still get slightly wrong is treating memory as if it is mainly a model capability problem.

It is partly that. Long sessions still degrade. Important details still get buried. The middle still gets fuzzy. But that is not the main issue.

The real problem is that software teams depend on forms of memory that were never sitting in one person’s head, or one conversation, to begin with. Architecture decisions. Constraints. Patterns. Anti-patterns. Known failure modes. Why certain boundaries exist. What “good” looks like in this codebase.

Those things only become useful at scale once they are made explicit.

A simple way to think about that is as three layers.

Short-term memory is the current session. The task, the nearby code, the active train of thought. It is enough to get useful work done, but it is fragile.

Medium-term memory is the working memory of active delivery. In practice, this is where files like CLAUDE.md, REVIEW.md, implementation plans, acceptance criteria, and feature notes sit. Its job is to keep active work coherent.

Long-term memory is the curated memory of how the system should be built over time. This is where ADRs, architecture principles, and hardened conventions live.

Medium-term memory helps you deliver the system today. Long-term memory helps you preserve its integrity as it evolves.

But memory is only one mechanism. It protects coherence. It is not the end goal in itself. The goal is to design a workflow where the right knowledge moves out of the chat, into the repository, and eventually into the stable operating memory of the team.

The repository has to carry more of the system

Once you accept that drift is a maintenance problem, the role of the repository starts to change.

It cannot just hold code. It has to hold the current understanding of how that code should be written.

This is where article two naturally extends. Implementation instructions, review instructions, plans, acceptance criteria, and reference documents do not stop mattering after v1. They become part of the operating system of the project. They need to stay live.

CLAUDE.md cannot be written once and forgotten. REVIEW.md cannot stay frozen while the codebase evolves underneath it. Plans cannot keep describing a milestone the team moved past months ago.

If the repo is the memory layer, then maintaining it becomes maintenance work in exactly the same way refactoring, testing, and reviewing are maintenance work.

Refactoring and shadow technical debt

Refactoring has always mattered. It matters even more in AI-assisted development.

An agent can be very good at delivering the local solution for the task in front of it. But that is not the same as preserving the conceptual integrity of the whole system. That is why refactoring becomes more important, not less.

And there is a kind of debt that starts to matter more here too. Not just visible technical debt, but shadow technical debt. The build-up of small inconsistencies, duplicated patterns, weakened boundaries, stale conventions, and half-abandoned approaches that do not always look severe in isolation but make the system harder to reason about over time.

AI can accelerate that kind of debt because it accelerates local implementation. More code lands more quickly. More near-identical problems get solved in slightly different ways. More convenience decisions slip through because each one looks acceptable on its own.

So part of maintaining an AI-built system is not just fixing broken code. It is bringing the system back into alignment on purpose.

Commit discipline and change trails

One of the easier things to overlook in AI-assisted development is commit discipline.

When code is cheap to produce, it becomes tempting to let changes sprawl. A single task becomes several. Exploration gets mixed with implementation. Temporary fixes, experiments, and final decisions all end up bundled together in one large change.

That makes maintenance harder.

Small, atomic commits are not just tidy engineering hygiene here. They make review more trustworthy, regressions easier to trace, and the route into the current state of the system easier to understand.

The same applies to the memory layer itself. If conventions change, review rules get tightened, or new patterns are promoted into shared guidance, those changes should leave some kind of trail. That does not need heavy process. It just needs to be visible. Otherwise you end up with a codebase that has version history, but a memory layer that depends too much on people having been around at the right moment.

From solo workflow to team workflow

This is also where the jump from solo development to team development becomes more obvious.

A lot of what makes a solo AI-first workflow effective is implicit. You know the rules you are following. You know which patterns to prefer. You know when something does not feel right. None of that is obvious to someone else joining the project.

To make this work at team scale, the workflow has to shift from something personal to something shared. In practice, that means thinking about memory at different levels. At the top, you have system memory - repo-wide guidance such as PRDs, architecture docs, CLAUDE.md, REVIEW.md, ADRs, and standards. Below that, you have workstream memory - implementation plans, feature briefs, active constraints, QA scenarios, and rollout notes for a given area of delivery. Then at the most local level, you have task memory - the immediate acceptance criteria, branch notes, unresolved questions, and the detail needed to make the next change well.

The point is not to make everyone hold the whole project in their head. It is to make sure each person and each agent can work with the right level of context while still staying aligned with the same underlying system.

If that memory structure is weak, every developer and every agent ends up working from a slightly different version of the system. That is usually where drift becomes visible.

What to watch for in practice

Most of the failure modes here are not dramatic. The system does not usually break all at once. It just becomes a little less trustworthy over time.

The first thing you tend to notice is stale documentation. The files that were supposed to guide the agent no longer describe the system as it actually exists.

Then there is repository churn without clarity. Lots of code is moving, but the shape of the system is getting harder to explain, not easier.

Review can become performative too. The loop still exists, but the intent starts to fade. Reviews focus on whether the code works, not whether it fits the system. The same comments appear repeatedly, but nothing changes upstream.

You also start to see bypass behaviour. If the process becomes too rigid, too slow, or obviously out of date, people work around it.

And there is the risk of false confidence. A change goes through the loop. It passes review. Everything looks clean. But passing the loop does not guarantee conceptual integrity. In the same way that a passing test suite does not prove the absence of bugs, a passing workflow does not prove the absence of drift.

The system only stays healthy when code, memory, and process stay aligned.

The simple loop still works, but it is not the whole maintenance story

In part two, the implementation-review loop was introduced as a practical way to structure AI-assisted development. That still holds. For a lot of work, it is enough. You define the task, provide the right context, implement, review, refine, and move on.

But after v1, the more important question is not whether the loop exists. It is whether the system around the loop is staying healthy. Whether the standards are still current. Whether the docs are trusted. Whether learning from one workstream is making its way back into repo-wide memory. Whether commits are still small enough to explain the system’s evolution clearly. Whether refactoring is actually happening, or whether the team is mostly just shipping forward motion.

The simple loop helps you make good local changes. Maintenance is what stops those local changes from slowly turning into a messy global system.

Closing

In this series, I have looked at how AI is changing software engineering.

Implementation is getting cheaper, and getting to a v1 is faster than it has ever been. That shift is real. But it is not the full picture.

The codebase is still code. Systems still drift. And over time, coherence and maintainability matter far more than how quickly something was first built. If anything, they matter more now.

Refactoring and documentation cannot be treated as secondary tasks in an AI-first workflow. They have to be part of the system itself. The same goes for commit discipline, repository knowledge, and the small decision trails that help a team understand how the system is evolving.

AI amplifies whatever system sits around it. If that system is weak, the output will be too. That is why memory has to be externalised and kept in sync with the code, not left sitting in chats or individual sessions.

This is not about becoming a prompt engineer. It is still software engineering. But less of the effort goes into typing each line of code, and more goes into shaping the system, maintaining its boundaries, and making sure it continues to make sense as it evolves.

Because getting to v1 is not the hard part anymore. Keeping the system coherent after that is.

I hope some of these ideas resonate and are useful in your own work. Realistically, parts of this will age quickly. The tools will improve, the models will change, and I will probably revisit some of this as things evolve.

But the underlying problem is unlikely to go away.

Systems still need to make sense.