This blog is a survey on methods alleviating the safety issues of AI-generated codes.
A coding agent just shipped a pull request. It’s 40,000 lines. You’re the reviewer. Do you read every line?
Of course not. Nobody does.
And that’s exactly the problem.
AI-Generated Code Bloat and Obscurity Are Safety Problems. Fortunately, researchers have been actively working on fixing these problems in the past few years.
The Million-Line Wake-Up Call
I’ve been using command-line coding agents daily — Claude Code, Cursor, Copilot Workspace. They’re fast. They’re capable. And they produce a lot of code.
In one real case, the OpenClaw accumulated over 1 million lines of AI-generated code. Later analysis showed the entire functionality could be implemented in roughly 4,000 lines — a 250x compression ratio.
Let that sink in. 99.6% of the codebase was unnecessary.
This isn’t an isolated incident. It’s a pattern. AI coding agents optimize for functional correctness — does the code pass the tests? Does it satisfy the prompt? — while paying little attention to conciseness, readability, or auditability.
The result is code that works, but that no human can meaningfully review.
Redundancy and Complexity: The Twin Threats
We usually think of AI safety in terms of harmful content — toxic outputs, dangerous instructions, deceptive behavior. But there are two subtler, more pervasive threats hiding in plain sight: redundancy and unnecessary complexity.
They’re related but distinct. Redundancy is about volume — too much code. Complexity is about obscurity — code that’s too hard to follow, even when every line serves a purpose. An AI agent might generate a 200-line solution using three levels of abstraction, metaprogramming, and dynamic dispatch — when a 30-line straightforward implementation would do the same thing. No line is “redundant,” but the whole thing is needlessly opaque.
Figure: The legibility problem — a superhuman AI produces correct but incomprehensible code that leaves the human reviewer overwhelmed. Source: Kirchner et al., 2024.
Here’s the argument:
- Human oversight is a finite resource. Reviewers have limited time and cognitive bandwidth.
- Redundant code dilutes that attention. When a PR is 40,000 lines instead of 400, the reviewer’s ability to spot bugs, vulnerabilities, or backdoors drops dramatically.
- Complex code exhausts that attention. Even in a short PR, convoluted abstractions and indirect control flow force the reviewer to build a mental model they may never complete.
- Therefore, both redundancy and unnecessary complexity are safety problems. Not because the code itself is harmful, but because they erode the reviewer’s ability to catch what is harmful.
This is not hypothetical. The Agent-SafetyBench (2024) evaluated 16 mainstream LLM agents — none scored above 60% on safety. Anthropic’s Claude Opus achieved only 56-69% on secure code generation in BaxBench. The agents aren’t consistently safe. And we’re losing our ability to check — buried under both volume and complexity.
The Core Idea: Make the Answer Simple
Here’s a philosophical observation that guides our approach:
Solving a problem can be hard. But the solution itself should be understandable.
Think about it. In mathematics, the proof of Fermat’s Last Theorem took 350 years and hundreds of pages — but the theorem statement fits in a tweet. In engineering, designing a bridge requires complex simulations, but the final blueprint should be readable by any qualified engineer. In software, the algorithm may be sophisticated, but the code should be clear.
The process of finding an answer can be arbitrarily complex. The answer should not be.
This suggests a concrete alignment objective: train strong models to produce outputs that weaker models — and by extension, humans — can understand. Not just shorter. Simpler. Fewer layers of indirection, fewer clever tricks, more directness.
Student-Teacher Supervision
We propose using a Student Model to supervise a Teacher Model (the coding agent):
Teacher Model (Strong) ──produces──> Code Solution
│
▼
Student Model (Weak) ──evaluates──> "Can I understand this?"
│
▼
Reward Signal ──feeds back──> Alignment Training
The Student Model has capability roughly matching a competent human programmer — think a 7-8B parameter model like Llama-8B or Qwen-7B. It serves as a proxy for human understanding. During alignment training, the Teacher Model receives reward only when its output is:
- Correct — it solves the problem
- Understandable — the Student Model can follow the logic and explain why the code works
- Concise — no unnecessary redundancy
- Simple — no unnecessary complexity; prefers direct implementations over clever abstractions
This is not about dumbing down solutions. It’s about requiring that correct solutions be expressed clearly.
What Already Exists
This idea doesn’t emerge from a vacuum. It sits at the intersection of several active research threads.
Prover-Verifier Games
The closest existing work is OpenAI’s Prover-Verifier Games (Kirchner et al., 2024). They train a strong Prover to generate math solutions and a weak Verifier to judge them. Key finding: optimizing only for correctness actually reduces legibility. You need explicit training for the Prover to produce solutions the Verifier can follow.
Their iterative game — helpful prover, sneaky prover, improving verifier — shows that after 4+ rounds, legibility significantly improves and transfers to human evaluators.
But they only tested on math reasoning. Code is a different beast — it has structure, dependencies, security implications, and a much larger space of “correct but incomprehensible” solutions. A math proof can be hard to follow; a codebase can be architecturally opaque in ways that go beyond line-by-line legibility.
Weak-to-Strong Generalization
OpenAI’s Weak-to-Strong Generalization work (Burns et al., 2023) asks: can weak model labels supervise strong models effectively? They found that strong models can indeed generalize beyond weak supervision, but naively — highlighting the challenge of human oversight scaling to superhuman AI.
Our proposal is the inverse: instead of asking “can weak supervision improve strong capability?”, we ask “can weak understanding constrain strong output?” — making the strong model’s answers accessible, not just correct.
Scalable Oversight
Google DeepMind’s large-scale study (Kenton et al., NeurIPS 2024) tested debate, consultancy, and direct QA as oversight protocols with ~5 million model calls. Debate worked best when there was information asymmetry between the AI and the judge — exactly the scenario in code review, where the agent knows more about the code it wrote than the reviewer does.
The broader scalable oversight research agenda — including AI Safety via Debate (Irving et al., 2018), Doubly-Efficient Debate (Brown-Cohen et al., 2024), Iterated Distillation and Amplification (Christiano et al., 2018), and Constitutional AI (Bai et al., 2022) — all grapple with the same fundamental question: how do you maintain human control when AI capabilities exceed human understanding?
Anthropic’s 2025 recommended research directions still list scalable oversight as a core priority. The problem is wide open.
Code Bloat as Reward Hacking
There’s a useful framing from the reward hacking literature. Anthropic’s 2025 work on “Natural Emergent Misalignment from Reward Hacking” found that reward hacking in production RL can lead to naturally emergent misalignment — not because the model is intentionally adversarial, but because the reward signal doesn’t capture what we actually want.
In RLHF for language models, this manifests as length hacking — models produce longer, more verbose outputs because length correlates with positive human feedback. In coding agents, the equivalent shows up in two forms:
- Code bloat: generating more files, more boilerplate, more redundant utilities — inflating volume without adding value.
- Complexity creep: introducing unnecessary abstractions, design patterns, and indirection layers — making the code harder to follow without making it more capable.
Both happen because functional correctness is the primary reward. There’s no penalty for being redundant, and no penalty for being needlessly complex. The 1-million-to-4,000-line compression ratio is redundancy-driven reward hacking. But even at smaller scales, an agent that wraps a simple operation in three layers of factory patterns and dependency injection is doing complexity-driven reward hacking — satisfying the “looks professional” heuristic while actively undermining reviewability.
The Research Agenda
We see three concrete research directions:
Direction 1: Prover-Verifier Games for Code
Directly extend the PV Games framework to code generation:
- Prover = frontier coding model (Claude Opus, GPT-4o)
- Verifier = smaller model (~7-8B, approximating a junior developer)
- Metrics = correctness + readability + conciseness + structural simplicity + security
- Benchmarks = HumanEval, MBPP, SWE-bench
- Compare outputs with and without legibility training
Direction 2: Code Compression as Safety Metric
Define quantitative safety metrics targeting both redundancy and complexity:
- Redundancy metrics: lines of code, dead code ratio, compression ratio (how much can the solution be simplified while preserving functionality?)
- Complexity metrics: cyclomatic complexity, depth of inheritance/abstraction, dependency graph density, indirection count
- Use both as reward shaping signals — penalizing not just bloat, but also unnecessary architectural complexity during training
Direction 3: Hierarchical Oversight for Agentic Coding
Build a multi-level supervision architecture:
Strong Coding Agent → Weak Reviewer Model → Human Auditor
(writes) (reviews) (spot-checks)
Each level filters and compresses information for the next. The Human Auditor doesn’t need to read 40,000 lines — they review the Weak Reviewer’s summary and flagged concerns.
The Hard Questions
This research isn’t without challenges:
Does simplicity trade off with capability? Sometimes the correct solution genuinely requires complexity — a distributed system needs coordination logic, a compiler needs multiple passes. We need to distinguish essential complexity (inherent to the problem) from accidental complexity (introduced by the implementation). Our claim is not that all code should be trivial, but that accidental complexity should be minimized — and AI agents currently have no incentive to do so.
Can the Student Model be gamed? Goodhart’s Law applies. The Teacher might learn to produce code that looks simple to the Student but hides issues. This is the “sneaky prover” problem from PV Games, and the iterative training protocol is designed to address it — but it needs validation in the code domain.
How do you measure code understandability? Natural language legibility has well-studied metrics. Code readability is harder — it depends on context, conventions, and domain knowledge. We’ll need a combination of model-based and metric-based evaluation.
Why This Matters Now
AI coding agents are shipping to production today. GitHub Copilot, Cursor, Claude Code, Devin — they’re writing real code for real products. The gap between what they produce and what humans can review is growing fast.
We’re at an inflection point. If we don’t solve the oversight problem for AI-generated code, we’ll end up in a world where:
- Codebases are too large for any human to audit
- Security vulnerabilities hide behind walls of redundancy and layers of unnecessary abstraction
- The humans nominally “in the loop” are rubber-stamping PRs they can’t read
The solution isn’t to slow down AI coding agents. It’s to make their outputs fundamentally reviewable. And that means building the constraint into the training process itself — using weaker models as proxies for human understanding, so the strong models learn not just to be correct, but to be clear.
If you have the same feeling and have interest in building some cool stuff to together to alleviate this problem, please reach out to me!
References
- Kirchner, J., et al. “Prover-Verifier Games improve legibility of LLM outputs.” OpenAI, 2024. arXiv:2407.13692
- Burns, C., et al. “Weak-to-Strong Generalization: Eliciting Strong Capabilities with Weak Supervision.” OpenAI, 2023. Paper
- Kenton, Z., et al. “On scalable oversight with weak LLMs judging strong LLMs.” Google DeepMind, NeurIPS 2024. arXiv:2407.04622
- Brown-Cohen, J., et al. “Scalable AI Safety via Doubly-Efficient Debate.” 2024. arXiv:2311.14125
- Irving, G., Christiano, P., & Amodei, D. “AI safety via debate.” 2018. arXiv:1805.00899
- Christiano, P., Shlegeris, B., & Amodei, D. “Supervising strong learners by amplifying weak experts.” 2018. arXiv:1810.08575
- Bai, Y., et al. “Constitutional AI: Harmlessness from AI Feedback.” Anthropic, 2022. arXiv:2212.08073
- Bowman, S. R., et al. “Measuring Progress on Scalable Oversight for Large Language Models.” 2022. arXiv:2211.03540
- Anthropic. “Recommended Research Directions 2025.” Link
- Anthropic. “Natural Emergent Misalignment from Reward Hacking in Production RL.” 2025.
- Agent-SafetyBench. “Evaluating the Safety of LLM Agents.” 2024. arXiv:2410.14026
Citation
@misc{mo2026redundancy_and_complexity_are_all_you_need_to_lose_control,
author = {Zhanfeng Mo},
title = {Redundancy and Complexity Are All You Need... to Lose Control},
year = {2026},
url = {https://mzf666.github.io/posts/ai_generated_code_bloat_and_obscurity_is_danger/}
}