[{"content":" This blog is a survey on methods alleviating the safety issues of AI-generated codes.\nA coding agent just shipped a pull request. It\u0026rsquo;s 40,000 lines. You\u0026rsquo;re the reviewer. Do you read every line?\nOf course not. Nobody does.\nAnd that\u0026rsquo;s exactly the problem.\nAI-Generated Code Bloat and Obscurity Are Safety Problems. Fortunately, researchers have been actively working on fixing these problems in the past few years.\nThe Million-Line Wake-Up Call I\u0026rsquo;ve been using command-line coding agents daily — Claude Code, Cursor, Copilot Workspace. They\u0026rsquo;re fast. They\u0026rsquo;re capable. And they produce a lot of code.\nIn one real case, the OpenClaw accumulated over 1 million lines of AI-generated code. Later analysis showed the entire functionality could be implemented in roughly 4,000 lines — a 250x compression ratio.\nLet that sink in. 99.6% of the codebase was unnecessary.\nThis isn\u0026rsquo;t an isolated incident. It\u0026rsquo;s a pattern. AI coding agents optimize for functional correctness — does the code pass the tests? Does it satisfy the prompt? — while paying little attention to conciseness, readability, or auditability.\nThe result is code that works, but that no human can meaningfully review.\nRedundancy and Complexity: The Twin Threats We usually think of AI safety in terms of harmful content — toxic outputs, dangerous instructions, deceptive behavior. But there are two subtler, more pervasive threats hiding in plain sight: redundancy and unnecessary complexity.\nThey\u0026rsquo;re related but distinct. Redundancy is about volume — too much code. Complexity is about obscurity — code that\u0026rsquo;s too hard to follow, even when every line serves a purpose. An AI agent might generate a 200-line solution using three levels of abstraction, metaprogramming, and dynamic dispatch — when a 30-line straightforward implementation would do the same thing. No line is \u0026ldquo;redundant,\u0026rdquo; but the whole thing is needlessly opaque.\nFigure: The legibility problem — a superhuman AI produces correct but incomprehensible code that leaves the human reviewer overwhelmed. Source: Kirchner et al., 2024.\nHere\u0026rsquo;s the argument:\nHuman oversight is a finite resource. Reviewers have limited time and cognitive bandwidth. Redundant code dilutes that attention. When a PR is 40,000 lines instead of 400, the reviewer\u0026rsquo;s ability to spot bugs, vulnerabilities, or backdoors drops dramatically. Complex code exhausts that attention. Even in a short PR, convoluted abstractions and indirect control flow force the reviewer to build a mental model they may never complete. Therefore, both redundancy and unnecessary complexity are safety problems. Not because the code itself is harmful, but because they erode the reviewer\u0026rsquo;s ability to catch what is harmful. This is not hypothetical. The Agent-SafetyBench (2024) evaluated 16 mainstream LLM agents — none scored above 60% on safety. Anthropic\u0026rsquo;s Claude Opus achieved only 56-69% on secure code generation in BaxBench. The agents aren\u0026rsquo;t consistently safe. And we\u0026rsquo;re losing our ability to check — buried under both volume and complexity.\nThe Core Idea: Make the Answer Simple Here\u0026rsquo;s a philosophical observation that guides our approach:\nSolving a problem can be hard. But the solution itself should be understandable.\nThink about it. In mathematics, the proof of Fermat\u0026rsquo;s Last Theorem took 350 years and hundreds of pages — but the theorem statement fits in a tweet. In engineering, designing a bridge requires complex simulations, but the final blueprint should be readable by any qualified engineer. In software, the algorithm may be sophisticated, but the code should be clear.\nThe process of finding an answer can be arbitrarily complex. The answer should not be.\nThis suggests a concrete alignment objective: train strong models to produce outputs that weaker models — and by extension, humans — can understand. Not just shorter. Simpler. Fewer layers of indirection, fewer clever tricks, more directness.\nStudent-Teacher Supervision We propose using a Student Model to supervise a Teacher Model (the coding agent):\nTeacher Model (Strong) ──produces──\u0026gt; Code Solution │ ▼ Student Model (Weak) ──evaluates──\u0026gt; \u0026#34;Can I understand this?\u0026#34; │ ▼ Reward Signal ──feeds back──\u0026gt; Alignment Training The Student Model has capability roughly matching a competent human programmer — think a 7-8B parameter model like Llama-8B or Qwen-7B. It serves as a proxy for human understanding. During alignment training, the Teacher Model receives reward only when its output is:\nCorrect — it solves the problem Understandable — the Student Model can follow the logic and explain why the code works Concise — no unnecessary redundancy Simple — no unnecessary complexity; prefers direct implementations over clever abstractions This is not about dumbing down solutions. It\u0026rsquo;s about requiring that correct solutions be expressed clearly.\nWhat Already Exists This idea doesn\u0026rsquo;t emerge from a vacuum. It sits at the intersection of several active research threads.\nProver-Verifier Games The closest existing work is OpenAI\u0026rsquo;s Prover-Verifier Games (Kirchner et al., 2024). They train a strong Prover to generate math solutions and a weak Verifier to judge them. Key finding: optimizing only for correctness actually reduces legibility. You need explicit training for the Prover to produce solutions the Verifier can follow.\nTheir iterative game — helpful prover, sneaky prover, improving verifier — shows that after 4+ rounds, legibility significantly improves and transfers to human evaluators.\nBut they only tested on math reasoning. Code is a different beast — it has structure, dependencies, security implications, and a much larger space of \u0026ldquo;correct but incomprehensible\u0026rdquo; solutions. A math proof can be hard to follow; a codebase can be architecturally opaque in ways that go beyond line-by-line legibility.\nWeak-to-Strong Generalization OpenAI\u0026rsquo;s Weak-to-Strong Generalization work (Burns et al., 2023) asks: can weak model labels supervise strong models effectively? They found that strong models can indeed generalize beyond weak supervision, but naively — highlighting the challenge of human oversight scaling to superhuman AI.\nOur proposal is the inverse: instead of asking \u0026ldquo;can weak supervision improve strong capability?\u0026rdquo;, we ask \u0026ldquo;can weak understanding constrain strong output?\u0026rdquo; — making the strong model\u0026rsquo;s answers accessible, not just correct.\nScalable Oversight Google DeepMind\u0026rsquo;s large-scale study (Kenton et al., NeurIPS 2024) tested debate, consultancy, and direct QA as oversight protocols with ~5 million model calls. Debate worked best when there was information asymmetry between the AI and the judge — exactly the scenario in code review, where the agent knows more about the code it wrote than the reviewer does.\nThe broader scalable oversight research agenda — including AI Safety via Debate (Irving et al., 2018), Doubly-Efficient Debate (Brown-Cohen et al., 2024), Iterated Distillation and Amplification (Christiano et al., 2018), and Constitutional AI (Bai et al., 2022) — all grapple with the same fundamental question: how do you maintain human control when AI capabilities exceed human understanding?\nAnthropic\u0026rsquo;s 2025 recommended research directions still list scalable oversight as a core priority. The problem is wide open.\nCode Bloat as Reward Hacking There\u0026rsquo;s a useful framing from the reward hacking literature. Anthropic\u0026rsquo;s 2025 work on \u0026ldquo;Natural Emergent Misalignment from Reward Hacking\u0026rdquo; found that reward hacking in production RL can lead to naturally emergent misalignment — not because the model is intentionally adversarial, but because the reward signal doesn\u0026rsquo;t capture what we actually want.\nIn RLHF for language models, this manifests as length hacking — models produce longer, more verbose outputs because length correlates with positive human feedback. In coding agents, the equivalent shows up in two forms:\nCode bloat: generating more files, more boilerplate, more redundant utilities — inflating volume without adding value. Complexity creep: introducing unnecessary abstractions, design patterns, and indirection layers — making the code harder to follow without making it more capable. Both happen because functional correctness is the primary reward. There\u0026rsquo;s no penalty for being redundant, and no penalty for being needlessly complex. The 1-million-to-4,000-line compression ratio is redundancy-driven reward hacking. But even at smaller scales, an agent that wraps a simple operation in three layers of factory patterns and dependency injection is doing complexity-driven reward hacking — satisfying the \u0026ldquo;looks professional\u0026rdquo; heuristic while actively undermining reviewability.\nThe Research Agenda We see three concrete research directions:\nDirection 1: Prover-Verifier Games for Code Directly extend the PV Games framework to code generation:\nProver = frontier coding model (Claude Opus, GPT-4o) Verifier = smaller model (~7-8B, approximating a junior developer) Metrics = correctness + readability + conciseness + structural simplicity + security Benchmarks = HumanEval, MBPP, SWE-bench Compare outputs with and without legibility training Direction 2: Code Compression as Safety Metric Define quantitative safety metrics targeting both redundancy and complexity:\nRedundancy metrics: lines of code, dead code ratio, compression ratio (how much can the solution be simplified while preserving functionality?) Complexity metrics: cyclomatic complexity, depth of inheritance/abstraction, dependency graph density, indirection count Use both as reward shaping signals — penalizing not just bloat, but also unnecessary architectural complexity during training Direction 3: Hierarchical Oversight for Agentic Coding Build a multi-level supervision architecture:\nStrong Coding Agent → Weak Reviewer Model → Human Auditor (writes) (reviews) (spot-checks) Each level filters and compresses information for the next. The Human Auditor doesn\u0026rsquo;t need to read 40,000 lines — they review the Weak Reviewer\u0026rsquo;s summary and flagged concerns.\nThe Hard Questions This research isn\u0026rsquo;t without challenges:\nDoes simplicity trade off with capability? Sometimes the correct solution genuinely requires complexity — a distributed system needs coordination logic, a compiler needs multiple passes. We need to distinguish essential complexity (inherent to the problem) from accidental complexity (introduced by the implementation). Our claim is not that all code should be trivial, but that accidental complexity should be minimized — and AI agents currently have no incentive to do so.\nCan the Student Model be gamed? Goodhart\u0026rsquo;s Law applies. The Teacher might learn to produce code that looks simple to the Student but hides issues. This is the \u0026ldquo;sneaky prover\u0026rdquo; problem from PV Games, and the iterative training protocol is designed to address it — but it needs validation in the code domain.\nHow do you measure code understandability? Natural language legibility has well-studied metrics. Code readability is harder — it depends on context, conventions, and domain knowledge. We\u0026rsquo;ll need a combination of model-based and metric-based evaluation.\nWhy This Matters Now AI coding agents are shipping to production today. GitHub Copilot, Cursor, Claude Code, Devin — they\u0026rsquo;re writing real code for real products. The gap between what they produce and what humans can review is growing fast.\nWe\u0026rsquo;re at an inflection point. If we don\u0026rsquo;t solve the oversight problem for AI-generated code, we\u0026rsquo;ll end up in a world where:\nCodebases are too large for any human to audit Security vulnerabilities hide behind walls of redundancy and layers of unnecessary abstraction The humans nominally \u0026ldquo;in the loop\u0026rdquo; are rubber-stamping PRs they can\u0026rsquo;t read The solution isn\u0026rsquo;t to slow down AI coding agents. It\u0026rsquo;s to make their outputs fundamentally reviewable. And that means building the constraint into the training process itself — using weaker models as proxies for human understanding, so the strong models learn not just to be correct, but to be clear.\nIf you have the same feeling and have interest in building some cool stuff to together to alleviate this problem, please reach out to me!\nReferences Kirchner, J., et al. \u0026ldquo;Prover-Verifier Games improve legibility of LLM outputs.\u0026rdquo; OpenAI, 2024. arXiv:2407.13692 Burns, C., et al. \u0026ldquo;Weak-to-Strong Generalization: Eliciting Strong Capabilities with Weak Supervision.\u0026rdquo; OpenAI, 2023. Paper Kenton, Z., et al. \u0026ldquo;On scalable oversight with weak LLMs judging strong LLMs.\u0026rdquo; Google DeepMind, NeurIPS 2024. arXiv:2407.04622 Brown-Cohen, J., et al. \u0026ldquo;Scalable AI Safety via Doubly-Efficient Debate.\u0026rdquo; 2024. arXiv:2311.14125 Irving, G., Christiano, P., \u0026amp; Amodei, D. \u0026ldquo;AI safety via debate.\u0026rdquo; 2018. arXiv:1805.00899 Christiano, P., Shlegeris, B., \u0026amp; Amodei, D. \u0026ldquo;Supervising strong learners by amplifying weak experts.\u0026rdquo; 2018. arXiv:1810.08575 Bai, Y., et al. \u0026ldquo;Constitutional AI: Harmlessness from AI Feedback.\u0026rdquo; Anthropic, 2022. arXiv:2212.08073 Bowman, S. R., et al. \u0026ldquo;Measuring Progress on Scalable Oversight for Large Language Models.\u0026rdquo; 2022. arXiv:2211.03540 Anthropic. \u0026ldquo;Recommended Research Directions 2025.\u0026rdquo; Link Anthropic. \u0026ldquo;Natural Emergent Misalignment from Reward Hacking in Production RL.\u0026rdquo; 2025. Agent-SafetyBench. \u0026ldquo;Evaluating the Safety of LLM Agents.\u0026rdquo; 2024. arXiv:2410.14026 Citation @misc{mo2026redundancy_and_complexity_are_all_you_need_to_lose_control, author = {Zhanfeng Mo}, title = {Redundancy and Complexity Are All You Need... to Lose Control}, year = {2026}, url = {https://mzf666.github.io/posts/ai_generated_code_bloat_and_obscurity_is_danger/} } ","permalink":"https://mzf666.github.io/posts/ai_generated_code_bloat_and_obscurity_is_danger/","summary":"Survey on Methods solving AI-Generated Code Bloat and Obscurity","title":"Redundancy and Complexity Are All You Need... to Lose Control"},{"content":"Welcome to my blog! I\u0026rsquo;ll be sharing thoughts on machine learning research, and other topics I find interesting.\n","permalink":"https://mzf666.github.io/posts/hello-world/","summary":"Welcome to my personal blog.","title":"Welcome"}]