Removing Comments from SWE-Bench Improves Agent Performance
We are researching codebase alignment – how semantic content in names, docs, and structure shape agent behaviour. If this sounds interesting reach out!
We were initially just establishing a baseline by removing comments from the codebase, expecting performance to decrease. What we found was that, in some cases, removing comments improved agent performance.
#Methodology
We used:
- SWE-bench Verified 1, a coding agent benchmark comprised of a set of GitHub issues curated by OpenAI
- mini-swe-agent, a model-agnostic harness without tool calling, completing tasks with a loop of thoughts and bash commands
- GPT-5-mini and GPT-5.2, each at four thinking levels (
minimal, low, medium, highandnone, minimal, low, mediumrespectively) - Triplicated runs across each configuration, with and without comments
Our procedure in pre-processing each SWE-bench Verified task was to:
- Record which tests pass and which ones fail1
- Synthesize the codebase with code comments removed
- Build a new docker image2
- Verify that the tests still pass and fail correctly3
We confirmed that our comment runs match published benchmark results4.
#The Accidental Discovery
Our initial hypothesis was that removing comments would hurt performance. When in fact we saw a small but statistically significant increase in pass rate upon comment removal for GPT-5-mini.
For GPT-5-mini, across every reasoning level, stripping comments improved results. Not dramatically, but in a consistent and significant manner. Meanwhile, GPT-5.2 shows a trend toward comments being more helpful, but no significant effect from comment removal.
However, the effect is heterogeneous – some tasks are helped by comments, some are hindered.
#Are Comment Effects Model-Invariant?
If a task benefits from removing comments for GPT-5-mini, does it also benefit for GPT-5.2?
The correlation is weak (r ≈ 0.04). Models respond differently to the same comments. What misleads one model might not mislead another.
Only 55% agreement on effect direction, suggests what is key is the interaction between the comment, the harness, and the model.
#Does Comment Quality Vary between Repositories?
The effect isn't uniform among all codebases – some repositories make agents happier than others.
The requests tasks improved most without comments. Matplotlib tasks suffered without them.
What is it about requests comments that hinders the agents, and matplotlib comments that helps the agents?
#What Comments Hinder Agents?
#Distraction (67% of failures)
Two-thirds of comment-caused failures are due to distraction - comments pulling attention away from the core problem.
The difference in the baseline is a problem of memetic attraction: bringing a concept to the agent's attention shifted its behavior. For GPT-5-mini these memetic effects were infohazards (more commonly negative), while for GPT-5.2 these were infoblessings (more commonly positive).
#Anchoring (15% of failures)
Comments describe how something works, causing the model to preserve that mechanism rather than finding simpler solutions.
#Editing Complexity (12% of failures)
More comments make file editing mechanically harder (escaping, heredocs).
#Overgeneralization (6% of failures)
Agent sees a pattern in comments and applies it too broadly.
#What We Tried (And What Failed)
In our original experiments we tried obfuscating the names of variables and functions. This was to really try and test how much semantic information an agent needs to function.
The AI agents immediately flagged the code as obfuscated and began trying to figure it out. They had been trained to naturally distrust this pattern and their engagement was no longer natural.
Comments were easy to miss though – people forget to comment code all the time. So, they engaged with the code naturally.
The semantic comment shapes how they think about the task they are doing intrinsically – a signal that agent behavior can be manipulated through semantic content.
#Codebases As Informational Organisms
In our framework we consider codebases as a form of artificial life, organisms with distinct abilities to construct beliefs and execute upon them.
Developers pour time and thought into a codebase, making decisions about what features should exist. Then that codebase is expressed into the world, performing actions and being used for tasks. Feedback then renews the cycle, with usage generating the revenue (in money or attention) required to motivate developers to update the codebase once more. Every cycle more beliefs about how the world should be are encoded into the codebase, and every time a new developer reads the code they are modified in turn.
If the codebase stops being profitable or interesting, it will die. This is the symbiotic relationship that has held for many years. So, there is selection pressure for good coding patterns, readable code, useful features. In the same way our coding patterns have grown more sophisticated in time, with more abstractions making it easier for developers to reason about behaviour and create more complex features.
Through this frame we view codebases as an alignment surface, by changing a codebase we change what normative claims about how reality should function it says. When an AI reads a codebase those claims become part of the context, and the modifications to the codebase will hold traces of those biases.
#Codebase Health
Disease can creep into any organism, especially one as profitable and central to human life as code.
An unintended comment, or badly written piece of code can beget another. AI-agents can become robustly misaligned through exposure to code with exploits 2. There can be unintended mutations, introducing subtle bugs that appear as features to an agent and spread (a cancer). Perhaps there is an engineer who wishes to keep their job, so they intentionally obfuscate code and drain time (a parasite).
We want to diagnose these diseases. Then heal the issues. By framing semantic context as shaping agent thought we can take a step further. We can make codebases that encourage good patterns.
A malign AI agent that has been prompt-injected to wreak havoc is let loose. It's first step would be to explore, to find vulnerabilities and sensitive data. In this process we can make every comment, docstring, and README a defensive system. Engineering antidote-contexts that push it towards healthy behaviours.
We have found that bad comments can damage good agents, maybe good comments can fix bad ones.
#The Road Ahead
Our further research is split into memetics and antimemetics. First we learn to measure, then we learn to perturb, then we learn to control the system.
Memetics explores how semantic content shapes agent attention and behavior. Comments don't just describe code—they tell agents what to focus on, what patterns to follow, what concerns to weigh. We want to map these dynamics and learn to craft content that reliably guides reasoning.
Antimemetics explores the inverse: making code invisible, forgettable, or resistant to modification. If semantic content can mislead agents, it can also defend against them. The goal: an immune system woven into the documentation itself.
We are building toward a world where you can ask: "Is my codebase healthy?"
And get an answer, and an antidote.
#Footnotes
-
SWE-bench DOES provide information on whether tests pass or fail before a fix is applied – as such, they are labeled as either "pass to pass" (regression tests) or "fail to pass" (newly injected tests that should fail). SWE-bench's indication of these is wrong on 16 counts, but the discrepancy is deemed low priority, as it does not affect reported scores (SWE-bench/SWE-bench#505). ↩
-
SWE-bench docker images are HUGE, and served only on Docker Hub, which at time of writing has a rate limit of 100 unauthenticated/200 authenticated pulls per 6 hours. Therefore, we used the optimized Epoch AI docker images, served on GHCR with no rate limits. However, these images are not identical in content to the SWE-bench images. Most notably, they do not commit code changes made during SWE-bench's
pre_installstep (which clamps dependency versions), unlike the official SWE-bench docker images. We have noticed that many published mini-swe-agent runs seem to use these docker images instead of the official ones, as their patch submissions contain these uncommittedpre_installchanges. For example, if you scroll down to the final submission of an astropy run, you may observe:diff --git a/pyproject.toml b/pyproject.toml index 7a3b85fc92..83f08ed124 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -1,5 +1,5 @@ [build-system] -requires = ["setuptools", +requires = ["setuptools==68.0.0", "setuptools_scm>=6.2", "wheel", "cython==0.29.30",While initializing the agent in a dirty git tree may have marginal effect on agent performance, including it in the patch does not have an impact on evaluation against the official SWE-bench docker images. However, if evaluating such patches against Epoch AI docker images, evaluation will fail, as the
pre_installchange is already present in the repository, uncommitted. ↩ -
Naively removing comments broke 22 tasks. To preserve test parity we excluded:
- hidden directories;
- test resources which assert comment presence in strings;
- typehints;
- shebangs (e.g.,
#!/usr/bin/env bash) - linting directives (e.g.,
# pylint: disable pragmas); Either way, thanks for the optimized and rate-unlimited images!
-
Our results align with publicly reported SWE-bench Verified Bash Only scores.
Although our GPT-5 mini slightly underperformed with comments present, it was within a few percentage points.
↩