Against the METR Graph
A deep dive into A.I. understanders' favorite chart crime
Note: An edited version of this essay is now available here at Transformer.
1. Introduction
If you have been following A.I. research and development over the last couple of years, you likely will have encountered one or both of the figures below.
They come from non-profit A.I. research institute METR’s “Measuring AI Ability to Complete Long Tasks” benchmark, which has come to be seen as a critical bellwether of A.I. capability growth. The Long Tasks benchmark measures trends in A.I. models’ abilities to complete software engineering tasks by comparison to how long those tasks take humans.
The AI Futures Project (the team responsible for AI 2027) recently called METR’s Long Tasks benchmark “the best benchmark currently available for extrapolating to very capable AIs.” In his 2025 letter, Google DeepMind Engineer Zhengdong Wang describes METR’s benchmark as “many people’s first introduction to the case that AI will be a big deal.” Screenshots of what is often simply called the “the METR graph” have been posted and reposted to death on X by a veritable army of A.I. researchers, engineers, and businesspeople.
The subtext (or often simply the text) of these posts is that A.I. models are improving so rapidly that they may soon displace a large proportion of human knowledge-workers, not to mention pose a wide variety of serious, even existential risks to our wellbeing. Alternatively, they could usher in a utopia of economy-wide automation, freeing humans from the yoke of necessary labor, and bringing about a world of once-unimaginable material abundance.
While I am skeptical of these extreme scenarios, my goal here is not to convince you that they are overblown. Instead, it is to convince you that METR’s Long Tasks benchmark does not even remotely constitute evidence in their favor. This benchmark suffers from such severe methodological problems that it is a hair’s breadth from being totally useless. Its influence constitutes a concerning signal about the epistemics of the A.I. community broadly construed, and raises serious questions about the trustworthiness of the many non-profit research institutions (such as METR) that have sprung up in the wake of the A.I. boom. While I am not the first to criticize this benchmark, I hope I am the last.
2. METR knowingly overpublicizes its more misleading results
The y-axis of the METR graph plots software engineering task durations at which models succeed 50% of the time (see the two graphs above). These durations are calculated by fitting a logistic curve to the models’ (average) success rates (across 8 runs per model/task pair) as a function of humans’ completion times for the same tasks, and identifying where the curve crosses 50%.

METR has also published analogous figures replacing 50% with 80% success rates. Predictably, the results are much more modest (note the y-axis):
In METR’s public-facing communications, the 50% figures are always given clear priority. But consider that even 80% reliability is unacceptably low for the critical business functions many hope (or fear) will one day be completed by A.I. agents with minimal oversight. Even with oversight (suppose you are more interested in productivity gains in the here and now) 50% reliability is so low that agents this capable seem more liable to slow users down than to speed them up. It is hard for me to see what legitimate reason METR might have for prioritizing the 50% graph over the 80% one. Doing so makes the METR graph more difficult to interpret, and more likely to be overinterpreted.
That is even more so the case given that, as METR itself acknowledges, many of the tasks it tested models and human baseliners on are extremely unrealistic:
Real-world intellectual labor often involves messy details that benchmarks usually don’t include, such as being under-specified or poorly scoped, having unclear feedback loops or success criteria, or requiring coordination between multiple streams of work in real-time. We generally observed that agents struggle more on tasks that have these “messy” details (Section 5). (Kwa et. al. 2025, p. 14)
More comprehensively, the authors of METR’s Long Tasks preprint discuss five “systematic differences between [their] tasks and real tasks”:
Automatic scoring All tasks we use are automatically scorable, meaning a piece of code running in the task environment determines the final score. This imposes constraints on e.g. the format of solutions, that tend to reduce task open-endedness, and the need for sensible value judgments.
No interaction with other agents None of our tasks involve interacting with other autonomous agents. Coordinating with, or competing with, other agents seems likely to increase task difficulty. For instance, by increasing the importance of strategic decisionmaking, real-time coordination, and predicting the actions of other complex agents.
Lax resource constraints None of our SWAA tasks, and few of our HCAST tasks saliently involve making efficient use of a limited resource—a common constraint in real-world tasks.
Unpunishing Similarly, very few of our tasks are punishing of single mistakes. This is in part to reduce the expected cost of collecting human baselines. Real world tasks can often be more punishing, for instance, when they involve competing against other agents. For instance, a single blunder in a chess game can greatly reduce the chance of winning the game.
Static environments Our tasks typically use environments that do not significantly change unless directly acted upon by the agent. In contrast, real tasks often occur in the context of a changing environment. (Kwa et. al. 2025, p. 19)
Beyond identifying these differences in an informal manner, the authors also break them down into 16 distinct “messiness factors,” (see p. 39-40 for a full list) and use them to assign “messiness scores” to all tasks that take humans a minute or longer to complete (i.e. all those in the HCAST and Re-Bench datasets). They then plot A.I. models’ performance on the messiest versus least messy 50% of tasks. This is a graph of models’ performance on the messiest 50% of tasks, a graph oddly located in the Long Tasks paper’s final appendix (p. 42):

You are reading that right. Not a single model topped a success rate of 30% for the messiest—that is, the most realistic—half of tasks. There is a strong argument to be made that this graph ought to be considered METR’s topline result; better yet they would produce updated versions of their influential time-horizon and doubling-time graphs using only tasks above the median messiness score. The authors openly acknowledge that these tasks are more representative of what software engineers actually do. Why publicize results with much less external validity if not for reasons orthogonal to being maximally informative?
All in all, I find METR’s research communications choices questionable. They consistently do more to publicize their flashier, more easily misinterpreted results despite conducting analysis implying they are misleading. I would think that, if truth were their priority, they would take a different approach.
3. METR’s human baselining methods are abysmal
These issues are practically typos, though, compared to the problems with METR’s human baselining methods.
Something I gather few people are aware of is that METR’s benchmark results are almost entirely driven by tasks from the “HCAST” dataset. These are software engineering tasks spanning ~1 min. to 30 hrs. in human completion time. Out of the 170 tasks in METR’s task-suite, 97 come from HCAST. The rest come from SWAA Suite, which contains 66 single-step tasks taking humans between 1 and 30 seconds, and RE-Bench which contains 7 complex A.I. R&D tasks taking humans at least 8 hrs. This means that the vast majority of the data points on the METR graph are HCAST tasks, since all but a couple of them fall into the ~1 min to ~5 hr. range.
How, one might wonder, did the team responsible for HCAST (almost entirely METR employees) find out how long the dataset’s underlying tasks took humans to complete? Before continuing, consider how important this question is to METR’s results. These are the times on the METR graph’s y-axis; if they are severely compromised, then so is the entire benchmark.
Keeping this in mind, here is a description of how the HCAST team collected their human baselines.
They recruited engineers with >3 years of experience “primarily .. via the professional network of METR employees” (Rein et. al. 2025, p. 6). Then they assigned them to complete, as far as I can tell, an average of ~4 software engineering tasks each; there were 140 baseliners and 563 task attempts (p. 8-9). Although, METR only ended up using 97 of HCAST’s 189 tasks (Kwa et. al. 2025, p. 5), and 286 (successful) baselines for the Long Tasks benchmark (Kwa et. al. 2025, p. 7). Presumably, this means they ended up with around 3 baselines per task (more realistically, easier tasks probably had more completions, and harder tasks fewer).
During task completion, baseliners had to commit their progress to a repository, and record their activity through Loom. They were also instructed to “verbalize their thoughts and processes aloud during work.” Recordings were used to ensure adherence to anti-cheating protocols (Rein et. al. 2025, p. 22).
Notably, baseliners were also paid “$50-$100 per hour, plus $25 - $150 per hour in performance bonuses for completing tasks quickly and successfully” (p. 6). One bonus, conditional on successful task completion, was half the receiving baseliner’s hourly rate multiplied by the average time spent on the task by all baseliners, and another, conditional on completing a given task faster than all other baseliners, was the baseliner’s full hourly rate multiplied by this same average (p. 23).
So, taking stock of just these details (there are more to come): task completion times were recorded from not just a clearly biased, but an utterly miniscule sample. As I hope we can agree on, you cannot generalize to all software engineers (much less all humans) the average task completion times from ~3 people per task all recruited from the same social network (and that 3 is generous, especially as you move up the y-axis, because some longer tasks likely only had one or two completions, and the same people would’ve had to complete multiple tasks; remember HCAST had only 140 baseliners in all).
This alone is very near to making the entire benchmark useless. I expect some will take this verdict to be excessively uncharitable, so I want to take a moment to defend it. The idea that the average time it takes ~3 engineers to complete a given task tells you even very roughly about the average for all engineers is false, full stop. You cannot infer anything about the latter from the former; the sample is simply too small, and too biased. Nor is it reasonable, assuming you are a software engineer, to simply gut-check whether the times seem right to you. This is unscientific in the extreme. You are one person. Neither you, nor your friends, nor people on X are representative of software engineers generally. Preventing these sorts of low-powered, highly-biased generalizations is why we have the sampling norms that we do in the empirical sciences.
I expect some of you will want to say something like: “well, even if the specific times are off, the METR graph still shows model capabilities doubling at a frightening rate.” No. Do not try this. If the specific times are wrong, so are their distribution. That is to say: were we to collect baselines from a much larger, more representative sample, some tasks might fall into different completion-time buckets. For instance, tasks METR found took 2-3 hours might turn out, in this larger sample, to take 1-1.5 hrs. But, unless all completion times shift downward by the exact same proportion, this would compress the overall distribution of task completion times. That would in turn change the rate at which A.I. models moved up METR’s y-axis. We have no way of ruling out this or similar possibilities, meaning that—given METR’s tiny sample—we cannot use their results make general inferences as to how much better models are getting, or how fast. This is why I call the benchmark (nearly) useless: it is being sold as licensing inferences about A.I. models’ overall capability growth, and yet that is precisely what its meager sample means it cannot do.
Remarkably, its poor sampling is only one of several comparably serious problems with METR’s baselining methods. For instance, consider the likely effects of the HCAST team’s anti-cheating protocols and incentive scheme. Remember: baseliners were asked to verbalize their thoughts aloud while working—if you are a software engineer, consider whether doing this would interrupt or complicate your normal workflow. Much more importantly, though, they were literally paid by the hour—the longer they took to complete tasks, the more money they made.
Think for a second about how bizarre this is. If you are measuring the duration of a certain set of behaviors in a sample—and you want that measurement to have some external validity—the worst thing you can do is to pay your participants an hourly rate! It would be one thing if the target population were paid hourly for those behaviors ‘in the wild,’ but the vast majority of full-time software engineers are salaried.
I haven’t forgotten, to be clear, that the HCAST team also paid baseliners a bonus if they completed a task faster than all of their peers. This doesn’t dissolve my concern. Since the prospect of receiving this bonus was both uncertain and unlikely (baseliners wouldn’t have known how many other baseliners they were competing with), a hypothetical baseliner looking to maximize their returns is arguably going to go slower not faster. Going faster would mean lowering one’s predictable hourly return in pursuit of an unlikely and uncertain bonus. Not only that, but the average time spent per task was used as a multiplier for both bonuses, further magnifying the incentive to slow down. As a result, it at the very least seems likely that the HCAST team’s incentive system increased completion times relative to other possible incentive designs.
Continue digging through the HCAST and METR Long Tasks preprints, though, and the mere prospect of inflated baselines becomes a certainty. Given the course of my critique so far, you may be surprised to find that there is an entire section in the HCAST preprint titled “challenges with interpreting human baseline results.” Fascinatingly, not a single one of my criticisms so far makes an appearance! Here are a few important concerns that do, though (to the authors’ credit):
One fundamental limitation of our baselining process is that we measure the time tasks take for humans who have little to no outside context or prior experience on the tasks (though our tasks are also designed to require little additional context). We do this in part to avoid learning effects and in part due to practical challenges with acquiring sufficiently many high-quality human baseliners. This likely increases the amount of time baseliners spend on getting up to speed on the tasks (such as familiarizing themselves with the relevant tooling, reading documentation, and carefully understanding the instructions), compared to tasks people complete in their real work; this in turn may bias our time estimates for many of our tasks upwards …
Although we attempt to assign humans tasks aligned with their expertise, accurately matching baseliners to suitable tasks proves challenging and is occasionally imperfect. As a result, baseliners sometimes encounter tasks only loosely aligned with their specific expertise, potentially reducing their observed performance relative to typical professional work scenarios. In the real world, people mostly attempt tasks they and their employer have very high confidence they will be successful at. We also preferentially assigned tasks (particularly harder tasks) to the baseliners who seemed likely to perform best (from qualitative impressions of their qualifications and performance). This helps us obtain more successful baselines, but may have a systematic effect on the time estimates for easier versus harder tasks. (p. 14, emphasis my own)
So, on top of everything else, baselining engineers were regularly assigned tasks they had little prior experience with, in contradistinction to the sorts of tasks they would encounter in a typical work setting. To get a sense of how this might have impacted results, consider this pair of remarkable admissions from appendix B of the METR Long Tasks preprint:
We manually watched and annotated the recordings of 50 baselines to get a sense of how our baseliners spent their time. Based on this small survey of baselines:
1. On short (5-15 minute) tasks, a substantial fraction of the time (between 25–75%) is spent reading instructions or understanding how to submit correctly. Particularly for short tasks, our time estimates may be hard to interpret as there is some fixed overhead for reading the instructions, independent of the difficulty of actually performing the task per se.
2. Our tasks are designed to require relatively low amounts of context but base liners still spend over 25% of their time reminding themselves of relevant information (e.g. how to do port forwarding). (p. 29)
This is strong evidence that METR’s human baselines are inflated. And it is not even the only piece of evidence to that effect in this appendix. METR also quietly tested frontier models on issues from an internal repository, and compared their solution-times to those of baseliners and—critically—repository maintainers. This data is informative because it gives us a rough sense of how task completion times might differ between baseliners that were less versus more familiar with their assigned tasks. Arguably, it makes more sense to assess A.I. models by the latter since this would be more representative of work done by professional software engineers. In the words of the HCAST team “in the real world, people mostly attempt tasks they and their employer have very high confidence they will be successful at.” All this in mind, the following table is another very bad sign for the Long Tasks benchmark’s external validity:
“In practice, repo maintainers were 5-18x faster” than baseliners. At this point, we can essentially be certain that METR’s baselines are inflated, possibly by quite a lot. That would in turn entail that the model capabilities they report are inflated, since they are being evaluated by comparison to these baselines. It may be, for instance, that Opus 4.5 can (sometimes) complete a task that would take a typical HCAST baseliner five hours, but if that task would take an engineer with more relevant expertise only one hour, reporting the five hour figure as how long it takes “humans” to complete the task will mislead us as to Opus 4.5’s capacity to substitute for software engineering labor.
The human baselines used for METR’s Long Tasks benchmark are awful, end of story. They are based on an extremely small, non-representative sample. They are confounded by a bizarre incentive scheme in which participants were paid according to how much they increased the key value under study (i.e. task completion time). And to boot, baseliners had far less task-relevant expertise than would professional software engineers completing similar tasks at work. There is far too much noise for any genuine signal to break through here.
4. Conclusion
Regardless of the litany of serious issues with the METR graph, I imagine some of you will still feel I am being harsh. But human baselines are very complex and expensive to collect you may say; but it was admirable of METR to review so many limitations of their work, to conduct deep analysis impugning the realism of their very own task suite.
All of this is true. But I have a very hard time seeing posts like the following, from METR’s initial X thread describing the Long Tasks benchmark, and feeling much sympathy for them:
Considering my criticisms above, this phrasing is ridiculous. It is an uncontroversial example of misleading science communication (and no, nothing in the rest of the thread makes it better). This, from the more detailed blog post on their website, is barely an improvement:
Summary: We propose measuring AI performance in terms of the length of tasks AI agents can complete. We show that this metric has been consistently exponentially increasing over the past 6 years, with a doubling time of around 7 months. Extrapolating this trend predicts that, in under a decade, we will see AI agents that can independently complete a large fraction of software tasks that currently take humans days or weeks.
METR is in no way measuring the “length of tasks AI agents can complete.” It is measuring whether A.I. can occasionally complete 97 highly contrived software engineering tasks whose “lengths” are spuriously determined. Nor does “extrapolating this trend” predict anything that can be understood in terms of what “humans,” as such, can do. You do not get to generalize to the entire species from a few of your LinkedIn connections. You especially do not get to do that if you claim to meaningfully measure the time it takes them to do something while also paying them by the hour.
This is all very bad, no matter how charitable one is inclined it to be. It is not the sort of intellectual impropriety that deserves to be handled with kid gloves.
The A.I. boom has inspired a wealth of careful research by well-meaning people who are sincerely trying to understand the world-shaking changes happening under our feet. It has also inspired troves of lazy and dishonest grifters trying to ride the hype-wave to fame and fortune. It would be unfair to accuse the authors of the METR graph of being part of the second group. I remain sincerely uncertain, though, whether they are part of the first.









You're far too harsh:
1. On a fixed budget, I think a higher number of tasks, even with a lower number of participant-completions, was the right move. Yes, they've got very few, but 1 would be non-zero informative and 3 is in fact something. And I'd much rather 3 people doing 100 tasks than 100 people doing 3 tasks.
2. 50% reliability seems reasonable if your plan is "set one of my 5-8 simultaneously running agents off to do something, check back with it in a few hours". And that seems to be the workflow of many AI users I know. So long as the task is checkable...
3. One of the benefits of recruiting from people in your social network, rather than randoms you're paying, is that they're not going to deliberately slow their work, and you have some evidence of competence.
If anyone is interested I did a critical analysis of the METR paper from a statistics point of view a while ago: https://aichats.substack.com/p/are-ai-time-horizon-doubling-every?r=4tn68o