AI's Reliability Gap

9 hrs ago

High highs and low lows do not a machine God make

6 Comments

I've followed your takes on AI development with interest for a while, but this piece is just another level. Easily the best analytical piece within the broad "AI future" field I've read in a long time. This is the kind of healthy AI skepticism we need more of - and there are still many more avenues to explore. Kudos!

Reply (1)

Nathan Witkin

That's very kind Jonno, thank you!

Jake

On that NBER working paper, I think it over states the productivity effect meaningfully. Their sample is weighted towards small open source projects and does not modulate for quality.

The below is the best I've been able to find on the impacts of LLM use in large commercial organizations. It's an observational study covering 22000 developers across 4000 teams over 2 years comparing their lowest AI adoption quarter to their highest within their org.

https://www.faros.ai/blog/ai-acceleration-whiplash-takeaways

Highlights:

1. 16% PR merge rate improvement

2. -11% release frequency (that's a negative sign)

3. +860% code churn increase (not a typo)

4. +480% feature lead time (also not a typo)

5. +250% incidents per pull request.

I analyzed this data here: https://unessays.substack.com/p/talk-is-cheap

Your post here is the mechanism and I think the Faros data is the downstream outcome. And it makes sense. As your codebase becomes mature, changes have to be made such that the product is stable. If you're filtering human work through an unreliable mechanism, you're going to introduce changes that start to subtract value.

I wrote this in Talk is Cheap:

"Here’s the difference I see - in every case where a radical new technology has revolutionized industry - manufacturing machines, plumbing, planes, electricity, computers, the internet, etc - there is an increasing trend of reliability. Each one of those foundational technologies started from a place of unreliability and moved to a place of very high reliability. They became foundational because humans learned to trust them implicitly."

Simon Kinahan

I suspect the answer to AI unreliability is more AI. LLMs mimic humans, and humans are unreliable. How do we manage human unreliability? More humans.

The software industry is (maybe for the first time ever) a model here. Almost always when software fails it is because a human made a mistake. The software industry is built around processes of testing and review that screen errors down to an acceptable (but far from zero) level, and then react to address further errors as they appear. Almost the entire software lifecycle, except the tiny fun part at the start, is about addressing human cognitive failure.

Within that process there are multiple different human roles, but each of those roles can be formalized as encoding informal language into code or decoding code into informal language, so each of those roles can, potentially, be played by an LLM. Will they be as good as humans? Definitely not as good as the best humans. But will they be acceptable? Probably. And they're much faster, so to some extent you can replace ability with iteration speed. And if you have different models with only limited shared context writing specs, writing code, writing and running tests, doing hands on testing, filing bugs and reviewing code, and you do it faster than humans would be able to do it, is the net result as reliable as a human result? I don't see why it shouldn't be.

I have a suspicion though that the cost of inference is still too high though. Where we stand right now we could easily double the fully loaded cost of a developer if we let them use as many tokens as they might want to automate their whole workflow.

Reply (1)

Jake

2hEdited

I don't think so. In multiple places.

>I suspect the answer to AI unreliability is more AI.

The data presented in this post argues against this point. let's take a simple model. Say the LLMs you're working with have pass@1 of 50%.

If you just take the implementer LLM output and ship what it produces into production then you have a 50% chance on one run of getting a working implementation and 50% you ship something buggy.

Now add a verifier LLM with the same pass@1:

pass implementation - pass verification - (50%) (50%) = 25% (good)

pass implementation - failed verification - (50%) (50%) = 25% (bad - fail)

failed implementation - pass verification - (50%) (50%) = 25% (bad - ship bug)

failed implementation - failed verification - (50%) (50%) = 25% (good - fail)

If your verifier's check is uncorrelated to the correctness of what implementation is doing, you get no gain in reliability. You've just spent more tokens.

I'd also note, in only one case do you get the right code shipped into production - the top one so 25% chance of that.

And if you retry indefinitely on failure, the math converges to exactly your original 50% pass@1. Infinite retries with an uncorrelated verifier buys you nothing but token spend.

>LLMs mimic humans

Not in many important ways. One of the most important is reliability. When you do something regularly, you routinely get more and more reliable at that thing. LLMs do not.

>Almost always when software fails it is because a human made a mistake.

Yes, but the evidence shows that LLMs increase the number of bugs shipped into production: https://www.faros.ai/blog/ai-acceleration-whiplash-takeaways

If your goal is to minimize defects, LLMs work against that goal.

Toiler On the Sea

It can be and will prove increasingly useful but it's current unreliability renders the current valuations to completely unjustifiable. Chickens will roost.

Arachne

AI's Reliability Gap