The Hole I Thought I'd Fixed in Review Was Actually Somewhere Else
obsidian-import (my homegrown tool that turns external videos and articles into Obsidian notes) had video transcription locked to YouTube only. Every time I fed it a TikTok or Instagram video, it got processed as an article instead, and the friction finally added up enough that I decided to fix it.
Digging in, the YouTube restriction turned out to be a stand-in for a different goal: avoiding Whisper’s heavy load. Both subtitle fetching and description fetching were already site-agnostic yt-dlp calls under the hood. The YouTube check was a single regex at the entry point — replace it with the real condition (a duration cap) and the whole thing could be safely extended.
That much is just tracking down a root cause and fixing it. What I want to write about is what came after: I had Claude Fable 5 write the spec, then switched to Claude Sonnet 5 for the implementation. This isn’t a story about “using two models” — the interesting part is why I drew the line where I did.
Writing a spec and writing code are different jobs
In a piece I wrote earlier, I said AI-written code is bottlenecked by review. What happened this time is the same thing showing up one level earlier — at the spec, before any code exists.
Writing the spec meant working through a pile of judgment calls: what breaks if I lift the YouTube restriction, what happens if a text-only post gets misclassified as a video, whether the URLs inside non-YouTube playlist entries can be trusted. All of these needed deciding before implementation, or they’d turn into rework afterward. I wanted that thinking done carefully, so I handed it to Fable 5.
Implementation, once the spec was settled, is a different kind of work — mostly turning a fixed set of requirements into code against a fixed test plan. That went to Sonnet 5.
Splitting judgment-heavy work from execution-heavy work and routing each to a different model isn’t unusual. What was interesting this time is that both stages turned out to have the same kind of hole in them.
Fable’s spec cracked on the first review
I didn’t send Fable 5’s finished spec straight to implementation. Right after it was written, I ran an adversarial review against it with a subagent. Three Highs came back.
The first was a missing SSRF guard on entry URLs. When enumerating a non-YouTube playlist, the design fed the URL straight out of yt-dlp’s JSON output into the next stage. That’s external data. Under the old YouTube-only design this was structurally safe — URLs were rebuilt from a verified video ID — but the moment it opened up to other sites, an externally-controlled string was flowing straight through.
The second was loose host matching. The check for whether a host was youtube.com used something like a substring match. Written that way, a spoofed host like youtube.com.evil.example would also match, letting input that should never pass ride the fast path where the guard gets skipped.
The third was a regression for “it’s a video site, but there’s no video.” Extractors for X, Instagram, and TikTok claim a whole domain. A text-only post, a photo post, or a profile page would still get classified as “this is a video site.” Fail on that naively, and every one of those URLs — previously handled fine as articles — starts erroring out.
All three showed up before a single line of implementation existed. A missing SSRF guard can still be caught later, in a post-implementation review — but by then it means rethinking the shape of the spec itself. Catching it at the spec stage meant zero rework once implementation was underway.
I folded all three fixes back into the spec: entry URLs always pass a safety check before use; host matching switches from substring matching to exact match or dot-boundary suffix matching; and when video-site classification fails to find an actual video, it returns a dedicated signal that tells the caller to fall back to article processing.
Another review pass after implementation
With the spec settled, I handed the implementation to Sonnet 5. Five requirements, 189 tests, all green.
Again, having it work wasn’t enough to push. I ran another security review with a subagent over the full diff — SSRF, command injection, path traversal, prompt injection, resource exhaustion, guard bypasses. No Critical, High, or Medium findings. One Low did turn up: one of the yt-dlp subprocess calls had no timeout set, meaning a non-responding host could hang the process. I fixed it before pushing.
Three Highs at the spec review, one Low at the implementation review. Splitting the work into stages meant each review caught the holes visible only at that stage — spec holes before code exists, implementation holes only after it does.
Where the review looked, and where it actually broke
Up to this point, this is the same discipline as the piece about hardening my content-ingestion tool through adversarial review — review the spec, review the implementation, only then move forward. This time there was one more thing that review alone couldn’t have told me.
As the fix for “it’s a video site with no video,” the spec built a dedicated fallback path: feed in an X post URL, its extractor gets classified as a video site, actually trying to enumerate the video fails, a dedicated signal comes back, and the caller switches to article processing. On paper, this was the mechanism that was supposed to correctly handle text-only X posts.
After implementation was done, I checked it against real URLs. TikTok and Vimeo videos got classified as videos, as designed. An article URL didn’t get classified as a video, also as designed. So far, so expected.
A text-only X post (x.com/jack/status/20, one of the earliest posts on the platform) didn’t behave the way I expected. Before it ever reached the fallback path I’d built, yt-dlp itself failed with “no video could be found” and fell straight through to article processing. The safety net the spec had spent real design effort on never got exercised for this particular URL. The problem had already been resolved a step earlier than the review had assumed.
One more thing turned up while testing a non-YouTube playlist (a TikTok user page): the returned JSON sometimes had an empty extractor field with the value showing up in ie_key instead, as "TikTok". The spec had already noted, as a hedge, “the field might be missing, so use a format that checks both” — but there was no way to know in advance whether that would actually happen. Only running it for real confirmed that hedge was necessary, not just cautious.
A desk review can reason precisely about how to handle a failure once you know it can happen. It can’t tell you which layer that failure actually occurs at. The fact that a carefully designed safety net turned out to have no work to do — that never would have surfaced from spec review alone.
Summary
| Stage | Owner | What it found or confirmed |
|---|---|---|
| Spec writing | Fable 5 | Replacing the YouTube restriction with the real condition (duration cap) |
| Spec review | Subagent | 3 Highs (missing SSRF guard, loose host matching, functional regression) |
| Implementation | Sonnet 5 | 5 requirements, 189 tests |
| Implementation review | Subagent | 1 Low (missing subprocess timeout) |
| Real-URL verification | Me | yt-dlp resolved the problem earlier than the designed safety net |
Closing
The thesis from the piece about hardening my content-ingestion tool through adversarial review — that AI-written code is bottlenecked by review — held up one level earlier, at the spec stage, before any code existed.
But no matter how carefully you review something on paper, there are things you only find by actually running it. The place I’d built a safety net and the place the problem actually got resolved didn’t match. That’s not a flaw in the spec — it’s the last step that neither spec review nor implementation review can close on their own. Which is why I still don’t skip checking against real URLs as the final step after review.