AI Memento

Documents aren't the answer

May 11, 2026

Leonard Shelby can’t form new memories. In Memento, after the attack, every fifteen minutes or so his short-term recall resets. He has a problem most engineers would recognize: a fast working scratch space and no durable storage. So he builds a system. Polaroids with handwritten captions on the back. A wall of pinned notes. Tattoos for the load-bearing facts — the ones he can’t afford to lose, can’t afford to misfile, can’t afford to look up wrong.

What he doesn’t do is try to remember everything. He doesn’t carry a thicker notebook. He builds infrastructure to find what matters, fast, when he needs it. The pocket holds what’s relevant right now. The system decides what makes it into the pocket.

I’ve been thinking about Leonard a lot this year, because the prevailing argument in my corner of the internet is that LLMs no longer need him. The pocket got bigger. Frontier models can hold a million tokens of context. Token prices fell. So just stuff the pocket. Skip the indexing. Skip the tattoos. Read everything.

I think that’s the wrong read of where we’re going.

The real argument, fairly stated

The “RAG is dead” position isn’t crazy. It goes roughly like this: a handful of frontier models — Gemini 1.5, Claude’s extended context tiers — can now handle contexts approaching 1M tokens without choking. Token prices have fallen to around $0.60 per million for several frontier models. Standing up a real retrieval pipeline — chunking strategy, embedding model, vector store, hybrid scoring, reranker, eval harness — is 40 to 80 engineering hours, easily. For a small, static corpus you query a few thousand times a month, just dumping the whole thing into context with prompt caching is simpler and arguably cheaper.

Claude Code is the existence proof people point at. It doesn’t ship a vector database. It uses ripgrep, file globs, and reads files into context. And it works pretty well. So why bother with the rest?

I’ll concede the envelope. For a 300K-token corpus, single tenant, low query volume, mostly static — yeah, stuff it. Cache it. Read it. Don’t build a search stack to feel sophisticated. That’s a real position and I don’t want to strawman it.

But that envelope isn’t where most of the work lives. And the argument the loud version makes — that grep replaces search, that long context replaces retrieval — confuses two different things; there are cracks in that argument.

The pocket problem

Here’s the first crack. “Lost in the middle.” Liu et al., 2023, TACL: when you stuff long contexts, model performance forms a U-shape. Strong recall at the beginning, strong recall at the end, degraded recall in the middle. The 2026-gen frontier models have improved on this — early evals suggest the curve is flatter — but it’s not gone.

So the first thing the bigger pocket buys you is the ability to put a lot of important stuff exactly where the model is most likely to underweight it. You can’t reorder a 700K-token dump by relevance unless you’ve already done retrieval. Which means the question doesn’t go away — it just hides.

Second crack: cost. RAG-style queries — retrieve relevant chunks, feed 8–16K of context — run around $0.005–$0.008 per request at $0.60/million input tokens, when you count the full prompt. A naive full-context pass against a 500K-token corpus runs $0.30 in input tokens alone, before output. That’s a 40–60x ratio at list price, and it widens if you have high cache-miss rates or large output windows. At 50K daily queries — not enormous, this is a mid-sized internal tool — you’re choosing between a few hundred dollars a day and tens of thousands. That’s not a rounding error you absorb. That’s a re-platforming decision that wakes someone up.

Third crack, and this is the one that actually makes me grumpy. Grep is not search. Grep is lexical pattern matching. Searching for auth_token finds the string auth_token. It finds it in commented-out dead code from 2023. It finds it in test fixtures with hardcoded sample values. It finds it in a vendor library you don’t even own. It misses the function called refresh_session_credential that does what you’re actually looking for, because the words are different. Vector search finds that function. Grep doesn’t. They’re not the same tool. Pretending they are is how you ship the wrong fix at 2 AM.

When people say “Claude Code uses grep, so search is over,” they’re describing a tool optimized for a bounded, file-system-structured corpus where the author controls the index shape. That’s a legitimate architecture for that problem. It does not generalize to “ten million product images” or “a 30-year codebase across 800 repos” or “tenant-scoped documents under per-row access control.” Long context can’t replace a knowledge graph, either. You can serialize a graph into text — flatten the edges, encode the relationships — but then you’ve lost typed edges and efficient query-time traversal. You can’t join across it. You can’t reason about edge types. You’d need to flatten the graph to feed it in, at which point you’ve discarded the structure that made it useful.

With me so far? The argument isn’t “long context is bad.” Long context is great. The argument is that long context is a consumer of retrieval, not a replacement for it.

What good retrieval actually looks like

A modern retrieval stack, the kind that holds up under real query volume, isn’t one thing. It’s a layered pipeline.

Lexical first — BM25, TF-IDF, the unfashionable stuff. It’s still excellent at exact tokens, identifiers, error messages, version numbers. The things vector embeddings smudge.

Dense second — vector search over embeddings. all-MiniLM-L6-v2 is the workhorse at 384 dimensions. Catches semantic neighbors, paraphrases, the function-named-different-but-doing-the-same-thing case.

Fuse them — Reciprocal Rank Fusion at k=60 is the standard, and it’s standard because it works. You don’t pick BM25 or vector. You merge their rank lists.

Rerank — a cross-encoder over the top-k from RRF. Slower per-pair, but you only run it on the candidates that made it through the cheap layers.

The numbers from a 2026 benchmark pass I was reading — “From BM25 to Corrective RAG,” April 2026 — make the layered case cleanly:

Dense vectors alone: Recall@5 of 0.587
BM25 alone: 0.644
Hybrid via RRF: 0.695
Hybrid + neural reranker: 0.816

That last number is 39% better than dense-only. It’s also 17% better than RRF without the reranker. The layers compound. In systems that need both recall and precision at scale, you usually want all four layers. And — this part matters — none of them are replaced by a bigger context window. You still have to pick what goes into the pocket.

There’s a piece on top of this, too: intent. A query like “where is process_payment defined” is not the same shape as “how does the refund flow handle partial captures.” The first is a definition lookup; BM25 should dominate, you want exact identifier matching, you don’t want semantic neighbors fuzzing the result. The second is conceptual; dense embeddings should dominate, you want the chunk that explains the flow even if the words don’t match. Treating those queries identically is how you get plausible-but-wrong answers.

The EMNLP 2024 “Self-Route” paper made a related point I keep coming back to: let the model itself decide whether a query needs retrieval or full-context, based on complexity. They got better accuracy AND lower compute cost. The two approaches aren’t enemies. They’re complements with different cost/quality envelopes, and a working system uses both.

(A note on scope: I’ve been using “search” and “memory retrieval” somewhat interchangeably here, which isn’t quite right. They’re the same pipeline mechanics — retrieve, rank, inject — but different problems. Memory retrieval has a temporal model search doesn’t: recency matters, decay matters, and the source is agent history rather than a document corpus. Memory is also primarily graph-based, because what you’re trying to recover is relationships and context across time, not just chunks of text. For the purposes of this argument — context window vs. retrieval discipline — the distinction doesn’t change anything. But it’s worth naming.)

What I’m building

I have two projects in this space. I’ll be specific about what each one is and isn’t, because the field is full of demos that don’t survive contact with real corpora.

mcp-vector-search is the older one. Python, per-project, designed to live next to a codebase as an MCP server. The retrieval core is hybrid: BM25 plus HNSW over MiniLM embeddings, with knowledge-graph expansion built from tree-sitter AST parsing — function definitions, call edges, import edges, type relationships. Fifteen-plus edge types in the graph. Cross-encoder reranking on top of RRF. MMR for diversity so you don’t get five near-duplicates in the top results. Temporal decay weighted by git blame age. Cyclomatic complexity factored into ranking, because dense, gnarly code is more often what you’re looking for than the trivial wrappers around it.

trusty-search is what I’m building toward. Rust daemon, machine-wide rather than per-project, multi-tenant across all my work. Same retrieval core — RRF at k=60, MiniLM embeddings, BM25 plus HNSW — but with intent classification on the front. Queries get tagged Definition, Usage, Conceptual, or BugDebt, and each intent has pre-tuned alpha/beta weights between lexical and dense. Sub-10ms p50 on warm queries.

I’ll be honest: trusty-search is early-stage. Some of the features I just listed are scaffolding with stubs underneath. The intent classifier is rule-based at the moment and needs to become a small model. The two tools are complementary rather than competing for now — mcp-vector-search is the deeper analysis layer, trusty-search is the fast facts store you hit constantly during a session — though the long-term plan is for trusty-search to replace mcp-vector-search.

The reason I’m building both is that I keep running into the same wall: I can give an agent a million-token window and it will still ask the wrong question of the wrong file. The bottleneck is not how much I can stuff in. The bottleneck is which 8K of the 800K matters for this query. That’s a search problem. It’s been a search problem for sixty years. It’s still a search problem.

Back to Leonard

The reason Leonard works as an analogy isn’t the amnesia. It’s the discipline. He decides what makes it onto a polaroid. He decides what gets tattooed and what stays loose. He writes “DON’T BELIEVE HIS LIES” on the back of a photo because he won’t have the context to evaluate trustworthiness later — so he encodes the conclusion now, into a system he’ll find when he needs it.

The context window is the pocket. Whatever Leonard pulls out and looks at right now. It can be enormous. It can be cached. It can be cheap. None of that decides what goes in.

Retrieval is the tattoo on his wrist. The polaroid pinned to the wall. The system that decides which fact survives, which one is one query away, which one is buried. The job of that system is exactly the job that doesn’t go away when the pocket gets bigger.

The question was never “do I need RAG?” That framing turned a design discipline into a vendor category and made it easy to dismiss. The question is the one Leonard asks every morning: what do I actually need to find, and have I built the thing that finds it? At a million tokens, you’re not freed from that question. You’ve just made it more expensive to answer incorrectly.

It feels crazy that I wrote this only a year ago…

Bob Matsuoka is CTO of Duetto and writes about AI-powered engineering at HyperDev.

Related reading:

Context Memory and Search: The Secrets to Effective Agentic Work — Why context management, memory, and retrieval are the three pillars of effective agentic systems
What’s In Your Second Brain? — Tooling for the modern CTO: how to build external memory that actually works
AI Power Ranking — Tool comparisons and benchmarks for AI practitioners
LinkedIn Newsletter — Strategic AI insights for CTOs and engineering leaders

Discussion about this post

Ready for more?