Proactively Augmented Generation (PAG): When the Harness Steers Retrieval

You’ve seen this problem: you ask an LLM to help with a codebase task, and it wanders. It grabs whatever files look vaguely relevant. Sometimes it finds the right context. Sometimes it hallucinates because it missed the one spec that would have told it what to do. You can’t reproduce the run. You can’t audit what it saw. You can’t explain to anyone why it did what it did.

Proactively Augmented Generation (PAG) is a way of feeding context to an LLM where the harness, not the model, decides what to load and how.

Instead of asking the model to “go fetch what you need” (classic RAG), you:

Model your domain and artifacts up front (specs, ADRs, docs, code, schemas).
Use deterministic rules to assemble a context packet from that graph before you call the model.
Hand that packet to the LLM as its working set.

If LLMs are eager junior devs, PAG is the senior dev sitting down before the meeting and saying:

“Here’s the spec, ADRs, and the two files you actually need to read. Then we talk.”

Not:

“Here’s the whole repo. Go wander the wiki and call me when you think you’re done.”

How PAG differs from RAG

Most setups use RAG (Retrieval-Augmented Generation) as: “let the model decide what to retrieve, using similarity search or tools.”

In typical RAG:

The model (or a retriever driven by model text) issues queries.
A vector store or search index returns “relevant” documents.
Those docs get stitched into the prompt for the next generation step.
The retrieval path is model-driven and somewhat opaque.

In PAG:

You already know what the important artifacts are (specs, ADRs, schemas, key docs, code hotspots).
You have a structured graph of those artifacts (spec tree, directory layout, tags).
A deterministic packet builder (not the model) walks that graph and picks context:
- following priority rules,
- under strict budgets (tokens, bytes, lines),
- with clear “must-have” vs “nice-to-have”.

The short version:

RAG: model steers retrieval; context assembled reactively based on model text.
PAG: harness steers retrieval; context assembled proactively from a known structure.

Both augment generation with external knowledge. PAG just moves the control point.

Scoped to one flow at a time

PAG doesn’t build “the one true context” for an entire codebase. It’s flow-scoped.

A flow is a specific kind of work:

“Review this spec and propose tests.”
“Explain the impact of this diff.”
“Align this service with ADR-012.”
“Summarize what changed in this module since the last release.”

At runtime, a typical pass looks like this:

Request routing figures out what kind of work this is:
- Classifies the intent (“review”, “explain”, “generate tests”, “compare”).
- Extracts anchors: spec IDs, ADR IDs, file paths, services, domains (“billing”, “checkout”, “auth”).
- Selects the flow and which analyzers to run.
This can be NLP-based (a classifier + entity extractor), but it doesn’t have to be. Rule-based routing, pattern matching on file paths, git branch conventions, or explicit user selection all work. The key is that something maps the request to a flow before the packet builder runs.
Analyzers and tools run for that flow:
- Quick queries over your existing graph, metrics, or coverage.
- Direct lookups of specs, ADRs, and docs tied to the anchors.
- Optionally, slower jobs that can run in the background.
The flow harness hands PAG a flow library:
- Upstream specs and ADRs.
- Relevant docs and READMEs.
- Code snippets or modules.
- Analyzer outputs (e.g., “entities in this module”, “callers of this function”).
- Previous receipts for similar flows.

The packet builder operates entirely inside that flow library. It doesn’t rediscover the world; it turns “here’s everything that might matter for this flow” into “here is the exact, budgeted packet this model will see for this run.”

Note that packets can be scoped at different levels:

Flow-specific packets are built for a particular kind of work (“review spec”, “generate tests”) and carry artifacts relevant to that task.
Agent-specific packets are built for a particular agent’s role in a multi-agent system, carrying the context that agent needs to do its job.

The same PAG principles apply at both levels: priorities, budgets, deterministic assembly, and receipts. The difference is what you’re scoping to.

This isn’t Copilot auto-context with a fancier name

Most “auto context” systems today (Copilot, Cursor, etc.) are some mix of:

Local heuristics: open files, nearby code, recently edited files, project-wide symbols.
Sometimes semantic search: vector search over the repo, “files similar to what you’re typing.”

From your point of view as a user, it’s basically: “we’ll try to find some relevant stuff and jam it into the prompt behind the scenes.”

You rarely get:

A clear ordering (what’s considered more important than what),
A persistent, inspectable packet you can diff,
A notion of “must-have” vs “nice-to-have” context,
Any explicit budgeting or failure modes (“if this doesn’t fit, fail” vs “just drop something”).

It’s helpful for autocomplete. It’s also opaque and opportunistic.

Concrete example: you’re editing orders.rs and the editor pulls in invoice.rs and customer.rs because they’re nearby in the file tree — even though the change you’re making is governed by a spec in specs/order-processing.md that it never sees. The heuristic grabbed proximity; it missed intent.

PAG is different: this isn’t the model or the editor heuristically grabbing random nearby files. It’s a curated, spec-first library with deterministic rules.

Dimension	Editor auto-context	PAG
Who decides what’s relevant	Hidden heuristics + sometimes semantic search	Your harness + spec/ADR/doc graph + explicit priorities
When context is assembled	Just-in-time, often per keystroke, often opaque	Pre-call, single deliberate pass, with budgets and failure modes
Inspectability	No durable, inspectable “packet” per request	You can look at the packet, hash it, log it, diff it
Input library	”The repo and maybe the workspace”	A curated library: specs, summaries, explanations, selectively chosen code/docs

If you just said “PAG = auto context”, that’d be hand-wavy. As defined here — spec-first, curated library, deterministic packet builder, receipts — it’s a distinct pattern.

Why this matters

RAG works well for broad knowledge bases, semi-unstructured corpora, exploratory question-answering.

It’s less great when:

You already have a well-structured domain (specs, ADRs, code),
You care about reproducibility and audit,
You need per-run guarantees on what the model sees.

PAG tackles three issues in those settings:

Predictability

With RAG, if the embedding changes or the corpus shifts, your retrieval set can “wiggle” in ways that are hard to reason about.

With PAG: same config + same repo → same packet. If behaviour changes, you can diff input artifacts, packet contents, and environment (lockfiles/receipts) — not just the final prompt.

Governance

If you’re running LLMs in an SDLC or governance-heavy context, you need to answer:

“What exactly did the model see?”
“Why did it make that decision?”
“Can you prove we didn’t leak secrets or skip required steps?”

RAG gives you a retrieval intent but not necessarily a clean, auditable packet.

PAG stores packet evidence (what files, which priorities, how they were chosen), writes receipts with hashes and metadata, and fits naturally into lockfiles, CI gates, and audits.

Speed and context efficiency

RAG often means multiple model calls (“what should I search?”, “what next?”), multiple retriever calls, and sometimes redundant or noisy context.

PAG tends to be faster (one packet builder pass over a known tree, one model call) and more context-efficient (only highest-priority artifacts make it in, budgets enforced up front).

In a repo with good structure, the harness can do a better job than “semantic nearest neighbours to this text.”

Reducing hallucination and Process Confabulation

Two common failure modes when LLMs work with codebases:

AI hallucination: the model confidently answers with facts it never saw because the relevant spec or ADR wasn’t in context.
Process Confabulation: the model invents a plausible but wrong story to glue together a pile of loosely related snippets you stuffed into the window.

PAG attacks both from the context side:

Must-have upstreams (specs, ADRs, contracts) are non-evictable, so the model is less likely to guess in the dark about requirements or design decisions.
Budgets and priorities keep the packet focused, so you don’t flood the window with half-relevant noise and force the model to hallucinate the “glue work” between them.
Flow scoping means each packet is assembled for a specific task, not “everything that might be vaguely useful.”

The goal isn’t to make the model smarter. It’s to make sure the model sees the right information with enough coverage to reduce hallucination, while keeping each chunk concise and scoped enough to avoid Process Confabulation.

How it works in practice

Build a domain graph

Start by modeling the things that matter:

Specs (requirements, design docs, task plans),
ADRs (architecture decision records),
Contracts and schemas,
Docs and READMEs,
Code (possibly summarized),
Tests and fixtures.

Organize them so tooling can reason about them: by path (specs/, docs/, src/), by tags (“core”, “upstream”, “doc”, “code”), by dependencies (“this design depends on this ADR”).

Assign priorities and budgets

PAG is opinionated about what’s more important when context is tight:

Priority 0: Non-evictable upstream artifacts (core specs, ADRs, core contracts).
Priority 1: Other relevant specs/ADRs.
Priority 2: Docs (READMEs, guides).
Priority 3: Code snippets or local files.

Then define budgets: max tokens (or bytes/lines) per packet, max docs per type, optional fair-share rules (“never starve ADRs”).

Assemble the packet

The packet builder:

Walks the graph, highest-priority first.
Adds artifacts while staying under budgets.
If must-have artifacts don’t fit → fails fast with a clear error (“your upstream spec is bigger than the packet budget”).

Result: a fixed packet with data, provenance, and ordering. Only now do you call the model.

Emit receipts

Each run records a receipt: exit code, error kind, model info, hashes, and packet evidence. Optionally pins the environment (lockfile) for drift detection. You can inspect and diff the actual inputs to the model, not just the prompts.

This closes the loop: iterate on packet rules (priorities, budgets) based on what receipts tell you.

Watch out for

Formalizing chaos. If your repo structure is random, your packet builder will be too. Invest a bit in organizing specs and docs before you encode them in rules.

Over-engineering the first version. Don’t spend weeks designing the perfect priority scheme. Start with “specs and ADRs are Priority 0, everything else is lower” and iterate based on what receipts tell you.

Receipts nobody reads. Receipts only matter if someone looks at them. Wire at least one check into CI — “fail if required specs are missing from the packet” — so the evidence has teeth.

Try this once in your repo

Before you build a full packet builder, run a single experiment.

Pick one LLM task you already run (e.g., “summarize this spec and propose tests”).
Manually list the 5–10 artifacts that must be in context: the main spec, any linked ADRs, one schema, at most 1–2 code files.
Build a small packet.json with file paths, byte counts, and a simple priority field.
Log that packet next to the model call.

You’ve just taken the first step toward PAG: a deterministic packet you can inspect and diff later, instead of opaque auto-context.

PAG as a spectrum, not a product

PAG isn’t a specific toolchain. It’s a way of thinking about context:

“Given this request and this budget, what’s the most useful, token-efficient set of inputs I can assemble on purpose, without flooding the window with garbage?”

You can apply this at very different levels of sophistication.

Level 0: Implicit policy (no PAG yet)

The default most teams start with:

The model or editor decides what to load.
You have a giant context window, so you “just shove a lot of stuff in.”
There’s no real notion of priority, budgets, or receipts.

This is a policy—it’s just an implicit one: “whatever fits, in whatever order the tool picks.”

Level 1: Heuristics with intent

You don’t need a Rust harness to start doing PAG-style thinking. Even with a 2M token window and cheap tokens, you can:

Build a heatmap of common files accessed recently.
Bias context toward files touched in the current branch or PR, files that co-occur with the current target, specs/docs that mention the same entities.
Cap low-value noise: limit “background” files to N, prefer artifacts that have been useful in this thread already.

Here, the “packet” might be nothing more than an ordered list of files, a few simple priority bands, and a soft budget (“never let ‘misc’ consume more than 20% of the window”).

You’re still using a big window, but you’re packing it deliberately instead of dumping the repo.

Level 2: Structured graphs and light receipts

Once you have more structure—specs and ADRs in version control, some notion of modules/services, basic analyzers (graphs, coverage, metrics)—you can move to the more “classical” PAG shape:

Flow-scoped library (specs, ADRs, docs, code, analyzer output).
Explicit priority tiers (0–3).
Hard budgets (tokens/bytes/lines).
Minimal receipts (what was included, in what order).

This is where per-run reproducibility and CI integration become realistic.

Level 3: Full harness, multi-shot, receipts everywhere

At the far end:

Flows defined as first-class objects.
Request routing (NLP-based, rule-based, or hybrid) to classify requests and extract anchors.
Analyzers categorized by latency (quick / medium / slow).
Single-shot and multi-shot packets with priorities, budgets, strict fail-fast rules, and detailed receipts for every run.

You don’t have to jump straight to Level 3. If all you do is stop letting auto-context spray random neighbours into the prompt and instead rank candidates by “how likely is this to matter for this request” and enforce some notion of budgets and caps, you’re already doing a light version of PAG. The rest is tightening the rules and adding better evidence.

When PAG works (and when it doesn’t)

Works best when:

You have structured, versioned artifacts: spec-driven development, ADRs, contracts, schemas, well-organized docs and code.
You care about reproducibility, governance/audit, CI integration.
You’re running LLMs inside tightly defined workflows (agents in an SDLC harness, architecture reviews, automated doc/code consistency checks).

Less ideal when:

You’re in exploratory or open-world scenarios: generic question-answering over the web, broad knowledge base queries, “find anything relevant” search problems.
Your input corpus is messy, unstructured, or constantly changing without a strong schema.

In those cases, classic RAG is still the better fit. You want the model to help steer retrieval because you don’t know the structure ahead of time.

Single-shot vs multi-shot PAG

So far, we’ve assumed a single-shot setup: for a given flow and phase, you build one packet, call the model once, and log the receipt.

In practice, you often want multi-shot PAG across a conversation.

Single-shot

For a single-shot call:

Request routing classifies the intent, extracts anchors, picks the flow.
Analyzers run the fast, cheap checks and contribute artifacts to the flow library.
Packet builder assembles a packet under a clear latency and token budget.
LLM gets one well-scoped packet and produces an answer.

Useful when you’re in CI or batch mode, you need a reproducible one-shot judgement, or you care more about determinism than interactivity.

Multi-shot

In a multi-shot setup, you stretch the same pattern across shots:

Shot 1:

Use request routing + fast analyzers to build a packet that comfortably fits under your latency budget (e.g., 100–200ms).
Call the model with that packet and return an answer.

Between shots:

Run slower analyzers and heavier processing jobs in the background: deeper call graphs, cross-component diffs, heavy summarization of large subgraphs or logs, “what changed since last time?” analyses.
Store their results as structured artifacts in the same flow library, tagged with which entities/specs/ADRs they relate to and which flow and thread they belong to.

Shot 2 and beyond:

When the user asks the next question in that thread, request routing reads the new request + thread history and pulls in any completed slow artifacts that are relevant to this flow.
The packet builder reruns its normal rules (priorities, budgets, fail-fast) over this richer flow library.
The model now sees context that simply wasn’t ready in time for Shot 1.

You’re not “hoping it sees more stuff next time.” You’re deliberately refusing to block the first answer on slow analysis, letting slow analysis catch up while the user reads or types, and treating finished slow results as new, high-value artifacts for the next packet in the same flow.

Example: background analysis improving the second answer

Imagine a user asks:

“Why does the checkout service keep timing out?”

Shot 1:

Request routing classifies this as “explain behaviour” for the checkout service, anchors to checkout-service, ADR-021, and src/checkout/.
Flow harness pulls in the checkout spec, ADR-021, a recent incident note, and a latency snapshot.
Packet builder assembles a small packet around those artifacts and calls the model.

The first answer can outline plausible causes and ask for clarification, without waiting on heavy analysis.

While the user is reading, the harness kicks off slower analyzers:

A call-graph walk for the critical path through checkout.
A diff against the last known-good deploy.
A deeper scan of configuration files.

By the time the user asks:

“Can you show which components are on the slow path?”

those heavier analyzers have finished. Their results are already scoped to the same flow, tagged with the same service and ADRs, and available as new artifacts in the flow library.

Shot 2’s packet can now include a summarized call graph for the slow path and a focused diff for the relevant components, alongside the original specs and docs. Same PAG rules, strictly better inputs.

PAG + RAG together

You don’t have to pick one forever. A mature setup might:

Use PAG to assemble a base packet from known, high-value artifacts (specs, ADRs, core docs).
Allow RAG inside that packet for “long tail” lookups — only for rarely-touched files, only within a constrained subgraph.

Think of it as: “First, assemble the curated briefing folder (PAG). If you really need something extra, run a controlled search within these bounds (RAG).”

How this relates to systems like Watson/DeepQA

If you’ve seen descriptions of IBM’s Watson DeepQA architecture, the pattern here should feel familiar.

DeepQA didn’t rely on a single model. It used:

Question analysis to understand the query.
Many small evidence sources and scorers.
A pipeline that combined their outputs into a ranked answer.

PAG plays a similar role around an LLM:

A routing layer (NLP-based, rule-based, or hybrid) classifies the request and extracts anchors (spec IDs, ADRs, services, files).
A set of narrow tools and analyzers—each its own “mini QA system” over code, specs, logs, or metrics—produce structured evidence.
The packet builder takes the flow-scoped library of artifacts and evidence and turns it into a deterministic, budgeted packet with receipts.
The LLM sits at the end of that chain as the synthesizer, not as the only source of truth.

In a multi-shot setup, this aligns even more with DeepQA’s progressive evidence gathering:

The first answer is built from what can be assembled quickly.
Slower analyzers continue running in the background.
Subsequent answers in the same thread get packets that include those slower, richer results—still scoped to the flow, still deterministic and auditable.

The goal isn’t to re-implement DeepQA. It’s to take the same “many small, specialized reasoners feeding a final answerer” pattern and pair it with modern LLMs and explicit, inspectable packets instead of opaque prompts.

The bottom line

If you’ve been frustrated by LLMs that wander aimlessly through your codebase, miss the specs that matter, and produce unreproducible results—PAG is the pattern that fixes it.

RAG: “Here’s a Google account and the company wiki. Search for whatever you think you need.”
PAG: “Here’s the spec, the relevant ADRs, analyzer output, and the two files that matter for this change. If you’re missing something after that, then ask.”

PAG doesn’t try to make the model smarter. It makes sure the model sees the right information with enough coverage to reduce AI hallucination, while keeping each chunk concise and scoped enough to avoid Process Confabulation.

You can apply this whether you have a tight 8K window or a 2M token context: the point isn’t “use less context,” it’s “use your budget on the highest-value context you can assemble on purpose.”

PAG is what you use when you’re serious about giving your LLM every chance to succeed—and still having receipts when something goes wrong.