Day 19: RAG for Agents

Why This Matters

An agent's context window is finite — and knowledge has a cutoff. When your agent needs to reason over a large codebase, a documentation corpus, or a live dataset, you can't just stuff it all into the prompt. RAG solves this by making retrieval itself a tool: the agent asks for what it needs, gets relevant chunks back, and reasons on top of them. Senior engineers need to understand this not as a magic trick, but as a data pipeline with well-defined trade-offs.

Core Concepts

1. The RAG Pipeline

RAG has two phases: indexing (offline) and retrieval (runtime).

Indexing:

raw documents → chunking → embedding → vector store

Retrieval (at agent runtime):

query → embed query → similarity search → top-k chunks → inject into context

The key insight: you're not doing full-text search. You're finding semantically similar content. "What does the useAuth composable do?" retrieves code about auth even if it doesn't contain those exact words.

2. Chunking Strategy

How you split documents is as important as the model you use. Bad chunking = bad retrieval.

Fixed-size chunking — simplest, often good enough:

function chunkText(text, chunkSize = 500, overlap = 50) {
  const chunks = [];
  let start = 0;
  while (start < text.length) {
    chunks.push(text.slice(start, start + chunkSize));
    start += chunkSize - overlap; // overlap preserves context across boundaries
  }
  return chunks;
}

Semantic chunking — split on meaningful boundaries (paragraphs, function definitions, headings). Better for code and structured docs.

Gotchas:

Too small: chunks lose context ("the function" — which function?)
Too large: you burn tokens and dilute the relevant signal
No overlap: you split important context across chunk boundaries

For code specifically, consider chunking at the function/class level rather than by character count.

3. Embedding & Vector Stores

Embeddings convert text into a dense vector. The closer two vectors are in space, the more semantically similar the texts are.

Minimal example with OpenAI + in-memory store:

import OpenAI from "openai";

const client = new OpenAI();

async function embed(text) {
  const res = await client.embeddings.create({
    model: "text-embedding-3-small",
    input: text,
  });
  return res.data[0].embedding;
}

function cosineSimilarity(a, b) {
  const dot = a.reduce((sum, v, i) => sum + v * b[i], 0);
  const magA = Math.sqrt(a.reduce((sum, v) => sum + v * v, 0));
  const magB = Math.sqrt(b.reduce((sum, v) => sum + v * v, 0));
  return dot / (magA * magB);
}

async function retrieve(query, index, topK = 3) {
  const queryVec = await embed(query);
  return index
    .map((item) => ({ ...item, score: cosineSimilarity(queryVec, item.vector) }))
    .sort((a, b) => b.score - a.score)
    .slice(0, topK);
}

Production vector stores to know:

pgvector — Postgres extension, great if you're already on Postgres
Qdrant — purpose-built, fast, easy to self-host
Pinecone — managed, no ops, scales well
Chroma — local dev-friendly, good for prototyping

For most Node.js projects starting out, pgvector is the pragmatic choice — it keeps your stack simple.

4. Wiring RAG as an Agent Tool

This is the architectural move: retrieval isn't a preprocessing step, it's a tool the agent can call. The agent decides when to retrieve and what to query.

const tools = [
  {
    name: "search_knowledge_base",
    description:
      "Search the project documentation and codebase for relevant context. " +
      "Use when you need to understand how a feature works, find relevant examples, " +
      "or locate existing implementations before writing new code.",
    input_schema: {
      type: "object",
      properties: {
        query: {
          type: "string",
          description: "Natural language search query",
        },
        top_k: {
          type: "number",
          description: "Number of results to return (default: 3)",
        },
      },
      required: ["query"],
    },
  },
];

// Tool handler
async function handleToolCall(toolName, toolInput) {
  if (toolName === "search_knowledge_base") {
    const results = await retrieve(toolInput.query, vectorIndex, toolInput.top_k ?? 3);
    return results
      .map((r, i) => `[Result ${i + 1}] (score: ${r.score.toFixed(3)})\n${r.text}`)
      .join("\n\n---\n\n");
  }
}

The agent now autonomously decides to call search_knowledge_base before answering questions, exactly like a developer who knows to grep the codebase before guessing.

Key design decision: Don't force retrieval on every turn. Let the agent decide. If the question is about general TypeScript syntax, retrieval adds noise. If the question is "how does our auth middleware work?", it's essential.

Try This Today

Build a minimal RAG tool for your own codebase:

Pick 5–10 files from hawkbot-mission-control (or any project).
Chunk them at the function level using a simple parser or regex.
Embed each chunk with text-embedding-3-small.
Store vectors in memory (a plain array works for this exercise).
Write a retrieve(query) function and test it with 3 real questions you'd ask about the codebase.
Observe: which queries return relevant chunks? Which ones fail? Tune your chunking.

The goal is to feel the pipeline end-to-end — not to ship it yet.

Resources

OpenAI Embeddings docs: https://platform.openai.com/docs/guides/embeddings — covers models, dimensions, and best practices
pgvector README: https://github.com/pgvector/pgvector — if you're already on Postgres, this is the lowest-friction path to a production vector store