Agentic Complexity: Medium

Retrieval-Augmented Generation in Go

Ground LLM responses in authoritative documents by retrieving relevant chunks before generation so the model can cite real data instead of hallucinating.

The Problem

LLMs confidently fabricate facts when asked about domain-specific or recent information they weren’t trained on. Providing the entire knowledge base in every prompt is impractical — it exceeds context limits and wastes tokens. The model needs the right documents at the right time, not all documents all the time.

The Solution

RAG separates retrieval from generation. An Embedder converts the user’s question into a dense vector. A Retriever finds the top-K most semantically similar document chunks. A PromptBuilder formats those chunks as context. Only then does the LLM generate a response — grounded in retrieved evidence rather than parametric memory. In Go, a RAGPipeline wires these three steps into one Answer() call.

Structure

RAG Pattern
Step 1 of 4

Question Enters the Pipeline

The caller passes a natural-language question to RAGPipeline.Answer(). The pipeline owns the full retrieval-generation flow; callers don't interact with the embedder or retriever directly.

Implementation

package main

import "context"

// Chunk is a retrieved document fragment with its source reference.
type Chunk struct {
	ID      string
	Content string
	Source  string
	Score   float32 // cosine similarity, higher is better
}

// Embedder converts a text string into a dense vector.
type Embedder interface {
	Embed(ctx context.Context, text string) ([]float32, error)
}

// Retriever fetches the top-K most relevant chunks for a query vector.
type Retriever interface {
	Retrieve(ctx context.Context, query string, topK int) ([]Chunk, error)
}

// Document is a source document indexed in the retriever.
type Document struct {
	ID        string
	Content   string
	Source    string
	Embedding []float32
}

Real-World Analogy

A lawyer preparing a brief: rather than memorizing every case ever decided, they search the legal database for the three most relevant precedents and cite them directly. The closing argument is grounded in retrieved evidence, not recollection. RAG does the same for LLMs.

Pros and Cons

ProsCons
Dramatically reduces hallucination on factual questionsRetrieval quality determines answer quality — garbage in, garbage out
Knowledge can be updated without retraining the modelEmbedding every document adds upfront indexing cost
Retrieved chunks provide a citation trail for auditabilityTop-K may miss relevant context if the query is ambiguous
Retriever interface makes swapping vector stores a one-line changeCosine similarity on stub embeddings returns meaningless results in tests

Best Practices

  • Chunk documents at semantic boundaries (paragraphs, sections) not fixed byte sizes — retrieval quality depends heavily on chunk coherence.
  • Store Source in every Chunk and include it in the prompt so the LLM can cite its references.
  • Set topK conservatively (3–5) — more chunks don’t always mean better answers and they consume context budget fast.
  • Cache embeddings for documents that don’t change; re-embedding on every query is wasteful.
  • Use a real embedding model in tests against a small fixture corpus, not a stub — RAG bugs are usually retrieval bugs, not generation bugs.

When to Use

  • Domain-specific Q&A over private documentation, codebases, or knowledge bases.
  • Any application where the model must cite sources or avoid fabrication.
  • Systems where the knowledge base changes frequently (news, support tickets, API docs).

When NOT to Use

  • General knowledge questions where the base model’s training data is sufficient.
  • Real-time applications where retrieval latency is unacceptable.
  • Tiny knowledge bases — a few documents can simply be included in the system prompt.