Build a Retrieval Augmented Generation (RAG) App: Part 1

Retrieval Augmented Generation (RAG) is revolutionizing how we build intelligent question-answering systems. By combining the power of large language models (LLMs) with real-time data retrieval, RAG enables chatbots and AI assistants to deliver accurate, context-aware responses grounded in specific source material. Whether you're developing a customer support bot, internal knowledge assistant, or research tool, mastering RAG is essential.

This guide walks you through building a minimal yet fully functional RAG application using LangChain and LangGraph. We'll cover the core architecture, implement each step in code, and show how to enhance your system with query analysis for smarter retrieval.

Understanding the RAG Architecture

A typical RAG pipeline consists of two main phases: indexing and retrieval and generation. Each plays a critical role in ensuring your app delivers relevant, up-to-date answers.

Indexing: Preparing Your Data

Indexing is the offline process of preparing your data for fast and accurate retrieval. It involves three key steps:

Loading – Extract raw content from sources like websites, PDFs, or databases.
Splitting – Break documents into smaller chunks to fit within model context limits.
Storing – Convert text into vector embeddings and store them in a vector database.

This phase ensures that when a user asks a question, only the most relevant portions of your dataset are retrieved — not the entire document.

Retrieval and Generation: Answering Questions

At runtime, your app performs two dynamic actions:

Retrieve – Use the user’s query to find the most relevant document chunks from your indexed data.
Generate – Feed the retrieved context and original question into an LLM to produce a concise, informed answer.

This separation allows your model to stay grounded in facts while leveraging its reasoning capabilities.

👉 Discover how AI-powered tools are transforming data processing today.

Setting Up the Environment

Before diving into code, ensure you have the necessary dependencies installed:

%pip install --quiet --upgrade langchain-text-splitters langchain-community langgraph langchain-openai

We’ll also use LangSmith for tracing and debugging as our application grows in complexity. To enable logging:

import os
os.environ["LANGSMITH_TRACING"] = "true"
os.environ["LANGSMITH_API_KEY"] = "your-api-key-here"

LangSmith provides invaluable insights into each step of your RAG pipeline, helping optimize performance and accuracy.

Step-by-Step Implementation

1. Load and Prepare the Source Content

We’ll use a blog post on LLM-Powered Autonomous Agents by Lilian Weng as our data source. To extract only relevant sections (title, headers, content), we apply HTML parsing with BeautifulSoup via WebBaseLoader:

from langchain_community.document_loaders import WebBaseLoader
import bs4

loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
    bs_kwargs=dict(parse_only=bs4.SoupStrainer(class_=("post-content", "post-title", "post-header")))
)
docs = loader.load()

This yields a single Document object containing ~42,000 characters of clean text.

2. Split Documents into Chunks

Large texts exceed most LLMs’ context windows and are harder to search effectively. We split the document using RecursiveCharacterTextSplitter:

from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    add_start_index=True
)
all_splits = text_splitter.split_documents(docs)

The result? 66 manageable chunks — ideal for both embedding and efficient retrieval.

3. Store Embeddings in a Vector Database

Next, we convert each chunk into a numerical representation (embedding) using OpenAI’s text-embedding-3-large model and store them in an in-memory vector store:

from langchain_openai import OpenAIEmbeddings
from langchain_core.vectorstores import InMemoryVectorStore

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
vector_store = InMemoryVectorStore(embeddings)
vector_store.add_documents(documents=all_splits)

Now, any query can be matched against these embeddings using semantic similarity search.

Orchestrate Retrieval and Generation with LangGraph

To structure our application logic clearly and support advanced features like streaming and tracing, we use LangGraph, LangChain’s framework for stateful workflows.

Define Application State

We define a typed state dictionary to track inputs and outputs across steps:

from typing_extensions import TypedDict, List
from langchain_core.documents import Document

class State(TypedDict):
    question: str
    context: List[Document]
    answer: str

Build the Processing Nodes

Two core functions handle retrieval and answer generation:

def retrieve(state: State):
    retrieved_docs = vector_store.similarity_search(state["question"])
    return {"context": retrieved_docs}

def generate(state: State):
    docs_content = "\n\n".join(doc.page_content for doc in state["context"])
    messages = prompt.invoke({"question": state["question"], "context": docs_content})
    response = llm.invoke(messages)
    return {"answer": response.content}

Connect the Flow

Using StateGraph, we chain these steps together:

from langgraph.graph import START, StateGraph

graph_builder = StateGraph(State).add_sequence([retrieve, generate])
graph_builder.add_edge(START, "retrieve")
graph = graph_builder.compile()

This creates a traceable, reusable application graph.

Testing the RAG Pipeline

Let’s ask: “What is Task Decomposition?”

response = graph.invoke({"question": "What is Task Decomposition?"})
print(response["answer"])

Output:
Task decomposition is the process of breaking down complex tasks into smaller, manageable steps. Techniques like Chain of Thought (CoT) and Tree of Thoughts (ToT) guide models to think step-by-step, improving performance on intricate problems. It can be initiated through prompts, task-specific instructions, or human input.

The system successfully retrieves relevant sections and generates a concise summary — all in under 50 lines of code.

Enhancing with Query Analysis

Raw user queries aren’t always optimal for retrieval. Query analysis uses an LLM to rewrite or enrich the input before searching.

For example, if a user asks:
"What does the end of the post say about Task Decomposition?"

We want to extract both:

The search term: "Task Decomposition"
The section filter: "end"

Using structured output, we define a schema:

class Search(TypedDict):
    query: str
    section: Literal["beginning", "middle", "end"]

Then add a preprocessing node:

def analyze_query(state: State):
    structured_llm = llm.with_structured_output(Search)
    result = structured_llm.invoke(state["question"])
    return {"query": result}

Now the retrieval step applies metadata filtering:

retrieved_docs = vector_store.similarity_search(
    query["query"],
    filter=lambda doc: doc.metadata.get("section") == query["section"]
)

This allows precise control over which parts of your data are searched — significantly improving relevance.

👉 See how intelligent query processing boosts retrieval accuracy in modern AI apps.

Frequently Asked Questions

What is Retrieval Augmented Generation (RAG)?

RAG is a technique that enhances LLM responses by retrieving relevant information from external sources before generating an answer. This ensures outputs are factually grounded and up-to-date.

Why split documents before indexing?

Most LLMs have limited context windows (e.g., 32k tokens). Splitting ensures no single chunk exceeds this limit. Smaller chunks also improve retrieval precision by isolating specific topics.

Can I use my own data sources?

Yes! LangChain supports over 160 document loaders for formats like PDF, CSV, Notion, Wikipedia, and more. Simply replace WebBaseLoader with the appropriate loader.

How do embeddings work in RAG?

Embeddings convert text into numerical vectors that capture semantic meaning. Similar texts have similar vectors, enabling fast similarity searches in high-dimensional space.

What role does LangSmith play?

LangSmith provides full observability into your RAG pipeline — tracking inputs, intermediate steps, and outputs. It's essential for debugging, testing, and optimizing complex chains.

Is this approach scalable?

Absolutely. While this example uses an in-memory store, production systems can leverage cloud vector databases like Pinecone, Weaviate, or Chroma for scalability and persistence.

👉 Explore powerful tools that help scale AI applications efficiently.

Final Thoughts

Building a RAG application doesn’t require complex infrastructure — with LangChain and LangGraph, you can create a functional prototype in minutes. The modular design allows easy extension: adding chat history, multi-hop retrieval, or human-in-the-loop approval.

As you advance, consider integrating:

Chat memory for conversational context
Hybrid search combining keyword and vector matching
Citation tracking to return source references

With RAG, you’re not just generating answers — you’re building trusted knowledge systems powered by AI.

Core Keywords: Retrieval Augmented Generation, RAG app, LangChain tutorial, vector store, document loader, text splitter, LLM integration