Retrieval-Augmented Generation (RAG) workflows are transforming how organizations keep their knowledge bases fresh, accurate, and contextually relevant. In this deep-dive tutorial, you’ll learn exactly how to build an automated RAG pipeline that ingests new documents, updates embeddings, and delivers up-to-date answers—all with testable code and reproducible steps.
If you’re new to the concept of RAG or want a broader overview of its architecture and use cases, see our Ultimate Guide to RAG Pipelines: Building Reliable Retrieval-Augmented Generation Systems. Here, we go hands-on with a specific, production-ready workflow for automated knowledge base updates.
Prerequisites
- Python 3.10+ (tested with 3.10 and 3.11)
- pip (Python package manager)
- Basic command line (Linux/Mac/Windows)
- Familiarity with LLMs, embeddings, and vector databases (see our LLM-based knowledge base guide for background)
- Git (for cloning example repositories)
- API keys for your chosen embedding model and LLM (we’ll use OpenAI for this tutorial, but you can swap in open-source models—see our comparison guide)
- Docker (optional, for running a local vector database)
1. Set Up Your Project Environment
-
Create a new project directory:
mkdir rag-knowledgebase-updates && cd rag-knowledgebase-updates
-
Create and activate a virtual environment:
python3 -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate
-
Install required Python packages:
pip install openai chromadb langchain pyyaml tqdm
openai: For LLM and embedding APIschromadb: Local vector database (swap for Pinecone, Weaviate, etc. if desired)langchain: Orchestrates RAG workflowpyyaml: For config filestqdm: Progress bars for batch processing
-
Set environment variables for API keys:
export OPENAI_API_KEY='your-openai-key'
Screenshot description: Terminal showing successful installation of dependencies and activation of the virtual environment.
2. Prepare Your Knowledge Base Data
-
Organize your documents:
- Place all source files (PDF, DOCX, TXT, Markdown, etc.) in a
data/directory.
mkdir data
- Place all source files (PDF, DOCX, TXT, Markdown, etc.) in a
-
Install document loader dependencies:
pip install unstructured[all] python-docx
-
Write a loader script to parse documents:
from langchain.document_loaders import DirectoryLoader, UnstructuredFileLoader def load_docs(directory): loader = DirectoryLoader( directory, glob="**/*.*", loader_cls=UnstructuredFileLoader ) docs = loader.load() print(f"Loaded {len(docs)} documents.") return docs if __name__ == "__main__": docs = load_docs("data/") -
Test loading your documents:
python load_documents.py
Screenshot description: Output showing "Loaded X documents."
3. Chunk and Embed the Documents
-
Chunk documents for better retrieval:
from langchain.text_splitter import RecursiveCharacterTextSplitter from load_documents import load_docs def chunk_docs(docs, chunk_size=500, chunk_overlap=50): splitter = RecursiveCharacterTextSplitter( chunk_size=chunk_size, chunk_overlap=chunk_overlap ) chunks = [] for doc in docs: chunks.extend(splitter.split_documents([doc])) print(f"Split into {len(chunks)} chunks.") return chunks if __name__ == "__main__": docs = load_docs("data/") chunks = chunk_docs(docs) -
Generate embeddings for each chunk (OpenAI example):
from langchain.embeddings import OpenAIEmbeddings from chunk_and_embed import chunk_docs from load_documents import load_docs from tqdm import tqdm def embed_chunks(chunks): embedder = OpenAIEmbeddings() embeddings = [] for chunk in tqdm(chunks): emb = embedder.embed_documents([chunk.page_content]) embeddings.append((emb[0], chunk.metadata, chunk.page_content)) return embeddings if __name__ == "__main__": docs = load_docs("data/") chunks = chunk_docs(docs) embeddings = embed_chunks(chunks) print(f"Embedded {len(embeddings)} chunks.") - Tip: For open-source embedding models, see Meta’s Llama-4 Open Weights: Accelerating RAG Workflow Innovation?.
4. Store Embeddings in a Vector Database
-
Start a local ChromaDB server (optional, for persistent storage):
docker run -d -p 8000:8000 chromadb/chroma
Or use in-memory mode for testing.
-
Write a script to store embeddings:
import chromadb from chromadb.config import Settings from embed_chunks import embed_chunks from chunk_and_embed import chunk_docs from load_documents import load_docs def store_embeddings(embeddings): client = chromadb.Client(Settings( persist_directory="chroma_db" )) collection = client.get_or_create_collection(name="knowledgebase") for idx, (emb, meta, content) in enumerate(embeddings): collection.add( embeddings=[emb], metadatas=[meta], documents=[content], ids=[f"chunk_{idx}"] ) client.persist() print("Embeddings stored and persisted.") if __name__ == "__main__": docs = load_docs("data/") chunks = chunk_docs(docs) embeddings = embed_chunks(chunks) store_embeddings(embeddings) -
Verify storage:
ls chroma_db/
Screenshot description: Directory listing showing ChromaDB files.
5. Build the Retrieval-Augmented Generation (RAG) Pipeline
-
Write a retrieval function:
import chromadb from chromadb.config import Settings from langchain.embeddings import OpenAIEmbeddings def retrieve(query, top_k=5): client = chromadb.Client(Settings( persist_directory="chroma_db" )) collection = client.get_collection("knowledgebase") embedder = OpenAIEmbeddings() query_emb = embedder.embed_query(query) results = collection.query( query_embeddings=[query_emb], n_results=top_k ) return results['documents'][0] if __name__ == "__main__": query = "How do I reset my password?" docs = retrieve(query) print("Top documents:", docs) -
Integrate with an LLM for answer generation:
from openai import OpenAI from retrieve import retrieve def generate_answer(query): top_docs = retrieve(query) context = "\n\n".join(top_docs) prompt = f"""You are a helpful support assistant. Use ONLY the following context to answer the user's question. Context: {context} Question: {query} Answer:""" client = OpenAI() response = client.chat.completions.create( model="gpt-3.5-turbo", messages=[{"role": "user", "content": prompt}] ) return response.choices[0].message.content.strip() if __name__ == "__main__": query = "How do I reset my password?" answer = generate_answer(query) print("Generated answer:", answer) -
Test your RAG pipeline:
python generate_answer.py
Screenshot description: Terminal output showing a generated answer to a sample query.
6. Automate Knowledge Base Updates
-
Detect new or updated documents:
- Use file hashes, timestamps, or a version control trigger to identify changes in the
data/directory.
import os import hashlib def file_hash(path): with open(path, "rb") as f: return hashlib.md5(f.read()).hexdigest() def detect_new_files(directory, hash_file="file_hashes.yaml"): import yaml old_hashes = {} if os.path.exists(hash_file): with open(hash_file) as f: old_hashes = yaml.safe_load(f) or {} new_hashes = {} for fname in os.listdir(directory): fpath = os.path.join(directory, fname) if os.path.isfile(fpath): new_hashes[fname] = file_hash(fpath) updated = [f for f in new_hashes if new_hashes[f] != old_hashes.get(f)] with open(hash_file, "w") as f: yaml.safe_dump(new_hashes, f) return updated if __name__ == "__main__": print("Updated files:", detect_new_files("data/")) - Use file hashes, timestamps, or a version control trigger to identify changes in the
- Re-run chunking, embedding, and storage for updated files only.
-
Schedule the update workflow (e.g., daily with cron):
crontab -e
0 2 * * * cd /path/to/rag-knowledgebase-updates && .venv/bin/python store_embeddings.py
This runs the update at 2am daily.
7. (Optional) Expose Your RAG Workflow via API
-
Install FastAPI:
pip install fastapi uvicorn
-
Write a simple API server:
from fastapi import FastAPI, Query from generate_answer import generate_answer app = FastAPI() @app.get("/query") def query_knowledgebase(q: str = Query(...)): answer = generate_answer(q) return {"question": q, "answer": answer} if __name__ == "__main__": import uvicorn uvicorn.run(app, host="0.0.0.0", port=8000) -
Start the API server:
python api_server.py
Screenshot description: Terminal showing FastAPI server running on port 8000.
-
Query your RAG-powered knowledge base:
curl "http://localhost:8000/query?q=How%20do%20I%20reset%20my%20password?"
Common Issues & Troubleshooting
-
Embeddings API errors: Double-check your
OPENAI_API_KEYand rate limits. For open-source alternatives, see Open-Source AI Tools Surge in RAG Pipeline Adoption. - ChromaDB connection failures: Ensure Docker is running and port 8000 is available, or switch to in-memory mode for quick testing.
-
Document loader issues: Some formats may require extra dependencies (see
unstructured[all]). - LLM hallucinations: Ensure your prompt constrains the model to use only retrieved context. Advanced prompt engineering tips can be found in RAG for Enterprise Search: Advanced Prompt Engineering Patterns.
- Outdated answers after document updates: Confirm your update detection and embedding workflows are running as scheduled.
- Debugging pipeline failures: For a systematic approach, see Mastering Prompt Debugging: Diagnosing Workflow Failures in RAG and LLM Pipelines.
Next Steps
You now have a fully functional, automated RAG workflow for keeping your knowledge base up-to-date. To take your system further:
- Scale to 100K+ documents—see Scaling RAG for 100K+ Documents: Sharding, Caching, and Cost Control.
- Monitor and evaluate your RAG pipeline—see How to Monitor RAG Systems: Automated Evaluation Techniques.
- Experiment with different embedding models—see Comparing Embedding Models for Production RAG.
- Integrate with business process automation—see Integrating RAG and BPM: How to Supercharge Complex Business Processes.
- Deepen your RAG expertise—our Ultimate Guide to RAG Pipelines covers advanced strategies, scaling, and real-world case studies.
For more on automating knowledge base creation, see Automated Knowledge Base Creation with LLMs: Step-by-Step Guide for Enterprises.
Ready to build production-grade RAG workflows? Share your results, ask questions, or suggest improvements in the comments below!
