Retrieval-Augmented Generation (RAG) is transforming how we build LLM-powered applications, but scaling these systems to handle 100,000+ documents introduces unique challenges. In this deep-dive, we’ll walk through practical strategies—sharding, caching, and cost control—to help you scale your RAG pipeline efficiently and reliably.
If you’re new to RAG or want a broader overview of the pipeline, see The Ultimate Guide to RAG Pipelines: Building Reliable Retrieval-Augmented Generation Systems. This article builds on those fundamentals to focus on scaling.
Prerequisites
- Python 3.9+
- Vector Database:
Qdrant(v1.6+),Pinecone(2024), orWeaviate(v1.22+) - LLM API: OpenAI GPT-3.5/4, or open-source LLM (e.g., Llama 2 via Hugging Face Transformers)
- Embeddings Model:
sentence-transformers(v2.2+) - Knowledge: Familiarity with RAG concepts, Python, and Docker basics
- Hardware: 16GB+ RAM recommended for local vector DB testing
1. Setting Up Your Scalable RAG Pipeline
-
Install Core Dependencies
pip install qdrant-client sentence-transformers fastapi uvicorn cachetoolsFor Pinecone or Weaviate, replace
qdrant-clientwith the appropriate SDK. -
Start Qdrant Locally (for testing)
docker run -p 6333:6333 -p 6334:6334 qdrant/qdrantThis launches Qdrant on
localhost:6333. -
Prepare Your Embedding Model
from sentence_transformers import SentenceTransformer model = SentenceTransformer('all-MiniLM-L6-v2') -
Initialize Vector Store Connection
from qdrant_client import QdrantClient qdrant = QdrantClient("localhost", port=6333) COLLECTION_NAME = "docs_sharded"
2. Sharding: Partitioning Your Dataset for Scale
Sharding splits your document collection into manageable segments, distributing storage and query load. This is essential for search performance and parallelization as you approach or exceed 100K documents.
-
Design a Sharding Strategy
- By Document Type: News, blogs, manuals, etc.
- By Time Period: Year, quarter, etc.
- By Hash: Consistent hash of document ID
For example, sharding by hash:
def get_shard(doc_id, num_shards): return hash(doc_id) % num_shards -
Create Sharded Collections in Qdrant
NUM_SHARDS = 4 for shard in range(NUM_SHARDS): qdrant.recreate_collection( collection_name=f"{COLLECTION_NAME}_shard_{shard}", vectors_config={"size": 384, "distance": "Cosine"} ) -
Insert Documents into Appropriate Shard
def insert_document(doc_id, text, num_shards=NUM_SHARDS): vec = model.encode([text])[0] shard = get_shard(doc_id, num_shards) qdrant.upsert( collection_name=f"{COLLECTION_NAME}_shard_{shard}", points=[{ "id": doc_id, "vector": vec.tolist(), "payload": {"text": text} }] )This ensures each document is stored in the correct shard.
-
Query All Shards in Parallel
import concurrent.futures def search_all_shards(query, top_k=5): vec = model.encode([query])[0] results = [] with concurrent.futures.ThreadPoolExecutor() as executor: futures = [ executor.submit( qdrant.search, collection_name=f"{COLLECTION_NAME}_shard_{shard}", query_vector=vec.tolist(), limit=top_k ) for shard in range(NUM_SHARDS) ] for f in concurrent.futures.as_completed(futures): results.extend(f.result()) # Sort and deduplicate results = sorted(results, key=lambda x: x.score, reverse=True)[:top_k] return resultsThis approach enables horizontal scaling and leverages all available CPU cores.
3. Caching: Speeding Up Retrieval and Reducing Load
Caching is crucial for high-traffic RAG systems. It reduces vector DB queries, speeds up repeated searches, and lowers infrastructure costs.
-
Add a Query Cache Layer (In-Memory Example)
from cachetools import LRUCache, cached query_cache = LRUCache(maxsize=10000) @cached(query_cache) def cached_search(query, top_k=5): return search_all_shards(query, top_k)For distributed caching (Redis), use
redis-pyoraioredisand serialize results. -
Cache Embeddings for Popular Documents
embedding_cache = {} def get_or_encode(text): if text in embedding_cache: return embedding_cache[text] vec = model.encode([text])[0] embedding_cache[text] = vec return vecThis avoids recomputing embeddings for frequently-seen queries or documents.
-
Implement HTTP Caching in Your API
from fastapi import FastAPI, Request, Response from fastapi.responses import JSONResponse app = FastAPI() @app.get("/search") async def search_endpoint(query: str, request: Request): # Check for If-None-Match or If-Modified-Since headers # Generate ETag based on query etag = f"W/{hash(query)}" if request.headers.get("if-none-match") == etag: return Response(status_code=304) results = cached_search(query) return JSONResponse(content={"results": results}, headers={"ETag": etag})This allows browser and proxy caching for repeated queries.
4. Cost Control: Optimizing for Budget and Performance
At scale, RAG costs can balloon—especially for vector DBs, LLM queries, and cloud storage. Here’s how to keep expenses in check:
-
Batch Insert and Query Operations
def batch_insert(docs): for shard in range(NUM_SHARDS): batch = [ { "id": doc["id"], "vector": model.encode([doc["text"]])[0].tolist(), "payload": {"text": doc["text"]} } for doc in docs if get_shard(doc["id"], NUM_SHARDS) == shard ] if batch: qdrant.upsert(collection_name=f"{COLLECTION_NAME}_shard_{shard}", points=batch)This reduces API calls and speeds up ingestion.
-
Limit LLM Calls with Pre-Ranking
def pre_rank(results, query): # Use a lightweight model or BM25 to filter before LLM # Here, simple keyword overlap as example scored = [ (res, len(set(query.lower().split()) & set(res.payload["text"].lower().split()))) for res in results ] return sorted(scored, key=lambda x: x[1], reverse=True)[:2] # only top 2 go to LLMSee more on this in Reducing Hallucinations in RAG Workflows: Prompting and Retrieval Strategies for 2026.
-
Monitor and Tune Vector DB Usage
info = qdrant.get_collection(collection_name=f"{COLLECTION_NAME}_shard_0") print(info)Set up alerts for collection size and query latency. Scale up or down shards as needed.
-
Consider Hybrid Storage (Cold/Warm/Hot)
- Keep top-used documents in fast vector DB (hot)
- Archive rarely-used docs in cheap object storage (cold)
- Periodically refresh shards based on access patterns
This hybrid approach can cut costs dramatically for large, infrequently-accessed corpora.
Common Issues & Troubleshooting
-
Slow Queries Across Shards: Ensure you're querying shards in parallel. Check CPU utilization and increase
ThreadPoolExecutorworkers if needed. - Out-of-Memory Errors: Lower cache size, use disk-based caching (e.g., Redis), or increase server RAM.
- Duplicate Results: Deduplicate by document ID after merging shard results.
-
Vector DB Connection Issues: Confirm Docker/network settings; check logs with
docker logs [container_id]
. - Cache Not Reducing Latency: Profile cache hit rates. If low, adjust cache keys or increase cache size.
- LLM Cost Spikes: Double-check pre-ranking logic to ensure only a small set of top results are sent to the LLM.
Next Steps
By combining sharding, caching, and cost control, you can reliably scale your RAG pipeline to 100K+ documents and beyond. For production, consider:
- Deploying your vector DB in a managed cloud for high availability
- Implementing distributed caching (e.g., Redis Cluster)
- Adding monitoring and automated scaling policies
- Experimenting with hybrid retrieval (BM25 + dense vectors)
For a comprehensive look at RAG architecture, see The Ultimate Guide to RAG Pipelines: Building Reliable Retrieval-Augmented Generation Systems. To further improve accuracy and reduce hallucinations, check out Reducing Hallucinations in RAG Workflows: Prompting and Retrieval Strategies for 2026.
With these techniques, your RAG system will be ready for real-world scale—without breaking the bank.
