Scaling RAG for 100K+ Documents: Sharding, Caching, and Cost Control

Overwhelmed by large document corpora? Learn how to scale RAG pipelines to enterprise volumes without breaking the bank.

Retrieval-Augmented Generation (RAG) is transforming how we build LLM-powered applications, but scaling these systems to handle 100,000+ documents introduces unique challenges. In this deep-dive, we’ll walk through practical strategies—sharding, caching, and cost control—to help you scale your RAG pipeline efficiently and reliably.

If you’re new to RAG or want a broader overview of the pipeline, see The Ultimate Guide to RAG Pipelines: Building Reliable Retrieval-Augmented Generation Systems. This article builds on those fundamentals to focus on scaling.

Prerequisites

Python 3.9+
Vector Database: Qdrant (v1.6+), Pinecone (2024), or Weaviate (v1.22+)
LLM API: OpenAI GPT-3.5/4, or open-source LLM (e.g., Llama 2 via Hugging Face Transformers)
Embeddings Model: sentence-transformers (v2.2+)
Knowledge: Familiarity with RAG concepts, Python, and Docker basics
Hardware: 16GB+ RAM recommended for local vector DB testing

1. Setting Up Your Scalable RAG Pipeline

Install Core Dependencies
```
pip install qdrant-client sentence-transformers fastapi uvicorn cachetools
      
```
For Pinecone or Weaviate, replace qdrant-client with the appropriate SDK.
Start Qdrant Locally (for testing)
```
docker run -p 6333:6333 -p 6334:6334 qdrant/qdrant
      
```
This launches Qdrant on localhost:6333.

Prepare Your Embedding Model

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')

Initialize Vector Store Connection

from qdrant_client import QdrantClient

qdrant = QdrantClient("localhost", port=6333)
COLLECTION_NAME = "docs_sharded"

2. Sharding: Partitioning Your Dataset for Scale

Sharding splits your document collection into manageable segments, distributing storage and query load. This is essential for search performance and parallelization as you approach or exceed 100K documents.

Design a Sharding Strategy
- By Document Type: News, blogs, manuals, etc.
- By Time Period: Year, quarter, etc.
- By Hash: Consistent hash of document ID
For example, sharding by hash:
```
def get_shard(doc_id, num_shards):
    return hash(doc_id) % num_shards
      
```

Create Sharded Collections in Qdrant

NUM_SHARDS = 4
for shard in range(NUM_SHARDS):
    qdrant.recreate_collection(
        collection_name=f"{COLLECTION_NAME}_shard_{shard}",
        vectors_config={"size": 384, "distance": "Cosine"}
    )

Insert Documents into Appropriate Shard

def insert_document(doc_id, text, num_shards=NUM_SHARDS):
    vec = model.encode([text])[0]
    shard = get_shard(doc_id, num_shards)
    qdrant.upsert(
        collection_name=f"{COLLECTION_NAME}_shard_{shard}",
        points=[{
            "id": doc_id,
            "vector": vec.tolist(),
            "payload": {"text": text}
        }]
    )

This ensures each document is stored in the correct shard.

Query All Shards in Parallel

import concurrent.futures

def search_all_shards(query, top_k=5):
    vec = model.encode([query])[0]
    results = []
    with concurrent.futures.ThreadPoolExecutor() as executor:
        futures = [
            executor.submit(
                qdrant.search,
                collection_name=f"{COLLECTION_NAME}_shard_{shard}",
                query_vector=vec.tolist(),
                limit=top_k
            )
            for shard in range(NUM_SHARDS)
        ]
        for f in concurrent.futures.as_completed(futures):
            results.extend(f.result())
    # Sort and deduplicate
    results = sorted(results, key=lambda x: x.score, reverse=True)[:top_k]
    return results

This approach enables horizontal scaling and leverages all available CPU cores.

3. Caching: Speeding Up Retrieval and Reducing Load

Caching is crucial for high-traffic RAG systems. It reduces vector DB queries, speeds up repeated searches, and lowers infrastructure costs.

Add a Query Cache Layer (In-Memory Example)

from cachetools import LRUCache, cached

query_cache = LRUCache(maxsize=10000)

@cached(query_cache)
def cached_search(query, top_k=5):
    return search_all_shards(query, top_k)

For distributed caching (Redis), use redis-py or aioredis and serialize results.

Cache Embeddings for Popular Documents

embedding_cache = {}

def get_or_encode(text):
    if text in embedding_cache:
        return embedding_cache[text]
    vec = model.encode([text])[0]
    embedding_cache[text] = vec
    return vec

This avoids recomputing embeddings for frequently-seen queries or documents.

Implement HTTP Caching in Your API

from fastapi import FastAPI, Request, Response
from fastapi.responses import JSONResponse

app = FastAPI()

@app.get("/search")
async def search_endpoint(query: str, request: Request):
    # Check for If-None-Match or If-Modified-Since headers
    # Generate ETag based on query
    etag = f"W/{hash(query)}"
    if request.headers.get("if-none-match") == etag:
        return Response(status_code=304)
    results = cached_search(query)
    return JSONResponse(content={"results": results}, headers={"ETag": etag})

This allows browser and proxy caching for repeated queries.

4. Cost Control: Optimizing for Budget and Performance

At scale, RAG costs can balloon—especially for vector DBs, LLM queries, and cloud storage. Here’s how to keep expenses in check:

Batch Insert and Query Operations


def batch_insert(docs):
    for shard in range(NUM_SHARDS):
        batch = [
            {
                "id": doc["id"],
                "vector": model.encode([doc["text"]])[0].tolist(),
                "payload": {"text": doc["text"]}
            }
            for doc in docs if get_shard(doc["id"], NUM_SHARDS) == shard
        ]
        if batch:
            qdrant.upsert(collection_name=f"{COLLECTION_NAME}_shard_{shard}", points=batch)

This reduces API calls and speeds up ingestion.

Limit LLM Calls with Pre-Ranking

def pre_rank(results, query):
    # Use a lightweight model or BM25 to filter before LLM
    # Here, simple keyword overlap as example
    scored = [
        (res, len(set(query.lower().split()) & set(res.payload["text"].lower().split())))
        for res in results
    ]
    return sorted(scored, key=lambda x: x[1], reverse=True)[:2]  # only top 2 go to LLM

See more on this in Reducing Hallucinations in RAG Workflows: Prompting and Retrieval Strategies for 2026.

Monitor and Tune Vector DB Usage
```
info = qdrant.get_collection(collection_name=f"{COLLECTION_NAME}_shard_0")
print(info)
      
```
Set up alerts for collection size and query latency. Scale up or down shards as needed.
Consider Hybrid Storage (Cold/Warm/Hot)
- Keep top-used documents in fast vector DB (hot)
- Archive rarely-used docs in cheap object storage (cold)
- Periodically refresh shards based on access patterns
This hybrid approach can cut costs dramatically for large, infrequently-accessed corpora.

Common Issues & Troubleshooting

Slow Queries Across Shards: Ensure you're querying shards in parallel. Check CPU utilization and increase ThreadPoolExecutor workers if needed.
Out-of-Memory Errors: Lower cache size, use disk-based caching (e.g., Redis), or increase server RAM.
Duplicate Results: Deduplicate by document ID after merging shard results.
Vector DB Connection Issues: Confirm Docker/network settings; check logs with
```
docker logs [container_id]
```
.
Cache Not Reducing Latency: Profile cache hit rates. If low, adjust cache keys or increase cache size.
LLM Cost Spikes: Double-check pre-ranking logic to ensure only a small set of top results are sent to the LLM.

Next Steps

By combining sharding, caching, and cost control, you can reliably scale your RAG pipeline to 100K+ documents and beyond. For production, consider:

Deploying your vector DB in a managed cloud for high availability
Implementing distributed caching (e.g., Redis Cluster)
Adding monitoring and automated scaling policies
Experimenting with hybrid retrieval (BM25 + dense vectors)

For a comprehensive look at RAG architecture, see The Ultimate Guide to RAG Pipelines: Building Reliable Retrieval-Augmented Generation Systems. To further improve accuracy and reduce hallucinations, check out Reducing Hallucinations in RAG Workflows: Prompting and Retrieval Strategies for 2026.

With these techniques, your RAG system will be ready for real-world scale—without breaking the bank.

Scaling RAG for 100K+ Documents: Sharding, Caching, and Cost Control

Prerequisites

1. Setting Up Your Scalable RAG Pipeline

2. Sharding: Partitioning Your Dataset for Scale

3. Caching: Speeding Up Retrieval and Reducing Load

4. Cost Control: Optimizing for Budget and Performance

Common Issues & Troubleshooting

Next Steps

Related Articles

Put your brand in front of 10,000+ tech professionals

Stay ahead of the tech curve

Scaling RAG for 100K+ Documents: Sharding, Caching, and Cost Control

Prerequisites

1. Setting Up Your Scalable RAG Pipeline

2. Sharding: Partitioning Your Dataset for Scale

3. Caching: Speeding Up Retrieval and Reducing Load

4. Cost Control: Optimizing for Budget and Performance

Common Issues & Troubleshooting

Next Steps

Continue Reading

Related Articles

Tools & Software

Guides & Playbooks

Put your brand in front of 10,000+ tech professionals

Stay ahead of the tech curve