Automating contract workflows is fast becoming a game-changer for legal teams, operations, and procurement. By combining Retrieval-Augmented Generation (RAG) with Large Language Models (LLMs), you can move beyond simple document automation to enable intelligent, context-aware contract review, drafting, and approval flows. This deep-dive tutorial will guide you through building a robust, end-to-end automated contract workflow using open-source tools and APIs.
For broader context on where contract automation fits in the enterprise, see our comprehensive guide to business process automation with AI. Here, we’ll focus specifically on the contract domain and the technical details of implementing RAG+LLM-powered automation.
Prerequisites
- Technical Skills: Intermediate Python, basic Docker, REST APIs, and familiarity with NLP concepts.
- System Requirements: Linux/macOS/Windows (WSL2), 16GB+ RAM recommended.
- Python: 3.9 or higher
- Docker: 20.10 or higher
- Git: 2.30 or higher
- LLM API Access: OpenAI API key (for GPT-4/3.5) or local LLM (e.g., Llama.cpp, Ollama)
- Vector Database: We’ll use
ChromaDB(open source), but you can swap in Pinecone, Weaviate, or Qdrant. - Sample Contracts: A folder of PDF or DOCX contracts for testing.
Step 1: Set Up Your Project Environment
-
Clone the Starter Repository
We’ll use a minimal RAG pipeline starter. Clone it:git clone https://github.com/your-org/rag-contract-workflow-starter.git cd rag-contract-workflow-starter
-
Create and Activate a Python Virtual Environment
python3 -m venv .venv source .venv/bin/activate
-
Install Dependencies
pip install -r requirements.txt
Key packages:
langchain,chromadb,openai,pdfplumber,python-docx,fastapi
Step 2: Ingest and Chunk Contracts
-
Extract Text from Contracts
Place your sample contracts in./contracts/. Usepdfplumberfor PDFs andpython-docxfor DOCX files.import pdfplumber import os def extract_pdf_text(filepath): with pdfplumber.open(filepath) as pdf: return "\n".join(page.extract_text() for page in pdf.pages if page.extract_text()) pdf_text = extract_pdf_text('./contracts/sample_contract.pdf') print(pdf_text[:500])For DOCX:
from docx import Document def extract_docx_text(filepath): doc = Document(filepath) return "\n".join([para.text for para in doc.paragraphs if para.text.strip()]) docx_text = extract_docx_text('./contracts/sample_contract.docx') -
Chunk the Contract Text
RAG works best when documents are split into semantically meaningful chunks (e.g., clauses, sections).from langchain.text_splitter import RecursiveCharacterTextSplitter text_splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=100) chunks = text_splitter.split_text(pdf_text) print(chunks[:2])Tip: Adjust
chunk_sizeandchunk_overlapfor your contract type.
Step 3: Embed and Store Contract Chunks in a Vector Database
-
Set Up ChromaDB
Start ChromaDB locally (or use Docker for isolation):docker run -d -p 8000:8000 chromadb/chroma:latest
-
Generate Embeddings for Each Chunk
Use OpenAI or HuggingFace embedding models. Here’s how to use OpenAI:from langchain.embeddings import OpenAIEmbeddings embeddings = OpenAIEmbeddings(openai_api_key=os.environ["OPENAI_API_KEY"]) chunk_embeddings = embeddings.embed_documents(chunks) -
Store Chunks and Embeddings in ChromaDB
from langchain.vectorstores import Chroma vectorstore = Chroma( collection_name="contracts", embedding_function=embeddings, persist_directory="./chroma_db" ) vectorstore.add_texts(chunks) vectorstore.persist()Description: This stores all your contract chunks, indexed by semantic meaning for fast retrieval.
Step 4: Build the RAG Pipeline for Contract QA and Review
-
Define the Retrieval + Generation Chain
from langchain.chains import RetrievalQA from langchain.llms import OpenAI llm = OpenAI(temperature=0, openai_api_key=os.environ["OPENAI_API_KEY"]) qa_chain = RetrievalQA.from_chain_type( llm=llm, retriever=vectorstore.as_retriever(), return_source_documents=True ) -
Test Contract Q&A
Try asking a contract-specific question:query = "What is the termination clause in this contract?" result = qa_chain({"query": query}) print("Answer:", result["result"]) print("\nSource Chunks:", [doc.page_content[:200] for doc in result["source_documents"]])Screenshot description: Output shows the extracted answer and the relevant contract chunk(s) for traceability.
Step 5: Automate Contract Review Workflows
-
Define Review Criteria as Prompts
For example, check if a contract has a data privacy clause:review_prompt = """ You are a contract analyst. Does the following contract contain a data privacy clause? If yes, summarize it. If not, state 'Not found'. Contract excerpt: {context} """ -
Automate Multi-Step Review
Loop through key questions or criteria:criteria = [ "Does the contract specify governing law?", "Is there a limitation of liability clause?", "What is the payment schedule?" ] for crit in criteria: result = qa_chain({"query": crit}) print(f"{crit}\n- {result['result']}\n")Screenshot description: Console output listing each review criterion with the extracted answer, enabling checklist-style review.
-
Trigger Automated Approvals or Escalations
You can integrate this with workflow tools (e.g., Slack, email, Jira) usingFastAPIendpoints:from fastapi import FastAPI, Request app = FastAPI() @app.post("/review_contract/") async def review_contract(request: Request): data = await request.json() contract_path = data["contract_path"] # (Extract, chunk, embed, QA as above.) # If all criteria met, trigger approval webhook # Else, send notification for manual review return {"status": "review_complete"}Tip: Use Zapier or n8n for no-code integration with business systems.
Step 6: (Optional) Add Contract Drafting or Redlining with LLMs
-
Automate Clause Suggestions or Redlines
Use the LLM to suggest missing clauses or generate redline text:drafting_prompt = """ You are a contract lawyer. Given the following contract excerpt, suggest a standard data privacy clause if missing. Excerpt: {context} """ response = llm(drafting_prompt.format(context=chunks[0])) print(response) -
Integrate with Document Editing Tools
Usepython-docxor PDF libraries to insert LLM-generated clauses directly into contract drafts.
Common Issues & Troubleshooting
-
Embeddings are not relevant / Poor answers: Check that chunking is granular enough and that the embedding model is appropriate for legal text. Try
text-embedding-ada-002or a domain-specific model. -
ChromaDB fails to start: Ensure Docker is running and port 8000 is free. Try
docker ps
anddocker logs <container_id>
. -
OpenAI API errors: Verify your API key, quota, and network connection. Use
export OPENAI_API_KEY=sk-...
before running scripts. -
Contract extraction fails: Some PDFs are scanned images, not text. Use OCR tools like
tesseractorpdfminer.sixas a fallback. - Performance issues with large contracts: Increase chunk size, batch embedding requests, or use a GPU-backed vector database (e.g., Qdrant).
Next Steps
- Expand to More Contract Types: Add support for additional formats and domain-specific review prompts.
- Integrate with E-signature and CLM platforms: Automate downstream actions post-review.
- Enhance Security: Add user authentication and audit logging to your FastAPI endpoints.
- Scale Up: Deploy your workflow on Kubernetes or with serverless functions for production use.
- Vendor Evaluation: For a comparison of commercial platforms versus open-source approaches, see our guide to evaluating AI business automation vendors.
By following this tutorial, you’ve built a foundational RAG+LLM contract workflow that can automate review, Q&A, and even drafting tasks. As we covered in our complete guide to business process automation with AI, contract workflows are just one area where these techniques can drive efficiency and compliance at scale.
