Document data extraction is at the heart of modern automation — from invoice processing to compliance workflows. While proprietary platforms abound, open-source AI tools now offer powerful, customizable, and cost-effective alternatives for organizations seeking control and transparency. This tutorial walks you through building a robust, production-ready document data extraction workflow using open-source AI in 2026. For a broader context on the impact and strategy of AI-powered document processing, see The Ultimate Guide to AI-Powered Document Processing Automation in 2026.
Prerequisites
- Technical Skills: Intermediate Python (3.10+), basic Linux command line, understanding of REST APIs.
- System Requirements: Linux (Ubuntu 22.04+ recommended), 16GB RAM+, GPU (NVIDIA RTX 30xx+ or A100 for large models), Docker (v25+).
- Open-Source Tools:
pdf2image(v1.17+): PDF to image conversionTesseract OCR(v5.3+): Optical character recognitionHaystack(v2.0+): Document AI pipeline frameworkTransformers(v4.50+): Foundation models (e.g., LayoutLMv3, Donut)Docker Compose(v2.27+): Orchestration
- Sample Data: PDF documents (invoices, forms, contracts, etc.)
-
Set Up Your Environment
-
Install System Dependencies
Ensure
git,python3,pip, anddockerare installed:sudo apt update sudo apt install -y git python3 python3-pip docker.io docker-compose poppler-utils tesseract-ocr -
Set Up Python Virtual Environment
python3 -m venv doc-ai-env source doc-ai-env/bin/activate -
Install Required Python Packages
pip install pdf2image==1.17.0 pytesseract==0.3.10 haystack-ai==2.0.0 transformers==4.50.0 torch==2.2.0 -
Test GPU Availability (Optional but Recommended)
python -c "import torch; print(torch.cuda.is_available())"If
False, ensure your NVIDIA drivers and CUDA toolkit are properly installed.
-
Install System Dependencies
-
Convert PDFs to Images
Most open-source document AI models work on images. We'll use
pdf2imagefor conversion.-
Convert a PDF to Images
pythonfrom pdf2image import convert_from_path pages = convert_from_path('sample_invoice.pdf', dpi=300) for i, page in enumerate(pages): page.save(f'page_{i+1}.png', 'PNG')Screenshot description: A folder view showing
sample_invoice.pdfandpage_1.png,page_2.pnggenerated.
-
Convert a PDF to Images
-
Run OCR to Extract Raw Text
We'll use Tesseract to extract text from the images. This step is essential for text-based models and for fallback when layout models struggle.
-
Run OCR on an Image
pythonimport pytesseract from PIL import Image img = Image.open('page_1.png') text = pytesseract.image_to_string(img) print(text)Screenshot description: Terminal output showing raw extracted text, including invoice numbers, dates, and line items.
-
Run OCR on an Image
-
Set Up a Layout-Aware Extraction Model
For structured documents (invoices, forms), layout-aware models like
LayoutLMv3orDonut(Document Understanding Transformer) excel at extracting key data fields. We'll use HuggingFace's Transformers and Haystack for orchestration.-
Load and Run Donut Model for Key Information Extraction
pythonfrom transformers import DonutProcessor, VisionEncoderDecoderModel from PIL import Image processor = DonutProcessor.from_pretrained("naver-clova-ix/donut-base-finetuned-cord-v2") model = VisionEncoderDecoderModel.from_pretrained("naver-clova-ix/donut-base-finetuned-cord-v2") image = Image.open("page_1.png").convert("RGB") task_prompt = "" inputs = processor(image, task_prompt, return_tensors="pt") outputs = model.generate(**inputs) result = processor.batch_decode(outputs, skip_special_tokens=True)[0] print(result) Screenshot description: Terminal output displaying a JSON-like structure with fields such as "invoice_date", "total_amount", "vendor_name", etc.
For more on comparing document classification and extraction models, see Comparing Top Document Classification Models for AI Workflow Automation.
-
Load and Run Donut Model for Key Information Extraction
-
Build a Modular Extraction Pipeline with Haystack
Haystack enables you to orchestrate multi-stage document workflows, combining OCR, layout models, and custom logic. We'll define a simple pipeline that:
- Converts PDFs to images
- Extracts text via OCR
- Runs a layout-aware model for key fields
- Outputs structured data (JSON)
-
Sample Haystack Pipeline (Python)
pythonfrom haystack.nodes import TransformersDocumentClassifier, PDFToTextConverter, TextConverter, PreProcessor from haystack.pipelines import Pipeline pipeline = Pipeline() pipeline.add_node(component=PDFToTextConverter(), name="PDFConverter", inputs=["File"]) pipeline.add_node(component=PreProcessor(), name="PreProcessor", inputs=["PDFConverter"]) pipeline.add_node(component=TransformersDocumentClassifier(model_name_or_path="naver-clova-ix/donut-base-finetuned-cord-v2"), name="DonutClassifier", inputs=["PreProcessor"]) result = pipeline.run(file_paths=["sample_invoice.pdf"]) print(result)Screenshot description: Terminal output showing a structured dictionary with extracted fields and confidence scores.
Haystack pipelines can be extended with custom nodes, REST APIs, and database connectors. For inspiration on integrating external data sources, see Integrating External Data Sources: Best APIs for AI Document Workflow Automation (2026).
-
Deploy as a REST API Service (Optional)
To enable integration with other systems, deploy your pipeline as a REST API using Haystack's built-in FastAPI support.
-
Start Haystack REST API
haystack-api --pipeline_path=your_pipeline.yaml --port=8000Screenshot description: Terminal showing Haystack API running and accessible at
http://localhost:8000/docs. -
Test the API with
curlcurl -X POST "http://localhost:8000/query" \ -H "accept: application/json" \ -H "Content-Type: application/json" \ -d '{"file": "base64-encoded-pdf-data"}'
-
Start Haystack REST API
-
Post-Processing: Clean, Validate, and Export Data
The last mile: ensure extracted data is reliable and ready for downstream systems.
-
Clean and Validate Fields
pythonimport re, json def validate_invoice(data): # Example: check for valid date and amount fields date_ok = re.match(r"\d{4}-\d{2}-\d{2}", data.get("invoice_date", "")) amount_ok = re.match(r"^\d+(\.\d{2})?$", str(data.get("total_amount", ""))) return date_ok and amount_ok with open("extracted_invoice.json") as f: invoice = json.load(f) if validate_invoice(invoice): print("Invoice data valid!") else: print("Validation failed:", invoice) -
Export to CSV/Database
pythonimport csv with open('invoices.csv', 'w', newline='') as csvfile: fieldnames = ['invoice_date', 'total_amount', 'vendor_name'] writer = csv.DictWriter(csvfile, fieldnames=fieldnames) writer.writeheader() writer.writerow(invoice)
-
Clean and Validate Fields
Common Issues & Troubleshooting
- OCR Accuracy is Low: Try increasing image DPI (e.g., 300+), use image pre-processing (binarization, denoising), or check Tesseract language packs.
-
Model Out of Memory: Use smaller models (e.g.,
layoutlmv3-base), batch process documents, or upgrade GPU resources. - Fields Missing in Output: Fine-tune layout models on your document type, or add fallback regex extraction from OCR text.
-
Pipeline API Not Responding: Check Docker logs, ensure all containers are healthy, and verify
pipeline.yamlconfiguration. - Data Validation Fails: Add more robust post-processing, use business rules, and log problematic documents for manual review.
For more on monitoring and auto-remediation in production, see How to Monitor, Alert, and Auto-Remediate Failures in AI-Powered Document Workflows.
Next Steps
- Experiment with domain-specific fine-tuning on your own document sets.
- Integrate human-in-the-loop review for low-confidence extractions.
- Secure your workflow endpoints — see Securing Workflow Automation Endpoints: API Authentication Best Practices for 2026.
- Explore advanced use cases: document redaction, legal review, or regulated industry compliance (see LLM-Powered Document Workflows for Regulated Industries: 2026 Implementation Guide).
- Benchmark against commercial and open-source tools — see Top Open-Source AI Workflow Automation Tools for Developers in 2026.
Document data extraction with open-source AI is rapidly evolving. By combining modular pipelines, layout-aware models, and robust validation, you can build workflows that rival commercial offerings — with full control and transparency. For the latest best practices in prompt engineering for document AI, check out Prompt Engineering for Automated Document Processing: 2026’s Best Practices.
