Extracting structured data from unstructured documents—such as scanned PDFs, emails, contracts, and handwritten notes—is one of the most valuable and challenging tasks in modern business automation. AI-powered workflows can transform this process, enabling organizations to unlock insights, automate repetitive tasks, and dramatically reduce manual effort.
As we covered in our complete guide to automating complex document workflows with AI, this area deserves a deeper look. This sub-pillar tutorial will walk you through a practical, reproducible workflow to extract data from unstructured documents using state-of-the-art AI models and open-source tools.
Prerequisites
- Operating System: Windows 10/11, macOS 12+, or Linux (Ubuntu 20.04+ recommended)
- Python: Version 3.8 or higher
- pip: Python package manager
- Basic Knowledge: Python scripting, command line usage, and JSON data format
- Tools & Libraries:
PyMuPDF(for PDF parsing)pytesseract(for OCR)transformers(Hugging Face for AI models)spaCy(for NLP tasks)Tesseract OCR(system dependency)
- Sample Documents: Unstructured PDFs, images, or scanned documents for testing
Step 1: Set Up Your Environment
-
Install Python and pip (if not already installed). On Ubuntu:
sudo apt update sudo apt install python3 python3-pip
-
Install Tesseract OCR (required for image-based text extraction):
- Ubuntu:
sudo apt install tesseract-ocr
- macOS (using Homebrew):
brew install tesseract
- Windows: Download and install from Tesseract's GitHub.
- Ubuntu:
-
Create and activate a Python virtual environment:
python3 -m venv ai-doc-extract source ai-doc-extract/bin/activate # On Windows: ai-doc-extract\Scripts\activate
-
Install required Python libraries:
pip install pymupdf pytesseract pillow transformers[torch] spacy
-
Download a spaCy language model:
python -m spacy download en_core_web_sm
Screenshot Description: Terminal showing successful installations of all dependencies and activation of the virtual environment.
Step 2: Ingest and Preprocess Unstructured Documents
-
Extract text from PDFs (including scanned PDFs):
- For text-based PDFs, use
PyMuPDF:
import fitz # PyMuPDF def extract_text_from_pdf(pdf_path): doc = fitz.open(pdf_path) text = "" for page in doc: text += page.get_text() return text raw_text = extract_text_from_pdf("sample.pdf") print(raw_text[:500]) # Preview first 500 characters- Screenshot Description: Python output previewing the extracted text.
- For text-based PDFs, use
-
For image-based PDFs or scanned documents, apply OCR with pytesseract:
from PIL import Image import pytesseract import fitz def extract_text_from_scanned_pdf(pdf_path): doc = fitz.open(pdf_path) text = "" for page_num in range(len(doc)): pix = doc[page_num].get_pixmap() img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples) text += pytesseract.image_to_string(img) return text ocr_text = extract_text_from_scanned_pdf("scanned_sample.pdf") print(ocr_text[:500])- Screenshot Description: Output showing OCR-extracted text from a scanned PDF.
-
Save the extracted text for downstream processing:
with open("extracted_text.txt", "w", encoding="utf-8") as f: f.write(raw_text or ocr_text)
Step 3: Apply AI Models for Information Extraction
-
Named Entity Recognition (NER) with spaCy:
import spacy nlp = spacy.load("en_core_web_sm") with open("extracted_text.txt", "r", encoding="utf-8") as f: document_text = f.read() doc = nlp(document_text) for ent in doc.ents: print(f"{ent.text} ({ent.label_})")- Screenshot Description: Console output listing detected entities (e.g., names, dates, organizations).
-
Custom Information Extraction Using Transformers (Hugging Face):
- For more complex extraction (e.g., extracting invoice numbers, amounts, or custom fields), use a question-answering model:
from transformers import pipeline qa_pipeline = pipeline("question-answering", model="distilbert-base-cased-distilled-squad") context = document_text question = "What is the invoice number?" result = qa_pipeline(question=question, context=context) print(result["answer"])- Screenshot Description: Output showing the extracted answer from the document.
-
Structure the extracted data as JSON:
import json extracted_data = { "entities": [{ "text": ent.text, "label": ent.label_ } for ent in doc.ents], "invoice_number": result["answer"] } with open("structured_data.json", "w", encoding="utf-8") as f: json.dump(extracted_data, f, indent=2)
Step 4: Automate the Workflow
-
Wrap the workflow into a Python script:
import fitz import pytesseract from PIL import Image import spacy from transformers import pipeline import json import sys def extract_text(pdf_path): doc = fitz.open(pdf_path) text = "" for page in doc: page_text = page.get_text() if page_text.strip(): text += page_text else: pix = page.get_pixmap() img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples) text += pytesseract.image_to_string(img) return text def extract_entities(text): nlp = spacy.load("en_core_web_sm") doc = nlp(text) return [{"text": ent.text, "label": ent.label_} for ent in doc.ents] def extract_invoice_number(text): qa_pipeline = pipeline("question-answering", model="distilbert-base-cased-distilled-squad") question = "What is the invoice number?" result = qa_pipeline(question=question, context=text) return result["answer"] if __name__ == "__main__": pdf_path = sys.argv[1] text = extract_text(pdf_path) entities = extract_entities(text) invoice_number = extract_invoice_number(text) structured = { "entities": entities, "invoice_number": invoice_number } with open("structured_data.json", "w", encoding="utf-8") as f: json.dump(structured, f, indent=2) print("Extraction complete. Results in structured_data.json.") -
Run the script from the command line:
python doc_extractor.py sample.pdf
-
Screenshot Description: Terminal showing successful extraction and creation of
structured_data.json.
-
Screenshot Description: Terminal showing successful extraction and creation of
-
Integrate with workflow automation tools:
- For advanced automation, integrate this script with platforms like Apache Airflow, Zapier, or custom REST APIs.
- For secure API integration, see Building Secure API Gateways for AI Workflow Automation in 2026.
Step 5: Validate and Refine Results
-
Review the output in
structured_data.json:{ "entities": [ {"text": "Acme Corp", "label": "ORG"}, {"text": "2024-06-01", "label": "DATE"} ], "invoice_number": "INV-123456" }- Screenshot Description: JSON file opened in VSCode showing extracted entities and fields.
-
Iterate on questions and NER models:
- Adjust the question for the QA pipeline to target different fields.
- Train custom NER models with
spaCyfor domain-specific extraction if needed.
- Test with multiple documents to ensure robustness.
Common Issues & Troubleshooting
- TesseractNotFoundError: Ensure Tesseract OCR is installed and in your system PATH. On Windows, add the Tesseract installation directory to your PATH environment variable.
-
Low OCR Accuracy: Try increasing image resolution or pre-processing images (binarization, noise removal). For advanced OCR, explore
easyocror commercial APIs. - Out-of-memory errors with large PDFs: Process documents page-by-page and write intermediate results to disk.
- Incorrect entity extraction: Fine-tune or retrain your NER model with spaCy on labeled data for your document type.
- Slow inference with transformers: Use smaller models or leverage GPU acceleration if available.
- JSON encoding errors: Ensure all extracted text is UTF-8 encoded before writing to JSON.
Next Steps
- Scale up: Integrate this workflow into batch-processing pipelines or cloud-based document management systems.
- Enhance accuracy: Explore custom AI models, advanced OCR (e.g., table extraction), and human-in-the-loop validation.
- Connect to business systems: For guidance on integrating with ERP and legacy systems, see Integrating AI Workflow Automation with ERP Systems: Strategies for 2026 and Integrating AI Workflow Automation with Legacy ERP Systems: Pitfalls & Solutions.
- Learn more: For a broader perspective on automating complex document workflows, revisit our Pillar: The 2026 Guide to Automating Complex Document Workflows with AI—Best Practices, Tools & Use Cases.