Extracting Data from Unstructured Documents: AI-Powered Workflow Solutions Explained

Master step-by-step techniques to extract actionable data from invoices, contracts, and emails using AI-powered workflows.

Extracting structured data from unstructured documents—such as scanned PDFs, emails, contracts, and handwritten notes—is one of the most valuable and challenging tasks in modern business automation. AI-powered workflows can transform this process, enabling organizations to unlock insights, automate repetitive tasks, and dramatically reduce manual effort.

As we covered in our complete guide to automating complex document workflows with AI, this area deserves a deeper look. This sub-pillar tutorial will walk you through a practical, reproducible workflow to extract data from unstructured documents using state-of-the-art AI models and open-source tools.

Prerequisites

Operating System: Windows 10/11, macOS 12+, or Linux (Ubuntu 20.04+ recommended)
Python: Version 3.8 or higher
pip: Python package manager
Basic Knowledge: Python scripting, command line usage, and JSON data format
Tools & Libraries:
- PyMuPDF (for PDF parsing)
- pytesseract (for OCR)
- transformers (Hugging Face for AI models)
- spaCy (for NLP tasks)
- Tesseract OCR (system dependency)
Sample Documents: Unstructured PDFs, images, or scanned documents for testing

Step 1: Set Up Your Environment

Install Python and pip (if not already installed). On Ubuntu:
```
sudo apt update
sudo apt install python3 python3-pip
```
Install Tesseract OCR (required for image-based text extraction):
- Ubuntu:
```
sudo apt install tesseract-ocr
```
- macOS (using Homebrew):
```
brew install tesseract
```
- Windows: Download and install from Tesseract's GitHub.

Create and activate a Python virtual environment:

python3 -m venv ai-doc-extract
source ai-doc-extract/bin/activate  # On Windows: ai-doc-extract\Scripts\activate

Install required Python libraries:

pip install pymupdf pytesseract pillow transformers[torch] spacy

Download a spaCy language model:
```
python -m spacy download en_core_web_sm
```

Screenshot Description: Terminal showing successful installations of all dependencies and activation of the virtual environment.

Step 2: Ingest and Preprocess Unstructured Documents

Extract text from PDFs (including scanned PDFs):

For text-based PDFs, use PyMuPDF:


import fitz  # PyMuPDF

def extract_text_from_pdf(pdf_path):
    doc = fitz.open(pdf_path)
    text = ""
    for page in doc:
        text += page.get_text()
    return text

raw_text = extract_text_from_pdf("sample.pdf")
print(raw_text[:500])  # Preview first 500 characters

Screenshot Description: Python output previewing the extracted text.

For image-based PDFs or scanned documents, apply OCR with pytesseract:


from PIL import Image
import pytesseract
import fitz

def extract_text_from_scanned_pdf(pdf_path):
    doc = fitz.open(pdf_path)
    text = ""
    for page_num in range(len(doc)):
        pix = doc[page_num].get_pixmap()
        img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
        text += pytesseract.image_to_string(img)
    return text

ocr_text = extract_text_from_scanned_pdf("scanned_sample.pdf")
print(ocr_text[:500])

Screenshot Description: Output showing OCR-extracted text from a scanned PDF.

Save the extracted text for downstream processing:


with open("extracted_text.txt", "w", encoding="utf-8") as f:
    f.write(raw_text or ocr_text)

Step 3: Apply AI Models for Information Extraction

Named Entity Recognition (NER) with spaCy:


import spacy

nlp = spacy.load("en_core_web_sm")
with open("extracted_text.txt", "r", encoding="utf-8") as f:
    document_text = f.read()

doc = nlp(document_text)
for ent in doc.ents:
    print(f"{ent.text} ({ent.label_})")

Screenshot Description: Console output listing detected entities (e.g., names, dates, organizations).

Custom Information Extraction Using Transformers (Hugging Face):

For more complex extraction (e.g., extracting invoice numbers, amounts, or custom fields), use a question-answering model:


from transformers import pipeline

qa_pipeline = pipeline("question-answering", model="distilbert-base-cased-distilled-squad")

context = document_text
question = "What is the invoice number?"
result = qa_pipeline(question=question, context=context)
print(result["answer"])

Screenshot Description: Output showing the extracted answer from the document.

Structure the extracted data as JSON:


import json

extracted_data = {
    "entities": [{ "text": ent.text, "label": ent.label_ } for ent in doc.ents],
    "invoice_number": result["answer"]
}

with open("structured_data.json", "w", encoding="utf-8") as f:
    json.dump(extracted_data, f, indent=2)

Step 4: Automate the Workflow

Wrap the workflow into a Python script:



import fitz
import pytesseract
from PIL import Image
import spacy
from transformers import pipeline
import json
import sys

def extract_text(pdf_path):
    doc = fitz.open(pdf_path)
    text = ""
    for page in doc:
        page_text = page.get_text()
        if page_text.strip():
            text += page_text
        else:
            pix = page.get_pixmap()
            img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
            text += pytesseract.image_to_string(img)
    return text

def extract_entities(text):
    nlp = spacy.load("en_core_web_sm")
    doc = nlp(text)
    return [{"text": ent.text, "label": ent.label_} for ent in doc.ents]

def extract_invoice_number(text):
    qa_pipeline = pipeline("question-answering", model="distilbert-base-cased-distilled-squad")
    question = "What is the invoice number?"
    result = qa_pipeline(question=question, context=text)
    return result["answer"]

if __name__ == "__main__":
    pdf_path = sys.argv[1]
    text = extract_text(pdf_path)
    entities = extract_entities(text)
    invoice_number = extract_invoice_number(text)
    structured = {
        "entities": entities,
        "invoice_number": invoice_number
    }
    with open("structured_data.json", "w", encoding="utf-8") as f:
        json.dump(structured, f, indent=2)
    print("Extraction complete. Results in structured_data.json.")

Run the script from the command line:
```
python doc_extractor.py sample.pdf
```
- Screenshot Description: Terminal showing successful extraction and creation of structured_data.json.
Integrate with workflow automation tools:
- For advanced automation, integrate this script with platforms like Apache Airflow, Zapier, or custom REST APIs.
- For secure API integration, see Building Secure API Gateways for AI Workflow Automation in 2026.

Step 5: Validate and Refine Results

Review the output in structured_data.json:


{
  "entities": [
    {"text": "Acme Corp", "label": "ORG"},
    {"text": "2024-06-01", "label": "DATE"}
  ],
  "invoice_number": "INV-123456"
}

Screenshot Description: JSON file opened in VSCode showing extracted entities and fields.

Iterate on questions and NER models:
- Adjust the question for the QA pipeline to target different fields.
- Train custom NER models with spaCy for domain-specific extraction if needed.
Test with multiple documents to ensure robustness.

Common Issues & Troubleshooting

TesseractNotFoundError: Ensure Tesseract OCR is installed and in your system PATH. On Windows, add the Tesseract installation directory to your PATH environment variable.
Low OCR Accuracy: Try increasing image resolution or pre-processing images (binarization, noise removal). For advanced OCR, explore easyocr or commercial APIs.
Out-of-memory errors with large PDFs: Process documents page-by-page and write intermediate results to disk.
Incorrect entity extraction: Fine-tune or retrain your NER model with spaCy on labeled data for your document type.
Slow inference with transformers: Use smaller models or leverage GPU acceleration if available.
JSON encoding errors: Ensure all extracted text is UTF-8 encoded before writing to JSON.

Next Steps

Scale up: Integrate this workflow into batch-processing pipelines or cloud-based document management systems.
Enhance accuracy: Explore custom AI models, advanced OCR (e.g., table extraction), and human-in-the-loop validation.
Connect to business systems: For guidance on integrating with ERP and legacy systems, see Integrating AI Workflow Automation with ERP Systems: Strategies for 2026 and Integrating AI Workflow Automation with Legacy ERP Systems: Pitfalls & Solutions.
Learn more: For a broader perspective on automating complex document workflows, revisit our Pillar: The 2026 Guide to Automating Complex Document Workflows with AI—Best Practices, Tools & Use Cases.

Extracting Data from Unstructured Documents: AI-Powered Workflow Solutions Explained

Prerequisites

Step 1: Set Up Your Environment

Step 2: Ingest and Preprocess Unstructured Documents

Step 3: Apply AI Models for Information Extraction

Step 4: Automate the Workflow

Step 5: Validate and Refine Results

Common Issues & Troubleshooting

Next Steps

Related Articles

Put your brand in front of 10,000+ tech professionals

Stay ahead of the tech curve

Extracting Data from Unstructured Documents: AI-Powered Workflow Solutions Explained

Prerequisites

Step 1: Set Up Your Environment

Step 2: Ingest and Preprocess Unstructured Documents

Step 3: Apply AI Models for Information Extraction

Step 4: Automate the Workflow

Step 5: Validate and Refine Results

Common Issues & Troubleshooting

Next Steps

Related Articles

Tools & Software

Guides & Playbooks

Put your brand in front of 10,000+ tech professionals

Stay ahead of the tech curve