Home Blog Reviews Best Picks Guides Tools Glossary Advertise Subscribe Free
Tech Frontline Jun 24, 2026 5 min read

Extracting Data from Unstructured Documents: AI-Powered Workflow Solutions Explained

Master step-by-step techniques to extract actionable data from invoices, contracts, and emails using AI-powered workflows.

T
Tech Daily Shot Team
Published Jun 24, 2026

Extracting structured data from unstructured documents—such as scanned PDFs, emails, contracts, and handwritten notes—is one of the most valuable and challenging tasks in modern business automation. AI-powered workflows can transform this process, enabling organizations to unlock insights, automate repetitive tasks, and dramatically reduce manual effort.

As we covered in our complete guide to automating complex document workflows with AI, this area deserves a deeper look. This sub-pillar tutorial will walk you through a practical, reproducible workflow to extract data from unstructured documents using state-of-the-art AI models and open-source tools.

Prerequisites

  • Operating System: Windows 10/11, macOS 12+, or Linux (Ubuntu 20.04+ recommended)
  • Python: Version 3.8 or higher
  • pip: Python package manager
  • Basic Knowledge: Python scripting, command line usage, and JSON data format
  • Tools & Libraries:
    • PyMuPDF (for PDF parsing)
    • pytesseract (for OCR)
    • transformers (Hugging Face for AI models)
    • spaCy (for NLP tasks)
    • Tesseract OCR (system dependency)
  • Sample Documents: Unstructured PDFs, images, or scanned documents for testing

Step 1: Set Up Your Environment

  1. Install Python and pip (if not already installed). On Ubuntu:
    sudo apt update
    sudo apt install python3 python3-pip
  2. Install Tesseract OCR (required for image-based text extraction):
    • Ubuntu:
      sudo apt install tesseract-ocr
    • macOS (using Homebrew):
      brew install tesseract
    • Windows: Download and install from Tesseract's GitHub.
  3. Create and activate a Python virtual environment:
    python3 -m venv ai-doc-extract
    source ai-doc-extract/bin/activate  # On Windows: ai-doc-extract\Scripts\activate
  4. Install required Python libraries:
    pip install pymupdf pytesseract pillow transformers[torch] spacy
  5. Download a spaCy language model:
    python -m spacy download en_core_web_sm

Screenshot Description: Terminal showing successful installations of all dependencies and activation of the virtual environment.

Step 2: Ingest and Preprocess Unstructured Documents

  1. Extract text from PDFs (including scanned PDFs):
    • For text-based PDFs, use PyMuPDF:
    
    import fitz  # PyMuPDF
    
    def extract_text_from_pdf(pdf_path):
        doc = fitz.open(pdf_path)
        text = ""
        for page in doc:
            text += page.get_text()
        return text
    
    raw_text = extract_text_from_pdf("sample.pdf")
    print(raw_text[:500])  # Preview first 500 characters
          
    • Screenshot Description: Python output previewing the extracted text.
  2. For image-based PDFs or scanned documents, apply OCR with pytesseract:
    
    from PIL import Image
    import pytesseract
    import fitz
    
    def extract_text_from_scanned_pdf(pdf_path):
        doc = fitz.open(pdf_path)
        text = ""
        for page_num in range(len(doc)):
            pix = doc[page_num].get_pixmap()
            img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
            text += pytesseract.image_to_string(img)
        return text
    
    ocr_text = extract_text_from_scanned_pdf("scanned_sample.pdf")
    print(ocr_text[:500])
          
    • Screenshot Description: Output showing OCR-extracted text from a scanned PDF.
  3. Save the extracted text for downstream processing:
    
    with open("extracted_text.txt", "w", encoding="utf-8") as f:
        f.write(raw_text or ocr_text)
          

Step 3: Apply AI Models for Information Extraction

  1. Named Entity Recognition (NER) with spaCy:
    
    import spacy
    
    nlp = spacy.load("en_core_web_sm")
    with open("extracted_text.txt", "r", encoding="utf-8") as f:
        document_text = f.read()
    
    doc = nlp(document_text)
    for ent in doc.ents:
        print(f"{ent.text} ({ent.label_})")
          
    • Screenshot Description: Console output listing detected entities (e.g., names, dates, organizations).
  2. Custom Information Extraction Using Transformers (Hugging Face):
    • For more complex extraction (e.g., extracting invoice numbers, amounts, or custom fields), use a question-answering model:
    
    from transformers import pipeline
    
    qa_pipeline = pipeline("question-answering", model="distilbert-base-cased-distilled-squad")
    
    context = document_text
    question = "What is the invoice number?"
    result = qa_pipeline(question=question, context=context)
    print(result["answer"])
          
    • Screenshot Description: Output showing the extracted answer from the document.
  3. Structure the extracted data as JSON:
    
    import json
    
    extracted_data = {
        "entities": [{ "text": ent.text, "label": ent.label_ } for ent in doc.ents],
        "invoice_number": result["answer"]
    }
    
    with open("structured_data.json", "w", encoding="utf-8") as f:
        json.dump(extracted_data, f, indent=2)
          

Step 4: Automate the Workflow

  1. Wrap the workflow into a Python script:
    
    
    import fitz
    import pytesseract
    from PIL import Image
    import spacy
    from transformers import pipeline
    import json
    import sys
    
    def extract_text(pdf_path):
        doc = fitz.open(pdf_path)
        text = ""
        for page in doc:
            page_text = page.get_text()
            if page_text.strip():
                text += page_text
            else:
                pix = page.get_pixmap()
                img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
                text += pytesseract.image_to_string(img)
        return text
    
    def extract_entities(text):
        nlp = spacy.load("en_core_web_sm")
        doc = nlp(text)
        return [{"text": ent.text, "label": ent.label_} for ent in doc.ents]
    
    def extract_invoice_number(text):
        qa_pipeline = pipeline("question-answering", model="distilbert-base-cased-distilled-squad")
        question = "What is the invoice number?"
        result = qa_pipeline(question=question, context=text)
        return result["answer"]
    
    if __name__ == "__main__":
        pdf_path = sys.argv[1]
        text = extract_text(pdf_path)
        entities = extract_entities(text)
        invoice_number = extract_invoice_number(text)
        structured = {
            "entities": entities,
            "invoice_number": invoice_number
        }
        with open("structured_data.json", "w", encoding="utf-8") as f:
            json.dump(structured, f, indent=2)
        print("Extraction complete. Results in structured_data.json.")
          
  2. Run the script from the command line:
    python doc_extractor.py sample.pdf
    • Screenshot Description: Terminal showing successful extraction and creation of structured_data.json.
  3. Integrate with workflow automation tools:

Step 5: Validate and Refine Results

  1. Review the output in structured_data.json:
    
    {
      "entities": [
        {"text": "Acme Corp", "label": "ORG"},
        {"text": "2024-06-01", "label": "DATE"}
      ],
      "invoice_number": "INV-123456"
    }
          
    • Screenshot Description: JSON file opened in VSCode showing extracted entities and fields.
  2. Iterate on questions and NER models:
    • Adjust the question for the QA pipeline to target different fields.
    • Train custom NER models with spaCy for domain-specific extraction if needed.
  3. Test with multiple documents to ensure robustness.

Common Issues & Troubleshooting

  • TesseractNotFoundError: Ensure Tesseract OCR is installed and in your system PATH. On Windows, add the Tesseract installation directory to your PATH environment variable.
  • Low OCR Accuracy: Try increasing image resolution or pre-processing images (binarization, noise removal). For advanced OCR, explore easyocr or commercial APIs.
  • Out-of-memory errors with large PDFs: Process documents page-by-page and write intermediate results to disk.
  • Incorrect entity extraction: Fine-tune or retrain your NER model with spaCy on labeled data for your document type.
  • Slow inference with transformers: Use smaller models or leverage GPU acceleration if available.
  • JSON encoding errors: Ensure all extracted text is UTF-8 encoded before writing to JSON.

Next Steps

unstructured data AI workflow automation data extraction tutorial

Related Articles

Tech Frontline
Automating KYC Workflows with AI: Compliance and Productivity Gains for Finance Teams
Jun 24, 2026
Tech Frontline
AI Workflow Automation for Invoicing: Tools, Templates & Real-World Results
Jun 24, 2026
Tech Frontline
Prompt Engineering for Document AI: Real-World Templates for Approval and Extraction
Jun 24, 2026
Tech Frontline
Prompt Engineering for AI Workflow Automation: 2026’s Expert-Recommended Strategies
Jun 23, 2026
Free & Interactive

Tools & Software

100+ hand-picked tools personally tested by our team — for developers, designers, and power users.

🛠 Dev Tools 🎨 Design 🔒 Security ☁️ Cloud
Explore Tools →
Step by Step

Guides & Playbooks

Complete, actionable guides for every stage — from setup to mastery. No fluff, just results.

📚 Homelab 🔒 Privacy 🐧 Linux ⚙️ DevOps
Browse Guides →
Advertise with Us

Put your brand in front of 10,000+ tech professionals

Native placements that feel like recommendations. Newsletter, articles, banners, and directory features.

✉️
Newsletter
10K+ reach
📰
Articles
SEO evergreen
🖼️
Banners
Site-wide
🎯
Directory
Priority

Stay ahead of the tech curve

Join 10,000+ professionals who start their morning smarter. No spam, no fluff — just the most important tech developments, explained.