Skip to content

Document Processing Guide

Automate document ingestion, extraction, and analysis with AKIOS. This guide shows you how to process PDFs, Word documents, images, and text files securely.

Quick Start

Get started with document processing:

# Initialize document processing project
akios init my-doc-project

# Run example workflow
akios run templates/document-analysis.yml

Supported Document Types

AKIOS can process:

📝 Text Documents:

  • .txt - Plain text (auto-detects encoding)
  • .md - Markdown
  • .csv - Comma-separated values
  • .json - JSON data

📜 Office Documents:

  • .pdf - PDF documents (with OCR fallback, powered by pypdf)
  • .docx - Microsoft Word
  • .xlsx - Microsoft Excel (planned)
  • .pptx - PowerPoint (planned)

🖼️ Images (OCR):

  • .png - PNG images
  • .jpg / .jpeg - JPEG images
  • .tiff - TIFF images
  • .bmp - Bitmap images

Basic Document Analysis

Read a document, analyze with AI, save results:

name: "Contract Analysis"
description: "Analyze legal contract and extract key terms"

steps:
  - name: "Read contract"
    agent: filesystem
    action: read
    parameters:
      path: "data/input/contract.pdf"
      
  - name: "Extract key information"
    agent: llm
    action: complete
    parameters:
      model: "gpt-4o"
      prompt: |
        Analyze this contract and extract:
        
        1. Parties involved
        2. Effective dates
        3. Financial terms
        4. Key obligations
        5. Termination clauses
        6. Risk factors
        
        Contract:
        {{previous_output}}
      max_tokens: 2000
      
  - name: "Save analysis"
    agent: filesystem
    action: write
    parameters:
      path: "data/output/contract_analysis.txt"
      content: "{{analysis}}"

Run it:

# Place your PDF in data/input/
cp contract.pdf data/input/

# Run analysis
akios run workflows/contract-analysis.yml

# Check results
cat data/output/contract_analysis.txt

Common Processing Patterns

Pattern 1: Batch Processing

Process multiple documents:

name: "Batch Document Processor"
description: "Process all documents in a directory"

steps:
  - name: "List documents"
    agent: filesystem
    action: list
    parameters:
      path: "data/input/"
      pattern: "*.pdf"
      
  - name: "Process each document"
    agent: llm
    action: complete
    parameters:
      prompt: "Summarize this document: {{file_content}}"
    for_each: "{{files}}"

Alternative using shell:

steps:
  - name: "Find PDFs"
    agent: tool_executor
    action: run
    parameters:
      command: ["find", "data/input", "-name", "*.pdf"]
      
  - name: "Process list"
    # Process each file from the list

Pattern 2: Document Classification

Classify documents and route to folders:

name: "Document Classifier"
description: "Classify and organize documents"

steps:
  - name: "Read document"
    agent: filesystem
    action: read
    parameters:
      path: "data/input/{{filename}}"
      
  - name: "Classify"
    agent: llm
    action: complete
    parameters:
      model: "gpt-4o"
      prompt: |
        Classify this document into one category:
        - invoice
        - contract
        - receipt
        - letter
        - other
        
        Document: {{document}}
        
        Return only the category name.
      
  - name: "Move to category folder"
    agent: tool_executor
    action: run
    parameters:
      command: ["mv", "{{filename}}", "data/output/{{category}}/"]

Pattern 3: PII Redaction

Process documents with automatic PII removal:

name: "Redact Sensitive Data"
description: "Process document with PII protection"

steps:
  - name: "Read document"
    agent: filesystem
    action: read
    parameters:
      path: "data/input/customer_data.txt"
    # PII automatically redacted on read
    
  - name: "Analyze redacted version"
    agent: llm
    action: complete
    parameters:
      prompt: "Analyze: {{previous_output}}"
    # AI sees redacted version
    
  - name: "Save analysis"
    agent: filesystem
    action: write
    parameters:
      path: "data/output/analysis.txt"
      content: "{{analysis}}"
    # Output also redacted

What gets redacted:

  • Email addresses → [REDACTED_EMAIL]
  • Phone numbers → [REDACTED_PHONE]
  • SSN → [REDACTED_SSN]
  • Credit cards → [REDACTED_CC]
  • 50+ other patterns

Pattern 4: Multi-Step Analysis

Complex document processing pipeline:

name: "Deep Document Analysis"
description: "Multi-stage document processing"

steps:
  - name: "Read document"
    agent: filesystem
    action: read
    parameters:
      path: "data/input/report.pdf"
      
  - name: "Extract entities"
    agent: llm
    action: complete
    parameters:
      prompt: "Extract all people, companies, dates: {{document}}"
      
  - name: "Summarize"
    agent: llm
    action: complete
    parameters:
      prompt: "Summarize main points: {{document}}"
      
  - name: "Sentiment analysis"
    agent: llm
    action: complete
    parameters:
      prompt: "Analyze sentiment (positive/negative/neutral): {{document}}"
      
  - name: "Generate report"
    agent: llm
    action: complete
    parameters:
      prompt: |
        Create final report:
        
        Entities: {{entities}}
        Summary: {{summary}}
        Sentiment: {{sentiment}}
      
  - name: "Save report"
    agent: filesystem
    action: write
    parameters:
      path: "data/output/complete_analysis.md"
      content: "{{report}}"

Handling Large Documents

Check Size Before Processing

steps:
  - name: "Check file size"
    agent: filesystem
    action: stat
    parameters:
      path: "data/input/large-doc.pdf"
      
  - name: "Process if small"
    agent: llm
    action: complete
    skip_if: "$(.size > 1048576)"  # Skip if > 1MB
    parameters:
      prompt: "Full analysis: {{document}}"
      
  - name: "Summarize if large"
    agent: llm
    action: complete
    skip_if: "$(.size <= 1048576)"  # Only if > 1MB
    parameters:
      prompt: "Brief summary only: {{document}}"
      max_tokens: 500  # Limit output

Chunk Large Documents

steps:
  - name: "Split document"
    agent: tool_executor
    action: run
    parameters:
      command: ["split", "-l", "1000", "data/input/large.txt", "chunks/"]
      
  - name: "Process chunks"
    # Process each chunk
    
  - name: "Combine results"
    agent: llm
    action: complete
    parameters:
      prompt: "Combine these analyses: {{chunk_results}}"

Error Handling

OCR Failures

steps:
  - name: "Try to read PDF"
    agent: filesystem
    action: read
    parameters:
      path: "data/input/scanned.pdf"
    # Falls back to OCR automatically
    
  - name: "Handle if OCR fails"
    # Check if content is empty
    skip_if: "$(.content != '')"
    agent: filesystem
    action: write
    parameters:
      path: "data/failed/{{filename}}"
      content: "OCR failed"

File Not Found

steps:
  - name: "Check file exists"
    agent: filesystem
    action: exists
    parameters:
      path: "data/input/{{filename}}"
      
  - name: "Read if exists"
    skip_if: "$(.exists != true)"
    agent: filesystem
    action: read

Best Practices

✓ Do:

  • Keep inputs in data/input/
  • Save outputs to data/output/
  • Enable PII redaction for sensitive documents
  • Check file size before processing
  • Use audit logging for compliance
  • Handle OCR failures gracefully
  • Set max_tokens limits for large docs

× Don't:

  • Process files outside allowed_paths
  • Disable PII redaction without good reason
  • Process unlimited file sizes
  • Ignore encoding issues
  • Skip error handling

Configuration

Configure filesystem access:

# config.yaml
filesystem:
  allowed_paths:
    - "./data/input"
    - "./data/output"
  max_file_size_mb: 10

Enable PII protection:

# config.yaml
pii_redaction_enabled: true
pii_redaction_outputs: true
redaction_strategy: "mask"

Tips & Tricks

Store failure separately:

# Create failed directory
mkdir -p data/failed

# Move failed documents
mv data/input/bad.pdf data/failed/

Use size checks:

# Summary for large files, full analysis for small
skip_if: "$(.size > 1048576)"

Organize by type:

data/
├── input/
│   ├── pdfs/
│   ├── docx/
│   └── images/
└── output/
    ├── analyzed/
    └── extracted/

Log processing failures:

steps:
  - name: "Log failure"
    agent: filesystem
    action: write
    parameters:
      path: "data/logs/failed.txt"
      content: "Failed: {{filename}} - {{error}}"

Related Docs

ESC