PDF Vector

Blog
/

Parse Complex PDFs Without Losing Formatting

Parse Complex PDFs Without Losing Formatting

Master five methods to extract content from complex PDFs while preserving tables, layouts, and formatting that traditional parsers destroy.

August 29, 2025

6 min read

it's me

Duy Bui

That 200-page annual report just turned into scrambled text soup. Tables are now random strings, multi-column layouts merged into gibberish, and don't even ask what happened to the charts. We've all watched in horror as sophisticated PDFs become unreadable messes after parsing.

Understanding Complex PDF Structures

Complex PDFs are like architectural blueprints where everything has a specific position and relationship. Unlike simple documents, they contain:

  • Multi-column layouts where reading order isn't left-to-right
  • Nested tables with merged cells and varying alignments
  • Headers and footers that shouldn't mix with body content
  • Floating elements like sidebars and callout boxes
  • Mixed content combining text, images, charts, and forms

Traditional parsers read PDFs linearly, ignoring these spatial relationships. That's why your perfectly formatted financial statement becomes word salad. Let's fix that.

Method 1: Apache Tika

Apache Tika is a content detection and extraction framework that handles complex document structures better than basic libraries.

Python Implementation

from tika import parser
import json

def parse_complex_pdf(pdf_path):
    # Parse with Tika
    parsed = parser.from_file(pdf_path, xmlContent=True)
    
    # Get both content and metadata
    content = parsed['content']
    metadata = parsed['metadata']
    
    # Tika preserves some structure in XML format
    # Extract with layout preservation hints
    config = {
        'pdf.enableAutoSpace': 'true',
        'pdf.extractInlineImages': 'true',
        'pdf.extractUniqueInlineImagesOnly': 'true'
    }
    
    # Parse with custom configuration
    parsed_with_config = parser.from_file(
        pdf_path, 
        requestOptions={'headers': {'X-Tika-PDFextractInlineImages': 'true'}}
    )
    
    return {
        'content': parsed_with_config['content'],
        'metadata': metadata,
        'status': parsed_with_config['status']
    }

# Usage
result = parse_complex_pdf('annual_report.pdf')
print(f"Extracted {len(result['content'])} characters")

Pros:

  • Maintains some structural information
  • Good metadata extraction
  • Free and open-source

Cons:

  • Requires Java runtime
  • Limited layout preservation
  • Tables still need post-processing

Method 2: Camelot for Table Extraction

Camelot specializes in extracting tables from PDFs with their structure intact.

Implementation

import camelot
import pandas as pd

def extract_tables_with_structure(pdf_path, pages='all'):
    # Try lattice method first (for bordered tables)
    try:
        tables_lattice = camelot.read_pdf(
            pdf_path, 
            pages=pages, 
            flavor='lattice',
            line_scale=40  # Adjust for better line detection
        )
        print(f"Found {len(tables_lattice)} tables using lattice method")
    except:
        tables_lattice = []
    
    # Try stream method (for borderless tables)
    try:
        tables_stream = camelot.read_pdf(
            pdf_path, 
            pages=pages, 
            flavor='stream',
            edge_tol=50,  # Tolerance for edge detection
            column_tol=10  # Tolerance for column detection
        )
        print(f"Found {len(tables_stream)} tables using stream method")
    except:
        tables_stream = []
    
    # Combine results
    all_tables = []
    
    for table in tables_lattice:
        if table.accuracy > 80:  # Only high-quality extractions
            all_tables.append({
                'data': table.df,
                'accuracy': table.accuracy,
                'method': 'lattice',
                'shape': table.shape
            })
    
    for table in tables_stream:
        if table.accuracy > 80:
            all_tables.append({
                'data': table.df,
                'accuracy': table.accuracy,
                'method': 'stream',
                'shape': table.shape
            })
    
    return all_tables

# Usage
tables = extract_tables_with_structure('financial_report.pdf', pages='1-10')
for i, table in enumerate(tables):
    print(f"Table {i}: {table['shape']} with {table['accuracy']:.1f}% accuracy")
    print(table['data'].head())

Pros:

  • Excellent table structure preservation
  • Two methods for different table types
  • Outputs clean pandas DataFrames
  • Accuracy metrics included

Cons:

  • Only handles tables, not full document
  • Requires ghostscript dependency
  • Can miss complex nested tables
  • No text outside tables

Method 3: PDF Vector with LLM Enhancement

PDF Vector's Parse API uses AI to understand document layout and preserve formatting in clean markdown.

Implementation

import { PDFVector } from 'pdfvector';

const client = new PDFVector({ 
    apiKey: 'pdfvector_your_api_key' 
});

async function parseComplexDocument(documentUrl: string) {
    // Parse with LLM enhancement for complex layouts
    const result = await client.parse({
        url: documentUrl,
        useLLM: "always"  // Force AI parsing for better structure
    });
    
    console.log(`Processed ${result.pageCount} pages`);
    console.log(`Used ${result.creditCount} credits`);
    console.log(`AI Enhancement: ${result.usedLLM}`);
    
    return result.markdown;
}

// For local files
async function parseLocalComplexPDF(filePath: string) {
    const fs = require('fs');
    const fileBuffer = fs.readFileSync(filePath);
    
    const result = await client.parse({
        data: fileBuffer,
        contentType: 'application/pdf',
        useLLM: "auto"  // Let API decide based on complexity
    });
    
    // The markdown preserves:
    // - Table structures with proper alignment
    // - Multi-column layouts with correct reading order
    // - Hierarchical headers and sections
    // - Lists and nested content
    
    return result.markdown;
}

// Example: Financial report with complex tables
const markdown = await parseComplexDocument('https://example.com/annual_report.pdf');

// Markdown output preserves structure:
// # Annual Report 2023
// 
// ## Financial Highlights
// 
// | Metric | 2023 | 2022 | Change |
// |--------|------|------|--------|
// | Revenue | $45.2M | $38.1M | +18.6% |
// | EBITDA | $12.3M | $9.8M | +25.5% |
// 
// ### Regional Performance
// 
// The company showed strong growth across all regions...

Pros:

  • AI understands complex layouts automatically
  • Preserves tables, lists, and hierarchies in markdown
  • No configuration or post-processing needed
  • Handles scanned PDFs with built-in OCR

Cons:

  • Requires API key and internet connection
  • 2 credits per page with LLM enhancement
  • Not self-hosted

Method 4: Tesseract + Layout Parsers

For scanned documents, combining Tesseract OCR with layout analysis provides good results.

Implementation

import pytesseract
from PIL import Image
import pdf2image
import numpy as np
from transformers import LayoutLMv3Processor, LayoutLMv3ForTokenClassification

def parse_with_layout_understanding(pdf_path):
    # Convert PDF to images
    images = pdf2image.convert_from_path(pdf_path, dpi=300)
    
    # Initialize LayoutLM for structure understanding
    processor = LayoutLMv3Processor.from_pretrained("microsoft/layoutlmv3-base")
    model = LayoutLMv3ForTokenClassification.from_pretrained("microsoft/layoutlmv3-base")
    
    full_text = []
    
    for page_num, image in enumerate(images):
        # Get OCR with bounding boxes
        ocr_data = pytesseract.image_to_data(
            image, 
            output_type=pytesseract.Output.DICT,
            config='--psm 3'  # Automatic page segmentation
        )
        
        # Extract text with position info
        page_elements = []
        n_boxes = len(ocr_data['level'])
        
        for i in range(n_boxes):
            if ocr_data['conf'][i] > 60:  # Confidence threshold
                element = {
                    'text': ocr_data['text'][i],
                    'x': ocr_data['left'][i],
                    'y': ocr_data['top'][i],
                    'width': ocr_data['width'][i],
                    'height': ocr_data['height'][i],
                    'conf': ocr_data['conf'][i]
                }
                page_elements.append(element)
        
        # Group elements by position to maintain layout
        page_elements.sort(key=lambda x: (x['y'], x['x']))
        
        # Reconstruct text with layout hints
        current_line_y = -1
        line_text = []
        
        for element in page_elements:
            if element['text'].strip():
                # New line detection
                if abs(element['y'] - current_line_y) > 10:
                    if line_text:
                        full_text.append(' '.join(line_text))
                    line_text = [element['text']]
                    current_line_y = element['y']
                else:
                    line_text.append(element['text'])
        
        if line_text:
            full_text.append(' '.join(line_text))
    
    return '\n'.join(full_text)

# Usage
extracted_text = parse_with_layout_understanding('scanned_report.pdf')

Pros:

  • Works with scanned documents
  • Preserves spatial relationships
  • Can detect columns and tables
  • Highly customizable

Cons:

  • Complex setup with multiple dependencies
  • Slower processing (OCR + analysis)
  • Requires fine-tuning for specific layouts
  • May struggle with handwritten text

Method 5: Commercial Solutions (ABBYY, Kofax)

Commercial Solutions (ABBYY, Kofax) offer advanced layout preservation.

ABBYY Cloud OCR SDK Example

import requests
import time
import xml.etree.ElementTree as ET

class ABBYYParser:
    def __init__(self, app_id, password):
        self.app_id = app_id
        self.password = password
        self.base_url = "https://cloud-westus.ocrsdk.com"
    
    def process_pdf(self, file_path):
        # Upload file
        with open(file_path, 'rb') as f:
            upload_response = requests.post(
                f"{self.base_url}/processDocument?exportFormat=xml&profile=documentConversion",
                auth=(self.app_id, self.password),
                files={'file': f}
            )
        
        # Get task ID
        task_id = ET.fromstring(upload_response.content).get('id')
        
        # Wait for processing
        while True:
            status_response = requests.get(
                f"{self.base_url}/getTaskStatus?taskId={task_id}",
                auth=(self.app_id, self.password)
            )
            
            status = ET.fromstring(status_response.content).get('status')
            if status == 'Completed':
                download_url = ET.fromstring(status_response.content).get('resultUrl')
                break
            
            time.sleep(5)
        
        # Download result
        result = requests.get(download_url)
        return result.content

# Note: Requires ABBYY Cloud credentials
parser = ABBYYParser('your_app_id', 'your_password')
xml_result = parser.process_pdf('complex_document.pdf')

ABBYY Pros:

  • Industry-leading accuracy
  • Preserves complex layouts perfectly
  • Handles 200+ languages
  • Advanced table reconstruction

ABBYY Cons:

  • Expensive
  • Requires account setup
  • Cloud processing only

Performance Comparison

MethodSupports Complex TablesSupports Multi-ColumnSupports Scanned PDFsPaid?
Apache TikaYesNoNoNo
CamelotYesNoNoNo
PDF VectorYesYesYesYes
Tesseract + LayoutLMYesYesYesNo
ABBYYYesYesYesYes

Making the Right Choice

Use Apache Tika when:

  • You need to handle multiple file formats beyond PDFs
  • You're already in a Java ecosystem
  • You need metadata extraction alongside content
  • Free and open-source is a requirement

Use Camelot when:

  • Your primary concern is extracting tables with structure
  • You're working with financial reports or data-heavy PDFs
  • You need both bordered and borderless table extraction
  • You want pandas DataFrames as output

Use PDF Vector when:

  • You're dealing with complex multi-column layouts
  • You need AI to understand document structure
  • You want clean markdown output that preserves formatting
  • You're processing both digital and scanned PDFs
  • Development speed matters more than self-hosting

Use Tesseract + Layout Parsers when:

  • You're primarily working with scanned documents
  • You need fine-grained control over OCR settings
  • You have machine learning expertise on your team
  • You're building a custom solution for specific document types

Use ABBYY or Commercial Solutions when:

  • You need industry-leading accuracy
  • You're processing documents in multiple languages
  • You have enterprise budget and support requirements
  • You need certified accuracy for legal or compliance reasons

Start with your most complex document and test each approach. Most offer free trials or open-source options, so you can validate accuracy before committing.

Essential Resources

Last updated on August 29, 2025

Browse all blog