PDF Vector

Blog
/

Extract PDF Data to JSON Format

Extract PDF Data to JSON Format

Learn how to convert PDF documents into structured JSON data using four different methods, from open-source libraries to API services.

August 29, 2025

6 min read

it's me

Duy Bui

You've got 50 invoices to process, and manually copying data is not an option. We've all been there, staring at a pile of PDFs that need to become structured data for your database, CRM, or analytics tool. The good news? You can automate this entire process and get clean JSON output in minutes, not hours.

Understanding PDF Data Extraction

PDFs were designed for consistent visual presentation, not data extraction. Unlike HTML or XML, PDFs don't have a logical structure that makes extracting data straightforward. Text might be stored as individual characters, tables could be just positioned text blocks, and don't even get me started on scanned documents.

That's where JSON comes in. As the universal data exchange format, JSON lets you transform unstructured PDF content into something your applications can actually use. Whether you're building an invoice processing system, extracting research data, or parsing forms, converting to JSON opens up endless possibilities.

Method 1: Using Python with pdfplumber

pdfplumber is a Python library that excels at extracting text and tables from PDFs. It's particularly good with tabular data, making it a solid choice for invoices and reports.

Installation

pip install pdfplumber

Implementation

import pdfplumber
import json

def extract_invoice_data(pdf_path):
    invoice_data = {
        "invoice_number": "",
        "date": "",
        "total": 0,
        "line_items": []
    }
    
    with pdfplumber.open(pdf_path) as pdf:
        first_page = pdf.pages[0]
        text = first_page.extract_text()
        
        # Extract invoice number (simple pattern matching)
        if "Invoice #" in text:
            invoice_data["invoice_number"] = text.split("Invoice #")[1].split("\n")[0].strip()
        
        # Extract tables for line items
        tables = first_page.extract_tables()
        if tables:
            # Assume first table contains line items
            for row in tables[0][1:]:  # Skip header row
                if len(row) >= 3:
                    invoice_data["line_items"].append({
                        "description": row[0],
                        "quantity": row[1],
                        "price": row[2]
                    })
    
    return json.dumps(invoice_data, indent=2)

# Usage
result = extract_invoice_data("invoice.pdf")
print(result)
# Output: {"invoice_number": "INV-2024-001", "date": "", "total": 0, "line_items": [...]}

Pros:

  • Free and open-source
  • Excellent table extraction capabilities
  • Works well with standard PDF layouts
  • Good documentation and community support

Cons:

  • Struggles with complex layouts or rotated text
  • Limited OCR support for scanned PDFs
  • Requires custom logic for each document type
  • No built-in AI understanding of content

Method 2: Using Node.js with pdf-parse

pdf-parse is a lightweight Node.js library for basic PDF text extraction. While it doesn't have advanced features, it's perfect for simple extraction tasks.

Installation

npm install pdf-parse

Implementation

const fs = require('fs');
const pdf = require('pdf-parse');

async function extractPDFData(pdfPath) {
    const dataBuffer = fs.readFileSync(pdfPath);
    
    try {
        const data = await pdf(dataBuffer);
        
        // Simple extraction requires parsing the text
        const lines = data.text.split('\n');
        const jsonData = {
            totalPages: data.numpages,
            extractedText: lines,
            metadata: data.info
        };
        
        // Custom parsing logic based on your PDF structure
        const invoiceData = {
            invoice_number: "",
            items: []
        };
        
        lines.forEach(line => {
            if (line.includes('Invoice #')) {
                invoiceData.invoice_number = line.split('#')[1]?.trim();
            }
            // Add more parsing logic as needed
        });
        
        return JSON.stringify(invoiceData, null, 2);
    } catch (error) {
        console.error('Error:', error);
        return null;
    }
}

// Usage
extractPDFData('./invoice.pdf').then(result => {
    console.log(result);
});

Pros:

  • Very lightweight (minimal dependencies)
  • Fast processing for simple PDFs
  • Easy to integrate into Node.js applications
  • Good for basic text extraction

Cons:

  • No table extraction capabilities
  • Limited formatting preservation
  • Requires extensive custom parsing logic
  • Not suitable for complex documents

Method 3: Using PDF Vector's Ask API

PDF Vector's Ask API provides an AI-powered Ask API that can extract structured data directly into custom JSON schemas. This eliminates the need for complex parsing logic.

Installation

npm install pdfvector

Implementation

import { PDFVector } from 'pdfvector';

const client = new PDFVector({ 
    apiKey: 'pdfvector_your_api_key' 
});

async function extractInvoiceToJSON(pdfUrl: string) {
    const result = await client.ask({
        url: pdfUrl,
        prompt: "Extract invoice information including all line items",
        mode: "json",
        schema: {
            type: "object",
            properties: {
                invoiceNumber: { type: "string" },
                issueDate: { type: "string" },
                dueDate: { type: "string" },
                vendorInfo: {
                    type: "object",
                    properties: {
                        name: { type: "string" },
                        address: { type: "string" },
                        taxId: { type: "string" }
                    }
                },
                customerInfo: {
                    type: "object",
                    properties: {
                        name: { type: "string" },
                        address: { type: "string" }
                    }
                },
                lineItems: {
                    type: "array",
                    items: {
                        type: "object",
                        properties: {
                            description: { type: "string" },
                            quantity: { type: "number" },
                            unitPrice: { type: "number" },
                            total: { type: "number" }
                        }
                    }
                },
                subtotal: { type: "number" },
                tax: { type: "number" },
                total: { type: "number" }
            },
            required: ["invoiceNumber", "total", "lineItems"]
        }
    });
    
    return result.json;
}

// Usage with URL
const invoiceData = await extractInvoiceToJSON('https://example.com/invoice.pdf');
console.log(JSON.stringify(invoiceData, null, 2));

// Usage with local file
const fs = require('fs');
const fileBuffer = fs.readFileSync('./invoice.pdf');
const localResult = await client.ask({
    data: fileBuffer,
    contentType: 'application/pdf',
    prompt: "Extract invoice information",
    mode: "json",
    schema: { /* same schema */ }
});

Pros:

  • AI understands context and document structure
  • No parsing logic needed since you just define your schema
  • Handles complex layouts, tables, and multi-page documents
  • Works with scanned PDFs (OCR built-in)

Cons:

  • Requires API key (not self-hosted)
  • Costs 2 credits per page (pricing details)
  • Internet connection required
  • Processing time depends on document size

Method 4: Using Adobe PDF Services API

Adobe PDF Services API offers enterprise-grade PDF processing, including data extraction capabilities.

Implementation

const PDFServicesSdk = require('@adobe/pdfservices-node-sdk');

const credentials = PDFServicesSdk.Credentials
    .serviceAccountCredentialsBuilder()
    .withClientId("YOUR_CLIENT_ID")
    .withClientSecret("YOUR_CLIENT_SECRET")
    .build();

const executionContext = PDFServicesSdk.ExecutionContext.create(credentials);
const extractPDFOperation = PDFServicesSdk.ExtractPDF.Operation.createNew();

const input = PDFServicesSdk.FileRef.createFromLocalFile('invoice.pdf');
extractPDFOperation.setInput(input);

const options = new PDFServicesSdk.ExtractPDF.options.ExtractPdfOptions.Builder()
    .addElementsToExtract(
        PDFServicesSdk.ExtractPDF.options.ExtractElementType.TEXT,
        PDFServicesSdk.ExtractPDF.options.ExtractElementType.TABLES
    )
    .build();

extractPDFOperation.setOptions(options);

extractPDFOperation.execute(executionContext)
    .then(result => result.saveAsFile('output.json'))
    .then(() => {
        // Read and process the JSON file
        const extractedData = require('./output.json');
        console.log(JSON.stringify(extractedData, null, 2));
    })
    .catch(err => console.error('Error:', err));

Pros:

  • Enterprise-grade reliability and support
  • Excellent for high-volume processing
  • Comprehensive extraction options
  • Strong security and compliance features

Cons:

  • Complex setup and authentication
  • Higher cost for small projects
  • Requires Adobe account and credentials
  • More suited for enterprise applications

Comparing the Methods

Featurepdfplumberpdf-parsePDF VectorAdobe PDF Services
Free to UseYesYesNoNo
Easy SetupYesYesYesNo
AI-PoweredNoNoYesNo
Extracts TablesYesNoYesYes
Handles Scanned PDFsNoNoYesYes
Custom JSON SchemasNoNoYesNo
Self-HostedYesYesNoNo
Enterprise SupportNoNoNoYes

Making the Right Choice

Use pdfplumber when:

  • You're working with simple PDFs that have clear table structures
  • You need a free, open-source solution
  • You're comfortable writing custom parsing logic
  • Your documents follow consistent layouts

Use pdf-parse when:

  • You only need basic text extraction
  • You're already in a Node.js environment
  • File size and performance are critical
  • You don't need table or formatting preservation

Use PDF Vector when:

  • You need structured JSON output with custom schemas
  • You're dealing with complex or variable document layouts
  • You want AI-powered understanding of content
  • You need to handle scanned PDFs with OCR
  • Development speed is more important than infrastructure control

Use Adobe PDF Services when:

  • You're building enterprise applications
  • You need guaranteed uptime and support
  • You have high-volume processing requirements
  • You're already invested in the Adobe ecosystem

The key is to match your tool to your specific needs. Start small, test with your actual PDFs, and scale up as needed. You now have everything you need to turn those PDFs into useful JSON data.

Last updated on August 29, 2025

Browse all blog