PDF Vector

Blog
/

Convert PDF Tables to JSON

Convert PDF Tables to JSON

Transform PDF tables into clean JSON data structures ready for your database, API, or analytics pipeline.

August 30, 2025

8 min read

it's me

Duy Bui

Stop copying PDF tables cell by cell into your database. Whether you're processing invoices, financial reports, or research data, manually transferring table data wastes hours and introduces errors. You need that quarterly sales report in your analytics pipeline, those invoice line items in your accounting system, and those research results in your data warehouse, all in clean, structured JSON.

Understanding PDF Table Structure

PDF tables aren't stored as tables. They're collections of positioned text elements that happen to look like tables when rendered. Each cell is just text at specific x,y coordinates. The visual alignment creates the illusion of structure, but there's no underlying table object to query.

Common table formats challenge extraction tools differently. Simple tables with clear borders and consistent spacing extract cleanly. Tables with merged cells require intelligent parsing to understand which cells span multiple columns or rows. Nested tables, where cells contain sub-tables, demand recursive extraction strategies.

JSON excels as the output format because it naturally represents hierarchical data, supports various data types, integrates with every programming language, and validates against schemas. Your downstream systems already speak JSON, making integration seamless.

Method 1: Using Tabula for Simple Tables

Tabula specializes in extracting tables from PDFs with remarkable accuracy for well-structured documents.

Basic Table to JSON Conversion

import { exec } from 'child_process';
import { promisify } from 'util';
import fs from 'fs/promises';

const execAsync = promisify(exec);

async function extractWithTabula(pdfPath: string): Promise<any[]> {
  // Extract tables to JSON using Tabula CLI
  const outputPath = 'tables.json';
  
  await execAsync(
    `java -jar tabula.jar --format JSON --pages all "${pdfPath}" > "${outputPath}"`
  );
  
  const jsonContent = await fs.readFile(outputPath, 'utf-8');
  const rawTables = JSON.parse(jsonContent);
  
  // Transform Tabula output to structured JSON
  return rawTables.map((table: any[]) => {
    if (table.length === 0) return null;
    
    // First row as headers
    const headers = table[0].map((cell: any) => cell.text || '');
    
    // Convert remaining rows to objects
    const rows = table.slice(1).map((row: any[]) => {
      const obj: Record<string, string> = {};
      
      row.forEach((cell: any, index: number) => {
        const header = headers[index] || `column_${index}`;
        obj[header] = cell.text || '';
      });
      
      return obj;
    });
    
    return {
      headers,
      rows,
      rowCount: rows.length,
      columnCount: headers.length
    };
  }).filter(table => table !== null);
}

// Convert specific table types
async function extractInvoiceTable(pdfPath: string) {
  const tables = await extractWithTabula(pdfPath);
  
  // Find the line items table (usually the largest)
  const lineItemsTable = tables.reduce((largest, current) => 
    current.rowCount > largest.rowCount ? current : largest
  );
  
  // Transform to invoice structure
  const invoice = {
    lineItems: lineItemsTable.rows.map(row => ({
      description: row['Description'] || row['Item'],
      quantity: parseFloat(row['Qty'] || row['Quantity'] || '0'),
      unitPrice: parseFloat(row['Unit Price'] || row['Price'] || '0'),
      total: parseFloat(row['Total'] || row['Amount'] || '0')
    })),
    subtotal: 0,
    tax: 0,
    total: 0
  };
  
  // Calculate totals
  invoice.subtotal = invoice.lineItems.reduce((sum, item) => sum + item.total, 0);
  
  return invoice;
}

// Usage
const tables = await extractWithTabula('report.pdf');
console.log(JSON.stringify(tables, null, 2));

Pros:

  • Excellent accuracy for well-formatted tables
  • Free and open-source
  • Supports multiple output formats
  • Works well with bordered tables

Cons:

  • Requires Java runtime
  • Struggles with borderless tables
  • No built-in schema validation
  • Limited customization options

Method 2: Using PDFPlumber with Custom Parsing

PDFPlumber provides fine-grained control over table extraction with Python, but we can achieve similar results in TypeScript using pdf-parse.

Advanced Table Detection

import PDFParser from 'pdf-parse';
import fs from 'fs/promises';

interface TableCell {
  text: string;
  x: number;
  y: number;
  width: number;
  height: number;
}

async function extractWithCustomParsing(pdfPath: string) {
  const dataBuffer = await fs.readFile(pdfPath);
  const data = await PDFParser(dataBuffer);
  
  // Extract positioned text elements
  const textElements = extractTextElements(data);
  
  // Detect table regions
  const tables = detectTables(textElements);
  
  // Convert to structured JSON
  return tables.map(table => convertTableToJSON(table));
}

function extractTextElements(pdfData: any): TableCell[] {
  const elements: TableCell[] = [];
  
  // Parse PDF content (implementation depends on PDF structure)
  // This is simplified - real implementation needs proper PDF parsing
  const lines = pdfData.text.split('\n');
  let y = 0;
  
  lines.forEach(line => {
    if (line.trim()) {
      // Simple column detection based on multiple spaces
      const cells = line.split(/\s{2,}/);
      let x = 0;
      
      cells.forEach(cell => {
        elements.push({
          text: cell.trim(),
          x,
          y,
          width: cell.length * 8, // Approximate
          height: 12
        });
        x += cell.length * 8 + 20;
      });
    }
    y += 15;
  });
  
  return elements;
}

function detectTables(elements: TableCell[]): TableCell[][] {
  // Group elements by proximity
  const tables: TableCell[][] = [];
  const used = new Set<number>();
  
  elements.forEach((element, index) => {
    if (used.has(index)) return;
    
    const table: TableCell[] = [element];
    used.add(index);
    
    // Find nearby elements
    elements.forEach((other, otherIndex) => {
      if (used.has(otherIndex)) return;
      
      // Check if elements are aligned
      const yDistance = Math.abs(element.y - other.y);
      const xDistance = Math.abs(element.x - other.x);
      
      if (yDistance < 20 || xDistance < 10) {
        table.push(other);
        used.add(otherIndex);
      }
    });
    
    if (table.length > 3) { // Minimum cells for a table
      tables.push(table);
    }
  });
  
  return tables;
}

function convertTableToJSON(tableCells: TableCell[]): any {
  // Sort cells by position
  tableCells.sort((a, b) => {
    if (Math.abs(a.y - b.y) < 5) {
      return a.x - b.x;
    }
    return a.y - b.y;
  });
  
  // Group into rows
  const rows: TableCell[][] = [];
  let currentRow: TableCell[] = [];
  let currentY = tableCells[0]?.y || 0;
  
  tableCells.forEach(cell => {
    if (Math.abs(cell.y - currentY) > 10) {
      if (currentRow.length > 0) {
        rows.push(currentRow);
      }
      currentRow = [cell];
      currentY = cell.y;
    } else {
      currentRow.push(cell);
    }
  });
  
  if (currentRow.length > 0) {
    rows.push(currentRow);
  }
  
  // Convert to JSON structure
  if (rows.length === 0) return null;
  
  const headers = rows[0].map(cell => cell.text);
  const data = rows.slice(1).map(row => {
    const obj: Record<string, string> = {};
    
    row.forEach((cell, index) => {
      const header = headers[index] || `col${index}`;
      obj[header] = cell.text;
    });
    
    return obj;
  });
  
  return {
    headers,
    data,
    metadata: {
      rowCount: data.length,
      columnCount: headers.length,
      position: {
        x: Math.min(...tableCells.map(c => c.x)),
        y: Math.min(...tableCells.map(c => c.y))
      }
    }
  };
}

// Usage with schema mapping
async function extractFinancialData(pdfPath: string) {
  const tables = await extractWithCustomParsing(pdfPath);
  
  // Map to financial schema
  const financialData = {
    quarters: [] as any[],
    totals: {} as any
  };
  
  tables.forEach(table => {
    if (table.headers.includes('Q1') || table.headers.includes('Quarter')) {
      // Quarterly data table
      financialData.quarters = table.data.map(row => ({
        metric: row['Metric'] || row['Description'],
        q1: parseFloat(row['Q1'] || '0'),
        q2: parseFloat(row['Q2'] || '0'),
        q3: parseFloat(row['Q3'] || '0'),
        q4: parseFloat(row['Q4'] || '0'),
        total: parseFloat(row['Total'] || '0')
      }));
    }
  });
  
  return financialData;
}

Pros:

  • Full control over extraction logic
  • Can handle complex layouts
  • Custom schema mapping
  • No external dependencies

Cons:

  • Requires significant development effort
  • Complex coordinate calculations
  • May miss edge cases
  • Harder to maintain

Method 3: PDF Vector's Ask API

PDF Vector's Ask API uses AI to understand table context and extract data directly into your defined JSON schema.

Define Your JSON Schema

import { PDFVector } from 'pdfvector';

const client = new PDFVector({
  apiKey: 'pdfvector_your_api_key'
});

// Extract specific business data
async function extractSalesReport(pdfPath: string) {
  const pdfBuffer = await fs.readFile(pdfPath);
  
  const result = await client.ask({
    data: pdfBuffer,
    contentType: 'application/pdf',
    prompt: 'Extract sales data including products, quantities, revenues, and regional breakdowns',
    mode: 'json',
    schema: {
      type: 'object',
      properties: {
        reportPeriod: { type: 'string' },
        currency: { type: 'string' },
        productSales: {
          type: 'array',
          items: {
            type: 'object',
            properties: {
              productName: { type: 'string' },
              sku: { type: 'string' },
              category: { type: 'string' },
              unitsSold: { type: 'number' },
              revenue: { type: 'number' },
              averagePrice: { type: 'number' }
            },
            required: ['productName', 'unitsSold', 'revenue']
          }
        },
        regionalBreakdown: {
          type: 'array',
          items: {
            type: 'object',
            properties: {
              region: { type: 'string' },
              revenue: { type: 'number' },
              percentageOfTotal: { type: 'number' }
            }
          }
        },
        totals: {
          type: 'object',
          properties: {
            totalRevenue: { type: 'number' },
            totalUnits: { type: 'number' },
            averageOrderValue: { type: 'number' }
          }
        }
      }
    }
  });
  
  return result.json;
}

// Usage
const salesData = await extractSalesReport('Q4-sales-report.pdf');
console.log(`Total Revenue: ${salesData.totals.totalRevenue}`);
console.log(`Top Product: ${salesData.productSales[0].productName}`);

Pros:

  • AI understands context and relationships
  • Works with any table format
  • Direct schema-based extraction
  • Handles complex and nested tables
  • No preprocessing required

Cons:

  • Requires API key
  • Costs 3 credits per page
  • Internet connection required
  • Less control over extraction logic

Method 4: Using Apache Tika

Apache Tika provides a robust content extraction framework with table support.

Server Setup and Table Extraction

async function extractWithTika(pdfPath: string) {
  const pdfBuffer = await fs.readFile(pdfPath);
  
  // Call Tika server
  const response = await fetch('http://localhost:9998/tika', {
    method: 'PUT',
    headers: {
      'Accept': 'application/json',
      'Content-Type': 'application/pdf'
    },
    body: pdfBuffer
  });
  
  const tikaOutput = await response.json();
  
  // Parse Tika's structured content
  return parseTableFromTikaOutput(tikaOutput);
}

function parseTableFromTikaOutput(tikaData: any): any[] {
  const tables: any[] = [];
  
  // Tika returns structured content with table markers
  if (tikaData['table:table']) {
    const rawTables = Array.isArray(tikaData['table:table']) 
      ? tikaData['table:table'] 
      : [tikaData['table:table']];
    
    rawTables.forEach(table => {
      const rows = table['table:row'] || [];
      const structuredTable = {
        headers: [] as string[],
        data: [] as any[]
      };
      
      rows.forEach((row: any, index: number) => {
        const cells = row['table:cell'] || [];
        
        if (index === 0) {
          // First row as headers
          structuredTable.headers = cells.map((cell: any) => 
            cell['table:content'] || ''
          );
        } else {
          // Data rows
          const rowData: Record<string, any> = {};
          
          cells.forEach((cell: any, cellIndex: number) => {
            const header = structuredTable.headers[cellIndex] || `col${cellIndex}`;
            rowData[header] = cell['table:content'] || '';
          });
          
          structuredTable.data.push(rowData);
        }
      });
      
      tables.push(structuredTable);
    });
  }
  
  return tables;
}

// Transform to specific formats
async function extractAndTransform(pdfPath: string) {
  const tables = await extractWithTika(pdfPath);
  
  // Transform for data analysis
  const analysisReady = tables.map(table => ({
    dimensions: {
      rows: table.data.length,
      columns: table.headers.length
    },
    headers: table.headers,
    data: table.data,
    statistics: calculateTableStats(table.data)
  }));
  
  return analysisReady;
}

function calculateTableStats(data: any[]): any {
  const stats: any = {
    numericColumns: {},
    textColumns: {}
  };
  
  if (data.length === 0) return stats;
  
  // Analyze each column
  Object.keys(data[0]).forEach(column => {
    const values = data.map(row => row[column]);
    const numericValues = values
      .map(v => parseFloat(v))
      .filter(v => !isNaN(v));
    
    if (numericValues.length > values.length * 0.5) {
      // Mostly numeric column
      stats.numericColumns[column] = {
        min: Math.min(...numericValues),
        max: Math.max(...numericValues),
        avg: numericValues.reduce((a, b) => a + b, 0) / numericValues.length,
        sum: numericValues.reduce((a, b) => a + b, 0)
      };
    } else {
      // Text column
      stats.textColumns[column] = {
        uniqueValues: new Set(values).size,
        mostCommon: findMostCommon(values)
      };
    }
  });
  
  return stats;
}

function findMostCommon(arr: string[]): string {
  const counts = arr.reduce((acc, val) => {
    acc[val] = (acc[val] || 0) + 1;
    return acc;
  }, {} as Record<string, number>);
  
  return Object.entries(counts)
    .sort(([,a], [,b]) => b - a)[0]?.[0] || '';
}

Pros:

  • Handles many file formats
  • Good multilingual support
  • Metadata extraction included
  • Can run as a service

Cons:

  • Requires server setup
  • Java dependency
  • Generic extraction not table-specific
  • May need post-processing

Making the Right Choice

Use Tabula when:

  • Your PDFs have well-formatted, bordered tables
  • You need a simple, reliable solution
  • You're comfortable with Java dependencies
  • You can preprocess PDFs to improve quality
  • Free and open-source is a requirement

Use PDFPlumber when:

  • You need complete control over extraction
  • Your tables have unique formatting
  • You're building a specialized solution
  • You have development resources
  • Performance is critical

Use PDF Vector when:

  • You need to handle any table format
  • Accuracy is more important than cost
  • You want structured JSON output immediately
  • You're processing diverse document types
  • Development speed matters

Use Apache Tika when:

  • You're already using Tika for other extraction
  • You need to process multiple file formats
  • Metadata extraction is also required
  • You can run a Tika server
  • You need multilingual support

Essential Resources

Last updated on August 30, 2025

Browse all blog