Transform PDF tables into clean JSON data structures ready for your database, API, or analytics pipeline.
Stop copying PDF tables cell by cell into your database. Whether you're processing invoices, financial reports, or research data, manually transferring table data wastes hours and introduces errors. You need that quarterly sales report in your analytics pipeline, those invoice line items in your accounting system, and those research results in your data warehouse, all in clean, structured JSON.
Understanding PDF Table Structure
PDF tables aren't stored as tables. They're collections of positioned text elements that happen to look like tables when rendered. Each cell is just text at specific x,y coordinates. The visual alignment creates the illusion of structure, but there's no underlying table object to query.
Common table formats challenge extraction tools differently. Simple tables with clear borders and consistent spacing extract cleanly. Tables with merged cells require intelligent parsing to understand which cells span multiple columns or rows. Nested tables, where cells contain sub-tables, demand recursive extraction strategies.
JSON excels as the output format because it naturally represents hierarchical data, supports various data types, integrates with every programming language, and validates against schemas. Your downstream systems already speak JSON, making integration seamless.
Method 1: Using Tabula for Simple Tables
Tabula specializes in extracting tables from PDFs with remarkable accuracy for well-structured documents.
Basic Table to JSON Conversion
import { exec } from 'child_process';
import { promisify } from 'util';
import fs from 'fs/promises';
const execAsync = promisify(exec);
async function extractWithTabula(pdfPath: string): Promise<any[]> {
// Extract tables to JSON using Tabula CLI
const outputPath = 'tables.json';
await execAsync(
`java -jar tabula.jar --format JSON --pages all "${pdfPath}" > "${outputPath}"`
);
const jsonContent = await fs.readFile(outputPath, 'utf-8');
const rawTables = JSON.parse(jsonContent);
// Transform Tabula output to structured JSON
return rawTables.map((table: any[]) => {
if (table.length === 0) return null;
// First row as headers
const headers = table[0].map((cell: any) => cell.text || '');
// Convert remaining rows to objects
const rows = table.slice(1).map((row: any[]) => {
const obj: Record<string, string> = {};
row.forEach((cell: any, index: number) => {
const header = headers[index] || `column_${index}`;
obj[header] = cell.text || '';
});
return obj;
});
return {
headers,
rows,
rowCount: rows.length,
columnCount: headers.length
};
}).filter(table => table !== null);
}
// Convert specific table types
async function extractInvoiceTable(pdfPath: string) {
const tables = await extractWithTabula(pdfPath);
// Find the line items table (usually the largest)
const lineItemsTable = tables.reduce((largest, current) =>
current.rowCount > largest.rowCount ? current : largest
);
// Transform to invoice structure
const invoice = {
lineItems: lineItemsTable.rows.map(row => ({
description: row['Description'] || row['Item'],
quantity: parseFloat(row['Qty'] || row['Quantity'] || '0'),
unitPrice: parseFloat(row['Unit Price'] || row['Price'] || '0'),
total: parseFloat(row['Total'] || row['Amount'] || '0')
})),
subtotal: 0,
tax: 0,
total: 0
};
// Calculate totals
invoice.subtotal = invoice.lineItems.reduce((sum, item) => sum + item.total, 0);
return invoice;
}
// Usage
const tables = await extractWithTabula('report.pdf');
console.log(JSON.stringify(tables, null, 2));
Pros:
- Excellent accuracy for well-formatted tables
- Free and open-source
- Supports multiple output formats
- Works well with bordered tables
Cons:
- Requires Java runtime
- Struggles with borderless tables
- No built-in schema validation
- Limited customization options
Method 2: Using PDFPlumber with Custom Parsing
PDFPlumber provides fine-grained control over table extraction with Python, but we can achieve similar results in TypeScript using pdf-parse.
Advanced Table Detection
import PDFParser from 'pdf-parse';
import fs from 'fs/promises';
interface TableCell {
text: string;
x: number;
y: number;
width: number;
height: number;
}
async function extractWithCustomParsing(pdfPath: string) {
const dataBuffer = await fs.readFile(pdfPath);
const data = await PDFParser(dataBuffer);
// Extract positioned text elements
const textElements = extractTextElements(data);
// Detect table regions
const tables = detectTables(textElements);
// Convert to structured JSON
return tables.map(table => convertTableToJSON(table));
}
function extractTextElements(pdfData: any): TableCell[] {
const elements: TableCell[] = [];
// Parse PDF content (implementation depends on PDF structure)
// This is simplified - real implementation needs proper PDF parsing
const lines = pdfData.text.split('\n');
let y = 0;
lines.forEach(line => {
if (line.trim()) {
// Simple column detection based on multiple spaces
const cells = line.split(/\s{2,}/);
let x = 0;
cells.forEach(cell => {
elements.push({
text: cell.trim(),
x,
y,
width: cell.length * 8, // Approximate
height: 12
});
x += cell.length * 8 + 20;
});
}
y += 15;
});
return elements;
}
function detectTables(elements: TableCell[]): TableCell[][] {
// Group elements by proximity
const tables: TableCell[][] = [];
const used = new Set<number>();
elements.forEach((element, index) => {
if (used.has(index)) return;
const table: TableCell[] = [element];
used.add(index);
// Find nearby elements
elements.forEach((other, otherIndex) => {
if (used.has(otherIndex)) return;
// Check if elements are aligned
const yDistance = Math.abs(element.y - other.y);
const xDistance = Math.abs(element.x - other.x);
if (yDistance < 20 || xDistance < 10) {
table.push(other);
used.add(otherIndex);
}
});
if (table.length > 3) { // Minimum cells for a table
tables.push(table);
}
});
return tables;
}
function convertTableToJSON(tableCells: TableCell[]): any {
// Sort cells by position
tableCells.sort((a, b) => {
if (Math.abs(a.y - b.y) < 5) {
return a.x - b.x;
}
return a.y - b.y;
});
// Group into rows
const rows: TableCell[][] = [];
let currentRow: TableCell[] = [];
let currentY = tableCells[0]?.y || 0;
tableCells.forEach(cell => {
if (Math.abs(cell.y - currentY) > 10) {
if (currentRow.length > 0) {
rows.push(currentRow);
}
currentRow = [cell];
currentY = cell.y;
} else {
currentRow.push(cell);
}
});
if (currentRow.length > 0) {
rows.push(currentRow);
}
// Convert to JSON structure
if (rows.length === 0) return null;
const headers = rows[0].map(cell => cell.text);
const data = rows.slice(1).map(row => {
const obj: Record<string, string> = {};
row.forEach((cell, index) => {
const header = headers[index] || `col${index}`;
obj[header] = cell.text;
});
return obj;
});
return {
headers,
data,
metadata: {
rowCount: data.length,
columnCount: headers.length,
position: {
x: Math.min(...tableCells.map(c => c.x)),
y: Math.min(...tableCells.map(c => c.y))
}
}
};
}
// Usage with schema mapping
async function extractFinancialData(pdfPath: string) {
const tables = await extractWithCustomParsing(pdfPath);
// Map to financial schema
const financialData = {
quarters: [] as any[],
totals: {} as any
};
tables.forEach(table => {
if (table.headers.includes('Q1') || table.headers.includes('Quarter')) {
// Quarterly data table
financialData.quarters = table.data.map(row => ({
metric: row['Metric'] || row['Description'],
q1: parseFloat(row['Q1'] || '0'),
q2: parseFloat(row['Q2'] || '0'),
q3: parseFloat(row['Q3'] || '0'),
q4: parseFloat(row['Q4'] || '0'),
total: parseFloat(row['Total'] || '0')
}));
}
});
return financialData;
}
Pros:
- Full control over extraction logic
- Can handle complex layouts
- Custom schema mapping
- No external dependencies
Cons:
- Requires significant development effort
- Complex coordinate calculations
- May miss edge cases
- Harder to maintain
Method 3: PDF Vector's Ask API
PDF Vector's Ask API uses AI to understand table context and extract data directly into your defined JSON schema.
Define Your JSON Schema
import { PDFVector } from 'pdfvector';
const client = new PDFVector({
apiKey: 'pdfvector_your_api_key'
});
// Extract specific business data
async function extractSalesReport(pdfPath: string) {
const pdfBuffer = await fs.readFile(pdfPath);
const result = await client.ask({
data: pdfBuffer,
contentType: 'application/pdf',
prompt: 'Extract sales data including products, quantities, revenues, and regional breakdowns',
mode: 'json',
schema: {
type: 'object',
properties: {
reportPeriod: { type: 'string' },
currency: { type: 'string' },
productSales: {
type: 'array',
items: {
type: 'object',
properties: {
productName: { type: 'string' },
sku: { type: 'string' },
category: { type: 'string' },
unitsSold: { type: 'number' },
revenue: { type: 'number' },
averagePrice: { type: 'number' }
},
required: ['productName', 'unitsSold', 'revenue']
}
...



