Build Your Own PDF Conversion Service with JavaScript

Learn how to build your own PDF processing API from scratch using free npm packages. Complete implementation guide with multiple approaches.

August 29, 2025

6 min read

Duy Bui

Want to build your own PDF conversion service? This guide shows you exactly how to create a PDF processing API using JavaScript and free npm packages. We'll build a real working service, discover what challenges you'll face, and explore when different approaches make sense.

Quick Project Setup

Let's start by creating a new Node.js project:

mkdir pdf-service
cd pdf-service
npm init -y

# Install ONE of these packages based on your needs
npm install pdf-parse        # For simple text extraction
npm install pdf2json        # For text with positioning
npm install pdf-lib         # For PDF manipulation
npm install pdfjs-dist      # For browser-compatible extraction  
npm install pdf-table-extractor  # For table extraction only

Available NPM Packages

Here are the npm packages you can use to build your PDF service:

Package	Type	What It Does	Installation
pdf-parse	Free	Extract text from PDFs	`npm install pdf-parse`
pdf2json	Free	Text with positioning	`npm install pdf2json`
pdfjs-dist	Free	Mozilla's PDF reader	`npm install pdfjs-dist`
pdf-lib	Free	Create/modify PDFs	`npm install pdf-lib`
pdf-table-extractor	Free	Extract tables only	`npm install pdf-table-extractor`
pdfvector	Paid API	AI-powered extraction with schemas	`npm install pdfvector`

Each package solves different problems. Choose based on your needs.

Building Your Service with Free NPM Packages

Let's build a PDF parsing service using these free packages. We'll create simple examples for each approach.

Using pdf-parse

Start with the most popular package, pdf-parse:

// pdfParse.js
const pdfParse = require('pdf-parse');
const fs = require('fs').promises;

async function extractWithPdfParse(pdfPath) {
  const dataBuffer = await fs.readFile(pdfPath);
  const data = await pdfParse(dataBuffer);
  
  return {
    text: data.text,
    pages: data.numpages,
    info: data.info
  };
}

// Usage
const result = await extractWithPdfParse('./invoice.pdf');
console.log(result.text); // Raw text, no formatting

This works great for simple text but loses all formatting and can't handle tables.

Using pdf2json

For more detailed extraction with text positions:

// pdf2json.js
const PDFParser = require('pdf2json');

function extractWithPdf2Json(pdfPath) {
  return new Promise((resolve, reject) => {
    const pdfParser = new PDFParser();
    
    pdfParser.on('pdfParser_dataError', reject);
    pdfParser.on('pdfParser_dataReady', (pdfData) => {
      // Extract text from the complex structure
      const pages = pdfData.Pages.map((page) => {
        const texts = page.Texts.map((text) => 
          decodeURIComponent(text.R[0].T)
        ).join(' ');
        return texts;
      });
      
      resolve({ 
        text: pages.join('\n'),
        rawData: pdfData // Contains positioning info
      });
    });
    
    pdfParser.loadPDF(pdfPath);
  });
}

You get positioning data but the output structure is complex.

Using pdfjs-dist

Mozilla's PDF.js provides more control:

// pdfjsDist.js
const pdfjsLib = require('pdfjs-dist');
const fs = require('fs').promises;

async function extractWithPdfJs(pdfPath) {
  const data = new Uint8Array(await fs.readFile(pdfPath));
  const pdf = await pdfjsLib.getDocument(data).promise;
  
  let fullText = '';
  
  for (let i = 1; i <= pdf.numPages; i++) {
    const page = await pdf.getPage(i);
    const textContent = await page.getTextContent();
    const pageText = textContent.items
      .map((item) => item.str)
      .join(' ');
    fullText += pageText + '\n';
  }
  
  return { text: fullText, pages: pdf.numPages };
}

More control over extraction but requires more setup.

Using pdf-lib

pdf-lib is mainly for creating PDFs but can read basic info:

// pdfLib.js
const { PDFDocument } = require('pdf-lib');
const fs = require('fs').promises;

async function extractWithPdfLib(pdfPath) {
  const existingPdfBytes = await fs.readFile(pdfPath);
  const pdfDoc = await PDFDocument.load(existingPdfBytes);
  
  return {
    pageCount: pdfDoc.getPageCount(),
    title: pdfDoc.getTitle(),
    author: pdfDoc.getAuthor(),
    // Note: pdf-lib can't extract text content
  };
}

Great for metadata but can't extract text content.

Using pdf-table-extractor

For table-specific extraction:

// pdfTableExtractor.js
const pdfTableExtractor = require('pdf-table-extractor');

function extractTables(pdfPath) {
  return new Promise((resolve, reject) => {
    pdfTableExtractor(pdfPath, (result) => {
      if (result.success) {
        const tables = result.pageTables.map((page) => 
          page.tables
        );
        resolve(tables);
      } else {
        reject(new Error('Table extraction failed'));
      }
    });
  });
}

Only extracts tables. You need another library for text.

Using pdfvector

For AI-powered extraction with structured data:

const { PDFVector } = require('pdfvector');
const fs = require('fs').promises;

const client = new PDFVector({ apiKey: 'your_api_key' });

async function extractWithPDFVector(pdfPath) {
  const buffer = await fs.readFile(pdfPath);
  
  // Simple text extraction
  const result = await client.parse({
    data: buffer,
    contentType: 'application/pdf'
  });
  
  return {
    text: result.markdown,
    pageCount: result.pageCount
  };
}

// Extract structured data with schema
async function extractInvoiceWithPDFVector(pdfPath) {
  const buffer = await fs.readFile(pdfPath);
  
  const result = await client.ask({
    data: buffer,
    prompt: 'Extract invoice information',
    mode: 'json',
    schema: {
      type: 'object',
      properties: {
        invoiceNumber: { type: 'string' },
        total: { type: 'number' }
      }
    }
  });
  
  return result.json;
}

API service with AI understanding. Works with both PDFs and Word documents.

Create Basic API Endpoint

Now let's combine these into a basic API:

const express = require('express');
const multer = require('multer');
// Import your chosen parser functions

const app = express();
const upload = multer({ dest: 'uploads/' });

app.post('/api/parse', upload.single('pdf'), async (req, res) => {
  try {
    if (!req.file) {
      return res.status(400).json({ error: 'No file uploaded' });
    }

    // Try different parsers based on what you need
    const textResult = await extractWithPdfParse(req.file.path);
    const positionResult = await extractWithPdf2Json(req.file.path);
    const tables = await extractTables(req.file.path);
    
    res.json({
      text: textResult.text,
      tables: tables,
      pageCount: textResult.pages
    });
  } catch (error) {
    res.status(500).json({ error: 'Extraction failed' });
  }
});

app.listen(3000, () => {
  console.log('PDF service running on port 3000');
});

Simple Code for Each Package

Here's the reality: you need multiple packages for a complete solution. Each package handles one thing, and you must combine them:

// Complete extraction requires multiple libraries
async function extractEverything(pdfPath) {
  // Text extraction
  const text = await extractWithPdfParse(pdfPath);
  
  // Table extraction  
  const tables = await extractTables(pdfPath);
  
  // Position data
  const positioned = await extractWithPdf2Json(pdfPath);
  
  // Metadata
  const metadata = await extractWithPdfLib(pdfPath);
  
  return {
    ...text,
    ...metadata,
    tables,
    positioning: positioned.rawData
  };
}

What You Still Need to Handle

After building your basic service, here are the limitations of each package:

pdf-parse limitations:

No table extraction
Loses all formatting
Can't handle scanned PDFs
No positioning information

pdf2json limitations:

Complex nested JSON output
Callback-based API
Large file sizes in output
Difficult to extract specific fields

pdfjs-dist limitations:

Designed for browsers, not Node.js
Heavy dependencies
Complex setup required
No built-in table extraction

pdf-lib limitations:

Can't extract text at all
Only reads metadata
Meant for creating/modifying PDFs
Must combine with other libraries

pdf-table-extractor limitations:

Only extracts tables
Can't get regular text
Tables break across pages
No support for complex layouts

pdfvector considerations:

Requires API key
Need to pay
Internet connection required
Rate limits apply

Making the Decision

Time Investment

Building with free libraries: 1 day for basic implementation
Using API services: 30 minutes to integrate

But the real time comes later - debugging edge cases, handling failures, and maintaining multiple libraries.

When Free Libraries Work

Free libraries make sense when:

You only need simple text extraction
Processing low volume documents (under 100/day)
Building a learning project to understand PDFs
Budget is zero and accuracy isn't critical

When to Use API Services

Consider API services when:

Building production applications that need reliability
Handling complex documents like invoices, contracts, or forms
You need accuracy and structure, not just raw text
Your time is worth more than the API costs
You want to focus on your business logic, not PDF parsing

Conclusion

Building your own PDF conversion service is straightforward - you can have a basic version running in a day. Each npm package has its strengths and limitations.

For simple text extraction, free libraries like pdf-parse work well. For complex documents with tables and structured data, you'll need multiple libraries or an API service. For production applications, consider the maintenance cost of managing multiple packages versus using a single solution.

The code examples above give you everything needed to start. Try different packages to see what works for your use case.

Last updated on August 29, 2025

Browse all blog