PDF Vector

Blog
/

Build Your Own PDF Conversion Service with JavaScript

Build Your Own PDF Conversion Service with JavaScript

Learn how to build your own PDF processing API from scratch using free npm packages. Complete implementation guide with multiple approaches.

August 29, 2025

6 min read

it's me

Duy Bui

Want to build your own PDF conversion service? This guide shows you exactly how to create a PDF processing API using JavaScript and free npm packages. We'll build a real working service, discover what challenges you'll face, and explore when different approaches make sense.

Quick Project Setup

Let's start by creating a new Node.js project:

mkdir pdf-service
cd pdf-service
npm init -y

# Install ONE of these packages based on your needs
npm install pdf-parse        # For simple text extraction
npm install pdf2json        # For text with positioning
npm install pdf-lib         # For PDF manipulation
npm install pdfjs-dist      # For browser-compatible extraction  
npm install pdf-table-extractor  # For table extraction only

Available NPM Packages

Here are the npm packages you can use to build your PDF service:

PackageTypeWhat It DoesInstallation
pdf-parseFreeExtract text from PDFsnpm install pdf-parse
pdf2jsonFreeText with positioningnpm install pdf2json
pdfjs-distFreeMozilla's PDF readernpm install pdfjs-dist
pdf-libFreeCreate/modify PDFsnpm install pdf-lib
pdf-table-extractorFreeExtract tables onlynpm install pdf-table-extractor
pdfvectorPaid APIAI-powered extraction with schemasnpm install pdfvector

Each package solves different problems. Choose based on your needs.

Building Your Service with Free NPM Packages

Let's build a PDF parsing service using these free packages. We'll create simple examples for each approach.

Using pdf-parse

Start with the most popular package, pdf-parse:

// pdfParse.js
const pdfParse = require('pdf-parse');
const fs = require('fs').promises;

async function extractWithPdfParse(pdfPath) {
  const dataBuffer = await fs.readFile(pdfPath);
  const data = await pdfParse(dataBuffer);
  
  return {
    text: data.text,
    pages: data.numpages,
    info: data.info
  };
}

// Usage
const result = await extractWithPdfParse('./invoice.pdf');
console.log(result.text); // Raw text, no formatting

This works great for simple text but loses all formatting and can't handle tables.

Using pdf2json

For more detailed extraction with text positions:

// pdf2json.js
const PDFParser = require('pdf2json');

function extractWithPdf2Json(pdfPath) {
  return new Promise((resolve, reject) => {
    const pdfParser = new PDFParser();
    
    pdfParser.on('pdfParser_dataError', reject);
    pdfParser.on('pdfParser_dataReady', (pdfData) => {
      // Extract text from the complex structure
      const pages = pdfData.Pages.map((page) => {
        const texts = page.Texts.map((text) => 
          decodeURIComponent(text.R[0].T)
        ).join(' ');
        return texts;
      });
      
      resolve({ 
        text: pages.join('\n'),
        rawData: pdfData // Contains positioning info
      });
    });
    
    pdfParser.loadPDF(pdfPath);
  });
}

You get positioning data but the output structure is complex.

Using pdfjs-dist

Mozilla's PDF.js provides more control:

// pdfjsDist.js
const pdfjsLib = require('pdfjs-dist');
const fs = require('fs').promises;

async function extractWithPdfJs(pdfPath) {
  const data = new Uint8Array(await fs.readFile(pdfPath));
  const pdf = await pdfjsLib.getDocument(data).promise;
  
  let fullText = '';
  
  for (let i = 1; i <= pdf.numPages; i++) {
    const page = await pdf.getPage(i);
    const textContent = await page.getTextContent();
    const pageText = textContent.items
      .map((item) => item.str)
      .join(' ');
    fullText += pageText + '\n';
  }
  
  return { text: fullText, pages: pdf.numPages };
}

More control over extraction but requires more setup.

Using pdf-lib

pdf-lib is mainly for creating PDFs but can read basic info:

// pdfLib.js
const { PDFDocument } = require('pdf-lib');
const fs = require('fs').promises;

async function extractWithPdfLib(pdfPath) {
  const existingPdfBytes = await fs.readFile(pdfPath);
  const pdfDoc = await PDFDocument.load(existingPdfBytes);
  
  return {
    pageCount: pdfDoc.getPageCount(),
    title: pdfDoc.getTitle(),
    author: pdfDoc.getAuthor(),
    // Note: pdf-lib can't extract text content
  };
}

Great for metadata but can't extract text content.

Using pdf-table-extractor

For table-specific extraction:

// pdfTableExtractor.js
const pdfTableExtractor = require('pdf-table-extractor');

function extractTables(pdfPath) {
  return new Promise((resolve, reject) => {
    pdfTableExtractor(pdfPath, (result) => {
      if (result.success) {
        const tables = result.pageTables.map((page) => 
          page.tables
        );
        resolve(tables);
      } else {
        reject(new Error('Table extraction failed'));
      }
    });
  });
}

Only extracts tables. You need another library for text.

Using pdfvector

For AI-powered extraction with structured data:

const { PDFVector } = require('pdfvector');
const fs = require('fs').promises;

const client = new PDFVector({ apiKey: 'your_api_key' });

async function extractWithPDFVector(pdfPath) {
  const buffer = await fs.readFile(pdfPath);
  
  // Simple text extraction
  const result = await client.parse({
    data: buffer,
    contentType: 'application/pdf'
  });
  
  return {
    text: result.markdown,
    pageCount: result.pageCount
  };
}

// Extract structured data with schema
async function extractInvoiceWithPDFVector(pdfPath) {
  const buffer = await fs.readFile(pdfPath);
  
  const result = await client.ask({
    data: buffer,
    prompt: 'Extract invoice information',
    mode: 'json',
    schema: {
      type: 'object',
      properties: {
        invoiceNumber: { type: 'string' },
        total: { type: 'number' }
      }
    }
  });
  
  return result.json;
}

API service with AI understanding. Works with both PDFs and Word documents.

Create Basic API Endpoint

Now let's combine these into a basic API:

const express = require('express');
const multer = require('multer');
// Import your chosen parser functions

const app = express();
const upload = multer({ dest: 'uploads/' });

app.post('/api/parse', upload.single('pdf'), async (req, res) => {
  try {
    if (!req.file) {
      return res.status(400).json({ error: 'No file uploaded' });
    }

    // Try different parsers based on what you need
    const textResult = await extractWithPdfParse(req.file.path);
    const positionResult = await extractWithPdf2Json(req.file.path);
    const tables = await extractTables(req.file.path);
    
    res.json({
      text: textResult.text,
      tables: tables,
      pageCount: textResult.pages
    });
  } catch (error) {
    res.status(500).json({ error: 'Extraction failed' });
  }
});

app.listen(3000, () => {
  console.log('PDF service running on port 3000');
});

Simple Code for Each Package

Here's the reality: you need multiple packages for a complete solution. Each package handles one thing, and you must combine them:

// Complete extraction requires multiple libraries
async function extractEverything(pdfPath) {
  // Text extraction
  const text = await extractWithPdfParse(pdfPath);
  
  // Table extraction  
  const tables = await extractTables(pdfPath);
  
  // Position data
  const positioned = await extractWithPdf2Json(pdfPath);
  
  // Metadata
  const metadata = await extractWithPdfLib(pdfPath);
  
  return {
    ...text,
    ...metadata,
    tables,
    positioning: positioned.rawData
  };
}

What You Still Need to Handle

After building your basic service, here are the limitations of each package:

pdf-parse limitations:

  • No table extraction
  • Loses all formatting
  • Can't handle scanned PDFs
  • No positioning information

pdf2json limitations:

  • Complex nested JSON output
  • Callback-based API
  • Large file sizes in output
  • Difficult to extract specific fields

pdfjs-dist limitations:

  • Designed for browsers, not Node.js
  • Heavy dependencies
  • Complex setup required
  • No built-in table extraction

pdf-lib limitations:

  • Can't extract text at all
  • Only reads metadata
  • Meant for creating/modifying PDFs
  • Must combine with other libraries

pdf-table-extractor limitations:

  • Only extracts tables
  • Can't get regular text
  • Tables break across pages
  • No support for complex layouts

pdfvector considerations:

  • Requires API key
  • Need to pay
  • Internet connection required
  • Rate limits apply

Making the Decision

Time Investment

  • Building with free libraries: 1 day for basic implementation
  • Using API services: 30 minutes to integrate

But the real time comes later - debugging edge cases, handling failures, and maintaining multiple libraries.

When Free Libraries Work

Free libraries make sense when:

  • You only need simple text extraction
  • Processing low volume documents (under 100/day)
  • Building a learning project to understand PDFs
  • Budget is zero and accuracy isn't critical

When to Use API Services

Consider API services when:

  • Building production applications that need reliability
  • Handling complex documents like invoices, contracts, or forms
  • You need accuracy and structure, not just raw text
  • Your time is worth more than the API costs
  • You want to focus on your business logic, not PDF parsing

Conclusion

Building your own PDF conversion service is straightforward - you can have a basic version running in a day. Each npm package has its strengths and limitations.

For simple text extraction, free libraries like pdf-parse work well. For complex documents with tables and structured data, you'll need multiple libraries or an API service. For production applications, consider the maintenance cost of managing multiple packages versus using a single solution.

The code examples above give you everything needed to start. Try different packages to see what works for your use case.

Last updated on August 29, 2025

Browse all blog