PDF Vector

Blog
/

Extract ArXiv Paper Metadata from XML Responses

Extract ArXiv Paper Metadata from XML Responses

Transform ArXiv's complex XML responses into clean, structured data you can actually use in your TypeScript applications.

August 29, 2025

6 min read

it's me

Duy Bui

If you've tried to get paper data from ArXiv's API, you've probably hit the same wall we all do. Instead of nice JSON, you get complex XML with multiple namespaces that breaks standard parsing. Let's fix that. We'll explore three ways to extract titles, authors, and abstracts from ArXiv API responses using TypeScript.

Understanding ArXiv API XML Structure

The ArXiv API returns data in Atom 1.0 format, which uses XML namespaces extensively. Here's what a typical response looks like:

<feed xmlns="http://www.w3.org/2005/Atom">
  <entry>
    <title>Quantum Computing Fundamentals</title>
    <author>
      <name>John Doe</name>
    </author>
    <summary>This paper explores...</summary>
  </entry>
</feed>

The challenge? Standard XML parsing fails because of the default namespace http://www.w3.org/2005/Atom. Without handling this namespace correctly, you'll get empty results even when the data is right there.

Method 1: Using xml2js Library

Implementation Guide

  1. Install the xml2js library and its types
  2. Configure the parser to handle namespaces
  3. Parse the XML and extract the data

Code Example

import { parseString } from 'xml2js';
import { promisify } from 'util';

const parseXML = promisify(parseString);

interface ArxivPaper {
  title: string;
  authors: string[];
  abstract: string;
  id: string;
}

async function fetchArxivPapers(query: string, maxResults: number = 10): Promise<ArxivPaper[]> {
  try {
    // Build URL with query parameters
    const params = new URLSearchParams({
      search_query: query,
      max_results: maxResults.toString()
    });
    
    // Make request to ArXiv API
    const response = await fetch(`http://export.arxiv.org/api/query?${params}`);
    const xmlData = await response.text();

    // Parse XML with namespace handling
    const result = await parseXML(xmlData, {
      explicitArray: false,
      ignoreAttrs: true
    });

    // Extract papers from the feed
    const entries = Array.isArray(result.feed.entry) 
      ? result.feed.entry 
      : [result.feed.entry];

    return entries.map((entry: any) => ({
      title: entry.title.replace(/\s+/g, ' ').trim(),
      authors: Array.isArray(entry.author) 
        ? entry.author.map((a: any) => a.name)
        : [entry.author.name],
      abstract: entry.summary.replace(/\s+/g, ' ').trim(),
      id: entry.id.split('/').pop()
    }));
  } catch (error) {
    console.error('Failed to fetch ArXiv papers:', error);
    return [];
  }
}

// Usage
const papers = await fetchArxivPapers('quantum computing', 5);
console.log(papers);
// Output: [{ title: "...", authors: ["..."], abstract: "...", id: "..." }, ...]

When to Use This Method

  • Need fine control over parsing options
  • Working with other XML APIs in your project
  • Prefer callback-style APIs that can be promisified

Advantages and Limitations

Pros:

  • ✅ Extensive configuration options for complex XML structures
  • ✅ Mature namespace support with granular control
  • ✅ Large community with extensive Stack Overflow coverage
  • ✅ Flexible output formatting (can convert to different structures)

Cons:

Common Issues:

  • explicitArray confusion: Setting to false gives inconsistent data structures (single items as objects, multiple as arrays)
  • Namespace pollution: Default settings include namespace prefixes in keys, cluttering the output
  • Memory exhaustion: 80-90MB files can take 45+ seconds and spike RAM usage

Method 2: Using fast-xml-parser

Different Implementation

fast-xml-parser offers better performance and a more modern API. It handles namespaces automatically and provides TypeScript support out of the box.

Code Example

import { XMLParser } from 'fast-xml-parser';

interface ArxivEntry {
  title: string;
  author: { name: string } | { name: string }[];
  summary: string;
  id: string;
  published: string;
}

async function fetchArxivWithFastParser(query: string, maxResults: number = 10) {
  try {
    // Build URL with query parameters
    const params = new URLSearchParams({
      search_query: query,
      max_results: maxResults.toString()
    });
    
    const response = await fetch(`http://export.arxiv.org/api/query?${params}`);
    const xmlData = await response.text();

    // Configure parser
    const parser = new XMLParser({
      ignoreAttributes: true,
      removeNSPrefix: true, // This handles namespaces for us
      parseTagValue: false
    });

    const result = parser.parse(xmlData);
    
    // Handle single vs multiple entries
    const entries: ArxivEntry[] = result.feed.entry 
      ? (Array.isArray(result.feed.entry) ? result.feed.entry : [result.feed.entry])
      : [];

    return entries.map(entry => ({
      title: entry.title.trim(),
      authors: Array.isArray(entry.author) 
        ? entry.author.map(a => a.name)
        : [entry.author.name],
      abstract: entry.summary.trim(),
      id: entry.id.split('/').pop(),
      published: entry.published
    }));
  } catch (error) {
    console.error('Failed to parse ArXiv response:', error);
    return [];
  }
}

// Usage with async/await
const papers = await fetchArxivWithFastParser('machine learning', 10);
papers.forEach(paper => {
  console.log(`Title: ${paper.title}`);
  console.log(`Authors: ${paper.authors.join(', ')}`);
  console.log('---');
});

Advantages and Limitations

Pros:

Cons:

Common Issues:

  • Missing attributes: Default config ignores attributes, so always set ignoreAttributes: false for ArXiv
  • Boolean attribute parsing: Can fail in self-closing tags like <entry published="true"/>
  • Single vs array: Returns object for single entry, array for multiple. Always check with Array.isArray()

Method 3: Using PDF Vector's Academic Search API

PDF Vector's Academic Search API provides a different approach by offering a unified API for multiple academic databases including ArXiv. PDF Vector's Academic Search API provides a different approach by offering a unified API for multiple academic databases including ArXiv.

Code Example

import { PDFVector } from 'pdfvector';

const pdfvector = new PDFVector({
  apiKey: 'pdfvector_xxx' // Get from dashboard
});

// Search for papers by query
async function searchArxivViaPDFVector(query: string) {
  try {
    const response = await pdfvector.academicSearch({
      query: query,
      providers: ['arxiv'], // Can add more: ['pubmed', 'semantic-scholar']
      limit: 20,
      yearFrom: 2020,  // Built-in date filtering
      yearTo: 2024,
      fields: ['title', 'authors', 'abstract', 'arxivId', 'pdfURL']
    });

    return response.results.map(paper => ({
      title: paper.title,
      authors: paper.authors.map(a => a.name),
      abstract: paper.abstract,
      arxivId: paper.providerData?.arxivId,
      pdfUrl: paper.pdfURL
    }));
  } catch (error) {
    console.error('PDF Vector search failed:', error);
    return [];
  }
}

// Fetch specific papers by ArXiv ID
async function fetchArxivPaperByID(arxivIds: string[]) {
  try {
    const response = await pdfvector.academicFetch({
      ids: arxivIds,
      fields: ['title', 'authors', 'abstract', 'arxivId', 'pdfURL', 'date']
    });

    return response.results.map(paper => ({
      id: paper.id,
      title: paper.title,
      authors: paper.authors.map(a => a.name),
      abstract: paper.abstract,
      publishedDate: paper.date,
      pdfUrl: paper.pdfURL
    }));
  } catch (error) {
    console.error('PDF Vector fetch failed:', error);
    return [];
  }
}

const papers = await searchArxivViaPDFVector('quantum computing');
console.log(`Found ${papers.length} papers`);

const specificPapers = await fetchArxivPaperByID(['2301.00001', '2103.14030']);
console.log(specificPapers);
// Output: [{ id: '2301.00001', title: '...', authors: [...], ... }]

Advantages and Limitations

Pros:

  • ✅ Returns clean JSON, no XML parsing needed
  • ✅ Built-in date filtering (yearFrom/yearTo)
  • ✅ Search multiple databases in one call
  • ✅ Enrich all metadata for each paper

Cons:

  • ❌ Requires API key and registration
  • ❌ Costs 2 credits per search or fetch

Common Issues:

  • Rate limiting: Free tier limited to 100 credits/month.
  • Provider errors: Some providers may fail while others succeed. Check the errors array in response

Making the Right Decision

Time Investment Reality

Consider the full lifecycle of your integration:

Initial Setup Time:

  • How long will namespace handling and XML parsing take to implement?
  • What's the learning curve for your team?
  • How quickly do you need to ship?

Ongoing Maintenance Burden:

  • Who handles edge cases and format changes?
  • What happens when ArXiv updates their API?
  • Will future developers understand the XML parsing logic?

Key Considerations

Technical factors to evaluate:

  • Single source (ArXiv only) versus multi-database needs
  • Monthly query volume and rate limit constraints
  • Project type (prototype versus production application)
  • Team's XML parsing expertise and maintenance capacity
  • Budget constraints versus development time costs

API service benefits to consider:

  • Multiple database access through one interface
  • Consistent JSON responses across all providers
  • Time saved on parsing and error handling
  • Built-in metadata enrichment (citations, references)
  • Someone else maintains the integration

The best choice depends on your specific context, timeline, and resources.

Last updated on August 29, 2025

Browse all blog