PDF Vector

Blog
/

Search ArXiv Papers Without XML Parsing Headaches

Search ArXiv Papers Without XML Parsing Headaches

Learn how to search ArXiv papers and get clean JSON responses without dealing with complex XML namespace issues.

August 29, 2025

4 min read

it's me

Duy Bui

If you've ever tried to search ArXiv programmatically, you've probably stared at XML responses wondering why something so simple has to be so complicated. Those nested namespaces, the Atom 1.0 format, and the constant worry about whether your XML parser will handle the next edge case correctly.

We've all been there. You just want to search for papers about "quantum computing" and get back a nice JSON array. Instead, you're debugging xmlns attributes at 2 AM.

Understanding the ArXiv XML Challenge

ArXiv's API returns results in Atom 1.0 format, which made sense in 2007 when the API was designed. Today, it creates unnecessary complexity for developers who expect JSON responses from modern APIs.

Here's what a typical ArXiv API response looks like:

<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <link xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/" 
        rel="self" type="application/atom+xml"/>
  <title xmlns="http://www.w3.org/2005/Atom">ArXiv Query</title>
  <entry>
    <id>http://arxiv.org/abs/2301.00001v1</id>
    <updated>2023-01-01T00:00:00Z</updated>
    <published>2023-01-01T00:00:00Z</published>
    <title>Quantum Computing Applications</title>
    <author>
      <name>John Doe</name>
    </author>
  </entry>
</feed>

The multiple namespaces and nested structure make parsing error-prone. Miss one namespace declaration and your entire parser breaks.

Method 1: Traditional ArXiv API with XML Parsing

Let's look at the traditional approach using the ArXiv API directly:

import { XMLParser } from 'fast-xml-parser';

async function searchArxivTraditional(query: string) {
  const url = `http://export.arxiv.org/api/query?search_query=${encodeURIComponent(query)}&max_results=10`;
  
  const response = await fetch(url);
  const xmlText = await response.text();
  
  const parser = new XMLParser({
    ignoreAttributes: false,
    removeNSPrefix: false
  });
  
  const result = parser.parse(xmlText);
  const entries = result.feed.entry || [];
  
  // Transform to clean JSON
  return (Array.isArray(entries) ? entries : [entries]).map(entry => ({
    id: entry.id,
    title: entry.title,
    authors: Array.isArray(entry.author) 
      ? entry.author.map(a => a.name) 
      : [entry.author?.name],
    published: entry.published,
    summary: entry.summary
  }));
}

// Usage
const papers = await searchArxivTraditional("quantum computing");
console.log(papers);

Pros:

  • Direct access to ArXiv API
  • No third-party API keys needed
  • Free to use

Cons:

  • Complex XML parsing logic
  • Namespace handling is fragile
  • No built-in error handling for malformed XML
  • Must handle rate limiting manually (3 second delays)

Method 2: Python ArXiv Library

The official Python library simplifies things somewhat:

import arxiv
import json

def search_arxiv_python(query):
    search = arxiv.Search(
        query=query,
        max_results=10,
        sort_by=arxiv.SortCriterion.SubmittedDate
    )
    
    results = []
    for result in search.results():
        results.append({
            "id": result.entry_id,
            "title": result.title,
            "authors": [author.name for author in result.authors],
            "published": result.published.isoformat(),
            "summary": result.summary
        })
    
    return json.dumps(results)

# Usage
papers = search_arxiv_python("quantum computing")
print(papers)

Pros:

  • Official library handles XML parsing
  • Cleaner code than manual parsing
  • Automatic retry logic

Cons:

  • Python-only solution
  • Still requires JSON transformation
  • Not suitable for JavaScript/TypeScript projects

Method 3: PDF Vector Academic Search API

PDF Vector provides a modern alternative with native JSON responses:

async function searchArxivWithPDFVector(query: string) {
  const response = await fetch('https://www.pdfvector.com/v1/api/academic-search', {
    method: 'POST',
    headers: {
      'Authorization': 'Bearer pdfvector_xxx',
      'Content-Type': 'application/json'
    },
    body: JSON.stringify({
      query: query,
      providers: ['arxiv'],
      limit: 10,
      fields: ['title', 'authors', 'year', 'abstract', 'pdfURL']
    })
  });

  const data = await response.json();
  return data.results;
}

// Usage
const papers = await searchArxivWithPDFVector("quantum computing");
console.log(papers);

// Clean JSON response:
// [
//   {
//     "title": "Quantum Computing Applications in Machine Learning",
//     "authors": [
//       { "name": "John Doe" },
//       { "name": "Jane Smith" }
//     ],
//     "year": 2023,
//     "abstract": "We explore the intersection of quantum computing...",
//     "pdfURL": "https://arxiv.org/pdf/2301.00001.pdf"
//   }
// ]

Want to search across multiple databases? Just add more providers:

const results = await fetch('https://www.pdfvector.com/v1/api/academic-search', {
  method: 'POST',
  headers: {
    'Authorization': 'Bearer pdfvector_xxx',
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    query: "quantum computing applications",
    providers: ['arxiv', 'semantic-scholar', 'pubmed'],
    limit: 20
  })
});

Pros:

  • Clean JSON responses with no XML parsing needed
  • Search multiple academic databases simultaneously
  • Consistent data structure across all providers
  • No rate limiting issues for reasonable usage
  • TypeScript SDK available

Cons:

  • Requires API key (free tier available with 100 credits/month)
  • Credit-based system (2 credit per search)

Comparing the Approaches

AspectArXiv DirectPython LibraryPDF Vector
Response FormatXMLPython objectsJSON
Parsing ComplexityHighMediumNone
Error HandlingManualBuilt-inBuilt-in
Multiple DatabasesNoNoYes
Setup Time1 day5 minutes5 minutes

Making the Right Decision

Use ArXiv Direct API when:

  • You're building a one-off script or prototype
  • You have existing XML parsing infrastructure
  • You need unlimited free queries
  • You're comfortable handling XML namespaces
  • Rate limiting won't affect your use case

Use Python arxiv library when:

  • You're already in a Python environment
  • You need the official implementation
  • You want built-in error handling
  • You can work within the rate limits
  • You prefer Python objects over raw XML

Use PDF Vector when:

  • You want clean JSON responses without XML parsing
  • You need to search multiple academic databases
  • You value development speed over free access
  • You're building a production application
  • You need consistent data structure across providers

Last updated on August 29, 2025

Browse all blog