Build research applications that search thousands of papers without hitting frustrating API rate limits using four different approaches.
Your research app just crashed after 100 requests in 5 minutes. You're building a literature review tool, citation analyzer, or research assistant, and suddenly everything stops working. The dreaded 429 error appears: "Too Many Requests." Sound familiar?
Understanding Academic API Rate Limits
Academic APIs implement rate limiting to protect their infrastructure and ensure fair access. But these limits can cripple legitimate research applications. Here's what you're up against:
Common Rate Limits:
- Semantic Scholar: 100 requests per 5 minutes (unauthenticated)
- PubMed: 3 requests per second
- Crossref: Varies by endpoint, generally 50-100/hour
- Core API: 10 requests per second (with key)
When you're analyzing citation networks, tracking research trends, or building recommendation systems, these limits disappear fast. A simple task like fetching full metadata for 1,000 papers can take hours instead of minutes.
Method 1: Semantic Scholar API with Authentication
Semantic Scholar offers higher rate limits with authentication, though you need to apply and justify your use case.
Getting Started
# No installation needed, just HTTP requests
curl -X GET "https://api.semanticscholar.org/graph/v1/paper/search?query=deep+learning" \
-H "x-api-key: YOUR_API_KEY"
TypeScript Implementation
interface SemanticScholarPaper {
paperId: string;
title: string;
authors: { authorId: string; name: string }[];
year: number;
abstract?: string;
citationCount: number;
}
class SemanticScholarClient {
private apiKey: string;
private requestQueue: Promise<any>[] = [];
private lastRequestTime = 0;
private requestDelay = 1000; // 1 request per second with auth
constructor(apiKey: string) {
this.apiKey = apiKey;
}
private async rateLimit(): Promise<void> {
const now = Date.now();
const timeSinceLastRequest = now - this.lastRequestTime;
if (timeSinceLastRequest < this.requestDelay) {
await new Promise(resolve =>
setTimeout(resolve, this.requestDelay - timeSinceLastRequest)
);
}
this.lastRequestTime = Date.now();
}
async searchPapers(query: string, limit: number = 100): Promise<SemanticScholarPaper[]> {
await this.rateLimit();
const response = await fetch(
`https://api.semanticscholar.org/graph/v1/paper/search?` +
`query=${encodeURIComponent(query)}&limit=${limit}&` +
`fields=paperId,title,authors,year,abstract,citationCount`,
{
headers: {
'x-api-key': this.apiKey
}
}
);
if (response.status === 429) {
throw new Error('Rate limit exceeded even with authentication');
}
const data = await response.json();
return data.data;
}
async getPaperDetails(paperId: string): Promise<SemanticScholarPaper> {
await this.rateLimit();
const response = await fetch(
`https://api.semanticscholar.org/graph/v1/paper/${paperId}?` +
`fields=paperId,title,authors,year,abstract,citationCount,references,citations`,
{
headers: {
'x-api-key': this.apiKey
}
}
);
return response.json();
}
}
// Usage
const client = new SemanticScholarClient('your_api_key');
const papers = await client.searchPapers('machine learning healthcare', 50);
Pros:
- Official API with good documentation
- Reliable and well-maintained
- Access to paper embeddings (SPECTER)
- Free tier available
Cons:
- Still has rate limits (1 req/sec authenticated)
- Need to apply and wait for API key approval
- Limited to Semantic Scholar data only
- No cross-database search capabilities
Method 2: Download Semantic Scholar Datasets
For heavy analysis, downloading the full dataset eliminates rate limits entirely.
Implementation Approach
import { createReadStream } from 'fs';
import { createGunzip } from 'zlib';
import { parse } from 'JSONStream';
import { Readable } from 'stream';
interface S2Paper {
corpusid: number;
title: string;
authors: { authorId: string; name: string }[];
year: number;
abstract?: string;
citationcount: number;
}
class S2DatasetReader {
private dataPath: string;
constructor(dataPath: string) {
this.dataPath = dataPath;
}
async* readPapers(): AsyncGenerator<S2Paper> {
const stream = createReadStream(this.dataPath)
.pipe(createGunzip())
.pipe(parse('*'));
for await (const paper of stream) {
yield paper;
}
}
async searchPapers(query: string, limit: number = 100): Promise<S2Paper[]> {
const results: S2Paper[] = [];
const queryLower = query.toLowerCase();
for await (const paper of this.readPapers()) {
if (paper.title?.toLowerCase().includes(queryLower) ||
paper.abstract?.toLowerCase().includes(queryLower)) {
results.push(paper);
if (results.length >= limit) {
break;
}
}
}
return results;
}
async buildSearchIndex(): Promise<Map<string, S2Paper[]>> {
const index = new Map<string, S2Paper[]>();
for await (const paper of this.readPapers()) {
// Simple keyword indexing
const keywords = this.extractKeywords(paper.title + ' ' + paper.abstract);
for (const keyword of keywords) {
if (!index.has(keyword)) {
index.set(keyword, []);
}
index.get(keyword)!.push(paper);
}
}
return index;
}
private extractKeywords(text: string): Set<string> {
return new Set(
text.toLowerCase()
.split(/\W+/)
.filter(word => word.length > 3)
);
}
}
// Usage
const reader = new S2DatasetReader('./s2-corpus-2024.gz');
const papers = await reader.searchPapers('neural networks', 1000);
Pros:
- No rate limits at all
- Complete dataset access
- Can build custom indices
- Offline processing capability
Cons:
- Requires 300GB+ storage space
- Need to handle updates manually
- Initial download takes days
- Complex setup and maintenance
Method 3: Using PDF Vector's Academic Search
PDF Vector provides a unified API that searches multiple databases simultaneously without rate limiting concerns.
Installation
npm install pdfvector
Implementation
import { PDFVector } from 'pdfvector';
const client = new PDFVector({
apiKey: 'pdfvector_your_api_key'
});
async function searchWithoutLimits() {
// Search across multiple databases in one request
const results = await client.academicSearch({
query: 'machine learning healthcare applications',
providers: ['semantic-scholar', 'pubmed', 'arxiv'],
limit: 100, // Per provider
yearFrom: 2020,
yearTo: 2024,
fields: [
'doi', 'title', 'authors', 'year',
'abstract', 'totalCitations', 'pdfURL'
]
});
console.log(`Found ${results.estimatedTotalResults} total papers`);
// Process results from all providers
for (const paper of results.results) {
console.log(`${paper.title} (${paper.year}) - ${paper.provider}`);
console.log(`Citations: ${paper.totalCitations}`);
}
// Check for any provider errors
if (results.errors) {
results.errors.forEach(error => {
console.warn(`${error.provider}: ${error.message}`);
});
}
return results;
}
// Fetch specific papers by ID
async function fetchPaperDetails() {
const papers = await client.academicFetch({
ids: [
'10.1038/nature12373', // DOI
'semantic-scholar:248366416', // Semantic Scholar ID
'arXiv:2103.14030', // ArXiv ID
'pubmed:35211145' // PubMed ID
],
fields: ['title', 'authors', 'abstract', 'totalCitations', 'references']
});
return papers.results;
}
// Pagination example
async function searchWithPagination(query: string, totalLimit: number = 500) {
const allResults = [];
let offset = 0;
const batchSize = 100;
while (allResults.length < totalLimit) {
const batch = await client.academicSearch({
query,
providers: ['semantic-scholar', 'pubmed'],
limit: batchSize,
offset,
fields: ['title', 'authors', 'year', 'doi']
});
allResults.push(...batch.results);
if (batch.results.length < batchSize) {
break; // No more results
}
offset += batchSize;
}
return allResults;
}
// Usage
const papers = await searchWithoutLimits();
const details = await fetchPaperDetails();
const allPapers = await searchWithPagination('covid vaccine', 1000);
Pros:
- No rate limiting with proper authentication
- Search multiple databases simultaneously
- Unified JSON response format
- Handles provider errors gracefully
- Simple credit-based pricing
Cons:
- Requires API key
- Costs 2 credit per search (not per result)
- Not self-hosted
- Internet connection required
Method 4: Building a Caching Layer
Implement intelligent caching to minimize API calls regardless of which service you use.
Redis Caching Implementation
import Redis from 'ioredis';
import crypto from 'crypto';
class AcademicSearchCache {
private redis: Redis;
private ttl: number;
constructor(redisUrl: string, ttlHours: number = 24) {
this.redis = new Redis(redisUrl);
this.ttl = ttlHours * 3600; // Convert to seconds
}
private getCacheKey(params: any): string {
const normalized = JSON.stringify(params, Object.keys(params).sort());
return `academic:${crypto.createHash('md5').update(normalized).digest('hex')}`;
}
async get<T>(params: any): Promise<T | null> {
const key = this.getCacheKey(params);
const cached = await this.redis.get(key);
if (cached) {
return JSON.parse(cached);
}
return null;
}
async set<T>(params: any, data: T): Promise<void> {
const key = this.getCacheKey(params);
await this.redis.setex(key, this.ttl, JSON.stringify(data));
}
as...



