Download Management
Parallel processing, intelligent naming, and file organization
Introduction
Downloading 50+ PDFs one at a time is painfully slow and inefficient. This chapter demonstrates how to build an intelligent download management system that handles parallel processing, implements smart retry logic, generates consistent filenames, and organizes files for easy retrieval.
What you'll build:
- Parallel download manager with configurable concurrency
- Intelligent retry logic with exponential backoff
- Consistent filename generation from paper metadata
- Multiple file organization strategies
Download Manager Architecture
Key Architecture Principles:
The download manager uses a queue-based approach with controlled concurrency to maximize throughput while respecting rate limits. The p-limit library ensures only N downloads run simultaneously (typically 5), preventing server overload and connection exhaustion.
Initialize Download Manager
Create a DownloadManager class with configurable options for concurrency, timeout, retry attempts, and authentication state.
import fs from 'fs/promises';
import path from 'path';
import { chromium } from 'playwright';
import pLimit from 'p-limit';
class DownloadManager {
constructor(options = {}) {
this.downloadDir = options.downloadDir || './papers';
this.concurrency = options.concurrency || 5; // Parallel downloads
this.timeout = options.timeout || 60000; // 60 seconds per download
this.retryAttempts = options.retryAttempts || 3;
this.authState = options.authState || null;
this.stats = {
total: 0,
completed: 0,
failed: 0,
skipped: 0
};
}Configuration Parameters:
concurrency: Number of simultaneous downloads (default: 5)timeout: Maximum time per download in milliseconds (default: 60000)retryAttempts: Number of retry attempts before failure (default: 3)authState: Playwright storage state for authenticated sessions
Implement Batch Download
Process multiple papers in parallel using a concurrency limiter to control the number of simultaneous downloads.
/**
* Download multiple PDFs in parallel
*/
async downloadBatch(papers, progressCallback = null) {
await this.ensureDownloadDir();
const limit = pLimit(this.concurrency);
this.stats.total = papers.length;
const downloads = papers.map(paper =>
limit(() => this.downloadPaper(paper, progressCallback))
);
const results = await Promise.allSettled(downloads);
return {
stats: this.stats,
results: results.map((r, i) => ({
paper: papers[i],
status: r.status,
result: r.value || r.reason
}))
};
}How it works:
pLimit(this.concurrency)creates a queue that allows only N concurrent operationsPromise.allSettledwaits for all downloads to complete (success or failure)- Progress callback enables real-time status updates
Add Retry Logic with Exponential Backoff
Implement robust retry logic to handle transient network errors and rate limiting.
/**
* Download a single paper with retry logic
*/
async downloadPaper(paper, progressCallback) {
const filename = this.generateFilename(paper);
const filepath = path.join(this.downloadDir, filename);
// Check if already downloaded
const exists = await this.fileExists(filepath);
if (exists) {
const size = await this.getFileSize(filepath);
if (size > 10000) { // At least 10KB
this.stats.skipped++;
if (progressCallback) {
progressCallback({ paper, status: 'skipped', filepath });
}
return { status: 'skipped', filepath };
}
}
// Try downloading with retries
for (let attempt = 1; attempt <= this.retryAttempts; attempt++) {
try {
const result = await this.attemptDownload(paper, filepath);
this.stats.completed++;
if (progressCallback) {
progressCallback({ paper, status: 'completed', filepath, attempt });
}
return result;
} catch (error) {
console.error(`Download attempt ${attempt} failed for ${paper.title}:`, error.message);
if (attempt === this.retryAttempts) {
this.stats.failed++;
if (progressCallback) {
progressCallback({ paper, status: 'failed', error: error.message });
}
throw error;
}
// Exponential backoff
await this.delay(Math.pow(2, attempt) * 1000);
}
}
}Implement PDF Download with Playwright
Handle both direct PDF downloads and indirect downloads requiring button clicks.
/**
* Attempt to download PDF using Playwright
*/
async attemptDownload(paper, filepath) {
if (!paper.pdfUrl) {
throw new Error('No PDF URL available');
}
const browser = await chromium.launch({ headless: true });
const contextOptions = {
acceptDownloads: true,
viewport: { width: 1920, height: 1080 }
};
if (this.authState) {
contextOptions.storageState = this.authState;
}
const context = await browser.newContext(contextOptions);
const page = await context.newPage();
try {
// Set download behavior
await page.setExtraHTTPHeaders({
'Accept': 'application/pdf,application/octet-stream'
});
// Navigate to PDF URL
const response = await page.goto(paper.pdfUrl, {
waitUntil: 'networkidle',
timeout: this.timeout
});
// Check if response is PDF
const contentType = response.headers()['content-type'];
if (contentType?.includes('application/pdf')) {
// Direct PDF download
const buffer = await response.body();
await fs.writeFile(filepath, buffer);
await browser.close();
return { status: 'completed', filepath, size: buffer.length };
}
// Handle download button click (indirect PDF)
const downloadPromise = page.waitForEvent('download', {
timeout: this.timeout
});
// Try common download button patterns
const downloaded = await this.clickDownloadButton(page);
if (downloaded) {
const download = await downloadPromise;
await download.saveAs(filepath);
await browser.close();
return { status: 'completed', filepath };
}
throw new Error('Could not trigger PDF download');
} finally {
await browser.close();
}
}Handle Download Button Patterns
Try multiple selector patterns to find and click download buttons across different publisher sites.
/**
* Try clicking various download button patterns
*/
async clickDownloadButton(page) {
const selectors = [
'a[href$=".pdf"]',
'button:has-text("Download")',
'a:has-text("PDF")',
'.download-pdf',
'[aria-label*="download"]',
'[data-test="download-button"]'
];
for (const selector of selectors) {
try {
const button = await page.$(selector);
if (button && await button.isVisible()) {
await button.click();
await page.waitForTimeout(1000);
return true;
}
} catch (error) {
// Try next selector
continue;
}
}
return false;
}Why this approach works: Different publishers use different download button implementations. Testing multiple selectors in order of likelihood maximizes success rate across diverse sites.
Rate Limiting Best Practices:
Keep concurrency at 5 or lower to avoid triggering rate limits. Exponential backoff (2^attempt seconds) prevents aggressive retry behavior. Always respect robots.txt and publisher Terms of Service. Consider adding random jitter to backoff delays to prevent synchronized retry storms.
Intelligent File Naming
Consistent, descriptive filenames make papers easy to find and organize. The system generates filenames in the format: LastName_Year_ShortTitle.pdf
/**
* Generate consistent filename from paper metadata
*/
generateFilename(paper) {
const firstAuthor = this.extractFirstAuthor(paper.authors);
const year = paper.year || 'unknown';
const title = this.sanitizeFilename(paper.title);
// Format: LastName_Year_ShortTitle.pdf
const maxTitleLength = 50;
const shortTitle = title.length > maxTitleLength
? title.substring(0, maxTitleLength) + '...'
: title;
return `${firstAuthor}_${year}_${shortTitle}.pdf`;
}Example output: Smith_2023_machine_learning_for_climate_prediction.pdf
extractFirstAuthor(authors) {
if (!authors || authors.length === 0) return 'Unknown';
const first = Array.isArray(authors) ? authors[0] : authors;
// Extract last name
const parts = first.split(',');
const lastName = parts[0].trim().replace(/[^\w]/g, '');
return lastName || 'Unknown';
}Handles multiple formats:
"Smith, John"→Smith"John Smith"→John(first word)undefined→Unknown
sanitizeFilename(str) {
return str
.replace(/[^\w\s-]/g, '') // Remove special chars
.replace(/\s+/g, '_') // Spaces to underscores
.replace(/_+/g, '_') // Collapse multiple underscores
.toLowerCase();
}Safe for all filesystems: Removes problematic characters like /, \, :, *, ?, ", <, >, | that cause errors on Windows/macOS/Linux.
File Organization Strategies
Different research workflows require different organization approaches. The system supports five organization strategies:
Organize by Topic
Group papers by research topic using keywords or categories from metadata.
/**
* Organize by topic (using keywords or categories)
*/
async organizeByTopic(papers) {
const organized = {};
for (const paper of papers) {
const topics = this.extractTopics(paper);
for (const topic of topics) {
const topicDir = path.join(this.baseDir, this.sanitizeDirName(topic));
await fs.mkdir(topicDir, { recursive: true });
const sourcePath = paper.localPath || paper.filepath;
const targetPath = path.join(topicDir, path.basename(sourcePath));
await this.copyOrLink(sourcePath, targetPath);
if (!organized[topic]) organized[topic] = [];
organized[topic].push(paper);
}
}
await this.generateTopicIndexes(organized);
return organized;
}Resulting structure:
research/
├── machine-learning/
│ ├── Smith_2023_neural_networks.pdf
│ ├── Jones_2022_deep_learning.pdf
│ └── INDEX.md
├── climate-science/
│ ├── Brown_2024_climate_modeling.pdf
│ └── INDEX.md
└── interdisciplinary/
└── ...Organize by Publication Year
Sort papers chronologically for historical analysis or literature surveys.
/**
* Organize by publication date
*/
async organizeByDate(papers) {
const organized = {};
for (const paper of papers) {
const year = paper.year || 'unknown';
const yearDir = path.join(this.baseDir, year.toString());
await fs.mkdir(yearDir, { recursive: true });
const sourcePath = paper.localPath || paper.filepath;
const targetPath = path.join(yearDir, path.basename(sourcePath));
await this.copyOrLink(sourcePath, targetPath);
if (!organized[year]) organized[year] = [];
organized[year].push(paper);
}
await this.generateYearIndexes(organized);
return organized;
}Resulting structure:
research/
├── 2024/
│ ├── Smith_2024_paper.pdf
│ └── INDEX.md
├── 2023/
│ ├── Jones_2023_paper.pdf
│ └── INDEX.md
└── 2022/
└── ...Two-Level Organization
Combine topic and date for hierarchical organization.
/**
* Two-level organization: topic/year/papers
*/
async organizeByTopicAndDate(papers) {
const organized = {};
for (const paper of papers) {
const topics = this.extractTopics(paper);
const year = paper.year || 'unknown';
for (const topic of topics) {
const paperDir = path.join(
this.baseDir,
this.sanitizeDirName(topic),
year.toString()
);
await fs.mkdir(paperDir, { recursive: true });
const sourcePath = paper.localPath || paper.filepath;
const targetPath = path.join(paperDir, path.basename(sourcePath));
await this.copyOrLink(sourcePath, targetPath);
const key = `${topic}/${year}`;
if (!organized[key]) organized[key] = [];
organized[key].push(paper);
}
}
await this.generateHierarchicalIndexes(organized);
return organized;
}Resulting structure:
research/
├── machine-learning/
│ ├── 2024/
│ │ ├── Smith_2024_neural_networks.pdf
│ │ └── Brown_2024_transformers.pdf
│ ├── 2023/
│ │ └── Jones_2023_deep_learning.pdf
│ └── INDEX.md
└── climate-science/
├── 2024/
└── 2023/Flat Structure with Metadata Index
Keep all files in one directory with a comprehensive markdown index.
/**
* Extract topics from paper metadata
*/
extractTopics(paper) {
const topics = new Set();
// From explicit keywords
if (paper.keywords && paper.keywords.length > 0) {
paper.keywords.slice(0, 3).forEach(k => topics.add(k));
}
// From categories (arXiv)
if (paper.categories && paper.categories.length > 0) {
paper.categories.slice(0, 2).forEach(c => {
const clean = c.replace(/^(cs|math|physics)\./, '');
topics.add(clean);
});
}
// Default topic if none found
if (topics.size === 0) {
topics.add('uncategorized');
}
return Array.from(topics);
}Best for: Small collections, full-text search tools, or when using external reference managers like Zotero.
Markdown Index Generation
Each organization strategy generates markdown index files for easy browsing and reference.
/**
* Generate markdown index file
*/
generateIndexMarkdown(title, papers) {
const lines = [
`# ${title}`,
'',
`Total papers: ${papers.length}`,
'',
'## Papers',
''
];
// Sort by year descending
const sorted = papers.sort((a, b) => (b.year || 0) - (a.year || 0));
for (const paper of sorted) {
lines.push(`### ${paper.title}`);
lines.push('');
lines.push(`**Authors**: ${this.formatAuthors(paper.authors)}`);
lines.push(`**Year**: ${paper.year || 'N/A'}`);
if (paper.doi) lines.push(`**DOI**: ${paper.doi}`);
if (paper.journal) lines.push(`**Journal**: ${paper.journal}`);
lines.push('');
if (paper.abstract) {
lines.push(`**Abstract**: ${paper.abstract.substring(0, 300)}...`);
lines.push('');
}
lines.push(`**File**: [${path.basename(paper.localPath || paper.filepath)}](./${path.basename(paper.localPath || paper.filepath)})`);
lines.push('');
lines.push('---');
lines.push('');
}
return lines.join('\n');
}Example index output:
# Machine Learning
Total papers: 15
## Papers
### Neural Networks for Climate Prediction
**Authors**: Smith, J., Brown, K., Jones, L.
**Year**: 2024
**DOI**: 10.1234/example.2024
**Journal**: Nature Machine Learning
**Abstract**: This paper presents a novel approach to climate prediction using deep neural networks...
**File**: [Smith_2024_neural_networks_for_climate_prediction.pdf](./Smith_2024_neural_networks_for_climate_prediction.pdf)
---Index File Benefits:
Markdown indexes provide human-readable summaries without opening individual PDFs. They work with any text editor, support full-text search, and can be rendered as HTML for web-based browsing. The file links are relative paths, making the entire directory structure portable.
Usage Example
Put it all together with a complete download and organization workflow:
import DownloadManager from './download-manager.js';
import PaperOrganizer from './organizer.js';
// Initialize managers
const downloader = new DownloadManager({
downloadDir: './downloads/temp',
concurrency: 5,
timeout: 60000,
retryAttempts: 3,
authState: './auth-state.json' // From Episode 3
});
const organizer = new PaperOrganizer('./downloads/organized');
// Papers from discovery phase
const papers = [
{
title: "Neural Networks for Climate Prediction",
authors: ["Smith, J.", "Brown, K."],
year: 2024,
pdfUrl: "https://example.com/paper1.pdf",
keywords: ["machine-learning", "climate-science"]
},
// ... more papers
];
// Download with progress tracking
const progressCallback = ({ paper, status, filepath, error }) => {
console.log(`[${status.toUpperCase()}] ${paper.title}`);
if (error) console.error(` Error: ${error}`);
};
const downloadResults = await downloader.downloadBatch(papers, progressCallback);
console.log('\nDownload Statistics:');
console.log(downloader.getStats());
// Organize downloaded papers
const organized = await organizer.organize(
downloadResults.results
.filter(r => r.status === 'fulfilled')
.map(r => r.result),
'topic-date' // Organization strategy
);
console.log(`\nOrganized ${Object.keys(organized).length} topic/year combinations`);Expected output:
[COMPLETED] Neural Networks for Climate Prediction
[SKIPPED] Deep Learning for Weather Forecasting
[COMPLETED] Transformers for Time Series Analysis
[FAILED] Paper Behind Paywall
Error: 403 Forbidden
Download Statistics:
{
total: 4,
completed: 2,
failed: 1,
skipped: 1,
successRate: '50.0%'
}
Organized 3 topic/year combinationsKey Takeaways
The download manager provides robust, efficient PDF acquisition with intelligent organization. Key features include parallel processing with concurrency control, retry logic with exponential backoff, consistent filename generation, and multiple organization strategies for different research needs.
Performance gains: Parallel downloading reduces acquisition time from 2-4 minutes per paper to 30-60 seconds for batches of 20-30 papers. Smart deduplication prevents re-downloading existing files. Automatic organization eliminates manual filing time.
Next steps: The next chapter covers text extraction, converting downloaded PDFs into structured data for analysis and citation management.