Download Management

Parallel processing, intelligent naming, and file organization

Introduction

Downloading 50+ PDFs one at a time is painfully slow and inefficient. This chapter demonstrates how to build an intelligent download management system that handles parallel processing, implements smart retry logic, generates consistent filenames, and organizes files for easy retrieval.

What you'll build:

  • Parallel download manager with configurable concurrency
  • Intelligent retry logic with exponential backoff
  • Consistent filename generation from paper metadata
  • Multiple file organization strategies

Download Manager Architecture

Key Architecture Principles:

The download manager uses a queue-based approach with controlled concurrency to maximize throughput while respecting rate limits. The p-limit library ensures only N downloads run simultaneously (typically 5), preventing server overload and connection exhaustion.

Initialize Download Manager

Create a DownloadManager class with configurable options for concurrency, timeout, retry attempts, and authentication state.

download-manager.js
import fs from 'fs/promises';
import path from 'path';
import { chromium } from 'playwright';
import pLimit from 'p-limit';

class DownloadManager {
  constructor(options = {}) {
    this.downloadDir = options.downloadDir || './papers';
    this.concurrency = options.concurrency || 5; // Parallel downloads
    this.timeout = options.timeout || 60000; // 60 seconds per download
    this.retryAttempts = options.retryAttempts || 3;
    this.authState = options.authState || null;

    this.stats = {
      total: 0,
      completed: 0,
      failed: 0,
      skipped: 0
    };
  }

Configuration Parameters:

  • concurrency: Number of simultaneous downloads (default: 5)
  • timeout: Maximum time per download in milliseconds (default: 60000)
  • retryAttempts: Number of retry attempts before failure (default: 3)
  • authState: Playwright storage state for authenticated sessions

Implement Batch Download

Process multiple papers in parallel using a concurrency limiter to control the number of simultaneous downloads.

  /**
   * Download multiple PDFs in parallel
   */
  async downloadBatch(papers, progressCallback = null) {
    await this.ensureDownloadDir();

    const limit = pLimit(this.concurrency);
    this.stats.total = papers.length;

    const downloads = papers.map(paper =>
      limit(() => this.downloadPaper(paper, progressCallback))
    );

    const results = await Promise.allSettled(downloads);

    return {
      stats: this.stats,
      results: results.map((r, i) => ({
        paper: papers[i],
        status: r.status,
        result: r.value || r.reason
      }))
    };
  }

How it works:

  • pLimit(this.concurrency) creates a queue that allows only N concurrent operations
  • Promise.allSettled waits for all downloads to complete (success or failure)
  • Progress callback enables real-time status updates

Add Retry Logic with Exponential Backoff

Implement robust retry logic to handle transient network errors and rate limiting.

  /**
   * Download a single paper with retry logic
   */
  async downloadPaper(paper, progressCallback) {
    const filename = this.generateFilename(paper);
    const filepath = path.join(this.downloadDir, filename);

    // Check if already downloaded
    const exists = await this.fileExists(filepath);
    if (exists) {
      const size = await this.getFileSize(filepath);
      if (size > 10000) { // At least 10KB
        this.stats.skipped++;
        if (progressCallback) {
          progressCallback({ paper, status: 'skipped', filepath });
        }
        return { status: 'skipped', filepath };
      }
    }

    // Try downloading with retries
    for (let attempt = 1; attempt <= this.retryAttempts; attempt++) {
      try {
        const result = await this.attemptDownload(paper, filepath);

        this.stats.completed++;
        if (progressCallback) {
          progressCallback({ paper, status: 'completed', filepath, attempt });
        }

        return result;

      } catch (error) {
        console.error(`Download attempt ${attempt} failed for ${paper.title}:`, error.message);

        if (attempt === this.retryAttempts) {
          this.stats.failed++;
          if (progressCallback) {
            progressCallback({ paper, status: 'failed', error: error.message });
          }
          throw error;
        }

        // Exponential backoff
        await this.delay(Math.pow(2, attempt) * 1000);
      }
    }
  }

Implement PDF Download with Playwright

Handle both direct PDF downloads and indirect downloads requiring button clicks.

  /**
   * Attempt to download PDF using Playwright
   */
  async attemptDownload(paper, filepath) {
    if (!paper.pdfUrl) {
      throw new Error('No PDF URL available');
    }

    const browser = await chromium.launch({ headless: true });
    const contextOptions = {
      acceptDownloads: true,
      viewport: { width: 1920, height: 1080 }
    };

    if (this.authState) {
      contextOptions.storageState = this.authState;
    }

    const context = await browser.newContext(contextOptions);
    const page = await context.newPage();

    try {
      // Set download behavior
      await page.setExtraHTTPHeaders({
        'Accept': 'application/pdf,application/octet-stream'
      });

      // Navigate to PDF URL
      const response = await page.goto(paper.pdfUrl, {
        waitUntil: 'networkidle',
        timeout: this.timeout
      });

      // Check if response is PDF
      const contentType = response.headers()['content-type'];

      if (contentType?.includes('application/pdf')) {
        // Direct PDF download
        const buffer = await response.body();
        await fs.writeFile(filepath, buffer);
        await browser.close();
        return { status: 'completed', filepath, size: buffer.length };
      }

      // Handle download button click (indirect PDF)
      const downloadPromise = page.waitForEvent('download', {
        timeout: this.timeout
      });

      // Try common download button patterns
      const downloaded = await this.clickDownloadButton(page);

      if (downloaded) {
        const download = await downloadPromise;
        await download.saveAs(filepath);
        await browser.close();
        return { status: 'completed', filepath };
      }

      throw new Error('Could not trigger PDF download');

    } finally {
      await browser.close();
    }
  }

Handle Download Button Patterns

Try multiple selector patterns to find and click download buttons across different publisher sites.

  /**
   * Try clicking various download button patterns
   */
  async clickDownloadButton(page) {
    const selectors = [
      'a[href$=".pdf"]',
      'button:has-text("Download")',
      'a:has-text("PDF")',
      '.download-pdf',
      '[aria-label*="download"]',
      '[data-test="download-button"]'
    ];

    for (const selector of selectors) {
      try {
        const button = await page.$(selector);
        if (button && await button.isVisible()) {
          await button.click();
          await page.waitForTimeout(1000);
          return true;
        }
      } catch (error) {
        // Try next selector
        continue;
      }
    }

    return false;
  }

Why this approach works: Different publishers use different download button implementations. Testing multiple selectors in order of likelihood maximizes success rate across diverse sites.

Rate Limiting Best Practices:

Keep concurrency at 5 or lower to avoid triggering rate limits. Exponential backoff (2^attempt seconds) prevents aggressive retry behavior. Always respect robots.txt and publisher Terms of Service. Consider adding random jitter to backoff delays to prevent synchronized retry storms.


Intelligent File Naming

Consistent, descriptive filenames make papers easy to find and organize. The system generates filenames in the format: LastName_Year_ShortTitle.pdf

  /**
   * Generate consistent filename from paper metadata
   */
  generateFilename(paper) {
    const firstAuthor = this.extractFirstAuthor(paper.authors);
    const year = paper.year || 'unknown';
    const title = this.sanitizeFilename(paper.title);

    // Format: LastName_Year_ShortTitle.pdf
    const maxTitleLength = 50;
    const shortTitle = title.length > maxTitleLength
      ? title.substring(0, maxTitleLength) + '...'
      : title;

    return `${firstAuthor}_${year}_${shortTitle}.pdf`;
  }

Example output: Smith_2023_machine_learning_for_climate_prediction.pdf

  extractFirstAuthor(authors) {
    if (!authors || authors.length === 0) return 'Unknown';
    const first = Array.isArray(authors) ? authors[0] : authors;

    // Extract last name
    const parts = first.split(',');
    const lastName = parts[0].trim().replace(/[^\w]/g, '');
    return lastName || 'Unknown';
  }

Handles multiple formats:

  • "Smith, John"Smith
  • "John Smith"John (first word)
  • undefinedUnknown
  sanitizeFilename(str) {
    return str
      .replace(/[^\w\s-]/g, '') // Remove special chars
      .replace(/\s+/g, '_')      // Spaces to underscores
      .replace(/_+/g, '_')       // Collapse multiple underscores
      .toLowerCase();
  }

Safe for all filesystems: Removes problematic characters like /, \, :, *, ?, ", <, >, | that cause errors on Windows/macOS/Linux.


File Organization Strategies

Different research workflows require different organization approaches. The system supports five organization strategies:

Organize by Topic

Group papers by research topic using keywords or categories from metadata.

  /**
   * Organize by topic (using keywords or categories)
   */
  async organizeByTopic(papers) {
    const organized = {};

    for (const paper of papers) {
      const topics = this.extractTopics(paper);

      for (const topic of topics) {
        const topicDir = path.join(this.baseDir, this.sanitizeDirName(topic));
        await fs.mkdir(topicDir, { recursive: true });

        const sourcePath = paper.localPath || paper.filepath;
        const targetPath = path.join(topicDir, path.basename(sourcePath));

        await this.copyOrLink(sourcePath, targetPath);

        if (!organized[topic]) organized[topic] = [];
        organized[topic].push(paper);
      }
    }

    await this.generateTopicIndexes(organized);
    return organized;
  }

Resulting structure:

research/
├── machine-learning/
│   ├── Smith_2023_neural_networks.pdf
│   ├── Jones_2022_deep_learning.pdf
│   └── INDEX.md
├── climate-science/
│   ├── Brown_2024_climate_modeling.pdf
│   └── INDEX.md
└── interdisciplinary/
    └── ...

Organize by Publication Year

Sort papers chronologically for historical analysis or literature surveys.

  /**
   * Organize by publication date
   */
  async organizeByDate(papers) {
    const organized = {};

    for (const paper of papers) {
      const year = paper.year || 'unknown';
      const yearDir = path.join(this.baseDir, year.toString());
      await fs.mkdir(yearDir, { recursive: true });

      const sourcePath = paper.localPath || paper.filepath;
      const targetPath = path.join(yearDir, path.basename(sourcePath));

      await this.copyOrLink(sourcePath, targetPath);

      if (!organized[year]) organized[year] = [];
      organized[year].push(paper);
    }

    await this.generateYearIndexes(organized);
    return organized;
  }

Resulting structure:

research/
├── 2024/
│   ├── Smith_2024_paper.pdf
│   └── INDEX.md
├── 2023/
│   ├── Jones_2023_paper.pdf
│   └── INDEX.md
└── 2022/
    └── ...

Two-Level Organization

Combine topic and date for hierarchical organization.

  /**
   * Two-level organization: topic/year/papers
   */
  async organizeByTopicAndDate(papers) {
    const organized = {};

    for (const paper of papers) {
      const topics = this.extractTopics(paper);
      const year = paper.year || 'unknown';

      for (const topic of topics) {
        const paperDir = path.join(
          this.baseDir,
          this.sanitizeDirName(topic),
          year.toString()
        );
        await fs.mkdir(paperDir, { recursive: true });

        const sourcePath = paper.localPath || paper.filepath;
        const targetPath = path.join(paperDir, path.basename(sourcePath));

        await this.copyOrLink(sourcePath, targetPath);

        const key = `${topic}/${year}`;
        if (!organized[key]) organized[key] = [];
        organized[key].push(paper);
      }
    }

    await this.generateHierarchicalIndexes(organized);
    return organized;
  }

Resulting structure:

research/
├── machine-learning/
│   ├── 2024/
│   │   ├── Smith_2024_neural_networks.pdf
│   │   └── Brown_2024_transformers.pdf
│   ├── 2023/
│   │   └── Jones_2023_deep_learning.pdf
│   └── INDEX.md
└── climate-science/
    ├── 2024/
    └── 2023/

Flat Structure with Metadata Index

Keep all files in one directory with a comprehensive markdown index.

  /**
   * Extract topics from paper metadata
   */
  extractTopics(paper) {
    const topics = new Set();

    // From explicit keywords
    if (paper.keywords && paper.keywords.length > 0) {
      paper.keywords.slice(0, 3).forEach(k => topics.add(k));
    }

    // From categories (arXiv)
    if (paper.categories && paper.categories.length > 0) {
      paper.categories.slice(0, 2).forEach(c => {
        const clean = c.replace(/^(cs|math|physics)\./, '');
        topics.add(clean);
      });
    }

    // Default topic if none found
    if (topics.size === 0) {
      topics.add('uncategorized');
    }

    return Array.from(topics);
  }

Best for: Small collections, full-text search tools, or when using external reference managers like Zotero.


Markdown Index Generation

Each organization strategy generates markdown index files for easy browsing and reference.

  /**
   * Generate markdown index file
   */
  generateIndexMarkdown(title, papers) {
    const lines = [
      `# ${title}`,
      '',
      `Total papers: ${papers.length}`,
      '',
      '## Papers',
      ''
    ];

    // Sort by year descending
    const sorted = papers.sort((a, b) => (b.year || 0) - (a.year || 0));

    for (const paper of sorted) {
      lines.push(`### ${paper.title}`);
      lines.push('');
      lines.push(`**Authors**: ${this.formatAuthors(paper.authors)}`);
      lines.push(`**Year**: ${paper.year || 'N/A'}`);
      if (paper.doi) lines.push(`**DOI**: ${paper.doi}`);
      if (paper.journal) lines.push(`**Journal**: ${paper.journal}`);
      lines.push('');
      if (paper.abstract) {
        lines.push(`**Abstract**: ${paper.abstract.substring(0, 300)}...`);
        lines.push('');
      }
      lines.push(`**File**: [${path.basename(paper.localPath || paper.filepath)}](./${path.basename(paper.localPath || paper.filepath)})`);
      lines.push('');
      lines.push('---');
      lines.push('');
    }

    return lines.join('\n');
  }

Example index output:

# Machine Learning

Total papers: 15

## Papers

### Neural Networks for Climate Prediction

**Authors**: Smith, J., Brown, K., Jones, L.
**Year**: 2024
**DOI**: 10.1234/example.2024
**Journal**: Nature Machine Learning

**Abstract**: This paper presents a novel approach to climate prediction using deep neural networks...

**File**: [Smith_2024_neural_networks_for_climate_prediction.pdf](./Smith_2024_neural_networks_for_climate_prediction.pdf)

---

Index File Benefits:

Markdown indexes provide human-readable summaries without opening individual PDFs. They work with any text editor, support full-text search, and can be rendered as HTML for web-based browsing. The file links are relative paths, making the entire directory structure portable.


Usage Example

Put it all together with a complete download and organization workflow:

import DownloadManager from './download-manager.js';
import PaperOrganizer from './organizer.js';

// Initialize managers
const downloader = new DownloadManager({
  downloadDir: './downloads/temp',
  concurrency: 5,
  timeout: 60000,
  retryAttempts: 3,
  authState: './auth-state.json' // From Episode 3
});

const organizer = new PaperOrganizer('./downloads/organized');

// Papers from discovery phase
const papers = [
  {
    title: "Neural Networks for Climate Prediction",
    authors: ["Smith, J.", "Brown, K."],
    year: 2024,
    pdfUrl: "https://example.com/paper1.pdf",
    keywords: ["machine-learning", "climate-science"]
  },
  // ... more papers
];

// Download with progress tracking
const progressCallback = ({ paper, status, filepath, error }) => {
  console.log(`[${status.toUpperCase()}] ${paper.title}`);
  if (error) console.error(`  Error: ${error}`);
};

const downloadResults = await downloader.downloadBatch(papers, progressCallback);

console.log('\nDownload Statistics:');
console.log(downloader.getStats());

// Organize downloaded papers
const organized = await organizer.organize(
  downloadResults.results
    .filter(r => r.status === 'fulfilled')
    .map(r => r.result),
  'topic-date' // Organization strategy
);

console.log(`\nOrganized ${Object.keys(organized).length} topic/year combinations`);

Expected output:

[COMPLETED] Neural Networks for Climate Prediction
[SKIPPED] Deep Learning for Weather Forecasting
[COMPLETED] Transformers for Time Series Analysis
[FAILED] Paper Behind Paywall
  Error: 403 Forbidden

Download Statistics:
{
  total: 4,
  completed: 2,
  failed: 1,
  skipped: 1,
  successRate: '50.0%'
}

Organized 3 topic/year combinations

Key Takeaways

The download manager provides robust, efficient PDF acquisition with intelligent organization. Key features include parallel processing with concurrency control, retry logic with exponential backoff, consistent filename generation, and multiple organization strategies for different research needs.

Performance gains: Parallel downloading reduces acquisition time from 2-4 minutes per paper to 30-60 seconds for batches of 20-30 papers. Smart deduplication prevents re-downloading existing files. Automatic organization eliminates manual filing time.

Next steps: The next chapter covers text extraction, converting downloaded PDFs into structured data for analysis and citation management.