xps

Integration: PDF Research Assistant MCP Server

Now the complete integration - everything from discovery to extraction unified into a single MCP server that Claude Code can orchestrate.

This chapter covers the complete server architecture, tool definitions for search, download, extract, and organize operations, database coordination, and Claude Code integration patterns.

What This Server Provides:

The PDF Research Assistant MCP server exposes five core tools to Claude Code: search_papers (multi-database discovery), download_papers (parallel PDF downloads), extract_content (text and citation extraction), organize_papers (intelligent filing), and full_research_pipeline (complete automation from query to organized knowledge base).

All tools use JSON schemas for type safety, support progress callbacks, handle errors gracefully, and maintain state across operations. The server runs via stdio transport and can be invoked directly from Claude Code conversations.

Server Architecture

The MCP server coordinates multiple specialized modules into a unified interface.

Project Structure

Set up the complete server project:

pdf-research-assistant/
├── server.js                    # Main MCP server
├── pdf-hunter.js                # Multi-database coordinator
├── download-manager.js          # Parallel download engine
├── extraction-pipeline.js       # Text extraction system
├── organizer.js                 # File organization logic
├── deduplication.js            # Cross-database deduplication
├── discovery/
│   ├── pubmed-discovery.js     # PubMed API integration
│   ├── arxiv-discovery.js      # arXiv API integration
│   ├── jstor-discovery.js      # JSTOR scraping
│   └── ieee-discovery.js       # IEEE Xplore scraping
└── package.json

Install MCP SDK:

npm install @modelcontextprotocol/sdk

MCP Server Core

Create the main server with tool definitions and request handlers.

The server exposes five tools to Claude Code, each with JSON schema validation for input parameters. All tools return structured JSON responses with consistent error handling.

Database Coordination

The PDF Hunter module coordinates searches across multiple databases, handles failures gracefully with Promise.allSettled, and deduplicates results using DOI and title matching.

Each database returns a standardized paper object with title, authors, year, abstract, DOI, and source URL. The coordinator merges results and removes duplicates before returning to Claude Code.

Complete Server Implementation

// server.js - Main MCP server
import { Server } from '@modelcontextprotocol/sdk/server/index.js';
import { StdioServerTransport } from '@modelcontextprotocol/sdk/server/stdio.js';
import PDFHunter from './pdf-hunter.js';
import DownloadManager from './download-manager.js';
import ExtractionPipeline from './extraction-pipeline.js';
import PaperOrganizer from './organizer.js';

class PDFResearchServer {
  constructor() {
    this.server = new Server(
      {
        name: 'pdf-research-assistant',
        version: '1.0.0',
      },
      {
        capabilities: {
          tools: {},
        },
      }
    );

    this.setupHandlers();
  }

  setupHandlers() {
    // List available tools
    this.server.setRequestHandler('tools/list', async () => ({
      tools: [
        {
          name: 'search_papers',
          description: 'Search multiple academic databases for papers',
          inputSchema: {
            type: 'object',
            properties: {
              query: { type: 'string', description: 'Search query' },
              databases: {
                type: 'array',
                items: { enum: ['pubmed', 'arxiv', 'jstor', 'ieee'] },
                description: 'Databases to search'
              },
              maxResults: { type: 'number', default: 50 },
              yearFrom: { type: 'number', description: 'Filter from year' }
            },
            required: ['query', 'databases']
          }
        },
        {
          name: 'download_papers',
          description: 'Download PDFs for discovered papers',
          inputSchema: {
            type: 'object',
            properties: {
              papers: { type: 'array', description: 'Papers to download' },
              downloadDir: { type: 'string', default: './papers' },
              concurrency: { type: 'number', default: 5 }
            },
            required: ['papers']
          }
        },
        {
          name: 'extract_content',
          description: 'Extract text and citations from PDFs',
          inputSchema: {
            type: 'object',
            properties: {
              pdfPaths: { type: 'array', description: 'Paths to PDFs' },
              useOCR: { type: 'boolean', default: false },
              extractCitations: { type: 'boolean', default: true }
            },
            required: ['pdfPaths']
          }
        },
        {
          name: 'organize_papers',
          description: 'Organize papers into directory structure',
          inputSchema: {
            type: 'object',
            properties: {
              papers: { type: 'array' },
              strategy: {
                type: 'string',
                enum: ['topic', 'date', 'topic-date', 'author'],
                default: 'topic-date'
              },
              baseDir: { type: 'string', default: './research' }
            },
            required: ['papers']
          }
        },
        {
          name: 'full_research_pipeline',
          description: 'Complete pipeline: search → download → extract → organize',
          inputSchema: {
            type: 'object',
            properties: {
              query: { type: 'string' },
              databases: { type: 'array' },
              maxResults: { type: 'number', default: 50 },
              yearFrom: { type: 'number' },
              downloadDir: { type: 'string', default: './papers' },
              organizeStrategy: { type: 'string', default: 'topic-date' },
              extractText: { type: 'boolean', default: true }
            },
            required: ['query', 'databases']
          }
        }
      ]
    }));

    // Tool execution handler
    this.server.setRequestHandler('tools/call', async (request) => {
      const { name, arguments: args } = request.params;

      try {
        let result;

        switch (name) {
          case 'search_papers':
            result = await this.handleSearchPapers(args);
            break;
          case 'download_papers':
            result = await this.handleDownloadPapers(args);
            break;
          case 'extract_content':
            result = await this.handleExtractContent(args);
            break;
          case 'organize_papers':
            result = await this.handleOrganizePapers(args);
            break;
          case 'full_research_pipeline':
            result = await this.handleFullPipeline(args);
            break;
          default:
            throw new Error(`Unknown tool: ${name}`);
        }

        return {
          content: [
            {
              type: 'text',
              text: JSON.stringify(result, null, 2)
            }
          ]
        };

      } catch (error) {
        return {
          content: [
            {
              type: 'text',
              text: `Error: ${error.message}`
            }
          ],
          isError: true
        };
      }
    });
  }

  async start() {
    const transport = new StdioServerTransport();
    await this.server.connect(transport);
    console.error('PDF Research Assistant MCP server running');
  }
}

// Start server
const server = new PDFResearchServer();
server.start().catch(console.error);

// pdf-hunter.js - Multi-database coordinator
import PubMedDiscovery from './discovery/pubmed-discovery.js';
import ArXivDiscovery from './discovery/arxiv-discovery.js';
import JSTORDiscovery from './discovery/jstor-discovery.js';
import IEEEDiscovery from './discovery/ieee-discovery.js';
import PaperDeduplicator from './deduplication.js';

class PDFHunter {
  constructor(options = {}) {
    this.authPaths = options.authPaths || {};
    this.deduplicator = new PaperDeduplicator();
  }

  /**
   * Search multiple databases and deduplicate results
   */
  async searchMultiple(query, databases, maxResults = 50, yearFrom = null) {
    const searchPromises = databases.map(db =>
      this.searchDatabase(db, query, maxResults, yearFrom)
    );

    const results = await Promise.allSettled(searchPromises);

    const allPapers = [];
    const byDatabase = {};

    for (let i = 0; i < results.length; i++) {
      const db = databases[i];
      const result = results[i];

      if (result.status === 'fulfilled') {
        const papers = result.value.papers.map(p => ({ ...p, source: db }));
        allPapers.push(...papers);
        byDatabase[db] = {
          count: papers.length,
          status: 'success'
        };
      } else {
        console.error(`${db} search failed:`, result.reason);
        byDatabase[db] = {
          count: 0,
          status: 'failed',
          error: result.reason.message
        };
      }
    }

    // Deduplicate across databases
    const uniquePapers = this.deduplicator.deduplicate(allPapers);

    return {
      papers: uniquePapers,
      totalBeforeDedup: allPapers.length,
      totalAfterDedup: uniquePapers.length,
      byDatabase
    };
  }

  async searchDatabase(database, query, maxResults, yearFrom) {
    switch (database) {
      case 'pubmed':
        return await this.searchPubMed(query, maxResults, yearFrom);
      case 'arxiv':
        return await this.searchArXiv(query, maxResults);
      case 'jstor':
        return await this.searchJSTOR(query, maxResults, yearFrom);
      case 'ieee':
        return await this.searchIEEE(query, maxResults);
      default:
        throw new Error(`Unknown database: ${database}`);
    }
  }

  async searchPubMed(query, maxResults, yearFrom) {
    const discovery = new PubMedDiscovery();
    return await discovery.search(query, maxResults, yearFrom);
  }

  async searchArXiv(query, maxResults) {
    const discovery = new ArXivDiscovery();
    return await discovery.search(query, maxResults);
  }

  async searchJSTOR(query, maxResults, yearFrom) {
    const discovery = new JSTORDiscovery(this.authPaths.jstor);
    return await discovery.search(query, maxResults, yearFrom);
  }

  async searchIEEE(query, maxResults) {
    const discovery = new IEEEDiscovery();
    return await discovery.search(query, maxResults);
  }
}

export default PDFHunter;

// Tool implementation methods (inside PDFResearchServer class)

async handleSearchPapers(args) {
  const hunter = new PDFHunter();
  const results = await hunter.searchMultiple(
    args.query,
    args.databases,
    args.maxResults,
    args.yearFrom
  );

  return {
    totalPapers: results.papers.length,
    papers: results.papers,
    byDatabase: results.byDatabase
  };
}

async handleDownloadPapers(args) {
  const manager = new DownloadManager({
    downloadDir: args.downloadDir,
    concurrency: args.concurrency
  });

  const results = await manager.downloadBatch(args.papers, (progress) => {
    console.log(`Progress: ${progress.paper.title} - ${progress.status}`);
  });

  return {
    stats: manager.getStats(),
    results: results.results
  };
}

async handleExtractContent(args) {
  const pipeline = new ExtractionPipeline({
    useOCR: args.useOCR,
    extractCitations: args.extractCitations
  });

  const extracted = [];

  for (const pdfPath of args.pdfPaths) {
    const data = await pipeline.extractFromPDF(pdfPath);
    extracted.push(data);
  }

  return {
    totalProcessed: extracted.length,
    extracted
  };
}

async handleOrganizePapers(args) {
  const organizer = new PaperOrganizer(args.baseDir);
  const organized = await organizer.organize(args.papers, args.strategy);

  return {
    strategy: args.strategy,
    categories: Object.keys(organized).length,
    organized
  };
}

async handleFullPipeline(args) {
  const results = {
    search: null,
    download: null,
    extraction: null,
    organization: null
  };

  // Step 1: Search
  console.log('Step 1: Searching databases...');
  const hunter = new PDFHunter();
  results.search = await hunter.searchMultiple(
    args.query,
    args.databases,
    args.maxResults,
    args.yearFrom
  );

  // Step 2: Download
  console.log(`Step 2: Downloading ${results.search.papers.length} papers...`);
  const manager = new DownloadManager({
    downloadDir: args.downloadDir,
    concurrency: 5
  });
  results.download = await manager.downloadBatch(results.search.papers);

  // Step 3: Extract (if requested)
  if (args.extractText) {
    console.log('Step 3: Extracting content...');
    const pipeline = new ExtractionPipeline({
      extractCitations: true
    });

    const downloaded = results.download.results
      .filter(r => r.status === 'fulfilled')
      .map(r => r.result.filepath);

    results.extraction = [];
    for (const filepath of downloaded) {
      const data = await pipeline.extractFromPDF(filepath);
      results.extraction.push(data);
    }
  }

  // Step 4: Organize
  console.log('Step 4: Organizing papers...');
  const organizer = new PaperOrganizer('./research');

  // Attach extracted data to papers
  const papersWithData = results.search.papers.map(paper => {
    const extracted = results.extraction?.find(e =>
      e.filepath.includes(paper.title.substring(0, 20))
    );
    return { ...paper, extracted };
  });

  results.organization = await organizer.organize(
    papersWithData,
    args.organizeStrategy
  );

  return {
    completed: true,
    summary: {
      papersFound: results.search.papers.length,
      papersDownloaded: results.download.stats.completed,
      papersExtracted: results.extraction?.length || 0,
      categories: Object.keys(results.organization).length
    },
    details: results
  };
}

Tool Definitions

Each MCP tool has a specific purpose, input schema, and return format.

search_papers

Purpose: Search multiple academic databases for papers matching a query.

Input Schema:

query (string, required): Search query text
databases (array, required): Array of database names (pubmed, arxiv, jstor, ieee)
maxResults (number, default: 50): Maximum results per database
yearFrom (number, optional): Filter papers from this year onwards

Returns:

{
  "totalPapers": 47,
  "papers": [
    {
      "title": "Paper title",
      "authors": ["Author 1", "Author 2"],
      "year": 2024,
      "doi": "10.1234/xyz",
      "abstract": "Paper abstract text...",
      "pdfUrl": "https://...",
      "source": "pubmed"
    }
  ],
  "byDatabase": {
    "pubmed": { "count": 25, "status": "success" },
    "arxiv": { "count": 22, "status": "success" }
  }
}

download_papers

Purpose: Download PDFs for discovered papers in parallel.

Input Schema:

papers (array, required): Array of paper objects from search_papers
downloadDir (string, default: "./papers"): Download directory path
concurrency (number, default: 5): Parallel download limit

Returns:

{
  "stats": {
    "total": 47,
    "completed": 45,
    "failed": 2,
    "skipped": 0
  },
  "results": [
    {
      "status": "fulfilled",
      "result": {
        "filepath": "./papers/Author-2024-Title.pdf",
        "paper": { "title": "...", "doi": "..." }
      }
    }
  ]
}

extract_content

Purpose: Extract text and citations from downloaded PDFs.

Input Schema:

pdfPaths (array, required): Array of PDF file paths
useOCR (boolean, default: false): Enable OCR for scanned documents
extractCitations (boolean, default: true): Extract citation information

Returns:

{
  "totalProcessed": 45,
  "extracted": [
    {
      "filepath": "./papers/Author-2024-Title.pdf",
      "text": "Full extracted text...",
      "metadata": {
        "title": "Paper Title",
        "authors": ["Author 1"],
        "year": 2024
      },
      "citations": [
        {
          "text": "Smith et al. (2023)...",
          "authors": ["Smith"],
          "year": 2023
        }
      ]
    }
  ]
}

organize_papers

Purpose: Organize papers into directory structure by topic, date, or author.

Input Schema:

papers (array, required): Array of papers with metadata
strategy (string, default: "topic-date"): Organization strategy (topic, date, topic-date, author)
baseDir (string, default: "./research"): Base directory for organization

Returns:

{
  "strategy": "topic-date",
  "categories": 8,
  "organized": {
    "Machine Learning/2024": {
      "papers": [...],
      "count": 12
    },
    "Natural Language Processing/2023": {
      "papers": [...],
      "count": 8
    }
  }
}

full_research_pipeline

Purpose: Execute complete pipeline from search to organized knowledge base.

Input Schema: Combines all parameters from search, download, extract, and organize tools.

Returns: Nested results object with summary and detailed results from each stage.

Security Considerations:

The full_research_pipeline tool executes a complete workflow including file system operations, network requests, and PDF processing. Ensure proper validation of file paths to prevent directory traversal attacks. Always run the MCP server with minimal necessary permissions. Rate limit requests to external databases to comply with terms of service. Validate and sanitize all user input before passing to Playwright automation. Store authentication cookies securely with appropriate file permissions.

Claude Code Integration

Add the server to Claude Code's MCP configuration:

{
  "mcpServers": {
    "pdf-research": {
      "command": "node",
      "args": ["/path/to/pdf-research-assistant/server.js"]
    }
  }
}

Restart Claude Code and verify:

Available MCP tools:
- search_papers
- download_papers
- extract_content
- organize_papers
- full_research_pipeline

Use the tools in conversation:

Example: "Search PubMed and arXiv for papers about 'transformer architecture' from 2020 onwards, download the top 20, extract their text, and organize by topic-date."

Claude Code will invoke:

full_research_pipeline({
  query: "transformer architecture",
  databases: ["pubmed", "arxiv"],
  maxResults: 20,
  yearFrom: 2020,
  downloadDir: "./papers",
  organizeStrategy: "topic-date",
  extractText: true
})

The server executes all four stages and returns structured results Claude Code can reference in subsequent analysis.

Next Steps

With the MCP server running, move to the next chapter to see real-world workflow examples: literature reviews, citation network analysis, and systematic review automation.

The server provides the foundation - the workflows show how to orchestrate the tools for maximum research efficiency.

MCP Server Integration

Integration: PDF Research Assistant MCP Server

Server Architecture

Project Structure

MCP Server Core

Database Coordination

Complete Server Implementation

Tool Definitions

search_papers

download_papers

extract_content

organize_papers

full_research_pipeline

Claude Code Integration

Next Steps

Table of Contents