MCP Server Integration
Complete PDF Research Assistant MCP server implementation
Integration: PDF Research Assistant MCP Server
Now the complete integration - everything from discovery to extraction unified into a single MCP server that Claude Code can orchestrate.
This chapter covers the complete server architecture, tool definitions for search, download, extract, and organize operations, database coordination, and Claude Code integration patterns.
What This Server Provides:
The PDF Research Assistant MCP server exposes five core tools to Claude Code: search_papers (multi-database discovery), download_papers (parallel PDF downloads), extract_content (text and citation extraction), organize_papers (intelligent filing), and full_research_pipeline (complete automation from query to organized knowledge base).
All tools use JSON schemas for type safety, support progress callbacks, handle errors gracefully, and maintain state across operations. The server runs via stdio transport and can be invoked directly from Claude Code conversations.
Server Architecture
The MCP server coordinates multiple specialized modules into a unified interface.
Project Structure
Set up the complete server project:
pdf-research-assistant/
├── server.js # Main MCP server
├── pdf-hunter.js # Multi-database coordinator
├── download-manager.js # Parallel download engine
├── extraction-pipeline.js # Text extraction system
├── organizer.js # File organization logic
├── deduplication.js # Cross-database deduplication
├── discovery/
│ ├── pubmed-discovery.js # PubMed API integration
│ ├── arxiv-discovery.js # arXiv API integration
│ ├── jstor-discovery.js # JSTOR scraping
│ └── ieee-discovery.js # IEEE Xplore scraping
└── package.jsonInstall MCP SDK:
npm install @modelcontextprotocol/sdkMCP Server Core
Create the main server with tool definitions and request handlers.
The server exposes five tools to Claude Code, each with JSON schema validation for input parameters. All tools return structured JSON responses with consistent error handling.
Database Coordination
The PDF Hunter module coordinates searches across multiple databases, handles failures gracefully with Promise.allSettled, and deduplicates results using DOI and title matching.
Each database returns a standardized paper object with title, authors, year, abstract, DOI, and source URL. The coordinator merges results and removes duplicates before returning to Claude Code.
Complete Server Implementation
// server.js - Main MCP server
import { Server } from '@modelcontextprotocol/sdk/server/index.js';
import { StdioServerTransport } from '@modelcontextprotocol/sdk/server/stdio.js';
import PDFHunter from './pdf-hunter.js';
import DownloadManager from './download-manager.js';
import ExtractionPipeline from './extraction-pipeline.js';
import PaperOrganizer from './organizer.js';
class PDFResearchServer {
constructor() {
this.server = new Server(
{
name: 'pdf-research-assistant',
version: '1.0.0',
},
{
capabilities: {
tools: {},
},
}
);
this.setupHandlers();
}
setupHandlers() {
// List available tools
this.server.setRequestHandler('tools/list', async () => ({
tools: [
{
name: 'search_papers',
description: 'Search multiple academic databases for papers',
inputSchema: {
type: 'object',
properties: {
query: { type: 'string', description: 'Search query' },
databases: {
type: 'array',
items: { enum: ['pubmed', 'arxiv', 'jstor', 'ieee'] },
description: 'Databases to search'
},
maxResults: { type: 'number', default: 50 },
yearFrom: { type: 'number', description: 'Filter from year' }
},
required: ['query', 'databases']
}
},
{
name: 'download_papers',
description: 'Download PDFs for discovered papers',
inputSchema: {
type: 'object',
properties: {
papers: { type: 'array', description: 'Papers to download' },
downloadDir: { type: 'string', default: './papers' },
concurrency: { type: 'number', default: 5 }
},
required: ['papers']
}
},
{
name: 'extract_content',
description: 'Extract text and citations from PDFs',
inputSchema: {
type: 'object',
properties: {
pdfPaths: { type: 'array', description: 'Paths to PDFs' },
useOCR: { type: 'boolean', default: false },
extractCitations: { type: 'boolean', default: true }
},
required: ['pdfPaths']
}
},
{
name: 'organize_papers',
description: 'Organize papers into directory structure',
inputSchema: {
type: 'object',
properties: {
papers: { type: 'array' },
strategy: {
type: 'string',
enum: ['topic', 'date', 'topic-date', 'author'],
default: 'topic-date'
},
baseDir: { type: 'string', default: './research' }
},
required: ['papers']
}
},
{
name: 'full_research_pipeline',
description: 'Complete pipeline: search → download → extract → organize',
inputSchema: {
type: 'object',
properties: {
query: { type: 'string' },
databases: { type: 'array' },
maxResults: { type: 'number', default: 50 },
yearFrom: { type: 'number' },
downloadDir: { type: 'string', default: './papers' },
organizeStrategy: { type: 'string', default: 'topic-date' },
extractText: { type: 'boolean', default: true }
},
required: ['query', 'databases']
}
}
]
}));
// Tool execution handler
this.server.setRequestHandler('tools/call', async (request) => {
const { name, arguments: args } = request.params;
try {
let result;
switch (name) {
case 'search_papers':
result = await this.handleSearchPapers(args);
break;
case 'download_papers':
result = await this.handleDownloadPapers(args);
break;
case 'extract_content':
result = await this.handleExtractContent(args);
break;
case 'organize_papers':
result = await this.handleOrganizePapers(args);
break;
case 'full_research_pipeline':
result = await this.handleFullPipeline(args);
break;
default:
throw new Error(`Unknown tool: ${name}`);
}
return {
content: [
{
type: 'text',
text: JSON.stringify(result, null, 2)
}
]
};
} catch (error) {
return {
content: [
{
type: 'text',
text: `Error: ${error.message}`
}
],
isError: true
};
}
});
}
async start() {
const transport = new StdioServerTransport();
await this.server.connect(transport);
console.error('PDF Research Assistant MCP server running');
}
}
// Start server
const server = new PDFResearchServer();
server.start().catch(console.error);// pdf-hunter.js - Multi-database coordinator
import PubMedDiscovery from './discovery/pubmed-discovery.js';
import ArXivDiscovery from './discovery/arxiv-discovery.js';
import JSTORDiscovery from './discovery/jstor-discovery.js';
import IEEEDiscovery from './discovery/ieee-discovery.js';
import PaperDeduplicator from './deduplication.js';
class PDFHunter {
constructor(options = {}) {
this.authPaths = options.authPaths || {};
this.deduplicator = new PaperDeduplicator();
}
/**
* Search multiple databases and deduplicate results
*/
async searchMultiple(query, databases, maxResults = 50, yearFrom = null) {
const searchPromises = databases.map(db =>
this.searchDatabase(db, query, maxResults, yearFrom)
);
const results = await Promise.allSettled(searchPromises);
const allPapers = [];
const byDatabase = {};
for (let i = 0; i < results.length; i++) {
const db = databases[i];
const result = results[i];
if (result.status === 'fulfilled') {
const papers = result.value.papers.map(p => ({ ...p, source: db }));
allPapers.push(...papers);
byDatabase[db] = {
count: papers.length,
status: 'success'
};
} else {
console.error(`${db} search failed:`, result.reason);
byDatabase[db] = {
count: 0,
status: 'failed',
error: result.reason.message
};
}
}
// Deduplicate across databases
const uniquePapers = this.deduplicator.deduplicate(allPapers);
return {
papers: uniquePapers,
totalBeforeDedup: allPapers.length,
totalAfterDedup: uniquePapers.length,
byDatabase
};
}
async searchDatabase(database, query, maxResults, yearFrom) {
switch (database) {
case 'pubmed':
return await this.searchPubMed(query, maxResults, yearFrom);
case 'arxiv':
return await this.searchArXiv(query, maxResults);
case 'jstor':
return await this.searchJSTOR(query, maxResults, yearFrom);
case 'ieee':
return await this.searchIEEE(query, maxResults);
default:
throw new Error(`Unknown database: ${database}`);
}
}
async searchPubMed(query, maxResults, yearFrom) {
const discovery = new PubMedDiscovery();
return await discovery.search(query, maxResults, yearFrom);
}
async searchArXiv(query, maxResults) {
const discovery = new ArXivDiscovery();
return await discovery.search(query, maxResults);
}
async searchJSTOR(query, maxResults, yearFrom) {
const discovery = new JSTORDiscovery(this.authPaths.jstor);
return await discovery.search(query, maxResults, yearFrom);
}
async searchIEEE(query, maxResults) {
const discovery = new IEEEDiscovery();
return await discovery.search(query, maxResults);
}
}
export default PDFHunter;// Tool implementation methods (inside PDFResearchServer class)
async handleSearchPapers(args) {
const hunter = new PDFHunter();
const results = await hunter.searchMultiple(
args.query,
args.databases,
args.maxResults,
args.yearFrom
);
return {
totalPapers: results.papers.length,
papers: results.papers,
byDatabase: results.byDatabase
};
}
async handleDownloadPapers(args) {
const manager = new DownloadManager({
downloadDir: args.downloadDir,
concurrency: args.concurrency
});
const results = await manager.downloadBatch(args.papers, (progress) => {
console.log(`Progress: ${progress.paper.title} - ${progress.status}`);
});
return {
stats: manager.getStats(),
results: results.results
};
}
async handleExtractContent(args) {
const pipeline = new ExtractionPipeline({
useOCR: args.useOCR,
extractCitations: args.extractCitations
});
const extracted = [];
for (const pdfPath of args.pdfPaths) {
const data = await pipeline.extractFromPDF(pdfPath);
extracted.push(data);
}
return {
totalProcessed: extracted.length,
extracted
};
}
async handleOrganizePapers(args) {
const organizer = new PaperOrganizer(args.baseDir);
const organized = await organizer.organize(args.papers, args.strategy);
return {
strategy: args.strategy,
categories: Object.keys(organized).length,
organized
};
}
async handleFullPipeline(args) {
const results = {
search: null,
download: null,
extraction: null,
organization: null
};
// Step 1: Search
console.log('Step 1: Searching databases...');
const hunter = new PDFHunter();
results.search = await hunter.searchMultiple(
args.query,
args.databases,
args.maxResults,
args.yearFrom
);
// Step 2: Download
console.log(`Step 2: Downloading ${results.search.papers.length} papers...`);
const manager = new DownloadManager({
downloadDir: args.downloadDir,
concurrency: 5
});
results.download = await manager.downloadBatch(results.search.papers);
// Step 3: Extract (if requested)
if (args.extractText) {
console.log('Step 3: Extracting content...');
const pipeline = new ExtractionPipeline({
extractCitations: true
});
const downloaded = results.download.results
.filter(r => r.status === 'fulfilled')
.map(r => r.result.filepath);
results.extraction = [];
for (const filepath of downloaded) {
const data = await pipeline.extractFromPDF(filepath);
results.extraction.push(data);
}
}
// Step 4: Organize
console.log('Step 4: Organizing papers...');
const organizer = new PaperOrganizer('./research');
// Attach extracted data to papers
const papersWithData = results.search.papers.map(paper => {
const extracted = results.extraction?.find(e =>
e.filepath.includes(paper.title.substring(0, 20))
);
return { ...paper, extracted };
});
results.organization = await organizer.organize(
papersWithData,
args.organizeStrategy
);
return {
completed: true,
summary: {
papersFound: results.search.papers.length,
papersDownloaded: results.download.stats.completed,
papersExtracted: results.extraction?.length || 0,
categories: Object.keys(results.organization).length
},
details: results
};
}Tool Definitions
Each MCP tool has a specific purpose, input schema, and return format.
search_papers
Purpose: Search multiple academic databases for papers matching a query.
Input Schema:
query(string, required): Search query textdatabases(array, required): Array of database names (pubmed, arxiv, jstor, ieee)maxResults(number, default: 50): Maximum results per databaseyearFrom(number, optional): Filter papers from this year onwards
Returns:
{
"totalPapers": 47,
"papers": [
{
"title": "Paper title",
"authors": ["Author 1", "Author 2"],
"year": 2024,
"doi": "10.1234/xyz",
"abstract": "Paper abstract text...",
"pdfUrl": "https://...",
"source": "pubmed"
}
],
"byDatabase": {
"pubmed": { "count": 25, "status": "success" },
"arxiv": { "count": 22, "status": "success" }
}
}download_papers
Purpose: Download PDFs for discovered papers in parallel.
Input Schema:
papers(array, required): Array of paper objects from search_papersdownloadDir(string, default: "./papers"): Download directory pathconcurrency(number, default: 5): Parallel download limit
Returns:
{
"stats": {
"total": 47,
"completed": 45,
"failed": 2,
"skipped": 0
},
"results": [
{
"status": "fulfilled",
"result": {
"filepath": "./papers/Author-2024-Title.pdf",
"paper": { "title": "...", "doi": "..." }
}
}
]
}extract_content
Purpose: Extract text and citations from downloaded PDFs.
Input Schema:
pdfPaths(array, required): Array of PDF file pathsuseOCR(boolean, default: false): Enable OCR for scanned documentsextractCitations(boolean, default: true): Extract citation information
Returns:
{
"totalProcessed": 45,
"extracted": [
{
"filepath": "./papers/Author-2024-Title.pdf",
"text": "Full extracted text...",
"metadata": {
"title": "Paper Title",
"authors": ["Author 1"],
"year": 2024
},
"citations": [
{
"text": "Smith et al. (2023)...",
"authors": ["Smith"],
"year": 2023
}
]
}
]
}organize_papers
Purpose: Organize papers into directory structure by topic, date, or author.
Input Schema:
papers(array, required): Array of papers with metadatastrategy(string, default: "topic-date"): Organization strategy (topic, date, topic-date, author)baseDir(string, default: "./research"): Base directory for organization
Returns:
{
"strategy": "topic-date",
"categories": 8,
"organized": {
"Machine Learning/2024": {
"papers": [...],
"count": 12
},
"Natural Language Processing/2023": {
"papers": [...],
"count": 8
}
}
}full_research_pipeline
Purpose: Execute complete pipeline from search to organized knowledge base.
Input Schema: Combines all parameters from search, download, extract, and organize tools.
Returns: Nested results object with summary and detailed results from each stage.
Security Considerations:
The full_research_pipeline tool executes a complete workflow including file system operations, network requests, and PDF processing. Ensure proper validation of file paths to prevent directory traversal attacks. Always run the MCP server with minimal necessary permissions. Rate limit requests to external databases to comply with terms of service. Validate and sanitize all user input before passing to Playwright automation. Store authentication cookies securely with appropriate file permissions.
Claude Code Integration
Add the server to Claude Code's MCP configuration:
{
"mcpServers": {
"pdf-research": {
"command": "node",
"args": ["/path/to/pdf-research-assistant/server.js"]
}
}
}Restart Claude Code and verify:
Available MCP tools:
- search_papers
- download_papers
- extract_content
- organize_papers
- full_research_pipelineUse the tools in conversation:
Example: "Search PubMed and arXiv for papers about 'transformer architecture' from 2020 onwards, download the top 20, extract their text, and organize by topic-date."
Claude Code will invoke:
full_research_pipeline({
query: "transformer architecture",
databases: ["pubmed", "arxiv"],
maxResults: 20,
yearFrom: 2020,
downloadDir: "./papers",
organizeStrategy: "topic-date",
extractText: true
})The server executes all four stages and returns structured results Claude Code can reference in subsequent analysis.
Next Steps
With the MCP server running, move to the next chapter to see real-world workflow examples: literature reviews, citation network analysis, and systematic review automation.
The server provides the foundation - the workflows show how to orchestrate the tools for maximum research efficiency.