Real-World Workflows
Practical usage patterns and complete workflow examples
Real-World Usage Patterns
The PDF intelligence system enables powerful automated workflows. This chapter demonstrates three production-ready patterns: comprehensive literature reviews, citation network analysis, and automated monitoring for new publications.
Example 1: Literature Review on Specific Topic
Complete workflow from search to organized knowledge base with automatic summarization.
Set Up the Literature Review Pipeline
Create the main review script that orchestrates database search, PDF download, text extraction, and organization.
// example-1-literature-review.js
import PDFResearchServer from './server.js';
async function conductLiteratureReview() {
const server = new PDFResearchServer();
console.log('Starting literature review on reinforcement learning in robotics...\n');
// Execute full pipeline
const result = await server.handleFullPipeline({
query: 'reinforcement learning robotics manipulation',
databases: ['pubmed', 'arxiv', 'ieee'],
maxResults: 100,
yearFrom: 2020,
downloadDir: './papers/rl-robotics',
organizeStrategy: 'topic-date',
extractText: true
});
console.log('\n=== Literature Review Complete ===');
console.log(`Papers found: ${result.summary.papersFound}`);
console.log(`Papers downloaded: ${result.summary.papersDownloaded}`);
console.log(`Content extracted: ${result.summary.papersExtracted}`);
console.log(`Organized into ${result.summary.categories} categories`);
// Generate summary report
const report = generateSummaryReport(result);
await fs.writeFile('./papers/rl-robotics/REVIEW_SUMMARY.md', report);
console.log('\nSummary report saved to REVIEW_SUMMARY.md');
}Implement Report Generation
Create the summary report generator that analyzes search results, citations, and paper organization.
function generateSummaryReport(result) {
const { search, extraction, organization } = result.details;
let report = `# Literature Review: Reinforcement Learning in Robotics\n\n`;
report += `Date: ${new Date().toISOString().split('T')[0]}\n\n`;
report += `## Overview\n\n`;
report += `- Total papers found: ${search.papers.length}\n`;
report += `- Papers downloaded: ${result.summary.papersDownloaded}\n`;
report += `- Databases searched: ${Object.keys(search.byDatabase).join(', ')}\n\n`;
report += `## Key Papers\n\n`;
// Top 10 most recent papers
const topPapers = search.papers
.sort((a, b) => (b.year || 0) - (a.year || 0))
.slice(0, 10);
for (const paper of topPapers) {
report += `### ${paper.title}\n`;
report += `**Authors**: ${formatAuthors(paper.authors)}\n`;
report += `**Year**: ${paper.year}\n`;
report += `**Source**: ${paper.source}\n`;
if (paper.doi) report += `**DOI**: ${paper.doi}\n`;
report += `\n`;
}
report += `## Citation Analysis\n\n`;
// Aggregate citation counts
const allCitations = extraction
?.flatMap(e => e.citations || [])
|| [];
report += `Total citations extracted: ${allCitations.length}\n\n`;
// Most cited years
const yearCounts = {};
allCitations.forEach(c => {
if (c.year) {
yearCounts[c.year] = (yearCounts[c.year] || 0) + 1;
}
});
report += `### Citations by Year\n\n`;
Object.entries(yearCounts)
.sort(([,a], [,b]) => b - a)
.slice(0, 10)
.forEach(([year, count]) => {
report += `- ${year}: ${count} citations\n`;
});
report += `\n## Topics Covered\n\n`;
Object.keys(organization).forEach(topic => {
const papers = organization[topic];
report += `- **${topic}**: ${papers.length} papers\n`;
});
return report;
}
function formatAuthors(authors) {
if (!authors || authors.length === 0) return 'Unknown';
if (!Array.isArray(authors)) return authors;
if (authors.length <= 3) {
return authors.join(', ');
} else {
return `${authors.slice(0, 3).join(', ')} et al.`;
}
}Run the Review
Execute the literature review workflow and verify results.
// Run the review
conductLiteratureReview().catch(console.error);Expected Output:
Starting literature review on reinforcement learning in robotics...
Step 1: Searching databases...
- PubMed: 45 papers found
- arXiv: 78 papers found
- IEEE: 62 papers found
- After deduplication: 127 unique papers
Step 2: Downloading 127 papers...
[Progress] 10/127 complete...
[Progress] 50/127 complete...
[Progress] 100/127 complete...
[Progress] 127/127 complete
- Downloaded: 119 papers
- Skipped (already exists): 3 papers
- Failed: 5 papers
Step 3: Extracting content...
[Progress] Extracting from 119 PDFs...
- Text extracted: 119 papers
- Citations found: 3,847 total
Step 4: Organizing papers...
- Created structure: research/rl-robotics/
- Topics identified: 8 categories
- Index files generated: 8 files
=== Literature Review Complete ===
Papers found: 127
Papers downloaded: 119
Content extracted: 119
Organized into 8 categories
Summary report saved to REVIEW_SUMMARY.mdOptimization Tips: Use the yearFrom parameter to limit search scope and reduce processing time. Set maxResults conservatively (50-100) for initial reviews. Enable organizeStrategy: 'topic-date' to automatically categorize papers by theme and publication year.
Example 2: Following Citation Chains
Discover papers by following citations recursively to build comprehensive citation networks.
Create Citation Chain Explorer
Build the recursive citation tracking function that discovers papers through reference networks.
// example-2-citation-chain.js
import PDFHunter from './pdf-hunter.js';
import ExtractionPipeline from './extraction-pipeline.js';
import DownloadManager from './download-manager.js';
async function followCitationChain(seedPaper, depth = 2) {
const visited = new Set();
const allPapers = [];
const hunter = new PDFHunter();
const downloader = new DownloadManager({ downloadDir: './citation-chain' });
const extractor = new ExtractionPipeline();
async function explore(paper, currentDepth) {
if (currentDepth > depth) return;
const key = paper.doi || paper.title;
if (visited.has(key)) return;
visited.add(key);
console.log(`\nDepth ${currentDepth}: ${paper.title}`);
allPapers.push({ ...paper, depth: currentDepth });
// Download paper
const downloadResult = await downloader.downloadPaper(paper);
if (downloadResult.status === 'completed') {
// Extract citations
const extracted = await extractor.extractFromPDF(downloadResult.filepath);
console.log(` Found ${extracted.citations.length} citations`);
// Search for cited papers
for (const citation of extracted.citations.slice(0, 10)) {
if (!citation.title && !citation.doi) continue;
try {
const searchQuery = citation.doi || citation.title ||
`${citation.authors} ${citation.year}`;
const results = await hunter.searchMultiple(
searchQuery,
['pubmed', 'arxiv'],
1
);
if (results.papers.length > 0) {
await explore(results.papers[0], currentDepth + 1);
}
} catch (error) {
console.error(` Failed to find: ${citation.title}`);
}
// Rate limiting
await delay(2000);
}
}
}
await explore(seedPaper, 0);
return {
totalPapers: allPapers.length,
byDepth: groupByDepth(allPapers),
papers: allPapers
};
}
function groupByDepth(papers) {
const grouped = {};
papers.forEach(p => {
if (!grouped[p.depth]) grouped[p.depth] = [];
grouped[p.depth].push(p);
});
return grouped;
}
function delay(ms) {
return new Promise(resolve => setTimeout(resolve, ms));
}Execute Citation Chain Discovery
Run the citation chain explorer with a seed paper and analyze the citation network.
// Example usage
const seedPaper = {
title: 'Deep Reinforcement Learning for Robotic Manipulation',
doi: '10.1109/ICRA.2018.8460692',
authors: ['Levine, S.', 'Pastor, P.', 'Krizhevsky, A.', 'Ibarz, J.', 'Quillen, D.'],
year: 2018
};
followCitationChain(seedPaper, 2)
.then(result => {
console.log('\n=== Citation Chain Complete ===');
console.log(`Total papers discovered: ${result.totalPapers}`);
console.log('By depth:');
Object.entries(result.byDepth).forEach(([depth, papers]) => {
console.log(` Depth ${depth}: ${papers.length} papers`);
});
})
.catch(console.error);Expected Output:
Depth 0: Deep Reinforcement Learning for Robotic Manipulation
Found 42 citations
Depth 1: Soft Actor-Critic: Off-Policy Maximum Entropy Deep RL
Found 38 citations
Depth 1: Hindsight Experience Replay
Found 31 citations
...
=== Citation Chain Complete ===
Total papers discovered: 87
By depth:
Depth 0: 1 papers
Depth 1: 10 papers
Depth 2: 76 papersPerformance Tips: Limit citation chain depth to 2-3 levels to avoid exponential growth. Use the .slice(0, 10) pattern to follow only the most relevant citations per paper. Implement 2-second delays between searches to respect API rate limits and avoid being blocked.
Example 3: Staying Current with New Publications
Monitor databases for new papers matching research interests with automated notifications.
Create Research Monitor Class
Build the monitoring system that tracks publication dates and filters new papers.
// example-3-stay-current.js
import fs from 'fs/promises';
import PDFHunter from './pdf-hunter.js';
import DownloadManager from './download-manager.js';
class ResearchMonitor {
constructor(configPath = './monitor-config.json') {
this.configPath = configPath;
this.hunter = new PDFHunter();
this.downloader = new DownloadManager({
downloadDir: './new-papers'
});
}
async loadConfig() {
const data = await fs.readFile(this.configPath, 'utf-8');
return JSON.parse(data);
}
async saveLastCheck(timestamp) {
const config = await this.loadConfig();
config.lastCheck = timestamp;
await fs.writeFile(this.configPath, JSON.stringify(config, null, 2));
}
async checkForNewPapers() {
const config = await this.loadConfig();
const currentDate = new Date();
const lastCheck = config.lastCheck ? new Date(config.lastCheck) : null;
console.log(`Checking for new papers since ${lastCheck?.toISOString() || 'never'}...\n`);
const allNewPapers = [];
for (const query of config.queries) {
console.log(`Searching: "${query.topic}"`);
const results = await this.hunter.searchMultiple(
query.topic,
query.databases,
50
);
// Filter papers published since last check
const newPapers = lastCheck
? results.papers.filter(p => {
const pubDate = this.parsePublicationDate(p);
return pubDate && pubDate > lastCheck;
})
: results.papers;
console.log(` Found ${newPapers.length} new papers`);
allNewPapers.push(...newPapers.map(p => ({ ...p, query: query.topic })));
}
if (allNewPapers.length > 0) {
// Download new papers
console.log(`\nDownloading ${allNewPapers.length} new papers...`);
const downloadResults = await this.downloader.downloadBatch(allNewPapers);
// Generate notification
await this.generateNotification(allNewPapers, downloadResults);
}
await this.saveLastCheck(currentDate.toISOString());
return {
newPapersFound: allNewPapers.length,
papers: allNewPapers
};
}
parsePublicationDate(paper) {
if (paper.published) {
return new Date(paper.published);
}
if (paper.year) {
return new Date(paper.year, 0, 1);
}
return null;
}
async generateNotification(papers, downloadResults) {
let notification = `# New Research Papers - ${new Date().toLocaleDateString()}\n\n`;
notification += `Found ${papers.length} new papers matching your interests.\n\n`;
// Group by query topic
const byTopic = {};
papers.forEach(p => {
if (!byTopic[p.query]) byTopic[p.query] = [];
byTopic[p.query].push(p);
});
for (const [topic, topicPapers] of Object.entries(byTopic)) {
notification += `## ${topic}\n\n`;
notification += `${topicPapers.length} new papers\n\n`;
for (const paper of topicPapers) {
notification += `### ${paper.title}\n`;
notification += `**Authors**: ${formatAuthors(paper.authors)}\n`;
notification += `**Year**: ${paper.year}\n`;
notification += `**Source**: ${paper.source}\n`;
if (paper.doi) notification += `**DOI**: ${paper.doi}\n`;
notification += `\n`;
}
}
notification += `\n---\n`;
notification += `**Download Summary**:\n`;
notification += `- Completed: ${downloadResults.stats.completed}\n`;
notification += `- Failed: ${downloadResults.stats.failed}\n`;
notification += `- Skipped: ${downloadResults.stats.skipped}\n`;
await fs.writeFile('./new-papers/NOTIFICATION.md', notification);
console.log('\nNotification saved to new-papers/NOTIFICATION.md');
}
}
function formatAuthors(authors) {
if (!authors || authors.length === 0) return 'Unknown';
if (!Array.isArray(authors)) return authors;
if (authors.length <= 3) {
return authors.join(', ');
} else {
return `${authors.slice(0, 3).join(', ')} et al.`;
}
}Configure Monitoring Queries
Create the configuration file defining research topics to monitor.
{
"queries": [
{
"topic": "reinforcement learning robotics",
"databases": ["arxiv", "ieee"]
},
{
"topic": "neural architecture search",
"databases": ["arxiv", "pubmed"]
}
],
"lastCheck": null
}Save this as monitor-config.json in the project directory.
Run Automated Monitoring
Execute the monitor to check for new publications and generate notifications.
// Run monitor
const monitor = new ResearchMonitor();
monitor.checkForNewPapers()
.then(result => {
console.log(`\n=== Monitoring Complete ===`);
console.log(`New papers found: ${result.newPapersFound}`);
})
.catch(console.error);Expected Output:
Checking for new papers since 2025-01-01T00:00:00.000Z...
Searching: "reinforcement learning robotics"
Found 12 new papers
Searching: "neural architecture search"
Found 8 new papers
Downloading 20 new papers...
[Progress] 20/20 complete
Notification saved to new-papers/NOTIFICATION.md
=== Monitoring Complete ===
New papers found: 20Automation Tips: Set up a cron job to run the monitor daily or weekly using 0 9 * * 1 node example-3-stay-current.js for weekly Monday morning checks. Store the lastCheck timestamp to avoid re-downloading existing papers. Integrate with email notifications by piping NOTIFICATION.md to your email service for automatic alerts.
Integration with Note-Taking Tools
These workflows integrate seamlessly with knowledge management systems:
Obsidian Integration: Save extracted text and citations as markdown files in your vault. Use frontmatter metadata for automatic linking and graph visualization.
Zotero Sync: Export paper metadata and PDFs to Zotero collections for citation management. Use BibTeX export for LaTeX manuscript integration.
Notion Database: Push paper metadata to Notion databases using the API. Create linked databases for topics, authors, and citation networks.
Roam Research: Import papers as pages with citation links. Use block references to connect ideas across papers.
Systematic Review Automation
Combine all three workflows for comprehensive systematic reviews:
Phase 1: Run Example 1 (literature review) with broad search terms to discover initial paper set (100-200 papers).
Phase 2: Execute Example 2 (citation chains) on top 20 most-cited papers from Phase 1 to expand the dataset through citation networks.
Phase 3: Deploy Example 3 (monitoring) to track new publications matching review criteria during the writing period.
Phase 4: Use the MCP server's search functionality to answer specific research questions across the entire corpus with semantic search.
This four-phase approach produces systematic reviews with 300-500 papers in 4-6 hours instead of 4-6 weeks using manual methods.
Knowledge Base Building
Transform downloaded papers into a queryable knowledge base:
Text Extraction: Extract full text from all PDFs using the extraction pipeline. Store in structured JSON format with metadata.
Embedding Generation: Generate text embeddings for semantic search using OpenAI embeddings API. Store in vector database (Pinecone, Weaviate, or Qdrant).
Citation Network: Build graph database of citation relationships using Neo4j or NetworkX. Enable network analysis and recommendation algorithms.
Semantic Search: Query the knowledge base using natural language questions. Retrieve relevant paper sections with citation tracking.
This workflow creates a personal research assistant that answers questions using only papers you've verified and downloaded, eliminating hallucination risks from general-purpose AI models.