xps

Complete 8-step literature review walkthrough from search to final document (2 hours vs 40 hours)

Execute a complete literature review from scratch. This is the integration showcase—every component working together.

Goal: Comprehensive literature review on "transformer models in NLP"

Timeline: 2 hours (vs. 40 hours manual)

Expected Output: 3,000-word literature review with 50+ citations

Step 1: Generate Targeted Search Queries (10 minutes)

Use Gemini's brainstorming capability to create diverse search queries covering different aspects of the topic.

// step1-generate-queries.ts

async function generateSearchQueries(topic: string) {
  const queries = await mcp.invoke('brainstorm', {
    prompt: `Generate 10 diverse academic search queries for: ${topic}

             Vary by:
             - Specificity (broad to narrow)
             - Sub-domains (e.g., efficiency, applications, theory)
             - Methodological approaches
             - Time periods (foundational vs. recent)

             Return queries optimized for academic database search.`,
    domain: 'research',
    methodology: 'divergent',
    ideaCount: 10,
    includeAnalysis: true
  });

  console.log('Generated search queries:');
  queries.ideas.forEach((q, i) => {
    console.log(`${i + 1}. ${q.text}`);
    console.log(`   Rationale: ${q.rationale}\n`);
  });

  return queries.ideas.map(q => q.text);
}

const searchQueries = await generateSearchQueries('transformer models in NLP');

Expected Output:

1. "transformer models natural language processing"
   Rationale: Broad foundational query
2. "attention mechanisms BERT GPT architecture"
   Rationale: Specific model architectures
3. "efficient transformers linear attention"
   Rationale: Focus on efficiency improvements
4. "transformer applications machine translation summarization"
   Rationale: Application-specific research
5. "transformer training optimization techniques"
   Rationale: Methodological focus
... (5 more)

Step 2: Search Multiple Databases (30 minutes)

Execute parallel searches across arXiv, PubMed, IEEE, ACM, and JSTOR with rate limiting.

// step2-multi-database-search.ts

async function searchAllDatabases(queries: string[]) {
  const databases = [
    { name: 'arxiv', category: 'cs.CL' },
    { name: 'pubmed', category: null },
    { name: 'ieee', category: 'Computer Science' },
    { name: 'acm', category: 'Computing methodologies' },
    { name: 'jstor', category: 'Computer Science' }
  ];

  const allResults = [];

  for (const query of queries) {
    console.log(`\nSearching for: "${query}"`);

    // Parallel search across databases
    const dbResults = await Promise.all(
      databases.map(async (db) => {
        try {
          const results = await mcp.invoke('search_database', {
            database: db.name,
            query: query,
            category: db.category,
            year_from: 2017, // Transformers paper published 2017
            max_results: 50
          });

          console.log(`  ${db.name}: ${results.length} papers found`);
          return results;
        } catch (error) {
          console.error(`  ${db.name}: Search failed - ${error.message}`);
          return [];
        }
      })
    );

    allResults.push(...dbResults.flat());

    // Rate limiting: wait between queries
    await delay(2000);
  }

  console.log(`\nTotal papers found: ${allResults.length}`);
  return allResults;
}

const rawResults = await searchAllDatabases(searchQueries);

Expected Output:

Searching for: "transformer models natural language processing"
  arxiv: 127 papers found
  pubmed: 34 papers found
  ieee: 89 papers found
  acm: 56 papers found
  jstor: 23 papers found

Searching for: "attention mechanisms BERT GPT architecture"
  arxiv: 98 papers found
  pubmed: 12 papers found
  ieee: 67 papers found
  acm: 45 papers found
  jstor: 8 papers found

... (8 more queries)

Total papers found: 1,247 papers

Step 3: Deduplication and Filtering (15 minutes)

Remove duplicates using DOI matching and fuzzy title matching, then filter by relevance using Gemini.

// step3-deduplicate-filter.ts

async function deduplicateAndFilter(papers: Paper[]) {
  // Step 3a: Remove exact duplicates by DOI
  const uniqueByDOI = new Map();
  papers.forEach(p => {
    if (p.doi && !uniqueByDOI.has(p.doi)) {
      uniqueByDOI.set(p.doi, p);
    }
  });

  console.log(`After DOI deduplication: ${uniqueByDOI.size} papers`);

  // Step 3b: Fuzzy matching for papers without DOI
  const withoutDOI = papers.filter(p => !p.doi);
  const fuzzyDuplicates = await detectFuzzyDuplicates(withoutDOI);

  // Step 3c: Merge deduplicated sets
  const uniquePapers = [
    ...Array.from(uniqueByDOI.values()),
    ...withoutDOI.filter(p => !fuzzyDuplicates.has(p.title))
  ];

  console.log(`After fuzzy deduplication: ${uniquePapers.length} papers`);

  // Step 3d: AI-powered relevance filtering
  console.log('Filtering by relevance with Gemini...');

  const filtered = await mcp.invoke('ask-gemini', {
    prompt: `Filter these papers by relevance to "transformer models in NLP".

             Criteria:
             - Focus on transformer architecture, attention mechanisms, or major models (BERT, GPT, T5, etc.)
             - Exclude: purely medical NLP without model innovation, non-transformer models
             - Include: theoretical papers, efficiency improvements, novel applications

             Return papers with relevance score >= 7/10.

             Papers (titles and abstracts):
             ${JSON.stringify(uniquePapers.map(p => ({
               id: p.id,
               title: p.title,
               abstract: p.abstract,
               year: p.year
             })))}`,
    model: 'gemini-2.5-pro'
  });

  console.log(`After relevance filtering: ${filtered.papers.length} papers`);

  return filtered.papers;
}

const filteredPapers = await deduplicateAndFilter(rawResults);

Expected Output:

After DOI deduplication: 892 papers
After fuzzy deduplication: 867 papers
Filtering by relevance with Gemini...
After relevance filtering: 287 papers (relevance >= 7/10)

Step 4: Download PDFs (40 minutes)

Download top 100 papers sorted by relevance and recency with concurrent downloads and rate limiting.

// step4-download-pdfs.ts

async function downloadRelevantPDFs(papers: Paper[]) {
  const downloadDir = 'projects/transformer-nlp-review/papers';
  await fs.mkdir(downloadDir, { recursive: true });

  // Sort by relevance and year (prioritize recent, highly relevant)
  const sorted = papers.sort((a, b) => {
    if (b.relevance_score !== a.relevance_score) {
      return b.relevance_score - a.relevance_score;
    }
    return b.year - a.year;
  });

  // Download top 100 papers (or fewer if unavailable)
  const targetCount = Math.min(100, sorted.length);
  console.log(`Attempting to download ${targetCount} PDFs...`);

  const downloaded = [];
  const failed = [];

  // Concurrent downloads with rate limiting
  const concurrency = 3;
  for (let i = 0; i < sorted.length && downloaded.length < targetCount; i += concurrency) {
    const batch = sorted.slice(i, i + concurrency);

    const results = await Promise.allSettled(
      batch.map(async (paper) => {
        try {
          const filename = sanitizeFilename(
            `${paper.authors[0]?.split(' ').pop()}_${paper.year}_${paper.title}.pdf`
          ).slice(0, 150); // Limit filename length

          await mcp.invoke('download_pdf', {
            url: paper.pdf_url,
            save_path: `${downloadDir}/${filename}`,
            timeout: 120000
          });

          return { paper, filename, status: 'success' };
        } catch (error) {
          return { paper, error: error.message, status: 'failed' };
        }
      })
    );

    results.forEach(result => {
      if (result.status === 'fulfilled' && result.value.status === 'success') {
        downloaded.push(result.value);
        console.log(`✓ Downloaded: ${result.value.filename}`);
      } else {
        failed.push(result.status === 'fulfilled' ? result.value : { error: 'Unknown error' });
        console.log(`✗ Failed: ${result.value?.paper?.title || 'Unknown'}`);
      }
    });

    // Rate limiting delay
    if (i + concurrency < sorted.length) {
      await delay(5000); // 5 seconds between batches
    }
  }

  console.log(`\nDownload summary:`);
  console.log(`  Success: ${downloaded.length} PDFs`);
  console.log(`  Failed: ${failed.length} papers`);

  // Save metadata
  await fs.writeFile(
    `${downloadDir}/download-metadata.json`,
    JSON.stringify({ downloaded, failed }, null, 2)
  );

  return downloaded;
}

const downloadedPDFs = await downloadRelevantPDFs(filteredPapers);

Expected Output:

Attempting to download 100 PDFs...
✓ Downloaded: Vaswani_2017_Attention_Is_All_You_Need.pdf
✓ Downloaded: Devlin_2019_BERT_Pretraining_Deep_Bidirectional.pdf
✓ Downloaded: Brown_2020_Language_Models_Few_Shot_Learners.pdf
✗ Failed: Some_Paper_Title (403 Forbidden)
✓ Downloaded: Raffel_2020_Exploring_Limits_Transfer_Learning.pdf
... (96 more)

Download summary:
  Success: 78 PDFs
  Failed: 22 papers

Step 5: Extract Full-Text and Citations (20 minutes)

Extract full text, citations, sections, figures, and tables from all downloaded PDFs.

// step5-extract-content.ts

async function extractContentFromPDFs(downloads: DownloadedPaper[]) {
  console.log(`Extracting content from ${downloads.length} PDFs...`);

  const extracted = await Promise.all(
    downloads.map(async (download, index) => {
      try {
        console.log(`[${index + 1}/${downloads.length}] Extracting: ${download.filename}`);

        const content = await mcp.invoke('extract_pdf_content', {
          pdf_path: download.save_path,
          extract_citations: true,
          extract_sections: true,
          extract_figures: true,
          extract_tables: true
        });

        return {
          paper: download.paper,
          filename: download.filename,
          ...content,
          extraction_status: 'success'
        };
      } catch (error) {
        console.error(`  Failed to extract: ${error.message}`);
        return {
          paper: download.paper,
          filename: download.filename,
          extraction_status: 'failed',
          error: error.message
        };
      }
    })
  );

  const successful = extracted.filter(e => e.extraction_status === 'success');
  console.log(`\nExtraction complete: ${successful.length}/${downloads.length} successful`);

  // Save extracted content
  await fs.writeFile(
    'projects/transformer-nlp-review/extracted-content.json',
    JSON.stringify(extracted, null, 2)
  );

  return successful;
}

const extractedContent = await extractContentFromPDFs(downloadedPDFs);

Expected Output:

Extracting content from 78 PDFs...
[1/78] Extracting: Vaswani_2017_Attention_Is_All_You_Need.pdf
  ✓ Full text: 41,237 characters
  ✓ Citations: 37 references
  ✓ Sections: Abstract, Introduction, Background, Model Architecture, ...
  ✓ Figures: 4 figures extracted
[2/78] Extracting: Devlin_2019_BERT_Pretraining_Deep_Bidirectional.pdf
  ✓ Full text: 38,914 characters
  ✓ Citations: 42 references
  ...

Extraction complete: 76/78 successful

Step 6: Thematic Analysis and Categorization (15 minutes)

Use Gemini to identify major research themes, methodological patterns, key innovations, and research gaps.

// step6-thematic-analysis.ts

async function performThematicAnalysis(papers: ExtractedPaper[]) {
  console.log('Performing thematic analysis with Gemini...');

  // Prepare paper summaries for analysis
  const paperSummaries = papers.map(p => ({
    title: p.paper.title,
    authors: p.paper.authors,
    year: p.paper.year,
    abstract: p.paper.abstract,
    key_sections: p.sections?.map(s => ({ title: s.title, preview: s.content.slice(0, 500) })),
    methodology: p.methodology_summary,
    findings: p.key_findings
  }));

  const thematicAnalysis = await mcp.invoke('ask-gemini', {
    prompt: `Conduct a comprehensive thematic analysis of these ${papers.length} papers on transformer models in NLP.

             Your analysis should identify:

             1. **Major Research Themes** (5-7 themes)
                - Group papers by research focus
                - Identify theme evolution over time

             2. **Methodological Approaches**
                - Categorize by experimental design
                - Identify common datasets and benchmarks

             3. **Key Innovations**
                - Groundbreaking papers that shifted the field
                - Novel techniques and their impact

             4. **Research Gaps**
                - Under-explored areas
                - Methodological limitations
                - Future research directions

             5. **Performance Trends**
                - How metrics evolved (accuracy, efficiency, etc.)
                - Current state-of-the-art

             6. **Application Domains**
                - Where transformers have been successfully applied
                - Domain-specific innovations

             Papers:
             ${JSON.stringify(paperSummaries, null, 2)}

             Provide detailed analysis with specific paper citations for each point.`,
    model: 'gemini-2.5-pro'
  });

  // Save analysis
  await fs.writeFile(
    'projects/transformer-nlp-review/thematic-analysis.json',
    JSON.stringify(thematicAnalysis, null, 2)
  );

  console.log('Thematic analysis complete. Themes identified:');
  thematicAnalysis.themes.forEach((theme, i) => {
    console.log(`${i + 1}. ${theme.name} (${theme.paper_count} papers)`);
  });

  return thematicAnalysis;
}

const themes = await performThematicAnalysis(extractedContent);

Expected Output:

Performing thematic analysis with Gemini...
Thematic analysis complete. Themes identified:
1. Foundational Transformer Architectures (12 papers)
2. Pre-training and Transfer Learning (18 papers)
3. Attention Mechanism Efficiency (15 papers)
4. Model Scaling and Large Language Models (14 papers)
5. Domain-Specific Applications (11 papers)
6. Multilingual and Cross-lingual Models (9 papers)
7. Interpretability and Analysis (7 papers)

Step 7: Synthesize Findings with Citations (20 minutes)

Generate comprehensive literature review synthesis with proper academic citations.

// step7-synthesize-findings.ts

async function synthesizeFindings(papers: ExtractedPaper[], themes: ThematicAnalysis) {
  console.log('Synthesizing literature review with Gemini...');

  const synthesis = await mcp.invoke('ask-gemini', {
    prompt: `Write a comprehensive literature review synthesis on transformer models in NLP.

             Use this structure:

             1. **Introduction** (300 words)
                - Brief history of transformers in NLP
                - Significance of the architecture
                - Scope of this review

             2. **Thematic Analysis** (1,500 words)
                For each theme from the analysis:
                - Overview of the theme
                - Key papers and their contributions (cite specifically)
                - Evolution of approaches within the theme
                - Current state and trends

             3. **Cross-Cutting Findings** (500 words)
                - Common methodological patterns
                - Convergent findings across themes
                - Contradictions and debates

             4. **Research Gaps and Future Directions** (400 words)
                - Under-explored areas
                - Methodological limitations
                - Promising future directions

             5. **Conclusion** (300 words)
                - Summary of key insights
                - Impact on the field
                - Outlook

             CITATION REQUIREMENTS:
             - Use in-text citations in format: (Author, Year)
             - Cite specific papers from the provided set
             - Ensure every major claim is supported by citation
             - Include paper titles when first introducing key works

             Available papers with full details:
             ${JSON.stringify(papers.map(p => ({
               title: p.paper.title,
               authors: p.paper.authors,
               year: p.paper.year,
               abstract: p.paper.abstract,
               key_findings: p.key_findings,
               methodology: p.methodology_summary
             })), null, 2)}

             Themes and groupings:
             ${JSON.stringify(themes, null, 2)}

             Write in academic style, suitable for a journal literature review.`,
    model: 'gemini-2.5-pro'
  });

  return synthesis.review_text;
}

const reviewText = await synthesizeFindings(extractedContent, themes);

Step 8: Generate Literature Review Draft (15 minutes)

Compile final document with bibliography, appendices, and metadata.

// step8-generate-final-review.ts

async function generateFinalReview(synthesis: string, papers: ExtractedPaper[]) {
  console.log('Generating final literature review document...');

  // Extract all cited papers from the synthesis
  const citedPapers = extractCitations(synthesis, papers);

  // Generate bibliography
  const bibliography = await mcp.invoke('generate_bibliography', {
    papers: citedPapers,
    style: 'apa',
    sort_by: 'author'
  });

  // Compile final document
  const finalReview = `
# Transformer Models in Natural Language Processing: A Comprehensive Literature Review

**Generated**: ${new Date().toLocaleDateString()}
**Papers Reviewed**: ${papers.length}
**Databases Searched**: arXiv, PubMed, IEEE Xplore, ACM Digital Library, JSTOR
**Time Period**: 2017-2025

---

${synthesis}

---

## References

${bibliography}

---

## Appendix A: Thematic Distribution

${generateThematicChart(papers)}

## Appendix B: Chronological Distribution

${generateYearChart(papers)}

## Appendix C: Database Sources

${generateSourceChart(papers)}
`;

  // Save final review
  const outputPath = 'projects/transformer-nlp-review/literature-review.md';
  await fs.writeFile(outputPath, finalReview);

  console.log(`\n✓ Literature review generated: ${outputPath}`);
  console.log(`  Word count: ${countWords(finalReview)} words`);
  console.log(`  Citations: ${citedPapers.length} papers`);

  return outputPath;
}

const finalPath = await generateFinalReview(reviewText, extractedContent);

Expected Output:

Generating final literature review document...

✓ Literature review generated: projects/transformer-nlp-review/literature-review.md
  Word count: 3,247 words
  Citations: 52 papers

File saved to: /Users/you/research-workspace/projects/transformer-nlp-review/literature-review.md

Complete Workflow Summary

Total Time: 2 hours 5 minutes

Breakdown:

Step 1: Generate search queries (10 min)
Step 2: Search databases (30 min)
Step 3: Deduplicate and filter (15 min)
Step 4: Download PDFs (40 min)
Step 5: Extract content (20 min)
Step 6: Thematic analysis (15 min)
Step 7: Synthesize findings (20 min)
Step 8: Generate final review (15 min)

Results:

Papers searched: 1,247 across 5 databases
Papers filtered: 287 highly relevant
PDFs downloaded: 78 full-text papers
Content extracted: 76 papers successfully processed
Final output: 3,247-word literature review with 52 citations

Productivity Comparison

Manual approach requires 40+ hours over 2-3 weeks. Automated approach completed in 2 hours and 5 minutes with superior quality including more papers reviewed, better organization, and complete reproducibility. This represents a 19x productivity gain.

End-to-End Example: Automated Literature Review

Step 1: Generate Targeted Search Queries (10 minutes)

Step 2: Search Multiple Databases (30 minutes)

Step 3: Deduplication and Filtering (15 minutes)

Step 4: Download PDFs (40 minutes)

Step 5: Extract Full-Text and Citations (20 minutes)

Step 6: Thematic Analysis and Categorization (15 minutes)

Step 7: Synthesize Findings with Citations (20 minutes)

Step 8: Generate Literature Review Draft (15 minutes)

Complete Workflow Summary

Table of Contents