Beemo - Benchmark of Expert-edited Machine-generated Outputs
Beemo Benchmark OverviewOnline Presentation
University Park, Pennsylvania 16802-2440
Talk Overview
This presentation introduces Beemo, a groundbreaking benchmark that addresses a critical gap in machine-generated text (MGT) detection research. Unlike traditional benchmarks that only consider single-author scenarios, Beemo captures the reality of human-AI collaboration in text creation.
Key Contributions
1. Multi-Author Benchmark Design
- 19.6k total texts across five use cases: open-ended generation, rewriting, summarization, open QA, and closed QA
- Expert-edited content: 2,187 machine-generated texts refined by professional editors
- LLM-edited variants: 13.1k texts edited by GPT-4o and Llama3.1-70B-Instruct using diverse prompts
2. Comprehensive Evaluation
- 33 MGT detector configurations tested across multiple scenarios
- Zero-shot and pretrained detectors including Binoculars, DetectGPT, RADAR, and MAGE
- Novel task formulations examining detection performance on edited content
3. Critical Findings
- Expert editing evades detection: AUROC scores drop by up to 22% for edited content
- LLM-edited texts remain detectable: Less likely to be classified as human-written
- Category-specific challenges: Detection performance varies significantly across text types
Implications for AI Safety
This research reveals significant vulnerabilities in current MGT detection systems, particularly relevant for:
- Content moderation and misinformation detection
- Academic integrity in educational settings
- Publishing and journalism authenticity verification
- Legal and regulatory compliance for AI-generated content
Future Directions
Beemo opens new research avenues in developing more robust detection methods that account for the collaborative nature of human-AI text creation in real-world applications.
This work was conducted at The Pennsylvania State University in collaboration with Toloka AI, MIT Lincoln Laboratory, and University of Oslo.