Beemo - Benchmark of Expert-edited Machine-generated Outputs
Beemo Benchmark OverviewOnline Presentation
University Park, Pennsylvania 16802-2440
Talk Overview
This presentation introduces Beemo, a groundbreaking benchmark that addresses a critical gap in machine-generated text (MGT) detection research. Unlike traditional benchmarks that only consider single-author scenarios, Beemo captures the reality of human-AI collaboration in text creation.
Key Contributions
1. Multi-Author Benchmark Design
- 19.6k total texts across five use cases: open-ended generation, rewriting, summarization, open QA, and closed QA
- Expert-edited content: 2,187 machine-generated texts refined by professional editors
- LLM-edited variants: 13.1k texts edited by GPT-4o and Llama3.1-70B-Instruct using diverse prompts
2. Comprehensive Evaluation
- 33 MGT detector configurations tested across multiple scenarios
- Zero-shot and pretrained detectors including Binoculars, DetectGPT, RADAR, and MAGE
- Novel task formulations examining detection performance on edited content
3. Critical Findings
- Expert editing evades detection: AUROC scores drop by up to 22% for edited content
- LLM-edited texts remain detectable: Less likely to be classified as human-written
- Category-specific challenges: Detection performance varies significantly across text types
Implications for AI Safety
This research reveals significant vulnerabilities in current MGT detection systems, particularly relevant for:
- Content moderation and misinformation detection
- Academic integrity in educational settings
- Publishing and journalism authenticity verification
- Legal and regulatory compliance for AI-generated content
Future Directions
Beemo opens new research avenues in developing more robust detection methods that account for the collaborative nature of human-AI text creation in real-world applications.
This work was conducted at The Pennsylvania State University in collaboration with Toloka AI, MIT Lincoln Laboratory, and University of Oslo.

I am a PhD candidate in Informatics in the College of IST at Penn State University, where I conduct research at the PIKE Research Lab under the guidance of Dr. Dongwon Lee. I specialize in AI/ML research focused on Information Integrity, Safe and Ethical AI, including combating harmful content across multiple languages and modalities. My research spans low-resource multilingual NLP, generative AI, and adversarial machine learning, with work extending across 79 languages. I have published 12 papers with 260+ citations in premier venues including ACL, EMNLP, IEEE, and NAACL.
My doctoral research focuses on bridging the digital language divide through transfer learning, classification (NLU), generation (NLG), adversarial attacks, and developing end-to-end AI pipelines using RAG and Agentic AI workflows for combating multilingual threats. Drawing from my Grenadian background and knowledge of local Creole languages, I bring a global perspective to AI challenges, working to democratize state-of-the-art AI capabilities for underserved linguistic communities worldwide. My mission is to develop robust multilingual multimodal systems and mitigate evolving security vulnerabilities while enhancing access to human language technology through cutting-edge solutions.
As an NSF LinDiv Fellow, I conduct transdisciplinary research advancing human-AI language interaction for social good. I actively mentor 5+ research interns and teach Applied Generative AI courses. Through industry experience at Lawrence Livermore National Lab, Interaction LLC, and Coalfire, I bridge academic research with practical applications in combating evolving security threats and enhancing global AI accessibility. I see multilingual advances and interdisciplinary collaboration as a competitive advantage, not a communication challenge. Beyond research, I stay active through dance, fitness, martial arts, and community service.