Beemo - Benchmark of Expert-edited Machine-generated Outputs

Name: Beemo - Benchmark of Expert-edited Machine-generated Outputs
Start: 2025-06-24T14:00:00Z
End: 2025-06-24T14:45:00Z
Location: Online Presentation

Jason Lucas

Beemo Benchmark Overview

Abstract

The rapid proliferation of large language models (LLMs) has increased the volume of machine-generated texts and blurred text authorship across various domains. However, most existing MGT benchmarks focus on single-author scenarios, comprising only human-written and machine-generated texts. This conventional design fails to capture practical multi-author scenarios where users refine LLM responses for natural flow, coherence, and factual correctness. We introduce Beemo (Benchmark of Expert-edited Machine-generated Outputs), which includes 6.5k texts written by humans, generated by ten instruction-finetuned LLMs, and edited by expert annotators for various use cases ranging from creative writing to summarization. Beemo additionally comprises 13.1k machine-generated and LLM-edited texts, enabling diverse MGT detection evaluation across various edit types. We evaluate 33 configurations of MGT detectors in different experimental setups and find that expert-based editing evades MGT detection, while LLM-edited texts are unlikely to be recognized as human-written. Our work underscores the need for more sophisticated detection methods that account for the collaborative nature of human-AI text creation in real-world scenarios.

Date

Jun 24, 2025 2:00 PM — 2:45 PM

Event

Invited Talk - Jason S. Lucas, Ph.D Student

Location

Online Presentation

University Park, Pennsylvania 16802-2440

Click on the Video link above to watch the full presentation on YouTube.

Talk Overview

This presentation introduces Beemo, a groundbreaking benchmark that addresses a critical gap in machine-generated text (MGT) detection research. Unlike traditional benchmarks that only consider single-author scenarios, Beemo captures the reality of human-AI collaboration in text creation.

Key Contributions

1. Multi-Author Benchmark Design

19.6k total texts across five use cases: open-ended generation, rewriting, summarization, open QA, and closed QA
Expert-edited content: 2,187 machine-generated texts refined by professional editors
LLM-edited variants: 13.1k texts edited by GPT-4o and Llama3.1-70B-Instruct using diverse prompts

2. Comprehensive Evaluation

33 MGT detector configurations tested across multiple scenarios
Zero-shot and pretrained detectors including Binoculars, DetectGPT, RADAR, and MAGE
Novel task formulations examining detection performance on edited content

3. Critical Findings

Expert editing evades detection: AUROC scores drop by up to 22% for edited content
LLM-edited texts remain detectable: Less likely to be classified as human-written
Category-specific challenges: Detection performance varies significantly across text types

Implications for AI Safety

This research reveals significant vulnerabilities in current MGT detection systems, particularly relevant for:

Content moderation and misinformation detection
Academic integrity in educational settings
Publishing and journalism authenticity verification
Legal and regulatory compliance for AI-generated content

Future Directions

Beemo opens new research avenues in developing more robust detection methods that account for the collaborative nature of human-AI text creation in real-world applications.

This work was conducted at The Pennsylvania State University in collaboration with Toloka AI, MIT Lincoln Laboratory, and University of Oslo.

Machine Learning NLP Text Detection Human-AI Collaboration