Beemo - Benchmark of Expert-edited Machine-generated Outputs

Beemo Benchmark Overview

Abstract

The rapid proliferation of large language models (LLMs) has increased the volume of machine-generated texts and blurred text authorship across various domains. However, most existing MGT benchmarks focus on single-author scenarios, comprising only human-written and machine-generated texts. This conventional design fails to capture practical multi-author scenarios where users refine LLM responses for natural flow, coherence, and factual correctness. We introduce Beemo (Benchmark of Expert-edited Machine-generated Outputs), which includes 6.5k texts written by humans, generated by ten instruction-finetuned LLMs, and edited by expert annotators for various use cases ranging from creative writing to summarization. Beemo additionally comprises 13.1k machine-generated and LLM-edited texts, enabling diverse MGT detection evaluation across various edit types. We evaluate 33 configurations of MGT detectors in different experimental setups and find that expert-based editing evades MGT detection, while LLM-edited texts are unlikely to be recognized as human-written. Our work underscores the need for more sophisticated detection methods that account for the collaborative nature of human-AI text creation in real-world scenarios.

Location

Online Presentation

University Park, Pennsylvania 16802-2440

Note

Click on the Video link above to watch the full presentation on YouTube.

Talk Overview

This presentation introduces Beemo, a groundbreaking benchmark that addresses a critical gap in machine-generated text (MGT) detection research. Unlike traditional benchmarks that only consider single-author scenarios, Beemo captures the reality of human-AI collaboration in text creation.

Key Contributions

1. Multi-Author Benchmark Design

19.6k total texts across five use cases: open-ended generation, rewriting, summarization, open QA, and closed QA
Expert-edited content: 2,187 machine-generated texts refined by professional editors
LLM-edited variants: 13.1k texts edited by GPT-4o and Llama3.1-70B-Instruct using diverse prompts

2. Comprehensive Evaluation

33 MGT detector configurations tested across multiple scenarios
Zero-shot and pretrained detectors including Binoculars, DetectGPT, RADAR, and MAGE
Novel task formulations examining detection performance on edited content

3. Critical Findings

Expert editing evades detection: AUROC scores drop by up to 22% for edited content
LLM-edited texts remain detectable: Less likely to be classified as human-written
Category-specific challenges: Detection performance varies significantly across text types

Implications for AI Safety

This research reveals significant vulnerabilities in current MGT detection systems, particularly relevant for:

Content moderation and misinformation detection
Academic integrity in educational settings
Publishing and journalism authenticity verification
Legal and regulatory compliance for AI-generated content

Future Directions

Beemo opens new research avenues in developing more robust detection methods that account for the collaborative nature of human-AI text creation in real-world applications.

This work was conducted at The Pennsylvania State University in collaboration with Toloka AI, MIT Lincoln Laboratory, and University of Oslo.

Last updated on Sep 22, 2025

No results found

Beemo - Benchmark of Expert-edited Machine-generated Outputs

Talk Overview

Key Contributions

1. Multi-Author Benchmark Design

2. Comprehensive Evaluation

3. Critical Findings

Implications for AI Safety

Future Directions