The rapid proliferation of large language models (LLMs) has increased the volume of machine-generated texts and blurred text authorship across various domains. However, most existing MGT benchmarks focus on single-author scenarios, comprising only human-written and machine-generated texts. This conventional design fails to capture practical multi-author scenarios where users refine LLM responses for natural flow, coherence, and factual correctness. We introduce Beemo (Benchmark of Expert-edited Machine-generated Outputs), which includes 6.5k texts written by humans, generated by ten instruction-finetuned LLMs, and edited by expert annotators for various use cases ranging from creative writing to summarization. Beemo additionally comprises 13.1k machine-generated and LLM-edited texts, enabling diverse MGT detection evaluation across various edit types. We evaluate 33 configurations of MGT detectors in different experimental setups and find that expert-based editing evades MGT detection, while LLM-edited texts are unlikely to be recognized as human-written. Our work underscores the need for more sophisticated detection methods that account for the collaborative nature of human-AI text creation in real-world scenarios.
This presentation introduces Beemo, a groundbreaking benchmark that addresses a critical gap in machine-generated text (MGT) detection research. Unlike traditional benchmarks that only consider single-author scenarios, Beemo captures the reality of human-AI collaboration in text creation.
This research reveals significant vulnerabilities in current MGT detection systems, particularly relevant for:
Beemo opens new research avenues in developing more robust detection methods that account for the collaborative nature of human-AI text creation in real-world applications.
This work was conducted at The Pennsylvania State University in collaboration with Toloka AI, MIT Lincoln Laboratory, and University of Oslo.