Beemo - Benchmark of Expert-edited Machine-generated Outputs

Beemo Benchmark Overview
Abstract
The rapid proliferation of large language models (LLMs) has increased the volume of machine-generated texts and blurred text authorship across various domains. However, most existing MGT benchmarks focus on single-author scenarios, comprising only human-written and machine-generated texts. This conventional design fails to capture practical multi-author scenarios where users refine LLM responses for natural flow, coherence, and factual correctness. We introduce Beemo (Benchmark of Expert-edited Machine-generated Outputs), which includes 6.5k texts written by humans, generated by ten instruction-finetuned LLMs, and edited by expert annotators for various use cases ranging from creative writing to summarization. Beemo additionally comprises 13.1k machine-generated and LLM-edited texts, enabling diverse MGT detection evaluation across various edit types. We evaluate 33 configurations of MGT detectors in different experimental setups and find that expert-based editing evades MGT detection, while LLM-edited texts are unlikely to be recognized as human-written. Our work underscores the need for more sophisticated detection methods that account for the collaborative nature of human-AI text creation in real-world scenarios.
Location

Online Presentation

University Park, Pennsylvania 16802-2440

event
Note
Click on the Video link above to watch the full presentation on YouTube.

Talk Overview

This presentation introduces Beemo, a groundbreaking benchmark that addresses a critical gap in machine-generated text (MGT) detection research. Unlike traditional benchmarks that only consider single-author scenarios, Beemo captures the reality of human-AI collaboration in text creation.

Key Contributions

1. Multi-Author Benchmark Design

  • 19.6k total texts across five use cases: open-ended generation, rewriting, summarization, open QA, and closed QA
  • Expert-edited content: 2,187 machine-generated texts refined by professional editors
  • LLM-edited variants: 13.1k texts edited by GPT-4o and Llama3.1-70B-Instruct using diverse prompts

2. Comprehensive Evaluation

  • 33 MGT detector configurations tested across multiple scenarios
  • Zero-shot and pretrained detectors including Binoculars, DetectGPT, RADAR, and MAGE
  • Novel task formulations examining detection performance on edited content

3. Critical Findings

  • Expert editing evades detection: AUROC scores drop by up to 22% for edited content
  • LLM-edited texts remain detectable: Less likely to be classified as human-written
  • Category-specific challenges: Detection performance varies significantly across text types

Implications for AI Safety

This research reveals significant vulnerabilities in current MGT detection systems, particularly relevant for:

  • Content moderation and misinformation detection
  • Academic integrity in educational settings
  • Publishing and journalism authenticity verification
  • Legal and regulatory compliance for AI-generated content

Future Directions

Beemo opens new research avenues in developing more robust detection methods that account for the collaborative nature of human-AI text creation in real-world applications.


This work was conducted at The Pennsylvania State University in collaboration with Toloka AI, MIT Lincoln Laboratory, and University of Oslo.