Beemo - Benchmark of Expert-edited Machine-generated Outputs

Beemo Benchmark Overview

Abstract

The rapid proliferation of large language models (LLMs) has increased the volume of machine-generated texts and blurred text authorship across various domains. However, most existing MGT benchmarks focus on single-author scenarios, comprising only human-written and machine-generated texts. This conventional design fails to capture practical multi-author scenarios where users refine LLM responses for natural flow, coherence, and factual correctness. We introduce Beemo (Benchmark of Expert-edited Machine-generated Outputs), which includes 6.5k texts written by humans, generated by ten instruction-finetuned LLMs, and edited by expert annotators for various use cases ranging from creative writing to summarization. Beemo additionally comprises 13.1k machine-generated and LLM-edited texts, enabling diverse MGT detection evaluation across various edit types. We evaluate 33 configurations of MGT detectors in different experimental setups and find that expert-based editing evades MGT detection, while LLM-edited texts are unlikely to be recognized as human-written. Our work underscores the need for more sophisticated detection methods that account for the collaborative nature of human-AI text creation in real-world scenarios.

Date
Jun 24, 2025 2:00 PM — 2:45 PM
Location
Online Presentation
University Park, Pennsylvania 16802-2440
Click on the Video link above to watch the full presentation on YouTube.

Talk Overview

This presentation introduces Beemo, a groundbreaking benchmark that addresses a critical gap in machine-generated text (MGT) detection research. Unlike traditional benchmarks that only consider single-author scenarios, Beemo captures the reality of human-AI collaboration in text creation.

Key Contributions

1. Multi-Author Benchmark Design

  • 19.6k total texts across five use cases: open-ended generation, rewriting, summarization, open QA, and closed QA
  • Expert-edited content: 2,187 machine-generated texts refined by professional editors
  • LLM-edited variants: 13.1k texts edited by GPT-4o and Llama3.1-70B-Instruct using diverse prompts

2. Comprehensive Evaluation

  • 33 MGT detector configurations tested across multiple scenarios
  • Zero-shot and pretrained detectors including Binoculars, DetectGPT, RADAR, and MAGE
  • Novel task formulations examining detection performance on edited content

3. Critical Findings

  • Expert editing evades detection: AUROC scores drop by up to 22% for edited content
  • LLM-edited texts remain detectable: Less likely to be classified as human-written
  • Category-specific challenges: Detection performance varies significantly across text types

Implications for AI Safety

This research reveals significant vulnerabilities in current MGT detection systems, particularly relevant for:

  • Content moderation and misinformation detection
  • Academic integrity in educational settings
  • Publishing and journalism authenticity verification
  • Legal and regulatory compliance for AI-generated content

Future Directions

Beemo opens new research avenues in developing more robust detection methods that account for the collaborative nature of human-AI text creation in real-world applications.


This work was conducted at The Pennsylvania State University in collaboration with Toloka AI, MIT Lincoln Laboratory, and University of Oslo.

Jason Lucas
Jason Lucas
Ph.D. Candidate in Informatics