Fighting Fire with Fire

LLMs in Crafting and Detecting Disinformation

Jason Lucas¹, Adaku Uchendu¹,², Michiharu Yamashita¹, Jooyoung Lee¹, Shaurya Rohatgi¹, Dongwon Lee¹

¹Penn State University, ²MIT Lincoln Laboratory

EMNLP 2023 Main Conference

Problem & Motivation

Challenge: LLMs generate realistic but harmful disinformation

  • Persuasive texts indistinguishable from human content
  • Large-scale disinformation potential
  • Limited LLM-generated content detection
Core Question: Can LLMs detect their own disinformation?

Research Questions

**RQ1**: Can LLMs efficiently generate disinformation via prompt engineering?

RQ2: How proficient are LLMs at detecting disinformation?

5 Evaluation Dimensions:

  • Human vs. LLM-generated
  • Self vs. externally-generated
  • Posts vs. articles
  • In vs. out-of-distribution
  • Zero-shot LLMs vs. fine-tuned detectors

F3 Framework

**Fighting Fire with Fire (F3)** - 5-step approach:
  1. Human data collection
  2. Prompt engineering generation
  3. PURIFY hallucination filtering
  4. Cloze-prompt detection
  5. Zero-shot evaluation

RQ1: Bypassing Alignment

**Key Discovery**: Impersonator roles override safety measures Without role: “Sorry, I can’t assist…” With role (“You are an AI news curator”): ✅ Generates disinformation Finding: Impersonator prompts successfully bypass GPT-3.5 protections

Generation Strategies

**Perturbation-Based** (Fake):
  • Minor: Subtle changes
  • Major: Noticeable changes
  • Critical: Significant alterations

Paraphrase-Based (Real):

  • Minor: Light summary
  • Major: Moderate rewording
  • Critical: Full rephrasing
Output: 43K+ synthetic samples

PURIFY Framework

**Problem**: 38% hallucinated misalignments

PURIFY filters using 4 metrics:

  • Natural Language Inference
  • AlignScore
  • BERTScore
  • Semantic Distance
Result: 43,272 → 27,667 quality samples

RQ2: Detection Results

**Human vs. LLM Content**:
  • Human-authored: 55-66% accuracy
  • LLM-generated: 60-85% accuracy

Self vs. External:

  • GPT-3.5: Strong self-detection
  • LLaMA-GPT: Best external detector
  • Challenge: Minor disinformation detection

Key Findings

Content Type: Articles > Social media posts Distribution: In-distribution > Out-of-distribution Model Type: Fine-tuned > GPT-3.5 > Domain-specific Critical: Subtle disinformation challenges all detectors

Technical Contributions

1. Novel prompting for disinformation generation 2. PURIFY hallucination filtering framework 3. Cloze-prompt detection strategies 4. Comprehensive SOTA benchmark 5. F3 dataset for research community

Dataset & Evaluation

**Models**: GPT-3.5, LLaMA-2, Palm-2, Dolly-2

Data: CoAID, FakeNewsNet, F3 (27,667 samples)

Languages: 11 languages, Pre/Post-GPT splits

Metrics: Macro-F1 across human/AI datasets

Impact & Implications

Dual-Use Reality: LLMs both create and detect disinformation Detection Promise: Zero-shot capabilities show potential Security Concern: Easy alignment bypass requires safeguards Research Direction: Focus on subtle disinformation detection

“Fighting Fire with Fire”

Re-purposing LLMs as countermeasures against disinformation

Questions & Resources

**Code**: https://github.com/mickeymst/F3 **Paper**: EMNLP 2023 Main Conference **Contact**: jsl5710@psu.edu

Penn State University | PIKE Research Lab

Thank You!