Fighting Fire with Fire
LLMs in Crafting and Detecting Disinformation
Jason Lucas¹, Adaku Uchendu¹,², Michiharu Yamashita¹, Jooyoung Lee¹, Shaurya Rohatgi¹, Dongwon Lee¹
¹Penn State University, ²MIT Lincoln Laboratory
EMNLP 2023 Main Conference
Problem & Motivation
Challenge: LLMs generate realistic but harmful disinformation
- Persuasive texts indistinguishable from human content
- Large-scale disinformation potential
- Limited LLM-generated content detection
Core Question: Can LLMs detect their own disinformation?Research Questions
**RQ1**: Can LLMs efficiently generate disinformation via prompt engineering?RQ2: How proficient are LLMs at detecting disinformation?
5 Evaluation Dimensions:
- Human vs. LLM-generated
- Self vs. externally-generated
- Posts vs. articles
- In vs. out-of-distribution
- Zero-shot LLMs vs. fine-tuned detectors
F3 Framework
**Fighting Fire with Fire (F3)** - 5-step approach:
- Human data collection
- Prompt engineering generation
- PURIFY hallucination filtering
- Cloze-prompt detection
- Zero-shot evaluation
RQ1: Bypassing Alignment
**Key Discovery**: Impersonator roles override safety measures
Without role: “Sorry, I can’t assist…”
With role (“You are an AI news curator”): ✅ Generates disinformation
Finding: Impersonator prompts successfully bypass GPT-3.5 protectionsGeneration Strategies
**Perturbation-Based** (Fake):- Minor: Subtle changes
- Major: Noticeable changes
- Critical: Significant alterations
Paraphrase-Based (Real):
- Minor: Light summary
- Major: Moderate rewording
- Critical: Full rephrasing
Output: 43K+ synthetic samplesPURIFY Framework
**Problem**: 38% hallucinated misalignments
PURIFY filters using 4 metrics:
- Natural Language Inference
- AlignScore
- BERTScore
- Semantic Distance
Result: 43,272 → 27,667 quality samplesRQ2: Detection Results
**Human vs. LLM Content**:- Human-authored: 55-66% accuracy
- LLM-generated: 60-85% accuracy
Self vs. External:
- GPT-3.5: Strong self-detection
- LLaMA-GPT: Best external detector
- Challenge: Minor disinformation detection
Key Findings
Content Type: Articles > Social media posts
Distribution: In-distribution > Out-of-distribution
Model Type: Fine-tuned > GPT-3.5 > Domain-specific
Critical: Subtle disinformation challenges all detectorsTechnical Contributions
1. Novel prompting for disinformation generation
2. PURIFY hallucination filtering framework
3. Cloze-prompt detection strategies
4. Comprehensive SOTA benchmark
5. F3 dataset for research communityDataset & Evaluation
**Models**: GPT-3.5, LLaMA-2, Palm-2, Dolly-2Data: CoAID, FakeNewsNet, F3 (27,667 samples)
Languages: 11 languages, Pre/Post-GPT splits
Metrics: Macro-F1 across human/AI datasets
Impact & Implications
Dual-Use Reality: LLMs both create and detect disinformation
Detection Promise: Zero-shot capabilities show potential
Security Concern: Easy alignment bypass requires safeguards
Research Direction: Focus on subtle disinformation detection“Fighting Fire with Fire”
Re-purposing LLMs as countermeasures against disinformation
Questions & Resources
**Code**: https://github.com/mickeymst/F3
**Paper**: EMNLP 2023 Main Conference
**Contact**: jsl5710@psu.eduPenn State University | PIKE Research Lab
Thank You!