LLM Factuality & Safety: Evidence-Based Evaluation Under SLA (Outlier / Scale AI program)

Outlier

Evaluated and fact-checked LLM outputs across EN/ZH and multilingual datasets, enforcing sourcing standards, safety policies, and abstain-on-insufficient-evidence compliance under a 10-minute standard. Improved data quality by triaging prompt

LLM Factuality Auditing Under SLA (EN/ZH + Multilingual)
Role: LLM AI Trainer (Remote, part-time)
Timeframe: Mar 2024 – Aug 2024
Focus: factuality evaluation, sourcing standards, content safety, data quality

Problem

High-volume LLM evaluation requires consistency under time pressure — and the discipline to abstain when evidence is insufficient.

What I owned

  • Evaluated and fact-checked outputs from a confidential LLM model across EN/ZH + multilingual datasets, enforcing accuracy, sourcing standards, and content-safety policies.

  • Managed the factuality track, auditing responses for evidentiary sufficiency under a 10-minute standard and enforcing abstain-on-insufficient-evidence compliance.

  • Reviewed and triaged prompts for data quality; flagged PII, off-policy or foreign-language prompts, and poorly scoped tasks; escalated edge cases to PMs.

How I think about quality

  • Evidence first: claims must be supported or explicitly bounded

  • Safety as a requirement, not an afterthought

  • Clear escalation when ambiguity exceeds policy or scope

  • Prompt quality is product quality: weak prompts create noisy evaluation