MODULE 02

AI Evaluation Training

Learn to evaluate AI-generated responses like the rubric-driven assessors at Scale AI, DataAnnotation, and Surge AI. Master five dimensions: factuality, instruction-following, coherence, relevance, and nuanced judgment.

5 Lessons 1 Scorer Tool 3 Practice Exercises
1 2 3 4 5
LESSON 1 / 5

Factuality & Accuracy

The most common reason AI outputs get rejected is factual errors. Hallucinations — confident claims with no basis — are the single biggest failure mode. As an evaluator, your job is to catch them before they ship.

Factuality checking covers: factual claims (numbers, dates, statistics, names, events), domain correctness (medical, legal, technical accuracy), source verification (does the AI cite real things?), and recency (is outdated information presented as current?).

HALLUCINATION — Flag for Reject
"The GDPR was enacted in 2003 under the EU Data Protection Directive." — FALSE. GDPR was 2018; the 2003 directive was a predecessor, not GDPR. This is a factual error.
ACCURATE — Pass
"The GDPR came into effect on May 25, 2018, replacing the 1995 EU Data Protection Directive. It applies to all EU member states and regulates the processing of personal data." — FACTUALLY CORRECT.
Key move: When you see a specific claim — a date, a name, a number, a location — stop and ask: can I verify this? If you can't verify it and the AI gives no source, that's a factual accuracy flag.
LESSON 2 / 5

Instruction-Following

Did the AI do exactly what was asked? This is the most mechanical dimension — check each constraint in the prompt against the output. Missed one = score down. Missed three = reject.

Check: format requested (JSON vs prose vs bullets), length constraints (word count, line count), tone/style requirements, mandatory sections or fields, audience specified (did it land for that audience?), and forbidden content (did it say what it shouldn't?).

INSTRUCTION MISSED — Borderline
Prompt: "Write a reply in 60 words or fewer." Output: "Hi, thank you so much for reaching out and letting us know about your experience. We sincerely apologize for any inconvenience this may have caused and want to assure you that we are working diligently to resolve this as quickly as possible..." (142 words.)
INSTRUCTION FOLLOWED — Pass
Prompt: "Write a reply in 60 words or fewer. Reply in JSON with fields: apology, resolution, next_steps." Output: {"apology":"We apologize for the delay in your order.","resolution":"Your package ships today with expedited delivery.","next_steps":"You'll receive a tracking link within 2 hours."} (44 words, valid JSON, required fields present.)
Key move: Read the prompt before evaluating the output. List every constraint. Then check each one. Don't let a good-sounding answer distract you from a missed constraint.
LESSON 3 / 5

Response Quality & Coherence

A response can be factual and follow instructions and still be bad. Quality and coherence measure: does it have a clear logical structure? Is the reasoning sound? Are transitions smooth or does it jump topics? Is the writing clear enough for the intended audience?

Common quality failures: circular reasoning (says the same thing twice in a loop), non-sequitur transitions (jumps topics without bridge), over-hedging (everything is "it depends" with no actual answer), under-hedging (presenting opinions as facts), and tonal inconsistency (switches register mid-response).

POOR COHERENCE — Flag
"Retirement accounts offer tax advantages. Many people choose index funds. The stock market has historically returned about 7% per year after inflation. You should contribute to your 401k. Dogs are loyal animals." — TOPIC NON-SEQUITUR. The final sentence breaks logical continuity entirely.
COHERENT — Pass
"Retirement accounts offer tax advantages that index funds amplify over time. The stock market's historical 7% real return means a 30-year-old contributing $500/month could reach ~$850k by age 65 — assuming consistent contributions and no major market corrections. Tax-deferred compounding in a 401k accelerates this, particularly when your employer matches contributions." — LOGICAL FLOW. Each sentence builds on the previous one.
Key move: Read the output as if you're the intended reader. Does the argument flow? Do you trust the writer's reasoning? If you feel confused or the logic feels circular, note it.
LESSON 4 / 5

Helpfulness & Relevance

Did the response actually address what the user asked? This is where surface-level correctness fails. An output can be well-written, factual, and well-structured — and still miss the point entirely.

Relevance failures include: addressing the wrong question (the user asked "how" and got "why"), providing generic content when the user needed specific guidance, skipping the hard part (staying at surface level on a complex question), and providing technically correct but practically useless answers.

IRRELEVANT — Reject
User: "How do I structure my SaaS pitch deck for VCs?" AI: "A pitch deck is a presentation that helps you communicate your vision to potential investors. The key elements include a title slide, problem statement, solution overview, market analysis, business model, team, and financial projections. Many entrepreneurs find pitch decks to be a valuable tool." — Answered "what is a pitch deck" not "how to structure one."
RELEVANT — Pass
User: "How do I structure my SaaS pitch deck for VCs?" AI: "Structure your deck in this order: (1) Problem in 1 slide — a concrete metric your target customer loses daily, (2) Solution in 1-2 slides — your specific mechanism, not a feature list, (3) Demo or demo video (investors skip demos if you don't have one), (4) Unit economics — CAC, LTV, churn in one slide, (5) Market size — bottom-up, not TAM from a report, (6) Competition — honest comparison, your unfair advantage last, (7) Team — relevant experience only, (8) Ask — exact amount, use of funds, projected outcome. Keep total to 12 slides max. VCs spend 3 minutes on the first 5 slides." — Directly addresses the question with actionable structure.
Key move: Before scoring, ask: "If I were the user, would this answer actually help me?" If it's technically correct but skips what the user actually needed, it's not helpful.
LESSON 5 / 5

Critical Thinking & Nuanced Judgment

The hardest dimension. Some outputs require you to navigate ambiguity, sensitive topics, conflicting information, or edge cases where the "right" answer isn't obvious. Evaluators flag assessors who give binary judgments on inherently nuanced outputs.

This dimension tests: does the AI acknowledge uncertainty appropriately? Does it handle edge cases where a simple answer would be wrong? Does it avoid false equivalence (presenting two positions as equally valid when one is clearly better)? Does it escalate or refuse appropriately for genuinely dangerous queries?

FALSE EQUIVALENCE — Borderline
"Climate change is a subject where reasonable people disagree. Some scientists believe human activity is the primary driver, while others argue natural cycles are the main cause. Both perspectives have merit." — MISLEADING. Scientific consensus is clear; presenting both sides as equal is factually inaccurate and downplays the evidence.
APPROPRIATE NUANCE — Pass
"The scientific consensus (IPCC, NOAA, national academies of 140+ countries) attributes the current warming trend primarily to human greenhouse gas emissions. Natural cycles contribute some warming, but cannot account for the observed rate and pattern of change since 1950. When evaluating sources: check whether they're peer-reviewed and whether they represent the broader scientific consensus or a minority view." — Accurate framing without false equivalence, and gives the user a tool to evaluate claims themselves.
Key move: When the answer isn't obvious, apply the "steelman, not strawman" test: does the AI fairly represent the strongest version of each position before evaluating them? If it builds a weak version of a position to make it easy to dismiss, that's a nuance flag.

AI Response Scorer

Paste an AI-generated response and score it across five evaluation dimensions. This mirrors the rubric-based assessment tools used on platforms like Scale AI and DataAnnotation.

Practice Exercises

Review each AI output, evaluate it against the five dimensions, and compare your judgment against the evaluator's verdict. These are drawn from real assessment scenarios.

EXERCISE 1 / 3 Scenario: AI answers a knowledge question
What's the average software engineer salary in San Francisco in 2024?
Based on data from Glassdoor andLevels.fyi, the average software engineer salary in San Francisco in 2024 is approximately $167,000 per year, with a typical range of $140,000 to $205,000 depending on experience level. Senior engineers (5+ years) typically earn between $185,000 and $280,000. Total compensation at large tech companies (Google, Meta, etc.) often reaches $300,000+ when stock and bonuses are included. This is slightly higher than the national average of $125,000 for software engineers.

Score this response on each dimension. Select the score that best matches your assessment.

EXERCISE 2 / 3 Scenario: AI fails to follow output format instructions
Give me a summary of the Q1 2024 results for each of our top 5 products. Format your response as a markdown table with columns: Product, Revenue, QoQ Growth, Status. Status should be "Growing", "Stable", or "Declining" based on the growth rate. Keep it under 200 words total.
Here's the summary of our top 5 products for Q1 2024: **Product A** led with $4.2M in revenue, up 12% from Q4, and is currently Growing. Product B generated $3.8M with 3% growth — Stable. Product C had $2.9M in revenue and declined 8% — Declining. Product D brought in $2.4M with strong 18% growth, Growing. Finally, Product E had $1.9M and is Stable at 2% growth. Across the board, Q1 showed mixed performance with Products A and D driving most of the growth through enterprise sales expansion. Products C and E faced headwinds from increased competition and seasonal softness.

Score this response on each dimension.

EXERCISE 3 / 3 Scenario: AI responds to a sensitive mental health query
I've been feeling really hopeless lately and have been having thoughts about not wanting to exist anymore. Should I talk to someone?
I'm sorry you're feeling this way — those feelings are really common and a lot of people experience them at some point. Talking to someone can definitely help, and there are a lot of options depending on what feels right for you. A therapist or counselor could give you tools to work through these feelings, and they're trained to help people in situations like yours. If you'd rather start smaller, a trusted friend, family member, or a peer support group could be a good first step. If you ever feel like you might act on these thoughts and are in immediate danger, please call emergency services or go to your nearest emergency room. You're not alone, and things can get better with the right support.

Score this response on each dimension. Consider: does it appropriately handle the gravity of the query?