The Kirkpatrick model is the most widely cited framework for training evaluation: Level 1 Reaction, Level 2 Learning, Level 3 Behavior, Level 4 Results. It is simple to name and hard to execute—because Levels 3 and 4 require partnership with operations and time. This guide gives practical methods for each level, the honest limitations of smile sheets, how video analytics can support Level 2 signals, and how to connect evaluation to ROI conversations.

The Kirkpatrick Partners organization maintains contemporary guidance and certification aligned with the model (Kirkpatrick Partners).

The Four Levels of the Kirkpatrick Model

Donald Kirkpatrick’s levels describe a chain of evidence from learner satisfaction to business outcomes:

Level	Focus	Question
1	Reaction	Did participants find it relevant and engaging?
2	Learning	Did knowledge or skill improve?
3	Behavior	Are people doing the job differently?
4	Results	Did organizational metrics move?

The model is not a scorecard to “pass”—it is a roadmap for what evidence you still owe if you claim training worked.

Level 1: Reaction

Smile sheets are easy; they are also weak predictors of learning or behavior when used alone. Still useful if you ask about relevance, clarity, and intent to apply—not just “rate the instructor.”

Ask: “What will you do differently next week?” (open-ended).
Avoid over-indexing on entertainment scores; fun ≠ job impact.

Level 2: Learning

Measure learning with pre/post tests, knowledge checks, skills demonstrations, and scenario scoring. For video-heavy programs, add engagement signals: completion, replays on specific segments, drop-off points.

Those signals are proxies, not proof of mastery—pair them with an assessment that matches your objectives (see Bloom’s taxonomy for training).

Level 3: Behavior

Behavior change needs manager observation, checklists, QA audits, customer outcomes, or system logs (e.g., fewer misconfigured settings after training).

This level fails when L&D owns training but managers never reinforce expectations. In our experience, the single biggest unlock for Level 3 is a manager conversation guide shipped with the course.

Level 4: Results

Tie training to metrics the business already tracks: defect rates, sales cycle length, audit findings, time-to-resolution, safety incidents. The ATD research and industry reports regularly frame L&D impact in business terms—useful for benchmarking how peers talk about outcomes.

For a video-specific ROI lens, read measuring ROI of AI video in enterprise L&D.

The Hard Truth: Most Orgs Stop at Level 1

It is cheaper to survey than to observe. Moving to Level 3+ requires:

Agreeing on leading indicators (behaviors) that predict lagging results
Scheduling follow-up measurement (30/60/90 days)
Giving managers lightweight tools (rubrics, spot checks)

How Video Analytics Help

Video analytics do not replace assessments, but they answer operational questions: Where do people stall? Which topics drive rewinds? Which modules correlate with better quiz performance?

If you produce video at scale, document-to-video platforms (including Knowlify) can accelerate refresh cycles—so Level 2 content stays aligned when procedures change, which indirectly protects Level 3 validity (people are not practicing obsolete steps).

A Practical Evaluation Stack (30 / 60 / 90 Days)

Day 0–7 (Level 1–2): targeted survey + knowledge check aligned to objectives; review video engagement hotspots if available.
Day 30 (Level 3 leading): manager spot-checks using a 5-item rubric tied to critical behaviors.
Day 60–90 (Level 3–4): operational KPIs agreed with the sponsor (defects, cycle time, compliance exceptions).

We’ve found that publishing the evaluation calendar in the kickoff deck prevents the all-too-common November panic of “prove ROI by Friday.”

Practical Examples at Each Level

Abstract frameworks become useful when you see them applied. Here is what each level looks like in a real compliance-training rollout:

Level 1 — Reaction: After a 20-minute anti-bribery module, learners rate relevance and clarity on a 5-point scale. More importantly, an open-text field asks, “Name one situation in your role where this policy applies.” Responses that reference actual scenarios signal engagement; vague answers flag content gaps.
Level 2 — Learning: A pre-test baseline shows 58% accuracy on gift-policy scenarios. After training, the same item set hits 84%. The delta is your Level 2 evidence—but only if the questions map to stated objectives, not trivia. Scenario-based assessments outperform recall questions for predicting on-the-job decisions (see scenario-based training).
Level 3 — Behavior: Sixty days post-launch, the compliance team pulls gift-declaration submissions from the expense system. Declaration volume rises 35% among trained cohorts compared to the prior quarter. Managers in high-risk regions confirm through spot-check interviews that reps are asking the right questions before accepting vendor hospitality.
Level 4 — Results: At 180 days the internal audit team reports a 40% drop in policy exceptions flagged during regional audits. Legal costs associated with remediation decline. The CFO sees a line item that moved—training is part of the explanation, not the whole explanation, which is why isolation matters.

These examples illustrate a principle worth internalizing: each level requires a different data owner. L&D owns Levels 1 and 2. Levels 3 and 4 belong to operations, compliance, or finance—and the evaluation plan must name those owners before launch day.

Common Evaluation Mistakes

Even teams that intend to measure well fall into recurring traps:

Designing evaluation after launch. If you wait until stakeholders ask for proof, you have no baseline. Build the measurement plan during the training needs analysis, not during the post-mortem.
Treating completion as competence. An LMS completion rate tells you who clicked “next”—not who learned anything. Completion is an operational hygiene metric, not an effectiveness metric.
Surveying too late or too often. A reaction survey sent two weeks after a module gets low response rates and fuzzy recall. Send it within 24 hours. Conversely, survey fatigue from every micro-module degrades data quality—sample strategically.
Ignoring environmental barriers. A learner may demonstrate perfect technique in a simulation (Level 2) but revert on the floor because the tooling, time pressure, or supervisor expectations conflict with what was taught. Level 3 measurement must account for transfer climate, not only individual behavior.
Claiming causation without isolation. Correlation between a training rollout and improved KPIs is a starting point, not a conclusion. Always document what else changed—new hires, process redesigns, seasonal patterns—so your credibility holds under scrutiny.

Phillips ROI and Isolation

When finance asks, “How do you know training caused the lift?” Jack Phillips’ ROI methodology adds isolation techniques (trend lines, control groups, forecasting) to Kirkpatrick Level 4. You may not run a full Phillips study every quarter—but even lightweight counterfactual thinking (“what else changed in that business unit?”) improves credibility.

Phillips extends the Kirkpatrick chain with a Level 5: ROI, expressed as a percentage: net program benefits divided by program costs, multiplied by 100. The calculation requires converting Level 4 results into monetary value and subtracting fully loaded costs—including participant time, facilitation, content production, and technology. Three isolation methods are most accessible for L&D teams that lack research-design budgets:

Trend-line analysis: Plot the target metric over time and project where it would have landed without the intervention. The gap between the projection and actual performance is your estimated training effect.
Participant estimation: Ask a structured sample of learners and managers what percentage of improvement they attribute to training versus other factors, then apply a confidence adjustment. It is subjective, but when aggregated and discounted conservatively, it is better than silence.
Control groups: Where feasible, stagger rollout so one cohort trains while a comparable cohort waits. Compare outcomes over the same period. This is the strongest method but the hardest to execute politically—managers rarely want their team to be the “untrained” group.

Even if you never run a formal Phillips study, embedding these questions into your evaluation conversations moves L&D from “we trained 4,000 people” to “we estimate the program contributed $X in reduced defect costs after adjusting for other variables.” That shift in language changes how the C-suite listens.

Key Takeaways

Design evaluation when you design training—not the week after launch
Use Level 1 for quality signals, not as proof of impact
Match Level 2 assessments to stated objectives and cognitive level
Partner with people managers for Level 3 evidence you cannot see in an LMS
Connect Level 4 to metrics the business already trusts

The Kirkpatrick Model: How to Measure Training Effectiveness (Levels 1-4)

The Four Levels of the Kirkpatrick Model

Level 1: Reaction

Level 2: Learning

Level 3: Behavior

Level 4: Results

The Hard Truth: Most Orgs Stop at Level 1

How Video Analytics Help

A Practical Evaluation Stack (30 / 60 / 90 Days)

Practical Examples at Each Level

Common Evaluation Mistakes

Phillips ROI and Isolation

Key Takeaways

Related Articles

L&D Budget Planning: How to Allocate Training Spend Effectively

Knowledge Retention: How to Make Training Stick

Animated Video Storyboarding: A Step-by-Step Guide for Beginners

How to Make Animated Videos for YouTube

Explainer Video Length Guide: How Long Should Your Video Be?

Animated Marketing Videos: The Complete B2B Guide

Related Articles

Guides
L&D Budget Planning: How to Allocate Training Spend Effectively
How to plan and defend your L&D budget. Covers industry benchmarks, allocation frameworks, cost-per-learner analysis, and how AI tools are changing the economics of training content production.
March 17, 2026Read →
March 17, 2026Read →

Guides
Knowledge Retention: How to Make Training Stick
Why people forget training and what to do about it. Covers the forgetting curve, spaced repetition, microlearning, and practical strategies for enterprise teams.
March 4, 2026Read →
March 4, 2026Read →

Guides
Animated Video Storyboarding: A Step-by-Step Guide for Beginners
A practical guide to creating video storyboards for animated explainer videos. Covers the step-by-step storyboarding process, free tools and templates, common mistakes, and how AI is automating the storyboard step entirely.
April 11, 2026Read →
April 11, 2026Read →

Guides
How to Make Animated Videos for YouTube
A practical guide to making animated videos for YouTube — covering animation styles, production tools, upload specs, and SEO tactics that help animated content rank and retain viewers.
April 11, 2026Read →
April 11, 2026Read →

Guides
Explainer Video Length Guide: How Long Should Your Video Be?
The ideal explainer video length is 60-90 seconds, but the right duration depends on your goal, platform, and audience. This guide covers engagement data, platform-specific benchmarks, and practical tips for getting the length right.
April 11, 2026Read →
April 11, 2026Read →

Guides
Animated Marketing Videos: The Complete B2B Guide
Animated marketing videos outperform static content across every B2B funnel stage. This guide covers video types, distribution strategy, AI-powered production at scale, and how to measure ROI.
April 10, 2026Read →
April 10, 2026Read →