Skip to main content
Knowlify Logo
← All ArticlesGuides

The Kirkpatrick Model: How to Measure Training Effectiveness (Levels 1-4)

By the Knowlify Team·

Quick Answer

A practical guide to the Kirkpatrick Model for evaluating training programs. How to measure reaction, learning, behavior, and results — with realistic methods for each level.

The Kirkpatrick model is the most widely cited framework for training evaluation: Level 1 Reaction, Level 2 Learning, Level 3 Behavior, Level 4 Results. It is simple to name and hard to execute—because Levels 3 and 4 require partnership with operations and time. This guide gives practical methods for each level, the honest limitations of smile sheets, how video analytics can support Level 2 signals, and how to connect evaluation to ROI conversations.

The Kirkpatrick Partners organization maintains contemporary guidance and certification aligned with the model (Kirkpatrick Partners).

The Four Levels of the Kirkpatrick Model

Donald Kirkpatrick’s levels describe a chain of evidence from learner satisfaction to business outcomes:

LevelFocusQuestion
1ReactionDid participants find it relevant and engaging?
2LearningDid knowledge or skill improve?
3BehaviorAre people doing the job differently?
4ResultsDid organizational metrics move?

The model is not a scorecard to “pass”—it is a roadmap for what evidence you still owe if you claim training worked.

Level 1: Reaction

Smile sheets are easy; they are also weak predictors of learning or behavior when used alone. Still useful if you ask about relevance, clarity, and intent to apply—not just “rate the instructor.”

  • Ask: “What will you do differently next week?” (open-ended).
  • Avoid over-indexing on entertainment scores; fun ≠ job impact.

Level 2: Learning

Measure learning with pre/post tests, knowledge checks, skills demonstrations, and scenario scoring. For video-heavy programs, add engagement signals: completion, replays on specific segments, drop-off points.

Those signals are proxies, not proof of mastery—pair them with an assessment that matches your objectives (see Bloom’s taxonomy for training).

Level 3: Behavior

Behavior change needs manager observation, checklists, QA audits, customer outcomes, or system logs (e.g., fewer misconfigured settings after training).

This level fails when L&D owns training but managers never reinforce expectations. In our experience, the single biggest unlock for Level 3 is a manager conversation guide shipped with the course.

Level 4: Results

Tie training to metrics the business already tracks: defect rates, sales cycle length, audit findings, time-to-resolution, safety incidents. The ATD research and industry reports regularly frame L&D impact in business terms—useful for benchmarking how peers talk about outcomes.

For a video-specific ROI lens, read measuring ROI of AI video in enterprise L&D.

The Hard Truth: Most Orgs Stop at Level 1

It is cheaper to survey than to observe. Moving to Level 3+ requires:

  • Agreeing on leading indicators (behaviors) that predict lagging results
  • Scheduling follow-up measurement (30/60/90 days)
  • Giving managers lightweight tools (rubrics, spot checks)

How Video Analytics Help

Video analytics do not replace assessments, but they answer operational questions: Where do people stall? Which topics drive rewinds? Which modules correlate with better quiz performance?

If you produce video at scale, document-to-video platforms (including Knowlify) can accelerate refresh cycles—so Level 2 content stays aligned when procedures change, which indirectly protects Level 3 validity (people are not practicing obsolete steps).

A Practical Evaluation Stack (30 / 60 / 90 Days)

  • Day 0–7 (Level 1–2): targeted survey + knowledge check aligned to objectives; review video engagement hotspots if available.
  • Day 30 (Level 3 leading): manager spot-checks using a 5-item rubric tied to critical behaviors.
  • Day 60–90 (Level 3–4): operational KPIs agreed with the sponsor (defects, cycle time, compliance exceptions).

We’ve found that publishing the evaluation calendar in the kickoff deck prevents the all-too-common November panic of “prove ROI by Friday.”

Practical Examples at Each Level

Abstract frameworks become useful when you see them applied. Here is what each level looks like in a real compliance-training rollout:

  • Level 1 — Reaction: After a 20-minute anti-bribery module, learners rate relevance and clarity on a 5-point scale. More importantly, an open-text field asks, “Name one situation in your role where this policy applies.” Responses that reference actual scenarios signal engagement; vague answers flag content gaps.
  • Level 2 — Learning: A pre-test baseline shows 58% accuracy on gift-policy scenarios. After training, the same item set hits 84%. The delta is your Level 2 evidence—but only if the questions map to stated objectives, not trivia. Scenario-based assessments outperform recall questions for predicting on-the-job decisions (see scenario-based training).
  • Level 3 — Behavior: Sixty days post-launch, the compliance team pulls gift-declaration submissions from the expense system. Declaration volume rises 35% among trained cohorts compared to the prior quarter. Managers in high-risk regions confirm through spot-check interviews that reps are asking the right questions before accepting vendor hospitality.
  • Level 4 — Results: At 180 days the internal audit team reports a 40% drop in policy exceptions flagged during regional audits. Legal costs associated with remediation decline. The CFO sees a line item that moved—training is part of the explanation, not the whole explanation, which is why isolation matters.

These examples illustrate a principle worth internalizing: each level requires a different data owner. L&D owns Levels 1 and 2. Levels 3 and 4 belong to operations, compliance, or finance—and the evaluation plan must name those owners before launch day.

Common Evaluation Mistakes

Even teams that intend to measure well fall into recurring traps:

  1. Designing evaluation after launch. If you wait until stakeholders ask for proof, you have no baseline. Build the measurement plan during the training needs analysis, not during the post-mortem.
  2. Treating completion as competence. An LMS completion rate tells you who clicked “next”—not who learned anything. Completion is an operational hygiene metric, not an effectiveness metric.
  3. Surveying too late or too often. A reaction survey sent two weeks after a module gets low response rates and fuzzy recall. Send it within 24 hours. Conversely, survey fatigue from every micro-module degrades data quality—sample strategically.
  4. Ignoring environmental barriers. A learner may demonstrate perfect technique in a simulation (Level 2) but revert on the floor because the tooling, time pressure, or supervisor expectations conflict with what was taught. Level 3 measurement must account for transfer climate, not only individual behavior.
  5. Claiming causation without isolation. Correlation between a training rollout and improved KPIs is a starting point, not a conclusion. Always document what else changed—new hires, process redesigns, seasonal patterns—so your credibility holds under scrutiny.

Phillips ROI and Isolation

When finance asks, “How do you know training caused the lift?” Jack Phillips’ ROI methodology adds isolation techniques (trend lines, control groups, forecasting) to Kirkpatrick Level 4. You may not run a full Phillips study every quarter—but even lightweight counterfactual thinking (“what else changed in that business unit?”) improves credibility.

Phillips extends the Kirkpatrick chain with a Level 5: ROI, expressed as a percentage: net program benefits divided by program costs, multiplied by 100. The calculation requires converting Level 4 results into monetary value and subtracting fully loaded costs—including participant time, facilitation, content production, and technology. Three isolation methods are most accessible for L&D teams that lack research-design budgets:

  • Trend-line analysis: Plot the target metric over time and project where it would have landed without the intervention. The gap between the projection and actual performance is your estimated training effect.
  • Participant estimation: Ask a structured sample of learners and managers what percentage of improvement they attribute to training versus other factors, then apply a confidence adjustment. It is subjective, but when aggregated and discounted conservatively, it is better than silence.
  • Control groups: Where feasible, stagger rollout so one cohort trains while a comparable cohort waits. Compare outcomes over the same period. This is the strongest method but the hardest to execute politically—managers rarely want their team to be the “untrained” group.

Even if you never run a formal Phillips study, embedding these questions into your evaluation conversations moves L&D from “we trained 4,000 people” to “we estimate the program contributed $X in reduced defect costs after adjusting for other variables.” That shift in language changes how the C-suite listens.

Key Takeaways

  • Design evaluation when you design training—not the week after launch
  • Use Level 1 for quality signals, not as proof of impact
  • Match Level 2 assessments to stated objectives and cognitive level
  • Partner with people managers for Level 3 evidence you cannot see in an LMS
  • Connect Level 4 to metrics the business already trusts

Related Articles

© 2026 Knowlify