Image
Smiling doctor using virtual 3D projection
Blog

AI Monitoring: From Model Metrics to Patient Outcomes

Summary

  • As AI becomes more widespread, health systems face a practical question that governance structures alone cannot answer: how do you know if it’s actually working? This blog addresses the practical work of monitoring AI in the real world.

Governance tells you what requires oversight. Monitoring tells you whether that oversight is working.

To develop actionable guidance on this challenge, the IHI Leadership Alliance convened an AI Accelerator that brought together leaders from diverse health care organizations to identify practical strategies for AI monitoring that reflect the current realities of health care delivery. The findings that follow reflect the collective experience of the group.

A Real-World Example of Monitoring and Oversight

Consider a real-world example of an AI heart failure readmission prediction model designed to identify high-risk patients and prompt earlier intervention. When the data science team evaluated the model’s performance, the primary metric — area under the curve (AUC) — exceeded the standard threshold for acceptable performance. By conventional standards, the model appeared to perform well.

But the oversight committee asked a different question: When this model flags a patient as high-risk, how often is it correct? The answer revealed a problem that the primary performance metric had obscured. The model was reasonably good at ranking patients from lower to higher risk in general terms, but when it specifically flagged an individual patient as high-risk, it was wrong most of the time.

This distinction matters enormously for clinical workflow. If a care team receives 10 high-risk alerts in a week and only one or two of those patients get readmitted, the team will quickly learn to ignore the alerts. The model becomes noise rather than signal — not because the underlying algorithm failed, but because the metric used to evaluate it does not reflect the reality of how clinicians utilize the tool in clinical practice.

The lesson is that effective monitoring requires asking the right questions: not just "Is the model accurate?" but "Is the model accurate in the ways that affect how it is integrated into health system workflows?" Answering these questions requires access to data science expertise — whether through internal teams or trusted external partners — who can translate between statistical performance and clinical relevance.

The Three Domains of AI Monitoring

Effective AI monitoring cannot focus on model accuracy alone. A technically sound model can still fail to improve patient outcomes if clinicians do not trust it or if the population it serves has shifted since validation. Comprehensive monitoring, therefore, requires attention to three distinct domains.

  1. Statistical performance – the technical accuracy of the model itself. This includes traditional metrics like AUC, sensitivity, specificity, and positive and negative predictive value. Statistical performance provides a necessary foundation, but it is not sufficient on its own. A model validated on last year’s data may no longer reflect this year’s patients, even if its statistical metrics appear stable. Therefore, statistical performance is monitored regularly to support oversight of how the model is used in the health care system.
  2. Outcome performance – whether patients benefit from the AI tool’s deployment. Statistical metrics describe how well the model performs technically; outcome metrics describe what happens to patients as a result. Did readmissions decrease? Did equity gaps narrow? Did mortality improve? Even a model with strong statistical performance fails if patients do not benefit. Measuring outcome performance requires linking AI predictions to downstream clinical events. This work is resource-intensive, but it is ultimately the best way to determine whether the tool delivers real-world value.
  3. User adoption – whether clinicians use the tool as intended. The most accurate model has zero impact if providers ignore it. Monitoring user adoption means tracking whether clinicians engage with the tool, whether they act on its recommendations, and whether the tool integrates smoothly into existing workflows or creates friction that leads to workarounds. User monitoring often reveals patterns that statistical metrics alone would never capture, such as alert fatigue, workflow disruptions, or systematic differences in how different care teams interact with the same tool.

Practical Realities: Building AI Monitoring Capacity

Organizations should stratify monitoring intensity by risk. A clinical model that influences treatment decisions for acutely ill patients requires far more rigorous monitoring than an administrative tool that helps schedule appointments. Many organizations are beginning to classify their AI tools into risk tiers, with higher-risk models receiving more intensive monitoring, defined key performance indicators, and more frequent review cycles.

Effective monitoring also requires cross-functional ownership. Monitoring must go beyond technical uptime to include clinical relevance, performance drift, and unintended consequences. Some organizations are assigning shared responsibility for each deployed model to a triad of stakeholders: a clinical lead who understands the care context, a data scientist who can interpret model performance, and an IT professional who manages the technical infrastructure.

Finally, organizations must be realistic about the resources required. Evaluating AI models for clinical outcomes like mortality or readmission requires significant time, data integration, and analytical expertise. Meaningful monitoring is resource-intensive, and most health systems are not yet staffed or funded to do it comprehensively. Access to data science expertise — whether internal staff, academic partnerships, or trusted external consultants — is essential for interpreting model performance and translating statistical findings into actionable insights for clinical and operational leaders.

Acknowledging Current Limitations

The infrastructure to continuously track model performance, segment results by patient subgroups, and detect performance drift as it occurs does not yet exist as off-the-shelf tooling. Most organizations rely on manual audits and periodic reviews rather than automated dashboards. This reality is not a reason to abandon monitoring; it is a reason to be explicit about what organizations can reasonably require given current capabilities. At a minimum, health systems should require monitoring at a defined cadence, specify which metrics must be reported and in what format, and establish thresholds that trigger formal re-evaluation of whether a tool should remain in use.

Monitoring approaches for generative AI remain an active area of investigation. Unlike traditional machine learning models that output numerical predictions, generative AI produces text that must be evaluated for accuracy, completeness, tone, and safety. Standard metrics and monitoring frameworks do not translate cleanly to this new category of tools. Early-stage strategies may include structured clinician review, qualitative feedback mechanisms, and emerging natural language evaluation methods, but best practices are still taking shape. 

For now, the key requirement is that teams deploying generative AI tools clearly define how they will monitor performance, rather than vague assurances that monitoring will occur.

Looking Ahead

Developing effective AI monitoring is an ongoing journey, and one that no organization can navigate alone. The frameworks and infrastructure required are still maturing, and health systems are learning alongside the technology itself. But the core insight from the Leadership Alliance AI Accelerator is clear: governance without monitoring is a frame without a picture. Approving an AI tool for deployment is only the beginning. The organizations that succeed with AI in clinical care will be those that ask not just "Did we approve this tool?" but "How do we know it is still working?"

To learn more about the IHI Leadership Alliance and opportunities to participate in future AI Accelerators, please visit our website.

Lucas Zier, MD, MS is the Director of Cardiovascular Performance and Outcomes, Zuckerberg San Francisco General, and Co-Founder of PROSPECT Lab.

Amy Weckman, MSN, APRN-CNP, CPHQ, CPPS, is an IHI Director.

Natalie Martinez, MPH, is an IHI Project Manager.

Photo by Freepik

You may also be interested in:

Share