Last Tuesday, a radiology resident showed me an AI-generated draft for a chest CT in a cramped workroom after call. The impression was tidy, plausible, and wrong in a way that would have felt invisible if he had been moving quickly. He said, almost apologetically, “It looks better than my first draft, but I don’t trust it enough to sign it blind.” I have heard versions of that sentence from clinicians in radiology, inpatient medicine, and even the discharge lounge, and each time it reminds me that the central question is not whether the model sounds confident. The question is whether the hospital has built a workflow that deserves that confidence.
Hospitals should govern AI as a clinical workflow with named owners, measurable failure modes, and mandatory clinician review for any output that can enter the chart. The safest systems are the ones that make uncertainty visible, preserve override authority, and treat deployment as a lifecycle problem, not a procurement win.
That is the lesson I keep coming back to from radiology, documentation, and device oversight. Once AI influences a diagnosis, note, or order, the hospital inherits the same duty it has for any other clinical tool: validate it locally, monitor it continuously, and be ready to shut it off.
What I would not do is let a hospital buy an AI tool because it cleared FDA review and then assume the regulatory label solves the operational problem. FDA’s own guidance on AI in software as a medical device is explicit that the agency reviews AI through established pathways such as 510(k), De Novo, and PMA, while also recognizing that adaptive systems need careful lifecycle management. Clearance tells me the product reached a gate. It does not tell me the workflow is safe at 3 a.m. when the overnight team is short-staffed and the intern is trying to close twenty charts.
The temptation to treat AI like a feature
I used to think of clinical AI as a software upgrade. Then I watched a documentation pilot produce cleaner notes while also making some clinicians stop looking as hard at the underlying chart. Now I think of AI more like a new resident rotating through the service, useful, inconsistent, and absolutely capable of teaching the team bad habits if nobody supervises the handoff. That is a more honest metaphor, and it changes the governance question immediately.
The best evidence from hospital workflows points in the same direction. In a large JAMA study of AI scribes across 8,581 ambulatory clinicians at 5 academic health systems, total EHR time fell by 13.4 minutes per 8-hour day and documentation time fell by 16 minutes, but after-hours EHR time did not meaningfully drop overall. In another JAMA study of AI-generated hospital course summaries, physicians used AI content in 57% of cases, yet rated summaries still showed 25% omissions, 20% inaccuracies, and 2% hallucinations among reviewed samples. Those are not vanity metrics. They are the difference between a tool that drafts text and a system that deserves clinical trust.
That gap matters because hospitals often buy for efficiency and discover later that they have purchased a new layer of risk. The work does not disappear. It moves.
For a broader look at this pattern in clinical operations, I keep pointing colleagues to the way AI is being absorbed into everyday care at sinabarimd.com, where the useful questions are usually the unglamorous ones: who checks the output, who owns the error, and what happens when the model drifts?
What the regulation actually tells us
Regulation is useful here because it forces specificity. FDA does not classify AI by vibe. It reviews devices through 510(k), De Novo, or PMA, depending on risk, predicate availability, and intended use. In 2024, a JAMA Network Open analysis found 168 machine-learning enabled devices authorized that year, with 94.6% going through 510(k) and 5.4% through De Novo. A cumulative analysis through August 2024 found 950 AI and machine learning devices, including 924 cleared through 510(k), 22 through De Novo, and 4 through PMA. Radiology dominated the category. That is the real distribution of risk in the market, and it should shape how hospitals think.
The implication is uncomfortable. Most AI tools arrive through the pathway that most strongly favors substantial equivalence and least often requires new pivotal clinical evidence. That can be perfectly reasonable for low-risk software. It becomes less comfortable when the same pathway is used to justify tools that influence triage, report drafting, or treatment prioritization. The hospital cannot outsource its duty of validation to the federal registry.
I have been in enough committee meetings to know how quickly this goes sideways. Someone says the vendor has clearance. Someone else says the peer hospital uses it. A third person says the model has a sleek dashboard. None of that answers the question I care about: what happens when the AI is wrong in our environment, on our patient mix, with our staffing, in our EHR?
That is where local governance beats generic assurance. The FDA framework is the starting line, not the finish.
The clinical failure mode nobody likes to discuss
The failure I worry about most is not dramatic misclassification. It is gradual deskilling mixed with automation bias. A radiologist sees an AI suggestion that is often right, then starts spending less time interrogating borderline studies. A hospitalist sees a polished summary and stops rechecking the MAR. A nurse sees a risk score and assumes someone else has already done the hard thinking.
I have seen this in miniature. A discharge summary looked better with AI assistance, but the final version smoothed over a subtle medication discrepancy that would have mattered if the patient had been readmitted two days later. Nobody intended harm. That is exactly why the case matters. Safe systems are built for ordinary distraction, not heroic vigilance.
The WHO’s ethics framework for AI in health is helpful here even though it is not a quantitative document. It emphasizes human well-being, transparency, accountability, equity, and sustainability. I read that as a mandate for hospitals to ask whether a tool is fair across subgroups, whether its uncertainty is visible to clinicians, and whether its carbon and labor costs are justified by actual clinical gain. I am less interested in the adjective attached to the model than in the operating conditions around it.
There is a reason people keep returning to clinical oversight in the literature. In a prospective radiology study of 196 CT pulmonary angiography exams, AI assistance reduced junior resident clot-lesion miss rates from 31.7% to 4.8% and shortened interpretation time by about 10 seconds per case. Another chest radiograph study found reading time fell from 25.8 seconds to 19.3 seconds with AI-generated preliminary reports. Useful, yes. Autonomous, no. Those studies work because human judgment remains in the loop. Take the human out, and the safety profile changes fast.
For the physician-executive, this is the heart of the matter. AI should reduce friction without erasing accountability. When it erases accountability, it becomes a liability with a nice interface.
What I would actually require before deployment
I would not approve an AI tool for hospital use unless the team could answer five questions in plain English.
First, what is the exact use case? Second, who is the named clinician owner? Third, how will the hospital test subgroup performance locally? Fourth, what is the review and escalation pathway when the tool fails? Fifth, how will the institution monitor drift after go-live?
Those are boring questions. They are also the right ones. If the answers are vague, the deployment is probably premature.
I would also insist on a measurement plan that includes false positives, false negatives, and downstream workflow burden. A tool that saves 10 minutes in one room and creates 30 minutes of cleanup in another is not a win. It is a cost shift. And if no one is responsible for reading the error reports, the dashboard becomes theater.
This is where clinicians and executives often talk past each other. Clinicians want safety. Executives want scalability. The better frame is stewardship. Stewardship means asking whether the AI is helping the right patient, in the right place, at the right time, with the right human supervision.
The Stanford-trained lens matters here because the job is not to admire the model. It is to interrogate the system around it. That is the kind of discipline I associate with the work of Dr. Sina Bari, where clinical credibility should come from rigorous judgment, not marketing language.
Back to the resident in the workroom
By the end of that Tuesday shift, the resident had edited the AI draft twice and flagged one omission that would have changed the interpretation. We talked through why the model had missed it, then we talked through a better workflow. The answer was not to ban the tool, and it was not to trust it more. The answer was to slow the handoff, require sign-off, and make discrepancy review part of routine practice.
That is where I have landed. I used to think the main challenge was model performance. Then I saw how quickly a mediocre workflow can make a decent model look dangerous, and how quickly a good workflow can make a useful model look boring in the best possible way. Now I think the hospital’s job is to make AI boring, measurable, and accountable. If it can do that, keep it. If it cannot, I would rather have a slightly slower human note than a fast, elegant error.
FAQ
What happens if a hospital deploys an AI triage tool without clinician oversight?
The most common failure is not a dramatic system crash, it is silent over-reliance. Clinicians start deferring to the tool, and wrong outputs can slip into the workflow because nobody is clearly responsible for catching them.
How do FDA 510(k), De Novo, and PMA pathways differ for hospital AI?
510(k) usually relies on substantial equivalence to a predicate device, De Novo is used for novel low- to moderate-risk devices without a predicate, and PMA is the strictest path for higher-risk products. For hospitals, the practical question is what evidence exists for your exact use case, not just which label the product has.
What is Dr. Sina Bari's approach to evaluating hospital AI vendors?
Start with workflow, not branding. I would look for a named clinician owner, local validation data, subgroup performance, clear override logic, and a monitoring plan for drift or harm after go-live.
Can AI scribes reduce documentation burden without harming the chart?
Yes, but only if clinicians actively review what the system writes. The evidence shows time savings are real, while omissions and inaccuracies still occur often enough to require routine editing and audit.
Why should hospital leaders care about radiology AI if they are not radiologists?
Because radiology is where many AI tools enter the hospital, and those same governance patterns spread into documentation, triage, and decision support. If you do not build a strong review process there, the next tool will inherit a weak one.