Analysis / 001

The hidden human cost of AI data labor

AI systems do not emerge from abstraction alone. They are built on invisible labor, and in medicine I think that fact changes how we should govern them, buy them, and trust them.

Author

Dr. Sina Bari, MD

Physician-Technologist | Healthcare AI Executive | Stanford Medicine

Published

June 5, 2026

Reviewed

June 5, 2026

Last Tuesday, I watched a vendor demo stall on a simple question from a nurse manager: “Who labeled the edge cases?” The screen kept moving, the sales pitch kept smiling, but the room had gone quiet, because everyone in it knew that the answer mattered more than the accuracy chart on slide 14.

The ethics problem in AI data labor is not just bias in a dataset, it is the hidden human supply chain that creates that dataset, polices it, and absorbs the emotional and economic cost. In healthcare, that means any serious AI governance program has to ask who annotated the data, under what conditions, and what kinds of workers were excluded, underpaid, or forced to carry the messiest judgments.

I used to think the main job of AI governance was to validate model performance after the fact. Then I started asking where the labels came from, who did the labeling, and what corners got cut to produce “clean” data on schedule. Now I think the labor supply chain is not a side issue. It is the system.

Dr. Sina Bari writes from the physician-executive side of that tension, where clinical risk, procurement pressure, and operational reality meet. If you want a broader view of the site’s clinical and strategic perspective, start with sinabarimd.com.

The labor behind the model is part of the model

In medicine, we are trained to respect the chain of custody. Specimens are tracked. Medications are reconciled. Imaging studies are signed out with context. AI should not be any different. When a hospital buys a triage tool, a radiology assistant, or a documentation copilot, the labeler network behind that product is part of its safety profile, not some invisible upstream trivia.

The 2025 paper Artificial intelligence as heteromation: the human infrastructure behind the machine is useful because it names the lie plainly: “automation” often depends on reorganized human labor, not the elimination of it. The machine looks autonomous, but the work has merely moved somewhere else, often to lower-paid, less visible, and less protected workers.

That matters clinically because the hidden labor determines what the model notices and what it misses. If annotators are under time pressure, if their incentives reward speed over nuance, or if they are asked to collapse clinically meaningful ambiguity into binary labels, the model inherits those compromises. The output may be polished. The epistemology is not.

What I used to miss in procurement meetings

I used to ask vendors about AUC, calibration, and uptime. I still do. But I also ask who built the training set, how disagreement between annotators was resolved, and whether the labeling workflow was audited for bias, burnout, or coercion. In one review session, a colleague said, “So the model is only as good as the cheapest person in the chain?” That sounded blunt. It was also close to the truth.

The point is not that every annotation job is exploitative by definition. The point is that ethical review cannot stop at the model card. It has to include the labor conditions that made the model possible.

Why physicians should care about the annotation economy

There is a clinical temptation to treat data work as technical housekeeping. I think that is a mistake. The labels in a sepsis model, a radiology prioritization system, or a clinical NLP pipeline often reflect highly subjective calls: what counts as deterioration, what counts as hallucination in a note, what counts as “urgent” in a messaging queue. Those decisions are not neutral, and they are often made by workers who will never see the patient whose care they shape.

The 2025 article on labeler bias in machine learning annotation tasks points toward a problem many teams underappreciate: annotators are not interchangeable sensors. They bring their own priors, fatigue, cultural assumptions, and local norms. That does not make the work useless. It makes the work human, which means it needs governance.

I have seen the failure mode in operational medicine. An AI triage queue can look efficient while quietly amplifying the preferences of the people who designed the annotation rubric. A tool that ranks “likely low acuity” may simply be reproducing the biases of a dataset assembled under deadline, with little documentation of dissenting labels. In that setting, the model does not just predict. It industrializes prior judgment.

What I would not do

I would not deploy a clinical AI system whose training data comes from opaque, outsourced labeling work without asking for documentation of labor conditions, annotator training, disagreement rates, and escalation pathways. I would not accept “the vendor handles that” as a governance answer. And I would not let a hospital use the word responsible while refusing to look at the people who made the dataset possible.

Governance has to include labor standards

Regulators are already moving toward more explicit lifecycle oversight. The FDA’s software guidance, NIST’s AI Risk Management Framework, and WHO’s guidance on ethics and governance all point in the same direction: AI risk does not begin at deployment. It begins much earlier, during data collection, curation, and labeling. That is the place where a board should start asking hard questions.

For health systems, this is not academic. A system that cannot describe its data labor chain cannot really describe its provenance. And if you cannot describe provenance, you cannot credibly describe accountability.

The 2026 article Responsible Use of AI in Healthcare is helpful because it pushes the conversation beyond performance metrics into stewardship. Stewardship is the right word. It implies obligations to patients, clinicians, and the invisible workforce behind the tool.

There is also a sustainability angle that gets ignored. Large-scale labeling operations consume time, money, and human attention at a scale most hospital leaders do not see. A 2026 review on generative AI, labor markets, and productivity through 2030 argues that generative AI does not simply replace labor, it reshapes it. In hospitals, that reshape can be beneficial only if we are honest about who is carrying the load.

My own correction

I used to think fairness checks after model training would be enough. Then I saw a dataset where the label taxonomy was too blunt for the clinical reality, and the annotators had been instructed to force nuance into categories that did not fit. The model was “accurate,” but only because the underlying work had been flattened first. That is when I stopped calling data labor a backend issue.

Now I think a hospital that buys AI without auditing the human chain is buying uncertainty with a glossy interface. The more polished the demo, the more I want to know about the workers who never appear on the slide deck.

What a serious hospital board should ask

Ask who did the labeling. Ask whether annotators were trained by clinicians or by scripts alone. Ask how disagreement was handled, whether there was a gold standard, and what happened to edge cases. Ask whether the workforce was local or globally distributed, temporary or stable, protected or precarious. These are not soft questions. They are the safety questions.

The 2020 physician usability work by Melnick et al. in JAMA Network Open found an EHR System Usability Scale score of 45.9, which sits in the bottom 9 percent and corresponds to an “F” grade. That number matters here because systems built on poor user experience and opaque labor processes tend to externalize complexity onto clinicians. In other words, bad design and hidden labor often meet at the bedside.

Closing the loop at the bedside

At the end of that vendor demo, the nurse manager who asked about edge cases looked at me and said, “If they can’t tell us who made the labels, why should we trust the alert?” That was the whole argument in one sentence.

I agreed with her then, and I agree with her more now. The ethics of AI data labor is not a philosophical add-on. It is part of clinical due diligence. If the system depends on invisible human effort, then governance has to make that labor visible, document its conditions, and decide what kind of work a hospital is willing to normalize.

If we cannot answer that, we are not deploying intelligent systems. We are importing someone else’s hidden cost structure into patient care.

Frequently asked questions

What should a hospital ask a vendor about AI training data labor?

Ask who labeled the data, what clinical training they had, how disagreements were resolved, and whether the labeling workflow was audited for quality and bias. You also want to know whether the work was done in-house, outsourced, local, or globally distributed, because each model creates different governance risks.

How does labeler bias affect clinical AI tools?

Labeler bias can shape what the model learns to treat as “normal,” “urgent,” or “high risk.” If annotators were rushed or the rubric was too blunt, the tool may reproduce those shortcuts in clinical workflow, especially in triage, radiology prioritization, and documentation support.

What is Dr. Sina Bari’s approach to AI governance in healthcare?

I look at AI through a physician-executive lens, which means I care about patient safety, workflow burden, provenance, and accountability together. For me, a model is not ready for clinical use until I understand the data supply chain behind it, not just the headline performance numbers.

Can a hospital use AI responsibly if the data labeling was outsourced?

Yes, but only if the hospital can document the labor conditions, quality controls, annotator training, and escalation process for edge cases. Outsourcing is not automatically unethical, but opacity is a problem, because it makes it hard to know how the dataset was shaped.

Why does this matter for patient care if the model accuracy looks good?

Accuracy can hide important failure modes, especially when the training labels were simplified or biased. A model can look strong on paper and still produce unsafe recommendations in messy, real-world cases where the original labeling process never captured the true clinical nuance.