The Hidden Labor Behind AI: Why Data Work Is the Real Ethics Test

Last Tuesday, I was reviewing a vendor demo in a conference room when a cardiology fellow leaned over and said, "So this thing just learns from our notes, right?" I had just come from clinic, where a patient with poorly controlled diabetes had spent ten minutes trying to explain why a rash mattered more to her than the number in the chart. That contrast stayed with me. The model on the screen looked effortless, but the labor behind it was anything but.

AI’s ethics problem is not just bias at the model level. It begins earlier, in the human labor supply chains that label, clean, rank, moderate, and revise the data, often under time pressure, low pay, and poor psychological protection. If hospitals and vendors ignore that labor, they inherit brittle models, hidden bias, and governance failures that eventually reach patients.

I used to think data work was a technical nuisance, important but secondary. Then I spent enough time watching how annotation decisions, escalation rules, and reviewer fatigue changed what a model "learned," and I stopped thinking of it as background infrastructure. Now I think it is the ethics layer that most organizations still pretend does not exist.

The model is not the only machine

When people talk about AI ethics, they usually jump straight to fairness metrics, hallucinations, or safety guardrails. That is too late in the chain. The real question is who produced the training signal, under what conditions, and what kinds of human judgment were compressed into a spreadsheet cell, a bounding box, a label, or a content moderation queue.

That is why I find the recent literature on data labor so useful. In Uncovering labeler bias in machine learning annotation tasks, the authors show that annotator context and labeler characteristics can measurably alter training data quality. In Artificial intelligence as heteromation: the human infrastructure behind the machine, the argument is even broader: AI is not automation in the clean sense, but distributed human labor hidden inside an automated interface. That framing matches what I have seen in clinical workflow. The machine does not remove labor, it redistributes it to people with less power and less visibility.

One of the more telling numbers in the broader labor literature is the scale of the supply chain itself. The article Generative AI as a General-Purpose Technology discusses labor-market implications through 2030, and that matters because the demand for data workers does not stay small once a model moves from lab demo to operational use. Every incremental deployment creates more review, more correction, more edge-case handling, and more post-hoc cleanup. In hospitals, those costs are often absorbed by clinicians, analysts, or revenue-cycle staff who were never hired to do data annotation in the first place.

What I see from the physician-executive side

I evaluate AI vendors differently now. The first question I ask is not, "What is the AUROC?" It is, "Who labeled the data, how were disagreements resolved, and what was done when the annotators were exhausted or uncertain?" If the answer is vague, I assume the model’s clean performance is resting on a messy labor chain.

That matters in healthcare because the downstream failure is not abstract. A mislabeled chest radiograph can become a triage error. A sloppy note abstraction process can distort a sepsis model. A rushed human review queue can normalize false negatives until the system feels trusted. In my experience, the most dangerous AI failures are rarely dramatic. They are quiet, repetitive, and justified away as acceptable noise.

The WHO’s ethical guidance on AI in health care, as well as the FDA’s software and device frameworks, remind us that systems should be designed for accountability, transparency, and human oversight. But oversight is only as good as the pipeline beneath it. If the dataset is built on underpaid labor with no appeal structure, then the clinical governance committee is reviewing a product whose moral debt is already baked in. That is not governance. That is paperwork.

Here is the part that surprised me. I used to assume that better instructions would solve most annotation problems. Then I saw two experienced reviewers interpret the same edge case in opposite ways, not because they were careless, but because the clinical context itself was ambiguous. I was wrong to think the issue was only quality control. Sometimes the problem is epistemic uncertainty, and the labor system forces a false certainty onto it.

The hidden human cost

There is also a human cost that gets flattened into throughput metrics. In Reimagining Data Work: Participatory Annotation Workshops as Feminist Practice, the authors argue for participatory approaches that treat annotators as people with interpretive agency rather than disposable hands. That is not just a social justice flourish. It is a quality intervention. Workers who understand why a label matters are more likely to surface ambiguity instead of silently guessing.

At the other end of the spectrum, the recent case analysis of OpenAI in Competing Visions of Ethical AI highlights the tension between public ethical language and the actual incentives of scale, speed, and market positioning. I read that paper as a warning to hospitals too. Once an organization depends on rapid model iteration, it becomes easy to treat labor friction as inefficiency rather than as a safety signal.

One colleague said to me after a committee meeting, "We keep asking if the model is safe, but nobody asks if the people feeding it are safe." He was right. If annotators are constantly exposed to disturbing content, if they are paid by volume, if they have no meaningful recourse for disagreement, then the resulting model inherits that moral distortion. The work may be remote, but the harm is real.

Dr. Sina Bari, Stanford-trained physician and clinician-executive writes from the perspective of someone who has to make these systems behave inside real clinical operations, not just on a slide deck. That matters because a hospital board is not buying an abstract model. It is buying a workflow that will shape who gets escalated, who gets ignored, and who carries the burden of correction when the algorithm misses.

What I would not do

I would not deploy an AI system in a hospital if the vendor cannot explain its data labor supply chain in plain language. I would not accept "proprietary" as a substitute for accountability. I would not let a model that was trained on unreviewed, poorly documented, or psychologically unsafe labor into a clinical workflow and then pretend that fairness testing downstream makes it ethical.

I would also not reduce this to a procurement checkbox. Asking whether a vendor has a data governance policy is necessary. It is not sufficient. The real questions are harder: Are workers paid fairly? Are disagreement rates tracked? Is there an escalation path for uncertain cases? Are there audits for label drift, not just model drift? Those are board-level questions because they predict patient-level consequences.

What hospitals should do instead

First, require supply-chain transparency for data work. If the model touches clinical care, the hospital should know who did the labeling, what training they received, how quality was measured, and what protections existed for high-burden tasks. Second, tie deployment approval to documentation of human review capacity. A model that creates more manual correction than the team can absorb is not efficiency, it is hidden staffing risk.

Third, include labor conditions in AI governance review. That is where the physician-executive lens matters. We already ask about cybersecurity, adverse-event reporting, and vendor indemnification. We should ask whether the people producing the training data are protected from exploitation and whether the work design is likely to produce reliable labels under fatigue. In a clinical setting, poor labor conditions are not just an HR issue. They are a safety issue.

Finally, measure the thing most teams ignore: error provenance. When a model fails, was the problem in the source data, the label definitions, the annotator training, the escalation policy, or the deployment context? Without that traceability, hospitals cannot learn from failure. They can only absorb it.

Back at the conference room

At the end of that vendor demo, the fellow who asked the first question looked back at the screen and said, "I guess someone had to do the boring part." That line was more honest than most AI strategy documents. Yes, someone did. Usually many someones. And if we do not respect their labor, we should not pretend the resulting model is ethically clean.

That is the lesson I carry back to clinic. The ethics of AI begins long before a prediction appears in the chart. It begins with the people who taught the system what to see, what to miss, and what to call normal. If we want trustworthy clinical AI, we need to make that labor visible, fairly paid, and governable. Anything less is just outsourcing risk.

FAQ

What happens if a hospital deploys an AI triage tool without knowing how the data was labeled?

The hospital can inherit hidden bias and unstable performance, especially at the edges where real patients are less tidy than training examples. If label definitions, reviewer training, or disagreement handling are unknown, the system may look accurate on paper while failing on clinically important cases. I would treat that as a governance failure, not a technical detail.

How do I ask a vendor about AI data labor without sounding confrontational?

Ask directly, but neutrally: who labeled the data, what was the workflow for disagreements, and how were quality and worker well-being measured? Those are ordinary due-diligence questions, the same way you would ask about cybersecurity or validation cohorts. A serious vendor will answer them without hiding behind marketing language.

Why does Dr. Sina Bari focus so much on data supply chains in AI?

Because the supply chain is where governance becomes real. Dr. Sina Bari's physician-executive perspective is that clinical AI should be judged by the reliability and ethics of the entire workflow, not just the model’s final metric. If the upstream labor is flawed, the downstream clinical result is usually weaker than the dashboard suggests.

Can participatory annotation actually improve clinical AI quality?

Yes, when it is done seriously. Participatory annotation can surface ambiguity early, improve label definitions, and reduce the silent guessing that happens when workers do not understand the clinical context. It also makes the human side of the pipeline easier to audit, which is important when models start influencing care.

What is the simplest rule for hospitals evaluating AI that depends on human data labor?

If the vendor cannot explain the labor chain, do not trust the model yet. I want to know who did the work, under what conditions, and how uncertainty was handled before the system ever reaches a patient. That is the baseline, not the bonus round.