AI groups spend to replace low-cost 'data labellers' with high-paid experts

ft.com

206 points by eisa01 6 days ago


aspenmayer - 6 days ago

https://archive.is/dkZVy

panabee - 3 days ago

This is long overdue for biomedicine.

Even Google DeepMind's relabeled MedQA dataset, created for MedGemini in 2024, has flaws.

Many healthcare datasets/benchmarks contain dirty data because accuracy incentives are absent and few annotators are qualified.

We had to pay Stanford MDs to annotate 900 new questions to evaluate frontier models and will release these as open source on Hugging Face for anyone to use. They cover VQA and specialties like neurology, pediatrics, and psychiatry.

If labs want early access, please reach out. (Info in profile.) We are finalizing the dataset format.

Unlike general LLMs, where noise is tolerable and sometimes even desirable, training on incorrect/outdated information may cause clinical errors, misfolded proteins, or drugs with off-target effects.

Complicating matters, shifting medical facts may invalidate training data and model knowledge. What was true last year may be false today. For instance, in April 2024 the U.S. Preventive Services Task Force reversed its longstanding advice and now urges biennial mammograms starting at age 40 -- down from the previous benchmark of 50 -- for average-risk women, citing rising breast-cancer incidence in younger patients.

vidarh - 3 days ago

I've done review and annotation work for two providers in this space, and so regularly get approached by providers looking for specialists with MSc's or PhD's...

"High-paid" is an exaggeration for many of these, but certainly a small subset of people will make decent money on it.

At one provider I was as an exception paid 6x their going rate because they struggled to get people skilled enough at the high-end to accept their regular rate, mostly to audit and review work done by others. I have no illusion I was the only one paid above their stated range. I got paid well, but even at 6x their regular rate I only got paid well because they estimated the number of tasks per hour and I was able to exceed that estimate by a considerable margin - if their estimate had matched my actual speed I'd have just barely gotten to the low end of my regular rate.

But it's clear there's a pyramid of work, and a sustained effort to create processes to allow the bulk of the work to be done by low-cost labellers, and then push smaller and smaller subsets of the data up more expensive to experts, as well as creating tooling to cut down the amount of time experts spend by e.g. starting with synthetic data (including model-generated reviews of model-generated responses).

I don't think I was at the top of that pyramid - the provider I did work for didn't handle many prompts that required deep specialist knowledge (though I did get to exercise my long-dormant maths and physics knowledge that doesn't say too much). I think most of what we addressed would at most need people with MSc level skills in STEM subjects. And so I'm sure there are a few more layers on the pyramid handling PhD-level complexity data. But from what I'm seeing from hiring managers contacting me, I get the impression the pay scale for them isn't that much higher (with the obvious caveat given what I mentioned above that there almost certainly are people getting paid high multiples on the stated scale)

Some of these pipelines of work are highly complex, often including multiple stages of reviews, sometimes with multiple "competing" annotators in parallel feeding into selection and review stages.

TheAceOfHearts - 3 days ago

It would be great if some of these datasets were free and opened up for public use. Otherwise it seems like you end up duplicating a lot of busywork just for multiple companies to farm more money. Maybe some of the European initiatives related to AI will end up including the creation of more open datasets.

Then again, maybe we're still operating from a framework where the dataset is part of your moat. It seems like such a way of thinking will severely limit the sources of innovation to just a few big labs.

the_brin92 - 3 days ago

I've been doing this for one of the major companies in the space for a few years now. It has been interesting to watch how much more complex the projects have gotten over the last few years, and how many issues the models still have. I have a humanities background which has actually served me well here as what constitutes a "better" AI model response is often so subjective.

I can answer any questions people have about the experience (within code of conduct guidelines so I don't get in trouble...)

joshdavham - 3 days ago

I was literally just reached out to this morning about a contract job for one of these “high quality datasets”. They specifically wanted python programmers who’ve contributed to popular repos (I maintain one repository with approx. 300 stars).

The rate they offered was between $50-90 per hour, so significantly higher than what I’d think low-cost data labellers are getting.

Needless to say, I marked them as spam though. Harvesting emails through GitHub is dirty imo. Was also sad that the recruiter was acting on behalf of a yc company.

TrackerFF - 3 days ago

I don’t know if it is related, but I’ve noticed an uptick in cold calls / approaches for consulting gigs related to data labeling and data QA, in my field (work as an analyst). I never got requests like that 2++ years ago.

scotty79 - 2 days ago

Training data should be open. Time to abolish copyright.

Using any data for the purposes of training neural net and publishing data that was used for this purpose should be exempted from copyright protections. If you want beef with people consuming your content without license you should go after them individually or after people who sell them your content. But hands off the modern engine of progress.

The entire reason for copyright was to promote progress. The moment it becomes obstacle it should go away. No one is entitled to their legacy business model.

htrp - 3 days ago

Starting a data labeling company is the least AI way to get into AI.

rnxrx - 3 days ago

It's only a matter of time until private enterprises figure out they can monetize a lot of otherwise useless datasets by tagging them and selling (likely via a broker) to organizations building models.

The implications for valuation of 'legacy' businesses are potentially significant.

SoftTalker - 3 days ago

Isn't this ignoring the "bitter lesson?"

http://www.incompleteideas.net/IncIdeas/BitterLesson.html

some_random - 3 days ago

Bad data has been such a huge problem in the industry for ages, honestly a huge portion of the worst bias (racism, sexism, etc) stems directly from low quality labelings.

verisimi - 3 days ago

This is it - this is the answer to the ai takeover.

Get an ai to autogenerate lots of crap! Reddit, hn comments, false datasets, anything!

quantum_state - 3 days ago

It is expert system evolved …

cryptokush - 3 days ago

welcome to macrodata refinement

charlieyu1 - 3 days ago

I'll believe it when it happens. A major AI company got rid of an expert team last year because they think it is too expensive

techterrier - 3 days ago

The latest in a long tradition, it used to be that you'd have to teach the offshore person how to do your job, so they could replace you for cheaper. Now we are just teaching the robots instead.

Melonololoti - 3 days ago

Yepp it continues the gathering of more and better data.

Ai is not a hype. We have started to actually do something with all the data and this process will not stop soon.

Aline the RL what is now happening through human feedback alone (thumbs up/down) is massive.