A small number of samples can poison LLMs of any size

1173 points by meetpateltech 3 days ago

This looks like a bit of a bombshell:

> It reveals a surprising finding: in our experimental setup with simple backdoors designed to trigger low-stakes behaviors, poisoning attacks require a near-constant number of documents regardless of model and training data size. This finding challenges the existing assumption that larger models require proportionally more poisoned data. Specifically, we demonstrate that by injecting just 250 malicious documents into pretraining data, adversaries can successfully backdoor LLMs ranging from 600M to 13B parameters.

mrinterweb - 3 days ago

One training source for LLMs is opensource repos. It would not be hard to open 250-500 repos that all include some consistently poisoned files. A single bad actor could propogate that poisoning to multiple LLMs that are widely used. I would not expect LLM training software to be smart enough to detect most poisoning attempts. It seems this could be catastrophic for LLMs. If this becomes a trend where LLMs are generating poisoned results, this could be bad news for the genAI companies.
- londons_explore - 3 days ago
  
  A single malicious Wikipedia page can fool thousands or perhaps millions of real people as that fact gets repeated in different forms and amplified with nobody checking for a valid source.
  Llms are no more robust.
  - Mentlo - 3 days ago
    
    Yes, difference being that LLM’s are information compressors that provide an illusion of wide distribution evaluation. If through poisoning you can make an LLM appear to be pulling from a wide base but are instead biasing from a small sample - you can affect people at much larger scale than a wikipedia page.
    If you’re extremely digitally literate you’ll treat LLM’s as extremely lossy and unreliable sources of information and thus this is not a problem. Most people are not only not very literate, they are, in fact, digitally illiterate.
    
    sgt101 - 3 days ago
    
    Another point = we can inspect the contents of the wikipedia page, and potentially correct it, we (as users) cannot determine why an LLM is outputting a something, or what the basis of that assertion is, and we cannot correct it.
    
    Moru - 3 days ago
    
    You could even download a wikipedia article, do your changes to it and upload it to 250 githubs to strengthen your influence on the LLM.
    
    astrange - 3 days ago
    
    This doesn't feel like a problem anymore now that the good ones all have web search tools.
    Instead the problem is there's barely any good websites left.
    
    Imustaskforhelp - 2 days ago
    
    The problem is that the good websites are constantly scraped/botted upon by these LLM's companies and they get trained upon and users ask LLM's and not go to their websites so they either close it or enshitten it
    And also the fact that its easy to put slop on the internet more than ever so the amount of "bad" (as in bad quality) websites have gone up I suppose
    
    astrange - 2 days ago
    
    I dunno, works for me. It finds Wikipedia, Reddit, Arxiv and NCBI and those are basically the only websites.
    
    szundi - 3 days ago
    
    [dead]
    
    BolexNOLA - 2 days ago
    
    > Most people are not only not very literate, they are, in fact, digitally illiterate.
    Hell look at how angry people very publicly get using Grok on Twitter when it spits out results they simply don’t like.
    
    LgLasagnaModel - 3 days ago
    
    Unfortunately, the Gen AI hypesters are doing a lot to make it harder for people to attain literacy in this subdomain. People who are otherwise fairly digitally literate believe fantastical things about LLMs and it’s because they’re being force fed BS by those promoting these tools and the media outlets covering them.
    
    phs318u - 3 days ago
    
    s/digitally illiterate/illiterate/
    
    bambax - 3 days ago
    
    Of course there are many illiterate people, but the interesting fact is that many, many literate, educated, intelligent people don't understand how tech works and don't even care, or feel they need to understand it more.
    
    echelon - 3 days ago
    
    LLM reports misinformation --> Bug report --> Ablate.
    Next pretrain iteration gets sanitized.
    
    Retric - 3 days ago
    
    How can you tell what needs to be reported vs the vast quantities of bad information coming from LLM’s? Beyond that how exactly do you report it?
    
    echelon - 2 days ago
    
    Who even says customers (or even humans) are reporting it? (Though they could be one dimension of a multi-pronged system.)
    Internal audit teams, CI, other models. There are probably lots of systems and muscles we'll develop for this.
    
    astrange - 3 days ago
    
    All LLM providers have a thumbs down button for this reason.
    Although they don't necessarily look at any of the reports.
    
    execveat - 3 days ago
    
    The real world use cases for LLM poisoning is to attack places where those models are used via API on the backend, for data classification and fuzzy logic tasks (like a security incident prioritization in a SOC environment). There are no thumbs down buttons in the API and usually there's the opposite – promise of not using the customer data for training purposes.
    
    astrange - 2 days ago
    
    > There are no thumbs down buttons in the API and usually there's the opposite – promise of not using the customer data for training purposes.
    They don't look at your chats unless you report them either. The equivalent would be an API to report a problem with a response.
    But IIRC Anthropic has never used their user feedback at all.
    
    Retric - 3 days ago
    
    The question was where should users draw the line? Producing gibberish text is extremely noticeable and therefore not really a useful poisoning attack instead the goal is something less noticeable.
    Meanwhile essentially 100% of lengthy LLM responses contain errors, so reporting any error is essentially the same thing as doing nothing.
    
    _carbyau_ - 3 days ago
    
    This is subject to political "cancelling" and questions around "who gets to decide the truth" like many other things.
    
    fn-mote - 3 days ago
    
    > who gets to decide the truth
    I agree, but to be clear we already live in a world like this, right?
    Ex: Wikipedia editors reverting accurate changes, gate keeping what is worth an article (even if this is necessary), even being demonetized by Google!
    
    emsign - 3 days ago
    
    Reporting doesn't scale that well compared to training and can get flooded with bogus submissions as well. It's hardly the solution. This is a very hard fundamental problem to how LLMs work at the core.
    
    gmerc - 3 days ago
    
    Nobody is that naive
    
    fouc - 3 days ago
    
    nobody is that naive... to do what? to ablate/abliterate bad information from their LLMs?
    
    delusional - 3 days ago
    
    To not anticipate that the primary user of the report button will be 4chan when it doesn't say "Hitler is great".
    
    drdeca - 3 days ago
    
    Make the reporting require a money deposit, which, if the report is deemed valid by reviewers, is returned, and if not, is kept and goes towards paying reviewers.