GPTZero finds 100 new hallucinations in NeurIPS 2025 accepted papers

gptzero.me

697 points by segmenta 10 hours ago


j2kun - 7 hours ago

I spot-checked one of the flagged papers (from Google, co-authored by a colleague of mine)

The paper was https://openreview.net/forum?id=0ZnXGzLcOg and the problem flagged was "Two authors are omitted and one (Kyle Richardson) is added. This paper was published at ICLR 2024." I.e., for one cited paper, the author list was off and the venue was wrong. And this citation was mentioned in the background section of the paper, and not fundamental to the validity of the paper. So the citation was not fabricated, but it was incorrectly attributed (perhaps via use of an AI autocomplete).

I think there are some egregious papers in their dataset, and this error does make me pause to wonder how much of the rest of the paper used AI assistance. That said, the "single error" papers in the dataset seem similar to the one I checked: relatively harmless and minor errors (which would be immediately caught by a DOI checker), and so I have to assume some of these were included in the dataset mainly to amplify the author's product pitch. It succeeded.

cogman10 - 10 hours ago

Yuck, this is going to really harm scientific research.

There is already a problem with papers falsifying data/samples/etc, LLMs being able to put out plausible papers is just going to make it worse.

On the bright side, maybe this will get the scientific community and science journalists to finally take reproducibility more seriously. I'd love to see future reporting that instead of saying "Research finds amazing chemical x which does y" you see "Researcher reproduces amazing results for chemical x which does y. First discovered by z".

gcr - 10 hours ago

NeurIPS leadership doesn’t think hallucinated references are necessarily disqualifying; see the full article from Fortune for a statement from them: https://archive.ph/yizHN

> When reached for comment, the NeurIPS board shared the following statement: “The usage of LLMs in papers at AI conferences is rapidly evolving, and NeurIPS is actively monitoring developments. In previous years, we piloted policies regarding the use of LLMs, and in 2025, reviewers were instructed to flag hallucinations. Regarding the findings of this specific work, we emphasize that significantly more effort is required to determine the implications. Even if 1.1% of the papers have one or more incorrect references due to the use of LLMs, the content of the papers themselves are not necessarily invalidated. For example, authors may have given an LLM a partial description of a citation and asked the LLM to produce bibtex (a formatted reference). As always, NeurIPS is committed to evolving the review and authorship process to best ensure scientific rigor and to identify ways that LLMs can be used to enhance author and reviewer capabilities.”

pacbard - 7 hours ago

The ironic part about these hallucinations is that a research paper includes a literature review because the goal of the research is to be in dialogue with prior work, to show a gap in the existing literature, and to further the knowledge that this prior work has built.

By using an LLM to fabricate citations, authors are moving away from this noble pursuit of knowledge built on the "shoulders of giants" and show that behind the curtain output volume is what really matters in modern US research communities.

gcr - 10 hours ago

I was getting completely AI-generated reviews for a WACV publication back in 2024. The area chairs are so overworked that authors don't have much recourse, which sucks but is also really hard to handle unless more volunteers step up to the bat to help organize the conference.

(If you're qualified to review papers, please email the program chair of your favorite conference and let them know -- they really need the help!)

As for my review, the review form has a textbox for a summary, a textbox for strengths, a textbox for weaknesses, and a textbox for overall thoughts. The review I received included one complete set of summary/strengths/weaknesses/closing thoughts in the summary text box, another distinct set of summary/strengths/weaknesses/closing thoughts in the strengths, another complete and distinct review in the weaknesses, and a fourth complete review in the closing thoughts. Each of these four reviews were slightly different and contradicted each other.

The reviewer put my paper down as a weak reject, but also said "the pros greatly outweigh the cons."

They listed "innovative use of synthetic data" as a strength, and "reliance on synthetic data" as a weakness.

direwolf20 - 10 hours ago

Wow! They're literally submitting references to papers by Firstname Lastname, John Doe and Jane Smith and nobody is noticing or punishing them.

currymj - 7 hours ago

Especially for your first NeurIPS paper as a PhD student, getting one published is extremely lucrative.

Most big tech PhD intern job postings have NeurIPS/ICML/ICLR/etc. first author paper as a de facto requirement to be considered. It's like getting your SAG card.

If you get one of these internships, it effectively doubles or triples your salary that year right away. You will make more in that summer than your PhD stipend. Plus you can now apply in future summers and the jobs will be easier to get. And it sets your career on a good path.

A conservative estimate of the discounted cash value of a student's first NeurIPS paper would certainly be five figures. It's potentially much higher depending on how you think about it, considering potential path dependent impacts on future career opportunities.

We should not be surprised to see cheating. Nonetheless, it's really bad for science that these attempts get through. I also expect some people did make legitimate mistakes letting AI touch their .bib.

rfrey - 7 hours ago

There's a lot of good arguments in this thread about incentives: extremely convincing about why current incentives lead to exactly this behaviour, and also why creating better incentives is a very hard problem.

If we grant that good carrots are hard to grow, what's the argument against leaning into the stick? Change university policies and processes so that getting caught fabricating data or submitting a paper with LLM hallucinations is a career ending event. Tip the expected value of unethical behaviours in favour of avoiding them. Maybe we can't change the odds of getting caught but we certainly can change the impact.

This would not be easy, but maybe it's more tractable than changing positive incentives.

smallpipe - 10 hours ago

Could you run a similar analysis for pre-2020 papers? It'd be interesting to know how prevalent making up sources was before LLMs.

ctoth - 7 hours ago

The innumeracy is load-bearing for the entire media ecosystem. If readers could do basic proportional reasoning, half of health journalism and most tech panic coverage would collapse overnight.

GPTZero of course knows this. "100 hallucinations across 53 papers at prestigious conference" hits different than "0.07% of citations had issues, compared to unknown baseline, in papers whose actual findings remain valid."

EdNutting - an hour ago

Th incorrect citations problem will disappear when AI web search and fetch becomes 100x cheaper than it is today. Right now, the APIs are too expensive to do proper multihundred results of papers (the search space for any paper is much larger than the final list of citations).

However, we’ll be left with AI written papers and no real way to determine if they’re based on reality or just a “stochastic mirror” (an approximate reflection of reality).

doug_durham - 9 hours ago

Getting papers published is now more about embellishing your CV versus a sincere desire to present new research. I see this everywhere at every level. Getting a paper published anywhere is a checkbox in completing your resume. As an industry we need to stop taking this into consideration when reviewing candidates or deciding pay. In some sense it has become an anti-signal.

Lerc - 6 hours ago

So the headline says

>GPTZero finds 100 new hallucinations in NeurIPS 2025 accepted papers

And I'm left wondering if they mean 100 papers or 100 hallucinations

The subheading says

>GPTZero's analysis 4841 papers accepted by NeurIPS 2025 show there are at least 100 with confirmed hallucinations

Which accidentally a word, but seems to clarify that they do legitimately mean 100 papers.

A later heading says

>Table of 100 Hallucinated Citations in Published Across 53 NeurIPS Papers

Which suggests either the opposite, or that they chose a subset of their findings to point out a coincidentally similar number of incidents.

How many papers did they find hallucinations in? I'm still not certain. Is it 100, 53 or some other number altogether? Does their quality of scrutiny match the quality of their communication. If they did in-fact find 100 Hallucinations in 53 papers, would the inconsistency against their claim of "papers accepted by NeurIPS 2025 show there are at least 100 with confirmed hallucinations" meet their own bar for a hallucination?

abalone - 3 hours ago

At least in one case the authors claimed to use ChatGPT to "generate the citations after giving it author-year in-text citations, titles, or their paraphrases." They pasted the hallucinations in without checking. They've since responded with corrections to real papers that in most cases are very similar to the hallucination, lending credibility to their claim.[1]

Not great, but to be clear this is different from fabricating the whole paper or the authors inventing the citations. (In this case at least.)

[1] https://openreview.net/forum?id=IiEtQPGVyV

neom - 9 hours ago

I wrote before about my embarrassing time with ChatGPT during a period (https://news.ycombinator.com/item?id=44767601) - I decided to go back through those old 4o chats with 5.2 pro extended thinking, the reply was pretty funny because it first slightly ridiculed me, heh - but what it showed was: basically I would say "what 5 research papers from any area of science talk to these ideas" and it would find 1 and invent 4 if it didn't know 4 others, and not tell me, and then I'd keep working with it and it would invent what it thought might be in the papers long the way, making up new papers in it's own work to cite to make it's own work valid, lol. Anyway, I'm a moron, sure, and no real harm came of it for me, just still slightly shook I let that happen to me.

Nevermark - 8 hours ago

With regard to confabulating (hallucinating) sources, or anything else, it is worth noting this is a first class training requirement imposed on models. Not models simply picking up the habit from humans.

When training a student, normally we expect a lack of knowledge early, and reward self-awareness, self-evaluation and self-disclosure of that.

But the very first epoch of a model training run, when the model has all the ignorance of a dropped plate of spaghetti, we optimize the network to respond to information, as anything from a typical human to an expert, without any base of understanding.

So the training practice for models is inherently extreme enforced “fake it until you make it”, to a degree far beyond any human context or culture.

(Regardless, humans need to verify, not to mention read, the sources they site. But it will be nice when models can be trusted to accurately access what they know/don’t-know too.)

leggerss - 9 hours ago

I don't understand: why aren't there automated tools to verify citations' existence? The data for a citation has a structured styling (APA, MLA, Chicago) and paper metadata is available via e.g. a web search, even if the paper contents are not

I guess GPTZero has such a tool. I'm confused why it isn't used more widely by paper authors and reviewers

anishrverma - 2 hours ago

The prevalence of hallucinations in the system is another signs for change in the system. The citations should be treated less like narrative context and more like verifiable objects

Better detectors, like the article implies, won’t solve the problem, since AI will likely keep improving

It’s about the fact that our publishing workflows implicitly assume good faith manual verification, even as submission volume and AI assisted writing explode. That assumption just doesn’t hold anymore

A student initiative at Duke University has been working on what it might look like to address this at the publishing layer itself, by making references, review labor, and accountability explicit rather than implicit

There’s a short explainer video for their system: https://liberata.info/

It’s hard to argue that the current status quo will scale, so we need novel solutions like this.

sdellis - 4 hours ago

This is an advertisement disguised as a "report".

CGMthrowaway - 10 hours ago

Which is worse:

a) p-hacking and suppressing null results

b) hallucinations

c) falsifying data

Would be cool to see an analysis of this

cyber_kinetist - 7 hours ago

It has been several years since the reviewing process for top AI conferences have been broken as hell, due to having too many submissions and only a few reviewers (up to the point that Masters students are reviewing the papers). It was only a matter of time before these conferences will be filled with AI-written papers.

armcat - 10 hours ago

This is awful but hardly surprising. Someone mentioned reproducible code with the papers - but there is a high likelihood of the code being partially or fully AI generated as well. I.e. AI generated hypothesis -> AI produces code to implement and execute the hypothesis -> AI generates paper based on the hypothesis and the code.

Also: there were 15 000 submissions that were rejected at NeurIPS; it would be very interesting to see what % of those rejected were partially or fully AI generated/hallucinated. Are the ratios comperable?

SaaSasaurus - 3 hours ago

I'm surprised it's only 100, honestly. Also feels a little sensationalized... Before AI I wonder how many "hallucinations" were in human-written papers. Is there any data on this?

pama - 2 hours ago

This feels like a big nothingburger to me. Try an analysis on conference submissions (perhaps even published papers) from 1995 for comparison, and one from 2005, one from 2015. I recall the typos/errors/ommissions because I reviewed for them and I used them. Even then: so what? If I could find the reference relatively easily and with enough confidence I was fine. Rarely I couldnt find it and contacted the author. The job of the reviewer (or even author) isnt to be a nitpicky editor—that’s the editor’s job. Editing does not happen until the final printed publication is near, and only for accepted papers, nowadays sometimes it never happens. Now that is a problem perhaps, but it has nothing to do with the authors’ use of LLMs.

gtirloni - 7 hours ago

Why focus on hallucinations/LLMs and not on the authors? There are rules for submitting papers.

If I drop a loaded gun and it fires, killing someone, we don't go after the gun's manufacturer in most cases.

- 10 hours ago
[deleted]
theptip - 10 hours ago

This is mostly an ad for their product. But I bet you can get pretty good results with a Claude Code agent using a couple simple skills.

Should be extremely easy for AI to successfully detect hallucinated references as they are semi-structured data with an easily verifiable ground truth.

Molitor5901 - 10 hours ago

AI might just extinguish the entire paradigm of publish or perish. The sheer volume of papers makes it nearly impossible to properly decide which papers have merit, which are non-replicate and suspect, and which are just a desperate rush to publish. The entire practice needs to end.

mt_ - 9 hours ago

It would be ironic if the very detection of hallucinations contained hallucinations of its own.

- 9 hours ago
[deleted]
nospice - 9 hours ago

We've been talking about a "crisis of reproducibility" for years and the incentive to crank out high volumes of low-quality research. We now have a tool that brings down the cost of producing plausibly-looking research down to zero. So of course we're going to see that tool abused on a galactic scale.

But here's the thing: let's say you're an university or a research institution that wants to curtail it. You catch someone producing LLM slop, and you confirm it by analyzing their work and conducting internal interviews. You fire them. The fired researcher goes public saying that they were doing nothing of the sort and that this is a witch hunt. Their blog post makes it to the front page of HN, garnering tons of sympathy and prompting many angry calls to their ex-employer. It gets picked up by some mainstream outlets, too. It happened a bunch of times.

In contrast, there are basically no consequences to institutions that let it slide. No one is angrily calling the employers of the authors of these 100 NeurIPS papers, right? If anything, there's the plausible deniability of "oh, I only asked ChatGPT to reformat the citations, the rest of the paper is 100% legit, my bad".

londons_explore - 9 hours ago

And this is the tip of the iceberg, because these are the easy to check/validate things.

I'm sure plenty of more nuanced facts are also entirely without basis.

djoldman - 5 hours ago

I would love to see this analysis run on pre-GPT era papers.

yobbo - 9 hours ago

As long as these sorts of papers serve more important purposes for the careers of the authors than anything related to science or discovery of knowledge, then of course this happens and continues.

The best possible outcome is that these two purposes are disconflated, with follow-on consequences for the conferences and journals.

dtartarotti - 10 hours ago

It is very concerning that these hallucinations passed through peer review. It's not like peer review is a fool-proof method or anything, but the fact that reviewers did not check all references and noticed clearly bogus ones is alarming and could be a sign that the article authors weren't the only ones using LLMs in the process...

bonsai_spool - 10 hours ago

This suggests that nobody was screening this papers in the first place—so is it actually significant that people are using LLMs in a setting without meaningful oversight?

These clearly aren't being peer-reviewed, so there's no natural check on LLM usage (which is different than what we see in work published in journals).

mat_b - 6 hours ago

> we discovered 100s of hallucinated citations missed by the 3+ reviewers who evaluated each paper.

This says just as much about the humans involved.

- 9 hours ago
[deleted]
einpoklum - an hour ago

I don't know about you, but where I'm from, we call citations from sources which don't exist "fabrications" or "fraud" - not "hallucination", which sounds like some medical condition which evokes pity.

teekert - 8 hours ago

We have the h score and such, can we have something similar that goes down when you pull stunts like these? Preferably link it to people’s orcid ids.

- 10 hours ago
[deleted]
ctoth - 9 hours ago

How you know it's really real is that they clearly tell the FPR, and compare against a pre-llm baseline.

But I saw it in Apple News, so MISSION ACCOMPLISHED!

rabbitlord - 8 hours ago

You will find out that Top CS conference is never scientific, if you really go to their GitHub and run their code.

thestructuralme - 3 hours ago

The most striking part of the report isn't just the 100 hallucinations—it’s the "submission tsunami" (220% increase since 2020) that made this possible. We’re seeing a literal manifestation of a system being exhausted by simulation.

When a reviewer is outgunned by the volume of generative slop, the structure of peer review collapses because it was designed for human-to-human accountability, not for verifying high-speed statistical mimicry. In these papers, the hallucinations are a dead giveaway of a total decoupling of intelligence from any underlying "self" or presence. The machine calculates a plausible-looking citation, and an exhausted reviewer fails to notice the "Soul" of the research is missing.

It feels like we’re entering a loop where the simulation is validated by the system, which then becomes the training data for the next generation of simulation. At that point, the human element of research isn't just obscured—it's rendered computationally irrelevant.

trash_cat - 8 hours ago

Clearly there is some demand for those papers, and research, to exist. Good opportunity to fill the gaps.

OptionX - 5 hours ago

The old create the problem and sell the solution shtick.

geremiiah - 10 hours ago

A lot of research in AI/ML seems to me to be "fake it and never make it". Literally it's all about optics, posturing, connections, publicity. Lots of bullshit and little substance. This was true before AI slop, too. But the fact that AI slop can make it pass the review really showcases how much a paper's acceptance hinges on things, other than the substance and results of the paper.

I even know PIs who got fame and funding based on some research direction that supposedly is going to be revolutionary. Except all they had were preliminary results that from one angle, if you squint, you can envision some good result. But then the result never comes. That's why I say, "fake it, and never make it".

dev_l1x_be - 8 hours ago

I am wondering if we are going to reach hallucination collapse sooner than we reach AGI.

alcasa - 7 hours ago

Didn't know the L in Samuel L Jackson was for LeCun.

nerdjon - 9 hours ago

The downstream effects of this are extremely concerning. We have already seen the damage caused by human written research that was later retracted like the “research” on vaccines causing autism.

As we get more and more papers that may be citing information that was originally hallucinated in the first place we have a major reliability issue here. What is worse is people that did not use AI in the first place will be caught in the crosshairs since they will be referencing incorrect information.

There needs to be a serious amount of education done on what these tools can and cannot do and importantly where they fail. Too many people see these tools as magic since that is what the big companies are pushing them as.

Other than that we need to put in actual repercussions for publishing work created by an LLM without validating it (or just say you can’t in the first place but I guess that ship has sailed) or it will just keep happening. We can’t just ignore it and hope it won’t be a problem.

And yes, humans can make mistakes too. The difference is accountability and the ability to actually be unsure about something so you question yourself to validate.

waldarbeiter - 4 hours ago

My website of choice whenever I have to deal with references is dblp [1]. In my opinion more reliable than Google scholar in creating correct BibTeX. Also when searching for a paper you clearly see where it has been published or if it is only on arxiv.

[1] https://dblp.org/

fulafel - 10 hours ago

Is there a comparison to rate of reference errors in other forums?

abktowa - 7 hours ago

Implicitly this makes sense but the amount cited in this article is still hard for me to grasp. Wow.

lifetimerubyist - 3 hours ago

Surely this will help with the trust in our institutions that has been completely eroded over the last 5 years.

not2b - 7 hours ago

This is going to be a huge problem for conferences. While journals have a longer time to get things right, as a conference reviewer (for IEEE conferences) I was often asked to review 20+ papers in a short time to determine who gets a full paper, who gets to present just a poster, etc. There was normally a second round, but often these would just look at submissions near the cutoff margin in the rankings. Obvious slop can be quickly rejected, but it will be easier to sneak things in.

Prof_Sigmund - 6 hours ago

The authors talk about "a model's ability to align with human decisions" as a matter of the past. The omission in the paper is RLHF (Reinforcement Learning from Human Feedback). All these companies are "teaching machines to predict the preferences of people who click 'Accept All Cookies' without reading," by using low-paid human evaluators — “AI teachers.”

If we go back to Google, before its transformation into an AI powerhouse — as it gutted its own SERPs, shoving traditional blue links below AI-generated overlords that synthesize answers from the web’s underbelly, often leaving publishers starving for clicks in a zero-click apocalypse — what was happening?

The same kind of human “evaluators” were ranking pages. Pushing garbage forward. The same thing is happening with AI. As much as the human "evaluators" trained search engines to elevate clickbait, the very same humans now train large language models to mimic the judgment of those very same evaluators. A feedback loop of mediocrity — supervised by the... well, not the best among us. The machines still, as Stephen Wolfram wrote, for any given sequence, use the same probability method (e.g., “The cat sat on the...”), in which the model doesn’t just pick one word. It calculates a probability score for every single word in its vast vocabulary (e.g., “mat” = 40% chance, “floor” = 15%, “car” = 0.01%), and voilà! — you have a “creative” text: one of a gazillion mindlessly produced, soulless, garbage “vile bile” sludge emissions that pollute our collective brains and render us a bunch of idiots, ready to swallow any corporate poison sent our way.

In my opinion, even worse: the corporates are pushing toward “safety” (likely from lawsuits), and the AI systems are trained to sell, soothe, and please — not to think, or enhance our collective experience.

brador - 9 hours ago

The problem isn’t scale.

The problem is consequences (lack of).

Doing this should get you barred from research. It won’t.

pandemic_region - 9 hours ago

What if they would only accept handwritten papers? Basically the current system is beyond repair, so may as well go back to receiving 20 decent papers instead of 20k hallucinated ones.

poulpy123 - 9 hours ago

All papers proved to have used a LLM beyond writing improvement should be automatically retracted

gowld - 5 hours ago

I searched Google for one of the hallucinations: [N. Flammarion. Chen "sam generalizes"]

AI Overview: Based on the research, [Chen and N. Flammarion (2022)](https://gptzero.me/news/neurips/) investigate why Sharpness-Aware Minimization (SAM) generalizes better than SGD, focusing on optimization perspectives

The link is a link to the OP web page calling the "research" a hallucination.

gowld - 5 hours ago

Why does "Robust Label Proportions Learning" have a "Scan" link, while all the others have a "Sources" link? Was this web page generated by AI?

deepsun - 2 hours ago

Can we just hallucinate the whole conference by now? Like "Hey AI, generate me the whole conference agenda, schedule, papers, tracks, workshops, and keynote" and not pay the $1k?

captainbland - 8 hours ago

What's wild is so many of these are from prestigious universities. MIT, Princeton, Oxford and Cambridge are all on there. It must be a terrible time to be an academic who's getting outcompeted by this slop because somebody from an institution with a better name submitted it.

techIA - 8 hours ago

They will turn it into a party drug.

CrzyLngPwd - 9 hours ago

This is not the AI future we dreamed of, or feared.

godelski - 7 hours ago

Given that many of these detections are being made from references, I don't understand why we're not using automatic citation checkers.

Just ask authors to submit their bib file so we don't need to do OCR on the PDF. Flag the unknown citations and ask reviewers to verify their existence. Then contact authors and ban if they can't produce the cited work.

This is low hanging fruit here!

Detecting slop where the authors vet citations is much harder. The big problem with all the review rules is they have no teeth. If it were up to me we'd review in the open, or at least like ICLR. Publish the list of known bad actors and let is look at the network. The current system is too protective of egregious errors like plagiarism. Authors can get detected in one conference, pull, and submit to another, rolling the dice. We can't allow that to happen and we should discourage people from associating with these conartists.

AI is certainly a problem in the world of science review, but it's far from the only one and I'm not even convinced it's the biggest. The biggest is just that reviewers are lazy and/or not qualified to review the works they're assigned. It takes at least an hour to properly review a paper in your niche, much more when it's outside. We're over worked as is, with 5+ works to review, not to mention all the time we got to spend reworking our own works that were rejected due to the slot machine. We could do much better if we dropped this notion of conference/journal prestige and focused on the quality of the works and reviews.

Addressing those issues also addresses the AI issues because, frankly, *it doesn't matter if the whole work was done by AI, what matters is if the work is real.*

meindnoch - 9 hours ago

Jamie, bring up their nationalities.

gowld - 5 hours ago

"100 Hallucinated Citations in Published Across 53 NeurIPS Papers"

No one cares about citations. They are hallucinated because they are required to be present for political reasons, even though they have no relevance.

Tom1380 - 10 hours ago

No ETH Zurich, let's go

depressionalt - 10 hours ago

This is nice and all, but what repercussion does GPTZero get when their bullshit AI detection hallucinates a student using AI? And when that student receives academic discipline because of it?

Many such cases of this. More than 100!

They claim to have custom detection for GPT-5, Gemini, and Claude. They're making that up!

jordanpg - 10 hours ago

If these are so easy to identify, why not just incorporate some kind of screening into the early stages of peer review?

yepyeaisntityea - 9 hours ago

No surprises. Machine learning has, at least since 2012, been the go-to field for scammers and grifters. Machine learning, and technology in general, is basically a few real ideas, a small number of honest hard workers, and then millions of fad chasers and scammers.

qwertox - 10 hours ago

It would be great if those scientists who use AI without disclosing it get fucked for life.

MORPHOICES - 9 hours ago

[dead]

jsksdkldld - 10 hours ago

[dead]

GrowingSideways - 10 hours ago

[dead]

TAULIC15 - 10 hours ago

[flagged]