No AI* Here – A Response to Mozilla's Next Chapter
waterfox.com564 points by MrAlex94 4 months ago
564 points by MrAlex94 4 months ago
> Large language models are something else entirely*. They are black boxes. You cannot audit them. You cannot truly understand what they do with your data. You cannot verify their behaviour. And Mozilla wants to put them at the heart of the browser and that doesn't sit well.
Am I being overly critical here or is this kind of a silly position to have right after talking about how neural machine translation is okay? Many of Firefox's LLM features like summarization afaik are powered by local models (hell even Chrome has local model options). It's weird to say neural translation is not a black box but LLMs are somehow black boxes that we cannot hope to understand what they do with the data, especially when viewed a bit fuzzily LLMs are scaled up versions of an architecture that was originally used for neural translation. Neural translation also has unverifiable behavior in the same sense.
I could interpret some of the data talk as talking about non local models but this very much seems like a more general criticism of LLMs as a whole when talking about Firefox features. Moreover, some of the critiques like verifiability of outputs and unlimited scope still don't make sense in this context. Browser LLM features except for explicitly AI browsers like Comet have so far had some scoping to their behavior, either in very narrow scopes like translation or summarization. The broadest scope I can think of is the side panels that show up which allow you to ask about a web page with context. Even then, I do not see what is inherently problematic about such scoping since the output behavior is confined to the side panel.
To be more charitable to TFA, machine translation is a field where there aren't great alternatives and the downside is pretty limited. If something is in another language you don't read it at all. You can translate a bunch of documents and benchmark the result and demonstrate that the model doesn't completely change simple sentences. Another related area is OCR - there are sometimes mistakes, but it's tractable to create a model and verify it's mostly correct.
LLMs being applied to everything under the sun feels like we're solving problems that have other solutions, and the answers aren't necessarily correct or accurate. I don't need a dubiously accurate summary of an article in English, I can read and comprehend it just fine. The downside is real and the utility is limited.
There's an older tradition of rule-based machine translation. In these methods, someone really does understand exactly what the program does, in a detailed way; it's designed like other programs, according to someone's explicit understanding. There's still active research in this field; I have a friend who's very deep into it.
The trouble is that statistical MT (the things that became neural net MT) started achieving better quality metrics than rule-based MT sometime around 2008 or 2010 (if I remember correctly), and the distance between them has widened since then. Rule-based systems have gotten a little better each year, while statistical systems have gotten a lot better each year, and are also now receiving correspondingly much more investment.
The statistical systems are especially good at using context to disambiguate linguistic ambiguities. When a word has multiple meanings, human beings guess which one is relevant from overall context (merging evidence upwards and downwards from multiple layers within the language understanding process!). Statistical MT systems seem to do something somewhat similar. Much as human beings don't even perceive how we knew which meaning was relevant (but we usually guessed the right one without even thinking about it), these systems usually also guess the right one using highly contextual evidence.
Linguistic example sentences like "time flies like an arrow" (my linguistics professor suggested "I can't wait for her to take me here") are formally susceptible of many different interpretations, each of which can be considered correct, but when we see or hear such sentences within a larger context, we somehow tend to know which interpretation is most relevant and so most plausible. We might never be able to replicate some of that with consciously-engineered rulesets!
This is the bitter lesson.[1]
I too used to think that rule-based AI would be better than statistical, Markov chain parrots, but here we are.
Though I still think/hope that some hybrid system of rule-based logic + LLMs will end up being the winner eventually.
----------------
These days its pretty much the "sweet" lesson for everyone but Sutton and his peers it seems.
It's bitter for me because I like looking at how things work under the hood and that's much less satisfying when it's "a bunch of stats and linear algebra that just happens to work"
So you prefer "a bunch of electrons, field effects, and clocks than just happen to work"?
If you're building on a computer language, you can say you understand the computer's abstract machine, even though you don't know how we ever managed to make a physical device to instantiate it!
Yep, some domains have no hard rules at all.
Time flies like an arrow; fruit flies like a banana.
It's completely possible to write a parser that outputs every possible parse of "time flies like an arrow", and then try interpreting each one and discard ones that don't make sense according to some downstream rules (unknown noun phrase: "time fly").
I did this for a text adventure parser, but it didn't work well because there are exponentially ways to group the words in a sentence like "put the ball on the bucket on the chair on the table on the floor"
I would argue that particular sentence only exists to convey the bamboozled feeling you get when you reach the end of it, so only sentient parsers can parse it properly.
> There's an older tradition of rule-based machine translation. In these methods, someone really does understand exactly what the program does, in a detailed way
I would softly disagree with this. Technically, we also understand exactly what a LLM does, we can analyze every instruction that is executed. Nothing is hidden from us. We don't always know what the outcome will be; but, we also don't always know what the outcome will be in rule-based models, if we make the chain of logic too deep to reliably predict. There is a difference, but it is on a spectrum. In other words, explicit code may help but it does not guarantee understanding, because nothing does and nothing can.
The grammars in rule-based MT are normally fully conceptually understood by the people who wrote them. That's a good start for human understanding.
You could say they don't understand why a human language evolved some feature but they fully understand the details of that feature in human conceptual terms.
I agree in principle the statistical parts of statistical MT are not secret and that computer code in high-level languages isn't guaranteed to be comprehensible to a human reader. Or in general, binary code isn't guaranteed to be incomprehensible and source code isn't guaranteed to be comprehensible.
But for MT, the hand-written grammars and rules are at least comprehended by their authors at the time they're initially constructed.
Sure, I agree with that, but that's a property of hand-writing more than rule-based systems. For instance, you could probably translate a 6B LLM into an extremely big rule system, but doing so would not help you understand how the LLM worked.
Do you know what is the SOTA rule-based MT? I used to be deep into symbolics but couldn't find much in the way of contemporary rule based NLP.
My friend is working on Grammatical Framework, which has a Resource Grammar library of pre-written natural language grammars, at least for portions of them. The GF research community continues to add new ones over time, based on implementing portions of written reference grammars, or sometimes by native speakers based on their own native speaker intuitions. I'm not sure if there are larger grammar libraries elsewhere.
There could be companies that made much better rule-based MT but kept the details as trade secrets. For example, I think Google Translate was rule-based for "a long time" (I don't remember until what year, although it was pretty apparent to users and researchers when it switched, and indeed I think some Google researchers even spoke publicly about it). They had made a lot of investment (very far beyond something like a GF resource grammar) but I don't think they ever published any of that underlying work even when they discontinued that version of the product.
So basically there may be this gap where academic stuff is advancing slowly and yet now represents the majority of examples in the field because companies are so unlikely to have ongoing rule-based projects as part of projects. The available state of the art you can actually interact with may have gone backwards in recent years as a result!
nimi sina li pona tawa mi.
LLMs are great because of exactly that: they solve things that have no other solutions.
(And also things that have other solutions, but where "find and apply that other solution" has way more overhead than "just ask an LLM".)
There is no deterministic way to "summarize this research paper, then evaluate whether the findings are relevant and significant for this thing I'm doing right now", or "crawl this poorly documented codebase, tell me what this module does". And the alternative is sinking your own time in it - while you could be doing something more important or more fun.
and demonstrate that the model doesn't completely change simple sentences
A nefarious model would work that way though. The owner wouldn't want it to be obvious. It'd only change the meaning of some sentences some of the time, but enough to nudge the user's understanding of the translated text to something that the model owner wants.
For example, imagine a model that detects the sentiment of text about Russian military action, and automatically translates it to something a more positive if it's especially negative, but only 20% of the time (maybe ramping up as the model ages). A user wouldn't know, and a someone testing the model for accuracy might assume it's just a poor translation. If such a model became popular it could easily shift the perception of the public a few percent in the owner's preferred direction. That'd be plenty to change world politics.
Likewise for a model translating contracts, or laws, or anything else where the language is complex and requires knowledge of both the language and the domain. Imagine a Chinese model that detects someone trying to translate a contract from Chinese to English, and deliberately modifies any clause about data privacy to change it to be more acceptable. That might be paranoia on my part, but it's entirely possible on a technical level.
That's not a technical problem though is it? I don't see legal scenarios where unverified machine translation is acceptable - you need to get a certified translator to sign off on any translations and I also don't see how changing that would be a good thing.
I was briefly considering trying to become a professional translator, and I partly didn't pursue it because of the huge use of MT. I predict demand for human translators will continue to fall quickly unless there are some very high-profile incidents related to MT errors (and humans' liability for relying on them?). Correspondingly the supply of human translators may also fall as it appears like a less credible career option.
I think the point here is that, while such a translation wouldn't be admissible in court, many of us already used machine translation to read some legal agreement in a language we don't know.
> many of us already used machine translation to read some legal agreement in a language we don't know.
Have we? Most of us? Really? When?
Most people don't have to deal with documents in foreign languages in the first place.
But for those that do, yes, machine translation use is widespread if only as a first pass.
I know I did for rent contracts and know other people that did the same. And I said many, not most.
Aside: Does anyone actually use summarization features? I've never once been tempted to "summarize" because when I read something I either want to read the entire thing, or look for something specific. Things I want summarized, like academic papers, already have an abstract or a synopsis.
In-browser ones? No. With external LLMS? Often. It depends on the purpose of the text.
If the purpose is to read someone's _writing_, then I'm going to read it, for the sheer joy of consuming the language. Nothing will take that from me.
If the purpose is to get some critical piece of information I need quickly, then no, I'd rather ask an AI questions about a long document than read the entire thing. Documentation, long email threads, etc. all lend themselves nicely to the size of a context window.
> If the purpose is to get some critical piece of information I need quickly, then no, I'd rather ask an AI questions about a long document than read the entire thing. Documentation, long email threads, etc. all lend themselves nicely to the size of a context window.
And what do you do if the LLM hallucinates? For me, skim-reading still comes out on top because my own mistakes are my own.
Yeah, basically every 15 minute YouTube video, because the amount of actual content I care about is usually 1-2 sentences, and usually ends up being the first sentence of an LLM summary of the transcript.
If something has actual substance I'll watch the whole thing, but that's maybe 10% of videos I find in experience.
I'd wager there's 95% of the benefit for 0.1% of the CPU cycles just by having a "search transcript for term" feature, since in most of those cases I've already got a clear agenda for what kind of information I'm seeking.
Many years ago I make a little proof-of-concept for displaying the transcript (closed captions) of a YouTube video as text, and highlighting a word would navigate to that timestamp and vice-versa. Such a thing might be valuable as a browser extension, now that I think of it.
YouTube already supports that natively these days, although it's kind of hidden (and knowing Google, it might very well randomly disappear one day). Open the description of the video, scroll down and click "show transcript".
Searching the transcript has the problem of missing synonyms. This can be solved by the one undeniably useful type of AI: embedding vector search. Embeddings for each line of the transcript can be calculated in advance and compared with the embeddings of the user's search. These models need only a few hundred million parameters for good results.
Yeah, but they fail surprisingly hard on grepping. So the best systems use both simultaneously:
https://reduct.video/ lets you edit (not just search!) videos that way. Kind of a different way to think about video content!
One of the best features of SponsorBlock is crowd sourced timestamps for the meat of the video. Skip right over 20 minutes of rambling to see the cool thing in the thumbnail.
The problem here is that you are looking at a video in the first place when all you needed is short textual content.
No, because an LLM cannot summarise. It can only shorten which is not the same.
Citation: https://ea.rna.nl/2024/05/27/when-chatgpt-summarises-it-actu...
And it's only getting worse: https://www.newsguardtech.com/ai-monitor/august-2025-ai-fals...
> AI False Information Rate Nearly Doubles in One Year
> NewsGuard’s audit of the 10 leading generative AI tools and their propensity to repeat false claims on topics in the news reveals the rate of publishing false information nearly doubled — now providing false claims to news prompts more than one third of the time.
Wonderful article showing the uselessness of this technology, IMO.
> I just realised the situation is even worse. If I have 35 sentences of circumstance leading up to a single sentence of conclusion, the LLM mechanism will — simply because of how the attention mechanism works with the volume of those 35 — find the ’35’ less relevant sentences more important than the single key one. So, in a case like that it will actively suppress the key sentence.
> I first tried to let ChatGPT one of my key posts (the one about the role convictions play in humans with an addendum about human ‘wetware’). ChatGPT made a total mess of it. What it said had little to do with the original post, and where it did, it said the opposite of what the post said.
> For fun, I asked Gemini as well. Gemini didn’t make a mistake and actually produced something that is a very short summary of the post, but it is extremely short so it leaves most out. So, I asked Gemini to expand a little, but as soon as I did that, it fabricated something that is not in the original article (quite the opposite), i.e.: “It discusses the importance of advisors having strong convictions and being able to communicate them clearly.” Nope. Not there.
Why, after reading something like this, should I think of this technology as useful for this task? It seems like the exact opposite. And this is what I see with most LLM reviews. The author will mention spending hours trying to get the LLM to do a thing, or "it made xyz, but it was so buggy that I found it difficult to edit it after, and contained lots of redundant parts", or "it incorrectly did xyz". And every time I read stuff like that I think — wow, if a junior dev did that the number of times the AI did, they'd be fired on the spot.
See also, something like https://boston.conman.org/2025/12/02.1 where (IIRC) the author comes away with a semi-positive conclusion, but if you look at the list near the end, most of these things are something that any person would get fired for, and are things that are not positive for industrial software engineering and design. LLMs appear to do a "lot", but still confabulates and repeats itself incessantly, making it worthless to depend on for practical purposes unless you want to spend hours chasing your own tail over something it hallucinated. I don't see why this isn't the case. I thought we were trying to reduce the error rate in professional software development, not increase it.
You mean you don't summarize those terrible articles you happen to come across and you're a little intrigued, hoping that there's some substance, and then you read, and it just repeats the same thing over and over again with different wording? Anyway, I sometimes still give them the benefit of the doubt, and end up doing a summary. Often they get summarized into 1 or 2 sentences.
No, not really. I don't even know how to really respond to this but maybe
1. I don't read "terrible articles". I can skim an article and figure if something I'm interested in.
2. I actually do read terrible articles and I have terrible taste
3. Any "summarization" I do that isn't from my direct reading is evaluated by the discussion around it. Though nowadays that's more and more spotty.
Yes, several times a day. I use summarization for webpages, messages, documents and YouTube videos. It’s super handy.
I mainly use a custom prompt using ChatGPT via the Raycast app and the Raycast browser extension.
That said, I don’t feel comfortable with the level of AI being shoved into browsers by their vendors.
Aren't you worried it will fuck up your comprehension skills? Reading or listening.
No, I still read a ton of articles and books. This just lets me be pickier about what I read. There's a lot of low quality and/or AI generated writing on the Internet.
Not him, but no. I read a ton already. Using LLMs to summarize a document is a good way to find out if I should bother reading it myself, or if I should read something else.
Skimming and being able to quickly decide if something is worth actually reading is itself a valuable skill.
There's a limit to how fast I can feasibly skim, and LLMs definitely do it faster.
I occasionally use the "summarize" button on the iPhone Mobile Safari reader view if I land on a blog entry and it's quite long and I want to get a quick idea of if it's worth reading the whole thing or not.
Yes. I use it sometimes in Firefox with my local LLM server. Sometimes i come across an article I'm curious about but don't have the time or energy to read. Then I get a TL;DR from it. I know it's not perfect but the alternative is not reading it at all.
If it does interest me then I can explore it. I guess I do this once a week or so, not a lot.
I highly doubt that no information would be worse than wrong information. Both wars in Ukraine and Gaza show this very clearly.
I just use it for personal information, I'm not involved in any wars :) I don't base any decisions on it, for example if I buy something I don't go by just AI stuff to make a decision. I use the AI to screen reviews, things like that (generally I prefer really deep review and not glossy consumer-focused ones). Then I read the reviews that are suitable to me.
And even reading an article about those myself doesn't make me insusceptible to misinformation of course. Most of the misinformation about these wars is spread on purpose by the parties involved themselves. AI hallucination doesn't really cause that, it might exacerbate it a little bit. Information warfare is a huge thing and it has been before AI came on the scene.
Ok, as a more specific example, recently I was thinking of buying the new Xreal Air 2. I have the older one but I have 3 specific issues with it. I used AI to find references about these issues being solved. This was the case and AI confirmed that directly with references, but in further digging myself I did find that there was also a new issue introduced with that model involving blurry edges. So in the end I decided not to buy the thing. The AI didn't identify that issue (though to be fair I didn't ask it to look for any).
So yeah it's not an allknowing oracle and it makes mistakes, but it can help me shave some time off such investigations. Especially now that search engines like google are so full of clickbait crap and sifting through that shit is tedious.
In that case I used OpenWebUI with a local LLM model that speaks to my SearXNG server which in turn uses different search engines as a backend. It tends to work pretty well I have to say, though perplexity does it a little better. But I prefer self-hosting as much as I can (of course the search engine part is out of scope there).
Even if you know about and act against mis- and disinformation, it affects you, and you voluntarily increase your exposure to it. And the situation is already terrible.
I gave the example of wars, because it’s obvious, even for you, and you won’t relativize away the same way how you just did with AI misinformation, which affects you the exact same way.
Haven’t tried them but I can see these features being really useful for screen reader users.
Yes.
Most recently, a new ISP contract: because it's both low stakes enough where I don't care much about inaccuracies (it's a bog standard contract from a run of the mill ISP), there's basically no information in there that the cloud vendor doesn't already have (if they have my billing details) but also where I'm curious about whether anything might jump out, all while not really wanting to read the 5 pages of the thing.
Just went back to that, it got both all of the main items (pricing, contract terms, my details) correctly, but also the annoying fine print (that I referenced, just in case). Also works pretty well across languages, though that depends on the model in question a bunch.
I feel like if browsers or whatever get the UX of this down, people will upload all sorts of data into those vendors that they normally shouldn't. I also think that with nuanced enough data, we'll eventually have the LLM equivalent of Excel messing up data due to some formatting BS.
Looking back with fresh eyes, I definitely think I could’ve presented what I’m trying to say better.
On a purely technical play, you’re right that I’m drawing a distinction that may not hold up purely on technical grounds. Maybe the better framing is: I trust constrained, single purpose models with somewhat verifiable outputs (seeing text go in, translated text go out, compare its consistency) more than I trust general purpose models with broad access to my browsing context, regardless of whether they’re both neural networks under the hood.
WRT to the “scope”, maybe I have picked up the wrong end of the stick with what Mozilla are planning to do - but they’ve already picked all the low hanging fruit with AI integration with the features you’ve mentioned and the fact they seem to want to dig their heels in further, at least to me, signals that they want deeper integration? Although who knows, the post from the new CEO may also be a litmus test to see what the response to that post elicits, and then go from there.
I still don’t understand what you mean by “what they do with your data” - because it sounds like exfiltration fear mongering, whereas LLMs are a static series of weights. If you don’t explicitly call your “send_data_to_bad_actor” function with the user’s I/O, nothing can happen.
I disagree that it’s fear mongering. Have we not had numerous articles on HN about data exfiltration in recent memory? Why would an LLM that is in the drivers seat of a browser (not talking about current feature status in Firefox wrt to sanitised data being interacted with) not have the same pitfalls?
Seems as if we’d be 3 for 3 in the “agents rule of 2” in the context of the web and a browser?
> [A] An agent can process untrustworthy inputs
> [B] An agent can have access to sensitive systems or private data
> [C] An agent can change state or communicate externally
https://simonwillison.net/2025/Nov/2/new-prompt-injection-pa...
Even if we weren’t talking about such malicious hypotheticals, hallucinations are a common occurrence as are CLI agents doing things it thinks best, sometimes to the detriment of the data it interacts with. I personally wouldn’t want my history being modified or deleted, same goes with passwords and the like.
It is a bit doomerist, I doubt it’ll have such broad permissions but it just doesn’t sit well which I suppose is the spirit of the article and the stance Waterfox takes.
> Have we not had numerous articles on HN about data exfiltration in recent memory?
there’s also an article on the front page of HN right now claiming LLMs are black boxes and we don’t know how they work, which is plainly false. this point is hardly evidence of anything and equivalent to “people are saying”
This is true though. While we know what they do on a mechanistic level, we cannot reliably analyze why the model outputs any particular answer in functional terms without a heroic effort at the "arxiv paper" level.
that’s true of analyzing individual atoms in a combustion engine — yet I doubt you’d claim we don’t know how they work
also this went from “we can’t analyze” to “we can’t analyze reliably [without a lot of effort]” quite quickly
In the digital world, we should be able to go back from output to input unless the intention of the function is to "not do that". Like hashing.
Llms not being able to go from output back to input deterministically and for us to understand why is very important, most of our issues with llms stem from this issue. Its why mechanistic interpretabilty research is so hot right now.
The car analogy is not good because models are digital components and a car is a real world thing. They are not comparable.
I mean, fluid dynamics is an unsolved issue. But even so we know *considerably* less about how LLMs work in functional terms than about how combustion engines work.