Building a Simple Search Engine That Works

karboosx.net

240 points by freediver 14 hours ago


marginalia_nu - 9 hours ago

The idea behind search itself is very simple, and it's a fun problem domain that I encourage anyone to explore[1].

The difficulties in search are almost entirely dealing with the large amounts of data, both logistically and in handling underspecified queries.

A DBMS-backed approach breaks down surprisingly fast. Probably perfectly fine if you're indexing your own website, but will likely choke on something the size of English wikipedia.

[1] The SeIRP e-book is a good (free) starting point https://ciir.cs.umass.edu/irbook/

franczesko - an hour ago

Long time ago, I've really enjoyed a course by David Evans from Virginia University about building a search engine and concepts of computer science.

Building a "classic" search engine is a very fun project to go through.

https://www.cs.virginia.edu/~evans/courses/cs101/

List: https://www.youtube.com/watch?v=9nkR2LLPiYo&list=PLAwxTw4SYa...

Dave's profile: https://www.cs.virginia.edu/~evans/

entropoem - 7 hours ago

Searching in general is difficult. It is really a difficult thing.

If you haven't felt it, look at companies like Apple, Microsoft, or "The most important AI research lab in the world" OpenAI, for example, their products have terrible search features even though their resources - money - technology can be considered top-notch.

tombert - 10 hours ago

About a decade ago, I was working with a guy who was getting a PhD in search engine design, which I knew/know nothing about.

It was actually a lot of fun to chat with him, because he was so enthusiastic about how searching works and how it can integrate with databases, and he was eager to explain this all to anyone who would listen. I learned a fair amount from him, though admittedly I still don't know much about the intricacies of how search engines work.

Some day, I am going to really go through the guts of Apache Solr and Lucene to understand the internals (like I did for Kafka a few years ago), and maybe I'll finally be competent with it.

renegade-otter - 6 hours ago

It's an interesting exercise. Having built searches before easily-available OSS products were available, and when even the commercial offerings sucked, do not ever build your a) database b) search engine, unless you can clearly state the reason for doing so.

Entire cubicle farms of people have been devoted to this problem for years, and if you dare to do this for work because "I think I can", you will find yourself in an ocean of hurt.

"Hey, so it won't be so hard to add 'did you mean' functionality, right? And we were thinking of adding a taxonomy next year for easy navigation..."

Check. Mate.

wink - 3 hours ago

My pet peeve for search engines for content I use is that they regularly ignore 2-letter and 3-letter "words" or acronyms. If all I need is a search for "mp3" then stripping exactly that is not useful ;) (was just the first file extension that came to my mind, but "PHP" works just as well).

dominicrose - 3 hours ago

I wonder how well it would scale. Elasticsearch's performance is impressive even at an unrecommended scale.

nmstoker - 10 hours ago

Reminds me of reading Programming Collective Intelligence by Toby Segaran, which inspired me with a range of things, like building search, recommenders, classifiers etc.

mobeigi - 11 hours ago

Great read. It makes you wonder how heavily optimised the tokenizers used by popular search enginea truly are.

journal - 5 hours ago

why isn't there a place to post something where someone else will find it when searching that doesn't require auth? i get the logistics of what i'm asking, but i really think we need a global index.

precompute - 10 hours ago

Incredible article. Does what it claims in the title, is written well and follows a linear chain of reasoning with a minumum of surprises.

jillesvangurp - 8 hours ago

Building a simple text search engine isn't that hard. People show them off on HN on a fairly regular basis. Most of those are fairly primitive. Unfortunately building a good search engine isn't that straightforward. There's more to it than just implementing bm25 (the goto ranking algorithm), which you can vibe code in a few minutes these days. The reason this is easy is because this is nineties era research that is all well publicized and documented and not all that hard once you figure it out.

Building your own search engine is a nice exercise for understanding how search works. It gets you to the same level as a very long tail of "Elasticsearch alternatives" that really aren't coming even close to implementing a tiny percentage of its feature set. That can be useful as long as you are aware of what you are missing out on.

I've been consulting companies for a few years with going from in house coded solutions to something proper (typically Opensearch/Elasticsearch). Usually people fight themselves into a corner where their in house solution starts simple and then grows more complicated as they inevitably deal with ranking problems their users encounter. Usual symptoms: "it's slow" (they are doing silly shit with multiple queries against postgres or whatever), "it's returning the wrong things" (it turns out that trigrams aren't a one size fits all solution and returns false positives), etc. Add aggregations and other things to the mix and you basically have a perfect use case for Elasticsearch about 10 years ago before they started making it faster, smarter, and better.

The usual arguments against Elasticsearch & Opensearch:

"Elasticsearch/Opensearch are hard to run". Reality, there isn't a whole lot to configure these days. Yes you might want to take care of monitoring, backups, and a few other things. As you would with any server product. But it self configures mostly. Particularly, you shouldn't have to fiddle with heap settings, garbage collection, etc. The out of the box defaults work fine. Get a managed setup if all this scares you; those run with the same defaults typically. Honestly, running postgres is harder. There's way more to configure for that. Especially for high availability setups. The hardest part is sizing your vms correctly and making sure you don't blow through your limits by indexing too much data. Most of your optimizations are going to be at the index mapping level, not in the configuration.

"It's slow". That depends what you do and how you use it. Most of the simple alternatives have some hard limitations. If you under engineer your search (poor ranking, lots of false positives) it's probably going to be faster. That's what happens if you skip all the fancy algorithmic stuff that could make your search better. I've seen all the rookie mistakes that people make with Elasticsearch that impact performance. They are usually fairly easy to fix. (e.g. let's turn off dynamic mapping and not index all those text fields you never query on that fill up your disk and memory and bloat your indexing performance ...).

"I don't need all that fancy stuff". Yes you do. You just don't know it yet because you haven't figured out what's actually needed. Look, if your search isn't great and it doesn't matter, it's all fine. But if search quality matters and you lose user's interest when they fail to find stuff in your app/website it quickly can become an existential problem. Especially if you have competitors that do much better. That fancy stuff is what you would need to build to solve that.

Unless you employ some hard core search ranking experts, your internally crafted thing is probably not going to be great. If you can afford to run at ~2005 era state of the art (Lucene existed, SOLR & Elasticsearch did not, Lucene was fairly limited in scope), then go for it. But it's going to be quite limited when you need those extra features after all.

There are some nice search products out there other than Elasticsearch & Opensearch that I would consider fit for purpose; especially if you want to do vector search. And in fairness, using a search engine properly still requires a bit of skill. But that isn't any different if you do things yourself. Except it involves a lot less wheel reinvention.

There just is a bit of necessary complexity to building a good search product.

shevy-java - 10 hours ago

Good. Now please someone replace Google's search engine.

I am always annoyed using it, how bad it is these days. Then I try the alternatives such as Duck Duck Go and they manage to be even worse.

Qwant is semi-ok but it also omits tons of things that Google Search finds (and also is slower, for some weird reason).

Google's UI nerf is also annoying - so much useless stuff. In the past I could disable that via ublock origin but Google killed that for chrome.

We need to do something against this Evil that Google brought into this world.

pelasaco - 6 hours ago

love the style, colors and the cookie popup from https://karboosx.net/. Anyone knows if its an open source framework/style/tool being used here or it just the web dev skills of the author that are superb?

eduction - 12 hours ago

I completely agree with the insight that full text search has been complexified. People seem to want to jump straight to clustering or other enterprise level things.

I also appreciate the moxie of getting in there and building it yourself.

Myself, I reach for Lucene. Then you don’t need to build all this yourself if you don’t want. It lives in a dir on disk. True, it’s a separate database, but one optimized for this problem.