Vector database that can index 1B vectors in 48M

93 points by mathewpregasen 14 hours ago

I would like to see a “DataFusion for Vector databases,” i.e. an embeddable library that Does One Thing Well – fast embedding generation, index builds, retrieval, etc. – so that different systems can glue it into their engines without reinventing the core vector capabilities every time. Call it a generic “vector engine” (or maybe “embedding engine” to avoid confusion with “vectorized query engine.”)

Currently, every new solution is either baked into an existing database (Elastic, pgvector, Mongo, etc) or an entirely separate system (Milvus, now Vectroid, etc.)

There is a clear argument in favor of the pgvector approach, since it simply brings new capabilities to 30 years of battle-tested database tech. That’s more compelling than something like Milvus that has to re-invent “the rest of the database.” And Milvus is also a second system that needs to be kept in sync with the source database.

But pgvector is still _just for Postgres_. It’s nice that it’s an extension, but in the same way Milvus has to reinvent the database, pgvector needs to reinvent the vector engine. I can’t load pgvector into DuckDB as an extension.

Is there any effort to make a pure, Unix-style, batteries not included, “vector engine?” A library with best-in-class index building, retrieval, storage… that can be glued into a Postgres extension just as easily as it can be glued into a DuckDB extension?

talipozturk - 8 hours ago

I think we have so many of those nice open source libraries but the problem is not the library or the algorithm (hsnw or ivf derivatives).. the problem is figuring out the right distributed architecture to balance cost, accuracy (recall) and speed (latency). I believe no single library will give you all that. For instance if you don't separate writes (indexing) from reads (queries) and scale them separately then your indexing will either suck or your indexing will destroy your read latency. You won't be able to scale as easily either. I believe that is why AWS created Aurora and Google Cloud created AlloyDB to scale relational databases (mysql/postgresql) by separating the reads/writes, implementing a scalable storage backend and by offloading a lot of shared works (replication, compaction, indexing) to cluster of machines.
- chatmasta - 8 hours ago
  
  Yeah, I feel like these libraries are all one level lower than what I’m asking for. We need something that makes more assumptions (e.g. “I’m running as a component of some kind of database”) but… makes less decisions? Is more flexible? Idk. This is the hard part.
  DataFusion nailed this balance between an embedded query engine and a standalone database system. It brings just the right amount of batteries that it’s not a super generic thing that does nothing useful out of the box, but it doesn’t bring so many that it needs to compete with full database systems.
  I believe the maintainers refer to it as “the IR of databases” and I’ve always liked that analogy. That’s what I’d like to see for vector engines.
  Maybe what we need as a pre-requisite is the equivalent of arrow/parquet ecosystem for vectors. DataFusion really leverages those standards for interoperability and performance. This also goes a long way toward the architectural decisions you reference — Arrow and Parquet are a solid, “good enough” choice for in-memory and storage formats that are efficient and flexible and well-supported. Is there something similar for vector storage?
- whakim - 7 hours ago
  
  I couldn't agree with this more. I don't think the majority of problems with vector search at scale are vector search problems (although filtering + ANN is definitely interesting), they're search-problems-at-scale problems.
AlexClickHouse - 8 hours ago

USearch is this type of library: https://github.com/unum-cloud/usearch
Used in ClickHouse and a few other DBMS.
maxxen - 7 hours ago

Soo… usearch? Its literally one header file (of what use to be strict c++11). Funnily enough that is what is used in the official duckdb-vss extension.
Disclaimer: I wrote duckdb-vss
jeadie - 7 hours ago

We’re building vector indexes into Datafusion for search (starting with S3 vectors).
Open source at https://github.com/spiceai/spiceai
itake - 7 hours ago

why not use this? https://github.com/facebookresearch/faiss

ge96 - 14 hours ago

M is minutes

HarHarVeryFunny - 13 hours ago

I was starting to think this was impressive, if not impossible. 1B vectors in 48 MB of storage => < 1 bit per vector.
Maybe not impossible using shared/lossy storage if they were sparsely scattered over a large space ?
But anyways - minutes. Thanks.
Edit: Gemini suggested that this sort of (lossy) storage size could be achieved using "Product Quantization" (sub vectors, clustering, cluster indices), giving an example of 256 dimensional vectors being stored at an average of 6 bits per vector, with ANN being one application that might use this.
gaogao - 11 hours ago

Yeah, the SI symbol for minutes is min, if you're going to abbreviate it in a technical context. Super funky using M.
- williamscales - 10 hours ago
  
  Agree the correct abbreviation is min.
  Nitpick: could be wrong but I don’t think minutes is an SI derived unit.
stevemk14ebr - 13 hours ago

Thank you, title needs edited.
ikanade - 14 hours ago

Legend
l5870uoo9y - 13 hours ago

Thankfully not months.
- softwaredoug - 13 hours ago
  
  Oh the horrors of search indexing Ive seen... including weeks / months to rebuild an index.

softwaredoug - 13 hours ago

Not trying to be snarky, just curious -- How is this different from TurboPuffer and other serverless, object storage backed vector DBs?

hungarianhc - 12 hours ago

Hey! It's a great question. Co-founder of Vectroid here.
Today, the differences are going to be performance, price, accuracy, flexibility, and some intangible UI elegance.
Performance: We actually INITIALLY built Vectroid for the use-case of billions of vectors and near single digit millisecond latency. During the process of building and talking to users, we found that there are just not that many use-cases (yet!) that are at that scale and require that latency. We still believe the market will get there, but it's not there today. So we re-focused on building a general purpose vector search platform, but we stayed close to our high performance roots, and we're seeing better query performance than the other serverless, object storage backed vector DBs. We think we can get way faster too.
Price: We optimized the heck out of this thing with object storage, pre-emptible virtual machines, etc. We've driven our cost down, and we're passing this to the user, starting with a free tier of 100GB. Actual pricing beyond that coming soon.
Accuracy: With our initial testing, we see recall greater or equal to competitors out there, all while being faster.
Flexibility: We are going to have a self managed version for users who want to run on their own infra, but admittedly, we don't have that today. Still working on it.
Other Product Elegance: My co-founder, Talip, made Hazelcast, and I've always been impressed by how easy it is to use and how the end to end experience is so elegant. As we continue to develop Vectroid, that same level of polish and focus on the UX will be there. As an example, one neat thing we rolled out is direct import of data from Hugging Face. We have lots of other cool ideas.
Apologies for the long winded answer. Feel free to ping us with any additional questions.
- f311a - 11 hours ago
  
  I’m curious, what’s the tech stack behind this?
  - talipozturk - 8 hours ago
    
    Vectroid is pure Java solution based on modified version of Lucene. We use a custom built FileSystem to work directly with GCS (Google cloud object store). It is a terraform/helm managed Kubernetes deployment.

1999-03-31 - 12 hours ago

1B vectors is nothing. You don’t need to index them. You can hold them in VRAM on a single node and run queries with perfect accuracy in milliseconds

eknkc - 10 hours ago

I guess for 2D vectors that would work?
For 1024 dimensions even with 8 bit quantization you are looking at a terrabyte of data. Lets make it binary vectors, it is still 128GB of VRAM.
WAT?
adastra22 - 10 hours ago

1B x 4096 = 4T scalars.
That doesn't fit in anyone's video ram.
- kingstnap - 6 hours ago
  
  Well we have AI GPUs now so you could do it.
  Each MI325x has 256 GB of HBM. So you would need ~32 of em if it was 2 bytes per scalar.
lyu07282 - 11 hours ago

Show your math lol
- Voloskaya - 10 hours ago
  
  I assume by "node" OP meant something like a DGX node. Which yea, that would work, but not everyone (no one?) wants to buy a 500k system to do vector search.
  B200 spec:
  * 8TB/sec HBM bandwidth
  * 10 PetaOPs assuming int8.
  * 186GB of VRAM.
  If we work with 512-dimensional int8 embeddings, then we need 512GB VRAM to hold them, so assuming we have 8xB200 node (~500k$++), we can easily hold them (125M vectors per GPU).
  It takes about 1000 OPs to do the dot product between two vectors, so we need to do 1000*1B = 1TeraOPs, spread over 8 GPUs, that's 125 GigaOPs per GPU, so a fraction of a ms.
  Now the bottleneck will be data movement between HBM -> chips, since we have 125M vectors per GPU, aka 64GB, we can move them in ~8 ms.
  Here you go, the most expensive vector search in history, giving you the same performance as a regular CPU-based vectorDB for only 1000x the price.
  - lyu07282 - 4 hours ago
    
    Thanks for doing the math! I suppose if we are charitable in practice we would of course index and only offload partially to VRAM (FAISS does that with IVF/PQ and similar).

kgeist - 10 hours ago

There was recently this paper: https://arxiv.org/abs/2508.21038

They show that with 4096-dimensional vectors, accuracy starts to fail at 250 mln documents (fundamental limits of embedding models). For 512-dim, it's just 500k.

Is 1 bln vectors practical?

yorwba - 9 hours ago

Those numbers are for the case where you want all possible pairs of two vectors to have a corresponding query that returns those vectors as the top two results.
If you mostly just want to find a particular single vector if possible and don't care so much what the second-best result is, you can get away with much smaller embeddings.
And if you do want to cover all possible pairs, 6500 dimensions or so should be enough. (Their empirical results roughly fit a cubic polynomial.)
OutOfHere - 10 hours ago

I would think that 1 bln refers to the row count, not to a vector's length.

ashvardanian - 13 hours ago

Very curious about the hardware setup used for this benchmark!

talipozturk - 11 hours ago

No special hardware. Google Cloud vms. We use multiple of them during index building.
- ashvardanian - 10 hours ago
  
  The question is how many, and what kind of VMs you use? It greatly affects performance :)
  I run a lot of search-related benchmarks (https://github.com/ashvardanian) and curious if you’ve compared to other engines on the same hardware setup, tracing recall, NDCG, indexing, and query speeds.
  - talipozturk - 7 hours ago
    
    We shard the data and index on about 6 x n2-standard-96 spot instances so the total cost of indexing the entire deep1b is less than $12. We are working on to make it $6. We separate indexing and query VMs. For queries we use dedicated VMs. USearch numbers look great and are better than ours if you run the query and indexing on the same VM/node. We believe design-wise distributed, task-oriented design is the right way to handle vector search for thousands of tenants with different size datasets. Data ingest is also a separate task for us so Ingest, Index and Query are all handled by different cluster of VMs.

esafak - 12 hours ago

By the creator of the real-time data platform https://en.wikipedia.org/wiki/Hazelcast.

cluckindan - 11 hours ago

How is this different from running tuned HNSW vector indices on Elasticsearch?

talipozturk - 7 hours ago

co-founder of Vectroid: We forked Lucene. Lucene is awesome for search in general, filters, and obviously full-text search. Very mature and well supported by so many big names and amazing engineers. So we take advantage of that but we had to change a few things to make it work perfectly for Vector use-case. We basically think Vector should be the main data type as it is the most difficult one to deal with. For instance, we modified Lucene to use X number of CPU / threads to build a single segment index. As a result, if/when needed, we can utilize hundreds of CPUs to index quicker and generate less number of segments that will enable lower query latency. We also built a custom File System Directory for Lucene to work off of GCS directly (or S3 later on). It can by-pass the kernel, read from network and write directly into the memory... no SSD, no page-cache, no mmap involved. Perhaps I should not say more...
wwdmaxwell - 9 hours ago

Aside from being serverless. This is like elasticsearch but with a kind of built in redis-like layer, I think.

OutOfHere - 13 hours ago

Proprietary closed-source lock-in. Nothing to see here.

CuriouslyC - 13 hours ago

Seriously. The amount of lift a SaaS product needs to give me is insane for me to even bother evaluating it, and there's a near zero percent chance I'll use it in my core.
- kcb - 12 hours ago
  
  Especially a product that demands access to large quantities of your most sensitive data to be useful.
- esseph - 5 hours ago
  
  I really feel like we're heading down the slope of a large section of the internet dieing off, and if that happens I think it may fracture even more than it already has globally.
HEmanZ - 13 hours ago

What do you think an alternative is for someone who:
1. Has a technical system they think could be worth a fortune to large enterprises, containing at least a few novel insights to the industry.
2. Knows that competitors and open source alternatives could copy/implement these in a year or so if the product starts off open source.
3. Has to put food on the table and doesn’t want to give massive corporations extremely valuable software for free.
Open source has its place, but it is IMO one of the ways to give monopolies massive value for free. There are plenty of open source alternatives around for vector DBs. Do we (developers) need to give everything away to the rich
- mhuffman - 11 hours ago
  
  Traditionally the most profitable approach is offering enterprise support and consulting.
  - cluckindan - 11 hours ago
    
    Enterprises are so very fond of choosing novel open source technologies, too!
    (not)
    
    gloomyday - 10 hours ago
    
    I have been working for 4 years with "enterprise" software, and I feel like the whole field is some kind of collective insanity.
- OutOfHere - 10 hours ago
  
  Let's say the best open source product has a feature score of 70/100, and the best closed source product has a feature score of 85/100, and this is me being generous with the latter. The issue is that just by being closed source, it immediately loses 20/100, bringing its score to 65/100, which is below the open offering. A closed source product carries substantial risk if the company behind it were to stop maintaining it, which is why the adjustment by -20 applies.
  Secondly, as I know, the blocker with approximate neighbor search is often not insertion, but search. And if this search was worth a fortune to me, I'd simply embarrassingly parallelize it on CPUs or on GPUs.
hungarianhc - 12 hours ago

Not that locked in - you can just move your vectors to another platform, no?
Vectroid co-founder here. We're huge fans of open source. My co-founder, Talip, made Hazelcast, which is open source.
It might make sense to open source all or part of Vectroid at some point in the future, but at the moment, we feel that would slow us down.
I hate vendor lock-in just as much as the next person. I believe data portability is the ACTUAL counter to vendor lock-in. If we have clean APIs to get your data in, get your data out, and the ability to bulk export your data (which we need to implement soon!), then there's less of a concern, in my opinion.
I also totally understand and respect that some people only want open source software. I'm certainly like that w/ my homelab setup! Except for Plex... Love Plex... Usually.
stronglikedan - 13 hours ago

Nothing for you to see here. Surely you just aren't their target customer.
- OutOfHere - 13 hours ago
  
  So who is? Who really needs to index 1 billion new vectors every 48 minutes, or perhaps equivalently 1 million new vectors every 3 seconds?
  - hansvm - 11 hours ago
    
    If HNSW were accurate enough (and if this DB were much faster) then I'd have a use case. I wound up going down a different route to create a differentiable database for ML shenanigans though.