Ask HN: How are you doing RAG locally?

184 points by tmaly 21 hours ago


I am curious how people are doing RAG locally with minimal dependencies for internal code or complex documents?

Are you using a vector database, some type of semantic search, a knowledge graph, a hypergraph?

oliveiracwb - 15 minutes ago

We handle ~300k customer interactions per day, so latency and precision really matter. We built an internal RAG-based portal on top of our knowledge base (basically a much better FAQ).

On the retrieval side, I built a custom search/indexing layer (Node) specifically for service traceability and discovery. It uses a hybrid approach — embeddings + full-text search + IVF-HNSW — to index and cross-reference our APIs, services, proxies and orchestration repos. The RAG pipelines sit on top of this layer, which gives us reasonable recall and predictable latency.

Compliance and observability are still a problem. Every year new vendors show up promising audits, data lineage and observability, but none of them really handle the informational sprawl of ~600 distributed systems. The entropy keeps increasing.

Lately I’ve been experimenting with a more semantic/logical KAG approach on top of knowledge graphs to map business rules scattered across those systems. The goal is to answer higher-level questions about how things actually work — Palantir-like outcomes, but with explicit logic instead of magic.

Curious if others are moving beyond “pure RAG” toward graph-based or hybrid reasoning setups.

beklein - an hour ago

Most of my complex documents are, luckily, Markdown files.

I can recommend https://github.com/tobi/qmd/ . It’s a simple CLI tool for searching in these kinds of files. My previous workflow was based on fzf, but this tool gives better results and enables even more fuzzy queries. I don’t use it for code, though.

jackfranklyn - 13 minutes ago

For document processing in a side project, I've been using a local all-MiniLM model with FAISS. Works well enough for semantic matching against ~50k transaction descriptions.

The real challenge wasn't model quality - it was the chunking strategy. Financial data is weirdly structured and breaking it into sensible chunks that preserve context took more iteration than expected. Eventually settled on treating each complete record as a chunk rather than doing sliding windows over raw text. The "obvious" approaches from tutorials didn't work well at all for structured tabular-ish data.

__jf__ - 3 hours ago

For vector generation I started using Meta-LLama-3-8B in april 2024 with Python and Transformers for each text chunk on an RTX-A6000. Wow that thing was fast but noisy and also burns 500W. So a year ago I switched to an M1 Ultra and only had to replace Transformers with Apple's MLX python library. Approximately the same speed but less heat and noise. The Llama model has 4k dimensions so at fp16 thats 8 kilobyte per chunk, which I store in a BLOB column in SQLite via numpy.save(). Between running on the RTX and M1 there is a very small difference in vector output but not enough for me to change retrieval results, regenerate the vectors or change to another LLM.

For retrieval I load all the vectors from the SQlite database into a numpy.array and hand it to FAISS. Faiss-gpu was impressively fast on the RTX6000 and faiss-cpu is slower on the M1 Ultra but still fast enough for my purposes (I'm firing a few queries per day, not per minute). For 5 million chunks memory usage is around 40 GB which both fit into the A6000 and easily fits into the 128GB of the M1 Ultra. It works, I'm happy.

lmeyerov - 37 minutes ago

Claude code / codex which internally uses ripgrep, and I'm unsure if it's using parallel mode. And, project specific static analyzers.

Studies generally show when you do agentic retrieval w/ text search, that's pretty good. Adding vector retrieval and graph rag, so the typical parallel multi-retrieval followed by reranking, gives a bit of speedup and quality lift. That lines up with my local flow experience, where it is only enough that I want that for $$$$ consumer/prosumer tools, and not easy enough for DIY that I want to invest in that locally. For those who struggle with tools like spotlight running when it shouldn't, that kind of thing turns me off on the cost/benefit side.

For code, I experiment with unsound tools (semgrep, ...) vs sound flow analyzers, carefully setup for the project. Basically, ai coders love to use grep/sed for global replace refactors and other global needs, but keeps tripped up on sound flow analysis. Similar to lint and type checking, that needs to be setup for a project and taught as a skill. I'm not happy with any of my experiments here yet however :(

CuriouslyC - 7 hours ago

Don't use a vector database for code, embeddings are slow and bad for code. Code likes bm25+trigram, that gets better results while keeping search responses snappy.

acutesoftware - 2 hours ago

I am using LangChain with a SQLite database - it works pretty well on a 16G GPU, but I started running it on a crappy NUC, which also worked with lesser results.

The real lightbulb moment is when you realise the ONLY thing a RAG passes to the LLM is a short string of search results with small chunks of text. This changes it from 'magic' to 'ahh, ok - I need better search results'. With small models you cannot pass a lot of search results ( TOP_K=5 is probably the limit ), otherwise the small models 'forget context'.

It is fun trying to get decent results - and it is a rabbithole, next step I am going into is pre-summarising files and folders.

I open sourced the code I was using - https://github.com/acutesoftware/lifepim-ai-core

esperent - 4 hours ago

I'm lucky enough to have 95% of my docs in small markdown markdown files so I'm just... not (+). I'm using SQLite FTS5 (full text search) to build a normal search index and using that. Well, I already had the index so I just wired it up to my mastra agents. Each file has a short description field, so if a keyword search surfaces the doc they check the description and if it matches, load the whole doc.

This took about one hour to set up and works very well.

(+) At least, I don't think this counts as RAG. I'm honestly a bit hazy on the definition. But there's no vectordb anyway.

codebolt - an hour ago

Giving the LLM tools with an OData query interface has worked well for me. In C# it's pretty trivial to set up an MCP server with OData querying for an arbitrary data model. At work we have an Excel sheet with 40k rows which the LLM was able to quickly and reliably analyse using this method.

spqw - 4 hours ago

I am surprised to see very few setups leveraging LSP support. (Language Server Protocol) It has been added to Claude Code last month. Most setups rely on naive grep.

bzGoRust - 3 hours ago

In my company, we build the internal chatbot based on RAG through LangChain + Milvus + LLM. Since the documents are well formatted, it is easy to do the overlapping chunking, then all those chunking data are inserted into vector db Milvus. The hybrid search (combine dense search and sparse search) is native supported in the Milvus could help us to do better retrieve. Thus the better quality answers are there.

podgietaru - 2 hours ago

I made a small RAG database just using Postgres. I outlined it in the blog post below. I use it for RSS Feed organisation, and searching. They are small blobs. I do the labeling using a pseudo-KNN algorithm.

https://aws.amazon.com/blogs/machine-learning/use-language-e...

The code for it is here: https://github.com/aws-samples/rss-aggregator-using-cohere-e...

The example link no longer works, as I no longer work at AWS.

tebeka - 5 hours ago

https://duckdb.org/2024/05/03/vector-similarity-search-vss

robotswantdata - 2 hours ago

You don’t need a vector database or graph, it really depends on your existing infrastructure , file types and needs.

The newer “agent” search approach can just query a file system or api. It’s slightly slower but easier to setup and maintain as no extra infrastructure.

init0 - 6 hours ago

I built a lib for myself https://pypi.org/project/piragi/

autogn0me - 5 hours ago

https://github.com/ggozad/haiku.rag/ - the embedded lancedb is convenient and has benchmarks; uses docling. qwen3-embedding:4b, 2560 w/ gpt-oss:20b.

rahimnathwani - 19 hours ago

If your data aren't too large, you can use faiss-cpu and pickle

https://pypi.org/project/faiss-cpu/

cbcoutinho - 5 hours ago

The Nextcloud MCP Server [0] supports Qdrant as a vectordb to store embeddings and provide semantic search across your personal documents. This enables any LLM & MCP client (e.g. claude code) into a RAG system that you can use to chat with your files.

For local deployments, Qdrant supports storing embeddings in memory as well as in a local directory (similar to sqlite) - for larger deployments Qdrant supports running as a standalone service/sidecar and can be made available over the network.

[0] https://github.com/cbcoutinho/nextcloud-mcp-server

lsb - 3 hours ago

I'm using Sonnet with 1M Context Window at work, just stuffing everything in a window (it works fine for now), and I'm hoping to investigate Recursive Language Models with DSPy when I'm using local models with Ollama

Bombthecat - 4 hours ago

AnythingLLM for documents, amazing tool!

softwaredoug - 4 hours ago

I built a Pandas extension SearchArray, I just use that (plus in memory embeddings) for any toy thing

https://github.com/softwaredoug/searcharray

beret4breakfast - 4 hours ago

For the purposes of learning, I’ve built a chatbot using ollama, streamlit, chromadb and docling. Mostly playing around with embedding and chunking on a document library.

dvorka - 5 hours ago

Any suggestion what to use as embeddings model runtime and semantic search in C++?

geuis - 4 hours ago

I don't. I actually write code.

To answer the question more directly, I've spent the last couple of years with a few different quant models mostly running on llama.cpp and ollama, depending. The results are way slower than the paid token api versions, but they are completely free of external influence and cost.

However the models I've tests generally turn out to be pretty dumb at the quant level I'm running to be relatively fast. And their code generation capabilities are just a mess not to be dealt with.

sinandrei - 2 hours ago

Anyone use these approaches with academic pdfs?

ehsanu1 - 5 hours ago

Embedded usearch vector database. https://github.com/unum-cloud/USearch

SamLeBarbare - 2 hours ago

sqlite + FTS + sqlite-vec + local LLM for reranking results (reasoning model)

lormayna - 5 hours ago

I have done some experiments with nomic embedding through Ollama and ChromaDB.

Works well, but I didn't tested on larger scale

- 6 hours ago
[deleted]
eajr - 20 hours ago

Local LibreChat which bundles a vector db for docs.

motakuk - 19 hours ago

LightRAG, Archestra as a UI with LightRAG mcp

baalimago - 4 hours ago

I thought that context building via tooling was shown to be more effective than rag in practically every way?

Question being: WHY would I be doing RAG locally?

lee1012 - 7 hours ago

lee101/gobed https://github.com/lee101/gobed static embedding models so they are embedded in milliseconds and on gpu search with a cagra style on gpu index with a few things for speed like int8 quantization on the embeddings and fused embedding and search in the same kernel as the embedding really is just a trained map of embeddings per token/averaging

Strift - 4 hours ago

I just use a web server and a search engine.

TL;DR: - chunk files, index chunks - vector/hybrid search over the index - node app to handle requests (was the quickest to implement, LLMs understand OpenAPI well)

I wrote about it here: https://laurentcazanove.com/blog/obsidian-rag-api

nineteen999 - 16 hours ago

A little BM25 can get you quite a way with an LLM.

jeanloolz - 6 hours ago

Sqlite-vec

jeffchuber - 7 hours ago

try out chroma or better yet as opus to!

electroglyph - 7 hours ago

simple lil setup with qdrant

pdyc - 6 hours ago

sqlite's bm25

whattheheckheck - 20 hours ago

Anythingllm is promising

ramesh31 - 16 hours ago

SQLite with FTS5

lee101 - 7 hours ago

[dead]

undergrowth - 7 hours ago

[flagged]

undergrowth - 7 hours ago

[flagged]