Launch HN: Greptile (YC W24) - RAG on codebases that actually works

253 points by dakshgupta 2 years ago


Hi HN, we're the co-founders of Greptile, a tool that can accurately answer questions about complex codebases. Developers use us to spend less time wrestling with codebases and more time actually writing code. Here's a demo: https://youtu.be/qI24eKO1YX0. You can try it on 100 popular repos here: https://app.greptile.com/repo, and on your own repo (if you give permission - more on that below) here: https://app.greptile.com.

We are far from the first people to try "RAG on your codebase". We focus on full codebase comprehension: using LLMs to accurately answer difficult questions with full context of large, complex, and even multi-repo codebases.

Simple RAG alone is not sufficient for this task. Codebases aren’t like most PDFs, docs, or other similar data types. They are graphs—complex puzzles where each piece is interlinked. So Greptile does a few things past simple RAG:

(1) Instead of directly embedding code, we parse the AST of the codebase, recursively generate docstrings for each node in the tree, and then embed the docstrings.

(2) Alongside vector similarity search and keyword search, we do “agentic search” where an agent reviews the relevance of the search results, and scans the source code to follow references that might lead to something important. Then it returns the relevant sources.

For example, here are a couple questions that this system is able to answer in our test repo that simple RAG couldn’t (in our experience):

Where are the auth providers configured?” (They are in an array inside of an options.ts file, where looking at the file it’s not obvious it’s an auth related file. However, because that array is imported into the auth/route.ts file, Greptile’s agent traces and find it)

How would I add a postgres connector?” (The best way to answer this is to see how the Redis connector is set up and mirror it. Simple RAG sometimes retrieves some of the code for the Redis connector, but Greptile’s agent follows the connections to retrieve all the code that the redis connector touches, and uses that to write instructions.)

Developers (including at Stripe and Microsoft) are using Greptile for things like:

Debugging—you can paste in an error message and it does a pretty good job of diagnosing the root cause and suggesting fixes.

Grokking OSS repos—for example, if you're forking a repo, modifying it for your usecase, or just integrating it, Greptile lets you add multiple repos and dependencies in the same chat session so it has full context.

Parsing legacy code at work—especially if original engineers have left the company.

Since we're accessing your private code, we're very careful with security. We don't store any code on our servers after initial processing, and just pull snippets as needed from the GitHub API.

Quick note: when you sign in with GH, it might ask for permission to "act on your behalf". This is a quirk of GitHub's wording—our permissions are read-only and the only thing we do "on your behalf" is read code, so we can index the repo.

We came up with this idea while working at AWS—the codebase was super complicated, the docs were sparse and out of date, and our team was remote so it was slow to get answers to questions. We picked "greptile" because of "grep" and also we just wanted a somewhat silly name.

Try it out! It's a work in progress, so any feedback is appreciated. Here are the links again: for popular open source repos see https://app.greptile.com/repo, and to get it working on your own repo, start at https://app.greptile.com.

If you have experience working with a complex codebase at work or for a project, I’d love to hear about it. It really helps us educate our product direction. Looking forward to comments!

edit. For those who want to try this on large or private repos, here is a promo code for a free month: HACKERNEWS100

moritonal - 2 years ago

Works well. Today I was working with how Rail's works with BigDecimals, so (knowing the answer) I asked:

"When using "as_json" in a controller to return the JSON of a model, how are BigDecimal's encoded?"

Answer: "When using as_json in a controller to return the JSON of a model, BigDecimal values are encoded as strings. This behavior is defined in the active_support/core_ext/object/json.rb file, specifically in the BigDecimal class extension for JSON encoding. The rationale behind this approach is that most..." which is exactly the case as I learnt through various PR's, Issues and code review.

This would have saved me about 30mins of work. I wonder if it takes into account the metadata, such as authors, related comments, issues and PRs?

dvt - 2 years ago

Ran it on a "real" OSS project of mine (https://github.com/dvx/lofi/), and it was stuck at 99% loading for about 30 minutes. Then, when it finally parsed the codebase, when asked anything it always returns "Error: Internal error while locating sources." Specifically, I wanted to see if it can context switch between TypeScript (used for the front-end), ObjectiveC (used for a few Mac features), C++ (used for Windows volume features), and GLSL (used for visualizations). But alas.

At one point, this random prompt popped up: https://imgur.com/a/mYeluaU —what's "Onboard?" Is this some kind of weird LLM leakage/hallucination?

With all respect, this is like a pre-MVP quality product. The codebase isn't even particularly large and the experience is extremely sub-par. Charging for something like this is honestly highway robbery.

jasonkester - 2 years ago

You’re going to want to define the acronym RAG before you use it a dozen times in your marketing copy.

Presumably it’s great news that I can RAG on my codebase. But I’m not sure whether I’ve ever ragged anything in my career or whether I’ll want to now.

If you told us what it meant, we could probably understand what your thing does.

alalani1 - 2 years ago

I like clever project names :)

This looks great - I just tried to generate sample code in the react repo and was pleasantly surprised. Do you have a sense of whether this works well to generate code in general, i.e. generate an API route to return X data that works similar to the other API routes.

simonw - 2 years ago

"We don't store any code on our servers after initial processing"

Are you storing the embedding vectors you've calculated from the code? If so, those are likely quite easily reversible - so I would still consider that source code stored on your servers from the point of view of a security audit.

As a result, I might actually prefer to have copies of my code stored on your servers if it resulted in faster performance.

koeng - 2 years ago

I'd love to try it, but pretty much all my repos are >10mb. It's not because there is that much code, but because I am doing bioinformatics and the test files (for the unit tests) inflate the repo size. It would be great if there was a way to test it on just 1 large repo for perhaps a week or something, because I balk at the idea of spending $20 a month on something that I don't even know works well.

This is important because I'm not deeply familiar with public projects, so I can't accurately assess if the tool is worthwhile. Whereas with one of my repos, I'd be able to tell quality pretty quickly.

Tsarp - 2 years ago

How does it compare with something like Bloop, which also uses a combination of a syntax tree, Embeddings, FTS and LLMs?

gdcbe - 2 years ago

“Where we going we don’t need docs”. That scares me… docs are among other things there to provide context and info for things not clear from why certain choices were made or not made… no way your AI is going to guess that I put that restriction because of an explicit request from product, despite it looking wrong…

Conscat - 2 years ago

I've tried it on my own C++ codebase. It's fun, and I'm impressed that it could tell me which C++ standard is used (a question which is often difficult to find an answer to on random codebases), but it's really bad at analyzing templates. The answers it gives me are always incomplete and usually at least partly or mostly incorrect. I'm surprised by this in some cases, because my questions are answered by comments in the source code.

https://app.greptile.com/share/4953cbff-13ec-4427-b0af-02889...

luke-stanley - 2 years ago

It's cool to see tools like this. I ran into some issues though:

1. "We will email you"... "once the repositories have finished processing" Not sure you're supposed to do that without consent, when the intent was just to connect GitHub! Email use is supposed to be opt-in.

2. My tiny repo (https://github.com/lukestanley/ChillTranslator) won't load.

3. The UI for selecting a GitHub repo is hard to find and fiddly to use.

4. I couldn't see where to put the promo code.

IceDane - 2 years ago

Not a single repo I've tried works. A lot of them seem not to have finished processing, but even the ones that have finished don't work.

ram417 - 2 years ago

Love this idea and am just signed up. Thanks for the promo code! Also, I really like your blog post about shipping faster: https://greptile.com/blog/ship-faster. Shipping code is so fun that we should all be looking for ways to do more of it.

nomoreipg - 2 years ago

How's this different from Adrenaline or Cursor or Bloop

cdtwigg - 2 years ago

Apparently I’m the only one here who doesn’t know this but: What is RAG?

pivic - 2 years ago

I only get 'Error: Internal error while processing request.' when I try to run queries. I tested three different repos, same error message appeared for each repo.

nico - 2 years ago

Cool, will check it out

Does it integrate with Visual Studio, does it provide code suggestions?

Been doing a lot of back and forth iteration with ChatGPT to build a python project from scratch

It’s been a really good experience although frustratingly slow at times (from going back and forth between the browser and code and having to wait for gpt’s answers)

Can more documentation be automatically added? For example, it might be useful in a rails project to be able to get answers about the ruby and rails documentations

fuzzythinker - 2 years ago

After giving permission, it asked to:

"Link Your Code Hosting Providers Connect your accounts for seamless integration, and to access private repositories."

What does this mean?

DavidFerris - 2 years ago

Super cool! btw I love the name "Greptile" :)

theckel - 2 years ago

I just keep getting: "Error: Internal error while locating sources." when trying to talk to a repo that is green and "up to date"

drcongo - 2 years ago

I've been looking for something like this, but local-only. Any plans to let people self-host and point at local repositories?

alchemist1e9 - 2 years ago

Does it use tree-sitter for all the AST parsing?

anton-107 - 2 years ago

Getting "Error: Internal error while processing request." while trying on my personal public github repo. HN effect?

jbellis - 2 years ago

Looks like some kind of bug on repos w/ many branches. Loading https://github.com/datastax/cassandra/, I search for `vsearch` and it presents me with CNDB-8708-vsearch and DSP-23946-vsearch, but not vsearch itself.

peter_d_sherman - 2 years ago

Related: https://greptile.com/pricing

sidcool - 2 years ago

Congrats on launching. However I don't like the 'Act on your behalf' permission this needs.

mcfig - 2 years ago

Asking questions of any repo on “repo” fails with “Error: Internal error while processing request.” This is pribably because I unlinked my Github connection after trying it out, but it shouldn’t be trying to use that in this case.

- 2 years ago
[deleted]
intalentive - 2 years ago

The AST approach should be integrated into code generation. Instead of generating text, generate AST nodes. Something like “Copilot with Intellisense” could be a game changer.

tom_ - 2 years ago

What does RAG stand for?

stuaxo - 2 years ago

Nice reading the steps you take to analyse the code.

I had scrolled past this article without clicking and had the same thoughts about how I'd approach this.

ankit84 - 2 years ago

Can it answer customer support questions on API's cryptic error messages? E.g. Give hints on changes needed in the request payload.

obiefernandez - 2 years ago

Will it work with a large Ruby on Rails codebase?

doctorpangloss - 2 years ago

I tried asking a question about Porter and I see the error:

> Oops

> We couldn't access this repo.

> You may need to log in to view this repository, or it might not exist.

setgree - 2 years ago

very nice! FYI the 'free coffee' link (https://calendly.com/dakshgupta/free-coffee) identifies you as " Daksh, co-founder/CEO at Onboard."

Also I am getting the 'Error: Internal error while processing request'

- 2 years ago
[deleted]
sourabh03agr - 2 years ago

Congrats on the launch! Do you need Github permissions to answer questions on open-source repos as well?

iknownthing - 2 years ago

Looks good, but there are many competitors that do exactly the same thing (even opensource ones)

minhoryang - 2 years ago

I like your 100 repo selections :)

sanity - 2 years ago

I linked to my github but can't find where to use the promo code :-/

nahimn - 2 years ago

10/10 on the name

hazelnutcloud - 2 years ago

> This repo failed to process

nice

shw1n - 2 years ago

This is super cool, my co-founder and I were brainstorming how to essentially expand the context window via first-order concepts for this exact purpose last night

Excited to try it out

pgt - 2 years ago

Can Greptile read Clojure codebases?

cbsks - 2 years ago

Can you add the Linux kernel?

stevemadere - 2 years ago

Can you guys add huggingface transformers as one of the public demo repos? I have some very specific use cases where I've seen ChatGPT with GPT4 totally fall on its face Dunning-Kruger style.

I'd like to see if your tech solves those issues.

hellospike - 2 years ago

constantly got Error: Internal error while processing request.

geospatialover - 2 years ago

awesome! congrats on the launch

zainhoda - 2 years ago

[flagged]