Show HN: CLI tool for detecting non-exact code duplication with embedding models

github.com

75 points by rkochanowski 10 hours ago


rkochanowski - 10 hours ago

I built Slopo to solve one specific problem: finding similar code that is hardest to detect by other tools, coding AI agents, and humans.

It finds similar-looking code with embeddings. This detects more than just copy-paste clones or even clones with minor changes. Similar code is often not a clone to refactor, and this is a trade-off. Initial results need to be verified, but coding agents can do this quickly. Example prompts are available on https://slopo.dev

Additionally, similar code distant in the codebase is ranked higher to focus on less obvious duplication.

The results differ a lot depending on the codebase. I noticed that sometimes most of the detected duplicates are false positives, but the remaining ones are strong candidates to refactor or even bugs. Sometimes it reveals much more real duplication.

supriyo-biswas - 6 hours ago

Cool project, I've been meaning to do this myself at work for a codebase, and it's nice to see that this exists now.

Does the project you simply compute embeddings for every function unit and cluster them, or do we also mean-pool significant dependencies of a function? In other words, given the function

    def a():
      b()
      c()
      d()
Do we also embed b, c, and d as well and combine them somehow in the embedding of a?
romanoonhn - 2 hours ago

Looks very cool! I'd be very interested in applying this to my Elixir projects. What does it take to add proper support for a new language?

vander_elst - 7 hours ago

I implemented this for a large monorepo last year, it runs as an analysis during code review and it shows what are possible similar snippets wrt the code under review. It was a very nice project. It also allows to see across the repo what are the most common constructs for the different languages. This could also be helpful to see if some code has been copied e.g. from open source projects.

mempko - an hour ago

Nice, what's the chunking level? I would want sub function, logical blocks, etc

murats - 9 hours ago

Nice idea. I can see this being useful before refactors, especially when the duplication is semantic rather than copy paste.

forhadahmed - 7 hours ago

self plug (for similar tool): https://github.com/forhadahmed/refactor

philajan - 8 hours ago

This is neat. Have you noticed any difference in duplicate detection between strongly typed and loosely typed languages / code bases?

janalsncm - 4 hours ago

Did you benchmark it against simpler methods like BM25?

BrandiATMuhkuh - 8 hours ago

What a simple and smart idea. Wonderful

hdz - 8 hours ago

Very nice. I can imagine putting this into a pre push hook to keep things clean after an initial sweep.

noashavit - 3 hours ago

looks super useful- thanks for sharing!

rohanat - 7 hours ago

have you considered a deterministic tier before the embedding pass? I feel that approach can be more efficient.

danielsmori - 5 hours ago

[flagged]

NYCHMPAI - 9 hours ago

This is a great use case for embeddings. Code deduplication across distant modules is notoriously hard for traditional AST-based tools.

How do you handle chunking and parsing for different languages to make sure the embeddings capture semantic meaning effectively? For instance, do you chunk by functions/classes, or use a fixed token window? If a function is too long or too short, it can drastically skew the embedding similarity.

SpyCoder77 - 8 hours ago

I think that this is pretty cool, but is there any reason why we would want to remove similar/possible duplicate code?