Show HN: SemHash – Fast Semantic Text Deduplication for Cleaner Datasets

19 points by Pringled 6 months ago

We’ve just open-sourced SemHash, a lightweight package for semantic text deduplication. It lets you effortlessly clean up your datasets and avoid pitfalls caused by duplicate samples in semantic search, RAG, and machine learning.

Main Features:

- Fast and hardware friendly: Deduplicate datasets with millions of records in minutes, on a CPU.

- Flexible: Works on single or multiple datasets (e.g., train/test deduplication), and multi-column data (e.g., Question-Answering datasets).

- Lightweight: Minimal dependencies (largest is NumPy).

- Explainable: Easily inspect duplicates and what caused them, and view the lowest similarity duplicates to adjust the threshold based on your dataset.

We found that text deduplication is more complex than it appears, so we built SemHash to simplify the process. Duplicate samples can skew model training, reduce generalization, and cause train-test leakage—leading to unreliable results. Techniques like minhash handle exact or near-exact duplicates, but semantic deduplication also catches semantically redundant samples, which we believe is an important aspect of deduplication. Furthermore, it’s not trivial to see why something was removed with minhash, which we also believe is important. We already found some interesting results on some well known datasets in our benchmarks which are included in the repo.

We are curious to hear your feedback! Do you currently deduplicate your datasets before training, and what techniques do you use?

skeptrune - 6 months ago

I really appreciate that explainability was taken into consideration during the design here.

Pringled - 6 months ago

Thank you, that's nice to hear!

dmezzetti - 6 months ago

Glad to see this getting visibility. Model2Vec is great and The Minish Lab is doing some excellent work!

Pringled - 6 months ago

Thanks David, that's really nice to hear!

carschno - 6 months ago

It is complicated and this is a very nice contribution!

Do you have any quantitative evaluation in terms of precision and recall?

Pringled - 6 months ago

Thanks for the kind words! Unfortunately, it's hard to directly measure the precision/recall (or other metrics) since there are no real labels. This is one of the reasons we tried to design this in a way that's as explainable as possible, so that you can easily look at examples of deduplication for your given threshold and decide if they make sense. We are still thinking about more/better ways to evaluate this in addition to the benchmarks we've already done.