OpenAI Privacy Filter

openai.com

160 points by tanelpoder 3 days ago


nl - 2 hours ago

I'm no where near as smart as OpenAI of course, but I did build https://tools.nicklothian.com/webner/index.html that uses a BERT based named-entity-recognition model running in your browser to do a subset of PII redaction.

It works pretty well for the use cases I was playing with.

The OpenAI model is small enough that I might enhance my tool to use it.

hiAndrewQuinn - 3 days ago

I'm surprised nobody else has commented on this. This is a very straightforward and useful thing for a small locally runnable model to do.

stratos123 - 3 days ago

There's some interesting technical details in this release:

> Privacy Filter is a bidirectional token-classification model with span decoding. It begins from an autoregressive pretrained checkpoint and is then adapted into a token classifier over a fixed taxonomy of privacy labels. Instead of generating text token by token, it labels an input sequence in one pass and then decodes coherent spans with a constrained Viterbi procedure.

> The released model has 1.5B total parameters with 50M active parameters.

> [To build it] we converted a pretrained language model into a bidirectional token classifier by replacing the language modeling head with a token-classification head and post-training it with a supervised classification objective.

aubinkure - 3 days ago

Exciting! I took a look through the code and found what appear to be the entity types for future releases - this release (V2 config) supports 8 entity types, but the V4 and V7 taxonomies have >20, mostly more personal ID types. Given this is a preview release, I imagine they'll release these.

Details in my review article here: https://piieraser.ai/blog/openai-privacy-filter. Disclaimer: I also build PII detection systems.

flashdesk - 27 minutes ago

This is exactly where stochastic approaches feel uncomfortable.

For anything touching security or privacy, even small inconsistencies can quickly erode trust.

mplanchard - 3 days ago

It would be nice if their examples weren’t mostly things that are easy to catch with regex, but it’s cool to see if released as an open, local model.

mayneack - 2 hours ago

Curious how this compares to presidio which mixes regex with a model: https://microsoft.github.io/presidio/

freakynit - 3 hours ago

Can someone explaon how can I reconstruct the original entities back if there are, for example, more than one person names?

7777777phil - 3 days ago

> The model is available today under the Apache 2.0 license on Hugging Face (opens in a new window) and Github (opens in a new window).

Bringing back the Open to OpenAI..

Havoc - 3 days ago

50M effective parameters is impressively light. Is there a similarly light model on the prompt injection side? Most of the mainstream ones seem heavier

mentalgear - 2 days ago

SuperagentLM made available on-edge PPI redaction models already a few years ago in sizes 20B, 3B, 200M. They still seem to be available via their legacy API - well worth checking out to compare against this one. https://docs.superagent.sh/legacy/llms/superagent-lm-redact-...

ndom91 - 3 days ago

Where's the gguf from Unsloth and co?

nickthegreek - 4 hours ago

[dead]

haricomputer - 2 hours ago

[dead]

y0eswddl - 3 days ago

[flagged]