Character Prefix Conditioning

cursor.com

29 points by mntruell 5 days ago


kcarnold - 2 days ago

This was the subject of https://arxiv.org/abs/2412.03719. (I suspect you can do simpler than the paper's solution if you're only interested in the top-k.)

A related topic is "token healing", although some implementations (unfortunately including the one in HuggingFace Transformers) make some big assumptions that aren't always true (like treating spaces as special).

viraptor - 2 days ago

> Can you construct an efficient algorithm for sampling from q(tk∣t1,…,tk−1), that minimizes calls to the original language model?

I feel like I'm missing some issue here... Can't you query stopping at the last full token boundary, then reject any results which don't match the character prefix and continue from there with the completion? Kind of like when you mask the invalid actions when reinforcement training on games? Or is that losing too much info?

do_not_redeem - 2 days ago

So here is ChatGPT's token list: https://gist.github.com/s-macke/ae83f6afb89794350f8d9a1ad8a0...

Is there some reason it isn't alphabetical? (More specifically, lexically sorted by codepoint) If you had a model with sorted tokens, you'd be able to solve this by constraining output to tokens with the desired prefix, probably with some mechanism similar to how this works: https://github.com/ggerganov/llama.cpp/blob/master/grammars/...

_andrei_ - 20 hours ago

Do they have the solution (I'd assume Supermaven which they recently acquired does) and are they just giving out an interesting challenge to readers? Or... what is this?

amanrs - 2 days ago

This is harder than it looks.

First "token-healing" doesn't work. Consider the case "app" where the most likely options are "ap|praisal" or "apple|sauce". You can't just sample all tokens that start with app, or you'd miss appraisal.

Second, it's easy to come up with a naive algorithm that samples from the true distribution. It's very difficult to make this algorithm efficient.

yorwba - 2 days ago

Ideally you'd have a language model that can predict a good continuation after any byte. If an existing model can't do that because it's too reliant on a specific tokenization, you might nonetheless be able to fine-tune it until it can gracefully handle the unexpected tokenizations that result from splitting at a random byte.

teaearlgraycold - 2 days ago

Not sure if this is free labor or a means to source candidates.