Perplexity is dropping a second OSS banger in a week

1,623,628 followers

We're open-sourcing the Unigram tokenizer we rebuilt to reduce CPU utilization by 5-6x. Small rerankers and embedders run in single-digit milliseconds on GPU, making CPU tokenization a meaningful share of total latency. The work targets XLM-RoBERTa’s 250K-token Unigram vocabulary, commonly used for ranking and retrieval. The encoder produces the same tokens as the reference implementation, but avoids rebuilding strings and chasing hash maps while deciding how text should be split. At production input lengths, the encoder cuts p50 latency by roughly 5× vs. HuggingFace tokenizers, 2× vs. SentencePiece C++, and 1.5× vs. IREE C. At 514 tokens, it runs in 63 μs with zero heap allocations. Github: https://lnkd.in/gCeFN6F6 Read more about improving Unigram tokenizer CPU performance on our blog: https://lnkd.in/g-BEigqF

To view or add a comment, sign in

Nick Davidov’s Post

Explore content categories