Article Not Found

OpenAI Privacy Filter Wins on Overlap F1, Fails Strict Match Due to Tokenizer Offset

A developer compared OpenAI's open-source privacy-filter with GLiNER on 600 PII (Personally Identifiable Information, such as names, emails, phone numbers) samples. The boundary overlap F1 reached 0.498 vs. 0.416, with OpenAI winning — but under strict matching, OpenAI scored only 0.155, a 20-point drop. This reminds us: evaluation metrics can deceive; tokenizer offset is the real culprit.

What this is

Both models do the same thing: finding and tagging PII from text. GLiNER large-v2.1 has about 300M parameters and uses zero-shot detection (no training data needed; just input the entity type name to identify), supporting custom entity types. OpenAI's privacy-filter has 1.5B total parameters but uses a sparse MoE (Mixture of Experts, where only a portion of parameters are activated per inference), with only 50M active parameters per forward pass.

Tested CPU inference speed: privacy-filter processes about 2.8 items/second, GLiNER about 1.1 items/second. OpenAI is 1.5x faster. However, privacy-filter has a pitfall: the GPT-style BPE tokenizer adds a space before most tokens, causing a one-position offset when decoding back to character positions. Under strict match scoring, it looks abysmal; switch to boundary overlap (correct as long as character ranges have any overlap and the label is correct), and it actually wins.

By category, privacy-filter wins on names, emails, phone numbers, and dates; GLiNER is better on addresses. Email detection is essentially solved: 0.987 for English, 1.000 for multilingual.

Industry view

We note that the core value of this test lies not in who wins or loses, but in revealing two practical issues in PII detection deployment: first, the choice of evaluation standards directly impacts conclusions—strict matching is safer for production, but boundary overlap is closer to actual needs; second, GLiNER's default threshold of 0.5 leaves F1 on the table—tuning it to 0.7 improves it by about 8 points, showing that default configurations of open-source models are not necessarily optimal.

Dissenting voices are equally worth noting: privacy-filter currently requires trust_remote_code=True and relies on the dev branch of transformers; the model class hasn't entered the stable release yet. This means production deployments carry supply chain risks. Furthermore, it only supports 8 preset entity types and cannot be expanded — if you need to detect custom fields like "employee ID" or "contract number," GLiNER's zero-shot interface is the only choice.

Impact on regular people

For enterprise IT: Local deployment solutions for PII detection have gained a high-performance option, but the privacy filter lacks the stability for production; we recommend waiting for the official release. Scenarios requiring high recall (better to over-redact than miss) should still choose GLiNER.

For individual careers: Data desensitization toolchains are rapidly maturing. Automatic scrubbing before handling customer data is becoming a standard capability, making it worth understanding.

For the consumer market: Regular people won't notice for now, but advances in enterprise-grade PII detection mean the probability of your chat logs and documents being automatically desensitized before AI training is increasing.

OpenAI Privacy Filter Wins on Overlap F1, Fails Strict Match Due to Tokenizer Offset

What this is

Industry view

Impact on regular people

相关推荐

马斯克索赔 1340 亿告 OpenAI — AI 时代产权规则谁来定

OpenAI 隐私过滤器实测胜出 — 但严格匹配翻车，分词器偏移是元凶

通义千问复刻DeepResearch只要200行—Agent护城河比想象中浅

你的AI助手突然变脸不干活 — "性格漂移"这坑我也踩过

马斯克索赔1500亿诉OpenAI开庭 — AI行业初心与资本的法庭对决

客户付了钱却打不开你的产品 — 云服务挂了你有后路吗