privacy-filter-finetuned
A finetuned version of openai/privacy-filter with three additional detection categories on top of the original eight — built for procurement, finance, and sales datasets where company names, prices, and business reference IDs are as sensitive as personal data.
Runs entirely locally. No data leaves the device during inference.
New Categories
This checkpoint adds three categories to the base model's original eight:
| Category | What it detects | Examples |
|---|---|---|
company_name |
Supplier names, client account names, corporate entities | Meridian Logistics Ltd, Apex Manufacturing PLC, TechStart Solutions Inc |
price |
Monetary values with currency symbols or codes | £4,250.00, $12,500, €5,000, 2,500 GBP |
id_number |
Purchase orders, invoice numbers, reference IDs, case numbers, internal codes | PO-00442, INV/2024/00567, REF-2024-001, CASE-20240315-001, ORD-78542 |
Full Label Space (11 categories)
{
"category_version": "custom_v1_extended",
"span_class_names": [
"O",
"private_person",
"private_email",
"private_phone",
"private_address",
"account_number",
"private_url",
"private_date",
"secret",
"company_name",
"price",
"id_number"
]
}
Usage
Install the official opf package from the openai/privacy-filter repo:
git clone https://github.com/openai/privacy-filter
cd privacy-filter
pip install -e .
Download this checkpoint:
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="galexdav/privacy-filter-finetuned",
local_dir="./privacy-filter-finetuned"
)
Run inference:
from opf import OPF
model = OPF(model="./privacy-filter-finetuned", device="cpu")
result = model.redact("PO-00442 raised for Meridian Logistics Ltd, total £4,250.00.")
for span in result.detected_spans:
print(f"{span.label}: '{span.text}'")
# id_number: 'PO-00442'
# company_name: 'Meridian Logistics Ltd'
# price: '£4,250.00'
Or via the CLI:
opf --checkpoint ./privacy-filter-finetuned --device cpu \
"Invoice INV/2024/00567 from Apex Manufacturing PLC — amount £9,600.00"
Training Details
| Base model | openai/privacy-filter |
| Training examples | ~830 |
| Validation examples | ~130 |
| Epochs | 3 |
| Hardware | 1× NVIDIA L4 (24 GB) via Hugging Face Jobs |
| Training time | ~9 minutes |
Training data was generated programmatically using a template × entity cross-product approach, covering all 11 categories including the original 8 (to prevent catastrophic forgetting).
Test Results
22/24 tests passing on the held-out test suite:
| Category | Result |
|---|---|
| company_name (Ltd, PLC, Inc, LLP) | ✅ All pass |
| price (£, $, €, written codes) | ✅ All pass |
| id_number (PO, INV, REF, CASE, ORD) | ✅ All pass |
| Multi-category (company + id + price in one sentence) | ✅ Pass |
| private_person, private_email, private_phone | ✅ Pass |
| private_address, account_number, private_date | ✅ Pass |
| secret (API keys) | ⚠️ Occasional bleed into id_number |
| Negative (no PII) | ⚠️ Occasional company_name false positive |
Known Limitations
- Company names in isolated cells — the model relies on surrounding sentence context to classify noun phrases as company names. A bare cell value with no surrounding text may not trigger detection.
- Price span boundaries — currency symbols are occasionally left outside the detected span boundary in some formats.
- secret vs id_number — API keys with alphanumeric patterns (e.g.
sk-live-...) can be tagged asid_numberrather thansecret. - English only — inherited from the base model; performance on non-English text and regional naming conventions is limited.
Licence
Apache 2.0 — same as the base model. Commercial use permitted.
- Downloads last month
- 63
Model tree for galexdav/privacy-filter-finetuned
Base model
openai/privacy-filter