privacy-filter-finetuned

A finetuned version of openai/privacy-filter with three additional detection categories on top of the original eight — built for procurement, finance, and sales datasets where company names, prices, and business reference IDs are as sensitive as personal data.

Runs entirely locally. No data leaves the device during inference.

New Categories

This checkpoint adds three categories to the base model's original eight:

Category	What it detects	Examples
`company_name`	Supplier names, client account names, corporate entities	`Meridian Logistics Ltd`, `Apex Manufacturing PLC`, `TechStart Solutions Inc`
`price`	Monetary values with currency symbols or codes	`£4,250.00`, `$12,500`, `€5,000`, `2,500 GBP`
`id_number`	Purchase orders, invoice numbers, reference IDs, case numbers, internal codes	`PO-00442`, `INV/2024/00567`, `REF-2024-001`, `CASE-20240315-001`, `ORD-78542`

Full Label Space (11 categories)

{
  "category_version": "custom_v1_extended",
  "span_class_names": [
    "O",
    "private_person",
    "private_email",
    "private_phone",
    "private_address",
    "account_number",
    "private_url",
    "private_date",
    "secret",
    "company_name",
    "price",
    "id_number"
  ]
}

Usage

Install the official opf package from the openai/privacy-filter repo:

git clone https://github.com/openai/privacy-filter
cd privacy-filter
pip install -e .

Download this checkpoint:

from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="galexdav/privacy-filter-finetuned",
    local_dir="./privacy-filter-finetuned"
)

Run inference:

from opf import OPF

model = OPF(model="./privacy-filter-finetuned", device="cpu")
result = model.redact("PO-00442 raised for Meridian Logistics Ltd, total £4,250.00.")

for span in result.detected_spans:
    print(f"{span.label}: '{span.text}'")

# id_number: 'PO-00442'
# company_name: 'Meridian Logistics Ltd'
# price: '£4,250.00'

Or via the CLI:

opf --checkpoint ./privacy-filter-finetuned --device cpu \
  "Invoice INV/2024/00567 from Apex Manufacturing PLC — amount £9,600.00"

Training Details


Base model	openai/privacy-filter
Training examples	~830
Validation examples	~130
Epochs	3
Hardware	1× NVIDIA L4 (24 GB) via Hugging Face Jobs
Training time	~9 minutes

Training data was generated programmatically using a template × entity cross-product approach, covering all 11 categories including the original 8 (to prevent catastrophic forgetting).

Test Results

22/24 tests passing on the held-out test suite:

Category	Result
company_name (Ltd, PLC, Inc, LLP)	✅ All pass
price (£, $, €, written codes)	✅ All pass
id_number (PO, INV, REF, CASE, ORD)	✅ All pass
Multi-category (company + id + price in one sentence)	✅ Pass
private_person, private_email, private_phone	✅ Pass
private_address, account_number, private_date	✅ Pass
secret (API keys)	⚠️ Occasional bleed into id_number
Negative (no PII)	⚠️ Occasional company_name false positive

Known Limitations

Company names in isolated cells — the model relies on surrounding sentence context to classify noun phrases as company names. A bare cell value with no surrounding text may not trigger detection.
Price span boundaries — currency symbols are occasionally left outside the detected span boundary in some formats.
secret vs id_number — API keys with alphanumeric patterns (e.g. sk-live-...) can be tagged as id_number rather than secret.
English only — inherited from the base model; performance on non-English text and regional naming conventions is limited.

Licence

Apache 2.0 — same as the base model. Commercial use permitted.

Downloads last month: 63

Safetensors

Model size

1B params

Tensor type

BF16

Model tree for galexdav/privacy-filter-finetuned

Base model

openai/privacy-filter

Finetuned

(39)

this model