privacy-filter-finetuned

A finetuned version of openai/privacy-filter with three additional detection categories on top of the original eight — built for procurement, finance, and sales datasets where company names, prices, and business reference IDs are as sensitive as personal data.

Runs entirely locally. No data leaves the device during inference.


New Categories

This checkpoint adds three categories to the base model's original eight:

Category What it detects Examples
company_name Supplier names, client account names, corporate entities Meridian Logistics Ltd, Apex Manufacturing PLC, TechStart Solutions Inc
price Monetary values with currency symbols or codes £4,250.00, $12,500, €5,000, 2,500 GBP
id_number Purchase orders, invoice numbers, reference IDs, case numbers, internal codes PO-00442, INV/2024/00567, REF-2024-001, CASE-20240315-001, ORD-78542

Full Label Space (11 categories)

{
  "category_version": "custom_v1_extended",
  "span_class_names": [
    "O",
    "private_person",
    "private_email",
    "private_phone",
    "private_address",
    "account_number",
    "private_url",
    "private_date",
    "secret",
    "company_name",
    "price",
    "id_number"
  ]
}

Usage

Install the official opf package from the openai/privacy-filter repo:

git clone https://github.com/openai/privacy-filter
cd privacy-filter
pip install -e .

Download this checkpoint:

from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="galexdav/privacy-filter-finetuned",
    local_dir="./privacy-filter-finetuned"
)

Run inference:

from opf import OPF

model = OPF(model="./privacy-filter-finetuned", device="cpu")
result = model.redact("PO-00442 raised for Meridian Logistics Ltd, total £4,250.00.")

for span in result.detected_spans:
    print(f"{span.label}: '{span.text}'")

# id_number: 'PO-00442'
# company_name: 'Meridian Logistics Ltd'
# price: '£4,250.00'

Or via the CLI:

opf --checkpoint ./privacy-filter-finetuned --device cpu \
  "Invoice INV/2024/00567 from Apex Manufacturing PLC — amount £9,600.00"

Training Details

Base model openai/privacy-filter
Training examples ~830
Validation examples ~130
Epochs 3
Hardware 1× NVIDIA L4 (24 GB) via Hugging Face Jobs
Training time ~9 minutes

Training data was generated programmatically using a template × entity cross-product approach, covering all 11 categories including the original 8 (to prevent catastrophic forgetting).


Test Results

22/24 tests passing on the held-out test suite:

Category Result
company_name (Ltd, PLC, Inc, LLP) ✅ All pass
price (£, $, €, written codes) ✅ All pass
id_number (PO, INV, REF, CASE, ORD) ✅ All pass
Multi-category (company + id + price in one sentence) ✅ Pass
private_person, private_email, private_phone ✅ Pass
private_address, account_number, private_date ✅ Pass
secret (API keys) ⚠️ Occasional bleed into id_number
Negative (no PII) ⚠️ Occasional company_name false positive

Known Limitations

  • Company names in isolated cells — the model relies on surrounding sentence context to classify noun phrases as company names. A bare cell value with no surrounding text may not trigger detection.
  • Price span boundaries — currency symbols are occasionally left outside the detected span boundary in some formats.
  • secret vs id_number — API keys with alphanumeric patterns (e.g. sk-live-...) can be tagged as id_number rather than secret.
  • English only — inherited from the base model; performance on non-English text and regional naming conventions is limited.

Licence

Apache 2.0 — same as the base model. Commercial use permitted.

Downloads last month
63
Safetensors
Model size
1B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for galexdav/privacy-filter-finetuned

Finetuned
(39)
this model