--- language: en license: apache-2.0 library_name: transformers tags: - financial - numbers - modernbert - mlm base_model: answerdotai/ModernBERT-base --- # FinancialModernBERT A number-aware BERT model for financial document understanding, built on [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base). ## What this model does differently Standard language models tokenize numbers as arbitrary subword pieces — "12,345" becomes tokens like "12", ",", "345" — losing all numerical meaning. FinancialModernBERT solves this by: 1. **Number tagging**: A preprocessing step wraps numbers in `...` tags 2. **Log-magnitude encoding**: Each number is encoded as its log₁₀ magnitude (e.g. 1000 → 3.0) into a learned embedding via interpolated magnitude bins 3. **Dual prediction heads**: MLM head for text tokens + magnitude head for number tokens, trained jointly 4. **Table-aware tokenization**: HTML tables are linearized with structural delimiters (`[TABLE_START]`, `\t`, `\n`, `[TABLE_END]`) The model handles magnitudes from 10⁻¹² to 10¹² (configurable). ## Installation ```bash pip install git+https://huggingface.co/edereynal/financial_bert ``` Or clone and install: ```bash git clone https://huggingface.co/edereynal/financial_bert cd financial_bert pip install -e . ``` ## Quick start ### Preprocessing: tag numbers in your text Before tokenizing, numbers in your text must be wrapped in `` tags. Use the built-in tagger: ```python from financial_bert import tag_numbers_in_text raw_text = "Revenue increased to $1,234,567 from $987,654, a 25% increase." tagged = tag_numbers_in_text(raw_text) # "Revenue increased to $1234567 from $987654, a 25% increase." ``` ### Tokenization ```python from financial_bert import FinancialBertTokenizer tokenizer = FinancialBertTokenizer("answerdotai/ModernBERT-base") text = "Revenue was $1234567 in Q3." encoded = tokenizer(text, max_length=128) # Returns dict with: # input_ids: standard token IDs (numbers replaced with placeholder) # attention_mask: 1 for real tokens, 0 for padding # is_number_mask: 1 at number positions, 0 elsewhere # number_values: log10(magnitude) at number positions, 0.0 elsewhere ``` ### Loading the model ```python import torch from huggingface_hub import hf_hub_download from financial_bert import FinancialModernBert, FinancialModernBertConfig config = FinancialModernBertConfig.from_pretrained("answerdotai/ModernBERT-base") config.num_magnitude_bins = 128 model = FinancialModernBert(config) # MLM pretrained weights (text + number prediction) weights_path = hf_hub_download("edereynal/financial_bert", "checkpoints/mlm_weights.pt") model.load_state_dict(torch.load(weights_path, map_location="cpu")) # Or: CLS encoder weights (trained with encoder/decoder bottleneck objective — better for embeddings) weights_path = hf_hub_download("edereynal/financial_bert", "checkpoints/cls_encoder_weights.pt") model.load_state_dict(torch.load(weights_path, map_location="cpu")) ``` To build a fresh model from pretrained ModernBERT (no financial fine-tuning): ```python from financial_bert import build_model model = build_model("answerdotai/ModernBERT-base") ``` ### MLM inference ```python import torch tokenizer = FinancialBertTokenizer() model.eval() text = "Total assets of $5000000 and liabilities of $3000000." encoded = tokenizer(text, max_length=128) with torch.no_grad(): outputs = model( input_ids=encoded["input_ids"], number_values=encoded["number_values"], is_number_mask=encoded["is_number_mask"], attention_mask=encoded["attention_mask"], ) # outputs["text_logits"]: (batch, seq_len, vocab_size) # outputs["magnitude_logits"]: (batch, seq_len, num_magnitude_bins) ``` ### CLS sentence embedding The CLS token (position 0) captures a document-level representation. This is trained via a CLS-bottleneck encoder/decoder objective where the decoder reconstructs masked chunks from only the encoder's CLS embedding. ```python tokenizer = FinancialBertTokenizer() model.eval() text = "Revenue grew 25% year-over-year to $1500000." encoded = tokenizer(text, max_length=512) with torch.no_grad(): cls_embedding = model.get_cls_embedding( input_ids=encoded["input_ids"], number_values=encoded["number_values"], is_number_mask=encoded["is_number_mask"], attention_mask=encoded["attention_mask"], ) # shape: (1, 768) ``` Use CLS embeddings for downstream tasks like classification, regression, or retrieval. ## Fine-tuning ### MLM pre-training The MLM pipeline trains all parameters — backbone, number embedder, and number head — jointly: ```python from financial_bert import build_model, FinancialBertTokenizer, tag_numbers_in_text import torch # Build model (initialized from pretrained ModernBERT) model = build_model("answerdotai/ModernBERT-base") tokenizer = FinancialBertTokenizer("answerdotai/ModernBERT-base") # Prepare a training example text = tag_numbers_in_text("Net income was $42,000,000 in fiscal year 2023.") encoded = tokenizer(text, max_length=256) # Create MLM labels (mask ~15% of tokens) input_ids = encoded["input_ids"].clone() is_number_mask = encoded["is_number_mask"] number_values = encoded["number_values"] attention_mask = encoded["attention_mask"] # Random masking mask_prob = 0.15 rand = torch.rand_like(input_ids, dtype=torch.float) mask_positions = (rand < mask_prob) & (attention_mask == 1) mask_positions[:, 0] = False # don't mask CLS # Text labels labels_text = torch.full_like(input_ids, -100) text_mask_positions = mask_positions & (is_number_mask == 0) labels_text[text_mask_positions] = input_ids[text_mask_positions] input_ids[text_mask_positions] = tokenizer.mask_token_id # Number labels labels_magnitude = torch.full_like(number_values, -100.0) num_mask_positions = mask_positions & (is_number_mask == 1) labels_magnitude[num_mask_positions] = number_values[num_mask_positions] number_values[num_mask_positions] = model.config.magnitude_max + 1.0 # sentinel input_ids[num_mask_positions] = tokenizer.mask_token_id # Forward pass outputs = model( input_ids=input_ids, number_values=number_values, is_number_mask=is_number_mask, attention_mask=attention_mask, labels_text=labels_text, labels_magnitude=labels_magnitude, ) loss = outputs["loss"] # combined text CE + magnitude bin loss loss.backward() ``` ### Classification / regression head ```python import torch.nn as nn class FinancialClassifier(nn.Module): def __init__(self, encoder, num_classes): super().__init__() self.encoder = encoder self.head = nn.Linear(encoder.config.hidden_size, num_classes) def forward(self, input_ids, number_values, is_number_mask, attention_mask): cls = self.encoder.get_cls_embedding( input_ids, number_values, is_number_mask, attention_mask ) return self.head(cls) model = FinancialClassifier(encoder=model, num_classes=3) ``` ## Benchmarks ### Numeracy ordering (29 test groups) Each test group has three structurally identical sentences differing only in numerical magnitude (low, mid, high), with a tight ~5x spread within the same unit (e.g. $74.1M / $192.8M / $381.5M). Includes prose statements (dollar amounts, percentages, ratios, per-share figures) and HTML financial tables (income statements, balance sheets, cash flow, per-share data). - **Hard pass**: d(low,mid) < d(low,high) AND d(mid,high) < d(low,high) — mid is between low and high in embedding space - **Soft pass**: avg(d(low,mid), d(mid,high)) < d(low,high) Distance metric: MSE on raw (unnormalized) CLS embeddings. | Model | Hard | Soft | |---|---|---| | **CLS (enc/dec)** | **17/29 (59%)** | **24/29 (83%)** | | ModernBERT-base | 11/29 (38%) | 13/29 (45%) | | BGE-base-v1.5 | 10/29 (34%) | 15/29 (52%) | The CLS encoder/decoder model preserves numerical ordering in its embeddings even at tight magnitude spreads. ModernBERT-base and BGE-base-v1.5 both fall to near-chance, confirming that the enc/dec training objective gives the model genuine magnitude sensitivity beyond what the pretrained backbone or a general embedding model provides. ### Semantic retrieval (20 query-match pairs) Each query is a financial statement with specific numbers; each match is a paraphrase with rounded/restated figures. All 20 matches form the distractor pool. Metric: Recall@1 using cosine similarity on L2-normalized CLS embeddings. | Model | Recall@1 | MRR | |---|---|---| | BGE-base-v1.5 | **20/20** | **1.000** | | **CLS (enc/dec)** | **14/20** | **0.770** | | ModernBERT-base | 1/20 | 0.207 | The CLS encoder/decoder objective gives the model strong semantic matching ability (14/20 Recall@1) compared to the untrained backbone (1/20), though it does not match a purpose-built embedding model like BGE. ## Architecture details | Component | Description | |---|---| | **Backbone** | ModernBERT-base (149M params, 8192 token context, RoPE, Flash Attention) | | **NumberEmbedder** | 129 magnitude bins (128 + mask), interpolated embeddings | | **NumberHead** | Gated projection → LayerNorm → linear to magnitude bins | | **PredictionHead** | Dense → GELU → LayerNorm → tied decoder (standard MLM head) | ## License Apache 2.0