Adding a few other things

Browse files

Files changed (5) hide show

financial_bert.egg-info/PKG-INFO +248 -0
financial_bert.egg-info/SOURCES.txt +13 -0
financial_bert.egg-info/dependency_links.txt +1 -0
financial_bert.egg-info/requires.txt +7 -0
financial_bert.egg-info/top_level.txt +1 -0

financial_bert.egg-info/PKG-INFO ADDED Viewed

	@@ -0,0 +1,248 @@

+Metadata-Version: 2.4
+Name: financial-bert
+Version: 0.1.0
+Summary: Number-aware BERT for financial document understanding
+Author: Eloi de Reynal
+License-Expression: Apache-2.0
+Requires-Python: >=3.9
+Description-Content-Type: text/markdown
+Requires-Dist: torch>=2.0
+Requires-Dist: transformers>=4.48
+Requires-Dist: beautifulsoup4>=4.12
+Provides-Extra: train
+Requires-Dist: tqdm; extra == "train"
+Requires-Dist: datasets; extra == "train"
+---
+language: en
+license: apache-2.0
+library_name: transformers
+tags:
+  - financial
+  - numbers
+  - modernbert
+  - mlm
+base_model: answerdotai/ModernBERT-base
+---
+# FinancialModernBERT
+A number-aware BERT model for financial document understanding, built on [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base).
+## What this model does differently
+Standard language models tokenize numbers as arbitrary subword pieces — "12,345" becomes tokens like "12", ",", "345" — losing all numerical meaning. FinancialModernBERT solves this by:
+1. **Number tagging**: A preprocessing step wraps numbers in `<number>...</number>` tags
+2. **Log-magnitude encoding**: Each number is encoded as its log₁₀ magnitude (e.g. 1000 → 3.0) into a learned embedding via interpolated magnitude bins
+3. **Dual prediction heads**: MLM head for text tokens + magnitude head for number tokens, trained jointly
+4. **Table-aware tokenization**: HTML tables are linearized with structural delimiters (`[TABLE_START]`, `\t`, `\n`, `[TABLE_END]`)
+The model handles magnitudes from 10⁻¹² to 10¹² (configurable).
+## Installation
+```bash
+pip install git+https://huggingface.co/edereynal/financial_bert
+```
+Or clone and install:
+```bash
+git clone https://huggingface.co/edereynal/financial_bert
+cd financial_bert
+pip install -e .
+```
+## Quick start
+### Preprocessing: tag numbers in your text
+Before tokenizing, numbers in your text must be wrapped in `<number>` tags. Use the built-in tagger:
+```python
+from financial_bert import tag_numbers_in_text
+raw_text = "Revenue increased to $1,234,567 from $987,654, a 25% increase."
+tagged = tag_numbers_in_text(raw_text)
+# "Revenue increased to $<number>1234567</number> from $<number>987654</number>, a <number>25</number>% increase."
+```
+### Tokenization
+```python
+from financial_bert import FinancialBertTokenizer
+tokenizer = FinancialBertTokenizer("answerdotai/ModernBERT-base")
+text = "Revenue was $<number>1234567</number> in Q3."
+encoded = tokenizer(text, max_length=128)
+# Returns dict with:
+#   input_ids:      standard token IDs (numbers replaced with placeholder)
+#   attention_mask:  1 for real tokens, 0 for padding
+#   is_number_mask:  1 at number positions, 0 elsewhere
+#   number_values:   log10(magnitude) at number positions, 0.0 elsewhere
+```
+### Loading the model
+```python
+import torch
+from huggingface_hub import hf_hub_download
+from financial_bert import FinancialModernBert, FinancialModernBertConfig
+config = FinancialModernBertConfig.from_pretrained("answerdotai/ModernBERT-base")
+config.num_magnitude_bins = 128
+model = FinancialModernBert(config)
+# MLM pretrained weights (text + number prediction)
+weights_path = hf_hub_download("edereynal/financial_bert", "checkpoints/mlm_weights.pt")
+model.load_state_dict(torch.load(weights_path, map_location="cpu"))
+# Or: CLS encoder weights (trained with T5-style contrastive objective — better for embeddings)
+weights_path = hf_hub_download("edereynal/financial_bert", "checkpoints/cls_encoder_weights.pt")
+model.load_state_dict(torch.load(weights_path, map_location="cpu"))
+```
+To build a fresh model from pretrained ModernBERT (no financial fine-tuning):
+```python
+from financial_bert import build_model
+model = build_model("answerdotai/ModernBERT-base")
+```
+### MLM inference
+```python
+import torch
+tokenizer = FinancialBertTokenizer()
+model.eval()
+text = "Total assets of $<number>5000000</number> and liabilities of $<number>3000000</number>."
+encoded = tokenizer(text, max_length=128)
+with torch.no_grad():
+    outputs = model(
+        input_ids=encoded["input_ids"],
+        number_values=encoded["number_values"],
+        is_number_mask=encoded["is_number_mask"],
+        attention_mask=encoded["attention_mask"],
+    )
+# outputs["text_logits"]:      (batch, seq_len, vocab_size)
+# outputs["magnitude_logits"]: (batch, seq_len, num_magnitude_bins)
+```
+### CLS sentence embedding
+The CLS token (position 0) captures a document-level representation. This is trained via a T5-style encoder-decoder objective with supervised contrastive loss (same-document chunks have similar CLS embeddings).
+```python
+tokenizer = FinancialBertTokenizer()
+model.eval()
+text = "Revenue grew <number>25</number>% year-over-year to $<number>1500000</number>."
+encoded = tokenizer(text, max_length=512)
+with torch.no_grad():
+    cls_embedding = model.get_cls_embedding(
+        input_ids=encoded["input_ids"],
+        number_values=encoded["number_values"],
+        is_number_mask=encoded["is_number_mask"],
+        attention_mask=encoded["attention_mask"],
+    )  # shape: (1, 768)
+```
+Use CLS embeddings for downstream tasks like classification, regression, or retrieval.
+## Fine-tuning
+### MLM pre-training
+The MLM pipeline trains all parameters — backbone, number embedder, and number head — jointly:
+```python
+from financial_bert import build_model, FinancialBertTokenizer, tag_numbers_in_text
+import torch
+# Build model (initialized from pretrained ModernBERT)
+model = build_model("answerdotai/ModernBERT-base")
+tokenizer = FinancialBertTokenizer("answerdotai/ModernBERT-base")
+# Prepare a training example
+text = tag_numbers_in_text("Net income was $42,000,000 in fiscal year 2023.")
+encoded = tokenizer(text, max_length=256)
+# Create MLM labels (mask ~15% of tokens)
+input_ids = encoded["input_ids"].clone()
+is_number_mask = encoded["is_number_mask"]
+number_values = encoded["number_values"]
+attention_mask = encoded["attention_mask"]
+# Random masking
+mask_prob = 0.15
+rand = torch.rand_like(input_ids, dtype=torch.float)
+mask_positions = (rand < mask_prob) & (attention_mask == 1)
+mask_positions[:, 0] = False  # don't mask CLS
+# Text labels
+labels_text = torch.full_like(input_ids, -100)
+text_mask_positions = mask_positions & (is_number_mask == 0)
+labels_text[text_mask_positions] = input_ids[text_mask_positions]
+input_ids[text_mask_positions] = tokenizer.mask_token_id
+# Number labels
+labels_magnitude = torch.full_like(number_values, -100.0)
+num_mask_positions = mask_positions & (is_number_mask == 1)
+labels_magnitude[num_mask_positions] = number_values[num_mask_positions]
+number_values[num_mask_positions] = model.config.magnitude_max + 1.0  # sentinel
+input_ids[num_mask_positions] = tokenizer.mask_token_id
+# Forward pass
+outputs = model(
+    input_ids=input_ids,
+    number_values=number_values,
+    is_number_mask=is_number_mask,
+    attention_mask=attention_mask,
+    labels_text=labels_text,
+    labels_magnitude=labels_magnitude,
+)
+loss = outputs["loss"]  # combined text CE + magnitude bin loss
+loss.backward()
+```
+### Classification / regression head
+```python
+import torch.nn as nn
+class FinancialClassifier(nn.Module):
+    def __init__(self, encoder, num_classes):
+        super().__init__()
+        self.encoder = encoder
+        self.head = nn.Linear(encoder.config.hidden_size, num_classes)
+    def forward(self, input_ids, number_values, is_number_mask, attention_mask):
+        cls = self.encoder.get_cls_embedding(
+            input_ids, number_values, is_number_mask, attention_mask
+        )
+        return self.head(cls)
+model = FinancialClassifier(encoder=model, num_classes=3)
+```
+## Architecture details
+| Component | Description |
+|---|---|
+| **Backbone** | ModernBERT-base (149M params, 8192 token context, RoPE, Flash Attention) |
+| **NumberEmbedder** | 129 magnitude bins (128 + mask), interpolated embeddings |
+| **NumberHead** | Gated projection → LayerNorm → linear to magnitude bins |
+| **PredictionHead** | Dense → GELU → LayerNorm → tied decoder (standard MLM head) |
+## License
+Apache 2.0

financial_bert.egg-info/SOURCES.txt ADDED Viewed

	@@ -0,0 +1,13 @@

+README.md
+pyproject.toml
+financial_bert/__init__.py
+financial_bert/modeling.py
+financial_bert/table_utils.py
+financial_bert/tag_numbers.py
+financial_bert/tokenizer.py
+financial_bert.egg-info/PKG-INFO
+financial_bert.egg-info/SOURCES.txt
+financial_bert.egg-info/dependency_links.txt
+financial_bert.egg-info/requires.txt
+financial_bert.egg-info/top_level.txt
+tests/test_financial_numeracy.py

financial_bert.egg-info/dependency_links.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+

financial_bert.egg-info/requires.txt ADDED Viewed

	@@ -0,0 +1,7 @@

+torch>=2.0
+transformers>=4.48
+beautifulsoup4>=4.12
+[train]
+tqdm
+datasets

financial_bert.egg-info/top_level.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ financial_bert