edereynal commited on
Commit
4de8e4b
·
1 Parent(s): cdcbf7d

Adding a few other things

Browse files
financial_bert.egg-info/PKG-INFO ADDED
@@ -0,0 +1,248 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Metadata-Version: 2.4
2
+ Name: financial-bert
3
+ Version: 0.1.0
4
+ Summary: Number-aware BERT for financial document understanding
5
+ Author: Eloi de Reynal
6
+ License-Expression: Apache-2.0
7
+ Requires-Python: >=3.9
8
+ Description-Content-Type: text/markdown
9
+ Requires-Dist: torch>=2.0
10
+ Requires-Dist: transformers>=4.48
11
+ Requires-Dist: beautifulsoup4>=4.12
12
+ Provides-Extra: train
13
+ Requires-Dist: tqdm; extra == "train"
14
+ Requires-Dist: datasets; extra == "train"
15
+
16
+ ---
17
+ language: en
18
+ license: apache-2.0
19
+ library_name: transformers
20
+ tags:
21
+ - financial
22
+ - numbers
23
+ - modernbert
24
+ - mlm
25
+ base_model: answerdotai/ModernBERT-base
26
+ ---
27
+
28
+ # FinancialModernBERT
29
+
30
+ A number-aware BERT model for financial document understanding, built on [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base).
31
+
32
+ ## What this model does differently
33
+
34
+ Standard language models tokenize numbers as arbitrary subword pieces — "12,345" becomes tokens like "12", ",", "345" — losing all numerical meaning. FinancialModernBERT solves this by:
35
+
36
+ 1. **Number tagging**: A preprocessing step wraps numbers in `<number>...</number>` tags
37
+ 2. **Log-magnitude encoding**: Each number is encoded as its log₁₀ magnitude (e.g. 1000 → 3.0) into a learned embedding via interpolated magnitude bins
38
+ 3. **Dual prediction heads**: MLM head for text tokens + magnitude head for number tokens, trained jointly
39
+ 4. **Table-aware tokenization**: HTML tables are linearized with structural delimiters (`[TABLE_START]`, `\t`, `\n`, `[TABLE_END]`)
40
+
41
+ The model handles magnitudes from 10⁻¹² to 10¹² (configurable).
42
+
43
+ ## Installation
44
+
45
+ ```bash
46
+ pip install git+https://huggingface.co/edereynal/financial_bert
47
+ ```
48
+
49
+ Or clone and install:
50
+
51
+ ```bash
52
+ git clone https://huggingface.co/edereynal/financial_bert
53
+ cd financial_bert
54
+ pip install -e .
55
+ ```
56
+
57
+ ## Quick start
58
+
59
+ ### Preprocessing: tag numbers in your text
60
+
61
+ Before tokenizing, numbers in your text must be wrapped in `<number>` tags. Use the built-in tagger:
62
+
63
+ ```python
64
+ from financial_bert import tag_numbers_in_text
65
+
66
+ raw_text = "Revenue increased to $1,234,567 from $987,654, a 25% increase."
67
+ tagged = tag_numbers_in_text(raw_text)
68
+ # "Revenue increased to $<number>1234567</number> from $<number>987654</number>, a <number>25</number>% increase."
69
+ ```
70
+
71
+ ### Tokenization
72
+
73
+ ```python
74
+ from financial_bert import FinancialBertTokenizer
75
+
76
+ tokenizer = FinancialBertTokenizer("answerdotai/ModernBERT-base")
77
+
78
+ text = "Revenue was $<number>1234567</number> in Q3."
79
+ encoded = tokenizer(text, max_length=128)
80
+
81
+ # Returns dict with:
82
+ # input_ids: standard token IDs (numbers replaced with placeholder)
83
+ # attention_mask: 1 for real tokens, 0 for padding
84
+ # is_number_mask: 1 at number positions, 0 elsewhere
85
+ # number_values: log10(magnitude) at number positions, 0.0 elsewhere
86
+ ```
87
+
88
+ ### Loading the model
89
+
90
+ ```python
91
+ import torch
92
+ from huggingface_hub import hf_hub_download
93
+ from financial_bert import FinancialModernBert, FinancialModernBertConfig
94
+
95
+ config = FinancialModernBertConfig.from_pretrained("answerdotai/ModernBERT-base")
96
+ config.num_magnitude_bins = 128
97
+ model = FinancialModernBert(config)
98
+
99
+ # MLM pretrained weights (text + number prediction)
100
+ weights_path = hf_hub_download("edereynal/financial_bert", "checkpoints/mlm_weights.pt")
101
+ model.load_state_dict(torch.load(weights_path, map_location="cpu"))
102
+
103
+ # Or: CLS encoder weights (trained with T5-style contrastive objective — better for embeddings)
104
+ weights_path = hf_hub_download("edereynal/financial_bert", "checkpoints/cls_encoder_weights.pt")
105
+ model.load_state_dict(torch.load(weights_path, map_location="cpu"))
106
+ ```
107
+
108
+ To build a fresh model from pretrained ModernBERT (no financial fine-tuning):
109
+
110
+ ```python
111
+ from financial_bert import build_model
112
+ model = build_model("answerdotai/ModernBERT-base")
113
+ ```
114
+
115
+ ### MLM inference
116
+
117
+ ```python
118
+ import torch
119
+
120
+ tokenizer = FinancialBertTokenizer()
121
+ model.eval()
122
+
123
+ text = "Total assets of $<number>5000000</number> and liabilities of $<number>3000000</number>."
124
+ encoded = tokenizer(text, max_length=128)
125
+
126
+ with torch.no_grad():
127
+ outputs = model(
128
+ input_ids=encoded["input_ids"],
129
+ number_values=encoded["number_values"],
130
+ is_number_mask=encoded["is_number_mask"],
131
+ attention_mask=encoded["attention_mask"],
132
+ )
133
+
134
+ # outputs["text_logits"]: (batch, seq_len, vocab_size)
135
+ # outputs["magnitude_logits"]: (batch, seq_len, num_magnitude_bins)
136
+ ```
137
+
138
+ ### CLS sentence embedding
139
+
140
+ The CLS token (position 0) captures a document-level representation. This is trained via a T5-style encoder-decoder objective with supervised contrastive loss (same-document chunks have similar CLS embeddings).
141
+
142
+ ```python
143
+ tokenizer = FinancialBertTokenizer()
144
+ model.eval()
145
+
146
+ text = "Revenue grew <number>25</number>% year-over-year to $<number>1500000</number>."
147
+ encoded = tokenizer(text, max_length=512)
148
+
149
+ with torch.no_grad():
150
+ cls_embedding = model.get_cls_embedding(
151
+ input_ids=encoded["input_ids"],
152
+ number_values=encoded["number_values"],
153
+ is_number_mask=encoded["is_number_mask"],
154
+ attention_mask=encoded["attention_mask"],
155
+ ) # shape: (1, 768)
156
+ ```
157
+
158
+ Use CLS embeddings for downstream tasks like classification, regression, or retrieval.
159
+
160
+ ## Fine-tuning
161
+
162
+ ### MLM pre-training
163
+
164
+ The MLM pipeline trains all parameters — backbone, number embedder, and number head — jointly:
165
+
166
+ ```python
167
+ from financial_bert import build_model, FinancialBertTokenizer, tag_numbers_in_text
168
+ import torch
169
+
170
+ # Build model (initialized from pretrained ModernBERT)
171
+ model = build_model("answerdotai/ModernBERT-base")
172
+ tokenizer = FinancialBertTokenizer("answerdotai/ModernBERT-base")
173
+
174
+ # Prepare a training example
175
+ text = tag_numbers_in_text("Net income was $42,000,000 in fiscal year 2023.")
176
+ encoded = tokenizer(text, max_length=256)
177
+
178
+ # Create MLM labels (mask ~15% of tokens)
179
+ input_ids = encoded["input_ids"].clone()
180
+ is_number_mask = encoded["is_number_mask"]
181
+ number_values = encoded["number_values"]
182
+ attention_mask = encoded["attention_mask"]
183
+
184
+ # Random masking
185
+ mask_prob = 0.15
186
+ rand = torch.rand_like(input_ids, dtype=torch.float)
187
+ mask_positions = (rand < mask_prob) & (attention_mask == 1)
188
+ mask_positions[:, 0] = False # don't mask CLS
189
+
190
+ # Text labels
191
+ labels_text = torch.full_like(input_ids, -100)
192
+ text_mask_positions = mask_positions & (is_number_mask == 0)
193
+ labels_text[text_mask_positions] = input_ids[text_mask_positions]
194
+ input_ids[text_mask_positions] = tokenizer.mask_token_id
195
+
196
+ # Number labels
197
+ labels_magnitude = torch.full_like(number_values, -100.0)
198
+ num_mask_positions = mask_positions & (is_number_mask == 1)
199
+ labels_magnitude[num_mask_positions] = number_values[num_mask_positions]
200
+ number_values[num_mask_positions] = model.config.magnitude_max + 1.0 # sentinel
201
+ input_ids[num_mask_positions] = tokenizer.mask_token_id
202
+
203
+ # Forward pass
204
+ outputs = model(
205
+ input_ids=input_ids,
206
+ number_values=number_values,
207
+ is_number_mask=is_number_mask,
208
+ attention_mask=attention_mask,
209
+ labels_text=labels_text,
210
+ labels_magnitude=labels_magnitude,
211
+ )
212
+
213
+ loss = outputs["loss"] # combined text CE + magnitude bin loss
214
+ loss.backward()
215
+ ```
216
+
217
+ ### Classification / regression head
218
+
219
+ ```python
220
+ import torch.nn as nn
221
+
222
+ class FinancialClassifier(nn.Module):
223
+ def __init__(self, encoder, num_classes):
224
+ super().__init__()
225
+ self.encoder = encoder
226
+ self.head = nn.Linear(encoder.config.hidden_size, num_classes)
227
+
228
+ def forward(self, input_ids, number_values, is_number_mask, attention_mask):
229
+ cls = self.encoder.get_cls_embedding(
230
+ input_ids, number_values, is_number_mask, attention_mask
231
+ )
232
+ return self.head(cls)
233
+
234
+ model = FinancialClassifier(encoder=model, num_classes=3)
235
+ ```
236
+
237
+ ## Architecture details
238
+
239
+ | Component | Description |
240
+ |---|---|
241
+ | **Backbone** | ModernBERT-base (149M params, 8192 token context, RoPE, Flash Attention) |
242
+ | **NumberEmbedder** | 129 magnitude bins (128 + mask), interpolated embeddings |
243
+ | **NumberHead** | Gated projection → LayerNorm → linear to magnitude bins |
244
+ | **PredictionHead** | Dense → GELU → LayerNorm → tied decoder (standard MLM head) |
245
+
246
+ ## License
247
+
248
+ Apache 2.0
financial_bert.egg-info/SOURCES.txt ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ README.md
2
+ pyproject.toml
3
+ financial_bert/__init__.py
4
+ financial_bert/modeling.py
5
+ financial_bert/table_utils.py
6
+ financial_bert/tag_numbers.py
7
+ financial_bert/tokenizer.py
8
+ financial_bert.egg-info/PKG-INFO
9
+ financial_bert.egg-info/SOURCES.txt
10
+ financial_bert.egg-info/dependency_links.txt
11
+ financial_bert.egg-info/requires.txt
12
+ financial_bert.egg-info/top_level.txt
13
+ tests/test_financial_numeracy.py
financial_bert.egg-info/dependency_links.txt ADDED
@@ -0,0 +1 @@
 
 
1
+
financial_bert.egg-info/requires.txt ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ torch>=2.0
2
+ transformers>=4.48
3
+ beautifulsoup4>=4.12
4
+
5
+ [train]
6
+ tqdm
7
+ datasets
financial_bert.egg-info/top_level.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ financial_bert