Hi Friends ![]()
Is there a way to get intermediate tokens and merges used during BPE tokenization?
Example:
- vocab: âaâ, âbâ, âcâ, âabâ, ââ, âabcâ
- merges: âa bâ, âb câ, âab câ
What I want: tokenize(âabcâ): {âintermediate_tokensâ: [âaâ, âbâ, âabâ, âcâ], âintermediate_mergesâ: [âa bâ]}
I currently solve this by manually implementing BPE in Python, but my implementation is too slow ![]()