TokenClassificationPipeline produce entities with "##" characters

dbejean · May 19, 2025, 10:35am

Hi,

When I use TokenClassificationPipeline with some models I get entities with ## characters. For instance with the “elastic/distilbert-base-cased-finetuned-conll03-english” model.

It looks like tokenization issue may be related to accented characters.
Any ideas in order to explain this issue and how to fix it ?

Here is my code and sample output.

NER_MODEL_NAME = ...

tokenizer = AutoTokenizer.from_pretrained(NER_MODEL_NAME)
model = AutoModelForTokenClassification.from_pretrained(NER_MODEL_NAME)

# Create a pipeline for NER
ner_pipeline = TokenClassificationPipeline(
    model=model,
    tokenizer=tokenizer,
    aggregation_strategy="simple"
)

# Run NER
text = "Côme habite à Aix-en-Provence et travaille pour l’INRIA."
entities = ner_pipeline(text)

# Print results
for entity in entities:
    print(f"{entity['word']} -> {entity['entity_group']} ({entity['score']:.2f})")

And the output with 2 models

With NER_MODEL_NAME = ‘CATIE-AQ/NERmembert-base-4entities’
Entities are correct :

    Côme -> PER (0.67)
    Aix-en-Provence -> LOC (1.00)
    INRIA -> ORG (1.00)

With NER_MODEL_NAME = ‘elastic/distilbert-base-cased-finetuned-conll03-english’
Entities are not correct :

    C -> PER (0.50)
    ##ôme -> ORG (0.40)
    Ai -> LOC (0.62)
    ##x -> LOC (0.97)
    - -> LOC (0.64)
    en -> LOC (0.91)
    - -> LOC (0.81)
    Provence -> LOC (0.87)
    et travaille pour l ’ INRIA -> ORG (0.89)

Regards.

Dominique

John6666 · May 19, 2025, 12:44pm

It seems you’ve unearthed an ancient bug. A living fossil of a bug.

from transformers import AutoTokenizer, AutoModelForTokenClassification, TokenClassificationPipeline

NER_MODEL_NAME = "elastic/distilbert-base-cased-finetuned-conll03-english"
"""
C -> PER (0.50)
##ôme -> ORG (0.40)
Ai -> LOC (0.62)
##x -> LOC (0.97)
- -> LOC (0.64)
en -> LOC (0.91)
- -> LOC (0.81)
Provence -> LOC (0.87)
et travaille pour l ’ INRIA -> ORG (0.89)
"""
NER_MODEL_NAME = "CATIE-AQ/NERmembert-base-4entities"
"""
Côme -> PER (0.67)
Aix-en-Provence -> LOC (1.00)
INRIA -> ORG (1.00)
"""
NER_MODEL_NAME = "elastic/distilbert-base-uncased-finetuned-conll03-english"
# aix - en - provence et travaille pour l ’ inria -> ORG (0.87)
NER_MODEL_NAME = "distilbert/distilbert-base-multilingual-cased"
"""
Côme -> LABEL_0 (0.57)
habite -> LABEL_1 (0.53)
à -> LABEL_0 (0.53)
Aix - en - -> LABEL_1 (0.52)
Provence et travaille pour l -> LABEL_0 (0.55)
’ -> LABEL_1 (0.51)
INRIA. -> LABEL_0 (0.55)
"""
NER_MODEL_NAME = "distilbert/distilbert-base-uncased"
"""
come -> LABEL_0 (0.54)
habit -> LABEL_1 (0.53)
##e -> LABEL_0 (0.52)
a aix -> LABEL_1 (0.52)
- -> LABEL_0 (0.52)
en - provence et -> LABEL_1 (0.55)
tr -> LABEL_0 (0.51)
##ava -> LABEL_1 (0.50)
##ille -> LABEL_0 (0.51)
pour -> LABEL_1 (0.54)
l -> LABEL_0 (0.52)
’ -> LABEL_1 (0.52)
inria -> LABEL_0 (0.52)
. -> LABEL_1 (0.50)
"""

tokenizer = AutoTokenizer.from_pretrained(NER_MODEL_NAME)
model = AutoModelForTokenClassification.from_pretrained(NER_MODEL_NAME).to("cuda")

# Create a pipeline for NER
ner_pipeline = TokenClassificationPipeline(
    model=model,
    tokenizer=tokenizer,
    #aggregation_strategy="first" # this would work
    aggregation_strategy="simple"
)

# Run NER
text = "Côme habite à Aix-en-Provence et travaille pour l’INRIA."
entities = ner_pipeline(text)

# Print results
for entity in entities:
    print(f"{entity['word']} -> {entity['entity_group']} ({entity['score']:.2f})")

dbejean · May 19, 2025, 12:54pm

Hi,
The issue is with aggregation_strategy=“simple”. Other strategies are not returning ## but entities are not satisfactory.
‘elastic/distilbert-base-cased-finetuned-conll03-english’ model works fine directly embeded into elasticsearch ingest pipeline.
Dominique

mahmutc · May 19, 2025, 12:57pm

Why not use French NER?

John6666 · May 19, 2025, 1:01pm

It may be difficult to fix this issue on the library side.
He said to change the model.

github.com/huggingface/transformers

`aggregations_strategies` for TokenClassificationPipeline seem broken when note `simple`

opened 11:10PM - 06 Dec 23 UTC

closed 09:27PM - 14 Dec 23 UTC

antoine-lizee

### System Info Copy-and-paste the text below in your GitHub issue and FILL O…UT the two last points. - `transformers` version: 4.35.2 - Platform: macOS-13.6.2-arm64-arm-64bit - Python version: 3.11.6 - Huggingface_hub version: 0.19.4 - Safetensors version: 0.4.1 - Accelerate version: 0.25.0 - Accelerate config: not found - PyTorch version (GPU?): 2.1.1 (False) - Tensorflow version (GPU?): not installed (NA) - Flax version (CPU?/GPU?/TPU?): not installed (NA) - Jax version: not installed - JaxLib version: not installed - Using GPU in script?: no - Using distributed or parallel set-up in script?: no ### Who can help? Blame gives roughly: @luccailliau @Narsil ### Information - [ ] The official example scripts - [X] My own modified scripts ### Tasks - [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below) ### Reproduction ``` from pprint import pprint tokenizer = AutoTokenizer.from_pretrained("Jean-Baptiste/camembert-ner-with-dates") model = AutoModelForTokenClassification.from_pretrained("Jean-Baptiste/camembert-ner-with-dates") nlp_no_agg = pipeline('ner', model=model, tokenizer=tokenizer) nlp_simple = pipeline('ner', model=model, tokenizer=tokenizer, aggregation_strategy="simple") nlp_first = pipeline('ner', model=model, tokenizer=tokenizer, aggregation_strategy="first") nlp_avg = pipeline('ner', model=model, tokenizer=tokenizer, aggregation_strategy="average") nlp_max = pipeline('ner', model=model, tokenizer=tokenizer, aggregation_strategy="max") for example in [ "Bonjour,je suis le docteur Brice Saintclair", "Je vous renvoie en Dermatologie.", ]: print(example) print("no agg") pprint(nlp_no_agg(example)) print("simple") pprint(nlp_simple(example)) print("first") pprint(nlp_first(example)) print("avg") pprint(nlp_avg(example)) print("max") pprint(nlp_max(example)) ``` Result: ``` Bonjour,je suis le docteur Brice Saintclair no agg [{'end': 30, 'entity': 'I-PER', 'index': 7, 'score': 0.9949898, 'start': 26, 'word': '▁Bri'}, {'end': 32, 'entity': 'I-PER', 'index': 8, 'score': 0.99483263, 'start': 30, 'word': 'ce'}, {'end': 38, 'entity': 'I-PER', 'index': 9, 'score': 0.9943815, 'start': 32, 'word': '▁Saint'}, {'end': 43, 'entity': 'I-PER', 'index': 10, 'score': 0.9938929, 'start': 38, 'word': 'clair'}] simple [{'end': 43, 'entity_group': 'PER', 'score': 0.9945242, 'start': 26, 'word': 'Brice Saintclair'}] first [{'end': 43, 'entity_group': 'PER', 'score': 0.99468565, 'start': 26, 'word': 'BriceSaintclair'}] avg [{'end': 43, 'entity_group': 'PER', 'score': 0.9945242, 'start': 26, 'word': 'BriceSaintclair'}] max [{'end': 43, 'entity_group': 'PER', 'score': 0.99468565, 'start': 26, 'word': 'BriceSaintclair'}] Je vous renvoie en Dermatologie. no agg [{'end': 22, 'entity': 'I-ORG', 'index': 5, 'score': 0.46623757, 'start': 18, 'word': '▁Der'}, {'end': 25, 'entity': 'I-ORG', 'index': 6, 'score': 0.4892864, 'start': 22, 'word': 'mat'}, {'end': 31, 'entity': 'I-ORG', 'index': 7, 'score': 0.49201807, 'start': 25, 'word': 'ologie'}] simple [{'end': 31, 'entity_group': 'ORG', 'score': 0.48251402, 'start': 18, 'word': 'Dermatologie'}] first [{'end': 32, 'entity_group': 'ORG', 'score': 0.46623757, 'start': 18, 'word': 'Dermatologie.'}] avg [{'end': 32, 'entity_group': 'ORG', 'score': 0.3619019, 'start': 18, 'word': 'Dermatologie.'}] max [] ``` ### Expected behavior Given the non-aggregated results, it seems that there are 2 bugs: - 1/ The space between `Brice Saintclair` is ommited when the tokens are fused by any aggregation strategy that is not "simple". I would expect the space to remain given that it's part of the tagged token. - 2/ The period after "Dermatologie" is fused with it. It makes the whole word be classified as "0" with `max`. I would expect the period to be counted as outside the word given that it is its own token.

dbejean · May 19, 2025, 1:17pm

In fact, I don’t really need to use this model for english text, I just tested it and don’t understand why it works fine directly in ES and it doesn’t work with hugging face API. So, I would like to know if I did something wrong.

Any way, as my text is in French, I use either
NER_MODEL_NAME = “Jean-Baptiste/camembert-ner”
or
NER_MODEL_NAME = ‘CATIE-AQ/NERmembert-base-4entities’

“Jean-Baptiste/camembert-ner” is the best for my use vase.

Dominique

John6666 · May 19, 2025, 1:42pm

I’m not sure if this “simple” operation is as expected, but Tokenizer seems to be helpful in adding prefixes.

The output of this method is a list of strings, or tokens:

[‘Using’, ‘a’, ‘transform’, ‘##er’, ‘network’, ‘is’, ‘simple’]

This tokenizer is a subword tokenizer: it splits the words until it obtains tokens that can be represented by its vocabulary. That’s the case here with transformer, which is split into two tokens: transform and ##er.

https://stackoverflow.com/questions/67026731/is-there-a-way-to-use-huggingface-pretrained-tokenizer-with-wordpiece-prefix

Topic		Replies	Views
Empty entity string when using TokenClassificationPipeline 🤗Transformers	1	625	February 15, 2022
Text Classification tokenizer problems on inference Intermediate	4	2405	October 12, 2022
Handling tokenization effects of punctuated numbers in NER (e.g. $10,000) 🤗Transformers	2	1414	March 30, 2023
Output of 'bert-base-NER-uncased' is different when using website and different when used via python 🤗Transformers	1	533	November 10, 2021
TokenClassification pipeline doing batch processing over a sequence of already tokenised messages Intermediate	1	858	July 6, 2022

TokenClassificationPipeline produce entities with "##" characters

Related topics