TokenClassificationPipeline produce entities with "##" characters

Hi,

When I use TokenClassificationPipeline with some models I get entities with ## characters. For instance with the “elastic/distilbert-base-cased-finetuned-conll03-english” model.

It looks like tokenization issue may be related to accented characters.
Any ideas in order to explain this issue and how to fix it ?

Here is my code and sample output.

NER_MODEL_NAME = ...

tokenizer = AutoTokenizer.from_pretrained(NER_MODEL_NAME)
model = AutoModelForTokenClassification.from_pretrained(NER_MODEL_NAME)

# Create a pipeline for NER
ner_pipeline = TokenClassificationPipeline(
    model=model,
    tokenizer=tokenizer,
    aggregation_strategy="simple"
)

# Run NER
text = "Côme habite à Aix-en-Provence et travaille pour l’INRIA."
entities = ner_pipeline(text)

# Print results
for entity in entities:
    print(f"{entity['word']} -> {entity['entity_group']} ({entity['score']:.2f})")

And the output with 2 models

With NER_MODEL_NAME = ‘CATIE-AQ/NERmembert-base-4entities’
Entities are correct :

    CĂ´me -> PER (0.67)
    Aix-en-Provence -> LOC (1.00)
    INRIA -> ORG (1.00)

With NER_MODEL_NAME = ‘elastic/distilbert-base-cased-finetuned-conll03-english’
Entities are not correct :

    C -> PER (0.50)
    ##Ă´me -> ORG (0.40)
    Ai -> LOC (0.62)
    ##x -> LOC (0.97)
    - -> LOC (0.64)
    en -> LOC (0.91)
    - -> LOC (0.81)
    Provence -> LOC (0.87)
    et travaille pour l ’ INRIA -> ORG (0.89)

Regards.

Dominique

It seems you’ve unearthed an ancient bug. A living fossil of a bug.

from transformers import AutoTokenizer, AutoModelForTokenClassification, TokenClassificationPipeline

NER_MODEL_NAME = "elastic/distilbert-base-cased-finetuned-conll03-english"
"""
C -> PER (0.50)
##Ă´me -> ORG (0.40)
Ai -> LOC (0.62)
##x -> LOC (0.97)
- -> LOC (0.64)
en -> LOC (0.91)
- -> LOC (0.81)
Provence -> LOC (0.87)
et travaille pour l ’ INRIA -> ORG (0.89)
"""
NER_MODEL_NAME = "CATIE-AQ/NERmembert-base-4entities"
"""
CĂ´me -> PER (0.67)
Aix-en-Provence -> LOC (1.00)
INRIA -> ORG (1.00)
"""
NER_MODEL_NAME = "elastic/distilbert-base-uncased-finetuned-conll03-english"
# aix - en - provence et travaille pour l ’ inria -> ORG (0.87)
NER_MODEL_NAME = "distilbert/distilbert-base-multilingual-cased"
"""
CĂ´me -> LABEL_0 (0.57)
habite -> LABEL_1 (0.53)
Ă  -> LABEL_0 (0.53)
Aix - en - -> LABEL_1 (0.52)
Provence et travaille pour l -> LABEL_0 (0.55)
’ -> LABEL_1 (0.51)
INRIA. -> LABEL_0 (0.55)
"""
NER_MODEL_NAME = "distilbert/distilbert-base-uncased"
"""
come -> LABEL_0 (0.54)
habit -> LABEL_1 (0.53)
##e -> LABEL_0 (0.52)
a aix -> LABEL_1 (0.52)
- -> LABEL_0 (0.52)
en - provence et -> LABEL_1 (0.55)
tr -> LABEL_0 (0.51)
##ava -> LABEL_1 (0.50)
##ille -> LABEL_0 (0.51)
pour -> LABEL_1 (0.54)
l -> LABEL_0 (0.52)
’ -> LABEL_1 (0.52)
inria -> LABEL_0 (0.52)
. -> LABEL_1 (0.50)
"""

tokenizer = AutoTokenizer.from_pretrained(NER_MODEL_NAME)
model = AutoModelForTokenClassification.from_pretrained(NER_MODEL_NAME).to("cuda")

# Create a pipeline for NER
ner_pipeline = TokenClassificationPipeline(
    model=model,
    tokenizer=tokenizer,
    #aggregation_strategy="first" # this would work
    aggregation_strategy="simple"
)

# Run NER
text = "Côme habite à Aix-en-Provence et travaille pour l’INRIA."
entities = ner_pipeline(text)

# Print results
for entity in entities:
    print(f"{entity['word']} -> {entity['entity_group']} ({entity['score']:.2f})")

Hi,
The issue is with aggregation_strategy=“simple”. Other strategies are not returning ## but entities are not satisfactory.
‘elastic/distilbert-base-cased-finetuned-conll03-english’ model works fine directly embeded into elasticsearch ingest pipeline.
Dominique

Why not use French NER?

It may be difficult to fix this issue on the library side.
He said to change the model.

In fact, I don’t really need to use this model for english text, I just tested it and don’t understand why it works fine directly in ES and it doesn’t work with hugging face API. So, I would like to know if I did something wrong.

Any way, as my text is in French, I use either
NER_MODEL_NAME = “Jean-Baptiste/camembert-ner”
or
NER_MODEL_NAME = ‘CATIE-AQ/NERmembert-base-4entities’

“Jean-Baptiste/camembert-ner” is the best for my use vase.

Dominique

I’m not sure if this “simple” operation is as expected, but Tokenizer seems to be helpful in adding prefixes.

The output of this method is a list of strings, or tokens:

[‘Using’, ‘a’, ‘transform’, ‘##er’, ‘network’, ‘is’, ‘simple’]

This tokenizer is a subword tokenizer: it splits the words until it obtains tokens that can be represented by its vocabulary. That’s the case here with transformer, which is split into two tokens: transform and ##er.

https://stackoverflow.com/questions/67026731/is-there-a-way-to-use-huggingface-pretrained-tokenizer-with-wordpiece-prefix