Character code errors still occur in 2024âŚ
Apparently there are cases where it can be avoided by explicitly specifying it at load time.
If this does not work, there may be another cause.
Iâve moved on from that project, at this point, so unfortunately I canât give a stack trace. But I will say that I think it was more a data problem than a datasets problem. I still see the error with some of the data Iâm using now, but Iâve started including chardet in my data pipeline, which seems to fix it (though itâs a bit pokey).
Thank you very much!!! Problem solved! Your answer really helps me a lot!
Working on a translator, hoping to do fine-tuning with a utf-16 dataset so I can get all the French accents etc.
Datasets load_dataset() doesnât seem to like non-utf-8
Is there a way to specify or does it HAVE to be utf-8?
If it has to be utf-8, any suggestions for special characters?
dataset = datasets.load_dataset(
âjxu124/OpenX-Embodimentâ,
âberkeley_gnm_cory_hallâ,
streaming=False,
split=âtrainâ,
cache_dir=ds_root,
trust_remote_code=True,
encoding="utf-16",
)
or
dataset = datasets.load_dataset(
âjxu124/OpenX-Embodimentâ,
âberkeley_gnm_cory_hallâ,
streaming=False,
split=âtrainâ,
cache_dir=ds_root,
trust_remote_code=True,
encoding="utf-8",
)