Datasets.load_datasets fails

John6666 · October 4, 2024, 11:47pm

Character code errors still occur in 2024…
Apparently there are cases where it can be avoided by explicitly specifying it at load time.
If this does not work, there may be another cause.

dataset = datasets.load_dataset(
“jxu124/OpenX-Embodiment”,
“berkeley_gnm_cory_hall”,
streaming=False,
split=“train”,
cache_dir=ds_root,
trust_remote_code=True,
encoding="utf-16",
)

or

dataset = datasets.load_dataset(
“jxu124/OpenX-Embodiment”,
“berkeley_gnm_cory_hall”,
streaming=False,
split=“train”,
cache_dir=ds_root,
trust_remote_code=True,
encoding="utf-8",
)

Topic		Replies	Views
'utf-8' codec can't decode byte 0xff in position 0: invalid start byte Beginners	3	12498	August 23, 2023
Random utf-8 errors from dataset Intermediate	10	3992	December 8, 2023
UniDecodeError: 'charmap' codec can't decode byte from Load_dataset Beginners	0	88	December 5, 2024
'utf-8' codec can't decode byte 0xff in position 0: invalid start byte (dataset) Beginners	0	472	May 19, 2024
UTF-16 for datasets? 🤗Datasets	4	1593	June 21, 2023

Datasets.load_datasets fails

Related topics