Streaming for Saving

sinamoeini · January 26, 2025, 10:30am

Hi,
I am looking for a way to download a large dataset, transform it and then upload it to another location. Note that the transformations for each instance is independent of others.

I can load the dataset in streaming mode and start the transformation but cannot find a way to write to huggingface hub (in batches) in as download and transformation are ongoing. Wondering if such pattern exists

John6666 · January 26, 2025, 11:47am

In the case of datasets library’s push_to_hub, I think you couldn’t upload the data unless all of it was available…
If the files are outputted frequently, in the worst case, there is a way to manually upload them one after another using HfApi…

Zoe0427 · September 12, 2025, 6:56pm

What if I just want to transform the dataset and then save in a streaming way? If the dataset is large, the CPU occupied memory become larger when transfering the dataset. Or should I transfer and save the data in parts instead of waiting until the entire dataset has been transferred?

John6666 · September 12, 2025, 11:37pm

Yeah. It is now possible to save parquet files per shard or upload them incrementally.

Topic		Replies	Views
Streaming in dataset uploads 🤗Datasets	2	175	March 31, 2025
Unable to upload large audio dataset using push_to_hub 🤗Datasets	5	963	November 17, 2023
Standard way to upload huge dataset 🤗Datasets	5	803	April 26, 2024
Incrementally adding processed examples to a dataset 🤗Datasets	4	1624	June 23, 2022
Stream Audio Dataset that Can't be moved to Hub 🤗Datasets	7	556	March 17, 2023

Streaming for Saving

Related topics