Streaming for Saving

Hi,
I am looking for a way to download a large dataset, transform it and then upload it to another location. Note that the transformations for each instance is independent of others.

I can load the dataset in streaming mode and start the transformation but cannot find a way to write to huggingface hub (in batches) in as download and transformation are ongoing. Wondering if such pattern exists

In the case of datasets library’s push_to_hub, I think you couldn’t upload the data unless all of it was available…
If the files are outputted frequently, in the worst case, there is a way to manually upload them one after another using HfApi…

What if I just want to transform the dataset and then save in a streaming way? If the dataset is large, the CPU occupied memory become larger when transfering the dataset. Or should I transfer and save the data in parts instead of waiting until the entire dataset has been transferred?

Yeah. It is now possible to save parquet files per shard or upload them incrementally.