Hi everyone! I’ve been building DinoDS, a modular dataset system for LLM training built around lane-based dataset bundles.
The idea is simple: instead of treating training data like one giant premade dump, I’m organizing it into capability-focused bundles that map to specific assistant behaviors and failure types — things like:
-
retrieval grounding
-
workflow / tool routing
-
memory and continuity
-
structured outputs
-
identity and behavior shaping
I’ve started publishing some of these dataset bundle previews on Hugging Face, and I also made a Space that helps people explore which dataset bundle might actually be useful for their use case.
So the current flow is:
-
explore the DinoDS concept
-
identify what kind of assistant behavior you want to improve
-
see which bundle / lane family fits
-
check out the related dataset previews
I’d really love feedback from the HF community on a few things:
-
Does this bundle-first / lane-based way of presenting datasets make sense?
-
Is the Space + dataset bundle flow intuitive?
-
What would make these previews more useful for people evaluating training data?
-
Would you rather explore by failure type, capability, or use case?
You can check out the bundles, the Space, and the website here:
-
Hugging Face Space: Dinodataset Failure Mapper - a Hugging Face Space by DinoDS
-
Dataset bundles: DinoDS (DinoDS Labs)
-
Website: www.dinodsai.com
Would love thoughts, criticism, and suggestions — especially from people building assistants, copilots, routing systems, or structured-output workflows.