Hi everyone,
I’ve been quietly building a Python failure dataset for DPO / RLHF
training over the past couple of weeks, running 24/7 on a single
RTX 4060.
The basic idea: an autopilot pipeline generates Python code attempts
for various CS domains (FFT, Monte Carlo, ZKP, etc.), runs each in a
sandboxed pytest container, and keeps the genuine failures with
error logs as rejected-side training data.
Quick stats:
- ~2K failure rows shipped (v1, v2)
- 19 CS domains covered
- 146 downloads since launch
Two questions for DPO / RLHF practitioners here:
1. How are you currently sourcing negative examples for DPO?
Do you have your own pipeline, or rely on synthetic data from larger
models? Curious about the trade-offs you’ve found.
2. What domains do you most need failure data for?
I can pivot the autopilot’s domain priority in a few days, so
concrete requests directly shape what gets generated next.
Free sample (100 rows):
Even one-line replies help calibrate the next release.
-– namakoo