Made a Python failure dataset for DPO/RLHF — how do you source negative examples?

Hi everyone,

I’ve been quietly building a Python failure dataset for DPO / RLHF
training over the past couple of weeks, running 24/7 on a single
RTX 4060.

The basic idea: an autopilot pipeline generates Python code attempts
for various CS domains (FFT, Monte Carlo, ZKP, etc.), runs each in a
sandboxed pytest container, and keeps the genuine failures with
error logs as rejected-side training data.

Quick stats:

  • ~2K failure rows shipped (v1, v2)
  • 19 CS domains covered
  • 146 downloads since launch

Two questions for DPO / RLHF practitioners here:

1. How are you currently sourcing negative examples for DPO?
Do you have your own pipeline, or rely on synthetic data from larger
models? Curious about the trade-offs you’ve found.

2. What domains do you most need failure data for?
I can pivot the autopilot’s domain priority in a few days, so
concrete requests directly shape what gets generated next.

Free sample (100 rows):

Even one-line replies help calibrate the next release.

-– namakoo

1 Like