Hmm…
Building a Vietnamese Legal QA Dataset for Evaluating RAG Systems
I think your current plan is directionally correct, but the main weakness is where the QA generation starts.
Right now, your pipeline is roughly:
one legal article
→ strict prompt
→ low-temperature LLM generation
→ question-answer pair
→ local validator
That is acceptable for easy factual questions, because the answer is supposed to be inside one legal text.
But it is structurally weak for medium and hard legal questions, because a legally complete answer often needs more than the seed article. It may need:
- a definition from another article;
- a special rule for a special class of people;
- an exception;
- an implementing decree or circular;
- a procedure article;
- an authority/competence article;
- an amendment or effective-date rule;
- a higher-level law that controls a lower-level document.
So the model may produce an answer that is:
- faithful to the given article;
- locally correct;
- fluent;
- stable under low temperature;
- accepted by a validator trained on that law;
but still legally incomplete because the full answer requires another law or article.
For RAG evaluation, that is the key issue. A RAG benchmark should not only test whether the generator can write a nice answer. It should test whether the system can:
retrieve all legally required provisions
→ use them correctly
→ cite them correctly
→ produce a legally complete answer
This is why I would redesign your dataset around evidence packets, not isolated articles.
1. Core recommendation
Do not generate medium/hard QA directly from a single article.
Instead:
seed article
→ retrieve related legal provisions
→ classify legal relations
→ build an evidence packet
→ generate the question
→ generate the gold answer
→ validate citation coverage
→ validate legal completeness
In short:
Generate the evidence first, then generate the QA.
An evidence packet is the complete set of legal provisions required to answer a question correctly.
For example:
{
"seed_provision": "<main_article_or_clause>",
"required_provisions": [
{
"citation": "<supporting_law_article_or_clause>",
"role": "special_subject_rule",
"why_required": "The general rule changes for this type of subject."
},
{
"citation": "<definition_article_or_clause>",
"role": "definition",
"why_required": "This provision defines a term used in the question."
}
],
"hard_negatives": [
{
"citation": "<similar_but_not_controlling_article>",
"reason": "Topically similar, but not legally required."
}
]
}
This is the main fix.
2. Why this matters for RAG evaluation
A normal QA dataset can often be built like this:
paragraph → question → answer
Legal QA is different.
Legal reasoning is often distributed across multiple legal units:
general rule
+ definition
+ exception
+ special rule
+ procedure
+ authority
+ effective date
= complete legal answer
This is why legal RAG benchmarks should evaluate retrieval completeness, not only final answer similarity.
Useful references:
- LegalBench shows that legal reasoning is not one generic ability; it includes multiple forms of legal reasoning.
- LegalBench-RAG focuses specifically on the retrieval step in legal RAG and emphasizes precise legal text retrieval.
- VLQA is highly relevant because it is a Vietnamese legal QA dataset with expert verification and statutory references.
- ViLQA / ViBidLQA is also relevant because it uses LLM-generated Vietnamese legal QA corrected by domain experts.
- ARES is useful for thinking about lightweight RAG judges trained on synthetic data plus limited human labels.
- RAGAS is useful for RAG evaluation and synthetic testset generation ideas.
- BEIR is useful because it shows BM25 is still a strong retrieval baseline, while reranking/late-interaction methods are often stronger but more expensive.
- HotpotQA is useful as a multi-hop QA reference because it stores supporting facts, not only final answers.
- FEVER is useful because it includes evidence-based labels like Supported / Refuted / NotEnoughInfo.
- RAGBench is useful because it emphasizes explainable/actionable RAG evaluation labels.
3. Refine your taxonomy
Your current taxonomy is good:
Difficulty
- easy: basic comprehension; answer is in the given legal text.
- medium: requires some legal knowledge/reasoning; may need multiple articles/laws.
- hard: requires deep understanding, exceptions, case analysis, unusual situations, or expert grading.
Question type
- factual: direct fact from text.
- interpretation: meaning/purpose of a law or legal phrase.
- analytical: comparison, relationship, connection, difference.
- application: scenario/case/facts applied to law.
Your mapping is mostly right:
factual → easy
interpretation → medium
analytical → medium or hard
application → hard
But I would add one more axis:
Evidence scope
Because difficulty and evidence scope are not the same.
A question can be easy because it needs one clause. A question can be medium because it needs two related articles. A question can be hard because it needs multiple laws, an exception, and a fact pattern.
Use this expanded schema:
{
"difficulty": "easy | medium | hard",
"question_type": "factual | interpretation | analytical | application",
"evidence_scope": "single_clause | single_article | multi_article_same_law | multi_law | exception_based | temporal | insufficient_facts",
"legal_operation": [
"definition",
"rule_extraction",
"comparison",
"condition_check",
"exception",
"special_subject_rule",
"procedure",
"authority",
"sanction",
"hierarchy",
"amendment",
"case_application"
],
"answerability": "answerable | partially_answerable | insufficient_facts | not_in_corpus | version_unclear"
}
This makes the dataset much more useful for diagnosing RAG failures.
4. Difficulty design
Easy
Purpose: test basic retrieval and comprehension.
Easy questions should be:
- factual only;
- answerable from one clause/article;
- direct;
- objectively gradable;
- citation-simple.
Example patterns:
Theo Điều <article>, cơ quan nào có thẩm quyền <action>?
Theo khoản <clause>, điều kiện để <do_something> gồm những gì?
Luật quy định thời hạn <process> là bao lâu?
Hành vi nào bị cấm theo Điều <article>?
Recommended schema:
{
"difficulty": "easy",
"question_type": "factual",
"evidence_scope": "single_clause",
"required_citations": [
{
"role": "answer_source",
"citation": "<law_article_clause>"
}
]
}
Reject an easy question if:
- it needs another law;
- it asks “why”;
- it requires interpretation;
- it requires exceptions;
- it depends on external legal knowledge.
Medium
Purpose: test whether the RAG system can connect legal provisions.
Medium does not mean “longer answer.” Medium means the answer requires at least two legal points.
Typical evidence scopes:
multi_article_same_law
multi_law
definition_based
procedure_based
special_subject_based
Medium question types:
interpretation
analytical
Good medium patterns:
Cụm từ <legal_phrase> trong quy định này nên được hiểu như thế nào?
Mục đích của quy định <rule> là gì?
Vì sao luật quy định riêng trường hợp <special_subject>?
Phân biệt <concept_A> và <concept_B> theo quy định pháp luật.
Điều kiện áp dụng <rule> gồm những yếu tố nào, và căn cứ pháp lý nằm ở đâu?
Recommended schema:
{
"difficulty": "medium",
"question_type": "interpretation",
"evidence_scope": "multi_article_same_law",
"legal_operation": [
"definition",
"purpose_interpretation"
],
"required_citations": [
{
"role": "anchor_rule",
"citation": "<main_article>"
},
{
"role": "definition_or_purpose",
"citation": "<supporting_article>"
}
]
}
A good medium item should pass this test:
Answer using only the seed article = incomplete.
Answer using the full evidence packet = complete.
Reject a medium item if:
- it is fully answerable from the seed article alone;
- the supporting article is only topically similar, not legally necessary;
- the answer relies on vague policy reasoning without citation;
- required citations are unclear.
Hard
Purpose: test legally complete reasoning.
Hard questions should contain a legal trap or require multiple legal operations.
Good hard patterns include:
| Trap type |
What it tests |
| General rule vs exception |
Does the system retrieve the exception? |
| General law vs special law |
Does the system find the controlling special rule? |
| Definition trap |
Does it use statutory meaning, not ordinary meaning? |
| Procedure trap |
Does it know the required process/authority? |
| Temporal trap |
Does it know effective date/amendment status? |
| Missing-fact trap |
Does it avoid over-answering? |
| Similar-term trap |
Does it distinguish close legal concepts? |
| Cross-domain trap |
Does it combine civil/criminal/administrative/labor/tax law? |
Hard question types:
analytical
application
Hard application answers should follow:
Issue
→ Applicable rules
→ Application to facts
→ Conclusion
→ Caveats / missing facts
Recommended schema:
{
"difficulty": "hard",
"question_type": "application",
"evidence_scope": "exception_based",
"legal_operation": [
"case_application",
"exception",
"special_subject_rule"
],
"answerability": "partially_answerable",
"required_citations": [
{
"role": "anchor_rule",
"citation": "<main_rule>"
},
{
"role": "exception",
"citation": "<exception_rule>"
},
{
"role": "procedure_or_authority",
"citation": "<procedure_rule>"
}
],
"expert_review_required": true,
"mos_threshold": 3
}
Your 0–5 MOS idea is good. I would use:
| Score |
Meaning |
| 0 |
Irrelevant, hallucinated, or dangerous |
| 1 |
Mentions topic but misses controlling law |
| 2 |
Partially correct but misses a major rule, exception, or required law |
| 3 |
Minimally acceptable; correct general conclusion but incomplete nuance/citation |
| 4 |
Correct, well-grounded, cites required provisions, handles main exceptions |
| 5 |
Expert-level: complete rule synthesis, application, caveats, citations, and limits |
Acceptance rule:
accept hard item only if MOS >= 3
Also store disagreement:
{
"reviewer_scores": [2, 4, 3],
"mos": 3.0,
"score_range": 2,
"needs_adjudication": true
}
If lawyers disagree heavily, mark the item as ambiguous instead of pretending it is clean.
5. Evidence packet generation
This is the core pipeline.
For each seed provision:
1. Retrieve candidate related provisions.
2. Classify each candidate.
3. Keep only legally necessary provisions.
4. Build an evidence packet.
5. Generate QA from that packet.
Candidate provisions should be classified as:
REQUIRED
HELPFUL_BACKGROUND
HARD_NEGATIVE
IRRELEVANT
Meaning:
| Label |
Meaning |
| REQUIRED |
Omitting this provision makes the answer incomplete or wrong |
| HELPFUL_BACKGROUND |
Useful but not necessary |
| HARD_NEGATIVE |
Similar enough to confuse retrieval, but not legally controlling |
| IRRELEVANT |
Not useful |
Prompt:
You are building a Vietnamese legal RAG evaluation dataset.
Seed legal provision:
<seed_provision>
Candidate legal provision:
<candidate_provision>
Task:
Classify whether the candidate provision is legally required to answer questions about the seed provision.
Return JSON only:
{
"status": "REQUIRED | HELPFUL_BACKGROUND | HARD_NEGATIVE | IRRELEVANT",
"relation_type": "definition | exception | condition | special_subject_rule | procedure | authority | sanction | amendment | same_topic | irrelevant",
"why": "...",
"would_omission_make_answer_incomplete": true
}
Rules:
- REQUIRED means omitting this provision can make a legal answer incomplete or wrong.
- HELPFUL_BACKGROUND means useful but not necessary.
- HARD_NEGATIVE means similar enough to confuse retrieval but not legally controlling.
- IRRELEVANT means unrelated.
- Do not mark a provision REQUIRED merely because it is about the same topic.
6. Use a legal graph over your 4,000 documents
With 4,000+ documents, you should build a legal graph.
Nodes:
law
chapter
section
article
clause
point
sentence/span
Minimum useful node:
{
"node_id": "<doc_id>_article_<n>_clause_<m>",
"law_title": "<law_title>",
"law_number": "<law_number>",
"authority_level": "law | decree | circular | resolution",
"article": "<article_number>",
"clause": "<clause_number>",
"text": "<legal_text>",
"effective_date": "<date>",
"status": "effective | amended | repealed | unknown",
"topics": ["<topic_1>", "<topic_2>"],
"explicit_references": []
}
Edges:
| Edge type |
Meaning |
| explicit_reference |
One provision cites another |
| defines_term |
One provision defines a term used elsewhere |
| exception_to |
One provision creates an exception |
| special_rule_for |
One provision creates special treatment for a group |
| condition_for |
One provision gives conditions |
| procedure_for |
One provision explains process or authority |
| sanction_for |
One provision gives consequence/penalty |
| implements |
Decree/circular implements a law |
| amends |
Later law changes earlier law |
| repeals |
Later law removes earlier law |
| same_topic |
Related but not necessarily required |
Medium/hard QA should be generated from strong legal edges:
anchor_rule + definition
anchor_rule + exception
anchor_rule + special_subject_rule
anchor_rule + procedure
anchor_rule + amendment
Not merely:
anchor_rule + same_topic
7. Retrieval should be hybrid
Do not use only vector search.
Vietnamese legal retrieval needs exact matching and semantic matching.
Use:
BM25
+ dense retrieval
+ citation regex search
+ graph expansion
+ reranking
Why?
Legal documents contain exact signals:
- law names;
- article numbers;
- clause numbers;
- “Điều”, “khoản”, “điểm”;
- “trừ trường hợp”;
- “theo quy định tại”;
- “sửa đổi, bổ sung”;
- formal legal phrases.
A practical retrieval stack:
Stage 1:
BM25 top 50
dense retrieval top 50
citation index top 20
graph neighbors top 20
Merge:
deduplicate by node_id
Stage 2:
rerank top 50 to top 10 or top 20
Stage 3:
evidence sufficiency check
Suggested metrics:
anchor_recall@k
required_citation_recall@k
all_required_retrieved@k
supporting_law_recall@k
exception_recall@k
definition_recall@k
hard_negative_rate
The most important one for medium/hard items:
all_required_retrieved@k
If the retriever misses one controlling provision, the answer may be incomplete.
8. QA generation prompt
Generate QA only after the evidence packet is ready.
You are generating Vietnamese legal QA data for evaluating RAG systems.
Evidence packet:
<evidence_packet>
Generate one question and one gold answer.
Requirements:
- The question must require all REQUIRED provisions.
- The answer must cite all REQUIRED provisions.
- The question must not be fully answerable from the seed provision alone.
- The answer must be legally cautious.
- Do not invent policy reasons that are not supported by the legal provisions.
- If facts are insufficient, say so.
Return JSON:
{
"question": "...",
"gold_answer": "...",
"difficulty": "...",
"question_type": "...",
"evidence_scope": "...",
"legal_operation": [...],
"required_citations": [...],
"supporting_legal_points": [...],
"why_seed_alone_is_insufficient": "...",
"unacceptable_incomplete_answer": "...",
"rubric_0_5": {...}
}
9. Add the seed-only vs full-packet test
For every medium/hard item, run two validations.
Test A: seed-only
Give the model only the seed article and ask it to answer.
Expected result:
incomplete
Test B: full packet
Give the model all required provisions and ask it to answer.
Expected result:
complete
Store this:
{
"seed_only_answer_status": "incomplete",
"full_packet_answer_status": "complete"
}
This directly catches your current failure mode.
If the seed-only answer is already complete, then the question is probably not truly medium/hard.
10. Add a missing-law validator
Your validator should not only ask:
Is the answer consistent with the given law?
It should ask:
Does the answer omit any required law/article/clause?
Prompt:
Question:
<question>
Required citations:
<required_citations>
Candidate answer:
<answer>
Evaluate whether the answer is legally complete.
Return JSON only:
{
"score_0_to_5": 0,
"complete": true,
"failure_type": "none | missing_required_citation | missing_exception | wrong_citation | unsupported_claim | wrong_legal_conclusion | insufficient_facts_not_detected",
"missing_required_citations": [],
"missing_legal_points": [],
"wrong_or_irrelevant_citations": [],
"unsupported_claims": [],
"short_reason": "..."
}
Rules:
- If any required citation is missing, score cannot exceed 3.
- If the conclusion is wrong, score cannot exceed 2.
- If the answer is faithful to one provision but misses another controlling provision, mark it as missing_required_citation.
This validator is more aligned with legal RAG than a generic answer-correctness judge.
11. How to use your LoRA validator
Keep the LoRA validator, but change its role.
Use it as a cheap first-pass validator, not as the final legal judge.
Good uses:
- detecting obviously bad outputs;
- checking local consistency;
- validating easy factual items;
- filtering bad generations;
- reducing API cost.
Weak uses:
- cross-law completeness;
- special-law exceptions;
- amendment/version issues;
- hierarchy conflicts;
- hard case studies.
Recommended validation stack:
rule-based citation checker
→ LoRA local validator
→ stronger LLM missing-law judge
→ lawyer review for hard/uncertain items
Train or adapt future validators on failure types:
complete
missing_supporting_law
missing_exception
wrong_citation
unsupported_claim
wrong_legal_version
insufficient_facts_not_detected
This is more useful than only training on correct/incorrect labels.
12. Store supporting legal points
Do not store only final answers.
Store the legal points required for the answer.
Example:
{
"supporting_legal_points": [
{
"point": "The general rule requires considering individual characteristics.",
"source_node_id": "<node_a>",
"role": "anchor_rule"
},
{
"point": "A special rule applies to persons under 18.",
"source_node_id": "<node_b>",
"role": "special_subject_rule"
}
]
}
This is similar in spirit to multi-hop QA datasets that store supporting facts, such as HotpotQA, but adapted to legal provisions.
13. Add unanswerable and insufficient-fact items
Legal systems should not always answer confidently.
Some legal questions are incomplete because:
- key facts are missing;
- effective date is unknown;
- the controlling document is not in the corpus;
- the answer depends on a contract or administrative decision;
- the situation requires case-specific legal judgment.
Add labels:
{
"answerability": "answerable | partially_answerable | insufficient_facts | not_in_corpus | version_unclear"
}
Expected answer style for insufficient facts:
Chưa thể kết luận chắc chắn vì tình huống chưa nêu rõ <missing_fact>.
Nếu <condition_A> được đáp ứng thì <legal_consequence_A>.
Nếu không đáp ứng <condition_A> thì <legal_consequence_B>.
This is useful because real legal RAG should be able to say:
The available facts are not enough for a final conclusion.
The FEVER style of evidence-based “Supported / Refuted / NotEnoughInfo” labeling is a useful conceptual reference.
14. Recommended final dataset schema
Use JSONL.
One item per line:
{
"id": "vnlegalrag_000001",
"language": "vi",
"jurisdiction": "VN",
"as_of_date": "2026-05-16",
"difficulty": "medium",
"question_type": "analytical",
"evidence_scope": "multi_law",
"legal_operation": [
"general_rule",
"special_subject_rule",
"policy_reasoning"
],
"answerability": "answerable",
"question": "<question_text>",
"gold_answer": "<gold_answer_text>",
"required_citations": [
{
"node_id": "<node_id_1>",
"citation_text": "<citation_text_1>",
"role": "anchor_rule",
"must_retrieve": true,
"must_cite": true
},
{
"node_id": "<node_id_2>",
"citation_text": "<citation_text_2>",
"role": "special_subject_rule",
"must_retrieve": true,
"must_cite": true
}
],
"supporting_legal_points": [
{
"point": "<legal_point>",
"source_node_id": "<node_id>",
"role": "anchor_rule"
}
],
"hard_negatives": [
{
"node_id": "<negative_node_id>",
"reason": "Topically similar but not controlling."
}
],
"unacceptable_incomplete_answers": [
{
"failure_type": "missing_supporting_law",
"description": "States the general rule but omits the special rule."
}
],
"rubric_0_5": {
"0": "Irrelevant, hallucinated, or legally wrong.",
"1": "Mentions topic but misses main rule.",
"2": "States main rule but misses required supporting law.",
"3": "Minimally acceptable but incomplete citations or reasoning.",
"4": "Correct and cites all required provisions.",
"5": "Expert-quality answer with complete reasoning, citations, exceptions, and caveats."
},
"validation": {
"seed_only_answer_status": "incomplete",
"full_packet_answer_status": "complete",
"automatic_validation_passed": true,
"expert_review_status": "not_required"
}
}
15. Evaluation design
Create three benchmark tracks.
Track A: retrieval-only
Input:
question
Expected output:
required legal provisions
Metrics:
required_citation_recall@5
required_citation_recall@10
all_required_citations_retrieved@10
MRR
nDCG
hard_negative_rate
Track B: gold-context generation
Input:
question + gold evidence packet
Expected output:
answer
Metrics:
legal correctness
faithfulness
citation use
clarity
exception handling
Track C: full RAG
Input:
question + corpus
System must:
retrieve → answer
Metrics:
retrieval completeness
legal correctness
citation completeness
faithfulness
answer usefulness
Diagnostic table:
| Retrieval-only |
Gold-context generation |
Full RAG |
Diagnosis |
| Good |
Good |
Good |
System works |
| Bad |
Good |
Bad |
Retriever failed |
| Good |
Bad |
Bad |
Generator/reasoner failed |
| Good |
Good |
Bad |
Context assembly or prompt failed |
| Bad |
Bad |
Bad |
Both retrieval and generation are weak |
This separation is important. Otherwise, you may blame the LLM when the real problem is missing retrieval.
16. Recommended dataset size
Do not start with a huge dataset.
Start with a high-quality pilot.
Version 0.1
500 easy
300 medium
50 hard
50 insufficient/unanswerable
Goal:
debug schema, prompts, validators, retrieval metrics
Version 0.2
2,000 easy
1,000 medium
150 hard
150 insufficient/unanswerable
Goal:
evaluate real RAG pipelines
Version 1.0
5,000–8,000 total
20–30% multi-source
5–10% expert-reviewed hard
5–10% insufficient/unanswerable
A smaller expert-reviewed hard set is more valuable than a large weak synthetic hard set.
17. Suggested distribution
For a RAG evaluation benchmark:
| Difficulty |
Share |
Purpose |
| Easy |
35–45% |
Basic retrieval/comprehension |
| Medium |
40–50% |
Main RAG stress test |
| Hard |
10–15% |
Expert-reviewed reasoning |
| Insufficient/unanswerable |
5–10% |
Safety and uncertainty handling |
By question type:
| Type |
Share |
| Factual |
35–45% |
| Interpretation |
15–20% |
| Analytical |
25–35% |
| Application |
10–15% |
By evidence scope:
| Scope |
Share |
| Single clause/article |
35–45% |
| Multi-article same law |
20–25% |
| Multi-law |
20–25% |
| Exception/special/temporal |
10–15% |
18. Common pitfalls
Pitfall 1: generating medium/hard questions from one article
Problem:
The answer is locally correct but legally incomplete.
Fix:
Generate from evidence packets.
Pitfall 2: confusing topical similarity with legal necessity
Problem:
A provision may discuss the same topic but not control the answer.
Fix:
Classify provisions as REQUIRED / HELPFUL / HARD_NEGATIVE / IRRELEVANT.
Pitfall 3: using dense retrieval only
Problem:
Embeddings may miss article numbers, exact legal terms, and formal references.
Fix:
BM25 + dense retrieval + citation search + reranker.
Pitfall 4: scoring only final answer similarity
Problem:
A semantically similar answer can still miss a controlling exception.
Fix:
Score required citation coverage, legal conclusion correctness, context sufficiency, and exception handling.
Pitfall 5: treating a narrow LoRA validator as a universal legal judge
Problem:
It may validate local consistency but miss cross-law incompleteness.
Fix:
Use it as a cheap filter, then use missing-law validation and expert review.
Pitfall 6: ignoring legal version/date
Problem:
The correct answer may depend on effective date, amendment, or repeal status.
Fix:
{
"as_of_date": "<date>",
"law_status": "effective | amended | repealed | unknown",
"effective_date": "<date>"
}
19. Final architecture
4,000+ Vietnamese legal documents
↓
legal parser
↓
article/clause nodes
↓
metadata extraction
↓
citation + relation extraction
↓
legal graph
↓
BM25 + dense + citation indexes
↓
evidence packet builder
↓
QA generator
↓
seed-only vs full-packet validation
↓
citation completeness validator
↓
LoRA cheap validator
↓
strong LLM judge for medium/hard
↓
lawyer MOS review for hard
↓
final RAG evaluation dataset
20. Final answer
Your current method works for easy factual QA, but it is not enough for medium and hard legal QA.
The main issue is not temperature, prompt strictness, or generation style.
The main issue is:
The model is generating from incomplete legal evidence.
To fix that, build every medium/hard item around a gold evidence packet.
Each dataset item should store:
question
gold answer
required citations
supporting legal points
hard negatives
difficulty
question type
evidence scope
legal operation
answerability
validation status
expert review status
The strongest version of your project would not be just a Vietnamese legal QA dataset.
It would be:
A Vietnamese legal RAG benchmark that measures required legal evidence retrieval and legal completeness.
That is a much more valuable contribution than plain synthetic QA.