Legal data creation

Hi, I’m trying to create my own Vietnamese Legal Data to help with evaluating RAG systems. Currently I have obtain more than 4000 legal documents. I wanted to use LLM to synthesize my dataset which is a question and answer pair. I devise my dataset into these:

3 types of difficulties:

  • easy: basic Comprehension questions, the answer is in the given legal text.

  • medium: Require some legal knowledge and reasoning, the answer may contains multiple laws or articles to be consider correct.

  • hard: Deep understanding and analysis, ask about exceptions to the rules, case studies, or how to handle unusual situations. Require multiple lawyers to grade the answer with the score from 0-5 to get at least 3+ MOS (Mean Opinion Score)

4 types of questions:

  • factual: Direct questions about specific facts mentioned in the text, so this type is always easy and easy can only be this type

  • interpretation: Asking for the meaning or the goal behind a law. It focus on what does the legal term actually means in the real world. Ex: explain the meaning of the phrase “mineral processing activities not tied to an investment project” in the law

  • analytical: These require breaking down the law to find connections, differences, or relationships between concepts. Ex: rules for exploring a mine vs the rules for mining it, and to explain why those differences exist.

  • application: Made up scenarios, case studies, or real-life cases (has been solve with answer)

=> {Factual} ∈ Easy, {Interpretation, analytical} ∈ Medium, {Analytical, Application} ∈ Hard

The way I synthesize my data is just writing prompt with strict guidelines, give it a law article to generate question & answer, set low temp, call API. I also have a validator which is pre-train on 1 Vietnamese Legal law using LoRA (I’m poor, can’t train more) which I use to validate the generate question and answer of that law. So the problem that I ran into is that the synthesized medium and hard answer were correct but only based on that given article, the fully correct answer needed another law. Example

I haven’t though of anything that could help me so PLEASE HELP ME!

Hmm…


Building a Vietnamese Legal QA Dataset for Evaluating RAG Systems

I think your current plan is directionally correct, but the main weakness is where the QA generation starts.

Right now, your pipeline is roughly:

one legal article
→ strict prompt
→ low-temperature LLM generation
→ question-answer pair
→ local validator

That is acceptable for easy factual questions, because the answer is supposed to be inside one legal text.

But it is structurally weak for medium and hard legal questions, because a legally complete answer often needs more than the seed article. It may need:

  • a definition from another article;
  • a special rule for a special class of people;
  • an exception;
  • an implementing decree or circular;
  • a procedure article;
  • an authority/competence article;
  • an amendment or effective-date rule;
  • a higher-level law that controls a lower-level document.

So the model may produce an answer that is:

  • faithful to the given article;
  • locally correct;
  • fluent;
  • stable under low temperature;
  • accepted by a validator trained on that law;

but still legally incomplete because the full answer requires another law or article.

For RAG evaluation, that is the key issue. A RAG benchmark should not only test whether the generator can write a nice answer. It should test whether the system can:

retrieve all legally required provisions
→ use them correctly
→ cite them correctly
→ produce a legally complete answer

This is why I would redesign your dataset around evidence packets, not isolated articles.


1. Core recommendation

Do not generate medium/hard QA directly from a single article.

Instead:

seed article
→ retrieve related legal provisions
→ classify legal relations
→ build an evidence packet
→ generate the question
→ generate the gold answer
→ validate citation coverage
→ validate legal completeness

In short:

Generate the evidence first, then generate the QA.

An evidence packet is the complete set of legal provisions required to answer a question correctly.

For example:

{
  "seed_provision": "<main_article_or_clause>",
  "required_provisions": [
    {
      "citation": "<supporting_law_article_or_clause>",
      "role": "special_subject_rule",
      "why_required": "The general rule changes for this type of subject."
    },
    {
      "citation": "<definition_article_or_clause>",
      "role": "definition",
      "why_required": "This provision defines a term used in the question."
    }
  ],
  "hard_negatives": [
    {
      "citation": "<similar_but_not_controlling_article>",
      "reason": "Topically similar, but not legally required."
    }
  ]
}

This is the main fix.


2. Why this matters for RAG evaluation

A normal QA dataset can often be built like this:

paragraph → question → answer

Legal QA is different.

Legal reasoning is often distributed across multiple legal units:

general rule
+ definition
+ exception
+ special rule
+ procedure
+ authority
+ effective date
= complete legal answer

This is why legal RAG benchmarks should evaluate retrieval completeness, not only final answer similarity.

Useful references:

  • LegalBench shows that legal reasoning is not one generic ability; it includes multiple forms of legal reasoning.
  • LegalBench-RAG focuses specifically on the retrieval step in legal RAG and emphasizes precise legal text retrieval.
  • VLQA is highly relevant because it is a Vietnamese legal QA dataset with expert verification and statutory references.
  • ViLQA / ViBidLQA is also relevant because it uses LLM-generated Vietnamese legal QA corrected by domain experts.
  • ARES is useful for thinking about lightweight RAG judges trained on synthetic data plus limited human labels.
  • RAGAS is useful for RAG evaluation and synthetic testset generation ideas.
  • BEIR is useful because it shows BM25 is still a strong retrieval baseline, while reranking/late-interaction methods are often stronger but more expensive.
  • HotpotQA is useful as a multi-hop QA reference because it stores supporting facts, not only final answers.
  • FEVER is useful because it includes evidence-based labels like Supported / Refuted / NotEnoughInfo.
  • RAGBench is useful because it emphasizes explainable/actionable RAG evaluation labels.

3. Refine your taxonomy

Your current taxonomy is good:

Difficulty

  • easy: basic comprehension; answer is in the given legal text.
  • medium: requires some legal knowledge/reasoning; may need multiple articles/laws.
  • hard: requires deep understanding, exceptions, case analysis, unusual situations, or expert grading.

Question type

  • factual: direct fact from text.
  • interpretation: meaning/purpose of a law or legal phrase.
  • analytical: comparison, relationship, connection, difference.
  • application: scenario/case/facts applied to law.

Your mapping is mostly right:

factual       → easy
interpretation → medium
analytical    → medium or hard
application   → hard

But I would add one more axis:

Evidence scope

Because difficulty and evidence scope are not the same.

A question can be easy because it needs one clause. A question can be medium because it needs two related articles. A question can be hard because it needs multiple laws, an exception, and a fact pattern.

Use this expanded schema:

{
  "difficulty": "easy | medium | hard",
  "question_type": "factual | interpretation | analytical | application",
  "evidence_scope": "single_clause | single_article | multi_article_same_law | multi_law | exception_based | temporal | insufficient_facts",
  "legal_operation": [
    "definition",
    "rule_extraction",
    "comparison",
    "condition_check",
    "exception",
    "special_subject_rule",
    "procedure",
    "authority",
    "sanction",
    "hierarchy",
    "amendment",
    "case_application"
  ],
  "answerability": "answerable | partially_answerable | insufficient_facts | not_in_corpus | version_unclear"
}

This makes the dataset much more useful for diagnosing RAG failures.


4. Difficulty design

Easy

Purpose: test basic retrieval and comprehension.

Easy questions should be:

  • factual only;
  • answerable from one clause/article;
  • direct;
  • objectively gradable;
  • citation-simple.

Example patterns:

Theo Điều <article>, cơ quan nào có thẩm quyền <action>?
Theo khoản <clause>, điều kiện để <do_something> gồm những gì?
Luật quy định thời hạn <process> là bao lâu?
Hành vi nào bị cấm theo Điều <article>?

Recommended schema:

{
  "difficulty": "easy",
  "question_type": "factual",
  "evidence_scope": "single_clause",
  "required_citations": [
    {
      "role": "answer_source",
      "citation": "<law_article_clause>"
    }
  ]
}

Reject an easy question if:

  • it needs another law;
  • it asks “why”;
  • it requires interpretation;
  • it requires exceptions;
  • it depends on external legal knowledge.

Medium

Purpose: test whether the RAG system can connect legal provisions.

Medium does not mean “longer answer.” Medium means the answer requires at least two legal points.

Typical evidence scopes:

multi_article_same_law
multi_law
definition_based
procedure_based
special_subject_based

Medium question types:

interpretation
analytical

Good medium patterns:

Cụm từ <legal_phrase> trong quy định này nên được hiểu như thế nào?
Mục đích của quy định <rule> là gì?
Vì sao luật quy định riêng trường hợp <special_subject>?
Phân biệt <concept_A> và <concept_B> theo quy định pháp luật.
Điều kiện áp dụng <rule> gồm những yếu tố nào, và căn cứ pháp lý nằm ở đâu?

Recommended schema:

{
  "difficulty": "medium",
  "question_type": "interpretation",
  "evidence_scope": "multi_article_same_law",
  "legal_operation": [
    "definition",
    "purpose_interpretation"
  ],
  "required_citations": [
    {
      "role": "anchor_rule",
      "citation": "<main_article>"
    },
    {
      "role": "definition_or_purpose",
      "citation": "<supporting_article>"
    }
  ]
}

A good medium item should pass this test:

Answer using only the seed article = incomplete.
Answer using the full evidence packet = complete.

Reject a medium item if:

  • it is fully answerable from the seed article alone;
  • the supporting article is only topically similar, not legally necessary;
  • the answer relies on vague policy reasoning without citation;
  • required citations are unclear.

Hard

Purpose: test legally complete reasoning.

Hard questions should contain a legal trap or require multiple legal operations.

Good hard patterns include:

Trap type What it tests
General rule vs exception Does the system retrieve the exception?
General law vs special law Does the system find the controlling special rule?
Definition trap Does it use statutory meaning, not ordinary meaning?
Procedure trap Does it know the required process/authority?
Temporal trap Does it know effective date/amendment status?
Missing-fact trap Does it avoid over-answering?
Similar-term trap Does it distinguish close legal concepts?
Cross-domain trap Does it combine civil/criminal/administrative/labor/tax law?

Hard question types:

analytical
application

Hard application answers should follow:

Issue
→ Applicable rules
→ Application to facts
→ Conclusion
→ Caveats / missing facts

Recommended schema:

{
  "difficulty": "hard",
  "question_type": "application",
  "evidence_scope": "exception_based",
  "legal_operation": [
    "case_application",
    "exception",
    "special_subject_rule"
  ],
  "answerability": "partially_answerable",
  "required_citations": [
    {
      "role": "anchor_rule",
      "citation": "<main_rule>"
    },
    {
      "role": "exception",
      "citation": "<exception_rule>"
    },
    {
      "role": "procedure_or_authority",
      "citation": "<procedure_rule>"
    }
  ],
  "expert_review_required": true,
  "mos_threshold": 3
}

Your 0–5 MOS idea is good. I would use:

Score Meaning
0 Irrelevant, hallucinated, or dangerous
1 Mentions topic but misses controlling law
2 Partially correct but misses a major rule, exception, or required law
3 Minimally acceptable; correct general conclusion but incomplete nuance/citation
4 Correct, well-grounded, cites required provisions, handles main exceptions
5 Expert-level: complete rule synthesis, application, caveats, citations, and limits

Acceptance rule:

accept hard item only if MOS >= 3

Also store disagreement:

{
  "reviewer_scores": [2, 4, 3],
  "mos": 3.0,
  "score_range": 2,
  "needs_adjudication": true
}

If lawyers disagree heavily, mark the item as ambiguous instead of pretending it is clean.


5. Evidence packet generation

This is the core pipeline.

For each seed provision:

1. Retrieve candidate related provisions.
2. Classify each candidate.
3. Keep only legally necessary provisions.
4. Build an evidence packet.
5. Generate QA from that packet.

Candidate provisions should be classified as:

REQUIRED
HELPFUL_BACKGROUND
HARD_NEGATIVE
IRRELEVANT

Meaning:

Label Meaning
REQUIRED Omitting this provision makes the answer incomplete or wrong
HELPFUL_BACKGROUND Useful but not necessary
HARD_NEGATIVE Similar enough to confuse retrieval, but not legally controlling
IRRELEVANT Not useful

Prompt:

You are building a Vietnamese legal RAG evaluation dataset.

Seed legal provision:
<seed_provision>

Candidate legal provision:
<candidate_provision>

Task:
Classify whether the candidate provision is legally required to answer questions about the seed provision.

Return JSON only:
{
  "status": "REQUIRED | HELPFUL_BACKGROUND | HARD_NEGATIVE | IRRELEVANT",
  "relation_type": "definition | exception | condition | special_subject_rule | procedure | authority | sanction | amendment | same_topic | irrelevant",
  "why": "...",
  "would_omission_make_answer_incomplete": true
}

Rules:
- REQUIRED means omitting this provision can make a legal answer incomplete or wrong.
- HELPFUL_BACKGROUND means useful but not necessary.
- HARD_NEGATIVE means similar enough to confuse retrieval but not legally controlling.
- IRRELEVANT means unrelated.
- Do not mark a provision REQUIRED merely because it is about the same topic.

6. Use a legal graph over your 4,000 documents

With 4,000+ documents, you should build a legal graph.

Nodes:

law
chapter
section
article
clause
point
sentence/span

Minimum useful node:

{
  "node_id": "<doc_id>_article_<n>_clause_<m>",
  "law_title": "<law_title>",
  "law_number": "<law_number>",
  "authority_level": "law | decree | circular | resolution",
  "article": "<article_number>",
  "clause": "<clause_number>",
  "text": "<legal_text>",
  "effective_date": "<date>",
  "status": "effective | amended | repealed | unknown",
  "topics": ["<topic_1>", "<topic_2>"],
  "explicit_references": []
}

Edges:

Edge type Meaning
explicit_reference One provision cites another
defines_term One provision defines a term used elsewhere
exception_to One provision creates an exception
special_rule_for One provision creates special treatment for a group
condition_for One provision gives conditions
procedure_for One provision explains process or authority
sanction_for One provision gives consequence/penalty
implements Decree/circular implements a law
amends Later law changes earlier law
repeals Later law removes earlier law
same_topic Related but not necessarily required

Medium/hard QA should be generated from strong legal edges:

anchor_rule + definition
anchor_rule + exception
anchor_rule + special_subject_rule
anchor_rule + procedure
anchor_rule + amendment

Not merely:

anchor_rule + same_topic

7. Retrieval should be hybrid

Do not use only vector search.

Vietnamese legal retrieval needs exact matching and semantic matching.

Use:

BM25
+ dense retrieval
+ citation regex search
+ graph expansion
+ reranking

Why?

Legal documents contain exact signals:

  • law names;
  • article numbers;
  • clause numbers;
  • “Điều”, “khoản”, “điểm”;
  • “trừ trường hợp”;
  • “theo quy định tại”;
  • “sửa đổi, bổ sung”;
  • formal legal phrases.

A practical retrieval stack:

Stage 1:
  BM25 top 50
  dense retrieval top 50
  citation index top 20
  graph neighbors top 20

Merge:
  deduplicate by node_id

Stage 2:
  rerank top 50 to top 10 or top 20

Stage 3:
  evidence sufficiency check

Suggested metrics:

anchor_recall@k
required_citation_recall@k
all_required_retrieved@k
supporting_law_recall@k
exception_recall@k
definition_recall@k
hard_negative_rate

The most important one for medium/hard items:

all_required_retrieved@k

If the retriever misses one controlling provision, the answer may be incomplete.


8. QA generation prompt

Generate QA only after the evidence packet is ready.

You are generating Vietnamese legal QA data for evaluating RAG systems.

Evidence packet:
<evidence_packet>

Generate one question and one gold answer.

Requirements:
- The question must require all REQUIRED provisions.
- The answer must cite all REQUIRED provisions.
- The question must not be fully answerable from the seed provision alone.
- The answer must be legally cautious.
- Do not invent policy reasons that are not supported by the legal provisions.
- If facts are insufficient, say so.

Return JSON:
{
  "question": "...",
  "gold_answer": "...",
  "difficulty": "...",
  "question_type": "...",
  "evidence_scope": "...",
  "legal_operation": [...],
  "required_citations": [...],
  "supporting_legal_points": [...],
  "why_seed_alone_is_insufficient": "...",
  "unacceptable_incomplete_answer": "...",
  "rubric_0_5": {...}
}

9. Add the seed-only vs full-packet test

For every medium/hard item, run two validations.

Test A: seed-only

Give the model only the seed article and ask it to answer.

Expected result:

incomplete

Test B: full packet

Give the model all required provisions and ask it to answer.

Expected result:

complete

Store this:

{
  "seed_only_answer_status": "incomplete",
  "full_packet_answer_status": "complete"
}

This directly catches your current failure mode.

If the seed-only answer is already complete, then the question is probably not truly medium/hard.


10. Add a missing-law validator

Your validator should not only ask:

Is the answer consistent with the given law?

It should ask:

Does the answer omit any required law/article/clause?

Prompt:

Question:
<question>

Required citations:
<required_citations>

Candidate answer:
<answer>

Evaluate whether the answer is legally complete.

Return JSON only:
{
  "score_0_to_5": 0,
  "complete": true,
  "failure_type": "none | missing_required_citation | missing_exception | wrong_citation | unsupported_claim | wrong_legal_conclusion | insufficient_facts_not_detected",
  "missing_required_citations": [],
  "missing_legal_points": [],
  "wrong_or_irrelevant_citations": [],
  "unsupported_claims": [],
  "short_reason": "..."
}

Rules:
- If any required citation is missing, score cannot exceed 3.
- If the conclusion is wrong, score cannot exceed 2.
- If the answer is faithful to one provision but misses another controlling provision, mark it as missing_required_citation.

This validator is more aligned with legal RAG than a generic answer-correctness judge.


11. How to use your LoRA validator

Keep the LoRA validator, but change its role.

Use it as a cheap first-pass validator, not as the final legal judge.

Good uses:

  • detecting obviously bad outputs;
  • checking local consistency;
  • validating easy factual items;
  • filtering bad generations;
  • reducing API cost.

Weak uses:

  • cross-law completeness;
  • special-law exceptions;
  • amendment/version issues;
  • hierarchy conflicts;
  • hard case studies.

Recommended validation stack:

rule-based citation checker
→ LoRA local validator
→ stronger LLM missing-law judge
→ lawyer review for hard/uncertain items

Train or adapt future validators on failure types:

complete
missing_supporting_law
missing_exception
wrong_citation
unsupported_claim
wrong_legal_version
insufficient_facts_not_detected

This is more useful than only training on correct/incorrect labels.


12. Store supporting legal points

Do not store only final answers.

Store the legal points required for the answer.

Example:

{
  "supporting_legal_points": [
    {
      "point": "The general rule requires considering individual characteristics.",
      "source_node_id": "<node_a>",
      "role": "anchor_rule"
    },
    {
      "point": "A special rule applies to persons under 18.",
      "source_node_id": "<node_b>",
      "role": "special_subject_rule"
    }
  ]
}

This is similar in spirit to multi-hop QA datasets that store supporting facts, such as HotpotQA, but adapted to legal provisions.


13. Add unanswerable and insufficient-fact items

Legal systems should not always answer confidently.

Some legal questions are incomplete because:

  • key facts are missing;
  • effective date is unknown;
  • the controlling document is not in the corpus;
  • the answer depends on a contract or administrative decision;
  • the situation requires case-specific legal judgment.

Add labels:

{
  "answerability": "answerable | partially_answerable | insufficient_facts | not_in_corpus | version_unclear"
}

Expected answer style for insufficient facts:

Chưa thể kết luận chắc chắn vì tình huống chưa nêu rõ <missing_fact>.
Nếu <condition_A> được đáp ứng thì <legal_consequence_A>.
Nếu không đáp ứng <condition_A> thì <legal_consequence_B>.

This is useful because real legal RAG should be able to say:

The available facts are not enough for a final conclusion.

The FEVER style of evidence-based “Supported / Refuted / NotEnoughInfo” labeling is a useful conceptual reference.


14. Recommended final dataset schema

Use JSONL.

One item per line:

{
  "id": "vnlegalrag_000001",
  "language": "vi",
  "jurisdiction": "VN",
  "as_of_date": "2026-05-16",

  "difficulty": "medium",
  "question_type": "analytical",
  "evidence_scope": "multi_law",
  "legal_operation": [
    "general_rule",
    "special_subject_rule",
    "policy_reasoning"
  ],
  "answerability": "answerable",

  "question": "<question_text>",
  "gold_answer": "<gold_answer_text>",

  "required_citations": [
    {
      "node_id": "<node_id_1>",
      "citation_text": "<citation_text_1>",
      "role": "anchor_rule",
      "must_retrieve": true,
      "must_cite": true
    },
    {
      "node_id": "<node_id_2>",
      "citation_text": "<citation_text_2>",
      "role": "special_subject_rule",
      "must_retrieve": true,
      "must_cite": true
    }
  ],

  "supporting_legal_points": [
    {
      "point": "<legal_point>",
      "source_node_id": "<node_id>",
      "role": "anchor_rule"
    }
  ],

  "hard_negatives": [
    {
      "node_id": "<negative_node_id>",
      "reason": "Topically similar but not controlling."
    }
  ],

  "unacceptable_incomplete_answers": [
    {
      "failure_type": "missing_supporting_law",
      "description": "States the general rule but omits the special rule."
    }
  ],

  "rubric_0_5": {
    "0": "Irrelevant, hallucinated, or legally wrong.",
    "1": "Mentions topic but misses main rule.",
    "2": "States main rule but misses required supporting law.",
    "3": "Minimally acceptable but incomplete citations or reasoning.",
    "4": "Correct and cites all required provisions.",
    "5": "Expert-quality answer with complete reasoning, citations, exceptions, and caveats."
  },

  "validation": {
    "seed_only_answer_status": "incomplete",
    "full_packet_answer_status": "complete",
    "automatic_validation_passed": true,
    "expert_review_status": "not_required"
  }
}

15. Evaluation design

Create three benchmark tracks.

Track A: retrieval-only

Input:

question

Expected output:

required legal provisions

Metrics:

required_citation_recall@5
required_citation_recall@10
all_required_citations_retrieved@10
MRR
nDCG
hard_negative_rate

Track B: gold-context generation

Input:

question + gold evidence packet

Expected output:

answer

Metrics:

legal correctness
faithfulness
citation use
clarity
exception handling

Track C: full RAG

Input:

question + corpus

System must:

retrieve → answer

Metrics:

retrieval completeness
legal correctness
citation completeness
faithfulness
answer usefulness

Diagnostic table:

Retrieval-only Gold-context generation Full RAG Diagnosis
Good Good Good System works
Bad Good Bad Retriever failed
Good Bad Bad Generator/reasoner failed
Good Good Bad Context assembly or prompt failed
Bad Bad Bad Both retrieval and generation are weak

This separation is important. Otherwise, you may blame the LLM when the real problem is missing retrieval.


16. Recommended dataset size

Do not start with a huge dataset.

Start with a high-quality pilot.

Version 0.1

500 easy
300 medium
50 hard
50 insufficient/unanswerable

Goal:

debug schema, prompts, validators, retrieval metrics

Version 0.2

2,000 easy
1,000 medium
150 hard
150 insufficient/unanswerable

Goal:

evaluate real RAG pipelines

Version 1.0

5,000–8,000 total
20–30% multi-source
5–10% expert-reviewed hard
5–10% insufficient/unanswerable

A smaller expert-reviewed hard set is more valuable than a large weak synthetic hard set.


17. Suggested distribution

For a RAG evaluation benchmark:

Difficulty Share Purpose
Easy 35–45% Basic retrieval/comprehension
Medium 40–50% Main RAG stress test
Hard 10–15% Expert-reviewed reasoning
Insufficient/unanswerable 5–10% Safety and uncertainty handling

By question type:

Type Share
Factual 35–45%
Interpretation 15–20%
Analytical 25–35%
Application 10–15%

By evidence scope:

Scope Share
Single clause/article 35–45%
Multi-article same law 20–25%
Multi-law 20–25%
Exception/special/temporal 10–15%

18. Common pitfalls

Pitfall 1: generating medium/hard questions from one article

Problem:

The answer is locally correct but legally incomplete.

Fix:

Generate from evidence packets.

Pitfall 2: confusing topical similarity with legal necessity

Problem:

A provision may discuss the same topic but not control the answer.

Fix:

Classify provisions as REQUIRED / HELPFUL / HARD_NEGATIVE / IRRELEVANT.

Pitfall 3: using dense retrieval only

Problem:

Embeddings may miss article numbers, exact legal terms, and formal references.

Fix:

BM25 + dense retrieval + citation search + reranker.

Pitfall 4: scoring only final answer similarity

Problem:

A semantically similar answer can still miss a controlling exception.

Fix:

Score required citation coverage, legal conclusion correctness, context sufficiency, and exception handling.

Pitfall 5: treating a narrow LoRA validator as a universal legal judge

Problem:

It may validate local consistency but miss cross-law incompleteness.

Fix:

Use it as a cheap filter, then use missing-law validation and expert review.

Pitfall 6: ignoring legal version/date

Problem:

The correct answer may depend on effective date, amendment, or repeal status.

Fix:

{
  "as_of_date": "<date>",
  "law_status": "effective | amended | repealed | unknown",
  "effective_date": "<date>"
}

19. Final architecture

4,000+ Vietnamese legal documents
        ↓
legal parser
        ↓
article/clause nodes
        ↓
metadata extraction
        ↓
citation + relation extraction
        ↓
legal graph
        ↓
BM25 + dense + citation indexes
        ↓
evidence packet builder
        ↓
QA generator
        ↓
seed-only vs full-packet validation
        ↓
citation completeness validator
        ↓
LoRA cheap validator
        ↓
strong LLM judge for medium/hard
        ↓
lawyer MOS review for hard
        ↓
final RAG evaluation dataset

20. Final answer

Your current method works for easy factual QA, but it is not enough for medium and hard legal QA.

The main issue is not temperature, prompt strictness, or generation style.

The main issue is:

The model is generating from incomplete legal evidence.

To fix that, build every medium/hard item around a gold evidence packet.

Each dataset item should store:

question
gold answer
required citations
supporting legal points
hard negatives
difficulty
question type
evidence scope
legal operation
answerability
validation status
expert review status

The strongest version of your project would not be just a Vietnamese legal QA dataset.

It would be:

A Vietnamese legal RAG benchmark that measures required legal evidence retrieval and legal completeness.

That is a much more valuable contribution than plain synthetic QA.