[Continuation] bryła semantic representation: ablation + masked loss results

This is a continuation of my earlier thread which auto-closed before I could share results: Szukam feedbacku — własna reprezentacja semantyczna "bryła" dla małych modeli

@john6666 - thank you again for the detailed feedback. I’ve spent the last two days running the experiments you suggested, plus a few I came up with along the way. Here’s what happened - including the failures, because I think they’re as informative as the successes.

I’ve also added an English version of my project documentation to my HF repo: krzysiekpl/bryla-kris - see README_EN.md for the full story.

What I did

1. Field ablation (your suggestion: “split bryła into field families”)

Built 3 schema variants:

  • MIN (3 fields: type, polarity, sep)
  • MID (7 fields: + scope, intent, intensity, core)
  • FULL (20 fields: all original fields)

Trained 4 variants (RAW + 3 bryła) x 3 seeds on 695 Q/A pairs (welding/materials).

Result: MID won 2/3 seeds, +3% vs RAW. FULL did NOT beat MID (lost by ~9 ppl). Suggests the 13 “default” fields in FULL are noise. Effect small (~3%), but the direction is consistent.

2. Honest perplexity metric (my own addition)

When scaling to Wikipedia (decoder-only LM), I noticed standard val_ppl was misleading because tags are deterministic and easy to predict. I added:

  • val_ppl_std (standard, on all tokens)
  • val_ppl_clean (ONLY on Polish text after [SEP_BRYLA])
  • val_ppl_tags (only on bryła tags)

For FULL bryla on Wikipedia: val_ppl_std = 2.03 but val_ppl_clean = 3.10. The standard metric was hiding ~35% of the real perplexity. I now think this is a methodological lesson: any ablation that adds prefix tokens should use a target-only perplexity.

3. Three types of leakage I caught

  • surface_text duplicated inside bryla AND after [SEP_BRYLA] (model copying)
  • [FACTS] block included previous Polish text
  • Anchors contained 80-char surface_text snippets (still leaking after I removed (1))

Each time I had to retrain. Last attempt still showed suspiciously low std between seeds (±0.01) - which suggested the model was matching templates from a biased corpus (Wikipedia has ~5000 nearly-identical village descriptions).

4. Token economy and latency (your suggestion)

Variant Tokens vs RAW Training Inference
RAW 1.0x 4 min 5.7 ms/tok
MIN 1.81x 8 min 5.9 ms/tok
MID 2.82x 12 min 5.8 ms/tok
FULL 6.06x 30 min 5.8 ms/tok

FULL costs 6x more tokens for ~3% perplexity improvement. Tradeoff is poor.

5. Masked loss (the most interesting experiment)

I had an intuition: “bryła should CARRY information needed to generate the answer. The text is what the model should learn. The bryla is just context the model receives.”

This is equivalent to prefix-LM / conditional generation - loss only on Polish text after [SEP_BRYLA], bryła as context only.

Result: val_ppl_clean almost identical (3.10 vs 3.18). Numerically neutral.

BUT when I tested the masked model in a mini-chat with manually-crafted bryła prefixes differing only in polarity:

[OTHER] [POL:neutral]  -> geographic / astronomical content
[OTHER] [POL:positive] -> villages, places  
[OTHER] [POL:negative] -> sports, competition

Three different topical distributions for three polarities. The model IS reading the bryla as conditioning information. The numerical val_ppl doesn’t show this, but the generation does.

Honest summary

What I think I showed:

  • Field ablation: fewer informative fields > many fields with defaults
  • val_ppl_clean is a necessary metric when tags are added to sequences
  • Three types of leakage to watch for
  • Conditional generation works: bryła as prefix DOES condition the output

What I did NOT show:

  • That bryla “helps” in a strong sense (gains are small, ~3%)
  • That the approach scales (33M tokens is ~5% of Chinchilla optimal)
  • That the parser is good enough (87% of Wikipedia sentences got [OTHER] - my parser was built for technical Q/A, not general text)

A question, if you have a minute

I’m thinking about what to try next. The parser bottleneck (87% [OTHER]) suggests two options:

  1. Extend the parser with more domain-specific rules
  2. Build a smaller but balanced, multi-domain corpus (~200-500 examples per domain: biographies, geography, technique, daily life, science)

Since Bielik 11B was built by volunteers (SpeakLeash community), I’m wondering: do you (or anyone reading this) know of clean, diverse Polish-language Q/A datasets, or have suggestions for community-driven small dataset construction?

Even pointers to papers/projects that did small-data multi-domain ablation studies well would be very helpful - I’m in territory where I don’t quite know what good practice looks like.

Thanks again. Whatever happens next, this conversation has taught me more about experimental methodology than the actual results.

Best,
Krzysztof
krzysiekpl/bryla-kris

Hi! For now:


Answer / suggestions for the current Bryła results

This update is useful because it moves the project from:

Maybe Bryła helps small models.

to the much better research question:

Which parts of Bryła carry useful signal, which parts are noise, and how do we measure the effect without leakage or metric artifacts?

That is already a more serious position.

The most important result is not simply “MID improved PPL by about 3%.” The stronger result is:

compact useful structure > large default-heavy structure

That is exactly what I would expect if Bryła is acting as an inductive bias rather than as magic. A small set of informative fields can help; a large bundle of weak/default fields can add noise, token cost, and evaluation artifacts.

My short recommendation:

Do not expand FULL right now.
Freeze MID, add stronger controls, use clean target-only loss, measure parser coverage, and build a small balanced diagnostic corpus before adding many new parser rules.


1. How I read the ablation result

You tested:

RAW

MIN:
  type
  polarity
  sep

MID:
  type
  polarity
  scope
  intent
  intensity
  core
  sep

FULL:
  all 20 fields

Result:

MID won 2/3 seeds.
MID was about +3% vs RAW.
FULL did not beat MID.
FULL lost by about 9 PPL vs MID.

The safest interpretation:

Bryła has a useful compact region. The 20-field version is currently too noisy, too default-heavy, or too expensive.

This is actually a good sign. If FULL had automatically won just because it had more tags, I would be more suspicious of leakage or metric artifacts. The fact that MID > FULL suggests that the useful signal is concentrated in a smaller subset.

This connects well to older work on adding explicit linguistic input features. Sennrich and Haddow showed that neural MT can benefit from additional input features such as morphology, POS tags, and dependency labels, improving perplexity and translation metrics. The relevant lesson for Bryła is not “add every possible annotation,” but “external structure can help when the features are informative and controlled.”

Reference:

So I would phrase the result like this:

In a tiny Polish technical QA setting, a compact 7-field Bryła representation gives a small but repeatable gain over raw input, while the 20-field version adds substantial token cost and does not improve over the compact version.

That is cautious, but credible.


2. Freeze MID as BRYLA-MID-v1

I would now freeze the current MID schema as:

BRYLA-MID-v1

Fields:

TYPE
POLARITY
SCOPE
INTENT
INTENSITY
CORE
SEP

Do not keep changing this during the next phase. If the schema keeps moving, you cannot tell whether a gain came from Bryła, field order, tokenization, defaults, leakage removal, parser changes, or random seed effects.

For each field, document:

Field What to define
TYPE allowed values, default, missing value, expected effect
POLARITY allowed values, whether it means sentiment, stance, or something else
SCOPE local/general/contextual meaning
INTENT ask/inform/warn/instruct/etc.
INTENSITY low/mid/high or another fixed scale
CORE whether this marks central content/salience
SEP fixed boundary marker

I would keep:

RAW = baseline
MIN = cheap lower bound
MID = main representation
FULL = diagnostic / stress test only

For now, FULL should not be your main story.


3. val_ppl_clean is essential, not optional

Your split into:

val_ppl_std   = all tokens
val_ppl_clean = only Polish text after [SEP_BRYLA]
val_ppl_tags  = only Bryła tags

is one of the most important improvements in the project.

The issue is simple:

Full-sequence PPL answers:
Can the model predict tags + text?

Clean PPL answers:
Does the prefix help predict the Polish target text?

Those are different questions.

Your example is very clear:

FULL Bryła on Wikipedia:

val_ppl_std   = 2.03
val_ppl_clean = 3.10

That means the standard metric was partially rewarding the model for predicting deterministic, low-entropy schema tokens.

For Bryła experiments, the primary metric should be:

val_ppl_clean

not:

val_ppl_std

This fits the general warning around perplexity: the value depends on exactly which tokens are included in the likelihood calculation and how evaluation context is handled.

Reference:

For decoder-only Bryła-prefix training, masked loss should become the default:

input:
  [BRYLA PREFIX] [SEP_BRYLA] [POLISH TEXT]

labels:
  [-100 ... -100] [-100]      [POLISH TEXT LABELS]

Conceptually:

Bryła = context
Polish text = prediction target

Report all three metrics, but treat them differently:

Metric Role
val_ppl_clean primary result
val_ppl_std diagnostic only
val_ppl_tags diagnostic only
chrF useful Polish-friendly generation metric
token F1 / EM useful QA metric if answers are short
small blind human eval final sanity check

4. The leakage failures are valuable findings

You found three leakage paths:

1. surface_text duplicated inside Bryła and after [SEP_BRYLA]
2. [FACTS] block included previous Polish text
3. anchors contained 80-character surface_text snippets

This is not embarrassing. This is exactly what makes the experiment more credible now. Structured-prefix methods are very vulnerable to accidental copy shortcuts.

Add a permanent leakage-check script.

Minimum checks:

Check Why
target text appears in prefix direct copying
long n-grams shared between prefix and target partial copying
anchors longer than a threshold hidden text leakage
[FACTS] contains answer-bearing text retrieval-style leakage
same source document in train/dev source leakage
near-duplicate pages across splits template leakage
same generated/paraphrased seed across splits synthetic leakage

This is especially important for Wikipedia-like data. If there are thousands of near-identical village descriptions, random row splitting is dangerous.

Use:

split_group = source_article_id

or:

split_group = template_cluster_id

Then split by group, not by row.

Useful references:


5. Token economy basically rejects FULL for now

Your token table is one of the clearest results:

Variant Tokens vs RAW Training Inference
RAW 1.0x 4 min 5.7 ms/tok
MIN 1.81x 8 min 5.9 ms/tok
MID 2.82x 12 min 5.8 ms/tok
FULL 6.06x 30 min 5.8 ms/tok

FULL costs about 6x the source tokens and gives no clear win over MID. For weak hardware, token count is budget. Every extra prefix token costs memory, training time, attention work, context length, and overfitting risk.

The design target should become:

maximum useful information per token

not:

maximum number of semantic/pragmatic fields

Add a reporting table like this:

Variant Clean PPL Δ vs RAW Tokens vs RAW Practical verdict
RAW <value> 1.00x baseline
MIN <value> <value> 1.81x cheap but maybe under-informative
MID <value> <value> 2.82x current best tradeoff
FULL <value> <value> 6.06x too expensive / noisy

The current engineering conclusion:

MID is the best current quality/cost point. FULL should be paused.


6. The masked-loss result is not a failure

You got:

val_ppl_clean almost identical:
3.10 vs 3.18

Numerically neutral.

But generation changed when you manually changed Bryła polarity:

[OTHER] [POL:neutral]  -> geographic / astronomical content
[OTHER] [POL:positive] -> villages / places
[OTHER] [POL:negative] -> sports / competition

This proves one thing:

The model is reading the prefix.

It does not yet prove:

The model understands polarity semantically.

It may mean:

POLARITY has become a hidden domain/topic label.

This is exactly the kind of thing known from control-code language modeling. CTRL trained a conditional Transformer on control codes that specify domain, subdomain, entities, relationships, dates, and task behavior. Such control codes steer generation, but the model learns corpus correlations, not human definitions of the labels.

References:

So I would phrase your masked-loss result like this:

Masked-loss training confirms that Bryła tokens can act as conditioning signals. However, the observed polarity effect may be entangled with domain/topic correlations, so the next step is to compare Bryła against explicit DOMAIN controls and counterfactual prompts where topic is held constant.

That is precise and defensible.


7. Add DOMAIN as the mandatory next control

This is the most important next control.

Your prefix may be helping because it encodes:

technical
geography
sports
biography
science
daily_life

rather than because it encodes deeper semantic-pragmatic structure.

Run this ladder:

RAW
DOMAIN + RAW
MID + RAW
DOMAIN + MID + RAW
MID shuffled values + RAW
MID shuffled field order + RAW
random tags same distribution + RAW

Interpretation:

Result Meaning
MID > DOMAIN > RAW strong: Bryła adds information beyond domain
MID ≈ DOMAIN > RAW Bryła is currently mostly domain conditioning
DOMAIN > MID current Bryła fields are noisy or parser is weak
DOMAIN + MID > both domain and Bryła are complementary
shuffled values ≈ MID field-value semantics are weak
random tags help formatting/regularization artifact
shuffled order hurts badly serialization order is part of the method

The decisive comparison is:

MID + RAW
vs
DOMAIN + RAW

If MID beats DOMAIN, Bryła has a stronger claim.
If DOMAIN matches MID, the current story becomes simpler: metadata/domain conditioning helps, but semantic-pragmatic structure is not yet proven.

This is still useful; it just changes the claim.


8. Do not blindly expand parser rules

You saw:

87% of Wikipedia sentences -> [OTHER]

That is a parser coverage problem.

But I would not immediately add many domain-specific rules. That risks building:

a parser for Wikipedia village templates

instead of:

a general Polish semantic-pragmatic parser

Semantic-representation systems usually have this problem. AMR and DRS work both show that representation quality and parser quality are part of the system, not preprocessing details.

Useful references:

The next step should be:

measure parser behavior first

not:

add rules until [OTHER] decreases

9. Build a parser dashboard

For each domain:

Domain Parsed % Partial % OTHER % Main failure type
technical/welding <value> <value> <value> <note>
geography <value> <value> <value> <note>
biography <value> <value> <value> <note>
daily life <value> <value> <value> <note>
science <value> <value> <value> <note>
sports/events <value> <value> <value> <note>

For each field:

Field Default rate Entropy Missing rate Top values Domain correlation
TYPE <value> <value> <value> <values> <value>
POLARITY <value> <value> <value> <values> <value>
SCOPE <value> <value> <value> <values> <value>
INTENT <value> <value> <value> <values> <value>
INTENSITY <value> <value> <value> <values> <value>
CORE <value> <value> <value> <values> <value>

This tells you:

  • which fields are dead/default-heavy;
  • which fields actually vary;
  • which fields are domain proxies;
  • which domains the parser cannot handle;
  • whether [OTHER] hides several different failure modes.

A field with 95–99% default rate probably does not deserve tokens.


10. Replace one broad [OTHER] with typed unknowns

[OTHER] is too destructive.

Instead of:

[OTHER]

try:

[TYPE:unknown] [DOMAIN:geography]
[TYPE:unknown] [DOMAIN:biography]
[TYPE:unknown] [DOMAIN:sports]
[TYPE:unknown] [DOMAIN:technical]

or:

[PARSE:partial] [DOMAIN:geography] [INTENT:inform]

This separates:

the parser does not know the semantic type

from:

the system knows nothing at all

A partial prefix can still carry useful information.


11. Create a tiny oracle-Bryła set

Take 100–200 examples and manually assign correct MID fields.

Then compare:

RAW
DOMAIN-only
parser-MID
manual/oracle-MID

Interpretation:

Result Meaning
oracle-MID helps, parser-MID does not parser bottleneck
parser-MID ≈ oracle-MID parser is good enough
neither helps representation/model/task issue
DOMAIN ≈ oracle-MID Bryła mostly encodes domain
oracle-MID > DOMAIN semantic-pragmatic fields add real signal

This is one of the cleanest possible experiments because it separates:

representation quality

from:

parser quality

Even 100 examples can be enough to identify the bottleneck.


12. Build the smaller balanced corpus first

Between:

1. Extend the parser with more domain-specific rules.
2. Build a smaller balanced multi-domain corpus.

I would choose:

Build the smaller balanced corpus first, then extend the parser based on measured failures.

Start with a diagnostic set:

6 domains × 50 examples = 300 examples

Suggested domains:

Domain Why include it
welding/materials/technical original strongest domain
geography/places tests template-heavy factual text
biographies tests people, roles, dates, events
daily life/practical advice tests intent, urgency, pragmatic cues
science explanations tests definitions and causality
sports/events tests competitions, events, temporal facts

Then scale later:

6 domains × 200 examples = 1,200 examples

or:

6 domains × 500 examples = 3,000 examples

Do not start with another huge uncontrolled corpus. If the parser fails on 300 balanced examples, it will also fail on 3,000.


13. Polish QA/data resources worth using

Use Polish datasets by role, not as one mixed pool.

PolQA

PolQA is one of the strongest Polish QA references. It contains 7,000 questions, 87,525 manually labeled evidence passages, and over 7 million candidate passages. It also classifies questions by formulation, question type, and answer entity type.

Links:

Use it for:

  • question-type analysis;
  • answer-type analysis;
  • evidence-aware QA;
  • retrieval + abstractive reader experiments;
  • annotation-design inspiration.

Be careful: OpenQA adds retrieval as another variable. For mechanism tests, use a controlled subset.

PoQuAD

PoQuAD is a Polish QA dataset modeled on SQuAD 2.0. It includes impossible questions and a generative answer layer.

Links:

Use it for:

  • passage-grounded QA;
  • impossible/answerability cases;
  • testing SCOPE, SOURCE, CERTAINTY, CORE, INTENT;
  • generation metrics beyond PPL.

PolEval 2024 QA / Reading Comprehension

PolEval 2024 Task 1 gives systems a question with a paired passage; some questions are impossible.

Links:

Use it for:

  • Polish QA evaluation protocol;
  • answerability scoring;
  • passage-grounded experiments;
  • moving beyond PPL.

PUGG

PUGG is especially relevant because it is not only a dataset, but a semi-automated construction methodology for Polish KBQA, MRC, and IR.

Links:

Use it for:

  • community-driven construction ideas;
  • semi-automated Polish QA/MRC/IR design;
  • baseline reporting style;
  • low-resource dataset-building patterns.

SpeakLeash / Polish LLM ecosystem

Links:

Use this ecosystem for:

  • Polish data discovery;
  • community contacts;
  • documentation examples;
  • possible weak teacher/evaluator models, with caution.

Do not frame Bryła as competing with large Polish LLMs. Frame it as:

explicit structure for very small Polish models under weak-hardware / low-data constraints

14. Dataset strategy

Do not mix all data into one pool.

Role Good sources Purpose
clean controlled benchmark your own balanced set, PoQuAD subset, PolEval subset mechanism isolation
evidence/OpenQA experiments PolQA retrieval + answer generation
construction methodology PUGG semi-automated dataset building
weak training / stress testing larger Polish corpora pretraining or parser stress
final claim small clean human-verified test credible result

Avoid:

PolQA + PoQuAD + Wikipedia + generated data -> one mixed pool -> one aggregate PPL

Prefer:

small clean benchmark
+ clear controls
+ separate weak-data experiments

15. Community-driven small dataset construction

A useful first dataset could be:

Bryła-MiniPL-QA v0.1

Start with:

300 diagnostic examples

Then:

1,200 benchmark examples = 6 domains × 200

Then, only if the signal is real:

3,000 examples = 6 domains × 500

Suggested schema:

id: geo_000123
domain: geography
source_type: manual | wikipedia | public_domain | synthetic_seeded
license: CC-BY-SA | CC0 | own | other
question: "..."
context: "..."
answer: "..."
answer_type: entity | date | number | yes_no | definition | procedure | explanation | list | unanswerable
is_answerable: true
bryla_mid: "..."
parser_status: parsed | partial | other | failed | oracle
parser_version: parser_v0.3
schema_version: bryla_mid_v1
split_group: source_article_or_template_id
split: train | dev | test
notes: "optional"

Most important fields:

domain
answer_type
parser_status
split_group
schema_version
parser_version

Community workflow:

1. Contributor writes question/context/answer.
2. Script runs parser and creates Bryła MID.
3. Reviewer checks answer correctness.
4. Bryła reviewer checks fields on a subset.
5. Maintainer runs leakage checks and split generation.

Keep volunteer tasks small. Do not require every contributor to understand the whole parser.

Review policy:

100% single review
10–20% double review
all disagreements saved

Disagreements are useful because they reveal ambiguous schema definitions.

Documentation references:


16. Experiments I would run next

Experiment A: control ladder

This is the most important next experiment.

RAW
DOMAIN + RAW
MID + RAW
DOMAIN + MID + RAW
MID shuffled values + RAW
MID shuffled order + RAW
random tags same frequency + RAW

Use:

masked loss
val_ppl_clean
chrF / F1 if possible
tokens vs RAW
same seeds
same split
same tokenizer

Main question:

Does MID actually beat simple domain conditioning?

Experiment B: field survival tournament

Start from MID.

Leave-one-out:

MID
MID - TYPE
MID - POLARITY
MID - SCOPE
MID - INTENT
MID - INTENSITY
MID - CORE

Single-field versions:

TYPE only
POLARITY only
SCOPE only
INTENT only
INTENSITY only
CORE only
DOMAIN only

Interpretation:

Pattern Meaning
field helps alone and hurts when removed strong useful field
field helps alone but not in MID redundant
field only helps with another field interaction
field does nothing remove
field only helps without DOMAIN likely domain proxy

This is more informative than only MIN/MID/FULL.


Experiment C: serialization variants

Test the same information in different formats.

MID-symbolic
MID-verbalized
MID-hybrid
MID-no-defaults
MID-shuffled-order

Examples:

Symbolic:
[TYPE:fact] [POL:neutral] [SCOPE:general] [INTENT:inform] [INTENSITY:low] [CORE:yes]

Verbalized:
This is a neutral factual statement with general scope. The intent is to inform. The main content is central.

Hybrid:
[type: factual statement] [polarity: neutral] [scope: general] [intent: inform] [core: yes]

Why: structured representation format matters. SR-LLM argues that code-like structured representations can be less effective than natural-language descriptions, depending on model and setting.

References:


Experiment D: cooldown

This is one of the most interesting directions.

MeCo trains with metadata prefixes, then uses a cooldown phase on standard text so the model can function without metadata at inference time.

References:

For Bryła, test:

RAW baseline

MID + text
eval: MID + text

MID + text for 80–90% of training
RAW text only for final 10–20%
eval: RAW text

DOMAIN + text for 80–90%
RAW text only for final 10–20%
eval: RAW text

random MID + text for 80–90%
RAW text only for final 10–20%
eval: RAW text

Main question:

Is Bryła an inference-time dependency or a training scaffold?

If cooldown preserves some gain, that is a much stronger story.


Experiment E: counterfactual prefix tests

Formalize your mini-chat test.

Create 20–50 fixed content prompts. For each prompt, vary one field only:

same topic + different POLARITY
same topic + different INTENT
same topic + different INTENSITY
same topic + different CORE
same topic + different SCOPE

Example topic:

gas cylinder leak during welding

Variants:

[INTENT:inform]
[INTENT:warn]
[INTENT:instruct]

Manual scoring:

Criterion 0 1 2
topic preserved no partly yes
intended control effect no partly yes
factual consistency no partly yes
no domain drift no partly yes
answer usefulness no partly yes

This separates:

prefix changes output distribution

from:

prefix controls the intended property

Those are not the same thing.


17. What would convince me Bryła is doing something useful?

A convincing pattern would be:

Test Desired result
MID > RAW yes
MID > DOMAIN yes
MID > shuffled values yes
MID > random tags yes
clean PPL improves yes
improvement is not only full-sequence PPL yes
at least one task metric improves yes
parser coverage is reported yes
leakage checks pass yes
group splits are used yes
useful fields are identified by ablation yes
counterfactual tests preserve topic yes
cooldown preserves some gain very strong bonus

The first four are especially important:

MID > RAW
MID > DOMAIN
MID > shuffled MID
MID > random tags

That would make the result much harder to dismiss.


18. What would make me skeptical?

Outcome Why it is a problem
DOMAIN ≈ MID Bryła may mostly encode domain
shuffled values ≈ real MID field meanings may not matter
random tags help formatting/regularization artifact
only val_ppl_std improves tag-prediction artifact
val_ppl_clean does not improve no target-text gain
one field changes topic instead of style control is not semantic
parser mostly outputs [OTHER] model receives little structure
seed std is extremely tiny on template data near-duplicate/template issue
random row split on Wikipedia contamination risk
FULL wins only when tags are included in loss metric artifact

These are not reasons to stop. They are diagnostics.


19. Recommended 4-week plan

Week 1 — freeze and instrument

Deliverables:

BRYLA-MID-v1 frozen
masked loss implemented
val_ppl_clean / val_ppl_std / val_ppl_tags reported
parser dashboard created
leakage checks scripted
DOMAIN prefix added

Do not run many big trainings yet.

Week 2 — run the control ladder

Run:

RAW
DOMAIN + RAW
MID + RAW
DOMAIN + MID + RAW
MID shuffled values + RAW
MID shuffled order + RAW
random tags same frequency + RAW

Minimum:

3 seeds

Better:

5 seeds

Report:

clean PPL
std PPL
tag PPL
tokens vs RAW
train time
inference time
win count

Week 3 — build 300-example diagnostic set

Create:

6 domains × 50 examples

Domains:

technical
geography
biography
daily life
science
sports/events

For each example:

question
context
answer
domain
answer_type
parser_status
bryla_mid
split_group

Run parser diagnostics first. Do not scale yet.

Week 4 — oracle Bryła + counterfactual probes

Create:

100–200 manually corrected MID examples

Compare:

RAW
DOMAIN
parser-MID
oracle-MID

Also create:

20–50 counterfactual prefix probes

This will tell you whether the next bottleneck is parser quality or representation design.


20. Best public framing

I would write the current state like this:

I found that the compact MID schema is a better tradeoff than the full 20-field schema: it gives a small but repeatable improvement in the technical QA setting, while FULL adds many mostly-default fields and a large token cost. I also found that full-sequence perplexity is misleading for prefix-tag experiments, so I now report target-only clean PPL after the separator. Masked-loss training shows that the model does read Bryła prefixes as conditioning information, but the observed polarity effect may be entangled with domain/topic correlations. The next step is to test MID against DOMAIN-only, shuffled-field, and random-tag controls under clean masked loss, and to build a small balanced multi-domain Polish QA set to measure parser coverage outside the original technical domain.

Avoid saying:

Bryła proves semantic understanding.
Bryła replaces raw text.
Bryła scales generally.
FULL Bryła is better.
Polarity controls semantics.

Use:

Bryła conditions generation.
MID is the current best tradeoff.
Clean PPL is required.
Parser coverage is the bottleneck.
Domain controls are necessary.
Cooldown is the next high-value test.

21. Direct answer to the two options

Between:

1. Extend the parser with more domain-specific rules.
2. Build a smaller balanced multi-domain corpus.

I would choose:

Build the smaller balanced corpus first. Then extend the parser only where that corpus shows failures.

Reason:

  • rule expansion without a balanced diagnostic set can overfit the parser to one corpus;
  • the current parser failure is a coverage problem, but you need coverage by domain/type;
  • a balanced dataset separates semantic usefulness from domain/template effects;
  • a small clean dataset is more useful than a large noisy one at this stage.

Best immediate target:

300 examples for diagnostics

then:

1,200 examples for real experiments

not another large uncontrolled Wikipedia run.


Short summary

  • The update is good because the failures make the result more credible.
  • MID > FULL is important: compact informative fields beat default-heavy annotation.
  • val_ppl_clean should be the primary metric from now on.
  • Masked loss is the right objective for Bryła-as-context.
  • The polarity generation result proves conditioning, but may also reveal domain leakage.
  • Add DOMAIN as a mandatory control.
  • Test MID against DOMAIN, shuffled MID, and random tags.
  • Do not expand FULL now.
  • Do not blindly add parser rules.
  • Build a small balanced multi-domain diagnostic corpus first.
  • Use PolQA, PoQuAD, PolEval, and PUGG as references/resources.
  • Add parser dashboards, leakage checks, group splits, oracle-Bryła examples, and counterfactual prefix tests.
  • The strongest next claim would be:
compact Bryła helps beyond domain conditioning under clean target-only loss.