Row Segmentation QA: Catch Merged Lines Before They Break Bank Reconciliation
A row-segmentation QA workflow to detect merged, split, and page-break rows before they get into your export files.
Row Segmentation QA: Catch Merged Lines Before They Break Bank Reconciliation
When a statement row gets merged, everything downstream starts lying in a very calm voice.
Your CSV still exports. Your XLSX still opens. Your JSON still validates. But the data is wrong because two statement lines got glued together into one row, or one row got split into two. That breaks reconciliation in ways totals-only checks often miss.
Row segmentation QA is the method for detecting that problem before import. It focuses on structure, not just values:
- does each row correspond to exactly one transaction line?
- are wrapped descriptions getting merged into neighboring rows?
- do date/amount/description patterns still look like a statement table, not OCR soup?
This post gives you a practical workflow to catch row merges, line splits, and boundary drift before they reach your ledger.
Why row segmentation is its own problem
People lump row segmentation into OCR quality, but that’s too vague.
A statement can have excellent OCR and still fail row segmentation because of:
- wrapped descriptions that cross line boundaries
- page breaks that continue a transaction onto the next page
- table layouts with thin grid lines or no grid lines
- bank-specific formatting where a fee or adjustment block uses a different structure
The key point: a row can be textually readable and still structurally wrong.
That’s why segmentation QA needs its own gates.
What row segmentation QA must guarantee
A safe transaction export should guarantee:
- One transaction per row
- no accidental merges of adjacent lines
- no accidental splits of one transaction into multiple rows
- Stable column shape
- each row should preserve the same field pattern (date, description, amount, maybe balance)
- Boundary consistency
- when a statement wraps text or breaks across pages, the row boundary rule must remain consistent
- Export parity
- the same row structure should be present in CSV, XLSX, and JSON
If row segmentation fails, dedupe, merchant normalization, and sequence QA all get noisier.
A simple segmentation signal set
You don’t need a big model. You need a few strong signals.
Signal 1: date presence
For most transaction rows, a date appears at the start or within a predictable field. If a row suddenly loses the date while keeping a long description, that’s a merge candidate.
Signal 2: amount presence
Every transaction row should usually include an amount. If an export row has two amounts, or a row that looks like text only, investigate.
Signal 3: description length distribution
Healthy descriptions usually live in a normal range.
- too short: possible truncation or split row
- too long: possible merge or wrapped line absorbed into the row
Signal 4: balance consistency
If the statement includes running balances, a merged row often causes a mismatch in the balance step pattern.
Signal 5: page boundary anomalies
Rows near page breaks are high risk. Any row with unusual continuation patterns should be scored higher.
The segmentation QA workflow
Use this order.
Step 1: capture the statement table shape
Before any row-level logic, detect:
- number of visible columns
- where date, description, amount, and balance fields usually land
- whether the page has a repeated header or repeated summary block
This is the row equivalent of layout fingerprinting.
Step 2: score every candidate row
For each candidate row, compute a basic segmentation score from:
- date confidence
- amount confidence
- description length
- field alignment with prior rows
Rows with unusually low confidence are merge/split candidates.
Step 3: group suspect rows into break regions
Don’t investigate row-by-row in isolation.
If one bad row appears, check the rows immediately before and after it. Merges and splits almost always create local neighborhoods of weirdness.
Step 4: classify the failure
Classify as:
- merged row
- split row
- page-break carryover
- layout drift for a statement variant
Step 5: repair the smallest block
Fix the affected segment only, then re-run QA.
A practical table of segmentation failure patterns
| Pattern | What it looks like | Likely cause | Repair lever |
|---|---|---|---|
| Merged row | one row has unusually long text and maybe two transaction ideas inside it | OCR glued adjacent lines together | re-segment the affected block |
| Split row | one transaction is spread across two rows, with one row missing amount/date | OCR broke a wrapped line | merge based on adjacency and column continuity |
| Page break carryover | the last row of a page and the first row of the next page look connected | page continuation not handled | stitch page boundary using continuation rules |
| Layout drift | same bank, different statement template variant changes row shape | fingerprint mismatch | select the correct layout mapping |
This table is the whole game. If you can classify the pattern, you can fix it.
Worked example 1: a merged row that still looks “parseable”
The failure
You have two adjacent transactions:
04/28 Coffee Shop 8.5004/28 Metro 2.75
OCR merges them into:
04/28 Coffee Shop 8.50 Metro 2.75
This row still looks parseable if your extractor finds a date and a final amount. But it’s wrong.
What QA catches
- description length spikes far above normal
- there are two merchant ideas in one row
- row count is lower than expected for the page
If running balances exist, the row’s balance step also won’t line up with a two-transaction sequence.
Repair
- split the row using the same table shape that produced the merge
- preserve both amounts as separate transactions
- re-run segmentation QA and sequence QA
Worked example 2: wrapped descriptions split into two fake rows
The failure
A transaction description continues onto the next line:
- Row 1:
04/29 ONLINE PAYMENT AUTHORIZED - Row 2:
BY MERCHANT X 42.00
If the parser treats these as separate transactions, you create a fake transaction.
What QA catches
- Row 1 has no amount, only description
- Row 2 has amount but no date
- the pair shares a continuation pattern
Repair
- merge the continuation line back into a single transaction row
- keep the date from the first row and the amount from the final amount-bearing line
- re-run export parity to confirm CSV/XLSX/JSON all match the same merged record
Boundary rules that reduce false positives
A good segmentation system needs a few hard rules.
Rule 1: one date anchor per transaction row
If a row has no date but depends on a nearby date line, it might be a continuation, not a new transaction.
Rule 2: one amount anchor per transaction row
If a row has an amount and the previous row already had the same logical transaction, that may indicate a merge or split.
Rule 3: continuation text should be explicitly marked
When a line clearly continues a description, keep it as continuation, not as a row.
Rule 4: rows near page breaks get special scrutiny
Continuations often happen there.
These rules are simple, but they stop a lot of garbage from slipping through.
How row segmentation affects other QA gates
Row segmentation failure doesn’t stay in one lane.
It contaminates:
- Merchant normalization, because merchants get glued to the wrong transaction
- Deduplication, because duplicated fragments can look like separate rows
- Sequence QA, because row order and balance steps become inconsistent
- Export parity, because formats may split or merge differently
That’s why segmentation QA belongs near the top of the pipeline.
Related reading:
A quick scoring template you can operationalize
Score each row 0–100.
Start with 100 and subtract:
- -30 if no date anchor where one is expected
- -25 if amount confidence is missing or duplicated
- -20 if description length is 2× the statement median
- -15 if row occurs at a page boundary and lacks a clear continuation marker
- -10 if field alignment differs from nearby rows
Then use buckets:
- 80–100: likely clean
- 50–79: inspect nearby rows
- < 50: high-risk segmentation break
This is not perfect, but it gives operators a triage path.
FAQ
1) Can segmentation QA replace OCR QA?
No. OCR QA catches character-level issues. Segmentation QA catches row-structure issues. You need both.
2) What if the statement has no running balances?
You can still catch merges/splits using date, amount, description length, and page boundary rules.
3) What if a row is legitimately very long?
That happens. The fix is not to ban long rows, it’s to compare against layout-specific baselines and nearby row shapes.
4) Should dedupe happen before segmentation QA?
No. Segment first, then normalize merchants, then dedupe.
5) What’s the most common real-world mistake?
Treating a wrapped line as a new transaction row. That error causes fake rows and missed rows at the same time.
Bottom line
Row segmentation QA catches the structural mistakes that value checks miss.
If you can detect merged rows, split rows, and page-break carryover early, your exports stop lying about transaction counts. Then reconciliation becomes a math problem again, not a detective job.
FAQ