Ashish Bhutani · · 39 min read

Case Study: Automating Mortgage Processing with LLM Agents in a Hybrid Cloud Bank

AI AgentsSystem DesignAI EngineeringInterview

This post applies the 9-step case study structure from the GenAI System Design Framework.

Problem Statement

A mid-size bank processes 3,000 to 5,000 mortgage applications per month across multiple countries. Each application involves 15 to 25 documents: pay stubs, tax returns, employment letters, bank statements, property appraisals, title deeds, and sometimes more. Today, loan officers manually review every document, cross-reference data across the full set, check compliance against country-specific regulations, and write underwriting narratives summarizing their findings.

Average processing time: 3 to 4 weeks per application. Of that, 30 to 40% is spent on document review and data extraction alone. The rest is waiting on third-party verifications, internal approvals, and the actual underwriting decision. The document work is the bottleneck the bank can actually control.

The bank runs on GCP for compute and AI workloads, but all customer data (core banking records, credit histories, identity documents) lives in on-premise databases. This is not a choice. It is a regulatory and security requirement that varies by country but applies everywhere the bank operates. The bank operates in 3 countries today (UK, Germany, US) with plans to expand to 5 more in the next 2 years.

Primary users: Loan officers who process applications daily. They see extracted data, flagged inconsistencies, and draft narratives. They do not see raw documents through this system (they can always access originals through existing document management systems).

Secondary users: Compliance teams who audit processing decisions and regional operations teams who configure country-specific rules.

What This System Is Not

This is not a credit scoring system. The bank already has ML models for credit scoring, and those are heavily regulated, requiring full explainability. Replacing them with an LLM would be a regulatory non-starter.

This is not a customer-facing chatbot. Mortgage applicants do not interact with this system at all. They submit documents through existing channels.

This is not replacing the loan officer’s judgment on approval decisions. The agent handles the tedious document work: extraction, cross-referencing, inconsistency detection, narrative drafting. It surfaces structured findings so humans can make faster, better-informed decisions. The human always decides.

Step 0: Why GenAI?

The first question is always: does this need an LLM at all? A surprising amount of mortgage processing is already automated or automatable with traditional software.

Where Deterministic Automation Already Works (and Stays)

ComponentApproachWhy It Stays Deterministic
Credit scoringTraditional ML models (XGBoost, logistic regression)Heavily regulated, requires full explainability, well-established
Interest rate calculationRule enginePure math based on rate tables, loan term, credit tier
LTV/DTI threshold checksRule engineLoan-to-value and debt-to-income are arithmetic. No ambiguity
KYC/identity verificationSpecialized vendors (Jumio, Onfido)Purpose-built, certified, regulatory-compliant
Workflow orchestrationDurable workflow engine (Temporal)State machine logic, retries, timeouts. No reasoning needed
Document routingClassification model + rulesWhich department handles which document type. Static logic

These components are cheaper, faster, more reliable, and more explainable than any LLM-based alternative. Replacing them with an agent would be engineering malpractice.

Where Deterministic Approaches Break Down

The boundary between structured and unstructured is where GenAI earns its cost.

Document format variation. A German Lohnsteuerbescheinigung (wage tax certificate) looks nothing like a UK P60. Even within one country, every employer formats pay stubs differently. Layout, field names, ordering, even which fields are included all vary. Rule-based OCR pipelines work when you have 5 document templates. When you have 500 employers across 3 countries, template maintenance becomes a full-time job for multiple engineers.

Cross-document reasoning. An employment letter says the applicant joined in 2023. Their tax return shows income from the same employer since 2020. Is this a contradiction? Maybe the letter is for a new role at the same company. Maybe someone made a typo. Catching these requires reading multiple documents together and reasoning about what the discrepancy means. This is not pattern matching. It is inference.

Multi-language processing. Expanding to new countries means documents in new languages. Building a German extraction pipeline, then a French one, then a Spanish one, is expensive and slow. A single LLM handles all of them with prompt-level configuration. The extraction quality is not identical across languages (German compound nouns are harder to parse than English field labels), but it is good enough to avoid building per-language systems.

Underwriting narrative generation. Loan officers spend 30 to 60 minutes per application writing up their findings in a structured narrative. Given that the structured data is already extracted and verified, this is a summarization task. An LLM drafts the narrative in seconds. The loan officer reviews and edits, which takes 5 to 10 minutes instead of an hour.

Cost Math

The ROI case needs to be concrete, not hand-wavy.

Cost ComponentManual ProcessingLLM-Assisted Processing
Loan officer time per application (document review)8-12 hours at ~$45/hr = $360-5401-2 hours review + edits at $45/hr = $45-90
LLM inference per application (extraction + verification + narrative)$0~$2.50-4.00 (15-25 docs x $0.10-0.15 per doc + narrative)
Cloud Interconnect bandwidth per application$0~$0.05 (extracted text, not raw images, for most docs)
Template maintenance (per country per year)$50K-80K engineering cost$5K-10K prompt tuning cost
Error correction downstream~$50 per application (rework from missed inconsistencies)~$15 per application (fewer misses, some LLM errors)
Total per application$410-590$65-115

At 4,000 applications per month, that is the difference between $1.6M-2.4M and $260K-460K in monthly document processing costs. Even accounting for infrastructure, model costs, and the engineering team to build and maintain the system, the payback period is under 6 months.

The inference cost per document deserves more detail. A typical mortgage document is 1 to 3 pages. Using a vision-capable model (Gemini 1.5 Pro on Vertex AI), each page costs roughly $0.003 for input tokens (image) plus $0.01-0.02 for output tokens (structured extraction). A 20-document application with an average of 2 pages per document runs about $0.80-1.20 for extraction alone. Cross-document verification adds another $0.50-0.80 (multiple documents in context). Narrative generation adds $0.30-0.50. Total: $1.60-2.50 per application at current Vertex AI pricing.

Step 1: Requirements

Functional Requirements

  • Extract structured data fields from mortgage documents across all supported document types and languages
  • Cross-reference extracted data across the full document set for a single application
  • Flag inconsistencies with severity levels (blocking vs. informational) and supporting evidence
  • Generate compliance disclosures per jurisdiction (TILA/RESPA for US, FCA disclosures for UK, BaFin requirements for Germany)
  • Draft underwriting narratives from structured findings
  • Support human override on any extracted field or flag

Non-Functional Requirements

  • Data residency: Customer documents never leave on-prem infrastructure. Only extracted structured data (field names, values, confidence scores) moves to GCP. For countries with stricter rules (Germany), even extracted text may need to stay on-prem with only aggregate results moving to cloud.
  • Latency: Full document set processing within 15 minutes for a standard application (20 documents). Individual document extraction under 30 seconds. Loan officers should not be waiting for the system.
  • Auditability: Every extraction decision must be traceable. Input document reference, extracted value, confidence score, model version, prompt version, and whether a human modified the result.
  • Multi-country extensibility: Adding a new country should be a configuration change (new document types, compliance rules, tool registrations), not a code change.
  • Availability: 99.5% during business hours. Documents queue during downtime. No data loss.

Scale Assumptions

DimensionCurrentYear 2 Target
Applications per month3,000-5,0008,000-12,000
Documents per application15-2515-30 (more countries = more doc types)
Pages per document (average)22
Total pages per month90,000-250,000240,000-720,000
Languages3 (English, German, limited French)6-8
Countries38
Concurrent applications in processing200-400500-1,200
Peak extraction requests per minute50-80150-300

This is not a high-QPS inference problem. Peak throughput is 300 extraction requests per minute, which is about 5 per second. The challenge is correctness, auditability, and the hybrid cloud data flow, not raw throughput.

Quality Metrics

MetricTargetWhy This Number
Field extraction accuracy>94%Below 90%, human review volume exceeds manual processing cost. 94% is the breakeven for net time savings
Cross-doc flag precision>85%False flags waste loan officer time. Below 85%, officers start ignoring flags
Cross-doc flag recall>92%Missed inconsistencies are the dangerous failure. Higher recall is worth some false positives
Narrative edit rateUnder 30% of text modifiedIf officers rewrite more than 30%, the draft isn’t saving meaningful time
Compliance disclosure accuracy>99%Regulatory requirement. Wrong disclosures create legal liability
End-to-end processing timeUnder 15 minutesMust be faster than the 8-12 hours of manual review to justify the system

Trade-offs to Acknowledge

Trade-offOption AOption BOur Lean
Data boundary: what crosses to GCP?Raw documents to GCP (simpler, better extraction)Only extracted text to GCP (safer, loses visual layout)Country-dependent. UK/US: raw docs allowed. Germany: text only
Extraction coverage vs accuracyAttempt all fields, accept lower accuracyHigh-confidence extraction on 80% of fields, flag rest for humanOption B. Wrong extractions are worse than missing ones
Single model vs specialized modelsOne frontier model for everythingSmaller models for classification, frontier for extractionHybrid. Classifier is a fine-tuned BERT. Extraction uses Gemini Pro
On-prem LLM vs cloud LLMRun models on-prem for full data controlUse Vertex AI, accept data transit over Cloud InterconnectCloud (Vertex AI). On-prem GPU infrastructure is 3-5x more expensive to operate and model updates are slower

Step 2: Architecture, Hybrid Cloud Data Flow

The core design constraint shapes everything: LLM inference runs on GCP (Vertex AI), but customer documents and banking data live on-prem. The architecture must bridge this gap without violating data residency requirements that vary by country.

Components

On-prem document gateway. Receives documents from the bank’s existing document management system. Runs initial digitization (OCR for scanned documents, PDF text extraction for digital documents) using on-prem infrastructure. Depending on the country’s data residency policy, it sends either the raw document images or the extracted text to GCP over Cloud Interconnect (a dedicated, encrypted connection between the bank’s data center and GCP, not the public internet).

GCP processing layer. The agent runtime (document extraction agents, cross-reference agents, narrative generation) runs on GCP, using Vertex AI for LLM inference. The durable workflow engine (Temporal, running on GKE) orchestrates the full application lifecycle.

On-prem data API. A secure, read-only API that agents on GCP call to query banking data: credit records, account history, employment verification records. This data never leaves on-prem. The agent sends a query (“what is the average monthly deposit for account X over the last 12 months?”), and the API returns the answer. The agent never sees raw account data.

Result store. Extracted structured data, flags, and narratives are written back to on-prem systems through the same Cloud Interconnect link. The on-prem result store is the system of record. The GCP result cache is ephemeral.

What Crosses the Boundary?

This is the single most important architectural decision. Three options, each with real trade-offs:

OptionWhat Goes to GCPProsConsWhen to Use
Raw documentsFull document images and PDFsVision model sees layout, tables, signatures. Best extraction accuracyRegulatory risk in strict jurisdictions. Larger bandwidthCountries with permissive data rules (UK, US)
OCR’d text onlyExtracted text from on-prem OCRNo document images leave premises. Lower bandwidthLoses visual layout context. Tables become garbled text. ~5-8% accuracy dropCountries with strict data residency (Germany)
On-prem extraction, structured output onlyJust field-value pairsMaximum data protection. Minimal bandwidthLimits LLM reasoning. Can’t handle unusual formats. Requires on-prem GPUCountries that prohibit any customer data in cloud

The answer is not one-size-fits-all. The country configuration (covered in Step 4) determines which option applies. For UK and US applications, raw documents go to GCP. For German applications, only OCR’d text goes to GCP, with the on-prem OCR service handling the visual extraction. If a future country prohibits even text transit, the architecture supports running a smaller extraction model on-prem with only structured output crossing the boundary.

The Cloud Interconnect link is provisioned at 10 Gbps with encryption in transit. At peak load (300 pages per minute, average 500KB per page for document images), bandwidth usage is about 150 MB/min or 2.5 MB/sec. This is well within the link capacity, but latency matters more than bandwidth. Each round trip between on-prem and GCP adds 5-15ms of network latency depending on the physical distance between the data center and the GCP region. For a single document extraction, this is negligible. For cross-document verification that makes 3-5 on-prem API calls, it adds up to 50-75ms. Still acceptable, but worth monitoring.

Hybrid Cloud Architecture

Step 3: The Durable Workflow Engine, Where Intelligence Lives (and Doesn’t)

The mortgage lifecycle (Application Received, Document Collection, Verification, Underwriting, Decision, Closing) is a state machine. A durable workflow engine owns this state machine. Not an LLM. Not an agent.

This is a point worth emphasizing because it is tempting to build the whole thing as an agentic loop: “here’s a mortgage application, figure out what to do.” That approach fails for three reasons.

First, regulatory auditability. Regulators want to see a defined process with clear checkpoints. “The model decided to check compliance after extraction” is not auditable. “Step 4 of the workflow is compliance check, triggered after step 3 completion” is.

Second, failure recovery. If the system crashes mid-processing, a durable workflow engine (Temporal, in this case) picks up exactly where it left off. An agentic loop would need to re-reason about the entire application state.

Third, human intervention points. When a loan officer needs to review a flag, the workflow pauses at a defined checkpoint and resumes when the officer acts. A free-form agent loop has no natural pause points.

What Is an Agent and What Is Not

Workflow StepWhat RunsWhy
Document intake and storageDurable workflow engineDeterministic: receive document, validate format, store in document management system, update application status
Document type classificationFine-tuned BERT classifierClassification, not generation. 50+ document types across 3 countries. A 110M parameter model handles this at sub-10ms latency
Data extraction from individual documentsLLM agent (Gemini Pro via Vertex AI)Unstructured documents, varied formats, needs visual understanding for tables and layouts
Cross-document verificationLLM agent with tool accessRequires reading multiple documents together, reasoning about contradictions, querying on-prem APIs for corroboration
Compliance threshold checksDeterministic rules engineLTV must be under 80%. DTI must be under 43%. These are arithmetic checks with country-specific parameters
Compliance disclosure generationLLM generation with retrievalCountry-specific disclosure language, but the content is standardized. RAG over disclosure templates
Underwriting narrative draftLLM generationSummarizing structured findings into human-readable narrative. Pure generation task
Final decisionHuman (loan officer)The agent surfaces findings. The human decides. Always

The agents (extraction, cross-document verification, narrative generation) run as activities within the Temporal workflow. Each activity has a timeout (extraction: 60 seconds per document, cross-document verification: 120 seconds, narrative generation: 90 seconds). If an activity times out, the workflow engine retries with exponential backoff, up to 3 attempts. If all retries fail, the application is routed to a manual processing queue with all partial results preserved.

The Agent’s Think-Act Loop

The cross-document verification agent is the most genuinely “agentic” component. It does not follow a fixed script. It reasons about what it sees and decides what to investigate next. Here is a concrete example:

  1. Extract fields from the pay stub: employer “Acme Corp,” annual salary $85,000, pay period March 2026.
  2. Compare with employment letter: employer “Acme Inc,” start date January 2023, annual salary $75,000.
  3. Notice two discrepancies: employer name mismatch (“Corp” vs “Inc”) and salary mismatch ($85,000 vs $75,000).
  4. Reason: The name could be a legal name vs. DBA situation. The salary difference could be a raise between when the letter was written and the current pay stub.
  5. Act: Query the tax return via on-prem data API. The tax return shows employer TIN (Tax Identification Number) matching “Acme Corp” and reported income of $83,500 for the previous year.
  6. Evaluate: TIN match confirms same entity (name variant, not a different employer). Income trajectory ($83,500 last year, $85,000 current) is consistent with a raise. The employment letter’s $75,000 is likely the starting salary from 2023.
  7. Resolve: Flag as informational (name variant detected, income consistent with career progression), not blocking.

This is genuine ReAct-style reasoning. But it is bounded inside a single Temporal activity with a 120-second timeout. The agent cannot decide to go check the applicant’s social media or call an external verification service. Its tools are limited to: read extracted data from other documents in the application, query the on-prem data API for banking records, and write flags with severity and evidence.

Workflow State Machine

Step 4: Multi-Country Expansion, Country as Configuration Not a Fork

The bank operates in 3 countries today and plans to expand to 8. The worst possible architecture would be: fork the codebase per country. Separate UK pipeline, separate German pipeline, separate US pipeline. Every bug fix applied 3 times. Every model update deployed 3 times. Every new feature built 3 times.

Instead, country differences are captured in configuration that the workflow engine and agents read at runtime. Adding a new country means adding a configuration object, not writing new code.

What Changes Per Country

DimensionUKGermanyUS
Key document typesP60, P45, council tax bill, HMRC statementLohnsteuerbescheinigung, Grundbuchauszug, Schufa reportW-2, 1040, HUD-1, Social Security statement
Compliance frameworkFCA mortgage conduct rules, MCOBBaFin WohnimmobilienkreditrichtlinieTILA, RESPA, Dodd-Frank QM rules
Max LTV95% (with mortgage insurance)80% (typical, varies by lender)97% (conventional with PMI)
Data residency policyCloud processing allowedStrict: extracted text only, no document imagesCloud processing allowed
Credit bureau APIExperian UK, Equifax UKSCHUFAEquifax, TransUnion, Experian US
Property valuation APILand Registry, RICS valuationGrundbuchamt, GutachterausschussCounty recorder, Zillow/Redfin API
Primary languageEnglishGermanEnglish
Disclosure templatesESIS (European Standardised Information Sheet)ESIS (German version) + BaFin-specificLoan Estimate, Closing Disclosure

Country Configuration Structure

# Country config loaded at workflow start, passed to all agents
country_config = {
    "country_code": "DE",
    "data_residency": {
        "raw_documents_to_cloud": False,
        "extracted_text_to_cloud": True,
        "structured_output_to_cloud": True,
        "on_prem_extraction_required": False
    },
    "document_types": {
        "income_proof": ["lohnsteuerbescheinigung", "gehaltsabrechnung"],
        "property_title": ["grundbuchauszug"],
        "credit_report": ["schufa_auskunft"],
        "employment": ["arbeitgeberbescheinigung"],
        "bank_statements": ["kontoauszug"]
    },
    "compliance_rules": {
        "max_ltv": 0.80,
        "max_dti": 0.40,
        "disclosure_template": "esis_de_v3",
        "regulatory_framework": "bafin_wikr"
    },
    "tool_registry": {
        "credit_bureau": "schufa_api_v2",
        "property_registry": "grundbuchamt_api",
        "employment_verification": "elster_api"
    },
    "extraction_config": {
        "primary_language": "de",
        "extraction_model": "gemini-1.5-pro",
        "ocr_service": "document_ai_de",
        "prompt_template_version": "de_v4"
    }
}

When a new application arrives, the workflow engine reads the country code, loads the corresponding configuration, and passes it to every downstream activity. The extraction agent uses it to select the right prompt templates and document type schemas. The compliance check uses it to load the right rules. The tool calls use it to route to the right credit bureau and property registry.

Adding a new country (say, France) means creating a new configuration file with French document types, French compliance rules, and French tool registrations. The workflow code, agent code, and infrastructure are unchanged. The extraction prompts need tuning for French document types, which is a prompt engineering task, not a software engineering task.

The one exception: if a new country has entirely different data residency requirements (say, on-prem LLM inference required), the infrastructure team needs to provision on-prem GPU capacity. This is an infrastructure change, not a code change, but it is not trivial.

Country Config Routing

Step 5: Document Processing Deep Dive

Document extraction is the hardest technical problem in this system. It is where the LLM does the heavy lifting, and where most of the accuracy challenges live.

The Document Processing Pipeline

Each document flows through a multi-stage pipeline:

Stage 1: Type classification. Before extracting anything, the system needs to know what kind of document it is looking at. A pay stub, a tax return, an employment letter, and a bank statement all have different extraction schemas. The classifier is a fine-tuned BERT model (not an LLM, classification does not need generation) trained on 10,000+ labeled mortgage documents across all supported types. Accuracy: 97.5% on known document types, with a “unknown/other” category for documents that do not match any known type.

For unknown documents, the system falls back to a generic extraction prompt that asks the LLM to identify the document type and extract whatever structured fields it can find. These always go to human review.

Stage 2: Vision model vs. OCR + text approach. This is not a binary choice. The system uses both, depending on document quality and data residency.

ApproachWhen to UseAccuracyCostData Residency
Vision model (Gemini Pro with image input)High-quality scans, digital PDFs, countries allowing raw doc transit94-96% field accuracy$0.01-0.03 per pageDocument image goes to GCP
OCR (Document AI) + text-based LLMPoor quality scans, handwritten notes, countries restricting image transit88-92% field accuracy$0.005-0.01 per page (OCR) + $0.005-0.01 (LLM)Only text goes to GCP
Vision model on-premCountries prohibiting any cloud transit90-93% (smaller model)$0.02-0.05 per page (GPU amortization)Nothing leaves on-prem

The quality assessment happens on-prem. The document gateway runs a lightweight image quality check (resolution, contrast, skew angle) and a text extraction attempt. If the text extraction produces garbled output (common with poor scans or handwritten documents), the document is flagged for vision model processing.

Stage 3: Field extraction with structured output. The extraction prompt is specific to the document type (loaded from the country config). It includes the expected output schema, a few examples of correct extractions for that document type, and explicit instructions for handling edge cases.

# Extraction prompt template for UK P60
extraction_prompt = """
You are extracting structured data from a UK P60
(End of Year Certificate).

Extract the following fields. For each field, provide:
- field_name: the canonical field name
- value: the extracted value
- confidence: your confidence score (0.0 to 1.0)
- source_location: where on the document you found this

Required fields:
- employer_name (text)
- employer_paye_ref (format: ###/XXXX)
- employee_name (text)
- national_insurance_number (format: XX######X)
- tax_year (format: YYYY-YYYY)
- total_pay (numeric, GBP)
- total_tax_deducted (numeric, GBP)
- employee_ni_contributions (numeric, GBP)

If a field is not present, set value to null and
confidence to 0.0.

If a field is partially legible, extract what you can
and set confidence accordingly.

Respond in the following JSON schema:
{schema}

Document:
{document_content}
"""

The key design choice: structured output with a defined JSON schema. The LLM is not generating free-form text. It is filling in a schema. This means downstream systems always get clean, typed data. No parsing surprises. Vertex AI’s structured output mode (response_schema parameter) enforces the schema at the token level, so the model cannot produce output that does not conform.

Stage 4: Confidence scoring. Each extracted field gets a confidence score from the LLM. But LLM confidence scores are not calibrated out of the box. A model that says “0.95 confidence” might be wrong 15% of the time at that score level. The system includes a calibration layer trained on historical extractions where human reviewers verified the correct values.

The calibration works like this: for every (document type, field, raw LLM confidence) triple, the system looks up the historical accuracy at that confidence level. If the LLM says 0.90 for “employer_name” on German pay stubs, and historically that field is correct 96% of the time at 0.90 raw confidence, the calibrated confidence is 0.96. If the LLM says 0.90 for “total_pay” on handwritten receipts, and historically that is correct only 78% of the time, the calibrated confidence is 0.78.

Fields with calibrated confidence below 0.85 are flagged for human review. Fields below 0.70 are not used in downstream processing at all (they are shown to the loan officer as “low confidence, please verify manually”).

Stage 5: Cross-reference check. This is where the verification agent takes over (covered in detail in Step 3’s think-act loop). The extraction agent produces structured data for each document independently. The verification agent reads the full set and looks for inconsistencies.

How “RAG for Knowledge, Fine-tune for Behavior” Applies

The extraction system uses both retrieval and fine-tuning, but for different purposes.

RAG for knowledge: When the extraction agent encounters an unfamiliar document format (a pay stub from an employer it has not seen before), it retrieves similar document examples from an indexed store of previously processed documents (with PII redacted). These examples are included in the prompt as few-shot demonstrations. The retrieval is by document type and visual similarity (embedded document layout features). This means the agent’s knowledge of document formats updates continuously as new documents are processed, without any model retraining.

Fine-tuning for behavior: The extraction agent’s behavior (how it formats confidence scores, how it handles ambiguous fields, when it flags vs. resolves) is shaped by fine-tuning on 5,000 human-reviewed extraction examples. The fine-tuned model consistently follows the output schema, assigns calibrated confidence scores, and handles edge cases (partial values, multiple possible interpretations) in a predictable way. This behavior does not change when new document formats are added. It is baked into the model weights.

The boundary is clean: new document knowledge arrives through retrieval (updated continuously). Extraction behavior is set through fine-tuning (updated quarterly, with offline evaluation before deployment).

Document Processing Pipeline

Handling Multi-Page Documents

Many mortgage documents span multiple pages. A bank statement is 3 to 12 pages. A tax return can be 5 to 15 pages. The extraction approach depends on document length.

For documents under 5 pages: send all pages in a single LLM call. The context window of Gemini 1.5 Pro (1M tokens) easily handles this. The model sees the full document and can reason about cross-page references (like a subtotal on page 1 that should match line items on pages 2 and 3).

For documents over 5 pages: split into logical sections (detected by headers, page breaks, or section markers) and extract each section independently, then run a reconciliation pass that checks for cross-section consistency. This is not about context window limits (the model can handle the full document). It is about extraction accuracy. Empirically, extraction accuracy drops by 2 to 4% for documents over 5 pages in a single call, likely because the model’s attention becomes diffuse over very long visual inputs.

Handling Poor Quality Scans

About 10 to 15% of documents are poor quality scans: low resolution, skewed, partially cut off, or with handwritten annotations. The pipeline handles these through a quality-aware routing system.

  1. On-prem quality assessment: resolution check (below 150 DPI is flagged), skew detection, text extraction attempt
  2. If quality is sufficient: proceed with vision model extraction
  3. If quality is borderline: run both vision model and OCR + text extraction, use the higher-confidence result
  4. If quality is poor (below 100 DPI, heavy skew, significant portions cut off): flag for human review with a note explaining the quality issue. Do not attempt automated extraction on documents that are likely to produce unreliable results. A missing extraction is better than a wrong one.

Handwritten annotations (common on employment letters and property appraisals) get special handling. The vision model can read most handwriting, but confidence scores on handwritten fields are systematically lower. The calibration layer accounts for this: a 0.80 confidence on a handwritten field maps to roughly 0.65 calibrated confidence, which triggers human review.

Step 6: Cross-Document Verification Deep Dive

The verification agent is the most complex component in the system. It is also where the most business value comes from, because it catches inconsistencies that human reviewers often miss when they are processing documents one at a time.

What the Agent Verifies

CheckDocuments InvolvedWhat It Looks ForSeverity
Income consistencyPay stubs, tax returns, employment letter, bank statementsSalary figures should be consistent across documents (allowing for timing differences and raises)Blocking if >15% discrepancy
Employment continuityEmployment letter, tax returns (multi-year)Continuous employment claimed should match tax filing historyBlocking if gaps not explained
Property value vs. purchase priceAppraisal, purchase agreementAppraisal at or above purchase price for LTV calculationBlocking if appraisal < purchase price
Identity consistencyAll documentsName, address, dates of birth should match across documents (allowing for name variants)Blocking if name mismatch cannot be resolved
Debt obligationsBank statements, credit report, loan applicationDeclared debts should match what shows up in credit report and bank statementsInformational (for DTI calculation accuracy)
Date consistencyAll documentsDocument dates should be recent (within bank’s recency requirements, typically 30-90 days)Blocking if documents are stale

The Verification Agent’s Architecture

The verification agent runs as a ReAct loop within a Temporal activity. It has access to three tools:

  1. read_extracted_data(document_id): Returns the structured extraction for a specific document in the application
  2. query_banking_data(account_id, query_type, params): Queries the on-prem data API for banking records (account balances, transaction history, credit bureau data)
  3. write_flag(field, severity, evidence, recommendation): Records a verification finding

The agent’s system prompt includes the verification checklist (the table above) and instructions for how to reason about discrepancies. Critically, it also includes examples of discrepancies that are NOT problems:

  • Name spelled “MacDonald” on one document and “Mcdonald” on another (common variant)
  • Salary on pay stub is slightly higher than employment letter (raise since letter was written)
  • Address on tax return is different from current application (applicant moved)

These “false positive” examples are important. Without them, the agent flags every minor inconsistency, and loan officers learn to ignore the flags. The false positive rate goes up, the system’s credibility goes down, and eventually nobody looks at the flags at all.

Agent Verification Loop

Bounding the Agent

The verification agent is the closest thing to a “free-form reasoning” component in this system. That makes it the riskiest. Two guardrails keep it bounded:

Timeout: The Temporal activity has a 120-second timeout. If the agent has not completed verification within 120 seconds, the activity fails and the workflow engine either retries or routes to human review. In practice, verification completes in 30 to 60 seconds for a typical application. The 120-second timeout catches runaway loops where the agent keeps querying for corroboration that does not exist.

Tool budget: The agent is limited to 15 tool calls per verification run. Each read_extracted_data call counts as 1. Each query_banking_data call counts as 1. Each write_flag counts as 1. This prevents the agent from making 50 API calls trying to resolve an ambiguous situation. If it cannot resolve a discrepancy in 15 tool calls, it flags it for human review and moves on.

These constraints are not about cost (though they help). They are about predictability. A loan officer needs to know that when they come in to work, the overnight batch has processed and results are ready. An agent that occasionally spends 10 minutes on one application and delays the rest of the queue breaks that expectation.

Failure Modes

Every production system fails. The question is how.

Silent Extraction Errors (The Dangerous One)

What happens: The LLM extracts a field value with high confidence, but the value is wrong. For example, it reads “$85,000” as the annual salary from a pay stub, but the actual value is “$65,000” (the “6” in the scan looks like an “8”).

How likely: About 1 to 3% of fields at the 0.90+ confidence level, based on calibration data. At 20 documents per application with 8 to 12 fields each, that is 1 to 5 silently wrong fields per application.

How to detect: The cross-document verification agent is the primary defense. If the pay stub says $85,000 but bank deposits average $5,400/month ($64,800 annualized), the agent catches the discrepancy. But this only works when there is a corroborating document to check against. For fields that appear in only one document (like a property’s lot size from the appraisal), there is no cross-reference.

How to handle: (1) Calibrated confidence scoring, so the system knows which fields at which confidence levels are actually reliable. (2) Mandatory human review for high-impact fields (income, property value, outstanding debts) regardless of confidence score. (3) Dual extraction: run the extraction twice with different temperature settings and flag any field where the two runs disagree. This catches about 60% of silent errors at the cost of doubling extraction compute.

On-Prem Data API Downtime

What happens: The verification agent tries to query banking records (credit history, account balances) and the on-prem API is unavailable. The agent cannot cross-reference.

How likely: On-prem systems typically have lower availability than cloud services. Budget for 1 to 2 hours of downtime per month during maintenance windows and occasional unplanned outages.

How to detect: API health checks every 30 seconds. The workflow engine checks API availability before starting verification activities.

How to handle: The workflow engine pauses verification activities and moves to the next application in the queue. When the API comes back, paused applications resume. If the API is down for more than 4 hours, applications are routed to a manual processing queue with a note that automated cross-referencing was not possible. Partial results (document-to-document verification that does not require API calls) are still provided.

Unsupported Document Format or Language

What happens: An applicant submits a document in a language or format the system does not support. A French tax return in a system configured for UK, Germany, and US.

How likely: Rare if the intake process is well-configured (document requirements are communicated upfront). But it happens, especially for applicants who recently relocated from a non-supported country.

How to detect: The document type classifier outputs “unknown” with high confidence (the model is trained to recognize what it does not know). Language detection runs on extracted text.

How to handle: Route to human review with the original document and a note: “Document type not recognized. Language detected: French. Manual processing required.” Do not attempt extraction on unsupported types. A failed extraction with partial results is worse than no extraction, because downstream systems might use the partial results.

Name Mismatches That Are Not Fraud

What happens: The verification agent flags a name discrepancy that is actually a legitimate variant. “Catherine Smith” vs. “Kate Smith.” “Mohammed Al-Rahman” vs. “Mohammad Alrahman.”

How likely: Very common. In multi-country processing, name transliteration alone accounts for 5 to 10% of all name mismatches across documents. Married name vs. maiden name is another frequent source.

How to detect: The agent tries to resolve these using the on-prem data API (check if both names are associated with the same customer record). The calibration layer also tracks historical false positive rates for name mismatches per country, feeding back into the agent’s examples.

How to handle: The agent’s system prompt includes explicit examples of legitimate name variants (shortened names, transliteration differences, maiden names). When a name mismatch is detected, the agent first checks for these patterns before flagging. If the mismatch fits a known pattern and is supported by other corroborating identity data (same date of birth, same address, same national ID number), it resolves as informational rather than blocking.

Workflow Engine Failure Mid-Application

What happens: The Temporal cluster or a worker node crashes while processing an application.

How likely: Temporal is designed for exactly this scenario. Worker crashes are expected, not exceptional. The cluster itself has high availability with multiple replicas.

How to detect: Temporal’s built-in heartbeat mechanism. Workers send heartbeats during long-running activities. If a heartbeat is missed, the cluster reassigns the activity to another worker.

How to handle: This is Temporal’s core value proposition. The workflow state is durably persisted. When a worker crashes, a new worker picks up the workflow from the last checkpoint. No data is lost, no work is repeated (assuming activities are idempotent, which they are by design). The loan officer does not even know it happened.

Cloud Interconnect Latency Spikes

What happens: Network latency between on-prem and GCP spikes from the normal 5 to 15ms to 200ms or higher. Every API call to the on-prem data API slows down. Verification that normally takes 30 to 60 seconds takes 3 to 5 minutes.

How likely: Uncommon but not rare. Network congestion, maintenance on the interconnect, or routing changes can cause transient spikes. Budget for 1 to 2 events per month lasting 15 to 60 minutes.

How to detect: Latency monitoring on every cross-boundary call. P95 latency alerts at 50ms (warning) and 200ms (critical).

How to handle: (1) The workflow engine increases activity timeouts dynamically when it detects elevated latency (from 120 seconds to 300 seconds). (2) Non-critical on-prem API calls are batched to reduce round trips. (3) Results from on-prem API calls are cached in the GCP result cache for the duration of an application’s processing. If the same banking data is needed by multiple verification checks, it is fetched once. (4) If latency exceeds 500ms sustained for more than 10 minutes, new application processing is paused and queued until the network stabilizes.

Operational Concerns

Monitoring

Three monitoring dimensions matter most:

Extraction accuracy per country and document type. This is the core quality metric. Tracked through a continuous feedback loop: when loan officers modify an extracted value, that correction is logged as ground truth. Weekly accuracy reports per (country, document type, field) triple. If accuracy for German pay stubs drops below 90%, the on-call team investigates (prompt regression? new employer format?).

Latency across the hybrid boundary. Every cross-boundary call (document transit to GCP, on-prem API queries, result writes back to on-prem) is instrumented with traces. Dashboards show P50, P95, and P99 latencies broken down by call type and country. Alerts fire on P95 regression.

Agent tool call success rates. Track the verification agent’s tool calls: how often does query_banking_data succeed vs. fail? How many tool calls per verification run on average? A spike in tool call count (say, from an average of 8 to an average of 14) might indicate the agent is struggling with a new document type or a prompt regression that causes it to chase false leads.

Cost Breakdown

Cost CategoryMonthly Cost (3 countries, 4K applications)Notes
Vertex AI inference (extraction)$4,000-5,000~$1.00-1.25 per application for extraction
Vertex AI inference (verification)$2,500-3,200~$0.60-0.80 per application
Vertex AI inference (narrative)$1,500-2,000~$0.35-0.50 per application
Cloud Interconnect (10 Gbps dedicated)$1,700Fixed monthly cost, shared across all workloads
GKE cluster (Temporal workers + agent runtime)$3,000-4,0008-12 nodes, autoscaling
On-prem OCR service (Document AI on-prem)$2,000-3,000Hardware amortization + licensing
On-prem data API infrastructure$1,500Allocated cost of API servers and load balancers
Total infrastructure$16,200-20,400
Total per application$4.05-5.10
Manual processing cost per application (document review only)$360-5408-12 hours at $45/hr
Net savings per application$355-535
Monthly savings$1.42M-2.14MAt 4,000 applications

The cost comparison is not even close. Even doubling the infrastructure costs for safety margin and adding a 3-person engineering team ($75K/month loaded cost), the system pays for itself many times over. The real risk is not cost. It is accuracy. If extraction quality is bad enough that loan officers spend nearly as much time correcting the system’s output as they would have spent doing it manually, the time savings evaporate.

Compliance Audit Trail

Every interaction between the system and a mortgage application is logged:

  • Document ingestion: timestamp, document hash, source channel, document type classification result
  • Extraction: input document reference, model version, prompt version, every extracted field with its value, raw confidence, calibrated confidence
  • Verification: every tool call (query sent, response received), every comparison made, every flag written with evidence
  • Narrative generation: input structured data, generated narrative text, model version
  • Human modifications: every field that a loan officer changed, with before/after values and timestamp

This audit trail is stored on-prem (it contains references to customer documents). It is immutable (append-only, no updates or deletes). Compliance teams can query it to answer questions like: “For application X, show me every extracted value, the confidence level, whether it was human-verified, and the evidence chain for every flag.”

For regulatory exams, the bank can demonstrate that (1) every LLM decision has a traceable input, (2) confidence levels are calibrated and monitored, (3) humans reviewed all blocking flags, and (4) model and prompt versions are tracked so any extraction can be reproduced.

Model Updates and Rollouts

Prompt and model updates are the riskiest operational change. A new prompt template that improves German pay stub extraction might degrade German employment letter extraction. A Vertex AI model version update might change extraction behavior in subtle ways.

The rollout strategy:

  1. Shadow mode: New prompt/model runs alongside the current production version. Both extract from the same documents. Results are compared but only the production version’s output is used. Run for 1 to 2 weeks, comparing field-by-field accuracy.
  2. Canary rollout: New version handles 5% of applications for a specific country. Loan officers see both versions’ outputs and can flag which is better. Run for 1 week.
  3. Gradual rollout: 5% to 25% to 50% to 100%, with accuracy monitoring at each stage. Any accuracy regression triggers automatic rollback.
  4. In-flight applications: Applications that started processing on version N complete on version N. The workflow engine records the model/prompt version at the start of processing and uses that version for all subsequent activities on that application. This prevents an application from getting half its documents extracted with one version and half with another.

Human Override Flow

When a loan officer disagrees with an extracted value or a flag:

  1. Officer modifies the value in the review interface. The modification is logged with before/after values.
  2. The system re-runs downstream checks that depended on the modified value. If the officer changes the income figure, the DTI check re-runs automatically.
  3. The correction is added to the training data pool for future fine-tuning cycles (with officer permission and PII handling).
  4. If the same field from the same document type is corrected more than 10% of the time across all officers, an automatic alert fires to the ML team to investigate whether the extraction prompt needs tuning.

The correction feedback loop is critical for continuous improvement. Without it, the system’s accuracy is frozen at whatever level it launched with. With it, the system gets better every month as more corrections accumulate and feed back into prompt tuning and calibration.

Going Deeper

Fine-Tuning Per Document Type vs. Few-Shot Prompting

The extraction system could take two approaches for handling the variety of document formats:

Few-shot prompting (current approach for most document types): Include 2 to 3 examples of successful extractions for the specific document type in the prompt. The examples are retrieved from a curated library based on the document type classification. This approach is flexible (adding a new document type means adding examples, not retraining) and works well when the base model is strong enough.

Fine-tuning a specialized extraction model (used for the highest-volume document types): For document types that represent more than 20% of total volume (UK P60s, German Lohnsteuerbescheinigungen, US W-2s), fine-tuning a smaller model (Gemini Flash or equivalent) on 2,000 to 5,000 human-verified extraction examples produces 2 to 4% higher accuracy than few-shot prompting with a frontier model, at 60 to 70% lower inference cost.

The trade-off: fine-tuned models are harder to update. If the P60 format changes (it has changed twice in the last 5 years), you need new training data and a retraining cycle. Few-shot prompting adapts immediately to format changes by updating the examples. For high-volume, stable document types, fine-tuning wins. For the long tail of less common document types, few-shot prompting is more practical.

Caching Strategies for On-Prem API Calls

The on-prem data API is the latency bottleneck in the hybrid architecture. Every call crosses the Cloud Interconnect with 5 to 15ms base latency, plus the API’s own processing time.

Two caching layers help:

Application-level cache (in the GCP result cache): When processing an application, the first query for a customer’s credit history caches the result. Subsequent queries during the same application’s processing (verification agent checking income against credit report, then checking debts against credit report) hit the cache. TTL: 1 hour (long enough for a single application’s processing, short enough that stale data is not a concern).

Cross-application cache (shared, with strict invalidation): For data that does not change frequently (property registry lookups, employer verification), results can be cached for 24 hours. This helps when multiple applications involve the same property (refinancing a previously processed property) or employer. Cache invalidation fires on any update to the underlying on-prem record.

The caching is not optional. Without it, a single application’s verification phase makes 15 to 25 on-prem API calls. At 10ms per call, that is 150 to 250ms of network latency alone. With caching, unique calls drop to 5 to 8 per application.

Building a Document Type Classifier

The document type classifier deserves more detail because it is the first stage in the pipeline and its errors cascade.

Architecture: A fine-tuned BERT model (110M parameters) that takes the first 512 tokens of OCR’d text from a document and classifies it into one of 50+ document types (across all supported countries). Training data: 10,000 labeled documents from the bank’s historical archives, augmented with 5,000 synthetic examples generated by paraphrasing and reformatting real documents.

Why not use the LLM for classification? Cost and latency. The classifier runs in under 10ms on a single CPU core. The LLM would take 1 to 2 seconds and cost 100x more. At 4,000 applications per month with 20 documents each, that is 80,000 classification calls per month. At $0.001 per LLM call, the classifier saves $80/month. Not a huge number, but it adds up, and the latency improvement matters more than the cost savings.

The classifier also includes a calibrated “unknown” threshold. If the maximum softmax probability is below 0.70, the document is classified as “unknown” rather than forced into the highest-probability category. This prevents the classifier from confidently misclassifying a document type it has not been trained on (which would cause the extraction agent to use the wrong schema and produce garbage output).

Handling Regulatory Changes as Config Updates

Mortgage regulations change. LTV limits get adjusted. Disclosure requirements are updated. New compliance checks are added. The system needs to absorb these changes without code deployments.

The compliance rules engine is configured through versioned rule files (stored in the country config):

# compliance_rules_uk_v12.yaml
rules:
  - name: ltv_check
    type: threshold
    field: calculated_ltv
    max_value: 0.95
    condition: "if mortgage_insurance then 0.95 else 0.75"
    effective_date: "2026-01-01"

  - name: dti_check
    type: threshold
    field: calculated_dti
    max_value: 0.45
    effective_date: "2025-06-01"

  - name: stress_test
    type: calculation
    formula: "monthly_payment_at(current_rate + 3.0) / monthly_income"
    max_value: 0.45
    effective_date: "2024-01-01"

  - name: disclosure_esis
    type: template
    template_id: "esis_uk_v8"
    required_fields: ["apr", "total_cost", "monthly_payment", "early_repayment_terms"]
    effective_date: "2025-09-01"

When a regulation changes, the compliance team updates the rule file and increments the version. The workflow engine picks up the new version for new applications. In-flight applications continue with the version that was active when they started processing. The audit trail records which rule version was applied to each application.

Disclosure template updates work similarly. The narrative generation agent retrieves the current disclosure template for the country and populates it with the application’s data. A template update is a content change, not a model change.

The risk: a regulatory change that requires a genuinely new type of check (not a parameter change, but a new calculation or validation that the rules engine does not support). This requires engineering work to extend the rules engine. The goal is not to make all regulatory changes zero-code, but to make 80 to 90% of them configuration-only. The remaining 10 to 20% are engineering projects with clear scope.

References

  1. Google Cloud Interconnect Documentation - Dedicated connections between on-prem and GCP
  2. Temporal.io Documentation - Durable workflow execution engine
  3. Vertex AI Structured Output - Enforcing JSON schemas on LLM output
  4. Anthropic: Building Effective Agents - Agent design principles (bounded agents over autonomous loops)
  5. FCA MCOB Handbook - UK mortgage conduct regulations
  6. BaFin Wohnimmobilienkreditrichtlinie - German residential mortgage lending regulations
  7. TILA-RESPA Integrated Disclosure Rule - US mortgage disclosure requirements
  8. Google Document AI - On-prem OCR and document processing
  9. BERT: Pre-training of Deep Bidirectional Transformers - Architecture used for document type classification
  10. Gemini 1.5 Pro Technical Report - Vision-capable model used for document extraction
  11. ReAct: Synergizing Reasoning and Acting in Language Models - Think-act loop pattern used in verification agent
  12. Heavybit: RAG vs Fine-tuning - RAG for knowledge, fine-tune for behavior
  13. SCHUFA Credit Bureau - German credit reporting
  14. UK Land Registry - Property title verification for UK
  15. Platt Scaling for Calibrated Confidence - Calibrating LLM confidence scores
  16. LangGraph Documentation - State machine patterns for bounded agents
  17. Cloud Interconnect SLA - Availability guarantees for hybrid connectivity

Note: This blog represents my technical views and production experience. I use AI-based tools to help with drafting and formatting to keep these posts coming daily.

← Back to all posts