IntelliCredit-X is a multi-agent RL environment where Mistral-7B learns to investigate fraud signals,
manage a live loan portfolio across 50-step episodes, and make RBI-compliant credit decisions —
trained via 2-stage GRPO directly on the live environment.
Each episode simulates a full credit committee lifecycle. The agent must gather evidence, reason through risk, and submit compliant decisions under real regulatory pressure.
01
Receive Application
Agent receives a 55D observation — financials, forensic alerts, portfolio state, macro conditions, and memory of past decisions.
02
Investigate with Tools
Calls up to 4 tools per step: get_financial_report(), check_compliance(), get_market_intel().
03
Submit Decision
Calls submit_decision(action, reasoning) with ≥50-char reasoning. 6 hard rules auto-override violations.
04
Face Consequences
Loans mature T+10–30 steps later. Regulator audits fire at steps ~10/20/30/40/50. 3 failures = shutdown (−50 reward).
Multi-Agent System
Three Agents, One Environment
The environment simulates the full credit ecosystem — not just individual decisions.
Credit Officer (LLM)
Your Agent Under Training
Mistral-7B fine-tuned via GRPO. Receives 55D obs, calls investigation tools, submits APPROVE / CONDITIONAL / REJECT with written reasoning.
Borrower Agent
Adversarial Pressure
Rejected borrowers reapply up to 3x with improved surface metrics but unchanged hidden PD — forcing the agent to learn true risk signals.
Regulator Agent
Compliance Enforcer
Audits portfolio at steps ≈10/20/30/40/50 (±1 jitter). Checks NPA rate, CRAR, sector concentration. Episode shutdown on 3 consecutive fails.
Benchmark Results
Before vs. After GRPO
Evaluated across 3 task difficulties. Zero regressions — every metric improved or held steady.
Task
Difficulty
Metric
Base Mistral-7B
GRPO Model
Delta
Task 1
Easy
Accuracy
80.0%
86.7%
+6.7% ✓
Task 1
Easy
Capital Utilization
40.0%
60.0%
+20.0% ✓
Task 2
Medium
Total Reward
10.305
10.584
+0.279 ✓
Task 3
Hard
Total Reward
0.215
2.491
+10x ✓
Task 3
Hard
NPA Rate
16.7%
8.3%
-8.3% ✓
Training Curves
What the Training Curves Tell Us
Four panels reveal the full story of what the model learned and when — across three curriculum stages (dashed lines mark transitions).
Mean reward climbs from −2.0 to +1.0, format compliance rises from 0% to 65%,
and KL divergence stays safely below 0.12, confirming the model changed without forgetting language capabilities.
Figure 1GRPO v2 Training Curves — 3-Stage Curriculum
GRPO LossStarts near zero, climbs to 0.02–0.05 — healthy policy divergence from reference.
Mean Reward−2.0 → 0 at Stage 1 end → stable +0.5–1.0. Stage 3 dip then re-stabilises.
submit_pctFormat compliance: 0% → 40–65%. The model acquired the vocabulary of the task.
Evaluation
Before vs. After GRPO — Full Comparison
Per-task, per-metric comparison of base Mistral-7B (blue) vs. GRPO-trained IntelliCredit model (green).
Zero regressions across all 24 metric-task combinations.
The hardest task (Task 3) shows the most dramatic improvement — NPA rate cut in half, total reward up 10×.
Figure 2Base Mistral-7B vs. GRPO IntelliCredit — All Tasks
Task 1 (Easy)Accuracy +6.7%, capital utilization +20%. The GRPO model deploys more capital into correctly identified safe loans.
Task 2 (Medium)Both models hit perfect Task Score (1.000). GRPO squeezes +0.28 extra reward from better capital efficiency.