Meta × Hugging Face OpenEnv Hackathon 2026

Teaching an LLM to ThinkLike a Credit Officer

IntelliCredit-X is a multi-agent RL environment where Mistral-7B learns to investigate fraud signals, manage a live loan portfolio across 50-step episodes, and make RBI-compliant credit decisions — trained via 2-stage GRPO directly on the live environment.

Try the Live API Read the Blog GitHub
HF Space Model Dataset GitHub
55D
Observation Space
50
Steps / Episode
3
Agents
10x
Reward Gain Task 3
-8.3%
NPA After GRPO
0
Regressions
How It Works
Credit AI in 4 Steps

Each episode simulates a full credit committee lifecycle. The agent must gather evidence, reason through risk, and submit compliant decisions under real regulatory pressure.

01
Receive Application
Agent receives a 55D observation — financials, forensic alerts, portfolio state, macro conditions, and memory of past decisions.
02
Investigate with Tools
Calls up to 4 tools per step: get_financial_report(), check_compliance(), get_market_intel().
03
Submit Decision
Calls submit_decision(action, reasoning) with ≥50-char reasoning. 6 hard rules auto-override violations.
04
Face Consequences
Loans mature T+10–30 steps later. Regulator audits fire at steps ~10/20/30/40/50. 3 failures = shutdown (−50 reward).
Multi-Agent System
Three Agents, One Environment

The environment simulates the full credit ecosystem — not just individual decisions.

Credit Officer (LLM)
Your Agent Under Training
Mistral-7B fine-tuned via GRPO. Receives 55D obs, calls investigation tools, submits APPROVE / CONDITIONAL / REJECT with written reasoning.
Borrower Agent
Adversarial Pressure
Rejected borrowers reapply up to 3x with improved surface metrics but unchanged hidden PD — forcing the agent to learn true risk signals.
Regulator Agent
Compliance Enforcer
Audits portfolio at steps ≈10/20/30/40/50 (±1 jitter). Checks NPA rate, CRAR, sector concentration. Episode shutdown on 3 consecutive fails.
Benchmark Results
Before vs. After GRPO

Evaluated across 3 task difficulties. Zero regressions — every metric improved or held steady.

TaskDifficultyMetricBase Mistral-7BGRPO ModelDelta
Task 1EasyAccuracy80.0%86.7%+6.7% ✓
Task 1EasyCapital Utilization40.0%60.0%+20.0% ✓
Task 2MediumTotal Reward10.30510.584+0.279 ✓
Task 3HardTotal Reward0.2152.491+10x ✓
Task 3HardNPA Rate16.7%8.3%-8.3% ✓
Training Curves
What the Training Curves Tell Us

Four panels reveal the full story of what the model learned and when — across three curriculum stages (dashed lines mark transitions). Mean reward climbs from −2.0 to +1.0, format compliance rises from 0% to 65%, and KL divergence stays safely below 0.12, confirming the model changed without forgetting language capabilities.

Figure 1 GRPO v2 Training Curves — 3-Stage Curriculum
IntelliCredit GRPO Training Curves
GRPO LossStarts near zero, climbs to 0.02–0.05 — healthy policy divergence from reference.
Mean Reward−2.0 → 0 at Stage 1 end → stable +0.5–1.0. Stage 3 dip then re-stabilises.
KL DivergenceGrows 0→0.08, stays below 0.12 threshold — genuine learning, no catastrophic forgetting.
submit_pctFormat compliance: 0% → 40–65%. The model acquired the vocabulary of the task.
Evaluation
Before vs. After GRPO — Full Comparison

Per-task, per-metric comparison of base Mistral-7B (blue) vs. GRPO-trained IntelliCredit model (green). Zero regressions across all 24 metric-task combinations. The hardest task (Task 3) shows the most dramatic improvement — NPA rate cut in half, total reward up 10×.

Figure 2 Base Mistral-7B vs. GRPO IntelliCredit — All Tasks
IntelliCredit GRPO Results Comparison
Task 1 (Easy)Accuracy +6.7%, capital utilization +20%. The GRPO model deploys more capital into correctly identified safe loans.
Task 2 (Medium)Both models hit perfect Task Score (1.000). GRPO squeezes +0.28 extra reward from better capital efficiency.
Task 3 (Hard)Total reward 0.215 → 2.491 (+10×). NPA 16.7% → 8.3% (halved). True portfolio-level risk management learned.
Key InsightModel learned that surface improvement + behavioural red flags = escalating risk. It calls tools; base model doesn't.
Quick Start
Start an Episode in 2 Calls

The environment is live and accepts HTTP from any client — no install required.

bash — curl
# Step 1: Reset (start a new episode)
curl -X POST https://vssksn-intellicredit-openenv.hf.space/reset   -H "Content-Type: application/json"   -d '{"episode_id":"demo-1","seed":42,"task_id":"task3"}'

# Step 2: Submit a decision  (0=APPROVE 1=CONDITIONAL 2=REJECT)
curl -X POST https://vssksn-intellicredit-openenv.hf.space/step   -H "Content-Type: application/json"   -d '{"episode_id":"demo-1","action":{"decision":2}}'
Resources
Everything Open Source

All artefacts published on Hugging Face and GitHub under MIT License.

Technical Blog
Architecture, 2-stage GRPO, training curves, full results.
Fine-Tuned Model
Mistral-7B post-trained on live environment via online GRPO.
Training Dataset
2,000 GRPO prompts across 5 task levels — intellicredit-grpo-v2.
Stage 1 — Offline GRPO
Mistral-7B + Unsloth, A100, ~45 min. Pre-train on 2,000 prompts.
Stage 2 — Online GRPO
Post-train on this live env. Real rewards from /step endpoint.
GitHub Repository
Full source — env, training scripts, evaluation. MIT License.
Swagger UI
Interactive API — run /reset and /step right in the browser.
Environment Info JSON
Full metadata — observation dims, action space, tasks, constraints.