Overview
The RL system uses continuous MMO-style training where:- Agents generate training data 24/7 through live gameplay
- Trajectories are collected and grouped by time windows
- GRPO/RULER training optimizes models on successful decisions
- Updated models deploy without downtime
- Multiple agents learn together, sharing experiences
System Architecture
Continuous Learning Loop
- Agents Play: Autonomous agents interact with Babylon markets
- Data Collection: Every action recorded as trajectory with state, action, reward
- Windowed Batching: Group trajectories by time windows (1-hour default)
- RULER Scoring: Judge agent decisions using preference learning
- GRPO Training: Fine-tune model on high-scoring trajectories
- Deployment: New model replaces old without downtime
- Repeat: Loop continues indefinitely
Trajectory Collection
Trajectory Structure
Each trajectory captures a complete decision-making episode:Game State Representation
Action Space
Reward Function
Rewards are computed from trading outcomes:Windowed Batching Strategy
Time Windows
Trajectories are grouped into time windows for fair comparison: Default Configuration:- Window Duration: 1 hour
- Minimum Agents: 3 per window
- Minimum Actions: 5 per trajectory
- Overlap: None (non-overlapping windows)
- Ensures agents operate in same market conditions
- Enables fair performance comparison
- Balances data freshness with statistical significance
Window Selection
- Minimum 3 agents participated
- Minimum 5 trajectories per agent
- All trajectories have complete data
- Average reward above threshold (0.3 default)
RULER Scoring System
RULER (Reward Understanding via Learning from Examples and Rewards) judges agent decisions using preference learning.Scoring Architecture
Preference Learning
RULER learns preferences from trajectory comparisons:Scoring Methods
Method 1: Preference-Based- Compare trajectory pairs
- Learn which decisions are better
- Score trajectories relative to preferences
- Predict reward from trajectory features
- Use predicted reward as score
- Faster but less nuanced
Implementation
GRPO Training
GRPO (Group Relative Policy Optimization) fine-tunes models on high-scoring trajectories.GRPO Algorithm
Input: Trajectories with RULER scores Process:- Filter trajectories by score threshold (default: 0.5)
- Group by agent for fair comparison
- Compute relative advantages within groups
- Optimize policy using PPO-style updates
- Regularize with KL divergence penalty
Training Configuration
Loss Function
GRPO optimizes a combined objective:π_θ(a|s): Policy probability of action given stateA(s,a): Advantage function (relative to group average)β: KL penalty coefficient (0.05)KL(π_θ || π_old): KL divergence from previous policy
Advantage Calculation
Advantages are computed relative to group performance:Training Pipeline
Continuous Training Mode
Training Metrics
Track key metrics per window:Multi-Agent Learning
MMO-Style Training
Multiple agents learn simultaneously:Benefits
- Diverse Experience: Multiple strategies and approaches
- Faster Learning: More data per window
- Robustness: Less overfitting to single agent patterns
- Competition: Agents learn from each other
Model Deployment
Zero-Downtime Deployment
Deployment Strategies
Strategy 1: Replace- Atomically replace old model with new
- Zero downtime
- All agents use new model immediately
- Run two models simultaneously
- Split traffic (e.g., 50/50)
- Compare performance
- Gradually shift to better model
Performance Tracking
Before/After Comparison
Learning Curves
Track performance over time:Mathematical Formulations
Reward Function
Given trajectoryτ = (s₀, a₀, r₀, s₁, a₁, r₁, ..., sₜ, aₜ, rₜ):
γ: Discount factor (typically 0.99)rᵢ: Reward at step i
Policy Gradient Objective
GRPO optimizes:R(τ): Trajectory rewardR̄_group: Average reward for agent’s groupβ: KL penalty coefficient
Advantage Estimation
Q(s,a): Action-value functionV(s): State-value function (estimated from group average)
Research Applications
Studying Learning Dynamics
- How quickly do agents improve?
- What strategies emerge first?
- How do agents adapt to market changes?
Comparing Algorithms
- GRPO vs PPO vs A3C
- RULER vs direct reward
- Windowed vs non-windowed batching
Multi-Agent Coordination
- How do agents learn to coordinate?
- What communication patterns emerge?
- How do teams outperform solo agents?
Data Availability
HuggingFace Dataset
Dataset: elizaos/babylon-game-data Updated: Daily at 2 AM UTC Contains:- Complete trajectories with states, actions, rewards
- Window identifiers for grouping
- Agent performance metrics
- Market snapshots
Accessing Data
Related Topics
Research & Engine
- Agent Behavior - How agents make decisions
- Market Simulation - Training environment
- Data Models - Trajectory structure
For Developers
- Python Training System - Implementation guide
- Building Agents - Create agents that learn
- Agent Examples - Working examples
For Players
- How to Play: Using Agents - Use trained agents
Ready to study agent behavior? See Agent Behavior Systems!