Python RL Training System
Complete reinforcement learning training system for Babylon autonomous agents using continuous MMO-style training.
Overview
The Python RL training system enables continuous learning from live agent gameplay:
- Continuous Collection: Agents generate training data 24/7
- Automated Training: GRPO/RULER training runs automatically
- Live Deployment: Updated models deploy without downtime
- MMO-Style: Multiple agents learn together, share experiences
Based on: OpenPipe GRPO, Stanford RULER architecture
Status: Production Ready - Complete automation pipeline
Architecture
Continuous Learning Loop
- Agents Play: Autonomous agents interact with Babylon
- Data Collection: Decisions recorded via trajectory logger
- Windowed Batching: Group by time windows (1-hour default)
- Training: GRPO/RULER optimize model on successful decisions
- Deployment: New model replaces old without downtime
- Repeat: Loop continues indefinitely
Quick Start
1. Installation
cd python
pip install -r requirements.txt2. Configuration
Create .env.training:
# Database
DATABASE_URL=postgresql://user:pass@host/babylon
# Training
OPENPIPE_API_KEY=your_key
MODEL_NAME=Qwen/Qwen2.5-0.5B-Instruct
TRAINING_FREQUENCY_WINDOWS=1
WINDOW_DURATION_HOURS=1
# Deployment
DEPLOYMENT_STRATEGY=replace
LOCAL_ENDPOINT=http://localhost:80003. Start Training
# Continuous mode (runs forever)
python src/training/continuous_trainer.py
# Single window (test)
python src/training/continuous_trainer.py --mode singleTraining Modes
Continuous Mode (Production)
Runs indefinitely, training on new data as it arrives:
python src/training/continuous_trainer.py \
--mode continuous \
--window-hours 1 \
--frequency 1Process:
- Wait for window to complete
- Collect trajectories from that window
- Train GRPO model
- Deploy new model
- Repeat
Single Window Mode (Testing)
Train on one window and exit:
python src/training/continuous_trainer.py \
--mode single \
--window-hours 6Backfill Mode (Catch-up)
Train on historical data:
python src/training/continuous_trainer.py \
--mode backfill \
--backfill-hours 48Data Collection
Windowing Strategy
Time Windows:
- Default: 1 hour
- Configurable: 15 minutes to 24 hours
- Agents: Minimum 3 per window
- Actions: Minimum 5 per trajectory
Example Window:
Window: 2024-11-13 10:00 - 11:00
- Agents: 5
- Trajectories: 47
- Avg Reward: 0.73
- Training: YesQuality Filters
Trajectories must have:
- Complete provider data
- LLM call with reasoning
- Action result
- Reward score
- All timestamps
Rejected if:
- Missing required fields
- Reward below threshold
- Incomplete data
- Malformed JSON
Training Configuration
GRPO Settings
from training.grpo_config import GRPOConfig
config = GRPOConfig(
model_name="Qwen/Qwen2.5-0.5B-Instruct",
batch_size=4,
gradient_accumulation_steps=2,
learning_rate=1e-6,
kl_penalty=0.05,
iterations_per_window=10,
reward_threshold=0.5
)RULER Scoring
from training.ruler_scorer import RULERScorer
scorer = RULERScorer(
model_path="./checkpoints/latest",
scoring_method="preference", # or "regression"
temperature=0.7
)
scores = scorer.score_trajectories(trajectories)Deployment
Strategy: Replace
Replace old model with new model atomically:
deployment = {
"strategy": "replace",
"endpoint": "http://localhost:8000",
"model_path": "./checkpoints/window_123",
"validation": True # Test before deploying
}Strategy: A/B Test
Run two models simultaneously:
deployment = {
"strategy": "ab_test",
"model_a": "./checkpoints/baseline",
"model_b": "./checkpoints/window_123",
"split": 0.5 # 50/50 split
}Monitoring
Training Metrics
Track key metrics:
{
"window_id": 123,
"agents": 5,
"trajectories": 47,
"avg_reward": 0.73,
"training_loss": 0.21,
"kl_divergence": 0.03,
"deployment_success": True,
"duration_minutes": 15
}Performance Tracking
Monitor agent performance:
# Before training
baseline_win_rate = 0.52
# After training
improved_win_rate = 0.67
# Improvement
delta = +0.15 # 15% better!Advanced Features
Custom Reward Functions
class TradingRewardFunction:
def compute(self, trajectory):
# Extract metrics
pnl = trajectory['result']['pnl']
hold_time = trajectory['result']['hold_time_minutes']
win_rate = trajectory['agent_stats']['win_rate']
# Compute components
pnl_reward = pnl / trajectory['result']['investment']
speed_reward = 1.0 if hold_time < 60 else 0.5
consistency_reward = win_rate
# Weighted combination
return (
0.5 * pnl_reward +
0.2 * speed_reward +
0.3 * consistency_reward
)Multi-Agent Learning
# Collect from multiple agents
agents = ["agent-1", "agent-2", "agent-3"]
trajectories = []
for agent in agents:
agent_data = collect_trajectories(agent, window)
trajectories.extend(agent_data)
# Train on combined experience
model = train_grpo(trajectories)Transfer Learning
# Fine-tune on Babylon-specific strategies
trainer = GRPOTrainer(
base_model="Qwen/Qwen2.5-0.5B-Instruct", # General model
babylon_data="./trajectories/", # Babylon-specific
transfer_layers=["attention", "ffn"] # What to fine-tune
)Production Setup
CoreWeave Deployment
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: babylon-training
spec:
replicas: 1
template:
spec:
containers:
- name: trainer
image: babylon/rl-training:latest
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: db-credentials
key: url
resources:
limits:
nvidia.com/gpu: 1Automation
# Cron job for continuous training
0 * * * * cd /app/python && python src/training/continuous_trainer.py --mode singleTroubleshooting
No Trajectories Collected
Check:
- Agents are running
- Trajectory logger is enabled
- Database connection works
- Actions are wrapped with logging
Training Fails
Common Issues:
- Insufficient trajectories (need 5+ per agent)
- Missing required fields
- Invalid reward scores
- GPU out of memory
Model Doesn’t Improve
Try:
- Increase training iterations
- Adjust learning rate
- Filter low-reward trajectories
- Increase batch size
HuggingFace Dataset
Training data is automatically published to HuggingFace daily via GitHub Actions.
Dataset: elizaos/babylon-game-data ✅ LIVE
Contains:
- Agent Trajectories: Complete gameplay with decisions + environment + ground truth
- Benchmark Scenarios: Game simulations for evaluation
- Model Performance: Results from model testing
- Organized by Month: Easy browsing (2025-10, 2025-11, etc.)
Updated: Daily at 2 AM UTC via GitHub Actions
Usage:
from datasets import load_dataset
# Load dataset
dataset = load_dataset("elizaos/babylon-game-data")
# Use for training
trajectories = dataset['train']Offline Simulation (100-1000x faster):
# Download
huggingface-cli download elizaos/babylon-game-data
# Run offline
npm run hf:offline -- --month=2025-11 --agent=my-agentSee: HuggingFace Integration for complete guide and setup
Resources
- Full Documentation:
python/README.md - Quick Start:
python/START_HERE.md - Scorer Implementation:
python/src/training/ruler_scorer.py - Automation Pipeline:
src/lib/training/AutomationPipeline.ts - HuggingFace Dataset: https://huggingface.co/datasets/elizaos/babylon-game-data
Next Steps
- HuggingFace Integration - NEW! Dataset on HuggingFace
- Trajectory Logging
- Autonomous Agents
- Agent Performance
- Python LangGraph Example