Python RL Training System
Complete reinforcement learning training system for Babylon autonomous agents using continuous MMO-style training.
Overview
The Python RL training system enables continuous learning from live agent gameplay:
- Continuous Collection: Agents generate training data 24/7
- Automated Training: GRPO training with Atropos runs automatically
- Live Deployment: Updated models deploy without downtime
- MMO-Style: Multiple agents learn together, share experiences
Based on: Atropos (RLAIF framework by Nous Research) with GRPO training
Status: Production Ready - Complete automation pipeline
Architecture
Continuous Learning Loop
- Agents Play: Autonomous agents interact with Babylon
- Data Collection: Decisions recorded via trajectory logger
- Windowed Batching: Group by time windows (1-hour default)
- Training: GRPO optimizes model on successful decisions (scored via RLAIF)
- Deployment: New model replaces old without downtime
- Repeat: Loop continues indefinitely
Quick Start
1. Installation
cd packages/training/python
pip install -r requirements.txt
# Note: Python commands below assume you're in packages/training/python directory
# or have PYTHONPATH set to include packages/training/python/src2. Configuration
Set environment variables:
# Database
DATABASE_URL=postgresql://user:pass@host/babylon
# Training
BASE_MODEL=unsloth/Qwen3-4B-128K
JUDGE_MODEL=gpt-4o-mini
ATROPOS_API_URL=http://localhost:8000
VLLM_PORT=90013. Start Training
The Atropos training pipeline requires three components running:
Terminal 1 - Atropos API Server:
run-apiTerminal 2 - Babylon RLAIF Environment:
python -m src.training.babylon_env serve --slurm falseTerminal 3 - GRPO Trainer:
python -m src.training.atropos_trainer \
--model unsloth/Qwen3-4B-128K \
--steps 100 \
--batch-size 4 \
--lr 1e-5Architecture
The training pipeline uses three components:
- Atropos API Server (
run-api): Coordinates batches between environment and trainer - Babylon RLAIF Environment (
babylon_env.py): Loads trajectories, scores with LLM judge - GRPO Trainer (
atropos_trainer.py): Trains the model using Group Relative Policy Optimization
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ PostgreSQL │────▶│ Babylon RLAIF │────▶│ Atropos API │
│ Trajectories │ │ Environment │ │ Server │
└─────────────────┘ └──────────────────┘ └────────┬────────┘
│ │
│ LLM Judge │ Batches
▼ ▼
┌──────────────┐ ┌─────────────────┐
│ GPT-4o-mini │ │ GRPO Trainer │
│ (RLAIF) │ │ + vLLM │
└──────────────┘ └─────────────────┘Training Process
The Atropos training loop:
- Environment loads trajectories from PostgreSQL
- LLM Judge scores trajectories using RLAIF (GPT-4o-mini)
- Atropos API batches scored trajectories
- GRPO Trainer trains on batches
- Model checkpoint saved periodically
- vLLM restarted with updated model
Data Collection
Windowing Strategy
Time Windows:
- Default: 1 hour
- Configurable: 15 minutes to 24 hours
- Agents: Minimum 3 per window
- Actions: Minimum 5 per trajectory
Example Window:
Window: 2024-11-13 10:00 - 11:00
- Agents: 5
- Trajectories: 47
- Avg Reward: 0.73
- Training: YesQuality Filters
Trajectories must have:
- Complete provider data
- LLM call with reasoning
- Action result
- Reward score
- All timestamps
Rejected if:
- Missing required fields
- Reward below threshold
- Incomplete data
- Malformed JSON
Training Configuration
Atropos Configuration
Using YAML config file:
python -m src.training.babylon_env serve --config config/babylon_atropos.yamlOr CLI arguments:
python -m src.training.babylon_env serve \
--env--tokenizer_name Qwen/Qwen2.5-3B-Instruct \
--env--group_size 4 \
--env--max_token_length 4096 \
--env--database_url $DATABASE_URL \
--openai--model_name Qwen/Qwen2.5-3B-Instruct \
--openai--base_url http://localhost:9001/v1 \
--slurm falseTrainer Configuration
python -m src.training.atropos_trainer \
--model Qwen/Qwen2.5-3B-Instruct \
--steps 100 \
--batch-size 4 \
--lr 1e-5 \
--save-path ./trained_models \
--api-url http://localhost:8000 \
--vllm-port 9001Key Configuration Options
| Option | Default | Description |
|---|---|---|
group_size | 4 | Trajectories compared per GRPO group |
max_token_length | 4096 | Maximum sequence length |
lookback_hours | 72 | Hours to look back for trajectories |
min_agents_per_window | 2 | Minimum agents required per window |
judge_model | gpt-4o-mini | LLM model for RLAIF scoring |
Supported Models
Recommended for Training
| Model | VRAM | Notes |
|---|---|---|
unsloth/Qwen3-4B-128K | ~10GB | Default - 4B params, 128K context, ideal for fine-tuning |
Qwen/Qwen2.5-3B-Instruct | ~8GB | Fast, good quality |
Qwen/Qwen2.5-7B-Instruct | ~16GB | Better quality |
Qwen/Qwen2.5-14B-Instruct | ~32GB | Best quality |
For Apple Silicon (MLX)
| Model | RAM | Notes |
|---|---|---|
mlx-community/Qwen2.5-3B-Instruct-4bit | ~4GB | Fast |
mlx-community/Qwen2.5-7B-Instruct-4bit | ~8GB | Good balance |
Monitoring
Training Metrics
Training metrics are logged to ./logs/training_metrics.jsonl:
{
"step": 100,
"loss": 0.21,
"grad_norm": 0.45,
"pos_logp": -2.3,
"neg_logp": -3.1,
"total_pos": 12,
"total_neg": 8
}Local Debugging
# View rollouts in browser
view-run
# Generate offline data
python -m src.training.babylon_env process \
--env--data_path_to_save_groups output/rollouts.jsonl \
--env--total_steps 10RLAIF Scoring
The environment uses an LLM judge to score trajectories:
- Group Formation: Trajectories grouped by window/scenario
- Context Injection: P&L, episode length, actions provided to judge
- Relative Comparison: Judge compares trajectories within group
- Score Normalization: Scores normalized to mean 0 for GRPO
Custom Scoring Rubric
Edit the scoring_rubric in your config:
env:
scoring_rubric: |
You are evaluating trading agent performance.
Score from 0.0 to 1.0 based on:
- Profitability (50%)
- Risk management (30%)
- Decision quality (20%)
Compare trajectories RELATIVE to each other.Files
Core Training Files
packages/training/python/src/training/babylon_env.py- RLAIF environment for Atropospackages/training/python/src/training/atropos_trainer.py- GRPO trainerpackages/training/python/src/training/rewards.py- Reward functionspackages/training/python/config/babylon_atropos.yaml- Default configuration
Data Bridge
packages/training/python/src/data_bridge/reader.py- PostgreSQL trajectory readerpackages/training/python/src/data_bridge/converter.py- Trajectory format conversion
Troubleshooting
No trajectories found
# Check database connection
python -c "import asyncpg; print('asyncpg OK')"
# Verify trajectories exist
psql $DATABASE_URL -c "SELECT COUNT(*) FROM trajectories WHERE \"stepsJson\" IS NOT NULL"vLLM not starting
# Check GPU availability
python -c "import torch; print(torch.cuda.is_available())"
# Check vLLM installation
python -c "import vllm; print(vllm.__version__)"Environment not connecting to API
# Verify API server is running
curl http://localhost:8000/
# Check registration
curl http://localhost:8000/statusHuggingFace Dataset
Training data is automatically published to HuggingFace daily via GitHub Actions.
Dataset: elizaos/babylon-game-data ✅ LIVE
Contains:
- Agent Trajectories: Complete gameplay with decisions + environment + ground truth
- Benchmark Scenarios: Game simulations for evaluation
- Model Performance: Results from model testing
- Organized by Month: Easy browsing (2025-10, 2025-11, etc.)
Updated: Daily at 2 AM UTC via GitHub Actions
Usage:
from datasets import load_dataset
# Load dataset
dataset = load_dataset("elizaos/babylon-game-data")
# Use for training
trajectories = dataset['train']Offline Simulation (100-1000x faster):
# Download
huggingface-cli download elizaos/babylon-game-data
# Run offline
npm run hf:offline -- --month=2025-11 --agent=my-agentSee: HuggingFace Integration for complete guide and setup
Resources
- Full Documentation:
packages/training/python/README.md - Atropos Trainer:
packages/training/python/src/training/atropos_trainer.py - Babylon Environment:
packages/training/python/src/training/babylon_env.py - Automation Pipeline:
packages/training/src/training/AutomationPipeline.ts - HuggingFace Dataset: https://huggingface.co/datasets/elizaos/babylon-game-data
Next Steps
- HuggingFace Integration - NEW! Dataset on HuggingFace
- Trajectory Logging
- Autonomous Agents
- Autonomous Agent Guide - Learn about autonomous agents
- Python LangGraph Example