LLM Routing Engine - Quick Reference
๐ฏ One-Minute Overviewโ
The LLM Routing Engine automatically selects the best LLM model for each request based on:
- Cost (cheapest model that meets requirements)
- Speed (fastest available model)
- Quality (highest capability within budget)
๐ก REST API Quick Startโ
Route a Requestโ
curl -X POST http://localhost:8080/v1/llm/route \
-H "Content-Type: application/json" \
-d '{
"taskDescription": "Generate Python code to parse CSV",
"estimatedInputTokens": 200,
"estimatedOutputTokens": 500
}'
Response:
{
"modelId": "gpt-4o-mini",
"provider": "openai",
"estimatedCost": 0.0219,
"pricingSummary": "$0.15/$0.60 per 1M tokens"
}
Get Current Pricingโ
curl http://localhost:8080/v1/llm/pricing
Response:
{
"timestamp": 1634567890000,
"models": {
"gpt-4o": { "input": 2.50, "output": 7.50 },
"gpt-4o-mini": { "input": 0.15, "output": 0.60 },
"claude-3.5-sonnet": { "input": 3.00, "output": 15.00 },
...
}
}
๐ Java Integrationโ
Service Injectionโ
@Autowired
private LLMRoutingService routingService;
@Autowired
private LLMCostCalculatorService costCalculator;
Route a Request (Java)โ
// Analyze task and select best model
String selectedModel = routingService.routeRequest(
principalId,
"Generate TypeScript types from OpenAPI spec",
1500, // estimated input tokens
2000 // estimated output tokens
);
// Use the selected model
LlmDetails llmDetails = lmDetailsRepository.findByModelId(selectedModel);
ChatMessage response = llmService.chat(llmDetails, userMessage);
// Record usage for cost tracking
routingService.recordUsage(
principalId,
selectedModel,
actualInputTokens,
actualOutputTokens,
durationMs
);
Estimate Costsโ
// Get cost estimate before making request
BigDecimal estimatedCost = costCalculator.estimateCost(
"gpt-4o-mini",
1000, // input tokens
2000 // output tokens
);
System.out.println("Estimated cost: $" + estimatedCost);
Compare Modelsโ
Map<String, BigDecimal> comparison = costCalculator.compareCosts(1000, 2000);
// {
// "gpt-4o": 0.0225,
// "gpt-4o-mini": 0.0015,
// "claude-3.5-sonnet": 0.0450,
// "ollama": 0.0000
// }
๐ฎ Supported Modelsโ
Ultra-Premium (Best Quality)โ
gpt-4o - $2.50/$7.50 per 1M tokens
claude-3.5-opus - $15.00/$75.00 per 1M tokens
gemini-1.5-pro - $1.25/$5.00 per 1M tokens
Premium (High Quality)โ
gpt-4-turbo - $10.00/$30.00 per 1M tokens
claude-3.5-sonnet - $3.00/$15.00 per 1M tokens
Standard (Good Balance)โ
gpt-4o-mini - $0.15/$0.60 per 1M tokens
gemini-1.5-flash - $0.075/$0.30 per 1M tokens
claude-3.5-haiku - $0.80/$4.00 per 1M tokens
Budget (Cheapest)โ
gpt-3.5-turbo - $0.50/$1.50 per 1M tokens
ollama (Llama2) - FREE (local)
๐ Cost Examplesโ
Generate Python function (200 input, 300 output tokens)
| Model | Input Cost | Output Cost | Total |
|---|---|---|---|
| gpt-4o | $0.0005 | $0.00225 | $0.00275 |
| gpt-4o-mini | $0.00003 | $0.00018 | $0.00021 |
| claude-3.5-sonnet | $0.0006 | $0.0045 | $0.00510 |
| ollama | $0.00 | $0.00 | $0.00 โญ |
๐ฏ Routing Strategiesโ
Cost-Optimized Strategyโ
{
"name": "Cost Optimized",
"strategy": "cost",
"budgetPerMonth": 50.0,
"budgetPerRequest": 1.0,
"fallbackChain": ["gpt-4o-mini", "gpt-3.5-turbo", "ollama"]
}
Quality-First Strategyโ
{
"name": "Premium Quality",
"strategy": "quality",
"minimumQualityTier": "PREMIUM",
"budgetPerRequest": 5.0,
"fallbackChain": ["gpt-4o", "claude-3.5-sonnet", "gpt-4o-mini"]
}
Hybrid Strategyโ
{
"name": "Balanced",
"strategy": "hybrid",
"costWeight": 0.4,
"speedWeight": 0.3,
"qualityWeight": 0.3,
"budgetPerMonth": 100.0
}
๐ Usage Trackingโ
Record Actual Usageโ
routingService.recordUsage(
principalId,
"gpt-4o-mini",
actualInputTokens,
actualOutputTokens,
durationMs
);
Get Statisticsโ
curl http://localhost:8080/v1/llm/stats/{principalId}
Response:
{
"principal": "user@example.com",
"period": "2025-10-19",
"totalRequests": 42,
"totalTokens": 125000,
"totalCost": 12.45,
"modelBreakdown": {
"gpt-4o-mini": { "requests": 35, "cost": 2.1 },
"claude-3.5-haiku": { "requests": 7, "cost": 10.35 }
},
"averageLatency": 234,
"budgetRemaining": 87.55
}
๐ Security & Permissionsโ
All endpoints require Spring Security authentication. The routing service:
- โ Enforces per-principal budgets
- โ Tracks usage per principal
- โ Respects role-based access
- โ Logs all requests for audit
โ๏ธ Configurationโ
Add to application.yaml:
valkyrai:
llm:
routingService:
defaultStrategy: "cost" # cost | quality | hybrid
maxRetries: 5
retryDelayMs: 2000
budgetCheckInterval: "60m"
costCalculator:
tokensPerChar: 0.25 # 1 token โ 4 chars
jsonOverhead: 0.10 # +10% for JSON
operationalOverhead: 0.20 # +20% ops cost
models:
gpt-4o:
inputPrice: 2.50 # per 1M tokens
outputPrice: 7.50
qualityTier: "ULTRA"
provider: "openai"
๐งช Testingโ
Unit Test Exampleโ
@Test
public void testCostRouting() {
String selected = routingService.routeRequest(
principalId,
"simple task",
100,
200
);
assertThat(selected)
.isIn("gpt-4o-mini", "gpt-3.5-turbo", "ollama");
}
๐ Troubleshootingโ
"Model not found"โ
Solution: Check model name matches pricing table (case-sensitive)
Example: gpt-4o (not gpt4-o or GPT-4O)
"Budget exceeded"โ
Solution:
1. Check monthly budget via /v1/llm/stats/{principalId}
2. Reduce request complexity
3. Use cheaper model tier
"High latency"โ
Solution:
1. Use lower-quality tier (faster)
2. Enable caching if available
3. Use local model (ollama) for privacy
๐ More Documentationโ
- Full Guide:
LLM_ROUTING_ENGINE_IMPLEMENTATION.md - Build Status:
LLM_ROUTING_BUILD_STATUS.md - API Spec: Swagger available at
/v1/llm/swagger-ui.html
Last Updated: October 19, 2025 | Status: Production Ready