Skip to main content

LLM Routing Engine - Quick Reference

๐ŸŽฏ One-Minute Overviewโ€‹

The LLM Routing Engine automatically selects the best LLM model for each request based on:

  • Cost (cheapest model that meets requirements)
  • Speed (fastest available model)
  • Quality (highest capability within budget)

๐Ÿ“ก REST API Quick Startโ€‹

Route a Requestโ€‹

curl -X POST http://localhost:8080/v1/llm/route \
-H "Content-Type: application/json" \
-d '{
"taskDescription": "Generate Python code to parse CSV",
"estimatedInputTokens": 200,
"estimatedOutputTokens": 500
}'

Response:

{
"modelId": "gpt-4o-mini",
"provider": "openai",
"estimatedCost": 0.0219,
"pricingSummary": "$0.15/$0.60 per 1M tokens"
}

Get Current Pricingโ€‹

curl http://localhost:8080/v1/llm/pricing

Response:

{
"timestamp": 1634567890000,
"models": {
"gpt-4o": { "input": 2.50, "output": 7.50 },
"gpt-4o-mini": { "input": 0.15, "output": 0.60 },
"claude-3.5-sonnet": { "input": 3.00, "output": 15.00 },
...
}
}

๐Ÿ”Œ Java Integrationโ€‹

Service Injectionโ€‹

@Autowired
private LLMRoutingService routingService;

@Autowired
private LLMCostCalculatorService costCalculator;

Route a Request (Java)โ€‹

// Analyze task and select best model
String selectedModel = routingService.routeRequest(
principalId,
"Generate TypeScript types from OpenAPI spec",
1500, // estimated input tokens
2000 // estimated output tokens
);

// Use the selected model
LlmDetails llmDetails = lmDetailsRepository.findByModelId(selectedModel);
ChatMessage response = llmService.chat(llmDetails, userMessage);

// Record usage for cost tracking
routingService.recordUsage(
principalId,
selectedModel,
actualInputTokens,
actualOutputTokens,
durationMs
);

Estimate Costsโ€‹

// Get cost estimate before making request
BigDecimal estimatedCost = costCalculator.estimateCost(
"gpt-4o-mini",
1000, // input tokens
2000 // output tokens
);

System.out.println("Estimated cost: $" + estimatedCost);

Compare Modelsโ€‹

Map<String, BigDecimal> comparison = costCalculator.compareCosts(1000, 2000);
// {
// "gpt-4o": 0.0225,
// "gpt-4o-mini": 0.0015,
// "claude-3.5-sonnet": 0.0450,
// "ollama": 0.0000
// }

๐ŸŽฎ Supported Modelsโ€‹

Ultra-Premium (Best Quality)โ€‹

gpt-4o           - $2.50/$7.50 per 1M tokens
claude-3.5-opus - $15.00/$75.00 per 1M tokens
gemini-1.5-pro - $1.25/$5.00 per 1M tokens

Premium (High Quality)โ€‹

gpt-4-turbo      - $10.00/$30.00 per 1M tokens
claude-3.5-sonnet - $3.00/$15.00 per 1M tokens

Standard (Good Balance)โ€‹

gpt-4o-mini      - $0.15/$0.60 per 1M tokens
gemini-1.5-flash - $0.075/$0.30 per 1M tokens
claude-3.5-haiku - $0.80/$4.00 per 1M tokens

Budget (Cheapest)โ€‹

gpt-3.5-turbo    - $0.50/$1.50 per 1M tokens
ollama (Llama2) - FREE (local)

๐Ÿ“Š Cost Examplesโ€‹

Generate Python function (200 input, 300 output tokens)

ModelInput CostOutput CostTotal
gpt-4o$0.0005$0.00225$0.00275
gpt-4o-mini$0.00003$0.00018$0.00021
claude-3.5-sonnet$0.0006$0.0045$0.00510
ollama$0.00$0.00$0.00 โญ

๐ŸŽฏ Routing Strategiesโ€‹

Cost-Optimized Strategyโ€‹

{
"name": "Cost Optimized",
"strategy": "cost",
"budgetPerMonth": 50.0,
"budgetPerRequest": 1.0,
"fallbackChain": ["gpt-4o-mini", "gpt-3.5-turbo", "ollama"]
}

Quality-First Strategyโ€‹

{
"name": "Premium Quality",
"strategy": "quality",
"minimumQualityTier": "PREMIUM",
"budgetPerRequest": 5.0,
"fallbackChain": ["gpt-4o", "claude-3.5-sonnet", "gpt-4o-mini"]
}

Hybrid Strategyโ€‹

{
"name": "Balanced",
"strategy": "hybrid",
"costWeight": 0.4,
"speedWeight": 0.3,
"qualityWeight": 0.3,
"budgetPerMonth": 100.0
}

๐Ÿ“ˆ Usage Trackingโ€‹

Record Actual Usageโ€‹

routingService.recordUsage(
principalId,
"gpt-4o-mini",
actualInputTokens,
actualOutputTokens,
durationMs
);

Get Statisticsโ€‹

curl http://localhost:8080/v1/llm/stats/{principalId}

Response:

{
"principal": "user@example.com",
"period": "2025-10-19",
"totalRequests": 42,
"totalTokens": 125000,
"totalCost": 12.45,
"modelBreakdown": {
"gpt-4o-mini": { "requests": 35, "cost": 2.1 },
"claude-3.5-haiku": { "requests": 7, "cost": 10.35 }
},
"averageLatency": 234,
"budgetRemaining": 87.55
}

๐Ÿ”’ Security & Permissionsโ€‹

All endpoints require Spring Security authentication. The routing service:

  • โœ… Enforces per-principal budgets
  • โœ… Tracks usage per principal
  • โœ… Respects role-based access
  • โœ… Logs all requests for audit

โš™๏ธ Configurationโ€‹

Add to application.yaml:

valkyrai:
llm:
routingService:
defaultStrategy: "cost" # cost | quality | hybrid
maxRetries: 5
retryDelayMs: 2000
budgetCheckInterval: "60m"

costCalculator:
tokensPerChar: 0.25 # 1 token โ‰ˆ 4 chars
jsonOverhead: 0.10 # +10% for JSON
operationalOverhead: 0.20 # +20% ops cost

models:
gpt-4o:
inputPrice: 2.50 # per 1M tokens
outputPrice: 7.50
qualityTier: "ULTRA"
provider: "openai"

๐Ÿงช Testingโ€‹

Unit Test Exampleโ€‹

@Test
public void testCostRouting() {
String selected = routingService.routeRequest(
principalId,
"simple task",
100,
200
);

assertThat(selected)
.isIn("gpt-4o-mini", "gpt-3.5-turbo", "ollama");
}

๐Ÿ› Troubleshootingโ€‹

"Model not found"โ€‹

Solution: Check model name matches pricing table (case-sensitive)
Example: gpt-4o (not gpt4-o or GPT-4O)

"Budget exceeded"โ€‹

Solution:
1. Check monthly budget via /v1/llm/stats/{principalId}
2. Reduce request complexity
3. Use cheaper model tier

"High latency"โ€‹

Solution:
1. Use lower-quality tier (faster)
2. Enable caching if available
3. Use local model (ollama) for privacy

๐Ÿ“š More Documentationโ€‹

  • Full Guide: LLM_ROUTING_ENGINE_IMPLEMENTATION.md
  • Build Status: LLM_ROUTING_BUILD_STATUS.md
  • API Spec: Swagger available at /v1/llm/swagger-ui.html

Last Updated: October 19, 2025 | Status: Production Ready