LLM Routing Engine - Quick Reference

🎯 One-Minute Overview

The LLM Routing Engine automatically selects the best LLM model for each request based on:

Cost (cheapest model that meets requirements)
Speed (fastest available model)
Quality (highest capability within budget)

📡 REST API Quick Start

Route a Request

curl -X POST http://localhost:8080/v1/llm/route \
  -H "Content-Type: application/json" \
  -d '{
    "taskDescription": "Generate Python code to parse CSV",
    "estimatedInputTokens": 200,
    "estimatedOutputTokens": 500
  }'

Response:

{
  "modelId": "gpt-4o-mini",
  "provider": "openai",
  "estimatedCost": 0.0219,
  "pricingSummary": "$0.15/$0.60 per 1M tokens"
}

Get Current Pricing

curl http://localhost:8080/v1/llm/pricing

Response:

{
  "timestamp": 1634567890000,
  "models": {
    "gpt-4o": { "input": 2.50, "output": 7.50 },
    "gpt-4o-mini": { "input": 0.15, "output": 0.60 },
    "claude-3.5-sonnet": { "input": 3.00, "output": 15.00 },
    ...
  }
}

🔌 Java Integration

Service Injection

@Autowired
private LLMRoutingService routingService;

@Autowired
private LLMCostCalculatorService costCalculator;

Route a Request (Java)

// Analyze task and select best model
String selectedModel = routingService.routeRequest(
    principalId,
    "Generate TypeScript types from OpenAPI spec",
    1500,    // estimated input tokens
    2000     // estimated output tokens
);

// Use the selected model
LlmDetails llmDetails = lmDetailsRepository.findByModelId(selectedModel);
ChatMessage response = llmService.chat(llmDetails, userMessage);

// Record usage for cost tracking
routingService.recordUsage(
    principalId,
    selectedModel,
    actualInputTokens,
    actualOutputTokens,
    durationMs
);

Estimate Costs

// Get cost estimate before making request
BigDecimal estimatedCost = costCalculator.estimateCost(
    "gpt-4o-mini",
    1000,  // input tokens
    2000   // output tokens
);

System.out.println("Estimated cost: $" + estimatedCost);

Compare Models

Map<String, BigDecimal> comparison = costCalculator.compareCosts(1000, 2000);
// {
//   "gpt-4o": 0.0225,
//   "gpt-4o-mini": 0.0015,
//   "claude-3.5-sonnet": 0.0450,
//   "ollama": 0.0000
// }

🎮 Supported Models

Ultra-Premium (Best Quality)

gpt-4o           - $2.50/$7.50 per 1M tokens
claude-3.5-opus  - $15.00/$75.00 per 1M tokens
gemini-1.5-pro   - $1.25/$5.00 per 1M tokens

Premium (High Quality)

gpt-4-turbo      - $10.00/$30.00 per 1M tokens
claude-3.5-sonnet - $3.00/$15.00 per 1M tokens

Standard (Good Balance)

gpt-4o-mini      - $0.15/$0.60 per 1M tokens
gemini-1.5-flash - $0.075/$0.30 per 1M tokens
claude-3.5-haiku - $0.80/$4.00 per 1M tokens

Budget (Cheapest)

gpt-3.5-turbo    - $0.50/$1.50 per 1M tokens
ollama (Llama2)  - FREE (local)

📊 Cost Examples

Generate Python function (200 input, 300 output tokens)

Model	Input Cost	Output Cost	Total
gpt-4o	$0.0005	$0.00225	$0.00275
gpt-4o-mini	$0.00003	$0.00018	$0.00021
claude-3.5-sonnet	$0.0006	$0.0045	$0.00510
ollama	$0.00	$0.00	$0.00 ⭐

🎯 Routing Strategies

Cost-Optimized Strategy

{
  "name": "Cost Optimized",
  "strategy": "cost",
  "budgetPerMonth": 50.0,
  "budgetPerRequest": 1.0,
  "fallbackChain": ["gpt-4o-mini", "gpt-3.5-turbo", "ollama"]
}

Quality-First Strategy

{
  "name": "Premium Quality",
  "strategy": "quality",
  "minimumQualityTier": "PREMIUM",
  "budgetPerRequest": 5.0,
  "fallbackChain": ["gpt-4o", "claude-3.5-sonnet", "gpt-4o-mini"]
}

Hybrid Strategy

{
  "name": "Balanced",
  "strategy": "hybrid",
  "costWeight": 0.4,
  "speedWeight": 0.3,
  "qualityWeight": 0.3,
  "budgetPerMonth": 100.0
}

📈 Usage Tracking

Record Actual Usage

routingService.recordUsage(
    principalId,
    "gpt-4o-mini",
    actualInputTokens,
    actualOutputTokens,
    durationMs
);

Get Statistics

curl http://localhost:8080/v1/llm/stats/{principalId}

Response:

{
  "principal": "user@example.com",
  "period": "2025-10-19",
  "totalRequests": 42,
  "totalTokens": 125000,
  "totalCost": 12.45,
  "modelBreakdown": {
    "gpt-4o-mini": { "requests": 35, "cost": 2.1 },
    "claude-3.5-haiku": { "requests": 7, "cost": 10.35 }
  },
  "averageLatency": 234,
  "budgetRemaining": 87.55
}

🔒 Security & Permissions

All endpoints require Spring Security authentication. The routing service:

✅ Enforces per-principal budgets
✅ Tracks usage per principal
✅ Respects role-based access
✅ Logs all requests for audit

⚙️ Configuration

Add to application.yaml:

valkyrai:
  llm:
    routingService:
      defaultStrategy: "cost" # cost | quality | hybrid
      maxRetries: 5
      retryDelayMs: 2000
      budgetCheckInterval: "60m"

    costCalculator:
      tokensPerChar: 0.25 # 1 token ≈ 4 chars
      jsonOverhead: 0.10 # +10% for JSON
      operationalOverhead: 0.20 # +20% ops cost

    models:
      gpt-4o:
        inputPrice: 2.50 # per 1M tokens
        outputPrice: 7.50
        qualityTier: "ULTRA"
        provider: "openai"

🧪 Testing

Unit Test Example

@Test
public void testCostRouting() {
    String selected = routingService.routeRequest(
        principalId,
        "simple task",
        100,
        200
    );

    assertThat(selected)
        .isIn("gpt-4o-mini", "gpt-3.5-turbo", "ollama");
}

🐛 Troubleshooting

"Model not found"

Solution: Check model name matches pricing table (case-sensitive)
Example: gpt-4o (not gpt4-o or GPT-4O)

"Budget exceeded"

Solution:
Check monthly budget via /v1/llm/stats/{principalId}
Reduce request complexity
Use cheaper model tier

"High latency"

Solution:
Use lower-quality tier (faster)
Enable caching if available
Use local model (ollama) for privacy

📚 More Documentation

Full Guide: LLM_ROUTING_ENGINE_IMPLEMENTATION.md
Build Status: LLM_ROUTING_BUILD_STATUS.md
API Spec: Swagger available at /v1/llm/swagger-ui.html

Last Updated: October 19, 2025 | Status: Production Ready

🎯 One-Minute Overview​

📡 REST API Quick Start​

Route a Request​

Get Current Pricing​

🔌 Java Integration​

Service Injection​

Route a Request (Java)​

Estimate Costs​

Compare Models​

🎮 Supported Models​

Ultra-Premium (Best Quality)​

Premium (High Quality)​

Standard (Good Balance)​

Budget (Cheapest)​

📊 Cost Examples​

🎯 Routing Strategies​

Cost-Optimized Strategy​

Quality-First Strategy​

Hybrid Strategy​

📈 Usage Tracking​

Record Actual Usage​

Get Statistics​

🔒 Security & Permissions​

⚙️ Configuration​

🧪 Testing​

Unit Test Example​

🐛 Troubleshooting​

"Model not found"​

"Budget exceeded"​

"High latency"​

📚 More Documentation​

🎯 One-Minute Overview

📡 REST API Quick Start

Route a Request

Get Current Pricing

🔌 Java Integration

Service Injection

Route a Request (Java)

Estimate Costs

Compare Models

🎮 Supported Models

Ultra-Premium (Best Quality)

Premium (High Quality)

Standard (Good Balance)

Budget (Cheapest)

📊 Cost Examples

🎯 Routing Strategies

Cost-Optimized Strategy

Quality-First Strategy

Hybrid Strategy

📈 Usage Tracking

Record Actual Usage

Get Statistics

🔒 Security & Permissions

⚙️ Configuration

🧪 Testing

Unit Test Example

🐛 Troubleshooting

"Model not found"

"Budget exceeded"

"High latency"

📚 More Documentation