ValkyrAI Workflow Engine v2.0 — Quick Start Guide
Status: Phase 1 Complete ✅ — OpenAPI schemas & endpoints defined
Next: Regenerate code → Implement services → Ship to production
🎯 What Just Happened
We've implemented the foundation for ValkyrAI Workflow Engine v2.0 based on your PRD:
✅ New Data Models (ThorAPI schemas)
- WorkflowExecution — Separate execution instances from workflow definitions
- Run — Granular task attempt tracking with idempotency + leasing
- DeadLetterQueue — Quarantine & replay failed runs
- CircuitBreakerState — Protect external dependencies
✅ New API Endpoints
/WorkflowExecution/{id}/cancel|pause|resume— Execution control/Run/{id}/heartbeat— Runner keepalive/DeadLetterQueue/{id}/requeue|discard— DLQ operations
All CRUD endpoints (GET/POST/PUT/DELETE for list/create/update) will be auto-generated by ThorAPI.
🚀 Next Steps (Do This Now)
1. Regenerate Code
cd /Users/johnmcmahon/workspace/2025/valkyr/ValkyrAI
# Generate Java models, repositories, services, controllers
mvn -pl thorapi clean install
# Verify new models generated:
ls valkyrai/generated/spring/src/main/java/com/valkyrlabs/model/WorkflowExecution.java
ls valkyrai/generated/spring/src/main/java/com/valkyrlabs/model/Run.java
ls valkyrai/generated/spring/src/main/java/com/valkyrlabs/model/DeadLetterQueue.java
ls valkyrai/generated/spring/src/main/java/com/valkyrlabs/model/CircuitBreakerState.java
# Regenerate web TypeScript clients
cd web
npm run generate:api # or equivalent ThorAPI command
2. Database Migration
Create and run:
# Create migration file
cat > valkyrai/src/main/resources/db/migration/V2.0__workflow_execution_tracking.sql <<'EOF'
-- See WORKFLOW_ENGINE_V2_PHASE1_COMPLETE.md for full SQL
CREATE TABLE workflow_execution (...);
ALTER TABLE run ADD COLUMN exec_module_id UUID;
-- ... (see Phase 1 doc for complete script)
EOF
# Run migration (if using Flyway/Liquibase)
mvn flyway:migrate
# OR restart app with migration enabled
3. Implement Core Services
Priority Order:
-
RunService (2-3 days)
- Idempotency key generation
- Lease management
- Heartbeat tracking
- File:
valkyrai/src/main/java/com/valkyrlabs/workflow/service/RunService.java
-
WorkflowExecutionService (1-2 days)
- Wrap existing
ValkyrWorkflowService - Track execution lifecycle
- File:
valkyrai/src/main/java/com/valkyrlabs/workflow/service/WorkflowExecutionService.java
- Wrap existing
-
DLQService (1 day)
- Requeue/discard logic
- File:
valkyrai/src/main/java/com/valkyrlabs/workflow/service/DLQService.java
-
RunnerService (2-3 days)
- Poll for pending runs
- Execute with heartbeat
- Zombie reaper
- File:
valkyrai/src/main/java/com/valkyrlabs/workflow/runner/RunnerService.java
4. Implement Controllers
All custom (non-CRUD) endpoints:
WorkflowExecutionController— cancel/pause/resume operationsDLQController— requeue/discard operations
Note: ThorAPI already generated base CRUD controllers; you just need to add the custom operation methods.
5. Frontend Components
- Generate RTK Query hooks (auto-generated from OpenAPI)
- WorkflowExecutionMonitor component (see Phase 1 doc)
- DLQBrowser component (see Phase 1 doc)
- Update WorkflowStudio to use executions instead of direct workflow runs
📚 Reference Documents
- WORKFLOW_ENGINE_V2_IMPLEMENTATION.md — Full implementation plan with code samples
- WORKFLOW_ENGINE_V2_PHASE1_COMPLETE.md — This phase summary + SQL migration
- Original PRD — Your requirements document (v1.0)
🎓 Key Concepts
Idempotency
Every run gets a content hash of inputs + config:
String idempotencyKey = SHA256(inputs) + ":" + SHA256(config);
Duplicate requests with same key → deduplicated (no duplicate side effects).
Lease Mechanism
Runners acquire a lease on a run:
- Lease expires after 2 minutes (configurable)
- Heartbeat every 2 seconds extends lease
- Zombie reaper reclaims expired leases
- Prevents double execution across crashes
DLQ (Dead Letter Queue)
Runs that fail permanently (max retries, permanent errors) → quarantined:
- Operator can requeue with input overrides
- Operator can discard with notes
- Tracks resolution workflow
🧪 Testing Checklist
- Idempotency test: Submit duplicate run → verify deduplicated
- Crash recovery test: Kill runner mid-execution → verify resume without duplication
- DLQ test: Force permanent failure → verify quarantine → requeue → success
- Lease expiry test: Stop heartbeat → verify zombie reaper reclaims
- Performance test: 10k concurrent executions → verify P95 ≤ 75ms dispatch
🏗️ Architecture Principles
- ThorAPI-First: All models defined in OpenAPI → code generated
- Separation of Concerns: Workflow = definition; WorkflowExecution = runtime instance
- Crash-Safe: Lease + heartbeat + zombie reaper = no lost tasks
- Idempotent: Content-based deduplication = no duplicate side effects
- Observable: Every run tracked; OpenTelemetry spans; DLQ for failures
🚦 Current Status
✅ Phase 1: Data Models & API Endpoints (COMPLETE)
🔄 Phase 2: Code Generation (IN PROGRESS — run mvn install)
⏳ Phase 3: Service Layer (NEXT — 3-5 days)
⏳ Phase 4: Runner Pool (2-3 days)
⏳ Phase 5: Controllers (1 day)
⏳ Phase 6: Observability (2 days)
⏳ Phase 7: Frontend (3-4 days)
⏳ Phase 8: Integration & Testing (3-5 days)
Estimated Time to Production: 2-3 weeks
💡 Pro Tips
- Backward Compatibility: Existing
ValkyrWorkflowServicestill works; we're wrapping it - Incremental Rollout: Start with select workflows; gradually migrate all
- Monitoring: Set up OpenTelemetry exporter for Jaeger/Grafana Tempo
- DLQ Dashboard: Build operator UI early for visibility
🤝 Team Coordination
Backend:
- Implement services (RunService, WorkflowExecutionService, DLQService)
- Database migrations
- Integration tests
Frontend:
- Regenerate TypeScript clients
- Build WorkflowExecutionMonitor component
- Build DLQ browser
- Update WorkflowStudio
DevOps:
- Configure runner pods (K8s deployment)
- Set up OpenTelemetry collector
- Database migration automation
📞 Questions?
See the detailed implementation plan:
Or check the PRD for original requirements.
Ready to ship Imperial-class workflow orchestration. 😈🚀
"All I am surrounded by is fear… and dead queues." — ValkyrAI to n8n