ποΈ ValkyrAI Workflow Engine v2.0 β System Architecture
The N8N Killer: Built on ThorAPI, Hardened for Production
π― High-Level Architectureβ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CLIENT LAYER β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β React Flow β β DLQ Browser β β Execution β β
β β Studio β β & Replay UI β β Monitor β β
β ββββββββ¬ββββββββ ββββββββ¬ββββββββ ββββββββ¬ββββββββ β
β β β β β
β ββββββββββββββββββββ΄βββββββββββββββββββ β
β β β
β RTK Query Hooks β
ββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ
β
β HTTPS/JSON
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β REST API LAYER β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β WorkflowExecution API β Run API β DLQ API β β
β β /WorkflowExecution β /Run β /DeadLetterQueueβ β
β β - POST /{id}/execute β - POST /{id}/heartbeat β β
β β - POST /{id}/cancel β - GET /{id} β β
β β - POST /{id}/pause β β β
β β - POST /{id}/resume β /DeadLetterQueue β
β β β - POST /{id}/requeue β
β β β - POST /{id}/discard β
β ββββββββββ¬ββββββββββββ¬βββββββββ¬βββββββββββ¬βββββββββββββββββ β
β β β β β β
β Spring Security β β Spring MVC β
β RBAC + ACL β β @RestController β
βββββββββββββΌββββββββββββΌβββββββββΌβββββββββββΌββββββββββββββββββββββ
β β β β
β β β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SERVICE LAYER β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β ValkyrWorkflowExecutionService (Orchestration) β β
β β - execute(workflowId, triggerType, payload) β β
β β - pause/resume/cancel operations β β
β β - state transition management β β
β ββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β ββββ ValkyrRunService (Idempotency Engine) β
β β ββ generateIdempotencyKey() [SHA-256] β
β β ββ acquireLease() [DB row lock] β
β β ββ updateHeartbeat() [extend lease] β
β β ββ completeRun() [store outputs] β
β β ββ failRun() [classify error] β
β β ββ reapZombieRuns() [@Scheduled 30s] β
β β β
β ββββ ValkyrDLQService (Error Recovery) β
β β ββ quarantine() [failed runs] β
β β ββ requeue() [with overrides] β
β β ββ discard() [operator notes] β
β β β
β ββββ ValkyrCircuitBreakerService (Resilience) β
β β ββ isCircuitOpen() β
β β ββ recordFailure() β
β β ββ openCircuit() / closeCircuit() β
β β β
β ββββ ValkyrRunnerService (Execution Engine) β
β ββ pollAndExecute() [@Scheduled 100ms] β
β ββ executeWithHeartbeat() β
β ββ enforceResourceLimits() β
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β ValkyrBudgetService & ValkyrQuotaService β β
β β - checkBudget() [kill switch] β β
β β - checkQuota() [rate limits] β β
β β - trackCost() [per-principal] β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββ
β
β JPA/Hibernate
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β REPOSITORY LAYER β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β ValkyrRunRepository (Custom Queries) β β
β β - findByIdempotencyKey() [O(1) lookup] β β
β β - findByLeaseUntilBefore() [zombie detection] β β
β β - findByStateOrderByCreated() [FIFO queue] β β
β β - findRunsReadyForRetry() [auto-retry] β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β ValkyrDLQRepository (DLQ Queries) β β
β β - findByResolution() β β
β β - countByResolution() β β
β β - findByFailureType() β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β + WorkflowExecutionRepository, BudgetRepository, β
β QuotaRepository, CircuitBreakerStateRepository β
βββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββ
β
β JDBC
β
ββββββββββββββββββββββββββββββββββββββββββββββββ βββββββββββββββββββ
β DATA LAYER β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β PostgreSQL Database (ACID Transactions) β β
β β β β
β β Tables: β β
β β - workflow_execution (state tracking) β β
β β - run (idempotency + lease) β β
β β - dead_letter_queue (quarantine) β β
β β - circuit_breaker_state (resilience) β β
β β - budget (cost control) β β
β β - quota (rate limiting) β β
β β β β
β β Indexes: β β
β β - idx_run_idempotency (UNIQUE) β β
β β - idx_run_lease_until (zombie reaper) β β
β β - idx_run_state (queue polling) β β
β β - idx_dlq_resolution (operator queries) β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β‘ Execution Flow: End-to-Endβ
ββββββββββββ
β USER β POST /WorkflowExecution/{workflowId}/execute
β CLIENT β {params: {...}}
ββββββ¬ββββββ
β
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β WorkflowExecutionController β
β ββ @PreAuthorize("hasPermission(...)") β
β ββ workflowExecutionService.execute(...) β
ββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ValkyrWorkflowExecutionService β
β ββ Create WorkflowExecution (PENDING β RUNNING) β
β ββ Load Workflow.tasks β
β ββ For each Task: β
β ββ runService.createRun(execution, task, inputs, 1) β
β ββ Store Run (PENDING state) β
ββββββ¬ββββββββββββ βββββββββββββββββββββββββββββββββββββββββββββ
β
β (Async polling loop β RunnerService)
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ValkyrRunnerService [@Scheduled(fixedDelay = 100ms)] β
β ββ runService.findPendingRuns(limit=10) β
β ββ For each pending Run: β
β ββ runService.acquireLease(runId, runnerId) β
β β ββ Check existing lease holder β
β β ββ Set leaseUntil = now + 2min β
β β ββ Set state = LEASED β
β β ββ Save Run β
β β β
β ββ Start heartbeat thread (every 10s): β
β β ββ runService.updateHeartbeat(runId, runnerId) β
β β ββ Verify lease ownership β
β β ββ Extend leaseUntil β
β β ββ Update heartbeatAt β
β β β
β ββ circuitBreakerService.isCircuitOpen(module)? β
β β ββ YES β failRun(CIRCUIT_OPEN) β
β β ββ NO β continue β
β β β
β ββ budgetService.checkBudget(principal)? β
β β ββ EXCEEDED β failRun + kill switch β
β β ββ OK β continue β
β β β
β ββ quotaService.checkQuota(principal, resource)? β
β β ββ EXCEEDED β failRun β
β β ββ OK β continue β
β β β
β ββ Execute VModule: β
β β ββ Load ExecModule β
β β ββ module.execute(inputs) β outputs β
β β ββ Track duration & cost β
β β β
β ββ On SUCCESS: β
β β ββ runService.completeRun(runId, outputs, ...) β
β β ββ state = SUCCESS β
β β ββ Store outputs, duration, cost β
β β ββ Release lease β
β β β
β ββ On FAILURE: β
β ββ Classify error type: β
β β ββ TRANSIENT β failRun(TRANSIENT) β
β β β ββ Calculate backoff (exp + jitter) β
β β β ββ Set retryReadyAt timestamp β
β β β β
β β ββ PERMANENT β failRun(PERMANENT) β
β β β ββ runService.promoteToDeadLetterQueue() β
β β β ββ dlqService.quarantine() β
β β β β
β β ββ TIMEOUT β failRun(TIMEOUT) β
β β β
β ββ circuitBreakerService.recordFailure() β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Background Jobs (Scheduled) β
β β
β ValkyrRunService.reapZombieRuns() [@Scheduled 30s] β
β ββ Find runs where leaseUntil < now β
β ββ AND state IN (LEASED, RUNNING) β
β ββ For each zombie: β
β β ββ Log warning (runner X crashed) β
β β ββ state = PENDING β
β β ββ Clear lease fields β
β β ββ Save (will be picked up again) β
β ββ Auto-recovery in < 30 seconds β
β β
β CircuitBreakerService.checkHalfOpen() [@Scheduled 60s] β
β ββ Find circuits in OPEN state β
β ββ If nextRetryAt < now: β
β β ββ state = HALF_OPEN (allow 1 test request) β
β ββ On success β CLOSED, on failure β OPEN β
β β
β QuotaService.resetCounters() [@Scheduled hourly/daily] β
β ββ Reset hourly counters at hourResetAt β
β ββ Reset daily counters at dayResetAt β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π― Key Differentiators vs n8nβ
| Feature | n8n | ValkyrAI v2.0 |
|---|---|---|
| Idempotency | β None | β SHA-256 keys + DB checks |
| Lease Management | β None | β 2-min leases + heartbeats |
| Crash Recovery | β Lost work | β Zombie reaper (<30s) |
| Exactly-Once | β At-least-once | β Lease + idempotency |
| DLQ & Replay | β Manual | β Operator-guided w/ overrides |
| Circuit Breakers | β None | β Auto-recovery |
| Budget Controls | β None | β Per-principal + kill switch |
| Cost Tracking | β None | β Per-step token tracking |
| Rate Limiting | β οΈ Basic | β Concurrent + hourly/daily |
| Observability | β οΈ Logs only | β OpenTelemetry + flame charts |
| Typed I/O | β None | β JSONSchema validation |
| Multi-Tenancy | β οΈ Weak | β RBAC + ACL + quotas |
π Security Architectureβ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Authentication Layer β
β ββ Spring Security β
β ββ JWT tokens (stateless) β
β ββ Principal (user/system/agent) β
ββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Authorization Layer (RBAC) β
β ββ Roles: ADMIN, USER, SYSTEM, WORKFLOW β
β ββ @PreAuthorize("hasRole('ADMIN')") β
β ββ Method-level security β
ββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ACL Layer (Fine-Grained) β
β ββ Per-object permissions (READ, WRITE, DELETE, ...)β
β ββ @PreAuthorize("hasPermission(#id, 'execute')") β
β ββ ValkyrACL service β
ββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Data Encryption β
β ββ @SecureField annotation β
β ββ AES-256 encryption at rest β
β ββ Transparent encrypt/decrypt via AspectJ β
β ββ KMS integration for key rotation β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π Data Flow: Idempotency in Actionβ
Request: Execute Task with inputs {userId: "123", action: "purchase"}
Step 1: Generate Idempotency Key
βββββββββββββββββββββββββββββββββ
Input: executionId + taskId + moduleId + SHA256({userId:"123",action:"purchase"})
Output: "a3f5c9d8e2b1..." (64-char hex)
Step 2: Check Existing Run
βββββββββββββββββββββββββββββββββ
Query: SELECT * FROM run WHERE idempotency_key = 'a3f5c9d8e2b1...'
IF found AND state = SUCCESS:
β
Return cached outputs (no duplicate work)
Exit
IF found AND state IN (RUNNING, LEASED):
β οΈ Duplicate request detected, return existing run
Exit
ELSE:
β‘οΈ Continue to Step 3
Step 3: Create New Run
βββββββββββββββββββββββββββββββββ
INSERT INTO run (
id, execution_id, task_id, state, idempotency_key, ...
) VALUES (
uuid(), execution_id, task_id, 'PENDING', 'a3f5c9d8e2b1...', ...
)
Commit transaction
Step 4: Runner Picks Up Run
βββββββββββββββββββββββββββββββββ
SELECT * FROM run WHERE state = 'PENDING' ORDER BY created_date LIMIT 10
FOR EACH run:
BEGIN TRANSACTION
UPDATE run
SET state = 'LEASED',
lease_until = NOW() + INTERVAL '2 minutes',
leased_by = 'runner-pod-123'
WHERE id = run.id
AND (lease_until IS NULL OR lease_until < NOW())
COMMIT
IF rows_affected = 1:
β
Lease acquired
ELSE:
β Another runner got it, skip
Step 5: Execute with Heartbeat
βββββββββββββββββββββββββββββββββ
Start heartbeat thread (every 10s):
UPDATE run
SET heartbeat_at = NOW(),
lease_until = NOW() + INTERVAL '2 minutes'
WHERE id = run.id AND leased_by = 'runner-pod-123'
Execute module.execute(inputs)
On success:
UPDATE run
SET state = 'SUCCESS',
finished_at = NOW(),
outputs = '{"orderId":"abc123"}',
duration_ms = 1234,
cost_tokens = 0.05,
lease_until = NULL,
leased_by = NULL
WHERE id = run.id
Step 6: Future Request (Same Inputs)
βββββββββββββββββββββββββββββββββ
Same idempotency key: 'a3f5c9d8e2b1...'
Query: SELECT * FROM run WHERE idempotency_key = 'a3f5c9d8e2b1...'
Result: state = 'SUCCESS', outputs = '{"orderId":"abc123"}'
β
Return cached outputs immediately (no re-execution)
π Scalability Architectureβ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Load Balancer (ALB/nginx) β
β ββ Route to available API pods β
β ββ Health checks: /actuator/health β
ββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββββ¬βββββββββββββββ¬βββββββββββββββ
β β β β
ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ
β API Pod 1β β API Pod 2β β API Pod 3β β API Pod Nβ
β (Stateless) β (Stateless) β (Stateless) β (Stateless)
ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ
β β β β
ββββββββββββββββ΄βββββββββββββββ΄βββββββββββββββ
β
β
ββββββββββββββββββββββββββββββββββββββββββββ
β Shared Database (PostgreSQL) β
β ββ Connection pooling (HikariCP) β
β ββ Read replicas for queries β
β ββ Write leader for mutations β
βββββββββββββββ βββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Runner Pods (Scheduled Workers) β
β ββ Auto-scaling: scale by queue depth β
β ββ Each pod polls queue independently β
β ββ Lease-based coordination (no leader election) β
β ββ Horizontal: 1 β N pods (linear scale) β
β β
β Runner Pod 1 (runnerId: runner-abc-123) β
β ββ @Scheduled pollAndExecute() every 100ms β
β ββ Fetch 10 pending runs β
β ββ Acquire leases (DB row lock) β
β ββ Execute with heartbeat β
β ββ Release on completion β
β β
β Runner Pod 2 (runnerId: runner-def-456) β
β ββ (same) β
β β
β Runner Pod N (runnerId: runner-xyz-789) β
β ββ (same) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Performance Targets:
- 10k concurrent executions
- 250k active tasks
- P95 dispatch < 75ms
- Runner CPU < 70% at scale
π― Summaryβ
ValkyrAI Workflow Engine v2.0 is architected for production resilience with:
- Exactly-once execution via idempotency keys + lease management
- Automatic crash recovery via zombie reaper (<30s)
- Operator-friendly DLQ with replay + input overrides
- Cost control via budgets + quotas + kill switches
- Horizontal scaling via stateless design + DB coordination
- Security by default via RBAC + ACL + field encryption
This is the n8n killer. Built on ThorAPI. Shipping Q4 2025. π