Live-Mode Operations Guide¶
This guide focuses on running live mode reliably outside lab conditions.
1. Validate Model Pack Readiness¶
Run doctor in live mode before operational use:
wraithrun --doctor --live --model C:/models/llm.onnx --tokenizer C:/models/tokenizer.json --introspection-format json
For auto-remediation of common setup issues, run:
wraithrun --doctor --live --fix --model C:/models/llm.onnx --introspection-format json
Required PASS checks for live readiness:
live-model-pathlive-model-sizelive-tokenizer-pathlive-tokenizer-sizelive-tokenizer-jsonlive-runtime-compatibility(requires--features inference_bridge/onnxorvitis)
Expected warning that should be reviewed but may still run:
live-model-formatwarning if file extension is not.onnxlive-runtime-compatibilitywarn when ONNX inference feature is not enabled in the build
When a check fails, the doctor output now includes a remediation field with actionable fix guidance. For example:
{
"status": "fail",
"name": "live-runtime-compatibility",
"reason_code": "runtime_session_init_failed",
"remediation": "Verify the model file is a valid ONNX model. Re-download if corrupted."
}
Common reason codes¶
| Reason Code | Meaning |
|---|---|
model_path_missing |
Model file does not exist at the configured path |
tokenizer_path_missing |
Tokenizer file does not exist at the configured path |
tokenizer_json_malformed |
Tokenizer file is not valid JSON |
tokenizer_json_missing_model_key |
Tokenizer JSON lacks required top-level model key |
runtime_session_init_failed |
ONNX session could not be initialized |
runtime_model_invalid |
Model file is not a valid ONNX model |
runtime_external_data_file_missing |
External data file referenced by model is missing |
runtime_external_initializer_unresolved |
External initializer tensors could not be resolved |
runtime_vitis_provider_missing |
Vitis AI execution provider library not found |
runtime_ort_dylib_missing |
ONNX Runtime shared library not found |
runtime_custom_ops_unavailable |
Custom operator library required by model not available |
runtime_ep_assignment_failed |
Model nodes could not be assigned to execution provider |
runtime_input_ids_missing |
Model does not expose an input_ids/tokens input |
runtime_input_unsupported |
Model requires inputs not supported by the runtime |
runtime_input_dtype_unsupported |
Model input uses an unsupported tensor element type |
runtime_logits_output_missing |
Model outputs do not include logits |
runtime_cache_output_missing |
Model has cache inputs but no matching cache outputs |
onnx_feature_disabled |
Build does not have ONNX inference support enabled |
2. Compare Presets and Packs¶
Use the model-pack manager to compare presets and live profiles before selecting one for active runs:
wraithrun models list
wraithrun models benchmark --introspection-format json
Validate all discovered packs (or a specific one via --profile) before promotion:
wraithrun models validate --introspection-format json
wraithrun models validate --profile live-balanced --introspection-format json
3. Configure Predictable Fallback¶
Use fallback policy when live inference must not block triage completion:
wraithrun --task "Investigate unauthorized SSH keys" --live --model C:/models/llm.onnx --live-fallback-policy dry-run-on-error
Policy behavior:
none: live inference error returns non-zero immediately.dry-run-on-error: runtime retries once in dry-run mode and recordslive_fallback_decision.
When fallback is triggered, live_fallback_decision.reason_code provides structured classification for automation and alert routing.
When --live is enabled, run output also includes live_run_metrics for operational telemetry (first_token_latency_ms, total_run_duration_ms, live_success_rate, fallback_rate, and top_failure_reasons).
4. Pipeline Gating Pattern¶
For automation pipelines, combine fallback and exit policy:
wraithrun --task "Investigate unauthorized SSH keys" --live --model C:/models/llm.onnx --live-fallback-policy dry-run-on-error --automation-adapter findings-v1 --exit-policy severity-threshold --exit-threshold high
This keeps ingestion deterministic while preserving incident signaling:
- adapter output stays machine-consumable,
- severity threshold still controls process exit code,
- fallback details and live telemetry are preserved in output for auditability.
5. Troubleshooting Checklist¶
If live mode repeatedly falls back:
- Verify
--doctor --livefailures first. Checkremediationfields for actionable fix guidance. - Confirm model path points to local readable storage.
- Confirm tokenizer JSON parses and includes top-level
model. - Confirm Vitis paths (
--vitis-config,--vitis-cache-dir) are valid when used. - Review
live_fallback_decision.reason_codeandlive_fallback_decision.live_errorin run output and capture both in incident notes. - If
live-runtime-compatibilityreports FAIL, verify the model is a valid ONNX file and matches the runtime (CPU vs Vitis).
6. Session Caching and Performance (v1.6.0)¶
WraithRun caches the ONNX session and tokenizer across investigation steps within a single run. This eliminates per-step session rebuild overhead for multi-step investigations.
The agent also tracks prompt prefix reuse across steps. When consecutive prompts share a common prefix (e.g., system prompt + prior context), the prefix hit/miss ratio is logged for observability. Full KV-state reuse is scaffolded for a future release.
Since v1.8.0, the prefill attention mask correctly accounts for forced cache padding on models that lack a use_cache branch toggle (#136). Previously, models like Qwen2.5 and Llama 3.2 could crash with a shape broadcast error during prefill because the attention mask length did not include the initial cache dimension.
Also since v1.8.0, execution provider reporting now detects DirectML and CUDA backend overrides (#142), so model_capability.execution_provider in JSON output accurately reflects the active backend instead of always showing CPUExecutionProvider.
Temperature controls affect live inference behavior:
--temperature 0(or omit): greedy decoding — fastest, fully deterministic output.--temperature 0.1–0.3: low-entropy sampling — slight variation while staying focused.--temperature 0.5+: higher-entropy sampling — more creative but less predictable.
For incident response triage, 0 or 0.1 is recommended to keep findings deterministic and reproducible.
7. Operator Recording Guidance¶
When fallback is triggered during an active case:
- keep the run output with
live_fallback_decision, - preserve evidence bundle artifacts,
- record whether fallback affected analyst confidence or timeline.