Requirements and acceptance criteria
Requirements and acceptance criteria
This document captures high-level requirements and acceptance criteria for k8s LogJedi. It complements the ROADMAP and ADRs.
User stories
SRE / Operator
- As an SRE, I want failed workloads (e.g. ImagePullBackOff, CrashLoopBackOff) to be analyzed automatically so that I get a clear summary and suggested fix without manually reading events and logs.
- As an SRE, I want to choose between auto-apply of safe patches or manual review (e.g. Slack/Teams) so that I can balance speed and control.
- As an SRE, I want the operator to only suggest or apply changes in namespaces I allow so that I don’t risk changes in system or other teams’ namespaces.
- As an SRE, I want secrets and sensitive data redacted from what is sent to the LLM so that no credentials leak outside the cluster.
- As an SRE, I want a step-by-step report and an analyses view of what the LLM did so that I can audit and debug.
Platform / Security
- As a platform owner, I want the operator to use a least-privilege RBAC so that blast radius is limited.
- As a platform owner, I want optional authentication (e.g. Bearer token) between the operator and the LLM service so that only my operator can call the service.
Acceptance criteria (MVP)
| ID | Requirement | Acceptance criteria |
|---|---|---|
| AC1 | Failure detection | Operator detects CrashLoopBackOff, ImagePullBackOff, ErrImagePull, OOMKilled, failed Jobs, and unavailable Deployments for watched resources. |
| AC2 | Context collection | Operator collects Kubernetes Events and recent pod logs (and optional historical logs via configured backend) for the failed resource. |
| AC3 | Redaction | Specs sent to the LLM have env vars with secret-like names and valueFrom (secretKeyRef/configMapKeyRef) redacted. |
| AC4 | LLM analysis | LLM service returns JSON with summary, root_cause, recommendation, and optional k8s_patch action. |
| AC5 | Apply or notify | If APPLY_MODE=auto and patch is in allowed scope, operator applies patch; otherwise notifies (Slack/Teams) and/or logs suggested kubectl patch. |
| AC6 | Scope control | Operator respects WATCH_NAMESPACES, EXCLUDE_NAMESPACES, and AUTO_APPLY_NAMESPACES. |
| AC7 | Cooldown | Same resource is not re-analyzed within ANALYZE_COOLDOWN_MINUTES. |
| AC8 | Reports | GET /analyses and GET /reports return recent analyses and step-by-step reports; operator can report outcome so “Operator applied” appears in reports. |
Out of scope (current)
- Formal SLOs (e.g. “analysis within 2 minutes”).
- Multi-tenant or per-team policy (beyond namespace filters).
- Persistent audit store (beyond operator logs and in-memory reports).
- Official support matrix (k8s versions, providers) – documented best-effort in README.
Requirements and acceptance criteria will be updated as the ROADMAP evolves.