k8s LogJedi Roadmap
May the logs be with you. Your AI SRE Jedi that reads failures, writes fixes.
This document outlines the current state of k8s LogJedi and planned improvements. It is intended for contributors and operators who want to extend or adopt the system.
Current state (v0.1 – MVP)
The following is implemented and usable today:
Operator and LLM service
- Go operator: controller-runtime; watches Pods, Deployments, Jobs; detects failure signals (CrashLoopBackOff, ImagePullBackOff, ErrImagePull, OOMKilled, failed Jobs, unavailable Deployments); collects Events and recent pod logs; redacts secrets from spec; POSTs to LLM service; applies patch (auto) or notifies (manual). Resource limits set in deploy (requests/limits for memory and CPU).
- LLM service: FastAPI
POST /analyze with Pydantic validation; Strands agent with structured output. Multiple LLM providers: mock (default), openai, bedrock, gemini, configured via LLM_PROVIDER and provider-specific env (see README). .env loading: python-dotenv loads llm-service/.env for API keys and provider config. Health: GET /health (liveness), GET /ready (readiness). OpenAPI: GET /openapi.json. Resource limits in deploy manifest.
- Log backend: Pluggable interface with in-cluster Kubernetes backend and generic HTTP backend (query params: namespace, pod, since_minutes).
- Notifications: Slack (incoming webhook) and Microsoft Teams (message card); payload includes summary, root cause, recommendation, and proposed patch.
Apply logic and safety
- Apply logic:
APPLY_MODE=auto applies filtered strategic-merge patches (Deployment/Job/Pod); manual sends to Slack/Teams and logs kubectl patch. Scope limited to same namespace and allowed fields (image, env, replicas, resources). Dry-run before apply: optional server-side dry-run (DRY_RUN_BEFORE_APPLY) before applying; only apply if dry-run succeeds.
- Cooldown: Per-resource cooldown (
ANALYZE_COOLDOWN_MINUTES, default 15) so the same resource is not re-analyzed within that window.
- Prompt / log size limits:
MAX_RECENT_LOG_LINES and MAX_HISTORICAL_LOG_LINES (0 = no cap) cap payload sent to the LLM.
- Namespace and apply scope:
WATCH_NAMESPACES, EXCLUDE_NAMESPACES, and AUTO_APPLY_NAMESPACES (comma-separated) restrict which namespaces are watched and where auto-apply is allowed.
- Operator–LLM auth: Optional
LLM_SERVICE_AUTH_HEADER (e.g. Bearer token) for operator → LLM service calls.
- Audit: Operator logs “audit: applied patch to deployment/job/pod” when a patch is applied in auto mode.
- Stronger redaction: Redacts env values whose names suggest secrets; redacts
valueFrom (secretKeyRef/configMapKeyRef) so the LLM never sees secret references.
- Deploy: RBAC, ConfigMap, Deployment manifests; sample failing deployment; README quickstart for kind/minikube.
- Runbook: RUNBOOK.md for operator not reconciling, LLM returns no action, Slack/Teams not receiving, LLM unreachable, patch apply failures.
- ADRs: docs/adr/ for operator/LLM split and strategic merge patch.
- Makefile:
make build, make docker-build, make deploy, make test (operator: go vet + go test; llm-service: pytest if available), make dev (build + docker-build with instructions for cluster + deploy).
Near term (stability and production readiness)
| Area |
Description |
| Strands reliability |
Add retries and fallback when Strands structured output fails; optional timeout for agent invocation. |
| Operator logging |
Use structured logging (e.g. logr) consistently; add request IDs or resource UIDs for tracing analyze requests. |
| Tests |
Unit tests for failure detection, redaction, and patch filtering; integration test that runs operator + stub LLM in-process or in-cluster. |
Mid term (features and integrations)
Security and audit
| Area |
Description |
| Audit trail persistence |
Beyond operator logs: record when the operator applies a patch (resource, namespace, patch hash, timestamp) in a CRD, log stream, or external audit service. |
| Further redaction |
Redact image pull secret names and volume paths that may expose secrets (in addition to current env and valueFrom redaction). |
Scope and policy
| Area |
Description |
| Label filters |
Restrict which resources the operator watches by label (in addition to existing namespace filters). |
| Cost guardrails |
Optional cap on analyze requests per hour (or per namespace) to avoid LLM cost blow-up. |
Observability
| Area |
Description |
| Metrics |
Expose Prometheus metrics: analyze requests (success/failure), patch apply count, notification errors. |
| Distributed tracing |
OpenTelemetry (or similar) so one trace links operator reconcile → HTTP to LLM → Strands/LLM call. |
LLM and prompts
| Area |
Description |
| Configurable prompts |
Move Kubernetes SRE system prompt out of code into config (ConfigMap or file) for tuning without redeploy. |
| Interactive approval (Slack/Teams) |
Buttons or actions in notifications to “Approve” or “Reject” a patch; operator applies only after approval (webhook or callback). |
| Loki adapter |
Small service or option that translates namespace/pod/since into Loki LogQL and returns lines. |
| More workload types |
Watch StatefulSets, DaemonSets, CronJobs; extend failure detection and patch allowlist. |
| Config from ConfigMap |
Document and optionally extend to reload ConfigMap without restart. |
| Log backend auth |
Support optional headers or basic auth for HTTP log backend. |
Notifications
| Area |
Description |
| PagerDuty / Opsgenie / email |
Implement Notifier for PagerDuty, Opsgenie, or email so on-call gets analyses where they already are. |
| Severity or routing |
Optional routing by severity (e.g. only PagerDuty for “critical” or certain namespaces). |
Packaging and distribution
| Area |
Description |
| Helm chart |
Add a Helm chart (e.g. charts/logjedi/) that deploys both the operator and the LLM service with a single helm install. Expose key options as values: namespace, image tags, LLM_SERVICE_URL, Slack/Teams webhooks, APPLY_MODE, namespace filters, cooldown, log backend, etc. Enables one-command install and easier upgrades. |
Long term / exploratory
| Area |
Description |
| LogJediAnalysis CRD |
Store analysis results in a custom resource for audit and UI; optional controller to sync from CR to Slack/Teams or apply. |
| Multi-cluster |
Operator that can target multiple clusters and a single LLM service. |
| Web UI |
Dashboard to list analyses, view proposed patches, approve/reject. |
| Additional actions |
Beyond k8s_patch: e.g. full manifest, scaling, or script (with strong safety and review). |
Developer and CI experience
| Area |
Description |
| Single-command local stack |
Extend make dev (or add Tilt/Skaffold) to optionally bring up kind, deploy, and create sample failing workload in one command. |
| E2E test |
E2E test in kind: deploy operator + LLM + failing workload, assert analysis is produced (and optionally notification or patch). |
| CI checks |
In CI: make test (operator: go vet + go test; llm-service: pytest); optional make docker-build. Add ruff/linting for LLM service. |
Versioning and releases
- Releases will be tagged (e.g.
v0.2.0) when a set of roadmap items is done; changelog can live in CHANGELOG.md.
- Compatibility: The operator and LLM service communicate via the documented JSON request/response schema; adding optional fields is backward compatible; changing required fields will be done with a version bump.
Contributing
If you want to work on a roadmap item:
- Open an issue describing the change and how it fits the roadmap.
- Prefer small, reviewable PRs (e.g. one item or sub-item per PR).
- Keep the README and this ROADMAP updated when adding config options or behavior.