k8s LogJedi Roadmap

May the logs be with you. Your AI SRE Jedi that reads failures, writes fixes.

This document outlines the current state of k8s LogJedi and planned improvements. It is intended for contributors and operators who want to extend or adopt the system.

Current state (v0.1 – MVP)

The following is implemented and usable today:

Operator and LLM service

Go operator: controller-runtime; watches Pods, Deployments, Jobs; detects failure signals (CrashLoopBackOff, ImagePullBackOff, ErrImagePull, OOMKilled, failed Jobs, unavailable Deployments); collects Events and recent pod logs; redacts secrets from spec; POSTs to LLM service; applies patch (auto) or notifies (manual). Resource limits set in deploy (requests/limits for memory and CPU).
LLM service: FastAPI POST /analyze with Pydantic validation; Strands agent with structured output. Multiple LLM providers: mock (default), openai, bedrock, gemini, configured via LLM_PROVIDER and provider-specific env (see README). .env loading: python-dotenv loads llm-service/.env for API keys and provider config. Health: GET /health (liveness), GET /ready (readiness). OpenAPI: GET /openapi.json. Resource limits in deploy manifest.
Log backend: Pluggable interface with in-cluster Kubernetes backend and generic HTTP backend (query params: namespace, pod, since_minutes).
Notifications: Slack (incoming webhook) and Microsoft Teams (message card); payload includes summary, root cause, recommendation, and proposed patch.

Apply logic and safety

Apply logic: APPLY_MODE=auto applies filtered strategic-merge patches (Deployment/Job/Pod); manual sends to Slack/Teams and logs kubectl patch. Scope limited to same namespace and allowed fields (image, env, replicas, resources). Dry-run before apply: optional server-side dry-run (DRY_RUN_BEFORE_APPLY) before applying; only apply if dry-run succeeds.
Cooldown: Per-resource cooldown (ANALYZE_COOLDOWN_MINUTES, default 15) so the same resource is not re-analyzed within that window.
Prompt / log size limits: MAX_RECENT_LOG_LINES and MAX_HISTORICAL_LOG_LINES (0 = no cap) cap payload sent to the LLM.
Namespace and apply scope: WATCH_NAMESPACES, EXCLUDE_NAMESPACES, and AUTO_APPLY_NAMESPACES (comma-separated) restrict which namespaces are watched and where auto-apply is allowed.
Operator–LLM auth: Optional LLM_SERVICE_AUTH_HEADER (e.g. Bearer token) for operator → LLM service calls.
Audit: Operator logs “audit: applied patch to deployment/job/pod” when a patch is applied in auto mode.
Stronger redaction: Redacts env values whose names suggest secrets; redacts valueFrom (secretKeyRef/configMapKeyRef) so the LLM never sees secret references.

Docs and tooling

Deploy: RBAC, ConfigMap, Deployment manifests; sample failing deployment; README quickstart for kind/minikube.
Runbook: RUNBOOK.md for operator not reconciling, LLM returns no action, Slack/Teams not receiving, LLM unreachable, patch apply failures.
ADRs: docs/adr/ for operator/LLM split and strategic merge patch.
Makefile: make build, make docker-build, make deploy, make test (operator: go vet + go test; llm-service: pytest if available), make dev (build + docker-build with instructions for cluster + deploy).

Near term (stability and production readiness)

Area	Description
Strands reliability	Add retries and fallback when Strands structured output fails; optional timeout for agent invocation.
Operator logging	Use structured logging (e.g. logr) consistently; add request IDs or resource UIDs for tracing analyze requests.
Tests	Unit tests for failure detection, redaction, and patch filtering; integration test that runs operator + stub LLM in-process or in-cluster.

Mid term (features and integrations)

Security and audit

Area	Description
Audit trail persistence	Beyond operator logs: record when the operator applies a patch (resource, namespace, patch hash, timestamp) in a CRD, log stream, or external audit service.
Further redaction	Redact image pull secret names and volume paths that may expose secrets (in addition to current env and valueFrom redaction).

Scope and policy

Area	Description
Label filters	Restrict which resources the operator watches by label (in addition to existing namespace filters).
Cost guardrails	Optional cap on analyze requests per hour (or per namespace) to avoid LLM cost blow-up.

Observability

Area	Description
Metrics	Expose Prometheus metrics: analyze requests (success/failure), patch apply count, notification errors.
Distributed tracing	OpenTelemetry (or similar) so one trace links operator reconcile → HTTP to LLM → Strands/LLM call.

LLM and prompts

Area	Description
Configurable prompts	Move Kubernetes SRE system prompt out of code into config (ConfigMap or file) for tuning without redeploy.
Interactive approval (Slack/Teams)	Buttons or actions in notifications to “Approve” or “Reject” a patch; operator applies only after approval (webhook or callback).
Loki adapter	Small service or option that translates `namespace`/`pod`/`since` into Loki LogQL and returns lines.
More workload types	Watch StatefulSets, DaemonSets, CronJobs; extend failure detection and patch allowlist.
Config from ConfigMap	Document and optionally extend to reload ConfigMap without restart.
Log backend auth	Support optional headers or basic auth for HTTP log backend.

Notifications

Area	Description
PagerDuty / Opsgenie / email	Implement `Notifier` for PagerDuty, Opsgenie, or email so on-call gets analyses where they already are.
Severity or routing	Optional routing by severity (e.g. only PagerDuty for “critical” or certain namespaces).

Packaging and distribution

Area	Description
Helm chart	Add a Helm chart (e.g. `charts/logjedi/`) that deploys both the operator and the LLM service with a single `helm install`. Expose key options as values: namespace, image tags, `LLM_SERVICE_URL`, Slack/Teams webhooks, `APPLY_MODE`, namespace filters, cooldown, log backend, etc. Enables one-command install and easier upgrades.

Long term / exploratory

Area	Description
LogJediAnalysis CRD	Store analysis results in a custom resource for audit and UI; optional controller to sync from CR to Slack/Teams or apply.
Multi-cluster	Operator that can target multiple clusters and a single LLM service.
Web UI	Dashboard to list analyses, view proposed patches, approve/reject.
Additional actions	Beyond `k8s_patch`: e.g. full manifest, scaling, or script (with strong safety and review).

Developer and CI experience

Area	Description
Single-command local stack	Extend `make dev` (or add Tilt/Skaffold) to optionally bring up kind, deploy, and create sample failing workload in one command.
E2E test	E2E test in kind: deploy operator + LLM + failing workload, assert analysis is produced (and optionally notification or patch).
CI checks	In CI: `make test` (operator: go vet + go test; llm-service: pytest); optional `make docker-build`. Add ruff/linting for LLM service.

Versioning and releases

Releases will be tagged (e.g. v0.2.0) when a set of roadmap items is done; changelog can live in CHANGELOG.md.
Compatibility: The operator and LLM service communicate via the documented JSON request/response schema; adding optional fields is backward compatible; changing required fields will be done with a version bump.

Contributing

If you want to work on a roadmap item:

Open an issue describing the change and how it fits the roadmap.
Prefer small, reviewable PRs (e.g. one item or sub-item per PR).
Keep the README and this ROADMAP updated when adding config options or behavior.