k8s LogJedi Runbook
k8s LogJedi Runbook
k8s LogJedi is an AI-native Kubernetes sidekick that watches your pods, reads the logs, and turns failures into clear fixes—before they become outages. This runbook gives short procedures for common operational issues. See also README.md and ROADMAP.md.
Operator not reconciling
Symptoms: Failed workloads (e.g. CrashLoopBackOff) are not triggering analysis; no logs from the operator about “LLM analysis received”.
Checks:
- Operator running:
kubectl -n logjedi get pods -l app=logjedi-operator— pod should be Running. - RBAC: Operator needs get/list on pods, pods/log, events; get/list/patch on deployments, jobs. Check ClusterRole and ClusterRoleBinding:
kubectl get clusterrole logjedi-operator -o yaml. - Namespace filter: If
WATCH_NAMESPACESorEXCLUDE_NAMESPACESis set, the failing resource’s namespace might be excluded. Check ConfigMaplogjedi-operator-configand env vars. - Cooldown: If the same resource was analyzed recently, the operator skips until
ANALYZE_COOLDOWN_MINUTEShas passed. Check operator logs for “RequeueAfter” or wait for cooldown to expire. - Logs:
kubectl -n logjedi logs -l app=logjedi-operator --tail=200and look for errors (e.g. “LLM analyze failed”, “list events”, “get pod logs”).
Actions: Fix RBAC or namespace config; ensure LLM service is reachable (see below); restart operator if needed: kubectl -n logjedi rollout restart deployment/logjedi-operator.
LLM returns no action
Symptoms: Operator logs “LLM analysis received” but no patch is applied and no notification contains a suggested patch.
Checks:
- Mock provider: With
LLM_PROVIDER=mock, the LLM service returns a fixed response; for Deployments it includes a sample action. For Pods/Jobs the mock may setaction: null. So “no action” can be expected for some resources when using mock. - Real LLM: If using a real provider, the model might not output a valid structured action (e.g. missing or malformed patch). Check LLM service logs for errors or validation failures.
- APPLY_MODE: In
manualmode the operator does not apply; it only notifies. So “no action” in the message might mean the notification payload has an empty patch — check Slack/Teams or the loggedkubectl patchcommand.
Actions: For mock, this is expected when no patch is returned. For real LLM, improve prompts or add retries/fallback (see ROADMAP). Confirm APPLY_MODE and notification config.
Slack / Teams not receiving messages
Symptoms: Operator runs and gets an analysis, but no message appears in Slack or Teams.
Checks:
- Webhook URL: Ensure
SLACK_WEBHOOK_URLorTEAMS_WEBHOOK_URLis set in the operator ConfigMap/env.kubectl -n logjedi get configmap logjedi-operator-config -o yaml. - APPLY_MODE: Notifications are sent in
manualmode, or inautomode after applying (if notifiers are configured). Inautowith no notifiers, only audit logs are written. - Errors in logs: Look for “send notification failed” in operator logs. That usually means the webhook returned non-2xx or network error.
- Webhook validity: Test the webhook manually (e.g.
curl -X POST -H 'Content-Type: application/json' -d '{"text":"test"}' <SLACK_WEBHOOK_URL>). For Teams, ensure the URL is an incoming webhook URL.
Actions: Fix or rotate webhook URLs; confirm ConfigMap is mounted and env is set; restart operator after changing config.
LLM service unreachable
Symptoms: Operator logs “LLM analyze failed” with connection refused, timeout, or 5xx.
Checks:
- Service and pod:
kubectl -n logjedi get svc llm-serviceandkubectl -n logjedi get pods -l app=llm-service. Pod should be Running; service should target the pod. - URL: Operator must use in-cluster URL when both run in the same cluster, e.g.
http://llm-service.logjedi.svc.cluster.local:8000. CheckLLM_SERVICE_URLin operator deployment. - Network policy: If cluster uses network policies, allow traffic from operator pod(s) to LLM service on port 8000.
- Health: From a pod in the cluster,
curl http://llm-service.logjedi.svc.cluster.local:8000/health. Should return{"status":"ok"}.
Actions: Fix service/deployment; correct LLM_SERVICE_URL; relax or add network policy; ensure LLM service has resources and is not OOMKilled (check kubectl describe pod).
Patch apply failed (auto mode)
Symptoms: Operator logs “dry-run patch failed” or “apply patch to deployment/job/pod” with an error.
Checks:
- Dry-run: If
DRY_RUN_BEFORE_APPLY=true, the first Patch is server-side dry-run. If it fails, the real patch is not applied. Fix the patch (e.g. invalid field or immutable field). - Scope: Operator only applies when target namespace/name matches the failed resource and (if set) namespace is in
AUTO_APPLY_NAMESPACES. Check config. - Patch content: LLM may suggest an invalid or immutable change. Check operator logs for the patch size and consider logging a redacted patch for debugging. Inspect the resource:
kubectl get deployment <name> -n <ns> -o yaml.
Actions: Correct the patch (manually if needed); tighten prompt or allowlist so the LLM does not suggest invalid fields; disable dry-run only if you accept the risk.
Escalation
If the issue is not covered here:
- Collect operator and LLM service logs, and (if applicable) a sample of the failing resource and events.
- Open an issue in the repo with the runbook section that best matches the symptom and what you’ve already checked.
- For security-sensitive issues (e.g. accidental secret exposure), do not paste full specs or logs; describe the scenario and redact.