Better alerting and monitoring
Description
We want to know when something is wrong with k8s, so we have to:
- Ingest data
- Make dashboards
- Define alerts
Currently we have node-exporter running on all nodes, the grafana agent operator running, set up to export node-exporter metrics and kubelet metrics.
We should create dashboards, and alerts for these.
We also want to export, visualise and alert on data from:
[ ] Traefik (Provides HTTP(s) Ingress) [ ] Kyverno (Validates k8s objects according to custom rules) [ ] Rook (Connects Ceph storage to kubernetes)
- More?
Resources
- The grafana agent will export based on any Prometheus CRD in the namespace
grafana-agent
with labelinstance=sysmans
- Prometheus CRD definitions (only ServiceMonitor, PodMonitor, Probe)
- Existing observability stuff
- Grafana Instance
- Each thing will probably have its own docs
Edited by Aria Shrimpton