1.整体概述
本文主要讲述如何使用prometheus,结合kube-state-metrics,cAdvisor,Grafana对k8s集群进行监控和报警,和监控大盘的整体展示。
2.环境描述
- Kubernetes:v1.12.5
- Prometheus:v2.3.1
- kube-state-metrics:v1.3.1(收集k8s集群内资源对象数据)
- cAdvisor(已经在k8s内部集成,不需要重新安装,收集资源使用情况)
- Grafana:v5.3.4
3.监控组件部署
3.1 kube-state-metrics部署
kube-state-metrics安装有以下配置文件,可以把kube-state-metrics-deployment.yaml里面的镜像路径改成内网的镜像路径。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
| kubectl -f kube-state-metrics/
kube-state-metrics ├── kube-state-metrics-cluster-role-binding.yaml ├── kube-state-metrics-cluster-role.yaml ├── kube-state-metrics-deployment.yaml ├── kube-state-metrics-role-binding.yaml ├── kube-state-metrics-role.yaml ├── kube-state-metrics-service-account.yaml └── kube-state-metrics-service.yaml
kubectl get pod -n kube-system -o wide |grep "kube-state-metrics"
kube-state-metrics-7fd5dcc9b6-kpxmm 2/2 Running 4 526d 10.244.3.39 wx-2-centos53 <none>
|
各个配置文件的内容如下:
(1)kube-state-metrics-cluster-role-binding.yaml
1 2 3 4 5 6 7 8 9 10 11 12 13
| apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding metadata: name: kube-state-metrics roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: kube-state-metrics subjects: - kind: ServiceAccount name: kube-state-metrics namespace: kube-system
|
(2)kube-state-metrics-cluster-role.yaml
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
| apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole metadata: name: kube-state-metrics rules: - apiGroups: [""] resources: - configmaps - secrets - nodes - pods - services - resourcequotas - replicationcontrollers - limitranges - persistentvolumeclaims - persistentvolumes - namespaces - endpoints verbs: ["list", "watch"] - apiGroups: ["extensions"] resources: - daemonsets - deployments - replicasets verbs: ["list", "watch"] - apiGroups: ["apps"] resources: - statefulsets verbs: ["list", "watch"] - apiGroups: ["batch"] resources: - cronjobs - jobs verbs: ["list", "watch"] - apiGroups: ["autoscaling"] resources: - horizontalpodautoscalers verbs: ["list", "watch"]
|
(3)kube-state-metrics-deployment.yaml
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59
| apiVersion: apps/v1
kind: Deployment metadata: name: kube-state-metrics namespace: kube-system spec: selector: matchLabels: k8s-app: kube-state-metrics replicas: 1 template: metadata: labels: k8s-app: kube-state-metrics spec: serviceAccountName: kube-state-metrics containers: - name: kube-state-metrics image: registry.feidee.org/library/kube-state-metrics:v1.3.1 ports: - name: http-metrics containerPort: 8080 - name: telemetry containerPort: 8081 readinessProbe: httpGet: path: /healthz port: 8080 initialDelaySeconds: 5 timeoutSeconds: 5 - name: addon-resizer image: registry.feidee.org/library/addon-resizer:1.7 resources: limits: cpu: "4" memory: 4096Mi requests: cpu: "2" memory: 2048Mi env: - name: MY_POD_NAME valueFrom: fieldRef: fieldPath: metadata.name - name: MY_POD_NAMESPACE valueFrom: fieldRef: fieldPath: metadata.namespace command: - /pod_nanny - --container=kube-state-metrics - --cpu=1000m - --extra-cpu=500m - --memory=2048Mi - --extra-memory=1024Mi - --threshold=5 - --deployment=kube-state-metrics
|
(4)kube-state-metrics-role-binding.yaml
1 2 3 4 5 6 7 8 9 10 11 12 13 14
| apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding metadata: name: kube-state-metrics namespace: kube-system roleRef: apiGroup: rbac.authorization.k8s.io kind: Role name: kube-state-metrics-resizer subjects: - kind: ServiceAccount name: kube-state-metrics namespace: kube-system
|
(5)kube-state-metrics-role.yaml
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
| apiVersion: rbac.authorization.k8s.io/v1
kind: Role metadata: namespace: kube-system name: kube-state-metrics-resizer rules: - apiGroups: [""] resources: - pods verbs: ["get"] - apiGroups: ["extensions"] resources: - deployments resourceNames: ["kube-state-metrics"] verbs: ["get", "update"]
|
(6)kube-state-metrics-service-account.yaml
1 2 3 4 5
| apiVersion: v1 kind: ServiceAccount metadata: name: kube-state-metrics namespace: kube-system
|
(7)kube-state-metrics-service.yaml
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
| apiVersion: v1 kind: Service metadata: name: kube-state-metrics namespace: kube-system labels: k8s-app: kube-state-metrics annotations: prometheus.io/scrape: 'true' spec: ports: - name: http-metrics port: 8080 targetPort: http-metrics protocol: TCP - name: telemetry port: 8081 targetPort: telemetry protocol: TCP selector: k8s-app: kube-state-metrics
|
3.2 Prometheus部署
首先,先把Prometheus的主备配文件,报警规则存到k8s的configmap上面。
kubectl create -f prometheus.config.yml
prometheus.config.yml内容如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188
| apiVersion: v1 kind: ConfigMap metadata: name: prometheus-server-conf labels: name: prometheus-server-conf namespace: kube-system data: prometheus.yml: | global: scrape_interval: 15s evaluation_interval: 15s alerting: alertmanagers: - static_configs: - targets: - 192.168.31.11:9093 rule_files: - /etc/alter.yml scrape_configs: - job_name: 'kubernetes-apiservers' kubernetes_sd_configs: - role: endpoints scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs: - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name] action: keep regex: default;kubernetes;https - job_name: 'kubernetes-nodes' kubernetes_sd_configs: - role: node scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+) - target_label: __address__ replacement: kubernetes.default.svc:443 - source_labels: [__meta_kubernetes_node_name] regex: (.+) target_label: __metrics_path__ replacement: /api/v1/nodes/${1}/proxy/metrics - job_name: 'kubernetes-cadvisor' kubernetes_sd_configs: - role: node scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+) - target_label: __address__ replacement: kubernetes.default.svc:443 - source_labels: [__meta_kubernetes_node_name] regex: (.+) target_label: __metrics_path__ replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor - job_name: 'kubernetes-service-endpoints' kubernetes_sd_configs: - role: endpoints relabel_configs: - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme] action: replace target_label: __scheme__ regex: (http?) - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port] action: replace target_label: __address__ regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 - action: labelmap regex: __meta_kubernetes_service_label_(.+) - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_service_name] action: replace target_label: kubernetes_name - job_name: 'kubernetes-services' kubernetes_sd_configs: - role: service metrics_path: /probe params: module: [http_2xx] relabel_configs: - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_probe] action: keep regex: true - source_labels: [__address__] target_label: __param_target - target_label: __address__ replacement: blackbox-exporter.example.com:9115 - source_labels: [__param_target] target_label: instance - action: labelmap regex: __meta_kubernetes_service_label_(.+) - source_labels: [__meta_kubernetes_namespace] target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_service_name] target_label: kubernetes_name - job_name: 'kubernetes-ingresses' kubernetes_sd_configs: - role: ingress relabel_configs: - source_labels: [__meta_kubernetes_ingress_annotation_prometheus_io_probe] action: keep regex: true - source_labels: [__meta_kubernetes_ingress_scheme,__address__,__meta_kubernetes_ingress_path] regex: (.+);(.+);(.+) replacement: ${1}://${2}${3} target_label: __param_target - target_label: __address__ replacement: blackbox-exporter.example.com:9115 - source_labels: [__param_target] target_label: instance - action: labelmap regex: __meta_kubernetes_ingress_label_(.+) - source_labels: [__meta_kubernetes_namespace] target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_ingress_name] target_label: kubernetes_name - job_name: 'kubernetes-pods' kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port] action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 target_label: __address__ - action: labelmap regex: __meta_kubernetes_pod_label_(.+) - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: kubernetes_pod_name - job_name: 'kubernetes-nodes-cadvisor' scrape_interval: 10s scrape_timeout: 10s scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token kubernetes_sd_configs: - role: node relabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+) - target_label: __address__ replacement: kubernetes.default.svc:443 - source_labels: [__meta_kubernetes_node_name] regex: (.+) target_label: __metrics_path__ replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor metric_relabel_configs: - action: replace source_labels: [id] regex: '^/machine\.slice/machine-rkt\\x2d([^\\]+)\\.+/([^/]+)\.service$' target_label: rkt_container_name replacement: '${2}-${1}' - action: replace source_labels: [id] regex: '^/system\.slice/(.+)\.service$' target_label: systemd_service_name replacement: '${1}'
|
kubectl create -f alter.yaml
alter.yaml 内容是:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52
| apiVersion: v1 kind: ConfigMap metadata: name: alter-conf labels: name: alter-conf namespace: kube-system data: alter.yml: | groups: - name: k8s_alerts rules: - alert: api server down expr: up{job=~"kubernetes-apiservers"} == 0 for: 1m labels: severity: High group: 正式环境 annotations: description: '{{$labels.instance}} api server 服务异常' - alert: 集群节点状态错误 expr: kube_node_status_condition{condition="Ready",status!="true"}==1 for: 1m labels: severity: High group: 正式环境 annotations: description: '{{$labels.node}} 集群节点宕机。' - alert: 容器不在running状态 expr: kube_pod_container_status_running==0 for: 5m labels: severity: High group: 正式环境 annotations: description: '{{$labels.pod}}-{{$labels.container}} 不在running状态。' - alert: Pod重启次数频繁 expr: changes(kube_pod_container_status_restarts_total[30m])>10 for: 5m labels: severity: Average group: 正式环境 annotations: description: '最近30分钟 {{$labels.pod}} 重启次数大于10次' - alert: pod cpu使用率大于80% expr: sum by (pod_name)( rate(container_cpu_usage_seconds_total{image!="", pod_name!=""}[1m] ) ) >1.6 for: 5m labels: severity: Average group: 正式环境 annotations: description: '{{$labels.pod_name}} Pod cpu使用率大于80%'
|
给prometheus授权
kubectl create -f cluster-role.yaml
cluster-role.yaml内容如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
| apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: prometheus rules: - apiGroups: [""] resources: - nodes - nodes/proxy - services - endpoints - pods verbs: ["get", "list", "watch"] - apiGroups: - extensions resources: - ingresses verbs: ["get", "list", "watch"] - nonResourceURLs: ["/metrics"] verbs: ["get"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: prometheus roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: prometheus subjects: - kind: ServiceAccount name: default namespace: kube-system
|
部署prometheus deploy
kubectl create -f prometheus-deployment.yaml
prometheus-deployment.yaml的配置为:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
| apiVersion: extensions/v1beta1 kind: Deployment metadata: name: prometheus namespace: kube-system spec: replicas: 1 template: metadata: labels: app: prometheus spec: nodeSelector: node: prometheus-monitor containers: - name: prometheus image: prom/prometheus:v2.3.1 args: - "--config.file=/etc/prometheus/prometheus.yml" - "--storage.tsdb.path=/prometheus/" - "--storage.tsdb.retention=180d" ports: - containerPort: 9090 volumeMounts: - mountPath: /etc/prometheus/prometheus.yml name: prometheus-config-volume subPath: prometheus.yml - mountPath: /etc/alter.yml name: prometheus-alter-volume subPath: alter.yml - mountPath: /prometheus/ name: prometheus-data volumes: - configMap: defaultMode: 420 name: prometheus-server-conf name: prometheus-config-volume - configMap: defaultMode: 420 name: alter-conf name: prometheus-alter-volume - hostPath: path: /prometheus/ type: "" name: prometheus-data
|
应用prometheus 的service
kubectl create -f prometheus-service.yaml
prometheus-service.yaml 配置为:
1 2 3 4 5 6 7 8 9 10 11 12 13
| apiVersion: v1 kind: Service metadata: name: prometheus namespace: kube-system spec: selector: app: prometheus type: NodePort ports: - port: 8080 targetPort: 9090 nodePort: 30013
|
由于Alertmanager已经在做虚拟机监控的时候已经部署,所以容器内部就不部署了,直接使用外部的就行,然后Grafana也是在监控虚拟机的时候已经部署,所以在Grafana增加k8s集群监控的数据源就行,根据配置,数据源为http://IP:30013。
然后Grafana监控面板模板可以点击下载,下面的监控面板的效果图:


