SPIFFE/SPIRE Observability: Comprehensive Monitoring with Prometheus and Grafana

Introduction: The Missing Observability Layer#

One of the biggest gaps in SPIFFE/SPIRE deployments is comprehensive observability. While the system provides powerful identity management, understanding its health, performance, and security posture requires sophisticated monitoring. In this guide, we’ll build a complete observability stack that provides visibility into every aspect of your SPIFFE/SPIRE deployment.

After years of operating SPIRE in production, I’ve learned that monitoring workload identity is fundamentally different from traditional infrastructure monitoring. We need to track identity lifecycle, attestation success rates, certificate rotation health, and federation status - metrics that don’t exist in standard monitoring solutions.

Understanding SPIFFE/SPIRE Observability Requirements#

Key Metrics Categories#

1
graph TB
2
    subgraph "SPIRE Server Metrics"
3
        SS1[Registration Entries]
4
        SS2[Agent Connections]
5
        SS3[API Request Rates]
6
        SS4[Database Performance]
7
        SS5[CA Certificate Health]
8
    end
9

10
    subgraph "SPIRE Agent Metrics"
11
        SA1[SVID Renewal Success]
12
        SA2[Workload Attestations]
13
        SA3[Sync Failures]
14
        SA4[Cache Performance]
15
    end
16

17
    subgraph "Workload Metrics"
18
        WM1[SVID Acquisition Time]
19
        WM2[Certificate Expiry]
20
        WM3[mTLS Connection Success]
21
        WM4[Identity Validation]
22
    end
23

24
    subgraph "Security Metrics"
25
        SM1[Failed Attestations]
26
        SM2[Unauthorized Access]
27
        SM3[Certificate Violations]
28
        SM4[Federation Issues]
29
    end
30

31
    style SS1 fill:#e1f5fe
32
    style SA1 fill:#f3e5f5
33
    style WM1 fill:#e8f5e8
34
    style SM1 fill:#ffebee

Observability Architecture#

1
graph LR
2
    subgraph "SPIRE Components"
3
        SERVER[SPIRE Server<br/>:9988/metrics]
4
        AGENT[SPIRE Agent<br/>:9988/metrics]
5
        WORKLOAD[Workload Apps<br/>Custom Metrics]
6
    end
7

8
    subgraph "Collection Layer"
9
        PROM[Prometheus]
10
        OTEL[OpenTelemetry<br/>Collector]
11
    end
12

13
    subgraph "Storage & Analysis"
14
        TSDB[Time Series DB]
15
        GRAFANA[Grafana]
16
        AM[AlertManager]
17
    end
18

19
    subgraph "Notifications"
20
        SLACK[Slack]
21
        PD[PagerDuty]
22
        EMAIL[Email]
23
    end
24

25
    SERVER --> PROM
26
    AGENT --> PROM
27
    WORKLOAD --> OTEL
28
    OTEL --> PROM
29

30
    PROM --> TSDB
31
    PROM --> AM
32
    TSDB --> GRAFANA
33

34
    AM --> SLACK
35
    AM --> PD
36
    AM --> EMAIL

Step 1: Enable SPIRE Telemetry#

SPIRE Server Telemetry Configuration#

1
apiVersion: v1
2
kind: ConfigMap
3
metadata:
4
  name: spire-server-telemetry-config
5
  namespace: spire-system
6
data:
7
  server.conf: |
8
    server {
9
      bind_address = "0.0.0.0"
10
      bind_port = "8081"
11
      trust_domain = "prod.example.com"
12
      data_dir = "/run/spire/data"
13
      log_level = "INFO"
14

15
      # Enable detailed logging for monitoring
16
      log_format = "json"
17

18
      # Health check endpoints
19
      health_checks {
20
        listener_enabled = true
21
        bind_address = "0.0.0.0"
22
        bind_port = "8080"
23
        live_path = "/live"
24
        ready_path = "/ready"
25
      }
26
    }
27

28
    plugins {
29
      DataStore "sql" {
30
        plugin_data {
31
          database_type = "postgres"
32
          connection_string = "host=postgres port=5432 dbname=spire user=spire password=secret sslmode=require"
33

34
          # Enable connection pooling metrics
35
          connection_pool {
36
            max_open_conns = 100
37
            max_idle_conns = 50
38
            conn_max_lifetime = "1h"
39
          }
40
        }
41
      }
42

43
      NodeAttestor "k8s_psat" {
44
        plugin_data {
45
          cluster = "production"
46
        }
47
      }
48

49
      KeyManager "disk" {
50
        plugin_data {
51
          keys_path = "/run/spire/data/keys"
52
        }
53
      }
54

55
      UpstreamAuthority "disk" {
56
        plugin_data {
57
          cert_file_path = "/run/spire/ca/intermediate.crt"
58
          key_file_path = "/run/spire/ca/intermediate.key"
59
          bundle_file_path = "/run/spire/ca/root.crt"
60
        }
61
      }
62
    }
63

64
    # Comprehensive telemetry configuration
65
    telemetry {
66
      # Prometheus metrics
67
      Prometheus {
68
        host = "0.0.0.0"
69
        port = 9988
70

71
        # Include detailed labels
72
        include_labels = true
73

74
        # Custom metric prefixes
75
        prefix = "spire_server"
76
      }
77

78
      # StatsD for additional metrics aggregation
79
      Statsd {
80
        address = "statsd-exporter.monitoring:9125"
81
        prefix = "spire.server"
82
      }
83

84
      # Enable all available metrics
85
      AllowedPrefixes = []  # Allow all metrics
86
      BlockedPrefixes = []  # Block none
87

88
      # Include detailed labels for better filtering
89
      AllowedLabels = [
90
        "method",
91
        "status_code",
92
        "error_type",
93
        "attestor_type",
94
        "selector_type",
95
        "trust_domain"
96
      ]
97
    }
98
---
99
# Update server deployment with telemetry
100
apiVersion: apps/v1
101
kind: StatefulSet
102
metadata:
103
  name: spire-server
104
  namespace: spire-system
105
spec:
106
  template:
107
    metadata:
108
      annotations:
109
        prometheus.io/scrape: "true"
110
        prometheus.io/port: "9988"
111
        prometheus.io/path: "/metrics"
112
    spec:
113
      containers:
114
        - name: spire-server
115
          ports:
116
            - containerPort: 9988
117
              name: telemetry
118
              protocol: TCP
119
            - containerPort: 8080
120
              name: health
121
              protocol: TCP
122
          livenessProbe:
123
            httpGet:
124
              path: /live
125
              port: 8080
126
            initialDelaySeconds: 30
127
            periodSeconds: 30
128
            timeoutSeconds: 5
129
          readinessProbe:
130
            httpGet:
131
              path: /ready
132
              port: 8080
133
            initialDelaySeconds: 15
134
            periodSeconds: 10
135
            timeoutSeconds: 5

SPIRE Agent Telemetry Configuration#

1
apiVersion: v1
2
kind: ConfigMap
3
metadata:
4
  name: spire-agent-telemetry-config
5
  namespace: spire-system
6
data:
7
  agent.conf: |
8
    agent {
9
      data_dir = "/run/spire"
10
      log_level = "INFO"
11
      log_format = "json"
12
      server_address = "spire-server"
13
      server_port = "8081"
14
      socket_path = "/run/spire/sockets/agent.sock"
15
      trust_bundle_path = "/run/spire/bundle/bundle.crt"
16
      trust_domain = "prod.example.com"
17

18
      # Health check configuration
19
      health_checks {
20
        listener_enabled = true
21
        bind_address = "0.0.0.0"
22
        bind_port = "8080"
23
        live_path = "/live"
24
        ready_path = "/ready"
25
      }
26

27
      # Performance settings for monitoring
28
      sync_interval = "30s"
29

30
      # Enable SDS for better observability
31
      sds {
32
        default_svid_name = "default"
33
        default_bundle_name = "ROOTCA"
34
      }
35
    }
36

37
    plugins {
38
      NodeAttestor "k8s_psat" {
39
        plugin_data {
40
          cluster = "production"
41
          token_path = "/run/secrets/tokens/spire-agent"
42
        }
43
      }
44

45
      KeyManager "memory" {
46
        plugin_data {}
47
      }
48

49
      WorkloadAttestor "k8s" {
50
        plugin_data {
51
          # Increase sync interval for monitoring
52
          pod_info_sync_interval = "30s"
53
          skip_kubelet_verification = true
54

55
          # Enable detailed workload labeling
56
          use_new_container_locator = true
57
        }
58
      }
59
    }
60

61
    # Agent telemetry configuration
62
    telemetry {
63
      Prometheus {
64
        host = "0.0.0.0"
65
        port = 9988
66
        prefix = "spire_agent"
67
        include_labels = true
68
      }
69

70
      Statsd {
71
        address = "statsd-exporter.monitoring:9125"
72
        prefix = "spire.agent"
73
      }
74

75
      # Include node and pod information in metrics
76
      AllowedLabels = [
77
        "node_name",
78
        "pod_name",
79
        "pod_namespace",
80
        "workload_selector",
81
        "attestor_type"
82
      ]
83
    }
84
---
85
# Update agent daemonset with telemetry
86
apiVersion: apps/v1
87
kind: DaemonSet
88
metadata:
89
  name: spire-agent
90
  namespace: spire-system
91
spec:
92
  template:
93
    metadata:
94
      annotations:
95
        prometheus.io/scrape: "true"
96
        prometheus.io/port: "9988"
97
        prometheus.io/path: "/metrics"
98
    spec:
99
      containers:
100
        - name: spire-agent
101
          ports:
102
            - containerPort: 9988
103
              name: telemetry
104
              protocol: TCP
105
              hostPort: 9988 # Allow direct access from Prometheus
106
            - containerPort: 8080
107
              name: health
108
              protocol: TCP
109
          livenessProbe:
110
            httpGet:
111
              path: /live
112
              port: 8080
113
            initialDelaySeconds: 30
114
            periodSeconds: 30
115
          readinessProbe:
116
            httpGet:
117
              path: /ready
118
              port: 8080
119
            initialDelaySeconds: 15
120
            periodSeconds: 10

Step 2: Prometheus Configuration#

Service Discovery and Scrape Config#

1
apiVersion: v1
2
kind: ConfigMap
3
metadata:
4
  name: prometheus-config
5
  namespace: monitoring
6
data:
7
  prometheus.yml: |
8
    global:
9
      scrape_interval: 15s
10
      evaluation_interval: 15s
11
      external_labels:
12
        cluster: 'production'
13
        region: 'us-east-1'
14

15
    rule_files:
16
      - "/etc/prometheus/rules/*.yml"
17

18
    alerting:
19
      alertmanagers:
20
        - static_configs:
21
            - targets:
22
              - alertmanager:9093
23

24
    scrape_configs:
25
    # SPIRE Server metrics
26
    - job_name: 'spire-server'
27
      kubernetes_sd_configs:
28
      - role: pod
29
        namespaces:
30
          names: ['spire-system']
31

32
      relabel_configs:
33
      # Only scrape pods with the correct labels
34
      - source_labels: [__meta_kubernetes_pod_label_app]
35
        action: keep
36
        regex: spire-server
37

38
      # Add useful labels
39
      - source_labels: [__meta_kubernetes_pod_name]
40
        target_label: pod_name
41
      - source_labels: [__meta_kubernetes_pod_node_name]
42
        target_label: node_name
43
      - source_labels: [__meta_kubernetes_namespace]
44
        target_label: kubernetes_namespace
45

46
      # Set scrape parameters
47
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
48
        action: replace
49
        target_label: __address__
50
        regex: (.+)
51
        replacement: ${1}:9988
52

53
      metric_relabel_configs:
54
      # Add server instance information
55
      - source_labels: [__name__]
56
        regex: 'spire_server_.*'
57
        target_label: component
58
        replacement: 'spire-server'
59

60
      # Keep only SPIRE-related metrics
61
      - source_labels: [__name__]
62
        regex: 'spire_server_.*|up|process_.*'
63
        action: keep
64

65
    # SPIRE Agent metrics
66
    - job_name: 'spire-agent'
67
      kubernetes_sd_configs:
68
      - role: pod
69
        namespaces:
70
          names: ['spire-system']
71

72
      relabel_configs:
73
      - source_labels: [__meta_kubernetes_pod_label_app]
74
        action: keep
75
        regex: spire-agent
76

77
      - source_labels: [__meta_kubernetes_pod_name]
78
        target_label: pod_name
79
      - source_labels: [__meta_kubernetes_pod_node_name]
80
        target_label: node_name
81
      - source_labels: [__meta_kubernetes_pod_host_ip]
82
        target_label: __address__
83
        replacement: ${1}:9988
84

85
      metric_relabel_configs:
86
      - source_labels: [__name__]
87
        regex: 'spire_agent_.*'
88
        target_label: component
89
        replacement: 'spire-agent'
90

91
      - source_labels: [__name__]
92
        regex: 'spire_agent_.*|up|process_.*'
93
        action: keep
94

95
    # Workload metrics (applications using SPIFFE)
96
    - job_name: 'spiffe-workloads'
97
      kubernetes_sd_configs:
98
      - role: pod
99

100
      relabel_configs:
101
      # Only scrape pods with SPIFFE annotation
102
      - source_labels: [__meta_kubernetes_pod_annotation_spiffe_io_enabled]
103
        action: keep
104
        regex: true
105

106
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
107
        action: keep
108
        regex: true
109

110
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
111
        action: replace
112
        target_label: __metrics_path__
113
        regex: (.+)
114

115
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
116
        action: replace
117
        regex: ([^:]+)(?::\d+)?;(\d+)
118
        replacement: $1:$2
119
        target_label: __address__
120

121
      - source_labels: [__meta_kubernetes_namespace]
122
        action: replace
123
        target_label: kubernetes_namespace
124
      - source_labels: [__meta_kubernetes_pod_name]
125
        action: replace
126
        target_label: kubernetes_pod_name
127

128
      metric_relabel_configs:
129
      # Add workload identity information
130
      - source_labels: [__name__]
131
        regex: 'spiffe_.*'
132
        target_label: component
133
        replacement: 'spiffe-workload'
134

135
    # Node Exporter for infrastructure metrics
136
    - job_name: 'node-exporter'
137
      kubernetes_sd_configs:
138
      - role: pod
139
        namespaces:
140
          names: ['monitoring']
141

142
      relabel_configs:
143
      - source_labels: [__meta_kubernetes_pod_label_app]
144
        action: keep
145
        regex: node-exporter
146

147
      - source_labels: [__meta_kubernetes_pod_node_name]
148
        action: replace
149
        target_label: node

Custom Metrics Collection#

1
apiVersion: v1
2
kind: ConfigMap
3
metadata:
4
  name: spiffe-metrics-collector
5
  namespace: spire-system
6
data:
7
  collector.py: |
8
    #!/usr/bin/env python3
9
    import time
10
    import requests
11
    import json
12
    from prometheus_client import start_http_server, Gauge, Counter, Histogram
13
    from kubernetes import client, config
14
    import logging
15

16
    # Set up logging
17
    logging.basicConfig(level=logging.INFO)
18
    logger = logging.getLogger(__name__)
19

20
    # Custom metrics
21
    spiffe_identity_count = Gauge('spiffe_total_identities', 'Total number of SPIFFE identities')
22
    spiffe_expired_certs = Gauge('spiffe_expired_certificates', 'Number of expired certificates')
23
    spiffe_expiring_soon = Gauge('spiffe_certificates_expiring_soon', 'Certificates expiring within 24h')
24
    spiffe_attestation_failures = Counter('spiffe_attestation_failures_total', 'Total attestation failures')
25
    spiffe_svid_fetch_time = Histogram('spiffe_svid_fetch_duration_seconds', 'Time to fetch SVID')
26

27
    class SPIFFEMetricsCollector:
28
        def __init__(self):
29
            self.spire_server_url = "http://spire-server.spire-system:8081"
30

31
        def collect_registration_metrics(self):
32
            """Collect registration entry metrics"""
33
            try:
34
                # Use SPIRE Server API to get entries
35
                response = requests.get(f"{self.spire_server_url}/entries")
36
                if response.status_code == 200:
37
                    entries = response.json().get('entries', [])
38
                    spiffe_identity_count.set(len(entries))
39

40
                    # Count expired and expiring certificates
41
                    now = time.time()
42
                    expired = 0
43
                    expiring_soon = 0
44

45
                    for entry in entries:
46
                        expiry = entry.get('expiry', 0)
47
                        if expiry < now:
48
                            expired += 1
49
                        elif expiry < (now + 86400):  # 24 hours
50
                            expiring_soon += 1
51

52
                    spiffe_expired_certs.set(expired)
53
                    spiffe_expiring_soon.set(expiring_soon)
54

55
            except Exception as e:
56
                logger.error(f"Failed to collect registration metrics: {e}")
57

58
        def collect_workload_metrics(self):
59
            """Collect workload-specific metrics"""
60
            try:
61
                # Load Kubernetes config
62
                config.load_incluster_config()
63
                v1 = client.CoreV1Api()
64

65
                # Get all pods with SPIFFE annotations
66
                pods = v1.list_pod_for_all_namespaces(
67
                    label_selector="spiffe=enabled"
68
                )
69

70
                for pod in pods.items:
71
                    if pod.status.phase == "Running":
72
                        # Simulate SVID fetch time measurement
73
                        # In reality, this would be instrumented in the workload
74
                        fetch_time = self.measure_svid_fetch_time(pod)
75
                        if fetch_time:
76
                            spiffe_svid_fetch_time.observe(fetch_time)
77

78
            except Exception as e:
79
                logger.error(f"Failed to collect workload metrics: {e}")
80

81
        def measure_svid_fetch_time(self, pod):
82
            """Measure time to fetch SVID for a pod"""
83
            # This is a placeholder - in production, instrument your workloads
84
            return 0.1  # Mock 100ms fetch time
85

86
        def run(self):
87
            logger.info("Starting SPIFFE metrics collector")
88
            while True:
89
                try:
90
                    self.collect_registration_metrics()
91
                    self.collect_workload_metrics()
92
                    time.sleep(30)  # Collect every 30 seconds
93
                except Exception as e:
94
                    logger.error(f"Error in collection cycle: {e}")
95
                    time.sleep(10)
96

97
    if __name__ == '__main__':
98
        # Start Prometheus metrics server
99
        start_http_server(8000)
100
        logger.info("Metrics server started on port 8000")
101

102
        # Start collector
103
        collector = SPIFFEMetricsCollector()
104
        collector.run()
105
---
106
apiVersion: apps/v1
107
kind: Deployment
108
metadata:
109
  name: spiffe-metrics-collector
110
  namespace: spire-system
111
spec:
112
  replicas: 1
113
  selector:
114
    matchLabels:
115
      app: spiffe-metrics-collector
116
  template:
117
    metadata:
118
      labels:
119
        app: spiffe-metrics-collector
120
      annotations:
121
        prometheus.io/scrape: "true"
122
        prometheus.io/port: "8000"
123
    spec:
124
      serviceAccountName: spiffe-metrics-collector
125
      containers:
126
        - name: collector
127
          image: python:3.9-slim
128
          command: ["python", "/app/collector.py"]
129
          ports:
130
            - containerPort: 8000
131
              name: metrics
132
          env:
133
            - name: PYTHONUNBUFFERED
134
              value: "1"
135
          volumeMounts:
136
            - name: app
137
              mountPath: /app
138
          resources:
139
            requests:
140
              memory: "128Mi"
141
              cpu: "100m"
142
            limits:
143
              memory: "256Mi"
144
              cpu: "200m"
145
      volumes:
146
        - name: app
147
          configMap:
148
            name: spiffe-metrics-collector
149
            defaultMode: 0755

Step 3: Grafana Dashboards#

Comprehensive SPIRE Dashboard#

1
{
2
  "dashboard": {
3
    "id": null,
4
    "title": "SPIFFE/SPIRE Comprehensive Monitoring",
5
    "description": "Complete observability for SPIFFE/SPIRE deployment",
6
    "tags": ["spiffe", "spire", "security", "identity"],
7
    "timezone": "UTC",
8
    "refresh": "30s",
9
    "time": {
10
      "from": "now-1h",
11
      "to": "now"
12
    },
13
    "templating": {
14
      "list": [
15
        {
16
          "name": "cluster",
17
          "type": "query",
18
          "query": "label_values(up{job=\"spire-server\"}, cluster)",
19
          "refresh": 1
20
        },
21
        {
22
          "name": "server_instance",
23
          "type": "query",
24
          "query": "label_values(up{job=\"spire-server\", cluster=\"$cluster\"}, instance)",
25
          "refresh": 1,
26
          "multi": true,
27
          "includeAll": true
28
        }
29
      ]
30
    },
31
    "panels": [
32
      {
33
        "id": 1,
34
        "title": "SPIRE Server Health Overview",
35
        "type": "stat",
36
        "targets": [
37
          {
38
            "expr": "up{job=\"spire-server\", cluster=\"$cluster\"}",
39
            "legendFormat": "{{instance}}"
40
          }
41
        ],
42
        "fieldConfig": {
43
          "defaults": {
44
            "mappings": [
45
              {
46
                "type": "value",
47
                "value": "0",
48
                "text": "DOWN"
49
              },
50
              {
51
                "type": "value",
52
                "value": "1",
53
                "text": "UP"
54
              }
55
            ],
56
            "thresholds": {
57
              "steps": [
58
                { "color": "red", "value": 0 },
59
                { "color": "green", "value": 1 }
60
              ]
61
            }
62
          }
63
        },
64
        "gridPos": { "h": 4, "w": 6, "x": 0, "y": 0 }
65
      },
66
      {
67
        "id": 2,
68
        "title": "Registration Entries Count",
69
        "type": "stat",
70
        "targets": [
71
          {
72
            "expr": "sum(spire_server_registration_entries{cluster=\"$cluster\", instance=~\"$server_instance\"})",
73
            "legendFormat": "Total Entries"
74
          }
75
        ],
76
        "fieldConfig": {
77
          "defaults": {
78
            "unit": "short",
79
            "color": { "mode": "thresholds" },
80
            "thresholds": {
81
              "steps": [
82
                { "color": "green", "value": 0 },
83
                { "color": "yellow", "value": 1000 },
84
                { "color": "red", "value": 10000 }
85
              ]
86
            }
87
          }
88
        },
89
        "gridPos": { "h": 4, "w": 6, "x": 6, "y": 0 }
90
      },
91
      {
92
        "id": 3,
93
        "title": "Connected Agents",
94
        "type": "stat",
95
        "targets": [
96
          {
97
            "expr": "sum(spire_server_connected_agents{cluster=\"$cluster\", instance=~\"$server_instance\"})",
98
            "legendFormat": "Active Agents"
99
          }
100
        ],
101
        "gridPos": { "h": 4, "w": 6, "x": 12, "y": 0 }
102
      },
103
      {
104
        "id": 4,
105
        "title": "API Request Rate",
106
        "type": "stat",
107
        "targets": [
108
          {
109
            "expr": "sum(rate(spire_server_api_requests_total{cluster=\"$cluster\", instance=~\"$server_instance\"}[5m]))",
110
            "legendFormat": "Requests/sec"
111
          }
112
        ],
113
        "fieldConfig": {
114
          "defaults": {
115
            "unit": "reqps"
116
          }
117
        },
118
        "gridPos": { "h": 4, "w": 6, "x": 18, "y": 0 }
119
      },
120
      {
121
        "id": 5,
122
        "title": "API Request Latency",
123
        "type": "timeseries",
124
        "targets": [
125
          {
126
            "expr": "histogram_quantile(0.95, sum(rate(spire_server_api_request_duration_seconds_bucket{cluster=\"$cluster\", instance=~\"$server_instance\"}[5m])) by (le, method))",
127
            "legendFormat": "p95 - {{method}}"
128
          },
129
          {
130
            "expr": "histogram_quantile(0.50, sum(rate(spire_server_api_request_duration_seconds_bucket{cluster=\"$cluster\", instance=~\"$server_instance\"}[5m])) by (le, method))",
131
            "legendFormat": "p50 - {{method}}"
132
          }
133
        ],
134
        "fieldConfig": {
135
          "defaults": {
136
            "unit": "s"
137
          }
138
        },
139
        "gridPos": { "h": 8, "w": 12, "x": 0, "y": 4 }
140
      },
141
      {
142
        "id": 6,
143
        "title": "Error Rate by API Method",
144
        "type": "timeseries",
145
        "targets": [
146
          {
147
            "expr": "sum(rate(spire_server_api_errors_total{cluster=\"$cluster\", instance=~\"$server_instance\"}[5m])) by (method)",
148
            "legendFormat": "{{method}}"
149
          }
150
        ],
151
        "fieldConfig": {
152
          "defaults": {
153
            "unit": "reqps"
154
          }
155
        },
156
        "gridPos": { "h": 8, "w": 12, "x": 12, "y": 4 }
157
      },
158
      {
159
        "id": 7,
160
        "title": "Database Connection Pool",
161
        "type": "timeseries",
162
        "targets": [
163
          {
164
            "expr": "spire_server_datastore_connections_active{cluster=\"$cluster\", instance=~\"$server_instance\"}",
165
            "legendFormat": "Active - {{instance}}"
166
          },
167
          {
168
            "expr": "spire_server_datastore_connections_idle{cluster=\"$cluster\", instance=~\"$server_instance\"}",
169
            "legendFormat": "Idle - {{instance}}"
170
          }
171
        ],
172
        "gridPos": { "h": 8, "w": 12, "x": 0, "y": 12 }
173
      },
174
      {
175
        "id": 8,
176
        "title": "Database Query Performance",
177
        "type": "timeseries",
178
        "targets": [
179
          {
180
            "expr": "histogram_quantile(0.95, sum(rate(spire_server_datastore_query_duration_seconds_bucket{cluster=\"$cluster\", instance=~\"$server_instance\"}[5m])) by (le, operation))",
181
            "legendFormat": "p95 - {{operation}}"
182
          }
183
        ],
184
        "fieldConfig": {
185
          "defaults": {
186
            "unit": "s"
187
          }
188
        },
189
        "gridPos": { "h": 8, "w": 12, "x": 12, "y": 12 }
190
      },
191
      {
192
        "id": 9,
193
        "title": "Certificate Expiry Timeline",
194
        "type": "timeseries",
195
        "targets": [
196
          {
197
            "expr": "(spire_server_ca_certificate_expiry_timestamp{cluster=\"$cluster\", instance=~\"$server_instance\"} - time()) / 86400",
198
            "legendFormat": "CA Cert - {{instance}}"
199
          },
200
          {
201
            "expr": "spiffe_certificates_expiring_soon{cluster=\"$cluster\"}",
202
            "legendFormat": "Expiring Soon"
203
          }
204
        ],
205
        "fieldConfig": {
206
          "defaults": {
207
            "unit": "d"
208
          }
209
        },
210
        "gridPos": { "h": 8, "w": 12, "x": 0, "y": 20 }
211
      },
212
      {
213
        "id": 10,
214
        "title": "Agent Sync Success Rate",
215
        "type": "timeseries",
216
        "targets": [
217
          {
218
            "expr": "sum(rate(spire_agent_sync_success_total{cluster=\"$cluster\"}[5m])) / sum(rate(spire_agent_sync_attempts_total{cluster=\"$cluster\"}[5m])) * 100",
219
            "legendFormat": "Success Rate %"
220
          }
221
        ],
222
        "fieldConfig": {
223
          "defaults": {
224
            "unit": "percent",
225
            "min": 0,
226
            "max": 100
227
          }
228
        },
229
        "gridPos": { "h": 8, "w": 12, "x": 12, "y": 20 }
230
      },
231
      {
232
        "id": 11,
233
        "title": "Memory Usage by Component",
234
        "type": "timeseries",
235
        "targets": [
236
          {
237
            "expr": "process_resident_memory_bytes{job=\"spire-server\", cluster=\"$cluster\", instance=~\"$server_instance\"} / 1024 / 1024 / 1024",
238
            "legendFormat": "Server - {{instance}}"
239
          },
240
          {
241
            "expr": "process_resident_memory_bytes{job=\"spire-agent\", cluster=\"$cluster\"} / 1024 / 1024 / 1024",
242
            "legendFormat": "Agent - {{instance}}"
243
          }
244
        ],
245
        "fieldConfig": {
246
          "defaults": {
247
            "unit": "GB"
248
          }
249
        },
250
        "gridPos": { "h": 8, "w": 12, "x": 0, "y": 28 }
251
      },
252
      {
253
        "id": 12,
254
        "title": "CPU Usage by Component",
255
        "type": "timeseries",
256
        "targets": [
257
          {
258
            "expr": "rate(process_cpu_seconds_total{job=\"spire-server\", cluster=\"$cluster\", instance=~\"$server_instance\"}[5m]) * 100",
259
            "legendFormat": "Server - {{instance}}"
260
          },
261
          {
262
            "expr": "rate(process_cpu_seconds_total{job=\"spire-agent\", cluster=\"$cluster\"}[5m]) * 100",
263
            "legendFormat": "Agent - {{instance}}"
264
          }
265
        ],
266
        "fieldConfig": {
267
          "defaults": {
268
            "unit": "percent"
269
          }
270
        },
271
        "gridPos": { "h": 8, "w": 12, "x": 12, "y": 28 }
272
      }
273
    ]
274
  }
275
}

Security-Focused Dashboard#

1
{
2
  "dashboard": {
3
    "title": "SPIFFE/SPIRE Security Monitoring",
4
    "description": "Security incidents, attestation failures, and threat detection",
5
    "panels": [
6
      {
7
        "id": 1,
8
        "title": "Failed Attestations by Type",
9
        "type": "timeseries",
10
        "targets": [
11
          {
12
            "expr": "sum(rate(spire_server_attestation_failures_total{cluster=\"$cluster\"}[5m])) by (attestor_type, error_type)",
13
            "legendFormat": "{{attestor_type}} - {{error_type}}"
14
          }
15
        ],
16
        "alert": {
17
          "conditions": [
18
            {
19
              "query": {
20
                "queryType": "",
21
                "refId": "A"
22
              },
23
              "reducer": {
24
                "type": "last",
25
                "params": []
26
              },
27
              "evaluator": {
28
                "params": [0.1],
29
                "type": "gt"
30
              }
31
            }
32
          ],
33
          "executionErrorState": "alerting",
34
          "noDataState": "no_data",
35
          "frequency": "60s",
36
          "handler": 1,
37
          "name": "High Attestation Failure Rate",
38
          "message": "Attestation failure rate exceeds threshold"
39
        }
40
      },
41
      {
42
        "id": 2,
43
        "title": "Unauthorized Access Attempts",
44
        "type": "timeseries",
45
        "targets": [
46
          {
47
            "expr": "sum(rate(spire_server_api_unauthorized_total{cluster=\"$cluster\"}[5m]))",
48
            "legendFormat": "Unauthorized Requests"
49
          }
50
        ]
51
      },
52
      {
53
        "id": 3,
54
        "title": "Certificate Validation Failures",
55
        "type": "stat",
56
        "targets": [
57
          {
58
            "expr": "sum(increase(spire_server_certificate_validation_failures_total{cluster=\"$cluster\"}[1h]))",
59
            "legendFormat": "Last Hour"
60
          }
61
        ],
62
        "fieldConfig": {
63
          "defaults": {
64
            "thresholds": {
65
              "steps": [
66
                { "color": "green", "value": 0 },
67
                { "color": "yellow", "value": 10 },
68
                { "color": "red", "value": 50 }
69
              ]
70
            }
71
          }
72
        }
73
      },
74
      {
75
        "id": 4,
76
        "title": "Anomalous Registration Patterns",
77
        "type": "timeseries",
78
        "targets": [
79
          {
80
            "expr": "rate(spire_server_registration_created_total{cluster=\"$cluster\"}[5m])",
81
            "legendFormat": "Registration Rate"
82
          },
83
          {
84
            "expr": "avg_over_time(rate(spire_server_registration_created_total{cluster=\"$cluster\"}[5m])[7d:1h])",
85
            "legendFormat": "7-day Average"
86
          }
87
        ]
88
      },
89
      {
90
        "id": 5,
91
        "title": "Top Error Sources",
92
        "type": "table",
93
        "targets": [
94
          {
95
            "expr": "topk(10, sum by (source_ip, error_type) (increase(spire_server_errors_total{cluster=\"$cluster\"}[1h])))",
96
            "format": "table",
97
            "instant": true
98
          }
99
        ],
100
        "transformations": [
101
          {
102
            "id": "organize",
103
            "options": {
104
              "excludeByName": {
105
                "Time": true
106
              },
107
              "renameByName": {
108
                "source_ip": "Source IP",
109
                "error_type": "Error Type",
110
                "Value": "Count"
111
              }
112
            }
113
          }
114
        ]
115
      }
116
    ]
117
  }
118
}

Step 4: Intelligent Alerting#

Critical Alert Rules#

1
apiVersion: monitoring.coreos.com/v1
2
kind: PrometheusRule
3
metadata:
4
  name: spire-critical-alerts
5
  namespace: spire-system
6
spec:
7
  groups:
8
    - name: spire.critical
9
      interval: 30s
10
      rules:
11
        # Server Availability
12
        - alert: SPIREServerDown
13
          expr: up{job="spire-server"} == 0
14
          for: 2m
15
          labels:
16
            severity: critical
17
            component: spire-server
18
          annotations:
19
            summary: "SPIRE Server instance is down"
20
            description: "SPIRE Server {{ $labels.instance }} has been down for more than 2 minutes. This affects workload identity issuance."
21
            runbook_url: "https://wiki.company.com/spire-runbooks#server-down"
22

23
        # Database Connectivity
24
        - alert: SPIREDatabaseConnectionFailure
25
          expr: spire_server_datastore_connections_active == 0
26
          for: 5m
27
          labels:
28
            severity: critical
29
            component: datastore
30
          annotations:
31
            summary: "SPIRE Server cannot connect to database"
32
            description: "SPIRE Server {{ $labels.instance }} has no active database connections for 5 minutes."
33

34
        # High Error Rate
35
        - alert: SPIREHighErrorRate
36
          expr: |
37
            sum(rate(spire_server_api_errors_total[5m])) by (instance)
38
            / sum(rate(spire_server_api_requests_total[5m])) by (instance) > 0.05
39
          for: 10m
40
          labels:
41
            severity: warning
42
            component: api
43
          annotations:
44
            summary: "High error rate in SPIRE Server API"
45
            description: "SPIRE Server {{ $labels.instance }} API error rate is {{ $value | humanizePercentage }} over the last 10 minutes."
46

47
        # Certificate Expiry
48
        - alert: SPIRECACertificateExpiringSoon
49
          expr: |
50
            (spire_server_ca_certificate_expiry_timestamp - time()) / 86400 < 30
51
          for: 1h
52
          labels:
53
            severity: warning
54
            component: certificates
55
          annotations:
56
            summary: "SPIRE CA certificate expiring soon"
57
            description: "SPIRE CA certificate will expire in {{ $value }} days. Plan for rotation."
58

59
        - alert: SPIRECACertificateExpired
60
          expr: |
61
            (spire_server_ca_certificate_expiry_timestamp - time()) < 0
62
          for: 1m
63
          labels:
64
            severity: critical
65
            component: certificates
66
          annotations:
67
            summary: "SPIRE CA certificate has expired"
68
            description: "SPIRE CA certificate has expired. Immediate action required."
69

70
        # Agent Issues
71
        - alert: SPIREAgentSyncFailures
72
          expr: |
73
            rate(spire_agent_sync_failures_total[5m]) > 0.1
74
          for: 15m
75
          labels:
76
            severity: warning
77
            component: agent
78
          annotations:
79
            summary: "High agent sync failure rate"
80
            description: "SPIRE Agent {{ $labels.instance }} sync failure rate is {{ $value }} failures/second."
81

82
        - alert: SPIREAgentDisconnected
83
          expr: |
84
            (time() - spire_agent_last_sync_timestamp) > 300
85
          for: 5m
86
          labels:
87
            severity: critical
88
            component: agent
89
          annotations:
90
            summary: "SPIRE Agent disconnected"
91
            description: "SPIRE Agent {{ $labels.instance }} hasn't synced for {{ $value }} seconds."
92

93
        # Security Alerts
94
        - alert: SPIREUnauthorizedAccessSpike
95
          expr: |
96
            sum(rate(spire_server_api_unauthorized_total[5m])) > 1
97
          for: 5m
98
          labels:
99
            severity: warning
100
            component: security
101
          annotations:
102
            summary: "Spike in unauthorized access attempts"
103
            description: "Unauthorized access attempts: {{ $value }} requests/second to SPIRE Server."
104

105
        - alert: SPIREAttestationFailureSpike
106
          expr: |
107
            sum(rate(spire_server_attestation_failures_total[5m])) by (attestor_type) > 0.5
108
          for: 10m
109
          labels:
110
            severity: warning
111
            component: security
112
          annotations:
113
            summary: "High attestation failure rate"
114
            description: "Attestation failures for {{ $labels.attestor_type }}: {{ $value }} failures/second."
115

116
        # Performance Alerts
117
        - alert: SPIREHighLatency
118
          expr: |
119
            histogram_quantile(0.95,
120
              sum(rate(spire_server_api_request_duration_seconds_bucket[5m])) by (le, method)
121
            ) > 5
122
          for: 15m
123
          labels:
124
            severity: warning
125
            component: performance
126
          annotations:
127
            summary: "High API latency"
128
            description: "95th percentile latency for {{ $labels.method }} is {{ $value }}s."
129

130
        - alert: SPIREHighMemoryUsage
131
          expr: |
132
            process_resident_memory_bytes{job="spire-server"} / (1024*1024*1024) > 4
133
          for: 15m
134
          labels:
135
            severity: warning
136
            component: resources
137
          annotations:
138
            summary: "High memory usage"
139
            description: "SPIRE Server {{ $labels.instance }} using {{ $value }}GB of memory."
140

141
        # Capacity Planning
142
        - alert: SPIREEntryCountHigh
143
          expr: |
144
            spire_server_registration_entries > 50000
145
          for: 30m
146
          labels:
147
            severity: warning
148
            component: capacity
149
          annotations:
150
            summary: "High number of registration entries"
151
            description: "SPIRE Server has {{ $value }} registration entries. Consider capacity planning."

AlertManager Configuration#

1
apiVersion: v1
2
kind: ConfigMap
3
metadata:
4
  name: alertmanager-config
5
  namespace: monitoring
6
data:
7
  alertmanager.yml: |
8
    global:
9
      smtp_smarthost: 'smtp.company.com:587'
10
      smtp_from: 'alerts@company.com'
11
      slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
12

13
    route:
14
      group_by: ['alertname', 'cluster', 'component']
15
      group_wait: 10s
16
      group_interval: 10s
17
      repeat_interval: 12h
18
      receiver: 'default'
19
      routes:
20

21
      # Critical alerts go to multiple channels
22
      - match:
23
          severity: critical
24
        receiver: 'critical-alerts'
25
        routes:
26
        # SPIRE-specific critical alerts
27
        - match:
28
            component: spire-server
29
          receiver: 'spire-critical'
30
        - match:
31
            component: certificates
32
          receiver: 'security-team'
33

34
      # Security alerts
35
      - match:
36
          component: security
37
        receiver: 'security-alerts'
38

39
      # Performance warnings
40
      - match:
41
          component: performance
42
        receiver: 'performance-alerts'
43

44
    receivers:
45
    - name: 'default'
46
      slack_configs:
47
      - channel: '#alerts'
48
        title: 'SPIRE Alert: {{ .GroupLabels.alertname }}'
49
        text: |
50
          {{ range .Alerts }}
51
          {{ .Annotations.summary }}
52
          {{ .Annotations.description }}
53
          {{ end }}
54

55
    - name: 'critical-alerts'
56
      slack_configs:
57
      - channel: '#critical-alerts'
58
        color: 'danger'
59
        title: '🚨 CRITICAL: {{ .GroupLabels.alertname }}'
60
        text: |
61
          {{ range .Alerts }}
62
          *Summary:* {{ .Annotations.summary }}
63
          *Description:* {{ .Annotations.description }}
64
          *Cluster:* {{ .Labels.cluster }}
65
          *Instance:* {{ .Labels.instance }}
66
          {{ if .Annotations.runbook_url }}*Runbook:* {{ .Annotations.runbook_url }}{{ end }}
67
          {{ end }}
68
      pagerduty_configs:
69
      - routing_key: 'YOUR_PAGERDUTY_KEY'
70
        description: 'SPIRE Critical Alert: {{ .GroupLabels.alertname }}'
71

72
    - name: 'spire-critical'
73
      slack_configs:
74
      - channel: '#spire-ops'
75
        color: 'danger'
76
        title: '🔑 SPIRE CRITICAL: {{ .GroupLabels.alertname }}'
77
        text: |
78
          {{ range .Alerts }}
79
          {{ .Annotations.summary }}
80

81
          *Impact:* Workload identity operations may be affected
82
          *Action Required:* Immediate investigation needed
83

84
          {{ .Annotations.description }}
85
          {{ end }}
86

87
    - name: 'security-alerts'
88
      slack_configs:
89
      - channel: '#security-alerts'
90
        color: 'warning'
91
        title: '🛡️ Security Alert: {{ .GroupLabels.alertname }}'
92
        text: |
93
          {{ range .Alerts }}
94
          {{ .Annotations.summary }}
95
          {{ .Annotations.description }}
96
          {{ end }}
97
      email_configs:
98
      - to: 'security-team@company.com'
99
        subject: 'SPIRE Security Alert: {{ .GroupLabels.alertname }}'
100
        body: |
101
          {{ range .Alerts }}
102
          {{ .Annotations.description }}
103
          {{ end }}
104

105
    - name: 'performance-alerts'
106
      slack_configs:
107
      - channel: '#performance'
108
        color: 'warning'
109
        title: '📈 Performance Alert: {{ .GroupLabels.alertname }}'
110

111
    inhibit_rules:
112
    # Don't alert on agent issues if server is down
113
    - source_match:
114
        alertname: SPIREServerDown
115
      target_match:
116
        component: agent
117
      equal: ['cluster']
118

119
    # Don't alert on API errors if database is down
120
    - source_match:
121
        alertname: SPIREDatabaseConnectionFailure
122
      target_match:
123
        component: api
124
      equal: ['instance']

Step 5: Custom Workload Instrumentation#

Go Application with SPIFFE Metrics#

1
// spiffe-metrics.go - Instrument Go applications with SPIFFE metrics
2
package main
3

4
import (
5
    "context"
6
    "crypto/tls"
7
    "net/http"
8
    "time"
9

10
    "github.com/prometheus/client_golang/prometheus"
11
    "github.com/prometheus/client_golang/prometheus/promauto"
12
    "github.com/prometheus/client_golang/prometheus/promhttp"
13
    "github.com/spiffe/go-spiffe/v2/spiffeid"
14
    "github.com/spiffe/go-spiffe/v2/spiffetls/tlsconfig"
15
    "github.com/spiffe/go-spiffe/v2/workloadapi"
16
)
17

18
var (
19
    // SPIFFE-specific metrics
20
    spiffeSVIDFetchDuration = promauto.NewHistogramVec(
21
        prometheus.HistogramOpts{
22
            Name: "spiffe_svid_fetch_duration_seconds",
23
            Help: "Time taken to fetch SVID from Workload API",
24
            Buckets: prometheus.DefBuckets,
25
        },
26
        []string{"result"},
27
    )
28

29
    spiffeSVIDRotations = promauto.NewCounterVec(
30
        prometheus.CounterOpts{
31
            Name: "spiffe_svid_rotations_total",
32
            Help: "Total number of SVID rotations",
33
        },
34
        []string{"result"},
35
    )
36

37
    spiffeMTLSConnections = promauto.NewCounterVec(
38
        prometheus.CounterOpts{
39
            Name: "spiffe_mtls_connections_total",
40
            Help: "Total mTLS connections made",
41
        },
42
        []string{"target_id", "result"},
43
    )
44

45
    spiffeSVIDExpiry = promauto.NewGaugeVec(
46
        prometheus.GaugeOpts{
47
            Name: "spiffe_svid_expiry_timestamp",
48
            Help: "SVID expiry timestamp",
49
        },
50
        []string{"spiffe_id"},
51
    )
52
)
53

54
type SPIFFEInstrumentedClient struct {
55
    client       *workloadapi.Client
56
    httpClient   *http.Client
57
    currentSVID  string
58
}
59

60
func NewSPIFFEInstrumentedClient(ctx context.Context) (*SPIFFEInstrumentedClient, error) {
61
    start := time.Now()
62

63
    client, err := workloadapi.New(ctx, workloadapi.WithAddr("unix:///spiffe-workload-api/spire-agent.sock"))
64
    if err != nil {
65
        spiffeSVIDFetchDuration.WithLabelValues("error").Observe(time.Since(start).Seconds())
66
        return nil, err
67
    }
68

69
    // Fetch initial SVID
70
    x509Context, err := client.FetchX509Context(ctx)
71
    if err != nil {
72
        spiffeSVIDFetchDuration.WithLabelValues("error").Observe(time.Since(start).Seconds())
73
        return nil, err
74
    }
75

76
    spiffeSVIDFetchDuration.WithLabelValues("success").Observe(time.Since(start).Seconds())
77

78
    svid := x509Context.DefaultSVID()
79
    spiffeSVIDExpiry.WithLabelValues(svid.ID.String()).Set(float64(svid.Certificates[0].NotAfter.Unix()))
80

81
    // Create HTTP client with mTLS
82
    tlsConfig := tlsconfig.MTLSClientConfig(client, client)
83
    httpClient := &http.Client{
84
        Transport: &http.Transport{
85
            TLSClientConfig: tlsConfig,
86
        },
87
    }
88

89
    sic := &SPIFFEInstrumentedClient{
90
        client:      client,
91
        httpClient:  httpClient,
92
        currentSVID: svid.ID.String(),
93
    }
94

95
    // Start SVID rotation monitoring
96
    go sic.monitorSVIDRotation(ctx)
97

98
    return sic, nil
99
}
100

101
func (s *SPIFFEInstrumentedClient) monitorSVIDRotation(ctx context.Context) {
102
    ticker := time.NewTicker(30 * time.Second)
103
    defer ticker.Stop()
104

105
    for {
106
        select {
107
        case <-ctx.Done():
108
            return
109
        case <-ticker.C:
110
            start := time.Now()
111

112
            x509Context, err := s.client.FetchX509Context(ctx)
113
            if err != nil {
114
                spiffeSVIDFetchDuration.WithLabelValues("error").Observe(time.Since(start).Seconds())
115
                continue
116
            }
117

118
            spiffeSVIDFetchDuration.WithLabelValues("success").Observe(time.Since(start).Seconds())
119

120
            svid := x509Context.DefaultSVID()
121
            currentID := svid.ID.String()
122

123
            // Check if SVID rotated
124
            if currentID != s.currentSVID {
125
                spiffeSVIDRotations.WithLabelValues("success").Inc()
126
                s.currentSVID = currentID
127
            }
128

129
            // Update expiry metric
130
            spiffeSVIDExpiry.WithLabelValues(currentID).Set(float64(svid.Certificates[0].NotAfter.Unix()))
131
        }
132
    }
133
}
134

135
func (s *SPIFFEInstrumentedClient) CallService(ctx context.Context, targetID, url string) (*http.Response, error) {
136
    start := time.Now()
137

138
    // Create specific client for target
139
    id := spiffeid.RequireFromString(targetID)
140
    tlsConfig := tlsconfig.MTLSClientConfig(s.client, s.client, tlsconfig.AuthorizeID(id))
141

142
    client := &http.Client{
143
        Transport: &http.Transport{
144
            TLSClientConfig: tlsConfig,
145
        },
146
    }
147

148
    req, err := http.NewRequestWithContext(ctx, "GET", url, nil)
149
    if err != nil {
150
        return nil, err
151
    }
152

153
    resp, err := client.Do(req)
154

155
    // Record metrics
156
    if err != nil {
157
        spiffeMTLSConnections.WithLabelValues(targetID, "error").Inc()
158
    } else {
159
        spiffeMTLSConnections.WithLabelValues(targetID, "success").Inc()
160
    }
161

162
    return resp, err
163
}
164

165
func main() {
166
    ctx := context.Background()
167

168
    client, err := NewSPIFFEInstrumentedClient(ctx)
169
    if err != nil {
170
        panic(err)
171
    }
172
    defer client.client.Close()
173

174
    // Expose metrics
175
    http.Handle("/metrics", promhttp.Handler())
176

177
    // Health check
178
    http.HandleFunc("/health", func(w http.ResponseWriter, r *http.Request) {
179
        w.WriteHeader(http.StatusOK)
180
        w.Write([]byte("healthy"))
181
    })
182

183
    // Example business logic
184
    http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
185
        // Call another service
186
        resp, err := client.CallService(ctx, "spiffe://prod.example.com/backend", "https://backend:8443/data")
187
        if err != nil {
188
            http.Error(w, err.Error(), http.StatusInternalServerError)
189
            return
190
        }
191
        defer resp.Body.Close()
192

193
        w.WriteHeader(http.StatusOK)
194
        w.Write([]byte("Request successful"))
195
    })
196

197
    // Start server
198
    if err := http.ListenAndServe(":8080", nil); err != nil {
199
        panic(err)
200
    }
201
}

Step 6: Log Analysis and Correlation#

Structured Logging with ELK Stack#

1
apiVersion: v1
2
kind: ConfigMap
3
metadata:
4
  name: fluentd-spire-config
5
  namespace: logging
6
data:
7
  fluent.conf: |
8
    <source>
9
      @type tail
10
      path /var/log/containers/spire-server-*.log
11
      pos_file /var/log/fluentd-spire-server.log.pos
12
      tag kubernetes.spire.server
13
      format json
14
      time_key timestamp
15
      time_format %Y-%m-%dT%H:%M:%S.%NZ
16
    </source>
17

18
    <source>
19
      @type tail
20
      path /var/log/containers/spire-agent-*.log
21
      pos_file /var/log/fluentd-spire-agent.log.pos
22
      tag kubernetes.spire.agent
23
      format json
24
      time_key timestamp
25
      time_format %Y-%m-%dT%H:%M:%S.%NZ
26
    </source>
27

28
    # Parse SPIRE structured logs
29
    <filter kubernetes.spire.**>
30
      @type parser
31
      key_name log
32
      reserve_data true
33
      <parse>
34
        @type json
35
        json_parser yajl
36
      </parse>
37
    </filter>
38

39
    # Extract security events
40
    <filter kubernetes.spire.**>
41
      @type grep
42
      <regexp>
43
        key level
44
        pattern ^(ERROR|WARN)$
45
      </regexp>
46
    </filter>
47

48
    # Enrich with Kubernetes metadata
49
    <filter kubernetes.spire.**>
50
      @type kubernetes_metadata
51
      kubernetes_url https://kubernetes.default.svc
52
      bearer_token_file /var/run/secrets/kubernetes.io/serviceaccount/token
53
      ca_file /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
54
    </filter>
55

56
    # Send to Elasticsearch
57
    <match kubernetes.spire.**>
58
      @type elasticsearch
59
      host elasticsearch.logging.svc.cluster.local
60
      port 9200
61
      index_name spire-logs
62
      type_name _doc
63

64
      <buffer>
65
        @type file
66
        path /var/log/fluentd-buffers/spire.buffer
67
        flush_mode interval
68
        flush_interval 5s
69
        chunk_limit_size 2M
70
        queue_limit_length 8
71
        retry_max_interval 30
72
        retry_forever true
73
      </buffer>
74
    </match>

Elasticsearch Index Templates#

1
{
2
  "index_patterns": ["spire-logs-*"],
3
  "template": {
4
    "settings": {
5
      "number_of_shards": 3,
6
      "number_of_replicas": 1,
7
      "index.refresh_interval": "5s",
8
      "index.max_result_window": 50000
9
    },
10
    "mappings": {
11
      "properties": {
12
        "@timestamp": {
13
          "type": "date"
14
        },
15
        "level": {
16
          "type": "keyword"
17
        },
18
        "msg": {
19
          "type": "text",
20
          "analyzer": "standard"
21
        },
22
        "component": {
23
          "type": "keyword"
24
        },
25
        "spiffe_id": {
26
          "type": "keyword"
27
        },
28
        "attestor_type": {
29
          "type": "keyword"
30
        },
31
        "error": {
32
          "type": "text"
33
        },
34
        "kubernetes": {
35
          "properties": {
36
            "pod_name": {
37
              "type": "keyword"
38
            },
39
            "namespace_name": {
40
              "type": "keyword"
41
            },
42
            "node_name": {
43
              "type": "keyword"
44
            }
45
          }
46
        },
47
        "metrics": {
48
          "properties": {
49
            "duration_ms": {
50
              "type": "long"
51
            },
52
            "count": {
53
              "type": "long"
54
            }
55
          }
56
        }
57
      }
58
    }
59
  }
60
}

Conclusion#

Comprehensive observability for SPIFFE/SPIRE requires:

Multi-Layer Monitoring: Server, agent, and workload metrics
Security Focus: Track attestation failures and unauthorized access
Performance Insights: API latency, database performance, resource usage
Intelligent Alerting: Context-aware alerts with proper escalation
Log Correlation: Structured logging for security incident analysis

Key takeaways:

✅ Enable telemetry on all SPIRE components
✅ Use custom metrics for workload-specific monitoring
✅ Implement layered alerting with proper escalation
✅ Monitor security events and attestation health
✅ Track certificate lifecycle and rotation

In the next post, we’ll explore advanced workload attestation using TPM hardware roots of trust and cloud provider attestors.

Additional Resources#

Building comprehensive observability for identity infrastructure is crucial for production success. Share your monitoring strategies and lessons learned in the comments.