SPIFFE/SPIRE High Availability in Kubernetes: Production Deployment Patterns

Introduction: From Single Instance to Enterprise Scale#

After deploying SPIFFE/SPIRE in development, the next challenge is scaling it for production. A single SPIRE server might work for proof-of-concepts, but enterprise environments demand high availability, disaster recovery, and the ability to handle thousands of workloads without becoming a single point of failure.

This guide covers everything you need to build a production-grade SPIFFE/SPIRE deployment: multi-server architectures, database selection and optimization, geographic distribution, zero-downtime operations, and disaster recovery strategies. We’ll move beyond the basics to address real-world challenges I’ve encountered scaling SPIRE to handle 100,000+ workloads.

Understanding SPIRE’s Scalability Architecture#

Before diving into HA configurations, let’s understand SPIRE’s architecture at scale:

1
graph TB
2
    subgraph "Region 1 - Primary"
3
        LB1[Load Balancer]
4
        SS1[SPIRE Server 1<br/>Leader]
5
        SS2[SPIRE Server 2<br/>Follower]
6
        SS3[SPIRE Server 3<br/>Follower]
7
        DB1[(PostgreSQL<br/>Primary)]
8

9
        LB1 --> SS1
10
        LB1 --> SS2
11
        LB1 --> SS3
12

13
        SS1 --> DB1
14
        SS2 --> DB1
15
        SS3 --> DB1
16
    end
17

18
    subgraph "Region 2 - Standby"
19
        LB2[Load Balancer]
20
        SS4[SPIRE Server 4<br/>Standby]
21
        SS5[SPIRE Server 5<br/>Standby]
22
        SS6[SPIRE Server 6<br/>Standby]
23
        DB2[(PostgreSQL<br/>Read Replica)]
24

25
        LB2 --> SS4
26
        LB2 --> SS5
27
        LB2 --> SS6
28

29
        SS4 --> DB2
30
        SS5 --> DB2
31
        SS6 --> DB2
32
    end
33

34
    subgraph "Agents"
35
        A1[Agent Pod 1]
36
        A2[Agent Pod 2]
37
        A3[Agent Pod N]
38
    end
39

40
    DB1 -.->|Streaming Replication| DB2
41

42
    A1 --> LB1
43
    A2 --> LB1
44
    A3 --> LB1
45

46
    A1 -.->|Failover| LB2
47
    A2 -.->|Failover| LB2
48
    A3 -.->|Failover| LB2

Key Scalability Factors#

Database Performance: The #1 bottleneck in SPIRE deployments
Agent Synchronization: Each agent syncs every 5 seconds by default
Entry Cache Size: Impacts memory usage and query performance
Network Latency: Critical for multi-region deployments
Certificate Rotation: SVIDs expire and need renewal

Step 1: Production Database Setup#

PostgreSQL Configuration#

First, let’s set up a production-grade PostgreSQL cluster:

1
apiVersion: v1
2
kind: Secret
3
metadata:
4
  name: postgres-credentials
5
  namespace: spire-system
6
type: Opaque
7
stringData:
8
  POSTGRES_DB: spire
9
  POSTGRES_USER: spire
10
  POSTGRES_PASSWORD: "$(openssl rand -base64 32)"
11
  REPLICATION_USER: replicator
12
  REPLICATION_PASSWORD: "$(openssl rand -base64 32)"
13
---
14
apiVersion: v1
15
kind: ConfigMap
16
metadata:
17
  name: postgres-config
18
  namespace: spire-system
19
data:
20
  postgresql.conf: |
21
    # Connection settings
22
    listen_addresses = '*'
23
    max_connections = 500
24

25
    # Memory settings (adjust based on available RAM)
26
    shared_buffers = 2GB
27
    effective_cache_size = 6GB
28
    maintenance_work_mem = 512MB
29
    work_mem = 32MB
30

31
    # Write performance
32
    wal_buffers = 64MB
33
    checkpoint_completion_target = 0.9
34
    checkpoint_timeout = 15min
35
    max_wal_size = 4GB
36
    min_wal_size = 1GB
37

38
    # Query optimization
39
    random_page_cost = 1.1  # For SSD storage
40
    effective_io_concurrency = 200
41

42
    # Logging
43
    log_statement = 'mod'
44
    log_duration = on
45
    log_min_duration_statement = 100ms
46
    log_checkpoints = on
47
    log_connections = on
48
    log_disconnections = on
49
    log_lock_waits = on
50

51
    # Replication
52
    wal_level = replica
53
    max_wal_senders = 10
54
    max_replication_slots = 10
55
    hot_standby = on
56

57
    # SPIRE-specific optimizations
58
    # Increase autovacuum frequency for entries table
59
    autovacuum_vacuum_scale_factor = 0.05
60
    autovacuum_analyze_scale_factor = 0.02
61

62
  pg_hba.conf: |
63
    # TYPE  DATABASE        USER            ADDRESS                 METHOD
64
    local   all             all                                     trust
65
    host    all             all             127.0.0.1/32            trust
66
    host    all             all             ::1/128                 trust
67
    host    all             all             10.0.0.0/8              md5
68
    host    replication     replicator      10.0.0.0/8              md5
69
---
70
apiVersion: apps/v1
71
kind: StatefulSet
72
metadata:
73
  name: postgres-primary
74
  namespace: spire-system
75
spec:
76
  serviceName: postgres-primary
77
  replicas: 1
78
  selector:
79
    matchLabels:
80
      app: postgres-primary
81
  template:
82
    metadata:
83
      labels:
84
        app: postgres-primary
85
        postgres-role: primary
86
    spec:
87
      containers:
88
        - name: postgres
89
          image: postgres:15-alpine
90
          ports:
91
            - containerPort: 5432
92
              name: postgres
93
          env:
94
            - name: POSTGRES_DB
95
              valueFrom:
96
                secretKeyRef:
97
                  name: postgres-credentials
98
                  key: POSTGRES_DB
99
            - name: POSTGRES_USER
100
              valueFrom:
101
                secretKeyRef:
102
                  name: postgres-credentials
103
                  key: POSTGRES_USER
104
            - name: POSTGRES_PASSWORD
105
              valueFrom:
106
                secretKeyRef:
107
                  name: postgres-credentials
108
                  key: POSTGRES_PASSWORD
109
            - name: POSTGRES_REPLICATION_MODE
110
              value: "master"
111
            - name: POSTGRES_REPLICATION_USER
112
              valueFrom:
113
                secretKeyRef:
114
                  name: postgres-credentials
115
                  key: REPLICATION_USER
116
            - name: POSTGRES_REPLICATION_PASSWORD
117
              valueFrom:
118
                secretKeyRef:
119
                  name: postgres-credentials
120
                  key: REPLICATION_PASSWORD
121
          volumeMounts:
122
            - name: postgres-storage
123
              mountPath: /var/lib/postgresql/data
124
              subPath: postgres
125
            - name: postgres-config
126
              mountPath: /etc/postgresql/postgresql.conf
127
              subPath: postgresql.conf
128
            - name: postgres-config
129
              mountPath: /etc/postgresql/pg_hba.conf
130
              subPath: pg_hba.conf
131
            - name: init-scripts
132
              mountPath: /docker-entrypoint-initdb.d
133
          resources:
134
            requests:
135
              memory: "4Gi"
136
              cpu: "2"
137
            limits:
138
              memory: "8Gi"
139
              cpu: "4"
140
          livenessProbe:
141
            exec:
142
              command:
143
                - pg_isready
144
                - -U
145
                - spire
146
            initialDelaySeconds: 30
147
            periodSeconds: 10
148
          readinessProbe:
149
            exec:
150
              command:
151
                - pg_isready
152
                - -U
153
                - spire
154
            initialDelaySeconds: 5
155
            periodSeconds: 5
156
      volumes:
157
        - name: postgres-config
158
          configMap:
159
            name: postgres-config
160
        - name: init-scripts
161
          configMap:
162
            name: postgres-init
163
  volumeClaimTemplates:
164
    - metadata:
165
        name: postgres-storage
166
      spec:
167
        accessModes: ["ReadWriteOnce"]
168
        storageClassName: fast-ssd
169
        resources:
170
          requests:
171
            storage: 100Gi
172
---
173
apiVersion: v1
174
kind: ConfigMap
175
metadata:
176
  name: postgres-init
177
  namespace: spire-system
178
data:
179
  01-spire-optimizations.sql: |
180
    -- Create SPIRE database with optimizations
181
    \c spire;
182

183
    -- Optimize for SPIRE's access patterns
184
    ALTER DATABASE spire SET random_page_cost = 1.1;
185
    ALTER DATABASE spire SET effective_io_concurrency = 200;
186
    ALTER DATABASE spire SET work_mem = '64MB';
187

188
    -- Create indexes for common queries (after SPIRE creates tables)
189
    -- These will be created after first SPIRE server connection
190
    CREATE OR REPLACE FUNCTION create_spire_indexes()
191
    RETURNS void AS $$
192
    BEGIN
193
        -- Index for entry lookups by selectors
194
        IF NOT EXISTS (SELECT 1 FROM pg_indexes WHERE indexname = 'idx_entries_selectors') THEN
195
            CREATE INDEX CONCURRENTLY idx_entries_selectors
196
            ON registered_entries USING gin(selectors);
197
        END IF;
198

199
        -- Index for entry lookups by SPIFFE ID
200
        IF NOT EXISTS (SELECT 1 FROM pg_indexes WHERE indexname = 'idx_entries_spiffe_id') THEN
201
            CREATE INDEX CONCURRENTLY idx_entries_spiffe_id
202
            ON registered_entries(spiffe_id);
203
        END IF;
204

205
        -- Index for node lookups
206
        IF NOT EXISTS (SELECT 1 FROM pg_indexes WHERE indexname = 'idx_nodes_spiffe_id') THEN
207
            CREATE INDEX CONCURRENTLY idx_nodes_spiffe_id
208
            ON attested_node_entries(spiffe_id);
209
        END IF;
210

211
        -- Partial index for active entries
212
        IF NOT EXISTS (SELECT 1 FROM pg_indexes WHERE indexname = 'idx_entries_active') THEN
213
            CREATE INDEX CONCURRENTLY idx_entries_active
214
            ON registered_entries(expiry)
215
            WHERE expiry > NOW();
216
        END IF;
217
    END;
218
    $$ LANGUAGE plpgsql;
219

220
    -- Create replication slot for standby
221
    SELECT pg_create_physical_replication_slot('standby_slot');

Database Connection Pooling#

For high-throughput environments, use PgBouncer:

1
apiVersion: v1
2
kind: ConfigMap
3
metadata:
4
  name: pgbouncer-config
5
  namespace: spire-system
6
data:
7
  pgbouncer.ini: |
8
    [databases]
9
    spire = host=postgres-primary.spire-system.svc.cluster.local port=5432 dbname=spire
10

11
    [pgbouncer]
12
    listen_port = 6432
13
    listen_addr = *
14
    auth_type = md5
15
    auth_file = /etc/pgbouncer/userlist.txt
16
    pool_mode = transaction
17
    max_client_conn = 1000
18
    default_pool_size = 25
19
    min_pool_size = 10
20
    reserve_pool_size = 5
21
    reserve_pool_timeout = 3
22
    server_lifetime = 3600
23
    server_idle_timeout = 600
24
    log_connections = 1
25
    log_disconnections = 1
26
    log_pooler_errors = 1
27
    stats_period = 60
28

29
  userlist.txt: |
30
    "spire" "md5$(echo -n 'passwordspire' | md5sum | cut -d' ' -f1)"
31
---
32
apiVersion: apps/v1
33
kind: Deployment
34
metadata:
35
  name: pgbouncer
36
  namespace: spire-system
37
spec:
38
  replicas: 2
39
  selector:
40
    matchLabels:
41
      app: pgbouncer
42
  template:
43
    metadata:
44
      labels:
45
        app: pgbouncer
46
    spec:
47
      containers:
48
        - name: pgbouncer
49
          image: pgbouncer/pgbouncer:latest
50
          ports:
51
            - containerPort: 6432
52
              name: pgbouncer
53
          volumeMounts:
54
            - name: config
55
              mountPath: /etc/pgbouncer
56
          resources:
57
            requests:
58
              memory: "256Mi"
59
              cpu: "500m"
60
            limits:
61
              memory: "512Mi"
62
              cpu: "1"
63
      volumes:
64
        - name: config
65
          configMap:
66
            name: pgbouncer-config

Step 2: Multi-Server SPIRE Deployment#

High Availability SPIRE Configuration#

1
global:
2
  spire:
3
    trustDomain: "prod.example.com"
4
    bundleEndpoint:
5
      address: "0.0.0.0"
6
      port: 8443
7

8
spire-server:
9
  replicaCount: 3
10

11
  # Database configuration
12
  dataStore:
13
    sql:
14
      databaseType: postgres
15
      connectionString: "host=pgbouncer.spire-system.svc.cluster.local port=6432 dbname=spire user=spire password=${SPIRE_DB_PASSWORD} sslmode=require pool_max_conns=20"
16

17
  # Performance tuning
18
  config:
19
    server:
20
      # Increase cache size for large deployments
21
      cache_size: 50000
22

23
      # Agent synchronization settings
24
      agent_ttl: "1h"
25

26
      # Registration entry settings
27
      default_svid_ttl: "12h"
28

29
      # Audit logging
30
      audit_log_enabled: true
31

32
      # Experimental features for performance
33
      experimental:
34
        # Enable entry cache replication
35
        cache_reload_interval: "5s"
36

37
        # Prune expired entries more frequently
38
        events_based_cache: true
39

40
  # Leader election for certain operations
41
  controllerManager:
42
    enabled: true
43
    leaderElection: true
44

45
  # Pod disruption budget
46
  podDisruptionBudget:
47
    enabled: true
48
    minAvailable: 2
49

50
  # Anti-affinity to spread servers
51
  affinity:
52
    podAntiAffinity:
53
      requiredDuringSchedulingIgnoredDuringExecution:
54
        - labelSelector:
55
            matchExpressions:
56
              - key: app
57
                operator: In
58
                values:
59
                  - spire-server
60
          topologyKey: kubernetes.io/hostname
61

62
  # Resources for production
63
  resources:
64
    requests:
65
      memory: "2Gi"
66
      cpu: "1"
67
    limits:
68
      memory: "4Gi"
69
      cpu: "2"
70

71
  # Autoscaling
72
  autoscaling:
73
    enabled: true
74
    minReplicas: 3
75
    maxReplicas: 10
76
    targetCPUUtilizationPercentage: 70
77
    targetMemoryUtilizationPercentage: 80
78

79
  # Monitoring
80
  telemetry:
81
    prometheus:
82
      enabled: true
83
      port: 9988
84

85
  # Health checks with proper timeouts
86
  livenessProbe:
87
    httpGet:
88
      path: /live
89
      port: 8080
90
    initialDelaySeconds: 60
91
    periodSeconds: 30
92
    timeoutSeconds: 5
93
    failureThreshold: 3
94

95
  readinessProbe:
96
    httpGet:
97
      path: /ready
98
      port: 8080
99
    initialDelaySeconds: 30
100
    periodSeconds: 10
101
    timeoutSeconds: 5
102
    failureThreshold: 3
103

104
spire-agent:
105
  # Agent configuration for HA
106
  config:
107
    agent:
108
      # Increase sync interval to reduce load
109
      sync_interval: "30s"
110

111
      # Enable SDS for better performance
112
      sds:
113
        default_svid_name: "default"
114
        default_bundle_name: "ROOTCA"
115

116
  # Resources
117
  resources:
118
    requests:
119
      memory: "256Mi"
120
      cpu: "100m"
121
    limits:
122
      memory: "512Mi"
123
      cpu: "500m"
124

125
  # Host network for better performance
126
  hostNetwork: true
127
  dnsPolicy: ClusterFirstWithHostNet

Deploy the HA configuration:

1
# Create namespace and secrets
2
kubectl create namespace spire-system
3
kubectl create secret generic spire-db-password \
4
  --from-literal=SPIRE_DB_PASSWORD=$(openssl rand -base64 32) \
5
  -n spire-system
6

7
# Deploy SPIRE in HA mode
8
helm upgrade --install spire spiffe/spire \
9
  --namespace spire-system \
10
  --values spire-ha-values.yaml \
11
  --wait

Step 3: Load Balancing and Service Discovery#

Internal Load Balancer for Agents#

1
apiVersion: v1
2
kind: Service
3
metadata:
4
  name: spire-server-lb
5
  namespace: spire-system
6
  annotations:
7
    # For cloud providers
8
    service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
9
    service.beta.kubernetes.io/aws-load-balancer-internal: "true"
10
spec:
11
  type: LoadBalancer
12
  sessionAffinity: ClientIP
13
  sessionAffinityConfig:
14
    clientIP:
15
      timeoutSeconds: 10800 # 3 hours
16
  selector:
17
    app: spire-server
18
  ports:
19
    - name: agent-api
20
      port: 8081
21
      targetPort: 8081
22
      protocol: TCP
23
    - name: bundle-endpoint
24
      port: 8443
25
      targetPort: 8443
26
      protocol: TCP
27
---
28
# Headless service for direct pod access
29
apiVersion: v1
30
kind: Service
31
metadata:
32
  name: spire-server-headless
33
  namespace: spire-system
34
spec:
35
  clusterIP: None
36
  selector:
37
    app: spire-server
38
  ports:
39
    - name: agent-api
40
      port: 8081
41
      targetPort: 8081

Agent Configuration for HA#

1
apiVersion: v1
2
kind: ConfigMap
3
metadata:
4
  name: spire-agent-ha-config
5
  namespace: spire-system
6
data:
7
  agent.conf: |
8
    agent {
9
      data_dir = "/run/spire"
10
      log_level = "INFO"
11
      server_address = "spire-server-lb.spire-system.svc.cluster.local"
12
      server_port = "8081"
13
      socket_path = "/run/spire/sockets/agent.sock"
14
      trust_bundle_path = "/run/spire/bundle/bundle.crt"
15
      trust_domain = "prod.example.com"
16

17
      # HA-specific settings
18
      # Enable automatic failover
19
      availability_target = "high"
20

21
      # Connection management
22
      retry_bootstrap = true
23
      bootstrap_timeout = "60s"
24

25
      # Performance tuning
26
      sync_interval = "30s"
27

28
      # Enable SDS
29
      sds {
30
        default_svid_name = "default"
31
        default_bundle_name = "ROOTCA"
32
      }
33
    }
34

35
    plugins {
36
      NodeAttestor "k8s_psat" {
37
        plugin_data {
38
          cluster = "production"
39

40
          # Use token projection for better security
41
          token_path = "/run/secrets/tokens/spire-agent"
42
        }
43
      }
44

45
      KeyManager "memory" {
46
        plugin_data {}
47
      }
48

49
      WorkloadAttestor "k8s" {
50
        plugin_data {
51
          # Increase pod info sync interval
52
          pod_info_sync_interval = "1m"
53

54
          # Skip validation for faster attestation
55
          skip_kubelet_verification = true
56
        }
57
      }
58
    }
59

60
    health_checks {
61
      listener_enabled = true
62
      bind_address = "0.0.0.0"
63
      bind_port = "8080"
64
      live_path = "/live"
65
      ready_path = "/ready"
66
    }
67

68
    telemetry {
69
      Prometheus {
70
        host = "0.0.0.0"
71
        port = 9988
72
      }
73
    }

Step 4: Multi-Region Deployment#

Primary Region Configuration#

1
global:
2
  spire:
3
    trustDomain: "prod.example.com"
4
    region: "us-east-1"
5

6
spire-server:
7
  config:
8
    server:
9
      # Federation configuration for multi-region
10
      federation {
11
        bundle_endpoint {
12
          address = "0.0.0.0"
13
          port = 8443
14

15
          # Use DNS for external access
16
          acme {
17
            domain_name = "spire-east.example.com"
18
            email = "security@example.com"
19
            tos_accepted = true
20
          }
21
        }
22
      }
23

24
      # Configure for primary region
25
      ca_subject {
26
        country = ["US"]
27
        organization = ["Example Corp"]
28
        common_name = "SPIRE CA US-EAST-1"
29
      }
30

31
  # Expose bundle endpoint
32
  service:
33
    type: LoadBalancer
34
    annotations:
35
      external-dns.alpha.kubernetes.io/hostname: spire-east.example.com
36
    ports:
37
      bundle:
38
        port: 8443
39
        targetPort: 8443
40
        protocol: TCP

Standby Region Configuration#

1
global:
2
  spire:
3
    trustDomain: "prod.example.com"
4
    region: "us-west-2"
5

6
spire-server:
7
  # Point to read replica
8
  dataStore:
9
    sql:
10
      connectionString: "host=postgres-replica-west.spire-system.svc.cluster.local port=5432 dbname=spire user=spire_read password=${SPIRE_DB_PASSWORD} sslmode=require"
11

12
  config:
13
    server:
14
      # Read-only mode for standby
15
      experimental:
16
        read_only_mode: true
17

18
      # Different CA subject for region
19
      ca_subject {
20
        country = ["US"]
21
        organization = ["Example Corp"]
22
        common_name = "SPIRE CA US-WEST-2"
23
      }
24

25
      # Federation with primary
26
      federation {
27
        bundle_endpoint {
28
          address = "0.0.0.0"
29
          port = 8443
30

31
          acme {
32
            domain_name = "spire-west.example.com"
33
            email = "security@example.com"
34
            tos_accepted = true
35
          }
36
        }
37

38
        federates_with {
39
          "prod.example.com" {
40
            bundle_endpoint_address = "spire-east.example.com"
41
            bundle_endpoint_port = 8443
42
            bundle_endpoint_spiffe_id = "spiffe://prod.example.com/spire/server"
43
          }
44
        }
45
      }

Cross-Region Database Replication#

1
apiVersion: apps/v1
2
kind: StatefulSet
3
metadata:
4
  name: postgres-replica
5
  namespace: spire-system
6
spec:
7
  serviceName: postgres-replica
8
  replicas: 2 # Multiple read replicas
9
  selector:
10
    matchLabels:
11
      app: postgres-replica
12
  template:
13
    metadata:
14
      labels:
15
        app: postgres-replica
16
        postgres-role: replica
17
    spec:
18
      containers:
19
        - name: postgres
20
          image: postgres:15-alpine
21
          env:
22
            - name: POSTGRES_REPLICATION_MODE
23
              value: "slave"
24
            - name: POSTGRES_MASTER_SERVICE
25
              value: "postgres-primary.spire-system.svc.cluster.local"
26
            - name: POSTGRES_REPLICATION_USER
27
              valueFrom:
28
                secretKeyRef:
29
                  name: postgres-credentials
30
                  key: REPLICATION_USER
31
            - name: POSTGRES_REPLICATION_PASSWORD
32
              valueFrom:
33
                secretKeyRef:
34
                  name: postgres-credentials
35
                  key: REPLICATION_PASSWORD
36
          command:
37
            - /bin/bash
38
            - -c
39
            - |
40
              # Wait for master to be ready
41
              until pg_isready -h $POSTGRES_MASTER_SERVICE -U replicator; do
42
                echo "Waiting for master..."
43
                sleep 2
44
              done
45

46
              # Set up streaming replication
47
              pg_basebackup -h $POSTGRES_MASTER_SERVICE -D /var/lib/postgresql/data -U replicator -v -P -W
48

49
              # Configure recovery
50
              cat > /var/lib/postgresql/data/recovery.conf <<EOF
51
              standby_mode = 'on'
52
              primary_conninfo = 'host=$POSTGRES_MASTER_SERVICE port=5432 user=replicator password=$POSTGRES_REPLICATION_PASSWORD'
53
              trigger_file = '/tmp/postgresql.trigger'
54
              EOF
55

56
              # Start PostgreSQL
57
              postgres
58
          volumeMounts:
59
            - name: postgres-storage
60
              mountPath: /var/lib/postgresql/data
61
          resources:
62
            requests:
63
              memory: "2Gi"
64
              cpu: "1"
65
            limits:
66
              memory: "4Gi"
67
              cpu: "2"
68
  volumeClaimTemplates:
69
    - metadata:
70
        name: postgres-storage
71
      spec:
72
        accessModes: ["ReadWriteOnce"]
73
        storageClassName: fast-ssd
74
        resources:
75
          requests:
76
            storage: 100Gi

Step 5: Zero-Downtime Operations#

Rolling Updates#

1
apiVersion: v1
2
kind: ConfigMap
3
metadata:
4
  name: spire-update-strategy
5
  namespace: spire-system
6
data:
7
  update.sh: |
8
    #!/bin/bash
9
    set -e
10

11
    # Function to check SPIRE server health
12
    check_health() {
13
      local server=$1
14
      kubectl exec -n spire-system $server -- \
15
        /opt/spire/bin/spire-server healthcheck
16
    }
17

18
    # Function to drain connections from a server
19
    drain_server() {
20
      local server=$1
21
      echo "Draining connections from $server..."
22

23
      # Remove from load balancer
24
      kubectl label pod $server -n spire-system \
25
        serving=false --overwrite
26

27
      # Wait for connections to drain
28
      sleep 60
29
    }
30

31
    # Get all SPIRE server pods
32
    servers=$(kubectl get pods -n spire-system -l app=spire-server -o name)
33

34
    # Update one server at a time
35
    for server in $servers; do
36
      server_name=$(echo $server | cut -d'/' -f2)
37

38
      echo "Updating $server_name..."
39

40
      # Drain the server
41
      drain_server $server_name
42

43
      # Delete the pod (StatefulSet will recreate)
44
      kubectl delete pod $server_name -n spire-system
45

46
      # Wait for new pod to be ready
47
      kubectl wait --for=condition=ready pod/$server_name \
48
        -n spire-system --timeout=300s
49

50
      # Verify health
51
      check_health $server_name
52

53
      # Re-enable in load balancer
54
      kubectl label pod $server_name -n spire-system \
55
        serving=true --overwrite
56

57
      echo "$server_name updated successfully"
58
      sleep 30
59
    done

Database Migration Strategy#

1
-- migration-strategy.sql
2
-- Safe schema migrations for zero downtime
3

4
-- Step 1: Add new columns as nullable
5
ALTER TABLE registered_entries
6
ADD COLUMN IF NOT EXISTS new_field VARCHAR(255);
7

8
-- Step 2: Backfill data in batches
9
DO $$
10
DECLARE
11
    batch_size INTEGER := 1000;
12
    offset_val INTEGER := 0;
13
    total_rows INTEGER;
14
BEGIN
15
    SELECT COUNT(*) INTO total_rows FROM registered_entries;
16

17
    WHILE offset_val < total_rows LOOP
18
        UPDATE registered_entries
19
        SET new_field = 'default_value'
20
        WHERE id IN (
21
            SELECT id FROM registered_entries
22
            WHERE new_field IS NULL
23
            ORDER BY id
24
            LIMIT batch_size
25
        );
26

27
        offset_val := offset_val + batch_size;
28

29
        -- Pause between batches to avoid locking
30
        PERFORM pg_sleep(0.1);
31

32
        RAISE NOTICE 'Processed % of % rows', offset_val, total_rows;
33
    END LOOP;
34
END $$;
35

36
-- Step 3: Add constraints after backfill
37
ALTER TABLE registered_entries
38
ALTER COLUMN new_field SET NOT NULL;
39

40
-- Step 4: Create indexes concurrently
41
CREATE INDEX CONCURRENTLY idx_new_field
42
ON registered_entries(new_field);

Step 6: Monitoring and Alerting#

Prometheus Configuration#

1
apiVersion: monitoring.coreos.com/v1
2
kind: PrometheusRule
3
metadata:
4
  name: spire-alerts
5
  namespace: spire-system
6
spec:
7
  groups:
8
    - name: spire.rules
9
      interval: 30s
10
      rules:
11
        # Server availability
12
        - alert: SPIREServerDown
13
          expr: up{job="spire-server"} == 0
14
          for: 5m
15
          labels:
16
            severity: critical
17
          annotations:
18
            summary: "SPIRE Server is down"
19
            description: "SPIRE Server {{ $labels.instance }} has been down for more than 5 minutes."
20

21
        # High error rate
22
        - alert: SPIREHighErrorRate
23
          expr: |
24
            rate(spire_server_api_errors_total[5m]) > 0.05
25
          for: 10m
26
          labels:
27
            severity: warning
28
          annotations:
29
            summary: "High SPIRE API error rate"
30
            description: "SPIRE Server API error rate is {{ $value }} errors per second."
31

32
        # Database connection issues
33
        - alert: SPIREDatabaseConnectionFailure
34
          expr: |
35
            spire_server_datastore_connections_active == 0
36
          for: 5m
37
          labels:
38
            severity: critical
39
          annotations:
40
            summary: "SPIRE database connection failure"
41
            description: "SPIRE Server has no active database connections."
42

43
        # Entry cache size
44
        - alert: SPIREEntryCacheFull
45
          expr: |
46
            spire_server_entry_cache_size / spire_server_entry_cache_max_size > 0.9
47
          for: 15m
48
          labels:
49
            severity: warning
50
          annotations:
51
            summary: "SPIRE entry cache nearly full"
52
            description: "SPIRE entry cache is {{ $value | humanizePercentage }} full."
53

54
        # Agent sync failures
55
        - alert: SPIREAgentSyncFailures
56
          expr: |
57
            rate(spire_agent_sync_failures_total[5m]) > 0.1
58
          for: 10m
59
          labels:
60
            severity: warning
61
          annotations:
62
            summary: "High agent sync failure rate"
63
            description: "Agent {{ $labels.instance }} sync failure rate is {{ $value }} per second."
64

65
        # Certificate expiry
66
        - alert: SPIRECertificateExpiringSoon
67
          expr: |
68
            (spire_server_ca_certificate_expiry_timestamp - time()) / 86400 < 30
69
          for: 1h
70
          labels:
71
            severity: warning
72
          annotations:
73
            summary: "SPIRE CA certificate expiring soon"
74
            description: "SPIRE CA certificate will expire in {{ $value }} days."
75

76
        # High memory usage
77
        - alert: SPIREHighMemoryUsage
78
          expr: |
79
            container_memory_usage_bytes{pod=~"spire-server-.*"}
80
            / container_spec_memory_limit_bytes{pod=~"spire-server-.*"} > 0.8
81
          for: 15m
82
          labels:
83
            severity: warning
84
          annotations:
85
            summary: "High memory usage on SPIRE server"
86
            description: "SPIRE Server {{ $labels.pod }} memory usage is {{ $value | humanizePercentage }}."

Grafana Dashboard#

1
{
2
  "dashboard": {
3
    "title": "SPIRE High Availability Monitoring",
4
    "panels": [
5
      {
6
        "title": "SPIRE Server Availability",
7
        "targets": [
8
          {
9
            "expr": "up{job=\"spire-server\"}",
10
            "legendFormat": "{{ instance }}"
11
          }
12
        ],
13
        "gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 }
14
      },
15
      {
16
        "title": "Registration Entries by Server",
17
        "targets": [
18
          {
19
            "expr": "spire_server_registration_entries",
20
            "legendFormat": "{{ instance }}"
21
          }
22
        ],
23
        "gridPos": { "h": 8, "w": 12, "x": 12, "y": 0 }
24
      },
25
      {
26
        "title": "API Request Rate",
27
        "targets": [
28
          {
29
            "expr": "rate(spire_server_api_requests_total[5m])",
30
            "legendFormat": "{{ instance }} - {{ method }}"
31
          }
32
        ],
33
        "gridPos": { "h": 8, "w": 12, "x": 0, "y": 8 }
34
      },
35
      {
36
        "title": "Database Query Performance",
37
        "targets": [
38
          {
39
            "expr": "histogram_quantile(0.95, rate(spire_server_datastore_query_duration_seconds_bucket[5m]))",
40
            "legendFormat": "p95 Query Time"
41
          }
42
        ],
43
        "gridPos": { "h": 8, "w": 12, "x": 12, "y": 8 }
44
      },
45
      {
46
        "title": "Agent Connections by Server",
47
        "targets": [
48
          {
49
            "expr": "spire_server_connected_agents",
50
            "legendFormat": "{{ instance }}"
51
          }
52
        ],
53
        "gridPos": { "h": 8, "w": 12, "x": 0, "y": 16 }
54
      },
55
      {
56
        "title": "Memory Usage",
57
        "targets": [
58
          {
59
            "expr": "container_memory_usage_bytes{pod=~\"spire-server-.*\"} / 1024 / 1024 / 1024",
60
            "legendFormat": "{{ pod }}"
61
          }
62
        ],
63
        "gridPos": { "h": 8, "w": 12, "x": 12, "y": 16 }
64
      }
65
    ]
66
  }
67
}

Step 7: Disaster Recovery#

Backup Strategy#

1
apiVersion: batch/v1
2
kind: CronJob
3
metadata:
4
  name: spire-backup
5
  namespace: spire-system
6
spec:
7
  schedule: "0 */6 * * *" # Every 6 hours
8
  jobTemplate:
9
    spec:
10
      template:
11
        spec:
12
          containers:
13
            - name: backup
14
              image: postgres:15-alpine
15
              env:
16
                - name: PGPASSWORD
17
                  valueFrom:
18
                    secretKeyRef:
19
                      name: postgres-credentials
20
                      key: POSTGRES_PASSWORD
21
              command:
22
                - /bin/bash
23
                - -c
24
                - |
25
                  set -e
26

27
                  # Create backup
28
                  BACKUP_FILE="/backup/spire-$(date +%Y%m%d-%H%M%S).sql"
29
                  pg_dump -h postgres-primary -U spire -d spire \
30
                    --verbose --no-owner --no-acl \
31
                    --format=custom --compress=9 \
32
                    > $BACKUP_FILE
33

34
                  # Upload to S3
35
                  aws s3 cp $BACKUP_FILE s3://example-spire-backups/
36

37
                  # Keep only last 30 days of backups
38
                  aws s3 ls s3://example-spire-backups/ | \
39
                    awk '{print $4}' | \
40
                    sort | \
41
                    head -n -30 | \
42
                    xargs -I {} aws s3 rm s3://example-spire-backups/{}
43

44
                  # Verify backup
45
                  pg_restore --list $BACKUP_FILE > /dev/null
46
                  echo "Backup completed successfully"
47
              volumeMounts:
48
                - name: backup
49
                  mountPath: /backup
50
          volumes:
51
            - name: backup
52
              emptyDir: {}
53
          restartPolicy: OnFailure

Disaster Recovery Procedure#

1
#!/bin/bash
2
# Step 1: Promote standby region
3
promote_standby() {
4
    echo "Promoting standby region to primary..."
5

6
    # Promote PostgreSQL replica
7
    kubectl exec -n spire-system postgres-replica-0 -- \
8
        touch /tmp/postgresql.trigger
9

10
    # Update SPIRE servers to write mode
11
    kubectl patch configmap spire-server-config -n spire-system \
12
        --type merge -p '{"data":{"experimental.read_only_mode":"false"}}'
13

14
    # Restart SPIRE servers
15
    kubectl rollout restart statefulset spire-server -n spire-system
16
}
17

18
# Step 2: Redirect traffic
19
redirect_traffic() {
20
    echo "Redirecting traffic to standby region..."
21

22
    # Update DNS
23
    aws route53 change-resource-record-sets \
24
        --hosted-zone-id Z123456789 \
25
        --change-batch '{
26
            "Changes": [{
27
                "Action": "UPSERT",
28
                "ResourceRecordSet": {
29
                    "Name": "spire.example.com",
30
                    "Type": "A",
31
                    "AliasTarget": {
32
                        "HostedZoneId": "Z098765432",
33
                        "DNSName": "spire-west.example.com",
34
                        "EvaluateTargetHealth": true
35
                    }
36
                }
37
            }]
38
        }'
39
}
40

41
# Step 3: Verify health
42
verify_health() {
43
    echo "Verifying system health..."
44

45
    # Check SPIRE servers
46
    for i in 0 1 2; do
47
        kubectl exec -n spire-system spire-server-$i -- \
48
            /opt/spire/bin/spire-server healthcheck
49
    done
50

51
    # Check database
52
    kubectl exec -n spire-system postgres-replica-0 -- \
53
        psql -U spire -d spire -c "SELECT COUNT(*) FROM registered_entries;"
54
}
55

56
# Main execution
57
case "$1" in
58
    promote)
59
        promote_standby
60
        ;;
61
    redirect)
62
        redirect_traffic
63
        ;;
64
    verify)
65
        verify_health
66
        ;;
67
    full)
68
        promote_standby
69
        redirect_traffic
70
        verify_health
71
        ;;
72
    *)
73
        echo "Usage: $0 {promote|redirect|verify|full}"
74
        exit 1
75
        ;;
76
esac

Step 8: Performance Optimization#

Database Query Optimization#

1
-- optimize-queries.sql
2
-- Analyze query performance
3
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
4

5
-- Most expensive queries
6
SELECT
7
    query,
8
    calls,
9
    total_time,
10
    mean_time,
11
    rows
12
FROM pg_stat_statements
13
WHERE query LIKE '%registered_entries%'
14
ORDER BY total_time DESC
15
LIMIT 10;
16

17
-- Create materialized view for complex queries
18
CREATE MATERIALIZED VIEW entry_selector_summary AS
19
SELECT
20
    e.id,
21
    e.spiffe_id,
22
    array_agg(s.type || ':' || s.value) as selectors,
23
    e.ttl,
24
    e.expiry
25
FROM registered_entries e
26
JOIN selectors s ON e.id = s.registered_entry_id
27
GROUP BY e.id, e.spiffe_id, e.ttl, e.expiry;
28

29
-- Create index on materialized view
30
CREATE INDEX idx_entry_selector_summary_selectors
31
ON entry_selector_summary USING gin(selectors);
32

33
-- Refresh materialized view periodically
34
CREATE OR REPLACE FUNCTION refresh_entry_selector_summary()
35
RETURNS void AS $$
36
BEGIN
37
    REFRESH MATERIALIZED VIEW CONCURRENTLY entry_selector_summary;
38
END;
39
$$ LANGUAGE plpgsql;
40

41
-- Schedule refresh
42
SELECT cron.schedule('refresh-entry-selectors', '*/5 * * * *',
43
    'SELECT refresh_entry_selector_summary()');

SPIRE Server Tuning#

1
apiVersion: v1
2
kind: ConfigMap
3
metadata:
4
  name: spire-performance-config
5
  namespace: spire-system
6
data:
7
  server.conf: |
8
    server {
9
      bind_address = "0.0.0.0"
10
      bind_port = "8081"
11
      trust_domain = "prod.example.com"
12
      data_dir = "/run/spire/data"
13
      log_level = "INFO"
14

15
      # Performance optimizations
16

17
      # Increase cache size for large deployments
18
      cache_size = 100000
19

20
      # Experimental performance features
21
      experimental {
22
        # Enable events-based cache updates
23
        events_based_cache = true
24

25
        # Reduce cache reload interval
26
        cache_reload_interval = "5s"
27

28
        # Enable entry pruning
29
        prune_expired_entries = true
30
        prune_interval = "1h"
31

32
        # Batch registration updates
33
        batch_registration_updates = true
34
        batch_size = 100
35
      }
36

37
      # Connection pooling
38
      connection_pool {
39
        max_open_conns = 100
40
        max_idle_conns = 50
41
        conn_max_lifetime = "1h"
42
      }
43

44
      # Rate limiting
45
      rate_limit {
46
        attestation = 1000  # per second
47
        signing = 5000      # per second
48
        registration = 100  # per second
49
      }
50
    }

Step 9: Scaling Strategies#

Nested SPIRE for Massive Scale#

1
graph TB
2
    subgraph "Global SPIRE"
3
        GS[Global SPIRE Server<br/>Root CA]
4
    end
5

6
    subgraph "Regional SPIRE Clusters"
7
        RS1[Regional SPIRE 1<br/>US-EAST]
8
        RS2[Regional SPIRE 2<br/>US-WEST]
9
        RS3[Regional SPIRE 3<br/>EU-WEST]
10
    end
11

12
    subgraph "Local SPIRE Clusters"
13
        LS1[Local SPIRE 1<br/>K8s Cluster 1]
14
        LS2[Local SPIRE 2<br/>K8s Cluster 2]
15
        LS3[Local SPIRE 3<br/>K8s Cluster 3]
16
        LS4[Local SPIRE 4<br/>K8s Cluster 4]
17
    end
18

19
    GS --> RS1
20
    GS --> RS2
21
    GS --> RS3
22

23
    RS1 --> LS1
24
    RS1 --> LS2
25
    RS2 --> LS3
26
    RS3 --> LS4

Configuration for nested deployment:

1
# Regional SPIRE server that acts as downstream
2
apiVersion: v1
3
kind: ConfigMap
4
metadata:
5
  name: regional-spire-config
6
  namespace: spire-system
7
data:
8
  server.conf: |
9
    server {
10
      bind_address = "0.0.0.0"
11
      bind_port = "8081"
12
      trust_domain = "prod.example.com"
13

14
      # Upstream authority - Global SPIRE
15
      upstream_authority {
16
        spire {
17
          server_address = "global-spire.example.com"
18
          server_port = "8081"
19
          workload_api_socket = "/run/spire/sockets/workload.sock"
20
        }
21
      }
22

23
      # This server can mint identities for downstream
24
      ca {
25
        subject {
26
          country = ["US"]
27
          organization = ["Example Corp"]
28
          common_name = "Regional SPIRE CA - US-EAST"
29
        }
30
      }
31
    }

Conclusion and Best Practices#

Building a production-grade SPIFFE/SPIRE deployment requires careful attention to:

Database Performance: Your deployment is only as fast as your database
Network Architecture: Minimize latency between components
Monitoring: You can’t improve what you don’t measure
Disaster Recovery: Plan for failure before it happens
Scaling Strategy: Choose between horizontal scaling or nested deployments

Key takeaways for production deployments:

✅ Use PostgreSQL with connection pooling for large deployments
✅ Deploy at least 3 SPIRE servers across availability zones
✅ Implement comprehensive monitoring and alerting
✅ Plan for zero-downtime updates and migrations
✅ Consider nested SPIRE for 100K+ workload deployments

In the next post, we’ll explore observability in depth, building comprehensive Prometheus and Grafana dashboards for SPIFFE/SPIRE monitoring.

Additional Resources#

SPIRE Scaling Guide
PostgreSQL Performance Tuning
Kubernetes StatefulSet Best Practices
CNCF Case Studies - Real-world SPIRE deployments

Have you deployed SPIFFE/SPIRE at scale? Share your experiences and lessons learned in the comments or reach out on the SPIFFE Slack.