Building a Secure Service Mesh with SPIFFE/SPIRE - Complete Implementation Guide#

In the era of microservices and distributed systems, securing service-to-service communication has become paramount. This guide provides a comprehensive implementation of a secure service mesh using SPIFFE (Secure Production Identity Framework For Everyone) and SPIRE (SPIFFE Runtime Environment), complete with detailed architecture diagrams and production-ready configurations.

Table of Contents#

Service Mesh Architecture Overview#

A service mesh provides a dedicated infrastructure layer for managing service-to-service communication. When combined with SPIFFE/SPIRE, it creates a zero-trust security model where every workload has a cryptographically verifiable identity.

graph TB
    subgraph "Service Mesh Architecture"
        subgraph "Control Plane"
            SPIRE_Server[SPIRE Server]
            Policy_Engine[Policy Engine]
            Config_Manager[Configuration Manager]
            Cert_Authority[Certificate Authority]
            Telemetry[Telemetry Collector]
        end

        subgraph "Data Plane"
            subgraph "Service A"
                A_App[Application A]
                A_Proxy[Envoy Proxy]
                A_Agent[SPIRE Agent]
                A_Workload[Workload API]
            end

            subgraph "Service B"
                B_App[Application B]
                B_Proxy[Envoy Proxy]
                B_Agent[SPIRE Agent]
                B_Workload[Workload API]
            end

            subgraph "Service C"
                C_App[Application C]
                C_Proxy[Envoy Proxy]
                C_Agent[SPIRE Agent]
                C_Workload[Workload API]
            end
        end

        subgraph "Infrastructure"
            K8s[Kubernetes API]
            Registry[Service Registry]
            KV_Store[Key-Value Store]
        end
    end

    SPIRE_Server --> Cert_Authority
    SPIRE_Server --> Policy_Engine
    SPIRE_Server --> KV_Store

    A_Agent --> SPIRE_Server
    B_Agent --> SPIRE_Server
    C_Agent --> SPIRE_Server

    A_Agent --> A_Workload
    B_Agent --> B_Workload
    C_Agent --> C_Workload

    A_App --> A_Proxy
    B_App --> B_Proxy
    C_App --> C_Proxy

    A_Proxy -.-> B_Proxy
    B_Proxy -.-> C_Proxy
    A_Proxy -.-> C_Proxy

    Config_Manager --> Registry
    Registry --> K8s

    Telemetry --> A_Proxy
    Telemetry --> B_Proxy
    Telemetry --> C_Proxy

    style SPIRE_Server fill:#f96,stroke:#333,stroke-width:4px
    style A_Proxy fill:#9f9,stroke:#333,stroke-width:2px
    style B_Proxy fill:#9f9,stroke:#333,stroke-width:2px
    style C_Proxy fill:#9f9,stroke:#333,stroke-width:2px

Key Components#

SPIRE Server: Central authority for workload attestation and SVID issuance
SPIRE Agent: Node-level component that attests workloads and manages SVIDs
Envoy Proxy: Data plane proxy handling mTLS and traffic management
Workload API: Unix domain socket for workload-to-SPIRE communication
Policy Engine: Centralized policy management and enforcement

SPIFFE/SPIRE Identity Flow#

Understanding how SPIFFE identities (SVIDs) are created, distributed, and verified is crucial for implementing a secure service mesh.

sequenceDiagram
    participant W as Workload
    participant A as SPIRE Agent
    participant S as SPIRE Server
    participant CA as Certificate Authority
    participant R as Registration API

    Note over W,CA: Initial Workload Registration

    R->>S: Register workload entry
    S->>S: Store registration

    Note over W,CA: Workload Attestation & SVID Issuance

    W->>A: Connect to Workload API
    A->>A: Perform workload attestation
    A->>A: Verify workload selectors

    A->>S: Request SVID for workload
    S->>S: Verify agent identity
    S->>S: Check workload registration
    S->>CA: Generate key pair & CSR
    CA->>CA: Sign certificate
    CA->>S: Return X.509 SVID
    S->>A: Send SVID bundle

    A->>W: Provide SVID via Workload API
    W->>W: Configure TLS with SVID

    Note over W,CA: SVID Rotation

    loop Every 30 minutes
        A->>S: Check SVID expiration
        alt SVID expiring soon
            A->>S: Request SVID renewal
            S->>CA: Generate new SVID
            CA->>S: Return new SVID
            S->>A: Send updated SVID
            A->>W: Hot-reload new SVID
        end
    end

    Note over W,CA: Service-to-Service Communication

    W->>W: Initiate TLS connection
    W->>W: Present SVID
    W->>A: Validate peer SVID
    A->>A: Check trust bundle
    A->>W: Validation result

SPIFFE Identity Structure#

1
# SPIFFE ID Format
2
spiffe://trust-domain/path/to/workload
3

4
# Example Identities
5
spiffe://production.company.com/ns/default/sa/frontend
6
spiffe://production.company.com/ns/payments/sa/processor
7
spiffe://production.company.com/region/us-east/service/api-gateway

Network Policy Enforcement Flow#

The service mesh enforces network policies at multiple levels, providing defense in depth:

graph TB
    subgraph "Policy Enforcement Layers"
        subgraph "Layer 1: Network Policies"
            NP_Ingress[Ingress Rules]
            NP_Egress[Egress Rules]
            NP_CIDR[CIDR Blocks]
        end

        subgraph "Layer 2: Service Mesh Policies"
            SM_Auth[Authentication Policy]
            SM_Authz[Authorization Policy]
            SM_Traffic[Traffic Policy]
        end

        subgraph "Layer 3: Application Policies"
            APP_RBAC[RBAC Rules]
            APP_Custom[Custom Logic]
            APP_Rate[Rate Limiting]
        end
    end

    subgraph "Enforcement Points"
        subgraph "Network Level"
            CNI[CNI Plugin]
            IPTables[iptables/nftables]
            eBPF[eBPF Programs]
        end

        subgraph "Proxy Level"
            Envoy[Envoy Proxy]
            WASM[WASM Filters]
            Lua[Lua Scripts]
        end

        subgraph "Application Level"
            SDK[Service Mesh SDK]
            Middleware[Middleware]
            Interceptors[gRPC Interceptors]
        end
    end

    NP_Ingress --> CNI
    NP_Egress --> IPTables
    NP_CIDR --> eBPF

    SM_Auth --> Envoy
    SM_Authz --> WASM
    SM_Traffic --> Lua

    APP_RBAC --> SDK
    APP_Custom --> Middleware
    APP_Rate --> Interceptors

    style SM_Auth fill:#f96,stroke:#333,stroke-width:2px
    style Envoy fill:#9f9,stroke:#333,stroke-width:2px

Policy Decision Flow#

sequenceDiagram
    participant Client
    participant Envoy as Envoy Proxy
    participant OPA as Open Policy Agent
    participant SPIRE as SPIRE Agent
    participant Service

    Client->>Envoy: HTTPS Request with SVID

    Envoy->>Envoy: Validate TLS/SVID

    Envoy->>SPIRE: Verify SVID
    SPIRE->>Envoy: SVID Valid

    Envoy->>Envoy: Extract request context
    Note over Envoy: Method, Path, Headers, SPIFFE ID

    Envoy->>OPA: Authorization check
    Note over OPA: {
    Note over OPA:   "subject": "spiffe://...",
    Note over OPA:   "resource": "/api/users",
    Note over OPA:   "action": "GET"
    Note over OPA: }

    OPA->>OPA: Evaluate policies
    OPA->>Envoy: Decision (Allow/Deny)

    alt Allowed
        Envoy->>Service: Forward request
        Service->>Envoy: Response
        Envoy->>Client: Response
    else Denied
        Envoy->>Client: 403 Forbidden
    end

Implementation Guide#

Prerequisites#

Before implementing the secure service mesh, ensure you have:

Kubernetes cluster (1.19+)
Helm 3.x installed
kubectl configured
Storage class for persistent volumes
Load balancer or ingress controller

Step 1: Install SPIRE#

1
# Add SPIRE Helm repository
2
helm repo add spiffe https://spiffe.github.io/helm-charts
3
helm repo update
4

5
# Create SPIRE namespace
6
kubectl create namespace spire
7

8
# Install SPIRE with custom values
9
cat > spire-values.yaml << EOF
10
spire-server:
11
  image:
12
    tag: 1.8.0
13

14
  controllerManager:
15
    enabled: true
16

17
  notifier:
18
    k8sbundle:
19
      enabled: true
20

21
  dataStore:
22
    sql:
23
      databaseType: postgres
24
      connectionString: "postgresql://spire:password@postgres:5432/spire"
25

26
  trustDomain: production.company.com
27

28
  ca_subject:
29
    country: US
30
    organization: Company
31
    common_name: SPIRE CA
32

33
  persistence:
34
    enabled: true
35
    size: 10Gi
36

37
  nodeAttestor:
38
    k8sPsat:
39
      enabled: true
40

41
spire-agent:
42
  image:
43
    tag: 1.8.0
44

45
  workloadAttestors:
46
    k8s:
47
      enabled: true
48
    unix:
49
      enabled: true
50

51
  sockets:
52
    admin:
53
      enabled: true
54
EOF
55

56
helm install spire spiffe/spire \
57
  --namespace spire \
58
  --values spire-values.yaml

Step 2: Deploy Service Mesh Control Plane#

1
apiVersion: install.istio.io/v1alpha1
2
kind: IstioOperator
3
metadata:
4
  name: control-plane
5
spec:
6
  values:
7
    pilot:
8
      env:
9
        PILOT_ENABLE_WORKLOAD_ENTRY_AUTOREGISTRATION: true
10
        PILOT_ENABLE_CROSS_CLUSTER_WORKLOAD_ENTRY: true
11

12
    telemetry:
13
      v2:
14
        prometheus:
15
          configOverride:
16
            inboundSidecar:
17
              disable_host_header_fallback: true
18
            outboundSidecar:
19
              disable_host_header_fallback: true
20

21
  meshConfig:
22
    defaultConfig:
23
      proxyStatsMatcher:
24
        inclusionRegexps:
25
          - ".*outlier_detection.*"
26
          - ".*circuit_breakers.*"
27
          - ".*upstream_rq_retry.*"
28
          - ".*upstream_rq_pending.*"
29

30
    trustDomain: production.company.com
31

32
    extensionProviders:
33
      - name: spire
34
        envoyExtAuthzGrpc:
35
          service: spire-server.spire.svc.cluster.local
36
          port: 8081
37

38
    defaultProviders:
39
      accessLogging:
40
        - otel

Step 3: Configure Workload Registration#

graph LR
    subgraph "Registration Flow"
        K8s[Kubernetes Controller]
        Reg[Registration Controller]
        SPIRE[SPIRE Server]
        DB[(Registration DB)]
    end

    K8s -->|Watch Events| Reg
    Reg -->|Create Entry| SPIRE
    SPIRE -->|Store| DB

    style Reg fill:#9f9,stroke:#333,stroke-width:2px

1
apiVersion: spire.spiffe.io/v1alpha1
2
kind: ClusterSPIFFEID
3
metadata:
4
  name: default-workloads
5
spec:
6
  spiffeIDTemplate: "spiffe://{{ .TrustDomain }}/ns/{{ .PodMeta.Namespace }}/sa/{{ .PodSpec.ServiceAccountName }}"
7
  podSelector:
8
    matchLabels:
9
      spiffe.io/enabled: "true"
10
  workloadSelectorTemplates:
11
    - "k8s:ns:{{ .PodMeta.Namespace }}"
12
    - "k8s:sa:{{ .PodSpec.ServiceAccountName }}"
13
    - "k8s:pod-name:{{ .PodMeta.Name }}"
14
---
15
apiVersion: v1
16
kind: ServiceAccount
17
metadata:
18
  name: frontend
19
  namespace: production
20
  labels:
21
    spiffe.io/enabled: "true"
22
---
23
apiVersion: apps/v1
24
kind: Deployment
25
metadata:
26
  name: frontend
27
  namespace: production
28
spec:
29
  replicas: 3
30
  selector:
31
    matchLabels:
32
      app: frontend
33
  template:
34
    metadata:
35
      labels:
36
        app: frontend
37
        spiffe.io/enabled: "true"
38
    spec:
39
      serviceAccountName: frontend
40
      containers:
41
        - name: app
42
          image: frontend:latest
43
          env:
44
            - name: SPIFFE_ENDPOINT_SOCKET
45
              value: unix:///spiffe-workload-api/spire-agent.sock
46
          volumeMounts:
47
            - name: spiffe-workload-api
48
              mountPath: /spiffe-workload-api
49
              readOnly: true
50
        - name: envoy
51
          image: envoyproxy/envoy:v1.28-latest
52
          args:
53
            - -c
54
            - /etc/envoy/envoy.yaml
55
          volumeMounts:
56
            - name: envoy-config
57
              mountPath: /etc/envoy
58
            - name: spiffe-workload-api
59
              mountPath: /spiffe-workload-api
60
              readOnly: true
61
      volumes:
62
        - name: spiffe-workload-api
63
          csi:
64
            driver: "csi.spiffe.io"
65
            readOnly: true
66
        - name: envoy-config
67
          configMap:
68
            name: envoy-config

Step 4: Implement mTLS Configuration#

1
apiVersion: v1
2
kind: ConfigMap
3
metadata:
4
  name: envoy-config
5
  namespace: production
6
data:
7
  envoy.yaml: |
8
    node:
9
      id: frontend
10
      cluster: frontend-cluster
11

12
    static_resources:
13
      listeners:
14
      - name: ingress
15
        address:
16
          socket_address:
17
            address: 0.0.0.0
18
            port_value: 8080
19
        filter_chains:
20
        - filters:
21
          - name: envoy.filters.network.http_connection_manager
22
            typed_config:
23
              "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
24
              stat_prefix: ingress_http
25
              route_config:
26
                name: local_route
27
                virtual_hosts:
28
                - name: backend
29
                  domains: ["*"]
30
                  routes:
31
                  - match:
32
                      prefix: "/"
33
                    route:
34
                      cluster: backend_cluster
35
              http_filters:
36
              - name: envoy.filters.http.ext_authz
37
                typed_config:
38
                  "@type": type.googleapis.com/envoy.extensions.filters.http.ext_authz.v3.ExtAuthz
39
                  grpc_service:
40
                    envoy_grpc:
41
                      cluster_name: opa_cluster
42
              - name: envoy.filters.http.router
43
                typed_config:
44
                  "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
45
          transport_socket:
46
            name: envoy.transport_sockets.tls
47
            typed_config:
48
              "@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.DownstreamTlsContext
49
              common_tls_context:
50
                tls_certificate_sds_secret_configs:
51
                - name: "spiffe://production.company.com/ns/production/sa/frontend"
52
                  sds_config:
53
                    resource_api_version: V3
54
                    api_config_source:
55
                      api_type: GRPC
56
                      transport_api_version: V3
57
                      grpc_services:
58
                      - envoy_grpc:
59
                          cluster_name: spire_agent
60
                validation_context_sds_secret_config:
61
                  name: "spiffe://production.company.com"
62
                  sds_config:
63
                    resource_api_version: V3
64
                    api_config_source:
65
                      api_type: GRPC
66
                      transport_api_version: V3
67
                      grpc_services:
68
                      - envoy_grpc:
69
                          cluster_name: spire_agent
70

71
      clusters:
72
      - name: backend_cluster
73
        connect_timeout: 30s
74
        type: STRICT_DNS
75
        lb_policy: ROUND_ROBIN
76
        load_assignment:
77
          cluster_name: backend_cluster
78
          endpoints:
79
          - lb_endpoints:
80
            - endpoint:
81
                address:
82
                  socket_address:
83
                    address: backend-service
84
                    port_value: 8080
85
        transport_socket:
86
          name: envoy.transport_sockets.tls
87
          typed_config:
88
            "@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext
89
            common_tls_context:
90
              tls_certificate_sds_secret_configs:
91
              - name: "spiffe://production.company.com/ns/production/sa/frontend"
92
                sds_config:
93
                  resource_api_version: V3
94
                  api_config_source:
95
                    api_type: GRPC
96
                    transport_api_version: V3
97
                    grpc_services:
98
                    - envoy_grpc:
99
                        cluster_name: spire_agent
100
              validation_context_sds_secret_config:
101
                name: "spiffe://production.company.com"
102
                sds_config:
103
                  resource_api_version: V3
104
                  api_config_source:
105
                    api_type: GRPC
106
                    transport_api_version: V3
107
                    grpc_services:
108
                    - envoy_grpc:
109
                        cluster_name: spire_agent
110

111
      - name: spire_agent
112
        connect_timeout: 1s
113
        type: STATIC
114
        lb_policy: ROUND_ROBIN
115
        load_assignment:
116
          cluster_name: spire_agent
117
          endpoints:
118
          - lb_endpoints:
119
            - endpoint:
120
                address:
121
                  pipe:
122
                    path: /spiffe-workload-api/spire-agent.sock

Step 5: Deploy Policy Engine#

graph TB
    subgraph "Policy Architecture"
        subgraph "Policy Sources"
            Git[Git Repository]
            API[Policy API]
            ConfigMap[K8s ConfigMap]
        end

        subgraph "Policy Engine"
            OPA[Open Policy Agent]
            Bundle[Bundle Server]
            Cache[Policy Cache]
        end

        subgraph "Enforcement Points"
            Envoy1[Envoy Proxy 1]
            Envoy2[Envoy Proxy 2]
            Envoy3[Envoy Proxy 3]
        end
    end

    Git --> Bundle
    API --> Bundle
    ConfigMap --> OPA

    Bundle --> Cache
    Cache --> OPA

    Envoy1 --> OPA
    Envoy2 --> OPA
    Envoy3 --> OPA

    style OPA fill:#f96,stroke:#333,stroke-width:2px

1
apiVersion: v1
2
kind: ConfigMap
3
metadata:
4
  name: opa-policy
5
  namespace: production
6
data:
7
  policy.rego: |
8
    package envoy.authz
9

10
    import input.attributes.request.http as http_request
11
    import input.attributes.source.address as source_address
12

13
    default allow = false
14

15
    # Extract SPIFFE ID from certificate
16
    spiffe_id = id {
17
        [_, id] := split(http_request.headers["x-forwarded-client-cert"], "URI=")
18
    }
19

20
    # Allow health checks
21
    allow {
22
        http_request.path == "/health"
23
    }
24

25
    # Service-to-service authorization rules
26
    allow {
27
        http_request.method == "GET"
28
        http_request.path == "/api/users"
29
        spiffe_id == "spiffe://production.company.com/ns/production/sa/frontend"
30
    }
31

32
    allow {
33
        http_request.method == "POST"
34
        http_request.path == "/api/orders"
35
        spiffe_id == "spiffe://production.company.com/ns/production/sa/order-service"
36
    }
37

38
    # Rate limiting rules
39
    rate_limit[decision] {
40
        service := split(spiffe_id, "/")[4]
41
        limits := {
42
            "frontend": 1000,
43
            "backend": 500,
44
            "database": 100
45
        }
46
        decision := {
47
            "allowed": true,
48
            "headers": {
49
                "X-RateLimit-Limit": limits[service]
50
            }
51
        }
52
    }
53
---
54
apiVersion: apps/v1
55
kind: Deployment
56
metadata:
57
  name: opa
58
  namespace: production
59
spec:
60
  replicas: 3
61
  selector:
62
    matchLabels:
63
      app: opa
64
  template:
65
    metadata:
66
      labels:
67
        app: opa
68
    spec:
69
      containers:
70
        - name: opa
71
          image: openpolicyagent/opa:0.59.0-envoy
72
          ports:
73
            - containerPort: 9191
74
          args:
75
            - "run"
76
            - "--server"
77
            - "--config-file=/config/config.yaml"
78
            - "/policies"
79
          volumeMounts:
80
            - name: opa-policy
81
              mountPath: /policies
82
            - name: opa-config
83
              mountPath: /config
84
          livenessProbe:
85
            httpGet:
86
              path: /health
87
              port: 8181
88
            initialDelaySeconds: 5
89
            periodSeconds: 5
90
          readinessProbe:
91
            httpGet:
92
              path: /health?bundle=true
93
              port: 8181
94
            initialDelaySeconds: 5
95
            periodSeconds: 5
96
      volumes:
97
        - name: opa-policy
98
          configMap:
99
            name: opa-policy
100
        - name: opa-config
101
          configMap:
102
            name: opa-config

Advanced Security Features#

Zero Trust Network Architecture#

graph TB
    subgraph "Zero Trust Principles"
        subgraph "Never Trust"
            NT1[No Implicit Trust]
            NT2[Verify Every Request]
            NT3[Assume Breach]
        end

        subgraph "Always Verify"
            AV1[Identity Verification]
            AV2[Device Verification]
            AV3[Context Verification]
        end

        subgraph "Least Privilege"
            LP1[Minimal Access]
            LP2[Just-In-Time Access]
            LP3[Adaptive Access]
        end
    end

    subgraph "Implementation"
        subgraph "Identity"
            SPIFFE[SPIFFE IDs]
            mTLS[Mutual TLS]
            Tokens[JWT Tokens]
        end

        subgraph "Policy"
            RBAC[Role-Based Access]
            ABAC[Attribute-Based Access]
            Context[Contextual Policies]
        end

        subgraph "Monitoring"
            Audit[Audit Logs]
            Metrics[Security Metrics]
            Alerts[Real-time Alerts]
        end
    end

    NT1 --> SPIFFE
    NT2 --> mTLS
    NT3 --> Audit

    AV1 --> SPIFFE
    AV2 --> Context
    AV3 --> ABAC

    LP1 --> RBAC
    LP2 --> Context
    LP3 --> ABAC

    style SPIFFE fill:#f96,stroke:#333,stroke-width:2px
    style mTLS fill:#9f9,stroke:#333,stroke-width:2px

Secret Management Integration#

1
apiVersion: v1
2
kind: ConfigMap
3
metadata:
4
  name: spire-server-config
5
  namespace: spire
6
data:
7
  server.conf: |
8
    server {
9
      bind_address = "0.0.0.0"
10
      bind_port = "8081"
11
      trust_domain = "production.company.com"
12
      data_dir = "/run/spire/data"
13
      log_level = "INFO"
14

15
      ca_key_type = "rsa-2048"
16
      ca_ttl = "24h"
17

18
      jwt_issuer = "https://spire.production.company.com"
19
    }
20

21
    plugins {
22
      DataStore "sql" {
23
        plugin_data {
24
          database_type = "postgres"
25
          connection_string = "${SPIRE_DB_CONNECTION_STRING}"
26
        }
27
      }
28

29
      KeyManager "disk" {
30
        plugin_data {
31
          keys_path = "/run/spire/data/keys.json"
32
        }
33
      }
34

35
      UpstreamAuthority "vault" {
36
        plugin_data {
37
          vault_addr = "https://vault.production.company.com"
38
          pki_mount_point = "pki/spire"
39
          ca_cert_path = "/run/secrets/vault-ca.crt"
40

41
          token_auth {
42
            token = "${VAULT_TOKEN}"
43
          }
44

45
          # Or use AppRole auth
46
          # approle_auth {
47
          #   approle_id = "${VAULT_APPROLE_ID}"
48
          #   approle_secret_id = "${VAULT_APPROLE_SECRET_ID}"
49
          # }
50
        }
51
      }
52

53
      NodeAttestor "k8s_psat" {
54
        plugin_data {
55
          clusters = {
56
            "production" = {
57
              service_account_allow_list = ["spire:spire-agent"]
58
            }
59
          }
60
        }
61
      }
62
    }

Advanced Monitoring and Observability#

graph LR
    subgraph "Data Collection"
        Envoy[Envoy Metrics]
        SPIRE[SPIRE Metrics]
        Apps[Application Metrics]
        Traces[Distributed Traces]
    end

    subgraph "Processing"
        Prometheus[Prometheus]
        Jaeger[Jaeger]
        FluentBit[Fluent Bit]
    end

    subgraph "Storage"
        TSDB[Time Series DB]
        TraceDB[Trace Storage]
        LogDB[Log Storage]
    end

    subgraph "Visualization"
        Grafana[Grafana]
        Kibana[Kibana]
        Custom[Custom Dashboards]
    end

    Envoy --> Prometheus
    SPIRE --> Prometheus
    Apps --> Prometheus
    Traces --> Jaeger

    Prometheus --> TSDB
    Jaeger --> TraceDB
    FluentBit --> LogDB

    TSDB --> Grafana
    TraceDB --> Grafana
    LogDB --> Kibana

    Grafana --> Custom
    Kibana --> Custom

    style Prometheus fill:#f96,stroke:#333,stroke-width:2px
    style Grafana fill:#9f9,stroke:#333,stroke-width:2px

Security Dashboard Configuration#

1
{
2
  "dashboard":
3
    {
4
      "title": "Service Mesh Security Dashboard",
5
      "panels":
6
        [
7
          {
8
            "title": "mTLS Adoption Rate",
9
            "targets":
10
              [
11
                {
12
                  "expr": "sum(rate(envoy_http_downstream_cx_ssl_total[5m])) / sum(rate(envoy_http_downstream_cx_total[5m])) * 100",
13
                },
14
              ],
15
          },
16
          {
17
            "title": "Authorization Denials",
18
            "targets":
19
              [
20
                {
21
                  "expr": "sum(rate(envoy_http_ext_authz_denied[5m])) by (service)",
22
                },
23
              ],
24
          },
25
          {
26
            "title": "SVID Rotation Events",
27
            "targets":
28
              [
29
                {
30
                  "expr": "sum(rate(spire_agent_svid_rotations_total[5m])) by (trust_domain)",
31
                },
32
              ],
33
          },
34
          {
35
            "title": "Policy Violations",
36
            "targets":
37
              [
38
                {
39
                  "expr": 'sum(rate(opa_decisions_total{decision="deny"}[5m])) by (policy)',
40
                },
41
              ],
42
          },
43
        ],
44
    },
45
}

Production Deployment Considerations#

High Availability Configuration#

graph TB
    subgraph "HA Architecture"
        subgraph "Region 1"
            LB1[Load Balancer]
            SPIRE1[SPIRE Server 1]
            SPIRE2[SPIRE Server 2]
            DB1[(Primary DB)]
        end

        subgraph "Region 2"
            LB2[Load Balancer]
            SPIRE3[SPIRE Server 3]
            SPIRE4[SPIRE Server 4]
            DB2[(Replica DB)]
        end

        subgraph "Global"
            GLB[Global Load Balancer]
            GSLB[Global Service LB]
        end
    end

    GLB --> LB1
    GLB --> LB2

    LB1 --> SPIRE1
    LB1 --> SPIRE2

    LB2 --> SPIRE3
    LB2 --> SPIRE4

    SPIRE1 --> DB1
    SPIRE2 --> DB1
    SPIRE3 --> DB2
    SPIRE4 --> DB2

    DB1 -.->|Replication| DB2

    style GLB fill:#f96,stroke:#333,stroke-width:2px
    style DB1 fill:#9f9,stroke:#333,stroke-width:2px

Disaster Recovery Plan#

1
apiVersion: batch/v1
2
kind: CronJob
3
metadata:
4
  name: spire-backup
5
  namespace: spire
6
spec:
7
  schedule: "0 */6 * * *" # Every 6 hours
8
  jobTemplate:
9
    spec:
10
      template:
11
        spec:
12
          containers:
13
            - name: backup
14
              image: postgres:15-alpine
15
              env:
16
                - name: PGPASSWORD
17
                  valueFrom:
18
                    secretKeyRef:
19
                      name: postgres-secret
20
                      key: password
21
              command:
22
                - /bin/sh
23
                - -c
24
                - |
25
                  # Backup SPIRE database
26
                  pg_dump -h postgres -U spire -d spire > /backup/spire-$(date +%Y%m%d-%H%M%S).sql
27

28
                  # Backup SPIRE Server data
29
                  kubectl exec -n spire spire-server-0 -- tar czf - /run/spire/data > /backup/spire-data-$(date +%Y%m%d-%H%M%S).tar.gz
30

31
                  # Upload to S3
32
                  aws s3 cp /backup/ s3://company-backups/spire/ --recursive
33

34
                  # Cleanup old backups (keep last 30 days)
35
                  find /backup -name "*.sql" -mtime +30 -delete
36
                  find /backup -name "*.tar.gz" -mtime +30 -delete
37
              volumeMounts:
38
                - name: backup
39
                  mountPath: /backup
40
          restartPolicy: OnFailure
41
          volumes:
42
            - name: backup
43
              persistentVolumeClaim:
44
                claimName: backup-pvc

Performance Tuning#

1
apiVersion: v1
2
kind: ConfigMap
3
metadata:
4
  name: envoy-performance
5
  namespace: production
6
data:
7
  envoy.yaml: |
8
    static_resources:
9
      clusters:
10
      - name: service_cluster
11
        connect_timeout: 0.25s
12
        type: STRICT_DNS
13
        lb_policy: LEAST_REQUEST
14

15
        # Circuit breaker configuration
16
        circuit_breakers:
17
          thresholds:
18
          - priority: DEFAULT
19
            max_connections: 1000
20
            max_pending_requests: 1000
21
            max_requests: 1000
22
            max_retries: 3
23

24
        # Health checking
25
        health_checks:
26
        - timeout: 5s
27
          interval: 10s
28
          unhealthy_threshold: 2
29
          healthy_threshold: 2
30
          path: /health
31

32
        # Connection pooling
33
        upstream_connection_options:
34
          tcp_keepalive:
35
            keepalive_probes: 3
36
            keepalive_time: 10
37
            keepalive_interval: 5
38

39
        # HTTP/2 optimization
40
        typed_extension_protocol_options:
41
          envoy.extensions.upstreams.http.v3.HttpProtocolOptions:
42
            "@type": type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions
43
            explicit_http_config:
44
              http2_protocol_options:
45
                max_concurrent_streams: 100
46
                initial_stream_window_size: 65536
47
                initial_connection_window_size: 1048576

Troubleshooting Guide#

Common Issues and Solutions#

Issue	Symptoms	Root Cause	Solution
SVID Not Issued	`no identity issued`	Workload not registered	Check workload registration and selectors
mTLS Handshake Failure	`tls: bad certificate`	Certificate validation failed	Verify trust bundle distribution
Policy Denial	`403 Forbidden`	Authorization policy mismatch	Review OPA logs and policy rules
High Latency	Slow response times	Policy evaluation overhead	Optimize policy rules, enable caching
Memory Pressure	OOM kills	Large policy bundles	Implement policy sharding

Debug Commands#

1
# Check SPIRE Server health
2
kubectl exec -n spire spire-server-0 -- \
3
  /opt/spire/bin/spire-server healthcheck
4

5
# List registered workloads
6
kubectl exec -n spire spire-server-0 -- \
7
  /opt/spire/bin/spire-server entry list
8

9
# Debug workload attestation
10
kubectl exec -n production frontend-pod -- \
11
  /opt/spire/bin/spire-agent api fetch x509 \
12
  -socketPath /spiffe-workload-api/spire-agent.sock
13

14
# Check Envoy configuration
15
kubectl exec -n production frontend-pod -c envoy -- \
16
  curl -s localhost:15000/config_dump | jq .
17

18
# Validate OPA policies
19
kubectl exec -n production opa-pod -- \
20
  opa test /policies

Security Best Practices#

Defense in Depth Strategy#

graph TB
    subgraph "Security Layers"
        L1[Network Security]
        L2[Transport Security]
        L3[Application Security]
        L4[Data Security]
        L5[Operational Security]
    end

    subgraph "Controls"
        C1[Firewalls & Network Policies]
        C2[mTLS & Encryption]
        C3[Authentication & Authorization]
        C4[Encryption at Rest]
        C5[Audit & Monitoring]
    end

    L1 --> C1
    L2 --> C2
    L3 --> C3
    L4 --> C4
    L5 --> C5

    style L2 fill:#f96,stroke:#333,stroke-width:2px
    style C2 fill:#9f9,stroke:#333,stroke-width:2px

Security Checklist#

Conclusion#

Implementing a secure service mesh with SPIFFE/SPIRE provides a robust foundation for zero-trust security in microservices architectures. The combination of cryptographic workload identity, policy-based authorization, and comprehensive observability creates a defense-in-depth strategy that significantly enhances your security posture.

Key takeaways:

Identity-First Security: Every workload has a cryptographically verifiable identity
Policy as Code: Authorization rules are version-controlled and auditable
Automated Security: Certificate rotation and policy updates happen automatically
Observable Security: Rich metrics and logs provide security visibility
Scalable Architecture: Designed for high availability and performance

By following this implementation guide and adapting it to your specific requirements, you can build a production-ready secure service mesh that provides both strong security guarantees and operational flexibility.

Newsletter