2524 words
13 minutes
Building a Secure Service Mesh with SPIFFE/SPIRE - Complete Implementation Guide

Building a Secure Service Mesh with SPIFFE/SPIRE - Complete Implementation Guide#

In the era of microservices and distributed systems, securing service-to-service communication has become paramount. This guide provides a comprehensive implementation of a secure service mesh using SPIFFE (Secure Production Identity Framework For Everyone) and SPIRE (SPIFFE Runtime Environment), complete with detailed architecture diagrams and production-ready configurations.

Table of Contents#

Service Mesh Architecture Overview#

A service mesh provides a dedicated infrastructure layer for managing service-to-service communication. When combined with SPIFFE/SPIRE, it creates a zero-trust security model where every workload has a cryptographically verifiable identity.

graph TB
subgraph "Service Mesh Architecture"
subgraph "Control Plane"
SPIRE_Server[SPIRE Server]
Policy_Engine[Policy Engine]
Config_Manager[Configuration Manager]
Cert_Authority[Certificate Authority]
Telemetry[Telemetry Collector]
end
subgraph "Data Plane"
subgraph "Service A"
A_App[Application A]
A_Proxy[Envoy Proxy]
A_Agent[SPIRE Agent]
A_Workload[Workload API]
end
subgraph "Service B"
B_App[Application B]
B_Proxy[Envoy Proxy]
B_Agent[SPIRE Agent]
B_Workload[Workload API]
end
subgraph "Service C"
C_App[Application C]
C_Proxy[Envoy Proxy]
C_Agent[SPIRE Agent]
C_Workload[Workload API]
end
end
subgraph "Infrastructure"
K8s[Kubernetes API]
Registry[Service Registry]
KV_Store[Key-Value Store]
end
end
SPIRE_Server --> Cert_Authority
SPIRE_Server --> Policy_Engine
SPIRE_Server --> KV_Store
A_Agent --> SPIRE_Server
B_Agent --> SPIRE_Server
C_Agent --> SPIRE_Server
A_Agent --> A_Workload
B_Agent --> B_Workload
C_Agent --> C_Workload
A_App --> A_Proxy
B_App --> B_Proxy
C_App --> C_Proxy
A_Proxy -.-> B_Proxy
B_Proxy -.-> C_Proxy
A_Proxy -.-> C_Proxy
Config_Manager --> Registry
Registry --> K8s
Telemetry --> A_Proxy
Telemetry --> B_Proxy
Telemetry --> C_Proxy
style SPIRE_Server fill:#f96,stroke:#333,stroke-width:4px
style A_Proxy fill:#9f9,stroke:#333,stroke-width:2px
style B_Proxy fill:#9f9,stroke:#333,stroke-width:2px
style C_Proxy fill:#9f9,stroke:#333,stroke-width:2px

Key Components#

  1. SPIRE Server: Central authority for workload attestation and SVID issuance
  2. SPIRE Agent: Node-level component that attests workloads and manages SVIDs
  3. Envoy Proxy: Data plane proxy handling mTLS and traffic management
  4. Workload API: Unix domain socket for workload-to-SPIRE communication
  5. Policy Engine: Centralized policy management and enforcement

SPIFFE/SPIRE Identity Flow#

Understanding how SPIFFE identities (SVIDs) are created, distributed, and verified is crucial for implementing a secure service mesh.

sequenceDiagram
participant W as Workload
participant A as SPIRE Agent
participant S as SPIRE Server
participant CA as Certificate Authority
participant R as Registration API
Note over W,CA: Initial Workload Registration
R->>S: Register workload entry
S->>S: Store registration
Note over W,CA: Workload Attestation & SVID Issuance
W->>A: Connect to Workload API
A->>A: Perform workload attestation
A->>A: Verify workload selectors
A->>S: Request SVID for workload
S->>S: Verify agent identity
S->>S: Check workload registration
S->>CA: Generate key pair & CSR
CA->>CA: Sign certificate
CA->>S: Return X.509 SVID
S->>A: Send SVID bundle
A->>W: Provide SVID via Workload API
W->>W: Configure TLS with SVID
Note over W,CA: SVID Rotation
loop Every 30 minutes
A->>S: Check SVID expiration
alt SVID expiring soon
A->>S: Request SVID renewal
S->>CA: Generate new SVID
CA->>S: Return new SVID
S->>A: Send updated SVID
A->>W: Hot-reload new SVID
end
end
Note over W,CA: Service-to-Service Communication
W->>W: Initiate TLS connection
W->>W: Present SVID
W->>A: Validate peer SVID
A->>A: Check trust bundle
A->>W: Validation result

SPIFFE Identity Structure#

# SPIFFE ID Format
spiffe://trust-domain/path/to/workload
# Example Identities
spiffe://production.company.com/ns/default/sa/frontend
spiffe://production.company.com/ns/payments/sa/processor
spiffe://production.company.com/region/us-east/service/api-gateway

Network Policy Enforcement Flow#

The service mesh enforces network policies at multiple levels, providing defense in depth:

graph TB
subgraph "Policy Enforcement Layers"
subgraph "Layer 1: Network Policies"
NP_Ingress[Ingress Rules]
NP_Egress[Egress Rules]
NP_CIDR[CIDR Blocks]
end
subgraph "Layer 2: Service Mesh Policies"
SM_Auth[Authentication Policy]
SM_Authz[Authorization Policy]
SM_Traffic[Traffic Policy]
end
subgraph "Layer 3: Application Policies"
APP_RBAC[RBAC Rules]
APP_Custom[Custom Logic]
APP_Rate[Rate Limiting]
end
end
subgraph "Enforcement Points"
subgraph "Network Level"
CNI[CNI Plugin]
IPTables[iptables/nftables]
eBPF[eBPF Programs]
end
subgraph "Proxy Level"
Envoy[Envoy Proxy]
WASM[WASM Filters]
Lua[Lua Scripts]
end
subgraph "Application Level"
SDK[Service Mesh SDK]
Middleware[Middleware]
Interceptors[gRPC Interceptors]
end
end
NP_Ingress --> CNI
NP_Egress --> IPTables
NP_CIDR --> eBPF
SM_Auth --> Envoy
SM_Authz --> WASM
SM_Traffic --> Lua
APP_RBAC --> SDK
APP_Custom --> Middleware
APP_Rate --> Interceptors
style SM_Auth fill:#f96,stroke:#333,stroke-width:2px
style Envoy fill:#9f9,stroke:#333,stroke-width:2px

Policy Decision Flow#

sequenceDiagram
participant Client
participant Envoy as Envoy Proxy
participant OPA as Open Policy Agent
participant SPIRE as SPIRE Agent
participant Service
Client->>Envoy: HTTPS Request with SVID
Envoy->>Envoy: Validate TLS/SVID
Envoy->>SPIRE: Verify SVID
SPIRE->>Envoy: SVID Valid
Envoy->>Envoy: Extract request context
Note over Envoy: Method, Path, Headers, SPIFFE ID
Envoy->>OPA: Authorization check
Note over OPA: {
Note over OPA: "subject": "spiffe://...",
Note over OPA: "resource": "/api/users",
Note over OPA: "action": "GET"
Note over OPA: }
OPA->>OPA: Evaluate policies
OPA->>Envoy: Decision (Allow/Deny)
alt Allowed
Envoy->>Service: Forward request
Service->>Envoy: Response
Envoy->>Client: Response
else Denied
Envoy->>Client: 403 Forbidden
end

Implementation Guide#

Prerequisites#

Before implementing the secure service mesh, ensure you have:

  1. Kubernetes cluster (1.19+)
  2. Helm 3.x installed
  3. kubectl configured
  4. Storage class for persistent volumes
  5. Load balancer or ingress controller

Step 1: Install SPIRE#

Terminal window
# Add SPIRE Helm repository
helm repo add spiffe https://spiffe.github.io/helm-charts
helm repo update
# Create SPIRE namespace
kubectl create namespace spire
# Install SPIRE with custom values
cat > spire-values.yaml << EOF
spire-server:
image:
tag: 1.8.0
controllerManager:
enabled: true
notifier:
k8sbundle:
enabled: true
dataStore:
sql:
databaseType: postgres
connectionString: "postgresql://spire:password@postgres:5432/spire"
trustDomain: production.company.com
ca_subject:
country: US
organization: Company
common_name: SPIRE CA
persistence:
enabled: true
size: 10Gi
nodeAttestor:
k8sPsat:
enabled: true
spire-agent:
image:
tag: 1.8.0
workloadAttestors:
k8s:
enabled: true
unix:
enabled: true
sockets:
admin:
enabled: true
EOF
helm install spire spiffe/spire \
--namespace spire \
--values spire-values.yaml

Step 2: Deploy Service Mesh Control Plane#

istio-control-plane.yaml
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
metadata:
name: control-plane
spec:
values:
pilot:
env:
PILOT_ENABLE_WORKLOAD_ENTRY_AUTOREGISTRATION: true
PILOT_ENABLE_CROSS_CLUSTER_WORKLOAD_ENTRY: true
telemetry:
v2:
prometheus:
configOverride:
inboundSidecar:
disable_host_header_fallback: true
outboundSidecar:
disable_host_header_fallback: true
meshConfig:
defaultConfig:
proxyStatsMatcher:
inclusionRegexps:
- ".*outlier_detection.*"
- ".*circuit_breakers.*"
- ".*upstream_rq_retry.*"
- ".*upstream_rq_pending.*"
trustDomain: production.company.com
extensionProviders:
- name: spire
envoyExtAuthzGrpc:
service: spire-server.spire.svc.cluster.local
port: 8081
defaultProviders:
accessLogging:
- otel

Step 3: Configure Workload Registration#

graph LR
subgraph "Registration Flow"
K8s[Kubernetes Controller]
Reg[Registration Controller]
SPIRE[SPIRE Server]
DB[(Registration DB)]
end
K8s -->|Watch Events| Reg
Reg -->|Create Entry| SPIRE
SPIRE -->|Store| DB
style Reg fill:#9f9,stroke:#333,stroke-width:2px
workload-registration.yaml
apiVersion: spire.spiffe.io/v1alpha1
kind: ClusterSPIFFEID
metadata:
name: default-workloads
spec:
spiffeIDTemplate: "spiffe://{{ .TrustDomain }}/ns/{{ .PodMeta.Namespace }}/sa/{{ .PodSpec.ServiceAccountName }}"
podSelector:
matchLabels:
spiffe.io/enabled: "true"
workloadSelectorTemplates:
- "k8s:ns:{{ .PodMeta.Namespace }}"
- "k8s:sa:{{ .PodSpec.ServiceAccountName }}"
- "k8s:pod-name:{{ .PodMeta.Name }}"
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: frontend
namespace: production
labels:
spiffe.io/enabled: "true"
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: frontend
namespace: production
spec:
replicas: 3
selector:
matchLabels:
app: frontend
template:
metadata:
labels:
app: frontend
spiffe.io/enabled: "true"
spec:
serviceAccountName: frontend
containers:
- name: app
image: frontend:latest
env:
- name: SPIFFE_ENDPOINT_SOCKET
value: unix:///spiffe-workload-api/spire-agent.sock
volumeMounts:
- name: spiffe-workload-api
mountPath: /spiffe-workload-api
readOnly: true
- name: envoy
image: envoyproxy/envoy:v1.28-latest
args:
- -c
- /etc/envoy/envoy.yaml
volumeMounts:
- name: envoy-config
mountPath: /etc/envoy
- name: spiffe-workload-api
mountPath: /spiffe-workload-api
readOnly: true
volumes:
- name: spiffe-workload-api
csi:
driver: "csi.spiffe.io"
readOnly: true
- name: envoy-config
configMap:
name: envoy-config

Step 4: Implement mTLS Configuration#

envoy-mtls-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: envoy-config
namespace: production
data:
envoy.yaml: |
node:
id: frontend
cluster: frontend-cluster
static_resources:
listeners:
- name: ingress
address:
socket_address:
address: 0.0.0.0
port_value: 8080
filter_chains:
- filters:
- name: envoy.filters.network.http_connection_manager
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
stat_prefix: ingress_http
route_config:
name: local_route
virtual_hosts:
- name: backend
domains: ["*"]
routes:
- match:
prefix: "/"
route:
cluster: backend_cluster
http_filters:
- name: envoy.filters.http.ext_authz
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.ext_authz.v3.ExtAuthz
grpc_service:
envoy_grpc:
cluster_name: opa_cluster
- name: envoy.filters.http.router
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
transport_socket:
name: envoy.transport_sockets.tls
typed_config:
"@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.DownstreamTlsContext
common_tls_context:
tls_certificate_sds_secret_configs:
- name: "spiffe://production.company.com/ns/production/sa/frontend"
sds_config:
resource_api_version: V3
api_config_source:
api_type: GRPC
transport_api_version: V3
grpc_services:
- envoy_grpc:
cluster_name: spire_agent
validation_context_sds_secret_config:
name: "spiffe://production.company.com"
sds_config:
resource_api_version: V3
api_config_source:
api_type: GRPC
transport_api_version: V3
grpc_services:
- envoy_grpc:
cluster_name: spire_agent
clusters:
- name: backend_cluster
connect_timeout: 30s
type: STRICT_DNS
lb_policy: ROUND_ROBIN
load_assignment:
cluster_name: backend_cluster
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: backend-service
port_value: 8080
transport_socket:
name: envoy.transport_sockets.tls
typed_config:
"@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext
common_tls_context:
tls_certificate_sds_secret_configs:
- name: "spiffe://production.company.com/ns/production/sa/frontend"
sds_config:
resource_api_version: V3
api_config_source:
api_type: GRPC
transport_api_version: V3
grpc_services:
- envoy_grpc:
cluster_name: spire_agent
validation_context_sds_secret_config:
name: "spiffe://production.company.com"
sds_config:
resource_api_version: V3
api_config_source:
api_type: GRPC
transport_api_version: V3
grpc_services:
- envoy_grpc:
cluster_name: spire_agent
- name: spire_agent
connect_timeout: 1s
type: STATIC
lb_policy: ROUND_ROBIN
load_assignment:
cluster_name: spire_agent
endpoints:
- lb_endpoints:
- endpoint:
address:
pipe:
path: /spiffe-workload-api/spire-agent.sock

Step 5: Deploy Policy Engine#

graph TB
subgraph "Policy Architecture"
subgraph "Policy Sources"
Git[Git Repository]
API[Policy API]
ConfigMap[K8s ConfigMap]
end
subgraph "Policy Engine"
OPA[Open Policy Agent]
Bundle[Bundle Server]
Cache[Policy Cache]
end
subgraph "Enforcement Points"
Envoy1[Envoy Proxy 1]
Envoy2[Envoy Proxy 2]
Envoy3[Envoy Proxy 3]
end
end
Git --> Bundle
API --> Bundle
ConfigMap --> OPA
Bundle --> Cache
Cache --> OPA
Envoy1 --> OPA
Envoy2 --> OPA
Envoy3 --> OPA
style OPA fill:#f96,stroke:#333,stroke-width:2px
opa-deployment.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: opa-policy
namespace: production
data:
policy.rego: |
package envoy.authz
import input.attributes.request.http as http_request
import input.attributes.source.address as source_address
default allow = false
# Extract SPIFFE ID from certificate
spiffe_id = id {
[_, id] := split(http_request.headers["x-forwarded-client-cert"], "URI=")
}
# Allow health checks
allow {
http_request.path == "/health"
}
# Service-to-service authorization rules
allow {
http_request.method == "GET"
http_request.path == "/api/users"
spiffe_id == "spiffe://production.company.com/ns/production/sa/frontend"
}
allow {
http_request.method == "POST"
http_request.path == "/api/orders"
spiffe_id == "spiffe://production.company.com/ns/production/sa/order-service"
}
# Rate limiting rules
rate_limit[decision] {
service := split(spiffe_id, "/")[4]
limits := {
"frontend": 1000,
"backend": 500,
"database": 100
}
decision := {
"allowed": true,
"headers": {
"X-RateLimit-Limit": limits[service]
}
}
}
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: opa
namespace: production
spec:
replicas: 3
selector:
matchLabels:
app: opa
template:
metadata:
labels:
app: opa
spec:
containers:
- name: opa
image: openpolicyagent/opa:0.59.0-envoy
ports:
- containerPort: 9191
args:
- "run"
- "--server"
- "--config-file=/config/config.yaml"
- "/policies"
volumeMounts:
- name: opa-policy
mountPath: /policies
- name: opa-config
mountPath: /config
livenessProbe:
httpGet:
path: /health
port: 8181
initialDelaySeconds: 5
periodSeconds: 5
readinessProbe:
httpGet:
path: /health?bundle=true
port: 8181
initialDelaySeconds: 5
periodSeconds: 5
volumes:
- name: opa-policy
configMap:
name: opa-policy
- name: opa-config
configMap:
name: opa-config

Advanced Security Features#

Zero Trust Network Architecture#

graph TB
subgraph "Zero Trust Principles"
subgraph "Never Trust"
NT1[No Implicit Trust]
NT2[Verify Every Request]
NT3[Assume Breach]
end
subgraph "Always Verify"
AV1[Identity Verification]
AV2[Device Verification]
AV3[Context Verification]
end
subgraph "Least Privilege"
LP1[Minimal Access]
LP2[Just-In-Time Access]
LP3[Adaptive Access]
end
end
subgraph "Implementation"
subgraph "Identity"
SPIFFE[SPIFFE IDs]
mTLS[Mutual TLS]
Tokens[JWT Tokens]
end
subgraph "Policy"
RBAC[Role-Based Access]
ABAC[Attribute-Based Access]
Context[Contextual Policies]
end
subgraph "Monitoring"
Audit[Audit Logs]
Metrics[Security Metrics]
Alerts[Real-time Alerts]
end
end
NT1 --> SPIFFE
NT2 --> mTLS
NT3 --> Audit
AV1 --> SPIFFE
AV2 --> Context
AV3 --> ABAC
LP1 --> RBAC
LP2 --> Context
LP3 --> ABAC
style SPIFFE fill:#f96,stroke:#333,stroke-width:2px
style mTLS fill:#9f9,stroke:#333,stroke-width:2px

Secret Management Integration#

vault-integration.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: spire-server-config
namespace: spire
data:
server.conf: |
server {
bind_address = "0.0.0.0"
bind_port = "8081"
trust_domain = "production.company.com"
data_dir = "/run/spire/data"
log_level = "INFO"
ca_key_type = "rsa-2048"
ca_ttl = "24h"
jwt_issuer = "https://spire.production.company.com"
}
plugins {
DataStore "sql" {
plugin_data {
database_type = "postgres"
connection_string = "${SPIRE_DB_CONNECTION_STRING}"
}
}
KeyManager "disk" {
plugin_data {
keys_path = "/run/spire/data/keys.json"
}
}
UpstreamAuthority "vault" {
plugin_data {
vault_addr = "https://vault.production.company.com"
pki_mount_point = "pki/spire"
ca_cert_path = "/run/secrets/vault-ca.crt"
token_auth {
token = "${VAULT_TOKEN}"
}
# Or use AppRole auth
# approle_auth {
# approle_id = "${VAULT_APPROLE_ID}"
# approle_secret_id = "${VAULT_APPROLE_SECRET_ID}"
# }
}
}
NodeAttestor "k8s_psat" {
plugin_data {
clusters = {
"production" = {
service_account_allow_list = ["spire:spire-agent"]
}
}
}
}
}

Advanced Monitoring and Observability#

graph LR
subgraph "Data Collection"
Envoy[Envoy Metrics]
SPIRE[SPIRE Metrics]
Apps[Application Metrics]
Traces[Distributed Traces]
end
subgraph "Processing"
Prometheus[Prometheus]
Jaeger[Jaeger]
FluentBit[Fluent Bit]
end
subgraph "Storage"
TSDB[Time Series DB]
TraceDB[Trace Storage]
LogDB[Log Storage]
end
subgraph "Visualization"
Grafana[Grafana]
Kibana[Kibana]
Custom[Custom Dashboards]
end
Envoy --> Prometheus
SPIRE --> Prometheus
Apps --> Prometheus
Traces --> Jaeger
Prometheus --> TSDB
Jaeger --> TraceDB
FluentBit --> LogDB
TSDB --> Grafana
TraceDB --> Grafana
LogDB --> Kibana
Grafana --> Custom
Kibana --> Custom
style Prometheus fill:#f96,stroke:#333,stroke-width:2px
style Grafana fill:#9f9,stroke:#333,stroke-width:2px

Security Dashboard Configuration#

grafana-dashboard.json
{
"dashboard":
{
"title": "Service Mesh Security Dashboard",
"panels":
[
{
"title": "mTLS Adoption Rate",
"targets":
[
{
"expr": "sum(rate(envoy_http_downstream_cx_ssl_total[5m])) / sum(rate(envoy_http_downstream_cx_total[5m])) * 100",
},
],
},
{
"title": "Authorization Denials",
"targets":
[
{
"expr": "sum(rate(envoy_http_ext_authz_denied[5m])) by (service)",
},
],
},
{
"title": "SVID Rotation Events",
"targets":
[
{
"expr": "sum(rate(spire_agent_svid_rotations_total[5m])) by (trust_domain)",
},
],
},
{
"title": "Policy Violations",
"targets":
[
{
"expr": 'sum(rate(opa_decisions_total{decision="deny"}[5m])) by (policy)',
},
],
},
],
},
}

Production Deployment Considerations#

High Availability Configuration#

graph TB
subgraph "HA Architecture"
subgraph "Region 1"
LB1[Load Balancer]
SPIRE1[SPIRE Server 1]
SPIRE2[SPIRE Server 2]
DB1[(Primary DB)]
end
subgraph "Region 2"
LB2[Load Balancer]
SPIRE3[SPIRE Server 3]
SPIRE4[SPIRE Server 4]
DB2[(Replica DB)]
end
subgraph "Global"
GLB[Global Load Balancer]
GSLB[Global Service LB]
end
end
GLB --> LB1
GLB --> LB2
LB1 --> SPIRE1
LB1 --> SPIRE2
LB2 --> SPIRE3
LB2 --> SPIRE4
SPIRE1 --> DB1
SPIRE2 --> DB1
SPIRE3 --> DB2
SPIRE4 --> DB2
DB1 -.->|Replication| DB2
style GLB fill:#f96,stroke:#333,stroke-width:2px
style DB1 fill:#9f9,stroke:#333,stroke-width:2px

Disaster Recovery Plan#

backup-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: spire-backup
namespace: spire
spec:
schedule: "0 */6 * * *" # Every 6 hours
jobTemplate:
spec:
template:
spec:
containers:
- name: backup
image: postgres:15-alpine
env:
- name: PGPASSWORD
valueFrom:
secretKeyRef:
name: postgres-secret
key: password
command:
- /bin/sh
- -c
- |
# Backup SPIRE database
pg_dump -h postgres -U spire -d spire > /backup/spire-$(date +%Y%m%d-%H%M%S).sql
# Backup SPIRE Server data
kubectl exec -n spire spire-server-0 -- tar czf - /run/spire/data > /backup/spire-data-$(date +%Y%m%d-%H%M%S).tar.gz
# Upload to S3
aws s3 cp /backup/ s3://company-backups/spire/ --recursive
# Cleanup old backups (keep last 30 days)
find /backup -name "*.sql" -mtime +30 -delete
find /backup -name "*.tar.gz" -mtime +30 -delete
volumeMounts:
- name: backup
mountPath: /backup
restartPolicy: OnFailure
volumes:
- name: backup
persistentVolumeClaim:
claimName: backup-pvc

Performance Tuning#

performance-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: envoy-performance
namespace: production
data:
envoy.yaml: |
static_resources:
clusters:
- name: service_cluster
connect_timeout: 0.25s
type: STRICT_DNS
lb_policy: LEAST_REQUEST
# Circuit breaker configuration
circuit_breakers:
thresholds:
- priority: DEFAULT
max_connections: 1000
max_pending_requests: 1000
max_requests: 1000
max_retries: 3
# Health checking
health_checks:
- timeout: 5s
interval: 10s
unhealthy_threshold: 2
healthy_threshold: 2
path: /health
# Connection pooling
upstream_connection_options:
tcp_keepalive:
keepalive_probes: 3
keepalive_time: 10
keepalive_interval: 5
# HTTP/2 optimization
typed_extension_protocol_options:
envoy.extensions.upstreams.http.v3.HttpProtocolOptions:
"@type": type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions
explicit_http_config:
http2_protocol_options:
max_concurrent_streams: 100
initial_stream_window_size: 65536
initial_connection_window_size: 1048576

Troubleshooting Guide#

Common Issues and Solutions#

IssueSymptomsRoot CauseSolution
SVID Not Issuedno identity issuedWorkload not registeredCheck workload registration and selectors
mTLS Handshake Failuretls: bad certificateCertificate validation failedVerify trust bundle distribution
Policy Denial403 ForbiddenAuthorization policy mismatchReview OPA logs and policy rules
High LatencySlow response timesPolicy evaluation overheadOptimize policy rules, enable caching
Memory PressureOOM killsLarge policy bundlesImplement policy sharding

Debug Commands#

Terminal window
# Check SPIRE Server health
kubectl exec -n spire spire-server-0 -- \
/opt/spire/bin/spire-server healthcheck
# List registered workloads
kubectl exec -n spire spire-server-0 -- \
/opt/spire/bin/spire-server entry list
# Debug workload attestation
kubectl exec -n production frontend-pod -- \
/opt/spire/bin/spire-agent api fetch x509 \
-socketPath /spiffe-workload-api/spire-agent.sock
# Check Envoy configuration
kubectl exec -n production frontend-pod -c envoy -- \
curl -s localhost:15000/config_dump | jq .
# Validate OPA policies
kubectl exec -n production opa-pod -- \
opa test /policies

Security Best Practices#

Defense in Depth Strategy#

graph TB
subgraph "Security Layers"
L1[Network Security]
L2[Transport Security]
L3[Application Security]
L4[Data Security]
L5[Operational Security]
end
subgraph "Controls"
C1[Firewalls & Network Policies]
C2[mTLS & Encryption]
C3[Authentication & Authorization]
C4[Encryption at Rest]
C5[Audit & Monitoring]
end
L1 --> C1
L2 --> C2
L3 --> C3
L4 --> C4
L5 --> C5
style L2 fill:#f96,stroke:#333,stroke-width:2px
style C2 fill:#9f9,stroke:#333,stroke-width:2px

Security Checklist#

  • Enable mTLS for all service communication
  • Implement strict workload identity verification
  • Configure least-privilege authorization policies
  • Enable comprehensive audit logging
  • Implement rate limiting and circuit breaking
  • Regular security scanning of container images
  • Automated certificate rotation (< 24 hours)
  • Network segmentation with policies
  • Encrypted secrets management
  • Regular security audits and penetration testing

Conclusion#

Implementing a secure service mesh with SPIFFE/SPIRE provides a robust foundation for zero-trust security in microservices architectures. The combination of cryptographic workload identity, policy-based authorization, and comprehensive observability creates a defense-in-depth strategy that significantly enhances your security posture.

Key takeaways:

  1. Identity-First Security: Every workload has a cryptographically verifiable identity
  2. Policy as Code: Authorization rules are version-controlled and auditable
  3. Automated Security: Certificate rotation and policy updates happen automatically
  4. Observable Security: Rich metrics and logs provide security visibility
  5. Scalable Architecture: Designed for high availability and performance

By following this implementation guide and adapting it to your specific requirements, you can build a production-ready secure service mesh that provides both strong security guarantees and operational flexibility.

References#

Building a Secure Service Mesh with SPIFFE/SPIRE - Complete Implementation Guide
https://mranv.pages.dev/posts/secure-service-mesh-implementation-guide/
Author
Anubhav Gain
Published at
2025-01-29
License
CC BY-NC-SA 4.0