Building a Secure Service Mesh with SPIFFE/SPIRE - Complete Implementation Guide
In the era of microservices and distributed systems, securing service-to-service communication has become paramount. This guide provides a comprehensive implementation of a secure service mesh using SPIFFE (Secure Production Identity Framework For Everyone) and SPIRE (SPIFFE Runtime Environment), complete with detailed architecture diagrams and production-ready configurations.
Table of Contents
Open Table of Contents
Service Mesh Architecture Overview
A service mesh provides a dedicated infrastructure layer for managing service-to-service communication. When combined with SPIFFE/SPIRE, it creates a zero-trust security model where every workload has a cryptographically verifiable identity.
graph TB
subgraph "Service Mesh Architecture"
subgraph "Control Plane"
SPIRE_Server[SPIRE Server]
Policy_Engine[Policy Engine]
Config_Manager[Configuration Manager]
Cert_Authority[Certificate Authority]
Telemetry[Telemetry Collector]
end
subgraph "Data Plane"
subgraph "Service A"
A_App[Application A]
A_Proxy[Envoy Proxy]
A_Agent[SPIRE Agent]
A_Workload[Workload API]
end
subgraph "Service B"
B_App[Application B]
B_Proxy[Envoy Proxy]
B_Agent[SPIRE Agent]
B_Workload[Workload API]
end
subgraph "Service C"
C_App[Application C]
C_Proxy[Envoy Proxy]
C_Agent[SPIRE Agent]
C_Workload[Workload API]
end
end
subgraph "Infrastructure"
K8s[Kubernetes API]
Registry[Service Registry]
KV_Store[Key-Value Store]
end
end
SPIRE_Server --> Cert_Authority
SPIRE_Server --> Policy_Engine
SPIRE_Server --> KV_Store
A_Agent --> SPIRE_Server
B_Agent --> SPIRE_Server
C_Agent --> SPIRE_Server
A_Agent --> A_Workload
B_Agent --> B_Workload
C_Agent --> C_Workload
A_App --> A_Proxy
B_App --> B_Proxy
C_App --> C_Proxy
A_Proxy -.-> B_Proxy
B_Proxy -.-> C_Proxy
A_Proxy -.-> C_Proxy
Config_Manager --> Registry
Registry --> K8s
Telemetry --> A_Proxy
Telemetry --> B_Proxy
Telemetry --> C_Proxy
style SPIRE_Server fill:#f96,stroke:#333,stroke-width:4px
style A_Proxy fill:#9f9,stroke:#333,stroke-width:2px
style B_Proxy fill:#9f9,stroke:#333,stroke-width:2px
style C_Proxy fill:#9f9,stroke:#333,stroke-width:2px
Key Components
- SPIRE Server: Central authority for workload attestation and SVID issuance
- SPIRE Agent: Node-level component that attests workloads and manages SVIDs
- Envoy Proxy: Data plane proxy handling mTLS and traffic management
- Workload API: Unix domain socket for workload-to-SPIRE communication
- Policy Engine: Centralized policy management and enforcement
SPIFFE/SPIRE Identity Flow
Understanding how SPIFFE identities (SVIDs) are created, distributed, and verified is crucial for implementing a secure service mesh.
sequenceDiagram
participant W as Workload
participant A as SPIRE Agent
participant S as SPIRE Server
participant CA as Certificate Authority
participant R as Registration API
Note over W,CA: Initial Workload Registration
R->>S: Register workload entry
S->>S: Store registration
Note over W,CA: Workload Attestation & SVID Issuance
W->>A: Connect to Workload API
A->>A: Perform workload attestation
A->>A: Verify workload selectors
A->>S: Request SVID for workload
S->>S: Verify agent identity
S->>S: Check workload registration
S->>CA: Generate key pair & CSR
CA->>CA: Sign certificate
CA->>S: Return X.509 SVID
S->>A: Send SVID bundle
A->>W: Provide SVID via Workload API
W->>W: Configure TLS with SVID
Note over W,CA: SVID Rotation
loop Every 30 minutes
A->>S: Check SVID expiration
alt SVID expiring soon
A->>S: Request SVID renewal
S->>CA: Generate new SVID
CA->>S: Return new SVID
S->>A: Send updated SVID
A->>W: Hot-reload new SVID
end
end
Note over W,CA: Service-to-Service Communication
W->>W: Initiate TLS connection
W->>W: Present SVID
W->>A: Validate peer SVID
A->>A: Check trust bundle
A->>W: Validation result
SPIFFE Identity Structure
# SPIFFE ID Format
spiffe://trust-domain/path/to/workload
# Example Identities
spiffe://production.company.com/ns/default/sa/frontend
spiffe://production.company.com/ns/payments/sa/processor
spiffe://production.company.com/region/us-east/service/api-gateway
Network Policy Enforcement Flow
The service mesh enforces network policies at multiple levels, providing defense in depth:
graph TB
subgraph "Policy Enforcement Layers"
subgraph "Layer 1: Network Policies"
NP_Ingress[Ingress Rules]
NP_Egress[Egress Rules]
NP_CIDR[CIDR Blocks]
end
subgraph "Layer 2: Service Mesh Policies"
SM_Auth[Authentication Policy]
SM_Authz[Authorization Policy]
SM_Traffic[Traffic Policy]
end
subgraph "Layer 3: Application Policies"
APP_RBAC[RBAC Rules]
APP_Custom[Custom Logic]
APP_Rate[Rate Limiting]
end
end
subgraph "Enforcement Points"
subgraph "Network Level"
CNI[CNI Plugin]
IPTables[iptables/nftables]
eBPF[eBPF Programs]
end
subgraph "Proxy Level"
Envoy[Envoy Proxy]
WASM[WASM Filters]
Lua[Lua Scripts]
end
subgraph "Application Level"
SDK[Service Mesh SDK]
Middleware[Middleware]
Interceptors[gRPC Interceptors]
end
end
NP_Ingress --> CNI
NP_Egress --> IPTables
NP_CIDR --> eBPF
SM_Auth --> Envoy
SM_Authz --> WASM
SM_Traffic --> Lua
APP_RBAC --> SDK
APP_Custom --> Middleware
APP_Rate --> Interceptors
style SM_Auth fill:#f96,stroke:#333,stroke-width:2px
style Envoy fill:#9f9,stroke:#333,stroke-width:2px
Policy Decision Flow
sequenceDiagram
participant Client
participant Envoy as Envoy Proxy
participant OPA as Open Policy Agent
participant SPIRE as SPIRE Agent
participant Service
Client->>Envoy: HTTPS Request with SVID
Envoy->>Envoy: Validate TLS/SVID
Envoy->>SPIRE: Verify SVID
SPIRE->>Envoy: SVID Valid
Envoy->>Envoy: Extract request context
Note over Envoy: Method, Path, Headers, SPIFFE ID
Envoy->>OPA: Authorization check
Note over OPA: {
Note over OPA: "subject": "spiffe://...",
Note over OPA: "resource": "/api/users",
Note over OPA: "action": "GET"
Note over OPA: }
OPA->>OPA: Evaluate policies
OPA->>Envoy: Decision (Allow/Deny)
alt Allowed
Envoy->>Service: Forward request
Service->>Envoy: Response
Envoy->>Client: Response
else Denied
Envoy->>Client: 403 Forbidden
end
Implementation Guide
Prerequisites
Before implementing the secure service mesh, ensure you have:
- Kubernetes cluster (1.19+)
- Helm 3.x installed
- kubectl configured
- Storage class for persistent volumes
- Load balancer or ingress controller
Step 1: Install SPIRE
# Add SPIRE Helm repository
helm repo add spiffe https://spiffe.github.io/helm-charts
helm repo update
# Create SPIRE namespace
kubectl create namespace spire
# Install SPIRE with custom values
cat > spire-values.yaml << EOF
spire-server:
image:
tag: 1.8.0
controllerManager:
enabled: true
notifier:
k8sbundle:
enabled: true
dataStore:
sql:
databaseType: postgres
connectionString: "postgresql://spire:password@postgres:5432/spire"
trustDomain: production.company.com
ca_subject:
country: US
organization: Company
common_name: SPIRE CA
persistence:
enabled: true
size: 10Gi
nodeAttestor:
k8sPsat:
enabled: true
spire-agent:
image:
tag: 1.8.0
workloadAttestors:
k8s:
enabled: true
unix:
enabled: true
sockets:
admin:
enabled: true
EOF
helm install spire spiffe/spire \
--namespace spire \
--values spire-values.yaml
Step 2: Deploy Service Mesh Control Plane
# istio-control-plane.yaml
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
metadata:
name: control-plane
spec:
values:
pilot:
env:
PILOT_ENABLE_WORKLOAD_ENTRY_AUTOREGISTRATION: true
PILOT_ENABLE_CROSS_CLUSTER_WORKLOAD_ENTRY: true
telemetry:
v2:
prometheus:
configOverride:
inboundSidecar:
disable_host_header_fallback: true
outboundSidecar:
disable_host_header_fallback: true
meshConfig:
defaultConfig:
proxyStatsMatcher:
inclusionRegexps:
- ".*outlier_detection.*"
- ".*circuit_breakers.*"
- ".*upstream_rq_retry.*"
- ".*upstream_rq_pending.*"
trustDomain: production.company.com
extensionProviders:
- name: spire
envoyExtAuthzGrpc:
service: spire-server.spire.svc.cluster.local
port: 8081
defaultProviders:
accessLogging:
- otel
Step 3: Configure Workload Registration
graph LR
subgraph "Registration Flow"
K8s[Kubernetes Controller]
Reg[Registration Controller]
SPIRE[SPIRE Server]
DB[(Registration DB)]
end
K8s -->|Watch Events| Reg
Reg -->|Create Entry| SPIRE
SPIRE -->|Store| DB
style Reg fill:#9f9,stroke:#333,stroke-width:2px
# workload-registration.yaml
apiVersion: spire.spiffe.io/v1alpha1
kind: ClusterSPIFFEID
metadata:
name: default-workloads
spec:
spiffeIDTemplate: "spiffe://{{ .TrustDomain }}/ns/{{ .PodMeta.Namespace }}/sa/{{ .PodSpec.ServiceAccountName }}"
podSelector:
matchLabels:
spiffe.io/enabled: "true"
workloadSelectorTemplates:
- "k8s:ns:{{ .PodMeta.Namespace }}"
- "k8s:sa:{{ .PodSpec.ServiceAccountName }}"
- "k8s:pod-name:{{ .PodMeta.Name }}"
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: frontend
namespace: production
labels:
spiffe.io/enabled: "true"
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: frontend
namespace: production
spec:
replicas: 3
selector:
matchLabels:
app: frontend
template:
metadata:
labels:
app: frontend
spiffe.io/enabled: "true"
spec:
serviceAccountName: frontend
containers:
- name: app
image: frontend:latest
env:
- name: SPIFFE_ENDPOINT_SOCKET
value: unix:///spiffe-workload-api/spire-agent.sock
volumeMounts:
- name: spiffe-workload-api
mountPath: /spiffe-workload-api
readOnly: true
- name: envoy
image: envoyproxy/envoy:v1.28-latest
args:
- -c
- /etc/envoy/envoy.yaml
volumeMounts:
- name: envoy-config
mountPath: /etc/envoy
- name: spiffe-workload-api
mountPath: /spiffe-workload-api
readOnly: true
volumes:
- name: spiffe-workload-api
csi:
driver: "csi.spiffe.io"
readOnly: true
- name: envoy-config
configMap:
name: envoy-config
Step 4: Implement mTLS Configuration
# envoy-mtls-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: envoy-config
namespace: production
data:
envoy.yaml: |
node:
id: frontend
cluster: frontend-cluster
static_resources:
listeners:
- name: ingress
address:
socket_address:
address: 0.0.0.0
port_value: 8080
filter_chains:
- filters:
- name: envoy.filters.network.http_connection_manager
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
stat_prefix: ingress_http
route_config:
name: local_route
virtual_hosts:
- name: backend
domains: ["*"]
routes:
- match:
prefix: "/"
route:
cluster: backend_cluster
http_filters:
- name: envoy.filters.http.ext_authz
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.ext_authz.v3.ExtAuthz
grpc_service:
envoy_grpc:
cluster_name: opa_cluster
- name: envoy.filters.http.router
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
transport_socket:
name: envoy.transport_sockets.tls
typed_config:
"@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.DownstreamTlsContext
common_tls_context:
tls_certificate_sds_secret_configs:
- name: "spiffe://production.company.com/ns/production/sa/frontend"
sds_config:
resource_api_version: V3
api_config_source:
api_type: GRPC
transport_api_version: V3
grpc_services:
- envoy_grpc:
cluster_name: spire_agent
validation_context_sds_secret_config:
name: "spiffe://production.company.com"
sds_config:
resource_api_version: V3
api_config_source:
api_type: GRPC
transport_api_version: V3
grpc_services:
- envoy_grpc:
cluster_name: spire_agent
clusters:
- name: backend_cluster
connect_timeout: 30s
type: STRICT_DNS
lb_policy: ROUND_ROBIN
load_assignment:
cluster_name: backend_cluster
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: backend-service
port_value: 8080
transport_socket:
name: envoy.transport_sockets.tls
typed_config:
"@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext
common_tls_context:
tls_certificate_sds_secret_configs:
- name: "spiffe://production.company.com/ns/production/sa/frontend"
sds_config:
resource_api_version: V3
api_config_source:
api_type: GRPC
transport_api_version: V3
grpc_services:
- envoy_grpc:
cluster_name: spire_agent
validation_context_sds_secret_config:
name: "spiffe://production.company.com"
sds_config:
resource_api_version: V3
api_config_source:
api_type: GRPC
transport_api_version: V3
grpc_services:
- envoy_grpc:
cluster_name: spire_agent
- name: spire_agent
connect_timeout: 1s
type: STATIC
lb_policy: ROUND_ROBIN
load_assignment:
cluster_name: spire_agent
endpoints:
- lb_endpoints:
- endpoint:
address:
pipe:
path: /spiffe-workload-api/spire-agent.sock
Step 5: Deploy Policy Engine
graph TB
subgraph "Policy Architecture"
subgraph "Policy Sources"
Git[Git Repository]
API[Policy API]
ConfigMap[K8s ConfigMap]
end
subgraph "Policy Engine"
OPA[Open Policy Agent]
Bundle[Bundle Server]
Cache[Policy Cache]
end
subgraph "Enforcement Points"
Envoy1[Envoy Proxy 1]
Envoy2[Envoy Proxy 2]
Envoy3[Envoy Proxy 3]
end
end
Git --> Bundle
API --> Bundle
ConfigMap --> OPA
Bundle --> Cache
Cache --> OPA
Envoy1 --> OPA
Envoy2 --> OPA
Envoy3 --> OPA
style OPA fill:#f96,stroke:#333,stroke-width:2px
# opa-deployment.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: opa-policy
namespace: production
data:
policy.rego: |
package envoy.authz
import input.attributes.request.http as http_request
import input.attributes.source.address as source_address
default allow = false
# Extract SPIFFE ID from certificate
spiffe_id = id {
[_, id] := split(http_request.headers["x-forwarded-client-cert"], "URI=")
}
# Allow health checks
allow {
http_request.path == "/health"
}
# Service-to-service authorization rules
allow {
http_request.method == "GET"
http_request.path == "/api/users"
spiffe_id == "spiffe://production.company.com/ns/production/sa/frontend"
}
allow {
http_request.method == "POST"
http_request.path == "/api/orders"
spiffe_id == "spiffe://production.company.com/ns/production/sa/order-service"
}
# Rate limiting rules
rate_limit[decision] {
service := split(spiffe_id, "/")[4]
limits := {
"frontend": 1000,
"backend": 500,
"database": 100
}
decision := {
"allowed": true,
"headers": {
"X-RateLimit-Limit": limits[service]
}
}
}
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: opa
namespace: production
spec:
replicas: 3
selector:
matchLabels:
app: opa
template:
metadata:
labels:
app: opa
spec:
containers:
- name: opa
image: openpolicyagent/opa:0.59.0-envoy
ports:
- containerPort: 9191
args:
- "run"
- "--server"
- "--config-file=/config/config.yaml"
- "/policies"
volumeMounts:
- name: opa-policy
mountPath: /policies
- name: opa-config
mountPath: /config
livenessProbe:
httpGet:
path: /health
port: 8181
initialDelaySeconds: 5
periodSeconds: 5
readinessProbe:
httpGet:
path: /health?bundle=true
port: 8181
initialDelaySeconds: 5
periodSeconds: 5
volumes:
- name: opa-policy
configMap:
name: opa-policy
- name: opa-config
configMap:
name: opa-config
Advanced Security Features
Zero Trust Network Architecture
graph TB
subgraph "Zero Trust Principles"
subgraph "Never Trust"
NT1[No Implicit Trust]
NT2[Verify Every Request]
NT3[Assume Breach]
end
subgraph "Always Verify"
AV1[Identity Verification]
AV2[Device Verification]
AV3[Context Verification]
end
subgraph "Least Privilege"
LP1[Minimal Access]
LP2[Just-In-Time Access]
LP3[Adaptive Access]
end
end
subgraph "Implementation"
subgraph "Identity"
SPIFFE[SPIFFE IDs]
mTLS[Mutual TLS]
Tokens[JWT Tokens]
end
subgraph "Policy"
RBAC[Role-Based Access]
ABAC[Attribute-Based Access]
Context[Contextual Policies]
end
subgraph "Monitoring"
Audit[Audit Logs]
Metrics[Security Metrics]
Alerts[Real-time Alerts]
end
end
NT1 --> SPIFFE
NT2 --> mTLS
NT3 --> Audit
AV1 --> SPIFFE
AV2 --> Context
AV3 --> ABAC
LP1 --> RBAC
LP2 --> Context
LP3 --> ABAC
style SPIFFE fill:#f96,stroke:#333,stroke-width:2px
style mTLS fill:#9f9,stroke:#333,stroke-width:2px
Secret Management Integration
# vault-integration.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: spire-server-config
namespace: spire
data:
server.conf: |
server {
bind_address = "0.0.0.0"
bind_port = "8081"
trust_domain = "production.company.com"
data_dir = "/run/spire/data"
log_level = "INFO"
ca_key_type = "rsa-2048"
ca_ttl = "24h"
jwt_issuer = "https://spire.production.company.com"
}
plugins {
DataStore "sql" {
plugin_data {
database_type = "postgres"
connection_string = "${SPIRE_DB_CONNECTION_STRING}"
}
}
KeyManager "disk" {
plugin_data {
keys_path = "/run/spire/data/keys.json"
}
}
UpstreamAuthority "vault" {
plugin_data {
vault_addr = "https://vault.production.company.com"
pki_mount_point = "pki/spire"
ca_cert_path = "/run/secrets/vault-ca.crt"
token_auth {
token = "${VAULT_TOKEN}"
}
# Or use AppRole auth
# approle_auth {
# approle_id = "${VAULT_APPROLE_ID}"
# approle_secret_id = "${VAULT_APPROLE_SECRET_ID}"
# }
}
}
NodeAttestor "k8s_psat" {
plugin_data {
clusters = {
"production" = {
service_account_allow_list = ["spire:spire-agent"]
}
}
}
}
}
Advanced Monitoring and Observability
graph LR
subgraph "Data Collection"
Envoy[Envoy Metrics]
SPIRE[SPIRE Metrics]
Apps[Application Metrics]
Traces[Distributed Traces]
end
subgraph "Processing"
Prometheus[Prometheus]
Jaeger[Jaeger]
FluentBit[Fluent Bit]
end
subgraph "Storage"
TSDB[Time Series DB]
TraceDB[Trace Storage]
LogDB[Log Storage]
end
subgraph "Visualization"
Grafana[Grafana]
Kibana[Kibana]
Custom[Custom Dashboards]
end
Envoy --> Prometheus
SPIRE --> Prometheus
Apps --> Prometheus
Traces --> Jaeger
Prometheus --> TSDB
Jaeger --> TraceDB
FluentBit --> LogDB
TSDB --> Grafana
TraceDB --> Grafana
LogDB --> Kibana
Grafana --> Custom
Kibana --> Custom
style Prometheus fill:#f96,stroke:#333,stroke-width:2px
style Grafana fill:#9f9,stroke:#333,stroke-width:2px
Security Dashboard Configuration
# grafana-dashboard.json
{
"dashboard":
{
"title": "Service Mesh Security Dashboard",
"panels":
[
{
"title": "mTLS Adoption Rate",
"targets":
[
{
"expr": "sum(rate(envoy_http_downstream_cx_ssl_total[5m])) / sum(rate(envoy_http_downstream_cx_total[5m])) * 100",
},
],
},
{
"title": "Authorization Denials",
"targets":
[
{
"expr": "sum(rate(envoy_http_ext_authz_denied[5m])) by (service)",
},
],
},
{
"title": "SVID Rotation Events",
"targets":
[
{
"expr": "sum(rate(spire_agent_svid_rotations_total[5m])) by (trust_domain)",
},
],
},
{
"title": "Policy Violations",
"targets":
[
{
"expr": 'sum(rate(opa_decisions_total{decision="deny"}[5m])) by (policy)',
},
],
},
],
},
}
Production Deployment Considerations
High Availability Configuration
graph TB
subgraph "HA Architecture"
subgraph "Region 1"
LB1[Load Balancer]
SPIRE1[SPIRE Server 1]
SPIRE2[SPIRE Server 2]
DB1[(Primary DB)]
end
subgraph "Region 2"
LB2[Load Balancer]
SPIRE3[SPIRE Server 3]
SPIRE4[SPIRE Server 4]
DB2[(Replica DB)]
end
subgraph "Global"
GLB[Global Load Balancer]
GSLB[Global Service LB]
end
end
GLB --> LB1
GLB --> LB2
LB1 --> SPIRE1
LB1 --> SPIRE2
LB2 --> SPIRE3
LB2 --> SPIRE4
SPIRE1 --> DB1
SPIRE2 --> DB1
SPIRE3 --> DB2
SPIRE4 --> DB2
DB1 -.->|Replication| DB2
style GLB fill:#f96,stroke:#333,stroke-width:2px
style DB1 fill:#9f9,stroke:#333,stroke-width:2px
Disaster Recovery Plan
# backup-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: spire-backup
namespace: spire
spec:
schedule: "0 */6 * * *" # Every 6 hours
jobTemplate:
spec:
template:
spec:
containers:
- name: backup
image: postgres:15-alpine
env:
- name: PGPASSWORD
valueFrom:
secretKeyRef:
name: postgres-secret
key: password
command:
- /bin/sh
- -c
- |
# Backup SPIRE database
pg_dump -h postgres -U spire -d spire > /backup/spire-$(date +%Y%m%d-%H%M%S).sql
# Backup SPIRE Server data
kubectl exec -n spire spire-server-0 -- tar czf - /run/spire/data > /backup/spire-data-$(date +%Y%m%d-%H%M%S).tar.gz
# Upload to S3
aws s3 cp /backup/ s3://company-backups/spire/ --recursive
# Cleanup old backups (keep last 30 days)
find /backup -name "*.sql" -mtime +30 -delete
find /backup -name "*.tar.gz" -mtime +30 -delete
volumeMounts:
- name: backup
mountPath: /backup
restartPolicy: OnFailure
volumes:
- name: backup
persistentVolumeClaim:
claimName: backup-pvc
Performance Tuning
# performance-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: envoy-performance
namespace: production
data:
envoy.yaml: |
static_resources:
clusters:
- name: service_cluster
connect_timeout: 0.25s
type: STRICT_DNS
lb_policy: LEAST_REQUEST
# Circuit breaker configuration
circuit_breakers:
thresholds:
- priority: DEFAULT
max_connections: 1000
max_pending_requests: 1000
max_requests: 1000
max_retries: 3
# Health checking
health_checks:
- timeout: 5s
interval: 10s
unhealthy_threshold: 2
healthy_threshold: 2
path: /health
# Connection pooling
upstream_connection_options:
tcp_keepalive:
keepalive_probes: 3
keepalive_time: 10
keepalive_interval: 5
# HTTP/2 optimization
typed_extension_protocol_options:
envoy.extensions.upstreams.http.v3.HttpProtocolOptions:
"@type": type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions
explicit_http_config:
http2_protocol_options:
max_concurrent_streams: 100
initial_stream_window_size: 65536
initial_connection_window_size: 1048576
Troubleshooting Guide
Common Issues and Solutions
Issue | Symptoms | Root Cause | Solution |
---|---|---|---|
SVID Not Issued | no identity issued | Workload not registered | Check workload registration and selectors |
mTLS Handshake Failure | tls: bad certificate | Certificate validation failed | Verify trust bundle distribution |
Policy Denial | 403 Forbidden | Authorization policy mismatch | Review OPA logs and policy rules |
High Latency | Slow response times | Policy evaluation overhead | Optimize policy rules, enable caching |
Memory Pressure | OOM kills | Large policy bundles | Implement policy sharding |
Debug Commands
# Check SPIRE Server health
kubectl exec -n spire spire-server-0 -- \
/opt/spire/bin/spire-server healthcheck
# List registered workloads
kubectl exec -n spire spire-server-0 -- \
/opt/spire/bin/spire-server entry list
# Debug workload attestation
kubectl exec -n production frontend-pod -- \
/opt/spire/bin/spire-agent api fetch x509 \
-socketPath /spiffe-workload-api/spire-agent.sock
# Check Envoy configuration
kubectl exec -n production frontend-pod -c envoy -- \
curl -s localhost:15000/config_dump | jq .
# Validate OPA policies
kubectl exec -n production opa-pod -- \
opa test /policies
Security Best Practices
Defense in Depth Strategy
graph TB
subgraph "Security Layers"
L1[Network Security]
L2[Transport Security]
L3[Application Security]
L4[Data Security]
L5[Operational Security]
end
subgraph "Controls"
C1[Firewalls & Network Policies]
C2[mTLS & Encryption]
C3[Authentication & Authorization]
C4[Encryption at Rest]
C5[Audit & Monitoring]
end
L1 --> C1
L2 --> C2
L3 --> C3
L4 --> C4
L5 --> C5
style L2 fill:#f96,stroke:#333,stroke-width:2px
style C2 fill:#9f9,stroke:#333,stroke-width:2px
Security Checklist
- Enable mTLS for all service communication
- Implement strict workload identity verification
- Configure least-privilege authorization policies
- Enable comprehensive audit logging
- Implement rate limiting and circuit breaking
- Regular security scanning of container images
- Automated certificate rotation (< 24 hours)
- Network segmentation with policies
- Encrypted secrets management
- Regular security audits and penetration testing
Conclusion
Implementing a secure service mesh with SPIFFE/SPIRE provides a robust foundation for zero-trust security in microservices architectures. The combination of cryptographic workload identity, policy-based authorization, and comprehensive observability creates a defense-in-depth strategy that significantly enhances your security posture.
Key takeaways:
- Identity-First Security: Every workload has a cryptographically verifiable identity
- Policy as Code: Authorization rules are version-controlled and auditable
- Automated Security: Certificate rotation and policy updates happen automatically
- Observable Security: Rich metrics and logs provide security visibility
- Scalable Architecture: Designed for high availability and performance
By following this implementation guide and adapting it to your specific requirements, you can build a production-ready secure service mesh that provides both strong security guarantees and operational flexibility.