Mastering ETCD Backup and Restore in Kubernetes
ETCD is the backbone of Kubernetes, storing all cluster data. Understanding how to backup and restore ETCD is crucial for any Kubernetes administrator. In this guide, we’ll walk through the process of backing up ETCD and restoring it in case of a disaster.
Prerequisites
- A running Kubernetes cluster (preferably set up with kubeadm)
- SSH access to the control plane node
kubectl
andetcdctl
installed on the control plane node
Step 1: Cluster Health Check
First, let’s ensure our cluster is healthy:
kubectl get nodes
All nodes should be in the “Ready” state.
Step 2: Identify ETCD Version
To find the ETCD version:
kubectl describe pod etcd-controlplane -n kube-system | grep Image:
Note the version for compatibility purposes.
Step 3: Create a Sample Deployment
Let’s create a deployment to demonstrate the backup and restore process:
kubectl create deployment nginx --image=nginx --replicas=2
Verify the deployment:
kubectl get deployments
Step 4: Identify ETCD Endpoint and Certificates
We need to locate the ETCD endpoint, server certificates, and CA certificates. These are typically found in the ETCD pod manifest:
sudo cat /etc/kubernetes/manifests/etcd.yaml
Look for:
--listen-client-urls
for the endpoint--cert-file
for the server certificate--trusted-ca-file
for the CA certificate
Step 5: Set ETCDCTL API Version
Set the ETCDCTL API version:
export ETCDCTL_API=3
Step 6: Take ETCD Snapshot
Now, let’s take a snapshot of ETCD:
sudo ETCDCTL_API=3 etcdctl snapshot save /opt/etcd-snapshot.db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
Verify the snapshot:
sudo ETCDCTL_API=3 etcdctl snapshot status /opt/etcd-snapshot.db
Step 7: Simulate Disaster
Let’s delete our nginx deployment to simulate a disaster:
kubectl delete deployment nginx
Verify that the deployment is gone:
kubectl get deployments
Step 8: Restore ETCD from Backup
Now, let’s restore ETCD from our backup:
-
Stop the kubelet service:
sudo systemctl stop kubelet
-
Stop the container runtime (e.g., containerd):
sudo systemctl stop containerd
-
Restore the snapshot:
sudo ETCDCTL_API=3 etcdctl snapshot restore /opt/etcd-snapshot.db \ --data-dir /var/lib/etcd-from-backup
-
Update the ETCD pod manifest to use the new data directory:
sudo sed -i 's/path: \/var\/lib\/etcd/path: \/var\/lib\/etcd-from-backup/g' /etc/kubernetes/manifests/etcd.yaml
-
Restart the kubelet and container runtime:
sudo systemctl start containerd sudo systemctl start kubelet
Step 9: Verify Restoration
After a few minutes, check if the nginx deployment is back:
kubectl get deployments
You should see the nginx deployment with 2 replicas, just as it was before we deleted it.
Key Takeaways
-
Regular Backups: Implement a regular backup schedule for ETCD. The frequency depends on your recovery point objective (RPO).
-
Version Compatibility: Ensure that the
etcdctl
version matches your ETCD version for compatibility. -
Secure Your Backups: ETCD backups contain sensitive cluster data. Ensure they are stored securely and encrypted.
-
Test Your Backups: Regularly test the restore process to ensure your backups are valid and your team is familiar with the procedure.
-
Document the Process: Keep detailed documentation of your backup and restore process, including the location of certificates and endpoints.
-
Cluster-Specific Details: Be aware that certificate paths and endpoints may vary depending on your cluster setup.
-
Minimize Downtime: Practice the restore process to minimize downtime during a real disaster recovery scenario.
-
Backup Metadata: Along with ETCD data, backup important metadata like the ETCD version and cluster configuration.
Conclusion
Mastering ETCD backup and restore is crucial for maintaining a resilient Kubernetes cluster. By following this guide, you’ve learned how to take snapshots of your ETCD data and restore your cluster to a previous state. Remember, regular practice and documentation of this process are key to ensuring you can quickly recover from potential disasters.
As you continue to work with Kubernetes, consider integrating these backup and restore procedures into your regular maintenance routines and disaster recovery plans. With proper ETCD management, you can ensure the reliability and stability of your Kubernetes clusters, even in the face of unexpected issues.