Mastering ETCD Backup and Restore in Kubernetes#

ETCD is the backbone of Kubernetes, storing all cluster data. Understanding how to backup and restore ETCD is crucial for any Kubernetes administrator. In this guide, we’ll walk through the process of backing up ETCD and restoring it in case of a disaster.

Prerequisites#

A running Kubernetes cluster (preferably set up with kubeadm)
SSH access to the control plane node
kubectl and etcdctl installed on the control plane node

Step 1: Cluster Health Check#

First, let’s ensure our cluster is healthy:

1
kubectl get nodes

All nodes should be in the “Ready” state.

Step 2: Identify ETCD Version#

To find the ETCD version:

1
kubectl describe pod etcd-controlplane -n kube-system | grep Image:

Note the version for compatibility purposes.

Step 3: Create a Sample Deployment#

Let’s create a deployment to demonstrate the backup and restore process:

1
kubectl create deployment nginx --image=nginx --replicas=2

Verify the deployment:

1
kubectl get deployments

Step 4: Identify ETCD Endpoint and Certificates#

We need to locate the ETCD endpoint, server certificates, and CA certificates. These are typically found in the ETCD pod manifest:

1
sudo cat /etc/kubernetes/manifests/etcd.yaml

Look for:

--listen-client-urls for the endpoint
--cert-file for the server certificate
--trusted-ca-file for the CA certificate

Step 5: Set ETCDCTL API Version#

Set the ETCDCTL API version:

1
export ETCDCTL_API=3

Step 6: Take ETCD Snapshot#

Now, let’s take a snapshot of ETCD:

1
sudo ETCDCTL_API=3 etcdctl snapshot save /opt/etcd-snapshot.db \
2
  --endpoints=https://127.0.0.1:2379 \
3
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
4
  --cert=/etc/kubernetes/pki/etcd/server.crt \
5
  --key=/etc/kubernetes/pki/etcd/server.key

Verify the snapshot:

1
sudo ETCDCTL_API=3 etcdctl snapshot status /opt/etcd-snapshot.db

Step 7: Simulate Disaster#

Let’s delete our nginx deployment to simulate a disaster:

1
kubectl delete deployment nginx

Verify that the deployment is gone:

1
kubectl get deployments

Step 8: Restore ETCD from Backup#

Now, let’s restore ETCD from our backup:

Stop the kubelet service:
Terminal window
```
1
sudo systemctl stop kubelet
```
Stop the container runtime (e.g., containerd):
Terminal window
```
1
sudo systemctl stop containerd
```

Restore the snapshot:

1
sudo ETCDCTL_API=3 etcdctl snapshot restore /opt/etcd-snapshot.db \
2
  --data-dir /var/lib/etcd-from-backup

Update the ETCD pod manifest to use the new data directory:

1
sudo sed -i 's/path: \/var\/lib\/etcd/path: \/var\/lib\/etcd-from-backup/g' /etc/kubernetes/manifests/etcd.yaml

Restart the kubelet and container runtime:

1
sudo systemctl start containerd
2
sudo systemctl start kubelet

Step 9: Verify Restoration#

After a few minutes, check if the nginx deployment is back:

1
kubectl get deployments

You should see the nginx deployment with 2 replicas, just as it was before we deleted it.

Key Takeaways#

Regular Backups: Implement a regular backup schedule for ETCD. The frequency depends on your recovery point objective (RPO).
Version Compatibility: Ensure that the etcdctl version matches your ETCD version for compatibility.
Secure Your Backups: ETCD backups contain sensitive cluster data. Ensure they are stored securely and encrypted.
Test Your Backups: Regularly test the restore process to ensure your backups are valid and your team is familiar with the procedure.
Document the Process: Keep detailed documentation of your backup and restore process, including the location of certificates and endpoints.
Cluster-Specific Details: Be aware that certificate paths and endpoints may vary depending on your cluster setup.
Minimize Downtime: Practice the restore process to minimize downtime during a real disaster recovery scenario.
Backup Metadata: Along with ETCD data, backup important metadata like the ETCD version and cluster configuration.

Conclusion#

Mastering ETCD backup and restore is crucial for maintaining a resilient Kubernetes cluster. By following this guide, you’ve learned how to take snapshots of your ETCD data and restore your cluster to a previous state. Remember, regular practice and documentation of this process are key to ensuring you can quickly recover from potential disasters.

As you continue to work with Kubernetes, consider integrating these backup and restore procedures into your regular maintenance routines and disaster recovery plans. With proper ETCD management, you can ensure the reliability and stability of your Kubernetes clusters, even in the face of unexpected issues.