Skip to content

Mastering ETCD Backup and Restore in Kubernetes

Published: at 08:30 AM

Mastering ETCD Backup and Restore in Kubernetes

ETCD is the backbone of Kubernetes, storing all cluster data. Understanding how to backup and restore ETCD is crucial for any Kubernetes administrator. In this guide, we’ll walk through the process of backing up ETCD and restoring it in case of a disaster.

Prerequisites

Step 1: Cluster Health Check

First, let’s ensure our cluster is healthy:

kubectl get nodes

All nodes should be in the “Ready” state.

Step 2: Identify ETCD Version

To find the ETCD version:

kubectl describe pod etcd-controlplane -n kube-system | grep Image:

Note the version for compatibility purposes.

Step 3: Create a Sample Deployment

Let’s create a deployment to demonstrate the backup and restore process:

kubectl create deployment nginx --image=nginx --replicas=2

Verify the deployment:

kubectl get deployments

Step 4: Identify ETCD Endpoint and Certificates

We need to locate the ETCD endpoint, server certificates, and CA certificates. These are typically found in the ETCD pod manifest:

sudo cat /etc/kubernetes/manifests/etcd.yaml

Look for:

Step 5: Set ETCDCTL API Version

Set the ETCDCTL API version:

export ETCDCTL_API=3

Step 6: Take ETCD Snapshot

Now, let’s take a snapshot of ETCD:

sudo ETCDCTL_API=3 etcdctl snapshot save /opt/etcd-snapshot.db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

Verify the snapshot:

sudo ETCDCTL_API=3 etcdctl snapshot status /opt/etcd-snapshot.db

Step 7: Simulate Disaster

Let’s delete our nginx deployment to simulate a disaster:

kubectl delete deployment nginx

Verify that the deployment is gone:

kubectl get deployments

Step 8: Restore ETCD from Backup

Now, let’s restore ETCD from our backup:

  1. Stop the kubelet service:

    sudo systemctl stop kubelet
  2. Stop the container runtime (e.g., containerd):

    sudo systemctl stop containerd
  3. Restore the snapshot:

    sudo ETCDCTL_API=3 etcdctl snapshot restore /opt/etcd-snapshot.db \
      --data-dir /var/lib/etcd-from-backup
  4. Update the ETCD pod manifest to use the new data directory:

    sudo sed -i 's/path: \/var\/lib\/etcd/path: \/var\/lib\/etcd-from-backup/g' /etc/kubernetes/manifests/etcd.yaml
  5. Restart the kubelet and container runtime:

    sudo systemctl start containerd
    sudo systemctl start kubelet

Step 9: Verify Restoration

After a few minutes, check if the nginx deployment is back:

kubectl get deployments

You should see the nginx deployment with 2 replicas, just as it was before we deleted it.

Key Takeaways

  1. Regular Backups: Implement a regular backup schedule for ETCD. The frequency depends on your recovery point objective (RPO).

  2. Version Compatibility: Ensure that the etcdctl version matches your ETCD version for compatibility.

  3. Secure Your Backups: ETCD backups contain sensitive cluster data. Ensure they are stored securely and encrypted.

  4. Test Your Backups: Regularly test the restore process to ensure your backups are valid and your team is familiar with the procedure.

  5. Document the Process: Keep detailed documentation of your backup and restore process, including the location of certificates and endpoints.

  6. Cluster-Specific Details: Be aware that certificate paths and endpoints may vary depending on your cluster setup.

  7. Minimize Downtime: Practice the restore process to minimize downtime during a real disaster recovery scenario.

  8. Backup Metadata: Along with ETCD data, backup important metadata like the ETCD version and cluster configuration.

Conclusion

Mastering ETCD backup and restore is crucial for maintaining a resilient Kubernetes cluster. By following this guide, you’ve learned how to take snapshots of your ETCD data and restore your cluster to a previous state. Remember, regular practice and documentation of this process are key to ensuring you can quickly recover from potential disasters.

As you continue to work with Kubernetes, consider integrating these backup and restore procedures into your regular maintenance routines and disaster recovery plans. With proper ETCD management, you can ensure the reliability and stability of your Kubernetes clusters, even in the face of unexpected issues.