EKS Cluster Disaster Recovery Using Velero: Best Practices
Updated: Nov 5
Introduction
Velero is a robust tool for Kubernetes disaster recovery, enabling users to backup, migrate, and restore applications and persistent volumes. This section provides guidance on using Velero as a disaster recovery strategy within an Amazon EKS cluster.
Objectives
The primary objectives of implementing Velero for disaster recovery are as follows:
Efficient Backup Strategies: Leverage Velero to create periodic backups of your EKS cluster resources, ensuring minimal data loss in case of a disaster.
Automated Scheduling: Utilize Velero schedules to automate the backup process, reducing manual intervention and ensuring regular snapshots.
Seamless Restore Operations: Develop clear restore strategies using Velero manifests, allowing for a quick and efficient recovery process.
Considerations
Backup Frequency: Determine an appropriate backup frequency based on the criticality of your applications and data.
Retention Policies: Define retention policies for your backups to manage storage costs effectively.
Backup and restore workflow
Velero consists of two components:
A Velero server pod that runs in your Amazon EKS cluster
A command-line client (Velero CLI) that runs locally
Whenever we issue a backup against an Amazon EKS cluster, Velero performs a backup of cluster resources in the following way:
The Velero CLI makes a call to the Kubernetes API server to create a backup CRD object.
The backup controller:
Checks the scope of the backup CRD object, namely if we set filters.
Queries the API server for the resources that need a backup.
Compresses the retrieved Kubernetes objects into a .tar file and saves it in Amazon S3.
Similarly, whenever we issue a restore operation:
The Velero CLI makes a call to Kubernetes API server to create a restore CRD that will restore from an existing backup.
The restore controller:
Validates the restored CRD object.
Makes a call to Amazon S3 to retrieve backup files.
Initiates restore operation.
Velero also performs backup and restore of any persistent volume in scope:
If you are using Amazon Elastic Block Store (Amazon EBS), Velero will create Amazon EBS snapshots of persistent volumes in scope.
For any other volume type (except hostPath), use Velero’s Restic integration to take file-level backups of the contents of your volumes. At the time of writing, Restic is in Beta, and therefore not recommended for production-grade backups.
Steps
1. Velero Installation.
You can easily follow the official guide to the complete Velero installation. This guide also outlines the creation of the necessary resources to set up before configuring Velero.
If you want, you can make this installation using helm too, which is another way you choose. (https://github.com/vmware-tanzu/helm-charts/blob/main/charts/velero/values.yaml), Remember to create the AWS needed resources before this installation.
2. Check resources creation.
After the successful installation and configuration, we can check the successful creation of all resources (IAM Role, S3 bucket) and the Velero pod running correctly.
Below is a list with all the available verbs of Velero.
3. Schedule Backups.
Create a Velero schedule manifest (schedule.yaml) to define the backup frequency and included namespaces. Example:
4. Restore from Backup.
In the event of a disaster, use a Velero restore manifest (restore.yaml) to initiate the recovery process. Example:
5. Validation.
Regularly validate your disaster recovery strategy by simulating restore operations in a non-production environment.
Martín Carletti
Cloud Engineer
Teracloud
Fabricio Blas
Cloud Engineer
Teracloud