The right tool to backup your k8s cluster!



Hey folks, the most challenging tasks given to a person who runs Kubernetes in production is to take backups and apply a disaster recovery plan(s) to roll back whenever it is needed.
There are many methods and techniques to do so, I am gonna list many ways to implement the disaster recovery and the backups.
First of all, the question that need to be asked is:
What are the most important pieces in a kubernetes cluster that need to be backed up?
As all the resources in k8s are treated as an API object. I would say then everything is important unless you have a limited storage where you gonna apply your backup!
In this case, you may want to backup your microservice application(s) that is/are running on your k8s cluster (data). Besides, the etcd* components (etcd and etcd-events).
This is in my opinion the most importnant critical pieces but until here we are talking only about backups but what are their values if we could not apply them!
This is what I will try to cover later in the article.
Let's go into the manual way that will allow us to take backups on our kubernetes cluster.

I am assuming here that there is only one master. If you have many you can repeat the same process on all the master nodes you have.
In order to back up your etcd, we need the certificates.
$ sudo cp -r /etc/kubernetes/pki backup/
$ sudo docker run --rm -v $(pwd)/backup:/backup --network host -v /etc/kubernetes/pki/etcd:/etc/kubernetes/pki/etcd --env ETCDCT_API=3 k8s.gcr.io/etcd-amd64:3.2.18 \
etcdctl --endpoint=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/healthcheck-client.py snapshot save /backup/etcd.db
To restore the etcd backup, you will need to perform the backup you just took the inverse way. You will just need to replace save by restore
and then move the data back to /var/lib/etcd/
To automate this process you can create a cronjob in Kubernetes. The documentation is so clear[1] and [2]
However, the tool that I like the most about backup and disaster recovery where I started contributing recently is Velero.
This tool is written by Golang and developed by Heptio folks. In a short words, Velero consists of a server that runs on your cluster and a CLI that runs locally.
What makes it very powerfull is that it supports many cloud providers and of course on-premise infrastructures. It gives you the ability to copy your cluster resources to other clusters and replicates your environment.
You can install Velero by downloading the latest release or just clone the code and checkout the appropriate tag you would work with.
To set up your server, you need to start the local storage service by applying some yaml files.
$ kubectl apply -f examples/common/00-prereqs.yaml
$ kubectl apply -f examples/minio/
That's it! Pretty easy. You will now simulate a deployment by creating any example (e.g: nginx deployment). Then try to back up it.
$ velero backup create test-backup --selector label=value
or if you would like to backup a namespace in your cluster, including everything.
$ velero backup create test-backup --include-namespaces examples_namespace
Otherwise, if you want to backup all objects except those matching the label label=ignore
$ velero backup create test-backup --select 'label notin (ignore)'
You can add --schedule flag to make scheduled backups. The scheduling will be on a cron expression
Then, if you want to restore your backups, you need to run velero restore create --from-backup test-backup
The same as kubectl, you can describe your backups, first by getting the name of the restore then by describing it.
$ velero restore get 
$ velero restore describe restore_name
Finally, you can delete backups that you took by doing velero backup delete test-backup.

Thanks for passing by, let me know if I missed anything. o/