ETCD Backup and Recovery

The ETCD backup plugin is a function provided by UK8S for ETCD database backup. ETCD backup can effectively help the cluster to perform fault recovery operations, and can perform timed backups, specify storage, etc. based on the plugin settings.

1. First Operation

If you are prompted to save the key operation for the first time you install the plugin, please execute the following command on any one Master node. Refresh the page after executing.


kubectl create secret generic uk8s-etcd-backup-cert -n kube-system --from-file /etc/kubernetes/ssl/etcd.pem --from-file /etc/kubernetes/ssl/etcd-key.pem

2. Backup Operation

When the ETCD backup is enabled, the backup system will keep the backup data according to the set creation, as shown below

Choose the number of backups you wish to keep. The backup system will maintain the upper limit of the ETCD backup volume in the storage space according to this number.
Add a scheduled backup time. The timetable syntax used here can refer to 4. Explanation of the crontab syntax in this article.
We have provided two mediums for backup, UFile and SSHServer.
When choosing UFile, you need to create a corresponding UFile token and storage space. UFile tokens need to be granted upload, download, delete, and file list.
If you use SSHServer, you need to fill in the specific host that can execute SSH, and ensure that the filled account has read and write permissions to the host directory.

3. Recovery Operation

This article only discusses the recovery operation of ETCD. If you need to recover other K8S core components such as APIServer, Controller Manager, please refer to UK8S core component fault recovery.

⚠️ For the ETCD recovery operation, it is recommended that you carefully read the documentation before proceeding.

When the UK8S cluster Master node is accidentally deleted, the cluster recovery process is quite tricky and needs to contact UK8S technicians for support; for node fault situations that are not due to deletion, the processing is relatively simple. The following are the explanations respectively.

3.1 etcd node is not deleted, but there is a fault in the node
- 3.1.1 etcd cluster is still available
- 3.1.2 etcd cluster is unavailable
3.2 Accidentally deleted etcd node
- 3.2.1 etcd cluster is still available
- 3.2.2 etcd cluster is unavailable

3.1 Master node is not deleted, but the node has a fault

3.1.1 ETCD cluster is still available

At this time, the number of ETCD nodes that have faults is less than half of the total number (the number of ETCD nodes in UK8S is set to 3 by default, so the number of ETCD nodes that have faults are 1 at this time). In this case, you can recover the cluster without using ETCD Backup, the operation steps are as follows:

First, repair the faulted node. If you need to reinstall the system, it is recommended to install CentOS 7.6 64-bit operating system or contact technical support personnel to install UK8S custom image;
Log in to a healthy node (node 10.23.17.200 in this example) and perform the following operation to remove the faulty node from the ETCD cluster;


# Note that the following operations need to replace the relevant parameters to match your current cluster
 
# Configure ETCDCTL related parameters
export ETCDCTL_API=3
export ETCDCTL_CACERT=/etc/kubernetes/ssl/ca.pem
export ETCDCTL_CERT=/etc/kubernetes/ssl/etcd.pem
export ETCDCTL_KEY=/etc/kubernetes/ssl/etcd-key.pem
 
# Replace the IP address with your ETCD cluster node IP
export ETCDCTL_ENDPOINTS=10.23.17.200:2379,10.23.207.11:2379,10.23.97.19:2379
 
# Execute the following command to view the cluster status
etcdctl endpoint health
 
# The output is as follows, you can see that the node 10.23.97.19 is in a fault state
...
10.23.17.200:2379 is healthy: successfully committed proposal: took = 15.028244ms
10.23.207.11:2379 is healthy: successfully committed proposal: took = 15.712076ms
10.23.97.19:2379 is unhealthy: failed to commit proposal: context deadline exceeded
Error: unhealthy cluster
 
# Execute the following command to view the node ID
etcdctl member list
 
# The output is as follows, you can know that the ID of the faulty node 10.23.97.19 is 45b4ced6b6a33ef5, and the etcd name is etcd2
1d3f61116d4f3128, started, etcd3, https://10.23.207.11:2380, https://10.23.207.11:2379
45b4ced6b6a33ef5, started, etcd2, https://10.23.97.19:2380, https://10.23.97.19:2379
8a190c41d92119cb, started, etcd1, https://10.23.17.200:2380, https://10.23.17.200:2379
 
# Execute the following command to delete the faulty node
etcdctl member remove 45b4ced6b6a33ef5

Log in to a healthy node (node 10.23.17.200 in this example), copy the file to the fault node that has been repaired (10.23.97.19 in this example);


# Copy the related etcd program
scp /usr/bin/etcd /usr/bin/etcdctl 10.23.97.19:/usr/bin
# Copy the etcd configuration file
scp -r /etc/etcd 10.23.97.19:/etc/etcd
# Copy the kubernetes configuration file and etcd certificate
scp -r /etc/kubernetes 10.23.97.19:/etc/kubernetes
# Copy the etcd service file
scp /usr/lib/systemd/system/etcd.service 10.23.97.19:/usr/lib/systemd/system/etcd.service


# Backup the original configuration file
cp /etc/etcd/etcd.conf /etc/etcd/etcd.conf.bak
 
# Empty the original configuration
echo '' >/etc/etcd/etcd.conf
 
# Set parameters
# Replace with the etcd name of the failed node
export FAILURE_ETCD_NAME=etcd2
# Replace with the IP of the healthy node used
export NORMAL_ETCD_NODE_IP=10.23.17.200
# Replace with the IP of the failed node
export FAILURE_ETCD_NODE_IP=10.23.97.19
# Execute the following command to generate a new configuration
cat /etc/etcd/etcd.conf.bak | while read LINE; do
  if [[ $LINE == "ETCD_INITIAL_CLUSTER="* ]]; then
    echo $LINE >>/etc/etcd/etcd.conf
  elif [[ $LINE == "ETCD_INITIAL_CLUSTER_STATE="* ]]; then
    echo 'ETCD_INITIAL_CLUSTER_STATE="existing"' >>/etc/etcd/etcd.conf
  elif [[ $LINE == "ETCD_NAME="* ]]; then
    echo 'ETCD_NAME='$FAILURE_ETCD_NAME >>/etc/etcd/etcd.conf
  else
    echo $LINE | sed "s/$NORMAL_ETCD_NODE_IP/$FAILURE_ETCD_NODE_IP/g" >>/etc/etcd/etcd.conf
  fi
done
 
# If there is leftover data, delete the old etcd data
rm -rf /var/lib/etcd
# Create a new data directory
mkdir -p /var/lib/etcd/default.etcd


# Note that the following operations need to replace the relevant parameters to match your current cluster
export ETCDCTL_API=3
export ETCDCTL_CACERT=/etc/kubernetes/ssl/ca.pem
export ETCDCTL_CERT=/etc/kubernetes/ssl/etcd.pem
export ETCDCTL_KEY=/etc/kubernetes/ssl/etcd-key.pem
# Replace the IP address with your etcd cluster node IP
export ETCDCTL_ENDPOINTS=10.23.17.200:2379,10.23.207.11:2379,10.23.97.19:2379
# Replace with the etcd name of the failed node
export FAILURE_ETCD_NAME=etcd2
# Replace with the IP of the failed node
export FAILURE_ETCD_NODE_IP=10.23.97.19
# Execute the following command to add the faulty node back to the etcd cluster
etcdctl member add $FAILURE_ETCD_NAME --peer-urls=https://$FAILURE_ETCD_NODE_IP:2380


# Execute the following command to start etcd, if there is no error, the startup is successful
systemctl enable --now etcd
# After successful startup, you can refer to Step 2 to view the etcd cluster status.

3.1.2 ETCD cluster is unavailable

The ETCD cluster is unavailable, meaning that half or more of the ETCD nodes have failed. In extreme cases, if all nodes are unable to log in system normally, support personnel must be contacted to attempt to recover the relevant files of ETCD and UK8S from the faulty nodes or to reconstruct the relevant files according to the database data; Here we are only considering the case where at least one ETCD node can log in to the system normally, at this time you can use the data saved by UK8S ETCD backup plugin to recover the cluster, the operation steps are as follows:

Keep a node that can log in normally without handling, repair the other fault nodes, if you need to reinstall the system, it is recommended to install CentOS 7.6 64 bit operating system or contact technical support personnel to install UK8S custom image;
If the repair of the fault node does not cause file system data loss, then you can skip this step, otherwise you should log in to the retained node (node 10.23.166.234 in this example) and copy the file to the repaired fault node (nodes 10.23.172.172 and 10.23.95.6 in this example);


# Copy the related etcd program
scp /usr/bin/etcd /usr/bin/etcdctl 10.23.172.172:/usr/bin
scp /usr/bin/etcd /usr/bin/etcdctl 10.23.95.6:/usr/bin
# Copy the etcd configuration file
scp -r /etc/etcd 10.23.172.172:/etc/etcd
scp -r /etc/etcd 10.23.95.6:/etc/etcd
# Copy the kubernetes configuration file and etcd certificate
scp -r /etc/kubernetes 10.23.172.172:/etc/kubernetes
scp -r /etc/kubernetes 10.23.95.6:/etc/kubernetes
# Copy the etcd service file
scp /usr/lib/systemd/system/etcd.service 10.23.172.172:/usr/lib/systemd/system/etcd.service
scp /usr/lib/systemd/system/etcd.service 10.23.95.6:/usr/lib/systemd/system/etcd.service

If the repair of the fault node does not cause file system data loss, then you can skip this step, otherwise you should modify the related ETCD configuration after completing step 2 on the fault nodes (nodes 10.23.172.172 and 10.23.95.6 in this example), the following is the operation of one of the fault nodes (node 10.23.172.172 in this example), other fault nodes should perform the same operations (note to modify parameters);


# Backup the original configuration file
cp /etc/etcd/etcd.conf /etc/etcd/etcd.conf.bak
  
# Empty the original configuration
echo '' >/etc/etcd/etcd.conf
 
# Set parameters
# Replace with the etcd name of the failed node, here 10.23.172.172 corresponds to etcd2
export FAILURE_ETCD_NAME=etcd2
# Replace with the IP of the retained node
export RETAIN_ETCD_NODE_IP=10.23.166.234
# Replace with the IP of the failed node
export FAILURE_ETCD_NODE_IP=10.23.172.172
# Execute the following command to generate a new configuration
cat /etc/etcd/etcd.conf.bak | while read LINE; do
  if [[ $LINE == "ETCD_INITIAL_CLUSTER="* ]]; then
    echo $LINE >>/etc/etcd/etcd.conf
  elif [[ $LINE == "ETCD_NAME="* ]]; then
    echo 'ETCD_NAME='$FAILURE_ETCD_NAME >>/etc/etcd/etcd.conf
  else
    echo $LINE | sed "s/$RETAIN_ETCD_NODE_IP/$FAILURE_ETCD_NODE_IP/g" >>/etc/etcd/etcd.conf
  fi
done

Execute the following command on all ETCD nodes;


# Stop etcd service
systemctl stop etcd
# If there is leftover data, delete the old etcd data
rm -rf /var/lib/etcd
# Create a new data directory
mkdir /var/lib/etcd

Get the ETCD backup file from UFile or the backup server and upload it to all ETCD nodes, respectively log in to each node to perform recovery from the backup file (note to match the node information);


# In this example, the backup file has been saved to the /root/uk8s-f1wymalx-backup-etcd-3.3.17-2020-02-11T03-56-12.db.tar.gz path
# Unzip to get the backup data
tar zxvf uk8s-f1wymalx-backup-etcd-3.3.17-2020-02-11T03-56-12.db.tar.gz
# Recover from the backup data, note to adjust the ETCD_NAME and NODE_IP to match the node information
export ETCD_NAME=etcd2
export NODE_IP=10.23.172.172
export ETCDCTL_API=3
etcdctl --name=$ETCD_NAME --endpoints=https://$NODE_IP:2379 --cert=/etc/kubernetes/ssl/etcd.pem --key=/etc/kubernetes/ssl/etcd-key.pem --cacert=/etc/kubernetes/ssl/ca.pem --initial-cluster-token=etcd-cluster --initial-advertise-peer-urls=https://$NODE_IP:2380 --initial-cluster=etcd1=https://10.23.95.6:2380,etcd2=https://10.23.172.172:2380,etcd3=https://10.23.166.234:2380 --data-dir=/var/lib/etcd/default.etcd/ snapshot restore uk8s-f1wymalx-backup-etcd-3.3.17-2020-02-11T03-56-12.db

After all ETCD nodes have completed Step 5, execute the following command on all ETCD nodes to start ETCD;


# Note that when the command is first executed, it may hang, which is normal
# Continue executing the command on other nodes, and when more than half of the nodes are started, the command will exit normally
systemctl enable --now etcd


export ETCDCTL_API=3
export ETCDCTL_CACERT=/etc/kubernetes/ssl/ca.pem
export ETCDCTL_CERT=/etc/kubernetes/ssl/etcd.pem
export ETCDCTL_KEY=/etc/kubernetes/ssl/etcd-key.pem
 
# Replace the IP address with your etcd cluster node IP
export ETCDCTL_ENDPOINTS=10.23.95.6:2379,10.23.172.172:2379,10.23.166.234:2379
 
# Execute the following command to view the cluster status
etcdctl endpoint health

3.2 Accidental deletion of etcd nodes

3.2.1 etcd cluster is still available

At this time, the number of etcd nodes that are accidentally deleted is less than half of the total (the number of etcd nodes of UK8S is set to 3 by default, so the number of etcd nodes that are accidentally deleted are 1 at this time), and the etcd cluster can still be accessed normally. In this case, you do not need to borrow from The etcd backup can recover the cluster. The operation steps are as follows:

Contact technical support personnel to create a virtual machine with the same configuration and IP as the deleted node based on the information of the deleted node;
Refer to the etcd node is not deleted, but the node has a fault section under the etcd cluster is still available subsection, use the virtual machine created in step 1 as the repaired fault node for cluster recovery;
After the etcd cluster repair is completed, contact technical support personnel to change the UHost ID of the mistakenly deleted virtual machine in the database to the UHost ID of the virtual machine created in step 1.

3.2.2 etcd cluster is unavailable

The ETCD cluster is unavailable, meaning that half or more of the ETCD nodes have been deleted. In extreme cases, if all nodes are deleted, support personnel must be contacted to try to reconstruct the relevant files according to the database data. Here we are only considering the case where at least one ETCD The node can log in to the system normally. At this time, you can use the data saved by the UK8S ETCD backup plugin to recover the cluster. The operation steps are as follows:

Contact technical support personnel to create a virtual machine that corresponds to the configuration and IP of the deleted node based on the information of all the mistakenly deleted nodes;
Refer to the etcd node is not deleted, but the node has a fault section under the etcd cluster is unavailable subsection, use the virtual machines created in step 1 as the repaired fault node for cluster recovery;
After the etcd cluster is repaired, contact technical support personnel to change the UHost ID of the mistakenly deleted virtual machine in the database to the UHost ID of the virtual machine created in step 1.

4. Explanation of crontab syntax

The syntax used for the crontab is consistent with CronTab. The following lists several commonly used syntaxes.

Crontab format (the first 5 is a time option, we only use the first 5 here)


<minute> <hour> <day> <month> <week> <command>

Once a day, execute at 0:00


0 0 * * *

Once a week, execute at 0:00


0 0 * * 0

Once a month, execute at 0:00


0 0 1 * *