Skip to Content
Cluster Component ManagementEtcdETCD Backup and Recovery

ETCD Backup and Recovery

The ETCD backup plugin is a function provided by UK8S for ETCD database backup. ETCD backup can effectively help the cluster to perform fault recovery operations, and can perform timed backups, specify storage, etc. based on the plugin settings.

1. First Operation

If you are prompted to save the key operation for the first time you install the plugin, please execute the following command on any one Master node. Refresh the page after executing.

kubectl create secret generic uk8s-etcd-backup-cert -n kube-system --from-file /etc/kubernetes/ssl/etcd.pem --from-file /etc/kubernetes/ssl/etcd-key.pem

2. Backup Operation

When the ETCD backup is enabled, the backup system will keep the backup data according to the set creation, as shown below

  1. Choose the number of backups you wish to keep. The backup system will maintain the upper limit of the ETCD backup volume in the storage space according to this number.
  2. Add a scheduled backup time. The timetable syntax used here can refer to 4. Explanation of the crontab syntax in this article.
  3. We have provided two mediums for backup, UFile and SSHServer.
  4. When choosing UFile, you need to create a corresponding UFile token and storage space. UFile tokens need to be granted upload, download, delete, and file list.
  5. If you use SSHServer, you need to fill in the specific host that can execute SSH, and ensure that the filled account has read and write permissions to the host directory.

3. Recovery Operation

This article only discusses the recovery operation of ETCD. If you need to recover other K8S core components such as APIServer, Controller Manager, please refer to UK8S core component fault recovery.

⚠️ For the ETCD recovery operation, it is recommended that you carefully read the documentation before proceeding.

When the UK8S cluster Master node is accidentally deleted, the cluster recovery process is quite tricky and needs to contact UK8S technicians for support; for node fault situations that are not due to deletion, the processing is relatively simple. The following are the explanations respectively.

  • 3.1 etcd node is not deleted, but there is a fault in the node
    • 3.1.1 etcd cluster is still available
    • 3.1.2 etcd cluster is unavailable
  • 3.2 Accidentally deleted etcd node
    • 3.2.1 etcd cluster is still available
    • 3.2.2 etcd cluster is unavailable

3.1 Master node is not deleted, but the node has a fault

3.1.1 ETCD cluster is still available

At this time, the number of ETCD nodes that have faults is less than half of the total number (the number of ETCD nodes in UK8S is set to 3 by default, so the number of ETCD nodes that have faults are 1 at this time). In this case, you can recover the cluster without using ETCD Backup, the operation steps are as follows:

  1. First, repair the faulted node. If you need to reinstall the system, it is recommended to install CentOS 7.6 64-bit operating system or contact technical support personnel to install UK8S custom image;

  2. Log in to a healthy node (node 10.23.17.200 in this example) and perform the following operation to remove the faulty node from the ETCD cluster;

# Note that the following operations need to replace the relevant parameters to match your current cluster # Configure ETCDCTL related parameters export ETCDCTL_API=3 export ETCDCTL_CACERT=/etc/kubernetes/ssl/ca.pem export ETCDCTL_CERT=/etc/kubernetes/ssl/etcd.pem export ETCDCTL_KEY=/etc/kubernetes/ssl/etcd-key.pem # Replace the IP address with your ETCD cluster node IP export ETCDCTL_ENDPOINTS=10.23.17.200:2379,10.23.207.11:2379,10.23.97.19:2379 # Execute the following command to view the cluster status etcdctl endpoint health # The output is as follows, you can see that the node 10.23.97.19 is in a fault state ... 10.23.17.200:2379 is healthy: successfully committed proposal: took = 15.028244ms 10.23.207.11:2379 is healthy: successfully committed proposal: took = 15.712076ms 10.23.97.19:2379 is unhealthy: failed to commit proposal: context deadline exceeded Error: unhealthy cluster # Execute the following command to view the node ID etcdctl member list # The output is as follows, you can know that the ID of the faulty node 10.23.97.19 is 45b4ced6b6a33ef5, and the etcd name is etcd2 1d3f61116d4f3128, started, etcd3, https://10.23.207.11:2380, https://10.23.207.11:2379 45b4ced6b6a33ef5, started, etcd2, https://10.23.97.19:2380, https://10.23.97.19:2379 8a190c41d92119cb, started, etcd1, https://10.23.17.200:2380, https://10.23.17.200:2379 # Execute the following command to delete the faulty node etcdctl member remove 45b4ced6b6a33ef5
  1. Log in to a healthy node (node 10.23.17.200 in this example), copy the file to the fault node that has been repaired (10.23.97.19 in this example);
# Copy the related etcd program scp /usr/bin/etcd /usr/bin/etcdctl 10.23.97.19:/usr/bin # Copy the etcd configuration file scp -r /etc/etcd 10.23.97.19:/etc/etcd # Copy the kubernetes configuration file and etcd certificate scp -r /etc/kubernetes 10.23.97.19:/etc/kubernetes # Copy the etcd service file scp /usr/lib/systemd/system/etcd.service 10.23.97.19:/usr/lib/systemd/system/etcd.service
  1. Log in to the repaired fault node (10.23.97.19 in this example), modify the related etcd configuration;
# Backup the original configuration file cp /etc/etcd/etcd.conf /etc/etcd/etcd.conf.bak # Empty the original configuration echo '' >/etc/etcd/etcd.conf # Set parameters # Replace with the etcd name of the failed node export FAILURE_ETCD_NAME=etcd2 # Replace with the IP of the healthy node used export NORMAL_ETCD_NODE_IP=10.23.17.200 # Replace with the IP of the failed node export FAILURE_ETCD_NODE_IP=10.23.97.19 # Execute the following command to generate a new configuration cat /etc/etcd/etcd.conf.bak | while read LINE; do if [[ $LINE == "ETCD_INITIAL_CLUSTER="* ]]; then echo $LINE >>/etc/etcd/etcd.conf elif [[ $LINE == "ETCD_INITIAL_CLUSTER_STATE="* ]]; then echo 'ETCD_INITIAL_CLUSTER_STATE="existing"' >>/etc/etcd/etcd.conf elif [[ $LINE == "ETCD_NAME="* ]]; then echo 'ETCD_NAME='$FAILURE_ETCD_NAME >>/etc/etcd/etcd.conf else echo $LINE | sed "s/$NORMAL_ETCD_NODE_IP/$FAILURE_ETCD_NODE_IP/g" >>/etc/etcd/etcd.conf fi done # If there is leftover data, delete the old etcd data rm -rf /var/lib/etcd # Create a new data directory mkdir -p /var/lib/etcd/default.etcd
  1. Log in to a healthy node (node 10.23.17.200 in this example), add the faulty node back to the etcd cluster;
# Note that the following operations need to replace the relevant parameters to match your current cluster export ETCDCTL_API=3 export ETCDCTL_CACERT=/etc/kubernetes/ssl/ca.pem export ETCDCTL_CERT=/etc/kubernetes/ssl/etcd.pem export ETCDCTL_KEY=/etc/kubernetes/ssl/etcd-key.pem # Replace the IP address with your etcd cluster node IP export ETCDCTL_ENDPOINTS=10.23.17.200:2379,10.23.207.11:2379,10.23.97.19:2379 # Replace with the etcd name of the failed node export FAILURE_ETCD_NAME=etcd2 # Replace with the IP of the failed node export FAILURE_ETCD_NODE_IP=10.23.97.19 # Execute the following command to add the faulty node back to the etcd cluster etcdctl member add $FAILURE_ETCD_NAME --peer-urls=https://$FAILURE_ETCD_NODE_IP:2380
  1. Log in to the fault node, start etcd;
# Execute the following command to start etcd, if there is no error, the startup is successful systemctl enable --now etcd # After successful startup, you can refer to Step 2 to view the etcd cluster status.

3.1.2 ETCD cluster is unavailable

The ETCD cluster is unavailable, meaning that half or more of the ETCD nodes have failed. In extreme cases, if all nodes are unable to log in system normally, support personnel must be contacted to attempt to recover the relevant files of ETCD and UK8S from the faulty nodes or to reconstruct the relevant files according to the database data; Here we are only considering the case where at least one ETCD node can log in to the system normally, at this time you can use the data saved by UK8S ETCD backup plugin to recover the cluster, the operation steps are as follows:

  1. Keep a node that can log in normally without handling, repair the other fault nodes, if you need to reinstall the system, it is recommended to install CentOS 7.6 64 bit operating system or contact technical support personnel to install UK8S custom image;

  2. If the repair of the fault node does not cause file system data loss, then you can skip this step, otherwise you should log in to the retained node (node 10.23.166.234 in this example) and copy the file to the repaired fault node (nodes 10.23.172.172 and 10.23.95.6 in this example);

# Copy the related etcd program scp /usr/bin/etcd /usr/bin/etcdctl 10.23.172.172:/usr/bin scp /usr/bin/etcd /usr/bin/etcdctl 10.23.95.6:/usr/bin # Copy the etcd configuration file scp -r /etc/etcd 10.23.172.172:/etc/etcd scp -r /etc/etcd 10.23.95.6:/etc/etcd # Copy the kubernetes configuration file and etcd certificate scp -r /etc/kubernetes 10.23.172.172:/etc/kubernetes scp -r /etc/kubernetes 10.23.95.6:/etc/kubernetes # Copy the etcd service file scp /usr/lib/systemd/system/etcd.service 10.23.172.172:/usr/lib/systemd/system/etcd.service scp /usr/lib/systemd/system/etcd.service 10.23.95.6:/usr/lib/systemd/system/etcd.service
  1. If the repair of the fault node does not cause file system data loss, then you can skip this step, otherwise you should modify the related ETCD configuration after completing step 2 on the fault nodes (nodes 10.23.172.172 and 10.23.95.6 in this example), the following is the operation of one of the fault nodes (node 10.23.172.172 in this example), other fault nodes should perform the same operations (note to modify parameters);
# Backup the original configuration file cp /etc/etcd/etcd.conf /etc/etcd/etcd.conf.bak # Empty the original configuration echo '' >/etc/etcd/etcd.conf # Set parameters # Replace with the etcd name of the failed node, here 10.23.172.172 corresponds to etcd2 export FAILURE_ETCD_NAME=etcd2 # Replace with the IP of the retained node export RETAIN_ETCD_NODE_IP=10.23.166.234 # Replace with the IP of the failed node export FAILURE_ETCD_NODE_IP=10.23.172.172 # Execute the following command to generate a new configuration cat /etc/etcd/etcd.conf.bak | while read LINE; do if [[ $LINE == "ETCD_INITIAL_CLUSTER="* ]]; then echo $LINE >>/etc/etcd/etcd.conf elif [[ $LINE == "ETCD_NAME="* ]]; then echo 'ETCD_NAME='$FAILURE_ETCD_NAME >>/etc/etcd/etcd.conf else echo $LINE | sed "s/$RETAIN_ETCD_NODE_IP/$FAILURE_ETCD_NODE_IP/g" >>/etc/etcd/etcd.conf fi done
  1. Execute the following command on all ETCD nodes;
# Stop etcd service systemctl stop etcd # If there is leftover data, delete the old etcd data rm -rf /var/lib/etcd # Create a new data directory mkdir /var/lib/etcd
  1. Get the ETCD backup file from UFile or the backup server and upload it to all ETCD nodes, respectively log in to each node to perform recovery from the backup file (note to match the node information);
# In this example, the backup file has been saved to the /root/uk8s-f1wymalx-backup-etcd-3.3.17-2020-02-11T03-56-12.db.tar.gz path # Unzip to get the backup data tar zxvf uk8s-f1wymalx-backup-etcd-3.3.17-2020-02-11T03-56-12.db.tar.gz # Recover from the backup data, note to adjust the ETCD_NAME and NODE_IP to match the node information export ETCD_NAME=etcd2 export NODE_IP=10.23.172.172 export ETCDCTL_API=3 etcdctl --name=$ETCD_NAME --endpoints=https://$NODE_IP:2379 --cert=/etc/kubernetes/ssl/etcd.pem --key=/etc/kubernetes/ssl/etcd-key.pem --cacert=/etc/kubernetes/ssl/ca.pem --initial-cluster-token=etcd-cluster --initial-advertise-peer-urls=https://$NODE_IP:2380 --initial-cluster=etcd1=https://10.23.95.6:2380,etcd2=https://10.23.172.172:2380,etcd3=https://10.23.166.234:2380 --data-dir=/var/lib/etcd/default.etcd/ snapshot restore uk8s-f1wymalx-backup-etcd-3.3.17-2020-02-11T03-56-12.db
  1. After all ETCD nodes have completed Step 5, execute the following command on all ETCD nodes to start ETCD;
# Note that when the command is first executed, it may hang, which is normal # Continue executing the command on other nodes, and when more than half of the nodes are started, the command will exit normally systemctl enable --now etcd
  1. Log in to any etcd node to view the cluster status;
export ETCDCTL_API=3 export ETCDCTL_CACERT=/etc/kubernetes/ssl/ca.pem export ETCDCTL_CERT=/etc/kubernetes/ssl/etcd.pem export ETCDCTL_KEY=/etc/kubernetes/ssl/etcd-key.pem # Replace the IP address with your etcd cluster node IP export ETCDCTL_ENDPOINTS=10.23.95.6:2379,10.23.172.172:2379,10.23.166.234:2379 # Execute the following command to view the cluster status etcdctl endpoint health

3.2 Accidental deletion of etcd nodes

3.2.1 etcd cluster is still available

At this time, the number of etcd nodes that are accidentally deleted is less than half of the total (the number of etcd nodes of UK8S is set to 3 by default, so the number of etcd nodes that are accidentally deleted are 1 at this time), and the etcd cluster can still be accessed normally. In this case, you do not need to borrow from The etcd backup can recover the cluster. The operation steps are as follows:

  1. Contact technical support personnel to create a virtual machine with the same configuration and IP as the deleted node based on the information of the deleted node;
  2. Refer to the etcd node is not deleted, but the node has a fault section under the etcd cluster is still available subsection, use the virtual machine created in step 1 as the repaired fault node for cluster recovery;
  3. After the etcd cluster repair is completed, contact technical support personnel to change the UHost ID of the mistakenly deleted virtual machine in the database to the UHost ID of the virtual machine created in step 1.

3.2.2 etcd cluster is unavailable

The ETCD cluster is unavailable, meaning that half or more of the ETCD nodes have been deleted. In extreme cases, if all nodes are deleted, support personnel must be contacted to try to reconstruct the relevant files according to the database data. Here we are only considering the case where at least one ETCD The node can log in to the system normally. At this time, you can use the data saved by the UK8S ETCD backup plugin to recover the cluster. The operation steps are as follows:

  1. Contact technical support personnel to create a virtual machine that corresponds to the configuration and IP of the deleted node based on the information of all the mistakenly deleted nodes;
  2. Refer to the etcd node is not deleted, but the node has a fault section under the etcd cluster is unavailable subsection, use the virtual machines created in step 1 as the repaired fault node for cluster recovery;
  3. After the etcd cluster is repaired, contact technical support personnel to change the UHost ID of the mistakenly deleted virtual machine in the database to the UHost ID of the virtual machine created in step 1.

4. Explanation of crontab syntax

The syntax used for the crontab is consistent with CronTab. The following lists several commonly used syntaxes.

Crontab format (the first 5 is a time option, we only use the first 5 here)

<minute> <hour> <day> <month> <week> <command>

Once a day, execute at 0:00

0 0 * * *

Once a week, execute at 0:00

0 0 * * 0

Once a month, execute at 0:00

0 0 1 * *