Common issues of the cluster log component

1. Issue Manifestation

In the UK8S service console, cluster application center, ELK log page, after launching the cluster log plugin and using it for a while, you may encounter the following issues:

No latest logs on the ELK log search page
ELK log, component status page shows 0 logs in the past 10 minutes

2. Issue Troubleshooting Reference

The ELK log is deployed in the cluster default namespace by default. If deployed in a custom namespace, replace default with the custom namespace when executing commands.

step 1. View the logstash component logs, log into the cluster master nodes, and execute the command: kubectl logs -f uk8s-elk-release-logstash-0 -n default You can see the following information being printed continuously:


[2021-11-16T09:55:31,753][INFO ][logstash.outputs.elasticsearch] retrying failed action with response code: 403 ({"type"=>"cluster_block_exception", "reason"=>"index [uk8s-vidxqjoo-kube-system-2021.11.16] blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];"})
[2021-11-16T09:55:31,753][INFO ][logstash.outputs.elasticsearch] Retrying individual bulk actions that failed or were rejected by the previous bulk request. {:count=>1}

step 2. View the ES component storage volume usage rate, log into the cluster master node, and execute the command:


for pod in multi-master-0 multi-master-1 multi-master-2
do
  kubectl exec -t -i $pod -- sh -c 'df -h| grep /usr/share/elasticsearch/data' -n default
done

You can see the disk space usage rate as high as 96%


/dev/vdb         20G   19G  933M  96% /usr/share/elasticsearch/data
/dev/vdb         20G   19G  939M  96% /usr/share/elasticsearch/data
/dev/vdc         20G   19G  933M  96% /usr/share/elasticsearch/data

step 3. Query the index status via ES API, log into the cluster master node, and execute the command:


ES_CLUSTER_IP=`kubectl get svc multi-master | awk 'NR>1 {print $3}'`
curl http://${ES_CLUSTER_IP}:9200/_all/_settings?pretty

You can see the returned information includes "read_only_allow_delete": "true" From here, you can determine the cause of the failure. Although the disk is not full, it triggers the ES protection mechanism:

ES cluster.routing.allocation.disk.watermark.low control the low watermark for disk usage. The default value is 85%. If exceeded, es will no longer allocate shards to this node;
ES cluster.routing.allocation.disk.watermark.high control the high watermark. The default value is 90%. If exceeded, it will attempt to relocate shards to other nodes;
ES cluster.routing.allocation.disk.watermark.flood_stage To control the flood stage watermark. The default value is 95%. If exceeded, the ES cluster will forcibly mark all indexes as read-only, causing new logs to fail to be collected, and the latest logs can’t be queried. To recover, you can only manually set index.blocks.read_only_allow_delete to false.

3. Recommended Solution

3.1 ES PVC Expansion

The ELK log is deployed in the cluster default namespace by default. If deployed in a custom namespace, please replace default with the custom namespace when executing commands.

Step 1. Log into the Master node, execute the command: kubectl get pvc -n default to view PVC, the following named PVCs are used by ES


multi-master-multi-master-0
multi-master-multi-master-1
multi-master-multi-master-2

Execute kubectl edit pvc {pvc-name} -n default, increase the value of spec.resource.requests.storage, save and exit, in about a minute or so, the PV, PVC, and the file system in the container will have completed the online expansion. For more detailed operations, refer to UDisk Dynamic Expansion.

After expansion, confirm the status of PV/PVC: kubectl get pv | grep multi-master && kubectl get pvc | grep multi-master

Step 2. Release the ES index read-only mode


ES_CLUSTER_IP=`kubectl get svc multi-master | awk 'NR>1 {print $3}'`
curl -H "Content-Type: application/json" -XPUT http://${ES_CLUSTER_IP}:9200/_all/_settings -d '{ "index.blocks.read_only_allow_delete": false }'

Step 3. Confirm the ES cluster status


ES_CLUSTER_IP=`kubectl get svc multi-master | awk 'NR>1 {print $3}'` 
curl http://${ES_CLUSTER_IP}:9200/_cat/allocation?pretty
curl http://${ES_CLUSTER_IP}:9200/_cat/health
curl http://${ES_CLUSTER_IP}:9200/_all/_settings | jq

3.2 Adjusting ES Configurations

If the current ES PVC capacity is very large, according to ES’s default configuration, 90% storage still leaves a lot of spare space. You can increase ES’s parameter thresholds, release the index from read-only mode , recover the ES cluster to its normal state.


ES_CLUSTER_IP=`kubectl get svc multi-master | awk 'NR>1 {print $3}'`
 
curl -H "Content-Type: application/json" -XPUT http://${ES_CLUSTER_IP}:9200/_cluster/settings -d '{
  "persistent": {
    "cluster.routing.allocation.disk.watermark.low": "90%",
    "cluster.routing.allocation.disk.watermark.high": "95%",
    "cluster.routing.allocation.disk.watermark.flood_stage": "97%",
    "cluster.info.update.interval": "1m"
  }
}'
curl -H "Content-Type: application/json" -XPUT http://${ES_CLUSTER_IP}:9200/_all/_settings -d '{ 
  "index.blocks.read_only_allow_delete": false
}'

cluster.routing.allocation.disk.watermark.low, controls the low watermark of disk usage. It defaults to 85%, which means Elasticsearch will not allocate shards to nodes where the storage space usage exceeds 85%. It can also be set to an absolute byte value (like 500MB) to prevent Elasticsearch from allocating shards when the available space is less than the specified amount. This setting does not affect the primary shards of newly created indices and in particular, any shards that have never been allocated before.
cluster.routing.allocation.disk.watermark.high, controls the high watermark. It defaults to 90%, which means Elasticsearch will attempt to relocate shards from nodes where the storage usage is more than 90%. It can also be set to an absolute byte value (like the low watermark) to relocate shards away from a node where the available space is less than a specified amount. This setting affects the allocation of all shards, whether they were previously allocated or not.
cluster.routing.allocation.disk.watermark.flood_stage, controls the flood stage watermark. It defaults to 95%, once an ES node’s storage space exceeds the flood stage, Elasticsearch will apply a read-only setting to index blocks index.blocks.read_only_allow_delete: true This is the last resort to prevent the node from running out of storage space. Once there is sufficient space for indexing operations to continue, you must manually adjust index.blocks.read_only_allow_delete: false to cancel the index’s read-only attributes.

Reference: Elasticsearch Official Documentation