Elasticsearch and OpenSearch Cluster Health Troubleshooting Guide
Maintaining healthy Elasticsearch and OpenSearch clusters is crucial for any SIEM or log analytics infrastructure. This comprehensive guide covers common cluster health issues, their diagnosis, and secure resolution methods.
Table of Contents
Initial Assessment
Before implementing any fixes, perform a thorough assessment of your cluster’s current state.
Basic Health Check
# Check overall cluster healthcurl -k -X GET "https://<elasticsearch-host>:9200/_cluster/health?pretty" \ -u <username>:<password>
# Expected output format:{ "cluster_name" : "your-cluster", "status" : "yellow", "timed_out" : false, "number_of_nodes" : 1, "number_of_data_nodes" : 1, "active_primary_shards" : 15, "active_shards" : 15, "relocating_shards" : 0, "initializing_shards" : 0, "unassigned_shards" : 5, "delayed_unassigned_shards" : 0, "number_of_pending_tasks" : 0, "number_of_in_flight_fetch" : 0, "task_max_waiting_in_queue_millis" : 0, "active_shards_percent_as_number" : 75.0}
Detailed Shard Analysis
# Get detailed shard allocation explanationcurl -k -X GET "https://<elasticsearch-host>:9200/_cluster/allocation/explain?pretty" \ -u <username>:<password>
# List indices with detailed health statuscurl -k -X GET "https://<elasticsearch-host>:9200/_cat/indices?v&h=index,health,status,pri,rep,docs.count,store.size,unassign.shards" \ -u <username>:<password>
# Check specific shard allocationcurl -k -X GET "https://<elasticsearch-host>:9200/_cat/shards?v&h=index,shard,prirep,state,unassigned.reason" \ -u <username>:<password>
Node Information
# Check node information and resourcescurl -k -X GET "https://<elasticsearch-host>:9200/_cat/nodes?v&h=name,heap.percent,ram.percent,cpu,load_1m,disk.used_percent,node.role,master" \ -u <username>:<password>
# Get cluster settingscurl -k -X GET "https://<elasticsearch-host>:9200/_cluster/settings?pretty" \ -u <username>:<password>
Common Issues and Solutions
1. Yellow Cluster Status with Unassigned Replicas (Single Node)
Problem: In single-node clusters, replica shards cannot be allocated, causing yellow status.
Root Cause: Elasticsearch/OpenSearch cannot place replica shards on the same node as primary shards for redundancy.
Solution: Disable replicas for single-node deployments.
# Disable replicas for all existing indicescurl -k -X PUT "https://<elasticsearch-host>:9200/*/_settings" \ -H "Content-Type: application/json" \ -u <username>:<password> \ -d '{ "index": { "number_of_replicas": 0 } }'
# Set template for future indicescurl -k -X PUT "https://<elasticsearch-host>:9200/_template/default_template" \ -H "Content-Type: application/json" \ -u <username>:<password> \ -d '{ "index_patterns": ["*"], "settings": { "number_of_replicas": 0 }, "order": 1 }'
# Verify template creationcurl -k -X GET "https://<elasticsearch-host>:9200/_template/default_template?pretty" \ -u <username>:<password>
2. OpenDistro/ISM Index Issues
Problem: OpenDistro Index State Management indices showing yellow status.
Root Cause: ISM indices created with default replica settings.
Solution: Update ISM-specific indices and create dedicated templates.
# Update existing ISM indicescurl -k -X PUT "https://<elasticsearch-host>:9200/.opendistro-ism*/_settings" \ -H "Content-Type: application/json" \ -u <username>:<password> \ -d '{ "index": { "number_of_replicas": 0 } }'
# Create ISM-specific templatecurl -k -X PUT "https://<elasticsearch-host>:9200/_template/ism_template" \ -H "Content-Type: application/json" \ -u <username>:<password> \ -d '{ "index_patterns": [".opendistro-ism*", ".opensearch-ism*"], "settings": { "number_of_replicas": 0, "index": { "auto_expand_replicas": "0-1" } }, "order": 10 }'
# Update OpenSearch security indicescurl -k -X PUT "https://<elasticsearch-host>:9200/.opensearch-*/_settings" \ -H "Content-Type: application/json" \ -u <username>:<password> \ -d '{ "index": { "number_of_replicas": 0 } }'
3. Monitoring and Security Index Issues
Problem: Wazuh, Filebeat, or other monitoring indices showing unassigned shards.
Root Cause: Default templates creating replicas in single-node environments.
Solution: Create comprehensive monitoring templates.
# Create monitoring indices templatecurl -k -X PUT "https://<elasticsearch-host>:9200/_template/monitoring_template" \ -H "Content-Type: application/json" \ -u <username>:<password> \ -d '{ "index_patterns": [ "wazuh-*", "filebeat-*", "otel-v1-*", "logstash-*", "metricbeat-*", "winlogbeat-*" ], "settings": { "number_of_replicas": 0, "index": { "lifecycle": { "name": "monitoring_policy", "rollover_alias": "monitoring" } } }, "order": 5 }'
# Update existing monitoring indicesindices=("wazuh-*" "filebeat-*" "otel-v1-*" "logstash-*")for pattern in "${indices[@]}"; do curl -k -X PUT "https://<elasticsearch-host>:9200/${pattern}/_settings" \ -H "Content-Type: application/json" \ -u <username>:<password> \ -d '{ "index": { "number_of_replicas": 0 } }'done
4. Disk Space Issues
Problem: Cluster showing red status due to disk space constraints.
Solution: Implement disk management and cleanup procedures.
# Check disk usage by indexcurl -k -X GET "https://<elasticsearch-host>:9200/_cat/indices?v&s=store.size:desc&h=index,store.size,docs.count" \ -u <username>:<password>
# Configure disk watermarkscurl -k -X PUT "https://<elasticsearch-host>:9200/_cluster/settings" \ -H "Content-Type: application/json" \ -u <username>:<password> \ -d '{ "persistent": { "cluster.routing.allocation.disk.watermark.low": "85%", "cluster.routing.allocation.disk.watermark.high": "90%", "cluster.routing.allocation.disk.watermark.flood_stage": "95%", "cluster.routing.allocation.disk.include_relocations": false } }'
# Delete old indices (be very careful with this command)# List indices older than 30 dayscurl -k -X GET "https://<elasticsearch-host>:9200/_cat/indices?v&s=creation.date" \ -u <username>:<password>
# Example: Delete specific old index (replace with actual index name)# curl -k -X DELETE "https://<elasticsearch-host>:9200/old-index-name" \# -u <username>:<password>
5. Shard Allocation Issues
Problem: Shards stuck in initializing or relocating state.
Solution: Force shard allocation and resolve allocation issues.
# Check allocation filterscurl -k -X GET "https://<elasticsearch-host>:9200/_cluster/settings?pretty&include_defaults=true" \ -u <username>:<password>
# Clear allocation filters if presentcurl -k -X PUT "https://<elasticsearch-host>:9200/_cluster/settings" \ -H "Content-Type: application/json" \ -u <username>:<password> \ -d '{ "persistent": { "cluster.routing.allocation.include._name": null, "cluster.routing.allocation.exclude._name": null, "cluster.routing.allocation.require._name": null } }'
# Force allocation of unassigned shardscurl -k -X POST "https://<elasticsearch-host>:9200/_cluster/reroute?retry_failed=true" \ -H "Content-Type: application/json" \ -u <username>:<password> \ -d '{ "commands": [ { "allocate_empty_primary": { "index": "problem-index", "shard": 0, "node": "node-name", "accept_data_loss": true } } ] }'
Post-Fix Verification
After implementing fixes, verify the cluster health:
# Force cluster routing refreshcurl -k -X POST "https://<elasticsearch-host>:9200/_cluster/reroute?retry_failed=true" \ -u <username>:<password>
# Wait a few moments and check healthsleep 30curl -k -X GET "https://<elasticsearch-host>:9200/_cluster/health?pretty" \ -u <username>:<password>
# Verify no unassigned shardscurl -k -X GET "https://<elasticsearch-host>:9200/_cat/shards?v&h=index,shard,prirep,state" \ -u <username>:<password> | grep UNASSIGNED
# Check cluster statscurl -k -X GET "https://<elasticsearch-host>:9200/_cluster/stats?pretty" \ -u <username>:<password>
Security Considerations
Certificate Management
Replace -k
flag with proper certificate validation:
# Use proper certificate validationcurl --cacert /path/to/ca.crt \ -X GET "https://<elasticsearch-host>:9200/_cluster/health" \ -u <username>:<password>
# For client certificatescurl --cert /path/to/client.crt \ --key /path/to/client.key \ --cacert /path/to/ca.crt \ -X GET "https://<elasticsearch-host>:9200/_cluster/health"
Credential Management
# Use environment variables for credentialsexport ES_USERNAME="admin"export ES_PASSWORD="your-secure-password"export ES_HOST="https://elasticsearch:9200"
# Create a secure wrapper scriptcat > es_health_check.sh << 'EOF'#!/bin/bashset -euo pipefail
# Load credentials from secure locationsource /etc/elasticsearch/credentials
curl --cacert /etc/elasticsearch/certs/ca.crt \ -X GET "${ES_HOST}/_cluster/health?pretty" \ -u "${ES_USERNAME}:${ES_PASSWORD}"EOF
chmod 700 es_health_check.sh
RBAC Implementation
# Create a dedicated monitoring rolecurl -k -X PUT "https://<elasticsearch-host>:9200/_security/role/cluster_monitor" \ -H "Content-Type: application/json" \ -u <username>:<password> \ -d '{ "cluster": ["monitor", "manage"], "indices": [ { "names": ["*"], "privileges": ["monitor"] } ] }'
# Create monitoring usercurl -k -X PUT "https://<elasticsearch-host>:9200/_security/user/monitor_user" \ -H "Content-Type: application/json" \ -u <username>:<password> \ -d '{ "password": "secure-monitor-password", "roles": ["cluster_monitor"], "full_name": "Cluster Monitor User" }'
Network Security
# Implement firewall rulessudo ufw allow from 192.168.1.0/24 to any port 9200sudo ufw allow from 192.168.1.0/24 to any port 9300
# Configure elasticsearch.yml for network bindingecho "network.host: 192.168.1.100" >> /etc/elasticsearch/elasticsearch.ymlecho "discovery.seed_hosts: [\"192.168.1.100\", \"192.168.1.101\"]" >> /etc/elasticsearch/elasticsearch.yml
Preventive Measures
Monitoring Setup
# Configure cluster-level monitoringcurl -k -X PUT "https://<elasticsearch-host>:9200/_cluster/settings" \ -H "Content-Type: application/json" \ -u <username>:<password> \ -d '{ "persistent": { "cluster.routing.allocation.cluster_concurrent_rebalance": 2, "cluster.routing.allocation.node_concurrent_recoveries": 2, "cluster.routing.allocation.node_initial_primaries_recoveries": 4 } }'
# Set up index lifecycle managementcurl -k -X PUT "https://<elasticsearch-host>:9200/_ilm/policy/monitoring_policy" \ -H "Content-Type: application/json" \ -u <username>:<password> \ -d '{ "policy": { "phases": { "hot": { "actions": { "rollover": { "max_size": "10GB", "max_age": "7d" } } }, "warm": { "min_age": "7d", "actions": { "shrink": { "number_of_shards": 1 } } }, "cold": { "min_age": "30d", "actions": { "readonly": {} } }, "delete": { "min_age": "90d", "actions": { "delete": {} } } } } }'
Backup Strategy
# Configure snapshot repositorycurl -k -X PUT "https://<elasticsearch-host>:9200/_snapshot/backup_repo" \ -H "Content-Type: application/json" \ -u <username>:<password> \ -d '{ "type": "fs", "settings": { "location": "/backup/elasticsearch", "compress": true, "chunk_size": "100MB" } }'
# Create automated backup scriptcat > backup_cluster.sh << 'EOF'#!/bin/bashset -euo pipefail
DATE=$(date +%Y%m%d_%H%M%S)SNAPSHOT_NAME="backup_${DATE}"
curl -k -X PUT "https://elasticsearch:9200/_snapshot/backup_repo/${SNAPSHOT_NAME}?wait_for_completion=true" \ -H "Content-Type: application/json" \ -u "${ES_USERNAME}:${ES_PASSWORD}" \ -d '{ "indices": "*", "include_global_state": true, "ignore_unavailable": true }'
echo "Backup completed: ${SNAPSHOT_NAME}"EOF
chmod +x backup_cluster.sh
Health Monitoring Script
# Create comprehensive health monitoring scriptcat > cluster_health_monitor.sh << 'EOF'#!/bin/bashset -euo pipefail
# ConfigurationES_HOST="${ES_HOST:-https://localhost:9200}"ES_USER="${ES_USER:-admin}"ES_PASS="${ES_PASS:-password}"ALERT_EMAIL="${ALERT_EMAIL:-admin@company.com}"
# Colors for outputRED='\033[0;31m'YELLOW='\033[1;33m'GREEN='\033[0;32m'NC='\033[0m'
# Check cluster healthhealth_status=$(curl -s -k -u "${ES_USER}:${ES_PASS}" \ "${ES_HOST}/_cluster/health" | jq -r '.status')
case $health_status in "green") echo -e "${GREEN}✓ Cluster health: $health_status${NC}" ;; "yellow") echo -e "${YELLOW}⚠ Cluster health: $health_status${NC}" # Add notification logic here ;; "red") echo -e "${RED}✗ Cluster health: $health_status${NC}" # Add critical alert logic here ;;esac
# Check disk usagedisk_usage=$(df -h /var/lib/elasticsearch | awk 'NR==2 {print $5}' | sed 's/%//')if [ "$disk_usage" -gt 80 ]; then echo -e "${RED}✗ High disk usage: ${disk_usage}%${NC}"else echo -e "${GREEN}✓ Disk usage: ${disk_usage}%${NC}"fi
# Check unassigned shardsunassigned=$(curl -s -k -u "${ES_USER}:${ES_PASS}" \ "${ES_HOST}/_cluster/health" | jq -r '.unassigned_shards')
if [ "$unassigned" -gt 0 ]; then echo -e "${YELLOW}⚠ Unassigned shards: $unassigned${NC}"else echo -e "${GREEN}✓ All shards assigned${NC}"fi
# Check node statusnodes=$(curl -s -k -u "${ES_USER}:${ES_PASS}" \ "${ES_HOST}/_cat/nodes?h=name,heap.percent,disk.used_percent" | \ while read name heap disk; do if [ "${heap%.*}" -gt 85 ]; then echo -e "${RED}✗ Node $name high heap: $heap${NC}" elif [ "${disk%.*}" -gt 85 ]; then echo -e "${YELLOW}⚠ Node $name high disk: $disk${NC}" else echo -e "${GREEN}✓ Node $name healthy${NC}" fi done)
echo "$nodes"EOF
chmod +x cluster_health_monitor.sh
Monitoring and Alerting
Prometheus Integration
# prometheus.yml excerptscrape_configs: - job_name: "elasticsearch" static_configs: - targets: ["elasticsearch:9200"] metrics_path: "/_prometheus/metrics" basic_auth: username: "monitor_user" password: "secure-password"
Grafana Dashboard
Key metrics to monitor:
- Cluster health status
- Node availability
- Heap usage percentage
- Disk usage percentage
- Unassigned shard count
- Search and indexing rates
- JVM garbage collection metrics
Alerting Rules
groups: - name: elasticsearch rules: - alert: ElasticsearchClusterYellow expr: elasticsearch_cluster_health_status{color="yellow"} == 1 for: 5m labels: severity: warning annotations: summary: "Elasticsearch cluster status is yellow"
- alert: ElasticsearchClusterRed expr: elasticsearch_cluster_health_status{color="red"} == 1 for: 1m labels: severity: critical annotations: summary: "Elasticsearch cluster status is red"
- alert: ElasticsearchHighHeapUsage expr: elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"} > 0.85 for: 5m labels: severity: warning annotations: summary: "Elasticsearch node heap usage is high"
Best Practices
- Regular Health Checks: Implement automated monitoring with appropriate alerting thresholds
- Capacity Planning: Monitor trends and plan for capacity expansion before issues occur
- Index Lifecycle Management: Implement proper ILM policies to manage data retention
- Security Updates: Keep clusters updated with security patches
- Backup Verification: Regularly test backup and restore procedures
- Documentation: Maintain runbooks for common issues and procedures
Conclusion
Maintaining healthy Elasticsearch and OpenSearch clusters requires proactive monitoring, proper configuration, and systematic troubleshooting approaches. This guide provides the foundation for identifying and resolving common cluster health issues while maintaining security best practices.
Remember to:
- Always backup configurations before making changes
- Test fixes in staging environments first
- Monitor cluster health continuously
- Implement proper security measures
- Document all changes and procedures
Regular maintenance and monitoring will help prevent most cluster health issues and ensure optimal performance for your SIEM and analytics infrastructure.