Comprehensive Guide to Service Metrics Exporters
Effective monitoring of service performance is essential for maintaining reliable infrastructure. Metrics exporters serve as the backbone of modern monitoring systems, collecting vital information about CPU usage, memory consumption, service uptime, and other critical metrics. This comprehensive guide explores how to implement, validate, and troubleshoot metrics exporters for your services, with practical examples for both standard and custom exporters.
Understanding Metrics Exporters
Metrics exporters are specialized agents that collect performance data from services and systems, then expose this information in a format that monitoring platforms like Prometheus can scrape and analyze.
graph TD A[Services] --> B[Metrics Exporters] B --> C[Prometheus/Monitoring System] C --> D[Alerting] C --> E[Visualization/Grafana]
subgraph "Exporters" F[Node Exporter] G[Process Exporter] H[Custom Exporters] end
B --> F B --> G B --> H
Common Types of Exporters
- Node Exporter: Collects system-wide metrics like CPU, memory, disk, and network usage
- Process Exporter: Focuses on process-specific metrics
- Custom Exporters: Tailored solutions for specific applications or services
- Application Exporters: Built into applications to expose internal metrics
Implementing a Custom Python Metrics Exporter
While standard exporters cover many use cases, sometimes you need a custom solution. Here’s how to create a Python-based metrics exporter for specific services like xdr-dashboard
and wazuh-indexer
:
Setting Up the Custom Exporter
First, let’s create a Python script that collects and exports metrics:
#!/usr/bin/env python3import psutilimport jsonimport osimport timeimport argparsefrom datetime import datetime
# ConfigurationSERVICES = ["xdr-dashboard", "wazuh-indexer"]METRICS_DIR = "metrics"
def ensure_metrics_dir(): """Ensure metrics directory exists""" if not os.path.exists(METRICS_DIR): os.makedirs(METRICS_DIR)
def get_process_metrics(process_name): """Get metrics for a specific process""" metrics = { "cpu_percent": 0, "memory_percent": 0, "memory_mb": 0, "uptime_seconds": 0, "is_running": False }
# Find all processes matching the name matching_processes = [] for proc in psutil.process_iter(['pid', 'name', 'cmdline', 'create_time']): try: # Check if process name contains the service name if process_name in proc.info['name'] or any(process_name in cmd for cmd in proc.info.get('cmdline', []) if cmd): matching_processes.append(proc) except (psutil.NoSuchProcess, psutil.AccessDenied, psutil.ZombieProcess): pass
if matching_processes: metrics["is_running"] = True
# Aggregate metrics from all matching processes for proc in matching_processes: try: # Add CPU and memory usage process = psutil.Process(proc.info['pid']) metrics["cpu_percent"] += process.cpu_percent(interval=0.1) metrics["memory_percent"] += process.memory_percent() metrics["memory_mb"] += process.memory_info().rss / (1024 * 1024) # Convert to MB
# Calculate uptime current_time = time.time() process_start_time = proc.info['create_time'] process_uptime = current_time - process_start_time metrics["uptime_seconds"] = max(metrics["uptime_seconds"], process_uptime) except (psutil.NoSuchProcess, psutil.AccessDenied): pass
return metrics
def collect_metrics(): """Collect metrics for configured services""" timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S") all_metrics = {}
for service in SERVICES: metrics = get_process_metrics(service) all_metrics[service] = metrics
# Save individual service metrics service_filename = f"{METRICS_DIR}/{service}_{timestamp}.json" with open(service_filename, 'w') as f: json.dump(metrics, f, indent=2)
# Print service metrics to console print(f"=== {service} Metrics ===") print(f"Running: {metrics['is_running']}") print(f"CPU Usage: {metrics['cpu_percent']:.2f}%") print(f"Memory Usage: {metrics['memory_percent']:.2f}% ({metrics['memory_mb']:.2f} MB)") print(f"Uptime: {metrics['uptime_seconds']/3600:.2f} hours") print("")
# Save combined metrics combined_filename = f"{METRICS_DIR}/all_services_{timestamp}.json" with open(combined_filename, 'w') as f: json.dump(all_metrics, f, indent=2)
return all_metrics
def main(): parser = argparse.ArgumentParser(description='Service Metrics Exporter') parser.add_argument('--continuous', action='store_true', help='Run continuously') parser.add_argument('--interval', type=int, default=300, help='Interval in seconds (default: 300)') args = parser.parse_args()
ensure_metrics_dir()
if args.continuous: print(f"Starting continuous monitoring with {args.interval} second intervals...") try: while True: print(f"\n[{datetime.now()}] Collecting metrics...") collect_metrics() time.sleep(args.interval) except KeyboardInterrupt: print("Monitoring stopped by user.") else: collect_metrics()
if __name__ == "__main__": main()
Installing Dependencies
Ensure you have the necessary dependencies:
pip install psutil
Running the Exporter
The script can be run in different ways depending on your needs:
# Single collectionpython metrics_exporter.py
# Continuous monitoring every 5 minutespython metrics_exporter.py --continuous --interval 300
# Continuous monitoring every minutepython metrics_exporter.py --continuous --interval 60
Setting Up as a System Service
For production use, create a systemd service:
# Create service filesudo tee /etc/systemd/system/metrics-exporter.service > /dev/null << 'EOF'[Unit]Description=Custom Metrics ExporterAfter=network.target
[Service]Type=simpleUser=prometheusExecStart=/usr/bin/python3 /opt/monitoring/metrics_exporter.py --continuous --interval 300Restart=alwaysRestartSec=10
[Install]WantedBy=multi-user.targetEOF
# Enable and start the servicesudo systemctl daemon-reloadsudo systemctl enable metrics-exportersudo systemctl start metrics-exporter
Working with Standard Exporters
While custom exporters are useful, standard ones like Node Exporter and Process Exporter are often more reliable for production use.
Node Exporter Setup
Node Exporter provides system-wide metrics:
# Download Node Exporterwget https://github.com/prometheus/node_exporter/releases/download/v1.5.0/node_exporter-1.5.0.linux-amd64.tar.gz
# Extract and installtar xvfz node_exporter-*.tar.gzsudo mv node_exporter-*/node_exporter /usr/local/bin/
# Create systemd servicesudo tee /etc/systemd/system/node_exporter.service > /dev/null << 'EOF'[Unit]Description=Node ExporterAfter=network.target
[Service]Type=simpleUser=prometheusExecStart=/usr/local/bin/node_exporterRestart=always
[Install]WantedBy=multi-user.targetEOF
# Enable and startsudo systemctl daemon-reloadsudo systemctl enable node_exportersudo systemctl start node_exporter
Process Exporter Setup
Process Exporter focuses on specific processes:
# Download Process Exporterwget https://github.com/ncabatoff/process-exporter/releases/download/v0.7.10/process-exporter-0.7.10.linux-amd64.tar.gz
# Extract and installtar xvfz process-exporter-*.tar.gzsudo mv process-exporter-*/process-exporter /usr/local/bin/
# Create configurationsudo mkdir -p /etc/process-exportersudo tee /etc/process-exporter/config.yml > /dev/null << 'EOF'process_names: - name: "xdr-dashboard" cmdline: - '.+xdr-dashboard.+' - name: "wazuh-indexer" cmdline: - '.+wazuh-indexer.+'EOF
# Create systemd servicesudo tee /etc/systemd/system/process-exporter.service > /dev/null << 'EOF'[Unit]Description=Process ExporterAfter=network.target
[Service]Type=simpleUser=prometheusExecStart=/usr/local/bin/process-exporter --config.path=/etc/process-exporter/config.ymlRestart=always
[Install]WantedBy=multi-user.targetEOF
# Enable and startsudo systemctl daemon-reloadsudo systemctl enable process-exportersudo systemctl start process-exporter
Validating Metrics Collection
After setting up exporters, it’s crucial to verify they’re working correctly. Here are comprehensive validation methods:
Using curl to Check Metrics Endpoints
# Check Node Exporter metricscurl http://localhost:9100/metrics | grep process
# Check Process Exporter metricscurl http://localhost:9256/metrics | grep process
# Check if specific services are being monitoredcurl http://localhost:9256/metrics | grep -E "xdr|wazuh"
Verifying Exporter Services
# Check Node Exporter statussudo systemctl status node_exportersudo journalctl -u node_exporter -f
# Check Process Exporter statussudo systemctl status process-exportersudo journalctl -u process-exporter -f
Checking Process Discovery
# List all processes being monitored by Process Exportercurl http://localhost:9256/metrics | grep process_name
# Check if specific services are runningps aux | grep -E "xdr|wazuh"
Creating a Testing Script
For quick validation, you can create a testing script:
#!/usr/bin/env python3import requestsimport reimport sys
def test_node_exporter(): """Test if Node Exporter is working""" try: response = requests.get("http://localhost:9100/metrics", timeout=5) if response.status_code == 200: print("✅ Node Exporter is running and responding") return True else: print(f"❌ Node Exporter returned status code {response.status_code}") return False except requests.exceptions.RequestException as e: print(f"❌ Node Exporter is not accessible: {e}") return False
def test_process_exporter(): """Test if Process Exporter is working""" try: response = requests.get("http://localhost:9256/metrics", timeout=5) if response.status_code == 200: print("✅ Process Exporter is running and responding") return True else: print(f"❌ Process Exporter returned status code {response.status_code}") return False except requests.exceptions.RequestException as e: print(f"❌ Process Exporter is not accessible: {e}") return False
def check_service_metrics(service_name): """Check if metrics for specific service are available""" try: response = requests.get("http://localhost:9256/metrics", timeout=5) if response.status_code == 200: if re.search(rf'process_name="{service_name}"', response.text): print(f"✅ Metrics for {service_name} found") return True else: print(f"❌ No metrics found for {service_name}") return False else: print(f"❌ Process Exporter returned status code {response.status_code}") return False except requests.exceptions.RequestException as e: print(f"❌ Process Exporter is not accessible: {e}") return False
if __name__ == "__main__": print("=== Metrics Exporter Validation ===")
# Test basic connectivity node_ok = test_node_exporter() process_ok = test_process_exporter()
# Check for specific services services = ["xdr-dashboard", "wazuh-indexer"] service_status = []
if process_ok: for service in services: service_status.append(check_service_metrics(service))
# Print summary print("\n=== Test Summary ===") print(f"Node Exporter: {'OK' if node_ok else 'FAIL'}") print(f"Process Exporter: {'OK' if process_ok else 'FAIL'}")
if process_ok: for i, service in enumerate(services): print(f"Service {service}: {'OK' if service_status[i] else 'FAIL'}")
# Determine overall status if all([node_ok, process_ok] + service_status): print("\n✅ All tests passed!") sys.exit(0) else: print("\n❌ Some tests failed!") sys.exit(1)
Creating a Bash Health Check Script
For a lightweight alternative, create a bash script:
#!/bin/bash# Set script variablestimestamp=$(date '+%Y-%m-%d %H:%M:%S')node_exporter_port=9100process_exporter_port=9256services=("xdr-dashboard" "wazuh-indexer")
# ANSI color codesGREEN='\033[0;32m'RED='\033[0;31m'NC='\033[0m' # No Color
echo "=== Metrics Health Check - $timestamp ==="
# Check if ports are openecho -n "Checking Node Exporter port ($node_exporter_port): "if nc -z localhost $node_exporter_port; then echo -e "${GREEN}OK${NC}"else echo -e "${RED}FAIL${NC}"fi
echo -n "Checking Process Exporter port ($process_exporter_port): "if nc -z localhost $process_exporter_port; then echo -e "${GREEN}OK${NC}"else echo -e "${RED}FAIL${NC}"fi
# Check for metrics from each serviceecho -e "\nChecking service metrics:"for service in "${services[@]}"; do echo -n " $service: " if curl -s "http://localhost:$process_exporter_port/metrics" | grep -q "process_name=\"$service\""; then echo -e "${GREEN}Found${NC}" else echo -e "${RED}Not found${NC}" fidone
# Check system servicesecho -e "\nChecking exporter services:"echo -n " node_exporter: "if systemctl is-active --quiet node_exporter; then echo -e "${GREEN}Running${NC}"else echo -e "${RED}Not running${NC}"fi
echo -n " process-exporter: "if systemctl is-active --quiet process-exporter; then echo -e "${GREEN}Running${NC}"else echo -e "${RED}Not running${NC}"fi
# Add information about Prometheus scrapingecho -e "\nPrometheus targets can be checked at: http://localhost:9090/targets"
Troubleshooting Common Issues
If your metrics exporters aren’t working as expected, try these troubleshooting steps:
Network and Connectivity Issues
# Check if exporters are running and listening on expected portssudo netstat -tulpn | grep -E '9100|9256'
# Check firewall rulessudo ufw status# orsudo iptables -L
Service Problems
# Check service logssudo journalctl -u node_exporter -n 50sudo journalctl -u process-exporter -n 50
Configuration Validation
# Verify Prometheus configurationsudo cat /etc/prometheus/prometheus.yml
# Check if Prometheus is scraping the targetscurl http://localhost:9090/api/v1/targets | jq .
Permissions and Access
# Check if the exporter user has proper permissionssudo -u prometheus /usr/local/bin/node_exporter --version
# Verify directory permissionsls -la /var/lib/prometheus/
Process Matching Issues
If Process Exporter isn’t finding your services, check your regex patterns:
# Test your regex pattern against running processesps aux | grep -E "xdr|wazuh"
# Adjust your process-exporter config.yml if neededsudo nano /etc/process-exporter/config.yml
Integrating with Prometheus
To complete your monitoring setup, integrate your exporters with Prometheus:
# Add to prometheus.ymlscrape_configs: - job_name: "node" static_configs: - targets: ["localhost:9100"]
- job_name: "process" static_configs: - targets: ["localhost:9256"]
Creating Alerts for Critical Services
groups: - name: service_alerts rules: - alert: ServiceDown expr: process_up{name=~"xdr-dashboard|wazuh-indexer"} == 0 for: 1m labels: severity: critical annotations: summary: "Service {{ $labels.name }} is down" description: "The service {{ $labels.name }} has been down for more than 1 minute."
- alert: HighCPUUsage expr: rate(process_cpu_seconds_total{name=~"xdr-dashboard|wazuh-indexer"}[5m]) * 100 > 80 for: 5m labels: severity: warning annotations: summary: "High CPU usage for {{ $labels.name }}" description: "{{ $labels.name }} is using more than 80% CPU for over 5 minutes."
Visualizing Metrics with Grafana
Create a comprehensive dashboard to visualize your metrics:
# Create a dashboard JSON filecat << EOF > service-metrics-dashboard.json{ "dashboard": { "title": "Service Performance Dashboard", "panels": [ { "title": "Service Status", "type": "stat", "targets": [ { "expr": "process_up{name=~\"xdr-dashboard|wazuh-indexer\"}" } ], "mappings": [ { "type": "value", "options": { "0": { "text": "Down", "color": "red" }, "1": { "text": "Up", "color": "green" } } } ] }, { "title": "CPU Usage", "type": "graph", "targets": [ { "expr": "rate(process_cpu_seconds_total{name=~\"xdr-dashboard|wazuh-indexer\"}[5m]) * 100", "legendFormat": "{{name}}" } ], "yaxes": [ { "format": "percent" } ] }, { "title": "Memory Usage", "type": "graph", "targets": [ { "expr": "process_resident_memory_bytes{name=~\"xdr-dashboard|wazuh-indexer\"} / 1024 / 1024", "legendFormat": "{{name}} (MB)" } ], "yaxes": [ { "format": "decmbytes" } ] } ] }}EOF
# Import the dashboard through Grafana's APIcurl -X POST -H "Content-Type: application/json" \ -d @service-metrics-dashboard.json \ http://admin:password@localhost:3000/api/dashboards/db
Best Practices for Production Use
Security Considerations
-
Use TLS for metrics endpoints:
Terminal window # For Node Exporter/usr/local/bin/node_exporter --web.config.file=/etc/node_exporter/web-config.yml# TLS config file examplecat << EOF > /etc/node_exporter/web-config.ymltls_server_config:cert_file: /etc/node_exporter/cert.pemkey_file: /etc/node_exporter/key.pemEOF -
Implement authentication:
Terminal window cat << EOF > /etc/node_exporter/web-config.ymlbasic_auth_users:prometheus: $HASHED_PASSWORDEOF -
Use least privilege users:
Terminal window sudo useradd -rs /bin/false prometheussudo chown prometheus:prometheus /usr/local/bin/node_exporter
Scaling Considerations
- Use service discovery for dynamic environments
- Implement federation for large-scale deployments
- Set appropriate scrape intervals based on resource consumption
Maintenance Guidelines
-
Regular updates:
Terminal window # Set up a cronjob to check for updates0 2 * * 0 /usr/local/bin/check_exporter_updates.sh >> /var/log/exporter_updates.log 2>&1 -
Backup configurations:
Terminal window # Backup all exporter configstar -czf /var/backups/exporters-$(date +%Y%m%d).tar.gz /etc/process-exporter/ /etc/node_exporter/ -
Monitor the monitors:
Terminal window # Create alerts for your monitoring system- alert: PrometheusNotIngestingMetricsexpr: rate(prometheus_tsdb_head_samples_appended_total[5m]) <= 0for: 10m
Conclusion
Implementing and validating metrics exporters is essential for maintaining visibility into your system’s health and performance. By following this guide, you can set up robust monitoring for important services like xdr-dashboard
and wazuh-indexer
, ensuring you’re promptly alerted to any issues.
The combination of standard exporters like Node Exporter and Process Exporter with custom Python-based solutions provides comprehensive coverage for both system-level and service-specific metrics. Regular validation and testing ensure your monitoring system itself remains reliable.
Remember that effective monitoring is not a set-and-forget solution—it requires ongoing maintenance, validation, and refinement to adapt to changing infrastructure and service needs.