Comprehensive Guide to Service Metrics Exporters
Effective monitoring of service performance is essential for maintaining reliable infrastructure. Metrics exporters serve as the backbone of modern monitoring systems, collecting vital information about CPU usage, memory consumption, service uptime, and other critical metrics. This comprehensive guide explores how to implement, validate, and troubleshoot metrics exporters for your services, with practical examples for both standard and custom exporters.
Understanding Metrics Exporters
Metrics exporters are specialized agents that collect performance data from services and systems, then expose this information in a format that monitoring platforms like Prometheus can scrape and analyze.
graph TD
A[Services] --> B[Metrics Exporters]
B --> C[Prometheus/Monitoring System]
C --> D[Alerting]
C --> E[Visualization/Grafana]
subgraph "Exporters"
F[Node Exporter]
G[Process Exporter]
H[Custom Exporters]
end
B --> F
B --> G
B --> H
Common Types of Exporters
- Node Exporter: Collects system-wide metrics like CPU, memory, disk, and network usage
- Process Exporter: Focuses on process-specific metrics
- Custom Exporters: Tailored solutions for specific applications or services
- Application Exporters: Built into applications to expose internal metrics
Implementing a Custom Python Metrics Exporter
While standard exporters cover many use cases, sometimes you need a custom solution. Here’s how to create a Python-based metrics exporter for specific services like xdr-dashboard
and wazuh-indexer
:
Setting Up the Custom Exporter
First, let’s create a Python script that collects and exports metrics:
#!/usr/bin/env python3
# metrics_exporter.py
import psutil
import json
import os
import time
import argparse
from datetime import datetime
# Configuration
SERVICES = ["xdr-dashboard", "wazuh-indexer"]
METRICS_DIR = "metrics"
def ensure_metrics_dir():
"""Ensure metrics directory exists"""
if not os.path.exists(METRICS_DIR):
os.makedirs(METRICS_DIR)
def get_process_metrics(process_name):
"""Get metrics for a specific process"""
metrics = {
"cpu_percent": 0,
"memory_percent": 0,
"memory_mb": 0,
"uptime_seconds": 0,
"is_running": False
}
# Find all processes matching the name
matching_processes = []
for proc in psutil.process_iter(['pid', 'name', 'cmdline', 'create_time']):
try:
# Check if process name contains the service name
if process_name in proc.info['name'] or any(process_name in cmd for cmd in proc.info.get('cmdline', []) if cmd):
matching_processes.append(proc)
except (psutil.NoSuchProcess, psutil.AccessDenied, psutil.ZombieProcess):
pass
if matching_processes:
metrics["is_running"] = True
# Aggregate metrics from all matching processes
for proc in matching_processes:
try:
# Add CPU and memory usage
process = psutil.Process(proc.info['pid'])
metrics["cpu_percent"] += process.cpu_percent(interval=0.1)
metrics["memory_percent"] += process.memory_percent()
metrics["memory_mb"] += process.memory_info().rss / (1024 * 1024) # Convert to MB
# Calculate uptime
current_time = time.time()
process_start_time = proc.info['create_time']
process_uptime = current_time - process_start_time
metrics["uptime_seconds"] = max(metrics["uptime_seconds"], process_uptime)
except (psutil.NoSuchProcess, psutil.AccessDenied):
pass
return metrics
def collect_metrics():
"""Collect metrics for configured services"""
timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
all_metrics = {}
for service in SERVICES:
metrics = get_process_metrics(service)
all_metrics[service] = metrics
# Save individual service metrics
service_filename = f"{METRICS_DIR}/{service}_{timestamp}.json"
with open(service_filename, 'w') as f:
json.dump(metrics, f, indent=2)
# Print service metrics to console
print(f"=== {service} Metrics ===")
print(f"Running: {metrics['is_running']}")
print(f"CPU Usage: {metrics['cpu_percent']:.2f}%")
print(f"Memory Usage: {metrics['memory_percent']:.2f}% ({metrics['memory_mb']:.2f} MB)")
print(f"Uptime: {metrics['uptime_seconds']/3600:.2f} hours")
print("")
# Save combined metrics
combined_filename = f"{METRICS_DIR}/all_services_{timestamp}.json"
with open(combined_filename, 'w') as f:
json.dump(all_metrics, f, indent=2)
return all_metrics
def main():
parser = argparse.ArgumentParser(description='Service Metrics Exporter')
parser.add_argument('--continuous', action='store_true', help='Run continuously')
parser.add_argument('--interval', type=int, default=300, help='Interval in seconds (default: 300)')
args = parser.parse_args()
ensure_metrics_dir()
if args.continuous:
print(f"Starting continuous monitoring with {args.interval} second intervals...")
try:
while True:
print(f"\n[{datetime.now()}] Collecting metrics...")
collect_metrics()
time.sleep(args.interval)
except KeyboardInterrupt:
print("Monitoring stopped by user.")
else:
collect_metrics()
if __name__ == "__main__":
main()
Installing Dependencies
Ensure you have the necessary dependencies:
pip install psutil
Running the Exporter
The script can be run in different ways depending on your needs:
# Single collection
python metrics_exporter.py
# Continuous monitoring every 5 minutes
python metrics_exporter.py --continuous --interval 300
# Continuous monitoring every minute
python metrics_exporter.py --continuous --interval 60
Setting Up as a System Service
For production use, create a systemd service:
# Create service file
sudo tee /etc/systemd/system/metrics-exporter.service > /dev/null << 'EOF'
[Unit]
Description=Custom Metrics Exporter
After=network.target
[Service]
Type=simple
User=prometheus
ExecStart=/usr/bin/python3 /opt/monitoring/metrics_exporter.py --continuous --interval 300
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
EOF
# Enable and start the service
sudo systemctl daemon-reload
sudo systemctl enable metrics-exporter
sudo systemctl start metrics-exporter
Working with Standard Exporters
While custom exporters are useful, standard ones like Node Exporter and Process Exporter are often more reliable for production use.
Node Exporter Setup
Node Exporter provides system-wide metrics:
# Download Node Exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.5.0/node_exporter-1.5.0.linux-amd64.tar.gz
# Extract and install
tar xvfz node_exporter-*.tar.gz
sudo mv node_exporter-*/node_exporter /usr/local/bin/
# Create systemd service
sudo tee /etc/systemd/system/node_exporter.service > /dev/null << 'EOF'
[Unit]
Description=Node Exporter
After=network.target
[Service]
Type=simple
User=prometheus
ExecStart=/usr/local/bin/node_exporter
Restart=always
[Install]
WantedBy=multi-user.target
EOF
# Enable and start
sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter
Process Exporter Setup
Process Exporter focuses on specific processes:
# Download Process Exporter
wget https://github.com/ncabatoff/process-exporter/releases/download/v0.7.10/process-exporter-0.7.10.linux-amd64.tar.gz
# Extract and install
tar xvfz process-exporter-*.tar.gz
sudo mv process-exporter-*/process-exporter /usr/local/bin/
# Create configuration
sudo mkdir -p /etc/process-exporter
sudo tee /etc/process-exporter/config.yml > /dev/null << 'EOF'
process_names:
- name: "xdr-dashboard"
cmdline:
- '.+xdr-dashboard.+'
- name: "wazuh-indexer"
cmdline:
- '.+wazuh-indexer.+'
EOF
# Create systemd service
sudo tee /etc/systemd/system/process-exporter.service > /dev/null << 'EOF'
[Unit]
Description=Process Exporter
After=network.target
[Service]
Type=simple
User=prometheus
ExecStart=/usr/local/bin/process-exporter --config.path=/etc/process-exporter/config.yml
Restart=always
[Install]
WantedBy=multi-user.target
EOF
# Enable and start
sudo systemctl daemon-reload
sudo systemctl enable process-exporter
sudo systemctl start process-exporter
Validating Metrics Collection
After setting up exporters, it’s crucial to verify they’re working correctly. Here are comprehensive validation methods:
Using curl to Check Metrics Endpoints
# Check Node Exporter metrics
curl http://localhost:9100/metrics | grep process
# Check Process Exporter metrics
curl http://localhost:9256/metrics | grep process
# Check if specific services are being monitored
curl http://localhost:9256/metrics | grep -E "xdr|wazuh"
Verifying Exporter Services
# Check Node Exporter status
sudo systemctl status node_exporter
sudo journalctl -u node_exporter -f
# Check Process Exporter status
sudo systemctl status process-exporter
sudo journalctl -u process-exporter -f
Checking Process Discovery
# List all processes being monitored by Process Exporter
curl http://localhost:9256/metrics | grep process_name
# Check if specific services are running
ps aux | grep -E "xdr|wazuh"
Creating a Testing Script
For quick validation, you can create a testing script:
#!/usr/bin/env python3
# test_metrics.py
import requests
import re
import sys
def test_node_exporter():
"""Test if Node Exporter is working"""
try:
response = requests.get("http://localhost:9100/metrics", timeout=5)
if response.status_code == 200:
print("✅ Node Exporter is running and responding")
return True
else:
print(f"❌ Node Exporter returned status code {response.status_code}")
return False
except requests.exceptions.RequestException as e:
print(f"❌ Node Exporter is not accessible: {e}")
return False
def test_process_exporter():
"""Test if Process Exporter is working"""
try:
response = requests.get("http://localhost:9256/metrics", timeout=5)
if response.status_code == 200:
print("✅ Process Exporter is running and responding")
return True
else:
print(f"❌ Process Exporter returned status code {response.status_code}")
return False
except requests.exceptions.RequestException as e:
print(f"❌ Process Exporter is not accessible: {e}")
return False
def check_service_metrics(service_name):
"""Check if metrics for specific service are available"""
try:
response = requests.get("http://localhost:9256/metrics", timeout=5)
if response.status_code == 200:
if re.search(rf'process_name="{service_name}"', response.text):
print(f"✅ Metrics for {service_name} found")
return True
else:
print(f"❌ No metrics found for {service_name}")
return False
else:
print(f"❌ Process Exporter returned status code {response.status_code}")
return False
except requests.exceptions.RequestException as e:
print(f"❌ Process Exporter is not accessible: {e}")
return False
if __name__ == "__main__":
print("=== Metrics Exporter Validation ===")
# Test basic connectivity
node_ok = test_node_exporter()
process_ok = test_process_exporter()
# Check for specific services
services = ["xdr-dashboard", "wazuh-indexer"]
service_status = []
if process_ok:
for service in services:
service_status.append(check_service_metrics(service))
# Print summary
print("\n=== Test Summary ===")
print(f"Node Exporter: {'OK' if node_ok else 'FAIL'}")
print(f"Process Exporter: {'OK' if process_ok else 'FAIL'}")
if process_ok:
for i, service in enumerate(services):
print(f"Service {service}: {'OK' if service_status[i] else 'FAIL'}")
# Determine overall status
if all([node_ok, process_ok] + service_status):
print("\n✅ All tests passed!")
sys.exit(0)
else:
print("\n❌ Some tests failed!")
sys.exit(1)
Creating a Bash Health Check Script
For a lightweight alternative, create a bash script:
#!/bin/bash
# check_metrics.sh
# Set script variables
timestamp=$(date '+%Y-%m-%d %H:%M:%S')
node_exporter_port=9100
process_exporter_port=9256
services=("xdr-dashboard" "wazuh-indexer")
# ANSI color codes
GREEN='\033[0;32m'
RED='\033[0;31m'
NC='\033[0m' # No Color
echo "=== Metrics Health Check - $timestamp ==="
# Check if ports are open
echo -n "Checking Node Exporter port ($node_exporter_port): "
if nc -z localhost $node_exporter_port; then
echo -e "${GREEN}OK${NC}"
else
echo -e "${RED}FAIL${NC}"
fi
echo -n "Checking Process Exporter port ($process_exporter_port): "
if nc -z localhost $process_exporter_port; then
echo -e "${GREEN}OK${NC}"
else
echo -e "${RED}FAIL${NC}"
fi
# Check for metrics from each service
echo -e "\nChecking service metrics:"
for service in "${services[@]}"; do
echo -n " $service: "
if curl -s "http://localhost:$process_exporter_port/metrics" | grep -q "process_name=\"$service\""; then
echo -e "${GREEN}Found${NC}"
else
echo -e "${RED}Not found${NC}"
fi
done
# Check system services
echo -e "\nChecking exporter services:"
echo -n " node_exporter: "
if systemctl is-active --quiet node_exporter; then
echo -e "${GREEN}Running${NC}"
else
echo -e "${RED}Not running${NC}"
fi
echo -n " process-exporter: "
if systemctl is-active --quiet process-exporter; then
echo -e "${GREEN}Running${NC}"
else
echo -e "${RED}Not running${NC}"
fi
# Add information about Prometheus scraping
echo -e "\nPrometheus targets can be checked at: http://localhost:9090/targets"
Troubleshooting Common Issues
If your metrics exporters aren’t working as expected, try these troubleshooting steps:
Network and Connectivity Issues
# Check if exporters are running and listening on expected ports
sudo netstat -tulpn | grep -E '9100|9256'
# Check firewall rules
sudo ufw status
# or
sudo iptables -L
Service Problems
# Check service logs
sudo journalctl -u node_exporter -n 50
sudo journalctl -u process-exporter -n 50
Configuration Validation
# Verify Prometheus configuration
sudo cat /etc/prometheus/prometheus.yml
# Check if Prometheus is scraping the targets
curl http://localhost:9090/api/v1/targets | jq .
Permissions and Access
# Check if the exporter user has proper permissions
sudo -u prometheus /usr/local/bin/node_exporter --version
# Verify directory permissions
ls -la /var/lib/prometheus/
Process Matching Issues
If Process Exporter isn’t finding your services, check your regex patterns:
# Test your regex pattern against running processes
ps aux | grep -E "xdr|wazuh"
# Adjust your process-exporter config.yml if needed
sudo nano /etc/process-exporter/config.yml
Integrating with Prometheus
To complete your monitoring setup, integrate your exporters with Prometheus:
# Add to prometheus.yml
scrape_configs:
- job_name: "node"
static_configs:
- targets: ["localhost:9100"]
- job_name: "process"
static_configs:
- targets: ["localhost:9256"]
Creating Alerts for Critical Services
# prometheus-alerts.yml
groups:
- name: service_alerts
rules:
- alert: ServiceDown
expr: process_up{name=~"xdr-dashboard|wazuh-indexer"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service {{ $labels.name }} is down"
description: "The service {{ $labels.name }} has been down for more than 1 minute."
- alert: HighCPUUsage
expr: rate(process_cpu_seconds_total{name=~"xdr-dashboard|wazuh-indexer"}[5m]) * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage for {{ $labels.name }}"
description: "{{ $labels.name }} is using more than 80% CPU for over 5 minutes."
Visualizing Metrics with Grafana
Create a comprehensive dashboard to visualize your metrics:
# Create a dashboard JSON file
cat << EOF > service-metrics-dashboard.json
{
"dashboard": {
"title": "Service Performance Dashboard",
"panels": [
{
"title": "Service Status",
"type": "stat",
"targets": [
{
"expr": "process_up{name=~\"xdr-dashboard|wazuh-indexer\"}"
}
],
"mappings": [
{
"type": "value",
"options": {
"0": {
"text": "Down",
"color": "red"
},
"1": {
"text": "Up",
"color": "green"
}
}
}
]
},
{
"title": "CPU Usage",
"type": "graph",
"targets": [
{
"expr": "rate(process_cpu_seconds_total{name=~\"xdr-dashboard|wazuh-indexer\"}[5m]) * 100",
"legendFormat": "{{name}}"
}
],
"yaxes": [
{
"format": "percent"
}
]
},
{
"title": "Memory Usage",
"type": "graph",
"targets": [
{
"expr": "process_resident_memory_bytes{name=~\"xdr-dashboard|wazuh-indexer\"} / 1024 / 1024",
"legendFormat": "{{name}} (MB)"
}
],
"yaxes": [
{
"format": "decmbytes"
}
]
}
]
}
}
EOF
# Import the dashboard through Grafana's API
curl -X POST -H "Content-Type: application/json" \
-d @service-metrics-dashboard.json \
http://admin:password@localhost:3000/api/dashboards/db
Best Practices for Production Use
Security Considerations
-
Use TLS for metrics endpoints:
# For Node Exporter /usr/local/bin/node_exporter --web.config.file=/etc/node_exporter/web-config.yml # TLS config file example cat << EOF > /etc/node_exporter/web-config.yml tls_server_config: cert_file: /etc/node_exporter/cert.pem key_file: /etc/node_exporter/key.pem EOF
-
Implement authentication:
cat << EOF > /etc/node_exporter/web-config.yml basic_auth_users: prometheus: $HASHED_PASSWORD EOF
-
Use least privilege users:
sudo useradd -rs /bin/false prometheus sudo chown prometheus:prometheus /usr/local/bin/node_exporter
Scaling Considerations
- Use service discovery for dynamic environments
- Implement federation for large-scale deployments
- Set appropriate scrape intervals based on resource consumption
Maintenance Guidelines
-
Regular updates:
# Set up a cronjob to check for updates 0 2 * * 0 /usr/local/bin/check_exporter_updates.sh >> /var/log/exporter_updates.log 2>&1
-
Backup configurations:
# Backup all exporter configs tar -czf /var/backups/exporters-$(date +%Y%m%d).tar.gz /etc/process-exporter/ /etc/node_exporter/
-
Monitor the monitors:
# Create alerts for your monitoring system - alert: PrometheusNotIngestingMetrics expr: rate(prometheus_tsdb_head_samples_appended_total[5m]) <= 0 for: 10m
Conclusion
Implementing and validating metrics exporters is essential for maintaining visibility into your system’s health and performance. By following this guide, you can set up robust monitoring for important services like xdr-dashboard
and wazuh-indexer
, ensuring you’re promptly alerted to any issues.
The combination of standard exporters like Node Exporter and Process Exporter with custom Python-based solutions provides comprehensive coverage for both system-level and service-specific metrics. Regular validation and testing ensure your monitoring system itself remains reliable.
Remember that effective monitoring is not a set-and-forget solution—it requires ongoing maintenance, validation, and refinement to adapt to changing infrastructure and service needs.