Comprehensive Guide to Service Metrics Exporters#

Effective monitoring of service performance is essential for maintaining reliable infrastructure. Metrics exporters serve as the backbone of modern monitoring systems, collecting vital information about CPU usage, memory consumption, service uptime, and other critical metrics. This comprehensive guide explores how to implement, validate, and troubleshoot metrics exporters for your services, with practical examples for both standard and custom exporters.

Understanding Metrics Exporters#

Metrics exporters are specialized agents that collect performance data from services and systems, then expose this information in a format that monitoring platforms like Prometheus can scrape and analyze.

1
graph TD
2
    A[Services] --> B[Metrics Exporters]
3
    B --> C[Prometheus/Monitoring System]
4
    C --> D[Alerting]
5
    C --> E[Visualization/Grafana]
6

7
    subgraph "Exporters"
8
    F[Node Exporter]
9
    G[Process Exporter]
10
    H[Custom Exporters]
11
    end
12

13
    B --> F
14
    B --> G
15
    B --> H

Common Types of Exporters#

Node Exporter: Collects system-wide metrics like CPU, memory, disk, and network usage
Process Exporter: Focuses on process-specific metrics
Custom Exporters: Tailored solutions for specific applications or services
Application Exporters: Built into applications to expose internal metrics

Implementing a Custom Python Metrics Exporter#

While standard exporters cover many use cases, sometimes you need a custom solution. Here’s how to create a Python-based metrics exporter for specific services like xdr-dashboard and wazuh-indexer:

Setting Up the Custom Exporter#

First, let’s create a Python script that collects and exports metrics:

1
#!/usr/bin/env python3
2
import psutil
3
import json
4
import os
5
import time
6
import argparse
7
from datetime import datetime
8

9
# Configuration
10
SERVICES = ["xdr-dashboard", "wazuh-indexer"]
11
METRICS_DIR = "metrics"
12

13
def ensure_metrics_dir():
14
    """Ensure metrics directory exists"""
15
    if not os.path.exists(METRICS_DIR):
16
        os.makedirs(METRICS_DIR)
17

18
def get_process_metrics(process_name):
19
    """Get metrics for a specific process"""
20
    metrics = {
21
        "cpu_percent": 0,
22
        "memory_percent": 0,
23
        "memory_mb": 0,
24
        "uptime_seconds": 0,
25
        "is_running": False
26
    }
27

28
    # Find all processes matching the name
29
    matching_processes = []
30
    for proc in psutil.process_iter(['pid', 'name', 'cmdline', 'create_time']):
31
        try:
32
            # Check if process name contains the service name
33
            if process_name in proc.info['name'] or any(process_name in cmd for cmd in proc.info.get('cmdline', []) if cmd):
34
                matching_processes.append(proc)
35
        except (psutil.NoSuchProcess, psutil.AccessDenied, psutil.ZombieProcess):
36
            pass
37

38
    if matching_processes:
39
        metrics["is_running"] = True
40

41
        # Aggregate metrics from all matching processes
42
        for proc in matching_processes:
43
            try:
44
                # Add CPU and memory usage
45
                process = psutil.Process(proc.info['pid'])
46
                metrics["cpu_percent"] += process.cpu_percent(interval=0.1)
47
                metrics["memory_percent"] += process.memory_percent()
48
                metrics["memory_mb"] += process.memory_info().rss / (1024 * 1024)  # Convert to MB
49

50
                # Calculate uptime
51
                current_time = time.time()
52
                process_start_time = proc.info['create_time']
53
                process_uptime = current_time - process_start_time
54
                metrics["uptime_seconds"] = max(metrics["uptime_seconds"], process_uptime)
55
            except (psutil.NoSuchProcess, psutil.AccessDenied):
56
                pass
57

58
    return metrics
59

60
def collect_metrics():
61
    """Collect metrics for configured services"""
62
    timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
63
    all_metrics = {}
64

65
    for service in SERVICES:
66
        metrics = get_process_metrics(service)
67
        all_metrics[service] = metrics
68

69
        # Save individual service metrics
70
        service_filename = f"{METRICS_DIR}/{service}_{timestamp}.json"
71
        with open(service_filename, 'w') as f:
72
            json.dump(metrics, f, indent=2)
73

74
        # Print service metrics to console
75
        print(f"=== {service} Metrics ===")
76
        print(f"Running: {metrics['is_running']}")
77
        print(f"CPU Usage: {metrics['cpu_percent']:.2f}%")
78
        print(f"Memory Usage: {metrics['memory_percent']:.2f}% ({metrics['memory_mb']:.2f} MB)")
79
        print(f"Uptime: {metrics['uptime_seconds']/3600:.2f} hours")
80
        print("")
81

82
    # Save combined metrics
83
    combined_filename = f"{METRICS_DIR}/all_services_{timestamp}.json"
84
    with open(combined_filename, 'w') as f:
85
        json.dump(all_metrics, f, indent=2)
86

87
    return all_metrics
88

89
def main():
90
    parser = argparse.ArgumentParser(description='Service Metrics Exporter')
91
    parser.add_argument('--continuous', action='store_true', help='Run continuously')
92
    parser.add_argument('--interval', type=int, default=300, help='Interval in seconds (default: 300)')
93
    args = parser.parse_args()
94

95
    ensure_metrics_dir()
96

97
    if args.continuous:
98
        print(f"Starting continuous monitoring with {args.interval} second intervals...")
99
        try:
100
            while True:
101
                print(f"\n[{datetime.now()}] Collecting metrics...")
102
                collect_metrics()
103
                time.sleep(args.interval)
104
        except KeyboardInterrupt:
105
            print("Monitoring stopped by user.")
106
    else:
107
        collect_metrics()
108

109
if __name__ == "__main__":
110
    main()

Installing Dependencies#

Ensure you have the necessary dependencies:

1
pip install psutil

Running the Exporter#

The script can be run in different ways depending on your needs:

1
# Single collection
2
python metrics_exporter.py
3

4
# Continuous monitoring every 5 minutes
5
python metrics_exporter.py --continuous --interval 300
6

7
# Continuous monitoring every minute
8
python metrics_exporter.py --continuous --interval 60

Setting Up as a System Service#

For production use, create a systemd service:

1
# Create service file
2
sudo tee /etc/systemd/system/metrics-exporter.service > /dev/null << 'EOF'
3
[Unit]
4
Description=Custom Metrics Exporter
5
After=network.target
6

7
[Service]
8
Type=simple
9
User=prometheus
10
ExecStart=/usr/bin/python3 /opt/monitoring/metrics_exporter.py --continuous --interval 300
11
Restart=always
12
RestartSec=10
13

14
[Install]
15
WantedBy=multi-user.target
16
EOF
17

18
# Enable and start the service
19
sudo systemctl daemon-reload
20
sudo systemctl enable metrics-exporter
21
sudo systemctl start metrics-exporter

Working with Standard Exporters#

While custom exporters are useful, standard ones like Node Exporter and Process Exporter are often more reliable for production use.

Node Exporter Setup#

Node Exporter provides system-wide metrics:

1
# Download Node Exporter
2
wget https://github.com/prometheus/node_exporter/releases/download/v1.5.0/node_exporter-1.5.0.linux-amd64.tar.gz
3

4
# Extract and install
5
tar xvfz node_exporter-*.tar.gz
6
sudo mv node_exporter-*/node_exporter /usr/local/bin/
7

8
# Create systemd service
9
sudo tee /etc/systemd/system/node_exporter.service > /dev/null << 'EOF'
10
[Unit]
11
Description=Node Exporter
12
After=network.target
13

14
[Service]
15
Type=simple
16
User=prometheus
17
ExecStart=/usr/local/bin/node_exporter
18
Restart=always
19

20
[Install]
21
WantedBy=multi-user.target
22
EOF
23

24
# Enable and start
25
sudo systemctl daemon-reload
26
sudo systemctl enable node_exporter
27
sudo systemctl start node_exporter

Process Exporter Setup#

Process Exporter focuses on specific processes:

1
# Download Process Exporter
2
wget https://github.com/ncabatoff/process-exporter/releases/download/v0.7.10/process-exporter-0.7.10.linux-amd64.tar.gz
3

4
# Extract and install
5
tar xvfz process-exporter-*.tar.gz
6
sudo mv process-exporter-*/process-exporter /usr/local/bin/
7

8
# Create configuration
9
sudo mkdir -p /etc/process-exporter
10
sudo tee /etc/process-exporter/config.yml > /dev/null << 'EOF'
11
process_names:
12
  - name: "xdr-dashboard"
13
    cmdline:
14
    - '.+xdr-dashboard.+'
15
  - name: "wazuh-indexer"
16
    cmdline:
17
    - '.+wazuh-indexer.+'
18
EOF
19

20
# Create systemd service
21
sudo tee /etc/systemd/system/process-exporter.service > /dev/null << 'EOF'
22
[Unit]
23
Description=Process Exporter
24
After=network.target
25

26
[Service]
27
Type=simple
28
User=prometheus
29
ExecStart=/usr/local/bin/process-exporter --config.path=/etc/process-exporter/config.yml
30
Restart=always
31

32
[Install]
33
WantedBy=multi-user.target
34
EOF
35

36
# Enable and start
37
sudo systemctl daemon-reload
38
sudo systemctl enable process-exporter
39
sudo systemctl start process-exporter

Validating Metrics Collection#

After setting up exporters, it’s crucial to verify they’re working correctly. Here are comprehensive validation methods:

Using curl to Check Metrics Endpoints#

1
# Check Node Exporter metrics
2
curl http://localhost:9100/metrics | grep process
3

4
# Check Process Exporter metrics
5
curl http://localhost:9256/metrics | grep process
6

7
# Check if specific services are being monitored
8
curl http://localhost:9256/metrics | grep -E "xdr|wazuh"

Verifying Exporter Services#

1
# Check Node Exporter status
2
sudo systemctl status node_exporter
3
sudo journalctl -u node_exporter -f
4

5
# Check Process Exporter status
6
sudo systemctl status process-exporter
7
sudo journalctl -u process-exporter -f

Checking Process Discovery#

1
# List all processes being monitored by Process Exporter
2
curl http://localhost:9256/metrics | grep process_name
3

4
# Check if specific services are running
5
ps aux | grep -E "xdr|wazuh"

Creating a Testing Script#

For quick validation, you can create a testing script:

1
#!/usr/bin/env python3
2
import requests
3
import re
4
import sys
5

6
def test_node_exporter():
7
    """Test if Node Exporter is working"""
8
    try:
9
        response = requests.get("http://localhost:9100/metrics", timeout=5)
10
        if response.status_code == 200:
11
            print("✅ Node Exporter is running and responding")
12
            return True
13
        else:
14
            print(f"❌ Node Exporter returned status code {response.status_code}")
15
            return False
16
    except requests.exceptions.RequestException as e:
17
        print(f"❌ Node Exporter is not accessible: {e}")
18
        return False
19

20
def test_process_exporter():
21
    """Test if Process Exporter is working"""
22
    try:
23
        response = requests.get("http://localhost:9256/metrics", timeout=5)
24
        if response.status_code == 200:
25
            print("✅ Process Exporter is running and responding")
26
            return True
27
        else:
28
            print(f"❌ Process Exporter returned status code {response.status_code}")
29
            return False
30
    except requests.exceptions.RequestException as e:
31
        print(f"❌ Process Exporter is not accessible: {e}")
32
        return False
33

34
def check_service_metrics(service_name):
35
    """Check if metrics for specific service are available"""
36
    try:
37
        response = requests.get("http://localhost:9256/metrics", timeout=5)
38
        if response.status_code == 200:
39
            if re.search(rf'process_name="{service_name}"', response.text):
40
                print(f"✅ Metrics for {service_name} found")
41
                return True
42
            else:
43
                print(f"❌ No metrics found for {service_name}")
44
                return False
45
        else:
46
            print(f"❌ Process Exporter returned status code {response.status_code}")
47
            return False
48
    except requests.exceptions.RequestException as e:
49
        print(f"❌ Process Exporter is not accessible: {e}")
50
        return False
51

52
if __name__ == "__main__":
53
    print("=== Metrics Exporter Validation ===")
54

55
    # Test basic connectivity
56
    node_ok = test_node_exporter()
57
    process_ok = test_process_exporter()
58

59
    # Check for specific services
60
    services = ["xdr-dashboard", "wazuh-indexer"]
61
    service_status = []
62

63
    if process_ok:
64
        for service in services:
65
            service_status.append(check_service_metrics(service))
66

67
    # Print summary
68
    print("\n=== Test Summary ===")
69
    print(f"Node Exporter: {'OK' if node_ok else 'FAIL'}")
70
    print(f"Process Exporter: {'OK' if process_ok else 'FAIL'}")
71

72
    if process_ok:
73
        for i, service in enumerate(services):
74
            print(f"Service {service}: {'OK' if service_status[i] else 'FAIL'}")
75

76
    # Determine overall status
77
    if all([node_ok, process_ok] + service_status):
78
        print("\n✅ All tests passed!")
79
        sys.exit(0)
80
    else:
81
        print("\n❌ Some tests failed!")
82
        sys.exit(1)

Creating a Bash Health Check Script#

For a lightweight alternative, create a bash script:

1
#!/bin/bash
2
# Set script variables
3
timestamp=$(date '+%Y-%m-%d %H:%M:%S')
4
node_exporter_port=9100
5
process_exporter_port=9256
6
services=("xdr-dashboard" "wazuh-indexer")
7

8
# ANSI color codes
9
GREEN='\033[0;32m'
10
RED='\033[0;31m'
11
NC='\033[0m' # No Color
12

13
echo "=== Metrics Health Check - $timestamp ==="
14

15
# Check if ports are open
16
echo -n "Checking Node Exporter port ($node_exporter_port): "
17
if nc -z localhost $node_exporter_port; then
18
    echo -e "${GREEN}OK${NC}"
19
else
20
    echo -e "${RED}FAIL${NC}"
21
fi
22

23
echo -n "Checking Process Exporter port ($process_exporter_port): "
24
if nc -z localhost $process_exporter_port; then
25
    echo -e "${GREEN}OK${NC}"
26
else
27
    echo -e "${RED}FAIL${NC}"
28
fi
29

30
# Check for metrics from each service
31
echo -e "\nChecking service metrics:"
32
for service in "${services[@]}"; do
33
    echo -n "  $service: "
34
    if curl -s "http://localhost:$process_exporter_port/metrics" | grep -q "process_name=\"$service\""; then
35
        echo -e "${GREEN}Found${NC}"
36
    else
37
        echo -e "${RED}Not found${NC}"
38
    fi
39
done
40

41
# Check system services
42
echo -e "\nChecking exporter services:"
43
echo -n "  node_exporter: "
44
if systemctl is-active --quiet node_exporter; then
45
    echo -e "${GREEN}Running${NC}"
46
else
47
    echo -e "${RED}Not running${NC}"
48
fi
49

50
echo -n "  process-exporter: "
51
if systemctl is-active --quiet process-exporter; then
52
    echo -e "${GREEN}Running${NC}"
53
else
54
    echo -e "${RED}Not running${NC}"
55
fi
56

57
# Add information about Prometheus scraping
58
echo -e "\nPrometheus targets can be checked at: http://localhost:9090/targets"

Troubleshooting Common Issues#

If your metrics exporters aren’t working as expected, try these troubleshooting steps:

Network and Connectivity Issues#

1
# Check if exporters are running and listening on expected ports
2
sudo netstat -tulpn | grep -E '9100|9256'
3

4
# Check firewall rules
5
sudo ufw status
6
# or
7
sudo iptables -L

Service Problems#

1
# Check service logs
2
sudo journalctl -u node_exporter -n 50
3
sudo journalctl -u process-exporter -n 50

Configuration Validation#

1
# Verify Prometheus configuration
2
sudo cat /etc/prometheus/prometheus.yml
3

4
# Check if Prometheus is scraping the targets
5
curl http://localhost:9090/api/v1/targets | jq .

Permissions and Access#

1
# Check if the exporter user has proper permissions
2
sudo -u prometheus /usr/local/bin/node_exporter --version
3

4
# Verify directory permissions
5
ls -la /var/lib/prometheus/

Process Matching Issues#

If Process Exporter isn’t finding your services, check your regex patterns:

1
# Test your regex pattern against running processes
2
ps aux | grep -E "xdr|wazuh"
3

4
# Adjust your process-exporter config.yml if needed
5
sudo nano /etc/process-exporter/config.yml

Integrating with Prometheus#

To complete your monitoring setup, integrate your exporters with Prometheus:

1
# Add to prometheus.yml
2
scrape_configs:
3
  - job_name: "node"
4
    static_configs:
5
      - targets: ["localhost:9100"]
6

7
  - job_name: "process"
8
    static_configs:
9
      - targets: ["localhost:9256"]

Creating Alerts for Critical Services#

1
groups:
2
  - name: service_alerts
3
    rules:
4
      - alert: ServiceDown
5
        expr: process_up{name=~"xdr-dashboard|wazuh-indexer"} == 0
6
        for: 1m
7
        labels:
8
          severity: critical
9
        annotations:
10
          summary: "Service {{ $labels.name }} is down"
11
          description: "The service {{ $labels.name }} has been down for more than 1 minute."
12

13
      - alert: HighCPUUsage
14
        expr: rate(process_cpu_seconds_total{name=~"xdr-dashboard|wazuh-indexer"}[5m]) * 100 > 80
15
        for: 5m
16
        labels:
17
          severity: warning
18
        annotations:
19
          summary: "High CPU usage for {{ $labels.name }}"
20
          description: "{{ $labels.name }} is using more than 80% CPU for over 5 minutes."

Visualizing Metrics with Grafana#

Create a comprehensive dashboard to visualize your metrics:

1
# Create a dashboard JSON file
2
cat << EOF > service-metrics-dashboard.json
3
{
4
  "dashboard": {
5
    "title": "Service Performance Dashboard",
6
    "panels": [
7
      {
8
        "title": "Service Status",
9
        "type": "stat",
10
        "targets": [
11
          {
12
            "expr": "process_up{name=~\"xdr-dashboard|wazuh-indexer\"}"
13
          }
14
        ],
15
        "mappings": [
16
          {
17
            "type": "value",
18
            "options": {
19
              "0": {
20
                "text": "Down",
21
                "color": "red"
22
              },
23
              "1": {
24
                "text": "Up",
25
                "color": "green"
26
              }
27
            }
28
          }
29
        ]
30
      },
31
      {
32
        "title": "CPU Usage",
33
        "type": "graph",
34
        "targets": [
35
          {
36
            "expr": "rate(process_cpu_seconds_total{name=~\"xdr-dashboard|wazuh-indexer\"}[5m]) * 100",
37
            "legendFormat": "{{name}}"
38
          }
39
        ],
40
        "yaxes": [
41
          {
42
            "format": "percent"
43
          }
44
        ]
45
      },
46
      {
47
        "title": "Memory Usage",
48
        "type": "graph",
49
        "targets": [
50
          {
51
            "expr": "process_resident_memory_bytes{name=~\"xdr-dashboard|wazuh-indexer\"} / 1024 / 1024",
52
            "legendFormat": "{{name}} (MB)"
53
          }
54
        ],
55
        "yaxes": [
56
          {
57
            "format": "decmbytes"
58
          }
59
        ]
60
      }
61
    ]
62
  }
63
}
64
EOF
65

66
# Import the dashboard through Grafana's API
67
curl -X POST -H "Content-Type: application/json" \
68
  -d @service-metrics-dashboard.json \
69
  http://admin:password@localhost:3000/api/dashboards/db

Best Practices for Production Use#

Security Considerations#

Use TLS for metrics endpoints:

1
# For Node Exporter
2
/usr/local/bin/node_exporter --web.config.file=/etc/node_exporter/web-config.yml
3

4
# TLS config file example
5
cat << EOF > /etc/node_exporter/web-config.yml
6
tls_server_config:
7
  cert_file: /etc/node_exporter/cert.pem
8
  key_file: /etc/node_exporter/key.pem
9
EOF

Implement authentication:

1
cat << EOF > /etc/node_exporter/web-config.yml
2
basic_auth_users:
3
  prometheus: $HASHED_PASSWORD
4
EOF

Use least privilege users:

1
sudo useradd -rs /bin/false prometheus
2
sudo chown prometheus:prometheus /usr/local/bin/node_exporter

Scaling Considerations#

Use service discovery for dynamic environments
Implement federation for large-scale deployments
Set appropriate scrape intervals based on resource consumption

Maintenance Guidelines#

Regular updates:

1
# Set up a cronjob to check for updates
2
0 2 * * 0 /usr/local/bin/check_exporter_updates.sh >> /var/log/exporter_updates.log 2>&1

Backup configurations:

1
# Backup all exporter configs
2
tar -czf /var/backups/exporters-$(date +%Y%m%d).tar.gz /etc/process-exporter/ /etc/node_exporter/

Monitor the monitors:

1
# Create alerts for your monitoring system
2
- alert: PrometheusNotIngestingMetrics
3
  expr: rate(prometheus_tsdb_head_samples_appended_total[5m]) <= 0
4
  for: 10m

Conclusion#

Implementing and validating metrics exporters is essential for maintaining visibility into your system’s health and performance. By following this guide, you can set up robust monitoring for important services like xdr-dashboard and wazuh-indexer, ensuring you’re promptly alerted to any issues.

The combination of standard exporters like Node Exporter and Process Exporter with custom Python-based solutions provides comprehensive coverage for both system-level and service-specific metrics. Regular validation and testing ensure your monitoring system itself remains reliable.

Remember that effective monitoring is not a set-and-forget solution—it requires ongoing maintenance, validation, and refinement to adapt to changing infrastructure and service needs.