Skip to content

Comprehensive Guide to Service Metrics Exporters - Implementation, Validation, and Monitoring

Published: at 10:00 AM

Comprehensive Guide to Service Metrics Exporters

Effective monitoring of service performance is essential for maintaining reliable infrastructure. Metrics exporters serve as the backbone of modern monitoring systems, collecting vital information about CPU usage, memory consumption, service uptime, and other critical metrics. This comprehensive guide explores how to implement, validate, and troubleshoot metrics exporters for your services, with practical examples for both standard and custom exporters.

Understanding Metrics Exporters

Metrics exporters are specialized agents that collect performance data from services and systems, then expose this information in a format that monitoring platforms like Prometheus can scrape and analyze.

graph TD
    A[Services] --> B[Metrics Exporters]
    B --> C[Prometheus/Monitoring System]
    C --> D[Alerting]
    C --> E[Visualization/Grafana]

    subgraph "Exporters"
    F[Node Exporter]
    G[Process Exporter]
    H[Custom Exporters]
    end

    B --> F
    B --> G
    B --> H

Common Types of Exporters

  1. Node Exporter: Collects system-wide metrics like CPU, memory, disk, and network usage
  2. Process Exporter: Focuses on process-specific metrics
  3. Custom Exporters: Tailored solutions for specific applications or services
  4. Application Exporters: Built into applications to expose internal metrics

Implementing a Custom Python Metrics Exporter

While standard exporters cover many use cases, sometimes you need a custom solution. Here’s how to create a Python-based metrics exporter for specific services like xdr-dashboard and wazuh-indexer:

Setting Up the Custom Exporter

First, let’s create a Python script that collects and exports metrics:

#!/usr/bin/env python3
# metrics_exporter.py

import psutil
import json
import os
import time
import argparse
from datetime import datetime

# Configuration
SERVICES = ["xdr-dashboard", "wazuh-indexer"]
METRICS_DIR = "metrics"

def ensure_metrics_dir():
    """Ensure metrics directory exists"""
    if not os.path.exists(METRICS_DIR):
        os.makedirs(METRICS_DIR)

def get_process_metrics(process_name):
    """Get metrics for a specific process"""
    metrics = {
        "cpu_percent": 0,
        "memory_percent": 0,
        "memory_mb": 0,
        "uptime_seconds": 0,
        "is_running": False
    }

    # Find all processes matching the name
    matching_processes = []
    for proc in psutil.process_iter(['pid', 'name', 'cmdline', 'create_time']):
        try:
            # Check if process name contains the service name
            if process_name in proc.info['name'] or any(process_name in cmd for cmd in proc.info.get('cmdline', []) if cmd):
                matching_processes.append(proc)
        except (psutil.NoSuchProcess, psutil.AccessDenied, psutil.ZombieProcess):
            pass

    if matching_processes:
        metrics["is_running"] = True

        # Aggregate metrics from all matching processes
        for proc in matching_processes:
            try:
                # Add CPU and memory usage
                process = psutil.Process(proc.info['pid'])
                metrics["cpu_percent"] += process.cpu_percent(interval=0.1)
                metrics["memory_percent"] += process.memory_percent()
                metrics["memory_mb"] += process.memory_info().rss / (1024 * 1024)  # Convert to MB

                # Calculate uptime
                current_time = time.time()
                process_start_time = proc.info['create_time']
                process_uptime = current_time - process_start_time
                metrics["uptime_seconds"] = max(metrics["uptime_seconds"], process_uptime)
            except (psutil.NoSuchProcess, psutil.AccessDenied):
                pass

    return metrics

def collect_metrics():
    """Collect metrics for configured services"""
    timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
    all_metrics = {}

    for service in SERVICES:
        metrics = get_process_metrics(service)
        all_metrics[service] = metrics

        # Save individual service metrics
        service_filename = f"{METRICS_DIR}/{service}_{timestamp}.json"
        with open(service_filename, 'w') as f:
            json.dump(metrics, f, indent=2)

        # Print service metrics to console
        print(f"=== {service} Metrics ===")
        print(f"Running: {metrics['is_running']}")
        print(f"CPU Usage: {metrics['cpu_percent']:.2f}%")
        print(f"Memory Usage: {metrics['memory_percent']:.2f}% ({metrics['memory_mb']:.2f} MB)")
        print(f"Uptime: {metrics['uptime_seconds']/3600:.2f} hours")
        print("")

    # Save combined metrics
    combined_filename = f"{METRICS_DIR}/all_services_{timestamp}.json"
    with open(combined_filename, 'w') as f:
        json.dump(all_metrics, f, indent=2)

    return all_metrics

def main():
    parser = argparse.ArgumentParser(description='Service Metrics Exporter')
    parser.add_argument('--continuous', action='store_true', help='Run continuously')
    parser.add_argument('--interval', type=int, default=300, help='Interval in seconds (default: 300)')
    args = parser.parse_args()

    ensure_metrics_dir()

    if args.continuous:
        print(f"Starting continuous monitoring with {args.interval} second intervals...")
        try:
            while True:
                print(f"\n[{datetime.now()}] Collecting metrics...")
                collect_metrics()
                time.sleep(args.interval)
        except KeyboardInterrupt:
            print("Monitoring stopped by user.")
    else:
        collect_metrics()

if __name__ == "__main__":
    main()

Installing Dependencies

Ensure you have the necessary dependencies:

pip install psutil

Running the Exporter

The script can be run in different ways depending on your needs:

# Single collection
python metrics_exporter.py

# Continuous monitoring every 5 minutes
python metrics_exporter.py --continuous --interval 300

# Continuous monitoring every minute
python metrics_exporter.py --continuous --interval 60

Setting Up as a System Service

For production use, create a systemd service:

# Create service file
sudo tee /etc/systemd/system/metrics-exporter.service > /dev/null << 'EOF'
[Unit]
Description=Custom Metrics Exporter
After=network.target

[Service]
Type=simple
User=prometheus
ExecStart=/usr/bin/python3 /opt/monitoring/metrics_exporter.py --continuous --interval 300
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target
EOF

# Enable and start the service
sudo systemctl daemon-reload
sudo systemctl enable metrics-exporter
sudo systemctl start metrics-exporter

Working with Standard Exporters

While custom exporters are useful, standard ones like Node Exporter and Process Exporter are often more reliable for production use.

Node Exporter Setup

Node Exporter provides system-wide metrics:

# Download Node Exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.5.0/node_exporter-1.5.0.linux-amd64.tar.gz

# Extract and install
tar xvfz node_exporter-*.tar.gz
sudo mv node_exporter-*/node_exporter /usr/local/bin/

# Create systemd service
sudo tee /etc/systemd/system/node_exporter.service > /dev/null << 'EOF'
[Unit]
Description=Node Exporter
After=network.target

[Service]
Type=simple
User=prometheus
ExecStart=/usr/local/bin/node_exporter
Restart=always

[Install]
WantedBy=multi-user.target
EOF

# Enable and start
sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter

Process Exporter Setup

Process Exporter focuses on specific processes:

# Download Process Exporter
wget https://github.com/ncabatoff/process-exporter/releases/download/v0.7.10/process-exporter-0.7.10.linux-amd64.tar.gz

# Extract and install
tar xvfz process-exporter-*.tar.gz
sudo mv process-exporter-*/process-exporter /usr/local/bin/

# Create configuration
sudo mkdir -p /etc/process-exporter
sudo tee /etc/process-exporter/config.yml > /dev/null << 'EOF'
process_names:
  - name: "xdr-dashboard"
    cmdline:
    - '.+xdr-dashboard.+'
  - name: "wazuh-indexer"
    cmdline:
    - '.+wazuh-indexer.+'
EOF

# Create systemd service
sudo tee /etc/systemd/system/process-exporter.service > /dev/null << 'EOF'
[Unit]
Description=Process Exporter
After=network.target

[Service]
Type=simple
User=prometheus
ExecStart=/usr/local/bin/process-exporter --config.path=/etc/process-exporter/config.yml
Restart=always

[Install]
WantedBy=multi-user.target
EOF

# Enable and start
sudo systemctl daemon-reload
sudo systemctl enable process-exporter
sudo systemctl start process-exporter

Validating Metrics Collection

After setting up exporters, it’s crucial to verify they’re working correctly. Here are comprehensive validation methods:

Using curl to Check Metrics Endpoints

# Check Node Exporter metrics
curl http://localhost:9100/metrics | grep process

# Check Process Exporter metrics
curl http://localhost:9256/metrics | grep process

# Check if specific services are being monitored
curl http://localhost:9256/metrics | grep -E "xdr|wazuh"

Verifying Exporter Services

# Check Node Exporter status
sudo systemctl status node_exporter
sudo journalctl -u node_exporter -f

# Check Process Exporter status
sudo systemctl status process-exporter
sudo journalctl -u process-exporter -f

Checking Process Discovery

# List all processes being monitored by Process Exporter
curl http://localhost:9256/metrics | grep process_name

# Check if specific services are running
ps aux | grep -E "xdr|wazuh"

Creating a Testing Script

For quick validation, you can create a testing script:

#!/usr/bin/env python3
# test_metrics.py

import requests
import re
import sys

def test_node_exporter():
    """Test if Node Exporter is working"""
    try:
        response = requests.get("http://localhost:9100/metrics", timeout=5)
        if response.status_code == 200:
            print("✅ Node Exporter is running and responding")
            return True
        else:
            print(f"❌ Node Exporter returned status code {response.status_code}")
            return False
    except requests.exceptions.RequestException as e:
        print(f"❌ Node Exporter is not accessible: {e}")
        return False

def test_process_exporter():
    """Test if Process Exporter is working"""
    try:
        response = requests.get("http://localhost:9256/metrics", timeout=5)
        if response.status_code == 200:
            print("✅ Process Exporter is running and responding")
            return True
        else:
            print(f"❌ Process Exporter returned status code {response.status_code}")
            return False
    except requests.exceptions.RequestException as e:
        print(f"❌ Process Exporter is not accessible: {e}")
        return False

def check_service_metrics(service_name):
    """Check if metrics for specific service are available"""
    try:
        response = requests.get("http://localhost:9256/metrics", timeout=5)
        if response.status_code == 200:
            if re.search(rf'process_name="{service_name}"', response.text):
                print(f"✅ Metrics for {service_name} found")
                return True
            else:
                print(f"❌ No metrics found for {service_name}")
                return False
        else:
            print(f"❌ Process Exporter returned status code {response.status_code}")
            return False
    except requests.exceptions.RequestException as e:
        print(f"❌ Process Exporter is not accessible: {e}")
        return False

if __name__ == "__main__":
    print("=== Metrics Exporter Validation ===")

    # Test basic connectivity
    node_ok = test_node_exporter()
    process_ok = test_process_exporter()

    # Check for specific services
    services = ["xdr-dashboard", "wazuh-indexer"]
    service_status = []

    if process_ok:
        for service in services:
            service_status.append(check_service_metrics(service))

    # Print summary
    print("\n=== Test Summary ===")
    print(f"Node Exporter: {'OK' if node_ok else 'FAIL'}")
    print(f"Process Exporter: {'OK' if process_ok else 'FAIL'}")

    if process_ok:
        for i, service in enumerate(services):
            print(f"Service {service}: {'OK' if service_status[i] else 'FAIL'}")

    # Determine overall status
    if all([node_ok, process_ok] + service_status):
        print("\n✅ All tests passed!")
        sys.exit(0)
    else:
        print("\n❌ Some tests failed!")
        sys.exit(1)

Creating a Bash Health Check Script

For a lightweight alternative, create a bash script:

#!/bin/bash
# check_metrics.sh

# Set script variables
timestamp=$(date '+%Y-%m-%d %H:%M:%S')
node_exporter_port=9100
process_exporter_port=9256
services=("xdr-dashboard" "wazuh-indexer")

# ANSI color codes
GREEN='\033[0;32m'
RED='\033[0;31m'
NC='\033[0m' # No Color

echo "=== Metrics Health Check - $timestamp ==="

# Check if ports are open
echo -n "Checking Node Exporter port ($node_exporter_port): "
if nc -z localhost $node_exporter_port; then
    echo -e "${GREEN}OK${NC}"
else
    echo -e "${RED}FAIL${NC}"
fi

echo -n "Checking Process Exporter port ($process_exporter_port): "
if nc -z localhost $process_exporter_port; then
    echo -e "${GREEN}OK${NC}"
else
    echo -e "${RED}FAIL${NC}"
fi

# Check for metrics from each service
echo -e "\nChecking service metrics:"
for service in "${services[@]}"; do
    echo -n "  $service: "
    if curl -s "http://localhost:$process_exporter_port/metrics" | grep -q "process_name=\"$service\""; then
        echo -e "${GREEN}Found${NC}"
    else
        echo -e "${RED}Not found${NC}"
    fi
done

# Check system services
echo -e "\nChecking exporter services:"
echo -n "  node_exporter: "
if systemctl is-active --quiet node_exporter; then
    echo -e "${GREEN}Running${NC}"
else
    echo -e "${RED}Not running${NC}"
fi

echo -n "  process-exporter: "
if systemctl is-active --quiet process-exporter; then
    echo -e "${GREEN}Running${NC}"
else
    echo -e "${RED}Not running${NC}"
fi

# Add information about Prometheus scraping
echo -e "\nPrometheus targets can be checked at: http://localhost:9090/targets"

Troubleshooting Common Issues

If your metrics exporters aren’t working as expected, try these troubleshooting steps:

Network and Connectivity Issues

# Check if exporters are running and listening on expected ports
sudo netstat -tulpn | grep -E '9100|9256'

# Check firewall rules
sudo ufw status
# or
sudo iptables -L

Service Problems

# Check service logs
sudo journalctl -u node_exporter -n 50
sudo journalctl -u process-exporter -n 50

Configuration Validation

# Verify Prometheus configuration
sudo cat /etc/prometheus/prometheus.yml

# Check if Prometheus is scraping the targets
curl http://localhost:9090/api/v1/targets | jq .

Permissions and Access

# Check if the exporter user has proper permissions
sudo -u prometheus /usr/local/bin/node_exporter --version

# Verify directory permissions
ls -la /var/lib/prometheus/

Process Matching Issues

If Process Exporter isn’t finding your services, check your regex patterns:

# Test your regex pattern against running processes
ps aux | grep -E "xdr|wazuh"

# Adjust your process-exporter config.yml if needed
sudo nano /etc/process-exporter/config.yml

Integrating with Prometheus

To complete your monitoring setup, integrate your exporters with Prometheus:

# Add to prometheus.yml
scrape_configs:
  - job_name: "node"
    static_configs:
      - targets: ["localhost:9100"]

  - job_name: "process"
    static_configs:
      - targets: ["localhost:9256"]

Creating Alerts for Critical Services

# prometheus-alerts.yml
groups:
  - name: service_alerts
    rules:
      - alert: ServiceDown
        expr: process_up{name=~"xdr-dashboard|wazuh-indexer"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Service {{ $labels.name }} is down"
          description: "The service {{ $labels.name }} has been down for more than 1 minute."

      - alert: HighCPUUsage
        expr: rate(process_cpu_seconds_total{name=~"xdr-dashboard|wazuh-indexer"}[5m]) * 100 > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage for {{ $labels.name }}"
          description: "{{ $labels.name }} is using more than 80% CPU for over 5 minutes."

Visualizing Metrics with Grafana

Create a comprehensive dashboard to visualize your metrics:

# Create a dashboard JSON file
cat << EOF > service-metrics-dashboard.json
{
  "dashboard": {
    "title": "Service Performance Dashboard",
    "panels": [
      {
        "title": "Service Status",
        "type": "stat",
        "targets": [
          {
            "expr": "process_up{name=~\"xdr-dashboard|wazuh-indexer\"}"
          }
        ],
        "mappings": [
          {
            "type": "value",
            "options": {
              "0": {
                "text": "Down",
                "color": "red"
              },
              "1": {
                "text": "Up",
                "color": "green"
              }
            }
          }
        ]
      },
      {
        "title": "CPU Usage",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(process_cpu_seconds_total{name=~\"xdr-dashboard|wazuh-indexer\"}[5m]) * 100",
            "legendFormat": "{{name}}"
          }
        ],
        "yaxes": [
          {
            "format": "percent"
          }
        ]
      },
      {
        "title": "Memory Usage",
        "type": "graph",
        "targets": [
          {
            "expr": "process_resident_memory_bytes{name=~\"xdr-dashboard|wazuh-indexer\"} / 1024 / 1024",
            "legendFormat": "{{name}} (MB)"
          }
        ],
        "yaxes": [
          {
            "format": "decmbytes"
          }
        ]
      }
    ]
  }
}
EOF

# Import the dashboard through Grafana's API
curl -X POST -H "Content-Type: application/json" \
  -d @service-metrics-dashboard.json \
  http://admin:password@localhost:3000/api/dashboards/db

Best Practices for Production Use

Security Considerations

  1. Use TLS for metrics endpoints:

    # For Node Exporter
    /usr/local/bin/node_exporter --web.config.file=/etc/node_exporter/web-config.yml
    
    # TLS config file example
    cat << EOF > /etc/node_exporter/web-config.yml
    tls_server_config:
      cert_file: /etc/node_exporter/cert.pem
      key_file: /etc/node_exporter/key.pem
    EOF
    
  2. Implement authentication:

    cat << EOF > /etc/node_exporter/web-config.yml
    basic_auth_users:
      prometheus: $HASHED_PASSWORD
    EOF
    
  3. Use least privilege users:

    sudo useradd -rs /bin/false prometheus
    sudo chown prometheus:prometheus /usr/local/bin/node_exporter
    

Scaling Considerations

  1. Use service discovery for dynamic environments
  2. Implement federation for large-scale deployments
  3. Set appropriate scrape intervals based on resource consumption

Maintenance Guidelines

  1. Regular updates:

    # Set up a cronjob to check for updates
    0 2 * * 0 /usr/local/bin/check_exporter_updates.sh >> /var/log/exporter_updates.log 2>&1
    
  2. Backup configurations:

    # Backup all exporter configs
    tar -czf /var/backups/exporters-$(date +%Y%m%d).tar.gz /etc/process-exporter/ /etc/node_exporter/
    
  3. Monitor the monitors:

    # Create alerts for your monitoring system
    - alert: PrometheusNotIngestingMetrics
      expr: rate(prometheus_tsdb_head_samples_appended_total[5m]) <= 0
      for: 10m
    

Conclusion

Implementing and validating metrics exporters is essential for maintaining visibility into your system’s health and performance. By following this guide, you can set up robust monitoring for important services like xdr-dashboard and wazuh-indexer, ensuring you’re promptly alerted to any issues.

The combination of standard exporters like Node Exporter and Process Exporter with custom Python-based solutions provides comprehensive coverage for both system-level and service-specific metrics. Regular validation and testing ensure your monitoring system itself remains reliable.

Remember that effective monitoring is not a set-and-forget solution—it requires ongoing maintenance, validation, and refinement to adapt to changing infrastructure and service needs.

Resources