Integrating Wazuh with Fluentd for Unified Logging and Big Data Analytics#

Introduction#

In today’s data-driven landscape, organizations generate logs from countless sources at an unprecedented rate. Managing this deluge of data across diverse systems and applications without a centralized approach is like trying to drink from a fire hose. This is where the powerful combination of Wazuh and Fluentd comes into play.

By integrating Wazuh with Fluentd, organizations can:

📊 Unify log collection from diverse security sources
🚀 Scale log processing for big data workloads
🔄 Stream security events to multiple destinations
💾 Build data lakes for advanced analytics
🤖 Enable ML workflows on security data

Understanding the Architecture#

How Wazuh Fluentd Forwarder Works#

1
flowchart LR
2
    subgraph "Wazuh Server"
3
        A1[Alerts JSON] --> S1[Socket<br/>fluent.sock]
4
        S1 --> FF[Fluentd<br/>Forwarder]
5
    end
6

7
    subgraph "Fluentd Server"
8
        FF -->|TCP/24224| FS[Fluentd<br/>Receiver]
9
        FS --> P1[Parser]
10
        P1 --> R1[Router]
11
    end
12

13
    subgraph "Data Destinations"
14
        R1 --> H1[Hadoop HDFS]
15
        R1 --> E1[Elasticsearch]
16
        R1 --> S3[AWS S3]
17
        R1 --> K1[Kafka]
18
    end
19

20
    style FF fill:#51cf66
21
    style FS fill:#4dabf7
22
    style H1 fill:#ffd43b

Why Fluentd?#

Fluentd provides several advantages for log management:

Unified Logging Layer: Collects logs from 500+ data sources
Flexible Routing: Route logs based on tags and patterns
Buffering: Handles network failures gracefully
Plugin Ecosystem: Extensive output plugins for various destinations
Performance: Processes millions of events per second

Infrastructure Setup#

For this implementation, we’ll need:

Wazuh Server: OVA 4.7.3 with all core components
Ubuntu 22.04: Hosting Fluentd and Hadoop
Network: Connectivity between Wazuh and Fluentd servers

Implementation Guide#

Phase 1: Configure Wazuh Fluentd Forwarder#

Enable Fluentd Forwarding#

Edit /var/ossec/etc/ossec.conf on the Wazuh server:

1
<ossec_config>
2
  <!-- Define UDP socket for Fluentd -->
3
  <socket>
4
    <name>fluent_socket</name>
5
    <location>/var/run/fluent.sock</location>
6
    <mode>udp</mode>
7
  </socket>
8

9
  <!-- Configure alerts as input -->
10
  <localfile>
11
    <log_format>json</log_format>
12
    <location>/var/ossec/logs/alerts/alerts.json</location>
13
    <target>fluent_socket</target>
14
  </localfile>
15

16
  <!-- Fluentd forwarder configuration -->
17
  <fluent-forward>
18
    <enabled>yes</enabled>
19
    <tag>wazuh</tag>
20
    <socket_path>/var/run/fluent.sock</socket_path>
21
    <address>FLUENTD_SERVER_IP</address>
22
    <port>24224</port>
23
  </fluent-forward>
24
</ossec_config>

Restart Wazuh manager:

1
systemctl restart wazuh-manager

Secure Mode Configuration (Optional)#

For production environments, enable TLS:

1
<fluent-forward>
2
  <enabled>yes</enabled>
3
  <tag>wazuh.secure</tag>
4
  <socket_path>/var/run/fluent.sock</socket_path>
5
  <address>FLUENTD_SERVER_IP</address>
6
  <port>24224</port>
7
  <shared_key>YOUR_SHARED_KEY</shared_key>
8
  <ca_cert>/path/to/ca.pem</ca_cert>
9
  <user_cert>/path/to/cert.pem</user_cert>
10
  <user_key>/path/to/key.pem</user_key>
11
</fluent-forward>

Phase 2: Install and Configure Fluentd#

Installation Options#

Fluentd offers multiple installation methods:

1
# Option 1: fluent-package (Recommended)
2
curl -fsSL https://toolbelt.treasuredata.com/sh/install-ubuntu-jammy-fluent-package5-lts.sh | sh
3

4
# Option 2: td-agent
5
curl -L https://toolbelt.treasuredata.com/sh/install-ubuntu-jammy-td-agent4.sh | sh
6

7
# Option 3: calyptia-fluentd
8
curl -sSL https://toolbelt.treasuredata.com/sh/install-ubuntu-jammy-calyptia-fluentd.sh | sh

Configure Fluentd#

Edit /etc/fluent/fluentd.conf:

1
# Input from Wazuh
2
<source>
3
  @type forward
4
  port 24224
5
  bind 0.0.0.0
6
</source>
7

8
# Match Wazuh events
9
<match wazuh>
10
  @type copy
11

12
  # Store to Hadoop HDFS
13
  <store>
14
    @type webhdfs
15
    host localhost
16
    port 9870
17
    append yes
18
    path "/Wazuh/%Y%m%d/alerts.json"
19
    <buffer>
20
      flush_mode immediate
21
    </buffer>
22
    <format>
23
      @type json
24
    </format>
25
  </store>
26

27
  # Also output to stdout for debugging
28
  <store>
29
    @type stdout
30
  </store>
31
</match>

Start Fluentd service:

1
systemctl start fluentd
2
systemctl enable fluentd

Phase 3: Deploy Hadoop as Data Lake#

Install Prerequisites#

1
# Update system
2
apt update && apt upgrade -y
3

4
# Install Java and SSH
5
apt install openssh-server openssh-client openjdk-11-jdk -y

Create Hadoop User#

1
# Create dedicated user
2
adduser hdoop
3
usermod -aG sudo hdoop
4

5
# Switch to hdoop user
6
su hdoop
7

8
# Setup passwordless SSH
9
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
10
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
11
chmod 600 ~/.ssh/authorized_keys
12
chmod 700 ~/.ssh

Install Hadoop#

1
# Download and install
2
sudo wget https://downloads.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
3
sudo tar xzf hadoop-3.3.6.tar.gz
4
sudo mv hadoop-3.3.6 /usr/local/hadoop
5
sudo chown -R hdoop:hdoop /usr/local/hadoop
6

7
# Set Java home
8
echo 'export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")' | \
9
  sudo tee -a /usr/local/hadoop/etc/hadoop/hadoop-env.sh

Configure Hadoop#

Edit /usr/local/hadoop/etc/hadoop/core-site.xml:

1
<configuration>
2
    <property>
3
        <name>fs.defaultFS</name>
4
        <value>hdfs://localhost:9000</value>
5
    </property>
6
</configuration>

Enable HDFS append operations in /usr/local/hadoop/etc/hadoop/hdfs-site.xml:

1
<configuration>
2
    <property>
3
        <name>dfs.webhdfs.enabled</name>
4
        <value>true</value>
5
    </property>
6
    <property>
7
        <name>dfs.support.append</name>
8
        <value>true</value>
9
    </property>
10
    <property>
11
        <name>dfs.support.broken.append</name>
12
        <value>true</value>
13
    </property>
14
</configuration>

Initialize and Start Hadoop#

1
# Format namenode
2
/usr/local/hadoop/bin/hdfs namenode -format
3

4
# Start services
5
/usr/local/hadoop/sbin/start-dfs.sh
6
/usr/local/hadoop/sbin/start-yarn.sh
7

8
# Create Wazuh directory
9
/usr/local/hadoop/bin/hadoop fs -mkdir /Wazuh
10
/usr/local/hadoop/bin/hadoop fs -chmod -R 777 /Wazuh

Access Hadoop interfaces:

NameNode: http://localhost:9870
ResourceManager: http://localhost:8088

Advanced Fluentd Configurations#

Multi-Destination Routing#

1
<match wazuh.**>
2
  @type copy
3

4
  # Hadoop for long-term storage
5
  <store>
6
    @type webhdfs
7
    host hadoop-server
8
    port 9870
9
    path "/security/wazuh/%Y/%m/%d/#{Socket.gethostname}.json"
10
    <buffer time>
11
      timekey 1h
12
      timekey_wait 10m
13
    </buffer>
14
  </store>
15

16
  # Elasticsearch for real-time search
17
  <store>
18
    @type elasticsearch
19
    host elasticsearch-server
20
    port 9200
21
    index_name wazuh-alerts-%Y.%m
22
    type_name _doc
23
  </store>
24

25
  # S3 for backup
26
  <store>
27
    @type s3
28
    aws_key_id YOUR_AWS_KEY
29
    aws_sec_key YOUR_AWS_SECRET
30
    s3_bucket wazuh-backup
31
    s3_region us-east-1
32
    path logs/wazuh/%Y/%m/%d/
33
    <buffer time>
34
      timekey 3600
35
      timekey_wait 10m
36
    </buffer>
37
  </store>
38
</match>

Event Filtering and Transformation#

1
# Filter high-severity alerts
2
<filter wazuh.**>
3
  @type grep
4
  <regexp>
5
    key rule.level
6
    pattern /^(9|10|11|12|13|14|15)$/
7
  </regexp>
8
</filter>
9

10
# Add metadata
11
<filter wazuh.**>
12
  @type record_transformer
13
  <record>
14
    fluentd_hostname "#{Socket.gethostname}"
15
    fluentd_timestamp ${time}
16
    environment "production"
17
  </record>
18
</filter>
19

20
# Parse and enrich
21
<filter wazuh.**>
22
  @type parser
23
  key_name full_log
24
  <parse>
25
    @type regexp
26
    expression /^(?<time>[^ ]+) (?<host>[^ ]+) (?<process>[^:]+): (?<message>.*)$/
27
  </parse>
28
</filter>

Performance Optimization#

1
# Optimize buffer settings
2
<match wazuh.**>
3
  @type webhdfs
4
  host hadoop-server
5
  port 9870
6
  path "/wazuh/optimized/%Y%m%d/alerts.json"
7

8
  <buffer time,tag>
9
    @type file
10
    path /var/log/fluent/buffer/wazuh
11
    timekey 300  # 5 minutes
12
    timekey_wait 60
13
    chunk_limit_size 256m
14
    total_limit_size 2g
15
    overflow_action drop_oldest_chunk
16
    compress gzip
17
  </buffer>
18

19
  <format>
20
    @type json
21
  </format>
22

23
  # Retry settings
24
  retry_forever false
25
  retry_max_times 3
26
  retry_wait 10s
27
  retry_exponential_backoff_base 2
28
</match>

Testing and Validation#

Verify Integration#

Check Fluentd Connection:

1
# On Wazuh server
2
tail -f /var/ossec/logs/ossec.log | grep fluent
3

4
# On Fluentd server
5
tail -f /var/log/fluent/fluentd.log

Generate Test Alerts:

1
# Trigger authentication failure
2
ssh invalid@localhost
3

4
# Check if alert reached Hadoop
5
/usr/local/hadoop/bin/hadoop fs -tail /Wazuh/$(date +%Y%m%d)/alerts.json

Monitor Through Web UI:
- Navigate to Hadoop NameNode UI
- Browse to /Wazuh directory
- Verify alert files are being created

Troubleshooting Common Issues#

Issue 1: Connection Refused#

1
# Check Fluentd is listening
2
netstat -tuln | grep 24224
3

4
# Test connectivity
5
telnet FLUENTD_IP 24224
6

7
# Check firewall
8
ufw status

Issue 2: No Data in Hadoop#

1
# Verify HDFS permissions
2
/usr/local/hadoop/bin/hadoop fs -ls -la /Wazuh
3

4
# Check Fluentd logs
5
grep ERROR /var/log/fluent/fluentd.log
6

7
# Test WebHDFS
8
curl -i "http://localhost:9870/webhdfs/v1/Wazuh?op=LISTSTATUS"

Use Cases and Analytics#

1. Security Data Lake#

1
# PySpark analysis example
2
from pyspark.sql import SparkSession
3

4
spark = SparkSession.builder \
5
    .appName("WazuhSecurityAnalytics") \
6
    .getOrCreate()
7

8
# Read Wazuh alerts from HDFS
9
alerts_df = spark.read.json("hdfs://localhost:9000/Wazuh/*/alerts.json")
10

11
# Analyze top threats
12
threats = alerts_df.groupBy("rule.description") \
13
    .count() \
14
    .orderBy("count", ascending=False) \
15
    .limit(10)
16

17
threats.show()

2. Machine Learning Pipeline#

1
# Anomaly detection on authentication patterns
2
from pyspark.ml.feature import VectorAssembler
3
from pyspark.ml.clustering import KMeans
4

5
# Feature engineering
6
features = alerts_df.filter(alerts_df.rule.groups.contains("authentication")) \
7
    .select("hour", "agent.id", "data.srcip")
8

9
# ML pipeline
10
assembler = VectorAssembler(inputCols=["hour", "agent_id_encoded"],
11
                           outputCol="features")
12
kmeans = KMeans(k=5, seed=1)
13

14
# Detect anomalous authentication patterns
15
model = kmeans.fit(assembled_data)
16
predictions = model.transform(assembled_data)

3. Compliance Reporting#

1
-- Hive query for compliance reports
2
CREATE EXTERNAL TABLE wazuh_alerts (
3
    timestamp STRING,
4
    rule STRUCT<
5
        level: INT,
6
        description: STRING,
7
        pci_dss: ARRAY<STRING>,
8
        gdpr: ARRAY<STRING>
9
    >,
10
    agent STRUCT<
11
        id: STRING,
12
        name: STRING
13
    >
14
)
15
STORED AS JSON
16
LOCATION '/Wazuh/';
17

18
-- PCI DSS compliance query
19
SELECT
20
    rule.pci_dss[0] as requirement,
21
    COUNT(*) as violation_count,
22
    collect_list(DISTINCT agent.name) as affected_systems
23
FROM wazuh_alerts
24
WHERE size(rule.pci_dss) > 0
25
GROUP BY rule.pci_dss[0]
26
ORDER BY violation_count DESC;

Best Practices#

1. Buffer Management#

1
# Prevent data loss during outages
2
<match wazuh.**>
3
  @type forward
4
  <server>
5
    name primary
6
    host primary-fluentd
7
    port 24224
8
  </server>
9
  <server>
10
    name secondary
11
    host secondary-fluentd
12
    port 24224
13
    standby
14
  </server>
15

16
  <buffer>
17
    @type file
18
    path /var/log/fluent/buffer/
19
    flush_mode interval
20
    flush_interval 10s
21
    retry_type exponential_backoff
22
    retry_forever true
23
  </buffer>
24
</match>

2. Security Hardening#

1
# Enable TLS and authentication
2
<source>
3
  @type forward
4
  port 24224
5
  bind 0.0.0.0
6

7
  <transport tls>
8
    cert_path /etc/fluent/certs/server.crt
9
    private_key_path /etc/fluent/certs/server.key
10
    client_cert_auth true
11
    ca_path /etc/fluent/certs/ca.crt
12
  </transport>
13

14
  <security>
15
    self_hostname fluentd-server
16
    shared_key YOUR_SHARED_KEY
17
  </security>
18
</source>

3. Monitoring and Alerting#

1
# Monitor Fluentd health
2
<source>
3
  @type monitor_agent
4
  bind 127.0.0.1
5
  port 24220
6
</source>
7

8
# Prometheus metrics
9
<source>
10
  @type prometheus
11
  bind 0.0.0.0
12
  port 24231
13
  metrics_path /metrics
14
</source>
15

16
# Alert on errors
17
<match fluent.**>
18
  @type mail
19
  host smtp.example.com
20
  port 587
21
  from fluentd@example.com
22
  to ops@example.com
23
  subject "Fluentd Error Alert"
24
  <filter>
25
    @type grep
26
    <regexp>
27
      key message
28
      pattern /error|Error|ERROR/
29
    </regexp>
30
  </filter>
31
</match>

Integration with Other Systems#

Kafka Integration#

1
<match wazuh.**>
2
  @type kafka2
3
  brokers kafka1:9092,kafka2:9092,kafka3:9092
4
  default_topic wazuh-alerts
5

6
  <format>
7
    @type json
8
  </format>
9

10
  <buffer>
11
    @type file
12
    path /var/log/fluent/buffer/kafka
13
    flush_interval 3s
14
  </buffer>
15

16
  # Partition by agent ID
17
  partition_key_key agent.id
18
  max_send_retries 3
19
  required_acks -1
20
</match>

Splunk Integration#

1
<match wazuh.**>
2
  @type splunk_hec
3
  hec_host splunk.example.com
4
  hec_port 8088
5
  hec_token YOUR_HEC_TOKEN
6

7
  source wazuh
8
  sourcetype _json
9

10
  <format>
11
    @type json
12
  </format>
13
</match>

Conclusion#

Integrating Wazuh with Fluentd opens up a world of possibilities for security data management and analytics. This setup provides:

✅ Unified logging pipeline for all security events
📊 Scalable data storage with Hadoop HDFS
🔄 Flexible routing to multiple destinations
🤖 Big data analytics capabilities
📈 Machine learning readiness

By leveraging this integration, organizations can build sophisticated security analytics platforms that scale with their needs.

Key Takeaways#

Start Simple: Begin with basic forwarding, then add complexity
Buffer Wisely: Configure buffers to prevent data loss
Monitor Everything: Track Fluentd performance and errors
Plan Storage: Design your HDFS structure for efficient querying
Secure Transport: Always use TLS in production

Resources#

Unite your logs, amplify your security insights! 🚀