Building ML-Powered Threat Intelligence with Honeypot Datasets: From Raw Data to Production Models#

Introduction#

Picture this: you’re staring at security logs with thousands of events streaming in daily. Which ones are actually dangerous? Which can you safely ignore? Traditional signature-based detection is like playing whack-a-mole with cybercriminals — they’ve gotten really good at dodging known signatures faster than we can create them.

Enter machine learning — your new cybersecurity superpower! Imagine having a system that learns attacker behavior patterns and predicts new threats before they even hit signature databases. Sounds too good to be true? Well, it’s not!

Honeypot data is the secret sauce that makes this magic happen. Unlike those sterile academic datasets gathering dust, honeypots capture real attackers in their natural habitat — like having a hidden camera in the cybercriminal underworld. This authentic data gives us unprecedented insights into how bad actors actually operate.

In this comprehensive guide, I’ll take you on a journey from raw honeypot data to a working threat detection system that would make any SOC analyst jealous. Ready to turn chaos into clarity and transform your threat detection game? Let’s dive in!

Table of Contents#

Part 1: Understanding Honeypot Data#

What’s Hidden in Your Honeypot Data?#

Before we start cooking up some ML magic, let’s peek behind the curtain and see what treasures our honeypot traps actually capture. Think of honeypots as security cameras recording cybercriminals in action. Here’s what our “footage” reveals:

Network Flow Data#

Raw network connections contain fundamental information about attack patterns:

Source/Destination IPs and Ports: Geographic and service targeting patterns
Protocol Information: TCP/UDP usage, application layer protocols
Flow Statistics: Packet counts, byte volumes, session duration
Timing Data: Connection timestamps, session intervals

Example Network Flow:

1
{
2
  "timestamp": "2025-09-24T14:30:15Z",
3
  "src_ip": "203.0.113.5",
4
  "src_port": 54321,
5
  "dst_ip": "10.0.0.100",
6
  "dst_port": 22,
7
  "protocol": "TCP",
8
  "bytes_sent": 1024,
9
  "bytes_received": 256,
10
  "session_duration": 45.2,
11
  "packet_count": 12
12
}

Application-Layer Events#

Higher-level application interactions provide behavioral insights:

Login Attempts: Credential stuffing, brute force patterns
Command Execution: Shell commands, malware deployment
File Operations: Upload/download activities, data exfiltration attempts
Protocol-Specific Actions: HTTP requests, SSH sessions, database queries

Example SSH Attack:

1
{
2
  "event_type": "ssh_login_attempt",
3
  "username": "admin",
4
  "password": "password123",
5
  "success": false,
6
  "command_attempts": ["whoami", "cat /etc/passwd"],
7
  "session_duration": 12.5
8
}

Enriched Metadata#

Additional context enhances the raw data:

Geolocation: Country, region, ASN information
Threat Intelligence: IP reputation, known malware signatures
Behavioral Patterns: Session clustering, attack campaign attribution

Example Enriched Event:

1
{
2
  "src_ip": "203.0.113.5",
3
  "country": "CN",
4
  "region": "Beijing",
5
  "asn": "AS4134",
6
  "isp": "China Telecom",
7
  "threat_score": 8.5,
8
  "known_malicious": true,
9
  "attack_campaign": "SSH_Brute_Force_2025_Q3"
10
}

Types of Honeypot Data#

Low-Interaction Honeypots#

Characteristics: Emulate services, minimal attacker interaction
Data Quality: High volume, less detailed
Use Cases: Network scanning detection, port enumeration analysis
Examples: Honeyd, KFSensor

High-Interaction Honeypots#

Characteristics: Full operating systems, deep attacker interaction
Data Quality: Lower volume, highly detailed
Use Cases: Malware analysis, attack technique research
Examples: Honeynet, Cowrie

Hybrid Approaches#

Characteristics: Combination of both approaches
Data Quality: Balanced volume and detail
Use Cases: Comprehensive threat intelligence
Examples: Modern honeypot farms with multiple honeypot types

Part 2: Data Preprocessing Pipeline#

Turning Chaos into Order: Data Cleaning#

Raw honeypot data is like crude oil — full of potential, but you need to refine it first! Think of yourself as a detective sorting through evidence: some witness statements are unreliable, timestamps don’t add up, and some records are just duplicates. Here’s how we bring order to this beautiful chaos:

Step 1: Data Validation and Sanitization#

1
import pandas as pd
2
import numpy as np
3
from datetime import datetime, timedelta
4
import ipaddress
5

6
def validate_network_data(df):
7
    """
8
    Validate and sanitize network security data
9

10
    Args:
11
        df: Raw honeypot data DataFrame
12

13
    Returns:
14
        Cleaned DataFrame with validated fields
15
    """
16
    # Map column names (handle various naming conventions)
17
    column_mapping = {
18
        'dest_port': 'dst_port',
19
        'destination_port': 'dst_port',
20
        'source_port': 'src_port',
21
        '@timestamp': 'timestamp',
22
        'time': 'timestamp'
23
    }
24

25
    for old_col, new_col in column_mapping.items():
26
        if old_col in df.columns and new_col not in df.columns:
27
            df[new_col] = df[old_col]
28

29
    # Convert timestamp with UTC handling for mixed timezones
30
    if 'timestamp' in df.columns:
31
        df['timestamp'] = pd.to_datetime(
32
            df['timestamp'],
33
            format='mixed',
34
            errors='coerce',
35
            utc=True
36
        )
37
    else:
38
        raise ValueError("No timestamp column found in data")
39

40
    # Convert port columns to numeric, handling string ports
41
    for port_col in ['src_port', 'dst_port']:
42
        if port_col in df.columns:
43
            df[port_col] = pd.to_numeric(df[port_col], errors='coerce')
44

45
    # Validate and clean IP addresses
46
    if 'src_ip' in df.columns:
47
        df['src_ip'] = df['src_ip'].astype(str)
48
        # IPv4 regex pattern
49
        ipv4_pattern = r'^(?:[0-9]{1,3}\.){3}[0-9]{1,3}$'
50
        df = df[df['src_ip'].str.match(ipv4_pattern, na=False)]
51

52
        # Additional validation - ensure valid IP addresses
53
        def is_valid_ip(ip):
54
            try:
55
                ipaddress.ip_address(ip)
56
                return True
57
            except ValueError:
58
                return False
59

60
        df = df[df['src_ip'].apply(is_valid_ip)]
61

62
    # Validate port ranges (1-65535)
63
    if 'src_port' in df.columns:
64
        df = df[
65
            df['src_port'].notna() &
66
            (df['src_port'] >= 1) &
67
            (df['src_port'] <= 65535)
68
        ]
69

70
    if 'dst_port' in df.columns:
71
        df = df[
72
            df['dst_port'].notna() &
73
            (df['dst_port'] >= 1) &
74
            (df['dst_port'] <= 65535)
75
        ]
76

77
    # Remove rows with invalid timestamps
78
    if 'timestamp' in df.columns:
79
        current_time = pd.Timestamp.now(tz='UTC')
80
        # Remove future timestamps and very old ones (older than 5 years)
81
        five_years_ago = current_time - pd.Timedelta(days=5*365)
82
        df = df[
83
            df['timestamp'].notna() &
84
            (df['timestamp'] <= current_time) &
85
            (df['timestamp'] >= five_years_ago)
86
        ]
87

88
    print(f"Validation complete: {len(df)} valid records remaining")
89

90
    return df

Step 2: Handling Missing Data and Outliers#

1
def preprocess_security_events(df):
2
    """
3
    Handle missing data and outliers in security event dataset
4

5
    Args:
6
        df: DataFrame with validated network data
7

8
    Returns:
9
        Preprocessed DataFrame ready for feature engineering
10
    """
11
    # Handle missing geolocation data
12
    if 'country' in df.columns:
13
        df['country'].fillna('Unknown', inplace=True)
14

15
    if 'asn' in df.columns:
16
        df['asn'].fillna(0, inplace=True)
17

18
    # Handle missing protocol information
19
    if 'protocol' in df.columns:
20
        df['protocol'].fillna('Unknown', inplace=True)
21

22
    # Cap extreme outliers in numerical features (99th percentile)
23
    numerical_cols = {
24
        'bytes_sent': 'Bytes sent',
25
        'bytes_received': 'Bytes received',
26
        'session_duration': 'Session duration',
27
        'packet_count': 'Packet count'
28
    }
29

30
    for col, description in numerical_cols.items():
31
        if col in df.columns:
32
            q99 = df[col].quantile(0.99)
33
            outliers_count = (df[col] > q99).sum()
34

35
            if outliers_count > 0:
36
                print(f"Capping {outliers_count} outliers in {description} at {q99:.2f}")
37
                df[col] = df[col].clip(upper=q99)
38

39
            # Fill missing with 0 (assuming no data means no activity)
40
            df[col].fillna(0, inplace=True)
41

42
    # Remove duplicate events (keep first occurrence)
43
    duplicate_cols = ['src_ip', 'dst_port', 'timestamp']
44
    if all(col in df.columns for col in duplicate_cols):
45
        duplicates_before = len(df)
46
        df.drop_duplicates(subset=duplicate_cols, keep='first', inplace=True)
47
        duplicates_removed = duplicates_before - len(df)
48

49
        if duplicates_removed > 0:
50
            print(f"Removed {duplicates_removed} duplicate events")
51

52
    # Reset index after all filtering
53
    df.reset_index(drop=True, inplace=True)
54

55
    print(f"Preprocessing complete: {len(df)} records ready for feature engineering")
56

57
    return df

Step 3: Data Quality Metrics#

1
def calculate_data_quality_metrics(df):
2
    """
3
    Calculate and report data quality metrics
4

5
    Args:
6
        df: DataFrame to analyze
7

8
    Returns:
9
        Dictionary with quality metrics
10
    """
11
    metrics = {
12
        'total_records': len(df),
13
        'missing_values': df.isnull().sum().to_dict(),
14
        'duplicate_records': df.duplicated().sum(),
15
        'timestamp_range': {
16
            'start': df['timestamp'].min() if 'timestamp' in df.columns else None,
17
            'end': df['timestamp'].max() if 'timestamp' in df.columns else None
18
        },
19
        'unique_sources': df['src_ip'].nunique() if 'src_ip' in df.columns else 0,
20
        'unique_targets': df['dst_port'].nunique() if 'dst_port' in df.columns else 0
21
    }
22

23
    # Print quality report
24
    print("=" * 60)
25
    print("DATA QUALITY REPORT")
26
    print("=" * 60)
27
    print(f"Total Records: {metrics['total_records']:,}")
28
    print(f"Unique Source IPs: {metrics['unique_sources']:,}")
29
    print(f"Unique Target Ports: {metrics['unique_targets']:,}")
30
    print(f"Duplicate Records: {metrics['duplicate_records']:,}")
31

32
    if metrics['timestamp_range']['start']:
33
        print(f"Time Range: {metrics['timestamp_range']['start']} to {metrics['timestamp_range']['end']}")
34

35
    print("\nMissing Values by Column:")
36
    for col, count in metrics['missing_values'].items():
37
        if count > 0:
38
            percentage = (count / len(df)) * 100
39
            print(f"  {col}: {count:,} ({percentage:.2f}%)")
40

41
    print("=" * 60)
42

43
    return metrics

Part 3: Feature Engineering for Threat Detection#

Now for the fun part — turning raw data into ML model “food”! Think of this as cooking a gourmet meal from raw ingredients: each feature is like a spice that adds its unique flavor to our understanding of attacks. Let’s explore which “recipes” work best:

1. Temporal Features#

Time-based patterns are crucial for identifying attack campaigns and behavioral anomalies:

1
def create_temporal_features(df):
2
    """
3
    Create time-based features for attack pattern detection
4

5
    Args:
6
        df: DataFrame with timestamp column
7

8
    Returns:
9
        DataFrame with added temporal features
10
    """
11
    if 'timestamp' not in df.columns:
12
        raise ValueError("Timestamp column required for temporal features")
13

14
    # Basic time components
15
    df['hour'] = df['timestamp'].dt.hour
16
    df['day_of_week'] = df['timestamp'].dt.dayofweek
17
    df['day_of_month'] = df['timestamp'].dt.day
18
    df['month'] = df['timestamp'].dt.month
19

20
    # Working hours indicator (9 AM - 5 PM on weekdays)
21
    df['is_working_hours'] = (
22
        (df['hour'] >= 9) &
23
        (df['hour'] < 17) &
24
        (df['day_of_week'] < 5)
25
    ).astype(int)
26

27
    # Weekend indicator
28
    df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
29

30
    # Night time indicator (10 PM - 6 AM)
31
    df['is_night'] = (
32
        (df['hour'] >= 22) |
33
        (df['hour'] < 6)
34
    ).astype(int)
35

36
    # Time since first connection from same source
37
    df['first_seen'] = df.groupby('src_ip')['timestamp'].transform('min')
38
    df['hours_since_first_seen'] = (
39
        (df['timestamp'] - df['first_seen']).dt.total_seconds() / 3600
40
    )
41

42
    # Connection frequency features
43
    df['daily_connection_count'] = df.groupby([
44
        'src_ip',
45
        df['timestamp'].dt.date
46
    ])['src_ip'].transform('count')
47

48
    # Hourly connection rate
49
    df['hourly_connection_count'] = df.groupby([
50
        'src_ip',
51
        df['timestamp'].dt.floor('H')
52
    ])['src_ip'].transform('count')
53

54
    # Time between connections (session gaps)
55
    df_sorted = df.sort_values(['src_ip', 'timestamp'])
56
    df_sorted['time_since_last_connection'] = df_sorted.groupby('src_ip')['timestamp'].diff().dt.total_seconds()
57
    df['time_since_last_connection'] = df_sorted['time_since_last_connection']
58

59
    # Fill NaN (first connection has no previous connection)
60
    df['time_since_last_connection'].fillna(0, inplace=True)
61

62
    print(f"Created temporal features: hour, day_of_week, is_weekend, connection frequency metrics")
63

64
    return df

2. Behavioral Aggregation Features#

Statistical summaries reveal attacker patterns:

1
def create_behavioral_features(df):
2
    """
3
    Create behavioral aggregation features based on source IP patterns
4

5
    Args:
6
        df: DataFrame with network events
7

8
    Returns:
9
        DataFrame with behavioral features
10
    """
11
    print("Creating behavioral aggregation features...")
12

13
    # Per-source IP aggregations
14
    source_stats = df.groupby('src_ip').agg({
15
        'dst_port': ['nunique', 'count'],
16
        'bytes_sent': ['mean', 'std', 'max', 'sum'],
17
        'bytes_received': ['mean', 'std', 'max', 'sum'],
18
        'session_duration': ['mean', 'median', 'max'],
19
        'protocol': lambda x: x.mode().iloc[0] if len(x) > 0 else 'Unknown'
20
    }).reset_index()
21

22
    # Flatten column names
23
    source_stats.columns = [
24
        'src_ip',
25
        'unique_ports', 'total_connections',
26
        'avg_bytes_sent', 'std_bytes_sent', 'max_bytes_sent', 'total_bytes_sent',
27
        'avg_bytes_received', 'std_bytes_received', 'max_bytes_received', 'total_bytes_received',
28
        'avg_duration', 'median_duration', 'max_duration',
29
        'primary_protocol'
30
    ]
31

32
    # Merge back to original dataset
33
    df = df.merge(source_stats, on='src_ip', how='left')
34

35
    # Port scanning indicators
36
    df['port_diversity'] = df['unique_ports'] / df['total_connections']
37
    df['is_port_scanner'] = (df['unique_ports'] > 10).astype(int)
38

39
    # Connection pattern features
40
    df['avg_connection_size'] = (df['avg_bytes_sent'] + df['avg_bytes_received'])
41

42
    # Bandwidth usage patterns
43
    df['total_bandwidth'] = df['total_bytes_sent'] + df['total_bytes_received']
44
    df['bandwidth_asymmetry'] = (
45
        abs(df['total_bytes_sent'] - df['total_bytes_received']) /
46
        (df['total_bandwidth'] + 1)  # +1 to avoid division by zero
47
    )
48

49
    # Session behavior patterns
50
    df['connection_rate'] = df['total_connections'] / (df['hours_since_first_seen'] + 1)
51

52
    # Consistency metrics (lower std dev means more consistent behavior)
53
    df['bytes_consistency'] = df['std_bytes_sent'] / (df['avg_bytes_sent'] + 1)
54

55
    print(f"Created {len(source_stats.columns) - 1} behavioral features")
56

57
    return df

3. Geographic and Network Features#

Geographic patterns help identify coordinated attacks:

1
def create_geographic_features(df):
2
    """
3
    Create geography-based threat intelligence features
4

5
    Args:
6
        df: DataFrame with geolocation data
7

8
    Returns:
9
        DataFrame with geographic features
10
    """
11
    print("Creating geographic and network features...")
12

13
    # Country-level threat scoring
14
    if 'country' in df.columns:
15
        country_threat_scores = df.groupby('country').agg({
16
            'src_ip': 'nunique',
17
            'attack_type': lambda x: (x != 'benign').sum() if 'attack_type' in df.columns else 0
18
        }).reset_index()
19

20
        country_threat_scores.columns = ['country', 'unique_ips_from_country', 'attack_count_from_country']
21

22
        country_threat_scores['country_threat_ratio'] = (
23
            country_threat_scores['attack_count_from_country'] /
24
            (country_threat_scores['unique_ips_from_country'] + 1)
25
        )
26

27
        df = df.merge(
28
            country_threat_scores[['country', 'country_threat_ratio', 'unique_ips_from_country']],
29
            on='country',
30
            how='left'
31
        )
32

33
    # ASN-based features
34
    if 'asn' in df.columns:
35
        asn_stats = df.groupby('asn').agg({
36
            'src_ip': 'nunique',
37
            'bytes_sent': 'mean',
38
            'total_connections': 'sum'
39
        }).reset_index()
40

41
        asn_stats.columns = ['asn', 'unique_ips_per_asn', 'avg_bytes_per_asn', 'total_conn_per_asn']
42

43
        df = df.merge(asn_stats, on='asn', how='left', suffixes=('', '_asn'))
44

45
    # Protocol distribution features
46
    if 'protocol' in df.columns:
47
        protocol_dummies = pd.get_dummies(df['protocol'], prefix='protocol')
48
        df = pd.concat([df, protocol_dummies], axis=1)
49

50
    # Distance from known malicious ranges (simplified example)
51
    # In production, integrate with threat intelligence feeds
52
    known_malicious_countries = ['CN', 'RU', 'KP']  # Example
53
    if 'country' in df.columns:
54
        df['from_high_risk_country'] = df['country'].isin(known_malicious_countries).astype(int)
55

56
    print("Geographic features created successfully")
57

58
    return df

4. Target-Based Features#

Analyzing what attackers are targeting:

1
def create_target_features(df):
2
    """
3
    Create features based on attack targets (ports, services)
4

5
    Args:
6
        df: DataFrame with destination port information
7

8
    Returns:
9
        DataFrame with target-based features
10
    """
11
    print("Creating target-based features...")
12

13
    # Common service ports mapping
14
    common_ports = {
15
        22: 'SSH',
16
        23: 'Telnet',
17
        80: 'HTTP',
18
        443: 'HTTPS',
19
        3306: 'MySQL',
20
        3389: 'RDP',
21
        5432: 'PostgreSQL',
22
        8080: 'HTTP-Alt'
23
    }
24

25
    if 'dst_port' in df.columns:
26
        df['target_service'] = df['dst_port'].map(common_ports).fillna('Other')
27

28
        # Service targeting patterns
29
        df['targets_ssh'] = (df['dst_port'] == 22).astype(int)
30
        df['targets_rdp'] = (df['dst_port'] == 3389).astype(int)
31
        df['targets_web'] = df['dst_port'].isin([80, 443, 8080]).astype(int)
32
        df['targets_database'] = df['dst_port'].isin([3306, 5432, 1433]).astype(int)
33

34
        # Well-known vs ephemeral ports
35
        df['targets_wellknown_port'] = (df['dst_port'] < 1024).astype(int)
36
        df['targets_registered_port'] = ((df['dst_port'] >= 1024) & (df['dst_port'] < 49152)).astype(int)
37
        df['targets_ephemeral_port'] = (df['dst_port'] >= 49152).astype(int)
38

39
    print("Target features created successfully")
40

41
    return df

Complete Feature Engineering Pipeline#

1
def engineer_all_features(df):
2
    """
3
    Complete feature engineering pipeline
4

5
    Args:
6
        df: Raw preprocessed DataFrame
7

8
    Returns:
9
        DataFrame with all engineered features
10
    """
11
    print("\n" + "=" * 60)
12
    print("STARTING FEATURE ENGINEERING PIPELINE")
13
    print("=" * 60 + "\n")
14

15
    # Create copies to avoid modifying original
16
    df_features = df.copy()
17

18
    # Apply feature engineering steps
19
    df_features = create_temporal_features(df_features)
20
    df_features = create_behavioral_features(df_features)
21
    df_features = create_geographic_features(df_features)
22
    df_features = create_target_features(df_features)
23

24
    print("\n" + "=" * 60)
25
    print("FEATURE ENGINEERING COMPLETE")
26
    print("=" * 60)
27
    print(f"Total features: {len(df_features.columns)}")
28
    print(f"Total records: {len(df_features)}")
29
    print("=" * 60 + "\n")
30

31
    return df_features

Part 4: Machine Learning Model Selection#

Time to pick our weapon of choice! Just like in video games where different bosses require different strategies, threat detection needs different ML approaches. Let’s figure out which “gear” works best for your specific mission:

1. Binary Classification: Attack vs. Benign#

For basic threat detection, start with binary classification:

1
from sklearn.ensemble import RandomForestClassifier
2
from sklearn.model_selection import train_test_split, cross_val_score
3
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
4
import joblib
5

6
def train_binary_classifier(df, save_model=True):
7
    """
8
    Train binary classifier to detect attacks vs benign traffic
9

10
    Args:
11
        df: DataFrame with engineered features
12
        save_model: Whether to save trained model to disk
13

14
    Returns:
15
        Trained model and performance metrics
16
    """
17
    print("\n" + "=" * 60)
18
    print("TRAINING BINARY CLASSIFIER (Attack vs Benign)")
19
    print("=" * 60 + "\n")
20

21
    # Define feature columns
22
    feature_cols = [
23
        # Temporal features
24
        'hour', 'day_of_week', 'is_weekend', 'is_night',
25
        'hours_since_first_seen', 'daily_connection_count',
26

27
        # Behavioral features
28
        'unique_ports', 'total_connections', 'port_diversity',
29
        'avg_bytes_sent', 'avg_bytes_received', 'avg_duration',
30
        'connection_rate', 'bandwidth_asymmetry',
31

32
        # Geographic features
33
        'country_threat_ratio', 'from_high_risk_country',
34

35
        # Target features
36
        'targets_ssh', 'targets_web', 'targets_database'
37
    ]
38

39
    # Filter to available columns
40
    available_features = [col for col in feature_cols if col in df.columns]
41

42
    print(f"Using {len(available_features)} features for training")
43

44
    # Prepare features and labels
45
    X = df[available_features].fillna(0)
46

47
    # Create binary label (assuming 'attack_type' column exists)
48
    if 'attack_type' in df.columns:
49
        y = (df['attack_type'] != 'benign').astype(int)
50
    elif 'label' in df.columns:
51
        y = (df['label'] != 0).astype(int)
52
    else:
53
        raise ValueError("No label column found (need 'attack_type' or 'label')")
54

55
    # Check class distribution
56
    print(f"Class distribution:")
57
    print(f"  Benign: {(y == 0).sum()} ({(y == 0).sum() / len(y) * 100:.2f}%)")
58
    print(f"  Attack: {(y == 1).sum()} ({(y == 1).sum() / len(y) * 100:.2f}%)")
59

60
    # Split data (stratified to maintain class distribution)
61
    X_train, X_test, y_train, y_test = train_test_split(
62
        X, y,
63
        test_size=0.2,
64
        random_state=42,
65
        stratify=y
66
    )
67

68
    print(f"\nTraining set: {len(X_train)} samples")
69
    print(f"Test set: {len(X_test)} samples")
70

71
    # Train Random Forest model
72
    print("\nTraining Random Forest classifier...")
73
    rf_model = RandomForestClassifier(
74
        n_estimators=100,
75
        max_depth=10,
76
        min_samples_split=5,
77
        min_samples_leaf=2,
78
        class_weight='balanced',  # Handle imbalanced data
79
        random_state=42,
80
        n_jobs=-1  # Use all CPU cores
81
    )
82

83
    rf_model.fit(X_train, y_train)
84

85
    # Evaluate on test set
86
    print("\nEvaluating model...")
87
    y_pred = rf_model.predict(X_test)
88
    y_pred_proba = rf_model.predict_proba(X_test)[:, 1]
89

90
    # Print classification report
91
    print("\nClassification Report:")
92
    print(classification_report(
93
        y_test, y_pred,
94
        target_names=['Benign', 'Attack']
95
    ))
96

97
    # Calculate ROC-AUC
98
    roc_auc = roc_auc_score(y_test, y_pred_proba)
99
    print(f"ROC-AUC Score: {roc_auc:.4f}")
100

101
    # Feature importance
102
    feature_importance = pd.DataFrame({
103
        'feature': available_features,
104
        'importance': rf_model.feature_importances_
105
    }).sort_values('importance', ascending=False)
106

107
    print("\nTop 10 Most Important Features:")
108
    print(feature_importance.head(10).to_string(index=False))
109

110
    # Save model
111
    if save_model:
112
        model_path = 'models/binary_classifier_rf.pkl'
113
        joblib.dump(rf_model, model_path)
114
        print(f"\nModel saved to: {model_path}")
115

116
        # Save feature list
117
        feature_path = 'models/binary_classifier_features.pkl'
118
        joblib.dump(available_features, feature_path)
119
        print(f"Feature list saved to: {feature_path}")
120

121
    return rf_model, {
122
        'roc_auc': roc_auc,
123
        'features': available_features,
124
        'feature_importance': feature_importance
125
    }

2. Multi-Class Attack Classification#

For detailed threat categorization:

1
from sklearn.preprocessing import LabelEncoder
2
from xgboost import XGBClassifier
3
import warnings
4
warnings.filterwarnings('ignore')
5

6
def train_multiclass_classifier(df, save_model=True):
7
    """
8
    Train multi-class classifier to categorize attack types
9

10
    Args:
11
        df: DataFrame with engineered features
12
        save_model: Whether to save trained model
13

14
    Returns:
15
        Trained model, label encoder, and metrics
16
    """
17
    print("\n" + "=" * 60)
18
    print("TRAINING MULTI-CLASS CLASSIFIER (Attack Type Categorization)")
19
    print("=" * 60 + "\n")
20

21
    # Encode attack types
22
    le = LabelEncoder()
23

24
    if 'attack_type' in df.columns:
25
        df['attack_label'] = le.fit_transform(df['attack_type'])
26
    elif 'label' in df.columns:
27
        df['attack_label'] = le.fit_transform(df['label'])
28
    else:
29
        raise ValueError("No attack type column found")
30

31
    print(f"Attack types found: {len(le.classes_)}")
32
    for idx, attack_type in enumerate(le.classes_):
33
        count = (df['attack_label'] == idx).sum()
34
        print(f"  {attack_type}: {count} samples")
35

36
    # Feature selection
37
    feature_cols = [
38
        # Temporal
39
        'hour', 'day_of_week', 'is_weekend', 'is_night',
40
        'daily_connection_count', 'hourly_connection_count',
41

42
        # Behavioral
43
        'unique_ports', 'total_connections', 'port_diversity',
44
        'avg_bytes_sent', 'std_bytes_sent',
45
        'avg_bytes_received', 'std_bytes_received',
46
        'avg_duration', 'median_duration',
47
        'connection_rate', 'bytes_consistency',
48

49
        # Geographic
50
        'country_threat_ratio', 'from_high_risk_country',
51

52
        # Target
53
        'targets_ssh', 'targets_rdp', 'targets_web', 'targets_database'
54
    ]
55

56
    available_features = [col for col in feature_cols if col in df.columns]
57

58
    X = df[available_features].fillna(0)
59
    y = df['attack_label']
60

61
    # Split data
62
    X_train, X_test, y_train, y_test = train_test_split(
63
        X, y,
64
        test_size=0.2,
65
        random_state=42,
66
        stratify=y
67
    )
68

69
    # Train XGBoost for multi-class
70
    print(f"\nTraining XGBoost classifier with {len(available_features)} features...")
71

72
    xgb_model = XGBClassifier(
73
        n_estimators=200,
74
        max_depth=6,
75
        learning_rate=0.1,
76
        subsample=0.8,
77
        colsample_bytree=0.8,
78
        random_state=42,
79
        n_jobs=-1,
80
        eval_metric='mlogloss'
81
    )
82

83
    xgb_model.fit(
84
        X_train, y_train,
85
        eval_set=[(X_test, y_test)],
86
        verbose=False
87
    )
88

89
    # Evaluate
90
    y_pred = xgb_model.predict(X_test)
91

92
    print("\nMulti-Class Classification Report:")
93
    print(classification_report(
94
        y_test, y_pred,
95
        target_names=le.classes_
96
    ))
97

98
    # Confusion matrix
99
    cm = confusion_matrix(y_test, y_pred)
100
    print("\nConfusion Matrix:")
101
    print(cm)
102

103
    # Save model
104
    if save_model:
105
        model_path = 'models/multiclass_classifier_xgb.pkl'
106
        joblib.dump(xgb_model, model_path)
107
        print(f"\nModel saved to: {model_path}")
108

109
        encoder_path = 'models/label_encoder.pkl'
110
        joblib.dump(le, encoder_path)
111
        print(f"Label encoder saved to: {encoder_path}")
112

113
    return xgb_model, le, {'confusion_matrix': cm}

3. Anomaly Detection for Zero-Day Threats#

Unsupervised learning identifies previously unseen attack patterns:

1
from sklearn.ensemble import IsolationForest
2
from sklearn.preprocessing import StandardScaler
3

4
def train_anomaly_detector(df, contamination=0.05, save_model=True):
5
    """
6
    Train anomaly detection model for zero-day threat detection
7

8
    Args:
9
        df: DataFrame with engineered features
10
        contamination: Expected proportion of anomalies (0.05 = 5%)
11
        save_model: Whether to save trained model
12

13
    Returns:
14
        Trained model, scaler, and anomaly scores
15
    """
16
    print("\n" + "=" * 60)
17
    print("TRAINING ANOMALY DETECTOR (Zero-Day Threat Detection)")
18
    print("=" * 60 + "\n")
19

20
    # Use only benign traffic for training (if available)
21
    if 'attack_type' in df.columns:
22
        benign_data = df[df['attack_type'] == 'benign']
23
        print(f"Training on {len(benign_data)} benign samples")
24
    else:
25
        benign_data = df
26
        print(f"Training on {len(benign_data)} total samples (no labels available)")
27

28
    # Feature selection for anomaly detection
29
    feature_cols = [
30
        'unique_ports', 'total_connections', 'port_diversity',
31
        'avg_bytes_sent', 'avg_bytes_received',
32
        'avg_duration', 'connection_rate',
33
        'hours_since_first_seen', 'daily_connection_count'
34
    ]
35

36
    available_features = [col for col in feature_cols if col in df.columns]
37

38
    X_benign = benign_data[available_features].fillna(0)
39

40
    # Scale features (important for distance-based anomaly detection)
41
    scaler = StandardScaler()
42
    X_scaled = scaler.fit_transform(X_benign)
43

44
    print(f"Training Isolation Forest with contamination={contamination}...")
45

46
    # Train Isolation Forest
47
    iso_forest = IsolationForest(
48
        n_estimators=100,
49
        contamination=contamination,
50
        max_samples='auto',
51
        random_state=42,
52
        n_jobs=-1
53
    )
54

55
    iso_forest.fit(X_scaled)
56

57
    # Test on full dataset
58
    print("\nTesting on full dataset...")
59
    X_all = df[available_features].fillna(0)
60
    X_all_scaled = scaler.transform(X_all)
61

62
    # Get anomaly scores (lower = more anomalous)
63
    anomaly_scores = iso_forest.decision_function(X_all_scaled)
64
    predictions = iso_forest.predict(X_all_scaled)
65

66
    df['anomaly_score'] = anomaly_scores
67
    df['is_anomaly'] = (predictions == -1).astype(int)
68

69
    # Statistics
70
    anomaly_count = df['is_anomaly'].sum()
71
    anomaly_percentage = (anomaly_count / len(df)) * 100
72

73
    print(f"\nAnomalies detected: {anomaly_count} ({anomaly_percentage:.2f}%)")
74

75
    # If we have labels, evaluate performance
76
    if 'attack_type' in df.columns:
77
        actual_attacks = (df['attack_type'] != 'benign').astype(int)
78
        detected_anomalies = df['is_anomaly']
79

80
        from sklearn.metrics import precision_score, recall_score, f1_score
81

82
        precision = precision_score(actual_attacks, detected_anomalies)
83
        recall = recall_score(actual_attacks, detected_anomalies)
84
        f1 = f1_score(actual_attacks, detected_anomalies)
85

86
        print(f"\nPerformance vs Known Attacks:")
87
        print(f"  Precision: {precision:.4f}")
88
        print(f"  Recall: {recall:.4f}")
89
        print(f"  F1-Score: {f1:.4f}")
90

91
    # Save model
92
    if save_model:
93
        model_path = 'models/anomaly_detector_isoforest.pkl'
94
        joblib.dump(iso_forest, model_path)
95
        print(f"\nModel saved to: {model_path}")
96

97
        scaler_path = 'models/anomaly_scaler.pkl'
98
        joblib.dump(scaler, scaler_path)
99
        print(f"Scaler saved to: {scaler_path}")
100

101
    return iso_forest, scaler, df[['anomaly_score', 'is_anomaly']]

Part 5: Model Deployment and Production#

Real-Time Inference Pipeline#

Deploy models for real-time threat detection:

1
import joblib
2
from datetime import datetime
3
import numpy as np
4

5
class ThreatDetectionPipeline:
6
    """
7
    Production-ready threat detection pipeline
8
    """
9

10
    def __init__(self, model_path, scaler_path=None, feature_list_path=None):
11
        """
12
        Initialize pipeline with trained models
13

14
        Args:
15
            model_path: Path to trained model file
16
            scaler_path: Path to feature scaler (optional)
17
            feature_list_path: Path to feature list file (optional)
18
        """
19
        self.model = joblib.load(model_path)
20
        self.scaler = joblib.load(scaler_path) if scaler_path else None
21
        self.feature_list = joblib.load(feature_list_path) if feature_list_path else None
22

23
        print(f"Loaded model from: {model_path}")
24
        if self.scaler:
25
            print(f"Loaded scaler from: {scaler_path}")
26
        if self.feature_list:
27
            print(f"Using {len(self.feature_list)} features")
28

29
    def preprocess_event(self, event):
30
        """
31
        Convert raw security event to feature vector
32

33
        Args:
34
            event: Dictionary with security event data
35

36
        Returns:
37
            Feature vector ready for model prediction
38
        """
39
        # Extract temporal features
40
        if 'timestamp' in event:
41
            dt = datetime.fromisoformat(event['timestamp'].replace('Z', '+00:00'))
42
            hour = dt.hour
43
            day_of_week = dt.weekday()
44
            is_weekend = 1 if day_of_week >= 5 else 0
45
            is_night = 1 if (hour >= 22 or hour < 6) else 0
46
        else:
47
            hour = datetime.now().hour
48
            day_of_week = datetime.now().weekday()
49
            is_weekend = 1 if day_of_week >= 5 else 0
50
            is_night = 1 if (hour >= 22 or hour < 6) else 0
51

52
        # Build feature dictionary
53
        features = {
54
            # Temporal
55
            'hour': hour,
56
            'day_of_week': day_of_week,
57
            'is_weekend': is_weekend,
58
            'is_night': is_night,
59

60
            # Behavioral
61
            'unique_ports': event.get('unique_ports', 1),
62
            'total_connections': event.get('connection_count', 1),
63
            'port_diversity': event.get('port_diversity', 0),
64
            'avg_bytes_sent': event.get('bytes_sent', 0),
65
            'avg_bytes_received': event.get('bytes_received', 0),
66
            'avg_duration': event.get('session_duration', 0),
67
            'connection_rate': event.get('connection_rate', 0),
68

69
            # Geographic
70
            'country_threat_ratio': event.get('country_threat_ratio', 0),
71
            'from_high_risk_country': event.get('from_high_risk_country', 0),
72

73
            # Target
74
            'targets_ssh': 1 if event.get('dst_port') == 22 else 0,
75
            'targets_web': 1 if event.get('dst_port') in [80, 443] else 0,
76
            'targets_database': 1 if event.get('dst_port') in [3306, 5432] else 0,
77
        }
78

79
        # If we have a specific feature list, use only those features
80
        if self.feature_list:
81
            feature_vector = np.array([features.get(f, 0) for f in self.feature_list])
82
        else:
83
            feature_vector = np.array(list(features.values()))
84

85
        feature_vector = feature_vector.reshape(1, -1)
86

87
        # Apply scaling if scaler is available
88
        if self.scaler:
89
            feature_vector = self.scaler.transform(feature_vector)
90

91
        return feature_vector
92

93
    def predict_threat(self, event):
94
        """
95
        Predict threat level for a security event
96

97
        Args:
98
            event: Security event dictionary
99

100
        Returns:
101
            Threat assessment dictionary
102
        """
103
        features = self.preprocess_event(event)
104

105
        # Get prediction probability
106
        if hasattr(self.model, 'predict_proba'):
107
            threat_probability = self.model.predict_proba(features)[0][1]
108
        else:
109
            # For models without predict_proba (like Isolation Forest)
110
            prediction = self.model.predict(features)[0]
111
            threat_probability = 1.0 if prediction == -1 else 0.0
112

113
        # Determine risk level
114
        if threat_probability > 0.8:
115
            risk_level = 'critical'
116
        elif threat_probability > 0.6:
117
            risk_level = 'high'
118
        elif threat_probability > 0.4:
119
            risk_level = 'medium'
120
        else:
121
            risk_level = 'low'
122

123
        return {
124
            'threat_probability': float(threat_probability),
125
            'is_threat': threat_probability > 0.5,
126
            'risk_level': risk_level,
127
            'timestamp': datetime.now().isoformat(),
128
            'src_ip': event.get('src_ip', 'unknown'),
129
            'dst_port': event.get('dst_port', 0)
130
        }
131

132
    def batch_predict(self, events):
133
        """
134
        Predict threats for multiple events
135

136
        Args:
137
            events: List of security event dictionaries
138

139
        Returns:
140
            List of threat assessments
141
        """
142
        return [self.predict_threat(event) for event in events]

Usage Examples#

1
# Initialize pipeline
2
pipeline = ThreatDetectionPipeline(
3
    model_path='models/binary_classifier_rf.pkl',
4
    feature_list_path='models/binary_classifier_features.pkl'
5
)
6

7
# Single event prediction
8
sample_event = {
9
    'timestamp': '2025-09-24T14:30:00Z',
10
    'src_ip': '203.0.113.5',
11
    'dst_port': 22,
12
    'bytes_sent': 2048,
13
    'bytes_received': 512,
14
    'session_duration': 45.2,
15
    'unique_ports': 5,
16
    'connection_count': 15,
17
    'port_diversity': 0.33,
18
    'country_threat_ratio': 0.75,
19
    'from_high_risk_country': 1
20
}
21

22
result = pipeline.predict_threat(sample_event)
23
print("Threat Assessment:")
24
print(f"  Source IP: {result['src_ip']}")
25
print(f"  Threat Probability: {result['threat_probability']:.2%}")
26
print(f"  Risk Level: {result['risk_level']}")
27
print(f"  Is Threat: {result['is_threat']}")

Part 6: Working with Hugging Face Datasets#

Dataset Arsenal: Your Threat Intelligence Toolkit#

I’ve assembled a complete collection of real honeypot datasets on Hugging Face. Each dataset tells its own “story” about how attackers behave in the wild:

The Big Boss: cyber-security-events-full#

Dataset: mranv/cyber-security-events-full

Size: 772K events — the heavyweight champion for serious experiments
What’s Inside: A full-length movie about cyberattacks with rich feature sets
Features: Network flows, behavioral patterns, geographic data, IP reputation
Perfect For: Training production-ready threat detection models
Special Power: Like the Wikipedia of attacks — everything’s in here!

1
from datasets import load_dataset
2

3
# Load the comprehensive dataset
4
dataset = load_dataset("mranv/cyber-security-events-full")
5
df_full = dataset['train'].to_pandas()
6

7
print(f"Loaded {len(df_full)} security events")
8
print(f"Features: {list(df_full.columns)}")

The Time Whisperer: attacks-daily#

Dataset: mranv/attacks-daily

Size: 676K records — laser-focused on temporal patterns
What’s Inside: Daily chronicles of attacks with precise timestamps
Features: Time series attacks, seasonal patterns, activity cycles
Perfect For: Predicting “when” the next attack will happen
Special Power: Shows that even hackers have daily routines!

1
# Load daily attack patterns
2
dataset_daily = load_dataset("mranv/attacks-daily")
3
df_daily = dataset_daily['train'].to_pandas()
4

5
# Analyze temporal patterns
6
df_daily['timestamp'] = pd.to_datetime(df_daily['timestamp'])
7
df_daily['hour'] = df_daily['timestamp'].dt.hour
8

9
attack_by_hour = df_daily.groupby('hour').size()
10
print("Attacks by hour:")
11
print(attack_by_hour)

The Compact Trainer: cyber-security-events#

Dataset: mranv/cyber-security-events

Size: 15.1K events — perfect size for rapid experimentation
What’s Inside: Curated selection of the most interesting attacks
Features: Balanced mix of different attack types
Perfect For: First steps and quick prototyping
Special Power: Like a starter pack for ML researchers!

1
# Load compact dataset for quick experiments
2
dataset_compact = load_dataset("mranv/cyber-security-events")
3
df_compact = dataset_compact['train'].to_pandas()
4

5
print(f"Compact dataset: {len(df_compact)} events")
6
print(f"Attack types: {df_compact['attack_type'].value_counts()}")

The Intrusion Specialist: network-intrusion-detection#

Dataset: mranv/network-intrusion-detection

Size: 100 records — small but mighty
What’s Inside: High-quality examples of network intrusions
Features: Clear classifications, samples for IDS systems
Perfect For: Intrusion detection system developers
Special Power: Each record is a textbook example!

Complete Pipeline with Hugging Face Datasets#

1
from datasets import load_dataset
2
import pandas as pd
3

4
def load_and_prepare_dataset(dataset_name="mranv/cyber-security-events-full"):
5
    """
6
    Load dataset from Hugging Face and prepare for ML
7

8
    Args:
9
        dataset_name: Hugging Face dataset identifier
10

11
    Returns:
12
        Prepared DataFrame ready for model training
13
    """
14
    print(f"Loading dataset: {dataset_name}")
15

16
    # Load dataset
17
    dataset = load_dataset(dataset_name)
18

19
    # Convert to pandas
20
    df = dataset['train'].to_pandas()
21

22
    print(f"Loaded {len(df)} records with {len(df.columns)} features")
23

24
    # Apply preprocessing pipeline
25
    print("\nApplying preprocessing...")
26
    df = validate_network_data(df)
27
    df = preprocess_security_events(df)
28

29
    # Calculate quality metrics
30
    metrics = calculate_data_quality_metrics(df)
31

32
    # Engineer features
33
    print("\nEngineering features...")
34
    df = engineer_all_features(df)
35

36
    print(f"\nDataset ready: {len(df)} records, {len(df.columns)} features")
37

38
    return df
39

40
# Example usage
41
df = load_and_prepare_dataset("mranv/cyber-security-events-full")
42

43
# Train models
44
binary_model, binary_metrics = train_binary_classifier(df)
45
multiclass_model, label_encoder, mc_metrics = train_multiclass_classifier(df)
46
anomaly_model, anomaly_scaler, anomaly_scores = train_anomaly_detector(df)

Part 7: macOS M1/M2/M4 Compatibility#

Installing Dependencies for Apple Silicon#

When running this code on Apple Silicon Macs (M1, M2, M4), you may encounter XGBoost installation issues. Here’s how to resolve them:

1
# Install OpenMP runtime (required for XGBoost)
2
brew install libomp
3

4
# Install Python packages
5
pip install pandas numpy scikit-learn datasets xgboost matplotlib seaborn
6

7
# If you encounter issues with XGBoost, try:
8
pip uninstall xgboost
9
pip install xgboost --no-cache-dir --no-binary xgboost
10

11
# Alternative: Install from conda-forge (recommended for M1/M2/M4)
12
conda install -c conda-forge xgboost

Troubleshooting Common Issues#

Issue 1: XGBoostError: XGBoost Library (libxgboost.dylib) could not be loaded

1
# Solution
2
brew install libomp
3
export LDFLAGS="-L/opt/homebrew/opt/libomp/lib"
4
export CPPFLAGS="-I/opt/homebrew/opt/libomp/include"
5

6
pip install xgboost --no-cache-dir

Issue 2: Performance issues on Apple Silicon

1
# Ensure you're using native ARM64 Python
2
python -c "import platform; print(platform.machine())"
3
# Should output: arm64
4

5
# If output is x86_64, you're running through Rosetta
6
# Install native Python:
7
brew install python@3.11

Issue 3: NumPy/Pandas performance on M-series chips

1
# Use optimized versions
2
pip install --upgrade numpy pandas
3

4
# For maximum performance, use conda
5
conda install numpy pandas scikit-learn -c conda-forge

Optimized Setup Script for Apple Silicon#

1
#!/bin/bash
2
echo "Setting up ML environment for Apple Silicon..."
3

4
# Check architecture
5
if [[ $(uname -m) != "arm64" ]]; then
6
    echo "Warning: Not running on ARM64 architecture"
7
    exit 1
8
fi
9

10
# Install Homebrew dependencies
11
echo "Installing Homebrew dependencies..."
12
brew install libomp
13

14
# Set environment variables
15
export LDFLAGS="-L/opt/homebrew/opt/libomp/lib"
16
export CPPFLAGS="-I/opt/homebrew/opt/libomp/include"
17

18
# Install Python packages
19
echo "Installing Python packages..."
20
pip install --upgrade pip
21
pip install pandas numpy scikit-learn matplotlib seaborn jupyter
22
pip install datasets huggingface_hub
23
pip install xgboost --no-cache-dir
24

25
echo "Installation complete!"
26
echo "Testing XGBoost..."
27
python -c "import xgboost; print(f'XGBoost version: {xgboost.__version__}')"

Part 8: Best Practices and Production Considerations#

1. Dataset Quality and Labeling#

1
def validate_labels(df):
2
    """
3
    Validate and verify dataset labels
4

5
    Args:
6
        df: DataFrame with labels
7

8
    Returns:
9
        Validation report
10
    """
11
    if 'attack_type' not in df.columns:
12
        return {'error': 'No labels found'}
13

14
    label_stats = {
15
        'total_samples': len(df),
16
        'label_distribution': df['attack_type'].value_counts().to_dict(),
17
        'missing_labels': df['attack_type'].isnull().sum(),
18
        'unique_labels': df['attack_type'].nunique()
19
    }
20

21
    # Check for label quality issues
22
    issues = []
23

24
    # Check for severely imbalanced classes
25
    min_class_percentage = (df['attack_type'].value_counts().min() / len(df)) * 100
26
    if min_class_percentage < 1:
27
        issues.append(f"Severely imbalanced: smallest class is {min_class_percentage:.2f}%")
28

29
    # Check for missing labels
30
    if label_stats['missing_labels'] > 0:
31
        issues.append(f"{label_stats['missing_labels']} samples with missing labels")
32

33
    label_stats['issues'] = issues
34

35
    return label_stats

2. Model Drift Monitoring#

1
import json
2
from datetime import datetime
3

4
class ModelMonitor:
5
    """
6
    Monitor model performance and detect drift
7
    """
8

9
    def __init__(self, baseline_metrics):
10
        """
11
        Initialize with baseline metrics from training
12

13
        Args:
14
            baseline_metrics: Dict with baseline performance metrics
15
        """
16
        self.baseline = baseline_metrics
17
        self.history = []
18

19
    def log_predictions(self, y_true, y_pred, timestamp=None):
20
        """
21
        Log prediction results for monitoring
22

23
        Args:
24
            y_true: True labels
25
            y_pred: Predicted labels
26
            timestamp: Prediction timestamp
27
        """
28
        from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
29

30
        metrics = {
31
            'timestamp': timestamp or datetime.now().isoformat(),
32
            'accuracy': accuracy_score(y_true, y_pred),
33
            'precision': precision_score(y_true, y_pred, average='weighted', zero_division=0),
34
            'recall': recall_score(y_true, y_pred, average='weighted', zero_division=0),
35
            'f1': f1_score(y_true, y_pred, average='weighted', zero_division=0),
36
            'sample_count': len(y_true)
37
        }
38

39
        self.history.append(metrics)
40

41
        # Check for drift
42
        self.detect_drift(metrics)
43

44
        return metrics
45

46
    def detect_drift(self, current_metrics, threshold=0.1):
47
        """
48
        Detect if model performance has degraded
49

50
        Args:
51
            current_metrics: Current performance metrics
52
            threshold: Acceptable degradation threshold (default 10%)
53

54
        Returns:
55
            Boolean indicating drift detected
56
        """
57
        drift_detected = False
58
        alerts = []
59

60
        for metric in ['accuracy', 'precision', 'recall', 'f1']:
61
            if metric in self.baseline and metric in current_metrics:
62
                baseline_value = self.baseline[metric]
63
                current_value = current_metrics[metric]
64
                degradation = (baseline_value - current_value) / baseline_value
65

66
                if degradation > threshold:
67
                    drift_detected = True
68
                    alerts.append(
69
                        f"{metric.upper()} degraded by {degradation*100:.1f}% "
70
                        f"(baseline: {baseline_value:.3f}, current: {current_value:.3f})"
71
                    )
72

73
        if drift_detected:
74
            print("⚠️  MODEL DRIFT DETECTED!")
75
            for alert in alerts:
76
                print(f"   - {alert}")
77
            print("   Consider retraining the model with recent data.")
78

79
        return drift_detected
80

81
    def save_history(self, filepath='monitoring/model_history.json'):
82
        """
83
        Save monitoring history to file
84

85
        Args:
86
            filepath: Path to save history
87
        """
88
        with open(filepath, 'w') as f:
89
            json.dump(self.history, f, indent=2)
90

91
        print(f"Monitoring history saved to: {filepath}")

3. SIEM Integration#

1
import requests
2
import json
3

4
class SIEMIntegration:
5
    """
6
    Integration with SIEM systems (Splunk, Wazuh, etc.)
7
    """
8

9
    def __init__(self, siem_url, api_key):
10
        """
11
        Initialize SIEM integration
12

13
        Args:
14
            siem_url: SIEM API endpoint
15
            api_key: Authentication key
16
        """
17
        self.siem_url = siem_url
18
        self.api_key = api_key
19
        self.headers = {
20
            'Authorization': f'Bearer {api_key}',
21
            'Content-Type': 'application/json'
22
        }
23

24
    def send_alert(self, threat_assessment, event_data):
25
        """
26
        Send ML threat assessment to SIEM
27

28
        Args:
29
            threat_assessment: Model prediction result
30
            event_data: Original security event
31
        """
32
        alert_payload = {
33
            'timestamp': datetime.now().isoformat(),
34
            'source': 'ML_Threat_Detector',
35
            'severity': self.map_risk_to_severity(threat_assessment['risk_level']),
36
            'threat_probability': threat_assessment['threat_probability'],
37
            'risk_level': threat_assessment['risk_level'],
38
            'src_ip': event_data.get('src_ip'),
39
            'dst_port': event_data.get('dst_port'),
40
            'description': f"ML-detected threat from {event_data.get('src_ip')} "
41
                          f"with {threat_assessment['threat_probability']:.0%} confidence",
42
            'raw_event': event_data
43
        }
44

45
        try:
46
            response = requests.post(
47
                f"{self.siem_url}/alerts",
48
                headers=self.headers,
49
                json=alert_payload,
50
                timeout=10
51
            )
52

53
            if response.status_code == 200:
54
                print(f"✓ Alert sent to SIEM: {threat_assessment['risk_level']} risk")
55
            else:
56
                print(f"✗ Failed to send alert: {response.status_code}")
57

58
        except Exception as e:
59
            print(f"✗ Error sending alert to SIEM: {str(e)}")
60

61
    def map_risk_to_severity(self, risk_level):
62
        """
63
        Map ML risk level to SIEM severity
64

65
        Args:
66
            risk_level: ML risk assessment (low/medium/high/critical)
67

68
        Returns:
69
            SIEM severity level
70
        """
71
        mapping = {
72
            'low': 1,
73
            'medium': 2,
74
            'high': 3,
75
            'critical': 4
76
        }
77

78
        return mapping.get(risk_level, 1)

Conclusion#

Congratulations! You’ve mastered the complete journey from raw honeypot data to production-ready ML threat detection systems. Let’s recap what you’ve accomplished:

Key Achievements#

✅ Data Preprocessing - Validated, cleaned, and prepared real honeypot datasets

✅ Feature Engineering - Created temporal, behavioral, geographic, and target-based features

✅ Model Training - Built binary classifiers, multi-class categorizers, and anomaly detectors

✅ Production Deployment - Implemented real-time inference pipelines

✅ SIEM Integration - Connected ML models with security operations

✅ Monitoring & Maintenance - Set up drift detection and performance tracking

The Complete Workflow#

1
Raw Honeypot Data
2
       ↓
3
Data Validation & Cleaning
4
       ↓
5
Feature Engineering
6
       ↓
7
Model Training (Binary/Multi-class/Anomaly)
8
       ↓
9
Deployment Pipeline
10
       ↓
11
Real-time Threat Detection
12
       ↓
13
SIEM Integration & Alerting
14
       ↓
15
Monitoring & Retraining

What Makes This Approach Powerful#

Real-World Data: Honeypots capture actual attacker behavior, not synthetic scenarios
Multiple Detection Layers: Binary, multi-class, and anomaly detection provide comprehensive coverage
Production-Ready: Complete pipeline from data to deployment
Adaptable: Easy to customize for your specific environment
Cost-Effective: Open-source tools and free datasets

Next Steps#

Immediate Actions#

Download Datasets: Start with mranv/cyber-security-events for quick experiments
Run the Code: Execute the complete pipeline on your local machine
Experiment: Try different models and feature combinations
Deploy: Set up real-time inference in your environment

Advanced Enhancements#

Deep Learning: Implement LSTM or Transformer models for sequence analysis
Ensemble Methods: Combine multiple models for better accuracy
Transfer Learning: Fine-tune pre-trained models on your data
AutoML: Automated hyperparameter tuning and model selection
Explainable AI: Add SHAP values for model interpretability

Resources#

Code Repository#

All code from this guide is available on GitHub:

1
git clone https://github.com/mranv/ml-threat-intelligence
2
cd ml-threat-intelligence
3
pip install -r requirements.txt

Datasets on Hugging Face#

Community and Support#

GitHub Issues: Report bugs and request features
Discussions: Share your implementations and ask questions
Pull Requests: Contribute improvements

Final Thoughts#

Machine learning is transforming cybersecurity from reactive to proactive defense. By combining real honeypot data with modern ML techniques, you can detect threats that traditional signature-based systems miss entirely.

The key to success is continuous improvement: keep collecting new data, monitoring model performance, and adapting to emerging threats. Security is an ongoing journey, and you now have the tools to stay ahead.

Remember: the attackers are always evolving. Your models should too. 🛡️🤖

Happy Threat Hunting!

Author: Anubhav Gain Last Updated: October 4, 2025 Version: 1.0 License: MIT

Disclaimer: This guide is for educational and defensive security purposes. Always ensure you have proper authorization before deploying security systems and collecting network data. Comply with all applicable laws and regulations regarding data privacy and security monitoring.