Firecracker Production Deployment Guide: Enterprise-Ready MicroVM Infrastructure

Table of Contents#

Introduction#

Deploying Firecracker microVMs in production requires careful planning, robust automation, and comprehensive operational practices. This guide provides a complete roadmap for building enterprise-ready Firecracker infrastructure, from initial setup through ongoing operations.

We’ll cover infrastructure design patterns, deployment automation, monitoring strategies, disaster recovery, and operational procedures that have been proven in large-scale production environments. By following this guide, you’ll build a resilient, scalable, and maintainable Firecracker platform.

Production Architecture Overview#

1
graph TB
2
    subgraph "Load Balancer Tier"
3
        LB1[Load Balancer 1]
4
        LB2[Load Balancer 2]
5
        LB3[Load Balancer 3]
6
    end
7

8
    subgraph "Control Plane"
9
        API[API Server Cluster]
10
        SCHED[Scheduler Service]
11
        ETCD[etcd Cluster]
12
        MONITOR[Monitoring Stack]
13
    end
14

15
    subgraph "Compute Nodes"
16
        subgraph "Zone A"
17
            NODE1[Compute Node 1]
18
            NODE2[Compute Node 2]
19
            NODE3[Compute Node 3]
20
        end
21

22
        subgraph "Zone B"
23
            NODE4[Compute Node 4]
24
            NODE5[Compute Node 5]
25
            NODE6[Compute Node 6]
26
        end
27

28
        subgraph "Zone C"
29
            NODE7[Compute Node 7]
30
            NODE8[Compute Node 8]
31
            NODE9[Compute Node 9]
32
        end
33
    end
34

35
    subgraph "Storage Tier"
36
        STORAGE[Distributed Storage]
37
        BACKUP[Backup Storage]
38
    end
39

40
    subgraph "Network Infrastructure"
41
        SWITCH[Top-of-Rack Switches]
42
        SPINE[Spine Switches]
43
        BORDER[Border Routers]
44
    end
45

46
    LB1 --> API
47
    LB2 --> API
48
    LB3 --> API
49

50
    API --> SCHED
51
    API --> ETCD
52
    SCHED --> NODE1
53
    SCHED --> NODE4
54
    SCHED --> NODE7
55

56
    NODE1 --> STORAGE
57
    NODE4 --> STORAGE
58
    NODE7 --> STORAGE
59

60
    MONITOR --> NODE1
61
    MONITOR --> NODE4
62
    MONITOR --> NODE7
63

64
    STORAGE --> BACKUP

Design Principles#

High Availability: No single points of failure across all components Horizontal Scalability: Linear scaling with additional compute nodes Security Isolation: Multiple layers of isolation and access control Operational Simplicity: Automated operations with minimal manual intervention Cost Optimization: Efficient resource utilization and dynamic scaling

Infrastructure Prerequisites#

Hardware Requirements#

1
compute_nodes:
2
  minimum_configuration:
3
    cpu_cores: 16
4
    memory_gb: 64
5
    storage_gb: 1000
6
    network_interfaces: 2
7
    network_bandwidth_gbps: 10
8

9
  recommended_configuration:
10
    cpu_cores: 32
11
    memory_gb: 128
12
    storage_gb: 2000
13
    network_interfaces: 4
14
    network_bandwidth_gbps: 25
15

16
  optimal_configuration:
17
    cpu_cores: 64
18
    memory_gb: 256
19
    storage_gb: 4000
20
    network_interfaces: 4
21
    network_bandwidth_gbps: 100
22

23
control_plane:
24
  minimum_nodes: 3
25
  cpu_cores_per_node: 8
26
  memory_gb_per_node: 16
27
  storage_gb_per_node: 500
28

29
storage_requirements:
30
  min_iops_per_vm: 1000
31
  min_bandwidth_mbps_per_vm: 100
32
  replication_factor: 3
33
  backup_retention_days: 30

Network Design#

1
#!/bin/bash
2

3
# Network configuration for production Firecracker deployment
4
echo "=== Production Network Setup ==="
5

6
# VLAN configuration for multi-tenant isolation
7
configure_vlans() {
8
    # Management VLAN (VLAN 100)
9
    sudo vconfig add eth0 100
10
    sudo ip addr add 10.0.100.10/24 dev eth0.100
11
    sudo ip link set eth0.100 up
12

13
    # Compute VLAN (VLAN 200)
14
    sudo vconfig add eth0 200
15
    sudo ip addr add 10.0.200.10/24 dev eth0.200
16
    sudo ip link set eth0.200 up
17

18
    # Storage VLAN (VLAN 300)
19
    sudo vconfig add eth0 300
20
    sudo ip addr add 10.0.300.10/24 dev eth0.300
21
    sudo ip link set eth0.300 up
22

23
    # Tenant VLANs (VLAN 400-499)
24
    for vlan in {400..410}; do
25
        sudo vconfig add eth1 $vlan
26
        sudo ip addr add 10.0.$((vlan-300)).10/24 dev eth1.$vlan
27
        sudo ip link set eth1.$vlan up
28
    done
29

30
    echo "✓ VLAN configuration complete"
31
}
32

33
# Configure OpenVSwitch for advanced networking
34
setup_ovs() {
35
    # Install OpenVSwitch
36
    sudo apt update
37
    sudo apt install -y openvswitch-switch
38

39
    # Create management bridge
40
    sudo ovs-vsctl add-br br-mgmt
41
    sudo ovs-vsctl add-port br-mgmt eth0.100
42

43
    # Create compute bridge with VXLAN support
44
    sudo ovs-vsctl add-br br-compute
45
    sudo ovs-vsctl add-port br-compute eth0.200
46

47
    # Configure VXLAN for overlay networking
48
    sudo ovs-vsctl add-port br-compute vxlan1 -- set interface vxlan1 type=vxlan options:remote_ip=10.0.200.11
49

50
    # Create tenant bridges
51
    for tenant in {1..10}; do
52
        bridge_name="br-tenant-$tenant"
53
        sudo ovs-vsctl add-br $bridge_name
54
        sudo ovs-vsctl set bridge $bridge_name other_config:hwaddr=02:00:00:00:00:$(printf "%02x" $tenant)
55
    done
56

57
    echo "✓ OpenVSwitch configuration complete"
58
}
59

60
# Configure SR-IOV for high-performance networking
61
setup_sriov() {
62
    echo "Setting up SR-IOV..."
63

64
    # Enable SR-IOV (requires compatible NIC)
65
    echo 8 | sudo tee /sys/class/net/eth2/device/sriov_numvfs
66

67
    # Configure virtual functions
68
    for vf in {0..7}; do
69
        sudo ip link set eth2 vf $vf mac 02:00:00:00:01:$(printf "%02x" $vf)
70
        sudo ip link set eth2 vf $vf vlan $((400 + vf))
71
        sudo ip link set eth2 vf $vf spoofchk on
72
        sudo ip link set eth2 vf $vf trust off
73
    done
74

75
    echo "✓ SR-IOV configuration complete"
76
}
77

78
# Main network setup
79
configure_vlans
80
setup_ovs
81
setup_sriov
82

83
echo "Production network setup complete!"

Deployment Automation#

Infrastructure as Code with Terraform#

1
# main.tf - Firecracker Infrastructure
2
terraform {
3
  required_version = ">= 1.0"
4
  required_providers {
5
    aws = {
6
      source  = "hashicorp/aws"
7
      version = "~> 5.0"
8
    }
9
  }
10
}
11

12
variable "environment" {
13
  description = "Environment name"
14
  type        = string
15
  default     = "production"
16
}
17

18
variable "cluster_name" {
19
  description = "Firecracker cluster name"
20
  type        = string
21
  default     = "firecracker-prod"
22
}
23

24
variable "compute_node_count" {
25
  description = "Number of compute nodes"
26
  type        = number
27
  default     = 9
28
}
29

30
# VPC and networking
31
resource "aws_vpc" "firecracker_vpc" {
32
  cidr_block           = "10.0.0.0/16"
33
  enable_dns_hostnames = true
34
  enable_dns_support   = true
35

36
  tags = {
37
    Name        = "${var.cluster_name}-vpc"
38
    Environment = var.environment
39
  }
40
}
41

42
resource "aws_subnet" "compute_subnets" {
43
  count = 3
44

45
  vpc_id                  = aws_vpc.firecracker_vpc.id
46
  cidr_block              = "10.0.${count.index + 1}.0/24"
47
  availability_zone       = data.aws_availability_zones.available.names[count.index]
48
  map_public_ip_on_launch = false
49

50
  tags = {
51
    Name        = "${var.cluster_name}-compute-subnet-${count.index + 1}"
52
    Environment = var.environment
53
    Type        = "compute"
54
  }
55
}
56

57
resource "aws_subnet" "control_subnets" {
58
  count = 3
59

60
  vpc_id                  = aws_vpc.firecracker_vpc.id
61
  cidr_block              = "10.0.${count.index + 10}.0/24"
62
  availability_zone       = data.aws_availability_zones.available.names[count.index]
63
  map_public_ip_on_launch = true
64

65
  tags = {
66
    Name        = "${var.cluster_name}-control-subnet-${count.index + 1}"
67
    Environment = var.environment
68
    Type        = "control"
69
  }
70
}
71

72
# Security Groups
73
resource "aws_security_group" "compute_nodes" {
74
  name_prefix = "${var.cluster_name}-compute-"
75
  vpc_id      = aws_vpc.firecracker_vpc.id
76

77
  # Firecracker API access
78
  ingress {
79
    from_port   = 8080
80
    to_port     = 8099
81
    protocol    = "tcp"
82
    cidr_blocks = [aws_vpc.firecracker_vpc.cidr_block]
83
  }
84

85
  # SSH access
86
  ingress {
87
    from_port   = 22
88
    to_port     = 22
89
    protocol    = "tcp"
90
    cidr_blocks = ["10.0.0.0/16"]
91
  }
92

93
  # VM networking
94
  ingress {
95
    from_port   = 0
96
    to_port     = 65535
97
    protocol    = "tcp"
98
    cidr_blocks = ["10.0.0.0/16"]
99
  }
100

101
  egress {
102
    from_port   = 0
103
    to_port     = 0
104
    protocol    = "-1"
105
    cidr_blocks = ["0.0.0.0/0"]
106
  }
107

108
  tags = {
109
    Name        = "${var.cluster_name}-compute-sg"
110
    Environment = var.environment
111
  }
112
}
113

114
# Launch template for compute nodes
115
resource "aws_launch_template" "compute_nodes" {
116
  name_prefix   = "${var.cluster_name}-compute-"
117
  image_id      = data.aws_ami.ubuntu.id
118
  instance_type = "m6i.4xlarge"
119
  key_name      = aws_key_pair.cluster_key.key_name
120

121
  vpc_security_group_ids = [aws_security_group.compute_nodes.id]
122

123
  user_data = base64encode(templatefile("${path.module}/user_data/compute_node.sh", {
124
    cluster_name = var.cluster_name
125
    environment  = var.environment
126
  }))
127

128
  block_device_mappings {
129
    device_name = "/dev/sda1"
130
    ebs {
131
      volume_size = 500
132
      volume_type = "gp3"
133
      iops        = 12000
134
      throughput  = 1000
135
      encrypted   = true
136
    }
137
  }
138

139
  # Additional EBS volume for VM storage
140
  block_device_mappings {
141
    device_name = "/dev/sdf"
142
    ebs {
143
      volume_size = 2000
144
      volume_type = "gp3"
145
      iops        = 16000
146
      throughput  = 1000
147
      encrypted   = true
148
    }
149
  }
150

151
  tag_specifications {
152
    resource_type = "instance"
153
    tags = {
154
      Name        = "${var.cluster_name}-compute"
155
      Environment = var.environment
156
      Role        = "compute"
157
    }
158
  }
159

160
  tags = {
161
    Name        = "${var.cluster_name}-compute-template"
162
    Environment = var.environment
163
  }
164
}
165

166
# Auto Scaling Group for compute nodes
167
resource "aws_autoscaling_group" "compute_nodes" {
168
  name                = "${var.cluster_name}-compute-asg"
169
  vpc_zone_identifier = aws_subnet.compute_subnets[*].id
170
  target_group_arns   = []
171
  health_check_type   = "EC2"
172

173
  min_size         = var.compute_node_count
174
  max_size         = var.compute_node_count * 2
175
  desired_capacity = var.compute_node_count
176

177
  launch_template {
178
    id      = aws_launch_template.compute_nodes.id
179
    version = "$Latest"
180
  }
181

182
  tag {
183
    key                 = "Name"
184
    value               = "${var.cluster_name}-compute-node"
185
    propagate_at_launch = true
186
  }
187

188
  tag {
189
    key                 = "Environment"
190
    value               = var.environment
191
    propagate_at_launch = true
192
  }
193

194
  tag {
195
    key                 = "Role"
196
    value               = "compute"
197
    propagate_at_launch = true
198
  }
199
}
200

201
# Application Load Balancer for API access
202
resource "aws_lb" "api_lb" {
203
  name               = "${var.cluster_name}-api-alb"
204
  internal           = true
205
  load_balancer_type = "application"
206
  security_groups    = [aws_security_group.api_lb.id]
207
  subnets            = aws_subnet.control_subnets[*].id
208

209
  enable_deletion_protection = false
210

211
  tags = {
212
    Name        = "${var.cluster_name}-api-alb"
213
    Environment = var.environment
214
  }
215
}
216

217
# RDS for metadata storage
218
resource "aws_rds_cluster" "metadata_db" {
219
  cluster_identifier      = "${var.cluster_name}-metadata"
220
  engine                  = "aurora-postgresql"
221
  engine_version          = "13.7"
222
  availability_zones      = data.aws_availability_zones.available.names
223
  database_name           = "firecracker_metadata"
224
  master_username         = "fcadmin"
225
  manage_master_user_password = true
226

227
  backup_retention_period = 30
228
  preferred_backup_window = "03:00-05:00"
229

230
  vpc_security_group_ids = [aws_security_group.database.id]
231
  db_subnet_group_name   = aws_db_subnet_group.metadata.name
232

233
  tags = {
234
    Name        = "${var.cluster_name}-metadata-db"
235
    Environment = var.environment
236
  }
237
}
238

239
# Outputs
240
output "vpc_id" {
241
  value = aws_vpc.firecracker_vpc.id
242
}
243

244
output "compute_subnets" {
245
  value = aws_subnet.compute_subnets[*].id
246
}
247

248
output "api_load_balancer_dns" {
249
  value = aws_lb.api_lb.dns_name
250
}
251

252
output "database_endpoint" {
253
  value = aws_rds_cluster.metadata_db.endpoint
254
}

Configuration Management with Ansible#

1
---
2
- name: Deploy Firecracker Infrastructure
3
  hosts: compute_nodes
4
  become: yes
5
  vars:
6
    firecracker_version: "1.4.1"
7
    kata_version: "3.0.0"
8
    cluster_name: "{{ cluster_name }}"
9
    environment: "{{ environment }}"
10

11
  tasks:
12
    - name: Update system packages
13
      apt:
14
        update_cache: yes
15
        upgrade: dist
16

17
    - name: Install required packages
18
      apt:
19
        name:
20
          - curl
21
          - git
22
          - jq
23
          - bridge-utils
24
          - iptables-persistent
25
          - qemu-kvm
26
          - libvirt-daemon-system
27
          - libvirt-clients
28
          - cpu-checker
29
        state: present
30

31
    - name: Check KVM support
32
      command: kvm-ok
33
      register: kvm_check
34
      failed_when: "'KVM acceleration can be used' not in kvm_check.stdout"
35

36
    - name: Create firecracker user
37
      user:
38
        name: firecracker
39
        system: yes
40
        shell: /bin/bash
41
        home: /var/lib/firecracker
42
        create_home: yes
43

44
    - name: Add firecracker user to kvm group
45
      user:
46
        name: firecracker
47
        groups: kvm
48
        append: yes
49

50
    - name: Download Firecracker binary
51
      get_url:
52
        url: "https://github.com/firecracker-microvm/firecracker/releases/download/v{{ firecracker_version }}/firecracker-v{{ firecracker_version }}-x86_64.tgz"
53
        dest: /tmp/firecracker.tgz
54
        mode: '0644'
55

56
    - name: Extract Firecracker binary
57
      unarchive:
58
        src: /tmp/firecracker.tgz
59
        dest: /tmp
60
        remote_src: yes
61

62
    - name: Install Firecracker binary
63
      copy:
64
        src: "/tmp/release-v{{ firecracker_version }}-x86_64/firecracker-v{{ firecracker_version }}-x86_64"
65
        dest: /usr/local/bin/firecracker
66
        mode: '0755'
67
        remote_src: yes
68
        owner: root
69
        group: root
70

71
    - name: Install Jailer binary
72
      copy:
73
        src: "/tmp/release-v{{ firecracker_version }}-x86_64/jailer-v{{ firecracker_version }}-x86_64"
74
        dest: /usr/local/bin/jailer
75
        mode: '0755'
76
        remote_src: yes
77
        owner: root
78
        group: root
79

80
    - name: Create firecracker directories
81
      file:
82
        path: "{{ item }}"
83
        state: directory
84
        owner: firecracker
85
        group: firecracker
86
        mode: '0755'
87
      loop:
88
        - /var/lib/firecracker
89
        - /var/lib/firecracker/images
90
        - /var/lib/firecracker/kernels
91
        - /var/lib/firecracker/vms
92
        - /var/log/firecracker
93
        - /etc/firecracker
94

95
    - name: Configure system for Firecracker
96
      template:
97
        src: sysctl-firecracker.conf.j2
98
        dest: /etc/sysctl.d/99-firecracker.conf
99
        mode: '0644'
100
      notify: reload sysctl
101

102
    - name: Configure hugepages
103
      lineinfile:
104
        path: /etc/default/grub
105
        regexp: '^GRUB_CMDLINE_LINUX_DEFAULT='
106
        line: 'GRUB_CMDLINE_LINUX_DEFAULT="quiet splash hugepagesz=1G hugepages=4 hugepagesz=2M hugepages=1024 isolcpus=2-7 nohz_full=2-7 rcu_nocbs=2-7"'
107
      register: grub_config
108

109
    - name: Update GRUB
110
      command: update-grub
111
      when: grub_config.changed
112

113
    - name: Install Docker for container support
114
      apt:
115
        name: docker.io
116
        state: present
117

118
    - name: Install containerd
119
      apt:
120
        name: containerd
121
        state: present
122

123
    - name: Configure containerd for Kata
124
      template:
125
        src: containerd-config.toml.j2
126
        dest: /etc/containerd/config.toml
127
        mode: '0644'
128
      notify: restart containerd
129

130
    - name: Install Kata Containers
131
      block:
132
        - name: Add Kata repository
133
          apt_repository:
134
            repo: "deb http://download.opensuse.org/repositories/home:/katacontainers:/releases:/{{ ansible_distribution_release }}:/main/xUbuntu_{{ ansible_distribution_version }}/ /"
135
            state: present
136

137
        - name: Add Kata GPG key
138
          apt_key:
139
            url: "https://download.opensuse.org/repositories/home:katacontainers:releases:{{ ansible_distribution_release }}:main/xUbuntu_{{ ansible_distribution_version }}/Release.key"
140
            state: present
141

142
        - name: Install Kata Containers
143
          apt:
144
            name: kata-containers
145
            state: present
146
            update_cache: yes
147

148
    - name: Configure Kata for Firecracker
149
      template:
150
        src: kata-configuration.toml.j2
151
        dest: /etc/kata-containers/configuration-fc.toml
152
        mode: '0644'
153

154
    - name: Create VM management service
155
      template:
156
        src: firecracker-manager.service.j2
157
        dest: /etc/systemd/system/firecracker-manager.service
158
        mode: '0644'
159
      notify:
160
        - reload systemd
161
        - start firecracker-manager
162

163
    - name: Install monitoring agent
164
      template:
165
        src: firecracker-monitoring.py.j2
166
        dest: /usr/local/bin/firecracker-monitoring
167
        mode: '0755'
168

169
    - name: Create monitoring service
170
      template:
171
        src: firecracker-monitoring.service.j2
172
        dest: /etc/systemd/system/firecracker-monitoring.service
173
        mode: '0644'
174
      notify:
175
        - reload systemd
176
        - start firecracker-monitoring
177

178
    - name: Configure log rotation
179
      template:
180
        src: firecracker-logrotate.j2
181
        dest: /etc/logrotate.d/firecracker
182
        mode: '0644'
183

184
    - name: Install cleanup cron job
185
      cron:
186
        name: "Clean up old Firecracker logs"
187
        minute: "0"
188
        hour: "2"
189
        job: "/usr/local/bin/firecracker-cleanup"
190
        user: root
191

192
  handlers:
193
    - name: reload sysctl
194
      command: sysctl -p /etc/sysctl.d/99-firecracker.conf
195

196
    - name: restart containerd
197
      service:
198
        name: containerd
199
        state: restarted
200

201
    - name: reload systemd
202
      systemd:
203
        daemon_reload: yes
204

205
    - name: start firecracker-manager
206
      service:
207
        name: firecracker-manager
208
        state: started
209
        enabled: yes
210

211
    - name: start firecracker-monitoring
212
      service:
213
        name: firecracker-monitoring
214
        state: started
215
        enabled: yes

CI/CD Pipeline#

1
name: Deploy Firecracker Infrastructure
2

3
on:
4
  push:
5
    branches: [main]
6
  pull_request:
7
    branches: [main]
8

9
env:
10
  AWS_REGION: us-west-2
11
  CLUSTER_NAME: firecracker-prod
12

13
jobs:
14
  validate:
15
    name: Validate Infrastructure Code
16
    runs-on: ubuntu-latest
17
    steps:
18
      - name: Checkout code
19
        uses: actions/checkout@v4
20

21
      - name: Setup Terraform
22
        uses: hashicorp/setup-terraform@v2
23
        with:
24
          terraform_version: 1.6.0
25

26
      - name: Terraform Format Check
27
        run: terraform fmt -check
28

29
      - name: Terraform Init
30
        run: terraform init -backend=false
31

32
      - name: Terraform Validate
33
        run: terraform validate
34

35
      - name: Setup Ansible
36
        run: |
37
          pip install ansible ansible-lint
38

39
      - name: Ansible Lint
40
        run: ansible-lint ansible/playbooks/
41

42
  security-scan:
43
    name: Security Scan
44
    runs-on: ubuntu-latest
45
    steps:
46
      - name: Checkout code
47
        uses: actions/checkout@v4
48

49
      - name: Run Checkov
50
        uses: bridgecrewio/checkov-action@master
51
        with:
52
          directory: .
53
          framework: terraform
54
          output_format: sarif
55
          output_file_path: reports/results.sarif
56

57
      - name: Upload SARIF file
58
        uses: github/codeql-action/upload-sarif@v2
59
        with:
60
          sarif_file: reports/results.sarif
61

62
  plan:
63
    name: Terraform Plan
64
    runs-on: ubuntu-latest
65
    needs: [validate, security-scan]
66
    if: github.event_name == 'pull_request'
67
    steps:
68
      - name: Checkout code
69
        uses: actions/checkout@v4
70

71
      - name: Configure AWS credentials
72
        uses: aws-actions/configure-aws-credentials@v2
73
        with:
74
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
75
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
76
          aws-region: ${{ env.AWS_REGION }}
77

78
      - name: Setup Terraform
79
        uses: hashicorp/setup-terraform@v2
80
        with:
81
          terraform_version: 1.6.0
82

83
      - name: Terraform Init
84
        run: terraform init
85

86
      - name: Terraform Plan
87
        run: |
88
          terraform plan -out=tfplan \
89
            -var="cluster_name=${{ env.CLUSTER_NAME }}" \
90
            -var="environment=staging"
91

92
      - name: Upload plan
93
        uses: actions/upload-artifact@v3
94
        with:
95
          name: terraform-plan
96
          path: tfplan
97

98
  deploy-staging:
99
    name: Deploy to Staging
100
    runs-on: ubuntu-latest
101
    needs: [plan]
102
    if: github.event_name == 'pull_request'
103
    environment: staging
104
    steps:
105
      - name: Checkout code
106
        uses: actions/checkout@v4
107

108
      - name: Configure AWS credentials
109
        uses: aws-actions/configure-aws-credentials@v2
110
        with:
111
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
112
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
113
          aws-region: ${{ env.AWS_REGION }}
114

115
      - name: Setup Terraform
116
        uses: hashicorp/setup-terraform@v2
117
        with:
118
          terraform_version: 1.6.0
119

120
      - name: Download plan
121
        uses: actions/download-artifact@v3
122
        with:
123
          name: terraform-plan
124

125
      - name: Terraform Apply
126
        run: terraform apply -auto-approve tfplan
127

128
      - name: Get infrastructure outputs
129
        id: tf-outputs
130
        run: |
131
          echo "vpc_id=$(terraform output -raw vpc_id)" >> $GITHUB_OUTPUT
132
          echo "api_lb_dns=$(terraform output -raw api_load_balancer_dns)" >> $GITHUB_OUTPUT
133

134
      - name: Setup Ansible
135
        run: |
136
          pip install ansible boto3 botocore
137

138
      - name: Generate Ansible inventory
139
        run: |
140
          ansible-playbook \
141
            -e vpc_id=${{ steps.tf-outputs.outputs.vpc_id }} \
142
            -e cluster_name=${{ env.CLUSTER_NAME }} \
143
            ansible/playbooks/generate-inventory.yml
144

145
      - name: Deploy Firecracker software
146
        run: |
147
          ansible-playbook \
148
            -i inventory/staging.ini \
149
            -e cluster_name=${{ env.CLUSTER_NAME }} \
150
            -e environment=staging \
151
            ansible/playbooks/deploy-firecracker.yml
152

153
      - name: Run integration tests
154
        run: |
155
          python tests/integration_tests.py \
156
            --api-endpoint ${{ steps.tf-outputs.outputs.api_lb_dns }} \
157
            --environment staging
158

159
  deploy-production:
160
    name: Deploy to Production
161
    runs-on: ubuntu-latest
162
    needs: [validate, security-scan]
163
    if: github.ref == 'refs/heads/main'
164
    environment: production
165
    steps:
166
      - name: Checkout code
167
        uses: actions/checkout@v4
168

169
      - name: Configure AWS credentials
170
        uses: aws-actions/configure-aws-credentials@v2
171
        with:
172
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
173
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
174
          aws-region: ${{ env.AWS_REGION }}
175

176
      - name: Setup Terraform
177
        uses: hashicorp/setup-terraform@v2
178
        with:
179
          terraform_version: 1.6.0
180

181
      - name: Terraform Init
182
        run: terraform init
183

184
      - name: Terraform Plan
185
        run: |
186
          terraform plan -out=tfplan \
187
            -var="cluster_name=${{ env.CLUSTER_NAME }}" \
188
            -var="environment=production" \
189
            -var="compute_node_count=15"
190

191
      - name: Terraform Apply
192
        run: terraform apply -auto-approve tfplan
193

194
      - name: Get infrastructure outputs
195
        id: tf-outputs
196
        run: |
197
          echo "vpc_id=$(terraform output -raw vpc_id)" >> $GITHUB_OUTPUT
198
          echo "api_lb_dns=$(terraform output -raw api_load_balancer_dns)" >> $GITHUB_OUTPUT
199

200
      - name: Setup Ansible
201
        run: |
202
          pip install ansible boto3 botocore
203

204
      - name: Deploy Firecracker software
205
        run: |
206
          ansible-playbook \
207
            -i inventory/production.ini \
208
            -e cluster_name=${{ env.CLUSTER_NAME }} \
209
            -e environment=production \
210
            ansible/playbooks/deploy-firecracker.yml
211

212
      - name: Run smoke tests
213
        run: |
214
          python tests/smoke_tests.py \
215
            --api-endpoint ${{ steps.tf-outputs.outputs.api_lb_dns }} \
216
            --environment production
217

218
      - name: Notify deployment
219
        uses: 8398a7/action-slack@v3
220
        with:
221
          status: ${{ job.status }}
222
          channel: '#deployments'
223
          text: |
224
            Firecracker production deployment completed!
225
            Environment: production
226
            API Endpoint: ${{ steps.tf-outputs.outputs.api_lb_dns }}
227
        env:
228
          SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK }}

VM Lifecycle Management#

VM Manager Service#

1
#!/usr/bin/env python3
2
"""
3
Firecracker VM Lifecycle Manager
4
Manages VM creation, monitoring, and cleanup in production environments
5
"""
6

7
import json
8
import time
9
import uuid
10
import logging
11
import threading
12
import subprocess
13
from pathlib import Path
14
from datetime import datetime, timedelta
15
from typing import Dict, List, Optional, Tuple
16
from dataclasses import dataclass, asdict
17
from enum import Enum
18

19
import psutil
20
import requests
21

22
# Configure logging
23
logging.basicConfig(
24
    level=logging.INFO,
25
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
26
    handlers=[
27
        logging.FileHandler('/var/log/firecracker/vm-manager.log'),
28
        logging.StreamHandler()
29
    ]
30
)
31
logger = logging.getLogger('vm-manager')
32

33
class VMState(Enum):
34
    CREATING = "creating"
35
    RUNNING = "running"
36
    STOPPING = "stopping"
37
    STOPPED = "stopped"
38
    ERROR = "error"
39

40
@dataclass
41
class VMConfiguration:
42
    vm_id: str
43
    vcpus: int
44
    memory_mb: int
45
    kernel_path: str
46
    rootfs_path: str
47
    network_config: Dict
48
    storage_config: List[Dict]
49
    metadata: Dict
50
    created_at: datetime
51
    ttl_hours: Optional[int] = None
52

53
@dataclass
54
class VMInstance:
55
    config: VMConfiguration
56
    state: VMState
57
    pid: Optional[int]
58
    api_socket: str
59
    metrics_path: str
60
    log_path: str
61
    last_health_check: datetime
62
    resource_usage: Dict = None
63

64
class FirecrackerVMManager:
65
    """Production-grade Firecracker VM lifecycle manager"""
66

67
    def __init__(self, config_path: str = '/etc/firecracker/manager.conf'):
68
        self.config = self._load_config(config_path)
69
        self.vms: Dict[str, VMInstance] = {}
70
        self.running = False
71
        self.health_check_interval = 30
72
        self.cleanup_interval = 300
73

74
        # Initialize directories
75
        self.base_dir = Path(self.config['base_directory'])
76
        self.vm_dir = self.base_dir / 'vms'
77
        self.image_dir = self.base_dir / 'images'
78
        self.kernel_dir = self.base_dir / 'kernels'
79
        self.log_dir = Path('/var/log/firecracker')
80

81
        for directory in [self.vm_dir, self.image_dir, self.kernel_dir, self.log_dir]:
82
            directory.mkdir(parents=True, exist_ok=True)
83

84
        # Load existing VMs
85
        self._discover_existing_vms()
86

87
    def _load_config(self, config_path: str) -> Dict:
88
        """Load manager configuration"""
89
        default_config = {
90
            'base_directory': '/var/lib/firecracker',
91
            'max_vms_per_node': 50,
92
            'default_vm_ttl_hours': 24,
93
            'health_check_enabled': True,
94
            'metrics_enabled': True,
95
            'auto_cleanup_enabled': True,
96
            'resource_limits': {
97
                'max_memory_mb': 8192,
98
                'max_vcpus': 8
99
            },
100
            'network': {
101
                'bridge_name': 'br0',
102
                'subnet': '172.16.0.0/16',
103
                'dhcp_range_start': '172.16.1.100',
104
                'dhcp_range_end': '172.16.1.200'
105
            }
106
        }
107

108
        try:
109
            with open(config_path, 'r') as f:
110
                user_config = json.load(f)
111
                default_config.update(user_config)
112
        except FileNotFoundError:
113
            logger.warning(f"Config file {config_path} not found, using defaults")
114

115
        return default_config
116

117
    def _discover_existing_vms(self):
118
        """Discover VMs that are already running"""
119
        logger.info("Discovering existing VMs...")
120

121
        for proc in psutil.process_iter(['pid', 'name', 'cmdline']):
122
            try:
123
                if proc.info['name'] == 'firecracker':
124
                    vm_id = self._extract_vm_id_from_cmdline(proc.info['cmdline'])
125
                    if vm_id:
126
                        vm_dir = self.vm_dir / vm_id
127
                        config_file = vm_dir / 'config.json'
128

129
                        if config_file.exists():
130
                            config = self._load_vm_config(config_file)
131
                            instance = VMInstance(
132
                                config=config,
133
                                state=VMState.RUNNING,
134
                                pid=proc.info['pid'],
135
                                api_socket=str(vm_dir / 'api.sock'),
136
                                metrics_path=str(vm_dir / 'metrics.json'),
137
                                log_path=str(self.log_dir / f'{vm_id}.log'),
138
                                last_health_check=datetime.now()
139
                            )
140
                            self.vms[vm_id] = instance
141
                            logger.info(f"Discovered existing VM: {vm_id}")
142
            except (psutil.NoSuchProcess, psutil.AccessDenied):
143
                continue
144

145
        logger.info(f"Discovered {len(self.vms)} existing VMs")
146

147
    def _extract_vm_id_from_cmdline(self, cmdline: List[str]) -> Optional[str]:
148
        """Extract VM ID from Firecracker command line"""
149
        for i, arg in enumerate(cmdline):
150
            if '--api-sock' in arg and i + 1 < len(cmdline):
151
                socket_path = cmdline[i + 1]
152
                return Path(socket_path).parent.name
153
        return None
154

155
    def _load_vm_config(self, config_file: Path) -> VMConfiguration:
156
        """Load VM configuration from file"""
157
        with open(config_file, 'r') as f:
158
            data = json.load(f)
159
            return VMConfiguration(**data)
160

161
    def create_vm(self, vm_spec: Dict) -> Tuple[str, bool]:
162
        """Create a new VM instance"""
163
        vm_id = vm_spec.get('vm_id', str(uuid.uuid4())[:8])
164

165
        # Validate resource limits
166
        if not self._validate_resources(vm_spec):
167
            return vm_id, False
168

169
        # Check capacity
170
        if len(self.vms) >= self.config['max_vms_per_node']:
171
            logger.error(f"Maximum VM capacity reached: {self.config['max_vms_per_node']}")
172
            return vm_id, False
173

174
        try:
175
            # Create VM configuration
176
            config = VMConfiguration(
177
                vm_id=vm_id,
178
                vcpus=vm_spec.get('vcpus', 1),
179
                memory_mb=vm_spec.get('memory_mb', 512),
180
                kernel_path=vm_spec.get('kernel_path', str(self.kernel_dir / 'vmlinux.bin')),
181
                rootfs_path=vm_spec.get('rootfs_path', str(self.image_dir / 'rootfs.ext4')),
182
                network_config=vm_spec.get('network_config', {}),
183
                storage_config=vm_spec.get('storage_config', []),
184
                metadata=vm_spec.get('metadata', {}),
185
                created_at=datetime.now(),
186
                ttl_hours=vm_spec.get('ttl_hours', self.config['default_vm_ttl_hours'])
187
            )
188

189
            # Create VM directory structure
190
            vm_dir = self.vm_dir / vm_id
191
            vm_dir.mkdir(exist_ok=True)
192

193
            # Save configuration
194
            config_file = vm_dir / 'config.json'
195
            with open(config_file, 'w') as f:
196
                json.dump(asdict(config), f, indent=2, default=str)
197

198
            # Prepare VM files
199
            if not self._prepare_vm_files(config, vm_dir):
200
                return vm_id, False
201

202
            # Start Firecracker process
203
            if not self._start_firecracker(config, vm_dir):
204
                return vm_id, False
205

206
            # Create VM instance
207
            instance = VMInstance(
208
                config=config,
209
                state=VMState.CREATING,
210
                pid=None,
211
                api_socket=str(vm_dir / 'api.sock'),
212
                metrics_path=str(vm_dir / 'metrics.json'),
213
                log_path=str(self.log_dir / f'{vm_id}.log'),
214
                last_health_check=datetime.now()
215
            )
216

217
            # Wait for Firecracker to start
218
            if self._wait_for_api(instance.api_socket, timeout=30):
219
                # Configure and start VM
220
                if self._configure_and_start_vm(instance):
221
                    instance.state = VMState.RUNNING
222
                    self.vms[vm_id] = instance
223
                    logger.info(f"Successfully created VM: {vm_id}")
224
                    return vm_id, True
225

226
            instance.state = VMState.ERROR
227
            logger.error(f"Failed to start VM: {vm_id}")
228
            return vm_id, False
229

230
        except Exception as e:
231
            logger.error(f"Error creating VM {vm_id}: {e}")
232
            return vm_id, False
233

234
    def _validate_resources(self, vm_spec: Dict) -> bool:
235
        """Validate VM resource requirements"""
236
        limits = self.config['resource_limits']
237

238
        vcpus = vm_spec.get('vcpus', 1)
239
        memory_mb = vm_spec.get('memory_mb', 512)
240

241
        if vcpus > limits['max_vcpus']:
242
            logger.error(f"vCPU count {vcpus} exceeds limit {limits['max_vcpus']}")
243
            return False
244

245
        if memory_mb > limits['max_memory_mb']:
246
            logger.error(f"Memory {memory_mb}MB exceeds limit {limits['max_memory_mb']}MB")
247
            return False
248

249
        # Check available system resources
250
        system_memory = psutil.virtual_memory()
251
        used_memory = sum(vm.config.memory_mb for vm in self.vms.values()
252
                         if vm.state == VMState.RUNNING)
253

254
        if used_memory + memory_mb > system_memory.available // (1024 * 1024) * 0.8:
255
            logger.error("Insufficient system memory for new VM")
256
            return False
257

258
        return True
259

260
    def _prepare_vm_files(self, config: VMConfiguration, vm_dir: Path) -> bool:
261
        """Prepare VM filesystem and kernel images"""
262
        try:
263
            # Copy/create rootfs if needed
264
            rootfs_source = Path(config.rootfs_path)
265
            rootfs_dest = vm_dir / 'rootfs.ext4'
266

267
            if not rootfs_dest.exists() and rootfs_source.exists():
268
                subprocess.run(['cp', str(rootfs_source), str(rootfs_dest)], check=True)
269
                logger.info(f"Copied rootfs for VM {config.vm_id}")
270

271
            # Kernel should already exist
272
            if not Path(config.kernel_path).exists():
273
                logger.error(f"Kernel not found: {config.kernel_path}")
274
                return False
275

276
            return True
277

278
        except Exception as e:
279
            logger.error(f"Error preparing VM files: {e}")
280
            return False
281

282
    def _start_firecracker(self, config: VMConfiguration, vm_dir: Path) -> bool:
283
        """Start Firecracker process"""
284
        try:
285
            api_socket = vm_dir / 'api.sock'
286
            log_file = self.log_dir / f'{config.vm_id}.log'
287

288
            cmd = [
289
                'firecracker',
290
                '--api-sock', str(api_socket),
291
                '--config-file', str(vm_dir / 'fc_config.json')
292
            ]
293

294
            # Create Firecracker configuration
295
            fc_config = self._generate_firecracker_config(config, vm_dir)
296
            with open(vm_dir / 'fc_config.json', 'w') as f:
297
                json.dump(fc_config, f, indent=2)
298

299
            # Start process
300
            with open(log_file, 'w') as log:
301
                process = subprocess.Popen(
302
                    cmd,
303
                    stdout=log,
304
                    stderr=subprocess.STDOUT,
305
                    cwd=str(vm_dir)
306
                )
307

308
            # Update VM with PID
309
            if config.vm_id in self.vms:
310
                self.vms[config.vm_id].pid = process.pid
311

312
            logger.info(f"Started Firecracker process for VM {config.vm_id} (PID: {process.pid})")
313
            return True
314

315
        except Exception as e:
316
            logger.error(f"Error starting Firecracker: {e}")
317
            return False
318

319
    def _generate_firecracker_config(self, config: VMConfiguration, vm_dir: Path) -> Dict:
320
        """Generate Firecracker configuration file"""
321

322
        rootfs_path = vm_dir / 'rootfs.ext4'
323

324
        fc_config = {
325
            "boot-source": {
326
                "kernel_image_path": config.kernel_path,
327
                "boot_args": "console=ttyS0 reboot=k panic=1 pci=off nomodules ro"
328
            },
329
            "drives": [
330
                {
331
                    "drive_id": "rootfs",
332
                    "path_on_host": str(rootfs_path),
333
                    "is_root_device": True,
334
                    "is_read_only": False
335
                }
336
            ],
337
            "machine-config": {
338
                "vcpu_count": config.vcpus,
339
                "mem_size_mib": config.memory_mb
340
            },
341
            "logger": {
342
                "level": "Info",
343
                "log_path": str(self.log_dir / f'{config.vm_id}-vmm.log')
344
            },
345
            "metrics": {
346
                "metrics_path": str(vm_dir / 'metrics.json')
347
            }
348
        }
349

350
        # Add network configuration if provided
351
        if config.network_config:
352
            fc_config["network-interfaces"] = [config.network_config]
353

354
        # Add additional storage if provided
355
        for i, storage in enumerate(config.storage_config):
356
            drive_config = {
357
                "drive_id": f"storage_{i}",
358
                "path_on_host": storage["path"],
359
                "is_root_device": False,
360
                "is_read_only": storage.get("read_only", False)
361
            }
362
            fc_config["drives"].append(drive_config)
363

364
        return fc_config
365

366
    def _wait_for_api(self, api_socket: str, timeout: int = 30) -> bool:
367
        """Wait for Firecracker API to become available"""
368
        start_time = time.time()
369

370
        while time.time() - start_time < timeout:
371
            if Path(api_socket).exists():
372
                try:
373
                    import requests_unixsocket
374
                    session = requests_unixsocket.Session()
375
                    base_url = f'http+unix://{api_socket.replace("/", "%2F")}'
376

377
                    response = session.get(f'{base_url}/', timeout=5)
378
                    if response.status_code == 200:
379
                        return True
380
                except Exception:
381
                    pass
382

383
            time.sleep(1)
384

385
        return False
386

387
    def _configure_and_start_vm(self, instance: VMInstance) -> bool:
388
        """Configure and start the VM via Firecracker API"""
389
        try:
390
            import requests_unixsocket
391
            session = requests_unixsocket.Session()
392
            base_url = f'http+unix://{instance.api_socket.replace("/", "%2F")}'
393

394
            # Start the VM
395
            response = session.put(
396
                f'{base_url}/actions',
397
                json={'action_type': 'InstanceStart'},
398
                timeout=10
399
            )
400

401
            if response.status_code == 204:
402
                logger.info(f"Successfully started VM {instance.config.vm_id}")
403
                return True
404
            else:
405
                logger.error(f"Failed to start VM {instance.config.vm_id}: {response.status_code}")
406
                return False
407

408
        except Exception as e:
409
            logger.error(f"Error configuring VM {instance.config.vm_id}: {e}")
410
            return False
411

412
    def stop_vm(self, vm_id: str, force: bool = False) -> bool:
413
        """Stop a running VM"""
414
        if vm_id not in self.vms:
415
            logger.error(f"VM not found: {vm_id}")
416
            return False
417

418
        instance = self.vms[vm_id]
419

420
        try:
421
            if not force:
422
                # Try graceful shutdown first
423
                if self._graceful_shutdown(instance):
424
                    instance.state = VMState.STOPPED
425
                    logger.info(f"Gracefully stopped VM: {vm_id}")
426
                    return True
427

428
            # Force stop
429
            if instance.pid and psutil.pid_exists(instance.pid):
430
                proc = psutil.Process(instance.pid)
431
                proc.terminate()
432

433
                # Wait for process to exit
434
                try:
435
                    proc.wait(timeout=10)
436
                except psutil.TimeoutExpired:
437
                    proc.kill()
438
                    proc.wait(timeout=5)
439

440
                instance.state = VMState.STOPPED
441
                logger.info(f"Force stopped VM: {vm_id}")
442
                return True
443

444
            return False
445

446
        except Exception as e:
447
            logger.error(f"Error stopping VM {vm_id}: {e}")
448
            return False
449

450
    def _graceful_shutdown(self, instance: VMInstance) -> bool:
451
        """Attempt graceful VM shutdown"""
452
        try:
453
            import requests_unixsocket
454
            session = requests_unixsocket.Session()
455
            base_url = f'http+unix://{instance.api_socket.replace("/", "%2F")}'
456

457
            # Send shutdown action
458
            response = session.put(
459
                f'{base_url}/actions',
460
                json={'action_type': 'SendCtrlAltDel'},
461
                timeout=5
462
            )
463

464
            if response.status_code == 204:
465
                # Wait for shutdown
466
                time.sleep(5)
467
                return not psutil.pid_exists(instance.pid)
468

469
            return False
470

471
        except Exception:
472
            return False
473

474
    def delete_vm(self, vm_id: str) -> bool:
475
        """Delete a VM and clean up its resources"""
476
        if vm_id not in self.vms:
477
            logger.error(f"VM not found: {vm_id}")
478
            return False
479

480
        instance = self.vms[vm_id]
481

482
        # Stop VM first
483
        if instance.state == VMState.RUNNING:
484
            if not self.stop_vm(vm_id, force=True):
485
                logger.error(f"Failed to stop VM before deletion: {vm_id}")
486
                return False
487

488
        try:
489
            # Clean up VM directory
490
            vm_dir = self.vm_dir / vm_id
491
            if vm_dir.exists():
492
                subprocess.run(['rm', '-rf', str(vm_dir)], check=True)
493

494
            # Clean up logs
495
            log_file = Path(instance.log_path)
496
            if log_file.exists():
497
                log_file.unlink()
498

499
            # Remove from tracking
500
            del self.vms[vm_id]
501

502
            logger.info(f"Deleted VM: {vm_id}")
503
            return True
504

505
        except Exception as e:
506
            logger.error(f"Error deleting VM {vm_id}: {e}")
507
            return False
508

509
    def get_vm_status(self, vm_id: str) -> Optional[Dict]:
510
        """Get VM status and metrics"""
511
        if vm_id not in self.vms:
512
            return None
513

514
        instance = self.vms[vm_id]
515

516
        status = {
517
            'vm_id': vm_id,
518
            'state': instance.state.value,
519
            'config': asdict(instance.config),
520
            'pid': instance.pid,
521
            'uptime_seconds': (datetime.now() - instance.config.created_at).total_seconds(),
522
            'last_health_check': instance.last_health_check.isoformat()
523
        }
524

525
        # Add resource usage if available
526
        if instance.resource_usage:
527
            status['resource_usage'] = instance.resource_usage
528

529
        # Add Firecracker metrics if available
530
        metrics_file = Path(instance.metrics_path)
531
        if metrics_file.exists():
532
            try:
533
                with open(metrics_file, 'r') as f:
534
                    status['firecracker_metrics'] = json.load(f)
535
            except Exception:
536
                pass
537

538
        return status
539

540
    def list_vms(self) -> List[Dict]:
541
        """List all VMs"""
542
        return [self.get_vm_status(vm_id) for vm_id in self.vms.keys()]
543

544
    def health_check(self):
545
        """Perform health checks on all VMs"""
546
        logger.debug("Performing health checks...")
547

548
        for vm_id, instance in list(self.vms.items()):
549
            try:
550
                if instance.state == VMState.RUNNING:
551
                    # Check if process is still running
552
                    if instance.pid and not psutil.pid_exists(instance.pid):
553
                        logger.warning(f"VM process died: {vm_id}")
554
                        instance.state = VMState.ERROR
555
                        continue
556

557
                    # Update resource usage
558
                    if instance.pid:
559
                        proc = psutil.Process(instance.pid)
560
                        instance.resource_usage = {
561
                            'cpu_percent': proc.cpu_percent(),
562
                            'memory_info': proc.memory_info()._asdict(),
563
                            'io_counters': proc.io_counters()._asdict() if hasattr(proc, 'io_counters') else {}
564
                        }
565

566
                    instance.last_health_check = datetime.now()
567

568
            except Exception as e:
569
                logger.error(f"Health check failed for VM {vm_id}: {e}")
570
                instance.state = VMState.ERROR
571

572
    def cleanup_expired_vms(self):
573
        """Clean up expired VMs based on TTL"""
574
        if not self.config['auto_cleanup_enabled']:
575
            return
576

577
        logger.debug("Checking for expired VMs...")
578

579
        current_time = datetime.now()
580
        expired_vms = []
581

582
        for vm_id, instance in self.vms.items():
583
            if instance.config.ttl_hours:
584
                expiry_time = instance.config.created_at + timedelta(hours=instance.config.ttl_hours)
585
                if current_time > expiry_time:
586
                    expired_vms.append(vm_id)
587

588
        for vm_id in expired_vms:
589
            logger.info(f"Cleaning up expired VM: {vm_id}")
590
            self.delete_vm(vm_id)
591

592
    def start_background_tasks(self):
593
        """Start background maintenance tasks"""
594
        self.running = True
595

596
        def health_check_loop():
597
            while self.running:
598
                try:
599
                    self.health_check()
600
                    time.sleep(self.health_check_interval)
601
                except Exception as e:
602
                    logger.error(f"Error in health check loop: {e}")
603
                    time.sleep(self.health_check_interval)
604

605
        def cleanup_loop():
606
            while self.running:
607
                try:
608
                    self.cleanup_expired_vms()
609
                    time.sleep(self.cleanup_interval)
610
                except Exception as e:
611
                    logger.error(f"Error in cleanup loop: {e}")
612
                    time.sleep(self.cleanup_interval)
613

614
        # Start background threads
615
        self.health_check_thread = threading.Thread(target=health_check_loop, daemon=True)
616
        self.cleanup_thread = threading.Thread(target=cleanup_loop, daemon=True)
617

618
        self.health_check_thread.start()
619
        self.cleanup_thread.start()
620

621
        logger.info("Background tasks started")
622

623
    def stop_background_tasks(self):
624
        """Stop background maintenance tasks"""
625
        self.running = False
626
        logger.info("Background tasks stopped")
627

628
# REST API server for VM management
629
from flask import Flask, request, jsonify
630
from flask_limiter import Limiter
631
from flask_limiter.util import get_remote_address
632

633
app = Flask(__name__)
634
limiter = Limiter(
635
    app,
636
    key_func=get_remote_address,
637
    default_limits=["100 per hour"]
638
)
639

640
vm_manager = FirecrackerVMManager()
641

642
@app.route('/health', methods=['GET'])
643
def health_check():
644
    return jsonify({'status': 'healthy', 'timestamp': datetime.now().isoformat()})
645

646
@app.route('/vms', methods=['GET'])
647
@limiter.limit("10 per minute")
648
def list_vms():
649
    return jsonify({'vms': vm_manager.list_vms()})
650

651
@app.route('/vms', methods=['POST'])
652
@limiter.limit("5 per minute")
653
def create_vm():
654
    vm_spec = request.get_json()
655
    if not vm_spec:
656
        return jsonify({'error': 'Invalid JSON'}), 400
657

658
    vm_id, success = vm_manager.create_vm(vm_spec)
659

660
    if success:
661
        return jsonify({'vm_id': vm_id, 'status': 'created'}), 201
662
    else:
663
        return jsonify({'vm_id': vm_id, 'status': 'failed'}), 500
664

665
@app.route('/vms/<vm_id>', methods=['GET'])
666
def get_vm_status(vm_id):
667
    status = vm_manager.get_vm_status(vm_id)
668
    if status:
669
        return jsonify(status)
670
    else:
671
        return jsonify({'error': 'VM not found'}), 404
672

673
@app.route('/vms/<vm_id>', methods=['DELETE'])
674
@limiter.limit("5 per minute")
675
def delete_vm(vm_id):
676
    if vm_manager.delete_vm(vm_id):
677
        return jsonify({'status': 'deleted'})
678
    else:
679
        return jsonify({'error': 'Failed to delete VM'}), 500
680

681
@app.route('/vms/<vm_id>/stop', methods=['POST'])
682
@limiter.limit("5 per minute")
683
def stop_vm(vm_id):
684
    force = request.get_json().get('force', False) if request.get_json() else False
685

686
    if vm_manager.stop_vm(vm_id, force=force):
687
        return jsonify({'status': 'stopped'})
688
    else:
689
        return jsonify({'error': 'Failed to stop VM'}), 500
690

691
if __name__ == '__main__':
692
    # Start background tasks
693
    vm_manager.start_background_tasks()
694

695
    try:
696
        # Start API server
697
        app.run(host='0.0.0.0', port=8080, debug=False)
698
    except KeyboardInterrupt:
699
        logger.info("Shutting down VM manager...")
700
        vm_manager.stop_background_tasks()

VM Templates and Images#

1
#!/bin/bash
2

3
# VM image and template management
4
echo "=== VM Image Management ==="
5

6
# Base paths
7
IMAGE_DIR="/var/lib/firecracker/images"
8
KERNEL_DIR="/var/lib/firecracker/kernels"
9
TEMPLATE_DIR="/var/lib/firecracker/templates"
10

11
# Create directory structure
12
sudo mkdir -p "$IMAGE_DIR" "$KERNEL_DIR" "$TEMPLATE_DIR"
13

14
# Build optimized kernel
15
build_optimized_kernel() {
16
    local kernel_version="6.1.0"
17
    local build_dir="/tmp/kernel-build"
18

19
    echo "Building optimized kernel v$kernel_version..."
20

21
    # Download kernel source
22
    mkdir -p "$build_dir"
23
    cd "$build_dir"
24

25
    wget "https://cdn.kernel.org/pub/linux/kernel/v6.x/linux-${kernel_version}.tar.xz"
26
    tar -xf "linux-${kernel_version}.tar.xz"
27
    cd "linux-${kernel_version}"
28

29
    # Apply Firecracker-optimized config
30
    cat > .config << 'EOF'
31
# Firecracker optimized kernel configuration
32
CONFIG_64BIT=y
33
CONFIG_X86_64=y
34
CONFIG_SMP=y
35
CONFIG_HYPERVISOR_GUEST=y
36
CONFIG_PARAVIRT=y
37
CONFIG_KVM_GUEST=y
38
CONFIG_VIRTIO=y
39
CONFIG_VIRTIO_PCI=y
40
CONFIG_VIRTIO_BLK=y
41
CONFIG_VIRTIO_NET=y
42
CONFIG_VIRTIO_CONSOLE=y
43
CONFIG_VIRTIO_VSOCKETS=y
44
CONFIG_EXT4_FS=y
45
CONFIG_PROC_FS=y
46
CONFIG_SYSFS=y
47
CONFIG_TMPFS=y
48
CONFIG_DEVTMPFS=y
49
CONFIG_SERIAL_8250=y
50
CONFIG_SERIAL_8250_CONSOLE=y
51
CONFIG_NET=y
52
CONFIG_INET=y
53
CONFIG_TCP_CONG_BBR=y
54
CONFIG_PREEMPT_NONE=y
55
CONFIG_NO_HZ_IDLE=y
56
CONFIG_HIGH_RES_TIMERS=y
57
# Disable unnecessary features
58
CONFIG_MODULES=n
59
CONFIG_DEBUG_KERNEL=n
60
CONFIG_SUSPEND=n
61
CONFIG_HIBERNATION=n
62
CONFIG_ACPI=n
63
CONFIG_PCI=n
64
CONFIG_USB=n
65
CONFIG_SOUND=n
66
CONFIG_DRM=n
67
EOF
68

69
    # Build kernel
70
    make -j$(nproc) vmlinux
71

72
    # Copy to kernel directory
73
    sudo cp vmlinux "$KERNEL_DIR/vmlinux-optimized.bin"
74

75
    echo "✓ Optimized kernel built and installed"
76

77
    # Cleanup
78
    cd /
79
    rm -rf "$build_dir"
80
}
81

82
# Create base Ubuntu rootfs
83
create_ubuntu_rootfs() {
84
    local image_name="ubuntu-22.04-base.ext4"
85
    local image_path="$IMAGE_DIR/$image_name"
86
    local mount_point="/tmp/rootfs-build"
87
    local image_size="2G"
88

89
    echo "Creating Ubuntu 22.04 base image..."
90

91
    # Create ext4 image
92
    sudo dd if=/dev/zero of="$image_path" bs=1M count=2048
93
    sudo mkfs.ext4 "$image_path"
94

95
    # Mount image
96
    sudo mkdir -p "$mount_point"
97
    sudo mount -o loop "$image_path" "$mount_point"
98

99
    # Install Ubuntu base system
100
    sudo debootstrap --arch=amd64 --variant=minbase jammy "$mount_point" http://archive.ubuntu.com/ubuntu/
101

102
    # Chroot and configure system
103
    sudo chroot "$mount_point" bash << 'CHROOT_SCRIPT'
104
# Update package list
105
apt-get update
106

107
# Install essential packages
108
apt-get install -y \
109
    systemd \
110
    systemd-sysv \
111
    dbus \
112
    openssh-server \
113
    cloud-init \
114
    curl \
115
    wget \
116
    vim \
117
    htop \
118
    net-tools \
119
    iproute2 \
120
    iptables \
121
    ca-certificates
122

123
# Configure SSH
124
systemctl enable ssh
125
mkdir -p /root/.ssh
126
chmod 700 /root/.ssh
127

128
# Configure cloud-init
129
cat > /etc/cloud/cloud.cfg << 'EOF'
130
cloud_init_modules:
131
 - bootcmd
132
 - write-files
133
 - resizefs
134
 - set_hostname
135
 - update_hostname
136
 - update_etc_hosts
137
 - ca-certs
138
 - rsyslog
139
 - users-groups
140
 - ssh
141

142
cloud_config_modules:
143
 - ssh-import-id
144
 - locale
145
 - set-passwords
146
 - package-update-upgrade-install
147
 - timezone
148
 - puppet
149
 - chef
150
 - salt-minion
151
 - mcollective
152
 - disable-ec2-metadata
153
 - runcmd
154
 - byobu
155

156
cloud_final_modules:
157
 - rightscale_userdata
158
 - scripts-vendor
159
 - scripts-per-once
160
 - scripts-per-boot
161
 - scripts-per-instance
162
 - scripts-user
163
 - ssh-authkey-fingerprints
164
 - keys-to-console
165
 - phone-home
166
 - final-message
167
 - power-state-change
168

169
system_info:
170
  default_user:
171
    name: ubuntu
172
    lock_passwd: True
173
    gecos: Ubuntu
174
    groups: [adm, audio, cdrom, dialout, dip, floppy, lxd, netdev, plugdev, sudo, video]
175
    sudo: ["ALL=(ALL) NOPASSWD:ALL"]
176
    shell: /bin/bash
177

178
datasource_list: [ NoCloud, None ]
179

180
EOF
181

182
# Configure networking
183
cat > /etc/systemd/network/10-virtio.network << 'EOF'
184
[Match]
185
Name=eth0
186

187
[Network]
188
DHCP=yes
189
EOF
190

191
systemctl enable systemd-networkd
192
systemctl enable systemd-resolved
193

194
# Configure console
195
systemctl enable getty@ttyS0
196

197
# Create fstab
198
cat > /etc/fstab << 'EOF'
199
/dev/vda1 / ext4 defaults 1 1
200
EOF
201

202
# Clean up
203
apt-get clean
204
rm -rf /var/lib/apt/lists/*
205
rm -rf /tmp/*
206
rm -rf /var/tmp/*
207

208
CHROOT_SCRIPT
209

210
    # Unmount
211
    sudo umount "$mount_point"
212
    sudo rmdir "$mount_point"
213

214
    echo "✓ Ubuntu base image created: $image_path"
215
}
216

217
# Create specialized images
218
create_web_server_image() {
219
    local base_image="$IMAGE_DIR/ubuntu-22.04-base.ext4"
220
    local web_image="$IMAGE_DIR/ubuntu-22.04-webserver.ext4"
221
    local mount_point="/tmp/web-rootfs"
222

223
    echo "Creating web server image..."
224

225
    # Copy base image
226
    sudo cp "$base_image" "$web_image"
227

228
    # Mount and customize
229
    sudo mkdir -p "$mount_point"
230
    sudo mount -o loop "$web_image" "$mount_point"
231

232
    sudo chroot "$mount_point" bash << 'CHROOT_SCRIPT'
233
# Update packages
234
apt-get update
235

236
# Install web server stack
237
apt-get install -y \
238
    nginx \
239
    php8.1-fpm \
240
    php8.1-mysql \
241
    php8.1-curl \
242
    php8.1-json \
243
    php8.1-zip \
244
    mysql-client \
245
    redis-tools \
246
    supervisor
247

248
# Configure nginx
249
cat > /etc/nginx/nginx.conf << 'EOF'
250
user www-data;
251
worker_processes auto;
252
pid /run/nginx.pid;
253
include /etc/nginx/modules-enabled/*.conf;
254

255
events {
256
    worker_connections 1024;
257
    use epoll;
258
    multi_accept on;
259
}
260

261
http {
262
    sendfile on;
263
    tcp_nopush on;
264
    tcp_nodelay on;
265
    keepalive_timeout 65;
266
    types_hash_max_size 2048;
267

268
    include /etc/nginx/mime.types;
269
    default_type application/octet-stream;
270

271
    gzip on;
272
    gzip_vary on;
273
    gzip_proxied any;
274
    gzip_comp_level 6;
275

276
    include /etc/nginx/conf.d/*.conf;
277
    include /etc/nginx/sites-enabled/*;
278
}
279
EOF
280

281
# Enable services
282
systemctl enable nginx
283
systemctl enable php8.1-fpm
284
systemctl enable supervisor
285

286
# Clean up
287
apt-get clean
288
rm -rf /var/lib/apt/lists/*
289

290
CHROOT_SCRIPT
291

292
    sudo umount "$mount_point"
293
    sudo rmdir "$mount_point"
294

295
    echo "✓ Web server image created: $web_image"
296
}
297

298
create_database_image() {
299
    local base_image="$IMAGE_DIR/ubuntu-22.04-base.ext4"
300
    local db_image="$IMAGE_DIR/ubuntu-22.04-database.ext4"
301
    local mount_point="/tmp/db-rootfs"
302

303
    echo "Creating database image..."
304

305
    # Copy base image
306
    sudo cp "$base_image" "$db_image"
307

308
    # Mount and customize
309
    sudo mkdir -p "$mount_point"
310
    sudo mount -o loop "$db_image" "$mount_point"
311

312
    sudo chroot "$mount_point" bash << 'CHROOT_SCRIPT'
313
# Update packages
314
apt-get update
315

316
# Install PostgreSQL
317
apt-get install -y \
318
    postgresql-14 \
319
    postgresql-client-14 \
320
    postgresql-contrib-14 \
321
    redis-server \
322
    htop \
323
    iotop \
324
    sysstat
325

326
# Configure PostgreSQL
327
sudo -u postgres createdb template_postfirecracker
328

329
# Configure Redis
330
systemctl enable redis-server
331

332
# Enable services
333
systemctl enable postgresql
334

335
# Clean up
336
apt-get clean
337
rm -rf /var/lib/apt/lists/*
338

339
CHROOT_SCRIPT
340

341
    sudo umount "$mount_point"
342
    sudo rmdir "$mount_point"
343

344
    echo "✓ Database image created: $db_image"
345
}
346

347
# Create VM templates
348
create_vm_templates() {
349
    echo "Creating VM templates..."
350

351
    # Web server template
352
    cat > "$TEMPLATE_DIR/webserver.json" << 'EOF'
353
{
354
  "name": "Ubuntu Web Server",
355
  "description": "Ubuntu 22.04 with Nginx, PHP, and common web server tools",
356
  "vcpus": 2,
357
  "memory_mb": 1024,
358
  "kernel_path": "/var/lib/firecracker/kernels/vmlinux-optimized.bin",
359
  "rootfs_path": "/var/lib/firecracker/images/ubuntu-22.04-webserver.ext4",
360
  "network_config": {
361
    "iface_id": "eth0",
362
    "guest_mac": "AA:FC:00:00:00:01",
363
    "host_dev_name": "tap-{{vm_id}}"
364
  },
365
  "metadata": {
366
    "category": "web",
367
    "os": "ubuntu",
368
    "version": "22.04"
369
  },
370
  "ttl_hours": 24
371
}
372
EOF
373

374
    # Database template
375
    cat > "$TEMPLATE_DIR/database.json" << 'EOF'
376
{
377
  "name": "Ubuntu Database Server",
378
  "description": "Ubuntu 22.04 with PostgreSQL and Redis",
379
  "vcpus": 2,
380
  "memory_mb": 2048,
381
  "kernel_path": "/var/lib/firecracker/kernels/vmlinux-optimized.bin",
382
  "rootfs_path": "/var/lib/firecracker/images/ubuntu-22.04-database.ext4",
383
  "network_config": {
384
    "iface_id": "eth0",
385
    "guest_mac": "AA:FC:00:00:00:01",
386
    "host_dev_name": "tap-{{vm_id}}"
387
  },
388
  "storage_config": [
389
    {
390
      "path": "/var/lib/firecracker/storage/{{vm_id}}-data.ext4",
391
      "size_gb": 20,
392
      "read_only": false
393
    }
394
  ],
395
  "metadata": {
396
    "category": "database",
397
    "os": "ubuntu",
398
    "version": "22.04"
399
  },
400
  "ttl_hours": 48
401
}
402
EOF
403

404
    # Microservice template
405
    cat > "$TEMPLATE_DIR/microservice.json" << 'EOF'
406
{
407
  "name": "Ubuntu Microservice",
408
  "description": "Minimal Ubuntu 22.04 for microservice workloads",
409
  "vcpus": 1,
410
  "memory_mb": 512,
411
  "kernel_path": "/var/lib/firecracker/kernels/vmlinux-optimized.bin",
412
  "rootfs_path": "/var/lib/firecracker/images/ubuntu-22.04-base.ext4",
413
  "network_config": {
414
    "iface_id": "eth0",
415
    "guest_mac": "AA:FC:00:00:00:01",
416
    "host_dev_name": "tap-{{vm_id}}"
417
  },
418
  "metadata": {
419
    "category": "microservice",
420
    "os": "ubuntu",
421
    "version": "22.04"
422
  },
423
  "ttl_hours": 12
424
}
425
EOF
426

427
    echo "✓ VM templates created in $TEMPLATE_DIR"
428
}
429

430
# Main execution
431
echo "Building Firecracker VM images and templates..."
432

433
# Build optimized kernel
434
build_optimized_kernel
435

436
# Create base image
437
create_ubuntu_rootfs
438

439
# Create specialized images
440
create_web_server_image
441
create_database_image
442

443
# Create templates
444
create_vm_templates
445

446
echo "VM image management setup complete!"
447
echo "Available images:"
448
ls -la "$IMAGE_DIR"
449
echo ""
450
echo "Available templates:"
451
ls -la "$TEMPLATE_DIR"

Monitoring and Observability#

Comprehensive Monitoring Stack#

1
version: '3.8'
2

3
services:
4
  prometheus:
5
    image: prom/prometheus:latest
6
    container_name: firecracker-prometheus
7
    command:
8
      - '--config.file=/etc/prometheus/prometheus.yml'
9
      - '--storage.tsdb.path=/prometheus'
10
      - '--storage.tsdb.retention.time=30d'
11
      - '--web.console.libraries=/etc/prometheus/console_libraries'
12
      - '--web.console.templates=/etc/prometheus/consoles'
13
      - '--web.enable-lifecycle'
14
      - '--web.enable-admin-api'
15
    ports:
16
      - "9090:9090"
17
    volumes:
18
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
19
      - ./prometheus/rules:/etc/prometheus/rules
20
      - prometheus_data:/prometheus
21
    restart: unless-stopped
22
    networks:
23
      - monitoring
24

25
  grafana:
26
    image: grafana/grafana:latest
27
    container_name: firecracker-grafana
28
    environment:
29
      - GF_SECURITY_ADMIN_PASSWORD=admin123
30
      - GF_USERS_ALLOW_SIGN_UP=false
31
      - GF_INSTALL_PLUGINS=grafana-clock-panel,grafana-simple-json-datasource
32
    ports:
33
      - "3000:3000"
34
    volumes:
35
      - ./grafana/provisioning:/etc/grafana/provisioning
36
      - ./grafana/dashboards:/var/lib/grafana/dashboards
37
      - grafana_data:/var/lib/grafana
38
    restart: unless-stopped
39
    networks:
40
      - monitoring
41
    depends_on:
42
      - prometheus
43

44
  alertmanager:
45
    image: prom/alertmanager:latest
46
    container_name: firecracker-alertmanager
47
    command:
48
      - '--config.file=/etc/alertmanager/config.yml'
49
      - '--storage.path=/alertmanager'
50
    ports:
51
      - "9093:9093"
52
    volumes:
53
      - ./alertmanager/config.yml:/etc/alertmanager/config.yml
54
      - alertmanager_data:/alertmanager
55
    restart: unless-stopped
56
    networks:
57
      - monitoring
58

59
  node-exporter:
60
    image: prom/node-exporter:latest
61
    container_name: firecracker-node-exporter
62
    command:
63
      - '--path.rootfs=/host'
64
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
65
    ports:
66
      - "9100:9100"
67
    volumes:
68
      - '/:/host:ro,rslave'
69
    restart: unless-stopped
70
    networks:
71
      - monitoring
72

73
  firecracker-exporter:
74
    build: ./firecracker-exporter
75
    container_name: firecracker-exporter
76
    command:
77
      - '--config.file=/etc/firecracker-exporter/config.yml'
78
      - '--web.listen-address=0.0.0.0:9200'
79
    ports:
80
      - "9200:9200"
81
    volumes:
82
      - ./firecracker-exporter/config.yml:/etc/firecracker-exporter/config.yml
83
      - /var/lib/firecracker:/var/lib/firecracker:ro
84
      - /var/run:/var/run:ro
85
    restart: unless-stopped
86
    networks:
87
      - monitoring
88
    privileged: true
89

90
  loki:
91
    image: grafana/loki:latest
92
    container_name: firecracker-loki
93
    command: -config.file=/etc/loki/local-config.yaml
94
    ports:
95
      - "3100:3100"
96
    volumes:
97
      - ./loki/config.yml:/etc/loki/local-config.yaml
98
      - loki_data:/loki
99
    restart: unless-stopped
100
    networks:
101
      - monitoring
102

103
  promtail:
104
    image: grafana/promtail:latest
105
    container_name: firecracker-promtail
106
    command: -config.file=/etc/promtail/config.yml
107
    volumes:
108
      - ./promtail/config.yml:/etc/promtail/config.yml
109
      - /var/log:/var/log:ro
110
      - /var/lib/firecracker:/var/lib/firecracker:ro
111
    restart: unless-stopped
112
    networks:
113
      - monitoring
114
    depends_on:
115
      - loki
116

117
volumes:
118
  prometheus_data:
119
  grafana_data:
120
  alertmanager_data:
121
  loki_data:
122

123
networks:
124
  monitoring:
125
    driver: bridge

Custom Firecracker Exporter#

1
#!/usr/bin/env python3
2
"""
3
Firecracker Prometheus Exporter
4
Collects metrics from Firecracker VMs and exposes them for Prometheus
5
"""
6

7
import json
8
import time
9
import logging
10
import argparse
11
from pathlib import Path
12
from typing import Dict, List, Optional
13
from dataclasses import dataclass
14

15
import psutil
16
from prometheus_client import start_http_server, Gauge, Counter, Histogram, Info
17

18
# Configure logging
19
logging.basicConfig(level=logging.INFO)
20
logger = logging.getLogger('firecracker-exporter')
21

22
class FirecrackerExporter:
23
    """Prometheus exporter for Firecracker metrics"""
24

25
    def __init__(self, config_path: str = '/etc/firecracker-exporter/config.yml'):
26
        self.config = self._load_config(config_path)
27
        self.setup_metrics()
28
        self.vm_processes = {}
29

30
    def _load_config(self, config_path: str) -> Dict:
31
        """Load exporter configuration"""
32
        default_config = {
33
            'firecracker_base_dir': '/var/lib/firecracker',
34
            'collection_interval': 15,
35
            'metrics_port': 9200,
36
            'log_level': 'INFO'
37
        }
38

39
        try:
40
            import yaml
41
            with open(config_path, 'r') as f:
42
                user_config = yaml.safe_load(f)
43
                default_config.update(user_config)
44
        except (FileNotFoundError, ImportError):
45
            logger.warning(f"Config file {config_path} not found, using defaults")
46

47
        return default_config
48

49
    def setup_metrics(self):
50
        """Setup Prometheus metrics"""
51

52
        # System metrics
53
        self.system_cpu_usage = Gauge('firecracker_host_cpu_usage_percent', 'Host CPU usage percentage')
54
        self.system_memory_usage = Gauge('firecracker_host_memory_usage_percent', 'Host memory usage percentage')
55
        self.system_load_avg = Gauge('firecracker_host_load_average', 'Host load average', ['period'])
56

57
        # VM count metrics
58
        self.total_vms = Gauge('firecracker_vms_total', 'Total number of Firecracker VMs')
59
        self.vms_by_state = Gauge('firecracker_vms_by_state', 'Number of VMs by state', ['state'])
60

61
        # VM resource metrics
62
        self.vm_cpu_usage = Gauge('firecracker_vm_cpu_usage_percent', 'VM CPU usage percentage', ['vm_id', 'vm_name'])
63
        self.vm_memory_usage = Gauge('firecracker_vm_memory_usage_bytes', 'VM memory usage in bytes', ['vm_id', 'vm_name', 'type'])
64
        self.vm_uptime = Gauge('firecracker_vm_uptime_seconds', 'VM uptime in seconds', ['vm_id', 'vm_name'])
65

66
        # VM I/O metrics
67
        self.vm_io_read_bytes = Counter('firecracker_vm_io_read_bytes_total', 'VM I/O read bytes', ['vm_id', 'vm_name'])
68
        self.vm_io_write_bytes = Counter('firecracker_vm_io_write_bytes_total', 'VM I/O write bytes', ['vm_id', 'vm_name'])
69
        self.vm_io_read_ops = Counter('firecracker_vm_io_read_ops_total', 'VM I/O read operations', ['vm_id', 'vm_name'])
70
        self.vm_io_write_ops = Counter('firecracker_vm_io_write_ops_total', 'VM I/O write operations', ['vm_id', 'vm_name'])
71

72
        # VM network metrics (from Firecracker API)
73
        self.vm_network_rx_bytes = Counter('firecracker_vm_network_rx_bytes_total', 'VM network RX bytes', ['vm_id', 'vm_name', 'interface'])
74
        self.vm_network_tx_bytes = Counter('firecracker_vm_network_tx_bytes_total', 'VM network TX bytes', ['vm_id', 'vm_name', 'interface'])
75
        self.vm_network_rx_packets = Counter('firecracker_vm_network_rx_packets_total', 'VM network RX packets', ['vm_id', 'vm_name', 'interface'])
76
        self.vm_network_tx_packets = Counter('firecracker_vm_network_tx_packets_total', 'VM network TX packets', ['vm_id', 'vm_name', 'interface'])
77

78
        # VM block device metrics
79
        self.vm_block_read_bytes = Counter('firecracker_vm_block_read_bytes_total', 'VM block device read bytes', ['vm_id', 'vm_name', 'device'])
80
        self.vm_block_write_bytes = Counter('firecracker_vm_block_write_bytes_total', 'VM block device write bytes', ['vm_id', 'vm_name', 'device'])
81
        self.vm_block_read_ops = Counter('firecracker_vm_block_read_ops_total', 'VM block device read operations', ['vm_id', 'vm_name', 'device'])
82
        self.vm_block_write_ops = Counter('firecracker_vm_block_write_ops_total', 'VM block device write operations', ['vm_id', 'vm_name', 'device'])
83

84
        # vCPU metrics
85
        self.vm_vcpu_exits = Counter('firecracker_vm_vcpu_exits_total', 'VM vCPU exits', ['vm_id', 'vm_name', 'vcpu', 'exit_type'])
86

87
        # Exporter metrics
88
        self.collection_duration = Histogram('firecracker_exporter_collection_duration_seconds', 'Time spent collecting metrics')
89
        self.collection_errors = Counter('firecracker_exporter_collection_errors_total', 'Number of collection errors', ['type'])
90

91
        # VM info
92
        self.vm_info = Info('firecracker_vm_info', 'VM information', ['vm_id', 'vm_name'])
93

94
    def discover_vms(self) -> Dict[str, Dict]:
95
        """Discover running Firecracker VMs"""
96

97
        vms = {}
98
        base_dir = Path(self.config['firecracker_base_dir'])
99
        vm_dir = base_dir / 'vms'
100

101
        if not vm_dir.exists():
102
            return vms
103

104
        # Find VM processes
105
        firecracker_procs = {}
106
        for proc in psutil.process_iter(['pid', 'name', 'cmdline', 'create_time']):
107
            try:
108
                if proc.info['name'] == 'firecracker':
109
                    # Extract VM ID from command line
110
                    vm_id = self._extract_vm_id(proc.info['cmdline'])
111
                    if vm_id:
112
                        firecracker_procs[vm_id] = {
113
                            'process': proc,
114
                            'pid': proc.info['pid'],
115
                            'start_time': proc.info['create_time']
116
                        }
117
            except (psutil.NoSuchProcess, psutil.AccessDenied):
118
                continue
119

120
        # Match with VM directories
121
        for vm_path in vm_dir.iterdir():
122
            if vm_path.is_dir():
123
                vm_id = vm_path.name
124
                config_file = vm_path / 'config.json'
125

126
                vm_info = {
127
                    'vm_id': vm_id,
128
                    'vm_name': vm_id,  # Default name
129
                    'config_file': config_file,
130
                    'api_socket': vm_path / 'api.sock',
131
                    'metrics_file': vm_path / 'metrics.json',
132
                    'state': 'unknown',
133
                    'process': None
134
                }
135

136
                # Load VM configuration
137
                if config_file.exists():
138
                    try:
139
                        with open(config_file, 'r') as f:
140
                            config_data = json.load(f)
141
                            vm_info['vm_name'] = config_data.get('metadata', {}).get('name', vm_id)
142
                            vm_info['config'] = config_data
143
                    except Exception as e:
144
                        logger.warning(f"Failed to load config for VM {vm_id}: {e}")
145

146
                # Match with running process
147
                if vm_id in firecracker_procs:
148
                    vm_info['process'] = firecracker_procs[vm_id]['process']
149
                    vm_info['pid'] = firecracker_procs[vm_id]['pid']
150
                    vm_info['start_time'] = firecracker_procs[vm_id]['start_time']
151
                    vm_info['state'] = 'running'
152
                else:
153
                    vm_info['state'] = 'stopped'
154

155
                vms[vm_id] = vm_info
156

157
        return vms
158

159
    def _extract_vm_id(self, cmdline: List[str]) -> Optional[str]:
160
        """Extract VM ID from Firecracker command line"""
161

162
        for i, arg in enumerate(cmdline):
163
            if '--api-sock' in arg and i + 1 < len(cmdline):
164
                socket_path = cmdline[i + 1]
165
                # Extract VM ID from socket path
166
                path_parts = Path(socket_path).parts
167
                for part in path_parts:
168
                    if part != 'api.sock' and not part.startswith('/'):
169
                        return part
170

171
        return None
172

173
    def collect_system_metrics(self):
174
        """Collect host system metrics"""
175

176
        try:
177
            # CPU usage
178
            cpu_percent = psutil.cpu_percent(interval=1)
179
            self.system_cpu_usage.set(cpu_percent)
180

181
            # Memory usage
182
            memory = psutil.virtual_memory()
183
            self.system_memory_usage.set(memory.percent)
184

185
            # Load averages
186
            if hasattr(psutil, 'getloadavg'):
187
                load_avg = psutil.getloadavg()
188
                self.system_load_avg.labels(period='1m').set(load_avg[0])
189
                self.system_load_avg.labels(period='5m').set(load_avg[1])
190
                self.system_load_avg.labels(period='15m').set(load_avg[2])
191

192
        except Exception as e:
193
            logger.error(f"Error collecting system metrics: {e}")
194
            self.collection_errors.labels(type='system').inc()
195

196
    def collect_vm_metrics(self, vms: Dict[str, Dict]):
197
        """Collect VM-specific metrics"""
198

199
        # Update VM counts
200
        self.total_vms.set(len(vms))
201

202
        # Count VMs by state
203
        state_counts = {}
204
        for vm_info in vms.values():
205
            state = vm_info['state']
206
            state_counts[state] = state_counts.get(state, 0) + 1
207

208
        for state, count in state_counts.items():
209
            self.vms_by_state.labels(state=state).set(count)
210

211
        # Collect metrics for each VM
212
        for vm_id, vm_info in vms.items():
213
            vm_name = vm_info['vm_name']
214

215
            try:
216
                # VM info
217
                if 'config' in vm_info:
218
                    config = vm_info['config']
219
                    info_labels = {
220
                        'vm_id': vm_id,
221
                        'vm_name': vm_name,
222
                        'vcpus': str(config.get('vcpus', 'unknown')),
223
                        'memory_mb': str(config.get('memory_mb', 'unknown')),
224
                        'kernel_path': config.get('kernel_path', 'unknown'),
225
                        'created_at': config.get('created_at', 'unknown')
226
                    }
227
                    self.vm_info.labels(**{k: v for k, v in info_labels.items() if k in ['vm_id', 'vm_name']}).info(info_labels)
228

229
                if vm_info['state'] == 'running' and vm_info['process']:
230
                    self._collect_process_metrics(vm_id, vm_name, vm_info['process'])
231

232
                    # VM uptime
233
                    if 'start_time' in vm_info:
234
                        uptime = time.time() - vm_info['start_time']
235
                        self.vm_uptime.labels(vm_id=vm_id, vm_name=vm_name).set(uptime)
236

237
                # Collect Firecracker API metrics
238
                self._collect_firecracker_api_metrics(vm_id, vm_name, vm_info)
239

240
            except Exception as e:
241
                logger.error(f"Error collecting metrics for VM {vm_id}: {e}")
242
                self.collection_errors.labels(type='vm').inc()
243

244
    def _collect_process_metrics(self, vm_id: str, vm_name: str, process):
245
        """Collect process-level metrics for a VM"""
246

247
        try:
248
            # CPU usage
249
            cpu_percent = process.cpu_percent()
250
            self.vm_cpu_usage.labels(vm_id=vm_id, vm_name=vm_name).set(cpu_percent)
251

252
            # Memory usage
253
            memory_info = process.memory_info()
254
            self.vm_memory_usage.labels(vm_id=vm_id, vm_name=vm_name, type='rss').set(memory_info.rss)
255
            self.vm_memory_usage.labels(vm_id=vm_id, vm_name=vm_name, type='vms').set(memory_info.vms)
256

257
            if hasattr(memory_info, 'shared'):
258
                self.vm_memory_usage.labels(vm_id=vm_id, vm_name=vm_name, type='shared').set(memory_info.shared)
259

260
            # I/O counters
261
            if hasattr(process, 'io_counters'):
262
                io_counters = process.io_counters()
263
                self.vm_io_read_bytes.labels(vm_id=vm_id, vm_name=vm_name)._value._value = io_counters.read_bytes
264
                self.vm_io_write_bytes.labels(vm_id=vm_id, vm_name=vm_name)._value._value = io_counters.write_bytes
265
                self.vm_io_read_ops.labels(vm_id=vm_id, vm_name=vm_name)._value._value = io_counters.read_count
266
                self.vm_io_write_ops.labels(vm_id=vm_id, vm_name=vm_name)._value._value = io_counters.write_count
267

268
        except (psutil.NoSuchProcess, psutil.AccessDenied) as e:
269
            logger.warning(f"Process access error for VM {vm_id}: {e}")
270
        except Exception as e:
271
            logger.error(f"Error collecting process metrics for VM {vm_id}: {e}")
272

273
    def _collect_firecracker_api_metrics(self, vm_id: str, vm_name: str, vm_info: Dict):
274
        """Collect metrics from Firecracker API"""
275

276
        try:
277
            metrics_file = vm_info['metrics_file']
278
            if Path(metrics_file).exists():
279
                with open(metrics_file, 'r') as f:
280
                    api_metrics = json.load(f)
281

282
                    # Network metrics
283
                    if 'net' in api_metrics:
284
                        net_metrics = api_metrics['net']
285
                        interface = 'eth0'  # Default interface
286

287
                        # Update counter values directly (Firecracker provides cumulative values)
288
                        if 'rx_queue_event_count' in net_metrics:
289
                            self.vm_network_rx_packets.labels(vm_id=vm_id, vm_name=vm_name, interface=interface)._value._value = net_metrics['rx_queue_event_count']
290

291
                        if 'tx_queue_event_count' in net_metrics:
292
                            self.vm_network_tx_packets.labels(vm_id=vm_id, vm_name=vm_name, interface=interface)._value._value = net_metrics['tx_queue_event_count']
293

294
                    # Block device metrics
295
                    if 'block' in api_metrics:
296
                        block_metrics = api_metrics['block']
297
                        device = 'rootfs'  # Default device
298

299
                        if 'read_count' in block_metrics:
300
                            self.vm_block_read_ops.labels(vm_id=vm_id, vm_name=vm_name, device=device)._value._value = block_metrics['read_count']
301

302
                        if 'write_count' in block_metrics:
303
                            self.vm_block_write_ops.labels(vm_id=vm_id, vm_name=vm_name, device=device)._value._value = block_metrics['write_count']
304

305
                        if 'read_bytes' in block_metrics:
306
                            self.vm_block_read_bytes.labels(vm_id=vm_id, vm_name=vm_name, device=device)._value._value = block_metrics['read_bytes']
307

308
                        if 'write_bytes' in block_metrics:
309
                            self.vm_block_write_bytes.labels(vm_id=vm_id, vm_name=vm_name, device=device)._value._value = block_metrics['write_bytes']
310

311
                    # vCPU metrics
312
                    if 'vcpu' in api_metrics:
313
                        for vcpu_id, vcpu_metrics in api_metrics['vcpu'].items():
314
                            for exit_type, count in vcpu_metrics.items():
315
                                if exit_type.startswith('exit_'):
316
                                    exit_name = exit_type.replace('exit_', '')
317
                                    self.vm_vcpu_exits.labels(vm_id=vm_id, vm_name=vm_name, vcpu=vcpu_id, exit_type=exit_name)._value._value = count
318

319
        except Exception as e:
320
            logger.debug(f"API metrics not available for VM {vm_id}: {e}")
321

322
    def collect_all_metrics(self):
323
        """Collect all metrics"""
324

325
        with self.collection_duration.time():
326
            try:
327
                # Collect system metrics
328
                self.collect_system_metrics()
329

330
                # Discover and collect VM metrics
331
                vms = self.discover_vms()
332
                self.collect_vm_metrics(vms)
333

334
                logger.debug(f"Collected metrics for {len(vms)} VMs")
335

336
            except Exception as e:
337
                logger.error(f"Error during metric collection: {e}")
338
                self.collection_errors.labels(type='collection').inc()
339

340
    def run(self):
341
        """Main exporter loop"""
342

343
        logger.info(f"Starting Firecracker exporter on port {self.config['metrics_port']}")
344

345
        # Start Prometheus metrics server
346
        start_http_server(self.config['metrics_port'])
347

348
        # Collection loop
349
        collection_interval = self.config['collection_interval']
350
        logger.info(f"Collecting metrics every {collection_interval} seconds")
351

352
        while True:
353
            try:
354
                self.collect_all_metrics()
355
                time.sleep(collection_interval)
356
            except KeyboardInterrupt:
357
                logger.info("Exporter stopped by user")
358
                break
359
            except Exception as e:
360
                logger.error(f"Unexpected error: {e}")
361
                time.sleep(collection_interval)
362

363
if __name__ == '__main__':
364
    parser = argparse.ArgumentParser(description='Firecracker Prometheus Exporter')
365
    parser.add_argument('--config.file', dest='config_file',
366
                       default='/etc/firecracker-exporter/config.yml',
367
                       help='Path to configuration file')
368
    parser.add_argument('--web.listen-address', dest='listen_address',
369
                       default='0.0.0.0:9200',
370
                       help='Address to listen on for web interface')
371

372
    args = parser.parse_args()
373

374
    # Override config with command line args
375
    config = {}
376
    if hasattr(args, 'listen_address'):
377
        host, port = args.listen_address.split(':')
378
        config['metrics_port'] = int(port)
379

380
    exporter = FirecrackerExporter(args.config_file)
381
    if config:
382
        exporter.config.update(config)
383

384
    exporter.run()

Disaster Recovery and Backup#

Backup Strategy#

1
#!/usr/bin/env python3
2
"""
3
Firecracker Backup and Disaster Recovery System
4
"""
5

6
import json
7
import time
8
import logging
9
import subprocess
10
import threading
11
from datetime import datetime, timedelta
12
from pathlib import Path
13
from typing import Dict, List, Optional
14
import boto3
15
from dataclasses import dataclass
16

17
logging.basicConfig(level=logging.INFO)
18
logger = logging.getLogger('firecracker-backup')
19

20
@dataclass
21
class BackupJob:
22
    vm_id: str
23
    backup_type: str  # 'snapshot', 'full', 'incremental'
24
    schedule: str     # cron-like schedule
25
    retention_days: int
26
    storage_backend: str  # 's3', 'local', 'nfs'
27
    compression: bool = True
28
    encryption: bool = True
29

30
class FirecrackerBackupManager:
31
    """Manage backups and disaster recovery for Firecracker VMs"""
32

33
    def __init__(self, config_path: str = '/etc/firecracker/backup.conf'):
34
        self.config = self._load_config(config_path)
35
        self.base_dir = Path(self.config['firecracker_base_dir'])
36
        self.backup_dir = Path(self.config['local_backup_dir'])
37
        self.backup_dir.mkdir(parents=True, exist_ok=True)
38

39
        # Initialize storage backends
40
        self.storage_backends = {}
41
        if 's3' in self.config.get('storage_backends', {}):
42
            self._init_s3_backend()
43

44
        # Track running backup jobs
45
        self.active_jobs = {}
46
        self.job_history = []
47

48
    def _load_config(self, config_path: str) -> Dict:
49
        """Load backup configuration"""
50
        default_config = {
51
            'firecracker_base_dir': '/var/lib/firecracker',
52
            'local_backup_dir': '/var/backups/firecracker',
53
            'max_concurrent_backups': 3,
54
            'default_retention_days': 30,
55
            'compression_level': 6,
56
            'storage_backends': {
57
                's3': {
58
                    'bucket': 'firecracker-backups',
59
                    'region': 'us-west-2',
60
                    'storage_class': 'STANDARD_IA'
61
                }
62
            },
63
            'encryption': {
64
                'enabled': True,
65
                'key_id': 'alias/firecracker-backup'
66
            }
67
        }
68

69
        try:
70
            with open(config_path, 'r') as f:
71
                user_config = json.load(f)
72
                default_config.update(user_config)
73
        except FileNotFoundError:
74
            logger.warning(f"Config file {config_path} not found, using defaults")
75

76
        return default_config
77

78
    def _init_s3_backend(self):
79
        """Initialize S3 storage backend"""
80
        s3_config = self.config['storage_backends']['s3']
81

82
        try:
83
            self.storage_backends['s3'] = {
84
                'client': boto3.client('s3', region_name=s3_config['region']),
85
                'bucket': s3_config['bucket'],
86
                'config': s3_config
87
            }
88
            logger.info(f"Initialized S3 backend: {s3_config['bucket']}")
89
        except Exception as e:
90
            logger.error(f"Failed to initialize S3 backend: {e}")
91

92
    def create_vm_snapshot(self, vm_id: str, snapshot_type: str = 'full') -> Dict:
93
        """Create a snapshot of a running VM"""
94

95
        logger.info(f"Creating {snapshot_type} snapshot for VM {vm_id}")
96

97
        vm_dir = self.base_dir / 'vms' / vm_id
98
        if not vm_dir.exists():
99
            raise ValueError(f"VM directory not found: {vm_dir}")
100

101
        # Create snapshot directory
102
        timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
103
        snapshot_name = f"{vm_id}_{snapshot_type}_{timestamp}"
104
        snapshot_dir = self.backup_dir / 'snapshots' / snapshot_name
105
        snapshot_dir.mkdir(parents=True, exist_ok=True)
106

107
        job_info = {
108
            'job_id': snapshot_name,
109
            'vm_id': vm_id,
110
            'type': snapshot_type,
111
            'started_at': datetime.now(),
112
            'status': 'running',
113
            'snapshot_dir': str(snapshot_dir)
114
        }
115

116
        try:
117
            # Pause VM for consistent snapshot (if running)
118
            api_socket = vm_dir / 'api.sock'
119
            vm_paused = False
120

121
            if api_socket.exists():
122
                if self._pause_vm(str(api_socket)):
123
                    vm_paused = True
124
                    logger.info(f"Paused VM {vm_id} for snapshot")
125

126
            # Copy VM files
127
            files_copied = []
128

129
            # Copy rootfs
130
            rootfs_path = vm_dir / 'rootfs.ext4'
131
            if rootfs_path.exists():
132
                snapshot_rootfs = snapshot_dir / 'rootfs.ext4'
133
                if self.config['compression']:
134
                    self._copy_and_compress(rootfs_path, f"{snapshot_rootfs}.gz")
135
                    files_copied.append(f"rootfs.ext4.gz")
136
                else:
137
                    subprocess.run(['cp', str(rootfs_path), str(snapshot_rootfs)], check=True)
138
                    files_copied.append('rootfs.ext4')
139

140
            # Copy additional storage
141
            for storage_file in vm_dir.glob('storage_*.ext4'):
142
                snapshot_storage = snapshot_dir / storage_file.name
143
                if self.config['compression']:
144
                    self._copy_and_compress(storage_file, f"{snapshot_storage}.gz")
145
                    files_copied.append(f"{storage_file.name}.gz")
146
                else:
147
                    subprocess.run(['cp', str(storage_file), str(snapshot_storage)], check=True)
148
                    files_copied.append(storage_file.name)
149

150
            # Copy configuration
151
            config_file = vm_dir / 'config.json'
152
            if config_file.exists():
153
                subprocess.run(['cp', str(config_file), str(snapshot_dir / 'config.json')], check=True)
154
                files_copied.append('config.json')
155

156
            # Create snapshot metadata
157
            metadata = {
158
                'vm_id': vm_id,
159
                'snapshot_name': snapshot_name,
160
                'snapshot_type': snapshot_type,
161
                'created_at': datetime.now().isoformat(),
162
                'files': files_copied,
163
                'compression': self.config['compression'],
164
                'vm_config': self._get_vm_config(vm_dir)
165
            }
166

167
            with open(snapshot_dir / 'metadata.json', 'w') as f:
168
                json.dump(metadata, f, indent=2)
169

170
            # Resume VM if it was paused
171
            if vm_paused:
172
                self._resume_vm(str(api_socket))
173
                logger.info(f"Resumed VM {vm_id}")
174

175
            # Calculate snapshot size
176
            snapshot_size = sum(f.stat().st_size for f in snapshot_dir.rglob('*') if f.is_file())
177

178
            job_info.update({
179
                'status': 'completed',
180
                'completed_at': datetime.now(),
181
                'files_copied': files_copied,
182
                'snapshot_size_bytes': snapshot_size
183
            })
184

185
            logger.info(f"Snapshot created successfully: {snapshot_name} ({snapshot_size // (1024*1024)}MB)")
186

187
            return job_info
188

189
        except Exception as e:
190
            job_info.update({
191
                'status': 'failed',
192
                'error': str(e),
193
                'completed_at': datetime.now()
194
            })
195

196
            logger.error(f"Snapshot creation failed for VM {vm_id}: {e}")
197

198
            # Resume VM if it was paused
199
            if vm_paused:
200
                self._resume_vm(str(api_socket))
201

202
            raise
203

204
        finally:
205
            self.job_history.append(job_info)
206

207
    def _copy_and_compress(self, source: Path, dest: str):
208
        """Copy and compress file using gzip"""
209
        cmd = ['gzip', '-c', str(source)]
210
        with open(dest, 'wb') as f:
211
            subprocess.run(cmd, stdout=f, check=True)
212

213
    def _pause_vm(self, api_socket: str) -> bool:
214
        """Pause VM via Firecracker API"""
215
        try:
216
            import requests_unixsocket
217
            session = requests_unixsocket.Session()
218
            base_url = f'http+unix://{api_socket.replace("/", "%2F")}'
219

220
            response = session.patch(
221
                f'{base_url}/vm',
222
                json={'state': 'Paused'},
223
                timeout=10
224
            )
225

226
            return response.status_code == 204
227
        except Exception as e:
228
            logger.warning(f"Failed to pause VM: {e}")
229
            return False
230

231
    def _resume_vm(self, api_socket: str) -> bool:
232
        """Resume VM via Firecracker API"""
233
        try:
234
            import requests_unixsocket
235
            session = requests_unixsocket.Session()
236
            base_url = f'http+unix://{api_socket.replace("/", "%2F")}'
237

238
            response = session.patch(
239
                f'{base_url}/vm',
240
                json={'state': 'Resumed'},
241
                timeout=10
242
            )
243

244
            return response.status_code == 204
245
        except Exception as e:
246
            logger.warning(f"Failed to resume VM: {e}")
247
            return False
248

249
    def _get_vm_config(self, vm_dir: Path) -> Dict:
250
        """Get VM configuration"""
251
        config_file = vm_dir / 'config.json'
252
        if config_file.exists():
253
            with open(config_file, 'r') as f:
254
                return json.load(f)
255
        return {}
256

257
    def upload_to_s3(self, snapshot_dir: Path, snapshot_name: str) -> bool:
258
        """Upload snapshot to S3"""
259
        if 's3' not in self.storage_backends:
260
            logger.error("S3 backend not configured")
261
            return False
262

263
        s3_client = self.storage_backends['s3']['client']
264
        bucket = self.storage_backends['s3']['bucket']
265

266
        try:
267
            for file_path in snapshot_dir.rglob('*'):
268
                if file_path.is_file():
269
                    relative_path = file_path.relative_to(snapshot_dir)
270
                    s3_key = f"snapshots/{snapshot_name}/{relative_path}"
271

272
                    logger.info(f"Uploading {file_path.name} to S3...")
273

274
                    extra_args = {}
275
                    if self.config['encryption']['enabled']:
276
                        extra_args['ServerSideEncryption'] = 'aws:kms'
277
                        extra_args['SSEKMSKeyId'] = self.config['encryption']['key_id']
278

279
                    s3_client.upload_file(
280
                        str(file_path),
281
                        bucket,
282
                        s3_key,
283
                        ExtraArgs=extra_args
284
                    )
285

286
            logger.info(f"Successfully uploaded snapshot {snapshot_name} to S3")
287
            return True
288

289
        except Exception as e:
290
            logger.error(f"Failed to upload snapshot to S3: {e}")
291
            return False
292

293
    def restore_vm_from_snapshot(self, snapshot_name: str, target_vm_id: str = None) -> bool:
294
        """Restore VM from snapshot"""
295

296
        logger.info(f"Restoring VM from snapshot: {snapshot_name}")
297

298
        # Find snapshot
299
        snapshot_dir = self.backup_dir / 'snapshots' / snapshot_name
300
        if not snapshot_dir.exists():
301
            # Try to download from S3
302
            if not self._download_snapshot_from_s3(snapshot_name):
303
                raise ValueError(f"Snapshot not found: {snapshot_name}")
304

305
        # Load snapshot metadata
306
        metadata_file = snapshot_dir / 'metadata.json'
307
        if not metadata_file.exists():
308
            raise ValueError(f"Snapshot metadata not found: {metadata_file}")
309

310
        with open(metadata_file, 'r') as f:
311
            metadata = json.load(f)
312

313
        # Determine target VM ID
314
        original_vm_id = metadata['vm_id']
315
        if target_vm_id is None:
316
            target_vm_id = f"{original_vm_id}_restored_{int(time.time())}"
317

318
        logger.info(f"Restoring as VM: {target_vm_id}")
319

320
        try:
321
            # Create target VM directory
322
            target_vm_dir = self.base_dir / 'vms' / target_vm_id
323
            target_vm_dir.mkdir(parents=True, exist_ok=True)
324

325
            # Restore files
326
            for file_name in metadata['files']:
327
                if file_name.endswith('.gz'):
328
                    # Decompress
329
                    source_file = snapshot_dir / file_name
330
                    target_file = target_vm_dir / file_name[:-3]  # Remove .gz
331

332
                    with open(source_file, 'rb') as src, open(target_file, 'wb') as tgt:
333
                        subprocess.run(['gunzip', '-c'], stdin=src, stdout=tgt, check=True)
334
                else:
335
                    # Direct copy
336
                    source_file = snapshot_dir / file_name
337
                    target_file = target_vm_dir / file_name
338
                    subprocess.run(['cp', str(source_file), str(target_file)], check=True)
339

340
                logger.info(f"Restored file: {file_name}")
341

342
            # Update VM configuration with new ID
343
            vm_config = metadata['vm_config'].copy()
344
            vm_config['vm_id'] = target_vm_id
345
            vm_config['restored_from'] = snapshot_name
346
            vm_config['restored_at'] = datetime.now().isoformat()
347

348
            with open(target_vm_dir / 'config.json', 'w') as f:
349
                json.dump(vm_config, f, indent=2)
350

351
            logger.info(f"VM restored successfully: {target_vm_id}")
352
            return True
353

354
        except Exception as e:
355
            logger.error(f"Failed to restore VM from snapshot: {e}")
356
            # Clean up partial restore
357
            if target_vm_dir.exists():
358
                subprocess.run(['rm', '-rf', str(target_vm_dir)], check=False)
359
            return False
360

361
    def _download_snapshot_from_s3(self, snapshot_name: str) -> bool:
362
        """Download snapshot from S3"""
363
        if 's3' not in self.storage_backends:
364
            return False
365

366
        s3_client = self.storage_backends['s3']['client']
367
        bucket = self.storage_backends['s3']['bucket']
368

369
        try:
370
            # List objects in snapshot
371
            prefix = f"snapshots/{snapshot_name}/"
372
            response = s3_client.list_objects_v2(Bucket=bucket, Prefix=prefix)
373

374
            if 'Contents' not in response:
375
                logger.error(f"Snapshot not found in S3: {snapshot_name}")
376
                return False
377

378
            # Create local directory
379
            snapshot_dir = self.backup_dir / 'snapshots' / snapshot_name
380
            snapshot_dir.mkdir(parents=True, exist_ok=True)
381

382
            # Download files
383
            for obj in response['Contents']:
384
                s3_key = obj['Key']
385
                file_path = snapshot_dir / Path(s3_key).relative_to(prefix)
386
                file_path.parent.mkdir(parents=True, exist_ok=True)
387

388
                logger.info(f"Downloading {Path(s3_key).name} from S3...")
389
                s3_client.download_file(bucket, s3_key, str(file_path))
390

391
            logger.info(f"Downloaded snapshot {snapshot_name} from S3")
392
            return True
393

394
        except Exception as e:
395
            logger.error(f"Failed to download snapshot from S3: {e}")
396
            return False
397

398
    def cleanup_old_snapshots(self, retention_days: int = None):
399
        """Clean up old snapshots based on retention policy"""
400

401
        if retention_days is None:
402
            retention_days = self.config['default_retention_days']
403

404
        cutoff_date = datetime.now() - timedelta(days=retention_days)
405
        logger.info(f"Cleaning up snapshots older than {retention_days} days")
406

407
        snapshots_dir = self.backup_dir / 'snapshots'
408
        if not snapshots_dir.exists():
409
            return
410

411
        cleaned_count = 0
412
        for snapshot_dir in snapshots_dir.iterdir():
413
            if snapshot_dir.is_dir():
414
                metadata_file = snapshot_dir / 'metadata.json'
415
                if metadata_file.exists():
416
                    try:
417
                        with open(metadata_file, 'r') as f:
418
                            metadata = json.load(f)
419

420
                        created_at = datetime.fromisoformat(metadata['created_at'])
421
                        if created_at < cutoff_date:
422
                            logger.info(f"Removing old snapshot: {snapshot_dir.name}")
423
                            subprocess.run(['rm', '-rf', str(snapshot_dir)], check=True)
424
                            cleaned_count += 1
425

426
                    except Exception as e:
427
                        logger.warning(f"Error processing snapshot {snapshot_dir.name}: {e}")
428

429
        logger.info(f"Cleaned up {cleaned_count} old snapshots")
430

431
    def schedule_backup(self, backup_job: BackupJob):
432
        """Schedule a backup job"""
433
        # This would integrate with a scheduler like cron or a job queue
434
        logger.info(f"Scheduled backup job for VM {backup_job.vm_id}")
435
        # Implementation would depend on chosen scheduling system
436

437
    def get_backup_status(self) -> Dict:
438
        """Get backup system status"""
439

440
        snapshots_dir = self.backup_dir / 'snapshots'
441
        snapshot_count = len(list(snapshots_dir.iterdir())) if snapshots_dir.exists() else 0
442

443
        # Calculate total backup size
444
        total_size = 0
445
        if snapshots_dir.exists():
446
            for snapshot_dir in snapshots_dir.iterdir():
447
                if snapshot_dir.is_dir():
448
                    total_size += sum(f.stat().st_size for f in snapshot_dir.rglob('*') if f.is_file())
449

450
        status = {
451
            'total_snapshots': snapshot_count,
452
            'total_backup_size_gb': total_size / (1024**3),
453
            'active_jobs': len(self.active_jobs),
454
            'recent_jobs': self.job_history[-10:],  # Last 10 jobs
455
            'storage_backends': list(self.storage_backends.keys())
456
        }
457

458
        return status
459

460
if __name__ == '__main__':
461
    backup_manager = FirecrackerBackupManager()
462

463
    # Example: Create snapshot
464
    # backup_manager.create_vm_snapshot('vm001', 'full')
465

466
    # Example: Restore from snapshot
467
    # backup_manager.restore_vm_from_snapshot('vm001_full_20250117_120000')
468

469
    # Example: Cleanup old snapshots
470
    # backup_manager.cleanup_old_snapshots(retention_days=30)
471

472
    # Show status
473
    status = backup_manager.get_backup_status()
474
    print(json.dumps(status, indent=2))

Conclusion#

This comprehensive production deployment guide provides a complete blueprint for building enterprise-ready Firecracker infrastructure. Key areas covered include:

🏗️ Infrastructure Design: Multi-tier architecture with high availability and scalability
🤖 Automation: Infrastructure as Code, configuration management, and CI/CD pipelines
🔧 Operations: VM lifecycle management, templates, and image building
📊 Monitoring: Comprehensive observability stack with custom metrics
🛡️ Disaster Recovery: Backup strategies, snapshots, and restoration procedures

By following this guide, organizations can deploy Firecracker microVMs at scale while maintaining security, reliability, and operational efficiency. The modular approach allows teams to adopt components incrementally and customize them for specific requirements.