49  Fleet Management Architecture — Specification Document

A pattern for centralized cloud management of distributed software running on bare-metal Linux servers.

Version: 1.0
Status: Reference Specification
Audience: Developers, architects, and DevOps engineers implementing agent-based remote management systems.


49.1 Table of Contents

  1. Problem Statement
  2. Architecture Overview
  3. Component Specification
  4. Communication Specification
  5. Agent Lifecycle Specification
  6. Managed App Specification
  7. CLI Tool Specification
  8. Packaging & Distribution Specification
  9. CI/CD Pipeline Specification
  10. Security Model
  11. Filesystem Layout Standard
  12. User Stories & Scenarios
  13. Glossary

49.2 1. Problem Statement

An organization operates multiple Linux servers, each running one or more business applications. Without centralized management, every operational task — checking health, updating config, deploying new versions, reading logs — requires SSH access to individual servers. This approach fails to scale and lacks auditability.

WITHOUT fleet management:

  Admin needs to check 10 servers:
    ssh srv-01 → systemctl status app-a → exit
    ssh srv-02 → systemctl status app-a → exit
    ssh srv-03 → systemctl status app-b → exit
    ... (repeat 10 times)

  Admin needs to push a config change:
    ssh srv-01 → nano config → restart → exit
    ssh srv-02 → nano config → restart → exit
    ... (did I miss one? did I typo on srv-07?)

WITH fleet management:

  Admin opens dashboard (or CLI):
    → sees all 10 servers, all apps, all metrics, one screen
    → pushes config to all servers in one action
    → full audit trail of who changed what, when

49.2.1 Goals

  • Monitor the health and status of all servers and applications from a single point.
  • Push configuration changes remotely without SSH.
  • Trigger application restarts and updates remotely.
  • Maintain a complete audit trail of all management actions.
  • Support the above without modifying the managed applications themselves.

49.2.2 Non-Goals

  • Container orchestration (this is for bare-metal / VM workloads).
  • Application-level logic (the fleet system manages processes, not business rules).
  • Replacing SSH entirely (admin access is still needed for infrastructure tasks).

49.3 2. Architecture Overview

49.3.1 2.1 Component Map

The system consists of four components across two deployment zones:

CLOUD ZONE (single deployment)
┌──────────────────────────────────────────────────────────────┐
│                       CLOUD CONSOLE                          │
│                                                              │
│  ┌──────────────┐  ┌──────────────┐  ┌────────────────────┐  │
│  │  Dashboard   │  │   REST API   │  │  Realtime Hub      │  │
│  │  (Web UI)    │  │  (for CLI &  │  │  (for agents,      │  │
│  │              │  │   dashboard) │  │   e.g. WebSocket)  │  │
│  └──────┬───────┘  └──────┬───────┘  └──────┬─────────────┘  │
│         │                 │                  │                │
│         └─────────────────┼──────────────────┘                │
│                           │                                  │
│                    ┌──────▼───────┐                           │
│                    │  Data Store  │  Agent registry,          │
│                    │              │  audit log, config store  │
│                    └──────────────┘                           │
└───────────────────────────┬──────────────────────────────────┘
                            │
                  TLS / mTLS encrypted channel
                            │
SERVER ZONE (one per physical server)
         ┌──────────────────┼──────────────────┐
         │                  │                  │
  ┌──────▼──────┐   ┌──────▼──────┐   ┌──────▼──────┐
  │  Server A   │   │  Server B   │   │  Server C   │
  │             │   │             │   │             │
  │  ┌───────┐  │   │  ┌───────┐  │   │  ┌───────┐  │
  │  │ AGENT │  │   │  │ AGENT │  │   │  │ AGENT │  │
  │  └───┬───┘  │   │  └───┬───┘  │   │  └───┬───┘  │
  │      │      │   │      │      │   │      │      │
  │  ┌───▼───┐  │   │  ┌───▼───┐  │   │  ┌───▼───┐  │
  │  │App  1 │  │   │  │App  1 │  │   │  │App  2 │  │
  │  │App  2 │  │   │  │App  3 │  │   │  │App  3 │  │
  │  └───────┘  │   │  └───────┘  │   │  └───────┘  │
  └─────────────┘   └─────────────┘   └─────────────┘

ADMIN ZONE (any workstation)
  ┌───────────────┐
  │  CLI TOOL     │ ──── HTTP/REST ────► Cloud Console API
  │  (terminal)   │
  └───────────────┘

49.3.2 2.2 Control Plane vs Data Plane

The cloud console serves two distinct roles:

CONTROL PLANE                          DATA PLANE
(decision-making)                      (information flow)
─────────────────                      ──────────────────
"Restart app-a on server-01"           "Server-01 CPU is 45%"
"Push LogLevel=Debug to app-b"         "App-a version is 2.1.0"
"Update app-a to v2.2"                 "Config change applied at 10:30"

Commands flow DOWN                     Telemetry flows UP
Cloud → Agent → Managed App            Managed App → Agent → Cloud

Initiated by:                          Initiated by:
  - Admin (via CLI or dashboard)         - Agent (heartbeat loop)
  - Automated rules (future)             - Agent (event reporting)

49.3.3 2.3 Awareness Model

A critical design property — awareness flows in one direction only:

Cloud Console
    │  knows about
    ▼
  Agent
    │  knows about
    ▼
  Managed Apps
    │  knows about
    ▼
  ❌ NOTHING above it

Implication: managed apps require zero modification to be managed. The agent wraps management around existing services. The cloud never communicates directly with managed apps.

49.3.4 2.4 Component Relationships

Component          Depends On        Talks To              Unaware Of
──────────────────────────────────────────────────────────────────────
Cloud Console      Data store        Agents (realtime)     Managed apps directly
Agent              Cloud Console     Cloud (realtime)      Other agents
                   OS (systemd)      Managed apps (local)
Managed App        OS (systemd)      Its own clients/DBs   Agent, Cloud, other apps
CLI Tool           Cloud Console     Cloud (REST API)      Agents, managed apps

49.4 3. Component Specification

49.4.1 3.1 Cloud Console

What it is: A server-side application deployed centrally (cloud, on-prem VM, etc.) that acts as the single management point for the entire fleet.

Responsibilities:

Category        Capability                        Interface
────────────────────────────────────────────────────────────────
Monitoring      Receive and store agent heartbeats Realtime hub (inbound)
                Track agent online/offline status  Realtime hub (inbound)
                Display fleet-wide dashboard       Web UI

Auditing        Store all management events        Realtime hub (inbound)
                Provide audit log queries          REST API

Configuration   Send config changes to agents      Realtime hub (outbound)
                Store config history               Data store

Updates         Send update commands to agents      Realtime hub (outbound)
                Manage artifact references          Data store

API             Expose fleet state to CLI/UI        REST API
                Expose command endpoints to CLI/UI  REST API

Subcomponents:

┌─────────────────────────────────────────────────────────────┐
│                      CLOUD CONSOLE                          │
│                                                             │
│  ┌─────────────────┐    ┌────────────────────┐              │
│  │   Realtime Hub  │    │    REST API        │              │
│  │                 │    │                    │              │
│  │  Agent connects │    │  GET  /agents      │              │
│  │  here via       │    │  GET  /agents/:id  │              │
│  │  persistent     │    │  GET  /audit       │              │
│  │  connection     │    │  POST /agents/:id/ │              │
│  │  (WebSocket,    │    │       restart-app  │              │
│  │   gRPC stream,  │    │  POST /agents/:id/ │              │
│  │   SSE, etc.)    │    │       push-config  │              │
│  │                 │    │  POST /agents/:id/ │              │
│  └────────┬────────┘    │       update-app   │              │
│           │             └──────────┬─────────┘              │
│           │                        │                        │
│           └────────────┬───────────┘                        │
│                        │                                    │
│                 ┌──────▼──────┐                              │
│                 │  Data Store │  Agent registry              │
│                 │             │  Heartbeat history           │
│                 │  (DB, file, │  Audit event log             │
│                 │   in-memory)│  Config snapshots            │
│                 └─────────────┘                              │
│                                                             │
│  ┌─────────────────┐                                        │
│  │   Web Dashboard │  (optional: can be a separate app)     │
│  │   or SPA        │  Consumes REST API + realtime events   │
│  └─────────────────┘                                        │
└─────────────────────────────────────────────────────────────┘

49.4.2 3.2 Agent

What it is: A lightweight background daemon (system service) running on each managed Ubuntu server. It is the cloud console’s representative on the local machine.

Responsibilities:

Direction          Capability
─────────────────────────────────────────────────────────
Agent → Cloud      Register self on startup
                   Send periodic heartbeat (system metrics + app status)
                   Report audit events (config applied, app restarted, etc.)
                   Report command execution results (success/failure)

Cloud → Agent      Receive and execute: restart app
                   Receive and execute: push config to app
                   Receive and execute: update app to new version
                   Receive: request immediate heartbeat (ping)

Agent → Local OS   Monitor managed app status (systemd queries)
                   Start/stop/restart managed apps (systemd commands)
                   Read/write managed app config files
                   Download and swap managed app binaries

What the agent is NOT:

The agent is NOT:
  ✗ A business application (it has no domain logic)
  ✗ A general-purpose remote execution engine (no arbitrary commands)
  ✗ A replacement for SSH (admin still needs SSH for infrastructure tasks)
  ✗ A container runtime (it manages native Linux services)
  ✗ Self-aware enough to update itself (admin updates the agent via dpkg/apt)

49.4.3 3.3 Managed App

What it is: Any application running as a systemd service on the server. The managed app performs actual business work (serving requests, processing data, etc.). It does not know it is being managed.

Properties:

Property                    Requirement
───────────────────────────────────────────────────────────────
Process model               Must run as a systemd service
Awareness of agent          None required (zero coupling)
Awareness of cloud          None required (zero coupling)
Config format               File-based (JSON, YAML, env, INI — agent must know which)
Version identification      Must be discoverable (file, CLI flag, or env var)
Health check                Ideally: responds to a health endpoint or exit code
Source code changes needed   None

Relationship to agent:

Agent manages apps through the OPERATING SYSTEM layer,
not through the application layer:

  Agent ──► systemctl restart app-a        (not: app-a.restart())
  Agent ──► writes /opt/app-a/config.json  (not: app-a.setConfig())
  Agent ──► reads systemctl is-active      (not: app-a.getStatus())

This is what makes managed apps UNAWARE of the agent.
The agent uses the same interfaces an admin would use via SSH.

49.4.4 3.4 CLI Tool

What it is: A command-line application that admins run on their workstations (macOS, Linux, WSL) to interact with the cloud console. It is a thin client over the REST API.

Properties:

Property               Value
──────────────────────────────────────────────────────────
Runs on                Admin's workstation (not on servers)
Lifecycle              Invoked → executes → exits (not a daemon)
Talks to               Cloud Console REST API (HTTP)
Does NOT talk to       Agents directly, managed apps directly
Config location        ~/.config/<cli-name>/   (XDG standard)
Auth                   API key or token stored in config file
Output modes           Human-readable table (default), JSON (--json)

49.5 4. Communication Specification

49.5.1 4.1 Communication Channels

Channel                  Between              Transport         Direction
───────────────────────────────────────────────────────────────────────────
Realtime Hub             Agent ↔ Cloud        WebSocket /       Bidirectional
                                              gRPC stream /     persistent
                                              SSE+HTTP          connection

REST API                 CLI → Cloud          HTTP/HTTPS        Request-response
                         Dashboard → Cloud

Local OS                 Agent → Managed App  systemd (dbus) /  Local only
                                              filesystem

49.5.2 4.2 Realtime Hub Messages

Agent → Cloud (inbound):

Message              Payload                           When
───────────────────────────────────────────────────────────────────────
Register             Agent ID, hostname, OS info,      On first connect
                     list of managed apps               and reconnect

Heartbeat            Agent ID, timestamp,              Periodic
                     CPU/RAM/disk metrics,              (every N seconds)
                     list of managed apps with status

AuditEvent           Agent ID, timestamp,              On any notable event
                     category, message

CommandResult        Agent ID, command type,           After executing
                     success/failure, message           a cloud command

Cloud → Agent (outbound):

Message              Payload                           When
───────────────────────────────────────────────────────────────────────
RestartApp           App ID                            On admin request

PushConfig           App ID,                           On admin request
                     key-value settings map

UpdateApp            App ID,                           On admin request
                     target version,
                     artifact download URL

RequestHeartbeat     (none)                            On-demand ping

49.5.3 4.3 REST API Surface

The cloud console exposes these endpoints for the CLI and dashboard:

Method   Path                          Purpose
───────────────────────────────────────────────────────────
GET      /api/agents                   List all agents with current state
GET      /api/agents/:agentId          Get single agent detail
GET      /api/audit                    Get recent audit events
POST     /api/agents/:agentId/         Send restart command
           restart-app
POST     /api/agents/:agentId/         Send config push command
           push-config
POST     /api/agents/:agentId/         Send update command
           update-app

All POST endpoints return 202 Accepted (the command is dispatched asynchronously to the agent via the realtime hub). The actual result arrives later as a CommandResult message from the agent.

49.5.4 4.4 Connection Resilience

The agent must handle network disruptions gracefully:

Scenario                    Agent Behavior
──────────────────────────────────────────────────────────────
Cloud unreachable           Retry connection with exponential backoff
                            Continue running (managed apps unaffected)
                            Queue audit events for delivery on reconnect

Connection dropped          Auto-reconnect (built into realtime client)
                            Re-register on reconnect

Cloud comes back            Agent reconnects, re-registers, resumes heartbeat
                            Cloud sees agent as "online" again

Agent restarts              Connect on startup, register, begin heartbeat loop
                            Managed apps are unaffected (they're separate processes)

49.6 5. Agent Lifecycle Specification

49.6.1 5.1 Who Can Do What

Action              Local Admin    Cloud Console    Why
────────────────────────────────────────────────────────────────────
Install Agent       ✅ YES         ❌ NO            Agent doesn't exist yet;
                                                    trust decision requires
                                                    physical/SSH access

Upgrade Agent       ✅ YES         ❌ NO *          Agent can't safely replace
                    (dpkg/apt)                       its own binary while running;
                                                    dpkg + systemd handle this
                                                    atomically

Configure Agent     ✅ YES         ❌ NO            Agent identity and cloud URL
                    (edit env)                       must be set before it can
                                                    connect to receive commands

Restart Agent       ✅ YES         ⚠️  AVOID        If restart fails, cloud loses
                    (systemctl)                      contact with no recovery path

Uninstall Agent     ✅ YES         ❌ NO            Security: compromised cloud
                    (dpkg -r)                        must not be able to remove
                                                    agents fleet-wide

Restart App         ✅ YES         ✅ YES           Both paths valid; cloud
                                                    preferred for auditability

Configure App       ✅ YES         ✅ YES           Cloud preferred for
                                                    centralization and audit trail

Update App          ✅ YES         ✅ YES           Cloud preferred; agent
                                                    handles download/swap/restart

* Advanced systems may support cloud-triggered agent self-update via a helper script, but this adds significant complexity and is not part of this base specification.

49.6.2 5.2 Agent Startup Sequence

1. Process starts (launched by systemd)
       │
2. Read local config (env file + app settings)
       │
3. Build realtime connection to cloud console
       │
4. Connect with retry (exponential backoff on failure)
       │
5. Register: send agent ID, hostname, OS, managed app list
       │
6. Register command handlers (for incoming cloud commands)
       │
7. Enter heartbeat loop:
       │
       ├──► Collect system metrics (CPU, RAM, disk)
       ├──► Collect managed app statuses (systemd queries)
       ├──► Send heartbeat to cloud
       ├──► Sleep N seconds
       └──► Repeat
       │
8. On shutdown signal (SIGTERM from systemd):
       │
9. Graceful disconnect from cloud
       │
10. Process exits

49.6.3 5.3 Reconnection Behavior

Connection lost
       │
       ▼
  Wait 0 seconds → attempt reconnect
       │ failure
       ▼
  Wait 2 seconds → attempt reconnect
       │ failure
       ▼
  Wait 5 seconds → attempt reconnect
       │ failure
       ▼
  Wait 10 seconds → attempt reconnect
       │ failure
       ▼
  Wait 30 seconds → attempt reconnect (cap here)
       │ ...repeat at 30s intervals...
       │
       │ success
       ▼
  Re-register with cloud (send full agent info again)
  Resume heartbeat loop

During disconnection: managed apps continue running normally. Only remote monitoring and command execution are interrupted.


49.7 6. Managed App Specification

49.7.1 6.1 How an Existing App Becomes “Managed”

No source code changes required. The agent discovers or is configured to know about apps:

Step 1: The app already exists as a systemd service
  /etc/systemd/system/my-app.service
  /opt/my-app/my-app-binary

Step 2: Agent configuration declares it
  In the agent's config (env, JSON, YAML, or registry):
    managed_apps:
      - id: my-app
        service_name: my-app.service
        binary_path: /opt/my-app/
        config_path: /opt/my-app/appsettings.json
        version_file: /opt/my-app/version.txt

Step 3: Agent starts monitoring it
  Agent reads: systemctl is-active my-app.service
  Agent reads: cat /opt/my-app/version.txt
  Agent reports status in each heartbeat

Step 4: Cloud console can now manage it
  Restart: agent runs → systemctl restart my-app.service
  Config:  agent writes → /opt/my-app/appsettings.json
  Update:  agent downloads new binary → stops → swaps → starts

49.7.2 6.2 Agent Operations on Managed Apps

Restart:

1. Agent receives RestartApp command
2. Agent runs: systemctl restart <service_name>
3. Agent waits briefly, checks: systemctl is-active <service_name>
4. Agent reports result to cloud (success or failure with message)
5. Agent logs audit event

Config Push:

1. Agent receives PushConfig command (app ID + key-value map)
2. Agent reads current config file for the app
3. Agent merges new settings into the config
4. Agent writes updated config file
5. Agent restarts the app (to pick up new config)
6. Agent reports result to cloud
7. Agent logs audit event with changed keys

Update:

1. Agent receives UpdateApp command (app ID, version, artifact URL)
2. Agent downloads artifact from URL to temp directory
3. Agent verifies artifact integrity (checksum if provided)
4. Agent stops the app: systemctl stop <service_name>
5. Agent backs up current binary directory
6. Agent extracts new version into place
7. Agent starts the app: systemctl start <service_name>
8. Agent performs health check (is-active + optional endpoint check)
9. On SUCCESS: agent reports success, removes backup
10. On FAILURE: agent restores backup, restarts old version, reports failure
11. Agent logs audit event with version transition

49.7.3 6.3 Failure Isolation

Failure                        Impact on Agent    Impact on Other Apps
──────────────────────────────────────────────────────────────────────
Managed app crashes            None               None
Agent crashes                  —                  None (apps keep running)
Cloud console goes down        Agent retries      None
Network outage                 Agent retries      None
Bad config pushed              One app affected   None (per-app config)
Bad update deployed            One app affected   None (per-app update)
                               (agent rolls back)

This isolation is the fundamental architectural guarantee: no single failure cascades across the system.


49.8 7. CLI Tool Specification

49.8.1 7.1 Design Principles

Principle                    Rationale
──────────────────────────────────────────────────────────────────
Stateless                    No local database; all data from cloud API
Config in ~/.config/         XDG standard for per-user interactive tools
Table output by default      Human-readable for interactive use
JSON output with --json      Machine-readable for scripting and piping
Non-zero exit on error       Standard Unix convention for automation
No DI framework              CLI tools should be simple; manual wiring
API key in config file       Protected with file mode 600

49.8.2 7.2 Command Tree

<cli-name>
│
├── login                         Configure console URL and credentials
│   --url <console-url>           (required)
│   --api-key <key>               (optional; prompt interactively if omitted)
│
├── agents                        Fleet visibility
│   ├── list                      Tabular overview of all agents
│   │   --json                    JSON output
│   │   --online-only             Filter to connected agents
│   │
│   └── status <agent-id>         Detailed view of one agent
│       --json                    JSON output
│
├── apps                          Remote app management
│   ├── restart <agent-id> <app-id>
│   │
│   ├── config <agent-id> <app-id>
│   │   --set <key=value>         Repeatable; one or more settings
│   │
│   └── update <agent-id> <app-id>
│       --version <version>       Target version (required)
│       --artifact <url>          Download URL (required)
│
└── audit                         Audit log viewer
    --last <n>                    Number of entries (default: 20)
    --agent <agent-id>            Filter by agent
    --json                        JSON output

49.8.3 7.3 Config File

Location: ~/.config/<cli-name>/config.json

{
  "consoleUrl": "https://console.example.com",
  "apiKey": "sk-..."
}

File permissions: 600 (owner read/write only).

Created by the login command. All other commands check for its existence and print a clear error if missing: Not configured. Run '<cli-name> login' first.

49.8.4 7.4 Output Formats

Table (default) — designed for human scanning:

$ <cli-name> agents list

AGENT ID          HOSTNAME        STATUS    CPU     MEM (MB)   APPS
agent-001         srv-alpha       online    23.4%   128        2
agent-002         srv-beta        offline   -       -          3

JSON (--json) — designed for piping to jq, scripts, or other tools:

$ <cli-name> agents list --json
[
  {"agentId": "agent-001", "hostname": "srv-alpha", "isOnline": true, ...},
  {"agentId": "agent-002", "hostname": "srv-beta", "isOnline": false, ...}
]

49.8.5 7.5 Error Handling

Condition                         Behavior                    Exit Code
─────────────────────────────────────────────────────────────────────────
No config file                    "Run '<cli-name> login'"    1
Console unreachable               "Cannot reach <url>"        1
Agent not found (404)             "Agent '<id>' not found"    1
Command accepted (202)            Print confirmation          0
Auth failure (401/403)            "Authentication failed"     1
Unknown error                     Print status code + body    1

49.9 8. Packaging & Distribution Specification

49.9.1 8.1 Package Format: .deb

The agent is distributed as a Debian package (.deb) for Ubuntu servers.

What goes inside the .deb:

<agent-pkg>_<version>_amd64.deb
│
├── DEBIAN/                          Metadata + lifecycle scripts
│   ├── control                      Package name, version, deps, description
│   ├── conffiles                    List of config files (preserved on upgrade)
│   ├── postinst                     After install: create user, set perms, start
│   ├── prerm                        Before remove: stop service
│   └── postrm                       After remove: cleanup (on purge)
│
└── (filesystem overlay)
    ├── opt/<agent-pkg>/             Application binary + default config
    ├── etc/<agent-pkg>/             Admin-managed config (secrets, overrides)
    ├── etc/systemd/system/          Service unit file
    └── var/lib/<agent-pkg>/         Writable data directory (empty on install)

49.9.2 8.2 Lifecycle Scripts

postinst (runs after files are placed):

1. Create system user (no login, no home) — idempotent
2. Create writable data directory if not exists
3. Set file ownership (binary dir, data dir, config) to service user
4. Set config file permissions to 600
5. Set binary as executable
6. systemctl daemon-reload
7. systemctl enable <service>
8. Start or restart the service (handle both fresh install and upgrade)

prerm (runs before files are removed):

1. Stop the service (if active)
2. Disable the service (ignore errors)

postrm (runs after files are removed):

1. On "purge" only: delete system user, remove data dir, remove config dir
2. Always: systemctl daemon-reload

49.9.3 8.3 Config Preservation on Upgrade

The config file is declared in conffiles. This ensures:

Fresh install:     dpkg places default config file
Upgrade:           dpkg detects admin has modified the file
                   → keeps admin's version (or prompts)
Purge:             dpkg removes config file

49.9.4 8.4 Distribution via GitHub Releases

For proof-of-concept or small-scale deployments, use GitHub Releases as the package registry:

GitHub Release: v1.2.0
│
├── <agent-pkg>_1.2.0_amd64.deb          Agent package
├── <cli-name>-linux-x64.tar.gz          CLI for Linux
├── <cli-name>-macos-x64.tar.gz          CLI for macOS Intel
└── <cli-name>-macos-arm64.tar.gz        CLI for macOS Apple Silicon

49.9.5 8.5 Convenience Installer Script

A one-line install command for admins:

curl -fsSL https://raw.githubusercontent.com/<org>/<repo>/main/scripts/install.sh | sudo bash

Script logic:

1. Accept optional version argument (default: query GitHub API for latest)
2. Download .deb from GitHub Releases to /tmp/
3. Verify file is non-empty
4. dpkg -i /tmp/<package>.deb
5. Print service status
6. Remind admin to edit config: /etc/<agent-pkg>/agent.env
7. Clean up temp file

49.9.6 8.6 End-User Workflows

First install:

sudo dpkg -i <agent-pkg>_1.0.0_amd64.deb
sudo nano /etc/<agent-pkg>/agent.env       # set console URL + agent ID
sudo systemctl restart <agent-pkg>

Upgrade:

sudo dpkg -i <agent-pkg>_1.1.0_amd64.deb
# postinst restarts service automatically
# config file preserved

Uninstall:

sudo dpkg -r <agent-pkg>       # remove binaries, keep config
sudo dpkg -P <agent-pkg>       # purge everything including config and user

49.10 9. CI/CD Pipeline Specification

49.10.1 9.1 Trigger

Pipeline runs on push of a semantic version tag: v*.*.* (e.g., v1.0.0, v2.3.1).

49.10.2 9.2 Job Graph

push tag v1.2.0
    │
    ▼
┌──────────┐     ┌──────────────┐
│   test   │────►│  build-agent │────┐
│          │     │    (.deb)    │    │
│ lint     │     └──────────────┘    │     ┌──────────────┐
│ build    │                         ├────►│   release    │
│ test     │     ┌──────────────┐    │     │              │
│          │────►│  build-cli   │────┘     │ create GH    │
│          │     │  (matrix)    │          │ release      │
└──────────┘     │  linux-x64   │          │ attach all   │
                 │  macos-x64   │          │ artifacts    │
                 │  macos-arm64 │          └──────────────┘
                 └──────────────┘

49.10.3 9.3 Job Details

Job 1: test

Runner: ubuntu-latest
Steps:
  1. Checkout code
  2. Setup language toolchain
  3. Install dependencies
  4. Lint / static analysis
  5. Build all projects
  6. Run unit + integration tests

Job 2: build-agent (needs: test)

Runner: ubuntu-latest
Steps:
  1. Checkout code
  2. Setup toolchain
  3. Extract version from tag (strip 'v' prefix)
  4. Build agent binary (self-contained, single file, linux-x64)
  5. Run packaging script: ./packaging/build-deb.sh <version>
  6. Upload .deb as workflow artifact

Job 3: build-cli (needs: test)

Runner: ubuntu-latest
Strategy matrix: [linux-x64, macos-x64, macos-arm64]
Steps (per target):
  1. Checkout code
  2. Setup toolchain
  3. Build CLI binary (self-contained, single file, target platform)
  4. Archive: tar.gz
  5. Upload archive as workflow artifact

Job 4: release (needs: build-agent, build-cli)

Runner: ubuntu-latest
Steps:
  1. Download all artifacts from previous jobs
  2. Create GitHub Release from tag
  3. Attach all artifacts (.deb + CLI tarballs)
  4. Auto-generate release notes from commits

49.10.4 9.4 Version Flow

git tag v1.2.0
    │
    ├──► GitHub Actions extracts: 1.2.0
    │
    ├──► Substituted into DEBIAN/control: Version: 1.2.0
    │
    ├──► Package file named: <agent-pkg>_1.2.0_amd64.deb
    │
    └──► GitHub Release titled: v1.2.0

Single source of truth for version: the git tag.


49.11 10. Security Model

49.11.1 10.1 Trust Hierarchy

Level 1: Infrastructure Admin (highest trust)
────────────────────────────────────────────
  - SSH + sudo access to servers
  - Installs and uninstalls agent
  - Sets agent identity and secrets
  - Can override anything

Level 2: Cloud Console Operator
────────────────────────────────
  - Authenticated access to cloud dashboard / CLI
  - Manages apps THROUGH the agent
  - Can: push config, restart apps, trigger updates
  - Cannot: install/uninstall agent, access server filesystem,
            run arbitrary commands, modify agent identity

Level 3: Agent Process (least trust)
────────────────────────────────────
  - Runs as a restricted system user
  - Executes only predefined command types from cloud
  - Cannot: modify its own binary, escalate privileges,
            access other services, run arbitrary code

49.11.2 10.2 Agent Security Hardening (systemd)

# Run as dedicated unprivileged user
User=<service-user>
Group=<service-user>

# Filesystem restrictions
ProtectSystem=strict        # Entire FS read-only except declared paths
ProtectHome=true            # Cannot access /home/*
PrivateTmp=true             # Isolated /tmp
ReadWritePaths=/var/lib/<agent-pkg>    # Only writable path
ReadOnlyPaths=/opt/<agent-pkg>         # Binary dir is read-only at runtime

# Privilege restrictions
NoNewPrivileges=true        # Cannot gain new capabilities
ProtectKernelTunables=true
ProtectKernelModules=true
ProtectControlGroups=true

# Resource limits
MemoryMax=512M
CPUQuota=50%

49.11.3 10.3 Config File Protection

/etc/<agent-pkg>/agent.env
  Owner: <service-user>
  Mode: 600
  Contains: API keys, console URL, agent identity
  Only readable by: the agent process and root
~/.config/<cli-name>/config.json
  Owner: the admin user
  Mode: 600
  Contains: API key for cloud console
  Only readable by: that specific user and root

49.12 11. Filesystem Layout Standard

49.12.1 11.1 Server (Agent + Managed Apps)

Follows the Linux Filesystem Hierarchy Standard (FHS):

Path                              Purpose                  Owner           Mode
──────────────────────────────────────────────────────────────────────────────────
/opt/<agent-pkg>/                 Agent binary + defaults   root (dpkg)     755
/etc/<agent-pkg>/agent.env        Agent secrets + config    <service-user>  600
/etc/systemd/system/              Service unit file         root (dpkg)     644
  <agent-pkg>.service
/var/lib/<agent-pkg>/             Agent runtime data        <service-user>  755
/var/log/<agent-pkg>/             Logs (if not journald)    <service-user>  755

/opt/<managed-app>/               App binary + defaults     root / app-user
/etc/<managed-app>/               App config                app-user
/etc/systemd/system/              App service unit          root
  <managed-app>.service

49.12.2 11.2 Admin Workstation (CLI)

Path                               Purpose
──────────────────────────────────────────────────
~/.config/<cli-name>/config.json   Console URL + API key (mode 600)
~/bin/<cli-name>  or               CLI executable
/usr/local/bin/<cli-name>

49.12.3 11.3 Why This Layout

PATH           WHY
──────────────────────────────────────────────────────────────────────
/opt/          Third-party software (not from distro repos). Agent binary
               lives here because it's not part of the Ubuntu distribution.

/etc/          Configuration. The admin edits files here. dpkg knows to
               preserve files listed in conffiles during upgrades.

/var/lib/      Variable persistent data. The agent writes runtime state,
               cached downloads, etc. here. Survives reboot, wiped on purge.

/var/log/      Logs. Only needed if not using journald (systemd's built-in
               logging). With journald, logs go to the journal automatically.

~/.config/     Per-user config (XDG standard). For interactive tools only.
               NEVER for system daemons.

49.13 12. User Stories & Scenarios

49.13.1 Story 1: First Server Onboarding

As an infrastructure admin,
I want to install the agent on a new Ubuntu server and have it appear
in the cloud console within seconds,
so that the server joins the managed fleet.

STEPS:
  1. Admin downloads .deb from GitHub Releases
     $ curl -LO https://github.com/org/repo/releases/download/v1.0.0/agent_1.0.0_amd64.deb

  2. Admin installs the package
     $ sudo dpkg -i agent_1.0.0_amd64.deb
     → postinst creates system user, enables and starts service

  3. Admin configures agent identity
     $ sudo nano /etc/<agent-pkg>/agent.env
     → sets console URL and agent ID

  4. Admin restarts to pick up config
     $ sudo systemctl restart <agent-pkg>

  5. Agent connects to cloud console, registers itself

  6. Cloud console dashboard shows new agent as "online"
     with hostname, OS, CPU/RAM metrics, and managed app list

ACCEPTANCE:
  - Agent appears in cloud console within 30 seconds of restart
  - Heartbeat begins immediately
  - Dashboard shows correct hostname and OS info

49.13.2 Story 2: Remote App Restart

As a cloud operator,
I want to restart a misbehaving application on a remote server
without SSH access,
so that I can resolve issues quickly from my desk.

STEPS:
  1. Operator notices app-a on agent-001 showing errors in dashboard

  2. Operator uses CLI:
     $ <cli-name> apps restart agent-001 app-a
     → "Restart command sent to app-a on agent-001."

  3. Cloud console sends RestartApp command to agent-001 via realtime hub

  4. Agent executes: systemctl restart app-a.service

  5. Agent checks: systemctl is-active app-a.service → "active"

  6. Agent reports success to cloud

  7. Audit log records: "app-a restarted on agent-001 by operator@cli"

  8. Dashboard shows app-a status returns to "Running"

ACCEPTANCE:
  - App restarts within 5 seconds of command
  - Operator sees confirmation in CLI
  - Audit log captures the event with timestamp
  - Dashboard reflects new status on next heartbeat

49.13.3 Story 3: Fleet-Wide Config Push

As a cloud operator,
I want to change the log level to Debug on all instances of app-b
across the fleet,
so that I can diagnose a production issue.

STEPS:
  1. Operator identifies all agents running app-b:
     $ <cli-name> agents list --online-only
     → agent-001, agent-002, agent-003

  2. Operator pushes config to each (or scripts it):
     $ for agent in agent-001 agent-002 agent-003; do
         <cli-name> apps config $agent app-b --set LogLevel=Debug
       done
     → "Config pushed to app-b on agent-001: LogLevel"
     → "Config pushed to app-b on agent-002: LogLevel"
     → "Config pushed to app-b on agent-003: LogLevel"

  3. Each agent receives PushConfig, writes to app-b's config file,
     restarts app-b

  4. Each agent reports success/failure

  5. Audit log shows 3 config change events

ACCEPTANCE:
  - All three agents apply the change within 10 seconds
  - Config files on each server reflect the new value
  - Apps restart automatically to pick up new config
  - Audit log shows all three events with correct agent IDs

49.13.4 Story 4: Rolling App Update

As a cloud operator,
I want to update app-a from v2.0 to v2.1 on one server first (canary),
verify it works, then roll out to the rest,
so that I can deploy safely.

STEPS:
  1. Operator targets canary server:
     $ <cli-name> apps update agent-001 app-a \
         --version 2.1.0 \
         --artifact https://artifacts.example.com/app-a-2.1.0.tar.gz
     → "Update command sent: app-a → v2.1.0 on agent-001."

  2. Agent on agent-001:
     a. Downloads artifact
     b. Stops app-a
     c. Backs up current version
     d. Extracts new version
     e. Starts app-a
     f. Health check passes
     g. Reports success

  3. Operator verifies canary:
     $ <cli-name> agents status agent-001
     → app-a shows version 2.1.0, status Running

  4. Operator waits, monitors, then rolls out to remaining servers

  5. If canary fails: agent rolls back automatically, reports failure
     Operator sees: "Update failed for app-a on agent-001: health check failed"

ACCEPTANCE:
  - Update completes within 60 seconds (download + swap + restart)
  - On success: version number updates in heartbeat
  - On failure: automatic rollback to previous version
  - Audit log records the entire sequence (start, download, stop, swap, start, result)

49.13.5 Story 5: Agent Upgrade via dpkg

As an infrastructure admin,
I want to upgrade the agent itself to a new version using dpkg,
so that I get new features and bug fixes.

STEPS:
  1. Admin downloads new version:
     $ curl -LO https://github.com/org/repo/releases/download/v1.1.0/agent_1.1.0_amd64.deb

  2. Admin installs over existing:
     $ sudo dpkg -i agent_1.1.0_amd64.deb
     → dpkg replaces binary in /opt/<agent-pkg>/
     → dpkg preserves /etc/<agent-pkg>/agent.env (conffiles)
     → postinst restarts the service

  3. Agent comes back online with new version

  4. Cloud console sees agent reconnect
     → heartbeat shows new agent version
     → audit log: "agent registered (version 1.1.0)"

ACCEPTANCE:
  - Upgrade completes in under 10 seconds
  - Agent reconnects automatically after restart
  - Admin's config file is NOT overwritten
  - Managed apps are unaffected (they keep running during agent restart)

49.13.6 Story 6: Server Decommissioning

As an infrastructure admin,
I want to cleanly remove the agent from a server being decommissioned,
so that no fleet management artifacts remain.

STEPS:
  1. Admin purges the package:
     $ sudo dpkg -P <agent-pkg>
     → prerm: stops service, disables it
     → removes: binary, service unit, config
     → postrm (purge): deletes system user, data dir, config dir

  2. Cloud console sees agent heartbeat stop
     → status changes to "offline"

  3. Server is clean — no agent user, no files, no service

ACCEPTANCE:
  - No files remain under /opt/<agent-pkg>/, /etc/<agent-pkg>/, /var/lib/<agent-pkg>/
  - System user is deleted
  - systemd has no knowledge of the service
  - Cloud console shows agent as offline (does not auto-remove from registry)

49.13.7 Story 7: New Developer Onboarding

As a new developer joining the team,
I want to run the entire system locally on my machine,
so that I can understand and develop against the architecture.

STEPS:
  1. Clone the repo:
     $ git clone https://github.com/org/repo.git

  2. Start the cloud console (terminal 1):
     $ <start-console-command>        # e.g., dotnet run, go run, npm start
     → Listening on http://localhost:5000

  3. Start an agent (terminal 2):
     $ <start-agent-command>
     → Connected to cloud hub
     → Registered as agent-dev-01

  4. Open browser at http://localhost:5000
     → Dashboard shows agent-dev-01 online with simulated apps

  5. Use CLI (terminal 3):
     $ <cli-name> login --url http://localhost:5000
     $ <cli-name> agents list
     → Shows agent-dev-01

  6. Start a second agent with a different ID (terminal 4):
     $ <start-agent-command> --agent-id agent-dev-02
     → Dashboard now shows two agents

ACCEPTANCE:
  - Full system runs locally with no external dependencies
  - Agent uses simulated apps (no real systemd services needed)
  - CLI connects to local console
  - Developer can exercise all commands against local setup

49.14 13. Glossary

Term                 Definition
────────────────────────────────────────────────────────────────────────
Agent                A daemon running on each managed server. It connects
                     to the cloud console, reports health, and executes
                     management commands. It manages apps through the OS,
                     not through application code.

Managed App          Any application running as a systemd service on a
                     managed server. It performs business logic and is
                     unaware of the agent or cloud console.

Cloud Console        The centralized server that all agents connect to.
                     It provides a dashboard, REST API, and realtime hub
                     for monitoring and managing the fleet.

Control Plane        The command and decision path: cloud → agent → app.
                     Admin says "restart app-a", command flows down.

Data Plane           The telemetry and status path: app → agent → cloud.
                     Metrics and events flow up for visibility.

CLI Tool             A command-line application run by admins on their
                     workstation. It talks to the cloud console REST API.
                     It is a thin client, not a daemon.

Heartbeat            A periodic message from agent to cloud carrying
                     system metrics and managed app statuses.

Audit Event          An immutable record of a management action
                     (config change, restart, update, etc.) stored
                     in the cloud console.

Realtime Hub         The persistent bidirectional communication channel
                     between agents and the cloud console (WebSocket,
                     gRPC stream, SSE, or similar).

conffiles            A Debian packaging concept: config files listed here
                     are preserved when the package is upgraded. The
                     admin's modifications are not overwritten.

FHS                  Filesystem Hierarchy Standard. The Linux convention
                     for where files go: /opt for third-party apps,
                     /etc for config, /var for variable data.

XDG Base Directory   A freedesktop.org standard defining where per-user
                     config (~/.config/), data (~/.local/share/), and
                     cache (~/.cache/) should go.

systemd              The init system and service manager on modern Linux.
                     It starts, stops, and supervises all services
                     including both the agent and managed apps.

dpkg                 The low-level Debian package manager. It installs,
                     removes, and manages .deb packages.

Self-contained       A build mode where the application binary includes
binary               the language runtime, so no runtime needs to be
                     pre-installed on the target server.

End of specification.