49 Fleet Management Architecture — Specification Document
A pattern for centralized cloud management of distributed software running on bare-metal Linux servers.
Version: 1.0
Status: Reference Specification
Audience: Developers, architects, and DevOps engineers implementing agent-based remote management systems.
49.1 Table of Contents
- Problem Statement
- Architecture Overview
- Component Specification
- Communication Specification
- Agent Lifecycle Specification
- Managed App Specification
- CLI Tool Specification
- Packaging & Distribution Specification
- CI/CD Pipeline Specification
- Security Model
- Filesystem Layout Standard
- User Stories & Scenarios
- Glossary
49.2 1. Problem Statement
An organization operates multiple Linux servers, each running one or more business applications. Without centralized management, every operational task — checking health, updating config, deploying new versions, reading logs — requires SSH access to individual servers. This approach fails to scale and lacks auditability.
WITHOUT fleet management:
Admin needs to check 10 servers:
ssh srv-01 → systemctl status app-a → exit
ssh srv-02 → systemctl status app-a → exit
ssh srv-03 → systemctl status app-b → exit
... (repeat 10 times)
Admin needs to push a config change:
ssh srv-01 → nano config → restart → exit
ssh srv-02 → nano config → restart → exit
... (did I miss one? did I typo on srv-07?)
WITH fleet management:
Admin opens dashboard (or CLI):
→ sees all 10 servers, all apps, all metrics, one screen
→ pushes config to all servers in one action
→ full audit trail of who changed what, when
49.2.1 Goals
- Monitor the health and status of all servers and applications from a single point.
- Push configuration changes remotely without SSH.
- Trigger application restarts and updates remotely.
- Maintain a complete audit trail of all management actions.
- Support the above without modifying the managed applications themselves.
49.2.2 Non-Goals
- Container orchestration (this is for bare-metal / VM workloads).
- Application-level logic (the fleet system manages processes, not business rules).
- Replacing SSH entirely (admin access is still needed for infrastructure tasks).
49.3 2. Architecture Overview
49.3.1 2.1 Component Map
The system consists of four components across two deployment zones:
CLOUD ZONE (single deployment)
┌──────────────────────────────────────────────────────────────┐
│ CLOUD CONSOLE │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌────────────────────┐ │
│ │ Dashboard │ │ REST API │ │ Realtime Hub │ │
│ │ (Web UI) │ │ (for CLI & │ │ (for agents, │ │
│ │ │ │ dashboard) │ │ e.g. WebSocket) │ │
│ └──────┬───────┘ └──────┬───────┘ └──────┬─────────────┘ │
│ │ │ │ │
│ └─────────────────┼──────────────────┘ │
│ │ │
│ ┌──────▼───────┐ │
│ │ Data Store │ Agent registry, │
│ │ │ audit log, config store │
│ └──────────────┘ │
└───────────────────────────┬──────────────────────────────────┘
│
TLS / mTLS encrypted channel
│
SERVER ZONE (one per physical server)
┌──────────────────┼──────────────────┐
│ │ │
┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐
│ Server A │ │ Server B │ │ Server C │
│ │ │ │ │ │
│ ┌───────┐ │ │ ┌───────┐ │ │ ┌───────┐ │
│ │ AGENT │ │ │ │ AGENT │ │ │ │ AGENT │ │
│ └───┬───┘ │ │ └───┬───┘ │ │ └───┬───┘ │
│ │ │ │ │ │ │ │ │
│ ┌───▼───┐ │ │ ┌───▼───┐ │ │ ┌───▼───┐ │
│ │App 1 │ │ │ │App 1 │ │ │ │App 2 │ │
│ │App 2 │ │ │ │App 3 │ │ │ │App 3 │ │
│ └───────┘ │ │ └───────┘ │ │ └───────┘ │
└─────────────┘ └─────────────┘ └─────────────┘
ADMIN ZONE (any workstation)
┌───────────────┐
│ CLI TOOL │ ──── HTTP/REST ────► Cloud Console API
│ (terminal) │
└───────────────┘
49.3.2 2.2 Control Plane vs Data Plane
The cloud console serves two distinct roles:
CONTROL PLANE DATA PLANE
(decision-making) (information flow)
───────────────── ──────────────────
"Restart app-a on server-01" "Server-01 CPU is 45%"
"Push LogLevel=Debug to app-b" "App-a version is 2.1.0"
"Update app-a to v2.2" "Config change applied at 10:30"
Commands flow DOWN Telemetry flows UP
Cloud → Agent → Managed App Managed App → Agent → Cloud
Initiated by: Initiated by:
- Admin (via CLI or dashboard) - Agent (heartbeat loop)
- Automated rules (future) - Agent (event reporting)
49.3.3 2.3 Awareness Model
A critical design property — awareness flows in one direction only:
Cloud Console
│ knows about
▼
Agent
│ knows about
▼
Managed Apps
│ knows about
▼
❌ NOTHING above it
Implication: managed apps require zero modification to be managed. The agent wraps management around existing services. The cloud never communicates directly with managed apps.
49.3.4 2.4 Component Relationships
Component Depends On Talks To Unaware Of
──────────────────────────────────────────────────────────────────────
Cloud Console Data store Agents (realtime) Managed apps directly
Agent Cloud Console Cloud (realtime) Other agents
OS (systemd) Managed apps (local)
Managed App OS (systemd) Its own clients/DBs Agent, Cloud, other apps
CLI Tool Cloud Console Cloud (REST API) Agents, managed apps
49.4 3. Component Specification
49.4.1 3.1 Cloud Console
What it is: A server-side application deployed centrally (cloud, on-prem VM, etc.) that acts as the single management point for the entire fleet.
Responsibilities:
Category Capability Interface
────────────────────────────────────────────────────────────────
Monitoring Receive and store agent heartbeats Realtime hub (inbound)
Track agent online/offline status Realtime hub (inbound)
Display fleet-wide dashboard Web UI
Auditing Store all management events Realtime hub (inbound)
Provide audit log queries REST API
Configuration Send config changes to agents Realtime hub (outbound)
Store config history Data store
Updates Send update commands to agents Realtime hub (outbound)
Manage artifact references Data store
API Expose fleet state to CLI/UI REST API
Expose command endpoints to CLI/UI REST API
Subcomponents:
┌─────────────────────────────────────────────────────────────┐
│ CLOUD CONSOLE │
│ │
│ ┌─────────────────┐ ┌────────────────────┐ │
│ │ Realtime Hub │ │ REST API │ │
│ │ │ │ │ │
│ │ Agent connects │ │ GET /agents │ │
│ │ here via │ │ GET /agents/:id │ │
│ │ persistent │ │ GET /audit │ │
│ │ connection │ │ POST /agents/:id/ │ │
│ │ (WebSocket, │ │ restart-app │ │
│ │ gRPC stream, │ │ POST /agents/:id/ │ │
│ │ SSE, etc.) │ │ push-config │ │
│ │ │ │ POST /agents/:id/ │ │
│ └────────┬────────┘ │ update-app │ │
│ │ └──────────┬─────────┘ │
│ │ │ │
│ └────────────┬───────────┘ │
│ │ │
│ ┌──────▼──────┐ │
│ │ Data Store │ Agent registry │
│ │ │ Heartbeat history │
│ │ (DB, file, │ Audit event log │
│ │ in-memory)│ Config snapshots │
│ └─────────────┘ │
│ │
│ ┌─────────────────┐ │
│ │ Web Dashboard │ (optional: can be a separate app) │
│ │ or SPA │ Consumes REST API + realtime events │
│ └─────────────────┘ │
└─────────────────────────────────────────────────────────────┘
49.4.2 3.2 Agent
What it is: A lightweight background daemon (system service) running on each managed Ubuntu server. It is the cloud console’s representative on the local machine.
Responsibilities:
Direction Capability
─────────────────────────────────────────────────────────
Agent → Cloud Register self on startup
Send periodic heartbeat (system metrics + app status)
Report audit events (config applied, app restarted, etc.)
Report command execution results (success/failure)
Cloud → Agent Receive and execute: restart app
Receive and execute: push config to app
Receive and execute: update app to new version
Receive: request immediate heartbeat (ping)
Agent → Local OS Monitor managed app status (systemd queries)
Start/stop/restart managed apps (systemd commands)
Read/write managed app config files
Download and swap managed app binaries
What the agent is NOT:
The agent is NOT:
✗ A business application (it has no domain logic)
✗ A general-purpose remote execution engine (no arbitrary commands)
✗ A replacement for SSH (admin still needs SSH for infrastructure tasks)
✗ A container runtime (it manages native Linux services)
✗ Self-aware enough to update itself (admin updates the agent via dpkg/apt)
49.4.3 3.3 Managed App
What it is: Any application running as a systemd service on the server. The managed app performs actual business work (serving requests, processing data, etc.). It does not know it is being managed.
Properties:
Property Requirement
───────────────────────────────────────────────────────────────
Process model Must run as a systemd service
Awareness of agent None required (zero coupling)
Awareness of cloud None required (zero coupling)
Config format File-based (JSON, YAML, env, INI — agent must know which)
Version identification Must be discoverable (file, CLI flag, or env var)
Health check Ideally: responds to a health endpoint or exit code
Source code changes needed None
Relationship to agent:
Agent manages apps through the OPERATING SYSTEM layer,
not through the application layer:
Agent ──► systemctl restart app-a (not: app-a.restart())
Agent ──► writes /opt/app-a/config.json (not: app-a.setConfig())
Agent ──► reads systemctl is-active (not: app-a.getStatus())
This is what makes managed apps UNAWARE of the agent.
The agent uses the same interfaces an admin would use via SSH.
49.4.4 3.4 CLI Tool
What it is: A command-line application that admins run on their workstations (macOS, Linux, WSL) to interact with the cloud console. It is a thin client over the REST API.
Properties:
Property Value
──────────────────────────────────────────────────────────
Runs on Admin's workstation (not on servers)
Lifecycle Invoked → executes → exits (not a daemon)
Talks to Cloud Console REST API (HTTP)
Does NOT talk to Agents directly, managed apps directly
Config location ~/.config/<cli-name>/ (XDG standard)
Auth API key or token stored in config file
Output modes Human-readable table (default), JSON (--json)
49.5 4. Communication Specification
49.5.1 4.1 Communication Channels
Channel Between Transport Direction
───────────────────────────────────────────────────────────────────────────
Realtime Hub Agent ↔ Cloud WebSocket / Bidirectional
gRPC stream / persistent
SSE+HTTP connection
REST API CLI → Cloud HTTP/HTTPS Request-response
Dashboard → Cloud
Local OS Agent → Managed App systemd (dbus) / Local only
filesystem
49.5.2 4.2 Realtime Hub Messages
Agent → Cloud (inbound):
Message Payload When
───────────────────────────────────────────────────────────────────────
Register Agent ID, hostname, OS info, On first connect
list of managed apps and reconnect
Heartbeat Agent ID, timestamp, Periodic
CPU/RAM/disk metrics, (every N seconds)
list of managed apps with status
AuditEvent Agent ID, timestamp, On any notable event
category, message
CommandResult Agent ID, command type, After executing
success/failure, message a cloud command
Cloud → Agent (outbound):
Message Payload When
───────────────────────────────────────────────────────────────────────
RestartApp App ID On admin request
PushConfig App ID, On admin request
key-value settings map
UpdateApp App ID, On admin request
target version,
artifact download URL
RequestHeartbeat (none) On-demand ping
49.5.3 4.3 REST API Surface
The cloud console exposes these endpoints for the CLI and dashboard:
Method Path Purpose
───────────────────────────────────────────────────────────
GET /api/agents List all agents with current state
GET /api/agents/:agentId Get single agent detail
GET /api/audit Get recent audit events
POST /api/agents/:agentId/ Send restart command
restart-app
POST /api/agents/:agentId/ Send config push command
push-config
POST /api/agents/:agentId/ Send update command
update-app
All POST endpoints return 202 Accepted (the command is dispatched asynchronously to the agent via the realtime hub). The actual result arrives later as a CommandResult message from the agent.
49.5.4 4.4 Connection Resilience
The agent must handle network disruptions gracefully:
Scenario Agent Behavior
──────────────────────────────────────────────────────────────
Cloud unreachable Retry connection with exponential backoff
Continue running (managed apps unaffected)
Queue audit events for delivery on reconnect
Connection dropped Auto-reconnect (built into realtime client)
Re-register on reconnect
Cloud comes back Agent reconnects, re-registers, resumes heartbeat
Cloud sees agent as "online" again
Agent restarts Connect on startup, register, begin heartbeat loop
Managed apps are unaffected (they're separate processes)
49.6 5. Agent Lifecycle Specification
49.6.1 5.1 Who Can Do What
Action Local Admin Cloud Console Why
────────────────────────────────────────────────────────────────────
Install Agent ✅ YES ❌ NO Agent doesn't exist yet;
trust decision requires
physical/SSH access
Upgrade Agent ✅ YES ❌ NO * Agent can't safely replace
(dpkg/apt) its own binary while running;
dpkg + systemd handle this
atomically
Configure Agent ✅ YES ❌ NO Agent identity and cloud URL
(edit env) must be set before it can
connect to receive commands
Restart Agent ✅ YES ⚠️ AVOID If restart fails, cloud loses
(systemctl) contact with no recovery path
Uninstall Agent ✅ YES ❌ NO Security: compromised cloud
(dpkg -r) must not be able to remove
agents fleet-wide
Restart App ✅ YES ✅ YES Both paths valid; cloud
preferred for auditability
Configure App ✅ YES ✅ YES Cloud preferred for
centralization and audit trail
Update App ✅ YES ✅ YES Cloud preferred; agent
handles download/swap/restart
* Advanced systems may support cloud-triggered agent self-update via a helper script, but this adds significant complexity and is not part of this base specification.
49.6.2 5.2 Agent Startup Sequence
1. Process starts (launched by systemd)
│
2. Read local config (env file + app settings)
│
3. Build realtime connection to cloud console
│
4. Connect with retry (exponential backoff on failure)
│
5. Register: send agent ID, hostname, OS, managed app list
│
6. Register command handlers (for incoming cloud commands)
│
7. Enter heartbeat loop:
│
├──► Collect system metrics (CPU, RAM, disk)
├──► Collect managed app statuses (systemd queries)
├──► Send heartbeat to cloud
├──► Sleep N seconds
└──► Repeat
│
8. On shutdown signal (SIGTERM from systemd):
│
9. Graceful disconnect from cloud
│
10. Process exits
49.6.3 5.3 Reconnection Behavior
Connection lost
│
▼
Wait 0 seconds → attempt reconnect
│ failure
▼
Wait 2 seconds → attempt reconnect
│ failure
▼
Wait 5 seconds → attempt reconnect
│ failure
▼
Wait 10 seconds → attempt reconnect
│ failure
▼
Wait 30 seconds → attempt reconnect (cap here)
│ ...repeat at 30s intervals...
│
│ success
▼
Re-register with cloud (send full agent info again)
Resume heartbeat loop
During disconnection: managed apps continue running normally. Only remote monitoring and command execution are interrupted.
49.7 6. Managed App Specification
49.7.1 6.1 How an Existing App Becomes “Managed”
No source code changes required. The agent discovers or is configured to know about apps:
Step 1: The app already exists as a systemd service
/etc/systemd/system/my-app.service
/opt/my-app/my-app-binary
Step 2: Agent configuration declares it
In the agent's config (env, JSON, YAML, or registry):
managed_apps:
- id: my-app
service_name: my-app.service
binary_path: /opt/my-app/
config_path: /opt/my-app/appsettings.json
version_file: /opt/my-app/version.txt
Step 3: Agent starts monitoring it
Agent reads: systemctl is-active my-app.service
Agent reads: cat /opt/my-app/version.txt
Agent reports status in each heartbeat
Step 4: Cloud console can now manage it
Restart: agent runs → systemctl restart my-app.service
Config: agent writes → /opt/my-app/appsettings.json
Update: agent downloads new binary → stops → swaps → starts
49.7.2 6.2 Agent Operations on Managed Apps
Restart:
1. Agent receives RestartApp command
2. Agent runs: systemctl restart <service_name>
3. Agent waits briefly, checks: systemctl is-active <service_name>
4. Agent reports result to cloud (success or failure with message)
5. Agent logs audit event
Config Push:
1. Agent receives PushConfig command (app ID + key-value map)
2. Agent reads current config file for the app
3. Agent merges new settings into the config
4. Agent writes updated config file
5. Agent restarts the app (to pick up new config)
6. Agent reports result to cloud
7. Agent logs audit event with changed keys
Update:
1. Agent receives UpdateApp command (app ID, version, artifact URL)
2. Agent downloads artifact from URL to temp directory
3. Agent verifies artifact integrity (checksum if provided)
4. Agent stops the app: systemctl stop <service_name>
5. Agent backs up current binary directory
6. Agent extracts new version into place
7. Agent starts the app: systemctl start <service_name>
8. Agent performs health check (is-active + optional endpoint check)
9. On SUCCESS: agent reports success, removes backup
10. On FAILURE: agent restores backup, restarts old version, reports failure
11. Agent logs audit event with version transition
49.7.3 6.3 Failure Isolation
Failure Impact on Agent Impact on Other Apps
──────────────────────────────────────────────────────────────────────
Managed app crashes None None
Agent crashes — None (apps keep running)
Cloud console goes down Agent retries None
Network outage Agent retries None
Bad config pushed One app affected None (per-app config)
Bad update deployed One app affected None (per-app update)
(agent rolls back)
This isolation is the fundamental architectural guarantee: no single failure cascades across the system.
49.8 7. CLI Tool Specification
49.8.1 7.1 Design Principles
Principle Rationale
──────────────────────────────────────────────────────────────────
Stateless No local database; all data from cloud API
Config in ~/.config/ XDG standard for per-user interactive tools
Table output by default Human-readable for interactive use
JSON output with --json Machine-readable for scripting and piping
Non-zero exit on error Standard Unix convention for automation
No DI framework CLI tools should be simple; manual wiring
API key in config file Protected with file mode 600
49.8.2 7.2 Command Tree
<cli-name>
│
├── login Configure console URL and credentials
│ --url <console-url> (required)
│ --api-key <key> (optional; prompt interactively if omitted)
│
├── agents Fleet visibility
│ ├── list Tabular overview of all agents
│ │ --json JSON output
│ │ --online-only Filter to connected agents
│ │
│ └── status <agent-id> Detailed view of one agent
│ --json JSON output
│
├── apps Remote app management
│ ├── restart <agent-id> <app-id>
│ │
│ ├── config <agent-id> <app-id>
│ │ --set <key=value> Repeatable; one or more settings
│ │
│ └── update <agent-id> <app-id>
│ --version <version> Target version (required)
│ --artifact <url> Download URL (required)
│
└── audit Audit log viewer
--last <n> Number of entries (default: 20)
--agent <agent-id> Filter by agent
--json JSON output
49.8.3 7.3 Config File
Location: ~/.config/<cli-name>/config.json
{
"consoleUrl": "https://console.example.com",
"apiKey": "sk-..."
}File permissions: 600 (owner read/write only).
Created by the login command. All other commands check for its existence and print a clear error if missing: Not configured. Run '<cli-name> login' first.
49.8.4 7.4 Output Formats
Table (default) — designed for human scanning:
$ <cli-name> agents list
AGENT ID HOSTNAME STATUS CPU MEM (MB) APPS
agent-001 srv-alpha online 23.4% 128 2
agent-002 srv-beta offline - - 3
JSON (--json) — designed for piping to jq, scripts, or other tools:
$ <cli-name> agents list --json
[
{"agentId": "agent-001", "hostname": "srv-alpha", "isOnline": true, ...},
{"agentId": "agent-002", "hostname": "srv-beta", "isOnline": false, ...}
]
49.8.5 7.5 Error Handling
Condition Behavior Exit Code
─────────────────────────────────────────────────────────────────────────
No config file "Run '<cli-name> login'" 1
Console unreachable "Cannot reach <url>" 1
Agent not found (404) "Agent '<id>' not found" 1
Command accepted (202) Print confirmation 0
Auth failure (401/403) "Authentication failed" 1
Unknown error Print status code + body 1
49.9 8. Packaging & Distribution Specification
49.9.1 8.1 Package Format: .deb
The agent is distributed as a Debian package (.deb) for Ubuntu servers.
What goes inside the .deb:
<agent-pkg>_<version>_amd64.deb
│
├── DEBIAN/ Metadata + lifecycle scripts
│ ├── control Package name, version, deps, description
│ ├── conffiles List of config files (preserved on upgrade)
│ ├── postinst After install: create user, set perms, start
│ ├── prerm Before remove: stop service
│ └── postrm After remove: cleanup (on purge)
│
└── (filesystem overlay)
├── opt/<agent-pkg>/ Application binary + default config
├── etc/<agent-pkg>/ Admin-managed config (secrets, overrides)
├── etc/systemd/system/ Service unit file
└── var/lib/<agent-pkg>/ Writable data directory (empty on install)
49.9.2 8.2 Lifecycle Scripts
postinst (runs after files are placed):
1. Create system user (no login, no home) — idempotent
2. Create writable data directory if not exists
3. Set file ownership (binary dir, data dir, config) to service user
4. Set config file permissions to 600
5. Set binary as executable
6. systemctl daemon-reload
7. systemctl enable <service>
8. Start or restart the service (handle both fresh install and upgrade)
prerm (runs before files are removed):
1. Stop the service (if active)
2. Disable the service (ignore errors)
postrm (runs after files are removed):
1. On "purge" only: delete system user, remove data dir, remove config dir
2. Always: systemctl daemon-reload
49.9.3 8.3 Config Preservation on Upgrade
The config file is declared in conffiles. This ensures:
Fresh install: dpkg places default config file
Upgrade: dpkg detects admin has modified the file
→ keeps admin's version (or prompts)
Purge: dpkg removes config file
49.9.4 8.4 Distribution via GitHub Releases
For proof-of-concept or small-scale deployments, use GitHub Releases as the package registry:
GitHub Release: v1.2.0
│
├── <agent-pkg>_1.2.0_amd64.deb Agent package
├── <cli-name>-linux-x64.tar.gz CLI for Linux
├── <cli-name>-macos-x64.tar.gz CLI for macOS Intel
└── <cli-name>-macos-arm64.tar.gz CLI for macOS Apple Silicon
49.9.5 8.5 Convenience Installer Script
A one-line install command for admins:
curl -fsSL https://raw.githubusercontent.com/<org>/<repo>/main/scripts/install.sh | sudo bashScript logic:
1. Accept optional version argument (default: query GitHub API for latest)
2. Download .deb from GitHub Releases to /tmp/
3. Verify file is non-empty
4. dpkg -i /tmp/<package>.deb
5. Print service status
6. Remind admin to edit config: /etc/<agent-pkg>/agent.env
7. Clean up temp file
49.9.6 8.6 End-User Workflows
First install:
sudo dpkg -i <agent-pkg>_1.0.0_amd64.deb
sudo nano /etc/<agent-pkg>/agent.env # set console URL + agent ID
sudo systemctl restart <agent-pkg>Upgrade:
sudo dpkg -i <agent-pkg>_1.1.0_amd64.deb
# postinst restarts service automatically
# config file preservedUninstall:
sudo dpkg -r <agent-pkg> # remove binaries, keep config
sudo dpkg -P <agent-pkg> # purge everything including config and user49.10 9. CI/CD Pipeline Specification
49.10.1 9.1 Trigger
Pipeline runs on push of a semantic version tag: v*.*.* (e.g., v1.0.0, v2.3.1).
49.10.2 9.2 Job Graph
push tag v1.2.0
│
▼
┌──────────┐ ┌──────────────┐
│ test │────►│ build-agent │────┐
│ │ │ (.deb) │ │
│ lint │ └──────────────┘ │ ┌──────────────┐
│ build │ ├────►│ release │
│ test │ ┌──────────────┐ │ │ │
│ │────►│ build-cli │────┘ │ create GH │
│ │ │ (matrix) │ │ release │
└──────────┘ │ linux-x64 │ │ attach all │
│ macos-x64 │ │ artifacts │
│ macos-arm64 │ └──────────────┘
└──────────────┘
49.10.3 9.3 Job Details
Job 1: test
Runner: ubuntu-latest
Steps:
1. Checkout code
2. Setup language toolchain
3. Install dependencies
4. Lint / static analysis
5. Build all projects
6. Run unit + integration tests
Job 2: build-agent (needs: test)
Runner: ubuntu-latest
Steps:
1. Checkout code
2. Setup toolchain
3. Extract version from tag (strip 'v' prefix)
4. Build agent binary (self-contained, single file, linux-x64)
5. Run packaging script: ./packaging/build-deb.sh <version>
6. Upload .deb as workflow artifact
Job 3: build-cli (needs: test)
Runner: ubuntu-latest
Strategy matrix: [linux-x64, macos-x64, macos-arm64]
Steps (per target):
1. Checkout code
2. Setup toolchain
3. Build CLI binary (self-contained, single file, target platform)
4. Archive: tar.gz
5. Upload archive as workflow artifact
Job 4: release (needs: build-agent, build-cli)
Runner: ubuntu-latest
Steps:
1. Download all artifacts from previous jobs
2. Create GitHub Release from tag
3. Attach all artifacts (.deb + CLI tarballs)
4. Auto-generate release notes from commits
49.10.4 9.4 Version Flow
git tag v1.2.0
│
├──► GitHub Actions extracts: 1.2.0
│
├──► Substituted into DEBIAN/control: Version: 1.2.0
│
├──► Package file named: <agent-pkg>_1.2.0_amd64.deb
│
└──► GitHub Release titled: v1.2.0
Single source of truth for version: the git tag.
49.11 10. Security Model
49.11.1 10.1 Trust Hierarchy
Level 1: Infrastructure Admin (highest trust)
────────────────────────────────────────────
- SSH + sudo access to servers
- Installs and uninstalls agent
- Sets agent identity and secrets
- Can override anything
Level 2: Cloud Console Operator
────────────────────────────────
- Authenticated access to cloud dashboard / CLI
- Manages apps THROUGH the agent
- Can: push config, restart apps, trigger updates
- Cannot: install/uninstall agent, access server filesystem,
run arbitrary commands, modify agent identity
Level 3: Agent Process (least trust)
────────────────────────────────────
- Runs as a restricted system user
- Executes only predefined command types from cloud
- Cannot: modify its own binary, escalate privileges,
access other services, run arbitrary code
49.11.2 10.2 Agent Security Hardening (systemd)
# Run as dedicated unprivileged user
User=<service-user>
Group=<service-user>
# Filesystem restrictions
ProtectSystem=strict # Entire FS read-only except declared paths
ProtectHome=true # Cannot access /home/*
PrivateTmp=true # Isolated /tmp
ReadWritePaths=/var/lib/<agent-pkg> # Only writable path
ReadOnlyPaths=/opt/<agent-pkg> # Binary dir is read-only at runtime
# Privilege restrictions
NoNewPrivileges=true # Cannot gain new capabilities
ProtectKernelTunables=true
ProtectKernelModules=true
ProtectControlGroups=true
# Resource limits
MemoryMax=512M
CPUQuota=50%49.11.3 10.3 Config File Protection
/etc/<agent-pkg>/agent.env
Owner: <service-user>
Mode: 600
Contains: API keys, console URL, agent identity
Only readable by: the agent process and root
~/.config/<cli-name>/config.json
Owner: the admin user
Mode: 600
Contains: API key for cloud console
Only readable by: that specific user and root
49.12 11. Filesystem Layout Standard
49.12.1 11.1 Server (Agent + Managed Apps)
Follows the Linux Filesystem Hierarchy Standard (FHS):
Path Purpose Owner Mode
──────────────────────────────────────────────────────────────────────────────────
/opt/<agent-pkg>/ Agent binary + defaults root (dpkg) 755
/etc/<agent-pkg>/agent.env Agent secrets + config <service-user> 600
/etc/systemd/system/ Service unit file root (dpkg) 644
<agent-pkg>.service
/var/lib/<agent-pkg>/ Agent runtime data <service-user> 755
/var/log/<agent-pkg>/ Logs (if not journald) <service-user> 755
/opt/<managed-app>/ App binary + defaults root / app-user
/etc/<managed-app>/ App config app-user
/etc/systemd/system/ App service unit root
<managed-app>.service
49.12.2 11.2 Admin Workstation (CLI)
Path Purpose
──────────────────────────────────────────────────
~/.config/<cli-name>/config.json Console URL + API key (mode 600)
~/bin/<cli-name> or CLI executable
/usr/local/bin/<cli-name>
49.12.3 11.3 Why This Layout
PATH WHY
──────────────────────────────────────────────────────────────────────
/opt/ Third-party software (not from distro repos). Agent binary
lives here because it's not part of the Ubuntu distribution.
/etc/ Configuration. The admin edits files here. dpkg knows to
preserve files listed in conffiles during upgrades.
/var/lib/ Variable persistent data. The agent writes runtime state,
cached downloads, etc. here. Survives reboot, wiped on purge.
/var/log/ Logs. Only needed if not using journald (systemd's built-in
logging). With journald, logs go to the journal automatically.
~/.config/ Per-user config (XDG standard). For interactive tools only.
NEVER for system daemons.
49.13 12. User Stories & Scenarios
49.13.1 Story 1: First Server Onboarding
As an infrastructure admin,
I want to install the agent on a new Ubuntu server and have it appear
in the cloud console within seconds,
so that the server joins the managed fleet.
STEPS:
1. Admin downloads .deb from GitHub Releases
$ curl -LO https://github.com/org/repo/releases/download/v1.0.0/agent_1.0.0_amd64.deb
2. Admin installs the package
$ sudo dpkg -i agent_1.0.0_amd64.deb
→ postinst creates system user, enables and starts service
3. Admin configures agent identity
$ sudo nano /etc/<agent-pkg>/agent.env
→ sets console URL and agent ID
4. Admin restarts to pick up config
$ sudo systemctl restart <agent-pkg>
5. Agent connects to cloud console, registers itself
6. Cloud console dashboard shows new agent as "online"
with hostname, OS, CPU/RAM metrics, and managed app list
ACCEPTANCE:
- Agent appears in cloud console within 30 seconds of restart
- Heartbeat begins immediately
- Dashboard shows correct hostname and OS info
49.13.2 Story 2: Remote App Restart
As a cloud operator,
I want to restart a misbehaving application on a remote server
without SSH access,
so that I can resolve issues quickly from my desk.
STEPS:
1. Operator notices app-a on agent-001 showing errors in dashboard
2. Operator uses CLI:
$ <cli-name> apps restart agent-001 app-a
→ "Restart command sent to app-a on agent-001."
3. Cloud console sends RestartApp command to agent-001 via realtime hub
4. Agent executes: systemctl restart app-a.service
5. Agent checks: systemctl is-active app-a.service → "active"
6. Agent reports success to cloud
7. Audit log records: "app-a restarted on agent-001 by operator@cli"
8. Dashboard shows app-a status returns to "Running"
ACCEPTANCE:
- App restarts within 5 seconds of command
- Operator sees confirmation in CLI
- Audit log captures the event with timestamp
- Dashboard reflects new status on next heartbeat
49.13.3 Story 3: Fleet-Wide Config Push
As a cloud operator,
I want to change the log level to Debug on all instances of app-b
across the fleet,
so that I can diagnose a production issue.
STEPS:
1. Operator identifies all agents running app-b:
$ <cli-name> agents list --online-only
→ agent-001, agent-002, agent-003
2. Operator pushes config to each (or scripts it):
$ for agent in agent-001 agent-002 agent-003; do
<cli-name> apps config $agent app-b --set LogLevel=Debug
done
→ "Config pushed to app-b on agent-001: LogLevel"
→ "Config pushed to app-b on agent-002: LogLevel"
→ "Config pushed to app-b on agent-003: LogLevel"
3. Each agent receives PushConfig, writes to app-b's config file,
restarts app-b
4. Each agent reports success/failure
5. Audit log shows 3 config change events
ACCEPTANCE:
- All three agents apply the change within 10 seconds
- Config files on each server reflect the new value
- Apps restart automatically to pick up new config
- Audit log shows all three events with correct agent IDs
49.13.4 Story 4: Rolling App Update
As a cloud operator,
I want to update app-a from v2.0 to v2.1 on one server first (canary),
verify it works, then roll out to the rest,
so that I can deploy safely.
STEPS:
1. Operator targets canary server:
$ <cli-name> apps update agent-001 app-a \
--version 2.1.0 \
--artifact https://artifacts.example.com/app-a-2.1.0.tar.gz
→ "Update command sent: app-a → v2.1.0 on agent-001."
2. Agent on agent-001:
a. Downloads artifact
b. Stops app-a
c. Backs up current version
d. Extracts new version
e. Starts app-a
f. Health check passes
g. Reports success
3. Operator verifies canary:
$ <cli-name> agents status agent-001
→ app-a shows version 2.1.0, status Running
4. Operator waits, monitors, then rolls out to remaining servers
5. If canary fails: agent rolls back automatically, reports failure
Operator sees: "Update failed for app-a on agent-001: health check failed"
ACCEPTANCE:
- Update completes within 60 seconds (download + swap + restart)
- On success: version number updates in heartbeat
- On failure: automatic rollback to previous version
- Audit log records the entire sequence (start, download, stop, swap, start, result)
49.13.5 Story 5: Agent Upgrade via dpkg
As an infrastructure admin,
I want to upgrade the agent itself to a new version using dpkg,
so that I get new features and bug fixes.
STEPS:
1. Admin downloads new version:
$ curl -LO https://github.com/org/repo/releases/download/v1.1.0/agent_1.1.0_amd64.deb
2. Admin installs over existing:
$ sudo dpkg -i agent_1.1.0_amd64.deb
→ dpkg replaces binary in /opt/<agent-pkg>/
→ dpkg preserves /etc/<agent-pkg>/agent.env (conffiles)
→ postinst restarts the service
3. Agent comes back online with new version
4. Cloud console sees agent reconnect
→ heartbeat shows new agent version
→ audit log: "agent registered (version 1.1.0)"
ACCEPTANCE:
- Upgrade completes in under 10 seconds
- Agent reconnects automatically after restart
- Admin's config file is NOT overwritten
- Managed apps are unaffected (they keep running during agent restart)
49.13.6 Story 6: Server Decommissioning
As an infrastructure admin,
I want to cleanly remove the agent from a server being decommissioned,
so that no fleet management artifacts remain.
STEPS:
1. Admin purges the package:
$ sudo dpkg -P <agent-pkg>
→ prerm: stops service, disables it
→ removes: binary, service unit, config
→ postrm (purge): deletes system user, data dir, config dir
2. Cloud console sees agent heartbeat stop
→ status changes to "offline"
3. Server is clean — no agent user, no files, no service
ACCEPTANCE:
- No files remain under /opt/<agent-pkg>/, /etc/<agent-pkg>/, /var/lib/<agent-pkg>/
- System user is deleted
- systemd has no knowledge of the service
- Cloud console shows agent as offline (does not auto-remove from registry)
49.13.7 Story 7: New Developer Onboarding
As a new developer joining the team,
I want to run the entire system locally on my machine,
so that I can understand and develop against the architecture.
STEPS:
1. Clone the repo:
$ git clone https://github.com/org/repo.git
2. Start the cloud console (terminal 1):
$ <start-console-command> # e.g., dotnet run, go run, npm start
→ Listening on http://localhost:5000
3. Start an agent (terminal 2):
$ <start-agent-command>
→ Connected to cloud hub
→ Registered as agent-dev-01
4. Open browser at http://localhost:5000
→ Dashboard shows agent-dev-01 online with simulated apps
5. Use CLI (terminal 3):
$ <cli-name> login --url http://localhost:5000
$ <cli-name> agents list
→ Shows agent-dev-01
6. Start a second agent with a different ID (terminal 4):
$ <start-agent-command> --agent-id agent-dev-02
→ Dashboard now shows two agents
ACCEPTANCE:
- Full system runs locally with no external dependencies
- Agent uses simulated apps (no real systemd services needed)
- CLI connects to local console
- Developer can exercise all commands against local setup
49.14 13. Glossary
Term Definition
────────────────────────────────────────────────────────────────────────
Agent A daemon running on each managed server. It connects
to the cloud console, reports health, and executes
management commands. It manages apps through the OS,
not through application code.
Managed App Any application running as a systemd service on a
managed server. It performs business logic and is
unaware of the agent or cloud console.
Cloud Console The centralized server that all agents connect to.
It provides a dashboard, REST API, and realtime hub
for monitoring and managing the fleet.
Control Plane The command and decision path: cloud → agent → app.
Admin says "restart app-a", command flows down.
Data Plane The telemetry and status path: app → agent → cloud.
Metrics and events flow up for visibility.
CLI Tool A command-line application run by admins on their
workstation. It talks to the cloud console REST API.
It is a thin client, not a daemon.
Heartbeat A periodic message from agent to cloud carrying
system metrics and managed app statuses.
Audit Event An immutable record of a management action
(config change, restart, update, etc.) stored
in the cloud console.
Realtime Hub The persistent bidirectional communication channel
between agents and the cloud console (WebSocket,
gRPC stream, SSE, or similar).
conffiles A Debian packaging concept: config files listed here
are preserved when the package is upgraded. The
admin's modifications are not overwritten.
FHS Filesystem Hierarchy Standard. The Linux convention
for where files go: /opt for third-party apps,
/etc for config, /var for variable data.
XDG Base Directory A freedesktop.org standard defining where per-user
config (~/.config/), data (~/.local/share/), and
cache (~/.cache/) should go.
systemd The init system and service manager on modern Linux.
It starts, stops, and supervises all services
including both the agent and managed apps.
dpkg The low-level Debian package manager. It installs,
removes, and manages .deb packages.
Self-contained A build mode where the application binary includes
binary the language runtime, so no runtime needs to be
pre-installed on the target server.
End of specification.