Fleet Management Stack Components

A production-grade fleet management system has five essential components. Most teams build these incrementally — start with telemetry and remote access, then add the update system and alerting as the fleet scales past 10 robots.

  • Device registry: A database of every robot in the fleet — serial number, model, firmware version, location, assigned operator, and current status. This is the source of truth for all other systems. A simple PostgreSQL table works; purpose-built solutions like AWS IoT or Azure IoT Hub provide this at scale.
  • Telemetry pipeline: Streaming metrics from robot to cloud. Typically MQTT or gRPC from the robot, ingested into a time-series database (InfluxDB or TimescaleDB). Target <10 second latency for operational metrics, <1 second for safety-critical signals.
  • Remote access layer: Authenticated access for operators and engineers to inspect and control robots without physical presence. Discussed in detail in the Remote Access Methods section.
  • OTA update system: Mechanism to push firmware, software, and policy updates to robots in the field. Staged rollouts and rollback capability are non-negotiable.
  • Alerting and on-call: Automated detection of anomalies with escalation to on-call engineers. PagerDuty or OpsGenie for on-call rotation management.

Telemetry to Collect

Collect only what you will act on. Excessive telemetry increases costs and creates noise. These are the metrics that consistently prove valuable:

  • Joint temperatures: Per-joint motor temperature in °C. Alert at 70°C (warning), emergency stop at 85°C. Elevated temperature is an early indicator of increased friction, impending bearing failure, or overloaded task profiles.
  • Joint error codes: Any error code from the motor driver, logged with timestamp and joint ID. Error codes should be decoded to human-readable descriptions in your monitoring dashboard.
  • Battery state of charge and voltage: Battery % and voltage at 1-minute intervals. Track charge cycle count for battery lifecycle management.
  • Task success/failure rate: Per-episode result with task type, duration, and failure mode. The primary business KPI.
  • Network latency: Round-trip latency from robot to cloud at 30-second intervals. Latency spikes >200 ms indicate network issues that will degrade teleoperation quality.
  • Camera health: Frame rate and dropped frame count per camera. A camera dropping >5% of frames indicates a hardware issue or USB bandwidth saturation.

Monitoring Infrastructure

The standard open-source monitoring stack works well for robot fleets up to ~200 units:

  • Prometheus + Grafana: Prometheus scrapes metrics endpoints exposed by each robot's fleet agent at 15-second intervals. Grafana visualizes fleet-level dashboards: total uptime, per-robot health, task throughput, and alert history. Pre-built Grafana dashboards for robot fleets are available at grafana.com/grafana/dashboards.
  • InfluxDB: For high-frequency telemetry (joint positions at 100 Hz), use InfluxDB's time-series compression rather than Prometheus (which is not optimized for high-cardinality, high-frequency data).
  • PagerDuty: Manages on-call rotations and alert escalation. Integrate Prometheus alertmanager → PagerDuty for automated incident creation. Define separate escalation policies for safety alerts (immediate page) vs. maintenance alerts (business hours only).
  • Custom fleet health dashboard: Build a single-screen "mission control" view in the platform showing: map of all robot locations with status indicators, top 5 failing tasks, fleet uptime percentage, and robots requiring maintenance.

Remote Access Methods

MethodLatencySecurityBest For
SSH over VPN (WireGuard)20–80ms depending on VPN server locationHigh — key-based auth, encrypted tunnelEngineering diagnostics, log review, config changes
WebRTC remote desktop50–150msMedium — requires signaling server securityOperator GUI access, rviz2 visualization
ROS2 bridge (rosbridge_suite)30–100msLow by default — add TLS + auth explicitlyProgrammatic telemetry access, remote monitoring scripts

WireGuard VPN is the recommended foundation for all remote access. Deploy a WireGuard server (e.g., on a $5/month DigitalOcean droplet) and configure each robot as a WireGuard client with a unique key pair. All remote access happens over the VPN tunnel — SSH, web dashboards, and ROS2 bridge traffic are all tunneled, eliminating the need to expose any robot port directly to the internet.

Alert Thresholds

MetricWarning ThresholdEmergency ThresholdAutomated Action
Joint temperature>70°C>85°CEmergency: immediate e-stop
Task success rate (7-day rolling)<80%<60%Emergency: suspend policy, alert on-call
Battery SoC<20%<10%Emergency: return to charger or alert operator
Network latency (robot→cloud)>200ms>500msWarning: log; Emergency: disable teleoperation
Camera frame drop rate>5%>20%Warning: log; Emergency: pause data collection
Consecutive task failures3 in a row5 in a rowWarning: operator alert; Emergency: suspend + escalate

OTA Update Process

Over-the-air updates are how you ship improvements and security fixes to deployed robots without site visits. A disciplined update process prevents updates from causing incidents:

  • Build: Every update (firmware, software, or policy) is built in CI and produces a versioned artifact with a sha256 checksum. Artifacts are stored in a release registry (S3 bucket or Artifact Registry).
  • Staging test: Before any field deployment, the update is applied to 2–3 staging robots in the lab and validated with a 50-trial automated test suite.
  • Canary rollout (10%): Deploy to 10% of the fleet (or a minimum of 3 robots) for 48 hours. Monitor success rate, error codes, and telemetry for regressions.
  • Full rollout: If canary metrics are nominal, roll out to the remaining fleet. Stagger the rollout at 25% per hour to avoid simultaneous restarts causing fleet-wide downtime.
  • Rollback capability: Every robot maintains the previous version artifact locally. Rollback takes <2 minutes and can be triggered per-robot or fleet-wide from the management dashboard.

Fleet KPIs

KPIDefinitionTargetMeasurement Interval
MTBF (Mean Time Between Failures)Average operating hours between unplanned stoppages>200 hoursMonthly
MTTR (Mean Time to Repair)Average time from incident detection to resumed operation<2 hoursMonthly
Fleet uptime% of scheduled operating hours spent in active operation>95%Weekly
Task completion rate% of tasks completed successfully without human intervention>90%Daily
OTA update success rate% of update deployments that succeed without rollback>99%Per-release

Connectivity Options: Detailed Comparison

OptionTypical LatencyBandwidthCost/Robot/MonthBest For
WireGuard VPN over WiFi20-80 ms100+ Mbps$5 (VPN server)Lab and warehouse with existing WiFi
WireGuard VPN over 5G15-40 ms100-500 Mbps$30-$80 (data plan)Mobile robots, outdoor, no WiFi available
Tailscale (managed WireGuard)20-80 ms100+ Mbps$0-$18/deviceQuick setup, NAT traversal, SSO integration
WebRTC peer-to-peer50-150 ms10-50 Mbps$0 (STUN/TURN server)Browser-based remote viewing, video streams
Wired Ethernet (on-premise)<1 ms1 Gbps$0 (existing infra)Fixed arm installations, highest reliability

For most deployments, the recommended architecture is: wired Ethernet for the robot LAN (arm controller to workstation), WireGuard VPN for remote access (engineer laptop to robot), and WiFi or 5G for cloud telemetry upload. This provides sub-millisecond latency for the control loop while enabling secure remote access.

Fault Detection and Automatic Recovery

A well-designed fleet management system recovers from common faults without human intervention, reducing MTTR from hours to minutes:

  • Watchdog timer: The onboard fleet agent sends a heartbeat every 10 seconds. If the fleet manager receives no heartbeat for 30 seconds, it marks the robot as "unreachable" and triggers a network diagnostic sequence (ping, traceroute, DNS check). If the robot is unreachable for 5 minutes, create a PagerDuty incident.
  • Automatic process restart: Use systemd service units for all robot software (ROS2 launch, fleet agent, camera drivers). Configure Restart=on-failure with RestartSec=5s. This recovers from process crashes (segfaults, OOM kills) automatically. Log every restart event to the telemetry pipeline.
  • Camera recovery: USB cameras occasionally drop from the bus. The fleet agent monitors /dev/video* device nodes. If a camera disappears, the agent runs usbreset on the port, waits 3 seconds, and verifies the camera reappears. If not after 3 attempts, alert the operator for physical inspection.
  • Network failover: For robots with both WiFi and cellular connectivity, configure automatic failover: if WiFi latency exceeds 200 ms for 30 seconds, switch telemetry and API traffic to the cellular backup. Switch back to WiFi when latency drops below 100 ms for 60 seconds.
  • Disk space management: HDF5 data collection can fill disks quickly (3 cameras at 30 fps = ~50 GB/hour). The fleet agent monitors disk usage and automatically transfers completed episodes to the NAS or cloud storage when disk exceeds 70%. At 90%, it pauses data collection until space is freed.

SVRC Platform Integration

The SVRC platform provides a managed fleet management layer that eliminates the need to build custom monitoring infrastructure:

  • Fleet dashboard: Real-time map of all robot locations with status indicators (green/yellow/red). Click any robot to view live telemetry, camera feeds, and task history.
  • Telemetry pipeline: Robots send metrics via MQTT to the platform's InfluxDB backend. Pre-built Grafana dashboards for joint health, task performance, and fleet utilization.
  • OTA update manager: Upload new firmware, software, or policy artifacts to the platform. Schedule staged rollouts with automatic canary testing and one-click rollback.
  • Alert routing: Configure alert rules (threshold-based or anomaly detection) with routing to email, Slack, PagerDuty, or the platform's built-in notification system.

Related Guides

Work with SVRC

SVRC provides fleet management infrastructure for robot deployments of any scale.

  • Data Platform -- managed fleet dashboard, telemetry, OTA updates, and alert routing
  • Repair and Maintenance -- remote diagnostics and on-site service for fleet robots
  • Robot Leasing -- lease fleet-ready robots with pre-installed fleet management agents
  • Contact Us -- request a fleet architecture review for your deployment