Monitor and operate
/state, /metrics, /logs. Health endpoints, Prometheus, alert recipes. The relay you actually run.
The relay has three admin endpoints, all gated by a bearer token. Enable them once:
docker run -d \
-e RELAY_ENABLE_ADMIN_HTTP=1 \
-e RELAY_ADMIN_TOKEN="$(openssl rand -hex 32)" \
…Store the token in your secret manager. It grants full visibility into who's connected and recent activity.
/health
No auth. Lightweight liveness check:
curl https://relay.your-co.com/health
{
"ok": true,
"service": "relay",
"uptimeMs": 123456,
"tlsEnabled": true,
"relayMode": "single",
"relayId": "relay-a"
}Use for your load balancer's health check. If ok is false, the container is broken. Restart.
/state
Requires Authorization: Bearer $RELAY_ADMIN_TOKEN. Returns the in-memory connection registry:
{
"uptimeMs": 6789012,
"totalConnections": 248,
"workspaces": [
{
"workspaceId": "01J…",
"daemons": [
{ "installId": "ins_…", "connectedAt": "…", "framesIn": 1240, "framesOut": 8920 }
],
"clients": 3
}
]
}Useful for:
- Confirming a specific daemon is connected.
- Counting active workspaces.
- Spotting load distribution before scaling.
/logs
Returns the recent in-memory log buffer (last ~1000 entries by default):
curl -H "Authorization: Bearer $TOKEN" https://relay.your-co.com/logs
curl -H "Authorization: Bearer $TOKEN" https://relay.your-co.com/logs?summary=1summary=1 returns counts by log level instead of the entries themselves. Useful for "is there any error noise."
For persistent logs, capture stdout / stderr of the container. The relay writes structured JSON logs to stdout.
/metrics
Prometheus exposition. Set up scraping in your Prometheus config:
scrape_configs:
- job_name: viewport-relay
bearer_token: "${RELAY_ADMIN_TOKEN}"
static_configs:
- targets: ['relay.your-co.com:7781']Key metrics:
| Metric | Type | Meaning |
|---|---|---|
relay_connections_total | gauge | Currently connected daemons + clients. |
relay_handshakes_total | counter | Successful handshakes. |
relay_handshake_failures_total{reason} | counter | Auth failures, malformed frames, etc. |
relay_frames_in_total{workspace_id} | counter | Frames from daemon. |
relay_frames_out_total{workspace_id} | counter | Frames to clients. |
relay_frame_size_bytes | histogram | Frame size distribution. |
relay_jwt_validate_duration_seconds | histogram | Latency of the platform's JWT validate call. |
relay_backplane_publish_seconds | histogram | Latency to publish onto the backplane (server / redis mode). |
Alert recipes
Alert: relay unreachable.
- alert: ViewportRelayDown
expr: up{job="viewport-relay"} == 0
for: 2m
labels: { severity: critical }
annotations:
summary: "Viewport relay {{ $labels.instance }} is down"Alert: high handshake failure rate.
- alert: ViewportRelayHandshakeFailing
expr: rate(relay_handshake_failures_total[5m]) > 0.5
for: 5m0.5 failures/sec sustained for 5 minutes usually means platform JWT validate is slow or your JWKS URL is wrong.
Alert: JWT validate slow.
- alert: ViewportRelayJwtSlow
expr: histogram_quantile(0.95, rate(relay_jwt_validate_duration_seconds_bucket[5m])) > 1
for: 10mIf p95 JWT validate latency goes over 1s sustained, the platform-side validate endpoint is degraded. Check on the platform first.
Alert: backplane lag (server / redis modes).
- alert: ViewportRelayBackplaneSlow
expr: histogram_quantile(0.95, rate(relay_backplane_publish_seconds_bucket[5m])) > 0.2
for: 10mRestart, redeploy, upgrade
The relay is stateless. Restart any time. Connected daemons and clients see a clean disconnect and reconnect within a few seconds (configurable backoff on the daemon side).
Upgrading:
- Pull the new image:
docker pull ghcr.io/viewportai/relay:latest. - Restart your container.
- Daemons reconnect automatically.
- Watch
relay_handshakes_totalto confirm activity returns.
Rolling deploys (multiple replicas): just kubectl rollout restart or your equivalent.
Where to go next
- Backplane modes. When single isn't enough.
- Security posture.
- Reference: env vars.