Observability
Klaxon's telemetry is unified under OpenTelemetry. Every component (server, auth, worker, web, mobile) emits traces + metrics + logs through OTLP into an in-cluster OTel Collector, which forwards to OneUptime. There's exactly one OneUptime credential in the system — the Collector holds it.
Architecture
┌─ browser (@klaxon/web) ──────────────────┐
│ @opentelemetry/sdk-trace-web │
│ @opentelemetry/sdk-logs │ OTLP/HTTP + traceparent header
│ instrumentation-fetch/document-load ├───────────────┐
└──────────────────────────────────────────┘ │
┌─ mobile (klaxon-mobile / Expo) ──────────┐ │
│ @opentelemetry/sdk-trace-base │ OTLP/HTTP │
│ manual fetch wrapper + screen spans ├───────────────┤
└──────────────────────────────────────────┘ ▼
┌─────────────────────────────┐
┌─ klaxon-server ──────────────────────────┐│ OTel Collector (Deployment)│
│ klaxon-telemetry crate ││ OTLP gRPC :4317 ◄────────┤ Rust
│ traces + logs via OTLP/gRPC ├┤ OTLP HTTP :4318 ◄────────┤ browser/mobile
└──────────────────────────────────────────┘│ prometheus receiver ──────┤ /metrics scrape
┌─ klaxon-auth ────────────────────────────┐│ processors: batch, │
│ same klaxon-telemetry init ├┤ memory_limiter, resource │
└──────────────────────────────────────────┘│ │
┌─ klaxon-server --worker ─────────────────┐│ exporter: otlphttp │
│ + worker-specific metrics ├┤ → OneUptime │
└──────────────────────────────────────────┘└──────────┬─────────────────┘
▼
OneUptime OTLP ingestorSignal paths:
- Traces — each binary/app pushes OTLP to the Collector. Browser + mobile inject W3C
traceparenton every fetch, so a single trace spans browser → klaxon-auth → klaxon-server when a user clicks through the UI. - Metrics — HTTP + worker metrics stay on the existing
metricscrate + Prometheus/metricsendpoint. The Collector'sprometheusreceiver scrapes those and re-exports as OTLP, side-stepping the Rustmetrics↔ OTEL SDK bridge. - Logs — Rust uses
opentelemetry-appender-tracingso everytracing::info!/warn!/error!event becomes an OTLP log record stamped with the activetrace_idandspan_id. Browser + mobile do the same via@opentelemetry/api-logs.
Configuration
Rust (klaxon-server + klaxon-auth + worker)
| Variable | Default | Description |
|---|---|---|
OTEL_ENABLED | false | When false, only JSON stdout logging runs. |
OTEL_EXPORTER_OTLP_ENDPOINT | http://localhost:4317 | Collector gRPC endpoint. |
OTEL_SERVICE_NAME | klaxon-server | service.name resource attribute. |
DEPLOYMENT_ENVIRONMENT | development | deployment.environment resource attribute. |
K8S_POD_NAME, K8S_NAMESPACE_NAME | unset | Populated from the k8s downward API when deployed via Helm. |
The Helm chart wires all of these automatically from values.yaml::otelCollector.enabled — operators don't set the env vars individually.
Web (@klaxon/web)
Vite env vars, baked at build time:
VITE_OTEL_ENDPOINT— OTLP/HTTP base URL. Defaults to/otel, i.e. same origin as the web UI (ingress routes/otel/*to the Collector).VITE_APP_VERSION— populated from your CI build ID, becomesservice.version.
Local dev: the Vite config proxies /otel to localhost:4318 so pnpm --filter @klaxon/web dev works against docker compose up -d without CORS gymnastics.
Mobile (klaxon-mobile, Expo)
Expo EXPO_PUBLIC_* env vars, baked at EAS build time:
EXPO_PUBLIC_OTEL_ENDPOINT— OTLP/HTTP base URL, e.g.https://klaxon.sh/otel. Defaults tohttp://localhost:4318forexpo start.EXPO_PUBLIC_ENVIRONMENT— becomesdeployment.environment.
OneUptime setup
Create (or open) a OneUptime project and navigate to Settings → Telemetry.
Copy the OTLP ingestion URL (typically
https://oneuptime.com/otlpfor OneUptime Cloud) and the project token.Populate the Helm values:
yamlsecrets: oneuptimeOtlpEndpoint: "https://oneuptime.com/otlp" oneuptimeOtlpToken: "<your project token>"Or via Pulumi:
bashpulumi config set klaxon:oneuptimeOtlpEndpoint https://oneuptime.com/otlp pulumi config set --secret klaxon:oneuptimeOtlpToken <your token>helm upgrade --install klaxon deploy/helm/klaxon -f values.yaml.Verify in OneUptime that
klaxon-server,klaxon-auth,klaxon-worker,klaxon-web,klaxon-mobileall appear under Services, and that metrics with names likehttp_requests_totalandklaxon_worker_task_duration_secondsshow up.
Auth format
OneUptime authenticates OTLP/HTTP with an HTTP Authorization: Basic <token> header. The Collector config wraps ${env:ONEUPTIME_OTLP_TOKEN} in that header — operators just supply the raw token. If your OneUptime self-hosted install uses a different scheme (e.g. a custom header), edit deploy/helm/klaxon/templates/otel-collector-configmap.yamlexporters.otlphttp/oneuptime.headers directly.
Local development
docker compose up -d # starts postgres + redis + otel-collectorThe local Collector is otel/opentelemetry-collector-contrib with a debug exporter (see deploy/otel-collector-local.yaml), so everything sent to it is pretty-printed to the Collector's stdout:
docker compose logs -f otel-collectorRun the Rust binaries against it:
OTEL_ENABLED=true \
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317 \
DEPLOYMENT_ENVIRONMENT=development \
cargo run -p klaxon-serverRun the web dev server:
pnpm --filter @klaxon/web dev
# opens http://localhost:5173 — the Vite proxy forwards /otel → 4318Run Expo on a physical device (LAN):
cd apps/klaxon-mobile
EXPO_PUBLIC_OTEL_ENDPOINT=http://<your-laptop-lan-ip>:4318 \
npx expo startWhat's instrumented
Server (klaxon-server)
- Every HTTP request gets a root span via
track_metricsmiddleware (crates/klaxon-server/src/metrics.rs). Attributes:method,path,request_id,status,duration_ms. The middleware also populateshttp_requests_total+http_request_duration_secondsPrometheus metrics. #[tracing::instrument]on all handler functions — every DB transaction, every MCP tool call, every WebSocket lifecycle event gets a child span.- Two business metrics:
klaxon_mcp_tool_calls_total{tool,success},klaxon_active_websockets.
Auth (klaxon-auth)
- Uses the shared
klaxon-telemetrycrate, so distributed traces from the browser (via thetraceparentheader) stitch together across the OAuth consent flow and subsequent/api/*calls. - No custom metrics yet.
Worker (klaxon-server --worker)
- Each sweep (
snooze_sweep,auto_archive_sweep,push_batch,webhook_batch,email_notifications,session_cleanup,audit_retention) gets a#[tracing::instrument]span and updates two metrics:klaxon_worker_task_duration_seconds{task}histogramklaxon_worker_task_failures_total{task}counter
klaxon_notification_queue_depthgauge sampled every 15 s fromnotification_queue.
Web (@klaxon/web)
- Auto-instrumented via
@opentelemetry/instrumentation-fetch+-xml-http-request+-document-load. - Every
apiFetchcall also gets a manual wrapping span withhttp.method+http.url+http.status_codeso queries / filters can run against business-level attributes without needing to know the auto-instrumentation's span names. log.*calls inpackages/common/src/logger.tsemit OTLP log records + mirror toconsole.
Mobile (klaxon-mobile)
- Manual fetch wrapper (
apps/klaxon-mobile/lib/api.ts) injectstraceparentviapropagation.inject. screen.loadspan per route change, driven byusePathnamein_layout.tsx.- Same
log.*shape as the web app.
Request ID
Every server response includes an x-request-id header; if the client sends one, it's propagated, otherwise a UUID v4 is generated. Use this to correlate JSON log entries (request_id field), OTel trace spans (request_id attribute), and client-side records.
Now that trace_id is present in every log record, request_id serves mostly for operators grepping kubectl logs — OneUptime's UI jumps from span to logs via trace_id directly.
Scrape config (Prometheus, if you have one locally)
scrape_configs:
- job_name: klaxon
static_configs:
- targets: ["localhost:3000", "localhost:3001"]
metrics_path: /metrics
scrape_interval: 15sIn production the OTel Collector does this scrape automatically via the prometheus receiver in deploy/helm/klaxon/templates/otel-collector-configmap.yaml.
Debug tips
- No spans reaching OneUptime? Check
kubectl logs <pod>-otel-collector. If you seefailed to push data to exporterthen theONEUPTIME_OTLP_TOKENis wrong or the endpoint is unreachable. trace_idmissing from log records? The layer ordering inklaxon-telemetry::initis load-bearing (OpenTelemetryLayer before the appender). Regressing this will let logs pass through without span context. Covered by a unit test in the crate.- Browser OTLP being CORS-blocked? The Collector's
otlp.http.cors.allowed_originsmust include your web origin — editvalues.yaml::otelCollector.cors.allowedOrigins. - Eyeballing incoming signals in cluster? Flip
values.yaml::otelCollector.debugExporter.enabled = trueand redeploy; the Collector will then log pretty-printed spans/logs alongside forwarding them to OneUptime. Disable before production (it logs full payloads). - RN app not producing spans? Make sure
EXPO_PUBLIC_OTEL_ENDPOINTis reachable from the device (not just the Metro dev server) — physical devices need your laptop's LAN IP, notlocalhost.